feat: allow opting into caching image metadata by tag #5298

krancour · 2025-10-28T00:25:17Z

Highlights:

Creates a new, very simple cache abstraction with one implementation based on github.com/hashicorp/golang-lru/v2. This replaces https://github.com/patrickmn/go-cache. github.com/hashicorp/golang-lru/v2, being, as the name implies, an LRU cache, it is better suited to the rest of the changes in this PR.
Before, each registry got its own cache. Now there's just one shared cache. Its size is configurable. By default, it holds up to 100,000 entries. Each entry is pretty small.
Container image subscriptions can now opt into using cached tags. When choosing this, all image selection strategies, except Digest, will cache image information by tag and not by digest. If one does not opt into this, things work the same as they always have.
Operators can set policies that allow/disallow using cached tags or even require cached tags. This way operators who are not confident that users are working with immutable tags can forbid tags from beings cached. Those who are confident that users ARE working with immutable tags can require tag caching.

netlify · 2025-10-28T00:25:22Z

✅ Deploy Preview for docs-kargo-io ready!

Name	Link
🔨 Latest commit	`1a3e52f`
🔍 Latest deploy log	https://app.netlify.com/projects/docs-kargo-io/deploys/690a6cd1797e820007b557fd
😎 Deploy Preview	https://deploy-preview-5298.docs.kargo.io
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

codecov · 2025-10-28T00:30:49Z

Codecov Report

❌ Patch coverage is 58.62069% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.93%. Comparing base (372bd42) to head (1a3e52f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/image/repository_client.go	60.00%	8 Missing and 8 partials ⚠️
pkg/controller/warehouses/images.go	0.00%	10 Missing ⚠️
pkg/image/cache.go	55.55%	3 Missing and 1 partial ⚠️
pkg/image/newest_build_selector.go	40.00%	3 Missing ⚠️
pkg/image/tag_based_selector.go	33.33%	2 Missing ⚠️
pkg/image/digest_selector.go	83.33%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5298      +/-   ##
==========================================
- Coverage   55.95%   55.93%   -0.03%     
==========================================
  Files         407      409       +2     
  Lines       29906    29955      +49     
==========================================
+ Hits        16735    16754      +19     
- Misses      12200    12221      +21     
- Partials      971      980       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Kent Rancourt <[email protected]>

hiddeco

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

We have had reports internally that adding a zero to the current settings (metadata concurrency and rate limiter) brings down the refresh time from ~15 minutes to just over one.

nikolay-te · 2025-10-30T20:11:26Z

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

We have had reports internally that adding a zero to the current settings (metadata concurrency and rate limiter) brings down the refresh time from ~15 minutes to just over one.

Yes, please 🙏🏻 This made a HUGE difference in our environment. For different reasons we're stuck to using NewestBuild, and patching the warehouse shards to have increased rate limits and parallelism helped a lot.

Another potential improvement would be some connection reuse on the client (though it's interesting to see how would this work in a multi-tenant environment), as the other thing we've noticed after bumping these parameters manually is that the querying of the repositories can be very aggressive with multiple tcp connections getting recycled, and also high number of /v2/token requests for docker, etc.

krancour · 2025-10-31T00:33:50Z

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Assuming you're referring to the rate limit, I've always been open to making that configurable, but have a strongly held opinions about how it's approached. #1139 is the issue capturing my thoughts about this in more detail. Someone did open a PR not too long ago that was rejected because it only addressed rate limits and not the rest of #1139, which is ok, but it did so in a way that was inadequately future-proof. We'd have been unable to build a comprehensive solution to registry configuration on top of it, which would inevitably have forced a breaking change on us. I'm in no way saying that we cannot make rate limits configurable without addressing the rest of #1139. What I am saying is whatever is done to make rate limits configurable needs to not paint us into a corner in terms of how we'd approach the rest of #1139.

Separate from that, I have some worry that increasing the rate limit isn't the panacea people seem to think it is. There's a natural inclination toward thinking that if you crank that way up, things just get faster. I think there's not enough weight being given to the possibility that when the client isn't self-limiting and the rate limits are being enforced server-side instead, things might actually get worse. This isn't a reason not to do it. It's just a concern.

Between my concern and my opinions about there being a (somewhat involved) right way of approaching this, my inclination is to not bundle that improvement into this cache improvement.

Why don't we see about getting @jessesuen's buy-in on the idea of prioritizing #1139 in v1.10.0?

Signed-off-by: Kent Rancourt <[email protected]>

hiddeco · 2025-10-31T09:57:04Z

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Based on @nikolay-te's report, it is both.

nikolay-te · 2025-10-31T13:06:49Z

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Based on @nikolay-te's report, it is both.

I think it was probably mostly the per repo rate limit that gave us the improvement.
Since it's per registry, and we have only one registry for all images we were seeing image warehouse reconciliations taking up to an hour. This is even when we spread all warehouses on 10 separate kargo shards/instances dedicated to warehouse tasks. Both the Kargo agents and our registry and the caching proxy in front of it were mostly idle, until we increased the rate limits. After that these discoveries that were taking an hour dropped to 1-2 minutes, many were in the seconds.

Having said that, I see now that the caching proxies (nginx, passing through tag list operations, caching only manifests) I've setup is able to hold ALL images in our environment mostly in memory (just below 2G RAM) which makes me think this caching feature has the potential to bring similar improvements maybe even with the default rate limits, except maybe on cold starts.

Signed-off-by: Kent Rancourt <[email protected]>

krancour · 2025-10-31T14:19:44Z

@nikolay-te there's a lot of interesting insight there. Thank you.

Rate limit-related improvements are coming, but they won't be bundled into this PR.

Signed-off-by: Kent Rancourt <[email protected]>

hiddeco · 2025-11-04T20:49:13Z

pkg/image/repository_client.go

+		"tag", tag,
+	)
+
+	cacheKey := fmt.Sprintf("%s:%s", r.repoURL, tag)


Should this take the platform constraints into account? Otherwise, I think we could get mismatches?

🤔 I think you're right. I'll look at that.

Yep. Will fix. Converting to draft in the meantime.

hiddeco · 2025-11-04T20:50:45Z

pkg/image/cache.go

+
+func init() {
+	var err error
+	imageCache, err = cache.NewInMemoryCache[Image](100000)


Should the size be configurable?

I thought it was... there might be a commit missing from this PR. Will follow up on that.

ffb1e16 was missing previously. It's configurable now.

Signed-off-by: Kent Rancourt <[email protected]>

krancour added this to the v1.9.0 milestone Oct 28, 2025

krancour self-assigned this Oct 28, 2025

krancour requested review from a team as code owners October 28, 2025 00:25

krancour added kind/enhancement An entirely new feature priority/normal This is the priority for most work area/controller Affects the (main) controller area/chart Affects the Helm chart labels Oct 28, 2025

krancour marked this pull request as draft October 28, 2025 00:25

krancour added 5 commits October 28, 2025 17:34

add useCachedTags option to image subscriptions

bc9c826

Signed-off-by: Kent Rancourt <[email protected]>

run codegen

2365d4f

Signed-off-by: Kent Rancourt <[email protected]>

build a cool, new cache abstraction

cc5d840

Signed-off-by: Kent Rancourt <[email protected]>

enable more aggressive caching in image pkg

482a981

Signed-off-by: Kent Rancourt <[email protected]>

corresponding chart changes

8e26a59

Signed-off-by: Kent Rancourt <[email protected]>

krancour force-pushed the krancour/alt-caching branch from 5da66f0 to 8e26a59 Compare October 28, 2025 21:38

krancour marked this pull request as ready for review October 28, 2025 21:39

krancour mentioned this pull request Oct 30, 2025

Feature Request: incremental refresh for warehouse with container image repos using NewestBuild strategy #4625

Closed

3 tasks

hiddeco reviewed Oct 30, 2025

View reviewed changes

krancour added 2 commits October 30, 2025 20:53

Merge branch 'main' into krancour/alt-caching

ff9af73

run codegen

1bbe56d

Signed-off-by: Kent Rancourt <[email protected]>

krancour changed the title ~~feat: allow opting into caching tags~~ feat: allow opting into caching image metadata by tag Oct 31, 2025

run codegen

b2b5792

Signed-off-by: Kent Rancourt <[email protected]>

merge main

cce6b9a

Signed-off-by: Kent Rancourt <[email protected]>

hiddeco reviewed Nov 4, 2025

View reviewed changes

krancour added 2 commits November 4, 2025 16:11

make image metadata cache size configurable

ffb1e16

Signed-off-by: Kent Rancourt <[email protected]>

run codegen

1a3e52f

Signed-off-by: Kent Rancourt <[email protected]>

krancour marked this pull request as draft November 4, 2025 21:36

feat: allow opting into caching image metadata by tag #5298

Are you sure you want to change the base?

feat: allow opting into caching image metadata by tag #5298

Uh oh!

Conversation

krancour commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for docs-kargo-io ready!

Uh oh!

codecov bot commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hiddeco left a comment

Choose a reason for hiding this comment

Uh oh!

nikolay-te commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

krancour commented Oct 31, 2025

Uh oh!

hiddeco commented Oct 31, 2025

Uh oh!

nikolay-te commented Oct 31, 2025

Uh oh!

krancour commented Oct 31, 2025

Uh oh!

hiddeco Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

krancour Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

krancour Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

hiddeco Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

krancour Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

krancour Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

krancour commented Oct 28, 2025 •

edited

Loading

netlify bot commented Oct 28, 2025 •

edited

Loading

codecov bot commented Oct 28, 2025 •

edited

Loading

nikolay-te commented Oct 30, 2025 •

edited

Loading