Skip to content

Conversation

@krancour
Copy link
Member

@krancour krancour commented Oct 28, 2025

Fixes #5076

Highlights:

  1. Creates a new, very simple cache abstraction with one implementation based on github.com/hashicorp/golang-lru/v2. This replaces https://github.com/patrickmn/go-cache. github.com/hashicorp/golang-lru/v2, being, as the name implies, an LRU cache, it is better suited to the rest of the changes in this PR.

  2. Before, each registry got its own cache. Now there's just one shared cache. Its size is configurable. By default, it holds up to 100,000 entries. Each entry is pretty small.

  3. Container image subscriptions can now opt into using cached tags. When choosing this, all image selection strategies, except Digest, will cache image information by tag and not by digest. If one does not opt into this, things work the same as they always have.

  4. Operators can set policies that allow/disallow using cached tags or even require cached tags. This way operators who are not confident that users are working with immutable tags can forbid tags from beings cached. Those who are confident that users ARE working with immutable tags can require tag caching.

@krancour krancour added this to the v1.9.0 milestone Oct 28, 2025
@krancour krancour self-assigned this Oct 28, 2025
@krancour krancour requested review from a team as code owners October 28, 2025 00:25
@krancour krancour added kind/enhancement An entirely new feature priority/normal This is the priority for most work area/controller Affects the (main) controller area/chart Affects the Helm chart labels Oct 28, 2025
@netlify
Copy link

netlify bot commented Oct 28, 2025

Deploy Preview for docs-kargo-io ready!

Name Link
🔨 Latest commit 1a3e52f
🔍 Latest deploy log https://app.netlify.com/projects/docs-kargo-io/deploys/690a6cd1797e820007b557fd
😎 Deploy Preview https://deploy-preview-5298.docs.kargo.io
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@krancour krancour marked this pull request as draft October 28, 2025 00:25
@codecov
Copy link

codecov bot commented Oct 28, 2025

Codecov Report

❌ Patch coverage is 58.62069% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 55.93%. Comparing base (372bd42) to head (1a3e52f).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
pkg/image/repository_client.go 60.00% 8 Missing and 8 partials ⚠️
pkg/controller/warehouses/images.go 0.00% 10 Missing ⚠️
pkg/image/cache.go 55.55% 3 Missing and 1 partial ⚠️
pkg/image/newest_build_selector.go 40.00% 3 Missing ⚠️
pkg/image/tag_based_selector.go 33.33% 2 Missing ⚠️
pkg/image/digest_selector.go 83.33% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5298      +/-   ##
==========================================
- Coverage   55.95%   55.93%   -0.03%     
==========================================
  Files         407      409       +2     
  Lines       29906    29955      +49     
==========================================
+ Hits        16735    16754      +19     
- Misses      12200    12221      +21     
- Partials      971      980       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@hiddeco hiddeco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

We have had reports internally that adding a zero to the current settings (metadata concurrency and rate limiter) brings down the refresh time from ~15 minutes to just over one.

@nikolay-te
Copy link

nikolay-te commented Oct 30, 2025

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

We have had reports internally that adding a zero to the current settings (metadata concurrency and rate limiter) brings down the refresh time from ~15 minutes to just over one.

Yes, please 🙏🏻 This made a HUGE difference in our environment. For different reasons we're stuck to using NewestBuild, and patching the warehouse shards to have increased rate limits and parallelism helped a lot.

Another potential improvement would be some connection reuse on the client (though it's interesting to see how would this work in a multi-tenant environment), as the other thing we've noticed after bumping these parameters manually is that the querying of the repositories can be very aggressive with multiple tcp connections getting recycled, and also high number of /v2/token requests for docker, etc.

@krancour
Copy link
Member Author

In addition to what is being done in this PR, which is excellent, I wonder if this would be a good time to handle the // TODO about making the concurrency settings configurable.

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Assuming you're referring to the rate limit, I've always been open to making that configurable, but have a strongly held opinions about how it's approached. #1139 is the issue capturing my thoughts about this in more detail. Someone did open a PR not too long ago that was rejected because it only addressed rate limits and not the rest of #1139, which is ok, but it did so in a way that was inadequately future-proof. We'd have been unable to build a comprehensive solution to registry configuration on top of it, which would inevitably have forced a breaking change on us. I'm in no way saying that we cannot make rate limits configurable without addressing the rest of #1139. What I am saying is whatever is done to make rate limits configurable needs to not paint us into a corner in terms of how we'd approach the rest of #1139.

Separate from that, I have some worry that increasing the rate limit isn't the panacea people seem to think it is. There's a natural inclination toward thinking that if you crank that way up, things just get faster. I think there's not enough weight being given to the possibility that when the client isn't self-limiting and the rate limits are being enforced server-side instead, things might actually get worse. This isn't a reason not to do it. It's just a concern.

Between my concern and my opinions about there being a (somewhat involved) right way of approaching this, my inclination is to not bundle that improvement into this cache improvement.

Why don't we see about getting @jessesuen's buy-in on the idea of prioritizing #1139 in v1.10.0?

@krancour krancour changed the title feat: allow opting into caching tags feat: allow opting into caching image metadata by tag Oct 31, 2025
@hiddeco
Copy link
Contributor

hiddeco commented Oct 31, 2025

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Based on @nikolay-te's report, it is both.

@nikolay-te
Copy link

Are you referring to the rate limit (currently 20 per registry) or the number of goroutines allowed to work on pulling down metadata concurrently (currently capped at 1,000, but shared across all registries)?

Based on @nikolay-te's report, it is both.

I think it was probably mostly the per repo rate limit that gave us the improvement.
Since it's per registry, and we have only one registry for all images we were seeing image warehouse reconciliations taking up to an hour. This is even when we spread all warehouses on 10 separate kargo shards/instances dedicated to warehouse tasks. Both the Kargo agents and our registry and the caching proxy in front of it were mostly idle, until we increased the rate limits. After that these discoveries that were taking an hour dropped to 1-2 minutes, many were in the seconds.

Having said that, I see now that the caching proxies (nginx, passing through tag list operations, caching only manifests) I've setup is able to hold ALL images in our environment mostly in memory (just below 2G RAM) which makes me think this caching feature has the potential to bring similar improvements maybe even with the default rate limits, except maybe on cold starts.

Signed-off-by: Kent Rancourt <[email protected]>
@krancour
Copy link
Member Author

@nikolay-te there's a lot of interesting insight there. Thank you.

Rate limit-related improvements are coming, but they won't be bundled into this PR.

Signed-off-by: Kent Rancourt <[email protected]>
"tag", tag,
)

cacheKey := fmt.Sprintf("%s:%s", r.repoURL, tag)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this take the platform constraints into account? Otherwise, I think we could get mismatches?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I think you're right. I'll look at that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. Will fix. Converting to draft in the meantime.


func init() {
var err error
imageCache, err = cache.NewInMemoryCache[Image](100000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the size be configurable?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was... there might be a commit missing from this PR. Will follow up on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ffb1e16 was missing previously. It's configurable now.

Signed-off-by: Kent Rancourt <[email protected]>
@krancour krancour marked this pull request as draft November 4, 2025 21:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/chart Affects the Helm chart area/controller Affects the (main) controller kind/enhancement An entirely new feature priority/normal This is the priority for most work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

add options to opt into more aggressive image metadata caching (options to cache by tag)

3 participants