Skip to content

APIGOV-31191 validate cache#1015

Open
alrosca wants to merge 7 commits intomainfrom
APIGOV-31191
Open

APIGOV-31191 validate cache#1015
alrosca wants to merge 7 commits intomainfrom
APIGOV-31191

Conversation

@alrosca
Copy link
Copy Markdown
Collaborator

@alrosca alrosca commented Mar 10, 2026

Cache validation on start up and reconnect

jcollins-axway
jcollins-axway previously approved these changes Mar 11, 2026
Copy link
Copy Markdown
Collaborator

@jcollins-axway jcollins-axway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good from a high level, needs pounded testing wise

@jcollins-axway
Copy link
Copy Markdown
Collaborator

copilots review

Code Review: APIGOV-31191 vs main

Summary

  • Scope: reviewed 8 files (314 insertions, 4 deletions), focused on new cache validation flow, reconnect hooks, and cache manager interfaces.
  • Verdict: request changes
  • Severity: medium (2), low (0), high (0), blocker (0)

Findings

  • Medium — Cache validation uses non-unique key (name) and can collide across scoped resources

  • Medium — Cache validator hard-codes API version to v1alpha1, which can skip/incorrectly query some filters

    • Where: pkg/agent/cachevalidationjob.go, pkg/agent/cachevalidationjob.go
    • Description: validation URL is built from a synthetic ResourceInstance with fixed APIVersion: "v1alpha1". For filters whose resource version differs, GetKindLink() may produce the wrong endpoint or empty link, and validation is skipped (return true).
    • Possible impacts to the codebase: out-of-sync persisted cache may pass validation and remain in use after startup/reconnect for affected resource kinds.

Positives

  • Cache validation logic is cleanly encapsulated (cacheValidator) and integrated into both startup and reconnect paths.
  • Reconnect hooks were added consistently to poll and stream clients with minimal API surface change (WithOnReconnect options).
  • Logging around validation failures includes useful operational context (resource counts, resource names, timestamps).

Recommendations

  • Use a stable unique key for comparisons (for example selfLink or a composite of group/kind/scope/name) instead of name alone.
  • Derive API version from filter/resource metadata or from server-discovered GVK mappings rather than hard-coding "v1alpha1".
  • Add targeted unit tests for: duplicate names across scopes, non-v1alpha1 resources, and reconnect-triggered validation behavior.

@alrosca alrosca closed this Mar 12, 2026
@alrosca alrosca reopened this Mar 12, 2026
@alrosca
Copy link
Copy Markdown
Collaborator Author

alrosca commented Mar 12, 2026

Not a bad review at all

jcollins-axway
jcollins-axway previously approved these changes Mar 12, 2026
jcollins-axway
jcollins-axway previously approved these changes Mar 13, 2026
@jcollins-axway
Copy link
Copy Markdown
Collaborator

caching changes require an extra amount of testing as its usually to blame for our duplicate service issues

@sbolosan
Copy link
Copy Markdown
Collaborator

Give me time to look at this closer

}

func (es *EventSync) RebuildCache() {
// SDB - NOTE : Do we need to pause jobs.
Copy link
Copy Markdown
Collaborator

@sbolosan sbolosan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dang! @vivekschauhan @jcollins-axway I put this note in a very long time ago ;).
I was looking at another race condition down stream, but decided to see who called RebuildCache().

So RebuildCache is reachable by WithOnClientStop (poller) and With EventSyncError (stream) while the event listener go routine is alive, well technically alive.

In those paths the race is meh, maybe not a big deal because event flow has stopped like a harvester error or when the stream is not yet producing. But PublishingLock still doesn't coordinate with the listener. Maybe we should take a look at this again?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we eat the error?
At least log the error/warn and then if requirement is to rebuild, then rebuild?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like

    timeToRebuild, err := a.shouldRebuildCache()
    if err != nil {
        a.logger.WithError(err).Warn("unable to determine cache rebuild state, triggering rebuild")
        timeToRebuild = true
    }
    if timeToRebuild && a.rebuildCache != nil {
        a.rebuildCache.RebuildCache()
    }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can skip the error logging outside shouldRebuildCache since we already log it inside the method. Returning true on error inside the method would not require checking the return error and explicitly putting timeToRebuild on true

if value != nil {
logger := a.logger.WithField("cacheUpdateTime", value)
// get current cacheUpdateTime from x-agent-details
convToTimestamp, err := strconv.ParseInt(value.(string), 10, 64)
Copy link
Copy Markdown
Collaborator

@sbolosan sbolosan Mar 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old and not part of your changes, but can we do a safe guard here

strVal, ok := value.(string)
            if !ok {
                logger.Warn("cacheUpdateTime is not a string, triggering rebuild")
                return true, nil
            }

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this defensive code since in pkg/agent/evensync.go, line 162 that value is put in agent details and will always be a string? I don't think there are permissions to alter the agent details subresource from cli and I don't think anyone would do that, right?

Copy link
Copy Markdown

@jasonmscollins jasonmscollins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments from Shane

Copy link
Copy Markdown
Collaborator

@jcollins-axway jcollins-axway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants