Skip to content

fix(scheduler cache): evict goroutine leak, premature event, silent errors#5225

Open
PrasannaMishra001 wants to merge 1 commit intovolcano-sh:masterfrom
PrasannaMishra001:fix/evict-goroutine-leak-event
Open

fix(scheduler cache): evict goroutine leak, premature event, silent errors#5225
PrasannaMishra001 wants to merge 1 commit intovolcano-sh:masterfrom
PrasannaMishra001:fix/evict-goroutine-leak-event

Conversation

@PrasannaMishra001
Copy link
Copy Markdown

@PrasannaMishra001 PrasannaMishra001 commented Apr 17, 2026

What type of PR is this?

/kind bug

What this PR does / why we need it:

SchedulerCache.Evict() spawns a background goroutine to call Evictor.Evict() against the apiserver. Three issues in that goroutine:

  1. It has no stopCh awareness, so in flight evictions keep running past scheduler shutdown while every other worker spawned from Run() stops.
  2. The Recorder.Eventf(..., "Evict", ...) kubernetes event fires synchronously before the evict call, so the event is recorded even when the evict later fails.
  3. On error the goroutine silently calls resyncTask and throws the error away, with no log and no event.

Fix: store stopCh on SchedulerCache and skip the call on shutdown, move the Normal event to the success branch, emit a Warning EvictFailed event and klog.Errorf on failure.

Which issue(s) this PR fixes:

Fixes #5224

Special notes for your reviewer:

Does this PR introduce a user-facing change?

NONE

Signed-off-by: Leonidas <mishra.prasanna838@gmail.com>
Copilot AI review requested due to automatic review settings April 17, 2026 08:06
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign wangyang0616 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 17, 2026
@PrasannaMishra001
Copy link
Copy Markdown
Author

PrasannaMishra001 commented Apr 17, 2026

@hajnalmt @JesseStutler could you please take a look when you have time. This PR addresses issue #5224 in the scheduler cache eviction path and aligns with the kind of cache/event handler changes you have reviewed recently (for example #5184, #5182, #5172).

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes issues in SchedulerCache.Evict() related to scheduler shutdown handling, Kubernetes event ordering, and eviction error visibility (Fixes #5224).

Changes:

  • Store stopCh on SchedulerCache at Run() time so eviction work can observe scheduler shutdown.
  • Move the Normal Evict event to fire only after a successful Evictor.Evict() call; emit a Warning EvictFailed event on failure.
  • Log eviction failures via klog.Errorf instead of silently resyncing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +972 to +981
// Skip eviction if the scheduler is shutting down to avoid holding the
// apiserver connection open past shutdown and firing misleading events.
if sc.stopCh != nil {
select {
case <-sc.stopCh:
klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name)
return
default:
}
}
Comment on lines 971 to 991
go func() {
err := sc.Evictor.Evict(p, reason)
if err != nil {
// Skip eviction if the scheduler is shutting down to avoid holding the
// apiserver connection open past shutdown and firing misleading events.
if sc.stopCh != nil {
select {
case <-sc.stopCh:
klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name)
return
default:
}
}

if err := sc.Evictor.Evict(p, reason); err != nil {
klog.Errorf("Failed to evict pod %s/%s: %v", p.Namespace, p.Name, err)
sc.Recorder.Eventf(podgroup, v1.EventTypeWarning, "EvictFailed",
"Failed to evict pod %s/%s: %v", p.Namespace, p.Name, err)
sc.resyncTask(task)
return
}
sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason)
}()
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a shutdown observation mechanism in the SchedulerCache by storing the stopCh and checking it within the Evict method to prevent new evictions during shutdown. Feedback suggests using a static format string for event recording to avoid potential formatting issues with the reason variable. Additionally, it was noted that while the shutdown check is beneficial, the eviction call itself remains a blocking operation that could benefit from context-based cancellation to ensure cleaner shutdowns and consistent state.

Comment on lines +974 to +981
if sc.stopCh != nil {
select {
case <-sc.stopCh:
klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name)
return
default:
}
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The non-blocking check for stopCh at the start of the goroutine is a good addition to prevent starting new evictions during shutdown. However, the sc.Evictor.Evict call itself (line 983) is a blocking operation that performs network I/O to the API server. If the scheduler shuts down while this call is in progress, the goroutine will still hang until the network call completes or times out, as the Evictor interface does not currently accept a context.Context for cancellation. While changing the interface might be out of scope for this PR, consider that the task status remains in Releasing state if this goroutine returns early due to shutdown, which could lead to inconsistent cache state if the scheduler is restarted within the same process.

sc.resyncTask(task)
return
}
sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using the reason variable directly as the format string in Eventf can be risky if the reason string contains unexpected characters like %. It is safer to use a static format string.

Suggested change
sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason)
sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", "%s", reason)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(scheduler-cache): Evict() leaks goroutine on shutdown, emits "Evict" event before eviction, swallows errors silently

3 participants