fix(scheduler cache): evict goroutine leak, premature event, silent errors#5225
fix(scheduler cache): evict goroutine leak, premature event, silent errors#5225PrasannaMishra001 wants to merge 1 commit intovolcano-sh:masterfrom
Conversation
Signed-off-by: Leonidas <mishra.prasanna838@gmail.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR fixes issues in SchedulerCache.Evict() related to scheduler shutdown handling, Kubernetes event ordering, and eviction error visibility (Fixes #5224).
Changes:
- Store
stopChonSchedulerCacheatRun()time so eviction work can observe scheduler shutdown. - Move the Normal
Evictevent to fire only after a successfulEvictor.Evict()call; emit a WarningEvictFailedevent on failure. - Log eviction failures via
klog.Errorfinstead of silently resyncing.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // Skip eviction if the scheduler is shutting down to avoid holding the | ||
| // apiserver connection open past shutdown and firing misleading events. | ||
| if sc.stopCh != nil { | ||
| select { | ||
| case <-sc.stopCh: | ||
| klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name) | ||
| return | ||
| default: | ||
| } | ||
| } |
| go func() { | ||
| err := sc.Evictor.Evict(p, reason) | ||
| if err != nil { | ||
| // Skip eviction if the scheduler is shutting down to avoid holding the | ||
| // apiserver connection open past shutdown and firing misleading events. | ||
| if sc.stopCh != nil { | ||
| select { | ||
| case <-sc.stopCh: | ||
| klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name) | ||
| return | ||
| default: | ||
| } | ||
| } | ||
|
|
||
| if err := sc.Evictor.Evict(p, reason); err != nil { | ||
| klog.Errorf("Failed to evict pod %s/%s: %v", p.Namespace, p.Name, err) | ||
| sc.Recorder.Eventf(podgroup, v1.EventTypeWarning, "EvictFailed", | ||
| "Failed to evict pod %s/%s: %v", p.Namespace, p.Name, err) | ||
| sc.resyncTask(task) | ||
| return | ||
| } | ||
| sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason) | ||
| }() |
There was a problem hiding this comment.
Code Review
This pull request introduces a shutdown observation mechanism in the SchedulerCache by storing the stopCh and checking it within the Evict method to prevent new evictions during shutdown. Feedback suggests using a static format string for event recording to avoid potential formatting issues with the reason variable. Additionally, it was noted that while the shutdown check is beneficial, the eviction call itself remains a blocking operation that could benefit from context-based cancellation to ensure cleaner shutdowns and consistent state.
| if sc.stopCh != nil { | ||
| select { | ||
| case <-sc.stopCh: | ||
| klog.V(3).Infof("Skipping evict of pod %s/%s: scheduler is shutting down", p.Namespace, p.Name) | ||
| return | ||
| default: | ||
| } | ||
| } |
There was a problem hiding this comment.
The non-blocking check for stopCh at the start of the goroutine is a good addition to prevent starting new evictions during shutdown. However, the sc.Evictor.Evict call itself (line 983) is a blocking operation that performs network I/O to the API server. If the scheduler shuts down while this call is in progress, the goroutine will still hang until the network call completes or times out, as the Evictor interface does not currently accept a context.Context for cancellation. While changing the interface might be out of scope for this PR, consider that the task status remains in Releasing state if this goroutine returns early due to shutdown, which could lead to inconsistent cache state if the scheduler is restarted within the same process.
| sc.resyncTask(task) | ||
| return | ||
| } | ||
| sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason) |
There was a problem hiding this comment.
Using the reason variable directly as the format string in Eventf can be risky if the reason string contains unexpected characters like %. It is safer to use a static format string.
| sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", reason) | |
| sc.Recorder.Eventf(podgroup, v1.EventTypeNormal, "Evict", "%s", reason) |
What type of PR is this?
/kind bug
What this PR does / why we need it:
SchedulerCache.Evict()spawns a background goroutine to callEvictor.Evict()against the apiserver. Three issues in that goroutine:Run()stops.Recorder.Eventf(..., "Evict", ...)kubernetes event fires synchronously before the evict call, so the event is recorded even when the evict later fails.resyncTaskand throws the error away, with no log and no event.Fix: store
stopChonSchedulerCacheand skip the call on shutdown, move the Normal event to the success branch, emit a WarningEvictFailedevent andklog.Errorfon failure.Which issue(s) this PR fixes:
Fixes #5224
Special notes for your reviewer:
Does this PR introduce a user-facing change?