Skip to content

test(job/state): add unit tests for job controller state machine#5244

Closed
kube-gopher wants to merge 13 commits into
volcano-sh:masterfrom
kube-gopher:test/job-state-controller-unit-tests
Closed

test(job/state): add unit tests for job controller state machine#5244
kube-gopher wants to merge 13 commits into
volcano-sh:masterfrom
kube-gopher:test/job-state-controller-unit-tests

Conversation

@kube-gopher
Copy link
Copy Markdown
Contributor

@kube-gopher kube-gopher commented Apr 26, 2026

Summary

Add unit tests for pkg/controllers/job/state/ package, which previously
had 0% test coverage. This package implements the Volcano Job lifecycle
state machine (Pending → Running → Completing/Terminating/Aborting/
Restarting → terminal states).

Changes

File Test File Coverage
util.go util_test.go 100%
factory.go factory_test.go 100%
finished.go finished_test.go 100%
aborted.go aborted_test.go 100%
terminating.go terminating_test.go 100%
aborting.go aborting_test.go 100%
pending.go pending_test.go 100%
completing.go completing_test.go 100%
restarting.go restarting_test.go 100%
running.go running_test.go 100%

Test Coverage

Before: 0.0%
After: 98.6%

How to Verify

go test ./pkg/controllers/job/state/... -v -coverprofile=cover.out
go tool cover -func=cover.out

Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 26, 2026
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign shinytang6 for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Apr 26, 2026
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive suite of unit tests for the job state machine, covering various states (Aborted, Finished, Terminating), the state factory, and utility functions. The feedback focuses on improving test maintainability by unifying duplicated test helpers across files and utilizing existing constants for pod phase retention checks instead of hardcoding values.

Comment on lines +29 to +51
// capturedKillJob records one KillJob invocation and preserves the updateFn
// so tests can invoke it directly to verify the state-transition logic.
type capturedKillJob struct {
job *apis.JobInfo
podRetainPhase PhaseMap
updateFn UpdateStatusFn
}

// captureKillJob replaces KillJob with a stub that records its arguments and
// returns returnErr. The original value is restored via t.Cleanup.
func captureKillJob(t *testing.T, returnErr error) *capturedKillJob {
t.Helper()
original := KillJob
cap := &capturedKillJob{}
KillJob = func(job *apis.JobInfo, podRetainPhase PhaseMap, fn UpdateStatusFn) error {
cap.job = job
cap.podRetainPhase = podRetainPhase
cap.updateFn = fn
return returnErr
}
t.Cleanup(func() { KillJob = original })
return cap
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test helper struct capturedKillJob and the function captureKillJob are duplicated with slight variations (e.g., killJobCall and stubKillJob in finished_test.go) across multiple test files in the same package. Since these files share the same package namespace, it is recommended to unify these helpers into a single implementation (e.g., in a common_test.go file or by defining it once in a primary test file) to improve maintainability and reduce code duplication.

Comment on lines +100 to +107
for _, phase := range []v1.PodPhase{v1.PodSucceeded, v1.PodFailed} {
if _, ok := call.podRetainPhase[phase]; !ok {
t.Errorf("podRetainPhase missing %q; got %v", phase, call.podRetainPhase)
}
}
if len(call.podRetainPhase) != 2 {
t.Errorf("podRetainPhase should have exactly 2 entries, got %d", len(call.podRetainPhase))
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The test hardcodes the pod phases (PodSucceeded, PodFailed) and the expected count (2) when verifying the podRetainPhase map. It is better to iterate over the PodRetainPhaseSoft constant defined in factory.go to ensure consistency and avoid brittleness if the set of retained phases changes in the future. This approach is already used in aborted_test.go.

Suggested change
for _, phase := range []v1.PodPhase{v1.PodSucceeded, v1.PodFailed} {
if _, ok := call.podRetainPhase[phase]; !ok {
t.Errorf("podRetainPhase missing %q; got %v", phase, call.podRetainPhase)
}
}
if len(call.podRetainPhase) != 2 {
t.Errorf("podRetainPhase should have exactly 2 entries, got %d", len(call.podRetainPhase))
}
for phase := range PodRetainPhaseSoft {
if _, ok := call.podRetainPhase[phase]; !ok {
t.Errorf("podRetainPhase missing %q; got %v", phase, call.podRetainPhase)
}
}
if len(call.podRetainPhase) != len(PodRetainPhaseSoft) {
t.Errorf("podRetainPhase has %d entries, want %d", len(call.podRetainPhase), len(PodRetainPhaseSoft))
}

Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
@kube-gopher kube-gopher marked this pull request as ready for review April 27, 2026 01:50
Copilot AI review requested due to automatic review settings April 27, 2026 01:50
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 27, 2026
@kube-gopher
Copy link
Copy Markdown
Contributor Author

/test all

@volcano-sh-bot
Copy link
Copy Markdown
Contributor

@kube-gopher: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

Details

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds unit test coverage for pkg/controllers/job/state, validating the Volcano Job lifecycle state-machine behavior (action routing and phase transitions) without needing a fake k8s client.

Changes:

  • Adds table-driven unit tests for state Execute() methods across phases (Pending/Running/Restarting/Completing/Terminating/Aborting/Aborted/Finished).
  • Adds unit tests for shared helpers (TotalTasks, TotalTaskMinAvailable) and for NewState factory behavior / pod-retain phase maps.
  • Introduces lightweight stubs for package-level action functions (KillJob, KillTarget, SyncJob) to assert call arguments and exercise update functions.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
pkg/controllers/job/state/util_test.go Adds tests for task counting helpers (TotalTasks, TotalTaskMinAvailable).
pkg/controllers/job/state/factory_test.go Adds tests for NewState phase-to-state mapping and PodRetainPhase* constants.
pkg/controllers/job/state/finished_test.go Adds tests ensuring finished-state always calls KillJob with soft retention and nil updateFn.
pkg/controllers/job/state/pending_test.go Adds tests for Pending state action routing and SyncJob updateFn behavior.
pkg/controllers/job/state/running_test.go Adds tests for Running state action branches and default SyncJob updateFn phase decisions.
pkg/controllers/job/state/restarting_test.go Adds tests for restartingUpdateStatus and Restarting state action routing.
pkg/controllers/job/state/completing_test.go Adds tests for Completing state KillJob usage and drain-to-Completed transition logic.
pkg/controllers/job/state/terminating_test.go Adds tests for Terminating state KillJob usage and drain-to-Terminated transition logic.
pkg/controllers/job/state/aborting_test.go Adds tests for Aborting state Resume vs default behavior and Aborted transition logic.
pkg/controllers/job/state/aborted_test.go Adds tests for Aborted state Resume behavior vs default (nil updateFn) behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +170 to +177
originalPhase := tc.status.State.Phase
changed := cap.updateFn(&tc.status)

if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != originalPhase {
t.Errorf("phase should not change: got %q, want %q", tc.status.State.Phase, originalPhase)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this “alive pods” test, tc.status.State.Phase is never initialized (defaults to ""), so asserting it remains equal to originalPhase doesn’t verify that the phase stays Terminating. Set tc.status.State.Phase to vcbatch.Terminating before calling updateFn and assert it remains Terminating when changed==false.

Suggested change
originalPhase := tc.status.State.Phase
changed := cap.updateFn(&tc.status)
if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != originalPhase {
t.Errorf("phase should not change: got %q, want %q", tc.status.State.Phase, originalPhase)
tc.status.State.Phase = vcbatch.Terminating
changed := cap.updateFn(&tc.status)
if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != vcbatch.Terminating {
t.Errorf("phase should remain %q: got %q", vcbatch.Terminating, tc.status.State.Phase)

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +177
originalPhase := tc.status.State.Phase
changed := cap.updateFn(&tc.status)

if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != originalPhase {
t.Errorf("phase should not change: got %q, want %q", tc.status.State.Phase, originalPhase)
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as terminatingState alive-pod tests: tc.status.State.Phase is left as the zero value, so the “phase unchanged” assertion can pass even if updateFn incorrectly clears/changes the phase. Initialize the status phase to vcbatch.Completing before invoking updateFn and assert it stays Completing when changed==false.

Suggested change
originalPhase := tc.status.State.Phase
changed := cap.updateFn(&tc.status)
if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != originalPhase {
t.Errorf("phase should not change: got %q, want %q", tc.status.State.Phase, originalPhase)
tc.status.State.Phase = vcbatch.Completing
changed := cap.updateFn(&tc.status)
if changed {
t.Error("updateFn should return false while alive pods exist")
}
if tc.status.State.Phase != vcbatch.Completing {
t.Errorf("phase should remain %q: got %q", vcbatch.Completing, tc.status.State.Phase)

Copilot uses AI. Check for mistakes.
if err := s.Execute(Action{Action: v1alpha1.SyncJobAction}); err != nil {
t.Fatalf("unexpected error: %v", err)
}

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here originalPhase is taken from tc.status.State.Phase, but the test cases don’t set an initial phase (so it’s ""). To actually verify abortingState’s updateFn keeps the phase Aborting while pods are still alive, set tc.status.State.Phase = vcbatch.Aborting before calling updateFn and assert it remains Aborting when changed==false.

Suggested change
tc.status.State.Phase = vcbatch.Aborting

Copilot uses AI. Check for mistakes.
Comment on lines +503 to +505
originalPhase := tc.status.State.Phase
changed := cap.updateFn(&tc.status)

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These “stays pending” cases never set tc.status.State.Phase to Pending, so the “phase unchanged” assertion is comparing against the zero value ("") rather than vcbatch.Pending. Initialize each test status with State.Phase = vcbatch.Pending so the test verifies the updateFn truly preserves the Pending phase when MinAvailable is not met.

Copilot uses AI. Check for mistakes.
Comment on lines +640 to +642
if status.State.Phase != vcbatch.Running && status.State.Phase != "" {
t.Errorf("phase should remain unchanged, got %q", status.State.Phase)
}
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing status.State.Phase to remain "" weakens this assertion: in real usage JobStatus.State.Phase should already be Running, and the updateFn should leave it Running when no transition is needed. Consider setting the initial status.State.Phase = vcbatch.Running in these test cases and then asserting it remains Running when changed==false.

Copilot uses AI. Check for mistakes.

// stubKillJob replaces the KillJob package variable with a stub that records
// its arguments and returns the given error. It restores the original value
// via t.Cleanup so parallel tests are not affected.
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment is misleading: restoring a package-level variable in t.Cleanup prevents cross-test pollution when tests run sequentially, but it does not make the stub safe for parallel tests (t.Parallel), since other tests could observe the stub concurrently. Consider rewording to avoid implying parallel-safety, or explicitly note that tests must not run in parallel while stubbing globals.

Suggested change
// via t.Cleanup so parallel tests are not affected.
// via t.Cleanup to avoid leaking the stub into later sequential tests; tests
// that use this helper must not run in parallel because KillJob is global.

Copilot uses AI. Check for mistakes.
@JesseStutler
Copy link
Copy Markdown
Member

This PR appears to contain a lot of AI-generated code, including the description. Please explain your use of AI.

@kube-gopher
Copy link
Copy Markdown
Contributor Author

@JesseStutler Thanks for pointing this out. I did use AI assistance to help write parts of the code and description. That said, I’ve reviewed and tested the code myself and understand what it does.

@kube-gopher
Copy link
Copy Markdown
Contributor Author

@JesseStutler After using AI assistance to write code, I will carefully review every detail written by the AI to ensure code quality.

Signed-off-by: kube-gopher <jzlyy68@gmail.com>
@kube-gopher kube-gopher force-pushed the test/job-state-controller-unit-tests branch from 702d972 to 3e402bf Compare April 27, 2026 08:22
Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: kube-gopher <jzlyy68@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@kube-gopher
Copy link
Copy Markdown
Contributor Author

@JesseStutler I've carefully reviewed the quality of the relevant code, found and fixed some minor errors. Could you please take a look and let me know if there are any other areas that could be improved?

@kube-gopher kube-gopher deleted the test/job-state-controller-unit-tests branch May 11, 2026 04:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants