Skip to content

enhancement(agent): networkqos adapt to cgroup v2#5239

Open
3th4novo wants to merge 6 commits intovolcano-sh:masterfrom
3th4novo:networkqos-v2
Open

enhancement(agent): networkqos adapt to cgroup v2#5239
3th4novo wants to merge 6 commits intovolcano-sh:masterfrom
3th4novo:networkqos-v2

Conversation

@3th4novo
Copy link
Copy Markdown
Contributor

@3th4novo 3th4novo commented Apr 24, 2026

What type of PR is this?

area/agent

What this PR does / why we need it:

Adapts volcano agent by utilizing oncn-bwm to set network priority under cgroup v2.

Which issue(s) this PR fixes:

Fixes #5238

Special notes for your reviewer:

@JesseStutler @Tau721 @wangyang0616

Does this PR introduce a user-facing change?

NONE

Signed-off-by: Yiteng Zhang <ethanq918@163.com>
Copilot AI review requested due to automatic review settings April 24, 2026 09:27
@volcano-sh-bot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign william-wang for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 24, 2026
@3th4novo 3th4novo changed the title networkqos adapt to cgroup v2 enhancement(agent): networkqos adapt to cgroup v2 Apr 24, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adapts Volcano Agent’s Network QoS pod-level marking to work on cgroup v2 hosts by switching from writing net_cls.classid (v1-only) to invoking bwmcli with the unified cgroup path.

Changes:

  • Add cgroup version detection in the NetworkQoS pod event handler and a cgroup v2 implementation that calls bwmcli.
  • Clarify in cgroup utilities that net_cls and net_cls.classid are cgroup v1-only concepts.
  • Add unit tests covering cgroup v2 behavior (bwmcli invocation, QoS normalization) and bandwidth-annotation skip behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/agent/utils/cgroup/cgroup.go Documents net_cls/net_cls.classid as v1-only and points to bwmcli for v2.
pkg/agent/events/handlers/networkqos/networkqos_linux.go Adds v1/v2 dispatch; implements v2 path using bwmcli with timeout.
pkg/agent/events/handlers/networkqos/networkqos_linux_test.go Adds v2 unit tests with mocked executor and tests for annotation-based skipping.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

informerFactory := informers.NewSharedInformerFactory(fakeClient, 0)
informerFactory.Core().V1().Pods().Informer()
informerFactory.Start(context.TODO().Done())
cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache.WaitForNamedCacheSync return value is ignored here. Please check/assert the boolean return so this test doesn't proceed with an unsynced lister (which can lead to flakes if the pod isn't observed yet).

Suggested change
cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced)
synced := cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced)
assert.True(t, synced, "failed to sync pod informer cache")

Copilot uses AI. Check for mistakes.
Comment on lines 122 to 128
err = utils.UpdatePodCgroup(qosLevelFile, qosLevel)
if os.IsNotExist(err) {
klog.InfoS("Cgroup file not existed", "cgroupFile", qosLevelFile)
return nil
}
klog.InfoS("Successfully set network qos level to cgroup file", "qosLevel", string(qosLevel), "cgroupFile", qosLevelFile)
return nil
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handleV1 ignores any UpdatePodCgroup error other than os.IsNotExist(err) and still logs success. This can mask permission/IO failures and make NetworkQoS silently ineffective. Please handle err != nil explicitly (e.g., return the error or log and return), and only log success when the update succeeds.

Copilot uses AI. Check for mistakes.
Comment on lines +170 to +171
os.Setenv("VOLCANO_TEST_CGROUP_VERSION", "v2")
defer os.Unsetenv("VOLCANO_TEST_CGROUP_VERSION")
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test sets VOLCANO_TEST_CGROUP_VERSION and always unsets it in defer, which can clobber an existing value (and differs from the restore pattern used in other tests). Please preserve and restore the original value (or use t.Setenv).

Suggested change
os.Setenv("VOLCANO_TEST_CGROUP_VERSION", "v2")
defer os.Unsetenv("VOLCANO_TEST_CGROUP_VERSION")
t.Setenv("VOLCANO_TEST_CGROUP_VERSION", "v2")

Copilot uses AI. Check for mistakes.
informerFactory := informers.NewSharedInformerFactory(fakeClient, 0)
informerFactory.Core().V1().Pods().Informer()
informerFactory.Start(context.TODO().Done())
cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced)
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache.WaitForNamedCacheSync return value is ignored here. If the informer cache doesn't sync, the handler may skip (pod not found) and the assertions on mockExec.lastCmd can become flaky. Please assert the sync succeeds (check the boolean return) before proceeding.

Suggested change
cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced)
assert.True(t,
cache.WaitForNamedCacheSync("", context.TODO().Done(), informerFactory.Core().V1().Pods().Informer().HasSynced),
"pod informer cache failed to sync",
)

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for cgroup v2 in the network QoS handler by implementing a new handleV2 method that utilizes the bwmcli tool, as the net_cls controller is unavailable in cgroup v2. The changes include logic to detect the cgroup version, a command executor with a timeout for bwmcli calls, and comprehensive unit tests for the new functionality and edge cases like bandwidth annotations. Feedback was provided regarding the safe handling of file paths in shell commands to prevent potential injection or parsing errors.


cmdCtx, cancel := context.WithTimeout(context.Background(), bwmcliCmdTimeout)
defer cancel()
cmd := fmt.Sprintf("bwmcli -s %s %d", cgroupPath, qosLevel)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The cgroupPath is used directly in a shell command string. If the path contains spaces or other shell-special characters, the command execution will fail or behave unexpectedly. Although cgroup paths in Kubernetes typically do not contain spaces, it is safer to wrap the path in single quotes to prevent potential shell injection or parsing issues, especially since the underlying executor uses sh -c.

Suggested change
cmd := fmt.Sprintf("bwmcli -s %s %d", cgroupPath, qosLevel)
cmd := fmt.Sprintf("bwmcli -s '%s' %d", cgroupPath, qosLevel)

Signed-off-by: Yiteng Zhang <ethanq918@163.com>
Signed-off-by: Yiteng Zhang <ethanq918@163.com>
Signed-off-by: Yiteng Zhang <ethanq918@163.com>
@volcano-sh-bot volcano-sh-bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels May 6, 2026
3th4novo added 2 commits May 6, 2026 15:37
Signed-off-by: Yiteng Zhang <ethanq918@163.com>
Signed-off-by: Yiteng Zhang <ethanq918@163.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

enhancement(agent): Volcano agent does not support Network QoS functionality under cgroup v2

3 participants