pillar, kube: SR-IOV VF passthrough via Multus + sriov-cni for KubeVirt#5908
pillar, kube: SR-IOV VF passthrough via Multus + sriov-cni for KubeVirt#5908zedi-pramodh wants to merge 1 commit into
Conversation
| install_sriov_manifests() { | ||
| [ -d "${KUBE_MANIFESTS_DIR}" ] || return | ||
| ls /sys/bus/pci/devices/*/sriov_numvfs >/dev/null 2>&1 || return | ||
| cp -u /etc/k3s-manifests/sriov-cni-daemonset.yaml \ |
There was a problem hiding this comment.
if we download during the k3s/multus install time, then we can remove this sriov-cni-daemonset.yaml
There was a problem hiding this comment.
I do not like the k3s/multus way. Not every device model has SRIOV enabled. So I am not seeing reason to do as part of k3s install. Instead we will load through Kubernetes DS only if we see SRIOV enabled interfaces.
There was a problem hiding this comment.
in /var/lib/cni/bin, they are a number of binaries just loaded by default, no use here. but just happens this sr-iov is not currently part of that.
|
|
||
| // sriovNADName returns the NetworkAttachmentDefinition name for a given PF. | ||
| // One NAD per PF — VLAN, when configured, lives in the NAD's CNI config. | ||
| func sriovNADName(pfName string) string { |
There was a problem hiding this comment.
Should pass in the vlanID:
func sriovNADName(pfName string, vlanID uint32) string {
if vlanID == 0 {
return "sriov-" + pfName
}
return fmt.Sprintf("sriov-%s-vlan%d", pfName, vlanID)
}
|
|
||
| nadName := sriovNADName(pfName) | ||
| resourceName := sriovResourceName(pfName) | ||
| cniConfig := fmt.Sprintf( |
There was a problem hiding this comment.
this may need to exclude of vlanID 0:
to check if vlanID > 0, then include this vlanID in the string.
claude-rsp
left a comment
There was a problem hiding this comment.
Code review
Thanks for the very thorough commit message — it makes the rationale for each piece easy to follow. Overall the design is sound: routing SR-IOV VFs through Multus + sriov-cni instead of KubeVirt HostDevices is the right fix for both the "fungible pool" allocation issue and the "all VMs see PF MAC" symptom, and the dynamic ConfigMap reconciler is a clean way to bridge EVE's AssignableAdapters model to the upstream device-plugin's static config schema.
A few concerns worth addressing before merge:
Correctness / behaviour
-
resolveVFMacdoes not canonicalise the user-configured MAC (pkg/pillar/hypervisor/kubevirt.golines 2435–2445). Today it returnsstrings.ToLower(cfgMac)directly. KubeVirt validates the MAC with a strict regex (^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$-ish) and will reject formats like"aabb.ccdd.eeff"or values with leading/trailing whitespace already swallowed byTrimSpace. Suggest:if hw, err := net.ParseMAC(cfgMac); err == nil { return hw.String(), "config" }
This also defends against a user typo silently turning into a VMI admission failure deep in the start path.
-
derivePFMacFromVFSysfserrors are swallowed silently (line 2438:pfMac, _ := derivePFMacFromVFSysfs(...)). When sysfs is temporarily unreadable we treat that as "PF MAC unknown" and acceptcfgMacas-is, which is the exact pathological value (cfgMac == PF MAC) the check exists to guard against. At minimum log the error at Debug; better, treat sysfs failure as "can't trust cfgMac, fall through to generated". -
Concurrent reconcile race in
reconcileSRIOVDevicePlugin(pkg/pillar/cmd/zedkube/sriov_devplugin.go). The function is invoked from two paths — the AA subscription handler and thekubeCfgTimersafety net every 60 s. There's no mutex around the Get → compare → Update → DeletePod sequence. In practice the K8s API will reject one of two simultaneous Updates with a 409 (logged as a generic error), and the pod-delete on the loser will still happen against the new pod created by the DaemonSet after the winner's bounce. Add async.Mutex(or asingleflight.Group) on*zedkubefor this reconciler. -
garbageCollectSRIOVNADsis not called on theIsNotFound(initial-create) branch (lines 635–652). In the current shape, GC only runs when an existing ConfigMap drifts — so any leftover NAD that survives a fresh-install delete-then-recreate of the ConfigMap will sit ineve-kube-appuntil the next drift. Not a hot bug today (the only way to hit it is a manualkubectl delete cm sriovdp-config) but worth a one-liner: call GC after the create as well. -
Steady-state
setupVfreconcile re-programs MAC/VLAN on every PhysicalIOAdapterList event (pkg/pillar/cmd/domainmgr/domainmgr.golines 3452–3466). The new block runs unconditionally — both in the "no change" else branch and the add/modify branch.setupVf→createVfIoBundlecallssriov.SetupVfHardwareAddr/SetupVfVlanwhenever user config sets them, so we re-issue netlink ops on every adapter event. Idempotent on the kernel side, but it's churn. Consider gating the new call on eithercurrentIbPtr == nil(initial population missed) or an explicit "Vfs.Count changed" check; the no-change else branch is precisely the case where re-running setupVf is least useful. -
BindVFToVfioPCIassumes the VF arrives driverless. The comment says so, but the implementation will silently no-op if the VF is already bound to another driver (writingdriver_overridedoesn't unbind, anddrivers_probeonly probes unbound devices). Today's EVE flow (autoprobe=0) means this is fine, but the failure mode is silent — the device-plugin selector quietly matches nothing and the resource never appears, with nothing in the logs to point at the cause. Either (a) explicitly unbind first ifdriversymlink exists, or (b) at minimum checkdriverpost-write and warn if it's notvfio-pci. -
deleteLocalSriovDpPodreturns an error whenz.nodeName == "". Caller logs at Warn.z.nodeNameis populated by the EdgeNodeInfo handler (zedkube.go:1084), which can race the first AA Create event during boot. Not fatal — thekubeCfgTimersafety net will retry — but the logged error reads alarming. Worth downgrading the empty-nodeName case to a Debug or Info log. -
sriov-device-pluginDaemonSet is unconditionally scheduled on every amd64 node (pkg/kube/sriov/sriov-device-plugin.yaml). On clusters where some nodes have SR-IOV NICs and others do not, the DaemonSet will CrashLoopBackOff on the latter (the comment in the manifest acknowledges this for the bootstrap window only). For long-running mixed clusters, consider a node selector tied to a label thatcluster-init.shonly sets on SR-IOV-capable nodes — same pattern the manifests already use forkubernetes.io/arch: amd64.
Smaller nits
pkg/pillar/hypervisor/kubevirt.go:2349—ifName := fmt.Sprintf("sriov%d", i+1)indexes off the loop counter only. Today no other code path adds VMI interfaces namedsriov*, so this is fine, but a slightly more defensive form would usepa.ioBundle.Phylabel(already unique per VF). Cosmetic.pkg/pillar/cmd/zedkube/sriov_devplugin.go:684–688— the resource-key construction (p.ResourcePrefix + "/" + p.ResourceName) duplicates whatsriovResourceNamedoes inkubevirt.go. Pull this into one helper to keep the two file in lockstep — they have to agree exactly or NADs and pools won't match up.generateVFMacdoc says "collision probability with 24 random bits and 1000 VFs is well under 0.01 percent". With 24 bits and 1000 VFs the birthday-collision probability is ≈ 3% (matches the comment inzedrouter/ipam.go:56). Tighten or drop the number.pkg/pillar/sriov/sriov.go— bothBindVFToVfioPCIandGetPFIfaceFromVFBDFare good candidates for a small unit test using a tmpfs sysfs mock; same forreadVfVendorDeviceandgenerateVFMacinkubevirt.go. The PR adds 1036 lines of new logic with no test coverage on the pure-helper layer.
Things I liked
- Splitting
pciAssignmentsintohostDevAssignments/vfAssignmentskeeps non-SR-IOV passthrough on its existing path entirely — minimal blast radius. - The
IoNetEthPFskip inCreateReplicaVMIConfigis a nice guardrail against accidentally trying to attach a Physical Function to a VM. - Reusing the zedrouter OUI (
02:16:3e) and SHA-256 scheme keeps EVE-generated MACs visually consistent across the codebase. cluster-init.sh's shift to runninginstall_sriov_manifestsevery loop iteration (idempotent viacp -u) is the right call for upgrade flows — the previous one-shot pattern would have left old manifests on disk after an EVE bump.- The
physfn/netsysfs lookup for PF resolution sidesteps having to keep a separate cache; clean. - The
Drivers: ["vfio-pci"]selector in the generated ConfigMap pairs cleanly with the newBindVFToVfioPCIstep insetupVf— both sides of that contract are in this PR, so there's no out-of-tree config the user has to know about.
Build / packaging
- Build tag
//go:build kis consistent with the rest ofpkg/pillar/cmd/zedkube/. ✓ - Manifests added under
pkg/kube/sriov/and copied into/etc/k3s-manifests/via the Dockerfile match the existing nvidia-device-plugin pattern. ✓ - No new go module dependencies;
k8snetworkplumbingwg/network-attachment-definition-clientis already vendored.
Overall: the design and the bulk of the implementation look right; the items above are mostly hardening (concurrency, format-validation, observability) rather than structural. Happy to re-review once the MAC format-handling and reconcile-mutex are addressed.
| if cfgMac != "" { | ||
| pfMac, _ := derivePFMacFromVFSysfs(pa.ioBundle.PciLong) | ||
| if pfMac == "" || !strings.EqualFold(cfgMac, pfMac) { | ||
| return strings.ToLower(cfgMac), "config" |
There was a problem hiding this comment.
MAC format normalisation — strings.ToLower(cfgMac) only canonicalises case. KubeVirt validates the MAC with a strict regex (xx:xx:xx:xx:xx:xx) and will reject formats like aabb.ccdd.eeff, AA-BB-CC-DD-EE-FF, or values with embedded whitespace. A user typo here turns into a VMI admission failure deep in the start path, with a confusing error message.
Suggest:
if hw, err := net.ParseMAC(cfgMac); err == nil {
return hw.String(), "config"
}
// fall through to generatedThat also rejects malformed input cleanly instead of forwarding it to KubeVirt.
| func resolveVFMac(pa pciDevice, appUUID uuid.UUID) (string, string) { | ||
| cfgMac := strings.TrimSpace(pa.ioBundle.MacAddr) | ||
| if cfgMac != "" { | ||
| pfMac, _ := derivePFMacFromVFSysfs(pa.ioBundle.PciLong) |
There was a problem hiding this comment.
Sysfs error swallowed silently — the _ discards the error from derivePFMacFromVFSysfs. When sysfs is transiently unreadable we treat that as "PF MAC unknown" and accept cfgMac as-is, which is the exact pathological value (cfgMac == PF MAC) the check exists to guard against. That defeats the very protection this priority order was designed to add.
At minimum log the error at Debug; better, treat sysfs failure as "can't trust cfgMac → fall through to generated". Trusting input we couldn't validate is the worse default of the two.
| // | ||
| // Idempotent and cheap when nothing changes — it short-circuits if the JSON is | ||
| // byte-identical to what's already in the ConfigMap. | ||
| func (z *zedkube) reconcileSRIOVDevicePlugin(aa *types.AssignableAdapters) { |
There was a problem hiding this comment.
Concurrent reconcile race — this function is invoked from two paths: the AssignableAdapters subscription handlers (Create/Modify/Delete) and the kubeCfgTimer safety net every 60 s in zedkube.go:732. There is no mutex around the Get → compare → Update → DeletePod sequence.
In practice the K8s API will reject one of two simultaneous Updates with a 409 (logged as a generic error), and the pod-delete on the loser will still happen against the new pod created by the DaemonSet after the winner's bounce — a wasted restart, not a corruption.
Nonetheless: add a sync.Mutex (or singleflight.Group) on *zedkube for this reconciler. Cheap and removes the surprising log noise.
| cmIface := clientset.CoreV1().ConfigMaps(sriovDPNamespace) | ||
|
|
||
| cm, err := cmIface.Get(ctx, sriovDPConfigMapName, metav1.GetOptions{}) | ||
| if errors.IsNotFound(err) { |
There was a problem hiding this comment.
garbageCollectSRIOVNADs skipped on the initial-create branch — GC only runs after a successful Update (see line ~161). On the IsNotFound path (fresh cluster bootstrap, or after a manual kubectl delete cm sriovdp-config) we create the ConfigMap and return — any leftover NAD that survived the deletion sits in eve-kube-app until the next drift event.
Not a hot bug, but a one-liner: call garbageCollectSRIOVNADs(ctx, currentResources) after the create as well.
| // it here on every event is cheap. | ||
| ib = aa.LookupIoBundlePhylabel(phyAdapter.Phylabel) | ||
| if ib != nil && ib.Type == types.IoNetEthPF && ib.Vfs.Count > 0 { | ||
| if err := setupVf(ib, aa, log); err != nil { |
There was a problem hiding this comment.
Steady-state setupVf re-runs unconditionally — the new block sits after the if/else that distinguishes "add/modify" from "no change", so it runs on every adapter on every PhysicalIOAdapterList event regardless of whether anything actually changed.
setupVf → createVfIoBundle calls sriov.SetupVfHardwareAddr / SetupVfVlan whenever user config has them set, so we re-issue netlink ops to the kernel on every adapter event for every PF. Idempotent on the kernel side, but it's churn — and it makes the logs harder to read because the same VF re-creation messages appear on every controller heartbeat.
The stated motivations (Vfs.Count change, sriov_numvfs reset on reboot) are both already covered: a Count change is an adapter change and lands in the add/modify branch above; a reboot triggers !aa.Initialized which already runs setupVf (line ~3370). Consider gating this new call on currentIbPtr == nil or an explicit Count-changed check, or dropping it entirely.
| if err := os.WriteFile(overridePath, []byte("vfio-pci"), 0); err != nil { | ||
| return fmt.Errorf("write driver_override for %s: %w", vfBDF, err) | ||
| } | ||
| if err := os.WriteFile("/sys/bus/pci/drivers_probe", []byte(vfBDF), 0); err != nil { |
There was a problem hiding this comment.
Silent no-op when the VF is already bound to a different driver — writing driver_override does not unbind from the current driver, and drivers_probe only probes unbound devices. So if a VF arrives bound to e.g. igbvf (because some other path autoprobed it, or because of a leftover state across an EVE upgrade), this function returns success and leaves the VF on igbvf. The sriov-network-device-plugin's drivers: vfio-pci selector then quietly matches nothing and the resource never appears on the node — with no log line pointing at the cause.
Two options:
- Explicitly unbind first: read
<vfBDF>/driversymlink; if it exists and isn'tvfio-pci, writevfBDFto<driver>/unbindbefore the override. - At minimum verify post-write: after
drivers_probe, re-read<vfBDF>/driverand return an error (or warn loudly) if it is notvfio-pci.
Today's flow with sriov_drivers_autoprobe=0 makes this unlikely to bite, but the failure mode is invisible enough that a defensive check is worth it.
| // the local pod — neighbours' inventories are independent. | ||
| func deleteLocalSriovDpPod(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) error { | ||
| if nodeName == "" { | ||
| return fmt.Errorf("nodeName is empty") |
There was a problem hiding this comment.
Boot-time race against z.nodeName — z.nodeName is set by the EdgeNodeInfo handler (zedkube.go:1084), which can fire after the first AssignableAdapters Create event during boot. On that first AA event this function returns the error "nodeName is empty", which the caller logs at Warn — alarming for what is just an early-boot ordering issue.
The kubeCfgTimer 60 s safety net will recover, so it's not a real bug, but consider downgrading the empty-nodeName case to Debug/Info (or returning a sentinel that the caller can recognise and log differently). Otherwise every clean boot produces a scary-looking warning.
| spec: | ||
| hostNetwork: true | ||
| nodeSelector: | ||
| kubernetes.io/arch: amd64 |
There was a problem hiding this comment.
DaemonSet runs on every amd64 node, including non-SR-IOV ones — the only nodeSelector is kubernetes.io/arch: amd64. On a mixed cluster where some nodes have SR-IOV-capable NICs and others don't, the device-plugin pod will sit in CrashLoopBackOff on the latter (as the manifest's own comment acknowledges for the bootstrap window).
For long-running mixed clusters this is more than a transient: it's permanent log/event noise and a small but non-zero resource cost. Consider a label like eve.zededa.com/sriov-capable=true that cluster-init.sh sets only when /sys/bus/pci/devices/*/sriov_numvfs exists (the same check install_sriov_manifests already performs), and use it as an additional nodeSelector here. The sriov-cni DaemonSet has the same property and would benefit from the same gate.
|
@naiming-zededa mark this as 'approved' when you are satisfied (then I or @rene can approve if need be and merge) |
@eriknordmark I think @zedi-pramodh will modify more on this. |
Yes I will submit amended PR on Friday once I am back. I need to retest everything. |
17911e4 to
1160c82
Compare
|
@naiming-zededa addressed your comments and retested. |
1160c82 to
ea86ff1
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #5908 +/- ##
==========================================
+ Coverage 20.68% 21.00% +0.31%
==========================================
Files 491 502 +11
Lines 90599 92782 +2183
==========================================
+ Hits 18744 19492 +748
- Misses 70274 71528 +1254
- Partials 1581 1762 +181 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I see Yetus errors, I will try to address those though few of them are coming from upstream yaml files, I will try to ignore those, same with spdx errors. |
| # DaemonSet at runtime just to stage the binary on each node. Pinned by | ||
| # digest for reproducibility. v2.7.0 ships amd64 only, matching the | ||
| # previous DaemonSet's nodeSelector. | ||
| FROM --platform=linux/amd64 ghcr.io/k8snetworkplumbingwg/sriov-cni@sha256:fce504e683275ab59215065ddf6433ca80d1b4b53f0d4b0cdf037da6701c7cb4 AS sriov-cni-bin |
There was a problem hiding this comment.
@zedi-pramodh , I see that other versions do have arm64 images, is v2.7.0 really mandatory? For instance, v2.8.1 has all images: https://github.com/k8snetworkplumbingwg/sriov-cni/pkgs/container/sriov-cni/260647076?tag=v2.8.1
There was a problem hiding this comment.
@rene we just need sriov binary, but we also need to pick a version so I picked v2.7.0 which was latest I think when I started this PR. Coming to arm64 we do not support VMs in kubevirt arm64. when we get there we can pull the arm64 too.
@zedi-pramodh , pls, add the upstream files with errors to the corresponding ignore files for Yetus and spdx checker: https://github.com/lf-edge/eve/blob/master/.yamllint and https://github.com/lf-edge/eve/blob/master/.spdxignore |
|
Regarding PR description template, pls, do not remove the check boxes you didn't select, they should stay unselected so we can see what was checked or not.... |
Got it. |
Will do. Thanks. |
|
I probably need one more amend to commit since I am seeing issues in testing on a Dell device, HP device is OK. I am debugging that issue. |
ea86ff1 to
d6997d1
Compare
|
Amended commit, rebased to master. @naiming-zededa can you review again. The new commit comments explains everything, |
|
This needs to be back ported to 17.0.0 branch. |
eriknordmark
left a comment
There was a problem hiding this comment.
Looks good to Naiming
| # below (eve-alpine-deploy, etcdctl, eve-bridge, etc.) execute inside `build` | ||
| # and not inside this passthrough stage — otherwise the COPY --from=sriov-cni-bin | ||
| # below ends up referencing the stage it's running in (circular dependency). | ||
| FROM --platform=linux/amd64 ghcr.io/k8snetworkplumbingwg/sriov-cni@sha256:fce504e683275ab59215065ddf6433ca80d1b4b53f0d4b0cdf037da6701c7cb4 AS sriov-cni-bin |
There was a problem hiding this comment.
Please check what DL3029 from hadolint is all about.
There was a problem hiding this comment.
Fixed and amended the commit. This should go away now.
This commit reworks how EVE-Kubernetes assigns SR-IOV Virtual Functions to
VM application instances. Previously EVE used KubeVirt PCI HostDevices to
pass VFs to VMs, which has two problems:
1. KubeVirt's HostDevices path treats all VFs sharing a vendor:device
tuple as a fungible pool. When two VMs each request "1x VF from this
NIC", KubeVirt cannot pin a specific VF (by BDF) to a specific VM, so
the per-VF admin MAC programmed on the PF cannot be tied to the VM
that ends up using that VF.
2. When the device model lacks per-VF MAC user config (pf.Vfs.Data[i].Mac
empty), the VF IoBundle MacAddr field inherits the parent PF's
hardware MAC. Combined with (1), every VM sharing the pool then
observes the same MAC address inside the guest — exactly the bug
reported.
The new path attaches each VF to the VMI as a KubeVirt SR-IOV interface
backed by a Multus NetworkAttachmentDefinition and the upstream
sriov-network-device-plugin. Each VF is its own resource pool, pinned to
its exact PCI BDF via the device plugin's pciAddresses selector — so a VMI
asking for eve.network/<pf>_vf<i> is guaranteed to receive that one BDF.
Each VMI also gets a stable, per-(app, VF) MAC that the in-guest driver
observes verbatim.
Code changes
------------
pillar/hypervisor/kubevirt.go
* CreateReplicaVMIConfig now buckets PCI assignments: SR-IOV VFs
(IoNetEthVF) go through attachSRIOVInterfaces (Multus + sriov-cni);
other PCI devices (GPUs, NVMe, USB controllers, plain NICs) keep the
existing registerWithKV / HostDevices path. IoNetEthPF entries are
skipped (PFs must remain host-bound; only their VFs are passed
through).
* registerWithKV is reduced to its non-VF responsibility: register
vendor:device in KubeVirt's PermittedHostDevices CR and append the
resource to vmi.Spec.Domain.Devices.HostDevices. The SR-IOV-only
scaffolding that used to live here — per-node SR-IOV capability
labeling, the post-update wait for kubelet to advertise the new
extended resource, and the helpers addSRIOVNodeSelectors,
labelNodeForSRIOV, waitForKVDeviceResource, sriovNodeLabelKey — is
removed; those are unreachable on the VF path and unnecessary on the
non-VF path. The kubeConfig and nodeName parameters are dropped from
registerWithKV accordingly.
* ensureSRIOVNAD creates one NetworkAttachmentDefinition per (PF, VF
index, VLAN) in the eve-kube-app namespace. The NAD's annotation
k8s.v1.cni.cncf.io/resourceName tells Multus to ask the device plugin
for that specific VF's pool; spec.config tells sriov-cni to bind it
and apply the VLAN. Idempotent — refreshed only on content drift
(e.g. VLAN change).
* attachSRIOVInterfaces wires vmi.Spec.Domain.Devices.Interfaces[] with
SRIOV binding and a Multus network reference, and sets MacAddress per
VF. VF index is resolved from the IoBundle Phylabel (e.g. "eth2vf5"
-> 5) via sriov.ParseVfIfaceName, not from VfParams.Index — that
field can be 0 for every statically-declared VF until
checkAndFillIoBundle has run, which previously produced per-VM NADs
all pointing at "<pf>-vf0" and resource requests all targeting
"<pf>_vf0".
MAC selection priority (resolveVFMac):
1. IoBundle.MacAddr, when present and not a stale copy of the
parent PF's hardware MAC (detected by reading the PF MAC out of
/sys/bus/pci/devices/<vf>/physfn/net/<ifname>/address).
2. A deterministic locally-administered MAC generated via
SHA-256(appUUID || vfBDF) with OUI 02:16:3E — same scheme
zedrouter uses for switch-network VIFs (see ipam.go:
generateAppMac). Stable across reboots, unique per (app, VF).
Backstop: attachSRIOVInterfaces also calls BindVFToVfioPCI on each VF
BDF before handing the VMI off to KubeVirt. BindVFToVfioPCI is
idempotent, so this is free on the happy path and self-healing on
the bad one (e.g. iavf re-grabbed the VF after an autoprobe race).
PFIface lookup: domainmgr populates VfParams.PFIface at parse time,
but on systems where the VF didn't yet exist in sysfs at that point
(sriov_numvfs not written), we recover here via
sriov.GetPFIfaceFromVFBDF before failing.
* vmiMetaData gains a sriovVFs []sriovVFRef field that records (PF
ifname, VF index) for every VF wired into this VMI. On Stop/Delete,
clearSRIOVAdminMACs walks this list and zeroes the per-VF admin MAC
on the PF via sriov.ClearVFAdminMAC. sriov-cni's DEL path doesn't
reliably clear the admin MAC when the VF is pre-bound to vfio-pci
(no kernel netdev to operate on), so EVE owns the cleanup.
pillar/sriov/sriov.go
* CreateVF now takes the PF's PCI BDF instead of the kernel netdev
name. /sys/class/net/<dev>/device/sriov_* is sensitive to a netdev
rename (NIM bridges PFs and renames eth2 -> keth2); if that rename
overlaps CreateVF's polling, a name-anchored caller fails with "Link
not found" and the static VF IoBundle entries never get populated.
/sys/bus/pci/devices/<bdf>/sriov_* is BDF-anchored and stable
across renames.
* ensurePFLinkUp -> ensurePFAdminUp. Bumping sriov_numvfs causes some
drivers (ixgbe, i40e) to reset the PF and clear IFF_UP; the old code
additionally waited for OperState=OperUp, which never happens on
cable-less SR-IOV PFs used purely as VF sources for app passthrough.
The kernel allows VF allocation and per-VF config on an IFF_UP /
no-carrier PF, so checking IFF_UP only is sufficient. Now also
runs on the numvfs-already-at-target early-return path so a previous
CreateVF that aborted mid-sequence still recovers IFF_UP on the next
call.
* BindVFToVfioPCI attaches a VF to the vfio-pci driver via
driver_override + drivers_probe (with unbind from the current
driver if any). Required because EVE creates VFs with
sriov_drivers_autoprobe=0 only on first-boot CreateVF; on a reboot
where numvfs is already at target, CreateVF early-returns before
writing the autoprobe flag and the resident VF driver (ixgbevf /
iavf / igbvf) grabs the VFs. Post-bind verification (readlink the
driver symlink) catches the case where vfio-pci isn't registered in
the kernel.
* GetPFIfaceFromVFBDF resolves a VF's parent PF ifname via
/sys/bus/pci/devices/<vf-bdf>/physfn/net/<ifname>. Used by
domainmgr to populate VfParams.PFIface for VFs that arrive without
it (statically declared phyio entries that bypass createVfIoBundle).
* ParseVfIfaceName rewritten. The previous version used
fmt.Sscanf("%svf%d", ...), which is broken — Go's %s verb is greedy
and consumes the entire input, leaving nothing for the literal "vf"
and trailing %d to match. The replacement scans backwards for the
last "vf<digits>" suffix.
* ClearVFAdminMAC zeroes the per-VF admin MAC programmed on the PF's
VF table via netlink.LinkSetVfHardwareAddr(pf, idx, 00:..:00).
Called by kubevirt on VM stop/delete.
pillar/cmd/domainmgr/domainmgr.go
* checkAndFillIoBundle now populates VfParams.PFIface AND
VfParams.Index for IoNetEthVF bundles. IoBundleFromPhyAdapter
leaves both at zero for statically-declared VFs, which downstream
(sriov_devplugin's buildSRIOVPools, attachSRIOVInterfaces) cannot
distinguish from "uninitialised" vs "actually VF 0". Index is
re-derived from Phylabel (always idempotent). PFIface is sourced
from sysfs (handles NIM rename), best-effort: the VF may not yet
exist in sysfs at parse time, in which case attachSRIOVInterfaces
falls back to the same sysfs helper at VMI-attach time.
* setupVf is restructured around the stable PCI BDF. CreateVF is
driven by ib.PciLong (not the ifname, which may rename mid-call),
and the ifname is re-resolved AFTER CreateVF for the GetVfByTimeout
+ per-VF MAC/VLAN steps.
Two-phase per-VF setup: bind to vfio-pci FIRST, register with AA
SECOND. Previously a single createVfIoBundle failure (e.g.
SetupVfHardwareAddr rejected because the PF was momentarily admin-
down) returned from setupVf, leaving every subsequent VF in this PF
driverless — observed in the field as "one PF works, the other has
20 driverless VFs and zero allocatable resources". Now: bind every
VF, log per-VF createVfIoBundle failures, keep going. setupVf
returns the first error so the caller still sees something went
wrong, but AA tracks every VF that was bound (createVfIoBundle now
returns the partial bundle on error rather than zero-value).
* handlePhysicalIOAdapterListImpl now calls setupVf in the steady-
state branch as well as the !Initialized first-pass branch. Without
this, a controller publishing the device model in two phases —
initial Vfs.Count=0, later bumped to N — leaves sriov_numvfs at 0
forever because setupVf was previously only invoked during AA
initialization. CreateVF is idempotent, so reconciling on every
event is cheap and self-heals after kernel reboots reset
sriov_numvfs to zero.
* createVfIoBundle bug fixes:
- SetupVfHardwareAddr and SetupVfVlan are now called on the PF
link (pfIb.Ifname, e.g. "eth0"), not the VF link
(vfIb.Ifname = "eth0vf0"). netlink.LinkSetVfHardwareAddr /
LinkSetVfVlan operate on a PF with a VF index; calling them on
the VF link silently no-ops or errors depending on driver.
- SetupVfVlan is now called with vfUserConfig.VlanID; vf.VlanID
was always 0 for freshly created VFs (GetVf does not read the
current VLAN from sysfs), so user-configured VLANs were never
applied.
- The configured VlanID is persisted in vfIb.VfParams.VlanID so
the hypervisor knows it for NAD construction.
pillar/cmd/zedkube/sriov_devplugin.go (new)
* reconcileSRIOVDevicePlugin projects the AssignableAdapters
publication into the sriov-network-device-plugin ConfigMap
(kube-system/sriovdp-config). One pool per VF (not per PF), with a
pciAddresses selector pinning the single BDF — so kubelet
allocation is a 1:1 BDF->VMI mapping, never a pool lottery. This
is the lever that gives EVE deterministic per-VF assignment.
* Resource name is "<pf>_vf<index>", matching what
kubevirt.sriovResourceName produces. Per-PF / VF-index are
resolved from Phylabel via sriov.ParseVfIfaceName (not VfParams,
same justification as in attachSRIOVInterfaces). VF vendor:device
is read from sysfs of virtfn0 — PF and VF have different IDs
(e.g. I350 PF 8086:1521, VF 8086:1520), and the device plugin
enumerates VFs by VF ID.
* On configuration drift, updates the ConfigMap and deletes only the
local sriov-network-device-plugin pod (filtered by spec.nodeName);
the DaemonSet recreates it within seconds, and the new pod re-reads
the config. Per-node delete avoids disrupting neighbours in
multi-node clusters.
* healDriverlessVFs walks every IoNetEthVF and binds any whose host-
side driver isn't vfio-pci. Steady-state self-healer for the "20
driverless VFs on one PF" failure mode caused by historical bind
short-circuits; idempotent on the happy path. Runs at the start of
every reconcile so a freshly-bound VF lands in the pool's first
advertisement.
* garbageCollectSRIOVNADs sweeps NetworkAttachmentDefinitions in
eve-kube-app whose k8s.v1.cni.cncf.io/resourceName annotation
references an SR-IOV pool that is no longer in the current AA
projection. Scoped by the eve.network/ resource-prefix so only
NADs created by ensureSRIOVNAD are touched. Runs after every
successful ConfigMap reconcile, so removing a PF (or shrinking
its VF count) also cleans up the dangling NADs.
* Self-healing: if z.config is nil at reconcile time (kube clientset
not yet ready when the first AA event arrived), re-acquires
kubeapi.GetKubeConfig() inside the reconciler so subsequent calls
succeed without depending on a fresh AA event.
pillar/cmd/zedkube/zedkube.go
* Subscribes to AssignableAdapters from domainmgr; routes
Create/Modify/Delete to reconcileSRIOVDevicePlugin.
* Adds a safety-net call on the existing kubeCfgTimer (60s period)
that re-drives reconcile from the pubsub snapshot. Guarantees
eventual consistency even if the first AA Create event raced
against k3s startup and the AA never republishes after that.
pillar/types/assignableadapters.go
* IsNetEthVF() predicate on IoType — used wherever VF-specific
behaviour is required. VFs are intentionally excluded from IsNet()
because they must be vfio-pci-bound regardless of testing mode and
they are never EVE network ports.
pkg/kube/sriov/sriov-device-plugin.yaml (new)
* ServiceAccount + DaemonSet for the upstream
sriov-network-device-plugin (v3.7.0).
* The sriovdp-config ConfigMap is intentionally NOT shipped here:
k3s's deploy controller manages every resource defined under
/var/lib/rancher/k3s/server/manifests/ and reverts any drift, which
would undo zedkube's writes. Instead zedkube's reconciler creates
the ConfigMap on first run via the IsNotFound branch; the DaemonSet
pod CrashLoopBackOffs for a few seconds at first boot until the
ConfigMap exists, then starts normally.
pkg/kube/Dockerfile
* Adds a passthrough build stage `sriov-cni-bin` that pulls
ghcr.io/k8snetworkplumbingwg/sriov-cni (pinned by digest, amd64
only) and COPYs /usr/bin/sriov into the kube image. The binary is
staged from the host at /usr/bin/sriov by cluster-init.sh — no
sriov-cni DaemonSet at runtime, removing one moving part vs.
upstream deployments that ship a DaemonSet just to drop the binary
on each node.
* COPYs sriov-device-plugin.yaml into /etc/k3s-manifests/ so
cluster-init.sh can stage it.
pkg/kube/cluster-init.sh
* install_sriov_manifests copies the sriov binary to both
/var/lib/cni/bin (where Multus looks: binDir setting) and
/opt/cni/bin (the k3s/flannel default), and stages
sriov-device-plugin.yaml into KUBE_MANIFESTS_DIR. Gated on
/sys/bus/pci/devices/*/sriov_numvfs existing — /sys/class/net is
per-network-namespace and doesn't include host NICs from inside the
kube container's netns; /sys/bus/pci is namespace-independent.
* Hoisted out of the one-shot all_components_initialized guard and
called every iteration of the main loop. Makes EVE upgrades pick
up new or updated manifests without requiring a cluster re-init.
Idempotent via cp -u.
Backward compatibility
----------------------
* Non-SR-IOV PCI passthrough is unchanged — IoNetEth (whole NICs),
GPUs, NVMe, USB, etc. continue to flow through registerWithKV /
KubeVirt PermittedHostDevices.
Signed-off-by: Pramodh Pallapothu <pramodh@zededa.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
d6997d1 to
2c31334
Compare
|
Someone should approve again since I amended the commit for DL3029 fix. |
This commit reworks how EVE-Kubernetes assigns SR-IOV Virtual Functions to VM application instances. Previously EVE used KubeVirt PCI HostDevices to pass VFs to VMs, which has two problems:
KubeVirt's HostDevices path treats all VFs sharing a vendor:device tuple as a fungible pool. When two VMs each request "1x VF from this NIC", KubeVirt cannot pin a specific VF (by BDF) to a specific VM, so the per-VF admin MAC programmed on the PF cannot be tied to the VM that ends up using that VF.
When the device model lacks per-VF MAC user config (pf.Vfs.Data[i].Mac empty), the VF IoBundle MacAddr field inherits the parent PF's hardware MAC. Combined with (1), every VM sharing the pool then observes the same MAC address inside the guest.
The new path attaches each VF to the VMI as a KubeVirt SR-IOV interface backed by a Multus NetworkAttachmentDefinition and the upstream sriov-network-device-plugin. The device plugin tracks each VF by PCI BDF in a per-PF resource pool, so two VMs requesting the same pool get distinct VFs. Each VMI also receives a stable, per-(app, VF) MAC.
Code changes
pillar/hypervisor/kubevirt.go
CreateReplicaVMIConfig now buckets PCI assignments: SR-IOV VFs (IoNetEthVF) go through attachSRIOVInterfaces (Multus + sriov-cni); other PCI devices (GPUs, NVMe, USB controllers, plain NICs) keep the existing registerWithKV / HostDevices path.
registerWithKV is reduced to its non-VF responsibility: register vendor:device in KubeVirt's PermittedHostDevices CR and append the resource to vmi.Spec.Domain.Devices.HostDevices. The SR-IOV-only scaffolding that used to live here — per-node SR-IOV capability labeling, the post-update wait for kubelet to advertise the new extended resource, and the helpers addSRIOVNodeSelectors, labelNodeForSRIOV, waitForKVDeviceResource, sriovNodeLabelKey — is removed; those are unreachable on the VF path and unnecessary on the non-VF path (waitForVMI in Start() already polls up to 15 min for the resource to become schedulable). The kubeConfig and nodeName parameters are dropped from registerWithKV accordingly.
ensureSRIOVNAD creates one NetworkAttachmentDefinition per (PF, VLAN) in the eve-kube-app namespace. The NAD's annotation k8s.v1.cni.cncf.io/resourceName tells Multus to ask the device plugin for a VF from the named pool; spec.config tells sriov-cni to bind it. Idempotent — refreshed only on content drift (e.g. VLAN change).
attachSRIOVInterfaces wires vmi.Spec.Domain.Devices.Interfaces[] with SRIOV binding and a Multus network reference, and sets MacAddress per VF. MAC selection priority:
VfParams.PFIface is sourced from domainmgr's parse-time population
(see below). If a VF arrives without it set — e.g. because
sriov_numvfs hadn't been written when the phyAdapter list was first
processed on reboot — attachSRIOVInterfaces recovers via
sriov.GetPFIfaceFromVFBDF before failing.
pillar/sriov/sriov.go
GetPFIfaceFromVFBDF resolves a VF's parent PF ifname via /sys/bus/pci/devices//physfn/net/. Used by domainmgr to populate VfParams.PFIface for VFs that arrive without it (statically declared phyio entries that bypass createVfIoBundle).
BindVFToVfioPCI attaches a VF to the vfio-pci driver via driver_override + drivers_probe. Required because EVE creates VFs with sriov_drivers_autoprobe=0 (so the host driver doesn't claim them), but the upstream sriov-network-device-plugin's default selector requires VFs to already be bound to vfio-pci before it advertises them as a kubelet resource.
pillar/cmd/domainmgr/domainmgr.go
checkAndFillIoBundle now populates VfParams.PFIface for IoNetEthVF bundles whose PFIface is empty, using sriov.GetPFIfaceFromVFBDF. This is best-effort: the VF may not yet exist in sysfs at parse time (e.g. on reboot before sriov_numvfs has been written), in which case attachSRIOVInterfaces falls back to the same helper at VMI-attach time.
setupVf binds each newly created VF to vfio-pci via sriov.BindVFToVfioPCI immediately after createVfIoBundle records it in AssignableAdapters. Without this, the kube sriov-network-device-plugin's drivers=vfio-pci selector matches no VFs and the eve.network/_vfs resource is never advertised on the node, leaving VMIs unable to claim a VF. Non-fatal per VF: a failure is logged and the loop continues so one bad VF does not block its siblings.
handlePhysicalIOAdapterListImpl now calls setupVf in the steady- state branch as well as the !Initialized first-pass branch. Without this, a controller publishing the device model in two phases — initial Vfs.Count=0, later bumped to N — leaves sriov_numvfs at 0 forever because setupVf was previously only invoked during AA initialization. CreateVF is idempotent (returns early when numvfs already matches), so reconciling on every event is cheap and self-heals after kernel reboots reset sriov_numvfs to zero.
pillar/cmd/zedkube/sriov_devplugin.go (new)
reconcileSRIOVDevicePlugin projects the AssignableAdapters publication into the sriov-network-device-plugin ConfigMap (kube-system/sriovdp-config). For each IoNetEthPF bundle with Vfs.Count > 0, builds one pool entry keyed by PF Ifname; vendor and device IDs are read from the first VF in sysfs (PFs and VFs have different IDs — VFs are what the plugin needs to enumerate).
On configuration drift, updates the ConfigMap and deletes only the local sriov-network-device-plugin pod (the DaemonSet recreates it, new pod re-reads the config). Per-node delete avoids disrupting neighbours in multi-node clusters.
garbageCollectSRIOVNADs sweeps NetworkAttachmentDefinitions in eve-kube-app whose k8s.v1.cni.cncf.io/resourceName annotation references an SR-IOV pool that is no longer in the current AA projection. Scoped by the eve.network/ resource-prefix so only NADs created by ensureSRIOVNAD are touched. Runs after every successful ConfigMap reconcile, so removing a PF from the device model also cleans up its dangling NAD.
Self-healing: if z.config is nil at reconcile time (kube clientset not yet ready when the first AA event arrived), re-acquires kubeapi.GetKubeConfig() inside the reconciler so subsequent calls succeed without depending on a fresh AA event.
pillar/cmd/zedkube/zedkube.go
Subscribes to AssignableAdapters from domainmgr; routes Create/Modify/Delete to reconcileSRIOVDevicePlugin.
Adds a safety-net call on the existing kubeCfgTimer (60s period) that re-drives reconcile from the pubsub snapshot. Guarantees eventual consistency even if the first AA Create event raced against k3s startup and the AA never republishes after that.
pkg/kube/sriov/ (new)
sriov-cni-daemonset.yaml — upstream sriov-cni DaemonSet that installs /opt/cni/bin/sriov on every amd64 node.
sriov-device-plugin.yaml — ServiceAccount and DaemonSet for the upstream sriov-network-device-plugin. The sriovdp-config ConfigMap is intentionally NOT shipped here: k3s's deploy controller manages every resource defined under /var/lib/rancher/k3s/server/manifests/ and reverts any drift, which would undo zedkube's writes. Instead zedkube's reconciler creates the ConfigMap on first run via the IsNotFound branch.
pkg/kube/cluster-init.sh
install_sriov_manifests gates manifest installation on /sys/bus/pci/devices/*/sriov_numvfs existing. /sys/class/net is per-network-namespace and does not include host NICs from inside the kube container's netns; /sys/bus/pci is namespace-independent.
Hoisted out of the one-shot all_components_initialized guard and called every iteration of the main loop. Makes EVE upgrades pick up new or updated manifests without requiring a cluster re-init. Idempotent via cp -u.
pkg/kube/Dockerfile
Backward compatibility
How to test and validate this PR
Generate device model with something like this
{
"ztype": 11, ----> VF type
"phylabel": "eth1",
"phyaddrs": {
"Ifname": "eth1",
"PciLong": "0000:02:00.1"
},
"logicallabel": "eth1",
"assigngrp": "group18",
"usage": 1,
"cbattr": {},
"usagePolicy": {},
"cost": 0,
"parentassigngrp": "",
"vfs": {
"count": 4. ---> number of VFs
}
},
Define those VFs in device adapter menu in Controller as AppDirect
Deploy one or more VMs and pick those VFs. I tried 3 VMs.
Verify that in Kubernetes config map is generated and resourceLIst is auto populated.
[kube] root@59263949-dcff-489f-a832-4c60469b3955:/$ kubectl describe cm sriovdp-config -n kube-system
Name: sriovdp-config
Namespace: kube-system
Labels:
Annotations:
Data
config.json:
{
"resourceList": [
{
"resourceName": "eth1_vfs", ==> this is eth1 with 4VFs
"resourcePrefix": "eve.network",
"selectors": {
"vendors": [
"8086"
],
"devices": [
"1520"
],
"drivers": [
"vfio-pci"
],
"pfNames": [
"eth1"
]
}
}
]
}
Verify the eth1 interface shows all VFs and unique MAC for every VM.
[kube] root@59263949-dcff-489f-a832-4c60469b3955:/$ ip link show eth1
5: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
link/ether 94:40:c9:4b:37:41 brd ff:ff:ff:ff:ff:ff
vf 0 link/ether 02:16:3e:98:7d:04 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 1 link/ether 02:16:3e:78:5a:8e brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 2 link/ether 02:16:3e:7a:cc:14 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
Login into VM and verify IP addresses got assigned on VF interface and also can ping other VMs or run iperf test.
Reboot the device and verify the MAC addresses did not change for VM so that IP addressed stay same.
I have not tested failovers yet.
Changelog notes
Users can now use SRIOV-VFs to passthrough to VMs in eve-k
And the last but not least:
check them.
Please, check the boxes above after submitting the PR in interactive mode.