Skip to content

pillar, kube: SR-IOV VF passthrough via Multus + sriov-cni for KubeVirt#5908

Open
zedi-pramodh wants to merge 1 commit into
lf-edge:masterfrom
zedi-pramodh:sriov-feature
Open

pillar, kube: SR-IOV VF passthrough via Multus + sriov-cni for KubeVirt#5908
zedi-pramodh wants to merge 1 commit into
lf-edge:masterfrom
zedi-pramodh:sriov-feature

Conversation

@zedi-pramodh
Copy link
Copy Markdown

This commit reworks how EVE-Kubernetes assigns SR-IOV Virtual Functions to VM application instances. Previously EVE used KubeVirt PCI HostDevices to pass VFs to VMs, which has two problems:

  1. KubeVirt's HostDevices path treats all VFs sharing a vendor:device tuple as a fungible pool. When two VMs each request "1x VF from this NIC", KubeVirt cannot pin a specific VF (by BDF) to a specific VM, so the per-VF admin MAC programmed on the PF cannot be tied to the VM that ends up using that VF.

  2. When the device model lacks per-VF MAC user config (pf.Vfs.Data[i].Mac empty), the VF IoBundle MacAddr field inherits the parent PF's hardware MAC. Combined with (1), every VM sharing the pool then observes the same MAC address inside the guest.

The new path attaches each VF to the VMI as a KubeVirt SR-IOV interface backed by a Multus NetworkAttachmentDefinition and the upstream sriov-network-device-plugin. The device plugin tracks each VF by PCI BDF in a per-PF resource pool, so two VMs requesting the same pool get distinct VFs. Each VMI also receives a stable, per-(app, VF) MAC.

Code changes

pillar/hypervisor/kubevirt.go

  • CreateReplicaVMIConfig now buckets PCI assignments: SR-IOV VFs (IoNetEthVF) go through attachSRIOVInterfaces (Multus + sriov-cni); other PCI devices (GPUs, NVMe, USB controllers, plain NICs) keep the existing registerWithKV / HostDevices path.

  • registerWithKV is reduced to its non-VF responsibility: register vendor:device in KubeVirt's PermittedHostDevices CR and append the resource to vmi.Spec.Domain.Devices.HostDevices. The SR-IOV-only scaffolding that used to live here — per-node SR-IOV capability labeling, the post-update wait for kubelet to advertise the new extended resource, and the helpers addSRIOVNodeSelectors, labelNodeForSRIOV, waitForKVDeviceResource, sriovNodeLabelKey — is removed; those are unreachable on the VF path and unnecessary on the non-VF path (waitForVMI in Start() already polls up to 15 min for the resource to become schedulable). The kubeConfig and nodeName parameters are dropped from registerWithKV accordingly.

  • ensureSRIOVNAD creates one NetworkAttachmentDefinition per (PF, VLAN) in the eve-kube-app namespace. The NAD's annotation k8s.v1.cni.cncf.io/resourceName tells Multus to ask the device plugin for a VF from the named pool; spec.config tells sriov-cni to bind it. Idempotent — refreshed only on content drift (e.g. VLAN change).

  • attachSRIOVInterfaces wires vmi.Spec.Domain.Devices.Interfaces[] with SRIOV binding and a Multus network reference, and sets MacAddress per VF. MAC selection priority:

    1. IoBundle.MacAddr, when present and not a stale copy of the parent PF's hardware MAC.
    2. A deterministic locally-administered MAC generated via SHA-256(appUUID || vfBDF) with OUI 02:16:3E — same scheme zedrouter uses for switch-network VIFs (see ipam.go: generateAppMac). Stable across reboots, unique per (app, VF).

    VfParams.PFIface is sourced from domainmgr's parse-time population
    (see below). If a VF arrives without it set — e.g. because
    sriov_numvfs hadn't been written when the phyAdapter list was first
    processed on reboot — attachSRIOVInterfaces recovers via
    sriov.GetPFIfaceFromVFBDF before failing.

pillar/sriov/sriov.go

  • GetPFIfaceFromVFBDF resolves a VF's parent PF ifname via /sys/bus/pci/devices//physfn/net/. Used by domainmgr to populate VfParams.PFIface for VFs that arrive without it (statically declared phyio entries that bypass createVfIoBundle).

  • BindVFToVfioPCI attaches a VF to the vfio-pci driver via driver_override + drivers_probe. Required because EVE creates VFs with sriov_drivers_autoprobe=0 (so the host driver doesn't claim them), but the upstream sriov-network-device-plugin's default selector requires VFs to already be bound to vfio-pci before it advertises them as a kubelet resource.

pillar/cmd/domainmgr/domainmgr.go

  • checkAndFillIoBundle now populates VfParams.PFIface for IoNetEthVF bundles whose PFIface is empty, using sriov.GetPFIfaceFromVFBDF. This is best-effort: the VF may not yet exist in sysfs at parse time (e.g. on reboot before sriov_numvfs has been written), in which case attachSRIOVInterfaces falls back to the same helper at VMI-attach time.

  • setupVf binds each newly created VF to vfio-pci via sriov.BindVFToVfioPCI immediately after createVfIoBundle records it in AssignableAdapters. Without this, the kube sriov-network-device-plugin's drivers=vfio-pci selector matches no VFs and the eve.network/_vfs resource is never advertised on the node, leaving VMIs unable to claim a VF. Non-fatal per VF: a failure is logged and the loop continues so one bad VF does not block its siblings.

  • handlePhysicalIOAdapterListImpl now calls setupVf in the steady- state branch as well as the !Initialized first-pass branch. Without this, a controller publishing the device model in two phases — initial Vfs.Count=0, later bumped to N — leaves sriov_numvfs at 0 forever because setupVf was previously only invoked during AA initialization. CreateVF is idempotent (returns early when numvfs already matches), so reconciling on every event is cheap and self-heals after kernel reboots reset sriov_numvfs to zero.

pillar/cmd/zedkube/sriov_devplugin.go (new)

  • reconcileSRIOVDevicePlugin projects the AssignableAdapters publication into the sriov-network-device-plugin ConfigMap (kube-system/sriovdp-config). For each IoNetEthPF bundle with Vfs.Count > 0, builds one pool entry keyed by PF Ifname; vendor and device IDs are read from the first VF in sysfs (PFs and VFs have different IDs — VFs are what the plugin needs to enumerate).

  • On configuration drift, updates the ConfigMap and deletes only the local sriov-network-device-plugin pod (the DaemonSet recreates it, new pod re-reads the config). Per-node delete avoids disrupting neighbours in multi-node clusters.

  • garbageCollectSRIOVNADs sweeps NetworkAttachmentDefinitions in eve-kube-app whose k8s.v1.cni.cncf.io/resourceName annotation references an SR-IOV pool that is no longer in the current AA projection. Scoped by the eve.network/ resource-prefix so only NADs created by ensureSRIOVNAD are touched. Runs after every successful ConfigMap reconcile, so removing a PF from the device model also cleans up its dangling NAD.

  • Self-healing: if z.config is nil at reconcile time (kube clientset not yet ready when the first AA event arrived), re-acquires kubeapi.GetKubeConfig() inside the reconciler so subsequent calls succeed without depending on a fresh AA event.

pillar/cmd/zedkube/zedkube.go

  • Subscribes to AssignableAdapters from domainmgr; routes Create/Modify/Delete to reconcileSRIOVDevicePlugin.

  • Adds a safety-net call on the existing kubeCfgTimer (60s period) that re-drives reconcile from the pubsub snapshot. Guarantees eventual consistency even if the first AA Create event raced against k3s startup and the AA never republishes after that.

pkg/kube/sriov/ (new)

  • sriov-cni-daemonset.yaml — upstream sriov-cni DaemonSet that installs /opt/cni/bin/sriov on every amd64 node.

  • sriov-device-plugin.yaml — ServiceAccount and DaemonSet for the upstream sriov-network-device-plugin. The sriovdp-config ConfigMap is intentionally NOT shipped here: k3s's deploy controller manages every resource defined under /var/lib/rancher/k3s/server/manifests/ and reverts any drift, which would undo zedkube's writes. Instead zedkube's reconciler creates the ConfigMap on first run via the IsNotFound branch.

pkg/kube/cluster-init.sh

  • install_sriov_manifests gates manifest installation on /sys/bus/pci/devices/*/sriov_numvfs existing. /sys/class/net is per-network-namespace and does not include host NICs from inside the kube container's netns; /sys/bus/pci is namespace-independent.

  • Hoisted out of the one-shot all_components_initialized guard and called every iteration of the main loop. Makes EVE upgrades pick up new or updated manifests without requiring a cluster re-init. Idempotent via cp -u.

pkg/kube/Dockerfile

  • COPYs sriov/sriov-cni-daemonset.yaml and sriov-device-plugin.yaml into /etc/k3s-manifests/ so cluster-init.sh can stage them.

Backward compatibility

  • Non-SR-IOV PCI passthrough is unchanged — IoNetEth (whole NICs), GPUs, NVMe, USB, etc. continue to flow through registerWithKV / KubeVirt PermittedHostDevices.

How to test and validate this PR

  1. Generate device model with something like this
    {
    "ztype": 11, ----> VF type
    "phylabel": "eth1",
    "phyaddrs": {
    "Ifname": "eth1",
    "PciLong": "0000:02:00.1"
    },
    "logicallabel": "eth1",
    "assigngrp": "group18",
    "usage": 1,
    "cbattr": {},
    "usagePolicy": {},
    "cost": 0,
    "parentassigngrp": "",
    "vfs": {
    "count": 4. ---> number of VFs
    }
    },

  2. Define those VFs in device adapter menu in Controller as AppDirect

  3. Deploy one or more VMs and pick those VFs. I tried 3 VMs.

  4. Verify that in Kubernetes config map is generated and resourceLIst is auto populated.
    [kube] root@59263949-dcff-489f-a832-4c60469b3955:/$ kubectl describe cm sriovdp-config -n kube-system
    Name: sriovdp-config
    Namespace: kube-system
    Labels:
    Annotations:

Data

config.json:

{
"resourceList": [
{
"resourceName": "eth1_vfs", ==> this is eth1 with 4VFs
"resourcePrefix": "eve.network",
"selectors": {
"vendors": [
"8086"
],
"devices": [
"1520"
],
"drivers": [
"vfio-pci"
],
"pfNames": [
"eth1"
]
}
}
]
}

  1. Verify the eth1 interface shows all VFs and unique MAC for every VM.
    [kube] root@59263949-dcff-489f-a832-4c60469b3955:/$ ip link show eth1
    5: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether 94:40:c9:4b:37:41 brd ff:ff:ff:ff:ff:ff
    vf 0 link/ether 02:16:3e:98:7d:04 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 1 link/ether 02:16:3e:78:5a:8e brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 2 link/ether 02:16:3e:7a:cc:14 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off
    vf 3 link/ether 00:00:00:00:00:00 brd ff:ff:ff:ff:ff:ff, spoof checking on, link-state auto, trust off

  2. Login into VM and verify IP addresses got assigned on VF interface and also can ping other VMs or run iperf test.

  3. Reboot the device and verify the MAC addresses did not change for VM so that IP addressed stay same.

I have not tested failovers yet.

Changelog notes

Users can now use SRIOV-VFs to passthrough to VMs in eve-k

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've written the test verification instructions

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

Comment thread pkg/kube/sriov/sriov-cni-daemonset.yaml Outdated
Comment thread pkg/kube/cluster-init.sh Outdated
install_sriov_manifests() {
[ -d "${KUBE_MANIFESTS_DIR}" ] || return
ls /sys/bus/pci/devices/*/sriov_numvfs >/dev/null 2>&1 || return
cp -u /etc/k3s-manifests/sriov-cni-daemonset.yaml \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we download during the k3s/multus install time, then we can remove this sriov-cni-daemonset.yaml

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like the k3s/multus way. Not every device model has SRIOV enabled. So I am not seeing reason to do as part of k3s install. Instead we will load through Kubernetes DS only if we see SRIOV enabled interfaces.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in /var/lib/cni/bin, they are a number of binaries just loaded by default, no use here. but just happens this sr-iov is not currently part of that.

Comment thread pkg/kube/cluster-init.sh
Comment thread pkg/pillar/hypervisor/kubevirt.go Outdated

// sriovNADName returns the NetworkAttachmentDefinition name for a given PF.
// One NAD per PF — VLAN, when configured, lives in the NAD's CNI config.
func sriovNADName(pfName string) string {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should pass in the vlanID:

    func sriovNADName(pfName string, vlanID uint32) string {                                                                                                
        if vlanID == 0 {                                                                                                                                    
            return "sriov-" + pfName                                                                                                                        
        }                                                                                                                                                   
        return fmt.Sprintf("sriov-%s-vlan%d", pfName, vlanID)                                                                                               
    }                                                                                                                                                       

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment thread pkg/pillar/hypervisor/kubevirt.go Outdated
Comment thread pkg/kube/sriov/sriov-device-plugin.yaml Outdated

nadName := sriovNADName(pfName)
resourceName := sriovResourceName(pfName)
cniConfig := fmt.Sprintf(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this may need to exclude of vlanID 0:
to check if vlanID > 0, then include this vlanID in the string.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Comment thread pkg/pillar/cmd/domainmgr/domainmgr.go Outdated
Copy link
Copy Markdown

@claude-rsp claude-rsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code review

Thanks for the very thorough commit message — it makes the rationale for each piece easy to follow. Overall the design is sound: routing SR-IOV VFs through Multus + sriov-cni instead of KubeVirt HostDevices is the right fix for both the "fungible pool" allocation issue and the "all VMs see PF MAC" symptom, and the dynamic ConfigMap reconciler is a clean way to bridge EVE's AssignableAdapters model to the upstream device-plugin's static config schema.

A few concerns worth addressing before merge:

Correctness / behaviour

  1. resolveVFMac does not canonicalise the user-configured MAC (pkg/pillar/hypervisor/kubevirt.go lines 2435–2445). Today it returns strings.ToLower(cfgMac) directly. KubeVirt validates the MAC with a strict regex (^([0-9A-Fa-f]{2}[:-]){5}([0-9A-Fa-f]{2})$-ish) and will reject formats like "aabb.ccdd.eeff" or values with leading/trailing whitespace already swallowed by TrimSpace. Suggest:

    if hw, err := net.ParseMAC(cfgMac); err == nil {
        return hw.String(), "config"
    }

    This also defends against a user typo silently turning into a VMI admission failure deep in the start path.

  2. derivePFMacFromVFSysfs errors are swallowed silently (line 2438: pfMac, _ := derivePFMacFromVFSysfs(...)). When sysfs is temporarily unreadable we treat that as "PF MAC unknown" and accept cfgMac as-is, which is the exact pathological value (cfgMac == PF MAC) the check exists to guard against. At minimum log the error at Debug; better, treat sysfs failure as "can't trust cfgMac, fall through to generated".

  3. Concurrent reconcile race in reconcileSRIOVDevicePlugin (pkg/pillar/cmd/zedkube/sriov_devplugin.go). The function is invoked from two paths — the AA subscription handler and the kubeCfgTimer safety net every 60 s. There's no mutex around the Get → compare → Update → DeletePod sequence. In practice the K8s API will reject one of two simultaneous Updates with a 409 (logged as a generic error), and the pod-delete on the loser will still happen against the new pod created by the DaemonSet after the winner's bounce. Add a sync.Mutex (or a singleflight.Group) on *zedkube for this reconciler.

  4. garbageCollectSRIOVNADs is not called on the IsNotFound (initial-create) branch (lines 635–652). In the current shape, GC only runs when an existing ConfigMap drifts — so any leftover NAD that survives a fresh-install delete-then-recreate of the ConfigMap will sit in eve-kube-app until the next drift. Not a hot bug today (the only way to hit it is a manual kubectl delete cm sriovdp-config) but worth a one-liner: call GC after the create as well.

  5. Steady-state setupVf reconcile re-programs MAC/VLAN on every PhysicalIOAdapterList event (pkg/pillar/cmd/domainmgr/domainmgr.go lines 3452–3466). The new block runs unconditionally — both in the "no change" else branch and the add/modify branch. setupVfcreateVfIoBundle calls sriov.SetupVfHardwareAddr / SetupVfVlan whenever user config sets them, so we re-issue netlink ops on every adapter event. Idempotent on the kernel side, but it's churn. Consider gating the new call on either currentIbPtr == nil (initial population missed) or an explicit "Vfs.Count changed" check; the no-change else branch is precisely the case where re-running setupVf is least useful.

  6. BindVFToVfioPCI assumes the VF arrives driverless. The comment says so, but the implementation will silently no-op if the VF is already bound to another driver (writing driver_override doesn't unbind, and drivers_probe only probes unbound devices). Today's EVE flow (autoprobe=0) means this is fine, but the failure mode is silent — the device-plugin selector quietly matches nothing and the resource never appears, with nothing in the logs to point at the cause. Either (a) explicitly unbind first if driver symlink exists, or (b) at minimum check driver post-write and warn if it's not vfio-pci.

  7. deleteLocalSriovDpPod returns an error when z.nodeName == "". Caller logs at Warn. z.nodeName is populated by the EdgeNodeInfo handler (zedkube.go:1084), which can race the first AA Create event during boot. Not fatal — the kubeCfgTimer safety net will retry — but the logged error reads alarming. Worth downgrading the empty-nodeName case to a Debug or Info log.

  8. sriov-device-plugin DaemonSet is unconditionally scheduled on every amd64 node (pkg/kube/sriov/sriov-device-plugin.yaml). On clusters where some nodes have SR-IOV NICs and others do not, the DaemonSet will CrashLoopBackOff on the latter (the comment in the manifest acknowledges this for the bootstrap window only). For long-running mixed clusters, consider a node selector tied to a label that cluster-init.sh only sets on SR-IOV-capable nodes — same pattern the manifests already use for kubernetes.io/arch: amd64.

Smaller nits

  • pkg/pillar/hypervisor/kubevirt.go:2349ifName := fmt.Sprintf("sriov%d", i+1) indexes off the loop counter only. Today no other code path adds VMI interfaces named sriov*, so this is fine, but a slightly more defensive form would use pa.ioBundle.Phylabel (already unique per VF). Cosmetic.
  • pkg/pillar/cmd/zedkube/sriov_devplugin.go:684–688 — the resource-key construction (p.ResourcePrefix + "/" + p.ResourceName) duplicates what sriovResourceName does in kubevirt.go. Pull this into one helper to keep the two file in lockstep — they have to agree exactly or NADs and pools won't match up.
  • generateVFMac doc says "collision probability with 24 random bits and 1000 VFs is well under 0.01 percent". With 24 bits and 1000 VFs the birthday-collision probability is ≈ 3% (matches the comment in zedrouter/ipam.go:56). Tighten or drop the number.
  • pkg/pillar/sriov/sriov.go — both BindVFToVfioPCI and GetPFIfaceFromVFBDF are good candidates for a small unit test using a tmpfs sysfs mock; same for readVfVendorDevice and generateVFMac in kubevirt.go. The PR adds 1036 lines of new logic with no test coverage on the pure-helper layer.

Things I liked

  • Splitting pciAssignments into hostDevAssignments / vfAssignments keeps non-SR-IOV passthrough on its existing path entirely — minimal blast radius.
  • The IoNetEthPF skip in CreateReplicaVMIConfig is a nice guardrail against accidentally trying to attach a Physical Function to a VM.
  • Reusing the zedrouter OUI (02:16:3e) and SHA-256 scheme keeps EVE-generated MACs visually consistent across the codebase.
  • cluster-init.sh's shift to running install_sriov_manifests every loop iteration (idempotent via cp -u) is the right call for upgrade flows — the previous one-shot pattern would have left old manifests on disk after an EVE bump.
  • The physfn/net sysfs lookup for PF resolution sidesteps having to keep a separate cache; clean.
  • The Drivers: ["vfio-pci"] selector in the generated ConfigMap pairs cleanly with the new BindVFToVfioPCI step in setupVf — both sides of that contract are in this PR, so there's no out-of-tree config the user has to know about.

Build / packaging

  • Build tag //go:build k is consistent with the rest of pkg/pillar/cmd/zedkube/. ✓
  • Manifests added under pkg/kube/sriov/ and copied into /etc/k3s-manifests/ via the Dockerfile match the existing nvidia-device-plugin pattern. ✓
  • No new go module dependencies; k8snetworkplumbingwg/network-attachment-definition-client is already vendored.

Overall: the design and the bulk of the implementation look right; the items above are mostly hardening (concurrency, format-validation, observability) rather than structural. Happy to re-review once the MAC format-handling and reconcile-mutex are addressed.

if cfgMac != "" {
pfMac, _ := derivePFMacFromVFSysfs(pa.ioBundle.PciLong)
if pfMac == "" || !strings.EqualFold(cfgMac, pfMac) {
return strings.ToLower(cfgMac), "config"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MAC format normalisationstrings.ToLower(cfgMac) only canonicalises case. KubeVirt validates the MAC with a strict regex (xx:xx:xx:xx:xx:xx) and will reject formats like aabb.ccdd.eeff, AA-BB-CC-DD-EE-FF, or values with embedded whitespace. A user typo here turns into a VMI admission failure deep in the start path, with a confusing error message.

Suggest:

if hw, err := net.ParseMAC(cfgMac); err == nil {
    return hw.String(), "config"
}
// fall through to generated

That also rejects malformed input cleanly instead of forwarding it to KubeVirt.

Comment thread pkg/pillar/hypervisor/kubevirt.go Outdated
func resolveVFMac(pa pciDevice, appUUID uuid.UUID) (string, string) {
cfgMac := strings.TrimSpace(pa.ioBundle.MacAddr)
if cfgMac != "" {
pfMac, _ := derivePFMacFromVFSysfs(pa.ioBundle.PciLong)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sysfs error swallowed silently — the _ discards the error from derivePFMacFromVFSysfs. When sysfs is transiently unreadable we treat that as "PF MAC unknown" and accept cfgMac as-is, which is the exact pathological value (cfgMac == PF MAC) the check exists to guard against. That defeats the very protection this priority order was designed to add.

At minimum log the error at Debug; better, treat sysfs failure as "can't trust cfgMac → fall through to generated". Trusting input we couldn't validate is the worse default of the two.

//
// Idempotent and cheap when nothing changes — it short-circuits if the JSON is
// byte-identical to what's already in the ConfigMap.
func (z *zedkube) reconcileSRIOVDevicePlugin(aa *types.AssignableAdapters) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concurrent reconcile race — this function is invoked from two paths: the AssignableAdapters subscription handlers (Create/Modify/Delete) and the kubeCfgTimer safety net every 60 s in zedkube.go:732. There is no mutex around the Get → compare → Update → DeletePod sequence.

In practice the K8s API will reject one of two simultaneous Updates with a 409 (logged as a generic error), and the pod-delete on the loser will still happen against the new pod created by the DaemonSet after the winner's bounce — a wasted restart, not a corruption.

Nonetheless: add a sync.Mutex (or singleflight.Group) on *zedkube for this reconciler. Cheap and removes the surprising log noise.

cmIface := clientset.CoreV1().ConfigMaps(sriovDPNamespace)

cm, err := cmIface.Get(ctx, sriovDPConfigMapName, metav1.GetOptions{})
if errors.IsNotFound(err) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

garbageCollectSRIOVNADs skipped on the initial-create branch — GC only runs after a successful Update (see line ~161). On the IsNotFound path (fresh cluster bootstrap, or after a manual kubectl delete cm sriovdp-config) we create the ConfigMap and return — any leftover NAD that survived the deletion sits in eve-kube-app until the next drift event.

Not a hot bug, but a one-liner: call garbageCollectSRIOVNADs(ctx, currentResources) after the create as well.

Comment thread pkg/pillar/cmd/domainmgr/domainmgr.go Outdated
// it here on every event is cheap.
ib = aa.LookupIoBundlePhylabel(phyAdapter.Phylabel)
if ib != nil && ib.Type == types.IoNetEthPF && ib.Vfs.Count > 0 {
if err := setupVf(ib, aa, log); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Steady-state setupVf re-runs unconditionally — the new block sits after the if/else that distinguishes "add/modify" from "no change", so it runs on every adapter on every PhysicalIOAdapterList event regardless of whether anything actually changed.

setupVfcreateVfIoBundle calls sriov.SetupVfHardwareAddr / SetupVfVlan whenever user config has them set, so we re-issue netlink ops to the kernel on every adapter event for every PF. Idempotent on the kernel side, but it's churn — and it makes the logs harder to read because the same VF re-creation messages appear on every controller heartbeat.

The stated motivations (Vfs.Count change, sriov_numvfs reset on reboot) are both already covered: a Count change is an adapter change and lands in the add/modify branch above; a reboot triggers !aa.Initialized which already runs setupVf (line ~3370). Consider gating this new call on currentIbPtr == nil or an explicit Count-changed check, or dropping it entirely.

Comment thread pkg/pillar/sriov/sriov.go
if err := os.WriteFile(overridePath, []byte("vfio-pci"), 0); err != nil {
return fmt.Errorf("write driver_override for %s: %w", vfBDF, err)
}
if err := os.WriteFile("/sys/bus/pci/drivers_probe", []byte(vfBDF), 0); err != nil {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent no-op when the VF is already bound to a different driver — writing driver_override does not unbind from the current driver, and drivers_probe only probes unbound devices. So if a VF arrives bound to e.g. igbvf (because some other path autoprobed it, or because of a leftover state across an EVE upgrade), this function returns success and leaves the VF on igbvf. The sriov-network-device-plugin's drivers: vfio-pci selector then quietly matches nothing and the resource never appears on the node — with no log line pointing at the cause.

Two options:

  1. Explicitly unbind first: read <vfBDF>/driver symlink; if it exists and isn't vfio-pci, write vfBDF to <driver>/unbind before the override.
  2. At minimum verify post-write: after drivers_probe, re-read <vfBDF>/driver and return an error (or warn loudly) if it is not vfio-pci.

Today's flow with sriov_drivers_autoprobe=0 makes this unlikely to bite, but the failure mode is invisible enough that a defensive check is worth it.

// the local pod — neighbours' inventories are independent.
func deleteLocalSriovDpPod(ctx context.Context, clientset *kubernetes.Clientset, nodeName string) error {
if nodeName == "" {
return fmt.Errorf("nodeName is empty")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boot-time race against z.nodeNamez.nodeName is set by the EdgeNodeInfo handler (zedkube.go:1084), which can fire after the first AssignableAdapters Create event during boot. On that first AA event this function returns the error "nodeName is empty", which the caller logs at Warn — alarming for what is just an early-boot ordering issue.

The kubeCfgTimer 60 s safety net will recover, so it's not a real bug, but consider downgrading the empty-nodeName case to Debug/Info (or returning a sentinel that the caller can recognise and log differently). Otherwise every clean boot produces a scary-looking warning.

Comment thread pkg/kube/sriov/sriov-device-plugin.yaml Outdated
spec:
hostNetwork: true
nodeSelector:
kubernetes.io/arch: amd64
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DaemonSet runs on every amd64 node, including non-SR-IOV ones — the only nodeSelector is kubernetes.io/arch: amd64. On a mixed cluster where some nodes have SR-IOV-capable NICs and others don't, the device-plugin pod will sit in CrashLoopBackOff on the latter (as the manifest's own comment acknowledges for the bootstrap window).

For long-running mixed clusters this is more than a transient: it's permanent log/event noise and a small but non-zero resource cost. Consider a label like eve.zededa.com/sriov-capable=true that cluster-init.sh sets only when /sys/bus/pci/devices/*/sriov_numvfs exists (the same check install_sriov_manifests already performs), and use it as an additional nodeSelector here. The sriov-cni DaemonSet has the same property and would benefit from the same gate.

@eriknordmark
Copy link
Copy Markdown
Contributor

@naiming-zededa mark this as 'approved' when you are satisfied (then I or @rene can approve if need be and merge)

@naiming-zededa
Copy link
Copy Markdown
Contributor

@naiming-zededa mark this as 'approved' when you are satisfied (then I or @rene can approve if need be and merge)

@eriknordmark I think @zedi-pramodh will modify more on this.

@zedi-pramodh
Copy link
Copy Markdown
Author

@naiming-zededa mark this as 'approved' when you are satisfied (then I or @rene can approve if need be and merge)

@eriknordmark I think @zedi-pramodh will modify more on this.

Yes I will submit amended PR on Friday once I am back. I need to retest everything.

@zedi-pramodh
Copy link
Copy Markdown
Author

@naiming-zededa addressed your comments and retested.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 17, 2026

Codecov Report

❌ Patch coverage is 2.53165% with 539 lines in your changes missing coverage. Please review.
✅ Project coverage is 21.00%. Comparing base (324bd3c) to head (d6997d1).

Files with missing lines Patch % Lines
pkg/pillar/hypervisor/kubevirt.go 0.00% 199 Missing ⚠️
pkg/pillar/cmd/zedkube/sriov_devplugin.go 0.00% 193 Missing ⚠️
pkg/pillar/sriov/sriov.go 15.55% 75 Missing and 1 partial ⚠️
pkg/pillar/cmd/domainmgr/domainmgr.go 0.00% 48 Missing ⚠️
pkg/pillar/cmd/zedkube/zedkube.go 0.00% 21 Missing ⚠️
pkg/pillar/types/assignableadapters.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5908      +/-   ##
==========================================
+ Coverage   20.68%   21.00%   +0.31%     
==========================================
  Files         491      502      +11     
  Lines       90599    92782    +2183     
==========================================
+ Hits        18744    19492     +748     
- Misses      70274    71528    +1254     
- Partials     1581     1762     +181     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zedi-pramodh
Copy link
Copy Markdown
Author

I see Yetus errors, I will try to address those though few of them are coming from upstream yaml files, I will try to ignore those, same with spdx errors.

Comment thread pkg/kube/Dockerfile
# DaemonSet at runtime just to stage the binary on each node. Pinned by
# digest for reproducibility. v2.7.0 ships amd64 only, matching the
# previous DaemonSet's nodeSelector.
FROM --platform=linux/amd64 ghcr.io/k8snetworkplumbingwg/sriov-cni@sha256:fce504e683275ab59215065ddf6433ca80d1b4b53f0d4b0cdf037da6701c7cb4 AS sriov-cni-bin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zedi-pramodh , I see that other versions do have arm64 images, is v2.7.0 really mandatory? For instance, v2.8.1 has all images: https://github.com/k8snetworkplumbingwg/sriov-cni/pkgs/container/sriov-cni/260647076?tag=v2.8.1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rene we just need sriov binary, but we also need to pick a version so I picked v2.7.0 which was latest I think when I started this PR. Coming to arm64 we do not support VMs in kubevirt arm64. when we get there we can pull the arm64 too.

@rene
Copy link
Copy Markdown
Contributor

rene commented May 18, 2026

I see Yetus errors, I will try to address those though few of them are coming from upstream yaml files, I will try to ignore those, same with spdx errors.

@zedi-pramodh , pls, add the upstream files with errors to the corresponding ignore files for Yetus and spdx checker: https://github.com/lf-edge/eve/blob/master/.yamllint and https://github.com/lf-edge/eve/blob/master/.spdxignore

@rene
Copy link
Copy Markdown
Contributor

rene commented May 18, 2026

Regarding PR description template, pls, do not remove the check boxes you didn't select, they should stay unselected so we can see what was checked or not....

@zedi-pramodh
Copy link
Copy Markdown
Author

Regarding PR description template, pls, do not remove the check boxes you didn't select, they should stay unselected so we can see what was checked or not....

Got it.

@zedi-pramodh
Copy link
Copy Markdown
Author

I see Yetus errors, I will try to address those though few of them are coming from upstream yaml files, I will try to ignore those, same with spdx errors.

@zedi-pramodh , pls, add the upstream files with errors to the corresponding ignore files for Yetus and spdx checker: https://github.com/lf-edge/eve/blob/master/.yamllint and https://github.com/lf-edge/eve/blob/master/.spdxignore

Will do. Thanks.

@zedi-pramodh
Copy link
Copy Markdown
Author

I probably need one more amend to commit since I am seeing issues in testing on a Dell device, HP device is OK. I am debugging that issue.

@github-actions github-actions Bot requested a review from rene May 25, 2026 17:14
@zedi-pramodh
Copy link
Copy Markdown
Author

Amended commit, rebased to master. @naiming-zededa can you review again.

The new commit comments explains everything,

@zedi-pramodh
Copy link
Copy Markdown
Author

This needs to be back ported to 17.0.0 branch.

Copy link
Copy Markdown
Contributor

@naiming-zededa naiming-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to Naiming

Comment thread pkg/kube/Dockerfile
# below (eve-alpine-deploy, etcdctl, eve-bridge, etc.) execute inside `build`
# and not inside this passthrough stage — otherwise the COPY --from=sriov-cni-bin
# below ends up referencing the stage it's running in (circular dependency).
FROM --platform=linux/amd64 ghcr.io/k8snetworkplumbingwg/sriov-cni@sha256:fce504e683275ab59215065ddf6433ca80d1b4b53f0d4b0cdf037da6701c7cb4 AS sriov-cni-bin
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check what DL3029 from hadolint is all about.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed and amended the commit. This should go away now.

This commit reworks how EVE-Kubernetes assigns SR-IOV Virtual Functions to
VM application instances.  Previously EVE used KubeVirt PCI HostDevices to
pass VFs to VMs, which has two problems:

  1. KubeVirt's HostDevices path treats all VFs sharing a vendor:device
     tuple as a fungible pool.  When two VMs each request "1x VF from this
     NIC", KubeVirt cannot pin a specific VF (by BDF) to a specific VM, so
     the per-VF admin MAC programmed on the PF cannot be tied to the VM
     that ends up using that VF.

  2. When the device model lacks per-VF MAC user config (pf.Vfs.Data[i].Mac
     empty), the VF IoBundle MacAddr field inherits the parent PF's
     hardware MAC.  Combined with (1), every VM sharing the pool then
     observes the same MAC address inside the guest — exactly the bug
     reported.

The new path attaches each VF to the VMI as a KubeVirt SR-IOV interface
backed by a Multus NetworkAttachmentDefinition and the upstream
sriov-network-device-plugin.  Each VF is its own resource pool, pinned to
its exact PCI BDF via the device plugin's pciAddresses selector — so a VMI
asking for eve.network/<pf>_vf<i> is guaranteed to receive that one BDF.
Each VMI also gets a stable, per-(app, VF) MAC that the in-guest driver
observes verbatim.

Code changes
------------

pillar/hypervisor/kubevirt.go

  * CreateReplicaVMIConfig now buckets PCI assignments: SR-IOV VFs
    (IoNetEthVF) go through attachSRIOVInterfaces (Multus + sriov-cni);
    other PCI devices (GPUs, NVMe, USB controllers, plain NICs) keep the
    existing registerWithKV / HostDevices path.  IoNetEthPF entries are
    skipped (PFs must remain host-bound; only their VFs are passed
    through).

  * registerWithKV is reduced to its non-VF responsibility: register
    vendor:device in KubeVirt's PermittedHostDevices CR and append the
    resource to vmi.Spec.Domain.Devices.HostDevices.  The SR-IOV-only
    scaffolding that used to live here — per-node SR-IOV capability
    labeling, the post-update wait for kubelet to advertise the new
    extended resource, and the helpers addSRIOVNodeSelectors,
    labelNodeForSRIOV, waitForKVDeviceResource, sriovNodeLabelKey — is
    removed; those are unreachable on the VF path and unnecessary on the
    non-VF path.  The kubeConfig and nodeName parameters are dropped from
    registerWithKV accordingly.

  * ensureSRIOVNAD creates one NetworkAttachmentDefinition per (PF, VF
    index, VLAN) in the eve-kube-app namespace.  The NAD's annotation
    k8s.v1.cni.cncf.io/resourceName tells Multus to ask the device plugin
    for that specific VF's pool; spec.config tells sriov-cni to bind it
    and apply the VLAN.  Idempotent — refreshed only on content drift
    (e.g. VLAN change).

  * attachSRIOVInterfaces wires vmi.Spec.Domain.Devices.Interfaces[] with
    SRIOV binding and a Multus network reference, and sets MacAddress per
    VF.  VF index is resolved from the IoBundle Phylabel (e.g. "eth2vf5"
    -> 5) via sriov.ParseVfIfaceName, not from VfParams.Index — that
    field can be 0 for every statically-declared VF until
    checkAndFillIoBundle has run, which previously produced per-VM NADs
    all pointing at "<pf>-vf0" and resource requests all targeting
    "<pf>_vf0".

    MAC selection priority (resolveVFMac):
      1. IoBundle.MacAddr, when present and not a stale copy of the
         parent PF's hardware MAC (detected by reading the PF MAC out of
         /sys/bus/pci/devices/<vf>/physfn/net/<ifname>/address).
      2. A deterministic locally-administered MAC generated via
         SHA-256(appUUID || vfBDF) with OUI 02:16:3E — same scheme
         zedrouter uses for switch-network VIFs (see ipam.go:
         generateAppMac).  Stable across reboots, unique per (app, VF).

    Backstop: attachSRIOVInterfaces also calls BindVFToVfioPCI on each VF
    BDF before handing the VMI off to KubeVirt.  BindVFToVfioPCI is
    idempotent, so this is free on the happy path and self-healing on
    the bad one (e.g. iavf re-grabbed the VF after an autoprobe race).

    PFIface lookup: domainmgr populates VfParams.PFIface at parse time,
    but on systems where the VF didn't yet exist in sysfs at that point
    (sriov_numvfs not written), we recover here via
    sriov.GetPFIfaceFromVFBDF before failing.

  * vmiMetaData gains a sriovVFs []sriovVFRef field that records (PF
    ifname, VF index) for every VF wired into this VMI.  On Stop/Delete,
    clearSRIOVAdminMACs walks this list and zeroes the per-VF admin MAC
    on the PF via sriov.ClearVFAdminMAC.  sriov-cni's DEL path doesn't
    reliably clear the admin MAC when the VF is pre-bound to vfio-pci
    (no kernel netdev to operate on), so EVE owns the cleanup.

pillar/sriov/sriov.go

  * CreateVF now takes the PF's PCI BDF instead of the kernel netdev
    name.  /sys/class/net/<dev>/device/sriov_* is sensitive to a netdev
    rename (NIM bridges PFs and renames eth2 -> keth2); if that rename
    overlaps CreateVF's polling, a name-anchored caller fails with "Link
    not found" and the static VF IoBundle entries never get populated.
    /sys/bus/pci/devices/<bdf>/sriov_* is BDF-anchored and stable
    across renames.

  * ensurePFLinkUp -> ensurePFAdminUp.  Bumping sriov_numvfs causes some
    drivers (ixgbe, i40e) to reset the PF and clear IFF_UP; the old code
    additionally waited for OperState=OperUp, which never happens on
    cable-less SR-IOV PFs used purely as VF sources for app passthrough.
    The kernel allows VF allocation and per-VF config on an IFF_UP /
    no-carrier PF, so checking IFF_UP only is sufficient.  Now also
    runs on the numvfs-already-at-target early-return path so a previous
    CreateVF that aborted mid-sequence still recovers IFF_UP on the next
    call.

  * BindVFToVfioPCI attaches a VF to the vfio-pci driver via
    driver_override + drivers_probe (with unbind from the current
    driver if any).  Required because EVE creates VFs with
    sriov_drivers_autoprobe=0 only on first-boot CreateVF; on a reboot
    where numvfs is already at target, CreateVF early-returns before
    writing the autoprobe flag and the resident VF driver (ixgbevf /
    iavf / igbvf) grabs the VFs.  Post-bind verification (readlink the
    driver symlink) catches the case where vfio-pci isn't registered in
    the kernel.

  * GetPFIfaceFromVFBDF resolves a VF's parent PF ifname via
    /sys/bus/pci/devices/<vf-bdf>/physfn/net/<ifname>.  Used by
    domainmgr to populate VfParams.PFIface for VFs that arrive without
    it (statically declared phyio entries that bypass createVfIoBundle).

  * ParseVfIfaceName rewritten.  The previous version used
    fmt.Sscanf("%svf%d", ...), which is broken — Go's %s verb is greedy
    and consumes the entire input, leaving nothing for the literal "vf"
    and trailing %d to match.  The replacement scans backwards for the
    last "vf<digits>" suffix.

  * ClearVFAdminMAC zeroes the per-VF admin MAC programmed on the PF's
    VF table via netlink.LinkSetVfHardwareAddr(pf, idx, 00:..:00).
    Called by kubevirt on VM stop/delete.

pillar/cmd/domainmgr/domainmgr.go

  * checkAndFillIoBundle now populates VfParams.PFIface AND
    VfParams.Index for IoNetEthVF bundles.  IoBundleFromPhyAdapter
    leaves both at zero for statically-declared VFs, which downstream
    (sriov_devplugin's buildSRIOVPools, attachSRIOVInterfaces) cannot
    distinguish from "uninitialised" vs "actually VF 0".  Index is
    re-derived from Phylabel (always idempotent).  PFIface is sourced
    from sysfs (handles NIM rename), best-effort: the VF may not yet
    exist in sysfs at parse time, in which case attachSRIOVInterfaces
    falls back to the same sysfs helper at VMI-attach time.

  * setupVf is restructured around the stable PCI BDF.  CreateVF is
    driven by ib.PciLong (not the ifname, which may rename mid-call),
    and the ifname is re-resolved AFTER CreateVF for the GetVfByTimeout
    + per-VF MAC/VLAN steps.

    Two-phase per-VF setup: bind to vfio-pci FIRST, register with AA
    SECOND.  Previously a single createVfIoBundle failure (e.g.
    SetupVfHardwareAddr rejected because the PF was momentarily admin-
    down) returned from setupVf, leaving every subsequent VF in this PF
    driverless — observed in the field as "one PF works, the other has
    20 driverless VFs and zero allocatable resources".  Now: bind every
    VF, log per-VF createVfIoBundle failures, keep going.  setupVf
    returns the first error so the caller still sees something went
    wrong, but AA tracks every VF that was bound (createVfIoBundle now
    returns the partial bundle on error rather than zero-value).

  * handlePhysicalIOAdapterListImpl now calls setupVf in the steady-
    state branch as well as the !Initialized first-pass branch.  Without
    this, a controller publishing the device model in two phases —
    initial Vfs.Count=0, later bumped to N — leaves sriov_numvfs at 0
    forever because setupVf was previously only invoked during AA
    initialization.  CreateVF is idempotent, so reconciling on every
    event is cheap and self-heals after kernel reboots reset
    sriov_numvfs to zero.

  * createVfIoBundle bug fixes:
      - SetupVfHardwareAddr and SetupVfVlan are now called on the PF
        link (pfIb.Ifname, e.g. "eth0"), not the VF link
        (vfIb.Ifname = "eth0vf0").  netlink.LinkSetVfHardwareAddr /
        LinkSetVfVlan operate on a PF with a VF index; calling them on
        the VF link silently no-ops or errors depending on driver.
      - SetupVfVlan is now called with vfUserConfig.VlanID; vf.VlanID
        was always 0 for freshly created VFs (GetVf does not read the
        current VLAN from sysfs), so user-configured VLANs were never
        applied.
      - The configured VlanID is persisted in vfIb.VfParams.VlanID so
        the hypervisor knows it for NAD construction.

pillar/cmd/zedkube/sriov_devplugin.go (new)

  * reconcileSRIOVDevicePlugin projects the AssignableAdapters
    publication into the sriov-network-device-plugin ConfigMap
    (kube-system/sriovdp-config).  One pool per VF (not per PF), with a
    pciAddresses selector pinning the single BDF — so kubelet
    allocation is a 1:1 BDF->VMI mapping, never a pool lottery.  This
    is the lever that gives EVE deterministic per-VF assignment.

  * Resource name is "<pf>_vf<index>", matching what
    kubevirt.sriovResourceName produces.  Per-PF / VF-index are
    resolved from Phylabel via sriov.ParseVfIfaceName (not VfParams,
    same justification as in attachSRIOVInterfaces).  VF vendor:device
    is read from sysfs of virtfn0 — PF and VF have different IDs
    (e.g. I350 PF 8086:1521, VF 8086:1520), and the device plugin
    enumerates VFs by VF ID.

  * On configuration drift, updates the ConfigMap and deletes only the
    local sriov-network-device-plugin pod (filtered by spec.nodeName);
    the DaemonSet recreates it within seconds, and the new pod re-reads
    the config.  Per-node delete avoids disrupting neighbours in
    multi-node clusters.

  * healDriverlessVFs walks every IoNetEthVF and binds any whose host-
    side driver isn't vfio-pci.  Steady-state self-healer for the "20
    driverless VFs on one PF" failure mode caused by historical bind
    short-circuits; idempotent on the happy path.  Runs at the start of
    every reconcile so a freshly-bound VF lands in the pool's first
    advertisement.

  * garbageCollectSRIOVNADs sweeps NetworkAttachmentDefinitions in
    eve-kube-app whose k8s.v1.cni.cncf.io/resourceName annotation
    references an SR-IOV pool that is no longer in the current AA
    projection.  Scoped by the eve.network/ resource-prefix so only
    NADs created by ensureSRIOVNAD are touched.  Runs after every
    successful ConfigMap reconcile, so removing a PF (or shrinking
    its VF count) also cleans up the dangling NADs.

  * Self-healing: if z.config is nil at reconcile time (kube clientset
    not yet ready when the first AA event arrived), re-acquires
    kubeapi.GetKubeConfig() inside the reconciler so subsequent calls
    succeed without depending on a fresh AA event.

pillar/cmd/zedkube/zedkube.go

  * Subscribes to AssignableAdapters from domainmgr; routes
    Create/Modify/Delete to reconcileSRIOVDevicePlugin.

  * Adds a safety-net call on the existing kubeCfgTimer (60s period)
    that re-drives reconcile from the pubsub snapshot.  Guarantees
    eventual consistency even if the first AA Create event raced
    against k3s startup and the AA never republishes after that.

pillar/types/assignableadapters.go

  * IsNetEthVF() predicate on IoType — used wherever VF-specific
    behaviour is required.  VFs are intentionally excluded from IsNet()
    because they must be vfio-pci-bound regardless of testing mode and
    they are never EVE network ports.

pkg/kube/sriov/sriov-device-plugin.yaml (new)

  * ServiceAccount + DaemonSet for the upstream
    sriov-network-device-plugin (v3.7.0).

  * The sriovdp-config ConfigMap is intentionally NOT shipped here:
    k3s's deploy controller manages every resource defined under
    /var/lib/rancher/k3s/server/manifests/ and reverts any drift, which
    would undo zedkube's writes.  Instead zedkube's reconciler creates
    the ConfigMap on first run via the IsNotFound branch; the DaemonSet
    pod CrashLoopBackOffs for a few seconds at first boot until the
    ConfigMap exists, then starts normally.

pkg/kube/Dockerfile

  * Adds a passthrough build stage `sriov-cni-bin` that pulls
    ghcr.io/k8snetworkplumbingwg/sriov-cni (pinned by digest, amd64
    only) and COPYs /usr/bin/sriov into the kube image.  The binary is
    staged from the host at /usr/bin/sriov by cluster-init.sh — no
    sriov-cni DaemonSet at runtime, removing one moving part vs.
    upstream deployments that ship a DaemonSet just to drop the binary
    on each node.

  * COPYs sriov-device-plugin.yaml into /etc/k3s-manifests/ so
    cluster-init.sh can stage it.

pkg/kube/cluster-init.sh

  * install_sriov_manifests copies the sriov binary to both
    /var/lib/cni/bin (where Multus looks: binDir setting) and
    /opt/cni/bin (the k3s/flannel default), and stages
    sriov-device-plugin.yaml into KUBE_MANIFESTS_DIR.  Gated on
    /sys/bus/pci/devices/*/sriov_numvfs existing — /sys/class/net is
    per-network-namespace and doesn't include host NICs from inside the
    kube container's netns; /sys/bus/pci is namespace-independent.

  * Hoisted out of the one-shot all_components_initialized guard and
    called every iteration of the main loop.  Makes EVE upgrades pick
    up new or updated manifests without requiring a cluster re-init.
    Idempotent via cp -u.

Backward compatibility
----------------------

  * Non-SR-IOV PCI passthrough is unchanged — IoNetEth (whole NICs),
    GPUs, NVMe, USB, etc. continue to flow through registerWithKV /
    KubeVirt PermittedHostDevices.

Signed-off-by: Pramodh Pallapothu <pramodh@zededa.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@zedi-pramodh
Copy link
Copy Markdown
Author

Someone should approve again since I amended the commit for DL3029 fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants