-
Notifications
You must be signed in to change notification settings - Fork 795
Description
What happened:
On Amazon Linux 2023 EKS nodes, systemd-networkd sometimes configures secondary ENIs before the VPC CNI is initialized, resulting in missing policy routing rules for pod IPs. This leads to intermittent network failures, stale routes, and martian source kernel messages.
The behavior is inconsistent across node groups - some have secondary ENIs marked as Unmanaged by systemd-networkd as expected, while others end up with systemd-networkd fully managing the ENIs, causing duplicate or missing routes.
# networkctl status ens5
● 2: ens5
Link File: /usr/lib/systemd/network/99-default.link
Network File: /run/systemd/network/70-eks-ens5.network
State: routable (configured)
Online state: online
Type: ether
Path: pci-0000:00:05.0
Driver: ena
Vendor: Amazon.com, Inc.
Model: Elastic Network Adapter (ENA)
Alternative Names: enp0s5
Hardware Address: 0a:ff:e1:47:45:f3
MTU: 9001 (min: 128, max: 9216)
QDisc: mq
IPv6 Address Generation Mode: eui64
Number of Queues (Tx/Rx): 2/2
Address: 10.101.41.199 (DHCP4 via 10.101.41.1)
fe80::8ff:e1ff:fe47:45f3
Gateway: 10.101.41.1
DNS: 10.101.0.2
Search Domains: ec2.internal
Activation Policy: up
Required For Online: yes
DHCP4 Client ID: IAID:0xed10bdb8/DUID
DHCP6 Client IAID: 0xed10bdb8
DHCP6 Client DUID: DUID-EN/Vendor:0000ab11cc500c456718037f
# networkctl status ens6
● 7: ens6
Link File: /usr/lib/systemd/network/99-default.link
Network File: /run/systemd/network/70-eks-ens6.network
State: routable (configured)
Online state: online
Type: ether
Path: pci-0000:00:06.0
Driver: ena
Vendor: Amazon.com, Inc.
Model: Elastic Network Adapter (ENA)
Alternative Names: enp0s6
Hardware Address: 0a:ff:c9:51:6a:c7
MTU: 9001 (min: 128, max: 9216)
QDisc: mq
IPv6 Address Generation Mode: eui64
Number of Queues (Tx/Rx): 2/2
Address: 10.101.41.164 (DHCP4 via 10.101.41.1)
fe80::8ff:c9ff:fe51:6ac7
Gateway: 10.101.41.1
DNS: 10.101.0.2
Search Domains: ec2.internal
Activation Policy: up
Required For Online: yes
DHCP4 Client ID: IAID:0x6618dd42/DUID
DHCP6 Client IAID: 0x6618dd42
DHCP6 Client DUID: DUID-EN/Vendor:0000ab11cc500c456718037f
IPv4: martian source 10.101.40.133 from 10.101.40.1, on dev ens6
$ ip route
default via 10.101.40.1 dev ens5 proto dhcp src 10.101.40.117 metric 512
default via 10.101.40.1 dev ens6 proto dhcp src 10.101.40.196 metric 513
default via 10.101.40.1 dev ens7 proto dhcp src 10.101.40.20 metric 514
_ ** IPs differs as logs comes from the diffrent failing nodes_
What you expected to happen:
Secondary ENIs (ens6 and above) should remain unmanaged by systemd-networkd so that the AWS VPC CNI can fully configure policy routing, IP rules, and ENI-specific routing tables.
Actual Behavior:
Occasionally, systemd-networkd takes over configuration of secondary ENIs during boot, preventing the AWS VPC CNI from setting correct routing rules. This results in pod connectivity failures, missing route tables, and dropped packets.
How to reproduce it (as minimally and precisely as possible):
This issue is difficult to reproduce reliably because it appears to be a boot-time race condition between:
• systemd-networkd, which configures interfaces as soon as they appear
• AWS VPC CNI, which expects to configure secondary ENIs before systemd touches them
However, the problem can be reproduced more consistently by intentionally slowing down early boot steps so the secondary ENIs attach before kubelet and aws-node are fully initialized.
Suggested reproduction strategy:
1. Create an AL2023 EKS node group on EKS 1.34
2. Configure your cluster or nodegroup to ensure secondary ENIs will be attached:
• set WARM_ENI_TARGET >= 1
• or set high WARM_IP_TARGET
3. Delay node initialization so ENIs attach before CNI is running.
For example:
• inject an artificial delay i.e. in pre userdata - before the nodeadm init (e.g., sleep 300)
• add a slow external proxy or metadata throttling
• use a large cloud-init payload
5. Check network state after each boot:
• networkctl status
• ip rule
• ip route show table 10001
• dmesg | grep -i martian
Eventually, a node will come up with missing policy routing rules and pod connectivity failures.
Expected result in broken state:
• systemd-networkd configures the secondary ENI
• CNI skips configuring the ENI because systemd marked it as “managed”
• The routing table for the ENI is missing (no from <POD_IP> lookup 10001)
• Duplicate DHCP-added default routes appear
• Kernel logs martian source messages
• Pods scheduled to that ENI lose connectivity
Anything else we need to know?:
Impact
• Pods on affected nodes cannot reach services or other pods
• DNS resolves but traffic drops on reply path (asymmetric routing)
• Kernel drops packets (rp_filter)
• EKS upgrades are blocked due to instability
• Production workloads experience intermittent failures
• AL2023 migration becomes unreliable without manual overrides
Hotfix
A reliable short-term mitigation is to prevent systemd-networkd from managing secondary ENIs entirely, ensuring that the AWS VPC CNI has exclusive control over routing and interface configuration.
We confirmed that placing the following override file on AL2023 nodes resolves the issue:
--BOUNDARY
Content-Type: text/cloud-config; charset="us-ascii"
#cloud-config
write_files:
- path: /etc/systemd/network/10-vpc-cni-secondary.network
owner: root:root
permissions: '0644'
content: |
[Match]
Name=ens[6-9]* ens[1-9][0-9]*
[Link]
Unmanaged=yes
runcmd:
- [systemctl, daemon-reload]
- [systemctl, restart, systemd-networkd]
This seems related to the discussion in:
• awslabs/amazon-eks-ami#1738
Environment:
- Kubernetes version: Server Version: v1.34.1-eks-3cfe0ce
- CNI Version: v1.20.2-eksbuild.1
- OS (e.g:
cat /etc/os-release):
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.9.20251105"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2029-06-30"
- Kernel:
Linux 6.12.53-69.119.amzn2023.x86_64 #1 SMP PREEMPT_DYNAMIC Tue Oct 21 22:19:00 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux