Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

netoptimizer · 2025-08-21T15:37:10Z

Created this PR so @simosund can see the changes and evaluation numbers from prod experiment.

Based on code from PR xdp-project/bpf-examples#129

This is a direct copy from bpf-examples but from Simon's devel branch 'netstacklat-groupby' https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Keeping this as seperate commit to track what needed to change Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Gotcha#1: My devel laptop have TAI offset zero - Other systems (incl prod) all have 37 sec Gotcha#2: RX-timestamping need to be enabled maually - something else have to enable RX-timestamping Signed-off-by: Jesper Dangaard Brouer <[email protected]>

For ebpf_exporter we cannot control which BPF sections gets loaded. Instead we compile time disable some of the hooks via define/ifdef's. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Instead hardcode ifindex limits based on prod setup. As we don't have a way to configure this via YAML. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Something is buggy with this filter - All latency records is on max bucket The READ_ONCE change doesn't fix the issue Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

This was the real reason for seeing wrong numbers in prod. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

…sync Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Moved filter_nonempty_sockqueue to callers - because record_socket_latency() becomes a BPF function-call - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn Execution runtime: netstacklat_skb_consume_udp - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns This clearly have a huge improvement for UDP packets. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Enable filter_nonempty_sockqueue compile time to make sure that this config setting doesn't influence performance. I'm only seeing a small effect. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

The filter_nonempty_sockqueue() is effecient for UDP packets, but doesn't work well for TCP packets. Add filtering on socket queue lenght. Try filtering if qlen is not above 3 packets for TCP. Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns This have a HUGE improvement for TCP case. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

examples/netstacklat.bpf.c

Leverage BPF_MAP_TYPE_CGRP_STORAGE for our cgroup filter. To evaluate the two different cgroup_id_map types code macro CONFIG_CGRP_STORAGE is introduced, to allow us to switch implementation compile time. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

The ebpf_exporter variant of netstacklat is not runtime configurable at BPF-load time. Thus, below user_config isn't define as 'volatile', instead the 'const' allows the compiler to do dead-code elimination. We also disable user_config.filter_nonempty_sockqueue as we want to stress the cgroup lookup types some more. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

This required changing call signature of filter_cgroup() to also populate the cgroup_id as that is used for groupby_cgroup. We are still using the CONFIG_CGRP_STORAGE that leverages the BPF_MAP_TYPE_CGRP_STORAGE. Without the queue length filters (filter_nonempty_sockqueue) we are getting more calls. This is on purpose to evaluate cgroup_id_map types. name netstacklat_tcp_recv_timestamp - run_time_ns 17953390952 run_cnt 55079620 = 325.95 ns netstacklat_skb_consume_udp - run_time_ns 5779869863 run_cnt 11650472 = 496.10 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Changing the filter_cgroup() to use the cgroup_id_map type hash. Surprisingly this type seems to be faster than type cgrp_storage. name netstacklat_tcp_recv_timestamp - run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns - diff: 325.95 - 257.48 = 68.47 ns better name netstacklat_skb_consume_udp - run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns - diff: 496.10 - 437.27 = 58.83 ns better On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls. The filter_cgroup() for type cgrp_storage has two extra helper calls, bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for diff. That said, if a BPF-prog already have the struct_task available then the bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might still be useful in that case Signed-off-by: Jesper Dangaard Brouer <[email protected]>

If we have disabled network_ns filtering, the code still does a lookup of the current name space via calling get_network_ns(). Reorg the code to avoid this call if feature is disabled. name netstacklat_tcp_recv_timestamp - run_time_ns 10623365578 run_cnt 44842812 = 236.90 ns - diff: 257.48 - 236.90 = 20.58 ns better name netstacklat_skb_consume_udp - run_time_ns 3718153230 run_cnt 9902613 = 375.47 ns - diff: 437.27 - 375.47 = 61.80 ns better Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Let us control the filter_queue_len() via seperate user_config. Also enable this for UDP sockets. Notice the filter_cgroup() is still running first, so this is the primary filter. And filter_nonempty_sockqueue is still false. name netstacklat_tcp_recv_timestamp - run_time_ns 15661530364 run_cnt 94922963 = 164.99 ns name netstacklat_skb_consume_udp - run_time_ns 3255451250 run_cnt 14532586 = 224.01 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Verified performance is same as previous patch. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

It is more efficient to first filter out sockets with empty or small queues. This is a tradeoff as our production use-case is to capture when the system becomes overloaded. name netstacklat_tcp_recv_timestamp - run_time_ns 15347880145 run_cnt 177786472 = 86.32 ns name netstacklat_skb_consume_udp - run_time_ns 3096529442 run_cnt 33903931 = 91.33 ns The performance gain is huge. Do remember that this is the average runtime cost that is reduced, because we can skip recording many of these events. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

simosund · 2025-08-26T15:56:27Z

Changing the filter_cgroup() to use the cgroup_id_map type hash.
Surprisingly this type seems to be faster than type cgrp_storage.

name netstacklat_tcp_recv_timestamp

run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns

diff: 325.95 - 257.48 = 68.47 ns better

name netstacklat_skb_consume_udp

run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns

diff: 496.10 - 437.27 = 58.83 ns better

On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls.
The filter_cgroup() for type cgrp_storage has two extra helper calls,
bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the
bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for
diff. That said, if a BPF-prog already have the struct_task available then the
bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might
still be useful in that case

Hmm, that the cgroup storage was slower was a bit unexpected.

And shouldn't it even be the same number of helper calls? You essentially trade bpf_get_current_cgroup_id() for bpf_get_current_task() and bpf_map_lookup_elem() for bpf_cgrp_storage_get().

simosund · 2025-08-26T15:56:35Z

examples/netstacklat.bpf.c

+	struct task_struct *task = bpf_get_current_task_btf();
+	struct cgroup *cgrp = task->cgroups->dfl_cgrp;
+
+	return bpf_cgrp_storage_get(&netstack_cgroupfilter, cgrp, 0, 0) != NULL;


One thought here. In the example cgroup_id_map_cgrp_storage.bpf.c that PR cloudflare#565 added, they implement their lookup/check slightly differently from you.

Instead of just checking if the cgrp storage exists for the given cgroup, in the example they zero-initialize it if it doesn't exist, and then check if the content is non-zero. To me that seems like it should be more work than just checking for existence as you do. However, if a cgrp lookup miss takes significantly longer than a lookup hit for some reason, perhaps the way they do it in the example makes sense. Might be worth trying at least.

I tried it again and benchmarked this on production workload.

First line is: Type: hash = 183 ns

Second line is: Type: cgrp_storage = 240 ns with zero as flags.

Last line is: Type: cgrp_storage = 248 ns with BPF_LOCAL_STORAGE_GET_F_CREATE (and comparing value 0 vs 1)

Conclusion is clear: the cgrp_storage type is slower.

That cgrp storage was not only slower, but that much slower (~60ns), is a bit surprising (and disappointing). But thanks for doing the tests, then we at least know!

See that you have it running with some dashboard at least =)

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Our checks for empty or almost empty sockets were wrong, because sockets also have a backlog queue. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

This change test to be 'ge' greater-than-or-equal. We want the ability specify 1, meaning a queue size of one and above. Before 1 meant queue size 2 and above. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Let also record_skb_latency() use it. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

For production usage, we want the ability to disable the UDP hooks. Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

For production we need to reduce the number of Prometheus buckets metric. As the dequeue hooks are unlikely to see below usecs latencies we reduce the resolution to usecs. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

When importing this into ebpf_exporter we need to reformat the C-code, which is done via command line: clang-format -i configs/netstacklat.{h,bpf.c} But it doesn't convert these macros and comments. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

We are currently only using a single hook so change N_HOOKS that is used in the max_entries calc of netstack_latency_seconds. Detect if other hooks gets enabled and make compile fail. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

We unfortunately need atomic counter update for the nth-filter. This is because hooks like tcp-socket-read runs outside the socket lock in a preempt/migrate-able user context. We don't need accurate nth-counter across CPU, as this is just a down-sampling mechanism. Thus, we keep the PERCPU array map and have nth-counter on a per CPU basis. The trick here is that in most cases the counter is only used by the current running CPU, and the cache-line will mostly be in a cache coherency Exclusive/Modified (MOESI) state, which will cost less when doing atomic updates. Manually testing on production showed 7ns runtime increase (before 150.88 ns, after 157.67 ns). Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Sample every 2-nth packet (50%) - Overhead: 6039419756 / 32219314 = 187.44 ns Compared to local atomic-nth overead: 157.67 ns - approx 30 ns extra cost to sample 50% vs 3.12% Signed-off-by: Jesper Dangaard Brouer <[email protected]>

name netstacklat_tcp_recv_timestamp - run_time_ns 17510185954 run_cnt 101083454 = 173.23 ns Sample every 4-nth packet (25%) - Overhead: 173.23 ns - Compared to nth-2 (187.44 ns) saved 14.21 ns (187.44-173.23) - Compared to nth-32 (157.67 ns) cost 15.56 ns more (173.23-157.67) Signed-off-by: Jesper Dangaard Brouer <[email protected]>

name netstacklat_tcp_recv_timestamp - run_time_ns 24383044912 run_cnt 121125888 = 201.30 ns Compared to - nth-2 : 186 ns -> +15 ns - nth-4 : 173 ns -> +28 ns - nth-32 : 157 ns -> +44 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Because production have too little traffic on the internal interface that latency stats becomes unusable. Via CONFIG_GROUPBY_IFINDEX also remove this from the hist_key and note that YAML file also needs adjustments. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

The upstream version of netstacklat that we are based on got merged see PR#129. xdp-project/bpf-examples#129 Some adjustments were made, so lets sync with these to avoid diverting too much from upstream. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

As the filter_min_sockqueue_len can replaced it. This was also part of PR#129 merge, but it makes it easier to review, to keep this in a seperate commit. xdp-project/bpf-examples#129 Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Lacking a YAML ebpf_exporter config for selecting iface names we hard-coded ifindex, but some production servers have higher ifindex for vlan100. Relax ifindex range as a workaround. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netoptimizer added 13 commits August 20, 2025 12:58

Add netstacklat example

aadb5ac

This is a direct copy from bpf-examples but from Simon's devel branch 'netstacklat-groupby' https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat Signed-off-by: Jesper Dangaard Brouer <[email protected]>

Adjust netstacklat example to compile

c891328

Keeping this as seperate commit to track what needed to change Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: enable filter_cgroup and disable other hooks

22b929e

For ebpf_exporter we cannot control which BPF sections gets loaded. Instead we compile time disable some of the hooks via define/ifdef's. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: disable ifindex filter

50fafda

Instead hardcode ifindex limits based on prod setup. As we don't have a way to configure this via YAML. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: disable PID filtering

2efbb90

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: an idea as a code comment

43da77b

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: evaluate filter_nonempty_sockqueue

48c7f20

Something is buggy with this filter - All latency records is on max bucket The READ_ONCE change doesn't fix the issue Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: disable filter_nonempty_sockqueue

a0776a2

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: restore TAI offset to 37 sec

de778bc

This was the real reason for seeing wrong numbers in prod. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netoptimizer mentioned this pull request Aug 21, 2025

Netstacklat: Add filtering and grouping xdp-project/bpf-examples#129

Merged

netoptimizer added 2 commits August 21, 2025 18:06

netstacklat: compile time enable filter_nonempty_sockqueue

925debf

Enable filter_nonempty_sockqueue compile time to make sure that this config setting doesn't influence performance. I'm only seeing a small effect. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

simosund reviewed Aug 22, 2025

View reviewed changes

examples/netstacklat.bpf.c Show resolved Hide resolved

netoptimizer added 8 commits August 26, 2025 11:41

netstacklat: make filter_socket a common function

d9039e5

Verified performance is same as previous patch. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

simosund reviewed Aug 26, 2025

View reviewed changes

netoptimizer added 3 commits August 26, 2025 20:39

netstacklat: give credit Simon Sundberg

593f76b

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: take sk_backlog into account

780984d

Our checks for empty or almost empty sockets were wrong, because sockets also have a backlog queue. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: rename filter_queue_len to filter_min_queue_len

f8671e3

This change test to be 'ge' greater-than-or-equal. We want the ability specify 1, meaning a queue size of one and above. Before 1 meant queue size 2 and above. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netoptimizer added 8 commits August 27, 2025 14:16

netstacklat: add filter for every nth packet

3c57889

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: let nth filter counter be per hook

0127c2c

Let also record_skb_latency() use it. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: disable UDP hooks via ugly ifdefs

45cd607

For production usage, we want the ability to disable the UDP hooks. Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: change time resoultion to usecs

189c61f

For production we need to reduce the number of Prometheus buckets metric. As the dequeue hooks are unlikely to see below usecs latencies we reduce the resolution to usecs. Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: fix hash size for netstack_latency_seconds

f0a947d

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: add reminder to update N_CGROUPS

b54fdbf

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netoptimizer mentioned this pull request Aug 28, 2025

cgroup_id_map: change example for CGRP_STORAGE to not use GET_F_CREATE cloudflare/ebpf_exporter#570

Open

netoptimizer added 9 commits September 4, 2025 10:53

netstacklat: Experiment with 2 nth packet sampling

e3a5f58

Sample every 2-nth packet (50%) - Overhead: 6039419756 / 32219314 = 187.44 ns Compared to local atomic-nth overead: 157.67 ns - approx 30 ns extra cost to sample 50% vs 3.12% Signed-off-by: Jesper Dangaard Brouer <[email protected]>

netstacklat: comply with clang format requirements

b9cbe80

Signed-off-by: Jesper Dangaard Brouer <[email protected]>

This was referenced Oct 29, 2025

Netstacklat02: Evaluate TCP HoL filter for ebpf-exporter #2

Open

Netstacklat02 - porting TCP HoL patches from netstacklat PR#133 cloudflare/ebpf_exporter#587

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

Uh oh!

netoptimizer commented Aug 21, 2025

Uh oh!

Uh oh!

simosund commented Aug 26, 2025

Uh oh!

simosund Aug 26, 2025

Uh oh!

netoptimizer Aug 27, 2025 •

edited

Loading

Uh oh!

simosund Aug 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

Are you sure you want to change the base?

Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

Uh oh!

Conversation

netoptimizer commented Aug 21, 2025

Uh oh!

Uh oh!

simosund commented Aug 26, 2025

Uh oh!

simosund Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

netoptimizer Aug 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

simosund Aug 28, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

netoptimizer Aug 27, 2025 •

edited

Loading