Skip to content

Conversation

@netoptimizer
Copy link
Owner

Created this PR so @simosund can see the changes and evaluation numbers from prod experiment.

Based on code from PR xdp-project/bpf-examples#129

This is a direct copy from bpf-examples
but from Simon's devel branch 'netstacklat-groupby'

https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Keeping this as seperate commit to track what needed to change

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Gotcha#1: My devel laptop have TAI offset zero
 - Other systems (incl prod) all have 37 sec

Gotcha#2: RX-timestamping need to be enabled maually
 - something else have to enable RX-timestamping

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For ebpf_exporter we cannot control which BPF sections
gets loaded.  Instead we compile time disable some 
of the hooks via define/ifdef's.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Instead hardcode ifindex limits based on prod setup.
As we don't have a way to configure this via YAML.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Something is buggy with this filter
 - All latency records is on max bucket

The READ_ONCE change doesn't fix the issue

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This was the real reason for seeing wrong numbers in prod.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
…sync

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Moved filter_nonempty_sockqueue to callers
 - because record_socket_latency() becomes a BPF function-call
 - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns

This clearly have a huge improvement for UDP packets.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Enable filter_nonempty_sockqueue compile time to make sure
that this config setting doesn't influence performance.

I'm only seeing a small effect.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The filter_nonempty_sockqueue() is effecient for UDP packets,
but doesn't work well for TCP packets.

Add filtering on socket queue lenght. Try filtering
if qlen is not above 3 packets for TCP.

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns

This have a HUGE improvement for TCP case.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Leverage BPF_MAP_TYPE_CGRP_STORAGE for our cgroup filter.

To evaluate the two different cgroup_id_map types code
macro CONFIG_CGRP_STORAGE is introduced, to allow
us to switch implementation compile time.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The ebpf_exporter variant of netstacklat is not runtime configurable at
BPF-load time. Thus, below user_config isn't define as 'volatile', instead
the 'const' allows the compiler to do dead-code elimination.

We also disable user_config.filter_nonempty_sockqueue as we want to
stress the cgroup lookup types some more.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This required changing call signature of filter_cgroup() to also
populate the cgroup_id as that is used for groupby_cgroup.

We are still using the CONFIG_CGRP_STORAGE that leverages the
BPF_MAP_TYPE_CGRP_STORAGE. Without the queue length filters
(filter_nonempty_sockqueue) we are getting more calls. This is on purpose to
evaluate cgroup_id_map types.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 17953390952 run_cnt 55079620 = 325.95 ns

netstacklat_skb_consume_udp
 - run_time_ns 5779869863 run_cnt 11650472 = 496.10 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Changing the filter_cgroup() to use the cgroup_id_map type hash.
Surprisingly this type seems to be faster than type cgrp_storage.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns
 - diff: 325.95 - 257.48 = 68.47 ns better

name netstacklat_skb_consume_udp
 - run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns
 - diff: 496.10 - 437.27 = 58.83 ns better

On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls.
The filter_cgroup() for type cgrp_storage has two extra helper calls,
bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the
bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for
diff. That said, if a BPF-prog already have the struct_task available then the
bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might
still be useful in that case

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
If we have disabled network_ns filtering, the code still does a lookup of the
current name space via calling get_network_ns().

Reorg the code to avoid this call if feature is disabled.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 10623365578 run_cnt 44842812 = 236.90 ns
 - diff: 257.48 - 236.90 = 20.58 ns better

name netstacklat_skb_consume_udp
 - run_time_ns 3718153230 run_cnt 9902613 = 375.47 ns
 - diff: 437.27 - 375.47 = 61.80 ns better

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let us control the filter_queue_len() via seperate user_config.
Also enable this for UDP sockets.

Notice the filter_cgroup() is still running first, so this is the primary
filter. And filter_nonempty_sockqueue is still false.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 15661530364 run_cnt 94922963 = 164.99 ns

name netstacklat_skb_consume_udp
 - run_time_ns 3255451250 run_cnt 14532586 = 224.01 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Verified performance is same as previous patch.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
It is more efficient to first filter out sockets with empty or small queues.
This is a tradeoff as our production use-case is to capture when the system
becomes overloaded.

name netstacklat_tcp_recv_timestamp
 - run_time_ns 15347880145 run_cnt 177786472 = 86.32 ns

name netstacklat_skb_consume_udp
 - run_time_ns 3096529442 run_cnt 33903931 = 91.33 ns

The performance gain is huge. Do remember that this is the average runtime cost
that is reduced, because we can skip recording many of these events.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
@simosund
Copy link

Changing the filter_cgroup() to use the cgroup_id_map type hash.
Surprisingly this type seems to be faster than type cgrp_storage.

name netstacklat_tcp_recv_timestamp

  • run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns
  • diff: 325.95 - 257.48 = 68.47 ns better

name netstacklat_skb_consume_udp

  • run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns
  • diff: 496.10 - 437.27 = 58.83 ns better

On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls.
The filter_cgroup() for type cgrp_storage has two extra helper calls,
bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the
bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for
diff. That said, if a BPF-prog already have the struct_task available then the
bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might
still be useful in that case

Hmm, that the cgroup storage was slower was a bit unexpected.

And shouldn't it even be the same number of helper calls? You essentially trade bpf_get_current_cgroup_id() for bpf_get_current_task() and bpf_map_lookup_elem() for bpf_cgrp_storage_get().

struct task_struct *task = bpf_get_current_task_btf();
struct cgroup *cgrp = task->cgroups->dfl_cgrp;

return bpf_cgrp_storage_get(&netstack_cgroupfilter, cgrp, 0, 0) != NULL;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thought here. In the example cgroup_id_map_cgrp_storage.bpf.c that PR cloudflare#565 added, they implement their lookup/check slightly differently from you.

Instead of just checking if the cgrp storage exists for the given cgroup, in the example they zero-initialize it if it doesn't exist, and then check if the content is non-zero. To me that seems like it should be more work than just checking for existence as you do. However, if a cgrp lookup miss takes significantly longer than a lookup hit for some reason, perhaps the way they do it in the example makes sense. Might be worth trying at least.

Copy link
Owner Author

@netoptimizer netoptimizer Aug 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried it again and benchmarked this on production workload.

  1. First line is: Type: hash = 183 ns
  2. Second line is: Type: cgrp_storage = 240 ns with zero as flags.
  3. Last line is: Type: cgrp_storage = 248 ns with BPF_LOCAL_STORAGE_GET_F_CREATE (and comparing value 0 vs 1)
image

Conclusion is clear: the cgrp_storage type is slower.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That cgrp storage was not only slower, but that much slower (~60ns), is a bit surprising (and disappointing). But thanks for doing the tests, then we at least know!

See that you have it running with some dashboard at least =)

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Our checks for empty or almost empty sockets were wrong,
because sockets also have a backlog queue.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This change test to be 'ge' greater-than-or-equal.
We want the ability specify 1, meaning a queue size of one
and above. Before 1 meant queue size 2 and above.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let also record_skb_latency() use it.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production usage, we want the ability to disable the UDP
hooks.  Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production we need to reduce the number of Prometheus
buckets metric.  As the dequeue hooks are unlikely to see
below usecs latencies we reduce the resolution to usecs.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
When importing this into ebpf_exporter we need to reformat
the C-code, which is done via command line:

 clang-format -i configs/netstacklat.{h,bpf.c}

But it doesn't convert these macros and comments.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We are currently only using a single hook so change N_HOOKS
that is used in the max_entries calc of netstack_latency_seconds.

Detect if other hooks gets enabled and make compile fail.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We unfortunately need atomic counter update for the nth-filter.
This is because hooks like tcp-socket-read runs outside the socket
lock in a preempt/migrate-able user context.

We don't need accurate nth-counter across CPU, as this is
just a down-sampling mechanism.  Thus, we keep the PERCPU
array map and have nth-counter on a per CPU basis.  The
trick here is that in most cases the counter is only used
by the current running CPU, and the cache-line will mostly
be in a cache coherency Exclusive/Modified (MOESI) state,
which will cost less when doing atomic updates.

Manually testing on production showed 7ns runtime increase
(before 150.88 ns, after 157.67 ns).

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sample every 2-nth packet (50%)
 - Overhead: 6039419756 / 32219314 = 187.44 ns

Compared to local atomic-nth overead: 157.67 ns
 - approx 30 ns extra cost to sample 50% vs 3.12%

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp
 - run_time_ns 17510185954 run_cnt 101083454 = 173.23 ns

Sample every 4-nth packet (25%)
 - Overhead: 173.23 ns
 - Compared to nth-2 (187.44 ns) saved 14.21 ns (187.44-173.23)
 - Compared to nth-32 (157.67 ns) cost 15.56 ns more (173.23-157.67)

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp
 - run_time_ns 24383044912 run_cnt 121125888 = 201.30 ns

Compared to
 - nth-2  : 186 ns -> +15 ns
 - nth-4  : 173 ns -> +28 ns
 - nth-32 : 157 ns -> +44 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Because production have too little traffic on the internal
interface that latency stats becomes unusable.

Via CONFIG_GROUPBY_IFINDEX also remove this from the
hist_key and note that YAML file also needs adjustments.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The upstream version of netstacklat that we are based on
got merged see PR#129.

xdp-project/bpf-examples#129

Some adjustments were made, so lets sync with these
to avoid diverting too much from upstream.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
As the filter_min_sockqueue_len can replaced it.

This was also part of PR#129 merge, but it makes it easier
to review, to keep this in a seperate commit.

xdp-project/bpf-examples#129

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Lacking a YAML ebpf_exporter config for selecting iface names
we hard-coded ifindex, but some production servers have higher
ifindex for vlan100.

Relax ifindex range as a workaround.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants