-
Notifications
You must be signed in to change notification settings - Fork 0
Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This is a direct copy from bpf-examples but from Simon's devel branch 'netstacklat-groupby' https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Keeping this as seperate commit to track what needed to change Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Gotcha#1: My devel laptop have TAI offset zero - Other systems (incl prod) all have 37 sec Gotcha#2: RX-timestamping need to be enabled maually - something else have to enable RX-timestamping Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For ebpf_exporter we cannot control which BPF sections gets loaded. Instead we compile time disable some of the hooks via define/ifdef's. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Instead hardcode ifindex limits based on prod setup. As we don't have a way to configure this via YAML. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Something is buggy with this filter - All latency records is on max bucket The READ_ONCE change doesn't fix the issue Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This was the real reason for seeing wrong numbers in prod. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
…sync Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Moved filter_nonempty_sockqueue to callers - because record_socket_latency() becomes a BPF function-call - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn Execution runtime: netstacklat_skb_consume_udp - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns This clearly have a huge improvement for UDP packets. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Enable filter_nonempty_sockqueue compile time to make sure that this config setting doesn't influence performance. I'm only seeing a small effect. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The filter_nonempty_sockqueue() is effecient for UDP packets, but doesn't work well for TCP packets. Add filtering on socket queue lenght. Try filtering if qlen is not above 3 packets for TCP. Execution runtime: netstacklat_tcp_recv_timestamp - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns Execution runtime: netstacklat_skb_consume_udp - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns This have a HUGE improvement for TCP case. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Leverage BPF_MAP_TYPE_CGRP_STORAGE for our cgroup filter. To evaluate the two different cgroup_id_map types code macro CONFIG_CGRP_STORAGE is introduced, to allow us to switch implementation compile time. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The ebpf_exporter variant of netstacklat is not runtime configurable at BPF-load time. Thus, below user_config isn't define as 'volatile', instead the 'const' allows the compiler to do dead-code elimination. We also disable user_config.filter_nonempty_sockqueue as we want to stress the cgroup lookup types some more. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This required changing call signature of filter_cgroup() to also populate the cgroup_id as that is used for groupby_cgroup. We are still using the CONFIG_CGRP_STORAGE that leverages the BPF_MAP_TYPE_CGRP_STORAGE. Without the queue length filters (filter_nonempty_sockqueue) we are getting more calls. This is on purpose to evaluate cgroup_id_map types. name netstacklat_tcp_recv_timestamp - run_time_ns 17953390952 run_cnt 55079620 = 325.95 ns netstacklat_skb_consume_udp - run_time_ns 5779869863 run_cnt 11650472 = 496.10 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Changing the filter_cgroup() to use the cgroup_id_map type hash. Surprisingly this type seems to be faster than type cgrp_storage. name netstacklat_tcp_recv_timestamp - run_time_ns 10705407914 run_cnt 41576592 = 257.48 ns - diff: 325.95 - 257.48 = 68.47 ns better name netstacklat_skb_consume_udp - run_time_ns 3716653454 run_cnt 8499677 = 437.27 ns - diff: 496.10 - 437.27 = 58.83 ns better On this AMD CPU with SRSO enabled, we have extra overheads on BPF helper calls. The filter_cgroup() for type cgrp_storage has two extra helper calls, bpf_get_current_task_btf() and bpf_cgrp_storage_get(), but we eliminated the bpf_get_current_cgroup_id() helper call. Still this doesn't fully account for diff. That said, if a BPF-prog already have the struct_task available then the bpf_get_current_task_btf() can also be eliminated, so type cgrp_storage might still be useful in that case Signed-off-by: Jesper Dangaard Brouer <[email protected]>
If we have disabled network_ns filtering, the code still does a lookup of the current name space via calling get_network_ns(). Reorg the code to avoid this call if feature is disabled. name netstacklat_tcp_recv_timestamp - run_time_ns 10623365578 run_cnt 44842812 = 236.90 ns - diff: 257.48 - 236.90 = 20.58 ns better name netstacklat_skb_consume_udp - run_time_ns 3718153230 run_cnt 9902613 = 375.47 ns - diff: 437.27 - 375.47 = 61.80 ns better Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let us control the filter_queue_len() via seperate user_config. Also enable this for UDP sockets. Notice the filter_cgroup() is still running first, so this is the primary filter. And filter_nonempty_sockqueue is still false. name netstacklat_tcp_recv_timestamp - run_time_ns 15661530364 run_cnt 94922963 = 164.99 ns name netstacklat_skb_consume_udp - run_time_ns 3255451250 run_cnt 14532586 = 224.01 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Verified performance is same as previous patch. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
It is more efficient to first filter out sockets with empty or small queues. This is a tradeoff as our production use-case is to capture when the system becomes overloaded. name netstacklat_tcp_recv_timestamp - run_time_ns 15347880145 run_cnt 177786472 = 86.32 ns name netstacklat_skb_consume_udp - run_time_ns 3096529442 run_cnt 33903931 = 91.33 ns The performance gain is huge. Do remember that this is the average runtime cost that is reduced, because we can skip recording many of these events. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Hmm, that the cgroup storage was slower was a bit unexpected. And shouldn't it even be the same number of helper calls? You essentially trade |
| struct task_struct *task = bpf_get_current_task_btf(); | ||
| struct cgroup *cgrp = task->cgroups->dfl_cgrp; | ||
|
|
||
| return bpf_cgrp_storage_get(&netstack_cgroupfilter, cgrp, 0, 0) != NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One thought here. In the example cgroup_id_map_cgrp_storage.bpf.c that PR cloudflare#565 added, they implement their lookup/check slightly differently from you.
Instead of just checking if the cgrp storage exists for the given cgroup, in the example they zero-initialize it if it doesn't exist, and then check if the content is non-zero. To me that seems like it should be more work than just checking for existence as you do. However, if a cgrp lookup miss takes significantly longer than a lookup hit for some reason, perhaps the way they do it in the example makes sense. Might be worth trying at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried it again and benchmarked this on production workload.
- First line is: Type: hash = 183 ns
- Second line is: Type: cgrp_storage = 240 ns with zero as flags.
- Last line is: Type: cgrp_storage = 248 ns with BPF_LOCAL_STORAGE_GET_F_CREATE (and comparing value 0 vs 1)
Conclusion is clear: the cgrp_storage type is slower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That cgrp storage was not only slower, but that much slower (~60ns), is a bit surprising (and disappointing). But thanks for doing the tests, then we at least know!
See that you have it running with some dashboard at least =)
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Our checks for empty or almost empty sockets were wrong, because sockets also have a backlog queue. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This change test to be 'ge' greater-than-or-equal. We want the ability specify 1, meaning a queue size of one and above. Before 1 meant queue size 2 and above. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Let also record_skb_latency() use it. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production usage, we want the ability to disable the UDP hooks. Introduce CONFIG_xxx_HOOKS for IP, UDP and TCP. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For production we need to reduce the number of Prometheus buckets metric. As the dequeue hooks are unlikely to see below usecs latencies we reduce the resolution to usecs. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
When importing this into ebpf_exporter we need to reformat
the C-code, which is done via command line:
clang-format -i configs/netstacklat.{h,bpf.c}
But it doesn't convert these macros and comments.
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We are currently only using a single hook so change N_HOOKS that is used in the max_entries calc of netstack_latency_seconds. Detect if other hooks gets enabled and make compile fail. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
We unfortunately need atomic counter update for the nth-filter. This is because hooks like tcp-socket-read runs outside the socket lock in a preempt/migrate-able user context. We don't need accurate nth-counter across CPU, as this is just a down-sampling mechanism. Thus, we keep the PERCPU array map and have nth-counter on a per CPU basis. The trick here is that in most cases the counter is only used by the current running CPU, and the cache-line will mostly be in a cache coherency Exclusive/Modified (MOESI) state, which will cost less when doing atomic updates. Manually testing on production showed 7ns runtime increase (before 150.88 ns, after 157.67 ns). Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Sample every 2-nth packet (50%) - Overhead: 6039419756 / 32219314 = 187.44 ns Compared to local atomic-nth overead: 157.67 ns - approx 30 ns extra cost to sample 50% vs 3.12% Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp - run_time_ns 17510185954 run_cnt 101083454 = 173.23 ns Sample every 4-nth packet (25%) - Overhead: 173.23 ns - Compared to nth-2 (187.44 ns) saved 14.21 ns (187.44-173.23) - Compared to nth-32 (157.67 ns) cost 15.56 ns more (173.23-157.67) Signed-off-by: Jesper Dangaard Brouer <[email protected]>
name netstacklat_tcp_recv_timestamp - run_time_ns 24383044912 run_cnt 121125888 = 201.30 ns Compared to - nth-2 : 186 ns -> +15 ns - nth-4 : 173 ns -> +28 ns - nth-32 : 157 ns -> +44 ns Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Because production have too little traffic on the internal interface that latency stats becomes unusable. Via CONFIG_GROUPBY_IFINDEX also remove this from the hist_key and note that YAML file also needs adjustments. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The upstream version of netstacklat that we are based on got merged see PR#129. xdp-project/bpf-examples#129 Some adjustments were made, so lets sync with these to avoid diverting too much from upstream. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
As the filter_min_sockqueue_len can replaced it. This was also part of PR#129 merge, but it makes it easier to review, to keep this in a seperate commit. xdp-project/bpf-examples#129 Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Lacking a YAML ebpf_exporter config for selecting iface names we hard-coded ifindex, but some production servers have higher ifindex for vlan100. Relax ifindex range as a workaround. Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Created this PR so @simosund can see the changes and evaluation numbers from prod experiment.
Based on code from PR xdp-project/bpf-examples#129