Skip to content

Commit 885f101

Browse files
simosundnetoptimizer
authored andcommitted
netstacklat: Exclude TCP reads for HOL blocked segments
The 'tcp-socket-read' currently reports the latency for the skb containing the last TCP segment read from the socket. However, this segment might have been head of line (HOL) blocked by a previous segment missing. In this case, netstacklat's reported latency will include HOL blocking periods that is dependent on external factors (such as network packet loss, and network latency impacts retransmission time). As netstacklat is primarily intended to identify issues within the local host (in the network stack or receiving applications), by default filter out any socket reads were the last read SKB might have experienced HOL-blocking. Add the new -y/--include-tcp-hol-delay option to retain the old behavior of reporting latency for all reads, including those that are HOL-blocked. This may be useful in some scenarios when you still want to be aware of latency issues caused by HOL-blocking, even though it is caused by external components. For example, in a data center context were you have full control over the network, it may still be relevant to monitor HOL-based caused by the network. To exclude HOL-blocked reads, detect if any new ooo-segments have arrived by checking for differences in the number of ooo-packets in tcp_sock->rcv_ooopack. If any new ooo-segments have arrived, exclude the latency sample from the current read and set a limit for the next safe sequence number to read where the current ooo-packets must have been passed so segments can no longer be HOL-blocked. If there are skbs in the ooo-queue, set the limit to the end of the ooo-queue. Otherwise, set the limit to the current rcv_nxt (as if the ooo-queue is empty the detected ooo-segments must already have been merged into the receive queue and rcv_nxt must have advanced past them). If the read is past the safe sequence limit and no new ooo-segments have arrived, it's safe to start including the latency samples again. For sockets were some ooo-segments have been observed, keep the ooo-range state in socket storage (BPF_MAP_TYPE_SK_STORAGE). Skip protecting this state with a spin-lock, as it should only be concurrently accessed if there are concurrent reads on the same TCP socket, which is assumed to be very rare as applications attempting that cannot know which part of the data each of their concurrent reads will get. There are some scenarios that may cause this ooo-filtering to fail. - If multiple reads are done to the socket concurrently, we may not correctly track the last read byte. The kernel does not keep a lock on the TCP socket at the time our hooked function tcp_recv_timestamp() runs. If two reads are done in parallel, it's therefore possible that for both reads we will check the last read byte (tcp_sock.copied_seq) after the second read has updated it. We may then incorrectly conclude that the first read was ahead of the ooo-range when it was not, and record its latency when we should have excluded it. In practice I belive this issue should be quite rare, as most applications will probably not attempt to perform multiple concurrent reads to a single connected TCP socket in parallel (as then you cannot know which part of the payload the parallel reads will return). - As tcp_recv_timestamp() runs outside of the socket lock, the various state members we access may concurrently be updated as we're attempting to read them. An especially problematic one is tcp_sock.ooo_last_skb, which keeps a pointer to an SKB that is only valid while the ooo-queue is non-empty. It is possible that between our check for if the ooo-queue is non-empty and following the ooo_last_skb pointer, the ooo-queue is cleared and the ooo_last_skb pointer may end up pointing towards a freed SKB. If the socket members we access are updated before or while we read them, it can break the filtering in numerous ways, e.g. result in includes samples that should have been excluded (due to e.g. copied_seq being updated before our read) or excluding a large amount of valid samples (due to e.g. setting a sequence limit based on garbage in a freed SKB). Signed-off-by: Simon Sundberg <[email protected]> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
1 parent ba856d8 commit 885f101

File tree

2 files changed

+129
-0
lines changed

2 files changed

+129
-0
lines changed

examples/netstacklat.bpf.c

Lines changed: 128 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,10 @@
2121

2222
#define READ_ONCE(x) (*(volatile typeof(x) *)&(x))
2323

24+
// Mimic macros from /include/net/tcp.h
25+
#define tcp_sk(ptr) container_of(ptr, struct tcp_sock, inet_conn.icsk_inet.sk)
26+
#define TCP_SKB_CB(__skb) ((struct tcp_skb_cb *)&((__skb)->cb[0]))
27+
2428
char LICENSE[] SEC("license") = "GPL";
2529

2630
/* The ebpf_exporter variant of netstacklat is not runtime configurable at
@@ -37,6 +41,7 @@ const struct netstacklat_bpf_config user_config = {
3741
.filter_cgroup = true,
3842
.groupby_ifindex = false, /* If true also define CONFIG_GROUPBY_IFINDEX */
3943
.groupby_cgroup = true,
44+
.include_hol_blocked = false,
4045
};
4146

4247
/* This provide easy way compile-time to disable some hooks */
@@ -86,6 +91,13 @@ struct sk_buff___old {
8691
#err "Please update N_HOOKS"
8792
#endif
8893

94+
struct tcp_sock_ooo_range {
95+
u32 prev_n_ooopkts;
96+
u32 ooo_seq_end;
97+
/* indicates if ooo_seq_end is still valid (as 0 can be valid seq) */
98+
bool active;
99+
};
100+
89101
struct {
90102
__uint(type, BPF_MAP_TYPE_PERCPU_HASH);
91103
__uint(max_entries, HIST_NBUCKETS * N_HOOKS * N_CGROUPS * N_IFACES);
@@ -151,6 +163,22 @@ static ktime_t time_since(ktime_t tstamp)
151163
return (now - tstamp) / LATENCY_SCALE;
152164
}
153165

166+
/*
167+
* Is a < b considering u32 wrap around?
168+
* Based on the before() function in /include/net/tcp.h
169+
*/
170+
static bool u32_lt(u32 a, u32 b)
171+
{
172+
return (s32)(a - b) < 0;
173+
}
174+
175+
struct {
176+
__uint(type, BPF_MAP_TYPE_SK_STORAGE);
177+
__uint(map_flags, BPF_F_NO_PREALLOC);
178+
__type(key, int);
179+
__type(value, struct tcp_sock_ooo_range);
180+
} netstack_tcp_ooo_range SEC(".maps");
181+
154182
/* Determine if ebpf_exporter macro or local C implementation is used */
155183
#define CONFIG_MAP_MACROS 1
156184
#ifdef CONFIG_MAP_MACROS
@@ -476,6 +504,102 @@ static __always_inline bool filter_socket(struct sock *sk, struct sk_buff *skb,
476504
}
477505
#endif
478506

507+
static int current_max_possible_ooo_seq(struct tcp_sock *tp, u32 *seq)
508+
{
509+
struct tcp_skb_cb cb;
510+
u32 max_seq = 0;
511+
int err = 0;
512+
513+
if (BPF_CORE_READ(tp, out_of_order_queue.rb_node) == NULL) {
514+
/* No ooo-segments currently in ooo-queue
515+
* Any ooo-segments must already have been merged to the
516+
* receive queue. Current rcv_nxt must therefore be ahead
517+
* of all ooo-segments that have arrived until now.
518+
*/
519+
err = bpf_core_read(&max_seq, sizeof(max_seq), &tp->rcv_nxt);
520+
if (err)
521+
bpf_printk("failed to read tcp_sock->rcv_nxt, err=%d",
522+
err);
523+
} else {
524+
/*
525+
* Some ooo-segments currently in ooo-queue
526+
* Max out-of-order seq is given by the seq_end of the tail
527+
* skb in the ooo-queue.
528+
*/
529+
err = BPF_CORE_READ_INTO(&cb, tp, ooo_last_skb, cb);
530+
if (err)
531+
bpf_printk(
532+
"failed to read tcp_sock->ooo_last_skb->cb, err=%d",
533+
err);
534+
max_seq = cb.end_seq;
535+
}
536+
537+
*seq = max_seq;
538+
return err;
539+
}
540+
541+
static bool tcp_read_in_ooo_range(struct tcp_sock *tp,
542+
struct tcp_sock_ooo_range *ooo_range)
543+
{
544+
u32 read_seq;
545+
int err;
546+
547+
if (!ooo_range->active)
548+
return false;
549+
550+
err = bpf_core_read(&read_seq, sizeof(read_seq), &tp->copied_seq);
551+
if (err) {
552+
bpf_printk("failed to read tcp_sock->copied_seq, err=%d", err);
553+
return true; // Assume we may be in ooo-range
554+
}
555+
556+
if (u32_lt(ooo_range->ooo_seq_end, read_seq)) {
557+
ooo_range->active = false;
558+
return false;
559+
} else {
560+
return true;
561+
}
562+
}
563+
564+
static bool tcp_read_maybe_holblocked(struct sock *sk)
565+
{
566+
struct tcp_sock_ooo_range *ooo_range;
567+
struct tcp_sock *tp = tcp_sk(sk);
568+
u32 n_ooopkts, nxt_seq;
569+
int err;
570+
571+
err = bpf_core_read(&n_ooopkts, sizeof(n_ooopkts), &tp->rcv_ooopack);
572+
if (err) {
573+
bpf_printk("failed to read tcp_sock->rcv_ooopack, err=%d\n",
574+
err);
575+
return true; // Assume we may be in ooo-range
576+
}
577+
578+
if (n_ooopkts == 0)
579+
return false;
580+
581+
ooo_range = bpf_sk_storage_get(&netstack_tcp_ooo_range, sk, NULL,
582+
BPF_SK_STORAGE_GET_F_CREATE);
583+
if (!ooo_range) {
584+
bpf_printk(
585+
"failed getting ooo-range socket storage for tcp socket");
586+
return true; // Assume we may be in ooo-range
587+
}
588+
589+
// Increase in ooo-packets since last - figure out next safe seq
590+
if (n_ooopkts > ooo_range->prev_n_ooopkts) {
591+
ooo_range->prev_n_ooopkts = n_ooopkts;
592+
err = current_max_possible_ooo_seq(tp, &nxt_seq);
593+
if (!err) {
594+
ooo_range->ooo_seq_end = nxt_seq;
595+
ooo_range->active = true;
596+
}
597+
return true;
598+
}
599+
600+
return tcp_read_in_ooo_range(tp, ooo_range);
601+
}
602+
479603
static void record_socket_latency(struct sock *sk, struct sk_buff *skb,
480604
ktime_t tstamp, enum netstacklat_hook hook,
481605
u64 cgroup_id)
@@ -590,6 +714,10 @@ int BPF_PROG(netstacklat_tcp_recv_timestamp, void *msg, struct sock *sk,
590714
return 0;
591715

592716
struct timespec64 *ts = &tss->ts[0];
717+
718+
if (!user_config.include_hol_blocked && tcp_read_maybe_holblocked(sk))
719+
return 0;
720+
593721
record_socket_latency(sk, NULL,
594722
(ktime_t)ts->tv_sec * NS_PER_S + ts->tv_nsec,
595723
hook, cgroup_id);

examples/netstacklat.h

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,7 @@ struct netstacklat_bpf_config {
8787
bool filter_cgroup;
8888
bool groupby_ifindex;
8989
bool groupby_cgroup;
90+
bool include_hol_blocked;
9091
};
9192

9293
#endif

0 commit comments

Comments
 (0)