frr: sweep stale grout routes across zebra restart#605
frr: sweep stale grout routes across zebra restart#605maxime-leroy wants to merge 18 commits intoDPDK:mainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 166-178: The getopt_long call is using the shortopts string
"-:K::" so the 'K' option is declared with an optional argument, which means
space-separated args like "-K 30" won't populate optarg; change the shortopts to
require an argument (use "K:" in the shortopts, e.g. "-:K:") so both attached
("-K30") and separated ("-K 30") forms are accepted, keep the existing long
option entry in lo, and continue setting ret = atoi(optarg) inside the
getopt_long loop when c == 'K' and optarg is non-NULL to parse the timeout
correctly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 357a74cc-6952-473e-b474-fc3c6704842c
📒 Files selected for processing (3)
frr/rt_grout.cfrr/rt_grout.hfrr/zebra_dplane_grout.c
ac471d2 to
9bd72e3
Compare
9bd72e3 to
ca3e506
Compare
|
I am still working on a unit test, please don't merge it. |
|
Note: FRRouting/frr#21550 covers the same startup-after-crash class on the kernel-dplane side. It also includes a proto-owned NHG refcount fix. NHG cleanup across zebra restarts should be audited for grout as well (follow-up). |
ca3e506 to
8e53f72
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 207-213: zebra_dplane_grout currently treats
grout_read_k_from_cmdline() return value `0` as absent and substitutes
ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME, which breaks explicit `-K0`; update the logic
around zrouter.graceful_restart to distinguish "no value" (-1) from a valid 0 by
using a check like (k >= 0) ? k : ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME so that an
explicit 0 yields immediate sweep; adjust the block that sets `delay` before
calling event_add_timer(zrouter.master, rib_sweep_route, NULL, delay,
&zrouter.t_rib_sweep) and keep references to grout_read_k_from_cmdline(),
ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME, zrouter.t_rib_sweep, and rib_sweep_route.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9affa067-4cf8-477c-88ce-06717c97538d
📒 Files selected for processing (4)
frr/rt_grout.cfrr/rt_grout.hfrr/zebra_dplane_grout.csmoke/frr_restart_sweep_frr_test.sh
🚧 Files skipped from review as they are similar to previous changes (1)
- frr/rt_grout.h
|
WIP STILL |
8e53f72 to
81d3080
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
smoke/frr_restart_graceful_frr_test.sh (1)
108-110:⚠️ Potential issue | 🔴 CriticalDuplicate of the marker-tag mismatch — grep value is wrong here too.
tag 1735157248is0x676C6600, notGROUT_SYNC_MARKER_TAG(0x67727400=1735554048). The guard never matches, so the "marker still in zebra RIB after sweep" assertion is inert in this test as well. Fix together with the sweep-no-k smoke test (see the detailed comment onsmoke/frr_restart_sweep_no_k_frr_test.shlines 101-103).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smoke/frr_restart_graceful_frr_test.sh` around lines 108 - 110, The grep is using the wrong numeric tag (1735157248) for the sync marker; replace that literal with the correct GROUT_SYNC_MARKER_TAG value (0x67727400 / 1735554048) or reference the shared constant/variable used elsewhere so the check matches the actual marker; update the same change in the paired sweep-no-k test to keep both guards consistent (search for the grep invocation or the string "tag 1735157248" in this script and in smoke/frr_restart_sweep_no_k_frr_test.sh and swap to GROUT_SYNC_MARKER_TAG or 1735554048).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@smoke/frr_restart_sweep_no_k_frr_test.sh`:
- Around line 101-103: The test greps for the wrong hard-coded decimal tag;
replace the literal "1735157248" with a decimal derived from the correct hex
macro (GROUT_SYNC_MARKER_TAG == 0x67727400U). Compute the decimal in the shell
(e.g. marker_dec=$((0x67727400)) or convert the macro value to unsigned via
printf/%u) and use that variable in the grep check (grep -qE "$marker_dec") so
the test actually matches the tag printed by vtysh; update both occurrences (the
checks in frr_restart_sweep_no_k_frr_test.sh and
frr_restart_graceful_frr_test.sh) to use this computed value.
---
Duplicate comments:
In `@smoke/frr_restart_graceful_frr_test.sh`:
- Around line 108-110: The grep is using the wrong numeric tag (1735157248) for
the sync marker; replace that literal with the correct GROUT_SYNC_MARKER_TAG
value (0x67727400 / 1735554048) or reference the shared constant/variable used
elsewhere so the check matches the actual marker; update the same change in the
paired sweep-no-k test to keep both guards consistent (search for the grep
invocation or the string "tag 1735157248" in this script and in
smoke/frr_restart_sweep_no_k_frr_test.sh and swap to GROUT_SYNC_MARKER_TAG or
1735554048).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 965dfbed-b4b1-4d60-b04c-53fa3fce363d
📒 Files selected for processing (7)
frr/rt_grout.cfrr/rt_grout.hfrr/zebra_dplane_grout.cfrr/zebra_dplane_grout.hsmoke/_init_frr.shsmoke/frr_restart_graceful_frr_test.shsmoke/frr_restart_sweep_no_k_frr_test.sh
✅ Files skipped from review due to trivial changes (1)
- frr/zebra_dplane_grout.h
🚧 Files skipped from review as they are similar to previous changes (1)
- frr/rt_grout.h
81d3080 to
93e0d98
Compare
|
This PR must be carefully review it before any merge please. |
There was a problem hiding this comment.
♻️ Duplicate comments (3)
frr/zebra_dplane_grout.c (1)
283-291:⚠️ Potential issue | 🟠 MajorExplicit
-K0is silently replaced by the default GR window.
grout_read_k_from_cmdline()returns-1for absent/value-less-Kand0only when the operator explicitly attached-K0(e.g.,-K0inzebra_options). The> 0check conflates the two, so-K0ends up extending the stale-route reclaim byZEBRA_GR_DEFAULT_RIB_SWEEP_TIMEinstead of sweeping immediately — the opposite of what was requested. Use>= 0to distinguish "parsed zero" from "absent".🔧 Suggested fix
long delay = 0; if (zrouter.graceful_restart) { - long window = (grout_ctx.gr_cleanup_time > 0) ? + long window = (grout_ctx.gr_cleanup_time >= 0) ? grout_ctx.gr_cleanup_time : ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME; time_t now = monotime(NULL); time_t deadline = zrouter.startup_time + window; delay = (deadline > now) ? (deadline - now) : 0; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@frr/zebra_dplane_grout.c` around lines 283 - 291, The grace-period handling incorrectly treats an explicit zero as "absent" because it checks grout_ctx.gr_cleanup_time > 0; change that check to >= 0 so that an explicit -K0 (parsed as 0 by grout_read_k_from_cmdline) results in delay=0; adjust the conditional using the symbols grout_ctx.gr_cleanup_time, ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME and the existing delay calculation in the block guarded by zrouter.graceful_restart to use >= 0 instead of > 0.smoke/frr_restart_graceful_frr_test.sh (1)
114-116:⚠️ Potential issue | 🔴 CriticalSame wrong marker-tag decimal as in the sweep-no-K test — assertion unreachable.
0x67727400is 1735554048, not1735157248. This grep never matches the tagvtyshprints, so the "still in zebra RIB after sweep" check can't actually fail. Derive the decimal from the hex macro to keep the check honest (see fix infrr_restart_sweep_no_k_frr_test.sh).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smoke/frr_restart_graceful_frr_test.sh` around lines 114 - 116, The grep uses a wrong hardcoded decimal tag (1735157248) so the check is unreachable; update the assertion to derive the decimal from the hex marker (0x67727400) instead of using the incorrect literal—replace the hardcoded "tag 1735157248" with a shell-derived decimal (e.g. using $((0x67727400)) or printf conversion) or the same variable/expansion pattern used in frr_restart_sweep_no_k_frr_test.sh so the grep matches the actual tag emitted by vtysh.smoke/frr_restart_sweep_no_k_frr_test.sh (1)
96-98:⚠️ Potential issue | 🔴 CriticalGrep pattern will never match — assertion is a silent no-op.
GROUT_SYNC_MARKER_TAGis0x67727400= 1735554048, not1735157248(that would be0x676C6600).vtyshprints the tag in decimal, so the current regex never fires and any regression leaving the marker in zebra's RIB after the sweep passes silently — exactly the thing this assertion is meant to catch.🔧 Derive the decimal from the macro so it stays correct
-vtysh -c "show ipv6 route ::/128 detail" 2>/dev/null \ - | grep -qE "tag 1735157248" \ - && fail "sync marker still in zebra RIB after sweep" +marker_tag=$(printf '%u' 0x67727400) +vtysh -c "show ipv6 route ::/128 detail" 2>/dev/null \ + | grep -qE "tag $marker_tag" \ + && fail "sync marker still in zebra RIB after sweep"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@smoke/frr_restart_sweep_no_k_frr_test.sh` around lines 96 - 98, The grep is using the wrong decimal tag (1735157248) so the check never fires; update the grep to derive the correct decimal from the GROUT_SYNC_MARKER_TAG hex value instead of hardcoding the wrong number. Locate the vtysh check (vtysh -c "show ipv6 route ::/128 detail" | grep -qE "tag 1735157248") and replace the literal with a shell expression that converts the macro hex to decimal (e.g. use $((0x67727400)) or printf "%d" 0x67727400) so the regex matches the actual tag printed by vtysh.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 283-291: The grace-period handling incorrectly treats an explicit
zero as "absent" because it checks grout_ctx.gr_cleanup_time > 0; change that
check to >= 0 so that an explicit -K0 (parsed as 0 by grout_read_k_from_cmdline)
results in delay=0; adjust the conditional using the symbols
grout_ctx.gr_cleanup_time, ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME and the existing
delay calculation in the block guarded by zrouter.graceful_restart to use >= 0
instead of > 0.
In `@smoke/frr_restart_graceful_frr_test.sh`:
- Around line 114-116: The grep uses a wrong hardcoded decimal tag (1735157248)
so the check is unreachable; update the assertion to derive the decimal from the
hex marker (0x67727400) instead of using the incorrect literal—replace the
hardcoded "tag 1735157248" with a shell-derived decimal (e.g. using
$((0x67727400)) or printf conversion) or the same variable/expansion pattern
used in frr_restart_sweep_no_k_frr_test.sh so the grep matches the actual tag
emitted by vtysh.
In `@smoke/frr_restart_sweep_no_k_frr_test.sh`:
- Around line 96-98: The grep is using the wrong decimal tag (1735157248) so the
check never fires; update the grep to derive the correct decimal from the
GROUT_SYNC_MARKER_TAG hex value instead of hardcoding the wrong number. Locate
the vtysh check (vtysh -c "show ipv6 route ::/128 detail" | grep -qE "tag
1735157248") and replace the literal with a shell expression that converts the
macro hex to decimal (e.g. use $((0x67727400)) or printf "%d" 0x67727400) so the
regex matches the actual tag printed by vtysh.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 387be13f-6954-42f6-9ba5-187e630da105
📒 Files selected for processing (8)
frr/rt_grout.cfrr/rt_grout.hfrr/zebra_dplane_grout.cfrr/zebra_dplane_grout.hsmoke/_init_frr.shsmoke/frr_restart_graceful_frr_test.shsmoke/frr_restart_sweep_no_k_frr_test.shsmoke/grout_restart_resync_frr_test.sh
🚧 Files skipped from review as they are similar to previous changes (2)
- smoke/_init_frr.sh
- frr/rt_grout.c
|
what's happen if in between, the marker is injected, there is a crash, then restart with another marker injected (while the former one is not removed) ? |
Grout crash: The old marker does not survive a grout restart because the reconnect path wipes zebra’s RIB through "grout_ns_reset() -> vrf_terminate()". The marker is removed along with all the other routes before the next sync injects a new one. Zebra crash: The marker is not installed into the FIB, only kept in zebra’s RIB for polling visibility. After a zebra crash, that RIB state is gone, so there is no stale marker left when zebra restarts. |
4f9c7d7 to
c56c82f
Compare
|
LGTM @maxime-leroy : it seems OK for me. I feel a bit lost by the over verbose commit logs. For sure, such feature turns the usage of FRR/zebra with Grout to a nice product ready system wich independent services that can be upgraded isolated. It should help during runtime and development too. It was a real missing piece. Cosmetic: I'd prefer a shorter commit log that summarizes the commits's intends. Or if we want to keep such long commit logs which has some values, to have 2 sections into the commit log: a TL;DR summary at the beginning with the intends and then the descriptions you did. |
8da5d76 to
1769516
Compare
set_ip_route, set_srv6_localsid and set_srv6_route declare local max_tries=5 and count=0 but never read them. The retry loops that relied on these counters were replaced by mark_events/wait_event in f60b047, but the helper-local declarations were left behind. Fixes: f60b047 ("smoke: replace busy loops with event-based synchronization") Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The two assignment lines used a bare 'IFS=/ ;' / 'IFS=" " ;' before a
"${sids[*]}" join, leaking a non-default IFS to the rest of the
function. With IFS still set to '/' (or ' '), subsequent unquoted
reads of paths like $tmp/events.log get split into multiple words
and trigger 'ambiguous redirect' in wait_event/mark_events.
Collect SIDs into a single space-separated string and substitute
spaces with '/' via ${sids// //} when building the FRR command.
This avoids touching IFS at all and drops the dead seg_space build
(the variable is never read).
Fixes: be6bcd6 ("smoke: add srv6 test using frr/zebra")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
set_srv6_route is the only function in _init_frr.sh that uses 4-space indentation; the rest of the file (and the project default in .editorconfig: indent_style = tab) uses tabs. Convert to tabs for consistency. Whitespace-only change; "git diff -w" returns nothing. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When grout dies, the plugin wipes zebra's RIB via vrf_terminate() and re-syncs from scratch. But rib_add_multipath does not attach routes synchronously: it enqueues a zebra_early_route in META_QUEUE_EARLY_ROUTE, and rib_link runs only when that sub-queue drains. ere's still pending at vrf_terminate time can attach to a freed route_table or leak into the recreated VRF. Introduce a barrier marker the plugin can poll to detect drain completion: a dummy ::/128 SHARP entry with tag GROUT_SYNC_MARKER_TAG and distance INFINITY. rib_process skips it (no SELECT, no redistribute, no dplane install) and FIFO ordering within subq[EARLY_ROUTE] (a single global queue, not per-VRF) guarantees every prior ere is attached when the marker becomes observable via route_node_lookup. In grout_reconnect, inject the marker before vrf_terminate; once polling observes it, run grout_ns_reset and re-schedule grout_sync. Stale markers from a previous cycle are flushed via rib_meta_queue_early_route_cleanup + rib_delete to keep the matched poll unambiguous. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When zebra is killed and restarted while grout keeps running, every
route and nexthop-group from the previous FRR instance stays in
grout's FIB as an orphan. The startup sync drops self-originated
routes returned by the dump, so zebra's RIB never sees them and
nothing garbage-collects them. Stale prefixes can blackhole traffic
to dead nexthops indefinitely.
The kernel dplane already handles this through the metaQ. Routes are
not attached synchronously: rib_add_multipath enqueues a
zebra_early_route in META_QUEUE_EARLY_ROUTE, and rib_link only runs
when that sub-queue drains. The kernel side re-injects kernel-read
routes as ZEBRA_FLAG_SELFROUTE placeholders, protocol clients reclaim
them via rib_compare_routes, and rib_sweep_route purges anything
unreclaimed at zrouter.startup_time + K. Donald Sharp's 9e74dda819bf
("zebra: Delay some processing until after startup is finished") and
1cd1d048407 ("zebra: Fixup some startup issues") stamp startup_time
after route_read drains, so the operator's K window is measured from
the actual end of startup rather than process-start.
Transplant the mechanism. grout_sync runs from frr_config_post, well
after zebra_main_router_started has stamped startup_time and armed
t_rib_sweep, so the plugin has to glue around the upstream timing.
Drain barrier. Reuse the marker primitive: inject ::/128 SHARP at the
tail of the dump, poll until observable, then re-stamp startup_time
= monotime(NULL) and arm rib_sweep_route at delay = K. FIFO ordering
in META_QUEUE_EARLY_ROUTE guarantees every prior SELFROUTE injection
is attached and visible to rib_sweep_route by the time the marker
becomes observable.
Native sweep. zebra_main_router_started arms t_rib_sweep before our
dump completes. Pre-arm zrouter.t_rib_sweep with a long placeholder
delay at frr_late_init: event_add_timer's *t_ptr!=NULL early-return
(lib/event.c:1430) makes the native arm a no-op. The marker observe
path then cancels and re-arms with the real -K.
-K itself is re-parsed once from /proc/self/cmdline:
zebra_di.gr_cleanup_time is static in main.c and zrouter.t_rib_sweep
loses the value after firing.
Nexthop-groups follow the route lifecycle: rib_sweep_route only
sweeps routes, but dump-injected NHEs lose their refcount when the
last referencing route is swept and are then GC'd by FRR's
nexthop-group keep timer.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
set_ip_address, set_ip_route, set_srv6_route and set_srv6_localsid all gain a leading --persist flag that writes the configuration verbatim to frr.conf instead of pushing it through vtysh, so tests that need the config to be replayed on the next FRR start no longer have to do the equivalent "echo ... >> frr.conf" by hand. The four helpers now route their mark_events / vtysh / wait_event sequence through a shared _apply_frr_config dispatcher. No call site is changed: --persist is opt-in and the positional signatures of all four helpers stay backward compatible. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The flog name was unhelpfully terse and looked like a typo. Use the more explicit frr_log everywhere in start_frr. No functional change. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Validate the SELFROUTE lifecycle for a zebra crash while grout keeps running. Two staticd prefixes drive the test: - prefix_orphan: installed via vtysh runtime only, never persisted. After SIGKILL+restart, staticd has no memory of it; the plugin's SELFROUTE placeholder is never reclaimed, rib_sweep_route wipes it at T+K. - prefix_kept: installed via vtysh AND appended to frr.conf. At restart, mgmtd replays the integrated config, staticd re-pushes via zapi within -K, rib_compare_routes promotes the SELFROUTE entry to ZEBRA_ROUTE_STATIC, the sweep predicate no longer matches. The crash is simulated with SIGKILL on zebra and staticd, and the respawn is left to watchfrr instead of being driven by frrinit.sh restart: the latter mixes the in-flight phased restart watchfrr already runs with its own all_stop/all_start pair, which races and kills the freshly respawned zebra a second time before the sweep timer can fire. Two daemon-side knobs are needed to make watchfrr respawn observable within a sane test timeout: zebra wants -K5 to honour the graceful restart window, watchfrr wants --min-restart-interval=1 because its 60s default silently postpones the master daemon's restart (the "too soon" branch in try_restart logs only at LOG_DEBUG+1, invisible at debug level). Both are plumbed through new ZEBRA_EXTRA_OPTS / WATCHFRR_EXTRA_OPTS env vars in start_frr; the test exports them before sourcing _init_frr.sh so the daemons file gets the right cached argv. Synchronisation between the respawn and the assertions uses two gates on frr_log (sync marker observed, then "Sweeping the RIB"), followed by short wait_event timeouts on the actual del events. start_frr's $flog is dropped from the local declaration so tests sourcing _init_frr.sh can grep the FRR log file directly. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Variant of frr_restart_graceful_frr_test.sh that runs zebra with the default command line (no -K). graceful_restart=false and zebra_main_router_started arms rib_sweep_route at delay=0, racing the plugin's async replay chain. The pre-arm placeholder in zd_grout_plugin_init is what neutralises that race; the test asserts exactly one rib_sweep firing as proof it worked. The crash is simulated with SIGKILL on zebra and staticd; the respawn is left to watchfrr, which needs --min-restart-interval=1 (via WATCHFRR_EXTRA_OPTS) because the 60s default silently postpones the master daemon's restart well past any test timeout. Same synchronisation pattern as the graceful test: marker-observed gate, then sweep-fired gate, then wait_event on the del. The NH deletion event is not asserted: without -K the sweep races mgmtd's zapi apply of "nexthop-group keep 1" and the keep-around timer defaults to 180s, exceeding the smoke test window. This is a pre-existing FRR limitation unrelated to the plugin fix. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Validate the reverse direction: zebra survives, grout is SIGKILL'd and respawned with an empty FIB. The plugin detects the broken socket, calls grout_ns_reset which wipes zebra's RIB via vrf_terminate, recreates the default VRF, and re-syncs from the repopulated grout. staticd (still alive across the crash) re-pushes its route via zapi when the nexthop becomes resolvable through the recreated port. Poll grcli's JSON snapshot rather than the events stream: the original grcli events subscriber died with the first grout instance and is not respawned by the smoke framework. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
grout_gr_nexthop_to_frr_nexthop set nh->type only for L3 and
BLACKHOLE branches. The SR6_LOCAL, SR6_OUTPUT and GROUP branches
populated nh_srv6 / ifindex but left nh->type at zero, which is
not a valid nexthop_types_t value.
The omission was latent until grout_sync_routes started replaying
self-originated grout routes as SELFROUTE placeholders at zebra
startup. The flow:
1. grout_sync_nhs(vrf=N) -> for each NH dumped ->
grout_nexthop_change -> zebra_nhg_kernel_find() registers the
NHE in zebra's hash with kernel origin. No dplane install at
this stage.
2. grout_sync_routes(vrf=N) -> injects each route as a
SELFROUTE entry referencing nh_id. The route lands in
zebra's RIB.
3. When zebra processes the SELFROUTE it evaluates the
associated NHG. If the NHG is not marked installed
dplane-side, zebra pushes it through the dplane chain for
confirmation.
4. grout_dplane_nexthop_update receives the install ->
is_selfroute passes (BGP / static / ...) -> calls
grout_add_nexthop -> attempts GR_NH_ADD to grout.
With nh->type left at zero in step 1's conversion, step 4's
switch falls through to the default branch and logs "unsupported
nexthop type 0". The install never completes, rib_sweep_route has
nothing to purge dplane-side, and grout keeps the orphans across
the restart.
The same reclaim round-trip happens with the kernel netlink
dplane. zebra_nhg_install_kernel deliberately forces
dplane_nexthop_add when a protocol takes ownership of a kernel
NHG. Netlink encodes RTM_NEWNEXTHOP with NLM_F_CREATE |
NLM_F_REPLACE so the kernel acks silently. grout's GR_NH_ADD
handler is also idempotent via req->exist_ok, but the request
never gets sent for SRv6 NHEs because the nh->type switch rejects
them first.
Set nh->type to NEXTHOP_TYPE_IPV6 (or the IFINDEX variant when an
outgoing iface is present) for the SRv6 branches, matching what
grout_add_nexthop expects on the inverse direction. For
GR_NH_T_GROUP use NEXTHOP_TYPE_IFINDEX since the group container
has no gate.
Fixes: f481a7b ("frr: synchronize srv6 local nexthops from grout")
Fixes: db5852f ("frr: synchronize srv6 encap nexthops from grout")
Fixes: 4eb7354 ("frr: add basic multipath support")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
grout_route_change() used gr_nh->vrf_id (nexthop's vrf) instead of gr_r4/gr_r6->vrf_id (route's vrf). They diverge for cross-vrf nexthop patterns (L3VPN SRv6 H.Encaps, vrf-leak, route-target imports): the route lives in a customer vrf while the nh is in the underlay. Fixes: fa424dc ("ip,ip6: support cross-VRF routes with nexthop") Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Take a 5th positional argument naming the decap VRF for the static SID, defaulting to "default" so existing call sites remain unchanged. zebra's SRv6 manager (zebra/zebra_srv6.c, get_srv6_sid_explicit) tracks one SID per (behavior, vrf_id) ctx and explicitly releases the previous one on any second allocation, so coexisting SR6_LOCAL static SIDs must specify distinct VRFs (or behaviors). Used by the upcoming graceful restart srv6 nexthop coverage to keep both an orphan and a kept localsid installed at the same time. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Take optional 3rd and 4th positional arguments naming the route's VRF and the nexthop-VRF, mirroring set_ip_route. Both default to "default" so existing call sites remain unchanged; SIDs in the variadic trailing list are disambiguated from VRF names by the mandatory ':'. Used by the upcoming graceful restart srv6 nexthop coverage to exercise cross-vrf L3VPN encap. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The per-VRF sync chain (addrs -> nhs -> routes -> next vrf addrs ...)
introduced in commit "frr: drain pending early-routes before grout
reconnect resets VRF" injects routes from VRF N before NHEs from VRFs
N+1 and later are registered in zebra's NHE hash. When a route
references a NHE that lives in a later VRF, the injection lands in
zebra's RIB with an unresolved NH and the entry is later swept as
"unreclaimed" SELFROUTE, even though the NHE registration happens
shortly after.
This pattern arises with SRv6 SR6_LOCAL nexthops: the NHE is stored
in its decap (out_vrf) VRF rather than in the VRF where the SID
prefix is matched. A /128 END.DT4 SID matched in the default VRF can
have its NHE stored in the decap target VRF, so grout_sync_routes for
the default VRF runs before grout_sync_nhs for the decap VRF and
zebra cannot resolve the NHE id; the corresponding error is
[PXNS9-R1D2X] Zebra failed to find the nexthop hash entry for
id=NN in a route entry <pfx>
Reorganise the sync chain along the kernel dplane init pattern:
Pass 1 (addrs): per-VRF chain on dplane thread
addrs(v1) -> addrs(v2) -> ... -> addrs(vN)
Pass 2 (nhs): per-VRF chain on zrouter.master
nhs(v1) -> nhs(v2) -> ... -> nhs(vN)
Pass 3 (routes): per-VRF chain on zrouter.master
routes(v1) -> routes(v2) -> ... -> routes(vN)
This matches the order the kernel dplane uses at zebra startup
(RTM_GETLINK -> RTM_GETADDR -> RTM_GETNEXTHOP -> RTM_GETROUTE), each
stage globally completed before the next starts. When Pass 3 begins,
every NHE from every VRF dumped from grout is already registered in
zebra's hash, so any cross-VRF NH reference resolves at
rib_add_multipath time.
Stage-grouping also reduces dplane <-> zrouter thread switches: a
single transition between Pass 1 (dplane) and Pass 2 (zrouter), in
place of one transition per VRF in the previous interleaved chain.
The marker injection at the end of Pass 3 and the rib_sweep_route
arming behaviour are unchanged. The change is local to grout_sync_addrs
(chain to addrs of next VRF instead of nhs of this VRF; kick Pass 2
once all VRFs have been processed), grout_sync_nhs (chain to nhs of
next VRF instead of addrs of next VRF; kick Pass 3 once all VRFs
have been processed) and grout_sync_routes (chain to routes of next
VRF instead of addrs of next VRF). No other plugin entry point is
affected and the per-stage behaviour of addrs / nhs / routes is
preserved.
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Extend frr_restart_graceful_frr_test.sh with four SRv6 prefixes that exercise the SR6_OUTPUT and SR6_LOCAL branches of grout_gr_nexthop_to_frr_nexthop: - $sr6_orphan : SR6_OUTPUT static, vty-only -> sweep - $sr6_kept : SR6_OUTPUT static, persisted -> reclaim - $sr6_localsid_orphan: SR6_LOCAL /48 vrf=default, vty-only -> sweep - $sr6_localsid_kept : SR6_LOCAL /48 vrf=vrf2, persisted -> reclaim The two SR6_LOCAL SIDs land in distinct (behavior, vrf) contexts so both can coexist: zebra's SRv6 manager keeps a single SID per (behavior, vrf) ctx and silently releases the previous one on a second allocation otherwise. The sweep variants assert the dump-injected SELFROUTE round-trip through grout_add_nexthop succeeds (rib_sweep_route reaches the dplane delete). The reclaim variants additionally assert that rib_compare_routes preserves the seg6 segments (SR6_OUTPUT) and the out_vrf decap target (SR6_LOCAL) when transferring ownership from the plugin's SELFROUTE placeholder to staticd's re-push. The seg6 SID assertion uses "show ip route ... json" because the text format does not surface seg6 segments at all. The runtime install (vtysh) and the persistence (frr.conf) both go through the new --persist flag of set_ip_route, set_srv6_route and set_srv6_localsid instead of being open-coded as "echo ... >> frr.conf". Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Bring in a local DPDK patch that fixes a latent bug in the IPv6 FIB TRIE algorithm where rsvd_tbl8s drifts past zero on certain ADD/DEL patterns and wraps to UINT32_MAX, causing every subsequent long-prefix add to fail with -ENOSPC even though tbl8_pool_pos is far below number_tbl8s. The bug surfaces when a covering parent prefix is deleted before its children: depth_diff at DEL time is computed against a different RIB state than at ADD time and the cumulative reservation accounting diverges. Replacing rsvd_tbl8s with the actual pool occupancy counter (tbl8_pool_pos for trie, cur_tbl8s for dir24_8) eliminates the asymmetry. The pre-allocation refusal semantics are preserved. The patch is intended to be submitted upstream to dpdk-dev; it is carried locally until accepted so grout users get a working FIB6 under SRv6 churn. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When FRR restarts, it re-reads its configuration and pushes the SRv6 encapsulation source address to grout via GR_SRV6_TUNSRC_SET. Since grout was not restarted, the tunnel source nexthop from the previous FRR session is still present in the nexthop hash table, causing nexthop_new() to fail with EADDRINUSE. Return success if the same address is already configured, instead of failing the request. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
1769516 to
387cd1c
Compare
When zebra is killed and restarted while grout keeps running, grout's FIB retains every route the previous FRR instance had installed. With the current plugin those routes become orphans: the startup sync explicitly drops every self-originated route returned by the grout dump (rt_grout.c:337) so zebra's RIB never sees them, and no mechanism ever garbage-collects them from grout. Unreclaimed prefixes linger indefinitely and can blackhole traffic to stale nexthops.
Upstream FRR solves the same problem for the kernel dplane through the ZEBRA_FLAG_SELFROUTE flag, rib_sweep_route and the GR_RUN metaQ sub-queue: kernel-read routes are re-inserted as selfroute placeholders with a pre-startup uptime; protocol clients take them over via rib_compare_routes when they re-announce; anything unreclaimed is purged when rib_sweep_route fires after -K seconds. The grout plugin could not reuse that directly because grout_sync runs from frr_config_post, which is after zebra_main_router_started() has already stamped zrouter.startup_time and armed t_rib_sweep. Entries created in grout_sync therefore have uptime > startup_time and would never match the sweep predicate, and the sweep itself is already scheduled.
Transplant the mechanism into the plugin:
Thread a bool startup flag through grout_route4_change / grout_route6_change / grout_route_change (mirroring the existing grout_nexthop_change signature), true on the initial dump path, false on asynchronous notifications. The blanket self-origin skip stays for the notification case (an echo of an install we sent) but becomes injection for the dump case: build the route_entry with ZEBRA_FLAG_SELFROUTE and force re->uptime = zrouter.startup_time so the sweep predicate SELFROUTE && uptime <= startup_time matches, and rib_compare_routes will replace the placeholder cleanly when a client reclaims the prefix.
In grout_add_del_route, short-circuit DPLANE_OP_ROUTE_INSTALL and DPLANE_OP_ROUTE_UPDATE when the ctx carries ZEBRA_FLAG_SELFROUTE. Those are the routes we just learned from grout; pushing them back down would be a redundant round-trip. DPLANE_OP_ROUTE_DELETE is left alone so the sweep's uninstall reaches grout via the normal flow.
At the very end of the sync chain (last VRF of grout_sync_routes), cancel zrouter.t_rib_sweep and re-arm it with delay equal to the operator's -K value. This aligns the sweep on sync_complete + -K instead of startup_time + -K, giving our backdated SELFROUTE entries the full reclaim window the operator asked for. The -K value is recovered by re-parsing /proc/self/cmdline with getopt_long on a minimal option table: zebra_di.gr_cleanup_time is static in main.c and not exported, re-invoking frr_getopt would re-run frr_opt() with its side effects (exit on -h/-v, double-allocation of modules, errors++ path), and zrouter.t_rib_sweep loses the -K value once it has fired, so none of those routes are reliable. /proc/self/cmdline always holds the original argv. When -K was passed without a value the plugin falls back to ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME exactly as zebra itself does; non-GR startups keep delay = 0 (fire immediately) to match kernel-dplane wipe semantics.
With these pieces in place, zebra's existing SELFROUTE machinery works uniformly over grout-injected entries:
Problem
When zebra is killed and restarted while grout remains running, grout's FIB retains routes installed by the previous FRR instance. The plugin’s startup sync currently drops self-originated routes returned by the grout dump so zebra’s RIB never sees them; without a reclamation path these prefixes persist in grout and can blackhole traffic.
Technical changes
Route handling during startup sync
Dplane operations
Nexthop handling
Sweep rescheduling and sync marker
API/signature changes
Tests and infra
Behavioral outcomes
Status