Skip to content

frr: sweep stale grout routes across zebra restart#605

Open
maxime-leroy wants to merge 18 commits intoDPDK:mainfrom
maxime-leroy:cleanup_fib_routes_frr_startup
Open

frr: sweep stale grout routes across zebra restart#605
maxime-leroy wants to merge 18 commits intoDPDK:mainfrom
maxime-leroy:cleanup_fib_routes_frr_startup

Conversation

@maxime-leroy
Copy link
Copy Markdown
Collaborator

@maxime-leroy maxime-leroy commented Apr 22, 2026

When zebra is killed and restarted while grout keeps running, grout's FIB retains every route the previous FRR instance had installed. With the current plugin those routes become orphans: the startup sync explicitly drops every self-originated route returned by the grout dump (rt_grout.c:337) so zebra's RIB never sees them, and no mechanism ever garbage-collects them from grout. Unreclaimed prefixes linger indefinitely and can blackhole traffic to stale nexthops.

Upstream FRR solves the same problem for the kernel dplane through the ZEBRA_FLAG_SELFROUTE flag, rib_sweep_route and the GR_RUN metaQ sub-queue: kernel-read routes are re-inserted as selfroute placeholders with a pre-startup uptime; protocol clients take them over via rib_compare_routes when they re-announce; anything unreclaimed is purged when rib_sweep_route fires after -K seconds. The grout plugin could not reuse that directly because grout_sync runs from frr_config_post, which is after zebra_main_router_started() has already stamped zrouter.startup_time and armed t_rib_sweep. Entries created in grout_sync therefore have uptime > startup_time and would never match the sweep predicate, and the sweep itself is already scheduled.

Transplant the mechanism into the plugin:

  • Thread a bool startup flag through grout_route4_change / grout_route6_change / grout_route_change (mirroring the existing grout_nexthop_change signature), true on the initial dump path, false on asynchronous notifications. The blanket self-origin skip stays for the notification case (an echo of an install we sent) but becomes injection for the dump case: build the route_entry with ZEBRA_FLAG_SELFROUTE and force re->uptime = zrouter.startup_time so the sweep predicate SELFROUTE && uptime <= startup_time matches, and rib_compare_routes will replace the placeholder cleanly when a client reclaims the prefix.

  • In grout_add_del_route, short-circuit DPLANE_OP_ROUTE_INSTALL and DPLANE_OP_ROUTE_UPDATE when the ctx carries ZEBRA_FLAG_SELFROUTE. Those are the routes we just learned from grout; pushing them back down would be a redundant round-trip. DPLANE_OP_ROUTE_DELETE is left alone so the sweep's uninstall reaches grout via the normal flow.

  • At the very end of the sync chain (last VRF of grout_sync_routes), cancel zrouter.t_rib_sweep and re-arm it with delay equal to the operator's -K value. This aligns the sweep on sync_complete + -K instead of startup_time + -K, giving our backdated SELFROUTE entries the full reclaim window the operator asked for. The -K value is recovered by re-parsing /proc/self/cmdline with getopt_long on a minimal option table: zebra_di.gr_cleanup_time is static in main.c and not exported, re-invoking frr_getopt would re-run frr_opt() with its side effects (exit on -h/-v, double-allocation of modules, errors++ path), and zrouter.t_rib_sweep loses the -K value once it has fired, so none of those routes are reliable. /proc/self/cmdline always holds the original argv. When -K was passed without a value the plugin falls back to ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME exactly as zebra itself does; non-GR startups keep delay = 0 (fire immediately) to match kernel-dplane wipe semantics.

With these pieces in place, zebra's existing SELFROUTE machinery works uniformly over grout-injected entries:

  • Reclaim: clients reconnecting over zapi take over matching SELFROUTE entries via rib_compare_routes; the fresh entry loses SELFROUTE and has a current uptime.
  • GR_RUN (metaQ sub-q 10): per-client early cleanup triggered by ZEBRA_CLIENT_ROUTE_UPDATE_COMPLETE works automatically because our entries carry the real protocol type (via origin2zebra) and a backdated uptime (older than client->restart_time).
  • rib_sweep_route catch-all: reprogrammed to fire after the full grout sync chain completes, purges any SELFROUTE entry nobody reclaimed, and the subsequent DPLANE_OP_ROUTE_DELETE ctxs flow through grout_add_del_route to remove the routes from grout's FIB (the install guard above is install/update only).

Problem

When zebra is killed and restarted while grout remains running, grout's FIB retains routes installed by the previous FRR instance. The plugin’s startup sync currently drops self-originated routes returned by the grout dump so zebra’s RIB never sees them; without a reclamation path these prefixes persist in grout and can blackhole traffic.

Technical changes

Route handling during startup sync

  • Thread a startup boolean through grout_route4_change(), grout_route6_change(), and grout_route_change() (true for initial dump, false for async notifications).
  • For dump-path entries, set ZEBRA_FLAG_SELFROUTE on the created route_entry and backdate re->uptime to zrouter.startup_time so they match the sweep predicate (SELFROUTE && uptime <= startup_time) and are reclaimable when protocol clients reannounce.
  • Preserve existing self-origin skip behavior for notification-path entries.

Dplane operations

  • In grout_add_del_route(), short-circuit DPLANE_OP_ROUTE_INSTALL and DPLANE_OP_ROUTE_UPDATE when the dplane context carries ZEBRA_FLAG_SELFROUTE (return success without redundant install/update round-trips).
  • Leave DPLANE_OP_ROUTE_DELETE unchanged so rib_sweep_route can uninstall unreclaimed SELFROUTE entries.
  • Add a targeted delete short-circuit for the SRTE end-of-replay sync marker to avoid needless uninstall round-trips.

Nexthop handling

  • When startup is true, lookup corresponding nexthop-hash entries and backdate their uptime to zrouter.startup_time; log a warning if the nexthop entry cannot be found.

Sweep rescheduling and sync marker

  • Append an IPv6 self-route sync marker (::/128) tagged with GROUT_SYNC_MARKER_TAG during replay to detect end-of-replay visibility.
  • After the last VRF in grout_sync_routes completes, cancel any prior zrouter.t_rib_sweep and re-arm it with a delay equal to the operator’s -K value (re-parsed from /proc/self/cmdline once via getopt_long and cached; fallback to ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME if -K provided without a value; non-GR startups keep delay = 0).
  • Defer arming rib_sweep_route until the marker is observed (or max retry reached), so backdated SELFROUTE entries receive the intended reclaim window; delete the marker once observed.

API/signature changes

  • grout_route4_change(bool new, struct gr_ip4_route *gr_r4) → grout_route4_change(bool new, struct gr_ip4_route *gr_r4, bool startup)
  • grout_route6_change(bool new, struct gr_ip6_route *gr_r6) → grout_route6_change(bool new, struct gr_ip6_route *gr_r6, bool startup)
  • grout_route_change(...) now accepts a bool startup parameter.
  • Add public constant GROUT_SYNC_MARKER_TAG.

Tests and infra

  • New smoke tests:
    • smoke/frr_restart_graceful_frr_test.sh: validates graceful restart with -K5, ensuring orphan routes are swept while configured routes persist and are reclassified in zebra.
    • smoke/frr_restart_sweep_no_k_frr_test.sh: validates sweep behavior when zebra is started without -K.
    • smoke/grout_restart_resync_frr_test.sh: validates grout self-resynchronization after grout restart while FRR remains running.
  • smoke/_init_frr.sh: keep flog variable non-local so tests can inspect log file after start.

Behavioral outcomes

  • Reconnecting protocol clients reclaim matching SELFROUTE entries via rib_compare_routes() (SELFROUTE removed, uptime updated to current).
  • Per-client early cleanup functions operate correctly because entries carry protocol type and backdated uptime.
  • rib_sweep_route, after the rescheduled delay, purges unreclaimed SELFROUTE entries and causes DPLANE_OP_ROUTE_DELETE flows that remove routes from grout’s FIB.

Status

  • Work in progress: unit tests still in development; author notes upstream FRR PR Startup after crash issues FRRouting/frr#21550 addresses similar kernel-dplane startup cases and recommends auditing NHG cleanup across zebra restarts as follow-up.

@maxime-leroy maxime-leroy marked this pull request as draft April 22, 2026 10:43
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 22, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a startup boolean to grout route-change handlers (grout_route4_change, grout_route6_change, internal grout_route_change) and passes it from call sites to distinguish replay/full-sync from notification-driven updates. During startup, self-originated routes and nexthops have their uptime backdated to zrouter.startup_time. Implements a replay-completion sync marker (an IPv6 selfroute with GROUT_SYNC_MARKER_TAG) that is injected, polled for, deleted on observation, and used to defer arming rib_sweep_route. Adds startup short-circuits on dplane up-path for selfroutes and an SRTE-marker delete short-circuit. Adds grout_read_k_from_cmdline(), header macro, smoke tests, and makes flog global in the test init script.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 166-178: The getopt_long call is using the shortopts string
"-:K::" so the 'K' option is declared with an optional argument, which means
space-separated args like "-K 30" won't populate optarg; change the shortopts to
require an argument (use "K:" in the shortopts, e.g. "-:K:") so both attached
("-K30") and separated ("-K 30") forms are accepted, keep the existing long
option entry in lo, and continue setting ret = atoi(optarg) inside the
getopt_long loop when c == 'K' and optarg is non-NULL to parse the timeout
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 357a74cc-6952-473e-b474-fc3c6704842c

📥 Commits

Reviewing files that changed from the base of the PR and between 46807d1 and ddc7d5d.

📒 Files selected for processing (3)
  • frr/rt_grout.c
  • frr/rt_grout.h
  • frr/zebra_dplane_grout.c

Comment thread frr/zebra_dplane_grout.c
@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch 2 times, most recently from ac471d2 to 9bd72e3 Compare April 22, 2026 11:00
@grout-bot grout-bot force-pushed the cleanup_fib_routes_frr_startup branch from 9bd72e3 to ca3e506 Compare April 23, 2026 10:00
@maxime-leroy
Copy link
Copy Markdown
Collaborator Author

I am still working on a unit test, please don't merge it.

@maxime-leroy
Copy link
Copy Markdown
Collaborator Author

Note: FRRouting/frr#21550 covers the same startup-after-crash class on the kernel-dplane side. It also includes a proto-owned NHG refcount fix. NHG cleanup across zebra restarts should be audited for grout as well (follow-up).

@maxime-leroy maxime-leroy marked this pull request as ready for review April 23, 2026 10:39
@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch from ca3e506 to 8e53f72 Compare April 23, 2026 15:04
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 207-213: zebra_dplane_grout currently treats
grout_read_k_from_cmdline() return value `0` as absent and substitutes
ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME, which breaks explicit `-K0`; update the logic
around zrouter.graceful_restart to distinguish "no value" (-1) from a valid 0 by
using a check like (k >= 0) ? k : ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME so that an
explicit 0 yields immediate sweep; adjust the block that sets `delay` before
calling event_add_timer(zrouter.master, rib_sweep_route, NULL, delay,
&zrouter.t_rib_sweep) and keep references to grout_read_k_from_cmdline(),
ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME, zrouter.t_rib_sweep, and rib_sweep_route.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9affa067-4cf8-477c-88ce-06717c97538d

📥 Commits

Reviewing files that changed from the base of the PR and between ca3e506 and 8e53f72.

📒 Files selected for processing (4)
  • frr/rt_grout.c
  • frr/rt_grout.h
  • frr/zebra_dplane_grout.c
  • smoke/frr_restart_sweep_frr_test.sh
🚧 Files skipped from review as they are similar to previous changes (1)
  • frr/rt_grout.h

Comment thread frr/zebra_dplane_grout.c
@maxime-leroy
Copy link
Copy Markdown
Collaborator Author

WIP STILL

@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch from 8e53f72 to 81d3080 Compare April 24, 2026 12:37
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
smoke/frr_restart_graceful_frr_test.sh (1)

108-110: ⚠️ Potential issue | 🔴 Critical

Duplicate of the marker-tag mismatch — grep value is wrong here too.

tag 1735157248 is 0x676C6600, not GROUT_SYNC_MARKER_TAG (0x67727400 = 1735554048). The guard never matches, so the "marker still in zebra RIB after sweep" assertion is inert in this test as well. Fix together with the sweep-no-k smoke test (see the detailed comment on smoke/frr_restart_sweep_no_k_frr_test.sh lines 101-103).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@smoke/frr_restart_graceful_frr_test.sh` around lines 108 - 110, The grep is
using the wrong numeric tag (1735157248) for the sync marker; replace that
literal with the correct GROUT_SYNC_MARKER_TAG value (0x67727400 / 1735554048)
or reference the shared constant/variable used elsewhere so the check matches
the actual marker; update the same change in the paired sweep-no-k test to keep
both guards consistent (search for the grep invocation or the string "tag
1735157248" in this script and in smoke/frr_restart_sweep_no_k_frr_test.sh and
swap to GROUT_SYNC_MARKER_TAG or 1735554048).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@smoke/frr_restart_sweep_no_k_frr_test.sh`:
- Around line 101-103: The test greps for the wrong hard-coded decimal tag;
replace the literal "1735157248" with a decimal derived from the correct hex
macro (GROUT_SYNC_MARKER_TAG == 0x67727400U). Compute the decimal in the shell
(e.g. marker_dec=$((0x67727400)) or convert the macro value to unsigned via
printf/%u) and use that variable in the grep check (grep -qE "$marker_dec") so
the test actually matches the tag printed by vtysh; update both occurrences (the
checks in frr_restart_sweep_no_k_frr_test.sh and
frr_restart_graceful_frr_test.sh) to use this computed value.

---

Duplicate comments:
In `@smoke/frr_restart_graceful_frr_test.sh`:
- Around line 108-110: The grep is using the wrong numeric tag (1735157248) for
the sync marker; replace that literal with the correct GROUT_SYNC_MARKER_TAG
value (0x67727400 / 1735554048) or reference the shared constant/variable used
elsewhere so the check matches the actual marker; update the same change in the
paired sweep-no-k test to keep both guards consistent (search for the grep
invocation or the string "tag 1735157248" in this script and in
smoke/frr_restart_sweep_no_k_frr_test.sh and swap to GROUT_SYNC_MARKER_TAG or
1735554048).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 965dfbed-b4b1-4d60-b04c-53fa3fce363d

📥 Commits

Reviewing files that changed from the base of the PR and between 8e53f72 and 81d3080.

📒 Files selected for processing (7)
  • frr/rt_grout.c
  • frr/rt_grout.h
  • frr/zebra_dplane_grout.c
  • frr/zebra_dplane_grout.h
  • smoke/_init_frr.sh
  • smoke/frr_restart_graceful_frr_test.sh
  • smoke/frr_restart_sweep_no_k_frr_test.sh
✅ Files skipped from review due to trivial changes (1)
  • frr/zebra_dplane_grout.h
🚧 Files skipped from review as they are similar to previous changes (1)
  • frr/rt_grout.h

Comment thread smoke/frr_restart_sweep_no_k_frr_test.sh Outdated
@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch from 81d3080 to 93e0d98 Compare April 24, 2026 15:16
@maxime-leroy
Copy link
Copy Markdown
Collaborator Author

This PR must be carefully review it before any merge please.

@christophefontaine @rjarry @vjardin

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (3)
frr/zebra_dplane_grout.c (1)

283-291: ⚠️ Potential issue | 🟠 Major

Explicit -K0 is silently replaced by the default GR window.

grout_read_k_from_cmdline() returns -1 for absent/value-less -K and 0 only when the operator explicitly attached -K0 (e.g., -K0 in zebra_options). The > 0 check conflates the two, so -K0 ends up extending the stale-route reclaim by ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME instead of sweeping immediately — the opposite of what was requested. Use >= 0 to distinguish "parsed zero" from "absent".

🔧 Suggested fix
 	long delay = 0;
 	if (zrouter.graceful_restart) {
-		long window = (grout_ctx.gr_cleanup_time > 0) ?
+		long window = (grout_ctx.gr_cleanup_time >= 0) ?
 			grout_ctx.gr_cleanup_time :
 			ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME;
 		time_t now = monotime(NULL);
 		time_t deadline = zrouter.startup_time + window;
 		delay = (deadline > now) ? (deadline - now) : 0;
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@frr/zebra_dplane_grout.c` around lines 283 - 291, The grace-period handling
incorrectly treats an explicit zero as "absent" because it checks
grout_ctx.gr_cleanup_time > 0; change that check to >= 0 so that an explicit -K0
(parsed as 0 by grout_read_k_from_cmdline) results in delay=0; adjust the
conditional using the symbols grout_ctx.gr_cleanup_time,
ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME and the existing delay calculation in the block
guarded by zrouter.graceful_restart to use >= 0 instead of > 0.
smoke/frr_restart_graceful_frr_test.sh (1)

114-116: ⚠️ Potential issue | 🔴 Critical

Same wrong marker-tag decimal as in the sweep-no-K test — assertion unreachable.

0x67727400 is 1735554048, not 1735157248. This grep never matches the tag vtysh prints, so the "still in zebra RIB after sweep" check can't actually fail. Derive the decimal from the hex macro to keep the check honest (see fix in frr_restart_sweep_no_k_frr_test.sh).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@smoke/frr_restart_graceful_frr_test.sh` around lines 114 - 116, The grep uses
a wrong hardcoded decimal tag (1735157248) so the check is unreachable; update
the assertion to derive the decimal from the hex marker (0x67727400) instead of
using the incorrect literal—replace the hardcoded "tag 1735157248" with a
shell-derived decimal (e.g. using $((0x67727400)) or printf conversion) or the
same variable/expansion pattern used in frr_restart_sweep_no_k_frr_test.sh so
the grep matches the actual tag emitted by vtysh.
smoke/frr_restart_sweep_no_k_frr_test.sh (1)

96-98: ⚠️ Potential issue | 🔴 Critical

Grep pattern will never match — assertion is a silent no-op.

GROUT_SYNC_MARKER_TAG is 0x67727400 = 1735554048, not 1735157248 (that would be 0x676C6600). vtysh prints the tag in decimal, so the current regex never fires and any regression leaving the marker in zebra's RIB after the sweep passes silently — exactly the thing this assertion is meant to catch.

🔧 Derive the decimal from the macro so it stays correct
-vtysh -c "show ipv6 route ::/128 detail" 2>/dev/null \
-	| grep -qE "tag 1735157248" \
-	&& fail "sync marker still in zebra RIB after sweep"
+marker_tag=$(printf '%u' 0x67727400)
+vtysh -c "show ipv6 route ::/128 detail" 2>/dev/null \
+	| grep -qE "tag $marker_tag" \
+	&& fail "sync marker still in zebra RIB after sweep"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@smoke/frr_restart_sweep_no_k_frr_test.sh` around lines 96 - 98, The grep is
using the wrong decimal tag (1735157248) so the check never fires; update the
grep to derive the correct decimal from the GROUT_SYNC_MARKER_TAG hex value
instead of hardcoding the wrong number. Locate the vtysh check (vtysh -c "show
ipv6 route ::/128 detail" | grep -qE "tag 1735157248") and replace the literal
with a shell expression that converts the macro hex to decimal (e.g. use
$((0x67727400)) or printf "%d" 0x67727400) so the regex matches the actual tag
printed by vtysh.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@frr/zebra_dplane_grout.c`:
- Around line 283-291: The grace-period handling incorrectly treats an explicit
zero as "absent" because it checks grout_ctx.gr_cleanup_time > 0; change that
check to >= 0 so that an explicit -K0 (parsed as 0 by grout_read_k_from_cmdline)
results in delay=0; adjust the conditional using the symbols
grout_ctx.gr_cleanup_time, ZEBRA_GR_DEFAULT_RIB_SWEEP_TIME and the existing
delay calculation in the block guarded by zrouter.graceful_restart to use >= 0
instead of > 0.

In `@smoke/frr_restart_graceful_frr_test.sh`:
- Around line 114-116: The grep uses a wrong hardcoded decimal tag (1735157248)
so the check is unreachable; update the assertion to derive the decimal from the
hex marker (0x67727400) instead of using the incorrect literal—replace the
hardcoded "tag 1735157248" with a shell-derived decimal (e.g. using
$((0x67727400)) or printf conversion) or the same variable/expansion pattern
used in frr_restart_sweep_no_k_frr_test.sh so the grep matches the actual tag
emitted by vtysh.

In `@smoke/frr_restart_sweep_no_k_frr_test.sh`:
- Around line 96-98: The grep is using the wrong decimal tag (1735157248) so the
check never fires; update the grep to derive the correct decimal from the
GROUT_SYNC_MARKER_TAG hex value instead of hardcoding the wrong number. Locate
the vtysh check (vtysh -c "show ipv6 route ::/128 detail" | grep -qE "tag
1735157248") and replace the literal with a shell expression that converts the
macro hex to decimal (e.g. use $((0x67727400)) or printf "%d" 0x67727400) so the
regex matches the actual tag printed by vtysh.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 387be13f-6954-42f6-9ba5-187e630da105

📥 Commits

Reviewing files that changed from the base of the PR and between 81d3080 and 93e0d98.

📒 Files selected for processing (8)
  • frr/rt_grout.c
  • frr/rt_grout.h
  • frr/zebra_dplane_grout.c
  • frr/zebra_dplane_grout.h
  • smoke/_init_frr.sh
  • smoke/frr_restart_graceful_frr_test.sh
  • smoke/frr_restart_sweep_no_k_frr_test.sh
  • smoke/grout_restart_resync_frr_test.sh
🚧 Files skipped from review as they are similar to previous changes (2)
  • smoke/_init_frr.sh
  • frr/rt_grout.c

@vjardin
Copy link
Copy Markdown
Contributor

vjardin commented Apr 24, 2026

what's happen if in between, the marker is injected, there is a crash, then restart with another marker injected (while the former one is not removed) ?

@maxime-leroy
Copy link
Copy Markdown
Collaborator Author

what's happen if in between, the marker is injected, there is a crash, then restart with another marker injected (while the former one is not removed) ?

Grout crash:

The old marker does not survive a grout restart because the reconnect path wipes zebra’s RIB through "grout_ns_reset() -> vrf_terminate()". The marker is removed along with all the other routes before the next sync injects a new one.

Zebra crash:

The marker is not installed into the FIB, only kept in zebra’s RIB for polling visibility. After a zebra crash, that RIB state is gone, so there is no stale marker left when zebra restarts.

@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch 9 times, most recently from 4f9c7d7 to c56c82f Compare April 27, 2026 15:57
@vjardin
Copy link
Copy Markdown
Contributor

vjardin commented Apr 28, 2026

LGTM

@maxime-leroy : it seems OK for me. I feel a bit lost by the over verbose commit logs.

For sure, such feature turns the usage of FRR/zebra with Grout to a nice product ready system wich independent services that can be upgraded isolated. It should help during runtime and development too. It was a real missing piece.

Cosmetic: I'd prefer a shorter commit log that summarizes the commits's intends. Or if we want to keep such long commit logs which has some values, to have 2 sections into the commit log: a TL;DR summary at the beginning with the intends and then the descriptions you did.

@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch 10 times, most recently from 8da5d76 to 1769516 Compare May 6, 2026 08:25
set_ip_route, set_srv6_localsid and set_srv6_route declare local
max_tries=5 and count=0 but never read them. The retry loops that
relied on these counters were replaced by mark_events/wait_event in
f60b047, but the helper-local declarations were left behind.

Fixes: f60b047 ("smoke: replace busy loops with event-based synchronization")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The two assignment lines used a bare 'IFS=/ ;' / 'IFS=" " ;' before a
"${sids[*]}" join, leaking a non-default IFS to the rest of the
function. With IFS still set to '/' (or ' '), subsequent unquoted
reads of paths like $tmp/events.log get split into multiple words
and trigger 'ambiguous redirect' in wait_event/mark_events.

Collect SIDs into a single space-separated string and substitute
spaces with '/' via ${sids// //} when building the FRR command.
This avoids touching IFS at all and drops the dead seg_space build
(the variable is never read).

Fixes: be6bcd6 ("smoke: add srv6 test using frr/zebra")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
set_srv6_route is the only function in _init_frr.sh that uses
4-space indentation; the rest of the file (and the project default
in .editorconfig: indent_style = tab) uses tabs. Convert to tabs
for consistency.

Whitespace-only change; "git diff -w" returns nothing.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When grout dies, the plugin wipes zebra's RIB via vrf_terminate()
and re-syncs from scratch. But rib_add_multipath does not attach
routes synchronously: it enqueues a zebra_early_route in
META_QUEUE_EARLY_ROUTE, and rib_link runs only when that sub-queue
drains. ere's still pending at vrf_terminate time can attach to a
freed route_table or leak into the recreated VRF.

Introduce a barrier marker the plugin can poll to detect drain
completion: a dummy ::/128 SHARP entry with tag GROUT_SYNC_MARKER_TAG
and distance INFINITY. rib_process skips it (no SELECT, no
redistribute, no dplane install) and FIFO ordering within
subq[EARLY_ROUTE] (a single global queue, not per-VRF) guarantees
every prior ere is attached when the marker becomes observable via
route_node_lookup.

In grout_reconnect, inject the marker before vrf_terminate; once
polling observes it, run grout_ns_reset and re-schedule grout_sync.
Stale markers from a previous cycle are flushed via
rib_meta_queue_early_route_cleanup + rib_delete to keep the matched
poll unambiguous.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When zebra is killed and restarted while grout keeps running, every
route and nexthop-group from the previous FRR instance stays in
grout's FIB as an orphan. The startup sync drops self-originated
routes returned by the dump, so zebra's RIB never sees them and
nothing garbage-collects them. Stale prefixes can blackhole traffic
to dead nexthops indefinitely.

The kernel dplane already handles this through the metaQ. Routes are
not attached synchronously: rib_add_multipath enqueues a
zebra_early_route in META_QUEUE_EARLY_ROUTE, and rib_link only runs
when that sub-queue drains. The kernel side re-injects kernel-read
routes as ZEBRA_FLAG_SELFROUTE placeholders, protocol clients reclaim
them via rib_compare_routes, and rib_sweep_route purges anything
unreclaimed at zrouter.startup_time + K. Donald Sharp's 9e74dda819bf
("zebra: Delay some processing until after startup is finished") and
1cd1d048407 ("zebra: Fixup some startup issues") stamp startup_time
after route_read drains, so the operator's K window is measured from
the actual end of startup rather than process-start.

Transplant the mechanism. grout_sync runs from frr_config_post, well
after zebra_main_router_started has stamped startup_time and armed
t_rib_sweep, so the plugin has to glue around the upstream timing.

Drain barrier. Reuse the marker primitive: inject ::/128 SHARP at the
tail of the dump, poll until observable, then re-stamp startup_time
= monotime(NULL) and arm rib_sweep_route at delay = K. FIFO ordering
in META_QUEUE_EARLY_ROUTE guarantees every prior SELFROUTE injection
is attached and visible to rib_sweep_route by the time the marker
becomes observable.

Native sweep. zebra_main_router_started arms t_rib_sweep before our
dump completes. Pre-arm zrouter.t_rib_sweep with a long placeholder
delay at frr_late_init: event_add_timer's *t_ptr!=NULL early-return
(lib/event.c:1430) makes the native arm a no-op. The marker observe
path then cancels and re-arms with the real -K.

-K itself is re-parsed once from /proc/self/cmdline:
zebra_di.gr_cleanup_time is static in main.c and zrouter.t_rib_sweep
loses the value after firing.

Nexthop-groups follow the route lifecycle: rib_sweep_route only
sweeps routes, but dump-injected NHEs lose their refcount when the
last referencing route is swept and are then GC'd by FRR's
nexthop-group keep timer.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
set_ip_address, set_ip_route, set_srv6_route and set_srv6_localsid
all gain a leading --persist flag that writes the configuration
verbatim to frr.conf instead of pushing it through vtysh, so tests
that need the config to be replayed on the next FRR start no longer
have to do the equivalent "echo ... >> frr.conf" by hand.

The four helpers now route their mark_events / vtysh / wait_event
sequence through a shared _apply_frr_config dispatcher.

No call site is changed: --persist is opt-in and the positional
signatures of all four helpers stay backward compatible.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The flog name was unhelpfully terse and looked like a typo. Use the
more explicit frr_log everywhere in start_frr.

No functional change.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Validate the SELFROUTE lifecycle for a zebra crash while grout keeps
running. Two staticd prefixes drive the test:

- prefix_orphan: installed via vtysh runtime only, never persisted.
  After SIGKILL+restart, staticd has no memory of it; the plugin's
  SELFROUTE placeholder is never reclaimed, rib_sweep_route wipes
  it at T+K.

- prefix_kept: installed via vtysh AND appended to frr.conf. At
  restart, mgmtd replays the integrated config, staticd re-pushes
  via zapi within -K, rib_compare_routes promotes the SELFROUTE
  entry to ZEBRA_ROUTE_STATIC, the sweep predicate no longer
  matches.

The crash is simulated with SIGKILL on zebra and staticd, and the
respawn is left to watchfrr instead of being driven by frrinit.sh
restart: the latter mixes the in-flight phased restart watchfrr
already runs with its own all_stop/all_start pair, which races and
kills the freshly respawned zebra a second time before the sweep
timer can fire. Two daemon-side knobs are needed to make watchfrr
respawn observable within a sane test timeout: zebra wants -K5 to
honour the graceful restart window, watchfrr wants
--min-restart-interval=1 because its 60s default silently postpones
the master daemon's restart (the "too soon" branch in try_restart
logs only at LOG_DEBUG+1, invisible at debug level). Both are
plumbed through new ZEBRA_EXTRA_OPTS / WATCHFRR_EXTRA_OPTS env vars
in start_frr; the test exports them before sourcing _init_frr.sh so
the daemons file gets the right cached argv.

Synchronisation between the respawn and the assertions uses two
gates on frr_log (sync marker observed, then "Sweeping the RIB"),
followed by short wait_event timeouts on the actual del events.

start_frr's $flog is dropped from the local declaration so tests
sourcing _init_frr.sh can grep the FRR log file directly.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Variant of frr_restart_graceful_frr_test.sh that runs zebra with the
default command line (no -K). graceful_restart=false and
zebra_main_router_started arms rib_sweep_route at delay=0, racing
the plugin's async replay chain. The pre-arm placeholder in
zd_grout_plugin_init is what neutralises that race; the test
asserts exactly one rib_sweep firing as proof it worked.

The crash is simulated with SIGKILL on zebra and staticd; the
respawn is left to watchfrr, which needs --min-restart-interval=1
(via WATCHFRR_EXTRA_OPTS) because the 60s default silently
postpones the master daemon's restart well past any test timeout.
Same synchronisation pattern as the graceful test: marker-observed
gate, then sweep-fired gate, then wait_event on the del.

The NH deletion event is not asserted: without -K the sweep races
mgmtd's zapi apply of "nexthop-group keep 1" and the keep-around
timer defaults to 180s, exceeding the smoke test window. This is a
pre-existing FRR limitation unrelated to the plugin fix.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Validate the reverse direction: zebra survives, grout is SIGKILL'd
and respawned with an empty FIB. The plugin detects the broken
socket, calls grout_ns_reset which wipes zebra's RIB via
vrf_terminate, recreates the default VRF, and re-syncs from the
repopulated grout. staticd (still alive across the crash) re-pushes
its route via zapi when the nexthop becomes resolvable through the
recreated port.

Poll grcli's JSON snapshot rather than the events stream: the
original grcli events subscriber died with the first grout instance
and is not respawned by the smoke framework.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
grout_gr_nexthop_to_frr_nexthop set nh->type only for L3 and
BLACKHOLE branches. The SR6_LOCAL, SR6_OUTPUT and GROUP branches
populated nh_srv6 / ifindex but left nh->type at zero, which is
not a valid nexthop_types_t value.

The omission was latent until grout_sync_routes started replaying
self-originated grout routes as SELFROUTE placeholders at zebra
startup. The flow:

  1. grout_sync_nhs(vrf=N) -> for each NH dumped ->
     grout_nexthop_change -> zebra_nhg_kernel_find() registers the
     NHE in zebra's hash with kernel origin. No dplane install at
     this stage.
  2. grout_sync_routes(vrf=N) -> injects each route as a
     SELFROUTE entry referencing nh_id. The route lands in
     zebra's RIB.
  3. When zebra processes the SELFROUTE it evaluates the
     associated NHG. If the NHG is not marked installed
     dplane-side, zebra pushes it through the dplane chain for
     confirmation.
  4. grout_dplane_nexthop_update receives the install ->
     is_selfroute passes (BGP / static / ...) -> calls
     grout_add_nexthop -> attempts GR_NH_ADD to grout.

With nh->type left at zero in step 1's conversion, step 4's
switch falls through to the default branch and logs "unsupported
nexthop type 0". The install never completes, rib_sweep_route has
nothing to purge dplane-side, and grout keeps the orphans across
the restart.

The same reclaim round-trip happens with the kernel netlink
dplane. zebra_nhg_install_kernel deliberately forces
dplane_nexthop_add when a protocol takes ownership of a kernel
NHG. Netlink encodes RTM_NEWNEXTHOP with NLM_F_CREATE |
NLM_F_REPLACE so the kernel acks silently. grout's GR_NH_ADD
handler is also idempotent via req->exist_ok, but the request
never gets sent for SRv6 NHEs because the nh->type switch rejects
them first.

Set nh->type to NEXTHOP_TYPE_IPV6 (or the IFINDEX variant when an
outgoing iface is present) for the SRv6 branches, matching what
grout_add_nexthop expects on the inverse direction. For
GR_NH_T_GROUP use NEXTHOP_TYPE_IFINDEX since the group container
has no gate.

Fixes: f481a7b ("frr: synchronize srv6 local nexthops from grout")
Fixes: db5852f ("frr: synchronize srv6 encap nexthops from grout")
Fixes: 4eb7354 ("frr: add basic multipath support")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
grout_route_change() used gr_nh->vrf_id (nexthop's vrf) instead
of gr_r4/gr_r6->vrf_id (route's vrf). They diverge for cross-vrf
nexthop patterns (L3VPN SRv6 H.Encaps, vrf-leak, route-target
imports): the route lives in a customer vrf while the nh is in
the underlay.

Fixes: fa424dc ("ip,ip6: support cross-VRF routes with nexthop")
Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Take a 5th positional argument naming the decap VRF for the static
SID, defaulting to "default" so existing call sites remain unchanged.
zebra's SRv6 manager (zebra/zebra_srv6.c, get_srv6_sid_explicit)
tracks one SID per (behavior, vrf_id) ctx and explicitly releases the
previous one on any second allocation, so coexisting SR6_LOCAL static
SIDs must specify distinct VRFs (or behaviors). Used by the upcoming
graceful restart srv6 nexthop coverage to keep both an orphan and a
kept localsid installed at the same time.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Take optional 3rd and 4th positional arguments naming the route's
VRF and the nexthop-VRF, mirroring set_ip_route. Both default to
"default" so existing call sites remain unchanged; SIDs in the
variadic trailing list are disambiguated from VRF names by the
mandatory ':'. Used by the upcoming graceful restart srv6 nexthop
coverage to exercise cross-vrf L3VPN encap.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
The per-VRF sync chain (addrs -> nhs -> routes -> next vrf addrs ...)
introduced in commit "frr: drain pending early-routes before grout
reconnect resets VRF" injects routes from VRF N before NHEs from VRFs
N+1 and later are registered in zebra's NHE hash. When a route
references a NHE that lives in a later VRF, the injection lands in
zebra's RIB with an unresolved NH and the entry is later swept as
"unreclaimed" SELFROUTE, even though the NHE registration happens
shortly after.

This pattern arises with SRv6 SR6_LOCAL nexthops: the NHE is stored
in its decap (out_vrf) VRF rather than in the VRF where the SID
prefix is matched. A /128 END.DT4 SID matched in the default VRF can
have its NHE stored in the decap target VRF, so grout_sync_routes for
the default VRF runs before grout_sync_nhs for the decap VRF and
zebra cannot resolve the NHE id; the corresponding error is

    [PXNS9-R1D2X] Zebra failed to find the nexthop hash entry for
                  id=NN in a route entry <pfx>

Reorganise the sync chain along the kernel dplane init pattern:

  Pass 1 (addrs):  per-VRF chain on dplane thread
                   addrs(v1) -> addrs(v2) -> ... -> addrs(vN)
  Pass 2 (nhs):    per-VRF chain on zrouter.master
                   nhs(v1)   -> nhs(v2)   -> ... -> nhs(vN)
  Pass 3 (routes): per-VRF chain on zrouter.master
                   routes(v1) -> routes(v2) -> ... -> routes(vN)

This matches the order the kernel dplane uses at zebra startup
(RTM_GETLINK -> RTM_GETADDR -> RTM_GETNEXTHOP -> RTM_GETROUTE), each
stage globally completed before the next starts. When Pass 3 begins,
every NHE from every VRF dumped from grout is already registered in
zebra's hash, so any cross-VRF NH reference resolves at
rib_add_multipath time.

Stage-grouping also reduces dplane <-> zrouter thread switches: a
single transition between Pass 1 (dplane) and Pass 2 (zrouter), in
place of one transition per VRF in the previous interleaved chain.

The marker injection at the end of Pass 3 and the rib_sweep_route
arming behaviour are unchanged. The change is local to grout_sync_addrs
(chain to addrs of next VRF instead of nhs of this VRF; kick Pass 2
once all VRFs have been processed), grout_sync_nhs (chain to nhs of
next VRF instead of addrs of next VRF; kick Pass 3 once all VRFs
have been processed) and grout_sync_routes (chain to routes of next
VRF instead of addrs of next VRF). No other plugin entry point is
affected and the per-stage behaviour of addrs / nhs / routes is
preserved.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Extend frr_restart_graceful_frr_test.sh with four SRv6 prefixes that
exercise the SR6_OUTPUT and SR6_LOCAL branches of
grout_gr_nexthop_to_frr_nexthop:

  - $sr6_orphan       : SR6_OUTPUT static, vty-only          -> sweep
  - $sr6_kept         : SR6_OUTPUT static, persisted          -> reclaim
  - $sr6_localsid_orphan: SR6_LOCAL  /48 vrf=default, vty-only -> sweep
  - $sr6_localsid_kept  : SR6_LOCAL  /48 vrf=vrf2,    persisted -> reclaim

The two SR6_LOCAL SIDs land in distinct (behavior, vrf) contexts so
both can coexist: zebra's SRv6 manager keeps a single SID per
(behavior, vrf) ctx and silently releases the previous one on a
second allocation otherwise.

The sweep variants assert the dump-injected SELFROUTE round-trip
through grout_add_nexthop succeeds (rib_sweep_route reaches the
dplane delete). The reclaim variants additionally assert that
rib_compare_routes preserves the seg6 segments (SR6_OUTPUT) and the
out_vrf decap target (SR6_LOCAL) when transferring ownership from
the plugin's SELFROUTE placeholder to staticd's re-push.

The seg6 SID assertion uses "show ip route ... json" because the
text format does not surface seg6 segments at all.

The runtime install (vtysh) and the persistence (frr.conf) both go
through the new --persist flag of set_ip_route, set_srv6_route and
set_srv6_localsid instead of being open-coded as "echo ... >>
frr.conf".

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Bring in a local DPDK patch that fixes a latent bug in the IPv6 FIB
TRIE algorithm where rsvd_tbl8s drifts past zero on certain ADD/DEL
patterns and wraps to UINT32_MAX, causing every subsequent long-prefix
add to fail with -ENOSPC even though tbl8_pool_pos is far below
number_tbl8s.

The bug surfaces when a covering parent prefix is deleted before its
children: depth_diff at DEL time is computed against a different RIB
state than at ADD time and the cumulative reservation accounting
diverges. Replacing rsvd_tbl8s with the actual pool occupancy
counter (tbl8_pool_pos for trie, cur_tbl8s for dir24_8) eliminates
the asymmetry. The pre-allocation refusal semantics are preserved.

The patch is intended to be submitted upstream to dpdk-dev; it is
carried locally until accepted so grout users get a working FIB6
under SRv6 churn.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
When FRR restarts, it re-reads its configuration and pushes the SRv6
encapsulation source address to grout via GR_SRV6_TUNSRC_SET. Since grout
was not restarted, the tunnel source nexthop from the previous FRR
session is still present in the nexthop hash table, causing nexthop_new()
to fail with EADDRINUSE.

Return success if the same address is already configured, instead of
failing the request.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
@maxime-leroy maxime-leroy force-pushed the cleanup_fib_routes_frr_startup branch from 1769516 to 387cd1c Compare May 6, 2026 12:02
@maxime-leroy maxime-leroy marked this pull request as ready for review May 6, 2026 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants