Fix stale CASE exchange blocking new session handshake by jonlil · Pull Request #407 · project-chip/rs-matter

jonlil · 2026-04-11T09:22:18Z

Problem

When a Matter client restarts and sends a new CASESigma1 reusing the same exchange ID as an in-progress handshake, the old exchange is still awaiting CASESigma3. The new Sigma1 gets routed to the stale exchange, producing:

Invalid opcode: CASESigma1, expected: CASESigma3

The handshake fails and the client must wait through its retry/backoff cycle (typically 30-60s) before succeeding with a different exchange ID.

This is easily reproducible by restarting a Matter controller that connects to an rs-matter server — the first reconnect attempt always fails.

Fix

In Session::post_recv(), when an existing exchange receives a new session request (CASESigma1 or PBKDFParamRequest), mark the stale exchange as Dropped and fall through to create a fresh one. The new handshake proceeds immediately from scratch.

Uses the existing MessageMeta::is_new_session() check and Role::set_dropped_state() — no new types or APIs needed.

Testing

Tested with a Zigbee-to-Matter bridge (rs-matter server) and a Matter controller client (also rs-matter based). Before the fix, every server restart caused a 30-60s reconnect delay. After the fix, the client reconnects on its first attempt.

gemini-code-assist

Code Review

This pull request modifies the session handling logic to support peer handshake restarts by detecting new session requests on existing exchanges and marking the old exchanges as dropped. The review feedback points out a critical routing issue where subsequent packets might still be processed by the stale exchange unless the lookup logic is updated to ignore dropped states. It also suggests using the remove_exch method for a cleaner and more idiomatic removal of the stale exchange.

gemini-code-assist · 2026-04-11T09:24:22Z

+            } else {
+                let exch = unwrap!(self.exchanges[exch_index].as_mut());

-            exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?;
+                exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?;

-            Ok(false)
-        } else {
-            if !rx_header.proto.is_initiator()
-                || !MessageMeta::from(&rx_header.proto).is_new_exchange()
-            {
-                // Do not create a new exchange if the peer is not an initiator, or if
-                // the packet is NOT a candidate for a new exchange
-                // (i.e. it is a standalone ACK or a SC status response)
-                Err(ErrorCode::NoExchange)?;
+                return Ok(false);
            }


There is a significant routing issue here. If get_exch_for_rx returns a stale exchange that was marked as dropped (but not removed from the table), this else block will be executed for subsequent packets in the handshake (e.g., CASESigma3). This will cause the packet to be processed by the stale exchange instead of the new one created for the restarted handshake. To fully address this, get_exch_for_rx should be updated to ignore exchanges in the dropped state, ensuring that the transport can find the active exchange even if a stale one with the same ID still exists in the table.

Agreed this is a valid concern. With the updated code using remove_exch(), the slot is cleared immediately in the common case (no pending MRP at CASESigma1 stage). This means get_exch_for_rx won't find the stale exchange on subsequent packets.

In the rare edge case where MRP is pending and the exchange is only marked as Dropped (not removed), the stale exchange could theoretically shadow the new one. However, at the CASESigma1 stage no application data has been exchanged yet, so there should be no pending retransmissions or ACKs to delay removal.

gemini-code-assist · 2026-04-11T09:24:22Z

+            if MessageMeta::from(&rx_header.proto).is_new_session() {
+                warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");
+                let exch = unwrap!(self.exchanges[exch_index].as_mut());
+                exch.role.set_dropped_state();
+                // Fall through to create a new exchange below


Instead of manually setting the dropped state, consider using self.remove_exch(exch_index). This method is more idiomatic as it attempts to cleanly remove the exchange from the session's exchange table, or marks it as dropped if there are pending MRP operations (ACKs or retransmissions). This also increases the chance of freeing a slot in the exchange table for the new handshake if the stale exchange can be removed immediately.

Suggested change

if MessageMeta::from(&rx_header.proto).is_new_session() {

warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");

let exch = unwrap!(self.exchanges[exch_index].as_mut());

exch.role.set_dropped_state();

// Fall through to create a new exchange below

if MessageMeta::from(&rx_header.proto).is_new_session() {

warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");

self.remove_exch(exch_index);

// Fall through to create a new exchange below

Good suggestion — updated to use remove_exch() which either clears the slot immediately or marks as Dropped if MRP is pending. This is cleaner and frees the slot for the new exchange when possible.

github-actions · 2026-04-11T09:41:21Z

PR #407: Size comparison from 2af46c4 to 5043e71

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)

platform	target	config	section	`2af46c4`	`5043e71`	change	% change
(core)	riscv32imac-unknown-none-elf	infodefmt-optz-ltofat	FLASH	403870	404078	208	0.1
			RAM	68072	68072	0	0.0
	thumbv6m-none-eabi	infodefmt-optz-ltofat	FLASH	337860	337988	128	0.0
			RAM	64312	64320	8	0.0
	thumbv7em-none-eabi	infodefmt-optz-ltofat	FLASH	312748	312856	108	0.0
			RAM	63800	63808	8	0.0
	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	839687	839991	304	0.0
			RAM	66616	66616	0	0.0
dimmable-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1957576	1957896	320	0.0
			RAM	52352	52352	0	0.0
onoff-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1890784	1891080	296	0.0
			RAM	52056	52056	0	0.0
onoff-light-bt	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	3218056	3219040	984	0.0
			RAM	5784	5784	0	0.0
speaker	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	850424	850424	0	0.0
			RAM	2856	2856	0	0.0

When a Matter client restarts and sends a new CASESigma1 reusing the same exchange ID as an in-progress handshake, the old exchange is still awaiting CASESigma3. The new Sigma1 gets routed to the stale exchange, producing "Invalid opcode: CASESigma1, expected: CASESigma3" and the handshake fails. The client must then wait through its retry/backoff cycle (typically 30-60s) before succeeding on a subsequent attempt with a different exchange ID. Fix: in Session::post_recv(), when an existing exchange receives a new session request (CASESigma1 or PBKDFParamRequest), remove the stale exchange via remove_exch() and fall through to create a fresh one. remove_exch() either clears the slot immediately (allowing the new exchange to reuse it) or marks it as Dropped if MRP operations are pending.

github-actions · 2026-04-11T10:35:33Z

PR #407: Size comparison from 2af46c4 to 30dd689

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)

platform	target	config	section	`2af46c4`	`30dd689`	change
(core)	riscv32imac-unknown-none-elf	infodefmt-optz-ltofat	FLASH	403870	404060	190
			RAM	68072	68072	0
	thumbv6m-none-eabi	infodefmt-optz-ltofat	FLASH	337860	337940	80
			RAM	64312	64320	8
	thumbv7em-none-eabi	infodefmt-optz-ltofat	FLASH	312748	312812	64
			RAM	63800	63808	8
	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	839687	839911	224
			RAM	66616	66616	0
dimmable-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1957576	1957800	224
			RAM	52352	52352	0
onoff-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1890784	1891000	216
			RAM	52056	52056	0
onoff-light-bt	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	3218056	3218960	904
			RAM	5784	5784	0
speaker	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	850424	850424	0
			RAM	2856	2856	0

jonlil · 2026-04-11T10:46:47Z

Here are before/after logs from our production setup (rs-matter server with 7 Matter endpoints, rs-matter client controller):

Before fix — server restart causes 30-60s reconnect delay:

T+0s   Client sends CASESigma1 (reusing exchange ID from previous session)
T+0s   Server: "Invalid opcode: CASESigma1, expected: CASESigma3"
T+0s   Server: "Abandoned because of error Error::Invalid"
T+15s  Client: CASE timed out
T+30s  Client retries with new exchange ID → Sigma1 → Sigma2 → Sigma3 → success
T+30s  Subscriptions primed

After fix — immediate reconnect:

T+0s   Server restarts
T+1s   Client sends CASESigma1
T+1s   Server: "New session request on existing exchange — peer restarted handshake, evicting stale exchange"
T+1s   CASESigma1 → Sigma2 → Sigma3 → CASE established
T+1s   Subscriptions primed

Updated to use remove_exch() per review feedback — the stale exchange slot is freed immediately (no pending MRP at CASESigma1 stage), so get_exch_for_rx won't shadow the new exchange.

ivmarkov · 2026-04-11T11:39:55Z

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

jonlil · 2026-04-11T11:41:55Z

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

ivmarkov · 2026-04-11T11:48:04Z

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.

There is also #371 for it anyway.

With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?

As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

ivmarkov · 2026-04-11T12:21:28Z

CC @mstallmo ^^^

mstallmo · 2026-04-11T13:27:12Z

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.

There is also #371 for it anyway.

With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?

As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

Unfortunately some things have come up that have taken our attention away from finishing the controller features that we've been working on. Speaking for myself, I do intend to finish the work but I don't have a concrete timeline yet for when I'll be able get back to that.

That said if someone wants to finish the work before I can get back to focusing on it I'm happy to support in any way that I can.

ivmarkov · 2026-04-11T13:46:55Z

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.
There is also #371 for it anyway.
With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?
As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

Unfortunately some things have come up that have taken our attention away from finishing the controller features that we've been working on. Speaking for myself, I do intend to finish the work but I don't have a concrete timeline yet for when I'll be able get back to that.

That said if someone wants to finish the work before I can get back to focusing on it I'm happy to support in any way that I can.

Well if you don't mind I would suggest @jonlil and his LLM to look at https://github.com/estincelle/rs-matter/tree/marko/TECH-66-case-initiator/rs-matter/src/sc/case and maybe even contribute (a variation of) that (with some unit tests) as the CASE initiator/controller implementation. I doubt their implementation is that much different anyway. There are only a few meaningful ways how the initiator code-path can be implemented anyway.

jonlil · 2026-04-11T14:03:53Z

Wish it was possible to have reply discussions on GitHub, I will respond here instead. If anyone is interested I can show a full topology chart of my setup and what I have built so far and what I do for testing that the implementation works.

So I have already implemented the case initiator and yes, Claude with Opus was used as LLM, perhaps the test coverage is something I have to work on. I decided to put it behind a cargo feature since we bumped into growth in package size that might impact smaller devices, that MR is still open. So, I please continue the discussion on PR #408

And thanks for the help and great response on my, and claudes work 😄 Been quite many long and interesting nights with Matter 1.5 and home automation with automated tv morning shows as the target, and scheduling watching hockey when my team plays! Actually working!

ivmarkov · 2026-04-11T14:45:00Z

Wish it was possible to have reply discussions on GitHub, I will respond here instead. If anyone is interested I can show a full topology chart of my setup and what I have built so far and what I do for testing that the implementation works.

Yes, seeing this would be useful.
Adding some unit tests is always nice, but at a minimum, we should strive to enable more and more XAML / Python tests (if there are relevant ones) in the cargo xtask itest tests harness that is essentially the Matter C++ SDK integration tests harness.

So I have already implemented the case initiator and yes, Claude with Opus was used as LLM, perhaps the test coverage is something I have to work on. I decided to put it behind a cargo feature since we bumped into growth in package size that might impact smaller devices, that MR is still open. So, I please continue the discussion on PR #408

Thanks. Will look into it tomorrow, as today I'm AFK mostly.

And thanks for the help and great response on my, and claudes work 😄 Been quite many long and interesting nights with Matter 1.5 and home automation with automated tv morning shows as the target, and scheduling watching hockey when my team plays! Actually working!

NP. Since you are explicitly mentioning Matter 1.5, I assume you might be doing something related to the camera support or at least WebRTC (because cast-like clusters are available since a long time and well before 1.5)?
If so, I would be interested in the WebRTC / TCPIP bits as well. That is, if you are willing to contribute.

When a stale exchange is evicted by post_recv (new CASESigma1 on an existing exchange), the slot is set to None. The old Exchange object still holds a reference to that slot index. When it is eventually dropped, Drop calls remove_exch on the already-cleared slot, causing an unwrap panic. Handle the None case gracefully by returning true (already cleaned up) instead of panicking.

jonlil · 2026-04-12T10:20:24Z

Added a follow-up fix: the remove_exch function would panic (unwrap!) when called on an exchange slot that was already cleared. This happens when a stale exchange is evicted by post_recv (new CASESigma1 on an existing exchange) — the old Exchange object still holds a reference to the now-cleared slot index, and when it's eventually dropped, Drop calls remove_exch on the None slot.

Fix: return true (already cleaned up) instead of panicking.

github-actions · 2026-04-12T10:48:48Z

PR #407: Size comparison from 2af46c4 to 4d7e717

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)

platform	target	config	section	`2af46c4`	`4d7e717`	change
(core)	riscv32imac-unknown-none-elf	infodefmt-optz-ltofat	FLASH	403870	404022	152
			RAM	68072	68072	0
	thumbv6m-none-eabi	infodefmt-optz-ltofat	FLASH	337860	337924	64
			RAM	64312	64320	8
	thumbv7em-none-eabi	infodefmt-optz-ltofat	FLASH	312748	312792	44
			RAM	63800	63808	8
	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	839687	839807	120
			RAM	66616	66616	0
dimmable-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1957576	1957744	168
			RAM	52352	52352	0
onoff-light	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	1890784	1890888	104
			RAM	52056	52056	0
onoff-light-bt	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	3218056	3218296	240
			RAM	5784	5784	0
speaker	x86_64-unknown-linux-gnu	infologs-optz-ltofat	FLASH	850424	850424	0
			RAM	2856	2856	0

gemini-code-assist Bot reviewed Apr 11, 2026

View reviewed changes

jonlil force-pushed the fix/case-stale-exchange branch from 5043e71 to 30715b1 Compare April 11, 2026 09:58

jonlil force-pushed the fix/case-stale-exchange branch from 30715b1 to 30dd689 Compare April 11, 2026 10:02

jonlil mentioned this pull request Apr 11, 2026

Add CASE session initiator (Sigma1/Sigma2/Sigma3) #408

Closed

Conversation

jonlil commented Apr 11, 2026

Problem

Fix

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

jonlil Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

jonlil Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 11, 2026

Uh oh!

github-actions Bot commented Apr 11, 2026

Uh oh!

jonlil commented Apr 11, 2026

Uh oh!

ivmarkov commented Apr 11, 2026

Testing

Uh oh!

jonlil commented Apr 11, 2026

Testing

Uh oh!

ivmarkov commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing

Uh oh!

ivmarkov commented Apr 11, 2026

Uh oh!

mstallmo commented Apr 11, 2026

Testing

Uh oh!

ivmarkov commented Apr 11, 2026

Testing

Uh oh!

jonlil commented Apr 11, 2026

Uh oh!

ivmarkov commented Apr 11, 2026

Uh oh!

jonlil commented Apr 12, 2026

Uh oh!

github-actions Bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ivmarkov commented Apr 11, 2026 •

edited

Loading