Skip to content

Fix stale CASE exchange blocking new session handshake#407

Open
jonlil wants to merge 2 commits intoproject-chip:mainfrom
viska-ab:fix/case-stale-exchange
Open

Fix stale CASE exchange blocking new session handshake#407
jonlil wants to merge 2 commits intoproject-chip:mainfrom
viska-ab:fix/case-stale-exchange

Conversation

@jonlil
Copy link
Copy Markdown
Contributor

@jonlil jonlil commented Apr 11, 2026

Problem

When a Matter client restarts and sends a new CASESigma1 reusing the same exchange ID as an in-progress handshake, the old exchange is still awaiting CASESigma3. The new Sigma1 gets routed to the stale exchange, producing:

Invalid opcode: CASESigma1, expected: CASESigma3

The handshake fails and the client must wait through its retry/backoff cycle (typically 30-60s) before succeeding with a different exchange ID.

This is easily reproducible by restarting a Matter controller that connects to an rs-matter server — the first reconnect attempt always fails.

Fix

In Session::post_recv(), when an existing exchange receives a new session request (CASESigma1 or PBKDFParamRequest), mark the stale exchange as Dropped and fall through to create a fresh one. The new handshake proceeds immediately from scratch.

Uses the existing MessageMeta::is_new_session() check and Role::set_dropped_state() — no new types or APIs needed.

Testing

Tested with a Zigbee-to-Matter bridge (rs-matter server) and a Matter controller client (also rs-matter based). Before the fix, every server restart caused a 30-60s reconnect delay. After the fix, the client reconnects on its first attempt.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the session handling logic to support peer handshake restarts by detecting new session requests on existing exchanges and marking the old exchanges as dropped. The review feedback points out a critical routing issue where subsequent packets might still be processed by the stale exchange unless the lookup logic is updated to ignore dropped states. It also suggests using the remove_exch method for a cleaner and more idiomatic removal of the stale exchange.

Comment on lines +323 to 329
} else {
let exch = unwrap!(self.exchanges[exch_index].as_mut());

exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?;
exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?;

Ok(false)
} else {
if !rx_header.proto.is_initiator()
|| !MessageMeta::from(&rx_header.proto).is_new_exchange()
{
// Do not create a new exchange if the peer is not an initiator, or if
// the packet is NOT a candidate for a new exchange
// (i.e. it is a standalone ACK or a SC status response)
Err(ErrorCode::NoExchange)?;
return Ok(false);
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

There is a significant routing issue here. If get_exch_for_rx returns a stale exchange that was marked as dropped (but not removed from the table), this else block will be executed for subsequent packets in the handshake (e.g., CASESigma3). This will cause the packet to be processed by the stale exchange instead of the new one created for the restarted handshake. To fully address this, get_exch_for_rx should be updated to ignore exchanges in the dropped state, ensuring that the transport can find the active exchange even if a stale one with the same ID still exists in the table.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed this is a valid concern. With the updated code using remove_exch(), the slot is cleared immediately in the common case (no pending MRP at CASESigma1 stage). This means get_exch_for_rx won't find the stale exchange on subsequent packets.

In the rare edge case where MRP is pending and the exchange is only marked as Dropped (not removed), the stale exchange could theoretically shadow the new one. However, at the CASESigma1 stage no application data has been exchanged yet, so there should be no pending retransmissions or ACKs to delay removal.

Comment on lines +318 to +322
if MessageMeta::from(&rx_header.proto).is_new_session() {
warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");
let exch = unwrap!(self.exchanges[exch_index].as_mut());
exch.role.set_dropped_state();
// Fall through to create a new exchange below
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Instead of manually setting the dropped state, consider using self.remove_exch(exch_index). This method is more idiomatic as it attempts to cleanly remove the exchange from the session's exchange table, or marks it as dropped if there are pending MRP operations (ACKs or retransmissions). This also increases the chance of freeing a slot in the exchange table for the new handshake if the stale exchange can be removed immediately.

Suggested change
if MessageMeta::from(&rx_header.proto).is_new_session() {
warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");
let exch = unwrap!(self.exchanges[exch_index].as_mut());
exch.role.set_dropped_state();
// Fall through to create a new exchange below
if MessageMeta::from(&rx_header.proto).is_new_session() {
warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange");
self.remove_exch(exch_index);
// Fall through to create a new exchange below

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion — updated to use remove_exch() which either clears the slot immediately or marks as Dropped if MRP is pending. This is cleaner and frees the slot for the new exchange when possible.

@github-actions
Copy link
Copy Markdown

PR #407: Size comparison from 2af46c4 to 5043e71

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
platform target config section 2af46c4 5043e71 change % change
(core) riscv32imac-unknown-none-elf infodefmt-optz-ltofat FLASH 403870 404078 208 0.1
RAM 68072 68072 0 0.0
thumbv6m-none-eabi infodefmt-optz-ltofat FLASH 337860 337988 128 0.0
RAM 64312 64320 8 0.0
thumbv7em-none-eabi infodefmt-optz-ltofat FLASH 312748 312856 108 0.0
RAM 63800 63808 8 0.0
x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 839687 839991 304 0.0
RAM 66616 66616 0 0.0
dimmable-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1957576 1957896 320 0.0
RAM 52352 52352 0 0.0
onoff-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1890784 1891080 296 0.0
RAM 52056 52056 0 0.0
onoff-light-bt x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 3218056 3219040 984 0.0
RAM 5784 5784 0 0.0
speaker x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 850424 850424 0 0.0
RAM 2856 2856 0 0.0

@jonlil jonlil force-pushed the fix/case-stale-exchange branch from 5043e71 to 30715b1 Compare April 11, 2026 09:58
When a Matter client restarts and sends a new CASESigma1 reusing
the same exchange ID as an in-progress handshake, the old exchange
is still awaiting CASESigma3. The new Sigma1 gets routed to the
stale exchange, producing "Invalid opcode: CASESigma1, expected:
CASESigma3" and the handshake fails.

The client must then wait through its retry/backoff cycle (typically
30-60s) before succeeding on a subsequent attempt with a different
exchange ID.

Fix: in Session::post_recv(), when an existing exchange receives a
new session request (CASESigma1 or PBKDFParamRequest), remove the
stale exchange via remove_exch() and fall through to create a fresh
one. remove_exch() either clears the slot immediately (allowing the
new exchange to reuse it) or marks it as Dropped if MRP operations
are pending.
@jonlil jonlil force-pushed the fix/case-stale-exchange branch from 30715b1 to 30dd689 Compare April 11, 2026 10:02
@github-actions
Copy link
Copy Markdown

PR #407: Size comparison from 2af46c4 to 30dd689

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
platform target config section 2af46c4 30dd689 change % change
(core) riscv32imac-unknown-none-elf infodefmt-optz-ltofat FLASH 403870 404060 190 0.0
RAM 68072 68072 0 0.0
thumbv6m-none-eabi infodefmt-optz-ltofat FLASH 337860 337940 80 0.0
RAM 64312 64320 8 0.0
thumbv7em-none-eabi infodefmt-optz-ltofat FLASH 312748 312812 64 0.0
RAM 63800 63808 8 0.0
x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 839687 839911 224 0.0
RAM 66616 66616 0 0.0
dimmable-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1957576 1957800 224 0.0
RAM 52352 52352 0 0.0
onoff-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1890784 1891000 216 0.0
RAM 52056 52056 0 0.0
onoff-light-bt x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 3218056 3218960 904 0.0
RAM 5784 5784 0 0.0
speaker x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 850424 850424 0 0.0
RAM 2856 2856 0 0.0

@jonlil
Copy link
Copy Markdown
Contributor Author

jonlil commented Apr 11, 2026

Here are before/after logs from our production setup (rs-matter server with 7 Matter endpoints, rs-matter client controller):

Before fix — server restart causes 30-60s reconnect delay:

T+0s   Client sends CASESigma1 (reusing exchange ID from previous session)
T+0s   Server: "Invalid opcode: CASESigma1, expected: CASESigma3"
T+0s   Server: "Abandoned because of error Error::Invalid"
T+15s  Client: CASE timed out
T+30s  Client retries with new exchange ID → Sigma1 → Sigma2 → Sigma3 → success
T+30s  Subscriptions primed

After fix — immediate reconnect:

T+0s   Server restarts
T+1s   Client sends CASESigma1
T+1s   Server: "New session request on existing exchange — peer restarted handshake, evicting stale exchange"
T+1s   CASESigma1 → Sigma2 → Sigma3 → CASE established
T+1s   Subscriptions primed

Updated to use remove_exch() per review feedback — the stale exchange slot is freed immediately (no pending MRP at CASESigma1 stage), so get_exch_for_rx won't shadow the new exchange.

@ivmarkov
Copy link
Copy Markdown
Contributor

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

@jonlil
Copy link
Copy Markdown
Contributor Author

jonlil commented Apr 11, 2026

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

@ivmarkov
Copy link
Copy Markdown
Contributor

ivmarkov commented Apr 11, 2026

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.

There is also #371 for it anyway.

With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?

As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

@ivmarkov
Copy link
Copy Markdown
Contributor

CC @mstallmo ^^^

@mstallmo
Copy link
Copy Markdown
Contributor

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.

There is also #371 for it anyway.

With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?

As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

Unfortunately some things have come up that have taken our attention away from finishing the controller features that we've been working on. Speaking for myself, I do intend to finish the work but I don't have a concrete timeline yet for when I'll be able get back to that.

That said if someone wants to finish the work before I can get back to focusing on it I'm happy to support in any way that I can.

@ivmarkov
Copy link
Copy Markdown
Contributor

Testing

and a Matter controller client (also rs-matter based).

How is this possible, given that rs-matter still lacks the capability to actually initiate a CASE session?

Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well

I feel that PR would be much more valuable, so would be interested to see it.
There is also #371 for it anyway.
With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...?
As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does?

Unfortunately some things have come up that have taken our attention away from finishing the controller features that we've been working on. Speaking for myself, I do intend to finish the work but I don't have a concrete timeline yet for when I'll be able get back to that.

That said if someone wants to finish the work before I can get back to focusing on it I'm happy to support in any way that I can.

Well if you don't mind I would suggest @jonlil and his LLM to look at https://github.com/estincelle/rs-matter/tree/marko/TECH-66-case-initiator/rs-matter/src/sc/case and maybe even contribute (a variation of) that (with some unit tests) as the CASE initiator/controller implementation. I doubt their implementation is that much different anyway. There are only a few meaningful ways how the initiator code-path can be implemented anyway.

@jonlil
Copy link
Copy Markdown
Contributor Author

jonlil commented Apr 11, 2026

Wish it was possible to have reply discussions on GitHub, I will respond here instead. If anyone is interested I can show a full topology chart of my setup and what I have built so far and what I do for testing that the implementation works.

So I have already implemented the case initiator and yes, Claude with Opus was used as LLM, perhaps the test coverage is something I have to work on. I decided to put it behind a cargo feature since we bumped into growth in package size that might impact smaller devices, that MR is still open. So, I please continue the discussion on PR #408

And thanks for the help and great response on my, and claudes work 😄 Been quite many long and interesting nights with Matter 1.5 and home automation with automated tv morning shows as the target, and scheduling watching hockey when my team plays! Actually working!

@ivmarkov
Copy link
Copy Markdown
Contributor

Wish it was possible to have reply discussions on GitHub, I will respond here instead. If anyone is interested I can show a full topology chart of my setup and what I have built so far and what I do for testing that the implementation works.

Yes, seeing this would be useful.
Adding some unit tests is always nice, but at a minimum, we should strive to enable more and more XAML / Python tests (if there are relevant ones) in the cargo xtask itest tests harness that is essentially the Matter C++ SDK integration tests harness.

So I have already implemented the case initiator and yes, Claude with Opus was used as LLM, perhaps the test coverage is something I have to work on. I decided to put it behind a cargo feature since we bumped into growth in package size that might impact smaller devices, that MR is still open. So, I please continue the discussion on PR #408

Thanks. Will look into it tomorrow, as today I'm AFK mostly.

And thanks for the help and great response on my, and claudes work 😄 Been quite many long and interesting nights with Matter 1.5 and home automation with automated tv morning shows as the target, and scheduling watching hockey when my team plays! Actually working!

NP. Since you are explicitly mentioning Matter 1.5, I assume you might be doing something related to the camera support or at least WebRTC (because cast-like clusters are available since a long time and well before 1.5)?
If so, I would be interested in the WebRTC / TCPIP bits as well. That is, if you are willing to contribute.

When a stale exchange is evicted by post_recv (new CASESigma1 on an
existing exchange), the slot is set to None. The old Exchange object
still holds a reference to that slot index. When it is eventually
dropped, Drop calls remove_exch on the already-cleared slot, causing
an unwrap panic.

Handle the None case gracefully by returning true (already cleaned up)
instead of panicking.
@jonlil
Copy link
Copy Markdown
Contributor Author

jonlil commented Apr 12, 2026

Added a follow-up fix: the remove_exch function would panic (unwrap!) when called on an exchange slot that was already cleared. This happens when a stale exchange is evicted by post_recv (new CASESigma1 on an existing exchange) — the old Exchange object still holds a reference to the now-cleared slot index, and when it's eventually dropped, Drop calls remove_exch on the None slot.

Fix: return true (already cleaned up) instead of panicking.

@github-actions
Copy link
Copy Markdown

PR #407: Size comparison from 2af46c4 to 4d7e717

Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
platform target config section 2af46c4 4d7e717 change % change
(core) riscv32imac-unknown-none-elf infodefmt-optz-ltofat FLASH 403870 404022 152 0.0
RAM 68072 68072 0 0.0
thumbv6m-none-eabi infodefmt-optz-ltofat FLASH 337860 337924 64 0.0
RAM 64312 64320 8 0.0
thumbv7em-none-eabi infodefmt-optz-ltofat FLASH 312748 312792 44 0.0
RAM 63800 63808 8 0.0
x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 839687 839807 120 0.0
RAM 66616 66616 0 0.0
dimmable-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1957576 1957744 168 0.0
RAM 52352 52352 0 0.0
onoff-light x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 1890784 1890888 104 0.0
RAM 52056 52056 0 0.0
onoff-light-bt x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 3218056 3218296 240 0.0
RAM 5784 5784 0 0.0
speaker x86_64-unknown-linux-gnu infologs-optz-ltofat FLASH 850424 850424 0 0.0
RAM 2856 2856 0 0.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants