Fix stale CASE exchange blocking new session handshake#407
Fix stale CASE exchange blocking new session handshake#407jonlil wants to merge 2 commits intoproject-chip:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request modifies the session handling logic to support peer handshake restarts by detecting new session requests on existing exchanges and marking the old exchanges as dropped. The review feedback points out a critical routing issue where subsequent packets might still be processed by the stale exchange unless the lookup logic is updated to ignore dropped states. It also suggests using the remove_exch method for a cleaner and more idiomatic removal of the stale exchange.
| } else { | ||
| let exch = unwrap!(self.exchanges[exch_index].as_mut()); | ||
|
|
||
| exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?; | ||
| exch.post_recv(&rx_header.plain, &rx_header.proto, epoch)?; | ||
|
|
||
| Ok(false) | ||
| } else { | ||
| if !rx_header.proto.is_initiator() | ||
| || !MessageMeta::from(&rx_header.proto).is_new_exchange() | ||
| { | ||
| // Do not create a new exchange if the peer is not an initiator, or if | ||
| // the packet is NOT a candidate for a new exchange | ||
| // (i.e. it is a standalone ACK or a SC status response) | ||
| Err(ErrorCode::NoExchange)?; | ||
| return Ok(false); | ||
| } |
There was a problem hiding this comment.
There is a significant routing issue here. If get_exch_for_rx returns a stale exchange that was marked as dropped (but not removed from the table), this else block will be executed for subsequent packets in the handshake (e.g., CASESigma3). This will cause the packet to be processed by the stale exchange instead of the new one created for the restarted handshake. To fully address this, get_exch_for_rx should be updated to ignore exchanges in the dropped state, ensuring that the transport can find the active exchange even if a stale one with the same ID still exists in the table.
There was a problem hiding this comment.
Agreed this is a valid concern. With the updated code using remove_exch(), the slot is cleared immediately in the common case (no pending MRP at CASESigma1 stage). This means get_exch_for_rx won't find the stale exchange on subsequent packets.
In the rare edge case where MRP is pending and the exchange is only marked as Dropped (not removed), the stale exchange could theoretically shadow the new one. However, at the CASESigma1 stage no application data has been exchanged yet, so there should be no pending retransmissions or ACKs to delay removal.
| if MessageMeta::from(&rx_header.proto).is_new_session() { | ||
| warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange"); | ||
| let exch = unwrap!(self.exchanges[exch_index].as_mut()); | ||
| exch.role.set_dropped_state(); | ||
| // Fall through to create a new exchange below |
There was a problem hiding this comment.
Instead of manually setting the dropped state, consider using self.remove_exch(exch_index). This method is more idiomatic as it attempts to cleanly remove the exchange from the session's exchange table, or marks it as dropped if there are pending MRP operations (ACKs or retransmissions). This also increases the chance of freeing a slot in the exchange table for the new handshake if the stale exchange can be removed immediately.
| if MessageMeta::from(&rx_header.proto).is_new_session() { | |
| warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange"); | |
| let exch = unwrap!(self.exchanges[exch_index].as_mut()); | |
| exch.role.set_dropped_state(); | |
| // Fall through to create a new exchange below | |
| if MessageMeta::from(&rx_header.proto).is_new_session() { | |
| warn!("New session request on existing exchange — peer restarted handshake, evicting stale exchange"); | |
| self.remove_exch(exch_index); | |
| // Fall through to create a new exchange below |
There was a problem hiding this comment.
Good suggestion — updated to use remove_exch() which either clears the slot immediately or marks as Dropped if MRP is pending. This is cleaner and frees the slot for the new exchange when possible.
|
PR #407: Size comparison from 2af46c4 to 5043e71 Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
|
5043e71 to
30715b1
Compare
When a Matter client restarts and sends a new CASESigma1 reusing the same exchange ID as an in-progress handshake, the old exchange is still awaiting CASESigma3. The new Sigma1 gets routed to the stale exchange, producing "Invalid opcode: CASESigma1, expected: CASESigma3" and the handshake fails. The client must then wait through its retry/backoff cycle (typically 30-60s) before succeeding on a subsequent attempt with a different exchange ID. Fix: in Session::post_recv(), when an existing exchange receives a new session request (CASESigma1 or PBKDFParamRequest), remove the stale exchange via remove_exch() and fall through to create a fresh one. remove_exch() either clears the slot immediately (allowing the new exchange to reuse it) or marks it as Dropped if MRP operations are pending.
30715b1 to
30dd689
Compare
|
PR #407: Size comparison from 2af46c4 to 30dd689 Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
|
|
Here are before/after logs from our production setup (rs-matter server with 7 Matter endpoints, rs-matter client controller): Before fix — server restart causes 30-60s reconnect delay: After fix — immediate reconnect: Updated to use |
How is this possible, given that |
Because I built that to but have not pushed that PR yet! Is this something you would like to discuss? Perhaps I should create that PR as well |
I feel that PR would be much more valuable, so would be interested to see it. There is also #371 for it anyway. With that said, estincelle was working on the "controller" aspect as well, so you might want to sync with them (look at the PRs merged from them so far) or to at least look at their rs-matter fork, especially this branch. Unfortunately, they are not so active (if at all) since beginning of this month so either they are on a long leave, or they pivoted to something else...? As for this concrete PR, sure I can merge once it is stable and all, though it look a bit of a "hack". Is there anything in the Matter Core spec that mandates we should do what this PR does? |
|
CC @mstallmo ^^^ |
Unfortunately some things have come up that have taken our attention away from finishing the controller features that we've been working on. Speaking for myself, I do intend to finish the work but I don't have a concrete timeline yet for when I'll be able get back to that. That said if someone wants to finish the work before I can get back to focusing on it I'm happy to support in any way that I can. |
Well if you don't mind I would suggest @jonlil and his LLM to look at https://github.com/estincelle/rs-matter/tree/marko/TECH-66-case-initiator/rs-matter/src/sc/case and maybe even contribute (a variation of) that (with some unit tests) as the CASE initiator/controller implementation. I doubt their implementation is that much different anyway. There are only a few meaningful ways how the initiator code-path can be implemented anyway. |
|
Wish it was possible to have reply discussions on GitHub, I will respond here instead. If anyone is interested I can show a full topology chart of my setup and what I have built so far and what I do for testing that the implementation works. So I have already implemented the case initiator and yes, Claude with Opus was used as LLM, perhaps the test coverage is something I have to work on. I decided to put it behind a cargo feature since we bumped into growth in package size that might impact smaller devices, that MR is still open. So, I please continue the discussion on PR #408 And thanks for the help and great response on my, and claudes work 😄 Been quite many long and interesting nights with Matter 1.5 and home automation with automated tv morning shows as the target, and scheduling watching hockey when my team plays! Actually working! |
Yes, seeing this would be useful.
Thanks. Will look into it tomorrow, as today I'm AFK mostly.
NP. Since you are explicitly mentioning Matter 1.5, I assume you might be doing something related to the camera support or at least WebRTC (because cast-like clusters are available since a long time and well before 1.5)? |
When a stale exchange is evicted by post_recv (new CASESigma1 on an existing exchange), the slot is set to None. The old Exchange object still holds a reference to that slot index. When it is eventually dropped, Drop calls remove_exch on the already-cleared slot, causing an unwrap panic. Handle the None case gracefully by returning true (already cleaned up) instead of panicking.
|
Added a follow-up fix: the Fix: return |
|
PR #407: Size comparison from 2af46c4 to 4d7e717 Full report (8 builds for (core), dimmable-light, onoff-light, onoff-light-bt, speaker)
|
Problem
When a Matter client restarts and sends a new
CASESigma1reusing the same exchange ID as an in-progress handshake, the old exchange is still awaitingCASESigma3. The newSigma1gets routed to the stale exchange, producing:The handshake fails and the client must wait through its retry/backoff cycle (typically 30-60s) before succeeding with a different exchange ID.
This is easily reproducible by restarting a Matter controller that connects to an rs-matter server — the first reconnect attempt always fails.
Fix
In
Session::post_recv(), when an existing exchange receives a new session request (CASESigma1orPBKDFParamRequest), mark the stale exchange asDroppedand fall through to create a fresh one. The new handshake proceeds immediately from scratch.Uses the existing
MessageMeta::is_new_session()check andRole::set_dropped_state()— no new types or APIs needed.Testing
Tested with a Zigbee-to-Matter bridge (rs-matter server) and a Matter controller client (also rs-matter based). Before the fix, every server restart caused a 30-60s reconnect delay. After the fix, the client reconnects on its first attempt.