Skip to content

Conversation

@nefelim4ag
Copy link
Collaborator

@nefelim4ag nefelim4ag commented Aug 13, 2025

Migration of the I2C read error to the host side. So, if the host requests the data, and the MCU is unable to read it, the host raises an exception and shuts down.

There is an optional retry for the reads, as they are currently retriable by semantics.
Because there is a retry with pause underneath, I'm not sure if it is really necessary to implement the same retry pause loop in the bus.py.

Something similar could be done to i2c_write. I'm a little unhappy that it would make i2c_write chains less effective; on the other hand, I'm not aware of the code, where they would be practically reasonable.

Generally, I would assume, I2C write is retriable, in the sense that it writes the final register state.
I'm a little unsure if we can still implement i2c_write chaining and if it is really required by any device/usage.
I do think it is possible to register an Async callback, which would raise the error inside the bus.py.
But it seems a little tricky, and there is a chance that the MCU would lose the error feedback in the communication.
But if we do always respond with the error/success feedback, I do not see a non-blocking way to do so in any meaningful manner, which would allow retrying the last write.

Anyway, I'm open to suggestions.

I think that it is okay for now to just raise an exception and stop the machine from the host side, instead of the MCU.
It is a bonus if we do implement auto-retry here for all other modules.

Thanks.


Depends on: #7012, #7020
Child of: #6689

@nefelim4ag nefelim4ag added the other work first Topic waiting for other changes to complete label Aug 13, 2025
@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 2 times, most recently from 2df59c2 to b8e2da9 Compare August 13, 2025 21:47
@chestwood96
Copy link
Contributor

I am currently running it with an i2c filament width sensor and so far it looks good.

@KevinOConnor
Copy link
Collaborator

Thanks for working on this. Some random comments:

  • This change will require a large number of users to reflash their micro-controllers. If we do that, we should probably make all other i2c related communication changes at the same time (so that we don't put users in a situation where they have to reflash again a few weeks later). Wouldn't the i2c_write command also need to be modified to relay its status?
  • I don't understand the _DECL_STATIC_STR() stuff. The existing DECL_ENUMERATION() was implemented for this purpose. If for some reason standard enumerations aren't working for "response" messages then I'd guess we should fix that. Maybe I'm missing something though.

Cheers,
-Kevin

@nefelim4ag
Copy link
Collaborator Author

nefelim4ag commented Aug 15, 2025

I don't understand the _DECL_STATIC_STR() stuff. The existing DECL_ENUMERATION() was implemented for this purpose.

That way, I could avoid duplicating shutdown strings to pass the message back.
Right now, we can't get rid of the shutdown strings here, as long as the MCU part of the sensors does not handle errors.

I totally could redo it with the use of enumerations.

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 2 times, most recently from d262e3e to ca9180b Compare August 16, 2025 00:40
@nefelim4ag
Copy link
Collaborator Author

nefelim4ag commented Aug 16, 2025

Now it utilizes enum.
It is backward compatible, so older firmware should still work, no enforced migration.

(I could use try_lookup_command for migration, but it seems that simple enum checking should be enough)

@restoration2699
Copy link

restoration2699 commented Aug 16, 2025

I am very glad to see that someone is solving this problem.
It is very annoying to waste kilograms of plastic because of errors with BME 280.
For myself, I completely disabled the i2c bus error monitoring and now the host does not give an error even if the sensor is disconnected. This behavior is most acceptable for me.
But since I am not a programmer, I did it as best I can.
It would be nice if the developers did something similar, but correctly.
I also really want the host to give only a warning when errors occur, without shutting down.
I did it the following way.
correct 2 files, then flash firmware.
file i2ccmds.c

void i2c_shutdown_on_err(int ret)
{
switch (ret) {
// case I2C_BUS_NACK:
// shutdown("I2C NACK");
// case I2C_BUS_START_NACK:
// shutdown("I2C START NACK");
// case I2C_BUS_START_READ_NACK: 
// shutdown("I2C START READ NACK"); 
// case I2C_BUS_TIMEOUT: 
// shutdown("I2C Timeout"); 
}
}

file stm32\i2c.c

static int
i2c_start(I2C_TypeDef *i2c, uint8_t addr, uint8_t xfer_len, 
uint32_t timeout)
{ 
i2c->CR1 = I2C_CR1_START | I2C_CR1_PE; 
i2c_wait(i2c, I2C_SR1_SB, 0, timeout); 
i2c->DR = addr; 
if (addr & 0x01) 
i2c->CR1 |= I2C_CR1_ACK; 
int ret = i2c_wait(i2c, I2C_SR1_ADDR, 0, timeout); 
irqstatus_t flag = irq_save(); 
uint32_t sr2 = i2c->SR2; 
if (addr & 0x01 && xfer_len == 1) 
i2c->CR1 = I2C_CR1_STOP | I2C_CR1_PE; 
irq_restore(flag); 
if (!(sr2 & I2C_SR2_MSL)) 
// shutdown("Failed to send i2c addr"); 
return I2C_BUS_SUCCESS; 
return ret;
}

@chestwood96
Copy link
Contributor

I ran a bit over 1kg through the initial version of this just fine so far. I did switch to sw i2c because of the shutdown in

shutdown("Failed to send i2c addr");
but since then it's been smooth sailing

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 5 times, most recently from a3f3d8b to d830157 Compare August 17, 2025 20:48
@nefelim4ag
Copy link
Collaborator Author

I've spent some time thinking about how to properly do that and where to put the complexity.
Like, how it would look if I tried to write it from scratch.

As I2C writes now returns the status, there is an inherent retry from SerialRetryCommand.
So, it is retriable. As it is retriable, there would be 2 loops of retry inside the I2C class and inside the SerialRetryCommand.

To somehow get it under control and simplify things, it seems less ugly to me.
I've reimplemented the serialhdl retry logic inside the I2C class.
So, I can avoid thinking about 2 different loops of timeouts and retries, and exceptions.

Since i2c_write is now blocking, probably i2c_write_wait_ack() could be removed.
I assume that most devices would be fine with a retry of writes, so there is an added retry flag for writes.
In case somewhere, it does not hold true somewhere, it is possible to do whatever is required.

I assume that the I2C timeout is when something really wrong happens (hang of i2c hw) (or MCU too busy with timers).
Never heard of it in the wild, so I added logging for this specifically.
(We can drop it).

As the whole transfer now exists inside the one class, and there is an async response handler, if it is required.
I can technically mimic the serial queue sequence logic, add seq=%c field, track writes inside the I2C class, schedule many writes, ensuring that they are written, and add logic on the MCU to skip writes not in order.

Thanks.

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from d830157 to a68760a Compare August 17, 2025 21:22
@KevinOConnor
Copy link
Collaborator

Thanks for working on this. In general it seems useful. I have a few comments.

If I understand this PR correctly, it's mostly a "clean up" and it is not blocking other development work or features. Let me know if I got this wrong.

If so, I wonder if we could make this change in 3 phases:

  1. Change the host i2c_write() command to be blocking. That is, remove i2c_write_wait_ack() and make all callers use the underlying send_wait_ack() mechanism. This does mess with the "auto-init" mechanism (I2C write: use connect hook for early writes #7012), but maybe we could convert those callers to a new bus.i2c_write_init() type method?
  2. Update the mcu code to return the status back and no longer directly invoke a shutdown. The host will instead always invoke a shutdown if it receives a non-success status report.
  3. Change the host code to support propagation of i2c errors for devices that can handle that. Possibly implement some kind of automatic retry system for errors that appear transient.

For phase 2, I'd guess it would be okay to default to automatically retrying i2c_write() commands in cases where there is message loss between mcu and host. For the rare device that can't handle that, a bus.i2c_write(..., retry=False) could be used.

For phase 3, I'd guess we'd want to wait a month or two after implementing phase 2. I'm open to a different timeline, but as a high-level observation, if we run into issues with phase 1&2, it would seem really tricky to fix them if we've already started phase 3. Also, I'm not sure if we need a lot of automatic retransmit support for i2c, as I'd guess most devices just don't care at all (they don't mind a missing report) or really don't want to see any errors at all on the line. I could be wrong though.

So, it is retriable. As it is retriable, there would be 2 loops of retry inside the I2C class and inside the SerialRetryCommand.

For what it is worth, using the low-level mcu.register_response() seems a bit overkill to me. In particular, the thread safety issues and challenge in aligning commands to responses can be difficult to understand. If "double looping" is a concern, have you considered always using the existing i2c_send() with retry=False?

I assume that the I2C timeout is when something really wrong happens (hang of i2c hw) (or MCU too busy with timers).

It is for something really wrong at the mcu i2c hardware level. Some mcu hw have really bad i2c implementations. In particular, if the low-level software doesn't check for a timeout and the hardware doesn't respond as expected, then it leads to hard to debug watchdog reboots. The i2c timeout is not related to mcu being busy nor timer activity.

-Kevin

@nefelim4ag
Copy link
Collaborator Author

nefelim4ag commented Aug 21, 2025

This does mess with the "auto-init" mechanism (#7012), but maybe we could convert those callers to a new bus.i2c_write_init() type method?

I'm not sure I understood what you suggested here or how it would mess with it. Callers who depend on i2c_write() inside the __init__(), have been auto-migrated with the mentioned PR.
If there is a need to refactor them, to do i2c_write() on connect in a more direct way like bus.i2c_write_on_connect (?), I could do that.

The first phase

I added the #7020 with migration to synchronous i2c_write().

The second phase

This PR.
MCU does not shut down on sensor IO fail. I2C code propagates an error. Sensor fails.
If it fails, temp -> 0 or None, min > temp || temp > max -> shutdown.
If the heater stops receiving updates -> shutdown.
All other i2c users seem not to have exception handlers, so unhandled exception -> shutdown.

Optionally, there is an NACK retry.

temp -> 0 - https://github.com/Klipper3d/klipper/blob/master/klippy/extras/ads1x1x.py#L302
temp -> None - https://github.com/Klipper3d/klipper/blob/master/klippy/extras/sht3x.py#L143

Not sure which one is better.

The 3rd phase

I'm unaware of any devices that would require that.
It seems that it would probably be fixes for corner cases?

If "double looping" is a concern, have you considered always using the existing i2c_send() with retry=False?

It just feels a little bit ugly there would be another level of try .. except: but no problem.
Probably with the i2c_transfer it would be a little better.

if the low-level software doesn't check for a timeout and the hardware doesn't respond as expected, then it leads to hard to debug watchdog reboots. The i2c timeout is not related to mcu being busy nor timer activity.

Yes, I agree there should be a check, like a timeout for that.
I only mentioned timers because they do have a higher priority -> can happen in the middle of the I2C transaction, and if I understand the timer's scheduler code correctly - can block task code for more than 5ms.
That can lead to the I2C timeout afterwards.
(Because absolute time passed between the I2C start and the check > 5ms).

This is rare from a probability point; this just seems like a possible failure path.

Thanks.


i2c write/read merged to i2c_transfer.
No custom handlers.

Probably, I can propagate the exception from serialhdl instead of converting it to a string.

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 5 times, most recently from d325ffd to e6783a3 Compare August 22, 2025 00:51
@KevinOConnor
Copy link
Collaborator

I'm not sure I understood what you suggested here ...
If there is a need to refactor them, to do i2c_write() on connect in a more direct way like bus.i2c_write_on_connect (?), I could do that.

Yeah, I was saying we might want to refactor pca9533 and mcp4728 to use an explicit bus.i2c_write_on_connect() instead of having bus.i2c_write() infer that. (Because having i2c_write() be sometimes blocking can be confusing.) It's not a big deal though.

The first phase
I added the #7020 with migration to synchronous i2c_write().

Thanks. Looks good to me.

The second phase
This PR.

My suggestion is to move the retry logic in this PR to a third phase. That is, I'm suggesting this PR have a host i2c_transfer() function similar to:

    def i2c_transfer(self, write, read_len=0, minclock=0, reqclock=0,
                     retry=True):
        params = self.i2c_transfer_cmd.send([self.oid, write, read_len], minclock=minclock,
                     reqclock=reqclock, retry=retry)
        if params["i2c_bus_status"] != "SUCCESS":
            self.mcu.get_printer().shutdown("I2C mcu %s reports error %s" % (self.mcu.get_name(), params["i2c_bus_status"]))
        return params

That is, if there's no rush, can we make this PR have less behavioural changes. It'll still change the mcu->host protocol and still introduce redundant i2c writes on mcu to host message loss, but wont fundamentally change the error handling. Let me know if that doesn't work well for rollout timing or if I'm missing something.

Thanks again,
-Kevin

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 2 times, most recently from 4b15329 to 0ab378a Compare August 23, 2025 02:00
@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from 0ab378a to 610ff20 Compare August 23, 2025 02:06
@KevinOConnor
Copy link
Collaborator

Thanks. Looks fine to me. How about we give a week or two to see that #7020 doesn't cause any surprises, and then look to commit this.

-Kevin

@github-actions
Copy link

Thank you for your contribution to Klipper. Unfortunately, a reviewer has not assigned themselves to this GitHub Pull Request. All Pull Requests are reviewed before merging, and a reviewer will need to volunteer. Further information is available at: https://www.klipper3d.org/CONTRIBUTING.html

There are some steps that you can take now:

  1. Perform a self-review of your Pull Request by following the steps at: https://www.klipper3d.org/CONTRIBUTING.html#what-to-expect-in-a-review
    If you have completed a self-review, be sure to state the results of that self-review explicitly in the Pull Request comments. A reviewer is more likely to participate if the bulk of a review has already been completed.
  2. Consider opening a topic on the Klipper Discourse server to discuss this work. The Discourse server is a good place to discuss development ideas and to engage users interested in testing. Reviewers are more likely to prioritize Pull Requests with an active community of users.
  3. Consider helping out reviewers by reviewing other Klipper Pull Requests. Taking the time to perform a careful and detailed review of others work is appreciated. Regular contributors are more likely to prioritize the contributions of other regular contributors.

Unfortunately, if a reviewer does not assign themselves to this GitHub Pull Request then it will be automatically closed. If this happens, then it is a good idea to move further discussion to the Klipper Discourse server. Reviewers can reach out on that forum to let you know if they are interested and when they are available.

Best regards,
~ Your friendly GitIssueBot

PS: I'm just an automated script, not a human being.

@nefelim4ag
Copy link
Collaborator Author

nefelim4ag commented Oct 11, 2025

I added an optional write-only flag to the I2C class as PoC.
A simple idea is that we can still handle the write_noack and can still provide some error feedback, without complicating all the callers. As suggested here: #7067 (comment)

So, technically, things like PCA9*, sx1509, displays & etc, can still use the i2c_write_noack, and still have some debugging options.

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from 5b2bd8e to 2467b0d Compare October 11, 2025 17:02
@KevinOConnor
Copy link
Collaborator

I added an optional write-only flag to the I2C class as PoC.

Thanks. Seems fine to me. If we go this route though, I think i2c.i2c_read(), i2c.i2c_transfer(), and similar would need to fail if the write only flag is set. Also, as a minor comment, "wronly" seems too close to "wrongly" - perhaps "write_only" instead.

Cheers,
-Kevin

@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from 2467b0d to ecfb30f Compare October 11, 2025 21:10
@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch 3 times, most recently from e0e09b2 to 260aafc Compare October 21, 2025 20:37
@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from 260aafc to 51e6e47 Compare November 18, 2025 15:47
@nefelim4ag nefelim4ag changed the title RFC: I2C klippy's side error handling I2C klippy's side error handling Dec 3, 2025
If the bus is write only, with new i2c_transfer code
it is possible to at least provide some feedback on error

Signed-off-by: Timofey Titovets <[email protected]>
@nefelim4ag nefelim4ag force-pushed the i2c-host-error-handling branch from 51e6e47 to 54e203a Compare December 3, 2025 23:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants