-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(api): Various E-stop fixes #14929
Conversation
08d0920
to
f9d3ce9
Compare
The E-stop is a run-level fatal error, so we want to treat it like other run-level fatal errors: pass it in to ProtocolEngine.finish(), not ProtocolEngine.request_stop().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great to me, as discussed
Could any of this affect other flows that are part of the ProtocolEngine? Namely, LPC using PE, and the PE calibration command. |
It definitely can affect those, and we definitely should do that testing. I will test at least the things mentioned in the test plan. Feel free to add any more that you think of. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! TY!
|
This is intended to feed into ProtocolEngine.finish()'s magic error code searching.
This reverts commit 320dfe8. This solution was insufficient because it didn't handle the case where a command started and was then asyncio .cancel()'d by QueueWorker.cancel().
This fixes problems I introduced in this PR where certain E-stops would still try to drop tips and home at the end of the run. I thought that inspecting the `error` argument passed in to `finish()` would be enough to determine whether the run is finishing because of an E-stop. It's not. When the physical E-stop button is pressed, our top-down interruption from `ProtocolEngine.estop()` races against any bottom-up interruption from the electronics. The top-down interruption can either: * Raise a `RunStoppedError` when the protocol tries to run a new command * Raise a `RunStoppedError` when when the current command is asyncio `.cancel()`'d Neither of which currently carry any sign that they came from an E-stop. Until they do, we need a separate `from_estop` flag in `CommandState`.
We permanently set the run result to `FAILED` or `STOPPED` when a stop is requested, not just when the engine is `.finish()`'d. This seems premature to me, but, taking it for granted, it means we need to make sure that we choose `FAILED` if the stop was requested because of an E-stop. Also, because of that premature setting of the run result, we need to make sure that when the E-stop-induced `FinishAction` comes in, the run result already being set doesn't stop it from setting the run error.
Overview
Fixes various issues with
ProtocolEngine
's handling of E-stops—some that go back to when the code was initially added, and some that I accidentally introduced while working on error recovery.Closes EXEC-382.
Test Plan
All of the following should behave reasonably.
waitForResume
).Given that we're not sure how much of this was working properly in v7.2, I think we can be flexible with our interpretation of "behaves reasonably." For example, it's now theoretically possible that the UI briefly shows the run as "stop requested," where it did not before. I think that's fine as long as nothing functionally bad happens, like the run hanging.
Changelog
ProtocolEngine.estop()
was doing low-level state manipulation for individual commands. I accidentally broke this in refactor(api): Split UpdateCommandAction #14726 when I made the command lifecycle stricter; I made it start raisingAssertionError
s.To fix this,
.estop()
is rewritten to do nothing more complicated than what.stop()
does. This leaves the command state manipulation toCommandStore
, where it canonically belongs and where it can be unit-tested easily. This is the change that fixes EXEC-382.ProtocolEngine.estop()
was eagerly dispatching aHardwareStoppedAction
. This is normally reserved for after the protocol has exited and the hardware has settled. But it was issuing it right away, basically lying to higher layers about the engine being "finished."This is fixed by removing all of that from
.estop()
, and relying on the existing teardown code in.finish()
instead. This makes E-stop handling work basically the same way as other fatal run errors.We were naively calling
ProtocolEngine.estop()
from a hardware event callback. This did not look thread-safe to me: the hardware event callback presumably runs in the hardware API thread, and theProtocolEngine
lives in the main server thread. SeeProtocolEngine
's concurrency requirements.This is fixed by wrapping the call in
loop.call_soon_threadsafe()
.Review requests
None in particular.
Risk assessment
Medium. This can only be tested manually, and it's dealing with concurrency, and touching some confusing parts of our code.