Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(api): Various E-stop fixes #14929

Merged
merged 18 commits into from
Apr 19, 2024
Merged

fix(api): Various E-stop fixes #14929

merged 18 commits into from
Apr 19, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Apr 16, 2024

Overview

Fixes various issues with ProtocolEngine's handling of E-stops—some that go back to when the code was initially added, and some that I accidentally introduced while working on error recovery.

Closes EXEC-382.

Test Plan

All of the following should behave reasonably.

  • Runs:
    • Hitting the E-stop before anything has been done with the run, e.g. on the Opentrons App's run setup page.
    • Hitting the E-stop in the middle of a hardware command.
    • Hitting the E-stop in the middle of a non-hardware command (like waitForResume).
    • Hitting the E-stop in between commands.
      • EXEC-382 has a protocol to help test this.
  • Maintenance runs (like Labware Position Check):
    • Hitting the E-stop before anything has been done with the maintenance run.
    • Hitting the E-stop in the middle of a hardware command.
    • Hitting the E-stop in the middle of a non-hardware command.
    • Hitting the E-stop in between commands.

Given that we're not sure how much of this was working properly in v7.2, I think we can be flexible with our interpretation of "behaves reasonably." For example, it's now theoretically possible that the UI briefly shows the run as "stop requested," where it did not before. I think that's fine as long as nothing functionally bad happens, like the run hanging.

Changelog

  • ProtocolEngine.estop() was doing low-level state manipulation for individual commands. I accidentally broke this in refactor(api): Split UpdateCommandAction #14726 when I made the command lifecycle stricter; I made it start raising AssertionErrors.

    To fix this, .estop() is rewritten to do nothing more complicated than what .stop() does. This leaves the command state manipulation to CommandStore, where it canonically belongs and where it can be unit-tested easily. This is the change that fixes EXEC-382.

  • ProtocolEngine.estop() was eagerly dispatching a HardwareStoppedAction. This is normally reserved for after the protocol has exited and the hardware has settled. But it was issuing it right away, basically lying to higher layers about the engine being "finished."

    This is fixed by removing all of that from .estop(), and relying on the existing teardown code in .finish() instead. This makes E-stop handling work basically the same way as other fatal run errors.

  • We were naively calling ProtocolEngine.estop() from a hardware event callback. This did not look thread-safe to me: the hardware event callback presumably runs in the hardware API thread, and the ProtocolEngine lives in the main server thread. See ProtocolEngine's concurrency requirements.

    This is fixed by wrapping the call in loop.call_soon_threadsafe().

Review requests

None in particular.

Risk assessment

Medium. This can only be tested manually, and it's dealing with concurrency, and touching some confusing parts of our code.

The E-stop is a run-level fatal error, so we want to treat it like other run-level fatal errors: pass it in to ProtocolEngine.finish(), not ProtocolEngine.request_stop().
@SyntaxColoring SyntaxColoring marked this pull request as ready for review April 18, 2024 13:52
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner April 18, 2024 13:52
@SyntaxColoring SyntaxColoring requested review from a team April 18, 2024 13:52
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great to me, as discussed

@DerekMaggio
Copy link
Contributor

DerekMaggio commented Apr 18, 2024

Could any of this affect other flows that are part of the ProtocolEngine? Namely, LPC using PE, and the PE calibration command.
We should do a sanity check and push the e-stop during these flows to ensure it functions as expected.

@SyntaxColoring
Copy link
Contributor Author

Could any of this affect other flows that are part of the ProtocolEngine? Namely, LPC using PE, and the PE calibration command. We should do a sanity check and push the e-stop during these flows to ensure it functions as expected.

It definitely can affect those, and we definitely should do that testing. I will test at least the things mentioned in the test plan. Feel free to add any more that you think of.

Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! TY!

@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Apr 18, 2024

I'm looking into issues where, after an E-stop, the run acts as if it was cancelled in the normal way, and tries to home. Fixed, see the messages in the commits below.

This is intended to feed into ProtocolEngine.finish()'s magic error code searching.
This reverts commit 320dfe8.

This solution was insufficient because it didn't handle the case where a command started and was then asyncio .cancel()'d by QueueWorker.cancel().
This fixes problems I introduced in this PR where certain E-stops would still try to drop tips and home at the end of the run.

I thought that inspecting the `error` argument passed in to `finish()` would be enough to determine whether the run is finishing because of an E-stop. It's not. When the physical E-stop button is pressed, our top-down interruption from `ProtocolEngine.estop()` races against any bottom-up interruption from the electronics. The top-down interruption can either:

* Raise a `RunStoppedError` when the protocol tries to run a new command
* Raise a `RunStoppedError` when when the current command is asyncio `.cancel()`'d

Neither of which currently carry any sign that they came from an E-stop. Until they do, we need a separate `from_estop` flag in `CommandState`.
We permanently set the run result to `FAILED` or `STOPPED` when a stop is requested, not just when the engine is `.finish()`'d. This seems premature to me, but, taking it for granted, it means we need to make sure that we choose `FAILED` if the stop was requested because of an E-stop.

Also, because of that premature setting of the run result, we need to make sure that when the E-stop-induced `FinishAction` comes in, the run result already being set doesn't stop it from setting the run error.
@SyntaxColoring SyntaxColoring merged commit 4418128 into edge Apr 19, 2024
44 checks passed
@SyntaxColoring SyntaxColoring deleted the estop_fixes branch April 19, 2024 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants