-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(robot-server): Refactor hardware wrapper and other lifetime dependencies #8213
Conversation
Codecov Report
@@ Coverage Diff @@
## edge #8213 +/- ##
==========================================
+ Coverage 74.80% 74.85% +0.05%
==========================================
Files 1661 1663 +2
Lines 44316 44429 +113
Branches 4433 4433
==========================================
+ Hits 33149 33256 +107
- Misses 10414 10420 +6
Partials 753 753
Flags with carried forward coverage won't be shown. Click here to find out more.
|
33e4e11
to
4dfa729
Compare
else: | ||
# Create the hardware | ||
thread_manager = await initialize_api() | ||
_init_event_watchers(thread_manager, event_publisher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ This is technically a behavioral change!
Because of this line, this initialize()
function will only return _init_event_watchers()
succeeds. If initializing event watchers fails, you'll get a 500 error on any endpoint that Depends
on hardware.
The old code did this in initialization:
opentrons/robot-server/robot_server/hardware_wrapper.py
Lines 42 to 47 in 8f297af
else: | |
# Create the hardware | |
self._tm = await initialize_api() | |
await self.init_event_watchers() | |
log.info("Opentrons API initialized") | |
return self._tm |
Note the assignment to self._tm
instead of a local _tm
variable.
And the old code implemented retrieval like this:
opentrons/robot-server/robot_server/hardware_wrapper.py
Lines 84 to 86 in 8f297af
def get_hardware(self) -> Optional[ThreadManager]: | |
"""Get the wrapped ThreadManager, if it exists.""" | |
return self._tm |
Which meant the hardware could appear ready and successfully initialized even if _init_event_watchers()
raised an exception.
# `from None` is important: without it, this new exception would be | ||
# interpreted as an unexpected failure while handling the InvalidStateError. | ||
# This affects what's printed in stack traces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it weird for this comment to exist? raise Foo() from None
feels obscure to me, but, what am I going to do, copy this comment every single place I do raise Foo() from None
? (It's also used in this PR's rewrite of the get_hardware()
dependency, and I don't comment on it there.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an exception raise the best move for us here? We're already calling the method get_if_ready
, so if we take inspiration from dict.get
, I think it would be appropriate for this method to return Optional[_T]
. The interface is for getting an object that may take a while to initialize; I think None
makes sense as a return value for "not ready yet".
(Exception still makes sense to me for "initialization failed")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had the same thought.
Points in favor of raising:
- It matches the design of
asyncio.Future.result()
. - In the edge case when
_T
isNone
, raising makes it easier for the API to distinguish between the "not ready yet" and "done" states. I think it might sometimes make sense for_T
to beNone
when a dependency represents doing something, instead of providing something. Like how we have a dependency for checking version headers. Maybe initializing global logging objects should work like this? Maybe setting up notification subscriptions should work like this? I'm not sure yet.
Points in favor of returning Optional[_T]
:
- The typechecker will force callers to acknowledge, somehow, that the object might not be ready when they request it. They need to implement their own error handling instead of relying on the exception propagating from this method.
On balance, I end up softly preferring raising.
@@ -10,14 +10,11 @@ | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ℹ️ This diff got weird. I suggest viewing this file in split mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tested yet, but looks pretty good. Thoughts below in advance of pairing and review tomorrow
# `from None` is important: without it, this new exception would be | ||
# interpreted as an unexpected failure while handling the InvalidStateError. | ||
# This affects what's printed in stack traces. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is an exception raise the best move for us here? We're already calling the method get_if_ready
, so if we take inspiration from dict.get
, I think it would be appropriate for this method to return Optional[_T]
. The interface is for getting an object that may take a while to initialize; I think None
makes sense as a return value for "not ready yet".
(Exception still makes sense to me for "initialization failed")
@@ -0,0 +1,250 @@ | |||
"""Dependencies (in the FastAPI sense) that last for the lifetime of the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should be a submodule of util
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally dislike util
as a module name / concept. I prefer where this sits right now
finally: | ||
try: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This nested try
might be hard to understand, and its duplication across all these context managers is a red flag.
Possible solutions:
- Centralize this logic under
SlowInitializing
. - Do:
[Edit: No, I don't think this addresses the root problem. This layer should not be responsible for cleaning up after
try: yield slow_initializing finally: with contextlib.suppress(InitializationFailedError): fully_initialized_result = await slow_initializing.get_when_ready() await fully_initialized_result.on_shutdown()
SlowInitializing
.SlowInitializing
should clean up after itself.] - Just not suppress the
InitializationFailedError
. Let it propagate? [Edit: No, I think this would make things very confusing when the server exits with a traceback. Errors that have already been logged could get re-printed multiple times. And really, if something fails to open, it doesn't make sense to propagate that when you close it. The closing is a no-op because there's nothing to close.]
* Fix `_set()` not being used. * Rename `_set()` to `_store_on_app()`. * Add `_delete_stored_on_app()` and use it on cleanup.
Partially reverts commit 686b1fa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry this has taken so long with travels, but I've been thinking about this a lot so wanted to get this review in. We definitely need to chat about this in person to resolve I think or this will take forever 🙃
On the whole, I worry this PR does too much and adds a lot of hard to read / follow code to solve problems that aren't quite the problems we're actually experiencing.
Acceptance criteria that I see:
- Make sure hardware (and probably the RPC server, which also spins up threads) are initialized properly
- Make sure child threads are shut down at app shutdown, mostly for to make sure our test suite keeps working until we can address our mocking / lack of mocking problems in tests
I think ProtocolManager
and SessionManager
cleanups are red-herrings and are mostly - if not entirely - unnecessary. They are not production facing, and arguable not necessary at all (even the file cleanup done by ProtocolManager, because the files are placed in a temporary directory)
Focusing on our acceptance criteria of "make sure hardware stuff can spin up in the background", I think we should take the following simplification steps:
- An individual dependency can declare its own cleanup handlers or register them with a cleanup system rather than centralizing all lifetime dependencies into one place
- We lean into process termination as cleanup, as it's the strategy we use in both dev and production
- For further reading, I've been very influenced in this regard by the philosohpy of let it crash and crash-only software
app.include_router(router=router) | ||
|
||
|
||
app.on_event("startup")(initialize_logging) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is FastAPI's @on_event("startup"), not Starlette's.
From reading the source, FastAPI does not define on_event
; it's definitely a straight pass-through because the FastAPI
class inherits from Starlette
.
I think using the __init__
methods (which FastAPI also accepts) is the best available move. If we must use @app.on_event
(which may turn out to be the case depending on how badly we want the app.state
dictionary in there), I think all of them should be defined in the top-level app.py
|
||
|
||
app.on_event("startup")(initialize_logging) | ||
lifetime_dependencies.install_startup_shutdown_handlers(app) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Passing app
as an argument to a function that's going to mutate app
is a warning sign to me. From what I can tell, lifetime_dependencies
requires two things in the current PR:
app
, to mutate the startup and shutdown handlersapp.state
, to attach persistent data
I think (1) should be removed in favor of the app setup being in charge of attached handlers exposed by the lifetime_dependencies
module, and any other methods should operate on a State
object rather than a FastAPI
app
@@ -0,0 +1,250 @@ | |||
"""Dependencies (in the FastAPI sense) that last for the lifetime of the server. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I personally dislike util
as a module name / concept. I prefer where this sits right now
* We need to clean it up when the server shuts down. Note that because we support hot | ||
reloading, we do need to explicitly clean up most resources; we can't rely on things | ||
automatically getting cleaned up when the process ends. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling from the "Details" section:
Note that because we support hot reloading, we do need to explicitly clean up most resources; we can't rely on things automatically getting cleaned up when the process ends.
I pulled this quote out because it feels to me like a lot of this PR comes out of this assumption, but I don't see how it can be true. Uvicorn runs the server in a subprocess and terminates the process on reload. This makes sense, because there's no other way to reliably reload Python files.
This, to me says we can absolutely rely on process termination for cleanup. If we can rely on process termination, how much of this PR is necessary? I think cleanup may be important for our tests, but that's a test problem, not an application problem
lifetime_dependencies.install_startup_shutdown_handlers(app) | ||
|
||
|
||
__all__ = ["app"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These __all__
variables in internal, non-init files feel a little redundant to me
"""Handle app startup.""" | ||
initialize_logging() | ||
# Initialize api | ||
(await get_hardware_wrapper()).async_initialize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ignoring the fact that it was broken, I dislike this removal. If I'm trying to establish context while reading this server code, the question "how is the hardware connection created?" is really important, and when we put it here, in the app-level startup handler, the pathway to find that answer is really obvious
@@ -0,0 +1,83 @@ | |||
"""Utilities for initializing the hardware interface.""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Utilities" is always a bit of a warning sign for me. Something you need utilities, but if we have a hardware connection we need to initialize and hand out as a dependency, shouldn't those all be in the same place?
Is there a way we can restructure this PR to have robot_server/hardware.py
be the thing that actually exports a startup / shutdown handler for the hardware along with the get_hardware
dependency function? I'm imagining a system where any given dependency is able to use classes / functions in lifetime_dependencies
to declare themselves a lifetime dependency.
For example, if this module was to export 2 or 3 methods, I think everything would be a lot easier to follow:
def initialize_hardware(app_state: State) -> None
- Creates a backround task to initialize a ThreadManager
- Uses
slow_initializing
and/orlifetime_dependencies
to attach it to app state - (Maybe) registers cleanup with
lifetime_dependencies
def get_hardware(app_state: State) -> ThreadManager
- Retrieve the slow initializing hardware
- Raise if not initialized
- (Maybe)
def shutdown_hardware(app_state: State) -> None
- Shut down the hardware
- The necessity of this would depend on if
lifetime_dependencies
centralized cleanup or not
We're deeming this not worth the complexity given that #8278 solves #8023. #8051 is still a problem, but we think we'll just live with it and wait for FastAPI to give us a first-party solution instead of trying to DIY it ourselves. I need to go through this diff and see if there are any other little fixes worth pulling out, but then we can close this PR. |
Overview
This PR refactors and fixes how robot-server sets up and tears down its lifetime dependencies. Lifetime dependencies are dependencies that should be initialized when the server starts, and cleaned up when the server stops.
The goal is to address the specific issues of #8051 and #8023, and also some of the structural traps (in my opinion) that allowed them to exist in the first place.
Changes
Fix #8051 and fix #8023
Before this PR, all of our dependencies were created through FastAPI dependency functions.
Some of those dependency objects needed special initialization or cleanup. To do this, our startup/shutdown code fetched the objects like this:
opentrons/robot-server/robot_server/app.py
Lines 58 to 63 in 4be5bc0
(
get_hardware_wrapper()
was one of our FastAPI dependency functions.)It turns out that this is was invalid thing to do, and could corrupt objects in subtle ways. See #8051 for details. We needed a way for our startup/shutdown code to do its job without calling any FastAPI dependency functions.
This PR accomplishes that by flipping the relationship between our startup code and FastAPI dependency functions.
Separate our initialization logic from the machinery that runs that logic in the background
Before this PR, we had a
HardwareWrapper
class that acted as two things:HardwareWrapper
would own the background task and provided access to the hardware API whenever it became available.This PR separates those concerns into two independent units.
The first new unit is a
hardware_initialization
module. This has anasync def initialize()
function with a simple interface: it either returns a fully-formed, fully-usable hardware object, or it raises an exception.The second unit is a
slow_initializing
module. This has a wrapper class that wraps any factory function, runs it in the background, and provides access to its result whenever it becomes available.Separating these units and fleshing them out wins us:
HardwareWrapper
exposed initialization methods to endpoint-level code, whereas now we just expose the fully-formed, ready-to-use result. We also remove windows for partial initialization, simplifying the number of possible states that the object could be in, hopefully making it easier to reason about.HardwareWrapper
conflated the two.HardwareWrapper
orphaned its background task, so exceptions wouldn't propagate anywhere, among other potential problems.Lifetime dependencies other than the hardware API also now use the
slow_initializing
machinery. This has to do with the changes explained in the section above. Formerly, if dependencyget_foo()
depended onget_hardware()
, and something tried accessingget_foo()
beforeget_hardware()
was ready, FastAPI's dependency resolution magic would makeget_foo()
automatically raise a "hardware not ready" exception. But now, since we're initializing lifetime dependencies outside of FastAPI, we have to implement that ourselves.Set up a pattern of defining lifetime dependencies as context managers
dataclass
containing every lifetime dependency.@contextlib.contextmanager
-like factories. So, if all of this is just a holdover until FastAPI adds proper support, our holdover will at least be structurally similar to the ultimate solution.Review requests
Code review requests
Manual test requests
make test
and make sure it doesn't hang for you. At some of the intermediate commits in this PR, I thought I was experiencing mysterious hanging tests, but I can't reproduce it now.journalctl -f opentrons-robot-server
; runsystemctl stop opentrons-robot-server
; and make sure the service stops cleanly, except for the expected errors listed below.make dev
and make sure...^C
, except for the expected errors listed below.I will also do all of these before merging.
Expected errors
Cannot call "send" once a close message has been sent
). I don't know exactly what this is, but it's not new. See bug: robot-server errors during shutdown if there's an active RPC connection #8155.Task was destroyed but it is pending!
errors relating toopentrons/hardware_control/poller.py
. This is bug: Module-polling tasks leak on server shutdown #8239.Risk assessment
Medium.
Risk is brought up because this is critical to everything that the OT-2 does.
Risk is brought down because problems with this code should be obvious with a smoke test.