Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(robot-server): Refactor hardware wrapper and other lifetime dependencies #8213

Closed
wants to merge 49 commits into from

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Aug 9, 2021

Overview

This PR refactors and fixes how robot-server sets up and tears down its lifetime dependencies. Lifetime dependencies are dependencies that should be initialized when the server starts, and cleaned up when the server stops.

The goal is to address the specific issues of #8051 and #8023, and also some of the structural traps (in my opinion) that allowed them to exist in the first place.

Changes

Fix #8051 and fix #8023

Before this PR, all of our dependencies were created through FastAPI dependency functions.

Some of those dependency objects needed special initialization or cleanup. To do this, our startup/shutdown code fetched the objects like this:

@app.on_event("startup")
async def on_startup() -> None:
"""Handle app startup."""
initialize_logging()
# Initialize api
(await get_hardware_wrapper()).async_initialize()

(get_hardware_wrapper() was one of our FastAPI dependency functions.)

It turns out that this is was invalid thing to do, and could corrupt objects in subtle ways. See #8051 for details. We needed a way for our startup/shutdown code to do its job without calling any FastAPI dependency functions.

This PR accomplishes that by flipping the relationship between our startup code and FastAPI dependency functions.

  • Our startup code now initializes lifetime dependencies on its own. It places the result objects in a dedicated place.
  • While the server is running, our FastAPI dependency functions act as simple getters, fetching from that dedicated place.
  • Our shutdown code fetches the lifetime dependency objects from that dedicated place and runs each one's cleanup code.

Separate our initialization logic from the machinery that runs that logic in the background

Before this PR, we had a HardwareWrapper class that acted as two things:

  1. A factory function. It coordinated things to initialize the hardware API however we needed it to be initialized.
  2. A handle for doing that in the background. It let us start initializing as soon as the server started, allowing other things to happen concurrently. HardwareWrapper would own the background task and provided access to the hardware API whenever it became available.

This PR separates those concerns into two independent units.

The first new unit is a hardware_initialization module. This has an async def initialize() function with a simple interface: it either returns a fully-formed, fully-usable hardware object, or it raises an exception.

The second unit is a slow_initializing module. This has a wrapper class that wraps any factory function, runs it in the background, and provides access to its result whenever it becomes available.

Separating these units and fleshing them out wins us:

  • The ability to test and document these two concerns independently, and hopefully more thoroughly.
  • A simplified hardware API object, with fewer windows for misuse. The old HardwareWrapper exposed initialization methods to endpoint-level code, whereas now we just expose the fully-formed, ready-to-use result. We also remove windows for partial initialization, simplifying the number of possible states that the object could be in, hopefully making it easier to reason about.
  • The ability for endpoint-level code to distinguish between the hardware API having failed to initialize, versus simply not being ready yet. The old HardwareWrapper conflated the two.
  • No orphaned background tasks. The old HardwareWrapper orphaned its background task, so exceptions wouldn't propagate anywhere, among other potential problems.

Lifetime dependencies other than the hardware API also now use the slow_initializing machinery. This has to do with the changes explained in the section above. Formerly, if dependency get_foo() depended on get_hardware(), and something tried accessing get_foo() before get_hardware() was ready, FastAPI's dependency resolution magic would make get_foo() automatically raise a "hardware not ready" exception. But now, since we're initializing lifetime dependencies outside of FastAPI, we have to implement that ourselves.

Set up a pattern of defining lifetime dependencies as context managers

  • It's natural to express each individual lifetime dependency as a context manager—it's a resource that needs cleanup, and context managers are Python's way of doing resources that need cleanup.
    • Doing this as a pattern also made it more obvious that we were forgetting to clean up certain lifetime dependencies, so that's fixed now.
  • We use standard Python tools to compose our many small context managers into one big monolithic one. The monolithic one returns a dataclass containing every lifetime dependency.
    • This simplifies the module's interface: it stashes one big object for external code to retrieve later, instead of many small objects.
    • It's also convenient to transform this into a startup/shutdown function pair.
  • FastAPI and Starlette seem to be figuring out how to support lifetime dependencies in a more integrated way. The exact API still seems up in the air, but, according to some GitHub threads that I've linked in the code, it seems very very likely that it will somehow be based on context managers or @contextlib.contextmanager-like factories. So, if all of this is just a holdover until FastAPI adds proper support, our holdover will at least be structurally similar to the ultimate solution.

Review requests

Code review requests

  • After getting a feel for the production code, please look over the changes to tests and test fixtures. Those are the parts I'm least confident about.

Manual test requests

  • Please run make test and make sure it doesn't hang for you. At some of the intermediate commits in this PR, I thought I was experiencing mysterious hanging tests, but I can't reproduce it now.
  • If you have a real OT-2 handy...
    • Please push to it and do a basic smoke test like homing or running a protocol. I've already done this with PE feature flags disabled, so it would be helpful if someone else did it with the feature flags enabled.
    • If you want: SSH in; tail logs with journalctl -f opentrons-robot-server; run systemctl stop opentrons-robot-server; and make sure the service stops cleanly, except for the expected errors listed below.
  • If you want, run make dev and make sure...
    • You can still connect to the virtual robot.
    • The dev server exits promptly and cleanly when you interrupt with ^C, except for the expected errors listed below.
  • If you want, run through the repro steps in bug: notify-server cannot send hardware events #8023 and make sure that it is indeed fixed.

I will also do all of these before merging.

Expected errors

Risk assessment

Medium.

Risk is brought up because this is critical to everything that the OT-2 does.

Risk is brought down because problems with this code should be obvious with a smoke test.

@codecov
Copy link

codecov bot commented Aug 9, 2021

Codecov Report

Merging #8213 (5343b36) into edge (8f297af) will increase coverage by 0.05%.
The diff coverage is 90.24%.

Impacted file tree graph

@@            Coverage Diff             @@
##             edge    #8213      +/-   ##
==========================================
+ Coverage   74.80%   74.85%   +0.05%     
==========================================
  Files        1661     1663       +2     
  Lines       44316    44429     +113     
  Branches     4433     4433              
==========================================
+ Hits        33149    33256     +107     
- Misses      10414    10420       +6     
  Partials      753      753              
Flag Coverage Δ
api 85.82% <ø> (ø)
robot-server 93.50% <90.24%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
robot-server/robot_server/service/dependencies.py 79.24% <68.00%> (-12.14%) ⬇️
robot-server/robot_server/lifetime_dependencies.py 91.66% <91.66%> (ø)
...bot-server/robot_server/hardware_initialization.py 94.11% <94.11%> (ø)
robot-server/robot_server/slow_initializing.py 96.77% <96.77%> (ø)
robot-server/robot_server/__init__.py 100.00% <100.00%> (ø)
robot-server/robot_server/app_setup.py 100.00% <100.00%> (ø)
...obot-server/robot_server/service/legacy/rpc/rpc.py 92.34% <0.00%> (+1.09%) ⬆️
... and 1 more

robot-server/robot_server/app.py Outdated Show resolved Hide resolved
robot-server/robot_server/app.py Outdated Show resolved Hide resolved
robot-server/robot_server/app.py Outdated Show resolved Hide resolved
else:
# Create the hardware
thread_manager = await initialize_api()
_init_event_watchers(thread_manager, event_publisher)
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ This is technically a behavioral change!

Because of this line, this initialize() function will only return _init_event_watchers() succeeds. If initializing event watchers fails, you'll get a 500 error on any endpoint that Depends on hardware.

The old code did this in initialization:

else:
# Create the hardware
self._tm = await initialize_api()
await self.init_event_watchers()
log.info("Opentrons API initialized")
return self._tm

Note the assignment to self._tm instead of a local _tm variable.

And the old code implemented retrieval like this:

def get_hardware(self) -> Optional[ThreadManager]:
"""Get the wrapped ThreadManager, if it exists."""
return self._tm

Which meant the hardware could appear ready and successfully initialized even if _init_event_watchers() raised an exception.

robot-server/tests/test_slow_initializing.py Outdated Show resolved Hide resolved
robot-server/tests/test_slow_initializing.py Outdated Show resolved Hide resolved
Comment on lines +57 to +59
# `from None` is important: without it, this new exception would be
# interpreted as an unexpected failure while handling the InvalidStateError.
# This affects what's printed in stack traces.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it weird for this comment to exist? raise Foo() from None feels obscure to me, but, what am I going to do, copy this comment every single place I do raise Foo() from None? (It's also used in this PR's rewrite of the get_hardware() dependency, and I don't comment on it there.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an exception raise the best move for us here? We're already calling the method get_if_ready, so if we take inspiration from dict.get, I think it would be appropriate for this method to return Optional[_T]. The interface is for getting an object that may take a while to initialize; I think None makes sense as a return value for "not ready yet".

(Exception still makes sense to me for "initialization failed")

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Aug 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought.

Points in favor of raising:

  • It matches the design of asyncio.Future.result().
  • In the edge case when _T is None, raising makes it easier for the API to distinguish between the "not ready yet" and "done" states. I think it might sometimes make sense for _T to be None when a dependency represents doing something, instead of providing something. Like how we have a dependency for checking version headers. Maybe initializing global logging objects should work like this? Maybe setting up notification subscriptions should work like this? I'm not sure yet.

Points in favor of returning Optional[_T]:

  • The typechecker will force callers to acknowledge, somehow, that the object might not be ready when they request it. They need to implement their own error handling instead of relying on the exception propagating from this method.

On balance, I end up softly preferring raising.

@@ -10,14 +10,11 @@

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ℹ️ This diff got weird. I suggest viewing this file in split mode.

@SyntaxColoring SyntaxColoring changed the title refactor(robot-server): Refactor hardware wrapper fix(robot-server): Refactor hardware wrapper Aug 9, 2021
Copy link
Contributor

@mcous mcous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tested yet, but looks pretty good. Thoughts below in advance of pairing and review tomorrow

robot-server/robot_server/slow_initializing.py Outdated Show resolved Hide resolved
robot-server/tests/test_slow_initializing.py Outdated Show resolved Hide resolved
robot-server/robot_server/service/dependencies.py Outdated Show resolved Hide resolved
Comment on lines +57 to +59
# `from None` is important: without it, this new exception would be
# interpreted as an unexpected failure while handling the InvalidStateError.
# This affects what's printed in stack traces.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is an exception raise the best move for us here? We're already calling the method get_if_ready, so if we take inspiration from dict.get, I think it would be appropriate for this method to return Optional[_T]. The interface is for getting an object that may take a while to initialize; I think None makes sense as a return value for "not ready yet".

(Exception still makes sense to me for "initialization failed")

@@ -0,0 +1,250 @@
"""Dependencies (in the FastAPI sense) that last for the lifetime of the server.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should be a submodule of util?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally dislike util as a module name / concept. I prefer where this sits right now

@SyntaxColoring SyntaxColoring marked this pull request as ready for review August 14, 2021 07:36
@SyntaxColoring SyntaxColoring requested review from a team as code owners August 14, 2021 07:36
Comment on lines +100 to +101
finally:
try:
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This nested try might be hard to understand, and its duplication across all these context managers is a red flag.

Possible solutions:

  1. Centralize this logic under SlowInitializing.
  2. Do:
    try:
        yield slow_initializing
    finally:
        with contextlib.suppress(InitializationFailedError):
            fully_initialized_result = await slow_initializing.get_when_ready()
            await fully_initialized_result.on_shutdown()
    [Edit: No, I don't think this addresses the root problem. This layer should not be responsible for cleaning up after SlowInitializing. SlowInitializing should clean up after itself.]
  3. Just not suppress the InitializationFailedError. Let it propagate? [Edit: No, I think this would make things very confusing when the server exits with a traceback. Errors that have already been logged could get re-printed multiple times. And really, if something fails to open, it doesn't make sense to propagate that when you close it. The closing is a no-op because there's nothing to close.]

@SyntaxColoring SyntaxColoring changed the title fix(api,robot-server): Refactor hardware wrapper and other lifetime dependencies fix(robot-server): Refactor hardware wrapper and other lifetime dependencies Aug 18, 2021
@SyntaxColoring SyntaxColoring added this to the CPX Sprint 38 milestone Aug 19, 2021
Copy link
Contributor

@mcous mcous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry this has taken so long with travels, but I've been thinking about this a lot so wanted to get this review in. We definitely need to chat about this in person to resolve I think or this will take forever 🙃

On the whole, I worry this PR does too much and adds a lot of hard to read / follow code to solve problems that aren't quite the problems we're actually experiencing.

Acceptance criteria that I see:

  • Make sure hardware (and probably the RPC server, which also spins up threads) are initialized properly
  • Make sure child threads are shut down at app shutdown, mostly for to make sure our test suite keeps working until we can address our mocking / lack of mocking problems in tests

I think ProtocolManager and SessionManager cleanups are red-herrings and are mostly - if not entirely - unnecessary. They are not production facing, and arguable not necessary at all (even the file cleanup done by ProtocolManager, because the files are placed in a temporary directory)

Focusing on our acceptance criteria of "make sure hardware stuff can spin up in the background", I think we should take the following simplification steps:

  • An individual dependency can declare its own cleanup handlers or register them with a cleanup system rather than centralizing all lifetime dependencies into one place
  • We lean into process termination as cleanup, as it's the strategy we use in both dev and production

app.include_router(router=router)


app.on_event("startup")(initialize_logging)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is FastAPI's @on_event("startup"), not Starlette's.

From reading the source, FastAPI does not define on_event; it's definitely a straight pass-through because the FastAPI class inherits from Starlette.

I think using the __init__ methods (which FastAPI also accepts) is the best available move. If we must use @app.on_event (which may turn out to be the case depending on how badly we want the app.state dictionary in there), I think all of them should be defined in the top-level app.py



app.on_event("startup")(initialize_logging)
lifetime_dependencies.install_startup_shutdown_handlers(app)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing app as an argument to a function that's going to mutate app is a warning sign to me. From what I can tell, lifetime_dependencies requires two things in the current PR:

  1. app, to mutate the startup and shutdown handlers
  2. app.state, to attach persistent data

I think (1) should be removed in favor of the app setup being in charge of attached handlers exposed by the lifetime_dependencies module, and any other methods should operate on a State object rather than a FastAPI app

@@ -0,0 +1,250 @@
"""Dependencies (in the FastAPI sense) that last for the lifetime of the server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally dislike util as a module name / concept. I prefer where this sits right now

Comment on lines +11 to +13
* We need to clean it up when the server shuts down. Note that because we support hot
reloading, we do need to explicitly clean up most resources; we can't rely on things
automatically getting cleaned up when the process ends.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pulling from the "Details" section:

Note that because we support hot reloading, we do need to explicitly clean up most resources; we can't rely on things automatically getting cleaned up when the process ends.

I pulled this quote out because it feels to me like a lot of this PR comes out of this assumption, but I don't see how it can be true. Uvicorn runs the server in a subprocess and terminates the process on reload. This makes sense, because there's no other way to reliably reload Python files.

This, to me says we can absolutely rely on process termination for cleanup. If we can rely on process termination, how much of this PR is necessary? I think cleanup may be important for our tests, but that's a test problem, not an application problem

lifetime_dependencies.install_startup_shutdown_handlers(app)


__all__ = ["app"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These __all__ variables in internal, non-init files feel a little redundant to me

"""Handle app startup."""
initialize_logging()
# Initialize api
(await get_hardware_wrapper()).async_initialize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ignoring the fact that it was broken, I dislike this removal. If I'm trying to establish context while reading this server code, the question "how is the hardware connection created?" is really important, and when we put it here, in the app-level startup handler, the pathway to find that answer is really obvious

@@ -0,0 +1,83 @@
"""Utilities for initializing the hardware interface."""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Utilities" is always a bit of a warning sign for me. Something you need utilities, but if we have a hardware connection we need to initialize and hand out as a dependency, shouldn't those all be in the same place?

Is there a way we can restructure this PR to have robot_server/hardware.py be the thing that actually exports a startup / shutdown handler for the hardware along with the get_hardware dependency function? I'm imagining a system where any given dependency is able to use classes / functions in lifetime_dependencies to declare themselves a lifetime dependency.

For example, if this module was to export 2 or 3 methods, I think everything would be a lot easier to follow:

  1. def initialize_hardware(app_state: State) -> None
    • Creates a backround task to initialize a ThreadManager
    • Uses slow_initializing and/or lifetime_dependencies to attach it to app state
    • (Maybe) registers cleanup with lifetime_dependencies
  2. def get_hardware(app_state: State) -> ThreadManager
    • Retrieve the slow initializing hardware
    • Raise if not initialized
  3. (Maybe) def shutdown_hardware(app_state: State) -> None
    • Shut down the hardware
    • The necessity of this would depend on if lifetime_dependencies centralized cleanup or not

@SyntaxColoring
Copy link
Contributor Author

We're deeming this not worth the complexity given that #8278 solves #8023. #8051 is still a problem, but we think we'll just live with it and wait for FastAPI to give us a first-party solution instead of trying to DIY it ourselves.

I need to go through this diff and see if there are any other little fixes worth pulling out, but then we can close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants