refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

mcous · 2021-09-01T14:04:27Z

Overview

What I set out to do with this PR is two things:

Allow the server/API to bypass ThreadManager if the "Enable Protocol Engine" feature flag is enabled.
Incorporate some of my feedback from fix(robot-server): Refactor hardware wrapper and other lifetime dependencies #8213 and replace hardware_wrapper.py with a robot_server.hardware module containing initialization and dependency logic

My intention was to avoid touching the larger questions of lifetime dependencies raised in #8213, but in consolidating the initialization / cleanup logic of the hardware API, it looks to me like this code fixes #8023. It affects #8051, but I don't think it resolves it in spirit, so let's leave it open.

In touching the get_hardware dependency, I was forced to touch the get_rpc_server dependency, and it turns out those are the only two things that need explicit cleanup. I'm not sure where this leaves #8213, but I sort of think any general lifetime dependency solutions should probably wait until:

We have more lifetime dependencies to add
FastAPI is upgraded to Starlette v0.16 with support for lifetime context managers (🌐 Remove translation docs references to aiofiles as it's no longer needed since AnyIO fastapi/fastapi#3594)

Changelog

Do not use ThreadManager with ProtocolEngine
- This change breaks RPC if the feature flag is set
- This wasn't intentional, but it's a natural result of bypassing the ThreadManager and its SyncAdapter
- I don't think it's a problem and do not intend to fix it

Review requests

TL;DR: run protocols with the "Enable Protocol Engine" feature flag on and off, see if anything breaks. If you're not familiar with the engine stuff, smoke testing for no regressions while the ThreadManager is still in play is very helpful for this PR.

Feature flag off - still using ThreadManager
- RPC API is still good [tested by @SyntaxColoring]
- Hardware HTTP APIs are still good
  - One-off requests, like the "Home" and "Lights" buttons [tested by @SyntaxColoring]
  - Existing session flows, like Deck Calibration and Pipette Offset Calibration [tested by @SyntaxColoring]
Feature flag on - ThreadManager disabled
- RPC API will not work
- One-off HTTP APIs may work, but if they don't it's ok
  - [The app grays them out, except for feature flag switches. –@SyntaxColoring]
- Running protocols through the engine-based /protocols and /sessions endpoints works [tested by @SyntaxColoring]

Risk assessment

Medium due to touching hardware initialization code and nearly complete lack of test coverage

codecov · 2021-09-01T14:04:40Z

Codecov Report

Merging #8278 (9dd9ce3) into edge (40de1e4) will increase coverage by 0.00%.
The diff coverage is 87.09%.

@@           Coverage Diff           @@
##             edge    #8278   +/-   ##
=======================================
  Coverage   74.89%   74.89%           
=======================================
  Files        1683     1685    +2     
  Lines       44835    44913   +78     
  Branches     4494     4494           
=======================================
+ Hits        33577    33639   +62     
- Misses      10495    10511   +16     
  Partials      763      763

Flag	Coverage Δ
api	`85.74% <100.00%> (-0.04%)`	⬇️
robot-server	`93.40% <86.92%> (-0.09%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...er/robot_server/service/legacy/rpc/dependencies.py	`63.63% <63.63%> (ø)`
robot-server/robot_server/hardware.py	`87.20% <87.20%> (ø)`
robot-server/robot_server/app.py	`97.50% <92.30%> (+6.07%)`	⬆️
api/src/opentrons/hardware_control/api.py	`84.69% <100.00%> (ø)`
robot-server/robot_server/app_state.py	`100.00% <100.00%> (ø)`
robot-server/robot_server/health/router.py	`100.00% <100.00%> (ø)`
...obot-server/robot_server/protocols/dependencies.py	`55.17% <100.00%> (-1.50%)`	⬇️
robot-server/robot_server/service/__init__.py	`100.00% <100.00%> (ø)`
robot-server/robot_server/service/dependencies.py	`93.75% <100.00%> (+2.37%)`	⬆️
...-server/robot_server/service/legacy/routers/rpc.py	`100.00% <100.00%> (ø)`
... and 8 more

robot-server/robot_server/service/dependencies.py

mcous · 2021-09-01T17:55:40Z

robot-server/robot_server/app.py

- # Remove all sessions
- await (await get_session_manager()).remove_all()
- # Remove all uploaded protocols
- (await get_protocol_manager()).remove_all()


These weren't doing anything, nor were they necessary for the intended behavior, based on my testing. Removed them accordingly

Oof, looks like I might be wrong here. Tests are hanging on my machine in a way that reminds me of our previous experience with ulimit. Put the protocol manager cleanup back in place

robot-server/robot_server/app.py

api/src/opentrons/__init__.py

SyntaxColoring · 2021-09-01T20:42:02Z

robot-server/robot_server/app.py

+ shutdown_results = await asyncio.gather(
+ cleanup_rpc_server(app.state),
+ cleanup_hardware(app.state),
+ return_exceptions=True,
+ )


This gather() makes me nervous.

The interface of the matched factory function, get_rpc_server(), takes a hardware parameter. As the caller, I would assume that that hardware would need to stay valid for the entire lifetime of the RPC server, including its cleanup code.

cleanup_rpc_server() in fact does not touch the hardware, and it does appear safe to call it concurrently with cleanup_hardware(). But you have to read the implementation of cleanup_rpc_server() to know that. It feels like a breach of encapsulation.

I'd be more comfortable with a simple:

await cleanup_rpc_server() await cleanup_hardware()

Note that this would still be buggy because an exception in cleanup_rpc_server() will prevent cleanup_hardware() from running. However, I think that's okay for this PR, because:

That interrupted shutdown bug existed before.

That interrupted shutdown bug still exists with the gather() code, because of the await get_protocol_manager() and protocol_manager.remove_all() above the gather().

I think solving the interrupted cleanup bug is in the domain of #8051. It probably requires either a horrible try/except mess or contextlib.ExitStack.

I hear you and agree with your concerns, but I'd rather keep the gather for this PR because:

We know protocol_manager.remove_all(), as written, won't raise

We know RPC cleanup will raise in a common condition (connection present during shutdown)

We know that hardware cleanup and RPC cleanup, as written, can run concurrently

If something changes in the future that means they can't run concurrently, there will be a log statement

Removing the gather make me more nervous than omitting it. Let's not let perfect be the enemy of good

SyntaxColoring · 2021-09-01T20:45:43Z

robot-server/robot_server/app_state.py

+ request: Request = None,
+ websocket: WebSocket = None,


Wow. I'm amazed that this works...? Is this in the docs or discussed in a GitHub issue anywhere?

I found it by reading the Starlette source, where it is present in the public API of the HTTPConnection base class:

class HTTPConnection(Mapping): """ A base class for incoming HTTP connections, that is used to provide any functionality that is common to both `Request` and `WebSocket`. """ # ... @property def app(self) -> typing.Any: return self.scope["app"]

Probably worth a docs PR into both FastAPI and Starlette

I'm more surprised by the = None thing working.

Oh shoot, yeah, that needs a comment. It needs to be request: Request and websocket: WebSocket in order for FastAPI to magically pick up the type annotations and inject the object. request: Optional[Request] will blow up DI.

From there, our mypy config is set to be a little more lenient than I'd like, where you're allowed to do foo: SomeType = None instead of foo: Optional[SomeType] = None. Will add comments to that effect because it's a gotcha

SyntaxColoring · 2021-09-01T20:52:35Z

robot-server/robot_server/hardware.py

+
+ if initialize_task is not None:
+ initialize_task.cancel()
+ await asyncio.gather(initialize_task, return_exceptions=True)


I think this can just be this?

Suggested change

await asyncio.gather(initialize_task, return_exceptions=True)

await initialize_task

This is to ensure shutdown happens cleanly (read: without raising) if:

Initialization raised for some reason

Shutdown is requested before the hardware initialization finishes

If the stuff that needs to be cleaned up is gone, then clean up is satisfied and a raise would be counterproductive

Aha. I think I'd find a try/except clearer, but this works for me if you want to keep it as-is.

SyntaxColoring · 2021-09-01T21:03:07Z

robot-server/robot_server/hardware.py

+ initialize_task = getattr(app_state, _HARDWARE_API_INIT_TASK_KEY, None)
+ hardware_api = getattr(app_state, _HARDWARE_API_KEY, None)
+ unsubscribe_from_events = getattr(app_state, _HARDWARE_EVENT_UNSUBSCRIBE_KEY, None)


I think this is good for this PR, but one of the things I was trying to work out in #8213 was how to avoid too much indirection through keys, because it subverts typechecking.

What ended up in that PR was you'd access one DependencySet dataclass through a key, and then access the actual things you cared about in a type-safe way through that dataclass. So the indirection was still there, but it was consolidated and mitigated.

That strategy was messy in other ways, so I don't think it's the final solution, but I do hope we can find some other way to avoid this.

I think this would be better if I knew how to get some sort of immutable, unique Symbol-like object in Python for use as the key rather than a string, where the remote but real possibility of key collisions exist, but to be honest that's the extent of my concern with this.

I don't share the concern about subverting typchecking for a few reasons:

The keys and value storage on AppState are entirely private implementation details of this module

The primary value I derive from typing in scripted languages is code documentation and maybe some refactor safety

I think the object keys satisfy the documentation value, and I think this module is too small for any serious refactor safety concerns

Some type annotations on the variable declarations probably couldn't hurt, though!

In order to strictly type AppState, you end up needing to:

Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation

Introduce circular (type) dependencies

Those, to me, are serious downsides that introduce brittleness and coupling into the entire application, and I don't see what value it gives us in exchange. As we work with AppState, all data attached to AppState is private to the module that creates it

In order to strictly type AppState, you end up needing to:

Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation

Introduce circular (type) dependencies

Here's another approach that avoids those downsides.

class TypedStateValue(typing.Generic[T]): def __init__(unique_key: str) -> None: # Alternatively, we might be able to have this autogenerate unique keys. self._unique_key = unique_key def get_from(app_state: AppState) -> T: return getattr(app_state, self._unique_key) def set_on(app_state: AppState, value: T) -> None: setattr(app_state, self._unique_key, value)

_initialize_task = TypedStateValue[asyncio.Task]("_initialize_task") _hardware_api = TypedStateValue[API]("_hardware_api") _unsubscribe_from_events = TypedStateValue[typing.Callable[[], None]]("_unsubscribe_from_events") async def cleanup_hardware(app_state: AppState) -> None: initialize_task = _initialize_task.get_from(app_state) hardware_api = _hardware_api.get_from(app_state) unsubscribe_from_events = _unsubscribe_from_events.get_from(app_state) # ...

I agree that the severity of not having typechecking is lessened by this module being small and these attributes being internal. I do still think we should strive to have typechecking when it doesn't take "too much" machinery, and I hope the sketch above falls under the threshold for "too much"?

That's a pretty neat little helper, will add it to robot_server/app_state.py

robot-server/robot_server/hardware.py

SyntaxColoring · 2021-09-01T21:30:24Z

robot-server/robot_server/service/legacy/rpc/__init__.py

+ "CALL_NACK_MESSAGE",
+ "PONG_MESSAGE",
+ "RPCServer",
+ "get_rpc_server",


Requesting changes (or at least discussion) for this.

This makes me nervous because you can now do, for example, from service.legacy import get_rpc_server. And then, unless you read the docstring, it looks like get_rpc_server is just a normal function that you can call like get_rpc_server().

I'm hoping for a strong naming convention that sets our FastAPI dependency functions apart from other functions. Possibilities:

Making sure dependencies is always in the import path somewhere, like what we had before, so it's service.legacy.rpc.dependencies.get_rpc_server. I think we discussed this before, but I don't remember your thoughts on it. I personally don't mind it.

Making it something like service.legacy.rpc.dep_rpc_server.

Ditto for robot_server.hardware.get_hardware.

If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call 🙃. Mostly, though, I guess I don't feel nervous about this? If someone comes in thinking they can call get_rpc_server function outside of Depends or without passing in AppState explicitly, they'll be disabused of that notion when whatever it was they were trying to do doesn't work at all.

(I'm not counting the server setup and teardown bugs that we definitely did not notice for a while because a) teardown doesn't really matter because the process is stopped, and b) the part of setup that was broken wasn't a production codepath nor was it ever subject to actual tests. In anything even remotely production-facing, I think we'd notice pretty quickly.)

My TL;DR here is that I feel it's reasonable to expect a developer in this codebase to understand how and where the dependency injection system is used. That's my 🌶️ take, so with that out of the way:

I dislike putting dependencies in public module name

This opinion generally holds for any module name that describes the type of thing it is rather than the scope of its logic in the application, including models, constants, types, etc.

I could live with get_hardware_dependency, etc.

If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call

Magic is good because Everything Always Just Works, right? 🙃

Mostly, though, I guess I don't feel nervous about this?...

It's certainly possible I'm just traumatized by the server setup/teardown bugs, but I'm still a bit nervous about the same thing happening again.

Also, even if FastAPI dependency functions weren't special and error-prone in this way, I think I can see a separation-of-layers argument for keeping them apart?

I think let's keep this as-is for this PR, and open a naming convention discussion in #python or something.

SyntaxColoring

Code changes since my last review look good, though there are still some open conversations.

Smoke tests (according to the plan in the description) passed. 🎉

sanni-t

Very cool!!
Just the comment about making sure hardware api initializes correctly on a refresh robot.

api/src/opentrons/__init__.py

sanni-t · 2021-09-03T05:08:39Z

api/src/opentrons/__init__.py

 log.info(f"Robot Name: {name()}")
+ systemdd_notify()


This might have to go after the hardware api initialization. Putting it here might result in GPIO access conflict on a refresh robot. Have we tested this on an OT2R?

I've only tested on non-refresh, I believe. I put it here because I thought I saw it come before hardware init in the existing code. Def worth double and triple checking

Cool. I can go over the service files and test on the refresh electronics.

Tested that refresh electronics start up correctly. PR is good to merge!

Co-authored-by: Sanniti Pimpley <[email protected]>

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine

e643550

mcous added refactor robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). labels Sep 1, 2021

Merge branch 'edge' into engine_disable-thread-manager

6a4c7a3

mcous commented Sep 1, 2021

View reviewed changes

robot-server/robot_server/service/dependencies.py Outdated Show resolved Hide resolved

mcous marked this pull request as ready for review September 1, 2021 17:21

mcous requested a review from a team as a code owner September 1, 2021 17:21

fixup: docstring typo

9045389

mcous requested review from amitlissack, sanni-t and SyntaxColoring September 1, 2021 17:22

fixup: remove unnecessary and broken session/protocol manager cleanup

76f8abd

mcous commented Sep 1, 2021

View reviewed changes

robot-server/robot_server/app.py Outdated Show resolved Hide resolved

mcous added 3 commits September 1, 2021 15:08

fixup: add protocol manager cleanup back in to avoid borking tests

bea1c36

Merge branch 'edge' into engine_disable-thread-manager

478c18c

fixup: lint fix

cc0da31

SyntaxColoring reviewed Sep 1, 2021

View reviewed changes

api/src/opentrons/__init__.py Show resolved Hide resolved

fixup: add TODO for serial port IndexError

41ea9b0

mcous commented Sep 1, 2021

View reviewed changes

api/src/opentrons/__init__.py Show resolved Hide resolved

SyntaxColoring requested changes Sep 1, 2021

View reviewed changes

mcous added 4 commits September 1, 2021 20:07

Merge branch 'edge' into engine_disable-thread-manager

f3bc0ea

fixup: docstring tweaks in hardware.py

88eb62f

fixup: missed a tab

9829b98

fixup: add AppStateValue wrapper for typechecking

f8bad82

SyntaxColoring self-requested a review September 2, 2021 14:52

SyntaxColoring approved these changes Sep 2, 2021

View reviewed changes

fixup: add comment about non-optional trickery

9d36dee

sanni-t approved these changes Sep 3, 2021

View reviewed changes

Update api/src/opentrons/__init__.py

9dd9ce3

Co-authored-by: Sanniti Pimpley <[email protected]>

mcous merged commit d551ab7 into edge Sep 7, 2021

mcous deleted the engine_disable-thread-manager branch September 7, 2021 17:53

mcous mentioned this pull request Sep 9, 2021

refactor(robot-server): fix/suppress type errors to enable strict mode #8329

Merged

SyntaxColoring mentioned this pull request Sep 20, 2021

fix(robot-server): Refactor hardware wrapper and other lifetime dependencies #8213

Closed

mcous mentioned this pull request Jan 6, 2022

perf(api): bring back the ThreadManager #9199

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

mcous commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021

mcous Sep 1, 2021 •

edited

Loading

SyntaxColoring Sep 1, 2021

mcous Sep 1, 2021 •

edited

Loading

SyntaxColoring Sep 1, 2021

mcous Sep 1, 2021

SyntaxColoring Sep 2, 2021

mcous Sep 2, 2021

SyntaxColoring Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021 •

edited

Loading

SyntaxColoring Sep 2, 2021

SyntaxColoring Sep 1, 2021

mcous Sep 1, 2021 •

edited

Loading

SyntaxColoring Sep 2, 2021

mcous Sep 2, 2021 •

edited

Loading

SyntaxColoring Sep 1, 2021

mcous Sep 2, 2021 •

edited

Loading

SyntaxColoring Sep 2, 2021

SyntaxColoring left a comment

sanni-t left a comment

sanni-t Sep 3, 2021

mcous Sep 3, 2021

sanni-t Sep 3, 2021

sanni-t Sep 7, 2021

	await asyncio.gather(initialize_task, return_exceptions=True)
	await initialize_task

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

Conversation

mcous commented Sep 1, 2021 • edited Loading

Overview

Changelog

Review requests

Risk assessment

codecov bot commented Sep 1, 2021 • edited Loading

Codecov Report

Choose a reason for hiding this comment

mcous Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcous Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

mcous Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcous Sep 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcous Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcous Sep 2, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SyntaxColoring left a comment

Choose a reason for hiding this comment

sanni-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mcous commented Sep 1, 2021 •

edited

Loading

codecov bot commented Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021 •

edited

Loading

SyntaxColoring Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021 •

edited

Loading

mcous Sep 1, 2021 •

edited

Loading

mcous Sep 2, 2021 •

edited

Loading

mcous Sep 2, 2021 •

edited

Loading