Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278

Merged
merged 14 commits into from
Sep 7, 2021

Conversation

mcous
Copy link
Contributor

@mcous mcous commented Sep 1, 2021

Overview

What I set out to do with this PR is two things:

  1. Allow the server/API to bypass ThreadManager if the "Enable Protocol Engine" feature flag is enabled.
  2. Incorporate some of my feedback from fix(robot-server): Refactor hardware wrapper and other lifetime dependencies #8213 and replace hardware_wrapper.py with a robot_server.hardware module containing initialization and dependency logic

My intention was to avoid touching the larger questions of lifetime dependencies raised in #8213, but in consolidating the initialization / cleanup logic of the hardware API, it looks to me like this code fixes #8023. It affects #8051, but I don't think it resolves it in spirit, so let's leave it open.

In touching the get_hardware dependency, I was forced to touch the get_rpc_server dependency, and it turns out those are the only two things that need explicit cleanup. I'm not sure where this leaves #8213, but I sort of think any general lifetime dependency solutions should probably wait until:

Changelog

  • Do not use ThreadManager with ProtocolEngine
    • This change breaks RPC if the feature flag is set
    • This wasn't intentional, but it's a natural result of bypassing the ThreadManager and its SyncAdapter
    • I don't think it's a problem and do not intend to fix it

Review requests

TL;DR: run protocols with the "Enable Protocol Engine" feature flag on and off, see if anything breaks. If you're not familiar with the engine stuff, smoke testing for no regressions while the ThreadManager is still in play is very helpful for this PR.

  • Feature flag off - still using ThreadManager
    • RPC API is still good [tested by @SyntaxColoring]
    • Hardware HTTP APIs are still good
      • One-off requests, like the "Home" and "Lights" buttons [tested by @SyntaxColoring]
      • Existing session flows, like Deck Calibration and Pipette Offset Calibration [tested by @SyntaxColoring]
  • Feature flag on - ThreadManager disabled
    • RPC API will not work
    • One-off HTTP APIs may work, but if they don't it's ok
      • [The app grays them out, except for feature flag switches. –@SyntaxColoring]
    • Running protocols through the engine-based /protocols and /sessions endpoints works [tested by @SyntaxColoring]

Risk assessment

Medium due to touching hardware initialization code and nearly complete lack of test coverage

@mcous mcous added refactor robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). labels Sep 1, 2021
@codecov
Copy link

codecov bot commented Sep 1, 2021

Codecov Report

Merging #8278 (9dd9ce3) into edge (40de1e4) will increase coverage by 0.00%.
The diff coverage is 87.09%.

Impacted file tree graph

@@           Coverage Diff           @@
##             edge    #8278   +/-   ##
=======================================
  Coverage   74.89%   74.89%           
=======================================
  Files        1683     1685    +2     
  Lines       44835    44913   +78     
  Branches     4494     4494           
=======================================
+ Hits        33577    33639   +62     
- Misses      10495    10511   +16     
  Partials      763      763           
Flag Coverage Δ
api 85.74% <100.00%> (-0.04%) ⬇️
robot-server 93.40% <86.92%> (-0.09%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...er/robot_server/service/legacy/rpc/dependencies.py 63.63% <63.63%> (ø)
robot-server/robot_server/hardware.py 87.20% <87.20%> (ø)
robot-server/robot_server/app.py 97.50% <92.30%> (+6.07%) ⬆️
api/src/opentrons/hardware_control/api.py 84.69% <100.00%> (ø)
robot-server/robot_server/app_state.py 100.00% <100.00%> (ø)
robot-server/robot_server/health/router.py 100.00% <100.00%> (ø)
...obot-server/robot_server/protocols/dependencies.py 55.17% <100.00%> (-1.50%) ⬇️
robot-server/robot_server/service/__init__.py 100.00% <100.00%> (ø)
robot-server/robot_server/service/dependencies.py 93.75% <100.00%> (+2.37%) ⬆️
...-server/robot_server/service/legacy/routers/rpc.py 100.00% <100.00%> (ø)
... and 8 more

@mcous mcous marked this pull request as ready for review September 1, 2021 17:21
@mcous mcous requested a review from a team as a code owner September 1, 2021 17:21
Comment on lines -71 to -74
# Remove all sessions
await (await get_session_manager()).remove_all()
# Remove all uploaded protocols
(await get_protocol_manager()).remove_all()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These weren't doing anything, nor were they necessary for the intended behavior, based on my testing. Removed them accordingly

Copy link
Contributor Author

@mcous mcous Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oof, looks like I might be wrong here. Tests are hanging on my machine in a way that reminds me of our previous experience with ulimit. Put the protocol manager cleanup back in place

Comment on lines +68 to +72
shutdown_results = await asyncio.gather(
cleanup_rpc_server(app.state),
cleanup_hardware(app.state),
return_exceptions=True,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This gather() makes me nervous.

  • The interface of the matched factory function, get_rpc_server(), takes a hardware parameter. As the caller, I would assume that that hardware would need to stay valid for the entire lifetime of the RPC server, including its cleanup code.
  • cleanup_rpc_server() in fact does not touch the hardware, and it does appear safe to call it concurrently with cleanup_hardware(). But you have to read the implementation of cleanup_rpc_server() to know that. It feels like a breach of encapsulation.

I'd be more comfortable with a simple:

await cleanup_rpc_server()
await cleanup_hardware()

Note that this would still be buggy because an exception in cleanup_rpc_server() will prevent cleanup_hardware() from running. However, I think that's okay for this PR, because:

  • That interrupted shutdown bug existed before.
  • That interrupted shutdown bug still exists with the gather() code, because of the await get_protocol_manager() and protocol_manager.remove_all() above the gather().

I think solving the interrupted cleanup bug is in the domain of #8051. It probably requires either a horrible try/except mess or contextlib.ExitStack.

Copy link
Contributor Author

@mcous mcous Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hear you and agree with your concerns, but I'd rather keep the gather for this PR because:

  • We know protocol_manager.remove_all(), as written, won't raise
  • We know RPC cleanup will raise in a common condition (connection present during shutdown)
  • We know that hardware cleanup and RPC cleanup, as written, can run concurrently
  • If something changes in the future that means they can't run concurrently, there will be a log statement

Removing the gather make me more nervous than omitting it. Let's not let perfect be the enemy of good

Comment on lines +11 to +12
request: Request = None,
websocket: WebSocket = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow. I'm amazed that this works...? Is this in the docs or discussed in a GitHub issue anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it by reading the Starlette source, where it is present in the public API of the HTTPConnection base class:

class HTTPConnection(Mapping):
    """
    A base class for incoming HTTP connections, that is used to provide
    any functionality that is common to both `Request` and `WebSocket`.
    """

    # ...

    @property
    def app(self) -> typing.Any:
        return self.scope["app"]

Probably worth a docs PR into both FastAPI and Starlette

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm more surprised by the = None thing working.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot, yeah, that needs a comment. It needs to be request: Request and websocket: WebSocket in order for FastAPI to magically pick up the type annotations and inject the object. request: Optional[Request] will blow up DI.

From there, our mypy config is set to be a little more lenient than I'd like, where you're allowed to do foo: SomeType = None instead of foo: Optional[SomeType] = None. Will add comments to that effect because it's a gotcha


if initialize_task is not None:
initialize_task.cancel()
await asyncio.gather(initialize_task, return_exceptions=True)
Copy link
Contributor

@SyntaxColoring SyntaxColoring Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can just be this?

Suggested change
await asyncio.gather(initialize_task, return_exceptions=True)
await initialize_task

Copy link
Contributor Author

@mcous mcous Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to ensure shutdown happens cleanly (read: without raising) if:

  • Initialization raised for some reason
  • Shutdown is requested before the hardware initialization finishes

If the stuff that needs to be cleaned up is gone, then clean up is satisfied and a raise would be counterproductive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha. I think I'd find a try/except clearer, but this works for me if you want to keep it as-is.

Comment on lines 58 to 60
initialize_task = getattr(app_state, _HARDWARE_API_INIT_TASK_KEY, None)
hardware_api = getattr(app_state, _HARDWARE_API_KEY, None)
unsubscribe_from_events = getattr(app_state, _HARDWARE_EVENT_UNSUBSCRIBE_KEY, None)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good for this PR, but one of the things I was trying to work out in #8213 was how to avoid too much indirection through keys, because it subverts typechecking.

What ended up in that PR was you'd access one DependencySet dataclass through a key, and then access the actual things you cared about in a type-safe way through that dataclass. So the indirection was still there, but it was consolidated and mitigated.

That strategy was messy in other ways, so I don't think it's the final solution, but I do hope we can find some other way to avoid this.

Copy link
Contributor Author

@mcous mcous Sep 1, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this would be better if I knew how to get some sort of immutable, unique Symbol-like object in Python for use as the key rather than a string, where the remote but real possibility of key collisions exist, but to be honest that's the extent of my concern with this.

I don't share the concern about subverting typchecking for a few reasons:

  • The keys and value storage on AppState are entirely private implementation details of this module
  • The primary value I derive from typing in scripted languages is code documentation and maybe some refactor safety
    • I think the object keys satisfy the documentation value, and I think this module is too small for any serious refactor safety concerns
    • Some type annotations on the variable declarations probably couldn't hurt, though!

In order to strictly type AppState, you end up needing to:

  • Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation
  • Introduce circular (type) dependencies

Those, to me, are serious downsides that introduce brittleness and coupling into the entire application, and I don't see what value it gives us in exchange. As we work with AppState, all data attached to AppState is private to the module that creates it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to strictly type AppState, you end up needing to:

  • Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation
  • Introduce circular (type) dependencies

Here's another approach that avoids those downsides.

class TypedStateValue(typing.Generic[T]):
    def __init__(unique_key: str) -> None:
        # Alternatively, we might be able to have this autogenerate unique keys.
        self._unique_key = unique_key
    
    def get_from(app_state: AppState) -> T:
        return getattr(app_state, self._unique_key)

    def set_on(app_state: AppState, value: T) -> None:
        setattr(app_state, self._unique_key, value)
_initialize_task = TypedStateValue[asyncio.Task]("_initialize_task")
_hardware_api = TypedStateValue[API]("_hardware_api")
_unsubscribe_from_events = TypedStateValue[typing.Callable[[], None]]("_unsubscribe_from_events")

async def cleanup_hardware(app_state: AppState) -> None:
    initialize_task = _initialize_task.get_from(app_state)
    hardware_api = _hardware_api.get_from(app_state)
    unsubscribe_from_events = _unsubscribe_from_events.get_from(app_state)

    # ...

I agree that the severity of not having typechecking is lessened by this module being small and these attributes being internal. I do still think we should strive to have typechecking when it doesn't take "too much" machinery, and I hope the sketch above falls under the threshold for "too much"?

Copy link
Contributor Author

@mcous mcous Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a pretty neat little helper, will add it to robot_server/app_state.py

robot-server/robot_server/hardware.py Outdated Show resolved Hide resolved
"CALL_NACK_MESSAGE",
"PONG_MESSAGE",
"RPCServer",
"get_rpc_server",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes (or at least discussion) for this.

This makes me nervous because you can now do, for example, from service.legacy import get_rpc_server. And then, unless you read the docstring, it looks like get_rpc_server is just a normal function that you can call like get_rpc_server().

I'm hoping for a strong naming convention that sets our FastAPI dependency functions apart from other functions. Possibilities:

  • Making sure dependencies is always in the import path somewhere, like what we had before, so it's service.legacy.rpc.dependencies.get_rpc_server. I think we discussed this before, but I don't remember your thoughts on it. I personally don't mind it.
  • Making it something like service.legacy.rpc.dep_rpc_server.

Ditto for robot_server.hardware.get_hardware.

Copy link
Contributor Author

@mcous mcous Sep 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call 🙃. Mostly, though, I guess I don't feel nervous about this? If someone comes in thinking they can call get_rpc_server function outside of Depends or without passing in AppState explicitly, they'll be disabused of that notion when whatever it was they were trying to do doesn't work at all.

(I'm not counting the server setup and teardown bugs that we definitely did not notice for a while because a) teardown doesn't really matter because the process is stopped, and b) the part of setup that was broken wasn't a production codepath nor was it ever subject to actual tests. In anything even remotely production-facing, I think we'd notice pretty quickly.)

My TL;DR here is that I feel it's reasonable to expect a developer in this codebase to understand how and where the dependency injection system is used. That's my 🌶️ take, so with that out of the way:

  • I dislike putting dependencies in public module name
    • This opinion generally holds for any module name that describes the type of thing it is rather than the scope of its logic in the application, including models, constants, types, etc.
  • I could live with get_hardware_dependency, etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call

Magic is good because Everything Always Just Works, right? 🙃

Mostly, though, I guess I don't feel nervous about this?...

  • It's certainly possible I'm just traumatized by the server setup/teardown bugs, but I'm still a bit nervous about the same thing happening again.
  • Also, even if FastAPI dependency functions weren't special and error-prone in this way, I think I can see a separation-of-layers argument for keeping them apart?

I think let's keep this as-is for this PR, and open a naming convention discussion in #python or something.

Copy link
Contributor

@SyntaxColoring SyntaxColoring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes since my last review look good, though there are still some open conversations.

Smoke tests (according to the plan in the description) passed. 🎉

Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool!!
Just the comment about making sure hardware api initializes correctly on a refresh robot.

api/src/opentrons/__init__.py Outdated Show resolved Hide resolved
log.info(f"Robot Name: {name()}")
systemdd_notify()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might have to go after the hardware api initialization. Putting it here might result in GPIO access conflict on a refresh robot. Have we tested this on an OT2R?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've only tested on non-refresh, I believe. I put it here because I thought I saw it come before hardware init in the existing code. Def worth double and triple checking

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I can go over the service files and test on the refresh electronics.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested that refresh electronics start up correctly. PR is good to merge!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
refactor robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

bug: notify-server cannot send hardware events
3 participants