-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(robot_server,api): do not use ThreadManager with ProtocolEngine #8278
Conversation
Codecov Report
@@ Coverage Diff @@
## edge #8278 +/- ##
=======================================
Coverage 74.89% 74.89%
=======================================
Files 1683 1685 +2
Lines 44835 44913 +78
Branches 4494 4494
=======================================
+ Hits 33577 33639 +62
- Misses 10495 10511 +16
Partials 763 763
Flags with carried forward coverage won't be shown. Click here to find out more.
|
# Remove all sessions | ||
await (await get_session_manager()).remove_all() | ||
# Remove all uploaded protocols | ||
(await get_protocol_manager()).remove_all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These weren't doing anything, nor were they necessary for the intended behavior, based on my testing. Removed them accordingly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oof, looks like I might be wrong here. Tests are hanging on my machine in a way that reminds me of our previous experience with ulimit
. Put the protocol manager cleanup back in place
shutdown_results = await asyncio.gather( | ||
cleanup_rpc_server(app.state), | ||
cleanup_hardware(app.state), | ||
return_exceptions=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gather()
makes me nervous.
- The interface of the matched factory function,
get_rpc_server()
, takes a hardware parameter. As the caller, I would assume that that hardware would need to stay valid for the entire lifetime of the RPC server, including its cleanup code. cleanup_rpc_server()
in fact does not touch the hardware, and it does appear safe to call it concurrently withcleanup_hardware()
. But you have to read the implementation ofcleanup_rpc_server()
to know that. It feels like a breach of encapsulation.
I'd be more comfortable with a simple:
await cleanup_rpc_server()
await cleanup_hardware()
Note that this would still be buggy because an exception in cleanup_rpc_server()
will prevent cleanup_hardware()
from running. However, I think that's okay for this PR, because:
- That interrupted shutdown bug existed before.
- That interrupted shutdown bug still exists with the
gather()
code, because of theawait get_protocol_manager()
andprotocol_manager.remove_all()
above thegather()
.
I think solving the interrupted cleanup bug is in the domain of #8051. It probably requires either a horrible try
/except
mess or contextlib.ExitStack
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hear you and agree with your concerns, but I'd rather keep the gather
for this PR because:
- We know
protocol_manager.remove_all()
, as written, won't raise - We know RPC cleanup will raise in a common condition (connection present during shutdown)
- We know that hardware cleanup and RPC cleanup, as written, can run concurrently
- If something changes in the future that means they can't run concurrently, there will be a log statement
Removing the gather
make me more nervous than omitting it. Let's not let perfect be the enemy of good
request: Request = None, | ||
websocket: WebSocket = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow. I'm amazed that this works...? Is this in the docs or discussed in a GitHub issue anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it by reading the Starlette source, where it is present in the public API of the HTTPConnection
base class:
class HTTPConnection(Mapping): """ A base class for incoming HTTP connections, that is used to provide any functionality that is common to both `Request` and `WebSocket`. """ # ... @property def app(self) -> typing.Any: return self.scope["app"]
Probably worth a docs PR into both FastAPI and Starlette
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more surprised by the = None
thing working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh shoot, yeah, that needs a comment. It needs to be request: Request
and websocket: WebSocket
in order for FastAPI to magically pick up the type annotations and inject the object. request: Optional[Request]
will blow up DI.
From there, our mypy config is set to be a little more lenient than I'd like, where you're allowed to do foo: SomeType = None
instead of foo: Optional[SomeType] = None
. Will add comments to that effect because it's a gotcha
|
||
if initialize_task is not None: | ||
initialize_task.cancel() | ||
await asyncio.gather(initialize_task, return_exceptions=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can just be this?
await asyncio.gather(initialize_task, return_exceptions=True) | |
await initialize_task |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to ensure shutdown happens cleanly (read: without raising) if:
- Initialization raised for some reason
- Shutdown is requested before the hardware initialization finishes
If the stuff that needs to be cleaned up is gone, then clean up is satisfied and a raise would be counterproductive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha. I think I'd find a try
/except
clearer, but this works for me if you want to keep it as-is.
initialize_task = getattr(app_state, _HARDWARE_API_INIT_TASK_KEY, None) | ||
hardware_api = getattr(app_state, _HARDWARE_API_KEY, None) | ||
unsubscribe_from_events = getattr(app_state, _HARDWARE_EVENT_UNSUBSCRIBE_KEY, None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is good for this PR, but one of the things I was trying to work out in #8213 was how to avoid too much indirection through keys, because it subverts typechecking.
What ended up in that PR was you'd access one DependencySet
dataclass through a key, and then access the actual things you cared about in a type-safe way through that dataclass. So the indirection was still there, but it was consolidated and mitigated.
That strategy was messy in other ways, so I don't think it's the final solution, but I do hope we can find some other way to avoid this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be better if I knew how to get some sort of immutable, unique Symbol-like object in Python for use as the key rather than a string, where the remote but real possibility of key collisions exist, but to be honest that's the extent of my concern with this.
I don't share the concern about subverting typchecking for a few reasons:
- The keys and value storage on
AppState
are entirely private implementation details of this module - The primary value I derive from typing in scripted languages is code documentation and maybe some refactor safety
- I think the object keys satisfy the documentation value, and I think this module is too small for any serious refactor safety concerns
- Some type annotations on the variable declarations probably couldn't hurt, though!
In order to strictly type AppState
, you end up needing to:
- Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation
- Introduce circular (type) dependencies
Those, to me, are serious downsides that introduce brittleness and coupling into the entire application, and I don't see what value it gives us in exchange. As we work with AppState
, all data attached to AppState
is private to the module that creates it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to strictly type
AppState
, you end up needing to:
- Create an object that has to know about the implementation details of every module in the system that may want to store arbitrary global data, breaking encapsulation
- Introduce circular (type) dependencies
Here's another approach that avoids those downsides.
class TypedStateValue(typing.Generic[T]):
def __init__(unique_key: str) -> None:
# Alternatively, we might be able to have this autogenerate unique keys.
self._unique_key = unique_key
def get_from(app_state: AppState) -> T:
return getattr(app_state, self._unique_key)
def set_on(app_state: AppState, value: T) -> None:
setattr(app_state, self._unique_key, value)
_initialize_task = TypedStateValue[asyncio.Task]("_initialize_task")
_hardware_api = TypedStateValue[API]("_hardware_api")
_unsubscribe_from_events = TypedStateValue[typing.Callable[[], None]]("_unsubscribe_from_events")
async def cleanup_hardware(app_state: AppState) -> None:
initialize_task = _initialize_task.get_from(app_state)
hardware_api = _hardware_api.get_from(app_state)
unsubscribe_from_events = _unsubscribe_from_events.get_from(app_state)
# ...
I agree that the severity of not having typechecking is lessened by this module being small and these attributes being internal. I do still think we should strive to have typechecking when it doesn't take "too much" machinery, and I hope the sketch above falls under the threshold for "too much"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a pretty neat little helper, will add it to robot_server/app_state.py
"CALL_NACK_MESSAGE", | ||
"PONG_MESSAGE", | ||
"RPCServer", | ||
"get_rpc_server", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Requesting changes (or at least discussion) for this.
This makes me nervous because you can now do, for example, from service.legacy import get_rpc_server
. And then, unless you read the docstring, it looks like get_rpc_server
is just a normal function that you can call like get_rpc_server()
.
I'm hoping for a strong naming convention that sets our FastAPI dependency functions apart from other functions. Possibilities:
- Making sure
dependencies
is always in the import path somewhere, like what we had before, so it'sservice.legacy.rpc.dependencies.get_rpc_server
. I think we discussed this before, but I don't remember your thoughts on it. I personally don't mind it. - Making it something like
service.legacy.rpc.dep_rpc_server
.
Ditto for robot_server.hardware.get_hardware
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call 🙃. Mostly, though, I guess I don't feel nervous about this? If someone comes in thinking they can call get_rpc_server
function outside of Depends
or without passing in AppState
explicitly, they'll be disabused of that notion when whatever it was they were trying to do doesn't work at all.
(I'm not counting the server setup and teardown bugs that we definitely did not notice for a while because a) teardown doesn't really matter because the process is stopped, and b) the part of setup that was broken wasn't a production codepath nor was it ever subject to actual tests. In anything even remotely production-facing, I think we'd notice pretty quickly.)
My TL;DR here is that I feel it's reasonable to expect a developer in this codebase to understand how and where the dependency injection system is used. That's my 🌶️ take, so with that out of the way:
- I dislike putting
dependencies
in public module name- This opinion generally holds for any module name that describes the type of thing it is rather than the scope of its logic in the application, including
models
,constants
,types
, etc.
- This opinion generally holds for any module name that describes the type of thing it is rather than the scope of its logic in the application, including
- I could live with
get_hardware_dependency
, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's one thing I've learned working with FastAPI, its that nothing is ever just a normal function you can call
Magic is good because Everything Always Just Works, right? 🙃
Mostly, though, I guess I don't feel nervous about this?...
- It's certainly possible I'm just traumatized by the server setup/teardown bugs, but I'm still a bit nervous about the same thing happening again.
- Also, even if FastAPI dependency functions weren't special and error-prone in this way, I think I can see a separation-of-layers argument for keeping them apart?
I think let's keep this as-is for this PR, and open a naming convention discussion in #python
or something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code changes since my last review look good, though there are still some open conversations.
Smoke tests (according to the plan in the description) passed. 🎉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool!!
Just the comment about making sure hardware api initializes correctly on a refresh robot.
log.info(f"Robot Name: {name()}") | ||
systemdd_notify() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might have to go after the hardware api initialization. Putting it here might result in GPIO access conflict on a refresh robot. Have we tested this on an OT2R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've only tested on non-refresh, I believe. I put it here because I thought I saw it come before hardware init in the existing code. Def worth double and triple checking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I can go over the service files and test on the refresh electronics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested that refresh electronics start up correctly. PR is good to merge!
Co-authored-by: Sanniti Pimpley <[email protected]>
Overview
What I set out to do with this PR is two things:
ThreadManager
if the "Enable Protocol Engine" feature flag is enabled.hardware_wrapper.py
with arobot_server.hardware
module containing initialization and dependency logicMy intention was to avoid touching the larger questions of lifetime dependencies raised in #8213, but in consolidating the initialization / cleanup logic of the hardware API, it looks to me like this code fixes #8023. It affects #8051, but I don't think it resolves it in spirit, so let's leave it open.
In touching the
get_hardware
dependency, I was forced to touch theget_rpc_server
dependency, and it turns out those are the only two things that need explicit cleanup. I'm not sure where this leaves #8213, but I sort of think any general lifetime dependency solutions should probably wait until:Changelog
ThreadManager
and itsSyncAdapter
Review requests
TL;DR: run protocols with the "Enable Protocol Engine" feature flag on and off, see if anything breaks. If you're not familiar with the engine stuff, smoke testing for no regressions while the
ThreadManager
is still in play is very helpful for this PR./protocols
and/sessions
endpoints works [tested by @SyntaxColoring]Risk assessment
Medium due to touching hardware initialization code and nearly complete lack of test coverage