-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(robot-server): Stop opentrons-gpio-setup
before trying to control the front button light
#8580
Conversation
Codecov Report
@@ Coverage Diff @@
## edge #8580 +/- ##
==========================================
- Coverage 74.40% 74.39% -0.02%
==========================================
Files 1685 1686 +1
Lines 45471 45467 -4
Branches 4551 4551
==========================================
- Hits 33831 33823 -8
- Misses 10847 10851 +4
Partials 793 793
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty big behavioral regression and I don't think we should do it.
What if instead we let the hardware controller retry a couple times? We could even make the behavior conditional on an environment variable that is only set by the systemd service. The changes would then be entirely within the robot server code and its unit file, which are both in the same repo.
I think in the long run you're correct that the real solution is actual communication between the two programs, and ideally a rehoming of the blinkenlights service into the monorepo to facilitate it, while keeping the blinkenlights service light enough that it starts quickly. But adding a retry or poll to the robot server gpio takeover is a much better fix in the short term because while the code might be a bit ugly, it presents the same or better behavior to the user.
(Reviewed in person) I agree with @sfoster1 that this introduces behavioral regression. And definitely agreed that actual communication between the two services/ tasks would be the goal. But to get things working again on edge quickly, I would be fine with both Seth's approach of using env variable, or @SyntaxColoring 's approach of doing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much bigger fan of this, nice! Let's use an asyncio subprocess, but I think this is the actual correct thing to do - the systemctl stop
should be synchronous to the other unit actually coming down, so the race condition should go away.
It turns out we already do this, deeper inside the stack. opentrons/api/src/opentrons/drivers/rpi_drivers/gpio.py Lines 70 to 73 in d108c1e
And I guess it makes sense that this doesn't work. Per comments above, I've pushed a couple of commits to switch approaches. [Edit: From your comment directly above, looks like you've already seen this, lol. Our GitHub comments raced. Leaving this here for posterity.] Now, the sequence is:
This approach:
|
robot-server
fully mutually exclusive with opentrons-gpio-setup
opentrons-gpio-setup
before trying to control the front button light
@@ -108,37 +108,73 @@ async def get_hardware(app_state: AppState = Depends(get_app_state)) -> Hardware | |||
return hardware_api | |||
|
|||
|
|||
async def _initialize_hardware_api(app_state: AppState) -> None: | |||
async def _initialize_hardware_api(app_state: AppState) -> None: # noqa: C901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
noqa: C901
overrides flake8 saying this function is too complex, which it definitely is. I think another PR should change this function to -> HardwareAPI
, drop the gigantic try
/except
, and refactor calling code to take over for exception logging.
#8213's SlowInitializing
helper would have avoided this, but we tossed that PR because it felt complex in other ways. Maybe we should revisit that part of it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. I think we could probably at least fix the try/accept by adding another level of function call but that's just bikeshedding at this point, you're right about the refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome!
Overview
This PR is a quick workaround for a problem on our
edge
branch builds whererobot-server
fails to start up.Background on the problem
opentrons-gpio-setup
, akaot-blinkenlights
, is a standalone service that blinks the OT-2's front button light during startup, independently of other services.Later,
opentrons-robot-server
starts up, and it also wants to blink the button. The intent seems to have been for it to work like this:opentrons-gpio-setup
is running.opentrons-robot-server
begins starting up. Meanwhile,opentrons-gpio-setup
continues blinking the light in the background.Eventually, when
opentrons-robot-server
is ready, it takes over blinking duty fromopentrons-gpio-setup
.Importantly, things are arranged so that
opentrons-gpio-setup
is shut down just before this happens, to avoid a "device busy" conflict from the two processes attempting to acquire the same GPIO line at once.opentrons-robot-server
does some slow hardware initialization stuff and homes the gantry, while it does its own blinking.Finally,
opentrons-robot-server
sets the button solid blue when it's done initializing.This had been working for a while.
But #8536 affected step (3). It changed the relative timing so that, at the point where
opentrons-robot-server
starts blinking the light,opentrons-gpio-setup
is still running and trying to blink it itself. This causes a "device busy" error insideopentrons-robot-server
, preventing it from fully initializing.And all along, even when this was working, it was kind of dicey. We did the mutual exclusion in step (3) by configuring systemd to shut down
opentrons-gpio-setup
whenopentrons-robot-server
tells systemd it's done initializing. The problems that I see with this are:opentrons-gpio-setup
whenopentrons-robot-server
is "done initializing" doesn't guarantee thatopentrons-gpio-setup
will be totally dead by the timeopentrons-robot-server
attempts to acquire the GPIO line. If it's not, we'll get the same "device busy" conflict.Changelog
Makeopentrons-robot-server
andopentrons-gpio-setup
mutually exclusive at the systemd service level. systemd now will not let the two services run at the same time. Before startingopentrons-robot-server
, systemd will automatically stopopentrons-gpio-setup
and wait for it to exit.Edit: The approach has changed. See the thread below.
Caveat
Unfortunately, in practice, this workaround makes things look like this:You flip on the OT-2's power switch. The OT-2 starts booting.Ifopentrons-gpio-setup
even gets the chance to run at all,systemd
will very quickly stop it, because it wants to startopentrons-robot-server
now.The robot will spend ~1m45s without the light blinking. Most of this is spent insideopentrons-robot-server
, after it's started, but before it's gotten to the point where it's able to blink the light.opentrons-robot-server
will finally blink the light for a few seconds as it does some hardware initialization stuff and homes the gantry.opentrons-robot-server
will finally turn the button solid blue.This is certainly something we'll have to improve before the next release. 1m45s without the light blinking makes it look like the boot has completely failed.Options for real solutions
I'd like to kick further investigation of these into another issue/PR, but here are some options off the top of my head for fixing the above caveat:
opentrons-robot-server
so it doesn't take 1m45s to get to the point where it blinks the front button. If this is possible, it would even let us get rid of the separateopentrons-gpio-setup
service.opentrons-gpio-setup
, but find a more robust way foropentrons-robot-server
to signal it to back off when it's ready to take over button-blinking. Maybeopentrons-robot-server
would sendopentrons-gpio-setup
a custom dbus message, or something.opentrons-gpio-setup
, but makeopentrons-robot-server
synchronously stop it before taking over button-blinking. Doingsubprocess.run(["systemctl", "stop", "opentrons-gpio-setup"])
could I guess, but that feels icky to me.opentrons-robot-server
's "done initializing" update earlier, soopentrons-gpio-setup
has time to stop, in practice. I think this would be a bummer; for reasons described above, this was dicey even though it happened to work.Review requests
Are we okay with the 1m45s caveat described above?Manual testing:
I've tested on a non-refresh OT-2 and confirmed that this works as described above.
If you want to test yourself, you need to SSH in,mount -o remount,rw /
, and manually edit/etc/systemd/system/opentrons-robot-server.service
(e.g. withvi
).make push
will not update this file.Risk assessment
Low-ish. For now, I think this is the simplest, smallest change that's guaranteed to solve the immediate problem. But it does affect systemd service dependencies, which are inter-repo and difficult to reason about.