refactor(robot-server): Stop `opentrons-gpio-setup` before trying to control the front button light #8580

SyntaxColoring · 2021-10-21T20:25:55Z

Overview

This PR is a quick workaround for a problem on our edge branch builds where robot-server fails to start up.

Background on the problem

opentrons-gpio-setup, aka ot-blinkenlights, is a standalone service that blinks the OT-2's front button light during startup, independently of other services.

Later, opentrons-robot-server starts up, and it also wants to blink the button. The intent seems to have been for it to work like this:

opentrons-gpio-setup is running.
opentrons-robot-server begins starting up. Meanwhile, opentrons-gpio-setup continues blinking the light in the background.
Eventually, when opentrons-robot-server is ready, it takes over blinking duty from opentrons-gpio-setup.

Importantly, things are arranged so that opentrons-gpio-setup is shut down just before this happens, to avoid a "device busy" conflict from the two processes attempting to acquire the same GPIO line at once.
opentrons-robot-server does some slow hardware initialization stuff and homes the gantry, while it does its own blinking.
Finally, opentrons-robot-server sets the button solid blue when it's done initializing.

This had been working for a while.

But #8536 affected step (3). It changed the relative timing so that, at the point where opentrons-robot-server starts blinking the light, opentrons-gpio-setup is still running and trying to blink it itself. This causes a "device busy" error inside opentrons-robot-server, preventing it from fully initializing.

And all along, even when this was working, it was kind of dicey. We did the mutual exclusion in step (3) by configuring systemd to shut down opentrons-gpio-setup when opentrons-robot-server tells systemd it's done initializing. The problems that I see with this are:

It's inherently racy. Stopping opentrons-gpio-setup when opentrons-robot-server is "done initializing" doesn't guarantee that opentrons-gpio-setup will be totally dead by the time opentrons-robot-server attempts to acquire the GPIO line. If it's not, we'll get the same "device busy" conflict.
Even if that worked, it's unintuitive to me for a process to send a "done initializing" update as a synchronization mechanism for making it safe for that process to continue initializing.

Changelog

Make opentrons-robot-server and opentrons-gpio-setup mutually exclusive at the systemd service level. systemd now will not let the two services run at the same time. Before starting opentrons-robot-server, systemd will automatically stop opentrons-gpio-setup and wait for it to exit.

Edit: The approach has changed. See the thread below.

Caveat

~~Unfortunately, in practice, this workaround makes things look like this:~~

~~You flip on the OT-2's power switch. The OT-2 starts booting.~~
~~If opentrons-gpio-setup even gets the chance to run at all, systemd will very quickly stop it, because it wants to start opentrons-robot-server now.~~
The robot will spend ~1m45s without the light blinking. Most of this is spent inside opentrons-robot-server, after it's started, but before it's gotten to the point where it's able to blink the light.
~~opentrons-robot-server will finally blink the light for a few seconds as it does some hardware initialization stuff and homes the gantry.~~
~~opentrons-robot-server will finally turn the button solid blue.~~

~~This is certainly something we'll have to improve before the next release. 1m45s without the light blinking makes it look like the boot has completely failed.~~

Options for real solutions

I'd like to kick further investigation of these into another issue/PR, but here are some options off the top of my head for fixing the above caveat:

Somehow rework opentrons-robot-server so it doesn't take 1m45s to get to the point where it blinks the front button. If this is possible, it would even let us get rid of the separate opentrons-gpio-setup service.
Keep opentrons-gpio-setup, but find a more robust way for opentrons-robot-server to signal it to back off when it's ready to take over button-blinking. Maybe opentrons-robot-server would send opentrons-gpio-setup a custom dbus message, or something.
Keep opentrons-gpio-setup, but make opentrons-robot-server synchronously stop it before taking over button-blinking. Doing subprocess.run(["systemctl", "stop", "opentrons-gpio-setup"]) could I guess, but that feels icky to me.
- [Edit: This PR has changed to do this. See the thread below.]
Go back closer to the way things were before refactor(api,robot-server): sd_notify in server #8536. Move opentrons-robot-server's "done initializing" update earlier, so opentrons-gpio-setup has time to stop, in practice. I think this would be a bummer; for reasons described above, this was dicey even though it happened to work.

Review requests

~~Are we okay with the 1m45s caveat described above?~~
Manual testing:

I've tested on a non-refresh OT-2 and confirmed that this works as described above. If you want to test yourself, you need to SSH in, mount -o remount,rw /, and manually edit /etc/systemd/system/opentrons-robot-server.service (e.g. with vi). make push will not update this file.

Risk assessment

Low-ish. For now, I think this is the simplest, smallest change that's guaranteed to solve the immediate problem. But it does affect systemd service dependencies, which are inter-repo and difficult to reason about.

codecov · 2021-10-21T20:26:08Z

Codecov Report

Merging #8580 (7a9d31c) into edge (1a68d66) will decrease coverage by 0.01%.
The diff coverage is 50.00%.

@@            Coverage Diff             @@
##             edge    #8580      +/-   ##
==========================================
- Coverage   74.40%   74.39%   -0.02%     
==========================================
  Files        1685     1686       +1     
  Lines       45471    45467       -4     
  Branches     4551     4551              
==========================================
- Hits        33831    33823       -8     
- Misses      10847    10851       +4     
  Partials      793      793

Flag	Coverage Δ
api	`84.57% <ø> (+0.01%)`	⬆️
robot-server	`93.45% <50.00%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
api/src/opentrons/drivers/rpi_drivers/gpio.py	`5.20% <ø> (ø)`
robot-server/robot_server/hardware.py	`75.23% <50.00%> (-7.38%)`	⬇️
api/src/opentrons/__init__.py	`45.91% <0.00%> (-1.96%)`	⬇️
api/src/opentrons/protocol_engine/state/state.py	`98.70% <0.00%> (-1.30%)`	⬇️
...pentrons/protocol_api/paired_instrument_context.py	`81.25% <0.00%> (-0.17%)`	⬇️
api/src/opentrons/commands/types.py	`99.40% <0.00%> (-0.01%)`	⬇️
api/src/opentrons/commands/helpers.py	`86.36% <0.00%> (ø)`
api/src/opentrons/commands/commands.py	`90.54% <0.00%> (ø)`
api/src/opentrons/commands/paired_commands.py	`83.82% <0.00%> (ø)`
...pi/src/opentrons/protocol_engine/state/__init__.py	`100.00% <0.00%> (ø)`
... and 10 more

sfoster1

This is a pretty big behavioral regression and I don't think we should do it.

What if instead we let the hardware controller retry a couple times? We could even make the behavior conditional on an environment variable that is only set by the systemd service. The changes would then be entirely within the robot server code and its unit file, which are both in the same repo.

I think in the long run you're correct that the real solution is actual communication between the two programs, and ideally a rehoming of the blinkenlights service into the monorepo to facilitate it, while keeping the blinkenlights service light enough that it starts quickly. But adding a retry or poll to the robot server gpio takeover is a much better fix in the short term because while the code might be a bit ugly, it presents the same or better behavior to the user.

sanni-t · 2021-10-22T15:56:02Z

(Reviewed in person) I agree with @sfoster1 that this introduces behavioral regression. And definitely agreed that actual communication between the two services/ tasks would be the goal. But to get things working again on edge quickly, I would be fine with both Seth's approach of using env variable, or @SyntaxColoring 's approach of doing subprocess.run(["systemctl", "stop", "opentrons-gpio-setup"]) before button blinking.

sfoster1

Much bigger fan of this, nice! Let's use an asyncio subprocess, but I think this is the actual correct thing to do - the systemctl stop should be synchronous to the other unit actually coming down, so the race condition should go away.

robot-server/robot_server/hardware.py

SyntaxColoring · 2021-10-22T19:32:35Z

What if instead we let the hardware controller retry a couple times?

It turns out we already do this, deeper inside the stack.

opentrons/api/src/opentrons/drivers/rpi_drivers/gpio.py

Lines 70 to 73 in d108c1e

 if name == "blue_button": 

 return _retry_request_line(3) 

 else: 

 return _retry_request_line()

And I guess it makes sense that this doesn't work. opentrons-robot-server's notify() was moved after its GPIO acquisition, and opentrons-gpio-setup's GPIO release was predicated on opentrons-robot-server's notify(). So opentrons-robot-server could retry forever but never make progress.

Per comments above, I've pushed a couple of commits to switch approaches. [Edit: From your comment directly above, looks like you've already seen this, lol. Our GitHub comments raced. Leaving this here for posterity.]

Now, the sequence is:

OT-2 starts booting.
opentrons-gpio-setup starts blinking the light.
opentrons-robot-server starts. opentrons-gpio-setup remains in the background, continuing to blink the light.
opentrons-robot-server reaches the point where it wants to take over light-blinking.
opentrons-robot-server calls systemctl stop opentrons-gpio-setup and waits for that to complete.
- We could have used systemd's dbus API to programmatically stop opentrons-gpio-setup instead of using subprocess.run(), but this seemed hard and not worthwhile.
opentrons-robot-server acquires GPIOs and finishes initializing.
opentrons-robot-server calls notify() to inform systemd it's done initializing.

This approach:

Preserves the main improvement of refactor(api,robot-server): sd_notify in server #8536, which is notify()ing from the robot-server service instead of from the opentrons library.
Solves the underlying race condition.
Decouples robot-server's "done initializing" point from its "about to acquire GPIOs and initialize hardware" point, which I think is important.
Is slightly ugly in that robot-server's Python code needs to be aware of the magical name opentrons-gpio-setup. IMO this is acceptable, especially in the short term.

SyntaxColoring · 2021-10-22T22:03:02Z

robot-server/robot_server/hardware.py

@@ -108,37 +108,73 @@ async def get_hardware(app_state: AppState = Depends(get_app_state)) -> Hardware
 return hardware_api


-async def _initialize_hardware_api(app_state: AppState) -> None:
+async def _initialize_hardware_api(app_state: AppState) -> None: # noqa: C901


noqa: C901 overrides flake8 saying this function is too complex, which it definitely is. I think another PR should change this function to -> HardwareAPI, drop the gigantic try/except, and refactor calling code to take over for exception logging.

#8213's SlowInitializing helper would have avoided this, but we tossed that PR because it felt complex in other ways. Maybe we should revisit that part of it.

sfoster1

Looks good to me. I think we could probably at least fix the try/accept by adding another level of function call but that's just bikeshedding at this point, you're right about the refactor.

sanni-t

Awesome!

Stop opentrons-gpio-setup before starting opentrons-robot-server.

d108c1e

SyntaxColoring added robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). robot server Affects the `robot-server` project labels Oct 21, 2021

SyntaxColoring requested a review from a team as a code owner October 21, 2021 20:25

SyntaxColoring mentioned this pull request Oct 21, 2021

refactor(api): Small improvements to button blinking during hardware initialization #8582

Merged

SyntaxColoring requested a review from sanni-t October 21, 2021 22:00

sfoster1 requested changes Oct 22, 2021

View reviewed changes

SyntaxColoring added 2 commits October 22, 2021 14:56

Stop opentrons-gpio-service from within robot-server's Python code.

9b5b510

Leave todo for eliminating blue button GPIO line retries.

fc6eac3

SyntaxColoring requested a review from a team as a code owner October 22, 2021 18:56

sfoster1 requested changes Oct 22, 2021

View reviewed changes

robot-server/robot_server/hardware.py Outdated Show resolved Hide resolved

Use an asyncio subprocess, and "fix" exceptions not being logged.

7a9d31c

SyntaxColoring changed the title ~~refactor(robot-server): Make robot-server fully mutually exclusive with opentrons-gpio-setup~~ refactor(robot-server): Stop opentrons-gpio-setup before trying to control the front button light Oct 22, 2021

SyntaxColoring commented Oct 22, 2021

View reviewed changes

SyntaxColoring requested a review from sfoster1 October 25, 2021 13:49

sfoster1 approved these changes Oct 25, 2021

View reviewed changes

sanni-t approved these changes Oct 25, 2021

View reviewed changes

SyntaxColoring merged commit fcf7cdb into edge Oct 25, 2021

SyntaxColoring deleted the fix_gpio_conflict_on_boot branch October 25, 2021 16:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(robot-server): Stop `opentrons-gpio-setup` before trying to control the front button light #8580

refactor(robot-server): Stop `opentrons-gpio-setup` before trying to control the front button light #8580

SyntaxColoring commented Oct 21, 2021 •

edited

Loading

codecov bot commented Oct 21, 2021 •

edited

Loading

sfoster1 left a comment

sanni-t commented Oct 22, 2021

sfoster1 left a comment

SyntaxColoring commented Oct 22, 2021 •

edited

Loading

SyntaxColoring Oct 22, 2021 •

edited

Loading

sfoster1 left a comment

sanni-t left a comment

refactor(robot-server): Stop opentrons-gpio-setup before trying to control the front button light #8580

refactor(robot-server): Stop opentrons-gpio-setup before trying to control the front button light #8580

Conversation

SyntaxColoring commented Oct 21, 2021 • edited Loading

Overview

Background on the problem

Changelog

Caveat

Options for real solutions

Review requests

Risk assessment

codecov bot commented Oct 21, 2021 • edited Loading

Codecov Report

sfoster1 left a comment

Choose a reason for hiding this comment

sanni-t commented Oct 22, 2021

sfoster1 left a comment

Choose a reason for hiding this comment

SyntaxColoring commented Oct 22, 2021 • edited Loading

SyntaxColoring Oct 22, 2021 • edited Loading

Choose a reason for hiding this comment

sfoster1 left a comment

Choose a reason for hiding this comment

sanni-t left a comment

Choose a reason for hiding this comment

refactor(robot-server): Stop `opentrons-gpio-setup` before trying to control the front button light #8580

refactor(robot-server): Stop `opentrons-gpio-setup` before trying to control the front button light #8580

SyntaxColoring commented Oct 21, 2021 •

edited

Loading

codecov bot commented Oct 21, 2021 •

edited

Loading

SyntaxColoring commented Oct 22, 2021 •

edited

Loading

SyntaxColoring Oct 22, 2021 •

edited

Loading