Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(robot-server): Stop opentrons-gpio-setup before trying to control the front button light #8580

Merged
merged 4 commits into from
Oct 25, 2021

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Oct 21, 2021

Overview

This PR is a quick workaround for a problem on our edge branch builds where robot-server fails to start up.

Background on the problem

opentrons-gpio-setup, aka ot-blinkenlights, is a standalone service that blinks the OT-2's front button light during startup, independently of other services.

Later, opentrons-robot-server starts up, and it also wants to blink the button. The intent seems to have been for it to work like this:

  1. opentrons-gpio-setup is running.

  2. opentrons-robot-server begins starting up. Meanwhile, opentrons-gpio-setup continues blinking the light in the background.

  3. Eventually, when opentrons-robot-server is ready, it takes over blinking duty from opentrons-gpio-setup.

    Importantly, things are arranged so that opentrons-gpio-setup is shut down just before this happens, to avoid a "device busy" conflict from the two processes attempting to acquire the same GPIO line at once.

  4. opentrons-robot-server does some slow hardware initialization stuff and homes the gantry, while it does its own blinking.

  5. Finally, opentrons-robot-server sets the button solid blue when it's done initializing.

This had been working for a while.

But #8536 affected step (3). It changed the relative timing so that, at the point where opentrons-robot-server starts blinking the light, opentrons-gpio-setup is still running and trying to blink it itself. This causes a "device busy" error inside opentrons-robot-server, preventing it from fully initializing.

And all along, even when this was working, it was kind of dicey. We did the mutual exclusion in step (3) by configuring systemd to shut down opentrons-gpio-setup when opentrons-robot-server tells systemd it's done initializing. The problems that I see with this are:

  • It's inherently racy. Stopping opentrons-gpio-setup when opentrons-robot-server is "done initializing" doesn't guarantee that opentrons-gpio-setup will be totally dead by the time opentrons-robot-server attempts to acquire the GPIO line. If it's not, we'll get the same "device busy" conflict.
  • Even if that worked, it's unintuitive to me for a process to send a "done initializing" update as a synchronization mechanism for making it safe for that process to continue initializing.

Changelog

Make opentrons-robot-server and opentrons-gpio-setup mutually exclusive at the systemd service level. systemd now will not let the two services run at the same time. Before starting opentrons-robot-server, systemd will automatically stop opentrons-gpio-setup and wait for it to exit.

Edit: The approach has changed. See the thread below.

Caveat

Unfortunately, in practice, this workaround makes things look like this:

  1. You flip on the OT-2's power switch. The OT-2 starts booting.
  2. If opentrons-gpio-setup even gets the chance to run at all, systemd will very quickly stop it, because it wants to start opentrons-robot-server now.
  3. The robot will spend ~1m45s without the light blinking. Most of this is spent inside opentrons-robot-server, after it's started, but before it's gotten to the point where it's able to blink the light.
  4. opentrons-robot-server will finally blink the light for a few seconds as it does some hardware initialization stuff and homes the gantry.
  5. opentrons-robot-server will finally turn the button solid blue.

This is certainly something we'll have to improve before the next release. 1m45s without the light blinking makes it look like the boot has completely failed.

Options for real solutions

I'd like to kick further investigation of these into another issue/PR, but here are some options off the top of my head for fixing the above caveat:

  • Somehow rework opentrons-robot-server so it doesn't take 1m45s to get to the point where it blinks the front button. If this is possible, it would even let us get rid of the separate opentrons-gpio-setup service.
  • Keep opentrons-gpio-setup, but find a more robust way for opentrons-robot-server to signal it to back off when it's ready to take over button-blinking. Maybe opentrons-robot-server would send opentrons-gpio-setup a custom dbus message, or something.
  • Keep opentrons-gpio-setup, but make opentrons-robot-server synchronously stop it before taking over button-blinking. Doing subprocess.run(["systemctl", "stop", "opentrons-gpio-setup"]) could I guess, but that feels icky to me.
    • [Edit: This PR has changed to do this. See the thread below.]
  • Go back closer to the way things were before refactor(api,robot-server): sd_notify in server #8536. Move opentrons-robot-server's "done initializing" update earlier, so opentrons-gpio-setup has time to stop, in practice. I think this would be a bummer; for reasons described above, this was dicey even though it happened to work.

Review requests

  • Are we okay with the 1m45s caveat described above?

  • Manual testing:

    I've tested on a non-refresh OT-2 and confirmed that this works as described above. If you want to test yourself, you need to SSH in, mount -o remount,rw /, and manually edit /etc/systemd/system/opentrons-robot-server.service (e.g. with vi). make push will not update this file.

Risk assessment

Low-ish. For now, I think this is the simplest, smallest change that's guaranteed to solve the immediate problem. But it does affect systemd service dependencies, which are inter-repo and difficult to reason about.

@SyntaxColoring SyntaxColoring added robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience). robot server Affects the `robot-server` project labels Oct 21, 2021
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner October 21, 2021 20:25
@codecov
Copy link

codecov bot commented Oct 21, 2021

Codecov Report

Merging #8580 (7a9d31c) into edge (1a68d66) will decrease coverage by 0.01%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             edge    #8580      +/-   ##
==========================================
- Coverage   74.40%   74.39%   -0.02%     
==========================================
  Files        1685     1686       +1     
  Lines       45471    45467       -4     
  Branches     4551     4551              
==========================================
- Hits        33831    33823       -8     
- Misses      10847    10851       +4     
  Partials      793      793              
Flag Coverage Δ
api 84.57% <ø> (+0.01%) ⬆️
robot-server 93.45% <50.00%> (-0.17%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
api/src/opentrons/drivers/rpi_drivers/gpio.py 5.20% <ø> (ø)
robot-server/robot_server/hardware.py 75.23% <50.00%> (-7.38%) ⬇️
api/src/opentrons/__init__.py 45.91% <0.00%> (-1.96%) ⬇️
api/src/opentrons/protocol_engine/state/state.py 98.70% <0.00%> (-1.30%) ⬇️
...pentrons/protocol_api/paired_instrument_context.py 81.25% <0.00%> (-0.17%) ⬇️
api/src/opentrons/commands/types.py 99.40% <0.00%> (-0.01%) ⬇️
api/src/opentrons/commands/helpers.py 86.36% <0.00%> (ø)
api/src/opentrons/commands/commands.py 90.54% <0.00%> (ø)
api/src/opentrons/commands/paired_commands.py 83.82% <0.00%> (ø)
...pi/src/opentrons/protocol_engine/state/__init__.py 100.00% <0.00%> (ø)
... and 10 more

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty big behavioral regression and I don't think we should do it.

What if instead we let the hardware controller retry a couple times? We could even make the behavior conditional on an environment variable that is only set by the systemd service. The changes would then be entirely within the robot server code and its unit file, which are both in the same repo.

I think in the long run you're correct that the real solution is actual communication between the two programs, and ideally a rehoming of the blinkenlights service into the monorepo to facilitate it, while keeping the blinkenlights service light enough that it starts quickly. But adding a retry or poll to the robot server gpio takeover is a much better fix in the short term because while the code might be a bit ugly, it presents the same or better behavior to the user.

@sanni-t
Copy link
Member

sanni-t commented Oct 22, 2021

(Reviewed in person) I agree with @sfoster1 that this introduces behavioral regression. And definitely agreed that actual communication between the two services/ tasks would be the goal. But to get things working again on edge quickly, I would be fine with both Seth's approach of using env variable, or @SyntaxColoring 's approach of doing subprocess.run(["systemctl", "stop", "opentrons-gpio-setup"]) before button blinking.

@SyntaxColoring SyntaxColoring requested a review from a team as a code owner October 22, 2021 18:56
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much bigger fan of this, nice! Let's use an asyncio subprocess, but I think this is the actual correct thing to do - the systemctl stop should be synchronous to the other unit actually coming down, so the race condition should go away.

robot-server/robot_server/hardware.py Outdated Show resolved Hide resolved
@SyntaxColoring
Copy link
Contributor Author

SyntaxColoring commented Oct 22, 2021

What if instead we let the hardware controller retry a couple times?

It turns out we already do this, deeper inside the stack.

if name == "blue_button":
return _retry_request_line(3)
else:
return _retry_request_line()

And I guess it makes sense that this doesn't work. opentrons-robot-server's notify() was moved after its GPIO acquisition, and opentrons-gpio-setup's GPIO release was predicated on opentrons-robot-server's notify(). So opentrons-robot-server could retry forever but never make progress.


Per comments above, I've pushed a couple of commits to switch approaches. [Edit: From your comment directly above, looks like you've already seen this, lol. Our GitHub comments raced. Leaving this here for posterity.]

Now, the sequence is:

  1. OT-2 starts booting.
  2. opentrons-gpio-setup starts blinking the light.
  3. opentrons-robot-server starts. opentrons-gpio-setup remains in the background, continuing to blink the light.
  4. opentrons-robot-server reaches the point where it wants to take over light-blinking.
  5. opentrons-robot-server calls systemctl stop opentrons-gpio-setup and waits for that to complete.
    • We could have used systemd's dbus API to programmatically stop opentrons-gpio-setup instead of using subprocess.run(), but this seemed hard and not worthwhile.
  6. opentrons-robot-server acquires GPIOs and finishes initializing.
  7. opentrons-robot-server calls notify() to inform systemd it's done initializing.

This approach:

  • Preserves the main improvement of refactor(api,robot-server): sd_notify in server #8536, which is notify()ing from the robot-server service instead of from the opentrons library.
  • Solves the underlying race condition.
  • Decouples robot-server's "done initializing" point from its "about to acquire GPIOs and initialize hardware" point, which I think is important.
  • Is slightly ugly in that robot-server's Python code needs to be aware of the magical name opentrons-gpio-setup. IMO this is acceptable, especially in the short term.

@SyntaxColoring SyntaxColoring changed the title refactor(robot-server): Make robot-server fully mutually exclusive with opentrons-gpio-setup refactor(robot-server): Stop opentrons-gpio-setup before trying to control the front button light Oct 22, 2021
@@ -108,37 +108,73 @@ async def get_hardware(app_state: AppState = Depends(get_app_state)) -> Hardware
return hardware_api


async def _initialize_hardware_api(app_state: AppState) -> None:
async def _initialize_hardware_api(app_state: AppState) -> None: # noqa: C901
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Oct 22, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

noqa: C901 overrides flake8 saying this function is too complex, which it definitely is. I think another PR should change this function to -> HardwareAPI, drop the gigantic try/except, and refactor calling code to take over for exception logging.

#8213's SlowInitializing helper would have avoided this, but we tossed that PR because it felt complex in other ways. Maybe we should revisit that part of it.

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. I think we could probably at least fix the try/accept by adding another level of function call but that's just bikeshedding at this point, you're right about the refactor.

Copy link
Member

@sanni-t sanni-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!

@SyntaxColoring SyntaxColoring merged commit fcf7cdb into edge Oct 25, 2021
@SyntaxColoring SyntaxColoring deleted the fix_gpio_conflict_on_boot branch October 25, 2021 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
robot server Affects the `robot-server` project robot-svcs Falls under the purview of the Robot Services squad (formerly CPX, Core Platform Experience).
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants