Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add HEALTHCHECK to recover from dump1090 failures #29

Closed
mik3y opened this issue Jun 1, 2020 · 10 comments
Closed

Add HEALTHCHECK to recover from dump1090 failures #29

mik3y opened this issue Jun 1, 2020 · 10 comments
Assignees

Comments

@mik3y
Copy link

mik3y commented Jun 1, 2020

Issue

I seem to have a slightly unreliable SDR (or perhaps pi/power supply). Once every week or two, the USB device disconnects and promptly reconnects. This can be seen in dmesg:

[2413977.098905] usb 1-1.1: USB disconnect, device number 19
[2413977.393225] usb 1-1.1: new high-speed USB device number 20 using xhci_hcd
[2413977.534938] usb 1-1.1: New USB device found, idVendor=0bda, idProduct=2832, bcdDevice= 1.00
[2413977.534945] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[2413977.534951] usb 1-1.1: Product: RTL2832U
[2413977.534956] usb 1-1.1: Manufacturer: Realtek
[2413977.534961] usb 1-1.1: SerialNumber: 00001000

This seems to leave dump1090 in a bad state, in a loop like this:

$ docker logs -t --tail 5 piaware
2020-06-01T02:59:53.483919157Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory
2020-06-01T02:59:54.436927358Z [dump1090-fa] Mon Jun  1 02:59:54 2020 UTC  dump1090-fa  starting up.
2020-06-01T02:59:54.456436933Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory
2020-06-01T02:59:55.460161178Z [dump1090-fa] Mon Jun  1 02:59:55 2020 UTC  dump1090-fa  starting up.
2020-06-01T02:59:55.479666569Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory

Bouncing the container, i.e. docker-compose restart, does the trick every time.

Possible solutions

No question to me this is a bug in dump1090's device management, i.e. not reinitializing/re-probing after the disconnect or otherwise grabbing a handle to the 'new' device. But I don't have any familiarity with that code..

As a blunt instrument, we could instead add a Docker HEALTHCHECK directive in order to cause docker to detect and restart the unhealthy container.

One idea would be to run & parse the piaware-status status command. Looks like it detects both conditions:

Unhealthy state:

$ docker exec -i -t piaware /bin/sh -c 'piaware-status | egrep ^dump1090'
dump1090 is NOT producing data on localhost:30005.

Healthy:

$ docker exec -i -t piaware /bin/sh -c 'piaware-status | egrep ^dump1090'
dump1090 (pid 269) is listening for ES connections on port 30005.
dump1090 is producing data on localhost:30005.

I started down the road of putting together a PR, but realized I don't know enough how to do this correctly/without breaking other setups (eg whether any of the other stuff in piaware-status is relevant). Let me know what you think!

@mikenye mikenye self-assigned this Jun 2, 2020
@mikenye
Copy link
Member

mikenye commented Jun 2, 2020

Hi @mik3y, thanks for the idea and for the detail. I'm looking to implement this over the next day or so. When your container is unhealthy, can you let me know the exit code of piaware-status?

You can issue the following:

piaware-status
echo $?

I'm interested in the number that is printed from the echo $? command.

Something else to consider - when your SDR disconnects/reconnects, does it have the same USB device path? It looks like from your dmesg output the radio gets a different path... How are you passing the radio through to the container?

Thanks.

@mikenye
Copy link
Member

mikenye commented Jun 2, 2020

OK, I've added the first iteration of the healthcheck script in 34bda93.

I'm just waiting on the build to finish, however hopefully by the time you read this you should be able to try mikenye/piaware:development which contains the healthcheck.

I'm also running this, however I use an external Beast provider (no USB radio mapped as-per https://github.com/mikenye/docker-readsb/wiki/Guide-to-ADS-B-Data-Receiving,-Decoding-and-Sharing,-Leveraging-RTLSDR-and-Docker), so I can't test for your exact use case.

If you could run this in anger for a couple of days and let me know if this is stable and fixes your problem, it would be appreciated.

Assuming all is good, I'll merge the changes into the master branch.

Thanks!

@mik3y
Copy link
Author

mik3y commented Jun 3, 2020

Awesome! I will give this a try.

I'm interested in the number that is printed from the echo $? command.

Exit status is 0 when piware is wedged:

/ # piaware-status
PiAware master process (piaware) is running with pid 244.
PiAware ADS-B client (faup1090) is not running.
PiAware ADS-B UAT client (faup978) is not running (disabled by configuration settings)
PiAware mlat client (fa-mlat-client) is not running.
Local ADS-B receiver (dump1090) is not running.

no program appears to be listening for ES connections on port 30005.
faup1090 is NOT connected to the ADS-B receiver.
piaware is connected to FlightAware.

got 'couldn't open socket: cannot assign requested address'
dump1090 is NOT producing data on localhost:30005.

Your feeder ID is [redacted] (configured at /etc/piaware.conf:1)
/ # echo $?
0

Something else to consider - when your SDR disconnects/reconnects, does it have the same USB device path? It looks like from your dmesg output the radio gets a different path... How are you passing the radio through to the container?

Good question! I'm stepping around that by giving the container the whole usb device tree:

    devices:
      #- /dev/sdr:/dev/bus/usb/001/004
      - /dev/bus/usb:/dev/bus/usb

@mik3y
Copy link
Author

mik3y commented Jun 3, 2020

Worked like a charm - almost!

After updating to your tag, I restarted the services and then manually pulled the USB device. That triggered the wedge. Soon after, the health check reported its first failure:

$ docker inspect --format "{{json .State.Health }}" piaware
{"Status":"healthy","FailingStreak":1,"Log":[{"Start":"2020-06-02T20:21:40.534570358-04:00","End":"2020-06-02T20:21:41.317836883-04:00","ExitCode":0,"Output":"Death count of 'beastproxy' service since last check: 0 - healthy\nDeath count of 'beastrelay' service since last check: 0 - healthy\nDeath count of 'dump1090' service since last check: 0 - healthy\nDeath count of 'piaware' service since last check: 0 - healthy\nDeath count of 'skyaware' service since last check: 0 - healthy\npiaware-status - healthy\nContainer is HEALTHY :-)\n"},{"Start":"2020-06-02T20:22:11.3533449-04:00","End":"2020-06-02T20:22:11.929719897-04:00","ExitCode":1,"Output":"Death count of 'beastproxy' service since last check: 0 - healthy\nDeath count of 'beastrelay' service since last check: 0 - healthy\nDeath count of 'dump1090' service since last check: 11 - unhealthy\nDeath count of 'piaware' service since last check: 0 - healthy\nDeath count of 'skyaware' service since last check: 0 - healthy\npiaware-status reports: dump1090 is NOT producing data - unhealthy\nContainer is UNHEALTHY :-(\n"}]}

Why it's only almost: Unfortunately I had a slight misunderstanding about this feature; docker does not automatically restart an unhealthy container. It seems a ~simple workaround folks have devised is to kill init (or otherwise crash the service) as part of a "fail" path of the health check.

I guess that would work, but it does mean the first failure will immediately restart the container, as opposed to waiting for a retries streak.

@mikenye
Copy link
Member

mikenye commented Jun 3, 2020

Instead of having the container kill itself, may I suggest that you consider willfarrell/autoheal instead (as-per the article you linked, and I also use this personally).

I'm not sure that killing the init process is a good idea, and may break existing deployments if a user doesn't have restart=always etc...

Let me know your thoughts.

@mik3y
Copy link
Author

mik3y commented Jun 3, 2020

Oh yeah, that's a way better approach & decoupling. I'll give that a shot!

Meantime I'll let this run for the next few weeks and report back if the health check doesn't do its thing.

@mikenye
Copy link
Member

mikenye commented Jun 3, 2020

Thanks very much!

@mikenye
Copy link
Member

mikenye commented Jun 6, 2020

Hi @mik3y, how's it been running for you?

@mik3y
Copy link
Author

mik3y commented Jun 6, 2020

No issues thus far!

@mikenye
Copy link
Member

mikenye commented Jun 7, 2020

OK cool. I'm going to merge these changes into master so they'll get built into the latest image overnight. Thanks for your help with this one.

@mikenye mikenye closed this as completed Jun 7, 2020
mikenye added a commit that referenced this issue Jun 7, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants