Add `HEALTHCHECK` to recover from dump1090 failures #29

mik3y · 2020-06-01T03:09:33Z

Issue

I seem to have a slightly unreliable SDR (or perhaps pi/power supply). Once every week or two, the USB device disconnects and promptly reconnects. This can be seen in dmesg:

[2413977.098905] usb 1-1.1: USB disconnect, device number 19
[2413977.393225] usb 1-1.1: new high-speed USB device number 20 using xhci_hcd
[2413977.534938] usb 1-1.1: New USB device found, idVendor=0bda, idProduct=2832, bcdDevice= 1.00
[2413977.534945] usb 1-1.1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[2413977.534951] usb 1-1.1: Product: RTL2832U
[2413977.534956] usb 1-1.1: Manufacturer: Realtek
[2413977.534961] usb 1-1.1: SerialNumber: 00001000

This seems to leave dump1090 in a bad state, in a loop like this:

$ docker logs -t --tail 5 piaware
2020-06-01T02:59:53.483919157Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory
2020-06-01T02:59:54.436927358Z [dump1090-fa] Mon Jun  1 02:59:54 2020 UTC  dump1090-fa  starting up.
2020-06-01T02:59:54.456436933Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory
2020-06-01T02:59:55.460161178Z [dump1090-fa] Mon Jun  1 02:59:55 2020 UTC  dump1090-fa  starting up.
2020-06-01T02:59:55.479666569Z [dump1090-fa] rtlsdr: error querying device #0: No such file or directory

Bouncing the container, i.e. docker-compose restart, does the trick every time.

Possible solutions

No question to me this is a bug in dump1090's device management, i.e. not reinitializing/re-probing after the disconnect or otherwise grabbing a handle to the 'new' device. But I don't have any familiarity with that code..

As a blunt instrument, we could instead add a Docker HEALTHCHECK directive in order to cause docker to detect and restart the unhealthy container.

One idea would be to run & parse the piaware-status status command. Looks like it detects both conditions:

Unhealthy state:

$ docker exec -i -t piaware /bin/sh -c 'piaware-status | egrep ^dump1090'
dump1090 is NOT producing data on localhost:30005.

Healthy:

$ docker exec -i -t piaware /bin/sh -c 'piaware-status | egrep ^dump1090'
dump1090 (pid 269) is listening for ES connections on port 30005.
dump1090 is producing data on localhost:30005.

I started down the road of putting together a PR, but realized I don't know enough how to do this correctly/without breaking other setups (eg whether any of the other stuff in piaware-status is relevant). Let me know what you think!

The text was updated successfully, but these errors were encountered:

mikenye · 2020-06-02T08:39:43Z

Hi @mik3y, thanks for the idea and for the detail. I'm looking to implement this over the next day or so. When your container is unhealthy, can you let me know the exit code of piaware-status?

You can issue the following:

piaware-status
echo $?

I'm interested in the number that is printed from the echo $? command.

Something else to consider - when your SDR disconnects/reconnects, does it have the same USB device path? It looks like from your dmesg output the radio gets a different path... How are you passing the radio through to the container?

Thanks.

mikenye · 2020-06-02T09:41:33Z

OK, I've added the first iteration of the healthcheck script in 34bda93.

I'm just waiting on the build to finish, however hopefully by the time you read this you should be able to try mikenye/piaware:development which contains the healthcheck.

I'm also running this, however I use an external Beast provider (no USB radio mapped as-per https://github.com/mikenye/docker-readsb/wiki/Guide-to-ADS-B-Data-Receiving,-Decoding-and-Sharing,-Leveraging-RTLSDR-and-Docker), so I can't test for your exact use case.

If you could run this in anger for a couple of days and let me know if this is stable and fixes your problem, it would be appreciated.

Assuming all is good, I'll merge the changes into the master branch.

Thanks!

mik3y · 2020-06-03T00:14:17Z

Awesome! I will give this a try.

I'm interested in the number that is printed from the echo $? command.

Exit status is 0 when piware is wedged:

/ # piaware-status
PiAware master process (piaware) is running with pid 244.
PiAware ADS-B client (faup1090) is not running.
PiAware ADS-B UAT client (faup978) is not running (disabled by configuration settings)
PiAware mlat client (fa-mlat-client) is not running.
Local ADS-B receiver (dump1090) is not running.

no program appears to be listening for ES connections on port 30005.
faup1090 is NOT connected to the ADS-B receiver.
piaware is connected to FlightAware.

got 'couldn't open socket: cannot assign requested address'
dump1090 is NOT producing data on localhost:30005.

Your feeder ID is [redacted] (configured at /etc/piaware.conf:1)
/ # echo $?
0

Something else to consider - when your SDR disconnects/reconnects, does it have the same USB device path? It looks like from your dmesg output the radio gets a different path... How are you passing the radio through to the container?

Good question! I'm stepping around that by giving the container the whole usb device tree:

    devices:
      #- /dev/sdr:/dev/bus/usb/001/004
      - /dev/bus/usb:/dev/bus/usb

mik3y · 2020-06-03T00:29:14Z

Worked like a charm - almost!

After updating to your tag, I restarted the services and then manually pulled the USB device. That triggered the wedge. Soon after, the health check reported its first failure:

$ docker inspect --format "{{json .State.Health }}" piaware
{"Status":"healthy","FailingStreak":1,"Log":[{"Start":"2020-06-02T20:21:40.534570358-04:00","End":"2020-06-02T20:21:41.317836883-04:00","ExitCode":0,"Output":"Death count of 'beastproxy' service since last check: 0 - healthy\nDeath count of 'beastrelay' service since last check: 0 - healthy\nDeath count of 'dump1090' service since last check: 0 - healthy\nDeath count of 'piaware' service since last check: 0 - healthy\nDeath count of 'skyaware' service since last check: 0 - healthy\npiaware-status - healthy\nContainer is HEALTHY :-)\n"},{"Start":"2020-06-02T20:22:11.3533449-04:00","End":"2020-06-02T20:22:11.929719897-04:00","ExitCode":1,"Output":"Death count of 'beastproxy' service since last check: 0 - healthy\nDeath count of 'beastrelay' service since last check: 0 - healthy\nDeath count of 'dump1090' service since last check: 11 - unhealthy\nDeath count of 'piaware' service since last check: 0 - healthy\nDeath count of 'skyaware' service since last check: 0 - healthy\npiaware-status reports: dump1090 is NOT producing data - unhealthy\nContainer is UNHEALTHY :-(\n"}]}

Why it's only almost: Unfortunately I had a slight misunderstanding about this feature; docker does not automatically restart an unhealthy container. It seems a ~simple workaround folks have devised is to kill init (or otherwise crash the service) as part of a "fail" path of the health check.

I guess that would work, but it does mean the first failure will immediately restart the container, as opposed to waiting for a retries streak.

mikenye · 2020-06-03T01:20:20Z

Instead of having the container kill itself, may I suggest that you consider willfarrell/autoheal instead (as-per the article you linked, and I also use this personally).

I'm not sure that killing the init process is a good idea, and may break existing deployments if a user doesn't have restart=always etc...

Let me know your thoughts.

mik3y · 2020-06-03T01:29:49Z

Oh yeah, that's a way better approach & decoupling. I'll give that a shot!

Meantime I'll let this run for the next few weeks and report back if the health check doesn't do its thing.

mikenye · 2020-06-03T01:30:39Z

Thanks very much!

mikenye · 2020-06-06T11:14:59Z

Hi @mik3y, how's it been running for you?

mik3y · 2020-06-06T16:28:41Z

No issues thus far!

mikenye · 2020-06-07T13:31:58Z

OK cool. I'm going to merge these changes into master so they'll get built into the latest image overnight. Thanks for your help with this one.

Address issue #29

mikenye self-assigned this Jun 2, 2020

mikenye closed this as completed Jun 7, 2020

mikenye added a commit that referenced this issue Jun 7, 2020

Merge pull request #30 from mikenye/dev

a478171

Address issue #29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `HEALTHCHECK` to recover from dump1090 failures #29

Add `HEALTHCHECK` to recover from dump1090 failures #29

mik3y commented Jun 1, 2020

mikenye commented Jun 2, 2020

mikenye commented Jun 2, 2020

mik3y commented Jun 3, 2020

mik3y commented Jun 3, 2020

mikenye commented Jun 3, 2020 •

edited

Loading

mik3y commented Jun 3, 2020

mikenye commented Jun 3, 2020

mikenye commented Jun 6, 2020

mik3y commented Jun 6, 2020

mikenye commented Jun 7, 2020

Add HEALTHCHECK to recover from dump1090 failures #29

Add HEALTHCHECK to recover from dump1090 failures #29

Comments

mik3y commented Jun 1, 2020

Issue

Possible solutions

mikenye commented Jun 2, 2020

mikenye commented Jun 2, 2020

mik3y commented Jun 3, 2020

mik3y commented Jun 3, 2020

mikenye commented Jun 3, 2020 • edited Loading

mik3y commented Jun 3, 2020

mikenye commented Jun 3, 2020

mikenye commented Jun 6, 2020

mik3y commented Jun 6, 2020

mikenye commented Jun 7, 2020

Add `HEALTHCHECK` to recover from dump1090 failures #29

Add `HEALTHCHECK` to recover from dump1090 failures #29

mikenye commented Jun 3, 2020 •

edited

Loading