Fix bug that prevented dispatcher exit with downed DB #14469

AlanCoding · 2023-09-20T16:38:33Z

SUMMARY

We have an issue where we the dispatcher would go down during the middle of the job, and this left the entire service in a deadlock in the end.

Previously, we did a bunch of stuff to prevent the dispatcher service from exiting if the database went out temporarily (default tolerance set to 40). We also moved to a new job canceling system which would tell it to cancel via a SIGTERM signal.

The problem is what happens when we exceed that 40 second threshold while a job is running. In that case:

the parent process correctly measures that it has been >40 seconds since the database went away, and... not being able to read anything from pg_notify, decides it will exit
the multiprocessing library correctly forwards the exit signal onto child processes, including the control process which is still active for a running job
the control process correctly processes the signal and terminates the job over receptor
the problem comes after the job finishes, since the control process over-rides the signal handler of the normal worker loop, the kill flag got set for the control process but not for the worker main read loop and because of that the worker continues to read for new work from its parent forever, which then deadlocks, because the parent is waiting for the worker to exit.

So in summary, we have 2 layers of signal processing, and the inner layer was misbehaving in that it did not call that parent process signal handling method. This adds calls to do that.

Testing, I was able to see the dispatcher exit with this patch applied.

ISSUE TYPE

Bug, Docs Fix or other nominal change

COMPONENT NAME

API

AlanCoding · 2023-09-21T18:39:26Z

I just realized that this is incorrect in that it makes assumptions about the original context. If it gets SIGTERM signal, it does what it should do on SIGINT as well, and this should not be the case. I think it would be proper to pick them apart.

TheRealHaoLiu · 2023-10-24T17:56:28Z

tested this functionally in both docker-compose devel env and on kubernetes

seems to do the right thing

…ansible#6516) * Separate handling of original sitTERM and sigINT

…ansible#6514) * Separate handling of original sitTERM and sigINT

* Separate handling of original sitTERM and sigINT

Fix bug that prevented dispatcher exit with downed DB

a681c7f

AlanCoding requested a review from TheRealHaoLiu September 20, 2023 16:38

github-actions bot added the component:api label Sep 20, 2023

Fix test crash and add asserts for parent call

163b165

AlanCoding requested a review from fosterseth September 21, 2023 17:06

Separate handling of original sitTERM and sigINT

2837361

TheRealHaoLiu self-assigned this Sep 25, 2023

AlanCoding requested a review from relrod September 26, 2023 13:49

AlanCoding requested a review from chrismeyersfsu October 4, 2023 18:16

thedoubl3j mentioned this pull request Oct 5, 2023

AWX Community Meeting Agenda - Oct 2023 #14546

Closed

TheRealHaoLiu approved these changes Oct 24, 2023

View reviewed changes

chrismeyersfsu approved these changes Oct 24, 2023

View reviewed changes

AlanCoding merged commit fc0b58f into ansible:devel Oct 26, 2023
19 checks passed

AlanCoding added a commit to AlanCoding/awx that referenced this pull request Jan 22, 2024

Fix bug that prevented dispatcher exit with downed DB (ansible#14469) (…

8ccd136

…ansible#6516) * Separate handling of original sitTERM and sigINT

kdelee pushed a commit to kdelee/awx that referenced this pull request May 8, 2024

Fix bug that prevented dispatcher exit with downed DB (ansible#14469) (…

e577121

…ansible#6514) * Separate handling of original sitTERM and sigINT

djyasin pushed a commit to djyasin/awx that referenced this pull request Sep 16, 2024

Fix bug that prevented dispatcher exit with downed DB (ansible#14469)

854d88c

* Separate handling of original sitTERM and sigINT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bug that prevented dispatcher exit with downed DB #14469

Fix bug that prevented dispatcher exit with downed DB #14469

AlanCoding commented Sep 20, 2023

AlanCoding commented Sep 21, 2023

TheRealHaoLiu commented Oct 24, 2023

Fix bug that prevented dispatcher exit with downed DB #14469

Fix bug that prevented dispatcher exit with downed DB #14469

Conversation

AlanCoding commented Sep 20, 2023

SUMMARY

ISSUE TYPE

COMPONENT NAME

AlanCoding commented Sep 21, 2023

TheRealHaoLiu commented Oct 24, 2023