Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug that prevented dispatcher exit with downed DB #14469

Merged
merged 3 commits into from
Oct 26, 2023

Conversation

AlanCoding
Copy link
Member

SUMMARY

We have an issue where we the dispatcher would go down during the middle of the job, and this left the entire service in a deadlock in the end.

Previously, we did a bunch of stuff to prevent the dispatcher service from exiting if the database went out temporarily (default tolerance set to 40). We also moved to a new job canceling system which would tell it to cancel via a SIGTERM signal.

The problem is what happens when we exceed that 40 second threshold while a job is running. In that case:

  • the parent process correctly measures that it has been >40 seconds since the database went away, and... not being able to read anything from pg_notify, decides it will exit
  • the multiprocessing library correctly forwards the exit signal onto child processes, including the control process which is still active for a running job
  • the control process correctly processes the signal and terminates the job over receptor
  • the problem comes after the job finishes, since the control process over-rides the signal handler of the normal worker loop, the kill flag got set for the control process but not for the worker main read loop and because of that the worker continues to read for new work from its parent forever, which then deadlocks, because the parent is waiting for the worker to exit.

So in summary, we have 2 layers of signal processing, and the inner layer was misbehaving in that it did not call that parent process signal handling method. This adds calls to do that.

Testing, I was able to see the dispatcher exit with this patch applied.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

@AlanCoding
Copy link
Member Author

I just realized that this is incorrect in that it makes assumptions about the original context. If it gets SIGTERM signal, it does what it should do on SIGINT as well, and this should not be the case. I think it would be proper to pick them apart.

@TheRealHaoLiu
Copy link
Member

tested this functionally in both docker-compose devel env and on kubernetes

seems to do the right thing

@AlanCoding AlanCoding merged commit fc0b58f into ansible:devel Oct 26, 2023
19 checks passed
AlanCoding added a commit to AlanCoding/awx that referenced this pull request Jan 22, 2024
kdelee pushed a commit to kdelee/awx that referenced this pull request May 8, 2024
djyasin pushed a commit to djyasin/awx that referenced this pull request Sep 16, 2024
* Separate handling of original sitTERM and sigINT
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants