Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Send notifications for dependency failures #14603

Merged
merged 3 commits into from
Oct 30, 2023

Conversation

AlanCoding
Copy link
Member

@AlanCoding AlanCoding commented Oct 25, 2023

SUMMARY

I had a report that notifications were not firing for cases when an update-on-launch dependency failed. This is fairly easily verifiable. It's somewhat of a stupidly simple case where handle_work_error sent websockets, but not notifications.

But then you get into the weeds of how to fix it, and realize there are race conditions, consider how the callback receiver handles this:

        # Update host_status_counts while holding the row lock
        with transaction.atomic():
            uj = UnifiedJob.objects.select_for_update().get(pk=job_identifier)
            uj.host_status_counts = host_status_counts
            uj.save(update_fields=['host_status_counts'])

This solves a contention problem between the callback receiver and the dispatcher.

But wait - here, we have a job that was created, spawned a project update, and then is getting marked failed by the dispatcher running the project update. At no point did either the dispatcher or the callback receiver go through the run path for that job. There shouldn't be any contention problem!

With the errback/callback methods as they are, there is still at least the vague possibility of the job starting because the project update fails after updating its final status... but this is going too far beyond realistic concerns. The much more realistic concern is that we set "Previous task failed task:" twice so sending notifications here would introduce a new contention problem.

Systems-wise, why are we even dealing with the contention problems? What if, instead, we only triggered the failure logic in a reliable system that has a periodic fallback... this would be the task manager. I have already argued that the errback methods are not needed, so this is a good time to delete them.

Missing notification problem solved, contention problems are not a risk with this solution.

ISSUE TYPE
  • Bug, Docs Fix or other nominal change
COMPONENT NAME
  • API

@AlanCoding AlanCoding marked this pull request as ready for review October 26, 2023 14:12
Copy link
Member

@fosterseth fosterseth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good code cleanup

task_cls = task._get_task_class()
task_cls.apply_async(
[task.pk],
opts,
queue=task.get_queue_name(),
uuid=task.celery_task_id,
callbacks=[{'task': handle_work_success.name, 'kwargs': {'task_actual': task_actual}}],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

@AlanCoding AlanCoding merged commit 333ef76 into ansible:devel Oct 30, 2023
21 checks passed
djyasin pushed a commit to djyasin/awx that referenced this pull request Sep 16, 2024
* Send notifications for dependency failures

* Delete tests for deleted method

* Remove another test for removed method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants