-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pod deleted due to delayed cleanup #8022
Comments
OK. So this will only happen if two workflows have the same name, and (I assume) the second workflow is created very soon after the first workflow was deleted. Pod names are deterministic, so they will get the same name each time. Most users will not be affected, because they do not re-use workflow names. When the pod is deleted we could have the clean-up queue forget about it. argo-workflows/workflow/controller/controller.go Line 1030 in 9573303
I think we should remove the call to Would you like to submit a PR to fix? |
Yes, I think it only happens because the names of the pod are same. I can submit a PR to fix this 👍 |
Signed-off-by: Vignesh Rajasekaran <[email protected]>
Signed-off-by: Vignesh Rajasekaran <[email protected]>
I think removing the Also while working on the fix I remembered that after the old action which marks the pod as deleted is triggered the controller tries to recreate the pod.
This is because the node phase is still pending. I think this behaviour is itself incorrect as the node should be in running status since the pod is already created. Do you think we should look at it in a separate PR? argo-workflows/workflow/controller/operator.go Lines 1067 to 1071 in e7ff3f5
|
Signed-off-by: Vignesh Rajasekaran <[email protected]>
Signed-off-by: Vignesh Rajasekaran <[email protected]>
Summary
What happened/what you expected to happen?
I have a cron workflow in namespace
app-launch
which does the following every 30 minutes,app-launch
exists,workflows in
app-launch
complete within a few minutes of launching. My expectation is that the workflows created in theapp-launch
namespace run normally without any issue. But ever since I upgraded argo-workflows from2.12.11
to3.2.4
I'm noticing that workflows inapp-launch
namespace fail with on of the nodes having a status ofpod deleted
Upon in investigation of the controller logs, I realised that the pod in question is not actually deleted, but the controller thinks it is deleted. When combing through the controller logs, I can see that the controller itself sets the pod as completed (actually labels it as completed) which in-turn removes the pod from the pod informer cache.
NOTE: logs use workflows created in argo namespace
If you notice the time, the pod is marked deleted a good 5 seconds after he workflow is updated as Running. And in-between those 5 seconds, there is no pod reconciliation happening (which I confirmed separately). So I suspected that this pod cleanup is running a bit late than when it was scheduled.
That's when I noticed that that
podCleanupQueue
is implemented through a rate-limited-queue which if it notices the same key, would only add the key into the queue with an incremental back-off,https://github.com/kubernetes/client-go/blob/master/util/workqueue/default_rate_limiters.go#L89
In my use-case since I'm deleting and re-creating the workflows the name of the pod & namespace don't change. Hence what I think is happening is, the rate-limiter is unnecessarily throttling the cleanup because the key is same.
How to fix this?
I'm able to fix this behaviour by ensuring that the key is removed the rate_limiter checks by adding the following.
NOTE: I've not run any thorough tests other than letting the test run for a while and not observing this issue come-up.
https://github.com/argoproj/argo-workflows/blob/release-3.2/workflow/controller/controller.go#L479
What executor are you using? PNS/Emissary
Diagnostics
In order to simulate this you can run the following script with the workflow given below that. You might have to wait for a few iterations for the problem to appear. In the test below, you'll notice that the pod is marked as completed but the workflow status would be still running. But in my production use case (which I think is caused by the same issue) the workflow is marked as failed as it notices the pod is deleted in the informer cache
What executor are you running? k8sapi
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: