pod deleted due to delayed cleanup #8022

rvignesh89 · 2022-02-28T15:03:36Z

Summary

What happened/what you expected to happen?

I have a cron workflow in namespace app-launch which does the following every 30 minutes,

check if namespace app-launch exists,
if it exists, delete all resources (pods, deployments, worklows etc)
launch new set of pods, deployments & workflows.

workflows in app-launch complete within a few minutes of launching. My expectation is that the workflows created in the app-launch namespace run normally without any issue. But ever since I upgraded argo-workflows from 2.12.11 to 3.2.4 I'm noticing that workflows in app-launch namespace fail with on of the nodes having a status of pod deleted

Upon in investigation of the controller logs, I realised that the pod in question is not actually deleted, but the controller thinks it is deleted. When combing through the controller logs, I can see that the controller itself sets the pod as completed (actually labels it as completed) which in-turn removes the pod from the pod informer cache.

NOTE: logs use workflows created in argo namespace

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2009 workflow=hello
...
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.041Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello/labelPodCompleted

If you notice the time, the pod is marked deleted a good 5 seconds after he workflow is updated as Running. And in-between those 5 seconds, there is no pod reconciliation happening (which I confirmed separately). So I suspected that this pod cleanup is running a bit late than when it was scheduled.

That's when I noticed that that podCleanupQueue is implemented through a rate-limited-queue which if it notices the same key, would only add the key into the queue with an incremental back-off,

func (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration {
	..
	exp := r.failures[item]
	r.failures[item] = r.failures[item] + 1

	// The backoff is capped such that 'calculated' value never overflows.
	backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp))
	if backoff > math.MaxInt64 {
		return r.maxDelay
	}
...
}

https://github.com/kubernetes/client-go/blob/master/util/workqueue/default_rate_limiters.go#L89

In my use-case since I'm deleting and re-creating the workflows the name of the pod & namespace don't change. Hence what I think is happening is, the rate-limiter is unnecessarily throttling the cleanup because the key is same.

How to fix this?

I'm able to fix this behaviour by ensuring that the key is removed the rate_limiter checks by adding the following.

NOTE: I've not run any thorough tests other than letting the test run for a while and not observing this issue come-up.

defer func() {
		wfc.podCleanupQueue.Forget(key)
		wfc.podCleanupQueue.Done(key)
	}()

https://github.com/argoproj/argo-workflows/blob/release-3.2/workflow/controller/controller.go#L479

What executor are you using? PNS/Emissary

Diagnostics

In order to simulate this you can run the following script with the workflow given below that. You might have to wait for a few iterations for the problem to appear. In the test below, you'll notice that the pod is marked as completed but the workflow status would be still running. But in my production use case (which I think is caused by the same issue) the workflow is marked as failed as it notices the pod is deleted in the informer cache

set -eu

while true; do
  kubectl apply -n argo -f workflow.yaml
  kubectl wait -n argo --for=jsonpath='{.status.phase}'=Succeeded --timeout=600s workflow/hello

  sleep 5
  kubectl delete wf hello -n argo
done

# workflow.yal
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: hello
  namespace: argo
spec:
  entrypoint: whalesay        # Defines "whalesay" as the "main" template
  templates:
  - name: whalesay            # Defining the "whalesay" template
    container:
      image: ubuntu
      command: [sleep]
      args: ["1"]   # This template runs "cowsay" in the "whalesay" image with arguments "hello world"

What executor are you running? k8sapi

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.998Z" level=info msg="Created pod: hello (hello)" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.999Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.999Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Update workflows 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2009 workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:22.663Z" level=info msg="Get leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:22.666Z" level=info msg="Update leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.022Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-1340600742-agent/deletePod
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.030Z" level=info msg="Delete pods 404"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.041Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello/labelPodCompleted
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.046Z" level=info msg="Patch pods 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:27.675Z" level=info msg="Get leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:27.679Z" level=info msg="Update leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.897Z" level=info msg="Processing workflow" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.898Z" level=info msg="Workflow pod is missing" namespace=argo nodeName=hello nodePhase=Pending recentlyStarted=false workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.916Z" level=info msg="Create pods 409"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.918Z" level=info msg="Failed pod hello (hello) creation: already exists" namespace=argo workflow=hello

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

The text was updated successfully, but these errors were encountered:

alexec · 2022-02-28T15:33:17Z

OK. So this will only happen if two workflows have the same name, and (I assume) the second workflow is created very soon after the first workflow was deleted. Pod names are deterministic, so they will get the same name each time.

Most users will not be affected, because they do not re-use workflow names.

When the pod is deleted we could have the clean-up queue forget about it.

argo-workflows/workflow/controller/controller.go

Line 1030 in 9573303

_ = wfc.enqueueWfFromPodLabel(obj)

I think we should remove the call to enqueueWfFromPodLabel as I think that would be a no-op and replace it with forget/done in the clean-up queue.

Would you like to submit a PR to fix?

rvignesh89 · 2022-02-28T15:49:58Z

Yes, I think it only happens because the names of the pod are same. I can submit a PR to fix this 👍

Signed-off-by: Vignesh Rajasekaran <[email protected]>

rvignesh89 · 2022-03-03T09:19:01Z

I think removing the enqueueWfFromPodLabel would break legitimate use cases like when the pod is deleted due to evictions from node in which case the controller would lose the ability to do what it needs to do. So I'm skeptical in removing that in my PR.

Also while working on the fix I remembered that after the old action which marks the pod as deleted is triggered the controller tries to recreate the pod.

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.898Z" level=info msg="Workflow pod is missing" namespace=argo nodeName=hello nodePhase=Pending recentlyStarted=false workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.916Z" level=info msg="Create pods 409"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.918Z" level=info msg="Failed pod hello (hello) creation: already exists" namespace=argo workflow=hello

This is because the node phase is still pending. I think this behaviour is itself incorrect as the node should be in running status since the pod is already created. Do you think we should look at it in a separate PR?

argo-workflows/workflow/controller/operator.go

Lines 1067 to 1071 in e7ff3f5

 // If the node is pending and the pod does not exist, it could be the case that we want to try to submit it 

 // again instead of marking it as an error. Check if that's the case. 

 if node.Pending() { 

 continue 

 }

Signed-off-by: Vignesh Rajasekaran <[email protected]>

rvignesh89 added type/bug triage labels Feb 28, 2022

alexec added v3.2 area/controller Controller issues, panics labels Feb 28, 2022

alexec removed the triage label Feb 28, 2022

rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022

fix: pod deleted due to delayed cleanup. Fixes argoproj#8022

b730f5d

rvignesh89 mentioned this issue Mar 3, 2022

fix: pod deleted due to delayed cleanup. Fixes #8022 #8061

Merged

2 tasks

rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022

fix: pod deleted due to delayed cleanup. Fixes argoproj#8022

c4a2577

Signed-off-by: Vignesh Rajasekaran <[email protected]>

rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022

fix: pod deleted due to delayed cleanup. Fixes argoproj#8022

5709e62

Signed-off-by: Vignesh Rajasekaran <[email protected]>

alexec closed this as completed in #8061 Mar 7, 2022

alexec pushed a commit that referenced this issue Mar 7, 2022

fix: pod deleted due to delayed cleanup. Fixes #8022 (#8061)

6855f4c

Signed-off-by: Vignesh Rajasekaran <[email protected]>

sarabala1979 mentioned this issue Mar 18, 2022

v3.3.1 Cherry-pick #8183

Closed

12 tasks

sarabala1979 pushed a commit that referenced this issue Mar 18, 2022

fix: pod deleted due to delayed cleanup. Fixes #8022 (#8061)

0cec273

Signed-off-by: Vignesh Rajasekaran <[email protected]>

sarabala1979 mentioned this issue Apr 14, 2022

Cherry pick v3.3.2 #8401

Closed

85 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pod deleted due to delayed cleanup #8022

pod deleted due to delayed cleanup #8022

rvignesh89 commented Feb 28, 2022 •

edited

Loading

alexec commented Feb 28, 2022

rvignesh89 commented Feb 28, 2022

rvignesh89 commented Mar 3, 2022

pod deleted due to delayed cleanup #8022

pod deleted due to delayed cleanup #8022

Comments

rvignesh89 commented Feb 28, 2022 • edited Loading

Summary

How to fix this?

Diagnostics

alexec commented Feb 28, 2022

rvignesh89 commented Feb 28, 2022

rvignesh89 commented Mar 3, 2022

rvignesh89 commented Feb 28, 2022 •

edited

Loading