Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pod deleted due to delayed cleanup #8022

Closed
rvignesh89 opened this issue Feb 28, 2022 · 3 comments · Fixed by #8061
Closed

pod deleted due to delayed cleanup #8022

rvignesh89 opened this issue Feb 28, 2022 · 3 comments · Fixed by #8061
Labels
area/controller Controller issues, panics type/bug

Comments

@rvignesh89
Copy link
Contributor

rvignesh89 commented Feb 28, 2022

Summary

What happened/what you expected to happen?

I have a cron workflow in namespace app-launch which does the following every 30 minutes,

  1. check if namespace app-launch exists,
  2. if it exists, delete all resources (pods, deployments, worklows etc)
  3. launch new set of pods, deployments & workflows.

workflows in app-launch complete within a few minutes of launching. My expectation is that the workflows created in the app-launch namespace run normally without any issue. But ever since I upgraded argo-workflows from 2.12.11 to 3.2.4 I'm noticing that workflows in app-launch namespace fail with on of the nodes having a status of pod deleted

Upon in investigation of the controller logs, I realised that the pod in question is not actually deleted, but the controller thinks it is deleted. When combing through the controller logs, I can see that the controller itself sets the pod as completed (actually labels it as completed) which in-turn removes the pod from the pod informer cache.

NOTE: logs use workflows created in argo namespace

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2009 workflow=hello
...
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.041Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello/labelPodCompleted

If you notice the time, the pod is marked deleted a good 5 seconds after he workflow is updated as Running. And in-between those 5 seconds, there is no pod reconciliation happening (which I confirmed separately). So I suspected that this pod cleanup is running a bit late than when it was scheduled.

That's when I noticed that that podCleanupQueue is implemented through a rate-limited-queue which if it notices the same key, would only add the key into the queue with an incremental back-off,

func (r *ItemExponentialFailureRateLimiter) When(item interface{}) time.Duration {
	..
	exp := r.failures[item]
	r.failures[item] = r.failures[item] + 1

	// The backoff is capped such that 'calculated' value never overflows.
	backoff := float64(r.baseDelay.Nanoseconds()) * math.Pow(2, float64(exp))
	if backoff > math.MaxInt64 {
		return r.maxDelay
	}
...
}

https://github.com/kubernetes/client-go/blob/master/util/workqueue/default_rate_limiters.go#L89

In my use-case since I'm deleting and re-creating the workflows the name of the pod & namespace don't change. Hence what I think is happening is, the rate-limiter is unnecessarily throttling the cleanup because the key is same.

How to fix this?

I'm able to fix this behaviour by ensuring that the key is removed the rate_limiter checks by adding the following.

NOTE: I've not run any thorough tests other than letting the test run for a while and not observing this issue come-up.

defer func() {
		wfc.podCleanupQueue.Forget(key)
		wfc.podCleanupQueue.Done(key)
	}()

https://github.com/argoproj/argo-workflows/blob/release-3.2/workflow/controller/controller.go#L479

What executor are you using? PNS/Emissary

Diagnostics

In order to simulate this you can run the following script with the workflow given below that. You might have to wait for a few iterations for the problem to appear. In the test below, you'll notice that the pod is marked as completed but the workflow status would be still running. But in my production use case (which I think is caused by the same issue) the workflow is marked as failed as it notices the pod is deleted in the informer cache

set -eu

while true; do
  kubectl apply -n argo -f workflow.yaml
  kubectl wait -n argo --for=jsonpath='{.status.phase}'=Succeeded --timeout=600s workflow/hello

  sleep 5
  kubectl delete wf hello -n argo
done
# workflow.yal
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: hello
  namespace: argo
spec:
  entrypoint: whalesay        # Defines "whalesay" as the "main" template
  templates:
  - name: whalesay            # Defining the "whalesay" template
    container:
      image: ubuntu
      command: [sleep]
      args: ["1"]   # This template runs "cowsay" in the "whalesay" image with arguments "hello world"

What executor are you running? k8sapi

# Logs from the workflow controller:
kubectl logs -n argo deploy/workflow-controller | grep ${workflow} 

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.998Z" level=info msg="Created pod: hello (hello)" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.999Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:20.999Z" level=info msg=reconcileAgentPod namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Update workflows 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:21.004Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=2009 workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:22.663Z" level=info msg="Get leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:22.666Z" level=info msg="Update leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.022Z" level=info msg="cleaning up pod" action=deletePod key=argo/hello-1340600742-agent/deletePod
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.030Z" level=info msg="Delete pods 404"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.041Z" level=info msg="cleaning up pod" action=labelPodCompleted key=argo/hello/labelPodCompleted
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:26.046Z" level=info msg="Patch pods 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:27.675Z" level=info msg="Get leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:27.679Z" level=info msg="Update leases 200"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.897Z" level=info msg="Processing workflow" namespace=argo workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.898Z" level=info msg="Workflow pod is missing" namespace=argo nodeName=hello nodePhase=Pending recentlyStarted=false workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.916Z" level=info msg="Create pods 409"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.918Z" level=info msg="Failed pod hello (hello) creation: already exists" namespace=argo workflow=hello

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@alexec
Copy link
Contributor

alexec commented Feb 28, 2022

OK. So this will only happen if two workflows have the same name, and (I assume) the second workflow is created very soon after the first workflow was deleted. Pod names are deterministic, so they will get the same name each time.

Most users will not be affected, because they do not re-use workflow names.

When the pod is deleted we could have the clean-up queue forget about it.

_ = wfc.enqueueWfFromPodLabel(obj)

I think we should remove the call to enqueueWfFromPodLabel as I think that would be a no-op and replace it with forget/done in the clean-up queue.

Would you like to submit a PR to fix?

@alexec alexec removed the triage label Feb 28, 2022
@rvignesh89
Copy link
Contributor Author

Yes, I think it only happens because the names of the pod are same. I can submit a PR to fix this 👍

rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022
rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022
rvignesh89 added a commit to rvignesh89/argo-workflows that referenced this issue Mar 3, 2022
@rvignesh89
Copy link
Contributor Author

I think removing the enqueueWfFromPodLabel would break legitimate use cases like when the pod is deleted due to evictions from node in which case the controller would lose the ability to do what it needs to do. So I'm skeptical in removing that in my PR.

Also while working on the fix I remembered that after the old action which marks the pod as deleted is triggered the controller tries to recreate the pod.

workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.898Z" level=info msg="Workflow pod is missing" namespace=argo nodeName=hello nodePhase=Pending recentlyStarted=false workflow=hello
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.916Z" level=info msg="Create pods 409"
workflow-controller-759fcd5fdb-f5mvs workflow-controller time="2022-02-28T14:48:30.918Z" level=info msg="Failed pod hello (hello) creation: already exists" namespace=argo workflow=hello

This is because the node phase is still pending. I think this behaviour is itself incorrect as the node should be in running status since the pod is already created. Do you think we should look at it in a separate PR?

// If the node is pending and the pod does not exist, it could be the case that we want to try to submit it
// again instead of marking it as an error. Check if that's the case.
if node.Pending() {
continue
}

alexec pushed a commit that referenced this issue Mar 7, 2022
@sarabala1979 sarabala1979 mentioned this issue Mar 18, 2022
12 tasks
sarabala1979 pushed a commit that referenced this issue Mar 18, 2022
@sarabala1979 sarabala1979 mentioned this issue Apr 14, 2022
85 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controller Controller issues, panics type/bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants