Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

controller sending many pod delete requests that result in 404 response #12659

Closed
3 of 4 tasks
tooptoop4 opened this issue Feb 12, 2024 · 6 comments · Fixed by #13294
Closed
3 of 4 tasks

controller sending many pod delete requests that result in 404 response #12659

tooptoop4 opened this issue Feb 12, 2024 · 6 comments · Fixed by #13294
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates area/controller Controller issues, panics area/executor P3 Low priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug

Comments

@tooptoop4
Copy link
Contributor

tooptoop4 commented Feb 12, 2024

Pre-requisites

  • I have double-checked my configuration
  • I can confirm the issue exists when I tested with :latest
  • I have searched existing issues and could not find a match for this bug
  • I'd like to contribute the fix myself (see contributing guide)

What happened/what did you expect to happen?

although this seems to have no effect on the functioning of argoworkflows it could be potential stability/performance/k8s log cost issue at scale. it must be also filling up the cleanup queue so delaying cleanup of pods that really do exist

from checking the k8s api server the wfcontroller seems to be sending delete pod request for <podname from a step>-agent and getting not found response. i am using standard workflows like whalesay example. not sure what significance of agent suffix is (i did see

return woc.wf.NodeID("agent") + "-agent"
)

seeing 1000s of these, seems for every pod run its sending this unrequired delete request for pod with -agent suffix?

Version

3.4.11

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

n/a

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}
n/a

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
n/a
@agilgur5 agilgur5 added area/controller Controller issues, panics P3 Low priority labels Feb 12, 2024
@agilgur5
Copy link
Member

For reference, this was previously noted in Slack

@agilgur5
Copy link
Member

agilgur5 commented Feb 12, 2024

not sure what significance of agent suffix is

As far as I understand, the "agent" is a piece of the Executor that runs for certain built-in, non-container template types (e.g. resource, http, data, script, etc). Anyone else please correct me if I'm wrong; the historical references to agent didn't quite have a standard definition.

seeing 1000s of these, seems for every pod run its sending this unrequired delete request for pod with -agent suffix?

That sounds like it might be accidentally assuming that each Pod has an agent, when there are only certain types that do 🤔

@tooptoop4
Copy link
Contributor Author

woc.controller.queuePodForCleanup(woc.wf.Namespace, woc.getAgentPodName(), deletePod)
seems to be the line assuming each pod has an agent.

@jswxstw
Copy link
Member

jswxstw commented Feb 21, 2024

The agent pod will only be created if taskSet is not empty. Each workflow can have at most one agent pod.

func (woc *wfOperationCtx) reconcileAgentPod(ctx context.Context) error {
woc.log.Infof("reconcileAgentPod")
if len(woc.taskSet) == 0 {
return nil
}
pod, err := woc.createAgentPod(ctx)
if err != nil {
return err
}
// Check Pod is just created
if pod.Status.Phase != "" {
woc.updateAgentPodStatus(ctx, pod)
}
return nil
}

Only http and plugin template will be put into taskSet right now.
image

@tooptoop4
Copy link
Contributor Author

@jswxstw do u want to create PR?

@agilgur5 agilgur5 added the solution/suggested A solution to the bug has been suggested. Someone needs to implement it. label Mar 16, 2024
@jswxstw
Copy link
Member

jswxstw commented Mar 16, 2024

@jswxstw do u want to create PR?

I see you have created a PR and it looks good to me basically.
The only problem is that you can use the existing function woc.hasTaskSetNodes() to determine whether the deletion of AgentPod is necessary, rather than create a new function woc.hasNodeWithAgentPod().

@agilgur5 agilgur5 added the area/agent Argo Agent that runs for HTTP and Plugin templates label Apr 26, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue May 7, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue May 9, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Jun 25, 2024
jswxstw added a commit to jswxstw/argo-workflows that referenced this issue Jul 2, 2024
agilgur5 pushed a commit that referenced this issue Jul 25, 2024
- merge conflicts with tests removed in backport by agilgur5

Signed-off-by: oninowang <[email protected]>
Signed-off-by: jswxstw <[email protected]>
Co-authored-by: jswxstw <[email protected]>
Co-authored-by: agilgur5 <[email protected]>
(cherry picked from commit 825aacf)
@argoproj argoproj locked as resolved and limited conversation to collaborators Oct 7, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area/agent Argo Agent that runs for HTTP and Plugin templates area/controller Controller issues, panics area/executor P3 Low priority solution/suggested A solution to the bug has been suggested. Someone needs to implement it. type/bug
Projects
None yet
3 participants