-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Webhook Drops Some Workflows Under Load (Silently) #6981
Comments
This is because the event is currently handled asynchronously, which means the API returns the response immediately when the event is enqueued. See: argo-workflows/server/event/event_server.go Lines 76 to 86 in 6384e5f
I think we should probably just change this endpoint to a synchronous call? Or we could add a parameter in the |
+1 to @dinever 's statement. I think I make a mistake to implement it async. I think we should do them sync. I wrote this, so I'll fix it. |
I'd like to fix it. Should only take 1h. |
There are problems here:
A fix is not as easy as I thought. One web hook event can create many workflows, but the creation of each workflow can independently fail - so we cannot just retry them all. Retry could be improved, by using idempotent names. I will noodle on this. |
…roj#6981 and argoproj#6732 Signed-off-by: Alex Collins <[email protected]>
…proj#6981 and argoproj#6732 Signed-off-by: Alex Collins <[email protected]>
…proj#6981 and argoproj#6732 Signed-off-by: Alex Collins <[email protected]>
…and #6732 (#6995) Signed-off-by: Alex Collins <[email protected]>
Thank you @alexec ! Your work is much appreciated. |
…roj#6981 and argoproj#6732 (argoproj#6995) Signed-off-by: Alex Collins <[email protected]> Signed-off-by: kriti-sc <[email protected]>
…and #6732 (#6995) Signed-off-by: Alex Collins <[email protected]>
Summary
I have a
WorkflowEventBinding
that starts a workflow when I curl an endpoint like so:Normally this works correctly. However, sometimes I want to create several workflows at the same time. If I run the above curl command 10 times in quick succession, each returns a
200 OK
. However, only 7 or 8 workflows are actually created. I can see in the argo server logs that there is some "transient error":I would expect for this transient error to cause the webhook to return a
500
status code, not a200
. Alternative expected behavior would be for the argo server to retry the POST until success.Using Argo v.3.2.1
Note: I have also tried increasing the replicas of the argo-server to 2, as well as increasing the values fo
--event-worker-count
and--event-operation-queue-size
to 8 and 32, respectively. However, even if this fixed the dropped workflows, I would still be concerned about the lack of error reporting from the endpoint.Diagnostics
What Kubernetes provider are you using?
Coreweave
What executor are you running?
k8sapi
The following logs are from a time period in which the above curl command was executed 10 times, but only 8 workflows were started.
Message from the maintainers:
Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.
The text was updated successfully, but these errors were encountered: