Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook Drops Some Workflows Under Load (Silently) #6981

Closed
mitchellgordon95 opened this issue Oct 19, 2021 · 6 comments · Fixed by #6995
Closed

Webhook Drops Some Workflows Under Load (Silently) #6981

mitchellgordon95 opened this issue Oct 19, 2021 · 6 comments · Fixed by #6995

Comments

@mitchellgordon95
Copy link

Summary

I have a WorkflowEventBinding that starts a workflow when I curl an endpoint like so:

curl https://argo-workflows.tenant-aidungeon.ord1.ingress.coreweave.cloud/api/v1/events/tenant-aidungeon/pixeldraw -H "Authorization: $ARGO_TOKEN" -d '{"caption": "Tokyo Skyline #pixelart", "awsName": "neotokyo.png", "style": "quickMode"}'

Normally this works correctly. However, sometimes I want to create several workflows at the same time. If I run the above curl command 10 times in quick succession, each returns a 200 OK. However, only 7 or 8 workflows are actually created. I can see in the argo server logs that there is some "transient error":

time="2021-10-19T22:46:53.898Z" level=error msg="failed to dispatch from event" error="timed out waiting for the condition: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again." event=pixeldraw-consumer namespace=tenant-aidungeon"

I would expect for this transient error to cause the webhook to return a 500 status code, not a 200. Alternative expected behavior would be for the argo server to retry the POST until success.

Using Argo v.3.2.1

Note: I have also tried increasing the replicas of the argo-server to 2, as well as increasing the values fo --event-worker-count and --event-operation-queue-size to 8 and 32, respectively. However, even if this fixed the dropped workflows, I would still be concerned about the lack of error reporting from the endpoint.

Diagnostics

apiVersion: argoproj.io/v1alpha1
kind: WorkflowEventBinding
metadata:
  name: pixeldraw-consumer
spec:
  event:
    # metadata header name must be lowercase to match in selector
    selector: payload.caption != "" && payload.awsName != "" && discriminator == "pixeldraw"
  submit:
    workflowTemplateRef:
      name: pixeldraw-workflow-template
    arguments:
      parameters:
      - name: awsName
        valueFrom:
          event: payload.awsName
      - name: caption
        valueFrom:
          event: payload.caption
      - name: style
        valueFrom:
          event: payload.style

What Kubernetes provider are you using?
Coreweave

What executor are you running?
k8sapi

The following logs are from a time period in which the above curl command was executed 10 times, but only 8 workflows were started.

(base) Mitchells-MacBook-Pro:~ mitchg$ kubectl logs --tail=0 --follow deploy/argo-workflows-argo-server
time="2021-10-19T23:04:44.999Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:44Z" grpc.time_ms=4.275 span.kind=server system=grpc
time="2021-10-19T23:04:45.758Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:45Z" grpc.time_ms=4.098 span.kind=server system=grpc
time="2021-10-19T23:04:46.468Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:46Z" grpc.time_ms=3.088 span.kind=server system=grpc
time="2021-10-19T23:04:47.265Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:47Z" grpc.time_ms=4.884 span.kind=server system=grpc
time="2021-10-19T23:04:47.978Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:47Z" grpc.time_ms=3.77 span.kind=server system=grpc
time="2021-10-19T23:04:47.986Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:48.003Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:48.022Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:48.040Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:48.058Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:48.058Z" level=error msg="failed to dispatch from event" error="timed out waiting for the condition: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again." event=pixeldraw-consumer namespace=tenant-aidungeon
E1019 23:04:48.059624       1 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pixeldraw-consumer.16af91101783655a", GenerateName:"", Namespace:"tenant-aidungeon", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"WorkflowEventBinding", Namespace:"tenant-aidungeon", Name:"pixeldraw-consumer", UID:"e8cd434d-8d4b-4d08-bc09-30672954c3d7", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"7365105034", FieldPath:""}, Reason:"WorkflowEventBindingError", Message:"failed to dispatch event: timed out waiting for the condition: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again.", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc053f1a40380c55a, ext:2118166262612, loc:(*time.Location)(0x4a11380)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc053f1a40380c55a, ext:2118166262612, loc:(*time.Location)(0x4a11380)}}, Count:1, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events is forbidden: User "system:serviceaccount:tenant-aidungeon:argo-workflows-argo-server" cannot create resource "events" in API group "" in the namespace "tenant-aidungeon"' (will not retry!)
time="2021-10-19T23:04:48.768Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:48Z" grpc.time_ms=3.633 span.kind=server system=grpc
time="2021-10-19T23:04:49.644Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:49Z" grpc.time_ms=4.683 span.kind=server system=grpc
time="2021-10-19T23:04:50.332Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:50Z" grpc.time_ms=4.328 span.kind=server system=grpc
time="2021-10-19T23:04:51.163Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:51Z" grpc.time_ms=4.716 span.kind=server system=grpc
time="2021-10-19T23:04:51.868Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ReceiveEvent grpc.service=event.EventService grpc.start_time="2021-10-19T23:04:51Z" grpc.time_ms=4.414 span.kind=server system=grpc
time="2021-10-19T23:04:51.875Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:51.894Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:51.913Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:51.931Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:51.949Z" level=info msg="Transient error: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again."
time="2021-10-19T23:04:51.950Z" level=error msg="failed to dispatch from event" error="timed out waiting for the condition: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again." event=pixeldraw-consumer namespace=tenant-aidungeon
E1019 23:04:51.950810       1 event.go:264] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"pixeldraw-consumer.16af91101783655a", GenerateName:"", Namespace:"tenant-aidungeon", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:"", ManagedFields:[]v1.ManagedFieldsEntry(nil)}, InvolvedObject:v1.ObjectReference{Kind:"WorkflowEventBinding", Namespace:"tenant-aidungeon", Name:"pixeldraw-consumer", UID:"e8cd434d-8d4b-4d08-bc09-30672954c3d7", APIVersion:"argoproj.io/v1alpha1", ResourceVersion:"7365105034", FieldPath:""}, Reason:"WorkflowEventBindingError", Message:"failed to dispatch event: timed out waiting for the condition: failed to create workflow: The POST operation against Workflow.argoproj.io could not be completed at this time, please try again.", Source:v1.EventSource{Component:"workflow-controller", Host:""}, FirstTimestamp:v1.Time{Time:time.Time{wall:0xc053f1a40380c55a, ext:2118166262612, loc:(*time.Location)(0x4a11380)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xc053f1a4f8a02876, ext:2122057512048, loc:(*time.Location)(0x4a11380)}}, Count:2, Type:"Warning", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "pixeldraw-consumer.16af91101783655a" is forbidden: User "system:serviceaccount:tenant-aidungeon:argo-workflows-argo-server" cannot patch resource "events" in API group "" in the namespace "tenant-aidungeon"' (will not retry!)
time="2021-10-19T23:05:09.076Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=27.506 span.kind=server system=grpc
time="2021-10-19T23:05:09.076Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=17.854 span.kind=server system=grpc
time="2021-10-19T23:05:09.077Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=19.128 span.kind=server system=grpc
time="2021-10-19T23:05:09.078Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=19.421 span.kind=server system=grpc
time="2021-10-19T23:05:09.078Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=20.052 span.kind=server system=grpc
time="2021-10-19T23:05:09.079Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=20.318 span.kind=server system=grpc
time="2021-10-19T23:05:09.079Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=21.172 span.kind=server system=grpc
time="2021-10-19T23:05:09.079Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=DeleteWorkflow grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=21.351 span.kind=server system=grpc
time="2021-10-19T23:05:09.233Z" level=info msg="finished streaming call with code OK" grpc.code=OK grpc.method=WatchWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:04:33Z" grpc.time_ms=35495.832 span.kind=server system=grpc
time="2021-10-19T23:05:09.639Z" level=info msg="finished unary call with code OK" grpc.code=OK grpc.method=ListWorkflows grpc.service=workflow.WorkflowService grpc.start_time="2021-10-19T23:05:09Z" grpc.time_ms=420.2 span.kind=server system=grpc
(base) Mitchells-MacBook-Pro:~ mitchg$ kubectl logs --tail=0 --follow deploy/argo-workflows-workflow-controller
time="2021-10-19T23:04:40.866Z" level=info msg="Get leases 200"
time="2021-10-19T23:04:40.869Z" level=info msg="Update leases 200"
time="2021-10-19T23:04:44.992Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:44.997Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.001Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.004Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.008Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:45.008Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:45.008Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:45.008Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.009Z" level=info msg="Pod node pixeldraw-workflow-template-1634684684 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.011Z" level=info msg="Create events 201"
time="2021-10-19T23:04:45.447Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:45.449Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684684 (pixeldraw-workflow-template-1634684684)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.449Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.449Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.461Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:45.462Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166019566 workflow=pixeldraw-workflow-template-1634684684
time="2021-10-19T23:04:45.751Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:45.755Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.767Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.777Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:45.788Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:45.788Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:45.788Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:45.788Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:45.788Z" level=info msg="Pod node pixeldraw-workflow-template-1634684685 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:45.800Z" level=info msg="Create events 201"
time="2021-10-19T23:04:45.872Z" level=info msg="Get leases 200"
time="2021-10-19T23:04:45.875Z" level=info msg="Update leases 200"
time="2021-10-19T23:04:46.226Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:46.227Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684685 (pixeldraw-workflow-template-1634684685)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:46.227Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:46.227Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:46.239Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:46.239Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166019729 workflow=pixeldraw-workflow-template-1634684685
time="2021-10-19T23:04:46.459Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.463Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:46.466Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:46.470Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:46.473Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:46.473Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:46.473Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:46.473Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.473Z" level=info msg="Pod node pixeldraw-workflow-template-1634684686 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.476Z" level=info msg="Create events 201"
time="2021-10-19T23:04:46.908Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:46.909Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684686 (pixeldraw-workflow-template-1634684686)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.909Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.909Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:46.921Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:46.922Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166019852 workflow=pixeldraw-workflow-template-1634684686
time="2021-10-19T23:04:47.257Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:47.262Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:47.266Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:47.270Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:47.272Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:47.272Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:47.272Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:47.273Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:47.273Z" level=info msg="Pod node pixeldraw-workflow-template-1634684687 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:47.276Z" level=info msg="Create events 201"
time="2021-10-19T23:04:48.001Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:48.001Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684687 (pixeldraw-workflow-template-1634684687)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:48.001Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:48.001Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:48.014Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:48.015Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166020258 workflow=pixeldraw-workflow-template-1634684687
time="2021-10-19T23:04:48.760Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:48.765Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:48.769Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:48.772Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:48.775Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:48.775Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:48.775Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:48.775Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:48.776Z" level=info msg="Pod node pixeldraw-workflow-template-1634684688 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:48.779Z" level=info msg="Create events 201"
time="2021-10-19T23:04:49.290Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:49.300Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684688 (pixeldraw-workflow-template-1634684688)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:49.300Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:49.300Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:49.329Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:49.330Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166020680 workflow=pixeldraw-workflow-template-1634684688
time="2021-10-19T23:04:49.462Z" level=info msg="List workflows 200"
time="2021-10-19T23:04:49.463Z" level=info msg=healthz age=5m0s err="<nil>" instanceID= labelSelector="!workflows.argoproj.io/phase,!workflows.argoproj.io/controller-instanceid" managedNamespace=tenant-aidungeon
time="2021-10-19T23:04:49.638Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:49.644Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:49.648Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:49.652Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:49.656Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:49.656Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:49.656Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:49.656Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:49.657Z" level=info msg="Pod node pixeldraw-workflow-template-1634684689 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:49.661Z" level=info msg="Create events 201"
time="2021-10-19T23:04:50.145Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:50.146Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684689 (pixeldraw-workflow-template-1634684689)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:50.146Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:50.146Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:50.158Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:50.158Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166021227 workflow=pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:50.182Z" level=info msg="insignificant pod change" key=tenant-aidungeon/pixeldraw-workflow-template-1634684689
time="2021-10-19T23:04:50.327Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.332Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:50.336Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:50.340Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:50.343Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:50.343Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:50.343Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:50.343Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.343Z" level=info msg="Pod node pixeldraw-workflow-template-1634684690 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.346Z" level=info msg="Create events 201"
time="2021-10-19T23:04:50.879Z" level=info msg="Get leases 200"
time="2021-10-19T23:04:50.882Z" level=info msg="Update leases 200"
time="2021-10-19T23:04:50.977Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:50.977Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684690 (pixeldraw-workflow-template-1634684690)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.977Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.977Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:50.989Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:50.989Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166021565 workflow=pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:51.016Z" level=info msg="insignificant pod change" key=tenant-aidungeon/pixeldraw-workflow-template-1634684690
time="2021-10-19T23:04:51.156Z" level=info msg="Processing workflow" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.161Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:51.164Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:51.168Z" level=info msg="Get workflowtemplates 200"
time="2021-10-19T23:04:51.171Z" level=info msg="Get configmaps 404"
time="2021-10-19T23:04:51.171Z" level=warning msg="Non-transient error: configmaps \"artifact-repositories\" not found"
time="2021-10-19T23:04:51.171Z" level=info msg="resolved artifact repository" artifactRepositoryRef=default-artifact-repository
time="2021-10-19T23:04:51.171Z" level=info msg="Updated phase  -> Running" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.171Z" level=info msg="Pod node pixeldraw-workflow-template-1634684691 initialized Pending" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.174Z" level=info msg="Create events 201"
time="2021-10-19T23:04:51.607Z" level=info msg="Create pods 201"
time="2021-10-19T23:04:51.608Z" level=info msg="Created pod: pixeldraw-workflow-template-1634684691 (pixeldraw-workflow-template-1634684691)" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.608Z" level=info msg="TaskSet Reconciliation" namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.608Z" level=info msg=reconcileAgentPod namespace=tenant-aidungeon workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.619Z" level=info msg="Update workflows 200"
time="2021-10-19T23:04:51.620Z" level=info msg="Workflow update successful" namespace=tenant-aidungeon phase=Running resourceVersion=8166021767 workflow=pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:51.640Z" level=info msg="insignificant pod change" key=tenant-aidungeon/pixeldraw-workflow-template-1634684691
time="2021-10-19T23:04:55.885Z" level=info msg="Get leases 200"
time="2021-10-19T23:04:55.888Z" level=info msg="Update leases 200"
time="2021-10-19T23:04:57.183Z" level=info msg="List workflowtasksets 403"
E1019 23:04:57.183641       1 reflector.go:138] pkg/mod/k8s.io/[email protected]/tools/cache/reflector.go:167: Failed to watch *v1alpha1.WorkflowTaskSet: failed to list *v1alpha1.WorkflowTaskSet: workflowtasksets.argoproj.io is forbidden: User "system:serviceaccount:tenant-aidungeon:argo-workflows-argo" cannot list resource "workflowtasksets" in API group "argoproj.io" in the namespace "tenant-aidungeon"
time="2021-10-19T23:05:02.450Z" level=info msg="Get leases 200"
time="2021-10-19T23:05:02.644Z" level=info msg="Update leases 200"
time="2021-10-19T23:05:07.648Z" level=info msg="Get leases 200"
time="2021-10-19T23:05:07.652Z" level=info msg="Update leases 200"
time="2021-10-19T23:05:12.656Z" level=info msg="Get leases 200"
time="2021-10-19T23:05:12.659Z" level=info msg="Update leases 200"

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@dinever
Copy link
Member

dinever commented Oct 20, 2021

This is because the event is currently handled asynchronously, which means the API returns the response immediately when the event is enqueued.

See:

operation, err := dispatch.NewOperation(ctx, s.instanceIDService, s.eventRecorderManager.Get(req.Namespace), list.Items, req.Namespace, req.Discriminator, req.Payload)
if err != nil {
return nil, err
}
select {
case s.operationQueue <- *operation:
return &eventpkg.EventResponse{}, nil
default:
return nil, apierrors.NewServiceUnavailable("operation queue full")
}

I think we should probably just change this endpoint to a synchronous call?

Or we could add a parameter in the EventRequest to support both behaviors if we'd like to make sure it's backward compatible.

@alexec
Copy link
Contributor

alexec commented Oct 20, 2021

+1 to @dinever 's statement. I think I make a mistake to implement it async. I think we should do them sync.

I wrote this, so I'll fix it.

@dinever
Copy link
Member

dinever commented Oct 20, 2021

+1 to @dinever 's statement. I think I make a mistake to implement it async. I think we should do them sync.

I wrote this, so I'll fix it.

Thanks @alexec. I'm happy to work on this as well. Feel free to assign this to me.

@alexec
Copy link
Contributor

alexec commented Oct 20, 2021

I'd like to fix it. Should only take 1h.

@alexec alexec self-assigned this Oct 20, 2021
@alexec alexec added this to To do in Run The Business (incl. bugs) via automation Oct 20, 2021
@alexec
Copy link
Contributor

alexec commented Oct 20, 2021

There are problems here:

  • Errors are reported, but async. This normal for webhooks, but unhelpful for users.
  • Failures don't get re-queued, instead they get dropped.
  • The created workflow name is meant to prevent duplicates, but uses timestamp, and actually doesn't prevent it. Argo event generate workflow name using epoch time cause duplicate name. #6732
  • The current solution makes the Argo Server stateful, which make it harder to manage.

A fix is not as easy as I thought. One web hook event can create many workflows, but the creation of each workflow can independently fail - so we cannot just retry them all.

Retry could be improved, by using idempotent names.

I will noodle on this.

@alexec alexec removed the triage label Oct 20, 2021
alexec added a commit to alexec/argo-workflows that referenced this issue Oct 20, 2021
alexec added a commit to alexec/argo-workflows that referenced this issue Oct 20, 2021
alexec added a commit to alexec/argo-workflows that referenced this issue Oct 20, 2021
@alexec alexec moved this from To do to In progress in Run The Business (incl. bugs) Oct 21, 2021
alexec added a commit that referenced this issue Oct 22, 2021
@mitchellgordon95
Copy link
Author

Thank you @alexec ! Your work is much appreciated.

kriti-sc pushed a commit to kriti-sc/argo-workflows that referenced this issue Oct 24, 2021
alexec added a commit that referenced this issue Oct 24, 2021
@sarabala1979 sarabala1979 mentioned this issue Oct 26, 2021
25 tasks
@alexec alexec moved this from In progress to Done in Run The Business (incl. bugs) Oct 26, 2021
@sarabala1979 sarabala1979 mentioned this issue Nov 4, 2021
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 participants