Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semaphore lost #6110

Closed
2qif49lt opened this issue Jun 9, 2021 · 13 comments · Fixed by #6418
Closed

semaphore lost #6110

2qif49lt opened this issue Jun 9, 2021 · 13 comments · Fixed by #6418

Comments

@2qif49lt
Copy link

2qif49lt commented Jun 9, 2021

Summary

argo dont increase semaphore when pod completed, causes the flow keep running one by one.

Diagnostics

apiVersion: v1
kind: ConfigMap
metadata:
  name: conf
data:
  maxbuilder: "60"
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-sema
spec:
  entrypoint: main
  serviceAccountName: std
  templates:
  - name: main
    dag:
      tasks:
      - name: build-all
        template: build-all
  - name: build-all
    steps:
    - - name: split
        template: split
    - - name: build
        template: build
        arguments:
          parameters:
          - name: baseid
            value: "{{item.baseid}}"
          - name: tarid
            value: "{{item.tarid}}"
        withParam: "{{steps.split.outputs.result}}"
  - name: split
    script: 
      image: python:3.9-slim
      command: ["python"]
      source: |
        import json
        import sys

        output = []
        for id in range(120):
            output.append({'baseid': id, 'tarid': id + 10000})

        json.dump(output, sys.stdout)
  - name: build
    synchronization:
      semaphore:
        configMapKeyRef:
          name: conf
          key: maxbuilder
    inputs:
      parameters:
      - name: baseid
      - name: tarid
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 10; echo acquired lock"]
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:24:36 CST 2021
51
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:25:02 CST 2021
49
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:26:34 CST 2021
41
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:27:58 CST 2021
33
std@master:/data/samba/std/aes-argo$ argo get @latest
Name:                test-semajwls7
Namespace:           argo
ServiceAccount:      std
Status:              Running
Conditions:          
 PodRunning          True
Created:             Wed Jun 09 14:22:32 +0800 (9 minutes ago)
Started:             Wed Jun 09 14:22:32 +0800 (9 minutes ago)
Duration:            9 minutes 28 seconds
Progress:            111/121
ResourcesDuration:   39m30s*(100Mi memory),39m30s*(1 cpu)

STEP                                        TEMPLATE   PODNAME                    DURATION  MESSAGE
 ● test-semajwls7                           main                                                                                                                 
 └─● build-all                              build-all                                                                                                            
   ├───✔ split                              split      test-semajwls7-834350695   3s                                                                             
   └─┬─✔ build(0:baseid:0,tarid:10000)      build      test-semajwls7-1431628376  13s                                                                            
     ├─✔ ...                                                                           
     ├─✔ build(60:baseid:60,tarid:10060)    build      test-semajwls7-1756982882  30s                                                                            
     ├─◷ build(61:baseid:61,tarid:10061)    build      test-semajwls7-1784892703  9m        Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60   
     ├─◷ ...                                                                                Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60
     ├─◷ build(69:baseid:69,tarid:10069)    build      test-semajwls7-4001026551  9m        Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60   
     ├─● build(70:baseid:70,tarid:10070)    build      test-semajwls7-110265593   9m                                                                             
     ├─✔ build(71:baseid:71,tarid:10071)    build      test-semajwls7-3079006344  8m                                                                             
     ├─✔ ...                                                                            
     └─✔ build(119:baseid:119,tarid:10119)  build      test-semajwls7-2179176665  30s      
    
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:26:27 CST 2021
80
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:26:29 CST 2021
80
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:28:21 CST 2021
91
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:31:33 CST 2021
110
1

std@master:/data/samba/std/aes-argo$ argo list
NAME             STATUS      AGE   DURATION   PRIORITY
test-semajwls7   Succeeded   11m   10m        0
 
std@master:/data/samba/std/argo$ argo version
argo: v3.1.0-rc12
  BuildDate: 2021-06-03T01:30:42Z
  GitCommit: f529b2b5ccc779ecbc133deb02f130fafa40e810
  GitTreeState: clean
  GitTag: v3.1.0-rc12
  GoVersion: go1.15.7
  Compiler: gc
  Platform: linux/amd64                                 
@sarabala1979
Copy link
Member

@2qif49lt are you referring status message Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60? or step didn't run

@2qif49lt
Copy link
Author

2qif49lt commented Jun 9, 2021

@sarabala1979
the first 60 builds in steps are run in parallel.but others are not. only one build is running. i think the completed pods dont release the semaphores.

@2qif49lt
Copy link
Author

2qif49lt commented Jun 9, 2021

sema

@sarabala1979
Copy link
Member

sarabala1979 commented Jun 9, 2021

@2qif49lt which workflow controller version are you running?

@2qif49lt
Copy link
Author

2qif49lt commented Jun 9, 2021

@sarabala1979

containers:
      - args:
        - --configmap
        - workflow-controller-configmap
        - --executor-image
        - argoproj/argoexec:v3.1.0-rc12
        - --namespaced
        command:
        - workflow-controller
        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: argoproj/workflow-controller:v3.1.0-rc12
apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  containerRuntimeExecutor: pns
std@master:/data/samba/std/aes-argo$ kubectl logs workflow-controller-5ff9d7d64b-xfb2k -n argo
time="2021-06-09T22:56:30Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2021-06-09T22:56:30Z" level=info msg="cron config" cronSyncPeriod=10s
time="2021-06-09T22:56:30.086Z" level=info msg="config map" name=workflow-controller-configmap
time="2021-06-09T22:56:30.100Z" level=info msg="Get configmaps 200"
time="2021-06-09T22:56:30.102Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: pns\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\ntelemetryConfig: {}\n"
time="2021-06-09T22:56:30.102Z" level=info msg="Persistence configuration disabled"
time="2021-06-09T22:56:30.103Z" level=info msg="Starting Workflow Controller" version=v3.1.0-rc12
time="2021-06-09T22:56:30.103Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2021-06-09T22:56:30.105Z" level=info msg="Watch configmaps 200"
time="2021-06-09T22:56:30.106Z" level=info msg="List workflowtemplates 200"
time="2021-06-09T22:56:30.107Z" level=info msg="List pods 200"
time="2021-06-09T22:56:30.107Z" level=info msg="List configmaps 200"
time="2021-06-09T22:56:30.107Z" level=info msg="Watch workflowtemplates 200"
time="2021-06-09T22:56:30.110Z" level=info msg="Watch configmaps 200"
time="2021-06-09T22:56:30.114Z" level=info msg="Watch pods 200"
time="2021-06-09T22:56:30.114Z" level=info msg="List workflows 200"
time="2021-06-09T22:56:30.119Z" level=info msg="Watch workflows 200"
time="2021-06-09T22:56:30.206Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.209Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.211Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.213Z" level=info msg="List clusterworkflowtemplates 200"
time="2021-06-09T22:56:30.214Z" level=info msg="Watch clusterworkflowtemplates 200"
time="2021-06-09T22:56:30.221Z" level=info msg="List workflows 200"
time="2021-06-09T22:56:30.226Z" level=info msg="Manager initialized successfully"
I0609 22:56:30.226733       1 leaderelection.go:243] attempting to acquire leader lease  argo/workflow-controller...
time="2021-06-09T22:56:30.226Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2021-06-09T22:56:30.230Z" level=info msg="Get leases 200"
time="2021-06-09T22:56:30.230Z" level=info msg="new leader" id=workflow-controller-5ff9d7d64b-xfb2k leader=workflow-controller-5ff9d7d64b-smr9g

it behaves the same after restarting workflow controller pod.

@sarabala1979
Copy link
Member

@2qif49lt I didn't see any step is waiting for lock message in log. I am trying to reproduce in my local

@2qif49lt
Copy link
Author

2qif49lt commented Jun 10, 2021

@sarabala1979 sorry, there are the full log and yamls correspond to the gif.

workflow-controller.log

apiVersion: v1
kind: ConfigMap
metadata:
  name: conf
data:
  maxbuilder: "60"
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-sema
spec:
  entrypoint: build-all
  serviceAccountName: std
  templates:
  - name: build-all
    steps:
    - - name: split
        template: split
    - - name: build
        template: build
        arguments:
          parameters:
          - name: baseid
            value: "{{item.baseid}}"
          - name: tarid
            value: "{{item.tarid}}"
        withParam: "{{steps.split.outputs.result}}"
  - name: split
    script: 
      image: python:3.9-slim
      command: ["python"]
      source: |
        import json
        import sys

        output = []
        for id in range(70):
            output.append({'baseid': id, 'tarid': id + 10000})

        json.dump(output, sys.stdout)
  - name: build
    synchronization:
      semaphore:
        configMapKeyRef:
          name: conf
          key: maxbuilder
    inputs:
      parameters:
      - name: baseid
      - name: tarid
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 10; echo acquired lock"]

@sarabala1979 sarabala1979 self-assigned this Jun 14, 2021
@sarabala1979 sarabala1979 modified the milestones: v3.1, v3.0 Jun 14, 2021
@sarabala1979
Copy link
Member

I am able to reproduce this issue on my env. I will work on this.

@snoe925
Copy link

snoe925 commented Jul 21, 2021

I also have this same problem on v3.0.8. Once it enters the run one at a time mode, the controller logs and argo watch are in disagreement on the number of free locks. The argo watch says 0/16 but the controller logs "15 free of 16".

@sarabala1979
Copy link
Member

@2qif49lt @snoe925 I have fixed the issue and created a dev build for your testing sarabala1979/workflow-controller:6110.
Can you try and update me on whether the issue is fixed on your scenario.

@iatzmon
Copy link

iatzmon commented Jul 25, 2021

@sarabala1979 Hi, I've been experiencing the same issue and tried testing your fix on v3.1.2 but I'm getting an error whenever a container task is being executed:
Error (exit code 2): open /argo/podmetadata/annotations: no such file or directory

@iatzmon
Copy link

iatzmon commented Jul 25, 2021

@sarabala1979 nvm, settings my argoexec tag to latest instead of v3.1.2 fixed the issue, so far the fix seems to be working. I will update if I encounter any issues

@ebr
Copy link

ebr commented Nov 18, 2021

We are seeing this exact issue on v3.2.3 (all components); K8s v1.20/EKS.

  • using a template semaphore global-max-pods
  • workflows scale up to the N pods up to the template semaphore value (17000 in this case)
  • then all subsequent workflow nodes are stuck with status: Waiting for default/ConfigMap/argo-sync/global-max-pods lock. Lock status: 0/17000
  • Pods complete and succeed, but no new sibling pods (from the same step template) are launched. We see a very small trickle of pods from child steps only.
  • Status is not updated for any of the existing nodes
  • In controller logs we see messages like acquired synchronization lock for a small number of nodes, and we confirm that some pods are starting. but at a very slow rate, or in short bursts
  • Also in controller logs we see messages like Lock has been released by <namespace/workflowname/podname>. Available locks: 16495 semaphore=defaule/ConfigMap/argo-sync/global-max-pods.
  • when a workflow completes, nothing interesting happens, i.e. locks don't magically become available, etc.

So clearly the controller sees that we have plenty of locks available, but this status isn't reconciled with the workflow nodes

I will try to create a succinct repro workflow, but hopefully the above description paints a picture.

Aside from an actual fix, is there any way I could edit existing workflows to allow them to proceed? Can I somehow edit the status.synchronization.semaphore.waiting key - it has references to other workflow nodes, presumably the ones that are blocking this one from proceeding?

I can open another issue if that helps.

EDIT: I patched each affected workflow removing .spec.templates[*].synchronization and .status.synchronization, and all workflows came back to life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
7 participants