semaphore lost #6110

2qif49lt · 2021-06-09T06:39:46Z

Summary

argo dont increase semaphore when pod completed, causes the flow keep running one by one.

Diagnostics

apiVersion: v1
kind: ConfigMap
metadata:
  name: conf
data:
  maxbuilder: "60"

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-sema
spec:
  entrypoint: main
  serviceAccountName: std
  templates:
  - name: main
    dag:
      tasks:
      - name: build-all
        template: build-all
  - name: build-all
    steps:
    - - name: split
        template: split
    - - name: build
        template: build
        arguments:
          parameters:
          - name: baseid
            value: "{{item.baseid}}"
          - name: tarid
            value: "{{item.tarid}}"
        withParam: "{{steps.split.outputs.result}}"
  - name: split
    script: 
      image: python:3.9-slim
      command: ["python"]
      source: |
        import json
        import sys

        output = []
        for id in range(120):
            output.append({'baseid': id, 'tarid': id + 10000})

        json.dump(output, sys.stdout)
  - name: build
    synchronization:
      semaphore:
        configMapKeyRef:
          name: conf
          key: maxbuilder
    inputs:
      parameters:
      - name: baseid
      - name: tarid
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 10; echo acquired lock"]

std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:24:36 CST 2021
51
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:25:02 CST 2021
49
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:26:34 CST 2021
41
std@master:/data/samba/std/aes-argo$ date && argo get @latest |grep Waiting |wc -l
Wed Jun  9 14:27:58 CST 2021
33
std@master:/data/samba/std/aes-argo$ argo get @latest
Name:                test-semajwls7
Namespace:           argo
ServiceAccount:      std
Status:              Running
Conditions:          
 PodRunning          True
Created:             Wed Jun 09 14:22:32 +0800 (9 minutes ago)
Started:             Wed Jun 09 14:22:32 +0800 (9 minutes ago)
Duration:            9 minutes 28 seconds
Progress:            111/121
ResourcesDuration:   39m30s*(100Mi memory),39m30s*(1 cpu)

STEP                                        TEMPLATE   PODNAME                    DURATION  MESSAGE
 ● test-semajwls7                           main                                                                                                                 
 └─● build-all                              build-all                                                                                                            
   ├───✔ split                              split      test-semajwls7-834350695   3s                                                                             
   └─┬─✔ build(0:baseid:0,tarid:10000)      build      test-semajwls7-1431628376  13s                                                                            
     ├─✔ ...                                                                           
     ├─✔ build(60:baseid:60,tarid:10060)    build      test-semajwls7-1756982882  30s                                                                            
     ├─◷ build(61:baseid:61,tarid:10061)    build      test-semajwls7-1784892703  9m        Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60   
     ├─◷ ...                                                                                Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60
     ├─◷ build(69:baseid:69,tarid:10069)    build      test-semajwls7-4001026551  9m        Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60   
     ├─● build(70:baseid:70,tarid:10070)    build      test-semajwls7-110265593   9m                                                                             
     ├─✔ build(71:baseid:71,tarid:10071)    build      test-semajwls7-3079006344  8m                                                                             
     ├─✔ ...                                                                            
     └─✔ build(119:baseid:119,tarid:10119)  build      test-semajwls7-2179176665  30s      
    
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:26:27 CST 2021
80
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:26:29 CST 2021
80
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:28:21 CST 2021
91
1
std@master:/data/samba/std/argo$ date && kubectl get pods -n argo |grep Completed |wc -l  && kubectl get pods -n argo |grep test-sema |grep Running |wc -l
Wed Jun  9 14:31:33 CST 2021
110
1

std@master:/data/samba/std/aes-argo$ argo list
NAME             STATUS      AGE   DURATION   PRIORITY
test-semajwls7   Succeeded   11m   10m        0
 
std@master:/data/samba/std/argo$ argo version
argo: v3.1.0-rc12
  BuildDate: 2021-06-03T01:30:42Z
  GitCommit: f529b2b5ccc779ecbc133deb02f130fafa40e810
  GitTreeState: clean
  GitTag: v3.1.0-rc12
  GoVersion: go1.15.7
  Compiler: gc
  Platform: linux/amd64

The text was updated successfully, but these errors were encountered:

sarabala1979 · 2021-06-09T06:58:03Z

@2qif49lt are you referring status message Waiting for argo/ConfigMap/conf/maxbuilder lock. Lock status: 0/60? or step didn't run

2qif49lt · 2021-06-09T07:07:44Z

@sarabala1979
the first 60 builds in steps are run in parallel.but others are not. only one build is running. i think the completed pods dont release the semaphores.

2qif49lt · 2021-06-09T08:58:20Z

sarabala1979 · 2021-06-09T16:02:49Z

@2qif49lt which workflow controller version are you running?

2qif49lt · 2021-06-09T23:06:52Z

@sarabala1979

containers:
      - args:
        - --configmap
        - workflow-controller-configmap
        - --executor-image
        - argoproj/argoexec:v3.1.0-rc12
        - --namespaced
        command:
        - workflow-controller
        env:
        - name: LEADER_ELECTION_IDENTITY
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: metadata.name
        image: argoproj/workflow-controller:v3.1.0-rc12

apiVersion: v1
kind: ConfigMap
metadata:
  name: workflow-controller-configmap
data:
  containerRuntimeExecutor: pns

std@master:/data/samba/std/aes-argo$ kubectl logs workflow-controller-5ff9d7d64b-xfb2k -n argo
time="2021-06-09T22:56:30Z" level=info msg="index config" indexWorkflowSemaphoreKeys=true
time="2021-06-09T22:56:30Z" level=info msg="cron config" cronSyncPeriod=10s
time="2021-06-09T22:56:30.086Z" level=info msg="config map" name=workflow-controller-configmap
time="2021-06-09T22:56:30.100Z" level=info msg="Get configmaps 200"
time="2021-06-09T22:56:30.102Z" level=info msg="Configuration:\nartifactRepository: {}\ncontainerRuntimeExecutor: pns\ninitialDelay: 0s\nmetricsConfig: {}\nnodeEvents: {}\npodSpecLogStrategy: {}\ntelemetryConfig: {}\n"
time="2021-06-09T22:56:30.102Z" level=info msg="Persistence configuration disabled"
time="2021-06-09T22:56:30.103Z" level=info msg="Starting Workflow Controller" version=v3.1.0-rc12
time="2021-06-09T22:56:30.103Z" level=info msg="Workers: workflow: 32, pod: 32, pod cleanup: 4"
time="2021-06-09T22:56:30.105Z" level=info msg="Watch configmaps 200"
time="2021-06-09T22:56:30.106Z" level=info msg="List workflowtemplates 200"
time="2021-06-09T22:56:30.107Z" level=info msg="List pods 200"
time="2021-06-09T22:56:30.107Z" level=info msg="List configmaps 200"
time="2021-06-09T22:56:30.107Z" level=info msg="Watch workflowtemplates 200"
time="2021-06-09T22:56:30.110Z" level=info msg="Watch configmaps 200"
time="2021-06-09T22:56:30.114Z" level=info msg="Watch pods 200"
time="2021-06-09T22:56:30.114Z" level=info msg="List workflows 200"
time="2021-06-09T22:56:30.119Z" level=info msg="Watch workflows 200"
time="2021-06-09T22:56:30.206Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.209Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.211Z" level=info msg="Create selfsubjectaccessreviews 201"
time="2021-06-09T22:56:30.213Z" level=info msg="List clusterworkflowtemplates 200"
time="2021-06-09T22:56:30.214Z" level=info msg="Watch clusterworkflowtemplates 200"
time="2021-06-09T22:56:30.221Z" level=info msg="List workflows 200"
time="2021-06-09T22:56:30.226Z" level=info msg="Manager initialized successfully"
I0609 22:56:30.226733       1 leaderelection.go:243] attempting to acquire leader lease  argo/workflow-controller...
time="2021-06-09T22:56:30.226Z" level=info msg="Starting prometheus metrics server at localhost:9090/metrics"
time="2021-06-09T22:56:30.230Z" level=info msg="Get leases 200"
time="2021-06-09T22:56:30.230Z" level=info msg="new leader" id=workflow-controller-5ff9d7d64b-xfb2k leader=workflow-controller-5ff9d7d64b-smr9g

it behaves the same after restarting workflow controller pod.

sarabala1979 · 2021-06-10T16:09:21Z

@2qif49lt I didn't see any step is waiting for lock message in log. I am trying to reproduce in my local

2qif49lt · 2021-06-10T22:46:46Z

@sarabala1979 sorry, there are the full log and yamls correspond to the gif.

workflow-controller.log

apiVersion: v1
kind: ConfigMap
metadata:
  name: conf
data:
  maxbuilder: "60"

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-sema
spec:
  entrypoint: build-all
  serviceAccountName: std
  templates:
  - name: build-all
    steps:
    - - name: split
        template: split
    - - name: build
        template: build
        arguments:
          parameters:
          - name: baseid
            value: "{{item.baseid}}"
          - name: tarid
            value: "{{item.tarid}}"
        withParam: "{{steps.split.outputs.result}}"
  - name: split
    script: 
      image: python:3.9-slim
      command: ["python"]
      source: |
        import json
        import sys

        output = []
        for id in range(70):
            output.append({'baseid': id, 'tarid': id + 10000})

        json.dump(output, sys.stdout)
  - name: build
    synchronization:
      semaphore:
        configMapKeyRef:
          name: conf
          key: maxbuilder
    inputs:
      parameters:
      - name: baseid
      - name: tarid
    container:
      image: alpine:latest
      command: [sh, -c]
      args: ["sleep 10; echo acquired lock"]

sarabala1979 · 2021-06-24T16:47:53Z

I am able to reproduce this issue on my env. I will work on this.

snoe925 · 2021-07-21T17:54:34Z

I also have this same problem on v3.0.8. Once it enters the run one at a time mode, the controller logs and argo watch are in disagreement on the number of free locks. The argo watch says 0/16 but the controller logs "15 free of 16".

sarabala1979 · 2021-07-24T17:54:15Z

@2qif49lt @snoe925 I have fixed the issue and created a dev build for your testing sarabala1979/workflow-controller:6110.
Can you try and update me on whether the issue is fixed on your scenario.

iatzmon · 2021-07-25T16:28:08Z

@sarabala1979 Hi, I've been experiencing the same issue and tried testing your fix on v3.1.2 but I'm getting an error whenever a container task is being executed:
Error (exit code 2): open /argo/podmetadata/annotations: no such file or directory

iatzmon · 2021-07-25T16:39:59Z

@sarabala1979 nvm, settings my argoexec tag to latest instead of v3.1.2 fixed the issue, so far the fix seems to be working. I will update if I encounter any issues

ebr · 2021-11-18T07:08:21Z

We are seeing this exact issue on v3.2.3 (all components); K8s v1.20/EKS.

using a template semaphore global-max-pods
workflows scale up to the N pods up to the template semaphore value (17000 in this case)
then all subsequent workflow nodes are stuck with status: Waiting for default/ConfigMap/argo-sync/global-max-pods lock. Lock status: 0/17000
Pods complete and succeed, but no new sibling pods (from the same step template) are launched. We see a very small trickle of pods from child steps only.
Status is not updated for any of the existing nodes
In controller logs we see messages like acquired synchronization lock for a small number of nodes, and we confirm that some pods are starting. but at a very slow rate, or in short bursts
Also in controller logs we see messages like Lock has been released by <namespace/workflowname/podname>. Available locks: 16495 semaphore=defaule/ConfigMap/argo-sync/global-max-pods.
when a workflow completes, nothing interesting happens, i.e. locks don't magically become available, etc.

So clearly the controller sees that we have plenty of locks available, but this status isn't reconciled with the workflow nodes

I will try to create a succinct repro workflow, but hopefully the above description paints a picture.

Aside from an actual fix, is there any way I could edit existing workflows to allow them to proceed? Can I somehow edit the status.synchronization.semaphore.waiting key - it has references to other workflow nodes, presumably the ones that are blocking this one from proceeding?

I can open another issue if that helps.

EDIT: I patched each affected workflow removing .spec.templates[*].synchronization and .status.synchronization, and all workflows came back to life.

2qif49lt added the type/bug label Jun 9, 2021

sarabala1979 added the more-information-needed label Jun 9, 2021

no-response bot removed the more-information-needed label Jun 9, 2021

sarabala1979 self-assigned this Jun 14, 2021

sarabala1979 modified the milestones: v3.1, v3.0 Jun 14, 2021

alexec added the likely-bug label Jun 28, 2021

jbaez mentioned this issue Jul 13, 2021

Synchronization mutex or semaphore is not working properly #6334

Closed

sarabala1979 mentioned this issue Jul 24, 2021

fix(controller): Same workflow nodes are not executing parallel even semaphore locks available #6418

Merged

1 task

sarabala1979 closed this as completed in #6418 Jul 27, 2021

agilgur5 added the area/mutex-semaphore label May 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semaphore lost #6110

semaphore lost #6110

2qif49lt commented Jun 9, 2021 •

edited by agilgur5

Loading

sarabala1979 commented Jun 9, 2021

2qif49lt commented Jun 9, 2021

2qif49lt commented Jun 9, 2021

sarabala1979 commented Jun 9, 2021 •

edited

Loading

2qif49lt commented Jun 9, 2021 •

edited by agilgur5

Loading

sarabala1979 commented Jun 10, 2021

2qif49lt commented Jun 10, 2021 •

edited by agilgur5

Loading

sarabala1979 commented Jun 24, 2021

snoe925 commented Jul 21, 2021

sarabala1979 commented Jul 24, 2021

iatzmon commented Jul 25, 2021

iatzmon commented Jul 25, 2021

ebr commented Nov 18, 2021 •

edited

Loading

semaphore lost #6110

semaphore lost #6110

Comments

2qif49lt commented Jun 9, 2021 • edited by agilgur5 Loading

Summary

Diagnostics

sarabala1979 commented Jun 9, 2021

2qif49lt commented Jun 9, 2021

2qif49lt commented Jun 9, 2021

sarabala1979 commented Jun 9, 2021 • edited Loading

2qif49lt commented Jun 9, 2021 • edited by agilgur5 Loading

sarabala1979 commented Jun 10, 2021

2qif49lt commented Jun 10, 2021 • edited by agilgur5 Loading

sarabala1979 commented Jun 24, 2021

snoe925 commented Jul 21, 2021

sarabala1979 commented Jul 24, 2021

iatzmon commented Jul 25, 2021

iatzmon commented Jul 25, 2021

ebr commented Nov 18, 2021 • edited Loading

2qif49lt commented Jun 9, 2021 •

edited by agilgur5

Loading

sarabala1979 commented Jun 9, 2021 •

edited

Loading

2qif49lt commented Jun 9, 2021 •

edited by agilgur5

Loading

2qif49lt commented Jun 10, 2021 •

edited by agilgur5

Loading

ebr commented Nov 18, 2021 •

edited

Loading