-
Notifications
You must be signed in to change notification settings - Fork 3.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
semaphore lost #6110
Comments
@2qif49lt are you referring status message |
@sarabala1979 |
@2qif49lt which workflow controller version are you running? |
containers:
- args:
- --configmap
- workflow-controller-configmap
- --executor-image
- argoproj/argoexec:v3.1.0-rc12
- --namespaced
command:
- workflow-controller
env:
- name: LEADER_ELECTION_IDENTITY
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
image: argoproj/workflow-controller:v3.1.0-rc12 apiVersion: v1
kind: ConfigMap
metadata:
name: workflow-controller-configmap
data:
containerRuntimeExecutor: pns
it behaves the same after restarting workflow controller pod. |
@2qif49lt I didn't see any step is waiting for lock message in log. I am trying to reproduce in my local |
@sarabala1979 sorry, there are the full log and yamls correspond to the gif. apiVersion: v1
kind: ConfigMap
metadata:
name: conf
data:
maxbuilder: "60" apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: test-sema
spec:
entrypoint: build-all
serviceAccountName: std
templates:
- name: build-all
steps:
- - name: split
template: split
- - name: build
template: build
arguments:
parameters:
- name: baseid
value: "{{item.baseid}}"
- name: tarid
value: "{{item.tarid}}"
withParam: "{{steps.split.outputs.result}}"
- name: split
script:
image: python:3.9-slim
command: ["python"]
source: |
import json
import sys
output = []
for id in range(70):
output.append({'baseid': id, 'tarid': id + 10000})
json.dump(output, sys.stdout)
- name: build
synchronization:
semaphore:
configMapKeyRef:
name: conf
key: maxbuilder
inputs:
parameters:
- name: baseid
- name: tarid
container:
image: alpine:latest
command: [sh, -c]
args: ["sleep 10; echo acquired lock"] |
I am able to reproduce this issue on my env. I will work on this. |
I also have this same problem on v3.0.8. Once it enters the run one at a time mode, the controller logs and argo watch are in disagreement on the number of free locks. The argo watch says 0/16 but the controller logs "15 free of 16". |
@sarabala1979 Hi, I've been experiencing the same issue and tried testing your fix on v3.1.2 but I'm getting an error whenever a container task is being executed: |
@sarabala1979 nvm, settings my argoexec tag to latest instead of v3.1.2 fixed the issue, so far the fix seems to be working. I will update if I encounter any issues |
We are seeing this exact issue on v3.2.3 (all components); K8s v1.20/EKS.
So clearly the controller sees that we have plenty of locks available, but this status isn't reconciled with the workflow nodes I will try to create a succinct repro workflow, but hopefully the above description paints a picture. Aside from an actual fix, is there any way I could edit existing workflows to allow them to proceed? Can I somehow edit the I can open another issue if that helps. EDIT: I patched each affected workflow removing |
Summary
argo dont increase semaphore when pod completed, causes the flow keep running one by one.
Diagnostics
The text was updated successfully, but these errors were encountered: