Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutex not being released on step completion #4832

Closed
davidcollom opened this issue Jan 5, 2021 · 24 comments · Fixed by #4847
Closed

Mutex not being released on step completion #4832

davidcollom opened this issue Jan 5, 2021 · 24 comments · Fixed by #4847
Assignees
Labels

Comments

@davidcollom
Copy link
Contributor

davidcollom commented Jan 5, 2021

Summary

When using a variable within the mutex name, Mutex's aren't released until workflow-controller is restarted.
Within the example below, the workflow never completes and waits pending on the final unlock.

Additionally all other workflows aren't able to continue from the initial mutex locked (from gen-number-list).

Restarting the workflow-controller brings the workflow to life and all locks are released.

Diagnostics

GKE - v1.17.12-gke.1504

Argo version: v2.12.3

Workflow: https://gist.github.com/davidcollom/9c6f7d9d1819fe922d21b3a69e561754#file-workflow-yaml

Logs: https://gist.github.com/davidcollom/9c6f7d9d1819fe922d21b3a69e561754#file-logs


Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritise the issues with the most 👍.

@simster7
Copy link
Member

simster7 commented Jan 5, 2021

@sarabala1979 ?

@sarabala1979 sarabala1979 self-assigned this Jan 5, 2021
@sarabala1979
Copy link
Member

sarabala1979 commented Jan 5, 2021

I will look

@davidcollom
Copy link
Contributor Author

@sarabala1979 Something I didn't highlight in my initial report was that one/some of the steps are daemon'd steps, I have a feeling that this could be a good line of investigation of this issue. (I haven't confirmed this)

@simster7
Copy link
Member

simster7 commented Jan 6, 2021

Possibly related to #4835?

@sarabala1979
Copy link
Member

Let me verify that. I am able to reproduce locally. I am investigating

@sarabala1979
Copy link
Member

@davidcollom I was able to reproduce my old codebase. I found a fix to do AddAfter on the nextWorkflow function. But This is already fixed in the latest code and I am not able to reproduce it with the latest 2.12.3. Can you try with 2.12.3?

https://github.com/argoproj/argo/blob/6b3ce504508707472d4d31c6c522d1af02104b05/workflow/controller/controller.go#L295

@davidcollom
Copy link
Contributor Author

@sarabala1979 We upgraded to 2.12.3 Upon its release, we came in this morning to see that the same workflow was locked with similar issues.

We are in the process of investigating if we can remove the daemon step to further rule out that this is the cause.

Something that we've noticed and put down to the daemon step/template was that the duration of the workflow continues to increase within the UI, however this issue has been more of a problem than that (will raise as separate issue if we find more info)

@sarabala1979
Copy link
Member

@davidcollom I have a fix. But I couldn't reproduce on my local. I build docker image docker pull sarabala1979/workflow-controller:mutex.
Is it possible to try it on your env and let me the details and logs?

@davidcollom
Copy link
Contributor Author

I can take a look first thing Monday and see if we find / have this issue and report back after a few days? Thanks for taking a look, it is greatly appreciated!

@davidcollom
Copy link
Contributor Author

@sarabala1979 I've deployed this release to the workflow-controller only in our cluster, will see how things get on over the next few days and report back.

@davidcollom
Copy link
Contributor Author

@sarabala1979 We had another instance of the locks not being released this morning I'm afraid.

Unfortunately, I'm not able to provide examples without spending some considerable time to redact the output, I'll spend some time this morning to see if the example can reproduce the same issue.

@davidcollom
Copy link
Contributor Author

davidcollom commented Jan 12, 2021

If its of any use, the following was present in the status:

  synchronization:
    mutex:
      waiting:
      - holder: xxxxx/create-xxxxxxxxx-ztkw9/create-xxxxxxxx-ztkw9-3743669695
        mutex: xxxxx/Mutex/terraform-xxxxxx

with the workflow being delete-xxxxxxxxx-sknxg (within the same namespace) create-xxxxxxxxx-ztkw9 successfully completed a few hours before the delete-xxxxxxxxx-sknxg workflow was initialized.

Workflow controller logs referencing this mutext:

time="2021-01-11T16:37:16.386Z" level=info msg="xxxxx/Mutex/terraform-xxxxxx acquired by xxxxx/create-xxxxxxxxx-ztkw9/create-xxxxxxxxx-ztkw9-3743669695 " mutex=xxxxx/Mutex/terraform-xxxxxx
time="2021-01-12T10:02:55.893Z" level=info msg="xxxxx/Mutex/terraform-xxxxxx acquired by xxxxx/delete-xxxxxxxxx-sknxg/delete-xxxxxxxxx-sknxg-576744040 " mutex=xxxxx/Mutex/terraform-xxxxxx

The second, line is after restarting the workflow controller.

@sarabala1979
Copy link
Member

@sarabala1979 We had another instance of the locks not being released this morning I'm afraid.

Is this instance using my eng build? Can you provide the workflow controller lock for this workflow?

@sarabala1979
Copy link
Member

@davidcollom

@davidcollom
Copy link
Contributor Author

Yes, we're using your eng build.

@sarabala1979
Copy link
Member

@davidcollom ok. can I get the entire workflow controller log for this workflow?

@davidcollom
Copy link
Contributor Author

@sarabala1979
Copy link
Member

@davidcollom I have added below extra log statement on eng build. Can you include it on your log. grep?

        s.log.Infof("Current Pending: %v", s.pending)
	s.log.Infof("Current Holder: %v", s.lockHolder)

@davidcollom
Copy link
Contributor Author

davidcollom commented Jan 13, 2021

@sarabala1979 Is this in a new release/image? as I'll need to restart the pod, I've not removed any lines in the previous log, only redacted infromation.

@sarabala1979
Copy link
Member

@davidcollom it is only in my eng build. Just I want to debug who is waiting and who is holding the lock.
Are you sure you are using my eng build?

@davidcollom
Copy link
Contributor Author

$ k get deploy -n argo workflow-controller -o yaml | grep image:
        image: sarabala1979/workflow-controller:mutex

Yes, confirmed we're using your eng build.

@davidcollom
Copy link
Contributor Author

@sarabala1979 As discussed in slack, the latest Eng Build shows some good signs of this being resolved.

The example workflow provided initially (which became stuck) completed from end to end without introversion from myself or the team.

We will continue running this build for a few days to ensure our production workflows don't show previous symptoms or any other issues ahead and report back (Monday?) with any findings.

@davidcollom
Copy link
Contributor Author

@sarabala1979 Sorry for not getting back to you yesterday. We've had no issues over the last few days and not had to restart the workflow-controller so looks like the latest fix is working as expected! Thank you so much!

@sarabala1979
Copy link
Member

@davidcollom Thanks for your update and help. PR is already in review. It will be released next upcoming release

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants