Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(controller): Preserve the original slice when removing string #4835

Merged
merged 1 commit into from
Jan 7, 2021
Merged

fix(controller): Preserve the original slice when removing string #4835

merged 1 commit into from
Jan 7, 2021

Conversation

juchaosong
Copy link
Contributor

Checklist:

@juchaosong
Copy link
Contributor Author

The previous RemoveString() will change the origin slice. When releasing all semaphores, it will change original wf.Status.Synchronization.Semaphore.Holding referring to https://github.com/argoproj/argo/blob/master/workflow/sync/sync_manager.go#L182. This will cause the iterated array to change, and some semaphores does't be released.
You can change https://github.com/argoproj/argo/blob/master/examples/synchronization-tmpl-level.yaml#L37 sleep 10 to more time to reproduce the bug.

kubectl create -f examples/synchronization-tmpl-level.yaml
kubectl delete wf $(kubectl get wf | grep synchronization-tmpl-level | awk '{print$1}')

The controller logs:

time="2021-01-06T10:35:54.659Z" level=info msg="Processing workflow" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.659Z" level=info msg="Updated phase  -> Running" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.659Z" level=info msg="Steps node synchronization-tmpl-level-gfzqp initialized Running" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.659Z" level=info msg="StepGroup node synchronization-tmpl-level-gfzqp-3634344595 initialized Running" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.665Z" level=info msg="default/ConfigMap/my-config/template acquired by default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-397032944 " semaphore=default/ConfigMap/my-config/template
time="2021-01-06T10:35:54.665Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.665Z" level=info msg="Pod node synchronization-tmpl-level-gfzqp-397032944 initialized Pending" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.683Z" level=info msg="Created pod: synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) (synchronization-tmpl-level-gfzqp-397032944)" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.688Z" level=info msg="default/ConfigMap/my-config/template acquired by default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-1627841290 " semaphore=default/ConfigMap/my-config/template
time="2021-01-06T10:35:54.688Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.688Z" level=info msg="Pod node synchronization-tmpl-level-gfzqp-1627841290 initialized Pending" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.703Z" level=info msg="Created pod: synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) (synchronization-tmpl-level-gfzqp-1627841290)" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.708Z" level=info msg="default/ConfigMap/my-config/template acquired by default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-614236760 " semaphore=default/ConfigMap/my-config/template
time="2021-01-06T10:35:54.708Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.708Z" level=info msg="Pod node synchronization-tmpl-level-gfzqp-614236760 initialized Pending" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.723Z" level=info msg="Created pod: synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) (synchronization-tmpl-level-gfzqp-614236760)" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.728Z" level=info msg="Pod node synchronization-tmpl-level-gfzqp-4180462278 initialized Pending" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.733Z" level=info msg="Pod node synchronization-tmpl-level-gfzqp-2103118504 initialized Pending" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.733Z" level=info msg="Workflow step group node synchronization-tmpl-level-gfzqp-3634344595 not yet completed" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:54.742Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=2415604849 workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.683Z" level=info msg="Processing workflow" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.683Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-614236760 message: ContainerCreating"
time="2021-01-06T10:35:55.683Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-1627841290 message: ContainerCreating"
time="2021-01-06T10:35:55.683Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-397032944 message: ContainerCreating"
time="2021-01-06T10:35:55.689Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.689Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) (synchronization-tmpl-level-gfzqp-397032944) creation: already exists" namespace=default podPhase=Pending workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.695Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.695Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) (synchronization-tmpl-level-gfzqp-1627841290) creation: already exists" namespace=default podPhase=Pending workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.700Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.700Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) (synchronization-tmpl-level-gfzqp-614236760) creation: already exists" namespace=default podPhase=Pending workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.711Z" level=info msg="Workflow step group node synchronization-tmpl-level-gfzqp-3634344595 not yet completed" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:55.720Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=2415604911 workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.540Z" level=info msg="Processing workflow" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.540Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-614236760 status Pending -> Running"
time="2021-01-06T10:35:57.540Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-397032944 status Pending -> Running"
time="2021-01-06T10:35:57.540Z" level=info msg="Updating node synchronization-tmpl-level-gfzqp-1627841290 status Pending -> Running"
time="2021-01-06T10:35:57.546Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.547Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(0:1) (synchronization-tmpl-level-gfzqp-397032944) creation: already exists" namespace=default podPhase=Running workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.554Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.554Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(1:2) (synchronization-tmpl-level-gfzqp-1627841290) creation: already exists" namespace=default podPhase=Running workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.560Z" level=info msg="Node synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) acquired synchronization lock" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.560Z" level=info msg="Skipped pod synchronization-tmpl-level-gfzqp[0].synchronization-acquire-lock(2:3) (synchronization-tmpl-level-gfzqp-614236760) creation: already exists" namespace=default podPhase=Running workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.572Z" level=info msg="Workflow step group node synchronization-tmpl-level-gfzqp-3634344595 not yet completed" namespace=default workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:35:57.581Z" level=info msg="Workflow update successful" namespace=default phase=Running resourceVersion=2415605037 workflow=synchronization-tmpl-level-gfzqp
time="2021-01-06T10:36:01.405Z" level=info msg="Lock has been released by default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-397032944. Available locks: 1" semaphore=default/ConfigMap/my-config/template
time="2021-01-06T10:36:01.405Z" level=info msg="default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-397032944 released a lock from default/ConfigMap/my-config/template"
time="2021-01-06T10:36:01.405Z" level=info msg="Lock has been released by default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-614236760. Available locks: 2" semaphore=default/ConfigMap/my-config/template
time="2021-01-06T10:36:01.405Z" level=info msg="default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-614236760 released a lock from default/ConfigMap/my-config/template"
time="2021-01-06T10:36:01.406Z" level=info msg="default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-614236760 released a lock from default/ConfigMap/my-config/template"
time="2021-01-06T10:39:43.880Z" level=info msg="Alloc=10960 TotalAlloc=99747 Sys=70592 NumGC=15 Goroutines=176"

You will find default/synchronization-tmpl-level-gfzqp/synchronization-tmpl-level-gfzqp-614236760 release lock twice, and only 2 locks available, but 3 locks acquired.

@juchaosong juchaosong changed the title Preserve the original slice when removing string fix: Preserve the original slice when removing string Jan 6, 2021
Copy link
Member

@simster7 simster7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see how the original slice could be affected by this function. I played around a bit and found a case where it is actually malformed: https://play.golang.org/p/TGuB3gcWxDc

However, as this pattern is commonly used (and as it is used in this case) is to immediatley reassign the parameter slice with the returning value. This doesn't cause an issue: https://play.golang.org/p/zxJDU8MSt5N

This is the only way this function is used in our codebase:

https://github.com/argoproj/argo/blob/945106e3dc395055bac2034d633ebce86d1be749/pkg/apis/workflow/v1alpha1/workflow_types.go#L2595

Is there something I'm missing here? Otherwise it seems like this change would have no effect

@simster7 simster7 self-assigned this Jan 6, 2021
@alexec alexec changed the title fix: Preserve the original slice when removing string fix(controller): Preserve the original slice when removing string Jan 6, 2021
@alexec
Copy link
Contributor

alexec commented Jan 6, 2021

@juchaosong
Copy link
Contributor Author

I see how the original slice could be affected by this function. I played around a bit and found a case where it is actually malformed: https://play.golang.org/p/TGuB3gcWxDc

However, as this pattern is commonly used (and as it is used in this case) is to immediatley reassign the parameter slice with the returning value. This doesn't cause an issue: https://play.golang.org/p/zxJDU8MSt5N

This is the only way this function is used in our codebase:

https://github.com/argoproj/argo/blob/945106e3dc395055bac2034d633ebce86d1be749/pkg/apis/workflow/v1alpha1/workflow_types.go#L2595

Is there something I'm missing here? Otherwise it seems like this change would have no effect

https://github.com/argoproj/argo/blob/945106e3dc395055bac2034d633ebce86d1be749/workflow/sync/sync_manager.go#L179-L184
After invoking wf.Status.Synchronization.Semaphore.LockReleased(holderKey, holding.Semaphore), the original holding.Holders will be changed, so the iterating holding.Holders will get wrong holderKey. You can add log here to print holderKey and holding.Holders when reproduce the bug.

@juchaosong
Copy link
Contributor Author

https://github.com/argoproj/argo/blob/adfa988f9df64b629e08687737a80f2f6e0a6289/workflow/sync/sync_manager.go#L179
Or the iterating can be changed to for _, holderKey := range holding.DeepCopy().Holders, it will ensure that the iterated holders will not be changed.
But I don't think it's a good way to change the original slice in RemoveString(), which may cause unexpected results to the caller.

@@ -3,7 +3,9 @@ package slice
func RemoveString(slice []string, element string) []string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method should either modify the argument (and not return anything) or not modify the argument - it should not do both

@alexec alexec merged commit 53110b6 into argoproj:master Jan 7, 2021
@simster7
Copy link
Member

simster7 commented Jan 7, 2021

I see, thanks @juchaosong

@simster7 simster7 mentioned this pull request Jan 12, 2021
15 tasks
simster7 pushed a commit that referenced this pull request Jan 12, 2021
@simster7
Copy link
Member

Fix for this is out now in https://github.com/argoproj/argo/releases/tag/v2.12.4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants