Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Correct usage of wait.ExponentialBackoff #4962

Merged
merged 2 commits into from
Jan 28, 2021
Merged

Conversation

alexec
Copy link
Contributor

@alexec alexec commented Jan 28, 2021

/ ExponentialBackoff repeats a condition check with exponential backoff.
//
// It repeatedly checks the condition and then sleeps, using `backoff.Step()`
// to determine the length of the sleep and adjust Duration and Steps.
// Stops and returns as soon as:
// 1. the condition check returns true or an error,
// 2. `backoff.Steps` checks of the condition have been done, or
// 3. a sleep truncated by the cap on duration has been completed.
// In case (1) the returned error is what the condition function returned.
// In all other cases, ErrWaitTimeout is returned.
func ExponentialBackoff(backoff Backoff, condition ConditionFunc) error {

The take-away here is that - if you want a retry, you must not return an error. However, we were doing that a LOT.

Frustratingly, this will obfuscate the underlying error when we have many transient errors- you'll get ErrWaitTimeout.

if err != nil {
log.Warnf("Failed to wait for container id '%s': %v", mainContainerID, err)
return false, err
return false, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

big change here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

previously this code would have NEVER retried

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this exponentialBackoff this function needs to return the original error instead of ErrWaitTimeout

@@ -395,7 +395,7 @@ func (p *PNSExecutor) GetTerminatedContainerStatus(ctx context.Context, containe
containerStatus = &containerStatusRes
return containerStatus.State.Terminated != nil, nil
}
return false, errors.New(errors.CodeNotFound, fmt.Sprintf("containerID %q is not found in the pod %s", containerID, p.podName))
return false, nil
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've noted that it can take a bit longer to get the container IDs in the pod status - so we should not error - we should retry

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as the previous comment, the original error needs to return from this function instead of ErrWaitTimeout

@alexec alexec marked this pull request as ready for review January 28, 2021 18:33
@@ -59,7 +60,7 @@ func (o *Operation) Dispatch(ctx context.Context) {
nameSuffix := fmt.Sprintf("%v", time.Now().Unix())
err := wait.ExponentialBackoff(retry.DefaultRetry, func() (bool, error) {
_, err := o.dispatch(ctx, event, nameSuffix)
return err == nil, err
return errorsutil.Done(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

@alexec alexec merged commit a00aa32 into argoproj:master Jan 28, 2021
@alexec alexec deleted the done branch January 28, 2021 23:14
@simster7 simster7 mentioned this pull request Feb 1, 2021
32 tasks
@simster7 simster7 mentioned this pull request Feb 8, 2021
38 tasks
copierrj pushed a commit to PDOK/argo-workflows that referenced this pull request Mar 22, 2021
copierrj pushed a commit to PDOK/argo-workflows that referenced this pull request Mar 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants