-
Notifications
You must be signed in to change notification settings - Fork 13.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933
[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933
Conversation
7ec9949
to
ae15481
Compare
…#completeCancellingForAllVertices One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. [hotfix] Make WaitForTasks using an AtomicInteger
ae15481
to
2c146ca
Compare
On your branch
Edit: Problem is fixed if you initialize the |
@@ -48,6 +48,8 @@ | |||
|
|||
private Optional<Consumer<ExecutionAttemptID>> optSubmitCondition; | |||
|
|||
private Optional<Consumer<ExecutionAttemptID>> optCancelCondition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be always null
initialized. Isn't that a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jup it is. Fixed it.
@@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || current == DEPLOYING) { | |||
// failing in the meantime may happen and is no problem. | |||
// anything else is a serious problem !!! | |||
if (current != FAILED) { | |||
String message = String.format("Asynchronous race: Found state %s after successful cancel call.", state); | |||
String message = String.format("Asynchronous race: Found %s in state %s after successful cancel call.", vertex.getTaskNameWithSubtaskIndex(), state); | |||
LOG.error(message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: slf4j's {}
placeholders should be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case the message
has been created deliberately, because we reuse the message in the line below. Moreover, the logging statement is error and thus, will be evaluated in almost all cases. What one could argue is whether normal string concatenation wouldn't be faster than String.format
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok
While you are at it:
could be improved to
so that the failure reason is more obvious. |
Thanks for the review @GJL. I've addressed your comments. |
LGTM 👍 Tests didn't fail after ~1000 local executions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Thanks a lot for your review @GJL. Merging this PR. |
What is the purpose of the change
One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the
subsequent completeCancellingForAllVertices where not all execution are in state
CANCELLING.
The other race condition is between completeCancellingForAllVertices and the fixed
delay restart without a delay. The problem is that the 10th task could have failed.
In order to restart we would have to complete the cancel for the first 9 tasks. This
is enough for the restart strategy to restart the job. If this happens before
completeCancellingForAllVertices has also cancelled the execution of the 10th task,
it could happen that we cancel a fresh execution.
R @GJL
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation