[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933

tillrohrmann · 2017-11-01T15:59:03Z

What is the purpose of the change

One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the
subsequent completeCancellingForAllVertices where not all execution are in state
CANCELLING.

The other race condition is between completeCancellingForAllVertices and the fixed
delay restart without a delay. The problem is that the 10th task could have failed.
In order to restart we would have to complete the cancel for the first 9 tasks. This
is enough for the restart strategy to restart the job. If this happens before
completeCancellingForAllVertices has also cancelled the execution of the 10th task,
it could happen that we cancel a fresh execution.

R @GJL

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (no)
The serializers: (no)
The runtime per-record code paths (performance sensitive): (no)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (no)

Documentation

Does this pull request introduce a new feature? (no)
If yes, how is the feature documented? (not applicable)

…#completeCancellingForAllVertices One race condition is between waitUntilJobStatus(eg, JobStatus.FAILING, 1000) and the subsequent completeCancellingForAllVertices where not all execution are in state CANCELLING. The other race condition is between completeCancellingForAllVertices and the fixed delay restart without a delay. The problem is that the 10th task could have failed. In order to restart we would have to complete the cancel for the first 9 tasks. This is enough for the restart strategy to restart the job. If this happens before completeCancellingForAllVertices has also cancelled the execution of the 10th task, it could happen that we cancel a fresh execution. [hotfix] Make WaitForTasks using an AtomicInteger

GJL · 2017-11-01T16:10:52Z

On your branch ExecutionGraphRestartTest#testRestartWithEagerSchedulingAndSlotSharing is failing consistently

java.util.concurrent.TimeoutException
	at org.apache.flink.runtime.executiongraph.ExecutionGraphTestUtils.waitUntilJobStatus(ExecutionGraphTestUtils.java:116)
	at org.apache.flink.runtime.executiongraph.ExecutionGraphRestartTest.testRestartWithEagerSchedulingAndSlotSharing(ExecutionGraphRestartTest.java:776)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:55)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)

Edit: Problem is fixed if you initialize the Optional field (see comment below)

GJL · 2017-11-01T16:12:52Z

.../test/java/org/apache/flink/runtime/executiongraph/utils/SimpleAckingTaskManagerGateway.java

@@ -48,6 +48,8 @@

 private Optional<Consumer<ExecutionAttemptID>> optSubmitCondition;

+ private Optional<Consumer<ExecutionAttemptID>> optCancelCondition;


This will be always null initialized. Isn't that a problem?

Jup it is. Fixed it.

GJL · 2017-11-01T16:14:49Z

flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java

@@ -844,7 +844,7 @@ else if (current == CANCELING || current == RUNNING || current == DEPLOYING) {
 // failing in the meantime may happen and is no problem.
 // anything else is a serious problem !!!
 if (current != FAILED) {
- String message = String.format("Asynchronous race: Found state %s after successful cancel call.", state);
+ String message = String.format("Asynchronous race: Found %s in state %s after successful cancel call.", vertex.getTaskNameWithSubtaskIndex(), state);
 LOG.error(message);


nit: slf4j's {} placeholders should be used.

In this case the message has been created deliberately, because we reuse the message in the line below. Moreover, the logging statement is error and thus, will be evaluated in almost all cases. What one could argue is whether normal string concatenation wouldn't be faster than String.format.

GJL · 2017-11-01T16:20:28Z

While you are at it:

ExecutionGraphTestUtils#switchToRunning

public static void switchToRunning(ExecutionGraph eg) {
		// check that all execution are in state DEPLOYING
		for (ExecutionVertex ev : eg.getAllExecutionVertices()) {
			final Execution exec = ev.getCurrentExecutionAttempt();
			assert(exec.getState() == ExecutionState.DEPLOYING);
		}

		// switch executions to RUNNING
		for (ExecutionVertex ev : eg.getAllExecutionVertices()) {
			final Execution exec = ev.getCurrentExecutionAttempt();
			exec.switchToRunning();
		}
	}

could be improved to

public static void switchToRunning(ExecutionGraph eg) {
		// check that all execution are in state DEPLOYING
		for (ExecutionVertex ev : eg.getAllExecutionVertices()) {
			final Execution exec = ev.getCurrentExecutionAttempt();
			final ExecutionState executionState = exec.getState();
			assert executionState == ExecutionState.DEPLOYING 
				: "Expected executionState to be DEPLOYING, was: " + executionState;
		}

		// switch executions to RUNNING
		for (ExecutionVertex ev : eg.getAllExecutionVertices()) {
			final Execution exec = ev.getCurrentExecutionAttempt();
			exec.switchToRunning();
		}
	}

so that the failure reason is more obvious.

…skManagerGateway

tillrohrmann · 2017-11-01T22:18:20Z

Thanks for the review @GJL. I've addressed your comments.

GJL · 2017-11-02T10:58:02Z

LGTM 👍

Tests didn't fail after ~1000 local executions.

GJL

👍

tillrohrmann · 2017-11-02T14:01:01Z

Thanks a lot for your review @GJL. Merging this PR.

tillrohrmann force-pushed the hardenExecutionGraphRestartTest branch 2 times, most recently from 7ec9949 to ae15481 Compare November 1, 2017 16:00

tillrohrmann force-pushed the hardenExecutionGraphRestartTest branch from ae15481 to 2c146ca Compare November 1, 2017 16:01

GJL reviewed Nov 1, 2017

View reviewed changes

tillrohrmann added 2 commits November 1, 2017 23:05

[hotfix] Set optCancelCondition to Optional.empty() in SimpleAckingTa…

aea1145

…skManagerGateway

Add assertion message to ExecutionGraphTestUtils#switchToRunning

ec59dc9

GJL approved these changes Nov 2, 2017

View reviewed changes

asfgit closed this in f9b475f Nov 2, 2017

tillrohrmann deleted the hardenExecutionGraphRestartTest branch May 29, 2018 10:18

rmetzger added component=Runtime/Coordination component=Tests labels Mar 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933

[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933

tillrohrmann commented Nov 1, 2017 •

edited

Loading

GJL commented Nov 1, 2017 •

edited

Loading

GJL Nov 1, 2017

tillrohrmann Nov 1, 2017

GJL Nov 1, 2017

tillrohrmann Nov 1, 2017 •

edited

Loading

GJL Nov 2, 2017

GJL commented Nov 1, 2017

tillrohrmann commented Nov 1, 2017

GJL commented Nov 2, 2017

GJL left a comment

tillrohrmann commented Nov 2, 2017

		@@ -48,6 +48,8 @@

		private Optional<Consumer<ExecutionAttemptID>> optSubmitCondition;

		private Optional<Consumer<ExecutionAttemptID>> optCancelCondition;

[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933

[FLINK-7960] [tests] Fix race conditions in ExecutionGraphRestartTest#completeCancellingForAllVertices #4933

Conversation

tillrohrmann commented Nov 1, 2017 • edited Loading

What is the purpose of the change

Does this pull request potentially affect one of the following parts:

Documentation

GJL commented Nov 1, 2017 • edited Loading

GJL Nov 1, 2017

Choose a reason for hiding this comment

tillrohrmann Nov 1, 2017

Choose a reason for hiding this comment

GJL Nov 1, 2017

Choose a reason for hiding this comment

tillrohrmann Nov 1, 2017 • edited Loading

Choose a reason for hiding this comment

GJL Nov 2, 2017

Choose a reason for hiding this comment

GJL commented Nov 1, 2017

tillrohrmann commented Nov 1, 2017

GJL commented Nov 2, 2017

GJL left a comment

Choose a reason for hiding this comment

tillrohrmann commented Nov 2, 2017

tillrohrmann commented Nov 1, 2017 •

edited

Loading

GJL commented Nov 1, 2017 •

edited

Loading

tillrohrmann Nov 1, 2017 •

edited

Loading