Fix testRelocationFailureNotRetriedForever #109855

idegtiarenko · 2024-06-18T12:55:56Z

The test failure happens in the following scenario:

the exception is thrown, org.elasticsearch.cluster.routing.allocation.AllocationService#applyFailedShards is executed
applyFailedShards updates the shard failed_attempts=1, moves it back to the STARTED state and submits a new desired balance computation
the new computation is delayed, we observe the current state with no shard movements and ensureGreen is exiting
desired balance computation is completed, reconciliation starts and completes
test observes RELOCATING shard

We can also exit before all retry attempts are exhausted without even noticing that as error count is not asserted:

the exception is thrown, org.elasticsearch.cluster.routing.allocation.AllocationService#applyFailedShards is executed
applyFailedShards updates the shard failed_attempts=1, moves it back to the STARTED state and submits a new desired balance computation
the new computation is delayed, we observe the current state with no shard movements and ensureGreen is exiting
test observes STARTED shard (wile relocating failure counter is not exhausted yet)
desired balance computation is completed, reconciliation starts and completes starting another relocation attempt

Until we have a functionality to await no computation, the best is to assert busy the computation is complete.

Closes: #108951

elasticsearchmachine · 2024-06-18T12:56:20Z

Pinging @elastic/es-distributed (Team:Distributed)

idegtiarenko · 2024-06-18T12:56:24Z

server/src/internalClusterTest/java/org/elasticsearch/indices/IndicesLifecycleListenerIT.java

+ assertThat(shard, notNullValue());
+ assertThat(shard.state(), equalTo(ShardRoutingState.STARTED));
+ assertThat(state.nodes().get(shard.currentNodeId()).getName(), equalTo(node1));
+ assertThat(shard.relocationFailureInfo().failedRelocations(), equalTo(5));// see SETTING_ALLOCATION_MAX_RETRY


Added this assertion to ensure we exhausted all attempts before exiting

why hardcoded and not actual max retry value?

I looked at this test before and still trying to understand how these assertions prove that we dont retry forever as test name indicates. Since ensureGreen not reliable as you mentioned in description, reaching maxRetries and still being STARTED on current node not necessarily means we will not retry in the future. I guess this assumption holds until we dont reset retries and MaxRetryAllocationDecider still there, but it's implicit in this test.

++ let's say SETTING_ALLOCATION_MAX_RETRY.get(Settings.EMPTY) rather than the literal 5.

Also can we assertBusy() that the relocation counter reaches this value (or, even better use org.elasticsearch.test.ClusterServiceUtils#addTemporaryStateListener to await that cluster state) and then assert that the shard remains STARTED?

Will update to rely on default rather then a constant.

that the relocation counter reaches this value

Do you mean the relocation failure counter (one already asserted on L152) or something else?

and then assert that the shard remains STARTED?

I guess this part is tricky. I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion. This can not even be excluded by waiting for languid tasks as long as we are not using deterministic queue here.

Do you mean the relocation failure counter (one already asserted on L152)

Yeah, but wait only for that to reach 5, not for the rest of the conditions in the assertBusy().

I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion.

Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?

Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?

That is correct assuming we know implementation details.
Assuming a blackbox approach (or inventive ways to break the code without breaking the test) this is not completely reliable.

Maybe I'm missing something, but I think the whole point of SETTING_ALLOCATION_MAX_RETRY is to limit this value to 5, blocking further recovery attempts once it reaches the limit.

Correct, as long as we can rely on it, this test is valid.

…edForever

mhl-b

LGTM

…edForever

Fix testRelocationFailureNotRetriedForever

3050215

idegtiarenko added >test Issues or PRs that are addressing/adding tests :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team v8.15.0 labels Jun 18, 2024

idegtiarenko requested a review from DaveCTurner June 18, 2024 12:55

idegtiarenko commented Jun 18, 2024

View reviewed changes

idegtiarenko added 4 commits June 20, 2024 10:49

upd

06ce4ee

Merge branch 'refs/heads/main' into fix_testRelocationFailureNotRetri…

f39faf9

…edForever

upd

0f1c784

Merge branch 'refs/heads/main' into fix_testRelocationFailureNotRetri…

a3b43eb

…edForever

elasticsearchmachine added v8.16.0 and removed v8.15.0 labels Jul 4, 2024

mhl-b approved these changes Jul 4, 2024

View reviewed changes

Merge branch 'refs/heads/main' into fix_testRelocationFailureNotRetri…

fdc5fee

…edForever

idegtiarenko merged commit 35efffd into elastic:main Jul 5, 2024
15 checks passed

idegtiarenko deleted the fix_testRelocationFailureNotRetriedForever branch July 5, 2024 08:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix testRelocationFailureNotRetriedForever #109855

Fix testRelocationFailureNotRetriedForever #109855

idegtiarenko commented Jun 18, 2024

elasticsearchmachine commented Jun 18, 2024

idegtiarenko Jun 18, 2024

mhl-b Jun 20, 2024

mhl-b Jun 20, 2024

DaveCTurner Jun 20, 2024

idegtiarenko Jun 20, 2024

DaveCTurner Jun 20, 2024

idegtiarenko Jun 20, 2024

DaveCTurner Jun 20, 2024

idegtiarenko Jul 5, 2024

mhl-b left a comment

Fix testRelocationFailureNotRetriedForever #109855

Fix testRelocationFailureNotRetriedForever #109855

Conversation

idegtiarenko commented Jun 18, 2024

elasticsearchmachine commented Jun 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mhl-b left a comment

Choose a reason for hiding this comment