Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix testRelocationFailureNotRetriedForever #109855

Conversation

idegtiarenko
Copy link
Contributor

The test failure happens in the following scenario:

  1. the exception is thrown, org.elasticsearch.cluster.routing.allocation.AllocationService#applyFailedShards is executed
  2. applyFailedShards updates the shard failed_attempts=1, moves it back to the STARTED state and submits a new desired balance computation
  3. the new computation is delayed, we observe the current state with no shard movements and ensureGreen is exiting
  4. desired balance computation is completed, reconciliation starts and completes
  5. test observes RELOCATING shard

We can also exit before all retry attempts are exhausted without even noticing that as error count is not asserted:

  1. the exception is thrown, org.elasticsearch.cluster.routing.allocation.AllocationService#applyFailedShards is executed
  2. applyFailedShards updates the shard failed_attempts=1, moves it back to the STARTED state and submits a new desired balance computation
  3. the new computation is delayed, we observe the current state with no shard movements and ensureGreen is exiting
  4. test observes STARTED shard (wile relocating failure counter is not exhausted yet)
  5. desired balance computation is completed, reconciliation starts and completes starting another relocation attempt

Until we have a functionality to await no computation, the best is to assert busy the computation is complete.

Closes: #108951

@idegtiarenko idegtiarenko added >test Issues or PRs that are addressing/adding tests :Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team v8.15.0 labels Jun 18, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

assertThat(shard, notNullValue());
assertThat(shard.state(), equalTo(ShardRoutingState.STARTED));
assertThat(state.nodes().get(shard.currentNodeId()).getName(), equalTo(node1));
assertThat(shard.relocationFailureInfo().failedRelocations(), equalTo(5));// see SETTING_ALLOCATION_MAX_RETRY
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this assertion to ensure we exhausted all attempts before exiting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why hardcoded and not actual max retry value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at this test before and still trying to understand how these assertions prove that we dont retry forever as test name indicates. Since ensureGreen not reliable as you mentioned in description, reaching maxRetries and still being STARTED on current node not necessarily means we will not retry in the future. I guess this assumption holds until we dont reset retries and MaxRetryAllocationDecider still there, but it's implicit in this test.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ let's say SETTING_ALLOCATION_MAX_RETRY.get(Settings.EMPTY) rather than the literal 5.

Also can we assertBusy() that the relocation counter reaches this value (or, even better use org.elasticsearch.test.ClusterServiceUtils#addTemporaryStateListener to await that cluster state) and then assert that the shard remains STARTED?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update to rely on default rather then a constant.

that the relocation counter reaches this value

Do you mean the relocation failure counter (one already asserted on L152) or something else?

and then assert that the shard remains STARTED?

I guess this part is tricky. I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion. This can not even be excluded by waiting for languid tasks as long as we are not using deterministic queue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the relocation failure counter (one already asserted on L152)

Yeah, but wait only for that to reach 5, not for the rest of the conditions in the assertBusy().

I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion.

Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?

That is correct assuming we know implementation details.
Assuming a blackbox approach (or inventive ways to break the code without breaking the test) this is not completely reliable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but I think the whole point of SETTING_ALLOCATION_MAX_RETRY is to limit this value to 5, blocking further recovery attempts once it reaches the limit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct, as long as we can rely on it, this test is valid.

Copy link
Contributor

@mhl-b mhl-b left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@idegtiarenko idegtiarenko merged commit 35efffd into elastic:main Jul 5, 2024
15 checks passed
@idegtiarenko idegtiarenko deleted the fix_testRelocationFailureNotRetriedForever branch July 5, 2024 08:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed Meta label for distributed team >test Issues or PRs that are addressing/adding tests v8.16.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CI] IndicesLifecycleListenerIT testRelocationFailureNotRetriedForever failing
4 participants