-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix testRelocationFailureNotRetriedForever #109855
Fix testRelocationFailureNotRetriedForever #109855
Conversation
Pinging @elastic/es-distributed (Team:Distributed) |
assertThat(shard, notNullValue()); | ||
assertThat(shard.state(), equalTo(ShardRoutingState.STARTED)); | ||
assertThat(state.nodes().get(shard.currentNodeId()).getName(), equalTo(node1)); | ||
assertThat(shard.relocationFailureInfo().failedRelocations(), equalTo(5));// see SETTING_ALLOCATION_MAX_RETRY |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this assertion to ensure we exhausted all attempts before exiting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why hardcoded and not actual max retry value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at this test before and still trying to understand how these assertions prove that we dont retry forever as test name indicates. Since ensureGreen not reliable as you mentioned in description, reaching maxRetries and still being STARTED on current node not necessarily means we will not retry in the future. I guess this assumption holds until we dont reset retries and MaxRetryAllocationDecider still there, but it's implicit in this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ let's say SETTING_ALLOCATION_MAX_RETRY.get(Settings.EMPTY)
rather than the literal 5.
Also can we assertBusy()
that the relocation counter reaches this value (or, even better use org.elasticsearch.test.ClusterServiceUtils#addTemporaryStateListener
to await that cluster state) and then assert that the shard remains STARTED
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will update to rely on default rather then a constant.
that the relocation counter reaches this value
Do you mean the relocation failure counter (one already asserted on L152) or something else?
and then assert that the shard remains STARTED?
I guess this part is tricky. I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion. This can not even be excluded by waiting for languid tasks as long as we are not using deterministic queue here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the relocation failure counter (one already asserted on L152)
Yeah, but wait only for that to reach 5, not for the rest of the conditions in the assertBusy()
.
I can add another assert in the end but theoretically there can always be a scheduled task that issue another cluster state change right after we perform the last assertion.
Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure but it shouldn't be relocating that shard once it's reached 5 failures should it?
That is correct assuming we know implementation details.
Assuming a blackbox approach (or inventive ways to break the code without breaking the test) this is not completely reliable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I'm missing something, but I think the whole point of SETTING_ALLOCATION_MAX_RETRY
is to limit this value to 5, blocking further recovery attempts once it reaches the limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, as long as we can rely on it, this test is valid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
The test failure happens in the following scenario:
We can also exit before all retry attempts are exhausted without even noticing that as error count is not asserted:
Until we have a functionality to await no computation, the best is to assert busy the computation is complete.
Closes: #108951