-
Notifications
You must be signed in to change notification settings - Fork 24.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update checkpoints after post-replication actions, even on failure #109908
Update checkpoints after post-replication actions, even on failure #109908
Conversation
A failed post write refresh should not prevent advancing the local checkpoint if the translog operations have been fsynced correctly, hence we should update the checkpoints in all situations. On the other hand, if the fsync failed the local checkpoint won't advance anyway and the engine will fail during the next indexing operation.
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @fcofdez, I've created a changelog YAML for you. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a question.
Also, I wonder if we'd need a test or if you can point me to a test that verifies the fsync failing behavior?
// TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas | ||
// go out of sync with the primary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder about the last part of this todo? I see how this improves things and it seems ok to me. But should we leave the todo here for teh replicas go out of sync question? It is not entirely clear to me what is meant by that, do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not entirely sure about what they meant with this TODO, my understanding is that we already fail the shard if the refresh fails or after a fsync fails. I'll leave the comment as is just in case.
I'll write a test for that scenario |
…elasticsearch into advance-checkpoints-after-write
@henningandersen could you take a look into this when you have a chance? thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple questions/comments.
final LongSupplier globalCheckpointSupplier, | ||
final LongSupplier primaryTermSupplier, | ||
final LongConsumer persistedSequenceNumberConsumer, | ||
ChannelFactory channelFactory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is only added to inject failures on fsync. But we already have PathUtilsForTesting.installMock
, could that be used instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL, I've changed it to use that instead 👍
ensureGreen(indexName); | ||
|
||
var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get(); | ||
assertFalse(bulkResponse2.hasFailures()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would we not expect this to also fail due to the tragic failure above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TransportReplicationAction will retry on AlreadyClosedException
elasticsearch/server/src/main/java/org/elasticsearch/action/support/TransportActions.java
Line 28 in 12ad399
|| actual instanceof AlreadyClosedException); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.
@@ -189,7 +189,10 @@ public void onFailure(Exception e) { | |||
logger.trace("[{}] op [{}] post replication actions failed for [{}]", primary.routingEntry().shardId(), opType, request); | |||
// TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas | |||
// go out of sync with the primary | |||
finishAsFailed(e); | |||
// We update the checkpoints since a refresh might fail but the operations could be safely persisted, in the case that the | |||
// fsync failed the local checkpoint won't advance and the engine will be marked as failed when the next indexing operation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test seems to say that the engine just survives, which confuses me a bit, perhaps you can clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the engine survives until the next operation comes in and needs to be appended into the translog since the post write operation just closes the TranslogWriter and that doesn't bubble up to the Engine. That's why I added an extra indexing operation in the test, to see that the Engine fails indeed after that operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
ensureGreen(indexName); | ||
|
||
var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get(); | ||
assertFalse(bulkResponse2.hasFailures()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.
A failed post write refresh should not prevent advancing the local checkpoint if the translog operations have been fsynced correctly, hence we should update the checkpoints in all situations. On the other hand, if the fsync failed the local checkpoint won't advance anyway and the engine will fail during the next indexing operation.