Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update checkpoints after post-replication actions, even on failure #109908

Merged
merged 12 commits into from
Jun 27, 2024

Conversation

fcofdez
Copy link
Contributor

@fcofdez fcofdez commented Jun 19, 2024

A failed post write refresh should not prevent advancing the local checkpoint if the translog operations have been fsynced correctly, hence we should update the checkpoints in all situations. On the other hand, if the fsync failed the local checkpoint won't advance anyway and the engine will fail during the next indexing operation.

A failed post write refresh should not prevent advancing the local
checkpoint if the translog operations have been fsynced correctly,
hence we should update the checkpoints in all situations. On the
other hand, if the fsync failed the local checkpoint won't advance
anyway and the engine will fail during the next indexing operation.
@fcofdez fcofdez added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team v8.15.0 labels Jun 19, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine
Copy link
Collaborator

Hi @fcofdez, I've created a changelog YAML for you.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a question.

Also, I wonder if we'd need a test or if you can point me to a test that verifies the fsync failing behavior?

Comment on lines 190 to 191
// TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
// go out of sync with the primary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder about the last part of this todo? I see how this improves things and it seems ok to me. But should we leave the todo here for teh replicas go out of sync question? It is not entirely clear to me what is meant by that, do you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure about what they meant with this TODO, my understanding is that we already fail the shard if the refresh fails or after a fsync fails. I'll leave the comment as is just in case.

@fcofdez
Copy link
Contributor Author

fcofdez commented Jun 19, 2024

Also, I wonder if we'd need a test or if you can point me to a test that verifies the fsync failing behavior?

I'll write a test for that scenario

@fcofdez
Copy link
Contributor Author

fcofdez commented Jun 24, 2024

@henningandersen could you take a look into this when you have a chance? thanks!

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions/comments.

final LongSupplier globalCheckpointSupplier,
final LongSupplier primaryTermSupplier,
final LongConsumer persistedSequenceNumberConsumer,
ChannelFactory channelFactory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is only added to inject failures on fsync. But we already have PathUtilsForTesting.installMock, could that be used instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL, I've changed it to use that instead 👍

ensureGreen(indexName);

var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get();
assertFalse(bulkResponse2.hasFailures());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we not expect this to also fail due to the tragic failure above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TransportReplicationAction will retry on AlreadyClosedException

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.

@@ -189,7 +189,10 @@ public void onFailure(Exception e) {
logger.trace("[{}] op [{}] post replication actions failed for [{}]", primary.routingEntry().shardId(), opType, request);
// TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
// go out of sync with the primary
finishAsFailed(e);
// We update the checkpoints since a refresh might fail but the operations could be safely persisted, in the case that the
// fsync failed the local checkpoint won't advance and the engine will be marked as failed when the next indexing operation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test seems to say that the engine just survives, which confuses me a bit, perhaps you can clarify?

Copy link
Contributor Author

@fcofdez fcofdez Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically the engine survives until the next operation comes in and needs to be appended into the translog since the post write operation just closes the TranslogWriter and that doesn't bubble up to the Engine. That's why I added an extra indexing operation in the test, to see that the Engine fails indeed after that operation.

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

ensureGreen(indexName);

var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get();
assertFalse(bulkResponse2.hasFailures());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.

@fcofdez fcofdez merged commit ca2ea69 into elastic:main Jun 27, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants