Update checkpoints after post-replication actions, even on failure #109908

fcofdez · 2024-06-19T08:41:27Z

A failed post write refresh should not prevent advancing the local checkpoint if the translog operations have been fsynced correctly, hence we should update the checkpoints in all situations. On the other hand, if the fsync failed the local checkpoint won't advance anyway and the engine will fail during the next indexing operation.

elasticsearchmachine · 2024-06-19T08:41:51Z

Pinging @elastic/es-distributed (Team:Distributed)

elasticsearchmachine · 2024-06-19T08:41:51Z

Hi @fcofdez, I've created a changelog YAML for you.

henningandersen

I have a question.

Also, I wonder if we'd need a test or if you can point me to a test that verifies the fsync failing behavior?

henningandersen · 2024-06-19T09:05:08Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

- // TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
- // go out of sync with the primary


I wonder about the last part of this todo? I see how this improves things and it seems ok to me. But should we leave the todo here for teh replicas go out of sync question? It is not entirely clear to me what is meant by that, do you know?

I'm not entirely sure about what they meant with this TODO, my understanding is that we already fail the shard if the refresh fails or after a fsync fails. I'll leave the comment as is just in case.

fcofdez · 2024-06-19T09:41:32Z

Also, I wonder if we'd need a test or if you can point me to a test that verifies the fsync failing behavior?

I'll write a test for that scenario

…elasticsearch into advance-checkpoints-after-write

…fter-write

fcofdez · 2024-06-24T08:32:43Z

@henningandersen could you take a look into this when you have a chance? thanks!

henningandersen

A couple questions/comments.

henningandersen · 2024-06-24T14:55:20Z

server/src/main/java/org/elasticsearch/index/translog/Translog.java

+ final LongSupplier globalCheckpointSupplier,
+ final LongSupplier primaryTermSupplier,
+ final LongConsumer persistedSequenceNumberConsumer,
+ ChannelFactory channelFactory


I think this is only added to inject failures on fsync. But we already have PathUtilsForTesting.installMock, could that be used instead?

TIL, I've changed it to use that instead 👍

henningandersen · 2024-06-24T15:03:53Z

...src/internalClusterTest/java/org/elasticsearch/action/bulk/BulkAfterWriteFsyncFailureIT.java

+ ensureGreen(indexName);
+
+ var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get();
+ assertFalse(bulkResponse2.hasFailures());


Would we not expect this to also fail due to the tragic failure above?

The TransportReplicationAction will retry on AlreadyClosedException

elasticsearch/server/src/main/java/org/elasticsearch/action/support/TransportActions.java

Line 28 in 12ad399

|| actual instanceof AlreadyClosedException);

oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.

henningandersen · 2024-06-24T15:04:17Z

server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java

@@ -189,7 +189,10 @@ public void onFailure(Exception e) {
 logger.trace("[{}] op [{}] post replication actions failed for [{}]", primary.routingEntry().shardId(), opType, request);
 // TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
 // go out of sync with the primary
- finishAsFailed(e);
+ // We update the checkpoints since a refresh might fail but the operations could be safely persisted, in the case that the
+ // fsync failed the local checkpoint won't advance and the engine will be marked as failed when the next indexing operation


The test seems to say that the engine just survives, which confuses me a bit, perhaps you can clarify?

Technically the engine survives until the next operation comes in and needs to be appended into the translog since the post write operation just closes the TranslogWriter and that doesn't bubble up to the Engine. That's why I added an extra indexing operation in the test, to see that the Engine fails indeed after that operation.

…fter-write

henningandersen

LGTM.

henningandersen · 2024-06-27T09:13:33Z

...src/internalClusterTest/java/org/elasticsearch/action/bulk/BulkAfterWriteFsyncFailureIT.java

+ ensureGreen(indexName);
+
+ var bulkResponse2 = client().prepareBulk().add(prepareIndex(indexName).setId("2").setSource("key", "bar", "val", 20)).get();
+ assertFalse(bulkResponse2.hasFailures());


oh, can we add a comment about that to the test? Seems the test concerns the engine level, so this retry is not obvious when reading it.

fcofdez added >bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. Team:Distributed Meta label for distributed team v8.15.0 labels Jun 19, 2024

fcofdez requested a review from henningandersen June 19, 2024 08:41

Update docs/changelog/109908.yaml

b6ca5d0

henningandersen reviewed Jun 19, 2024

View reviewed changes

fcofdez added 6 commits June 19, 2024 11:44

Spotless

a15e559

Merge branch 'advance-checkpoints-after-write' of github.com:fcofdez/…

468224e

…elasticsearch into advance-checkpoints-after-write

Add hook to translog file channel factory

1e03e6e

Merge remote-tracking branch 'origin/main' into advance-checkpoints-a…

523ca22

…fter-write

Add test

5790fc6

Bring back comment

1845a8e

fcofdez requested a review from henningandersen June 19, 2024 15:40

henningandersen reviewed Jun 24, 2024

View reviewed changes

fcofdez added 3 commits June 27, 2024 10:01

Use PathUtilsForTesting.installMock

d73a980

Merge remote-tracking branch 'origin/main' into advance-checkpoints-a…

fece93f

…fter-write

Use explicit mappings

a71ba82

fcofdez requested a review from henningandersen June 27, 2024 08:42

henningandersen approved these changes Jun 27, 2024

View reviewed changes

Add comment

b58711e

fcofdez merged commit ca2ea69 into elastic:main Jun 27, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update checkpoints after post-replication actions, even on failure #109908

Update checkpoints after post-replication actions, even on failure #109908

fcofdez commented Jun 19, 2024

elasticsearchmachine commented Jun 19, 2024

elasticsearchmachine commented Jun 19, 2024

henningandersen left a comment

henningandersen Jun 19, 2024

fcofdez Jun 19, 2024

fcofdez commented Jun 19, 2024

fcofdez commented Jun 24, 2024

henningandersen left a comment

henningandersen Jun 24, 2024

fcofdez Jun 27, 2024

henningandersen Jun 24, 2024

fcofdez Jun 27, 2024

henningandersen Jun 27, 2024

henningandersen Jun 24, 2024

fcofdez Jun 27, 2024 •

edited

Loading

henningandersen left a comment

henningandersen Jun 27, 2024

		// TODO: fail shard? This will otherwise have the local / global checkpoint info lagging, or possibly have replicas
		// go out of sync with the primary

Update checkpoints after post-replication actions, even on failure #109908

Update checkpoints after post-replication actions, even on failure #109908

Conversation

fcofdez commented Jun 19, 2024

elasticsearchmachine commented Jun 19, 2024

elasticsearchmachine commented Jun 19, 2024

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcofdez commented Jun 19, 2024

fcofdez commented Jun 24, 2024

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcofdez Jun 27, 2024 • edited Loading

Choose a reason for hiding this comment

henningandersen left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fcofdez Jun 27, 2024 •

edited

Loading