[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636

gmarouli · 2024-06-12T16:02:11Z

Let’s say we have my-metrics data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers:

We update the mapping and mark it to be lazy rolled over
We receive 5 bulk index requests that all contain a write request for this data stream.
Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request.
Currently, data stream my-metrics has 5 tasks executing an unconditional rollover.
The data stream gets rolled over 5 times instead of one.

This scenario is captured in the LazyRolloverDuringDisruptionIT.

We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices.

This PR proposes:

To create a new task queue with a more efficient executor that further batches/deduplicates the requests.
We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the rolloverOnWrite flag set to true we skip the rollover.
When we skip the rollover we return the following response:

{
  "acknowledged": true,
  "shards_acknowledged": true,
  "old_index": ".ds-my-data-stream-2099.05.07-000002",
  "new_index": ".ds-my-data-stream-2099.05.07-000002",
  "rolled_over": false,
  "dry_run": false,
  "lazy": false,
}

elasticsearchmachine · 2024-06-12T16:21:30Z

Hi @gmarouli, I've created a changelog YAML for you.

elasticsearchmachine · 2024-06-13T09:26:40Z

Pinging @elastic/es-data-management (Team:Data Management)

gmarouli · 2024-06-13T12:26:02Z

@elasticmachine update branch

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

dakrone · 2024-06-13T22:48:17Z

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java

+ // We use high priority to not block writes for too long
+ this.lazyRolloverTaskQueue = clusterService.createTaskQueue(
+ "lazy-rollover",
+ Priority.HIGH,


I'm not sure about the HIGH priority for this, I think it'd be good to have someone on the distributed side weigh in about potential side-effects.

Discussed with distributed, we do not have enough evidence that this was an issue and this is quite a drastic change, so they recommended to not do this. I have reverted this.

Co-authored-by: Lee Hinman <[email protected]>

gmarouli · 2024-06-14T09:40:59Z

@elasticmachine update branch

nielsbauman

I think this LGTM. I'd like to give it a second round of thought to ensure I'm not missing anything. Added some comments in the meantime.

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/TransportRolloverAction.java

server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java

...c/internalClusterTest/java/org/elasticsearch/datastreams/LazyRolloverDuringDisruptionIT.java

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java

gmarouli · 2024-06-19T13:51:30Z

Update:" Ignore this, it was not applicable and it has been reverted.

@nielsbauman & @dakrone I realised today that we need a node feature since we are introducing a new task.

Initially, I thought it wasn't necessary because this is a master node action, but if there is a master failover from an newer node to the older node, we might end up with an older node needing to deal with the lazy rollover task which is effectively an unknown task.

The latest commit is fixing this, would you mind taking one more look on this change as well, also do you agree with this?

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java

gmarouli · 2024-06-20T07:02:04Z

@elasticmachine update branch

nielsbauman

One more approval for the sake of completeness :)

gmarouli · 2024-06-20T09:42:57Z

@elasticmachine update branch

gmarouli · 2024-06-20T12:47:32Z

@elasticmachine update branch

gmarouli · 2024-06-20T14:50:24Z

@elasticmachine update branch

gmarouli · 2024-06-21T06:17:50Z

@elasticmachine update branch

elasticsearchmachine · 2024-06-21T07:16:31Z

💔 Backport failed

Status	Branch	Result
❌	8.14	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 109636

…rigger multiple rollovers (elastic#109636) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```

gmarouli · 2024-06-21T10:13:41Z

💚 All backports created successfully

Status	Branch	Result
✅	8.14

Questions ?

Please refer to the Backport tool documentation

…rigger multiple rollovers (elastic#109636) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```

…rigger multiple rollovers (#109636) (#110031) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```

Create to reproduce lazy gets executed only once

4d30852

elasticsearchmachine added the v8.15.0 label Jun 12, 2024

gmarouli added >bug :Data Management/Data streams Data streams and their lifecycles v8.14.1 labels Jun 12, 2024

Update docs/changelog/109636.yaml

e8ec970

Do not execute a lazy rollover unless the rolloverOnWrite is true

9850e3c

gmarouli mentioned this pull request Jun 13, 2024

Lazily create the failure store #109289

Merged

Update 109636.yaml

d5d6b1f

gmarouli requested a review from dakrone June 13, 2024 09:14

gmarouli added 2 commits June 13, 2024 12:19

Revert small change

91a3edb

remove condition

2c3932a

gmarouli requested a review from nielsbauman June 13, 2024 09:26

gmarouli marked this pull request as ready for review June 13, 2024 09:26

elasticsearchmachine added the Team:Data Management Meta label for data/management team label Jun 13, 2024

fix format

aa497b4

gmarouli added v8.14.2 and removed v8.14.1 labels Jun 13, 2024

Merge branch 'main' into fix-multiple-lazy-rollovers

d648b88

dakrone reviewed Jun 13, 2024

View reviewed changes

Fix wrong spacing

114a84c

Co-authored-by: Lee Hinman <[email protected]>

Merge branch 'main' into fix-multiple-lazy-rollovers

54982d4

nielsbauman reviewed Jun 16, 2024

View reviewed changes

gmarouli added 2 commits June 17, 2024 11:24

Merge branch 'main' into fix-multiple-lazy-rollovers

68640ff

Revert priority bump

886450f

nielsbauman reviewed Jun 17, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java Outdated Show resolved Hide resolved

Add node feature to ensure we do not introduce an unknown task

fd41221

gmarouli requested review from nielsbauman and dakrone June 19, 2024 13:53

Small fixes

affc3c1

DaveCTurner reviewed Jun 19, 2024

View reviewed changes

server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java Outdated Show resolved Hide resolved

gmarouli added 2 commits June 19, 2024 18:06

Revert node feature

df39298

Fix "closing" the response headers

e45bf57

Merge branch 'main' into fix-multiple-lazy-rollovers

1005e8f

nielsbauman approved these changes Jun 20, 2024

View reviewed changes

gmarouli added auto-merge Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) auto-backport Automatically create backport pull requests when merged labels Jun 20, 2024

Merge branch 'main' into fix-multiple-lazy-rollovers

ea5f078

Merge branch 'main' into fix-multiple-lazy-rollovers

2899cd9

Merge branch 'main' into fix-multiple-lazy-rollovers

313ae9a

Merge branch 'main' into fix-multiple-lazy-rollovers

220db23

elasticsearchmachine merged commit c370d27 into elastic:main Jun 21, 2024
15 checks passed

gmarouli deleted the fix-multiple-lazy-rollovers branch June 21, 2024 07:15

elasticsearchmachine added the backport pending label Jun 21, 2024

gmarouli mentioned this pull request Jun 21, 2024

[8.14] [Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers (#109636) #110031

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636

[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636

gmarouli commented Jun 12, 2024 •

edited

Loading

elasticsearchmachine commented Jun 12, 2024

elasticsearchmachine commented Jun 13, 2024

gmarouli commented Jun 13, 2024

dakrone Jun 13, 2024

gmarouli Jun 17, 2024

gmarouli commented Jun 14, 2024

nielsbauman left a comment

gmarouli commented Jun 19, 2024 •

edited

Loading

gmarouli commented Jun 20, 2024

nielsbauman left a comment

gmarouli commented Jun 20, 2024

gmarouli commented Jun 20, 2024

gmarouli commented Jun 20, 2024

gmarouli commented Jun 21, 2024

elasticsearchmachine commented Jun 21, 2024

gmarouli commented Jun 21, 2024

[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636

[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636

Conversation

gmarouli commented Jun 12, 2024 • edited Loading

elasticsearchmachine commented Jun 12, 2024

elasticsearchmachine commented Jun 13, 2024

gmarouli commented Jun 13, 2024

dakrone Jun 13, 2024

Choose a reason for hiding this comment

gmarouli Jun 17, 2024

Choose a reason for hiding this comment

gmarouli commented Jun 14, 2024

nielsbauman left a comment

Choose a reason for hiding this comment

gmarouli commented Jun 19, 2024 • edited Loading

gmarouli commented Jun 20, 2024

nielsbauman left a comment

Choose a reason for hiding this comment

gmarouli commented Jun 20, 2024

gmarouli commented Jun 20, 2024

gmarouli commented Jun 20, 2024

gmarouli commented Jun 21, 2024

elasticsearchmachine commented Jun 21, 2024

💔 Backport failed

gmarouli commented Jun 21, 2024

💚 All backports created successfully

Questions ?

gmarouli commented Jun 12, 2024 •

edited

Loading

gmarouli commented Jun 19, 2024 •

edited

Loading