-
Notifications
You must be signed in to change notification settings - Fork 24.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636
[Bug] During heavy indexing load it's possible for lazy rollover to trigger multiple rollovers #109636
Conversation
Hi @gmarouli, I've created a changelog YAML for you. |
Pinging @elastic/es-data-management (Team:Data Management) |
@elasticmachine update branch |
server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java
Outdated
Show resolved
Hide resolved
// We use high priority to not block writes for too long | ||
this.lazyRolloverTaskQueue = clusterService.createTaskQueue( | ||
"lazy-rollover", | ||
Priority.HIGH, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the HIGH
priority for this, I think it'd be good to have someone on the distributed side weigh in about potential side-effects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed with distributed, we do not have enough evidence that this was an issue and this is quite a drastic change, so they recommended to not do this. I have reverted this.
Co-authored-by: Lee Hinman <[email protected]>
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this LGTM. I'd like to give it a second round of thought to ensure I'm not missing anything. Added some comments in the meantime.
server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/action/admin/indices/rollover/TransportRolloverAction.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/bulk/TransportBulkAction.java
Show resolved
Hide resolved
...c/internalClusterTest/java/org/elasticsearch/datastreams/LazyRolloverDuringDisruptionIT.java
Outdated
Show resolved
Hide resolved
server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java
Outdated
Show resolved
Hide resolved
Update:" Ignore this, it was not applicable and it has been reverted. @nielsbauman & @dakrone I realised today that we need a node feature since we are introducing a new task. Initially, I thought it wasn't necessary because this is a master node action, but if there is a master failover from an newer node to the older node, we might end up with an older node needing to deal with the lazy rollover task which is effectively an unknown task. The latest commit is fixing this, would you mind taking one more look on this change as well, also do you agree with this? |
server/src/main/java/org/elasticsearch/action/admin/indices/rollover/LazyRolloverAction.java
Outdated
Show resolved
Hide resolved
@elasticmachine update branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more approval for the sake of completeness :)
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
@elasticmachine update branch |
💔 Backport failed
You can use sqren/backport to manually backport by running |
…rigger multiple rollovers (elastic#109636) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```
💚 All backports created successfully
Questions ?Please refer to the Backport tool documentation |
…rigger multiple rollovers (elastic#109636) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```
…rigger multiple rollovers (#109636) (#110031) Let’s say we have `my-metrics` data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers: 1. We update the mapping and mark it to be lazy rolled over 2. We receive 5 bulk index requests that all contain a write request for this data stream. 3. Each of these requests are being picked up “at the same time”, they see that the data stream needs to be rolled over and they issue a lazy rollover request. 4. Currently, data stream my-metrics has 5 tasks executing an unconditional rollover. 5. The data stream gets rolled over 5 times instead of one. This scenario is captured in the `LazyRolloverDuringDisruptionIT`. We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices. This PR proposes: - To create a new task queue with a more efficient executor that further batches/deduplicates the requests. - We add two safe guards, the first to ensure we will not enqueue the rollover task if we see that a rollover has occurred already. The second safe guard is during task execution, if we see that the data stream does not have the `rolloverOnWrite` flag set to `true` we skip the rollover. - When we skip the rollover we return the following response: ``` { "acknowledged": true, "shards_acknowledged": true, "old_index": ".ds-my-data-stream-2099.05.07-000002", "new_index": ".ds-my-data-stream-2099.05.07-000002", "rolled_over": false, "dry_run": false, "lazy": false, } ```
Let’s say we have
my-metrics
data stream which is receiving a lot of indexing requests. The following scenario can result in multiple unnecessary rollovers:This scenario is captured in the
LazyRolloverDuringDisruptionIT
.We have witnessed this also in the wild, where a data stream was rolled over 281 times extra resulting in 281 empty indices.
This PR proposes:
rolloverOnWrite
flag set totrue
we skip the rollover.