Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

raftstore: optimize the batching mechanism when enabling async-ios. #17029

Open
wants to merge 22 commits into
base: master
Choose a base branch
from

Conversation

LykxSassinator
Copy link
Contributor

@LykxSassinator LykxSassinator commented May 17, 2024

What is changed and how it works?

Issue Number: Ref #16907

What's Changed:

Previously, ##16906 had already lowered the compression threshold to alleviate the strain on IO resources. However, it was found that this configuration adjustment had limited impact on reducing IO resource costs.

Therefore, this PR introduces a simple algorithm (using sliding window) to track the real-time size of WriteBatch and initiate a yield operation if the WriteBatch size is under the desired threshold. This mechanism aims to increase the WriteBatch size as needed, a strategy that has proven effective in saving IO resources.

Introduce a simple algorithm to make the batching mechanism on WriteBatch larger as expected,
for reducing the cost on IO resources.

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

By testing on oltp-read-write workload:

  • It can greatly reduce the cost of IOPS, at most 30%. And relative bandwidth costs can be reduced from 260 MB/s to ~170 MB/s, which will have significant effects on reducing the IO resources cost on Cloud env.
  • The performance degression is low as expected < 1%, introduced by the new batching mechanism.

image

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

Introduce a simple algorithm to make the batching mechanism on WriteBatch larger as expected,
for saving the cost on IO resources.

Copy link
Contributor

ti-chi-bot bot commented May 17, 2024

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

Copy link
Contributor

ti-chi-bot bot commented May 17, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

Copy link
Contributor

ti-chi-bot bot commented Jun 11, 2024

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from lykxsassinator, ensuring that each of them provides their approval before proceeding. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added size/M and removed size/S labels Jun 11, 2024
@ti-chi-bot ti-chi-bot bot added size/L and removed size/M labels Jun 12, 2024
@LykxSassinator LykxSassinator marked this pull request as ready for review June 19, 2024 02:50
@LykxSassinator
Copy link
Contributor Author

/test

Copy link
Contributor

ti-chi-bot bot commented Jun 19, 2024

@LykxSassinator: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

  • /test pull-unit-test

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@LykxSassinator
Copy link
Contributor Author

/test pull-unit-test

Signed-off-by: lucasliang <[email protected]>
@glorv
Copy link
Contributor

glorv commented Jun 20, 2024

An intuitive idea: What about move write task batch to raft fsm side. Instead of send a write message after handle each region, we can first cache region write message at the PollCtx, and after handle several regions, when the total write msg size exceeds a certain threshold or the current poll round is over, then we send the cached write tasks to the write thread. It seems more easier to handle this batch logic at the raft fsm side than the write io thread side.
/cc @Connor1996 @overvenus @zhangjinpeng87 What do you think?

/// If the batch size is smaller than the threshold, it will return a
/// recommended yield duration for the caller as a hint to wait for more writes.
/// The yield interval is calculated based on the trend of the change of the
/// batch size. The range of the trend is [0.5, 2.0]. If the batch size is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what does 2.0 mean and what does 0.5 mean? What is the quantitive relationship with wait time?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a heuristic ratio according to testing records, reflecting that the time of yield ranges from [yield_time * 0.5, yield_time * 2.0] nanoseconds. By default, the yield_time is 50 nanoseconds. So, the ranges of the whole yielding is [25 nanoseconds, 100 nanoseconds].

@Connor1996
Copy link
Member

An intuitive idea: What about move write task batch to raft fsm side. Instead of send a write message after handle each region, we can first cache region write message at the PollCtx, and after handle several regions, when the total write msg size exceeds a certain threshold or the current poll round is over, then we send the cached write tasks to the write thread. It seems more easier to handle this batch logic at the raft fsm side than the write io thread side. /cc @Connor1996 @overvenus @zhangjinpeng87 What do you think?

Looks promising, it's just how sync does that batch the writes in one poll. Maybe you can make a quick poc.

Signed-off-by: lucasliang <[email protected]>
Signed-off-by: lucasliang <[email protected]>
@zhangjinpeng87
Copy link
Member

An intuitive idea: What about move write task batch to raft fsm side. Instead of send a write message after handle each region, we can first cache region write message at the PollCtx, and after handle several regions, when the total write msg size exceeds a certain threshold or the current poll round is over, then we send the cached write tasks to the write thread. It seems more easier to handle this batch logic at the raft fsm side than the write io thread side. /cc @Connor1996 @overvenus @zhangjinpeng87 What do you think?

@LykxSassinator @glorv This is a good idea. I think we can merge current solution first because it is well tested and has promising benefits of reducing IOPS and IO bandwidth, and then evaluate the new idea. What do you think?

Copy link
Contributor

@glorv glorv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest LGTM

components/raftstore/src/store/async_io/write.rs Outdated Show resolved Hide resolved
components/raftstore/src/store/async_io/write.rs Outdated Show resolved Hide resolved
components/tikv_util/src/time.rs Show resolved Hide resolved
Signed-off-by: lucasliang <[email protected]>
@LykxSassinator
Copy link
Contributor Author

LykxSassinator commented Jun 24, 2024

Based on the most recent performance comparisons with version 8.1.0 (default LTS version), Most typical workloads have passed the testing standards. However, it appears that the larger batch flushing introduced by this PR may lead to performance regressions in the oltp_update_non_index workload (The decreasing trend of IOPS is > 50% as expected, but with ~8% performance regression).

Therefore, this PR will be put on hold until a more suitable configuration for optimizing the WriteBatch size is identified.

Signed-off-by: lucasliang <[email protected]>
Signed-off-by: lucasliang <[email protected]>
Signed-off-by: lucasliang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants