Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Dedup strategy that keeps the last not null field #4184

Merged
merged 10 commits into from
Jun 25, 2024

Conversation

evenyag
Copy link
Contributor

@evenyag evenyag commented Jun 21, 2024

I hereby agree to the terms of the GreptimeDB CLA.

Refer to a related PR or issue link (optional)

What's changed and what's your intention?

This PR implements a new dedup strategy LastNotNull. This strategy merges the rows with the same key together and uses the latest not null value as the final value for each field.

Checklist

  • I have written the necessary rustdoc comments.
  • I have added the necessary unit tests and integration tests.
  • This PR requires documentation updates.

Summary by CodeRabbit

  • New Features

    • Introduced enhancements for deduplication of sorted batches, improving the filtering of deleted rows and handling of duplicate rows.
  • Bug Fixes

    • Improved deduplication strategies to ensure better data integrity and more accurate results.
  • Documentation

    • Added comments to clarify the deduplication process and the behavior of batch readers, improving code readability.

@github-actions github-actions bot added the docs-not-required This change does not impact docs. label Jun 21, 2024
@evenyag evenyag marked this pull request as ready for review June 21, 2024 06:53
@WenyXu
Copy link
Member

WenyXu commented Jun 21, 2024

@coderabbitai review

Copy link
Contributor

coderabbitai bot commented Jun 21, 2024

Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

coderabbitai bot commented Jun 21, 2024

Walkthrough

The recent changes in the codebase focus on improving the deduplication process for sorted data batches. New strategies and structures were introduced to handle duplicate rows more efficiently, including mechanisms for filtering deleted rows and updating various metrics. Enhancements also include updating essential components to ensure seamless data operations within the deduplication logic.

Changes

File/Path Change Summary
.../dedup.rs Introduced changes for deduplication strategies, added structs like LastFieldsBuilder, functions for filtering deleted rows, and test cases.
.../merge.rs Added comments to clarify the behavior of the MergeReader with respect to duplicate rows between batches.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant DedupService
    participant MergeReader
    participant BatchProcessor

    Client ->> DedupService: Request deduplication
    DedupService ->> BatchProcessor: Process batch
    BatchProcessor ->> DedupService: Filter deleted rows, merge fields
    DedupService -->> Client: Provide deduplicated data
    MergeReader ->> DedupService: Access next batch
    MergeReader -->> DedupService: Ensure no duplicate rows within batch
Loading

Poem

In the land of code, where bytes do race,
Deduplication takes its place with grace.
Filtering and merging, rows set free,
A tidy batch for you and me.
With metrics sharp and strategies wise,
The data flows, a streamlined prize.
🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai generate interesting stats about this repository and render them as a table.
    • @coderabbitai show all the console.log statements in this repository.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between b739c9f and 8b3b656.

Files selected for processing (2)
  • src/mito2/src/read/dedup.rs (7 hunks)
  • src/mito2/src/read/merge.rs (1 hunks)
Additional comments not posted (5)
src/mito2/src/read/merge.rs (1)

37-39: Clarification of batch overlap behavior

The added comment in the MergeReader struct clarifies the behavior regarding the overlap of batches, specifically the potential duplication of the last and first entries of consecutive batches. This is crucial for understanding how the batches are processed and should help prevent confusion during maintenance.

src/mito2/src/read/dedup.rs (4)

17-24: New imports added to support deduplication logic.

The added imports are necessary for the new functionalities introduced in this file. Ensure that these imports are utilized effectively and that there are no unused imports which could lead to unnecessary dependencies.


28-28: Addition of BatchColumn to the existing imports.

This change is likely due to the new structures and methods that operate on batch columns directly, which is a good practice to keep related functionalities together. Ensure that BatchColumn is used appropriately in the implementations.


Line range hint 471-867: Comprehensive test coverage for new deduplication strategies.

The tests appear thorough, covering various scenarios including no duplications, handling of duplications, and the behavior of the new LastNotNull strategy. It's crucial to ensure these tests are comprehensive and consider all possible batch configurations to prevent any unexpected behavior in production.


226-379: Implementation of LastFieldsBuilder.

This struct and its methods are central to the new deduplication strategy. It manages fields of the last row for merging, handling nulls, and deletes efficiently. The implementation seems robust, but it's crucial to ensure that it integrates well with the rest of the deduplication logic and that all edge cases are covered by tests.

Verification successful

Deduplication Logic and Edge Cases Verified

The integration of LastFieldsBuilder with the deduplication logic has been confirmed. Test functions such as test_dedup_reader_no_duplications and test_dedup_reader_duplications in dedup.rs likely cover its functionality and edge cases.

  • src/mito2/src/read/dedup.rs: test_dedup_reader_no_duplications, test_dedup_reader_duplications
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify integration of `LastFieldsBuilder` with deduplication logic and coverage of edge cases.

# Test: Search for integration tests involving `LastFieldsBuilder`.
rg --type rust $'test.*LastFieldsBuilder'

Length of output: 42



Script:

#!/bin/bash
# Search for occurrences of `LastFieldsBuilder` in the codebase to understand its integration.
rg 'LastFieldsBuilder'

Length of output: 287



Script:

#!/bin/bash
# Search for test cases related to deduplication logic in the codebase.
rg --type rust 'dedup'

Length of output: 9335

src/mito2/src/read/dedup.rs Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Outside diff range and nitpick comments (1)
src/mito2/src/read/dedup.rs (1)

Line range hint 476-872: Review of added test cases:

The test cases cover various scenarios for the deduplication strategies. However, ensure that all edge cases are considered, especially scenarios with mixed operations (Put and Delete) within the same batch. Adding more comprehensive tests would help validate the robustness of the deduplication logic.

Would you like assistance in writing additional tests to cover more complex scenarios?

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8b3b656 and 1221fef.

Files selected for processing (1)
  • src/mito2/src/read/dedup.rs (7 hunks)
Additional comments not posted (2)
src/mito2/src/read/dedup.rs (2)

21-24: Review of new imports:

The newly added imports are relevant for the operations performed in the deduplication strategies, such as handling data types and vectors. Ensure these are used effectively throughout the implementations to avoid unnecessary imports.


386-472: Review of LastNotNull structure and its methods:

The LastNotNull structure implements the new deduplication strategy effectively. However, ensure that all edge cases are handled, particularly when batches may overlap in keys or timestamps. Consider adding more inline documentation to explain the logic, especially for complex conditions.

src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1221fef and 799f11f.

Files selected for processing (1)
  • src/mito2/src/read/dedup.rs (7 hunks)
Additional comments not posted (4)
src/mito2/src/read/dedup.rs (4)

17-24: Update to imports to support new deduplication logic.

The added imports are necessary for the new deduplication logic, ensuring that all necessary types and utilities are available. This is a good practice as it maintains clarity and modularity of the code.


28-28: Expanded import of Batch to include BatchColumn.

This change is likely necessary to support the new structures and methods that deal with batch manipulation at a more granular level, such as LastFieldsBuilder and LastNotNull.


Line range hint 477-873: Comprehensive test coverage for deduplication logic.

The tests provided are comprehensive and cover a wide range of scenarios, which is excellent for ensuring the robustness of the deduplication strategies. Ensure that all edge cases are covered, especially for scenarios involving deletions and batches with varying keys and timestamps.


Line range hint 189-213: Addition of filter_deleted_from_batch function.

This function is crucial for the new deduplication strategy by filtering out deleted rows and updating metrics. It's important to ensure that this function handles edge cases correctly, such as batches with all rows marked as deleted.

  • Please ensure unit tests cover various scenarios for this function.

src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
@evenyag evenyag requested a review from fengjiachun June 24, 2024 03:22
Copy link
Collaborator

@fengjiachun fengjiachun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@evenyag evenyag marked this pull request as draft June 24, 2024 07:55
@evenyag
Copy link
Contributor Author

evenyag commented Jun 24, 2024

211a371 fixes an issue that the builder may not reset itself. It also adds more tests.

@evenyag evenyag marked this pull request as ready for review June 24, 2024 09:54
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 799f11f and a669248.

Files selected for processing (1)
  • src/mito2/src/read/dedup.rs (7 hunks)
Additional comments not posted (4)
src/mito2/src/read/dedup.rs (4)

17-24: Review of new imports:

The newly imported modules are appropriate for the functionalities being implemented in this file, such as handling data types and vectors. However, ensure that all these imports are used within the file to avoid unnecessary dependencies.


Line range hint 28-50: Review of DedupReader struct and methods:

The implementation of DedupReader is concise and leverages generics to handle different deduplication strategies. The use of async operations and traits like BatchReader ensures that the reader is flexible and performant. Ensure comprehensive testing, especially for edge cases in asynchronous operations.


Line range hint 189-213: Review of filter_deleted_from_batch function:

This function efficiently filters deleted rows from the batch and updates deduplication metrics. The implementation appears robust, but ensure that unit tests cover various scenarios, including edge cases where all rows might be marked as deleted.


28-28: Ensure correct usage of modified import path:

The import path for Batch and BatchReader has been modified to include BatchColumn. Verify that BatchColumn is utilized appropriately in the file to justify this change.

Verification successful

BatchColumn is used appropriately in the DedupReader implementation.

  • BatchColumn is utilized in the map function to create a collection of fields with column_id and data.
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify the usage of BatchColumn in the file.

# Test: Search for 'BatchColumn' usage in the file.
rg 'BatchColumn' src/mito2/src/read/dedup.rs

Length of output: 148



Script:

#!/bin/bash
# Description: Extract the complete context of BatchColumn usage in the file.

# Show the lines around the usage of BatchColumn to understand its context.
rg -A 5 -B 5 'BatchColumn' src/mito2/src/read/dedup.rs

Length of output: 718

src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
Copy link

codecov bot commented Jun 24, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.80%. Comparing base (b739c9f) to head (f95fc79).
Report is 15 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4184      +/-   ##
==========================================
- Coverage   85.06%   84.80%   -0.27%     
==========================================
  Files        1025     1031       +6     
  Lines      180159   181828    +1669     
==========================================
+ Hits       153260   154204     +944     
- Misses      26899    27624     +725     

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Outside diff range and nitpick comments (1)
src/mito2/src/read/dedup.rs (1)

Line range hint 189-213: Review of filter_deleted_from_batch function.

This function efficiently handles the deletion of rows from batches and updates metrics accordingly. The logic is straightforward and uses the provided batch and metrics effectively. However, consider adding error handling for the batch.filter_deleted() call instead of implicitly expecting it to always succeed.

- batch.filter_deleted()?;
+ let result = batch.filter_deleted();
+ if result.is_err() {
+     return Err(result.err().unwrap());
+ }
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between a669248 and 8b90a08.

Files selected for processing (1)
  • src/mito2/src/read/dedup.rs (7 hunks)
Additional comments not posted (4)
src/mito2/src/read/dedup.rs (4)

17-24: Review of imports and module structure.

The imports are well-organized and relevant to the functionalities implemented in this file. However, ensure that all these modules are used; unused imports should be removed to keep the code clean.


28-28: Modification of imports.

The addition of BatchColumn to the imports suggests that this type is now being used in the file, likely in the context of handling batch data more granically. This change aligns with the PR's goal of enhancing deduplication strategies.


Line range hint 473-984: Unit tests for deduplication strategies.

The comprehensive unit tests provided cover various scenarios and edge cases, which is excellent for ensuring the robustness of the deduplication functionalities. However, ensure that all new functionalities introduced in this PR are covered by tests, especially those that handle error scenarios.


383-469: Implementation review of LastNotNull deduplication strategy.

The LastNotNull struct and its methods are implemented to support the new deduplication strategy that retains the last non-null field. The methods are well-structured and the logic is sound. However, the use of unwrap() in methods like merge_last_not_null (e.g., line 459) could be risky. Replace these with proper error handling to prevent potential runtime panics.

- let Some(buffer) = self.buffer.take() else {
+ if let Some(buffer) = self.buffer.take() {

Likely invalid or redundant comment.

src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
src/mito2/src/read/dedup.rs Outdated Show resolved Hide resolved
Co-authored-by: Ruihang Xia <[email protected]>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Outside diff range and nitpick comments (1)
src/mito2/src/read/dedup.rs (1)

[!TIP]
Codebase Verification

The function filter_deleted_from_batch is not covered by any tests. This is a critical function for the deduplication strategy, and it is essential to ensure that it is well-tested to maintain data integrity. Please add comprehensive tests for this function.

  • The function filter_deleted_from_batch is defined in src/mito2/src/read/dedup.rs but is not referenced in any test files.
Analysis chain

Line range hint 189-213: Critical function for deduplication strategy.

The function filter_deleted_from_batch is crucial for the new deduplication logic. The implementation seems efficient, but it is essential to ensure that this function is well-tested given its impact on data integrity.

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn test_filter_deleted_from_batch() {
        let mut batch = Batch::new(); // Assuming Batch::new() creates an empty batch.
        let mut metrics = DedupMetrics::default();
        assert!(filter_deleted_from_batch(&mut batch, &mut metrics).is_ok());
        assert_eq!(metrics.num_deleted_rows, 0, "No rows should be deleted in an empty batch.");
    }
}
Scripts executed

The following scripts were executed for the analysis:


Script:

#!/bin/bash
# Description: Verify that tests for `filter_deleted_from_batch` cover all edge cases.

# Test: Search for unit tests covering edge cases for `filter_deleted_from_batch`.
rg --type rust $'test.*filter_deleted_from_batch'

Length of output: 50



Script:

#!/bin/bash
# Description: Search for the `filter_deleted_from_batch` function in test files to confirm its presence.

# Find all test files and search for occurrences of `filter_deleted_from_batch`.
rg --type rust 'filter_deleted_from_batch' --tests

Length of output: 482



Script:

#!/bin/bash
# Description: Search for the `filter_deleted_from_batch` function in all Rust files to confirm its presence in test modules.

# Find all occurrences of `filter_deleted_from_batch` in Rust files.
rg --type rust 'filter_deleted_from_batch'

Length of output: 341

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8b90a08 and f95fc79.

Files selected for processing (1)
  • src/mito2/src/read/dedup.rs (7 hunks)
Additional comments not posted (2)
src/mito2/src/read/dedup.rs (2)

17-28: Refinement of imports and struct declarations.

The added imports and BatchColumn are necessary for the new functionalities introduced in this file. However, ensure that all these new imports are utilized effectively to avoid any unnecessary overhead.


Line range hint 473-984: Comprehensive testing of deduplication strategies.

The tests are well-structured and seem to cover a variety of scenarios, which is crucial for ensuring the robustness of the deduplication strategies. However, ensure that these tests are maintained and possibly expanded as new edge cases or features are introduced.

src/mito2/src/read/dedup.rs Show resolved Hide resolved
src/mito2/src/read/dedup.rs Show resolved Hide resolved
@evenyag evenyag added this pull request to the merge queue Jun 25, 2024
Merged via the queue into GreptimeTeam:main with commit 9aaf7d7 Jun 25, 2024
51 checks passed
@evenyag evenyag deleted the feat/last-not-null branch June 25, 2024 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docs-not-required This change does not impact docs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants