Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] BUG: Duplicates loaded into existing satellite #221

Closed
khalidhussein138 opened this issue Jan 5, 2024 · 5 comments
Closed

[BUG] BUG: Duplicates loaded into existing satellite #221

khalidhussein138 opened this issue Jan 5, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@khalidhussein138
Copy link

khalidhussein138 commented Jan 5, 2024

Describe the bug
When loading into a Satellite table where Stage contains two or more duplicate records (the same hash_key, hash_diff and load_timestamp) and the records have not been previously loaded into the satellite, both records are loaded into the satellite rather than just one.

Environment

dbt version: 1.5.2
automate_dv version: 0.10.1
Database/Platform: Snowflake

To Reproduce

Scenario: Duplicates loaded into Satellite
Given:  Stage contains two exact duplicate records.
And: A satellite already exists
And: The satellite does not contain the records.
And: apply_source_filter is set to TRUE

Stage:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

When: I load into the existing satellite.
Then: The satellite is loaded with both records.

Satellite:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

Expected behavior

Scenario: Duplicates not loaded into Satellite
Given:  Stage contains two exact duplicate records.
And: A satellite already exists
And: The satellite does not contain the records.
And: apply_source_filter is set to TRUE

Stage:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

When: I load into the existing satellite.
Then: The satellite should be loaded with a single record.

Satellite:
| hash_key_column | arbitrary_context_columns | hash_diff                 | load_ts                   |
| 6cf8            | arbitrary data            | md5('arbitrary data')     | 2022-05-22 00:13:48.901   |

The root cause
When debugging the Satellite script, we noticed that in this CTE the rank() statement is ranking all duplicate records as 1 which causes the records to be loaded. As a suggestion, changing this to a row_number() would solve the issue.

@khalidhussein138 khalidhussein138 added the bug Something isn't working label Jan 5, 2024
@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Jan 5, 2024

Hi,

Thanks for this report!

This is something we've noticed internally on our own projects and have recently developed a fix for and hope to release in due course.

We've also decided to change RANK() for ROW_NUMBER() everywhere for the reason you describe as well as performance enhancements.

@drewsonne
Copy link

Hi @DVAlexHiggs that's good to know, any idea when you say "due course", what sort of time frame do you think you're looking at? Are we talking a month, 6 months, 2 years?

@DVAlexHiggs
Copy link
Member

DVAlexHiggs commented Feb 15, 2024

Hi @DVAlexHiggs that's good to know, any idea when you say "due course", what sort of time frame do you think you're looking at? Are we talking a month, 6 months, 2 years?

Apologies for the vagueness! The target is this month; we've got a few QoL, performance improvements and various other things coming down the line 😄

@drewsonne
Copy link

Amazing, thank you!

@DVAlexHiggs
Copy link
Member

Fixed in v0.10.2 😄 Thanks for your patience for release of this! Please let us know if you experience any issues by responding here or opening a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants