Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent writes reconciliation in Delta Lake #16985

Open
Tracked by #11296
findepi opened this issue Apr 12, 2023 · 6 comments
Open
Tracked by #11296

Concurrent writes reconciliation in Delta Lake #16985

findepi opened this issue Apr 12, 2023 · 6 comments
Assignees
Labels
delta-lake Delta Lake connector enhancement New feature or request

Comments

@findepi
Copy link
Member

findepi commented Apr 12, 2023

Support transactions that e.g. insert data concurrently, or such that modify data within disjoint data sets (eg partitions).

@Thelin90
Copy link

Thelin90 commented Apr 17, 2023

Hello.

I have seen this behaviour for a while now.

You don't see this behaviour in spark, delta-rs when you do concurrent appends.

The delta connector does not seem to lock in the correct metadata delta version like other engines do.

I have seen this behaviour for

  • appends (works fine in spark since it commits the delta log on the driver and writes on the workers, will collide if using multi cluster)
    • with delta-rs you can use native dynamodb lock
  • deletes(works fine in spark since it commits the delta log on the driver and writes on the workers, will collide if using multi cluster)
    * with delta-rs you can use native dynamodb lock
  • merge (works fine in spark since it commits the delta log on the driver and writes on the workers, will collide if using multi cluster)

@findinpath
Copy link
Contributor

Scenario for realizing that there is a conflict and the query needs to be aborted (error key plays an important role here to notify the client accordingly):

start:

0001.json
table has file A with rows 1-10


User1: update row 5:
0002.json
  remove entry: delete A
  add entry: add file C (rows 1-4, 6-10 same, 5 modified)

User2: update row 2:
0003.json
  remove entry: delete A
  add entry: add file B (rows 1, 3-10 same, 2 modified)

realize conflict

When writing on different partitions of the table therefore the connector should be able to cope transparently for the user with concurrent inserts - there are no shared resources which are being modified by the queries running concurrently.

@findinpath findinpath self-assigned this Jul 25, 2023
@findinpath
Copy link
Contributor

Basic algorithm (shared from an answer of @findepi )

  • detect concurrent transaction (already there - see TransactionConflictException)
  • see what were the transactions committed in the meantime and whether transaction being committed still makes sense and can continue
    • if yes, re-commit with higher transaction number
    • rinse & repeat in case of another concurrent commit

@findinpath
Copy link
Contributor

findinpath commented Mar 7, 2024

Overview of the PRs used to cover the concurrent reconciliation functionality:

@findepi
Copy link
Member Author

findepi commented Jun 17, 2024

@findinpath awesome progress!

  • TODO concurrent reconciliation support for the optimize procedure

by any chance, is in the works?

@findinpath
Copy link
Contributor

findinpath commented Jun 18, 2024

by any chance, is in the works?

I plan to create a PR for handling optimize during this week.

#22443

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
delta-lake Delta Lake connector enhancement New feature or request
Development

No branches or pull requests

3 participants