Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for left semi and left anti joins #1182

Merged
merged 15 commits into from
Jun 30, 2023

Conversation

ChrisJar
Copy link
Collaborator

@ChrisJar ChrisJar commented Jun 14, 2023

To-do:

  • Fix expected return columns mismatch
  • Fix incorrect merge column name
  • Create CPU implementation for leftanti

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's a (hack-y) way of handling left anti joins that at least works with a very basic example. It only involves modifying the Python side and just slicing out the irrelevant righthand side columns. Not sure if this is the best way to do things or if it needs building upon to work for more complicated queries, but let me know what you think. Happy to help explore the Rust side more if this solution is insufficient.

dask_sql/physical/rel/logical/join.py Outdated Show resolved Hide resolved
dask_sql/physical/rel/logical/join.py Show resolved Hide resolved
dask_sql/physical/rel/logical/join.py Show resolved Hide resolved
dask_sql/physical/rel/logical/join.py Show resolved Hide resolved
dask_sql/physical/rel/logical/join.py Show resolved Hide resolved
@ChrisJar ChrisJar changed the title Add beggining of support for left anti joins Add support for left semi and left anti joins Jun 22, 2023
@ChrisJar
Copy link
Collaborator Author

@sarahyurick thanks for the suggestions!

@ChrisJar
Copy link
Collaborator Author

What's the current best practice for checking if a dask-sql dataframe is cpu or gpu backed without importing cudf?

@sarahyurick
Copy link
Collaborator

What's the current best practice for checking if a dask-sql dataframe is cpu or gpu backed without importing cudf?

I think is_cudf_type should work:

def is_cudf_type(obj):

@ChrisJar ChrisJar marked this pull request as ready for review June 22, 2023 21:05
@codecov-commenter
Copy link

codecov-commenter commented Jun 22, 2023

Codecov Report

Merging #1182 (5ab4b0c) into main (5421bbf) will increase coverage by 0.17%.
The diff coverage is 100.00%.

❗ Your organization is not using the GitHub App Integration. As a result you may experience degraded service beginning May 15th. Please install the Github App Integration for your organization. Read more.

@@            Coverage Diff             @@
##             main    #1182      +/-   ##
==========================================
+ Coverage   81.72%   81.89%   +0.17%     
==========================================
  Files          78       78              
  Lines        4519     4535      +16     
  Branches      831      837       +6     
==========================================
+ Hits         3693     3714      +21     
+ Misses        643      634       -9     
- Partials      183      187       +4     
Impacted Files Coverage Δ
dask_sql/context.py 94.02% <100.00%> (+0.02%) ⬆️
dask_sql/physical/rel/base.py 93.61% <100.00%> (+0.75%) ⬆️
dask_sql/physical/rel/logical/join.py 96.32% <100.00%> (+0.29%) ⬆️

... and 1 file with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Collaborator

@sarahyurick sarahyurick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@charlesbluca charlesbluca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ChrisJar! Some minor comments around type annotations / follow up work

if select_names:
cc = dc.column_container

select_names = select_names[: len(cc.columns)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why getFieldList and getFieldNames are returning an extra field that we need to filter out here (and in fix_column_to_row_type)? I want to believe there's something we could do on the Rust end to avoid needing these workarounds in the Python code, but okay with filing that as a follow up issue to address.

cc @jdye64 if you have any idea what could be going on here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it seems like it's treating left semi and left anti joins like left joins where the column merged on from both left and right tables is included in the result whereas in leftsemi and leftanti joins only the left hand column should be returned

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind filing an issue around this and linking it here so we could follow up on this later if possible?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dask_sql/physical/rel/base.py Outdated Show resolved Hide resolved
dask_sql/physical/rel/base.py Outdated Show resolved Hide resolved
dask_sql/physical/rel/logical/join.py Outdated Show resolved Hide resolved
Co-authored-by: Charles Blackmon-Luca <[email protected]>
@charlesbluca charlesbluca merged commit 2a0fec3 into dask-contrib:main Jun 30, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants