Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MemoryDataset entries to free_outputs #3475

Merged
merged 9 commits into from
Jan 8, 2024

Conversation

SajidAlamQB
Copy link
Contributor

@SajidAlamQB SajidAlamQB commented Jan 3, 2024

Description

Context: #1900

The free_outputs output from session isn't very clear we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

Development notes

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

  • Read the contributing guidelines
  • Signed off each commit with a Developer Certificate of Origin (DCO)
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the RELEASE.md file
  • Added tests to cover my changes
  • Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

@SajidAlamQB SajidAlamQB self-assigned this Jan 3, 2024
@noklam noklam self-requested a review January 4, 2024 11:34
@noklam
Copy link
Contributor

noklam commented Jan 4, 2024

The free_outputs output from session isn't very clear we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

Is this what we intended? I expect the change to be minimal. It should return datasets satisfy the following conditions:

  1. It is already in Memory (thus the original "free_output")
  2. It is not defined in catalog

The assumption of 2. is faulty because user can define MemoryDataset in catalog (rare but possible), in that case we should still return the dataset because there are no additional I/O.

Cc @merelcht In #1900, it is written that

Instead we'll change it to return all free outputs and additionally any MemoryDataSets that are defined in the catalog.

This can cause huge memory consumption, it will most likely fail too because Runner will release intermediate MemoryDataset so we cannot return all MemoryDataSet in catalog.

@SajidAlamQB
Copy link
Contributor Author

SajidAlamQB commented Jan 4, 2024

This can cause huge memory consumption, it will most likely fail too because Runner will release intermediate MemoryDataset so we cannot return all MemoryDataSet in catalog.

I agree with @noklam, the initial approach included intermediate MemoryDataset entries in free_outputs, which could lead to potential memory errors or unexpected behavior.

Based on on that I've updated the implementation to do the following:

free_outputs = (pipeline.outputs() - (set(registered_ds) - memory_datasets))

@noklam comment from DMs:

In any case the superset will be pipeline.outputs(), if it's registered in catalog but not MemoryDataset then we remove it from outputs.

Signed-off-by: Sajid Alam <[email protected]>
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved with a minor comment. Nice work!

# in the catalog.
free_outputs = pipeline.outputs() - set(registered_ds)
# in the catalog and include MemoryDataset.
free_outputs = pipeline.outputs() - (set(registered_ds) - memory_datasets)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The definition of free_outputs has always been confusing to me, I think what are returned here is "in_memory_dataset" as we are trying to return the dataset as long as there are no I/O penalties.

Feel free to come up with other names.

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree with this approach. Great work @SajidAlamQB 👍

@SajidAlamQB SajidAlamQB enabled auto-merge (squash) January 8, 2024 16:56
@SajidAlamQB SajidAlamQB merged commit bda3751 into main Jan 8, 2024
36 checks passed
@SajidAlamQB SajidAlamQB deleted the Change-free-outputs-to-also-return-MemoryDataSet branch January 8, 2024 17:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants