Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

KCaverly
Copy link
Collaborator

Hi Team,

In an effort to increase transparency and type safety, the below PR does the following:

  1. Make Dataloader/Dataset objects Pydantic & Type Safe
  2. Make Dataloader/Dataset consistent. Some provided datasets were not identified as Dataset objects specifically.
  3. Make Dataloaders 100% compatible with HuggingFace Datasets

As we are currently not using Datasets directly downstream (we only use list[Example]) this change should be fairly inconsequential, but opens up the opportunity for us to harden the interaction between Data/DSPy in the future.

@krypticmouse Let me know what you think.

@KCaverly KCaverly marked this pull request as ready for review April 29, 2024 18:35
dspy/datasets/colors.py Show resolved Hide resolved
dspy/datasets/colors.py Show resolved Hide resolved
dspy/datasets/dataset.py Show resolved Hide resolved
@arnavsinghvi11
Copy link
Collaborator

Hi @KCaverly , thanks for the PR! took a stab at it and left some comments. The changes look great and make sense to me but I just want to confirm that these are compatible with existing DSPy example notebooks (particularly intro.ipynb and any others using datasets (GSM8K).

as an aside, I feel like these specific datasets (colors, gsm8k, hotpotqa) can likely be removed with some refactoring to ensure users can call the Dataloader abstraction to retrieve them. We could instead add them as examples/documentation (but that can be for a separate PR).

@KCaverly
Copy link
Collaborator Author

KCaverly commented May 6, 2024

Thanks @arnavsinghvi11, for the review. I can double check on the examples, and I agree with the comment on future refactors.

I think two improvements for following up, would be:

  1. Move examples out of the core library.
  2. Collapse Dataloader functionality into Dataset.
  • I dont see the need for a separate Dataloader class, it feels to me like these should just be moved to factory functions on the Dataset object.

@KCaverly
Copy link
Collaborator Author

hey @arnavsinghvi11. I've incorporated the changes above, this should be good to merge.
Let me know if there are any outstanding concerns.

@krypticmouse
Copy link
Collaborator

krypticmouse commented May 13, 2024

@KCaverly Sorry about the delay, just catching up to this. Why did we remove the sample and split functionalities 🤔

Ik we have Dataset now so we can technically pull those out but the functionality of split won't still be as clean and seamless for other formats IMO. WDYT?

@KCaverly
Copy link
Collaborator Author

Hey @krypticmouse!

Weve still got ‘split_existing_split’ and ‘sample_split’. Which offer functionality to split existing splits, and sample/shuffle. I tried to make the base functionality more deterministic and consistent, but all the existing functionality should be possible still.

Not quite sure what your comment means on other formats. Does the two outline methods above have all the functionality we need? How can we improve this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants