Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

KCaverly · 2024-04-29T17:23:15Z

Hi Team,

In an effort to increase transparency and type safety, the below PR does the following:

Make Dataloader/Dataset objects Pydantic & Type Safe
Make Dataloader/Dataset consistent. Some provided datasets were not identified as Dataset objects specifically.
Make Dataloaders 100% compatible with HuggingFace Datasets

As we are currently not using Datasets directly downstream (we only use list[Example]) this change should be fairly inconsequential, but opens up the opportunity for us to harden the interaction between Data/DSPy in the future.

@krypticmouse Let me know what you think.

dspy/datasets/colors.py

dspy/datasets/dataset.py

arnavsinghvi11 · 2024-05-06T00:44:44Z

Hi @KCaverly , thanks for the PR! took a stab at it and left some comments. The changes look great and make sense to me but I just want to confirm that these are compatible with existing DSPy example notebooks (particularly intro.ipynb and any others using datasets (GSM8K).

as an aside, I feel like these specific datasets (colors, gsm8k, hotpotqa) can likely be removed with some refactoring to ensure users can call the Dataloader abstraction to retrieve them. We could instead add them as examples/documentation (but that can be for a separate PR).

KCaverly · 2024-05-06T16:49:11Z

Thanks @arnavsinghvi11, for the review. I can double check on the examples, and I agree with the comment on future refactors.

I think two improvements for following up, would be:

Move examples out of the core library.
Collapse Dataloader functionality into Dataset.

I dont see the need for a separate Dataloader class, it feels to me like these should just be moved to factory functions on the Dataset object.

…oading

KCaverly · 2024-05-13T13:47:31Z

hey @arnavsinghvi11. I've incorporated the changes above, this should be good to merge.
Let me know if there are any outstanding concerns.

krypticmouse · 2024-05-13T17:17:16Z

@KCaverly Sorry about the delay, just catching up to this. Why did we remove the sample and split functionalities 🤔

Ik we have Dataset now so we can technically pull those out but the functionality of split won't still be as clean and seamless for other formats IMO. WDYT?

KCaverly · 2024-05-14T02:29:56Z

Hey @krypticmouse!

Weve still got ‘split_existing_split’ and ‘sample_split’. Which offer functionality to split existing splits, and sample/shuffle. I tried to make the base functionality more deterministic and consistent, but all the existing functionality should be possible still.

Not quite sure what your comment means on other formats. Does the two outline methods above have all the functionality we need? How can we improve this?

KCaverly added 12 commits April 23, 2024 14:33

wip, working typed dataset api

a625308

updated hotpotqa, gsm8k, and colors to new dataset api

4da55ae

catchup up with main

d45e51e

wip: progress towards type safe dataloader

423a676

wip progress towards dataloader

4bfb94b

progress towards type safe dataloader

ad5ad50

working dataloader for huggingface datasets

ce5db06

working dataloader

6b46a00

progress on general huggingface dataloaders

506ff9a

cleaned up dataloader

e3a83e2

fix: update provided datasets for new dataset api

c401687

chore: ruff fixes

e09d1c9

KCaverly marked this pull request as ready for review April 29, 2024 18:35

arnavsinghvi11 reviewed May 6, 2024

View reviewed changes

dspy/datasets/colors.py Show resolved Hide resolved

dspy/datasets/colors.py Show resolved Hide resolved

dspy/datasets/dataset.py Show resolved Hide resolved

KCaverly added 6 commits May 13, 2024 09:38

fix: update intro notebook to accomodate for difference in hotpotqa l…

175b207

…oading

fix: fixed recursive getattr

f6fa23d

fix: added randomness back into Colors example

9588e0e

removed intro.ipynb outputs

37e3b77

catchup with main

3d01d1b

chore: ruff fixes

e5464ac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

KCaverly commented Apr 29, 2024

arnavsinghvi11 commented May 6, 2024

KCaverly commented May 6, 2024

KCaverly commented May 13, 2024

krypticmouse commented May 13, 2024 •

edited

Loading

KCaverly commented May 14, 2024

Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

Are you sure you want to change the base?

Make Dataloader/Datasets Type Safe, Consistent and 100% Compatible with HF Datasets #929

Conversation

KCaverly commented Apr 29, 2024

arnavsinghvi11 commented May 6, 2024

KCaverly commented May 6, 2024

KCaverly commented May 13, 2024

krypticmouse commented May 13, 2024 • edited Loading

KCaverly commented May 14, 2024

krypticmouse commented May 13, 2024 •

edited

Loading