Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset hosting, data cards and previews #139

Open
lhoestq opened this issue Mar 15, 2023 · 4 comments
Open

Dataset hosting, data cards and previews #139

lhoestq opened this issue Mar 15, 2023 · 4 comments

Comments

@lhoestq
Copy link

lhoestq commented Mar 15, 2023

Hi, I'm Quentin from Hugging Face :)

I know hosting datasets on github is not always practical: git lfs required, no data preview, limited storage (maybe not for you haha), no standard for data documentation. So I was wondering:

Have you considered host alternatives more suited for datasets, and would let researchers explore the datasets of evals ?

This way researchers can know in depth what data is used for evaluation and their goals and limitations, in particular to better understand what domains and structures their models perform good or bad at.

e.g. the Hugging Face datasets hub shows data cards for documentation and previews for each dataset. Also loading and caching a dataset is one line of python, saving you from wget and github hosting. It also supports pull requests for the community to contribute.

It can even allow to use those datasets in other well known eval frameworks, such as lm-evaluation-harness.

Let me know what you think !

@logankilpatrick
Copy link
Contributor

I love hugging face! Worth considering at some point soon, will explore internally over the next few weeks.

@andrew-openai
Copy link
Contributor

Thanks for stopping by! HuggingFace datasets is great.

Many of our evals are only a few samples long (10-20), which we were worried to be too small to host as individual datasets on Hugging Face Datasets. We needed a platform to support lots of small datasets which is why LFS seemed to work OK for our task.

If you think this is still a reasonable use case for HuggingFace Datasets, I'd be happy to help any efforts in mirroring them onto HuggingFace!

@polinaeterna
Copy link

polinaeterna commented Mar 30, 2023

@logankilpatrick @andrew-openai hi, I'm Polina from HF datasets team :) regarding your worrying about many small datasets - in datasets it's possible to host more then one dataset as a single dataset with many subsets, like it's done for benchmarks like glue.
Also feel free to ping me if you have any questions about adding datasets to the Hub :)

@EwoutH
Copy link

EwoutH commented Jun 6, 2023

Has any further consideration taken place over a Hugging Face dataset mirror?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants