Dataset hosting, data cards and previews #139

lhoestq · 2023-03-15T11:42:23Z

Hi, I'm Quentin from Hugging Face :)

I know hosting datasets on github is not always practical: git lfs required, no data preview, limited storage (maybe not for you haha), no standard for data documentation. So I was wondering:

Have you considered host alternatives more suited for datasets, and would let researchers explore the datasets of evals ?

This way researchers can know in depth what data is used for evaluation and their goals and limitations, in particular to better understand what domains and structures their models perform good or bad at.

e.g. the Hugging Face datasets hub shows data cards for documentation and previews for each dataset. Also loading and caching a dataset is one line of python, saving you from wget and github hosting. It also supports pull requests for the community to contribute.

It can even allow to use those datasets in other well known eval frameworks, such as lm-evaluation-harness.

Let me know what you think !

The text was updated successfully, but these errors were encountered:

logankilpatrick · 2023-03-20T21:18:52Z

I love hugging face! Worth considering at some point soon, will explore internally over the next few weeks.

andrew-openai · 2023-03-30T00:29:12Z

Thanks for stopping by! HuggingFace datasets is great.

Many of our evals are only a few samples long (10-20), which we were worried to be too small to host as individual datasets on Hugging Face Datasets. We needed a platform to support lots of small datasets which is why LFS seemed to work OK for our task.

If you think this is still a reasonable use case for HuggingFace Datasets, I'd be happy to help any efforts in mirroring them onto HuggingFace!

polinaeterna · 2023-03-30T12:35:37Z

@logankilpatrick @andrew-openai hi, I'm Polina from HF datasets team :) regarding your worrying about many small datasets - in datasets it's possible to host more then one dataset as a single dataset with many subsets, like it's done for benchmarks like glue.
Also feel free to ping me if you have any questions about adding datasets to the Hub :)

EwoutH · 2023-06-06T18:16:15Z

Has any further consideration taken place over a Hugging Face dataset mirror?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset hosting, data cards and previews #139

Dataset hosting, data cards and previews #139

lhoestq commented Mar 15, 2023 •

edited

logankilpatrick commented Mar 20, 2023

andrew-openai commented Mar 30, 2023

polinaeterna commented Mar 30, 2023 •

edited

EwoutH commented Jun 6, 2023

Dataset hosting, data cards and previews #139

Dataset hosting, data cards and previews #139

Comments

lhoestq commented Mar 15, 2023 • edited

logankilpatrick commented Mar 20, 2023

andrew-openai commented Mar 30, 2023

polinaeterna commented Mar 30, 2023 • edited

EwoutH commented Jun 6, 2023

lhoestq commented Mar 15, 2023 •

edited

polinaeterna commented Mar 30, 2023 •

edited