-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset hosting, data cards and previews #139
Comments
I love hugging face! Worth considering at some point soon, will explore internally over the next few weeks. |
Thanks for stopping by! HuggingFace datasets is great. Many of our evals are only a few samples long (10-20), which we were worried to be too small to host as individual datasets on Hugging Face Datasets. We needed a platform to support lots of small datasets which is why LFS seemed to work OK for our task. If you think this is still a reasonable use case for HuggingFace Datasets, I'd be happy to help any efforts in mirroring them onto HuggingFace! |
@logankilpatrick @andrew-openai hi, I'm Polina from HF datasets team :) regarding your worrying about many small datasets - in |
Has any further consideration taken place over a Hugging Face dataset mirror? |
Hi, I'm Quentin from Hugging Face :)
I know hosting datasets on github is not always practical: git lfs required, no data preview, limited storage (maybe not for you haha), no standard for data documentation. So I was wondering:
Have you considered host alternatives more suited for datasets, and would let researchers explore the datasets of
evals
?This way researchers can know in depth what data is used for evaluation and their goals and limitations, in particular to better understand what domains and structures their models perform good or bad at.
e.g. the Hugging Face datasets hub shows data cards for documentation and previews for each dataset. Also loading and caching a dataset is one line of python, saving you from
wget
and github hosting. It also supports pull requests for the community to contribute.It can even allow to use those datasets in other well known eval frameworks, such as lm-evaluation-harness.
Let me know what you think !
The text was updated successfully, but these errors were encountered: