Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding GermanQuAD and GermanDPR to HF datasets #1073

Closed
Timoeller opened this issue May 19, 2021 · 3 comments
Closed

Adding GermanQuAD and GermanDPR to HF datasets #1073

Timoeller opened this issue May 19, 2021 · 3 comments
Assignees

Comments

@Timoeller
Copy link
Contributor

Lets add our data to HF datasets with info from our landing page and links to haystack

@julian-risch
Copy link
Member

@Timoeller could you please review this so that we can close this issue?

Both datasets are uploaded, have dataset cards with links to our landing page and to haystack, and are linked to the models we trained on them. Main effort was to code a generate_examples method that yields records from the datasets, e.g., here: https://huggingface.co/datasets/deepset/germanquad/blob/main/germanquad.py

Usage after installing HF datasets via pip install datasets:

from datasets import load_dataset
dataset = load_dataset("deepset/germanquad", split="train")
dataset[0]
{'answers': {'answer_start': [146], 'text': ['britischen Common Laws']}, 'context': "Recht_der_Vereinigten_Staaten\n\n=== Amerikanisches Common Law ===\nObwohl die ...", 'id': 51870, 'question': 'Von welchem Gesetzt stammt das Amerikanische ab? '}

Limitations
There are two different ways to share datasets on HF datasets: community provided datasets and canonical datasets
I chose the first option so that the dataset is identified under the namespace of our organization: deepset/germanquad
However, it seems that the dataset viewer/explorer is limited to the second option: https://huggingface.co/datasets/viewer/?dataset=deepset/germanquad does not work
See details here: https://huggingface.co/docs/datasets/share_dataset.html

@Timoeller
Copy link
Contributor Author

Lets have the datasets under our namespace. I love the summaries.

2 optional improvements could be: add our germanquad picture to the model card and removing the second question in the example json for GermanQuAD (the json is pretty hard to read)
image

But Im fine with closing this issue as well.

@julian-risch
Copy link
Member

julian-risch commented May 21, 2021

Good points, thanks. I added the picture and removed the second question.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants