Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add WikiANN NER dataset #1080

Merged
merged 11 commits into from
Dec 6, 2020
Merged

Add WikiANN NER dataset #1080

merged 11 commits into from
Dec 6, 2020

Conversation

lewtun
Copy link
Member

@lewtun lewtun commented Dec 3, 2020

This PR adds the full set of 176 languages from the balanced train/dev/test splits of WikiANN / PAN-X from: https://github.com/afshinrahimi/mmner

Until now, only 40 of these languages were available in datasets as part of the XTREME benchmark

Courtesy of the dataset author, we can now download this dataset from a Dropbox URL without needing a manual download anymore 🥳, so at some point it would be worth updating the PAN-X subset of XTREME as well 😄

Link to gist with some snippets for producing dummy data: https://gist.github.com/lewtun/5b93294ab6dbcf59d1493dbe2cfd6bb9

P.S. @yjernite I think I was confused about needing to generate a set of YAML tags per config, so ended up just adding a single one in the README.

@lewtun lewtun changed the title WIP: Add WikiANN NER dataset Add WikiANN NER dataset Dec 3, 2020
@lewtun
Copy link
Member Author

lewtun commented Dec 3, 2020

Dataset card added, so ready for review!

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really awesome thank you !
I just left a comment for the language tags

datasets/wikiann/README.md Outdated Show resolved Hide resolved
Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks all good thanks !

@lhoestq lhoestq merged commit d843a82 into huggingface:master Dec 6, 2020
abecadel pushed a commit to abecadel/datasets that referenced this pull request Dec 7, 2020
* Add first pass at WikiANN

* Add dataset_infos for WikiANN

* ✨ Replace manual download with URL!

* Promote version number to variable

* Add comment about Dropbox download URL

* Add dummy data

* Update dataset_infos.json

* Fix style

* Add dataset card

* Fix language tags and add bibtex citations to dataset card
ophelielacroix pushed a commit to ophelielacroix/datasets that referenced this pull request Dec 8, 2020
* Add first pass at WikiANN

* Add dataset_infos for WikiANN

* ✨ Replace manual download with URL!

* Promote version number to variable

* Add comment about Dropbox download URL

* Add dummy data

* Update dataset_infos.json

* Fix style

* Add dataset card

* Fix language tags and add bibtex citations to dataset card
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants