Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multilingual tokenization for ROUGE #79

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jon-tow
Copy link
Collaborator

@jon-tow jon-tow commented Jun 1, 2022

  • Adds support for multilingual ROUGE scoring by providing language-specific tokenization via nltk.

  • Adds a code_to_pycountry_lang utility that maps ISO codes to pycountry.db.Language objects for robust language name parsing.

  • Removes rougeLsum in the default rouge_types arg as sentences are not separated by newlines which breaks the rouge_scorer assumption.

TODO

  • Add sentence-level tokenization (possibly use nltk.sent_tokenize?). As mentioned above, rouge-score==0.0.4 (the latest package release) expects sentences be split by newlines to compute the rougeLsum score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level named tokenize.py that overrides a module of the same name in pip's setuptools dependency, breaking the installation.

  • Find a clean abstraction for tagging non-English PromptSourceTasks with their language. This tag could then be used to construct the multilingual NltkWordTokenizer that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly use promptsource's language tagging: Language tags promptsource#771

@Muennighoff
Copy link

Can we still use the current ROUGE score in LMEVAL for non-space languages?
It seems to me like PaLM used it https://arxiv.org/pdf/2204.02311.pdf for many other languages than English

Also related: ROUGE-scores are 0-1 & BLEU 0-100 in LMEVAL right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants