Add multilingual tokenization for ROUGE #79
Draft
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adds support for multilingual ROUGE scoring by providing language-specific tokenization via
nltk
.Adds a
code_to_pycountry_lang
utility that maps ISO codes topycountry.db.Language
objects for robust language name parsing.Removes
rougeLsum
in the defaultrouge_types
arg as sentences are not separated by newlines which breaks therouge_scorer
assumption.TODO
Add sentence-level tokenization (possibly use
nltk.sent_tokenize
?). As mentioned above,rouge-score==0.0.4
(the latest package release) expects sentences be split by newlines to compute therougeLsum
score. The latest version on their master branch contains automatic sentence splitting support. Unfortunately, this repo is not pip installable because there exists a module at the project root level namedtokenize.py
that overrides a module of the same name in pip'ssetuptools
dependency, breaking the installation.Find a clean abstraction for tagging non-English
PromptSourceTask
s with their language. This tag could then be used to construct the multilingualNltkWordTokenizer
that gets passed into rouge and other metrics that may need multilingual support in the future. Possibly usepromptsource
's language tagging: Language tags promptsource#771