feat: pass `use_fast` parameter to `get_tokenizer` #106

nikitajz · 2021-07-16T13:32:38Z

A proposed solution for the issue #105

felixgwu

Looks great! There are only a few tiny changes that need to be made.

bert_score/utils.py

tests/test_score_function.py

nikitajz · 2021-07-19T19:46:55Z

@felixgwu after a few attempts I'm still getting failed tests due to OOM (OSError: [Errno 12] Cannot allocate memory) even for old tests without a fast tokenizer. I've seen the related issue but lost the link that Travis CI decreased memory limit to 3.5 Gb (from 7Gb) which might be related. Though tests pass successfully on my laptop (except for one unrelated test due to commented model 'scibert-scivocab-uncased' which I mentioned earlier).
Apart from that, seems get_idf_dict worth refactoring (at least for fast tokenizers) to use tokenizer.encode_batch instead of paralleling tokenizer.encode, see related issue

felixgwu

Everything looks good to me now. I can either merge it or wait if you would like to spend more time on fixing the Travis CI. I really appreciate your help!

nikitajz · 2021-07-19T20:52:33Z

ok, let see if I can devote more time to fixing the tests, otherwise, feel free to merge it.
Also, I'd recommend changing the model in the tests from roberta-large to something lighter, e.g. distilroberta-base

This reverts commit f104068

nikitajz · 2021-07-20T19:10:00Z

I've refactored the tests a bit. I also added the missing assert statement to test_multi_refs_working, please check if it is correct. I temporarily added skip to test test_score_en_sci which fails due to commented model in utils. All tests pass successfully locally.
Unfortunately, I didn't fix failing tests on Travis CI quickly. As far as I can tell, there are two possible options:

migrate to another CI hosting with more powerful VMs in the free tier
replace model roberta-large in the test with lightweight, something like distilroberta-base, but this requires also changing some tests behaviour, e.g. test_multi_refs where a model is not specified explicitly, but chosen by language specified. Hence I didn't proceed with this option.

felixgwu · 2021-07-21T04:37:03Z

It looked good to me, so I just merged them. I would like to keep using roberta-large in the tests for now since it is the default model. Thanks again for your help!

nikitajz added 4 commits July 16, 2021 16:28

feat: pass use_fast parameter to get_tokenizer

06afee9

feat: use_fast tokenizer, remaining funcs + test

4207125

chore: change installation packages order

9275572

chore: add packaging>=20.9

4c7aa1b

felixgwu requested changes Jul 19, 2021

View reviewed changes

bert_score/utils.py Show resolved Hide resolved

tests/test_score_function.py Show resolved Hide resolved

nikitajz added 3 commits July 19, 2021 21:34

chore: disable tokenizers parallelism for travis CI

489640f

fix: assert not using fast tokenizer for version < 4.0.0

6bc4e09

test: revert back original test and add a new one for fast tokenizer

c9b2c33

felixgwu marked this pull request as ready for review July 19, 2021 19:59

felixgwu approved these changes Jul 19, 2021

View reviewed changes

nikitajz added 3 commits July 20, 2021 21:08

test: decrease value, nthreads=2

f104068

Revert "test: decrease value, nthreads=2"

33e4402

This reverts commit f104068

tests: refactor asserts

db10677

felixgwu merged commit 2a40716 into Tiiiger:master Jul 21, 2021

felixgwu mentioned this pull request Jul 21, 2021

Slow tokenizer is used by default #105

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: pass `use_fast` parameter to `get_tokenizer` #106

feat: pass `use_fast` parameter to `get_tokenizer` #106

nikitajz commented Jul 16, 2021 •

edited

Loading

felixgwu left a comment

nikitajz commented Jul 19, 2021 •

edited

Loading

felixgwu left a comment

nikitajz commented Jul 19, 2021

nikitajz commented Jul 20, 2021

felixgwu commented Jul 21, 2021

feat: pass use_fast parameter to get_tokenizer #106

feat: pass use_fast parameter to get_tokenizer #106

Conversation

nikitajz commented Jul 16, 2021 • edited Loading

felixgwu left a comment

Choose a reason for hiding this comment

nikitajz commented Jul 19, 2021 • edited Loading

felixgwu left a comment

Choose a reason for hiding this comment

nikitajz commented Jul 19, 2021

nikitajz commented Jul 20, 2021

felixgwu commented Jul 21, 2021

feat: pass `use_fast` parameter to `get_tokenizer` #106

feat: pass `use_fast` parameter to `get_tokenizer` #106

nikitajz commented Jul 16, 2021 •

edited

Loading

nikitajz commented Jul 19, 2021 •

edited

Loading