Optimizing bert_cos_score_idf #69

ethanjperez · 2020-07-03T15:52:16Z

Pad BERT embeddings on GPU instead of CPU. Padding on CPU is the bottleneck in computing the greedy matching, so padding on GPU speeds up the matching by ~3x for me. Moving tensors to GPU then becomes the bottleneck, but it also takes ~2x less time to move pre-padding tensors to GPU, I think since you don't have to move a bunch of padding numbers. So overall I get a ~6x speed up on the sequences I'm evaluating
Using torch.no_grad() when computing greedy matching to save memory. I was able to increase the batch size for greedy matching by 2x after doing this. I'm not sure if increasing the batch size here will cause OOMs for others though, so it might be worth someone else checking/trying it out (or just removing the batch size increase).

Edit: I was finding some OOMs with the batch size increase, so I removed that

1) Pad BERT embeddings on GPU instead of CPU. Padding on CPU is the bottleneck in computing the greedy matching, so padding on GPU speeds up the matching by ~3x for me. Moving tensors to GPU then becomes the bottleneck, but it also takes ~2x less time to move pre-padding tensors to GPU, I think since you don't have to move a bunch of padding numbers. So overall I get a ~6x speed up on the sequences I'm evaluating 2) Using `torch.no_grad()` when computing greedy matching to save memory. I was able to increase the batch size for greedy matching by 2x after doing this. I'm not sure if increasing the batch size here will cause OOMs for others though, so it might be worth someone else checking/trying it out (or just removing the batch size increase).

Occasionally found OOMs with batch size increase for greedy matching only, so I removed that

Tiiiger · 2020-07-03T21:01:12Z

hi @ethanjperez , thank you for the contribution! I'll test this and merge afterwards.

Tiiiger · 2020-07-03T21:30:50Z

@ethanjperez actually I tested on the 3003 reference-candidate pairs (examples/hyps_long.txt and examples/refs_long.txt) and I didn't observe any significant speedup.

What are your test sequences like?

I wonder if you are testing on much more pairs?

ethanjperez · 2020-07-04T14:50:12Z

I'm using long sequences (many sentences), and I'm also doing leave-one-out reference evaluation. E.g., I have 10 references, and I want to evaluate each reference against the others (10 x 9 = 90 pairs). So for my situation, I need many more pairwise evaluations than BERT forward passes, so the matching was the slowest part. (The 6x speed up I found was only for the matching step specifically, but that part is pretty fast I think for normal MT evaluation.)

This change is just a suggestion that helped me (mostly useful for when you have lot of pairs, which isn't the standard case), so feel free to ignore the PR too :)

Tiiiger · 2020-07-04T21:32:23Z

I see. I think theses are reasonable changes. I am going to merge it.

ethanjperez added 2 commits July 3, 2020 10:52

Removing batch size increase

e9de1de

Occasionally found OOMs with batch size increase for greedy matching only, so I removed that

Tiiiger merged commit 4c10f36 into Tiiiger:master Jul 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizing bert_cos_score_idf #69

Optimizing bert_cos_score_idf #69

ethanjperez commented Jul 3, 2020 •

edited

Loading

Tiiiger commented Jul 3, 2020

Tiiiger commented Jul 3, 2020 •

edited

Loading

ethanjperez commented Jul 4, 2020 •

edited

Loading

Tiiiger commented Jul 4, 2020

Optimizing bert_cos_score_idf #69

Optimizing bert_cos_score_idf #69

Conversation

ethanjperez commented Jul 3, 2020 • edited Loading

Tiiiger commented Jul 3, 2020

Tiiiger commented Jul 3, 2020 • edited Loading

ethanjperez commented Jul 4, 2020 • edited Loading

Tiiiger commented Jul 4, 2020

ethanjperez commented Jul 3, 2020 •

edited

Loading

Tiiiger commented Jul 3, 2020 •

edited

Loading

ethanjperez commented Jul 4, 2020 •

edited

Loading