-
Notifications
You must be signed in to change notification settings - Fork 538
[SCRIPT] Improve BERT fine-tuning script with AdamW optimizer, bucketing and gradient accumulation #482
Conversation
Job PR-482/3 is complete. |
Codecov Report
@@ Coverage Diff @@
## master #482 +/- ##
==========================================
- Coverage 71.51% 71.19% -0.32%
==========================================
Files 119 118 -1
Lines 9998 9979 -19
==========================================
- Hits 7150 7105 -45
- Misses 2848 2874 +26
|
Codecov Report
@@ Coverage Diff @@
## master #482 +/- ##
==========================================
+ Coverage 66.98% 71.11% +4.13%
==========================================
Files 122 121 -1
Lines 10667 10040 -627
==========================================
- Hits 7145 7140 -5
+ Misses 3522 2900 -622
|
@eric-haibin-lin could you check this out of memory? (tested on my K80 GPU)
|
Wow, It's really fast to train. Great! |
@haven-jeon Thanks for pointing that out! I can reproduce the same OOM error on K80. If I do not use the |
@eric-haibin-lin |
parser.add_argument('--max_len', type=int, default=128, | ||
help='Maximum length of the sentence pairs, default is 128') | ||
parser.add_argument('--seed', type=int, default=2, help='Random seed, default is 2') | ||
parser.add_argument('--accumulate', type=int, default=None, help='The number of batches for ' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider adding a message for the actual effective batch size based on this option, when it's set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
scripts/bert/finetune_classifier.py
Outdated
args.max_len, pad=False) | ||
dev_trans = ClassificationTransform(tokenizer, MRPCDataset.get_labels(), args.max_len) | ||
|
||
data_train = MRPCDataset('train').transform(train_trans, lazy=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider modularizing the data pipeline by making some parts into a function. this helps avoid leaking variables into global context.
# Set grad_req if gradient accumulation is required | ||
if accumulate: | ||
for p in model.collect_params().values(): | ||
p.grad_req = 'add' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems wrong... it's setting all parameters to add even for those that had grad_req = 'null'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try
params = [p for p in model.collect_params().values() if p.grad_req != 'null']
if accumulate:
for p in params:
p.grad_req = 'add'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. Constant params have grad_req=null and their req cannot be updated. But if user set some grad_req to null
, those will be overridden. Updated now.
scripts/bert/finetune_classifier.py
Outdated
# update | ||
if not accumulate or (batch_id + 1) % accumulate == 0: | ||
grads = [p.grad(ctx) for p in params] | ||
gluon.utils.clip_global_norm(grads, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be done properly. consider using the clip norm function from #470
scripts/bert/finetune_classifier.py
Outdated
|
||
if __name__ == '__main__': | ||
pool_type = os.environ.get('MXNET_GPU_MEM_POOL_TYPE', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that static_alloc is on in the model, mem pool type would not be affecting model's internal spaces, but only the input and output arrays. In this case, you may want to tune the MXNET_GPU_MEM_POOL_ROUND_LINEAR_CUTOFF=x
so that 2^x is on the same scale as the mode of the input output array sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be too much for users to understand how to set it properly... Would it be better to remove this option for now to keep the example simple?
Job PR-482/6 is complete. |
Job PR-482/8 is complete. |
This reverts commit 60b8fdd.
Job PR-482/10 is complete. |
…ing and gradient accumulation (dmlc#482) * use adam_w and bucketing * add missing code * add helper msg for OOM error * move optimizer def to gluonnlp. also fix padding token * update documentation * fix lint * address CR comments * revert unintended changes * remove mem pool env var * Revert "remove mem pool env var" This reverts commit 60b8fdd. * remove mem pool env var
Description
Other ongoing BERT efforts may also benefit from this PR @fierceX @kenjewu @haven-jeon
Checklist
Essentials
Changes
Comments