Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multitask finetuning #676

Closed
wants to merge 146 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
146 commits
Select commit Hold shift + click to select a range
8789450
first commit, adding non causal mlm dataset
lintangsutawika Jun 8, 2022
646578d
Added prefix_len to get_batch
lintangsutawika Jun 8, 2022
8e726cb
changes
lintangsutawika Jun 13, 2022
25dec8e
added enums.py
lintangsutawika Jun 13, 2022
b7d6783
resolved merge conflicts
lintangsutawika Jun 13, 2022
035ae8b
added prefix attention calculation to utils
lintangsutawika Jun 14, 2022
a1c14a7
removed enums
lintangsutawika Jun 14, 2022
7d95765
added config for mlm
lintangsutawika Jun 14, 2022
b528969
added new args for mlm and added mlm dataset to data_utils
lintangsutawika Jun 14, 2022
c3830f7
added tokenizer as an arg
lintangsutawika Jun 14, 2022
9e3458c
remove get_tokenizer
lintangsutawika Jun 14, 2022
d1fabea
prefix_len to prefix
lintangsutawika Jun 14, 2022
9059902
added NonCausalMLMDataset to build_train_valid_test_datasets
lintangsutawika Jun 14, 2022
cd22016
fixed typo in conditional
lintangsutawika Jun 14, 2022
6d4be58
forgot conditional
lintangsutawika Jun 14, 2022
b729c0b
change mpu functions from Megatron to GPTNeoX
lintangsutawika Jun 14, 2022
dca68d0
tokenizer
lintangsutawika Jun 14, 2022
0be06f1
removed unused args
lintangsutawika Jun 14, 2022
51bcc46
removed bos_id
lintangsutawika Jun 14, 2022
c4bb330
added prefix key in get_batch_pipe
lintangsutawika Jun 14, 2022
835a0f7
set attention_mask to have batch size same as input batch if using pr…
lintangsutawika Jun 14, 2022
b37f5d9
fixed the sampling
lintangsutawika Jun 14, 2022
6be1b67
fixed the sampling arg
lintangsutawika Jun 14, 2022
86d0bd0
minor fix
lintangsutawika Jun 14, 2022
b43f35c
fixed typo
lintangsutawika Jun 14, 2022
d23d333
edit mlm dataset to not need input-seq-length and just rely on the or…
lintangsutawika Jun 23, 2022
964fe83
[cleanup] remove args in buld_train_valid_test_datasets
haileyschoelkopf Jul 13, 2022
7676b53
store seed in gpt2dataset class
haileyschoelkopf Jul 14, 2022
46fccb7
rename: NonCausalMLMDataset -> MLMDataset
haileyschoelkopf Jul 14, 2022
fd658e3
partial first pass: add GPT2Dataset dependency
haileyschoelkopf Jul 14, 2022
47869f7
change neox args, get up to error on CE loss calc
haileyschoelkopf Jul 14, 2022
b094982
initial MTFDataset commit
haileyschoelkopf Jul 14, 2022
33fc379
implement get_batch for MLM. loss goes down!
haileyschoelkopf Jul 14, 2022
97c9bcb
add extra-sentinel-tokens argument
haileyschoelkopf Jul 14, 2022
d3e76df
cleanup todos and set correct sentinel ids in dataset
haileyschoelkopf Jul 14, 2022
5282def
modify mlm.yml
haileyschoelkopf Jul 14, 2022
1b610e0
initial commit: MTFDataset class from Meg-DS
haileyschoelkopf Jul 15, 2022
064edf0
update MLMdataset return value
haileyschoelkopf Jul 25, 2022
587be1a
add decoder packed dataset class + refactor
haileyschoelkopf Jul 26, 2022
498231b
add train_mtf flag and refactor get_batch
haileyschoelkopf Jul 26, 2022
96af4cc
update attn mask creation
haileyschoelkopf Jul 26, 2022
8e7a2c2
use training_objective flag correctly
haileyschoelkopf Jul 26, 2022
9b9e313
Merge pull request #1 from lintangsutawika/hailey_add_packing
haileyschoelkopf Jul 29, 2022
3a03c20
update sample config w/ new neox args
haileyschoelkopf Jul 29, 2022
7231777
add pad() property to tokenizer classes
haileyschoelkopf Jul 29, 2022
bb60d49
cleanup + allow for packed MTFDataset to be selected
haileyschoelkopf Jul 29, 2022
68c27b3
full packed attn mask + pos ids impl.
haileyschoelkopf Aug 3, 2022
6a862b2
update datasets + add up-to-date mtf utils
haileyschoelkopf Aug 4, 2022
c725c38
Merge pull request #2: Add packed attention calculations + Multi-task…
haileyschoelkopf Aug 4, 2022
494b870
make sure code is all synced
haileyschoelkopf Aug 4, 2022
9a1047c
Update MLM w/ most recent version of code
haileyschoelkopf Aug 4, 2022
8e53bfd
preliminary p3 dataloading
haileyschoelkopf Aug 5, 2022
f592f25
change FIM config name
haileyschoelkopf Aug 5, 2022
d85af0d
resolve some todos + cleanup attn masking a bit
haileyschoelkopf Aug 5, 2022
ee7ab4d
change the way p3 is processed
haileyschoelkopf Aug 5, 2022
00146e4
log how many sentinel vs. dummy toks are added
haileyschoelkopf Aug 5, 2022
ed4e01c
clean up MTFDataset file
haileyschoelkopf Aug 7, 2022
7596e65
move MTF data code into single file
haileyschoelkopf Aug 7, 2022
fa1273e
delete temp_utils file; change imports in MTF file
haileyschoelkopf Aug 7, 2022
6011f32
emulate seqio max mixing rate
haileyschoelkopf Aug 7, 2022
b0ed699
update requirements to EleutherAI/lm_dataformat fork
haileyschoelkopf Aug 10, 2022
24c8a6a
update p3 download + preproc; allow for separate train+valid datasets
haileyschoelkopf Aug 10, 2022
e61196d
update datapaths on sample config
haileyschoelkopf Aug 10, 2022
d4229a3
Configure P3 download + dataloading
haileyschoelkopf Aug 10, 2022
d7838a7
Merge pull request #630 from lintangsutawika/mlm_and_mtf_adaptation
haileyschoelkopf Aug 21, 2022
694d7a3
initial t5 packing attempt
haileyschoelkopf Sep 13, 2022
93df634
added SuperGLUE DataDownloader object
Sep 13, 2022
f4c54b6
download all subsets and extract only the train.jsonl file
Sep 13, 2022
9be62e7
added preprocess for all sglue datasets
Sep 14, 2022
52aac11
moved preprocessing to tools/sglue_utils.py
Sep 15, 2022
4f0c158
added zip processing
Sep 15, 2022
5963e3c
copora.py can load and process at the same time
Sep 16, 2022
6590c19
revert to main
Sep 16, 2022
61550f6
super_glue process both text and target
Sep 16, 2022
f0fc9ee
merged MTF with SGLUE implementation
Sep 18, 2022
f8d63f0
changed dataset selection process and added new dataset object
Sep 18, 2022
afdbcb7
changed field names
Sep 19, 2022
8a3b722
merged with tt-packing
Sep 20, 2022
686b444
solve merge conflict
Sep 20, 2022
92d5adf
changed fields
Sep 20, 2022
cb08955
added superglue yml
Sep 20, 2022
a6fceb6
test T5MTF packed
Sep 20, 2022
27a13d8
forgot coma
Sep 20, 2022
a64b362
removed duplicate line
Sep 20, 2022
eb6ba00
removed duplicate line
Sep 20, 2022
989c25b
forgot to mention merge-file
Sep 20, 2022
f6f710b
fixed data loading in prepare_data
Sep 20, 2022
83cb51a
temp
Sep 20, 2022
0be327a
removed print
Sep 20, 2022
db2bb1e
changes to enable data process with fields other than text
lintangsutawika Sep 20, 2022
13db037
removed text field
lintangsutawika Sep 20, 2022
b072007
can process arbitrary key from a jsonl file
lintangsutawika Sep 20, 2022
edba60e
runs process for SGLUE
lintangsutawika Sep 20, 2022
0a4c524
minor fix on targets for wsc
lintangsutawika Sep 20, 2022
b694c4b
comment line for now
lintangsutawika Sep 20, 2022
9682a24
import t5_mtf_dataset
lintangsutawika Sep 20, 2022
ed73ec9
fix extra_sentinel_tokens issue
lintangsutawika Sep 20, 2022
75d6097
removed line
lintangsutawika Sep 20, 2022
383de82
adjustments for mtf
lintangsutawika Sep 20, 2022
21cce69
changed tokenizer
lintangsutawika Sep 20, 2022
8a7c1d1
tok_len includes input_tokens only
lintangsutawika Sep 20, 2022
42d4859
fixed typos, tokenizer pad set to 0
lintangsutawika Sep 20, 2022
231a069
removed print
lintangsutawika Sep 20, 2022
049a27b
position_ids and added neox_args to get_ltor_masks_and_position_ids
lintangsutawika Sep 20, 2022
7220768
for training
lintangsutawika Sep 20, 2022
59ca374
renamed dataset class
lintangsutawika Sep 21, 2022
bf8c384
yml split to two
lintangsutawika Sep 21, 2022
0613f68
updated with latest changes in improved-t5-2.0
lintangsutawika Sep 21, 2022
42cd8fa
enable t5 packing and non-packing versions
lintangsutawika Sep 21, 2022
26968b7
adapted to run both packed and non-packed versions
lintangsutawika Sep 21, 2022
caf0d3d
differentiate between make_segment_mask and get_full_mask
lintangsutawika Sep 21, 2022
2873d40
differentiate between make_segment_mask and get_full_mask
lintangsutawika Sep 21, 2022
58c4f10
typo, wsc should be wic in the wic prompt function
lintangsutawika Sep 23, 2022
a810f6c
commit modifications from stability cluster
haileyschoelkopf Sep 23, 2022
afa8caa
moved few configs args
lintangsutawika Sep 23, 2022
9506f7f
changed train_mtf to packing
lintangsutawika Sep 23, 2022
93ba059
changed train_mtf to packing
lintangsutawika Sep 23, 2022
5752e00
fix attention_mask
lintangsutawika Sep 23, 2022
a16bcda
minor adjustment
lintangsutawika Sep 23, 2022
007d6d0
Merge branch 'multitask-finetuning' into stability_multitask
lintangsutawika Sep 23, 2022
f29538e
Merge pull request #5 from lintangsutawika/stability_multitask
lintangsutawika Sep 23, 2022
93e13f9
added new args to use for get_ltor_masks_and_position_ids
lintangsutawika Sep 23, 2022
d6ba631
update configs
lintangsutawika Sep 23, 2022
72f0e27
changes in making loss_mask
lintangsutawika Sep 23, 2022
0d00539
changed in line 57
lintangsutawika Sep 25, 2022
216eb3f
both conditions return the same args
lintangsutawika Sep 25, 2022
870ec5b
edit _build_index_mappings
lintangsutawika Sep 25, 2022
bdaca21
closer implementation of packing to tensor2tensor library
lintangsutawika Sep 25, 2022
a408186
combined indexes converted to string to enable mmap
lintangsutawika Sep 25, 2022
b26b4c7
combined indexes converted to string to enable mmap
lintangsutawika Sep 25, 2022
351cffe
skip inputs that are longer than seq_length
lintangsutawika Sep 25, 2022
452bd3a
removed unised variables, fixed wrong logic
lintangsutawika Sep 25, 2022
faf64ce
added break in the loop
lintangsutawika Sep 25, 2022
58d9cf8
readded comneted line
lintangsutawika Sep 25, 2022
2a3d5bc
forgot to add a start token, so offsetting target tokens by 1
lintangsutawika Sep 25, 2022
6a246b9
fixed label and token_dec shifting
lintangsutawika Sep 26, 2022
bee300a
include eod in loss
lintangsutawika Sep 26, 2022
1023968
added generate_samples_from_prompt
lintangsutawika Sep 28, 2022
13d31f3
edits for encdec generation
lintangsutawika Sep 28, 2022
5778e25
change position of bs cals
lintangsutawika Sep 28, 2022
b12e079
remove stray lines
lintangsutawika Sep 28, 2022
f8faaf3
added set batch pipe based on arch
lintangsutawika Sep 28, 2022
b9c730e
pipe_batch_fn in setup_model_and_optimizer
lintangsutawika Sep 28, 2022
718faf8
remove unused line
lintangsutawika Sep 28, 2022
a5a79a0
temp hack for seq2seq generation
lintangsutawika Sep 28, 2022
bc605e7
Update training.py
lintangsutawika Oct 4, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions configs/FIM-160M.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# GPT-2 pretraining setup
{
# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe-parallel-size": 1,
"model-parallel-size": 1,

# model settings
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

# these should provide some speedup but takes a while to build, set to true if desired
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8,
}
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
"cpu_offload": False
},

# batch / data settings
"train_micro_batch_size_per_gpu": 4,
"gradient_accumulation_steps": 8,
"data-impl": "mmap",
"split": "995,4,1",

# activation checkpointing
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

# regularization
"gradient_clipping": 1.0,
"weight-decay": 0.01,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,

# precision settings
"fp16": {
"fp16": true,
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},

# misc. training settings
"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,

# logging
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,


# Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in
"vocab-file": "/fsx/pile/20B_tokenizer.json",
# "merge-file": "./20b_checkpoints/merges.txt",
"save": "./mlm_125m_checkpoints",
"load": "./mlm_125m_checkpoints",

# If finetuning, edit the following to the location of your finetuning dataset:
"data-path": "/fsx/pile/pile_20B_tokenizer_text_document",

"extra-sentinel-tokens": 100,
"training-objective": "mlm",
# "train_mtf": True,
# "use_prefix_attention": True,

### NEW DATA: ####
"tokenizer_type": "HFTokenizer",
"tensorboard-dir": "./tensorboard",
"log-dir": "./logs",

"hostfile": "./hostfile",
"launcher": "OpenMPI",
}
39 changes: 39 additions & 0 deletions configs/finetune-sglue.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Suggested data paths when finetuning T5 on Super Glue locally
{

"finetune": True,
"packing": True,
"data-path": "data/super_glue/super_glue",

"tokenizer-type": "HFTokenizer",
"vocab-file": "data/tokenizer.json",

#"save": "data/improved-t5-test",
#"load": "data/improved-t5-test",
"load": "ckpts/pretrain",
"save": "ckpts/sglue",

# batch / data settings
"train_micro_batch_size_per_gpu": 8,
"gradient-accumulation-steps": 1,
"data-impl": "mmap",
"split": "949,50,1",

# misc. training settings
"train-iters": 5000,
"lr-decay-iters": 5000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 1000,
"eval-interval": 5001,
"eval-iters": 10,

"tensorboard-dir": "/tmp/improved-t5/tensorboard",
"log-dir": "/tmp/improved-t5/logs",
# "use_wandb": True,
# "wandb_group": "T5-770M-9-3-22-testppl",
# "wandb_team": "eleutherai",
# "wandb_project": "improved-t5",

}
106 changes: 106 additions & 0 deletions configs/mlm.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,106 @@
# GPT-2 pretraining setup
{
# parallelism settings ( you will want to change these based on your cluster setup, ideally scheduling pipeline stages
# across the node boundaries )
"pipe-parallel-size": 1,
"model-parallel-size": 1,

# model settings
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 2048,
"max-position-embeddings": 2048,
"norm": "layernorm",
"pos-emb": "rotary",
"no-weight-tying": true,

# these should provide some speedup but takes a while to build, set to true if desired
"scaled-upper-triang-masked-softmax-fusion": false,
"bias-gelu-fusion": false,


# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0006,
"betas": [0.9, 0.999],
"eps": 1.0e-8,
}
},
"zero_optimization": {
"stage": 0,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
"cpu_offload": False
},

# batch / data settings
"train_micro_batch_size_per_gpu": 4,
"data-impl": "mmap",
"split": "949,50,1",

# activation checkpointing
"checkpoint-activations": true,
"checkpoint-num-layers": 1,
"partition-activations": true,
"synchronize-each-layer": true,

# regularization
"gradient_clipping": 1.0,
"weight-decay": 0.0,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,

# precision settings
"fp16": {
"enabled": true,
"type": "bfloat16", # set bf16 as precision
"loss_scale": 0,
"loss_scale_window": 1000,
"hysteresis": 2,
"min_loss_scale": 1
},

"fp32_allreduce": True, # without a patch to torch, bf16 models have to do the allreduce in fp32
# misc. training settings
"train-iters": 320000,
"lr-decay-iters": 320000,
"distributed-backend": "nccl",
"lr-decay-style": "cosine",
"warmup": 0.01,
"save-interval": 10000,
"eval-interval": 1000,
"eval-iters": 10,

# logging
"log-interval": 100,
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,


# Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in
"vocab-file": "./data/20b_checkpoints/125m-tokenizer/",
# "merge-file": "./20b_checkpoints/merges.txt",
"save": "./data/20b_checkpoints",
"load": "./data/20b_checkpoints",

# If finetuning, edit the following to the location of your finetuning dataset:
"data-path": "./data/enron/enron_text_document",

"extra-sentinel-tokens": 100,
"train-mlm": True,
# "use_prefix_attention": True,
"seq-length": 1024,

### NEW DATA: ####
"tokenizer_type": "HFGPT2Tokenizer",
"tensorboard-dir": "./tensorboard",
"log-dir": "./logs",
}
30 changes: 30 additions & 0 deletions configs/small_bf16.yml
Original file line number Diff line number Diff line change
Expand Up @@ -83,4 +83,34 @@
"steps_per_print": 10,
"keep-last-n-checkpoints": 4,
"wall_clock_breakdown": true,


# Tokenizer / checkpoint settings - you will need to change these to the location you have them saved in
"vocab-file": "./data/20b_checkpoints/125m-tokenizer/",
# "merge-file": "./20b_checkpoints/merges.txt",
"save": "./data/20b_checkpoints",
"load": "./data/20b_checkpoints",

# If finetuning, edit the following to the location of your finetuning dataset:
# "data-path": "./data/p3/p3",
"train-data-paths": ["./data/p3/p3"],
"valid-data-paths": ["./data/p3_valid/p3_valid"],
"test-data-paths": ["./data/p3_valid/p3_valid"],

"train-data-weights": [1],
"valid-data-weights": [1],
"test-data-weights": [1],


# "extra-sentinel-tokens": 100,
# "training-objective": "prefixlm",
"train_mtf": True,
"loss_on_targets_only": True,
# "use_prefix_attention": True,
"seq-length": 1024,

### NEW DATA: ####
"tokenizer_type": "HFGPT2Tokenizer",
"tensorboard-dir": "./tensorboard",
"log-dir": "./logs",
}
87 changes: 87 additions & 0 deletions configs/t5-base.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# example config for an encoder-decoder model
# (not optimized, just containing all relevant neox args)
{
"pipe-parallel-size": 1,
"model-parallel-size": 1,

# model settings
"model-arch": "t5",
"num-encoder-layers": 12,
"num-layers": 12,
"hidden-size": 768,
"num-attention-heads": 12,
"seq-length": 512,
"decoder-seq-length": 114,
"max-position-embeddings": 626,
"norm": "layernorm",
"pos-emb": "rotary",
"rotary-pct": 0.25,
"activation": "geglu",
"no-weight-tying": true,
"gpt-j-residual": false,
"output-layer-parallelism": "column",

"init_method": "small_init",
"output_layer_init_method": "wang_init",

"extra-sentinel-tokens": 100,
# "masked-lm-prob": 0.15,
# "mean_noise_span_length": 3,

# fusion ops (STILL UNTESTED WITH T5)
"scaled-upper-triang-masked-softmax-fusion": false,
# can we do upper triang fusion for certain layers only? would that be a speedup?
"bias-gelu-fusion": false,

# optimizer settings
"optimizer": {
"type": "Adam",
"params": {
"lr": 0.0001,
"betas": [0.9, 0.999],
"eps": 1.0e-8,
}
},
"min_lr": 0.0001,

"zero_optimization": {
"stage": 1,
"allgather_partitions": True,
"allgather_bucket_size": 500000000,
"overlap_comm": True,
"reduce_scatter": True,
"reduce_bucket_size": 500000000,
"contiguous_gradients": True,
"cpu_offload": False
},

# activation checkpointing
"checkpoint-activations": false,
"checkpoint-num-layers": 1,
"partition-activations": false,
"synchronize-each-layer": true,

# regularization
"gradient_clipping": 1.0,
"weight-decay": 0.01,
"hidden-dropout": 0.0,
"attention-dropout": 0.0,

# precision settings
"fp16": {
"enabled": true,
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 12,
"hysteresis": 2,
"min_loss_scale": 1
},

# logging
"log-interval": 100,
"steps_per_print": 10,
"wall_clock_breakdown": true,

# "launcher": "openmpi",
# "deepspeed_mpi": true,
}
Loading