Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase test coverage #289

Merged
merged 67 commits into from
May 12, 2021
Merged
Changes from 1 commit
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
41d3ca0
requirements for test coverage
May 2, 2021
427bebd
cleanup tensorboard dir when testing
May 2, 2021
fe7916c
simplify using subtests
May 3, 2021
119c6c1
fix clear test dirs in subtests
May 3, 2021
9ac6e03
test update to try and run tests with a worldsize > 1
May 3, 2021
65b497f
fix test model instantiation for world size > 1
May 3, 2021
ca05fd9
neox args test with import in function
May 3, 2021
69ae18c
test readme update
May 3, 2021
fb60a97
test model checkpoint with forward option
May 3, 2021
616ba01
test model checkpoint in inference mode
May 3, 2021
d343dbd
todo for config data_impl
May 3, 2021
2472bd2
upate test configs
May 3, 2021
92251df
add docstrings to testcases
May 3, 2021
532c982
test models with overwrite in neox_args
May 3, 2021
81086aa
update tests readme
May 3, 2021
528686f
test config include sm3 optimizer
May 3, 2021
48c7a3e
test config adjustments
May 3, 2021
b460776
add cpu and gpu testing in checkpoint test
May 3, 2021
023579b
add test for train / backwards step
May 3, 2021
aa0dc64
requirements for test coverage
May 2, 2021
c517c69
cleanup tensorboard dir when testing
May 2, 2021
132abfc
simplify using subtests
May 3, 2021
8b7a9a7
fix clear test dirs in subtests
May 3, 2021
93c986a
test update to try and run tests with a worldsize > 1
May 3, 2021
49ef6d0
fix test model instantiation for world size > 1
May 3, 2021
ce66309
neox args test with import in function
May 3, 2021
3673d1b
test readme update
May 3, 2021
e849d40
test model checkpoint with forward option
May 3, 2021
a83198f
test model checkpoint in inference mode
May 3, 2021
baeb88e
todo for config data_impl
May 3, 2021
ec81841
upate test configs
May 3, 2021
5ac730a
add docstrings to testcases
May 3, 2021
3cd056b
test models with overwrite in neox_args
May 3, 2021
c0469f0
update tests readme
May 3, 2021
44431d5
test config include sm3 optimizer
May 3, 2021
51c9bc1
test config adjustments
May 3, 2021
05fbd5d
add cpu and gpu testing in checkpoint test
May 3, 2021
2f827a3
add test for train / backwards step
May 3, 2021
eefff73
Merge branch 'increase_test_coverage' of github.com:EleutherAI/gpt-ne…
May 4, 2021
f1f40cf
test model train with right vocab size
May 4, 2021
7b0ccf2
modified test configs
May 4, 2021
dc53a5c
test train with nan handling of losses
May 4, 2021
9e4c31a
test model train comment out config 2 (no error, no termination)
May 4, 2021
c8c6e97
text generation utils - create dir fix
May 4, 2021
1cdea36
test model generation init
May 4, 2021
a391ae1
changed model tests to allow for init from dict
kipgparker May 4, 2021
dc88b12
Merge branch 'increase_test_coverage' of https://github.com/EleutherA…
kipgparker May 4, 2021
c756696
Merge branch 'main' into increase_test_coverage
May 7, 2021
48171c7
fix use recompute kwarg in generation instead of neox_args.recompute
May 7, 2021
7239d3a
adjust tests for generation to new main branch
May 7, 2021
83978f2
test text generation with multiple configs
May 10, 2021
f915114
test model generation with input file
May 10, 2021
0fce24f
adding config comparer and figured out what's causing test error
May 10, 2021
9042dab
Merge branch 'main' into increase_test_coverage
May 11, 2021
88d4a35
updated config comparer and config to meet new format
kipgparker May 11, 2021
14a84c9
fix / make loss dict naming consistent
May 11, 2021
b3b3d5c
disable fp32 in testing
May 11, 2021
97c0c62
fix error message for unknown activation
May 11, 2021
6f76823
add train_batch_size to known parameters in neox_args used testcase
May 11, 2021
89863b3
fix comment with new variable name
May 11, 2021
053e70c
add train_batch_size] to known properties in neox_args usage testcase
May 11, 2021
774bc2c
updated config comparer
kipgparker May 11, 2021
dc2806a
Merge branch 'main' into increase_test_coverage
May 12, 2021
02ccedc
Merge branch 'main' into increase_test_coverage
May 12, 2021
d405354
compare arg value in neox args load test
May 12, 2021
af0b1f1
mark testcases for cpu
May 12, 2021
a2d2e2b
readme for tests on cpu
May 12, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test train with nan handling of losses
  • Loading branch information
Samuel Weinbach committed May 4, 2021
commit dc53a5c712f0f62258bc43264ac2c6e189967ce0
34 changes: 19 additions & 15 deletions tests/model/test_model_train.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,29 +14,29 @@
import torch

@distributed_test(world_size=1)
def test_model_checkpoint_small_0():
def test_model_train_small_0():
yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_0.yml"])
run_train_test(yaml_list)

@distributed_test(world_size=1)
def test_model_checkpoint_small_1():
def test_model_train_small_1():
yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_1.yml"])
run_train_test(yaml_list)

# for some reason this testcase is running way to long
# potentially the optimizer problem?
# @distributed_test(world_size=2)
# def test_model_checkpoint_small_2():
# yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_2.yml"])
# run_train_test(yaml_list)
@distributed_test(world_size=2)
def test_model_train_small_2():
yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_2.yml"])
run_train_test(yaml_list)

@distributed_test(world_size=1)
def test_model_checkpoint_small_3():
def test_model_train_small_3():
yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_3.yml"])
run_train_test(yaml_list)

@distributed_test(world_size=2)
def test_model_checkpoint_small_4():
def test_model_train_small_4():
yaml_list = get_test_configs_with_path(["test_local_setup.yml", "test_small_4.yml"])
run_train_test(yaml_list)

Expand Down Expand Up @@ -75,8 +75,9 @@ def run_train_test(yaml_list):
timers = Timers(use_wandb=False, tensorboard_writer=None)

# generate some random data on which we can overfit
# context size of data is model seq_len + 1 in order to compute loss
data_list = list()
context_tokens_tensor = torch.randint(0, args_loaded.padded_vocab_size, (4, args_loaded.seq_length)).to(torch.int64)
context_tokens_tensor = torch.randint(0, args_loaded.padded_vocab_size, (4, args_loaded.seq_length + 1 )).to(torch.int64)
for i in range(max_steps):
data_list.append({ "text": context_tokens_tensor.clone() })
data_iterator = iter(data_list)
Expand All @@ -92,12 +93,15 @@ def run_train_test(yaml_list):
optimizer=optimizer,
lr_scheduler=lr_scheduler
)
losses.append(loss_dict["lm loss"].item())
if losses[-1] < losses[0]:
return # all good

losses.append(loss_dict["lm loss"])
if len(losses) >= 2:
if torch.isnan(losses[-1]): continue
if torch.isnan(losses[-2]): continue
if losses[-1] < losses[-2]:
return # all good

# loss should have decreased by now (otherwise increasing the max_steps parameter could have the testcase pass)
assert losses[-1] < losses[0], "run_train_test() loss going down within "+str(max_steps)+" steps"
assert losses[-1] < losses[-2], "run_train_test() loss going down within "+str(max_steps)+" steps"

if torch.distributed.get_world_size() == 1 or torch.distributed.get_rank() == 0:
clear_test_dirs()
clear_test_dirs()