Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increase test coverage #289

Merged
merged 67 commits into from
May 12, 2021
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
67 commits
Select commit Hold shift + click to select a range
41d3ca0
requirements for test coverage
May 2, 2021
427bebd
cleanup tensorboard dir when testing
May 2, 2021
fe7916c
simplify using subtests
May 3, 2021
119c6c1
fix clear test dirs in subtests
May 3, 2021
9ac6e03
test update to try and run tests with a worldsize > 1
May 3, 2021
65b497f
fix test model instantiation for world size > 1
May 3, 2021
ca05fd9
neox args test with import in function
May 3, 2021
69ae18c
test readme update
May 3, 2021
fb60a97
test model checkpoint with forward option
May 3, 2021
616ba01
test model checkpoint in inference mode
May 3, 2021
d343dbd
todo for config data_impl
May 3, 2021
2472bd2
upate test configs
May 3, 2021
92251df
add docstrings to testcases
May 3, 2021
532c982
test models with overwrite in neox_args
May 3, 2021
81086aa
update tests readme
May 3, 2021
528686f
test config include sm3 optimizer
May 3, 2021
48c7a3e
test config adjustments
May 3, 2021
b460776
add cpu and gpu testing in checkpoint test
May 3, 2021
023579b
add test for train / backwards step
May 3, 2021
aa0dc64
requirements for test coverage
May 2, 2021
c517c69
cleanup tensorboard dir when testing
May 2, 2021
132abfc
simplify using subtests
May 3, 2021
8b7a9a7
fix clear test dirs in subtests
May 3, 2021
93c986a
test update to try and run tests with a worldsize > 1
May 3, 2021
49ef6d0
fix test model instantiation for world size > 1
May 3, 2021
ce66309
neox args test with import in function
May 3, 2021
3673d1b
test readme update
May 3, 2021
e849d40
test model checkpoint with forward option
May 3, 2021
a83198f
test model checkpoint in inference mode
May 3, 2021
baeb88e
todo for config data_impl
May 3, 2021
ec81841
upate test configs
May 3, 2021
5ac730a
add docstrings to testcases
May 3, 2021
3cd056b
test models with overwrite in neox_args
May 3, 2021
c0469f0
update tests readme
May 3, 2021
44431d5
test config include sm3 optimizer
May 3, 2021
51c9bc1
test config adjustments
May 3, 2021
05fbd5d
add cpu and gpu testing in checkpoint test
May 3, 2021
2f827a3
add test for train / backwards step
May 3, 2021
eefff73
Merge branch 'increase_test_coverage' of github.com:EleutherAI/gpt-ne…
May 4, 2021
f1f40cf
test model train with right vocab size
May 4, 2021
7b0ccf2
modified test configs
May 4, 2021
dc53a5c
test train with nan handling of losses
May 4, 2021
9e4c31a
test model train comment out config 2 (no error, no termination)
May 4, 2021
c8c6e97
text generation utils - create dir fix
May 4, 2021
1cdea36
test model generation init
May 4, 2021
a391ae1
changed model tests to allow for init from dict
kipgparker May 4, 2021
dc88b12
Merge branch 'increase_test_coverage' of https://github.com/EleutherA…
kipgparker May 4, 2021
c756696
Merge branch 'main' into increase_test_coverage
May 7, 2021
48171c7
fix use recompute kwarg in generation instead of neox_args.recompute
May 7, 2021
7239d3a
adjust tests for generation to new main branch
May 7, 2021
83978f2
test text generation with multiple configs
May 10, 2021
f915114
test model generation with input file
May 10, 2021
0fce24f
adding config comparer and figured out what's causing test error
May 10, 2021
9042dab
Merge branch 'main' into increase_test_coverage
May 11, 2021
88d4a35
updated config comparer and config to meet new format
kipgparker May 11, 2021
14a84c9
fix / make loss dict naming consistent
May 11, 2021
b3b3d5c
disable fp32 in testing
May 11, 2021
97c0c62
fix error message for unknown activation
May 11, 2021
6f76823
add train_batch_size to known parameters in neox_args used testcase
May 11, 2021
89863b3
fix comment with new variable name
May 11, 2021
053e70c
add train_batch_size] to known properties in neox_args usage testcase
May 11, 2021
774bc2c
updated config comparer
kipgparker May 11, 2021
dc2806a
Merge branch 'main' into increase_test_coverage
May 12, 2021
02ccedc
Merge branch 'main' into increase_test_coverage
May 12, 2021
d405354
compare arg value in neox args load test
May 12, 2021
af0b1f1
mark testcases for cpu
May 12, 2021
a2d2e2b
readme for tests on cpu
May 12, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
test update to try and run tests with a worldsize > 1
  • Loading branch information
Samuel Weinbach committed May 4, 2021
commit 93c986a58bae68bdfc72fb0c8ca1f49a40e45c58
1 change: 1 addition & 0 deletions requirements/requirements-dev.txt
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
pytest==6.2.3
pytest-cov==2.11.1
pytest-forked==1.3.0
autopep8==1.5.6
26 changes: 0 additions & 26 deletions run_tests.py

This file was deleted.

6 changes: 0 additions & 6 deletions tests/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +0,0 @@
"""
Testcases for GPT NeoX
"""

from .model import *
from .neox_args import *
126 changes: 106 additions & 20 deletions tests/common.py
Original file line number Diff line number Diff line change
@@ -1,16 +1,25 @@
"""
collection of reusable functions in the context of testing
"""

import os
import time
import shutil
import itertools
from pathlib import Path

import pytest

import torch
import torch.distributed as dist
from torch.multiprocessing import Process

import deepspeed

TEST_CHECKPOINT_DIR = "test_checkpoint"
TEST_LOG_DIR = "test_logs"
TEST_TENSORBOARD_DIR = "test_tensorboard"

# Worker timeout *after* the first worker has completed.
DEEPSPEED_UNIT_WORKER_TIMEOUT = 120


def get_root_directory():
return Path(__file__).parents[1]

Expand All @@ -21,23 +30,9 @@ def get_configs_with_path(configs):
return [str(get_config_directory() / cfg) for cfg in configs]

def get_test_configs_with_path(configs):
test_config_dir = Path(__file__).parent / "model" / "test_configs"
test_config_dir = Path(__file__).parent / "test_configs"
return [str((test_config_dir / cfg).absolute()) for cfg in configs]

def iterate_all_test_configs_with_path():
test_config_dir = Path(__file__).parent / "model" / "test_configs"

model_configs = list((test_config_dir / "model").glob("*.yml"))
sparsity_configs = list((test_config_dir / "sparsity").glob("*.yml"))

for model_config, sparsity_config in itertools.product(model_configs, sparsity_configs):

yield [
str(test_config_dir / "test_local_setup.yml"),
str(model_config),
str(sparsity_config)
]

def clear_test_dirs():
log_dir = os.path.join(get_root_directory(),TEST_LOG_DIR)
if os.path.isdir(log_dir):
Expand All @@ -50,4 +45,95 @@ def clear_test_dirs():
tensorboard_dir = os.path.join(get_root_directory(), TEST_TENSORBOARD_DIR)
if os.path.isdir(tensorboard_dir):
shutil.rmtree(tensorboard_dir)


def distributed_test(world_size=2, backend='nccl'):
"""A decorator for executing a function (e.g., a unit test) in a distributed manner.
This decorator manages the spawning and joining of processes, initialization of
torch.distributed, and catching of errors.

This function is copied from: https://github.com/EleutherAI/DeeperSpeed/blob/24026e5bb37c528a222b8635c46256b1e1825d2e/tests/unit/common.py#L16

Usage example:
@distributed_test(worker_size=[2,3])
def my_test():
rank = dist.get_rank()
world_size = dist.get_world_size()
assert(rank < world_size)

Arguments:
world_size (int or list): number of ranks to spawn. Can be a list to spawn
multiple tests.
"""
def dist_wrap(run_func):
"""Second-level decorator for dist_test. This actually wraps the function. """
def dist_init(local_rank, num_procs, *func_args, **func_kwargs):
"""Initialize torch.distributed and execute the user function. """
os.environ['MASTER_ADDR'] = '127.0.0.1'
os.environ['MASTER_PORT'] = '29503'
os.environ['LOCAL_RANK'] = str(local_rank)
# NOTE: unit tests don't support multi-node so local_rank == global rank
os.environ['RANK'] = str(local_rank)
os.environ['WORLD_SIZE'] = str(num_procs)

deepspeed.init_distributed(dist_backend=backend)

if torch.cuda.is_available():
torch.cuda.set_device(local_rank)

run_func(*func_args, **func_kwargs)

def dist_launcher(num_procs, *func_args, **func_kwargs):
"""Launch processes and gracefully handle failures. """

# Spawn all workers on subprocesses.
processes = []
for local_rank in range(num_procs):
p = Process(target=dist_init,
args=(local_rank,
num_procs,
*func_args),
kwargs=func_kwargs)
p.start()
processes.append(p)

# Now loop and wait for a test to complete. The spin-wait here isn't a big
# deal because the number of processes will be O(#GPUs) << O(#CPUs).
any_done = False
while not any_done:
for p in processes:
if not p.is_alive():
any_done = True
break

# Wait for all other processes to complete
for p in processes:
p.join(DEEPSPEED_UNIT_WORKER_TIMEOUT)

failed = [(rank, p) for rank, p in enumerate(processes) if p.exitcode != 0]
for rank, p in failed:
# If it still hasn't terminated, kill it because it hung.
if p.exitcode is None:
p.terminate()
pytest.fail(f'Worker {rank} hung.', pytrace=False)
if p.exitcode < 0:
pytest.fail(f'Worker {rank} killed by signal {-p.exitcode}',
pytrace=False)
if p.exitcode > 0:
pytest.fail(f'Worker {rank} exited with code {p.exitcode}',
pytrace=False)

def run_func_decorator(*func_args, **func_kwargs):
"""Entry point for @distributed_test(). """

if isinstance(world_size, int):
dist_launcher(world_size, *func_args, **func_kwargs)
elif isinstance(world_size, list):
for procs in world_size:
dist_launcher(procs, *func_args, **func_kwargs)
time.sleep(0.5)
else:
raise TypeError(f'world_size must be an integer or a list of integers.')

return run_func_decorator

return dist_wrap
6 changes: 0 additions & 6 deletions tests/model/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +0,0 @@
"""
Tests concerning the GPT2Model class
"""

from .test_model_checkpoint import TestModelCheckpoint
from .test_model_instantiation import TestModelInstantiation
12 changes: 0 additions & 12 deletions tests/model/test_configs/sparsity/test_sparsity_default.yml

This file was deleted.

Loading