Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix python version and pytest install #1234

Open
wants to merge 71 commits into
base: main
Choose a base branch
from
Open

fix python version and pytest install #1234

wants to merge 71 commits into from

Conversation

jahatef
Copy link
Collaborator

@jahatef jahatef commented Jun 6, 2024

Possibly fix workflow issues. Needs to be tested in PR.

@jahatef jahatef marked this pull request as draft June 7, 2024 01:33
@jahatef
Copy link
Collaborator Author

jahatef commented Jun 17, 2024

Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in requirements.txt: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated the numpy requirement to be <2.0, which is required or it breaks deep speed.

Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665

@jahatef jahatef marked this pull request as ready for review June 17, 2024 21:42
@Quentin-Anthony
Copy link
Member

Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in requirements.txt: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated the numpy requirement to be <2.0, which is required or it breaks deep speed.

Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665

Here's the relevant trace from the runner, for future reference.

____________________________ test_main_constructor _____________________________
def test_main_constructor():
        input_args = ["train.py", "tests/config/test_setup.yml"]
>       neox_args = NeoXArgs.consume_deepy_args(input_args)

tests/unit/test_arguments.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
    neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,005] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_ymls __________________________
def test_constructor_from_ymls():
        t1 = test_constructor_from_ymls_class()
>       t1.test()

tests/unit/test_arguments.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/unit/test_arguments.py:31: in test
    neox_args = NeoXArgs.from_ymls(["tests/config/test_setup.yml"])
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,294] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_dict __________________________
def test_constructor_from_dict():
        t1 = test_constructor_from_dict_class()
>       t1.test()

tests/unit/test_arguments.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/unit/test_arguments.py:44: in test
    neox_args = NeoXArgs.from_dict(BASE_CONFIG)
megatron/neox_arguments/arguments.py:236: in from_dict
    return cls(**args_dict)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
Warning: 17 21:32:57,574] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
_________________________ test_gpt_neox_to_huggingface _________________________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f278be35b70>
tmpdir = local('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')
tmp_path = PosixPath('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')

    def test_gpt_neox_to_huggingface(monkeypatch, tmpdir, tmp_path):
        # Generate random GPT-NEOX model, check we can convert to hf format
        model_dir = str(tmpdir)
        input_args = ["train.py", "tests/config/test_setup.yml"]
>       deepspeed_main_args = simulate_deepy_env(monkeypatch, input_args)

tests/unit/test_format_conversion_scripts.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/common.py:523: in simulate_deepy_env
    neox_args = NeoXArgs.consume_deepy_args(input_args)
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
    neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:58,104] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
=============================== warnings summary ===============================
<string>:8
  <string>:8: PytestDeprecationWarning: A private pytest class or function was used.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/neox_args/test_neoxargs_usage.py::test_neoxargs_usage
FAILED tests/unit/test_arguments.py::test_main_constructor
FAILED tests/unit/test_arguments.py::test_constructor_from_ymls
FAILED tests/unit/test_arguments.py::test_constructor_from_dict
FAILED tests/unit/test_format_conversion_scripts.py::test_gpt_neox_to_huggingface
======= 5 failed, 24 passed, 92 skipped, 80 xfailed, 1 warning in 28.89s =======
Error: Process completed with exit code 1.

@Quentin-Anthony
Copy link
Member

@jahatef -- Why remove the commit hash from deeperspeed, but leave it for lm_dataformat?

@jahatef
Copy link
Collaborator Author

jahatef commented Jun 17, 2024

No good reason, it was a 4 month old version, which I'm not sure we want to be the default for users. I can add the hash back or remove it for the other package. I don't believe it was the cause of the issues I saw, I think it was the numpy version that caused problems with deep speed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants