Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot load the checkpoint #782

Open
jmlongriver12 opened this issue Feb 6, 2023 · 5 comments
Open

Cannot load the checkpoint #782

jmlongriver12 opened this issue Feb 6, 2023 · 5 comments
Labels
bug Something isn't working

Comments

@jmlongriver12
Copy link

Describe the bug
It generate the error when running the generate program
To Reproduce
Steps to reproduce the behavior:

  1. run "./deepy.py generate.py ./configs/20B.yml -i prompt.txt -o sample_outputs.txt"
  2. the error is raised as below:
    Loading extension module utils...
    Loading extension module utils...
    Loading extension module utils...
    Loading extension module utils...
    Traceback (most recent call last):
    File "generate.py", line 91, in
    main()
    File "generate.py", line 33, in main
    model, neox_args = setup_for_inference_or_eval(use_cache=True)
    File "/work/c272987/gpt-neox/megatron/utils.py", line 440, in setup_for_inference_or_eval
    model, _, _ = setup_model_and_optimizer(
    File "/work//gpt-neox/megatron/training.py", line 447, in setup_model_and_optimizer
    neox_args.iteration = load_checkpoint(
    File "/work//gpt-neox/megatron/checkpointing.py", line 239, in load_checkpoint
    checkpoint_name, state_dict = model.load_checkpoint(
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1523, in load_checkpoint
    load_path, client_states = self._load_checkpoint(load_dir,
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1558, in _load_checkpoint
    self.load_module_state_dict(state_dict=checkpoint['module'],
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/engine.py", line 1278, in load_module_state_dict
    self.module.load_state_dir(load_dir=self._curr_ckpt_path, strict=strict)
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/deepspeed/runtime/pipe/module.py", line 571, in load_state_dir
    layer.load_state_dict(torch.load(model_ckpt_path,
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 778, in load
    with _open_zipfile_reader(opened_file) as opened_zipfile:
    File "/work//gpt-neox/venv/lib/python3.8/site-packages/torch/serialization.py", line 282, in init
    super(_open_zipfile_reader, self).init(torch._C.PyTorchFileReader(name_or_buffer))
    RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory
    Traceback (most recent call last):
    File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
    File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
    File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 179, in
    main()
    File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 169, in main
    sigkill_handler(signal.SIGTERM, None) # not coming back
    File "/work/c272987/gpt-neox/venv/lib/python3.8/site-packages/deepspeed/launcher/launch.py", line 147, in sigkill_handler
    raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)

Expected behavior
runs smoothly

Environment (please complete the following information):

  • GPUs: 4
  • Configs: 20B
@jmlongriver12 jmlongriver12 added the bug Something isn't working label Feb 6, 2023
@syskn
Copy link

syskn commented Feb 11, 2023

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

When this message shows up, it usually implies that one of the checkpoint files is incomplete (e.g. broken during transfer). Can you check the local files?

@cywjava
Copy link

cywjava commented Mar 22, 2023

i have error is:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@StellaAthena
Copy link
Member

StellaAthena commented Mar 24, 2023

i have error is: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 288.00 MiB (GPU 0; 23.70 GiB total capacity; 7.89 GiB already allocated; 39.19 MiB free; 7.89 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

How big is your GPU? You need a rather large GPU to load a 20B model, and it seems you simply don’t have enough VRAM.

@heukirne
Copy link

heukirne commented Apr 5, 2023

Hi @StellaAthena , I'm trynig to run inference and finetunning using 20B with 8 x NVIDIA A10G 23GB VRAM and still got the

RuntimeError: CUDA out of memory. Tried to allocate 9.59 GiB (GPU 0; 22.04 GiB total capacity; 14.39 GiB already allocated; 7.00 GiB free; 14.39 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

None of the bellow config work:

"pipe-parallel-size": 8|4|2|1,
"model-parallel-size": 1|2|4|8,

I'm running Version 2.0 of GPT-NeoX
Do you have any tips on how to improve config and be able to run it?

@heukirne
Copy link

heukirne commented Apr 6, 2023

I was able tu run using HF version
https://github.com/mallorbc/GPTNeoX20B_HuggingFace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants