Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pytorch 1.4 can not load model saved by 1.7 #48915

Closed
Light-- opened this issue Dec 7, 2020 · 2 comments
Closed

pytorch 1.4 can not load model saved by 1.7 #48915

Light-- opened this issue Dec 7, 2020 · 2 comments

Comments

@Light--
Copy link

Light-- commented Dec 7, 2020

🐛 Bug

model trained by pytorch 1.7.0 cuda 11.0.221, but cannot load by pytorch1.4.0, cuda 10.0.130

To Reproduce

Steps to reproduce the behavior:

  1. train model and save by 1.7
  2. load by 1.4
torch.load('/home/user1/model_best_b.pth.tar')
Traceback (most recent call last):
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/IPython/core/interactiveshell.py", line 3417, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-10-13d633918c2f>", line 1, in <module>
    torch.load('/home/wangjunchu/pjs/fae/paper/ckpt/to_test/20201204175958/Arcface50_t4_bs50_bslr_0.001_fclr_0.01/model_best_bacc.pth.tar')
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 527, in load
    with _open_zipfile_reader(f) as opened_zipfile:
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 224, in __init__
    super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
RuntimeError: version_ <= kMaxSupportedFileFormatVersion INTERNAL ASSERT FAILED at /pytorch/caffe2/serialize/inline_container.cc:132, please report a bug to PyTorch. Attempted to read a PyTorch file with version 3, but the maximum supported version for reading is 2. Your PyTorch installation may be too old. (init at /pytorch/caffe2/serialize/inline_container.cc:132)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f3deaa57193 in /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: caffe2::serialize::PyTorchStreamReader::init() + 0x1f5b (0x7f3d447949eb in /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #2: caffe2::serialize::PyTorchStreamReader::PyTorchStreamReader(std::string const&) + 0x64 (0x7f3d44795c04 in /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/lib/libtorch.so)
frame #3: <unknown function> + 0x6c6536 (0x7f3dcc2d4536 in /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x295a74 (0x7f3dcbea3a74 in /data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #5: _PyMethodDef_RawFastCallDict + 0x24d (0x55ba98d39bfd in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #6: _PyCFunction_FastCallDict + 0x21 (0x55ba98d39d81 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #7: _PyObject_Call_Prepend + 0x63 (0x55ba98d37a73 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #8: PyObject_Call + 0x6e (0x55ba98d29fde in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #9: <unknown function> + 0xabddd (0x55ba98cadddd in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #10: _PyObject_FastCallKeywords + 0x128 (0x55ba98d7ff78 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #11: _PyEval_EvalFrameDefault + 0x5389 (0x55ba98dd2a39 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #12: _PyEval_EvalCodeWithName + 0x5da (0x55ba98d1766a in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #13: _PyFunction_FastCallDict + 0x1d5 (0x55ba98d184c5 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #14: _PyObject_Call_Prepend + 0x63 (0x55ba98d37a73 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #15: <unknown function> + 0x17d1ba (0x55ba98d7f1ba in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #16: _PyObject_FastCallKeywords + 0x128 (0x55ba98d7ff78 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #17: _PyEval_EvalFrameDefault + 0x4a96 (0x55ba98dd2146 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #18: _PyEval_EvalCodeWithName + 0x2f9 (0x55ba98d17389 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #19: _PyFunction_FastCallKeywords + 0x387 (0x55ba98d6b2b7 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #20: _PyEval_EvalFrameDefault + 0x4b39 (0x55ba98dd21e9 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #21: _PyEval_EvalCodeWithName + 0x2f9 (0x55ba98d17389 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #22: PyEval_EvalCodeEx + 0x44 (0x55ba98d182b4 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #23: PyEval_EvalCode + 0x1c (0x55ba98d182dc in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #24: <unknown function> + 0x1db30d (0x55ba98ddd30d in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #25: _PyMethodDef_RawFastCallKeywords + 0xe9 (0x55ba98d6b939 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #26: _PyCFunction_FastCallKeywords + 0x21 (0x55ba98d6bbd1 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #27: _PyEval_EvalFrameDefault + 0x47a4 (0x55ba98dd1e54 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #28: _PyGen_Send + 0x2a2 (0x55ba98d80f82 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #29: _PyEval_EvalFrameDefault + 0x1a76 (0x55ba98dcf126 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #30: _PyGen_Send + 0x2a2 (0x55ba98d80f82 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #31: _PyEval_EvalFrameDefault + 0x1a76 (0x55ba98dcf126 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #32: _PyGen_Send + 0x2a2 (0x55ba98d80f82 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #33: _PyMethodDef_RawFastCallKeywords + 0x8d (0x55ba98d6b8dd in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #34: _PyMethodDescr_FastCallKeywords + 0x4f (0x55ba98d7fdbf in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #35: _PyEval_EvalFrameDefault + 0x4c9d (0x55ba98dd234d in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #36: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #37: _PyEval_EvalFrameDefault + 0x416 (0x55ba98dcdac6 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #38: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #39: _PyEval_EvalFrameDefault + 0x690 (0x55ba98dcdd40 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #40: _PyEval_EvalCodeWithName + 0x2f9 (0x55ba98d17389 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #41: _PyFunction_FastCallKeywords + 0x387 (0x55ba98d6b2b7 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #42: _PyEval_EvalFrameDefault + 0x14d4 (0x55ba98dceb84 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #43: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #44: _PyEval_EvalFrameDefault + 0x690 (0x55ba98dcdd40 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #45: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #46: _PyEval_EvalFrameDefault + 0x690 (0x55ba98dcdd40 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #47: _PyEval_EvalCodeWithName + 0x2f9 (0x55ba98d17389 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #48: _PyFunction_FastCallKeywords + 0x325 (0x55ba98d6b255 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #49: _PyEval_EvalFrameDefault + 0x690 (0x55ba98dcdd40 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #50: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #51: _PyEval_EvalFrameDefault + 0x416 (0x55ba98dcdac6 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #52: _PyFunction_FastCallKeywords + 0xfb (0x55ba98d6b02b in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #53: _PyEval_EvalFrameDefault + 0x4b39 (0x55ba98dd21e9 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #54: _PyEval_EvalCodeWithName + 0x2f9 (0x55ba98d17389 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #55: PyEval_EvalCodeEx + 0x44 (0x55ba98d182b4 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #56: PyEval_EvalCode + 0x1c (0x55ba98d182dc in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #57: <unknown function> + 0x22c664 (0x55ba98e2e664 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #58: PyRun_FileExFlags + 0xa1 (0x55ba98e38a91 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #59: PyRun_SimpleFileExFlags + 0x1c3 (0x55ba98e38c83 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #60: <unknown function> + 0x237db5 (0x55ba98e39db5 in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #61: _Py_UnixMain + 0x3c (0x55ba98e39edc in /data/user1/pkgs/conda/envs/drc/bin/python)
frame #62: __libc_start_main + 0xf0 (0x7f3df6c6e830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #63: <unknown function> + 0x1db3e0 (0x55ba98ddd3e0 in /data/user1/pkgs/conda/envs/drc/bin/python)
  1. load by 1.7

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 594, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 853, in _load
    result = unpickler.load()
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 845, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 834, in load_tensor
    loaded_storages[key] = restore_location(storage, location)
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 175, in default_restore_location
    result = fn(storage, location)
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/serialization.py", line 157, in _cuda_deserialize
    return obj.cuda(device)
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/_utils.py", line 79, in _cuda
    return new_type(self.size()).copy_(self, non_blocking)
  File "/home/user1/anaconda3/lib/python3.8/site-packages/torch/cuda/__init__.py", line 462, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA error: out of memory

Expected behavior

normal loaded.

Environment

env of 1.7:

PyTorch version: 1.7.0
Is debug build: True
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.1 LTS (x86_64)
GCC version: (Ubuntu 9.3.0-17ubuntu1~20.04) 9.3.0
Clang version: Could not collect
CMake version: version 3.16.3
Python version: 3.8 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: Could not collect
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX

Nvidia driver version: 455.38
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.0.5
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.0.5
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_adv_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8
/usr/local/cuda-11.0/targets/x86_64-linux/lib/libcudnn_ops_train.so.8
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.0.5
/usr/local/cuda-11.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.0.5
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.16.2
[pip3] torch==1.4.0
[pip3] torchvision==0.5.0
[pip3] torchviz==0.0.1
[conda] blas                      1.0                         mkl    defaults
[conda] cudatoolkit               10.1.243             h6bb024c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] mkl                       2020.1                      217    defaults
[conda] mkl-service               2.3.0            py38he904b0f_0    defaults
[conda] mkl_fft                   1.1.0            py38h23d657b_0    defaults
[conda] mkl_random                1.1.1            py38h0573a6f_0    defaults
[conda] numpy                     1.18.5           py38ha1c710e_0    defaults
[conda] numpy-base                1.18.5           py38hde5b4d6_0    defaults
[conda] numpydoc                  1.1.0                      py_0    defaults
[conda] pytorch                   1.7.0           py3.8_cuda10.1.243_cudnn7.6.3_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchaudio                0.7.0                      py38    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch
[conda] torchvision               0.8.1                py38_cu101    https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch

env of 1.4:

PyTorch version: 1.4.0
Is debug build: False
CUDA used to build PyTorch: 10.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 16.04.3 LTS (x86_64)
GCC version: (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Clang version: Could not collect
CMake version: Could not collect
Python version: 3.7 (64-bit runtime)
Is CUDA available: True
CUDA runtime version: 10.0.130
GPU models and configuration:
GPU 0: TITAN RTX
GPU 1: TITAN RTX
GPU 2: TITAN RTX
GPU 3: TITAN RTX

Nvidia driver version: 440.44
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.18.3
[pip3] torch==1.4.0
[pip3] torchvision==0.6.0
[conda] cudatoolkit               10.1.243             h6bb024c_0    https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main
[conda] numpy                     1.16.2                   pypi_0    pypi
[conda] torch                     1.4.0                    pypi_0    pypi
[conda] torchvision               0.5.0                    pypi_0    pypi
[conda] torchviz                  0.0.1                    pypi_0    pypi

Additional context

model trained and load by 1.4 is ok.
weird bug and i don't know why
it's urgent, please help....

@ailzhang
Copy link
Contributor

ailzhang commented Dec 7, 2020

Pytorch doesn't guarantee forward compatibility, but for this particular issue using torch.save(_use_new_zipfile_serialization=False) in 1.7 and load it in 1.4 might work. Please feel free to reopen if it doesn't fix. Thanks!

@ailzhang ailzhang closed this as completed Dec 7, 2020
@Light--
Copy link
Author

Light-- commented Dec 10, 2020

torch.save(_use_new_zipfile_serialization=False) in 1.7 and load it in 1.4 might work.

@ailzhang

your solution fixed, thanks! but need to use with cpu (another flag: map_location).

First, i use your solution, only use _use_new_zipfile_serialization flag still can not load, report:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 702, in _legacy_load
    result = unpickler.load()
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 665, in persistent_load
    deserialized_objects[root_key] = restore_location(obj, location)
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 156, in default_restore_location
    result = fn(storage, location)
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/serialization.py", line 136, in _cuda_deserialize
    return storage_type(obj.size())
  File "/data/user1/pkgs/conda/envs/drc/lib/python3.7/site-packages/torch/cuda/__init__.py", line 480, in _lazy_new
    return super(_CudaBase, cls).__new__(cls, *args, **kwargs)
RuntimeError: CUDA out of memory. Tried to allocate 512.00 MiB (GPU 0; 23.65 GiB total capacity; 37.00 KiB already allocated; 53.00 MiB free; 2.00 MiB reserved in total by PyTorch)

no matter i choose which gpu to use by:

os.environ['CUDA_VISIBLE_DEVICES']='3'
torch.cuda.set_device(3)

it still report GPU0 OOM error. Why does it must use gpu0 to load the model???

Finally, with help of this post, i did it.

in 1.7:

torch.save(model_.state_dict(), 'model_best_bacc.pth.tar', _use_new_zipfile_serialization=False)

then in 1.4:

torch.load('model_best_bacc.pth.tar',map_location='cpu')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants