Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors with non-CUDA machine #136

Open
eloralopez opened this issue Apr 13, 2024 Discussed in #81 · 1 comment
Open

Errors with non-CUDA machine #136

eloralopez opened this issue Apr 13, 2024 Discussed in #81 · 1 comment

Comments

@eloralopez
Copy link

Discussed in #81

I am having a similar problem to the one described in the post quoted below, even though it appears that MapClassifier.py has been updated to incorporate the fix that the other user described.

When I try to use "Train Your Network", I get this error:

Traceback (most recent call last):
File "TagLab.py", line 4139, in trainNewNetwork
dataset_train_info, train_loss_values, val_loss_values = training.trainingNetwork(images_dir_train, labels_dir_train,
File "/Users/eln/TagLab/models/training.py", line 297, in trainingNetwork
state = torch.load("models/deeplab-resnet.pth.tar")
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1040, in load
return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1268, in _legacy_load
result = unpickler.load()
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 1205, in persistent_load
wrap_storage=restore_location(obj, location),
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 391, in default_restore_location
result = fn(storage, location)
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 266, in _cuda_deserialize
device = validate_cuda_device(location)
File "/Users/eln/miniconda3/lib/python3.8/site-packages/torch/serialization.py", line 250, in validate_cuda_device
raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

I looked at source/MapClassifier.py since it was mentioned in the previous discussion, and in lines 98-101 it looks like it should use torch.load with "cpu" since torch.cuda.is_available() is False, so this does not appear to be the same problem that the previous user ran into and fixed.

The problem appears to be arising in training.py , but I haven't figured out what it is yet. Any assistance would be appreciated!

Originally posted by andieich January 26, 2023
Hi,
I successfully installed TagLab on a Windows computer, but had some issues. I cannot use the GPU since it is made by Intel.
I therefore tried to install the CPU version of torch and tochvision. When I use the install.py script, I get this error:

append() takes exactly one argument (2 given)

This is caused by line 234. I commented out lines 232 - 236 and manually installed both packages with the following code:

pip install torch --extra-index-url https://download.pytorch.org/whl/cpu

pip install torchvision --extra-index-url https://download.pytorch.org/whl/cpu

Afterwards, I could install TagLab flawlessly.

However, when trying to run a auto segmentation, I got this error:

RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

So I did as proposed and changed in source/MapClassifier.py in line 98:

classifier.load_state_dict(torch.load(network_name)

to

classifier.load_state_dict(torch.load(network_name, map_location=torch.device('cpu')))

Now, everything works fine. Maybe that's something to consider for the next TagLab version.
Thanks for this amazing software!

@eloralopez
Copy link
Author

I fixed the issue by editing both training.py and losses.py to use the cpu-version of Torch. This arose in multiple places in both scripts:

In losses.py:
Line 28 add: `

USE_CUDA = torch.cuda.is_available()
if USE_CUDA:

    device = torch.device("cuda")

else:

    device = torch.device("cpu")
   net.to(device)`

Lines 40 and 62 change:
dist_maps_tensor = dist_maps_tensor.to(device='cuda:0') to dist_maps_tensor = dist_maps_tensor.to(device)

in surface_loss function add:
`

USE_CUDA = torch.cuda.is_available()
if USE_CUDA:
    device = torch.device("cuda")

else:

    device = torch.device("cpu")`

Line 80 change one_hot = one_hot.to('cuda:0') to one_hot = one_hot.to('cpu')

In training.py :
Line 103, add:
`

else:
    device = torch.device("cpu")

    net.to(device)

    torch.cpu.synchronize()`

Line 297, change state = torch.load("models/deeplab-resnet.pth.tar") to state = torch.load("models/deeplab-resnet.pth.tar", map_location=torch.device("cpu"))

Line 333, change class_weights = torch.FloatTensor(weights).cuda() to class_weights = torch.FloatTensor(weights).cpu()

Line 445, remove torch.cuda.empty_cache()

Line 479, change net.load_state_dict(torch.load(network_filename)) to net.load_state_dict(torch.load(network_filename, map_location=torch.device("cpu")))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant