Shape cannot match the size during training #3

cqbu · 2023-09-19T17:17:34Z

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right?
but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below:
Traceback (most recent call last):
File "train_net_lmpm.py", line 318, in
launch(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch
mp.start_processes(
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error:
Traceback (most recent call last):
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker
main_func(*args)
File "/root/MeViS/train_net_lmpm.py", line 312, in main
return trainer.train()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train
super().train(self.start_iter, self.max_iter)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
self.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step
self._trainer.run_step()
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step
loss_dict = self.model(data)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward
return self.train_model(batched_inputs)
File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model
features = self.backbone(images.tensor, lang_feat_sentence, lang_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward
y = super().forward(x, l, l_mask)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward
x_residual = self.fusion(x, l, l_mask)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward
lang = self.image_lang_att(x, l, l_mask) # (B, HW, dim)
File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward
value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l)
RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

cqbu · 2023-09-19T17:21:48Z

btw, I found when we call build_batch_data_loader, the parameter ‘prefetch_factor’ is not given, but in detectron2, the default value of prefetch_factor is None, which leads to error in DataLoader of torch when running assert prefetch_factor > 0， because prefetch_factor here is None but 0 is int.

heshuting555 · 2023-09-21T09:32:18Z

You can try to use multiple gpus to run! And the error will go away!

cilinyan · 2023-10-29T06:36:05Z

You can try to use multiple gpus to run! And the error will go away!

One simple approach is to ensure that only one video is trained on each GPU.

If you want to train multiple videos on GPU, you may need to make modifications in several parts of the code, such asthis.

wwyy1234 · 2024-04-28T02:30:41Z

During the training, in the part of backbone, I got this error:

File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

this happened in the part of SpatialImageLanguageAttention, I found num_heads is 1, so this is not a MultiheadAttention right? but I don't know whether the shape or the size is wrong, so what is the expected shape or size?

and the full error message is below: Traceback (most recent call last): File "train_net_lmpm.py", line 318, in launch( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 69, in launch mp.start_processes( File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, *args) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/launch.py", line 123, in _distributed_worker main_func(*args) File "/root/MeViS/train_net_lmpm.py", line 312, in main return trainer.train() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 484, in train super().train(self.start_iter, self.max_iter) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train self.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/defaults.py", line 494, in run_step self._trainer.run_step() File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 494, in run_step loss_dict = self.model(data) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 886, in forward output = self.module(*inputs[0], **kwargs[0]) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/lmpm/lmpm_model.py", line 281, in forward return self.train_model(batched_inputs) File "/root/MeViS/lmpm/lmpm_model.py", line 312, in train_model features = self.backbone(images.tensor, lang_feat_sentence, lang_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 785, in forward y = super().forward(x, l, l_mask) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 470, in forward x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 590, in forward x_residual = self.fusion(x, l, l_mask) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(_input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 627, in forward lang = self.image_lang_att(x, l, l_mask) # (B, H_W, dim) File "/root/anaconda3/envs/torch1/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl return forward_call(*input, **kwargs) File "/root/MeViS/mask2former/modeling/backbone/swin.py", line 694, in forward value = value.reshape(B, self.num_heads, self.value_channels//self.num_heads, n_l) RuntimeError: shape '[24, 1, 96, 40]' is invalid for input of size 368640

Excuse me, have you solved this problem? I encountered the same issue. I'm using two GPUs. Could you please let me know how you resolved it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shape cannot match the size during training #3

Shape cannot match the size during training #3

cqbu commented Sep 19, 2023

cqbu commented Sep 19, 2023

heshuting555 commented Sep 21, 2023 •

edited

Loading

cilinyan commented Oct 29, 2023 •

edited

Loading

wwyy1234 commented Apr 28, 2024

Shape cannot match the size during training #3

Shape cannot match the size during training #3

Comments

cqbu commented Sep 19, 2023

cqbu commented Sep 19, 2023

heshuting555 commented Sep 21, 2023 • edited Loading

cilinyan commented Oct 29, 2023 • edited Loading

wwyy1234 commented Apr 28, 2024

heshuting555 commented Sep 21, 2023 •

edited

Loading

cilinyan commented Oct 29, 2023 •

edited

Loading