Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA error when I apply my own dataset. #4

Open
Liu-SD opened this issue Jun 19, 2024 · 26 comments
Open

CUDA error when I apply my own dataset. #4

Liu-SD opened this issue Jun 19, 2024 · 26 comments

Comments

@Liu-SD
Copy link

Liu-SD commented Jun 19, 2024

The resolution of my dataset is 5236x3909. I scale down the resolution by 4 and the actual render resolution is 1309x977.

Now I get the runtime error as follows:

cameras extent: 381.5180541992188 [19/06 15:31:45]
Loading Training Cameras: 10 . [19/06 15:56:00]
0it [00:00, ?it/s]
Loading Test Cameras: 0 . [19/06 15:56:00]
Number of points at initialisation : 23947 [19/06 15:56:00]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/liu/nerf/RaDe-GS/train.py", line 312, in
training(dataset=lp.extract(args),
File "/home/liu/nerf/RaDe-GS/train.py", line 115, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/liu/nerf/RaDe-GS/gaussian_renderer/init.py", line 87, in render
"visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered

What's the reason and how to solve it? Thanks a lot!

@brianneoberson
Copy link

brianneoberson commented Jun 19, 2024

Hi,
I have this error even when training on DTU (scan24) dataset. Would also appreciate some help regarding this. :)

edit: I am using RTX 6000 with cuda 11.8

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 19, 2024

Hi! it seems the error happens in CUDA part. but currently I don't have any idea on it. I tested the code on two machines with different GPUs (H800 and 4080) but can't reproduce this error. I will appreciate if you can provide with more information. Thank you!

@LinzhouLi
Copy link

Hi!
I encounter the same issue on RTX 3090 and cuda 11.8

Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@zhouilu
Copy link

zhouilu commented Jun 19, 2024

Same error. I check render's input, scale rot opacity have Nan. Why?

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 19, 2024

Thank you for info. This issue seems related to the machines. Currently, RTX 4080 with CUDA 12.1 works well. I'm looking for other computers to reproduce this error and fix it.

@LinzhouLi
Copy link

I found this issue still exists on CUDA 12.1 and RTX 3090. It occasionally happens during training.

Training progress:  85%|████████████████████████████████████████████████████▉         | 25630/30000 [25:43<03:22, 21.63it/s, Loss=0.0226, loss_dep=0.0000, loss_normal=0.1220]
Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 150, in training
    depth_middepth_normal, _ = depth_double_to_normal(viewpoint_cam, rendered_depth, rendered_middepth)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 118, in depth_double_to_normal
    points1, points2 = depths_double_to_points(view, depth1, depth2)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 105, in depths_double_to_points
    ).float().cuda()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

@MrNeRF
Copy link

MrNeRF commented Jun 19, 2024

Can confirm! In case it does not crash, it produces consistently results on custom data as depicted on the attached rendering output (did not try any of the official data). I tried to deactivate the appearence embedding but it does not help. Might be due to the distortion loss? Not sure. But apparently there is a bug in the rasterizer implementation.
Screenshot from 2024-06-19 10-24-54

@WUMINGCHAzero
Copy link

WUMINGCHAzero commented Jun 20, 2024

Grad Nan after backwarding on custom data. Need Help. Thanks!
Env: torch1.13.1+cu117, A800 GPU

A quick test: this grad error still exist after updating forward.cu in your PR

@RongLiu-Leo
Copy link
Contributor

Same error.
It just occasionally happens.
Like running experiments 5 times and being successful once.

@MELANCHOLY828
Copy link

I've encountered the same issue with CUDA 12.1.

@zhanghaoyu816
Copy link

I have also encountered the same issue on RTX 4090 with CUDA 11.8, Pytorch 2.1.2, Ubuntu 22.04. As mentioned by others earlier, this error occurs randomly during training process.

Training progress:   0%|                                                                                                                                                                                                                      | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 312, in <module>
    training(dataset=lp.extract(args),
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

One solution that might help is issues/41, but I haven't try it...

@tkuye
Copy link

tkuye commented Jun 20, 2024

Same error as well. Grad NaN on two different datasets.

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 20, 2024

Thank you for important information. I have fixed the problem. Please update the code.

@MrNeRF
Copy link

MrNeRF commented Jun 20, 2024

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

@Li-colonel
Copy link

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

Have you verified whether it is due to distortion loss? An issue was reported in 2DGS and then they changed the default value of its corresponding hyperparameter to 0.0

@MrNeRF
Copy link

MrNeRF commented Jun 21, 2024

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

image

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 21, 2024

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

image

Are you using the viewer in this Repository?

@MrNeRF
Copy link

MrNeRF commented Jun 21, 2024

I printed every 100th image. The images are very good. Different from what I see in the viewer. Maybe there is some conversion issue while saving the ply file?

@MrNeRF
Copy link

MrNeRF commented Jun 21, 2024

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 21, 2024

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

Yes, I made some modification for 3D filters. You can use it in the same way as the original viewer. And I think we get the reason and I'll update the README for the viewer. Looking forward to good news.

@MrNeRF
Copy link

MrNeRF commented Jun 21, 2024

Obviously that was the issue. The rendering is actually quite nice and confirms the reported psnr. Thx for the help.

@MELANCHOLY828
Copy link

MELANCHOLY828 commented Jun 21, 2024

image
The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that?

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 21, 2024

image The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that? Please help me translate this into English.

Please use the viewer.

@WUMINGCHAzero
Copy link

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

@BaowenZ
Copy link
Owner

BaowenZ commented Jun 22, 2024

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

I can't open the ply files by the original viewer so I can't reproduce it. But I guess the ply files are wrongly parsed because other codes don't know my format (meaning or order of the variables in the file).

@Mikael-Spotscale
Copy link

I can confirm that the latest updates fixed the CUDA error for me.

Anybody else in the same situation, don't forget to reinstall the module with pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization after pulling the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests