CUDA error when I apply my own dataset. #4

Liu-SD · 2024-06-19T08:33:05Z

The resolution of my dataset is 5236x3909. I scale down the resolution by 4 and the actual render resolution is 1309x977.

Now I get the runtime error as follows:

cameras extent: 381.5180541992188 [19/06 15:31:45]
Loading Training Cameras: 10 . [19/06 15:56:00]
0it [00:00, ?it/s]
Loading Test Cameras: 0 . [19/06 15:56:00]
Number of points at initialisation : 23947 [19/06 15:56:00]
Training progress: 0%| | 0/30000 [00:00<?, ?it/s]
Traceback (most recent call last):
File "/home/liu/nerf/RaDe-GS/train.py", line 312, in
training(dataset=lp.extract(args),
File "/home/liu/nerf/RaDe-GS/train.py", line 115, in training
render_pkg = render(viewpoint_cam, gaussians, pipe, background)
File "/home/liu/nerf/RaDe-GS/gaussian_renderer/init.py", line 87, in render
"visibility_filter" : radii > 0,
RuntimeError: CUDA error: an illegal memory access was encountered

What's the reason and how to solve it? Thanks a lot!

brianneoberson · 2024-06-19T09:31:58Z

Hi,
I have this error even when training on DTU (scan24) dataset. Would also appreciate some help regarding this. :)

edit: I am using RTX 6000 with cuda 11.8

BaowenZ · 2024-06-19T10:17:37Z

Hi! it seems the error happens in CUDA part. but currently I don't have any idea on it. I tested the code on two machines with different GPUs (H800 and 4080) but can't reproduce this error. I will appreciate if you can provide with more information. Thank you!

LinzhouLi · 2024-06-19T10:50:20Z

Hi!
I encounter the same issue on RTX 3090 and cuda 11.8

Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

zhouilu · 2024-06-19T11:31:30Z

Same error. I check render's input, scale rot opacity have Nan. Why?

BaowenZ · 2024-06-19T12:34:52Z

Thank you for info. This issue seems related to the machines. Currently, RTX 4080 with CUDA 12.1 works well. I'm looking for other computers to reproduce this error and fix it.

LinzhouLi · 2024-06-19T13:25:26Z

I found this issue still exists on CUDA 12.1 and RTX 3090. It occasionally happens during training.

Training progress:  85%|████████████████████████████████████████████████████▉         | 25630/30000 [25:43<03:22, 21.63it/s, Loss=0.0226, loss_dep=0.0000, loss_normal=0.1220]
Traceback (most recent call last):
  File "/home/code/RaDe-GS/train.py", line 317, in <module>
    training(dataset=lp.extract(args),
  File "/home/code/RaDe-GS/train.py", line 150, in training
    depth_middepth_normal, _ = depth_double_to_normal(viewpoint_cam, rendered_depth, rendered_middepth)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 118, in depth_double_to_normal
    points1, points2 = depths_double_to_points(view, depth1, depth2)
  File "/home/code/RaDe-GS/utils/graphics_utils.py", line 105, in depths_double_to_points
    ).float().cuda()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

MrNeRF · 2024-06-19T15:54:43Z

Can confirm! In case it does not crash, it produces consistently results on custom data as depicted on the attached rendering output (did not try any of the official data). I tried to deactivate the appearence embedding but it does not help. Might be due to the distortion loss? Not sure. But apparently there is a bug in the rasterizer implementation.

WUMINGCHAzero · 2024-06-20T00:17:20Z

Grad Nan after backwarding on custom data. Need Help. Thanks!
Env: torch1.13.1+cu117, A800 GPU

A quick test: this grad error still exist after updating forward.cu in your PR

RongLiu-Leo · 2024-06-20T00:52:22Z

Same error.
It just occasionally happens.
Like running experiments 5 times and being successful once.

MELANCHOLY828 · 2024-06-20T01:50:02Z

I've encountered the same issue with CUDA 12.1.

zhanghaoyu816 · 2024-06-20T14:46:53Z

I have also encountered the same issue on RTX 4090 with CUDA 11.8, Pytorch 2.1.2, Ubuntu 22.04. As mentioned by others earlier, this error occurs randomly during training process.

Training progress:   0%|                                                                                                                                                                                                                      | 0/30000 [00:00<?, ?it/s]Traceback (most recent call last):
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 312, in <module>
    training(dataset=lp.extract(args),
  File "/home/ubuntu/Project/Gaussians/RaDe-GS/train.py", line 160, in training
    distortion_loss = torch.tensor([0],dtype=torch.float32,device="cuda")
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

One solution that might help is issues/41, but I haven't try it...

tkuye · 2024-06-20T15:34:41Z

Same error as well. Grad NaN on two different datasets.

BaowenZ · 2024-06-20T17:40:48Z

Thank you for important information. I have fixed the problem. Please update the code.

MrNeRF · 2024-06-20T21:02:35Z

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

Li-colonel · 2024-06-21T02:19:44Z

Thanks, seems to be fixed. However, the quality is similar to the image posted above. Any idea where this might come from?

Have you verified whether it is due to distortion loss? An issue was reported in 2DGS and then they changed the default value of its corresponding hyperparameter to 0.0

MrNeRF · 2024-06-21T10:43:31Z

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

BaowenZ · 2024-06-21T10:47:54Z

Hmm, the results are already extremely poor after 7k iterations. The distortion and normal consistency loss kicks in at 15k. So that's not the reason. My guess is that there is something in the rasterizer broken. I strangely reports quite good psnr.

Are you using the viewer in this Repository?

MrNeRF · 2024-06-21T11:11:28Z

I printed every 100th image. The images are very good. Different from what I see in the viewer. Maybe there is some conversion issue while saving the ply file?

MrNeRF · 2024-06-21T11:13:46Z

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

BaowenZ · 2024-06-21T11:23:36Z

Are you using the viewer in this Repository?

No, that might be the reason? What did you change? Maybe that's caused by mip?

Yes, I made some modification for 3D filters. You can use it in the same way as the original viewer. And I think we get the reason and I'll update the README for the viewer. Looking forward to good news.

MrNeRF · 2024-06-21T11:36:01Z

Obviously that was the issue. The rendering is actually quite nice and confirms the reported psnr. Thx for the help.

MELANCHOLY828 · 2024-06-21T15:52:24Z

The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that?

BaowenZ · 2024-06-21T16:02:19Z

The same issue, the Gaussian looks not good, but when I check the rendered images and the extracted mesh, the results are actually very good. Why is that? Please help me translate this into English.

Please use the viewer.

WUMINGCHAzero · 2024-06-22T02:24:17Z

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

BaowenZ · 2024-06-22T08:00:36Z

I'm curious why the 3D filter has such large influence on rendering results. Could you please explain a bit more? Thx

I can't open the ply files by the original viewer so I can't reproduce it. But I guess the ply files are wrongly parsed because other codes don't know my format (meaning or order of the variables in the file).

Mikael-Spotscale · 2024-06-24T08:18:28Z

I can confirm that the latest updates fixed the CUDA error for me.

Anybody else in the same situation, don't forget to reinstall the module with pip uninstall diff-gaussian-rasterization -y && pip install submodules/diff-gaussian-rasterization after pulling the code.

This was referenced Jun 26, 2024

Poor results on custom dataset when distortion loss and normal loss starts #21

Open

numpy.core._exceptions._ArrayMemoryError #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA error when I apply my own dataset. #4

CUDA error when I apply my own dataset. #4

Liu-SD commented Jun 19, 2024 •

edited

Loading

brianneoberson commented Jun 19, 2024 •

edited

Loading

BaowenZ commented Jun 19, 2024

LinzhouLi commented Jun 19, 2024

zhouilu commented Jun 19, 2024

BaowenZ commented Jun 19, 2024 •

edited

Loading

LinzhouLi commented Jun 19, 2024

MrNeRF commented Jun 19, 2024

WUMINGCHAzero commented Jun 20, 2024 •

edited

Loading

RongLiu-Leo commented Jun 20, 2024

MELANCHOLY828 commented Jun 20, 2024

zhanghaoyu816 commented Jun 20, 2024

tkuye commented Jun 20, 2024

BaowenZ commented Jun 20, 2024

MrNeRF commented Jun 20, 2024

Li-colonel commented Jun 21, 2024

MrNeRF commented Jun 21, 2024 •

edited

Loading

BaowenZ commented Jun 21, 2024

MrNeRF commented Jun 21, 2024

MrNeRF commented Jun 21, 2024 •

edited

Loading

BaowenZ commented Jun 21, 2024 •

edited

Loading

MrNeRF commented Jun 21, 2024

MELANCHOLY828 commented Jun 21, 2024 •

edited

Loading

BaowenZ commented Jun 21, 2024

WUMINGCHAzero commented Jun 22, 2024

BaowenZ commented Jun 22, 2024 •

edited

Loading

Mikael-Spotscale commented Jun 24, 2024

CUDA error when I apply my own dataset. #4

CUDA error when I apply my own dataset. #4

Comments

Liu-SD commented Jun 19, 2024 • edited Loading

brianneoberson commented Jun 19, 2024 • edited Loading

BaowenZ commented Jun 19, 2024

LinzhouLi commented Jun 19, 2024

zhouilu commented Jun 19, 2024

BaowenZ commented Jun 19, 2024 • edited Loading

LinzhouLi commented Jun 19, 2024

MrNeRF commented Jun 19, 2024

WUMINGCHAzero commented Jun 20, 2024 • edited Loading

RongLiu-Leo commented Jun 20, 2024

MELANCHOLY828 commented Jun 20, 2024

zhanghaoyu816 commented Jun 20, 2024

tkuye commented Jun 20, 2024

BaowenZ commented Jun 20, 2024

MrNeRF commented Jun 20, 2024

Li-colonel commented Jun 21, 2024

MrNeRF commented Jun 21, 2024 • edited Loading

BaowenZ commented Jun 21, 2024

MrNeRF commented Jun 21, 2024

MrNeRF commented Jun 21, 2024 • edited Loading

BaowenZ commented Jun 21, 2024 • edited Loading

MrNeRF commented Jun 21, 2024

MELANCHOLY828 commented Jun 21, 2024 • edited Loading

BaowenZ commented Jun 21, 2024

WUMINGCHAzero commented Jun 22, 2024

BaowenZ commented Jun 22, 2024 • edited Loading

Mikael-Spotscale commented Jun 24, 2024

Liu-SD commented Jun 19, 2024 •

edited

Loading

brianneoberson commented Jun 19, 2024 •

edited

Loading

BaowenZ commented Jun 19, 2024 •

edited

Loading

WUMINGCHAzero commented Jun 20, 2024 •

edited

Loading

MrNeRF commented Jun 21, 2024 •

edited

Loading

MrNeRF commented Jun 21, 2024 •

edited

Loading

BaowenZ commented Jun 21, 2024 •

edited

Loading

MELANCHOLY828 commented Jun 21, 2024 •

edited

Loading

BaowenZ commented Jun 22, 2024 •

edited

Loading