Fix errors on CUDA device #83

X3NNY · 2024-05-05T15:57:00Z

Set parameters on the correct device.

AlessandroFlati · 2024-05-06T05:04:06Z

I made the exact same changes to make CUDA work. Nonetheless, it seems slightly less performance on "science" tasks, probably because of the time taken to port memory to cpu for some critical operations (the ones you found out needed a .to("cpu")).
@KindXiaoming would you mind explaining why is that necessary? I'm working hard on this repo believing in its potentiality, but I didn't have time to look at all the low level details.

KindXiaoming · 2024-05-06T05:11:04Z

Hi AlessandroFlati, do you mean why there are some operations are needed to perform on CPUs? In some cases: (1) I want to use sklearn (actually I never checked if sklearn can support GPU! so I just went with CPU). For example, sklearn linear regression will be called if you want to extract symbolic equations (which is needed if you want interpretability for science). (2) it seems to me the GPU version of torch.lstsq is more unstable than CPU version. Stable in the sense that sometimes GPU may return NaN and CPU doesn't. For example, torch.lstsq is used in curve2coef hence in initialize_grid_from_another_model (which is needed if you want to do grid extension to aim for accuracy for science).

AlessandroFlati · 2024-05-06T05:15:31Z

Then I think I can help you on that. I'll open a fork and try to prepare a PR demonstrating those 2 things can be performed in CUDA with dedicated examples. Maybe I should open another Issue for that, but NaN are also easily propagated if one sets the last activation to be exp for large numbers on dataset (~1e6 is enough) and I was wondering why. I'll keep an eye on that too.

mw66 · 2024-05-06T06:06:09Z

@KindXiaoming

I think this fix deserve a new release (right now the version is v0.0.2).

Fix errors on CUDA device

f1b1c2f

AlessandroFlati approved these changes May 6, 2024

View reviewed changes

KindXiaoming merged commit 9c00ccc into KindXiaoming:master May 6, 2024

mw66 mentioned this pull request May 6, 2024

model.train(device='cuda') is not working: RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #52

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix errors on CUDA device #83

Fix errors on CUDA device #83

X3NNY commented May 5, 2024

AlessandroFlati commented May 6, 2024

KindXiaoming commented May 6, 2024 •

edited

Loading

AlessandroFlati commented May 6, 2024

mw66 commented May 6, 2024

Fix errors on CUDA device #83

Fix errors on CUDA device #83

Conversation

X3NNY commented May 5, 2024

AlessandroFlati commented May 6, 2024

KindXiaoming commented May 6, 2024 • edited Loading

AlessandroFlati commented May 6, 2024

mw66 commented May 6, 2024

KindXiaoming commented May 6, 2024 •

edited

Loading