-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Triangular solve fails on batches of matrices of size > (*, 524280) #79191
Comments
I can reproduce this |
For the passing case, the profile shows
so with a bigger batch (?) the gridDim.y becomes 65536 which is more than a max. |
Smaller repro
|
I believe we already encountered this in other operations and we resolved that there's not much we can do about it? @IvanYashchuk @xwang233 |
It's the number of rhs's you have for triangular solve, so you can split the computation to handle rhs's in batches? |
Failure of |
So what is the cause of the problem? Are there any intuitive explanations? Is it fixable? |
Inspired by this discussion I tried an approach based on repeating the mean and covariance matrix as a batch. My approach seems to work reliably for data at least 128 times bigger than @CloudyDory's example, i.e.
@lezcano @IvanYashchuk @xwang233 do you think this repeating idea could be incorporated into the MultivariateNormal class so that OP's code works? |
@timwaite alas, that's just a hack, and it'll probably be very slow. We should implmenet #79191 (comment) or #97211 (comment) as a workaround before this is fixed in cusolver (cc @xwang233). |
@lezcano thanks for the reply. The following slightly better workaround worked for my use case. (Computing the log probability of a low dimensional MVN distribution with many samples). I have posted it in case it is useful to others. The code needs some more input checking etc, but aside from those issues I don't think there is any drawback to doing the MVN calculation this way when on a CUDA device. As far as I can tell:
I wonder if a simple fix along these lines would be good enough then for the MVN (regardless of what happens with triangular_solve)? Of course it is possible I made a mistake, so it would be good for others to check. The main difference, in my use case, is:
Reading the CUDA documentation the latter approach seems to be more how cublasStrsmBatched is intended to be used, so I wonder if this is actually the correct approach.
I don't think this code would experience the same CUBLAS failure mode ever, as it seems to be due to wide right hand sides, whereas this method has RHS of width 1. I tested high dimensional distributions until my VRAM ran out.. Code for log probability density:
Comparison results:
Code to compare performance:
PS apologies if there are any glaring errors, I am quite new to Python and Pytorch! |
This looks like it's been fixed for CUDA >= 12.1. I'll implement a fix for previous versions tho. |
Fix #79191 cc jianyuh nikitaved pearu mruberry walterddr xwang233 Lezcano [ghstack-poisoned]
Fix #79191 cc jianyuh nikitaved pearu mruberry walterddr xwang233 Lezcano [ghstack-poisoned]
Fix #79191 cc jianyuh nikitaved pearu mruberry walterddr xwang233 Lezcano [ghstack-poisoned]
馃悰 Describe the bug
An error "CUBLAS_STATUS_EXECUTION_FAILED when calling 'cublasStrsmBatched'" will be triggered when calculating the log probabilities of MultivariateNormal distribution on GPU and the number of data samples is larger than 524280.
Code to reproduce the problem:
Result:
Calculating the log_prob of data1 runs without problem. Calculating the log_prob of data2 produce the following error:
The code also runs without problem if switching the device to CPU.
Versions
The error can be reproduced on the following two systems.
System 1:
System 2:
cc @ezyang @gchanan @zou3519 @fritzo @neerajprad @alicanb @nikitaved @ngimel @jianyuh @pearu @mruberry @walterddr @IvanYashchuk @xwang233 @lezcano
The text was updated successfully, but these errors were encountered: