-
Notifications
You must be signed in to change notification settings - Fork 21.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix numerical instability in vector_norm when receiving large size tensor #123416
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123416
Note: Links to docs will display an error until the docs builds have been completed. ❌ 13 New FailuresAs of commit e55d019 with merge base 16cb5d4 (): NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
@maybeLee Thanks for your fix. If we perform group reduction on such large size cases, accuracy should be improved without introducing too much overhead.
|
Hi @CaoE , thanks for your reply. I think the original code has already done such group reduction? pytorch/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp Lines 246 to 253 in 7b6e354
|
@maybeLee This is also a group reduce, and it is divided by vec size. But even if the vec size is 32, the size of each group is also large: 256901120/32=8028160. We can set a parameter, e.g., group_size = 32768, and further group the reduction externally by group_size (32768 elements are reduced in the current way in each group). |
Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as |
I made this pull request because I encountered the numerical instability of the API
torch.linalg.vector_norm
when I used it to process a tensor with large shape such as((1, 32, 224, 224, 160))
.Here is the code to reproduce the issue:
When checking the code, I find that the issue may lie here
pytorch/aten/src/ATen/native/cpu/ReduceOpsKernel.cpp
Line 181 in 3d20cc1
Where precision loss occurs during the accumulation of
acc_vec
when theacc_vec
is too large whiledata_vec * data_vec
is relatively small.The fix I am applying is to use the Kahan summation algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm).
Please note that after this fix is applied, the average execution time is changed from
0.08817670345306397
to0.15041379928588866
.Detailed Code
import numpy as np import time import torchx1 = np.ones((1, 32, 224, 224, 160))
ord = 2
time_list = []
for i in range(10):
s_time = time.time()
res1 = torch.linalg.vector_norm(torch.tensor(x1, dtype=torch.float32), ord=ord)
time_list.append(time.time() - s_time)
print(np.mean(time_list), np.std(time_list))
Please let me know if the fix can be improved.
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10