Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix numerical instability in vector_norm when receiving large size tensor #123416

Closed
wants to merge 1 commit into from

Conversation

maybeLee
Copy link

@maybeLee maybeLee commented Apr 5, 2024

I made this pull request because I encountered the numerical instability of the API torch.linalg.vector_norm when I used it to process a tensor with large shape such as ((1, 32, 224, 224, 160)).

Here is the code to reproduce the issue:

import numpy as np
import torch
x1 = np.ones((1, 32, 224, 224, 160))
ord = 2
print(np.size(x1))  # 256901120
res1 = torch.linalg.vector_norm(torch.tensor(x1, dtype=torch.float32), ord=ord)
res2 = torch.linalg.vector_norm(torch.tensor(x1, dtype=torch.float64), ord=ord)

print(res1, res2)  # tensor(11585.2373) tensor(16028.1353, dtype=torch.float64)
print(f"Expected result: {np.sqrt(np.size(x1))}")  # 16028.135262718493

When checking the code, I find that the issue may lie here

acc_vec += data_vec * data_vec;

Where precision loss occurs during the accumulation of acc_vec when the acc_vec is too large while data_vec * data_vec is relatively small.

The fix I am applying is to use the Kahan summation algorithm (https://en.wikipedia.org/wiki/Kahan_summation_algorithm).

Please note that after this fix is applied, the average execution time is changed from 0.08817670345306397 to 0.15041379928588866.

Detailed Code import numpy as np import time import torch

x1 = np.ones((1, 32, 224, 224, 160))

ord = 2
time_list = []
for i in range(10):
s_time = time.time()
res1 = torch.linalg.vector_norm(torch.tensor(x1, dtype=torch.float32), ord=ord)
time_list.append(time.time() - s_time)
print(np.mean(time_list), np.std(time_list))

Please let me know if the fix can be improved.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

Copy link

pytorch-bot bot commented Apr 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/123416

Note: Links to docs will display an error until the docs builds have been completed.

❌ 13 New Failures

As of commit e55d019 with merge base 16cb5d4 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

CLA Not Signed

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 5, 2024
@ezyang ezyang requested a review from mingfeima April 10, 2024 11:36
@ezyang ezyang added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 10, 2024
@mingfeima mingfeima requested a review from CaoE April 11, 2024 01:38
@CaoE
Copy link
Collaborator

CaoE commented Apr 23, 2024

@maybeLee Thanks for your fix. If we perform group reduction on such large size cases, accuracy should be improved without introducing too much overhead.
For example (pseudocode):

acc_t acc_buffer[group_size]:
for (int64_t g = 0; g <  group_size; g++) {
  // Do reduction for each group
  acc_buffer[g] = group_reduce(...)

}

// Do reduction finally
double acc_value = reduce_finaliy(acc_buffer)
result = scalar_t(std::sqrt(acc_value);

@maybeLee
Copy link
Author

Hi @CaoE , thanks for your reply. I think the original code has already done such group reduction?

for (; d < size - (size % Vec::size()); d += Vec::size()) {
Vec data_vec = Vec::loadu(self_data + d);
norm_two_reduce_step(acc_vec, data_vec);
}
acc_vec.store(buffer);
for (int j = 1; j < fVec::size(); j++) {
buffer[0] = buffer[0] + buffer[j];
}

@CaoE
Copy link
Collaborator

CaoE commented Apr 23, 2024

@maybeLee This is also a group reduce, and it is divided by vec size. But even if the vec size is 32, the size of each group is also large: 256901120/32=8028160. We can set a parameter, e.g., group_size = 32768, and further group the reduction externally by group_size (32768 elements are reduced in the current way in each group).

Copy link

Looks like this PR hasn't been updated in a while so we're going to go ahead and mark this as Stale.
Feel free to remove the Stale label if you feel this was a mistake.
If you are unable to remove the Stale label please contact a maintainer in order to do so.
If you want the bot to never mark this PR stale again, add the no-stale label.
Stale pull requests will automatically be closed after 30 days of inactivity.

@github-actions github-actions bot added the Stale label Jun 22, 2024
@github-actions github-actions bot closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) open source Stale triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants