-
Notifications
You must be signed in to change notification settings - Fork 21.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix torch.prod vectorized path for bool #128009
base: gh/zhuhaozhe/37/base
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128009
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bcfd4d4 with merge base 92ca17d ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
ghstack-source-id: 2ea0f5b2b1bd81d4b2e56e397468d7f00d201287 Pull Request resolved: #128009
ghstack-source-id: b51aaff77a0ababb6fac15c142179da35e749e50 Pull Request resolved: #128009
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit on the test code. Others LGTM.
test/test_reductions.py
Outdated
@@ -1501,6 +1501,11 @@ def test_prod_bool(self, device): | |||
result = torch.prod(torch.tensor(val, device=device)).item() | |||
expect = np.prod(np.array(val)) | |||
self.assertEqual(result, expect) | |||
# https://github.com/pytorch/pytorch/issues/127866 | |||
val = [False] * 256 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not adding this val
to vals
and tested together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for advice, updated.
const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes()); | ||
const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes()); | ||
__m512i out = _mm512_and_si512(*self_, *other_); | ||
Vectorized<bool> ret; | ||
std::memcpy(ret, &out, ret.size() * sizeof(bool)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit confused why you want to do memory copies here. What's the problem if we do _mm512_and_si512
directly on two vectors? Would your implementation be ever faster than the scalar version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm512_and_si512
happens on two _mm512i
and return a __mm512i
.
We can use reinterpret_cast
to convert Vectorized<bool>
to __mm512i
.
Currently there is no constructor from _mm512i
to Vectorized<bool>
, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>
.
As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool)
on ICX. ( Larger shape to avoid overhead become a major factor)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm512_and_si512
happens on two_mm512i
and return a__mm512i
. We can usereinterpret_cast
to convertVectorized<bool>
to__mm512i
.Currently there is no constructor from
_mm512i
toVectorized<bool>
, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return aVectorized<bool>
.As for performance, I observed 4x for this shape
x = torch.ones((10240), dtype=torch.bool)
on ICX. ( Larger shape to avoid overhead become a major factor)
OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i
as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?
Hi, @albanD.
For "other functions", do you mean some functions here
pytorch/aten/src/ATen/cpu/vec/vec512/vec512_int.h
Lines 292 to 294 in f7eae27
Vectorized<int32_t> operator<=(const Vectorized<int32_t>& other) const { | |
auto mask = _mm512_cmple_epi32_mask(values, other.values); | |
return _mm512_mask_set1_epi32(zero_vector, mask, 0xFFFFFFFF); |
This functions takes a
Vectorized<int32_t>
and return a Vectorized<int32_t>
without reinterpret_cast/memcpy
because constructer here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L27 and here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L19.We can also avoid
reinterpret_cast/memcpy
for adding boolean intrinsic classes Vectorized<bool>
. Following this but will introduce more code changes, and since we actually don't have intrinsic-based implementation for boolean vectors, and we also have speed up here for vec impl. We may not need to adding it.
Do you have any prefers on it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we expect these vectorized implementations might be needed in other places, we should add them yes.
Otherwise, what we have here is ok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I cannot see other cases need a Vectorized<bool>
.
And from here
pytorch/aten/src/ATen/cpu/vec/vec_base.h
Line 127 in 8a2fed7
// NOTE: If you specialize on a type, you must define all operations! |
Adding
Vectorized<bool>
should be a lot code changes and in my opinion we can leave this PR as it is and implement Vectorized<bool>
after we found more cases need to do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm512_and_si512
happens on two_mm512i
and return a__mm512i
. We can usereinterpret_cast
to convertVectorized<bool>
to__mm512i
.
Currently there is no constructor from_mm512i
toVectorized<bool>
, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return aVectorized<bool>
.
As for performance, I observed 4x for this shapex = torch.ones((10240), dtype=torch.bool)
on ICX. ( Larger shape to avoid overhead become a major factor)OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting
__mm512i
as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.
Thanks for advice, comments added.
ghstack-source-id: 2f14b93a2575f28eac5d83482e01c51af5b58e12 Pull Request resolved: #128009
const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes()); | ||
const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes()); | ||
__m512i out = _mm512_and_si512(*self_, *other_); | ||
Vectorized<bool> ret; | ||
std::memcpy(ret, &out, ret.size() * sizeof(bool)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_mm512_and_si512
happens on two_mm512i
and return a__mm512i
. We can usereinterpret_cast
to convertVectorized<bool>
to__mm512i
.Currently there is no constructor from
_mm512i
toVectorized<bool>
, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return aVectorized<bool>
.As for performance, I observed 4x for this shape
x = torch.ones((10240), dtype=torch.bool)
on ICX. ( Larger shape to avoid overhead become a major factor)
OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i
as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.
ghstack-source-id: af2ccfde8876098aef151dd4db38f75208ee3b2f Pull Request resolved: #128009
ghstack-source-id: 9974c368b0641a6fb6b913a0a44751b3e1f8f91a Pull Request resolved: #128009
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
ghstack-source-id: b113de9bd60c6d7df8e5cce17d49deb98f08e173 Pull Request resolved: #128009
@pytorchbot merge |
Merge failedReason: Approvers from one of the following sets are needed:
|
Fix #127866.
Stack from ghstack (oldest at bottom):
cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10