fix torch.prod vectorized path for bool #128009

zhuhaozhe · 2024-06-05T07:29:23Z

Fix #127866.

Stack from ghstack (oldest at bottom):

-> fix torch.prod vectorized path for bool #128009

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10

pytorch-bot · 2024-06-05T07:29:26Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128009

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bcfd4d4 with merge base 92ca17d ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ghstack-source-id: 2ea0f5b2b1bd81d4b2e56e397468d7f00d201287 Pull Request resolved: #128009

[ghstack-poisoned]

ghstack-source-id: b51aaff77a0ababb6fac15c142179da35e749e50 Pull Request resolved: #128009

[ghstack-poisoned]

jgong5

A nit on the test code. Others LGTM.

jgong5 · 2024-06-07T09:35:32Z

test/test_reductions.py

@@ -1501,6 +1501,11 @@ def test_prod_bool(self, device):
 result = torch.prod(torch.tensor(val, device=device)).item()
 expect = np.prod(np.array(val))
 self.assertEqual(result, expect)
+ # https://github.com/pytorch/pytorch/issues/127866
+ val = [False] * 256


why not adding this val to vals and tested together?

Thanks for advice, updated.

jgong5 · 2024-06-07T09:54:33Z

aten/src/ATen/cpu/vec/vec512/vec512.h

+ const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes());
+ const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes());
+ __m512i out = _mm512_and_si512(*self_, *other_);
+ Vectorized<bool> ret;
+ std::memcpy(ret, &out, ret.size() * sizeof(bool));


A bit confused why you want to do memory copies here. What's the problem if we do _mm512_and_si512 directly on two vectors? Would your implementation be ever faster than the scalar version?

_mm512_and_si512 happens on two _mm512iand return a __mm512i.
We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?

_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?

Hi, @albanD.
For "other functions", do you mean some functions here

pytorch/aten/src/ATen/cpu/vec/vec512/vec512_int.h

Lines 292 to 294 in f7eae27

Vectorized<int32_t> operator<=(const Vectorized<int32_t>& other) const {

auto mask = _mm512_cmple_epi32_mask(values, other.values);

return _mm512_mask_set1_epi32(zero_vector, mask, 0xFFFFFFFF);

This functions takes a Vectorized<int32_t> and return a Vectorized<int32_t> without reinterpret_cast/memcpy because constructer here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L27 and here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L19.
We can also avoid reinterpret_cast/memcpy for adding boolean intrinsic classes Vectorized<bool>. Following this but will introduce more code changes, and since we actually don't have intrinsic-based implementation for boolean vectors, and we also have speed up here for vec impl. We may not need to adding it.

Do you have any prefers on it?

If we expect these vectorized implementations might be needed in other places, we should add them yes.
Otherwise, what we have here is ok.

For now, I cannot see other cases need a Vectorized<bool>.
And from here

pytorch/aten/src/ATen/cpu/vec/vec_base.h

Line 127 in 8a2fed7

// NOTE: If you specialize on a type, you must define all operations!

Adding Vectorized<bool> should be a lot code changes and in my opinion we can leave this PR as it is and implement Vectorized<bool> after we found more cases need to do so.

_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.
Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.
As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

Thanks for advice, comments added.

ghstack-source-id: 2f14b93a2575f28eac5d83482e01c51af5b58e12 Pull Request resolved: #128009

[ghstack-poisoned]

jgong5 · 2024-06-11T12:23:48Z

aten/src/ATen/cpu/vec/vec512/vec512.h

+ const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes());
+ const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes());
+ __m512i out = _mm512_and_si512(*self_, *other_);
+ Vectorized<bool> ret;
+ std::memcpy(ret, &out, ret.size() * sizeof(bool));


_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

ghstack-source-id: af2ccfde8876098aef151dd4db38f75208ee3b2f Pull Request resolved: #128009

[ghstack-poisoned]

ghstack-source-id: 9974c368b0641a6fb6b913a0a44751b3e1f8f91a Pull Request resolved: #128009

[ghstack-poisoned]

zhuhaozhe · 2024-06-22T05:31:52Z

@pytorchbot rebase

pytorchmergebot · 2024-06-22T05:33:16Z

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]

pytorchmergebot · 2024-06-22T05:33:29Z

Successfully rebased gh/zhuhaozhe/37/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/128009)

ghstack-source-id: b113de9bd60c6d7df8e5cce17d49deb98f08e173 Pull Request resolved: #128009

zhuhaozhe · 2024-06-24T05:08:51Z

@pytorchbot merge

pytorchmergebot · 2024-06-24T05:10:31Z

Merge failed

Reason: Approvers from one of the following sets are needed:

superuser (pytorch/metamates)
Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)

Details for Dev Infra team

Raised by workflow job

Failing merge rule: Core Maintainers

zhuhaozhe · 2024-06-24T05:28:06Z

Hi, @mruberry, @albanD. May you kindly help to review this PR?

zhuhaozhe requested a review from mruberry as a code owner June 5, 2024 07:29

pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 5, 2024

zhuhaozhe added a commit that referenced this pull request Jun 5, 2024

fix torch.prod vectorized path for bool

696caea

ghstack-source-id: 2ea0f5b2b1bd81d4b2e56e397468d7f00d201287 Pull Request resolved: #128009

zhuhaozhe added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 5, 2024

Update

b1d4766

[ghstack-poisoned]

zhuhaozhe added the release notes: python_frontend release notes category label Jun 5, 2024

zhuhaozhe requested review from mingfeima and jgong5 June 5, 2024 07:31

pytorchbot added the open source label Jun 5, 2024

zhuhaozhe marked this pull request as draft June 5, 2024 07:31

zhuhaozhe added a commit that referenced this pull request Jun 6, 2024

fix torch.prod vectorized path for bool

4ea5a5d

ghstack-source-id: b51aaff77a0ababb6fac15c142179da35e749e50 Pull Request resolved: #128009

Update

47d91d7

[ghstack-poisoned]

jgong5 requested changes Jun 7, 2024

View reviewed changes

zhuhaozhe added a commit that referenced this pull request Jun 9, 2024

fix torch.prod vectorized path for bool

65b97d9

ghstack-source-id: 2f14b93a2575f28eac5d83482e01c51af5b58e12 Pull Request resolved: #128009

Update

b599a6f

[ghstack-poisoned]

zhuhaozhe requested a review from jgong5 June 9, 2024 09:46

jgong5 approved these changes Jun 11, 2024

View reviewed changes

zhuhaozhe requested a review from albanD June 18, 2024 05:31

zhuhaozhe added a commit that referenced this pull request Jun 18, 2024

fix torch.prod vectorized path for bool

2edc5ab

ghstack-source-id: af2ccfde8876098aef151dd4db38f75208ee3b2f Pull Request resolved: #128009

Update

fca459f

[ghstack-poisoned]

zhuhaozhe marked this pull request as ready for review June 19, 2024 05:18

zhuhaozhe added a commit that referenced this pull request Jun 19, 2024

fix torch.prod vectorized path for bool

0a633f3

ghstack-source-id: 9974c368b0641a6fb6b913a0a44751b3e1f8f91a Pull Request resolved: #128009

Update

079e6f4

[ghstack-poisoned]

Update

bcfd4d4

[ghstack-poisoned]

pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2024

fix torch.prod vectorized path for bool

7a3a852

ghstack-source-id: b113de9bd60c6d7df8e5cce17d49deb98f08e173 Pull Request resolved: #128009

pytorchmergebot added the merging label Jun 24, 2024

pytorchmergebot removed the merging label Jun 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix torch.prod vectorized path for bool #128009

fix torch.prod vectorized path for bool #128009

zhuhaozhe commented Jun 5, 2024 •

edited

Loading

pytorch-bot bot commented Jun 5, 2024 •

edited

Loading

jgong5 left a comment

jgong5 Jun 7, 2024

zhuhaozhe Jun 9, 2024

jgong5 Jun 7, 2024

zhuhaozhe Jun 9, 2024

albanD Jun 10, 2024

jgong5 Jun 11, 2024

zhuhaozhe Jun 18, 2024

albanD Jun 20, 2024

zhuhaozhe Jun 24, 2024

zhuhaozhe Jun 25, 2024

jgong5 Jun 11, 2024

zhuhaozhe commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

zhuhaozhe commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

zhuhaozhe commented Jun 24, 2024

	Vectorized<int32_t> operator<=(const Vectorized<int32_t>& other) const {
	auto mask = _mm512_cmple_epi32_mask(values, other.values);
	return _mm512_mask_set1_epi32(zero_vector, mask, 0xFFFFFFFF);

fix torch.prod vectorized path for bool #128009

Are you sure you want to change the base?

fix torch.prod vectorized path for bool #128009

Conversation

zhuhaozhe commented Jun 5, 2024 • edited Loading

pytorch-bot bot commented Jun 5, 2024 • edited Loading

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128009

✅ No Failures

jgong5 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zhuhaozhe commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

pytorchmergebot commented Jun 22, 2024

zhuhaozhe commented Jun 24, 2024

pytorchmergebot commented Jun 24, 2024

Merge failed

zhuhaozhe commented Jun 24, 2024

zhuhaozhe commented Jun 5, 2024 •

edited

Loading

pytorch-bot bot commented Jun 5, 2024 •

edited

Loading