Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix torch.prod vectorized path for bool #128009

Open
wants to merge 6 commits into
base: gh/zhuhaozhe/37/base
Choose a base branch
from

Conversation

zhuhaozhe
Copy link
Contributor

@zhuhaozhe zhuhaozhe commented Jun 5, 2024

@zhuhaozhe zhuhaozhe requested a review from mruberry as a code owner June 5, 2024 07:29
Copy link

pytorch-bot bot commented Jun 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/128009

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bcfd4d4 with merge base 92ca17d (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Jun 5, 2024
zhuhaozhe added a commit that referenced this pull request Jun 5, 2024
ghstack-source-id: 2ea0f5b2b1bd81d4b2e56e397468d7f00d201287
Pull Request resolved: #128009
@zhuhaozhe zhuhaozhe added the ciflow/trunk Trigger trunk jobs on your pull request label Jun 5, 2024
[ghstack-poisoned]
@zhuhaozhe zhuhaozhe added the release notes: python_frontend release notes category label Jun 5, 2024
@zhuhaozhe zhuhaozhe marked this pull request as draft June 5, 2024 07:31
zhuhaozhe added a commit that referenced this pull request Jun 6, 2024
ghstack-source-id: b51aaff77a0ababb6fac15c142179da35e749e50
Pull Request resolved: #128009
[ghstack-poisoned]
Copy link
Collaborator

@jgong5 jgong5 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit on the test code. Others LGTM.

@@ -1501,6 +1501,11 @@ def test_prod_bool(self, device):
result = torch.prod(torch.tensor(val, device=device)).item()
expect = np.prod(np.array(val))
self.assertEqual(result, expect)
# https://github.com/pytorch/pytorch/issues/127866
val = [False] * 256
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not adding this val to vals and tested together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for advice, updated.

Comment on lines 280 to 284
const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes());
const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes());
__m512i out = _mm512_and_si512(*self_, *other_);
Vectorized<bool> ret;
std::memcpy(ret, &out, ret.size() * sizeof(bool));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit confused why you want to do memory copies here. What's the problem if we do _mm512_and_si512 directly on two vectors? Would your implementation be ever faster than the scalar version?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_and_si512 happens on two _mm512iand return a __mm512i.
We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why no other function here needed this reinterpret_cast/memcpy but this one does?

Hi, @albanD.
For "other functions", do you mean some functions here

Vectorized<int32_t> operator<=(const Vectorized<int32_t>& other) const {
auto mask = _mm512_cmple_epi32_mask(values, other.values);
return _mm512_mask_set1_epi32(zero_vector, mask, 0xFFFFFFFF);

This functions takes a Vectorized<int32_t> and return a Vectorized<int32_t> without reinterpret_cast/memcpy because constructer here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L27 and here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec512/vec512_int.h#L19.
We can also avoid reinterpret_cast/memcpy for adding boolean intrinsic classes Vectorized<bool>. Following this but will introduce more code changes, and since we actually don't have intrinsic-based implementation for boolean vectors, and we also have speed up here for vec impl. We may not need to adding it.

Do you have any prefers on it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we expect these vectorized implementations might be needed in other places, we should add them yes.
Otherwise, what we have here is ok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I cannot see other cases need a Vectorized<bool>.
And from here

// NOTE: If you specialize on a type, you must define all operations!

Adding Vectorized<bool> should be a lot code changes and in my opinion we can leave this PR as it is and implement Vectorized<bool> after we found more cases need to do so.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.
Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.
As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

Thanks for advice, comments added.

zhuhaozhe added a commit that referenced this pull request Jun 9, 2024
ghstack-source-id: 2f14b93a2575f28eac5d83482e01c51af5b58e12
Pull Request resolved: #128009
[ghstack-poisoned]
@zhuhaozhe zhuhaozhe requested a review from jgong5 June 9, 2024 09:46
Comment on lines 280 to 284
const __m512i* self_ = reinterpret_cast<const __m512i*>(self.as_bytes());
const __m512i* other_ = reinterpret_cast<const __m512i*>(other.as_bytes());
__m512i out = _mm512_and_si512(*self_, *other_);
Vectorized<bool> ret;
std::memcpy(ret, &out, ret.size() * sizeof(bool));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_mm512_and_si512 happens on two _mm512iand return a __mm512i. We can use reinterpret_cast to convert Vectorized<bool> to __mm512i.

Currently there is no constructor from _mm512i to Vectorized<bool>, I take the reference here https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cpu/vec/vec.h#L19 to convert return a Vectorized<bool>.

As for performance, I observed 4x for this shape x = torch.ones((10240), dtype=torch.bool) on ICX. ( Larger shape to avoid overhead become a major factor)

OK. The code you referred to try to make sure the boolean values are valid but here the result should already be valid given the inputs are valid booleans. In fact, we don't have intrinsic-based implementation for boolean vectors. Supporting __mm512i as constructor seems requiring more code changes. I'm open to do memcpy here or adding boolean intrinsic classes. If you choose the former as what you are doing here, perhaps you can add a comment to explain why you are doing memcpy here.

@zhuhaozhe zhuhaozhe requested a review from albanD June 18, 2024 05:31
zhuhaozhe added a commit that referenced this pull request Jun 18, 2024
ghstack-source-id: af2ccfde8876098aef151dd4db38f75208ee3b2f
Pull Request resolved: #128009
[ghstack-poisoned]
@zhuhaozhe zhuhaozhe marked this pull request as ready for review June 19, 2024 05:18
zhuhaozhe added a commit that referenced this pull request Jun 19, 2024
ghstack-source-id: 9974c368b0641a6fb6b913a0a44751b3e1f8f91a
Pull Request resolved: #128009
[ghstack-poisoned]
@zhuhaozhe
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

[ghstack-poisoned]
@pytorchmergebot
Copy link
Collaborator

Successfully rebased gh/zhuhaozhe/37/orig onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via ghstack checkout https://github.com/pytorch/pytorch/pull/128009)

pytorchmergebot pushed a commit that referenced this pull request Jun 22, 2024
ghstack-source-id: b113de9bd60c6d7df8e5cce17d49deb98f08e173
Pull Request resolved: #128009
@zhuhaozhe
Copy link
Contributor Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@zhuhaozhe
Copy link
Contributor Author

Hi, @mruberry, @albanD. May you kindly help to review this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request module: cpu CPU specific problem (e.g., perf, algorithm) open source release notes: python_frontend release notes category
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

None yet

5 participants