-
Notifications
You must be signed in to change notification settings - Fork 319
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make Lanes
constexpr
on arm-sve
target
#1593
Comments
Here is how the above code should be corrected:
|
Thanks @johnplatts , I agree with your comments. Finally, we could consider adding a HWY_SVE_512 target which corresponds to A64FX. Then, Lanes could be constexpr. We should have good evidence that this is worthwhile in general, though - it costs compile time and binary size for everyone (except perhaps if it is opt-in). |
Thank you both for the quick responses. In the end I used the |
Sounds good :)
No worries, this would not be a breaking change. A new target means we recompile the code one more time for that target, with corresponding binary size increase (not major).
It's a bit more subtle. Static dispatch only uses the best target enabled via compiler flags. That may or may not be But before we get into that, the question remains: would you see a significant benefit from a target that knows/hard-codes the vector size? You can see examples of the optimizations we can do at occurrences of |
This is a hard question because I don't have the data to back up my answer right now. I will continue working with what highway offers, for now, and come back to you once I have a reasonable example. For this reason, I will leave this issue open, if it's ok for you. |
Sure, makes sense. BTW I'm curious what kind of project you are working on? |
I'm working on an HPC application of computational chemistry inside the EUPEX project. |
Cool, thanks for sharing. We're happy to support your work, please do not hesitate to ask questions or raise issues. |
Hello again, sorry for the late reply. I tried this simple dot product code on a Fujitsu A64FX and noticed that, for Would this problem be solved by a new |
No worries :) I think the issue here is using MaxLanes. That can be ok if we want an upper bound, but the actual number of lanes is Lanes(d). This can be non-constexpr, but in this context that's fine. Usually we have a local variable size_t N = Lanes(d), and the loop is over [0, N). Would that work for you? If you really require constexpr N, then yes, we would have to add an SVE_512, that is doable. We can also consider whether Fujitsu's upcoming Monaka chip supports SVE2, though it is still quite far in the future. |
Thank you very much, I never checked the contrib sections.
Sorry, my bad. From the previous comments by @johnplatts, I thought that
therefore I stopped using Lanes and replaced it with MaxLanes in my code. I thought that MaxLanes returned the maximum number of available/usable lanes on the current variable-width-vector-capable CPU, while it actually returns the maximum supported by the ISA.
Yes, of course that would work, but the best solution would be to have the |
You're welcome :) Please let us know if the dot library can be improved for your use case.
That's right :) I understand you want to pre-bake masks. As an input to the decision whether we create an SVE_512 already, would you be able to gather some evidence about the speedup vs runtime init for example by running on Graviton3 (SVE_256)? |
Hello again, sorry for the delay.
Sure. I have access to an AWS Graviton3 instance which I can use for this. What would you suggest as benchmark for the initialization? |
Sounds good :) I figured we might use your existing(?) benchmark to measure two versions of code:
It is quite possible that mask init might be able to 'hide' behind other latencies, or fill unused pipeline slots? |
I have this "prologue" at the beginning of a function
This code compiles fine on
x86-256
but onarm-sve
outputs:I think it would be useful to have
Lanes
as a constexpr function so that one can explicit some constraints throughstatic_assert
s.I am using clang 16.0.6 on an ARM A64FX.
The text was updated successfully, but these errors were encountered: