Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert Dot33 to SSE2 #17584

Merged
merged 1 commit into from
Jun 15, 2023
Merged

Convert Dot33 to SSE2 #17584

merged 1 commit into from
Jun 15, 2023

Conversation

fp64
Copy link
Contributor

@fp64 fp64 commented Jun 15, 2023

Simpler, lower requirements, and doesn't seem to hurt speed. See #17571.

Also removed && !PPSSPP_ARCH(X86). If there was a reason for it being there, please do tell.

Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.
@hrydgard hrydgard added this to the v1.16.0 milestone Jun 15, 2023
@hrydgard hrydgard merged commit 6d4e5a0 into hrydgard:master Jun 15, 2023
19 checks passed
@fp64 fp64 deleted the sse2-dot33 branch June 15, 2023 18:12
if (useSSE4)
return _mm_cvtss_f32(Dot33SSE4(a.vec, b.vec));
#if defined(_M_SSE)
__m128 v = _mm_mul_ps(a.vec, b.vec); // [X, Y, Z, W]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might crash MSVC builds on x86, especially Debug. Not really a huge issue, but vec is not guaranteed to be aligned if passed on the stack (the stack is only aligned to 16 on x86_64.) I guess we could add a loadu and hope the compiler optimizes it out, but otherwise our "solve" has been to avoid intrinsics for x86_32 when they access XMM args directly.

-[Unknown]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Owner

@hrydgard hrydgard Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I forgot about it too when merging this. Funnily enough it might not actually even be noticable on recent CPUs since Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert just the ifdef change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

I could be wrong, as I've heard other people say this same thing recently. But I was pretty sure this was ONLY when a VEX prefix is used (i.e. encoded as AVX/AVX2, regardless of using YMMs or not), which none of the non-jit code in PPSSPP is, at least on Windows. At least that's the case on Coffee Lake for sure, which isn't that old.

-[Unknown]

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Huh, ok. I guess I remember that one wrong then - I've avoided those accesses anyway whenever I can of course..

Copy link
Contributor Author

@fp64 fp64 Jun 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could add a loadu and hope the compiler optimizes it out

I'll just mention that movups is as fast as movaps for aligned data starting with Nehalem (Core i7). In fact, compilers often use movups even when implementing _mm_load_ps or equivalent (copying, dereferencing). I think the guarantee is even stronger: you would need to cross a cache line boundary to get a slowdown.
See e.g.:
https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions
https://stackoverflow.com/questions/42697118/visual-studio-2017-mm-load-ps-often-compiled-to-movups

Adding _mm_loadu_ps can also change addps reg,mem into movups reg,mem + addps reg,reg, which... might not actually be slower. MSVC (and GCC) seems rather decent at eliminating movups reg,reg, from the simple examples I tried.

So, 'pepper everything with _mm_loadu_ps' approach doesn't seem to have significant drawbacks, other than visual litter.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely compilers often optimize out loadu, but I have trust issues.

For example, here. On x86_64, ideally, I don't want this on the stack/memory, I want it from a register. An example is Dot33<useSSE4>(H.NormalizedOr001(useSSE4), worldnormal). Realistically, worldnormal is probably spilled, but at least the normalized H (and H itself) could've been XMMs directly. I've seen where a loadu convinces MSVC that it ought to spill the normalized H (although to aligned) just to load it later. I mean, the code DID say to load it from memory, so I guess it's not "wrong".

But some of that may have improved. I just feel like every time I look at a hot func that uses SIMD which MSVC is optimizing poorly, it's because it spills things it should NOT spill like crazy, or even unvectorizes things nonsensically. So I don't want to give it excuses to do that to me.

-[Unknown]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense.

I suppose it's possible to

#if PPSSPP_ARCH(X86) // Make sure MSVC loads aren't borked.
#define MAGIC_LOAD(v) _mm_loadu_ps(reinterpret_cast<const float*>(&(v)))
#else // Works on non-x86 too!
#define MAGIC_LOAD(v) (v)
#endif

assuming we care enough.

hrydgard added a commit that referenced this pull request Jun 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants