Convert Dot33 to SSE2 #17584

fp64 · 2023-06-15T12:46:14Z

Simpler, lower requirements, and doesn't seem to hurt speed. See #17571.

Also removed && !PPSSPP_ARCH(X86). If there was a reason for it being there, please do tell.

Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.

unknownbrackets · 2023-06-16T16:54:53Z

GPU/Software/Lighting.cpp

- if (useSSE4)
- return _mm_cvtss_f32(Dot33SSE4(a.vec, b.vec));
+#if defined(_M_SSE)
+ __m128 v = _mm_mul_ps(a.vec, b.vec); // [X, Y, Z, W]


This might crash MSVC builds on x86, especially Debug. Not really a huge issue, but vec is not guaranteed to be aligned if passed on the stack (the stack is only aligned to 16 on x86_64.) I guess we could add a loadu and hope the compiler optimizes it out, but otherwise our "solve" has been to avoid intrinsics for x86_32 when they access XMM args directly.

-[Unknown]

Oof. Didn't realize that MSVC had problems with it, so things like
https://stackoverflow.com/questions/10484422/msvc-cannot-send-function-parameters-of-16byte-alignment-on-x86
https://stackoverflow.com/questions/28488986/formal-parameter-with-declspecalign16-wont-be-aligned
are news to me. Sounds kind of bad.
Sorry about that.

Yeah I forgot about it too when merging this. Funnily enough it might not actually even be noticable on recent CPUs since Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

I'll revert just the ifdef change.

Intel relaxed the alignment requirement for SSE memory operands a while back.. but will affect old ones - which are the more likely ones to run in 32-bit.

I could be wrong, as I've heard other people say this same thing recently. But I was pretty sure this was ONLY when a VEX prefix is used (i.e. encoded as AVX/AVX2, regardless of using YMMs or not), which none of the non-jit code in PPSSPP is, at least on Windows. At least that's the case on Coffee Lake for sure, which isn't that old.

-[Unknown]

Huh, ok. I guess I remember that one wrong then - I've avoided those accesses anyway whenever I can of course..

I guess we could add a loadu and hope the compiler optimizes it out

I'll just mention that movups is as fast as movaps for aligned data starting with Nehalem (Core i7). In fact, compilers often use movups even when implementing _mm_load_ps or equivalent (copying, dereferencing). I think the guarantee is even stronger: you would need to cross a cache line boundary to get a slowdown.
See e.g.:
https://stackoverflow.com/questions/52147378/choice-between-aligned-vs-unaligned-x86-simd-instructions
https://stackoverflow.com/questions/42697118/visual-studio-2017-mm-load-ps-often-compiled-to-movups

Adding _mm_loadu_ps can also change addps reg,mem into movups reg,mem + addps reg,reg, which... might not actually be slower. MSVC (and GCC) seems rather decent at eliminating movups reg,reg, from the simple examples I tried.

So, 'pepper everything with _mm_loadu_ps' approach doesn't seem to have significant drawbacks, other than visual litter.

Definitely compilers often optimize out loadu, but I have trust issues.

For example, here. On x86_64, ideally, I don't want this on the stack/memory, I want it from a register. An example is Dot33<useSSE4>(H.NormalizedOr001(useSSE4), worldnormal). Realistically, worldnormal is probably spilled, but at least the normalized H (and H itself) could've been XMMs directly. I've seen where a loadu convinces MSVC that it ought to spill the normalized H (although to aligned) just to load it later. I mean, the code DID say to load it from memory, so I guess it's not "wrong".

But some of that may have improved. I just feel like every time I look at a hot func that uses SIMD which MSVC is optimizing poorly, it's because it spills things it should NOT spill like crazy, or even unvectorizes things nonsensically. So I don't want to give it excuses to do that to me.

-[Unknown]

Makes sense.

I suppose it's possible to

#if PPSSPP_ARCH(X86) // Make sure MSVC loads aren't borked. #define MAGIC_LOAD(v) _mm_loadu_ps(reinterpret_cast<const float*>(&(v))) #else // Works on non-x86 too! #define MAGIC_LOAD(v) (v) #endif

assuming we care enough.

Convert Dot33 to SSE2

f0d844a

Simpler, lower requirements, and doesn't seem to hurt speed. See hrydgard#17571.

hrydgard added this to the v1.16.0 milestone Jun 15, 2023

hrydgard approved these changes Jun 15, 2023

View reviewed changes

hrydgard merged commit 6d4e5a0 into hrydgard:master Jun 15, 2023
19 checks passed

fp64 deleted the sse2-dot33 branch June 15, 2023 18:12

unknownbrackets reviewed Jun 16, 2023

View reviewed changes

hrydgard added a commit that referenced this pull request Jun 16, 2023

Revert Dot33 on 32-bit x86 only. See #17584

5b4fa06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert Dot33 to SSE2 #17584

Convert Dot33 to SSE2 #17584

fp64 commented Jun 15, 2023

unknownbrackets Jun 16, 2023

fp64 Jun 16, 2023

hrydgard Jun 16, 2023 •

edited

Loading

hrydgard Jun 16, 2023

unknownbrackets Jun 16, 2023

hrydgard Jun 16, 2023

fp64 Jun 17, 2023 •

edited

Loading

unknownbrackets Jun 17, 2023

fp64 Jun 17, 2023

Convert Dot33 to SSE2 #17584

Convert Dot33 to SSE2 #17584

Conversation

fp64 commented Jun 15, 2023

unknownbrackets Jun 16, 2023

Choose a reason for hiding this comment

fp64 Jun 16, 2023

Choose a reason for hiding this comment

hrydgard Jun 16, 2023 • edited Loading

Choose a reason for hiding this comment

hrydgard Jun 16, 2023

Choose a reason for hiding this comment

unknownbrackets Jun 16, 2023

Choose a reason for hiding this comment

hrydgard Jun 16, 2023

Choose a reason for hiding this comment

fp64 Jun 17, 2023 • edited Loading

Choose a reason for hiding this comment

unknownbrackets Jun 17, 2023

Choose a reason for hiding this comment

fp64 Jun 17, 2023

Choose a reason for hiding this comment

hrydgard Jun 16, 2023 •

edited

Loading

fp64 Jun 17, 2023 •

edited

Loading