Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.

Optimize Buffer.MemoryCopy #9786

Merged
merged 1 commit into from
Mar 4, 2017
Merged

Optimize Buffer.MemoryCopy #9786

merged 1 commit into from
Mar 4, 2017

Conversation

vkvenkat
Copy link

@vkvenkat vkvenkat commented Feb 25, 2017

This PR optimizes Buffer.MemoryCopy for lengths up to a threshold. Performance measurements were made on a Skylake Desktop - Intel(R) Core(TM)i7-6700K CPU @ 4.00 GHz with 32 GB RAM. We used Windows 10 Enterprise and Ubuntu 14.04. We used a micro-benchmark which exercises multiple buckets of pseudo-random copy sizes from 0 to 4 KB (details below).

Performance improvements with the optimization:

  • 2% to 73% for x64 Windows
  • -4% to 73% for x86 Windows
  • 0% to 70% for x64 Ubuntu

The below charts show the performance speedup:

image

image

The below table has absolute clocks in Ticks for million copy operations:

image

Here's the pseudo-code for the micro-benchmark:

// 10 Buckets of copy sizes within specified ranges: [0,8], [9,16], [17,32],..., [2049,4096]
for (int i = 0; i < MAX_BUCKETS; i++)
{  
    Random rnd = new Random(1);  
    for (int j = 0; j < COPYLEN; j++)
    {
            // Randomize copy length within bounds
            copyLen[j] = (ushort) rnd.Next(0, 9); // Range changes with bucket
    }

    begin = DateTime.Now.Ticks;

    for (int k = 0; k < 1000000; k++)
    {
        for (count= 0; count < COPYLEN; count++) 
        {						
            Buffer.MemoryCopy(pSource, pTarget, 8, copyLen[count]); //Dest. max size changes with bucket
        }	
    }

    time = DateTime.Now.Ticks - begin; // Time reported for bucket
} 

@vkvenkat
Copy link
Author

@jkotas PTAL

PInvoke:
#if AMD64 && !PLATFORM_UNIX
FCopy:
_MemoryCopy(dest, src, len);
Copy link
Member

@jkotas jkotas Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go through PInvoke like before so that it does not starve GC for large copies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do. Buffer::BlockCopy (

JIT_MemCpy(dstPtr, srcPtr, count);
) uses a similar FCALL to JIT_MemCpy as well.

Copy link
Member

@jkotas jkotas Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we have places like the use of memcpy in Buffer::BlockCopy that can starve GC and lead to long GC pause times. We are trying to not introduce more of them, and fix the ones that are there.

// check can produce false positives for lengths greater than Int32.MaxInt. But, this is fine because we want to FCALL/PInvoke
// for large lengths anyways.

if ((len > CopyThreshold) || ((nuint)dest < (nuint)srcEnd))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The (nuint)dest < (nuint)srcEnd check does not look right. It means we will hit the slow path any time the destination is lower than the source.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it profitable to check for the CopyThreshold here, and not later? It adds overhead for small copies.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had re-purposed the if ((nuint)dest - (nuint)src < len) goto PInvoke overlap check from the original. Shall think of an alternate for this.

private struct Block16 { }

[StructLayout(LayoutKind.Sequential, Size = 64)]
private struct Block64 { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it work well / will it work well on ARM64?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall special-case the usage of custom structs to AMD64 @jamesqo

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

x86 CoreCLR switched to RyuJIT so there shouldn't be any differences between X86 and X64 here, both should use Block64.

I don't know what ARM64 does but it would make sense for it to use the best available method for moving a 64 bytes block, that should be the same or better than moving 4 bytes at a time.

It may be better to handle this by adding some specific define at the top of the file so it's easy to change:

#if AMD64 || X86
#define HAS_BLOCK64
#endif

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Have you done the perf testing with crossgened corelib (corelib is always crossgened in shipping configs)? I do not think the 64-byte blocks will make difference for crossgened corelib because of it is limited to SSE2; it should not hurt though.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Yes, all of the results posted are with crossgened corelib

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vkvenkat Great work optimizing this. I didn't have much time to look over the code, but one thing I noticed-- why are there Block16 / Block64 structs but no Block32?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol I just realized you made that comment a year ago. Still asking if the code is still there though

@jkotas
Copy link
Member

jkotas commented Feb 25, 2017

cc @jamesqo

Copy link

@jamesqo jamesqo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice changes! I have my own PR to modify MemoryCopy that I haven't gotten around to finishing yet: #9143 If this is merged I will gladly close mine in favor of yours, however.

#endif // AMD64

// FCALL/PInvoke to the native version when the copy length exceeds the threshold or when the buffers are overlapping. This
// check can produce false positives for lengths greater than Int32.MaxInt. But, this is fine because we want to FCALL/PInvoke
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: int.MaxValue

if (len > 64) goto MCPY05;

MCPY00:
// len > 16 && len <= 64
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Could be a Debug.Assert instead of a comment (same for other comments like this)


MCPY04:
// len < 4
if ((len & len) == 0) return;
Copy link

@jamesqo jamesqo Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len & len is always going to be len.

return;

MCPY02:
if ((len & 24) == 0) goto MCPY03;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting that replacing the switch table with all of these branches results in better perf.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The switch statement uses a table in memory. And memory is slow. Second it will polluted your memory data cache.

*(int*)(destEnd - 8) = *(int*)(srcEnd - 8);
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4);
#endif
return;
Copy link

@jamesqo jamesqo Feb 25, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are now 2/4 writes being incurred here instead of 1/2 for len == 8. I scanned your test code and it seems to indicate that for random numbers in [0, 8] the change is a net win, but have you tried testing consistently with len == 8?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing a 10% perf improvement with our version for a million copies with len == 8 on x64-Windows. Looks like eliminating the switch table lead to this, despite the extra write.

// len >= 4 && len < 8
*(int*)dest = *(int*)src;
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4);
return;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, we're incurring an extra write for len == 4.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw an 8% perf drop for len == 4 for a million copies.

#if BIT64
*(long*)(dest + i) = *(long*)(src + i);
*(long*)(dest + i + 8) = *(long*)(src + i + 8);
*(Block64*)dest = *(Block64*)src;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SIMD is not enabled for ARM64 yet, I think. I think all of the Block* usages should be under AMD64 actually, I believe RyuJIT is the default JIT on x86 now so AMD64 || (BIT32 && !ARM)? /cc @jkotas

*(int*)(dest + 48) = *(int*)(src + 48);
*(int*)(dest + 52) = *(int*)(src + 52);
*(int*)(dest + 56) = *(int*)(src + 56);
*(int*)(dest + 60) = *(int*)(src + 60);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it beneficial to loop unroll this much? In my experiments last year I recall 64 being slightly worse than doing chunks of 48.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 64 performs better compared to 48. Seeing a regression with 48 for all 3 copy size buckets involved here.

n--;
if (n != 0) goto MCPY06;

len = len & 0x3f;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: len &= 0x3f

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Also might be better to just write %= 64 to make it more understandable for others.

@jamesqo
Copy link

jamesqo commented Feb 25, 2017

/cc @benaadams @Anderman

@DrewScoggins
Copy link
Member

test Windows_NT_X64 perf
test Ubuntu14.04_x64 perf

@DrewScoggins
Copy link
Member

test Ubuntu14.04 perf

@vkvenkat
Copy link
Author

@jkotas @jamesqo Addressed all of the CR suggestions and pushed my latest. PTAL.

*(int*)(dest + 4) = *(int*)(src + 4);
*(int*)(dest + 8) = *(int*)(src + 8);
*(int*)(dest + 12) = *(int*)(src + 12);
*(int*)dest = *(int*)src;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be done as long for BIT64


switch (len)
// PInvoke to the native version when the buffers are overlapping.
if ((((nuint)dest <= (nuint)src) && ((nuint)destEnd > (nuint)src)) ||
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you not using the original condition if ((nuint)dest - (nuint)src < len) goto PInvoke; ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

src + len needs to be re-computed in quite a few places and so created srcEnd to avoid this. Had optimized the original condition to if ((nuint)dest < (nuint)srcEnd) and used it in the PR submitted last week. Revised it in favor of a more thorough overlap check as suggested in one of your CR comments.

Copy link
Member

@jkotas jkotas Feb 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had optimized the original condition to if ((nuint)dest < (nuint)srcEnd)

This optimization was incorrect. The original condition had subtle dependency on integer overflow - the rearranging of left side and right side broke it.

Revised it in favor of a more thorough overlap check

I do not think we need a more thorough overlap check here. We just need the original one that had a single compare and jump (unless there is some non-obvious reason why the one with two compares is faster...).

*(int*)(dest + i + 4) = *(int*)(src + i + 4);
*(int*)(dest + i + 8) = *(int*)(src + i + 8);
*(int*)(dest + i + 12) = *(int*)(src + i + 12);
*(int*)dest = *(int*)src;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use long for BIT64

@vkvenkat vkvenkat changed the title Extending optimized JIT helpers to Buffer.MemoryCopy Optimize Buffer.MemoryCopy Mar 1, 2017
@jkotas
Copy link
Member

jkotas commented Mar 2, 2017

@dotnet-bot test Windows_NT x64 Release Priority 1 Build and Test please

@jkotas
Copy link
Member

jkotas commented Mar 2, 2017

@vkvenkat Several tests are consistently failing. Could you please take a look?

CoreMangLib._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_cmd
CoreMangLib._cti_system_version_VersionToString2_VersionToString2_._cti_system_version_VersionToString2_VersionToString2_cmd
CoreMangLib._cti_system_version_VersionToString1_VersionToString1_._cti_system_version_VersionToString1_VersionToString1_cmd

#if BIT64
*(long*)dest = *(long*)src;
*(long*)(dest + 8) = *(long*)(src + 8);
if (len <= 48) goto MCPY01;
Copy link

@briansull briansull Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-- EDIT - I see that the overrun issue is handled by using 'destEnd' and srcEnd'

-- Original comment:
Shouldn't this be goto MCPY02

Going to MCPY01 will result in copying 16 bytes unconditionally and returning.
Which will overrun the end of the dest buffer for values less than 48

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you explained below, MCPY01 copies the last 16 bytes of src to dest, and an overrun doesn't occur here. MCPY01 uses destEnd & srcEnd as opposed to dest & src in other places.

*(long*)dest = *(long*)src;
*(long*)(dest + 8) = *(long*)(src + 8);

MCPY01:
Copy link

@briansull briansull Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should add a comment on each label describing what can be expected when we reach here:

MCPY01: // Unconditionally copy the last 16 bytes using destEnd and srcEnd and return.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have debug asserts after labels like Debug.Assert(len > 16 && len <= 64) after MCPY00. Do you prefer switching these to comments?

*(int*)dest = *(int*)src;
*(int*)(dest + 4) = *(int*)(src + 4);
*(int*)(destEnd - 8) = *(int*)(srcEnd - 8);
*(int*)(destEnd - 4) = *(int*)(srcEnd - 4);
Copy link

@briansull briansull Mar 2, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are using an interesting trick here to prevent the normal overrun that would occur when performing all 8 byte writes when len is not a multiple of 8.

You should explain how this trick works.

Also MCPY01 aqnd MCPY02 perform silimar operations but have a very subtle difference in their input requirements.

I hope that the processor doesn't introduce a stall when len < 8 and we introduce an overlapping write.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

Copy link
Member

@jkotas jkotas Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, this trick may need a full upfront overlap check to work ... it may be explaining the failure on Ubuntu.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkotas Thanks. Investigating the test failures. Shall push a fix for this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the overlap check to if (((nuint)dest - (nuint)src < len) || ((nuint)src - (nuint)dest < len)) goto PInvoke;. This should resolve test failures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briansull Added comments after labels to explain copy logic.

@vkvenkat
Copy link
Author

vkvenkat commented Mar 4, 2017

@jkotas Addressed latest CR feedback, fixed test failures & updated results. PTAL

@jkotas
Copy link
Member

jkotas commented Mar 4, 2017

Looks great - thank you!

@jkotas jkotas merged commit c6372c5 into dotnet:master Mar 4, 2017
kouvel added a commit to kouvel/coreclr that referenced this pull request Mar 7, 2017
Fixes #9161

PR dotnet#9786 fixes perf of span copy of types that don't contain references
kouvel added a commit to kouvel/coreclr that referenced this pull request Mar 8, 2017
Fixes #9161

PR dotnet#9786 fixes perf of span copy of types that don't contain references
kouvel added a commit that referenced this pull request Mar 9, 2017
Improve span copy of pointers and structs containing pointers

Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references
jorive pushed a commit to guhuro/coreclr that referenced this pull request May 4, 2017
jorive pushed a commit to guhuro/coreclr that referenced this pull request May 4, 2017
…#9999)

Improve span copy of pointers and structs containing pointers

Fixes #9161

PR dotnet#9786 fixes perf of span copy of types that don't contain references
sergiy-k added a commit to sergiy-k/corert that referenced this pull request May 8, 2017
sergiy-k added a commit to sergiy-k/corert that referenced this pull request May 9, 2017
sergiy-k added a commit to dotnet/corert that referenced this pull request May 9, 2017
Port optimization in Buffer.Memmove from CoreCLR

This is just a copy of changes made in
dotnet/coreclr#9786

* Code review feedback
@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017
@glenn-slayden
Copy link

I posted a comment for c6372c5 here, but I don't think that was the right place, I apologize for repeating the remarks here...


Just curious why all these improvements to Buffer.Memmove (plus more recent work) aren't deployed in the IL cpblk instruction instead? It appears that this commit c6372c5 removes a comment to the effect that cpblk is where all this effort naturally belongs (that is, in case the proliferation of conditional #if code here wasn't already somewhat suggestive), and I agree.

I know everyone wants there to be one central nexus for obtaining an optimal memmove, with proper alignment, chunking, tuning, and everything else done right; shouldn't cpblk be that place?

I also note that this commit removes all address alignment behavior. Now while it's true that the alignment code in ea9bee5 was a bit meek in only considering the destination address--at the possible expense of source alignment:

if (((int)dest & 3) != 0)
{
if (((int)dest & 1) != 0)
{
*(dest + i) = *(src + i);
i += 1;
if (((int)dest & 2) != 0)
goto IntAligned;
}
*(short*)(dest + i) = *(short*)(src + i);
i += 2;
}
IntAligned:
#if BIT64
// On 64-bit IntPtr.Size == 8, so we want to advance to the next 8-aligned address. If
// (int)dest % 8 is 0, 5, 6, or 7, we will already have advanced by 0, 3, 2, or 1
// bytes to the next aligned address (respectively), so do nothing. On the other hand,
// if it is 1, 2, 3, or 4 we will want to copy-and-advance another 4 bytes until
// we're aligned.
// The thing 1, 2, 3, and 4 have in common that the others don't is that if you
// subtract one from them, their 3rd lsb will not be set. Hence, the below check.
if ((((int)dest - 1) & 4) == 0)
{
*(int*)(dest + i) = *(int*)(src + i);
i += 4;
}
#endif // BIT64

...at least this was something, and a reasonably motivated approach since misaligned store penalties are typically higher than for read.

But in this commit there's neither. Did the experiments in #9786 empirically determine that physically aligning these operations is always a waste of time? For severe misalignment between source/target, it would be possible to write a split-access loop, albeit complex, that marshals quad-aligned reads to quad-aligned writes. Was this type of thing evaluated before taking the decision to abandon all efforts at alignment?

Finally, note that by insisting on alignment to the destination address exclusively, the alignment code (that was removed in this commit, see above) may have been ruining a naturally ideal source alignment, with the overall result of making things worse. This could have unfairly penalized test results for the existing code, when it would be a trivial matter to only do that adjustment when src/dest are jointly aligned to some appropriate level, e.g., wrap the above code with:

if ((((int)src ^ (int)dest) & 0x7) == 0)        // (or some other alignment value)
{
    // ...

As far as I can tell, there has never been code in buffer.cs that considers the source and destination physical alignment jointly, in order to then proceed sensibly on the basis of 3 broadly defined cases:

  • Input src/dst physical addresses already jointly aligned (or len less than max align size);
  • Joint alignment can be improved (to some degree) prior to starting the gang-operation (code removed in c6372c5)
  • Perverse src/dst alignment detected; proceed as-is, or, if worthwhile, use split-access loop to fully remediate.

@glenn-slayden
Copy link

Commit c6372c5 makes a change to the detection of overlap between the source and destination in Buffer.Memmove:

if ((nuint)dest - (nuint)src < len) goto PInvoke;

...becomes...

if (((nuint)dest - (nuint)src < len) || ((nuint)src - (nuint)dest < len)) goto PInvoke;

I must admit I puzzled over the original code since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?

Then I realized of course that the answer is that the original code is stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed Buffer.Memmove implementation which follows it.

The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.

Anyway, can we revisit this code to restore the optimization that was lost here? Namely, once again allow the managed code path to prevail (still subject to overall size considerations, etc.) instead of P/Invoke for the half of the detected overlaps which comport with the copy direction that's hard-coded into Buffer.Memmove?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet