Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Increasing utilization - gdrcopy_copybw #288

Open
GuyZilberman opened this issue Dec 25, 2023 · 3 comments
Open

Increasing utilization - gdrcopy_copybw #288

GuyZilberman opened this issue Dec 25, 2023 · 3 comments

Comments

@GuyZilberman
Copy link

Hi,

I am running the gdrcopy_copybw benchmark on NVIDIA A100 80GB PCIe with Gen 4 PCIe.
It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.
Upon looking into the code, it appears that in your implementation of memcpy_uncached_store_avx you are using 256 bit functions _mm256_load_pd and _mm256_stream_pd.
What could be the reason for that? Is there a reason for not using 512 bit functions _mm512_load_pd and _mm512_stream_pd instead? Could using the 512 bit functions increase the utilization?

Thanks!

@Eshcar
Copy link

Eshcar commented Dec 27, 2023

@pakmarkthub @drossetti can one of you please take a look at this question?
We are thinking of implementing 512b copy ourselves (and potentially contribute to the repository)
but we want to make sure this is not a waste of time
have you considered supporting _mm512_load_pd in the past and rejected it for some reason?
appreciate your help
Eshcar

@drossetti
Copy link
Member

It appears that the utilization doesn't reach its maximal possible value, getting about 20 GB/s out of the possible 32 GB/s, for buffers of sizes 32kB-8MB.

This question has been asked multiple times over the years, see for example Pak's comments in #286.
The short story is that CPU cores are are not designed as perfect DMA controllers, and can only generate small PCIe write packets, typically up to 64B instead of 256B which is the maximum supported by the GPU.

Regarding mm512, in the past I tried to use AVX2 without any improvement. That is architecturally understandable as the cache line bandwidth between a CPU core and its L1 is already saturated using finer grain loads.

That being said, there might be architectural innovations in the new CPUs, so a quick check would not hurt.

@GuyZilberman
Copy link
Author

Thank you for your response!

We will try to do it and will let you know if we get any improvement in our attempt.

Are any of the previously attempted implementations accessible anywhere for reference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants