-
Notifications
You must be signed in to change notification settings - Fork 224
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
It takes more than 100ms to issue a single command to Intel Arc GPU #386
Comments
The measurement of GPU to GPU here is synchronized. If you expect GPU performance data, please use profiler tool to exclude host computation runtime impact, like,
The tool will show you host latency (kernel submission), and asynchronized computation latency on GPU. |
@arthuryuan1987 |
I guess your build might not be a AOT build, which brings runtime kernel JIT overhead. And AOT build of NVCC is default on. You may warm up the clone kernel, like,
|
Describe the issue
Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.
For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.
Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB
Pytorch takes takes 0.000359s to issue 1 command on RTX 3090
clinfo
The text was updated successfully, but these errors were encountered: