Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It takes more than 100ms to issue a single command to Intel Arc GPU #386

Closed
BA8F0D39 opened this issue Jul 4, 2023 · 3 comments
Closed
Labels

Comments

@BA8F0D39
Copy link

BA8F0D39 commented Jul 4, 2023

Describe the issue

Printing a float32 takes 1340 us in IPEX. This is fine.
However, transferring a single float32 number takes 0.142 s in Intel Arc A770 16 GB. Why does this take so long? The GPU to GPU transfer rate is 224.56 bit/s for 1 float32.

For reference, RTX 3090 takes 0.000359s to transfer a single float32 number.

import time
import torch
import torchvision.models as models

import numpy as np
import intel_extension_for_pytorch as ipex

torch.manual_seed(0)


x = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
print(x.cpu())
end = time.time()

print("Print Time in Seconds: %.20f " % (end - start))







torch.manual_seed(2)

x = torch.rand(1, 1, dtype=torch.float32, device='xpu')
y = torch.rand(1, 1, dtype=torch.float32, device='xpu')

torch.xpu.synchronize()
start = time.time()
y = x.clone()
print(y.cpu())
end = time.time()

print("Data Transfer Time in Seconds: %.20f " % (end - start))

Pytorch takes takes 0.142s to issue 1 command on Intel Arc A770 16 GB

tensor([[0.9179]])
Print Time in Seconds: 0.00134086608886718750 
tensor([[0.9696]])
Data Transfer Time in Seconds: 0.14255475997924804688

Pytorch takes takes 0.000359s to issue 1 command on RTX 3090

tensor([[0.3990]])                                                          
Print Time in Seconds: 0.00103116035461425781                               
tensor([[0.4254]])                                                               
Data Transfer Time in Seconds: 0.00035905838012695312  

clinfo

Platform: Intel(R) OpenCL HD Graphics
  Device: Intel(R) Arc(TM) A770 Graphics
    Driver version  : 23.05.25593.18 (Linux x64)
    Compute units   : 512
    Clock frequency : 2400 MHz

    Global memory bandwidth (GBPS)
      float   : 391.40
      float2  : 403.59
      float4  : 406.54
      float8  : 418.51
      float16 : 422.83

    Single-precision compute (GFLOPS)
clCreateBuffer (-61)
      Tests skipped

    Half-precision compute (GFLOPS)
      half   : 19570.87
      half2  : 19509.20
      half4  : 19540.56
      half8  : 19455.51
      half16 : 19330.61

    No double precision support! Skipped

    Integer compute (GIOPS)
clCreateBuffer (-61)
      Tests skipped

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 17.11
      enqueueReadBuffer          : 8.06
      enqueueMapBuffer(for read) : 19.89
        memcpy from mapped ptr   : 22.40
      enqueueUnmap(after write)  : 23.53
        memcpy to mapped ptr     : 22.48

    Kernel launch latency : 6.07 us
@fengyuan14
Copy link

The measurement of GPU to GPU here is synchronized. If you expect GPU performance data, please use profiler tool to exclude host computation runtime impact, like,

with torch.autograd.profiler_legacy.profile() as prof:
    torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))

The tool will show you host latency (kernel submission), and asynchronized computation latency on GPU.

@BA8F0D39
Copy link
Author

BA8F0D39 commented Jul 6, 2023

@arthuryuan1987
Why is the Intel Arc A770 395x slower than RTX 3090 for the exact same pytorch code?

@fengyuan14
Copy link

I guess your build might not be a AOT build, which brings runtime kernel JIT overhead. And AOT build of NVCC is default on. You may warm up the clone kernel, like,

torch.clone(A, B) # warm up
with torch.autograd.profiler_legacy.profile() as prof:
    torch.clone(A, B)
print(prof.key_averages().table(sort_by="self_xpu_time_total"))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants