Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix performance of indexing unified memory. #2340

Merged
merged 1 commit into from
Apr 24, 2024

Conversation

maleadt
Copy link
Member

@maleadt maleadt commented Apr 24, 2024

Part of the goal of #2335 was to improve the performance of deriving a CPU pointer from a unified memory object, i.e., to speed up indexing of CuArrays on the CPU. I did a typo resulting in actually regressing that performance, so this brings it back to the expected improvement.

Before #2335:

julia> a = [1]
1-element Vector{Int64}:
 1

julia> @benchmark $a[]
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  1.879 ns  6.970 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     1.910 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.911 ns ± 0.100 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

   ▃             ▄             ▇           ▁ █            ▆ ▂
  ▅█▁▁▁▁▁▁▁▁▁▁▁▆▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁█▁█▁▁▁▁▁▁▁▁▁▁▁██ █
  1.88 ns     Histogram: log(frequency) by time     1.92 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> b = cu(a; unified=true)
1-element CuArray{Int64, 1, CUDA.Mem.UnifiedBuffer}:
 1

julia> @benchmark $b[]
BenchmarkTools.Trial: 10000 samples with 999 evaluations.
 Range (min  max):  10.349 ns  16.617 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     10.681 ns              ┊ GC (median):    0.00%
 Time  (mean ± σ):   10.685 ns ±  0.169 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

                                      ▃▇█ █▇▇▅ ▄▂▁
  ▂▂▂▂▁▂▂▂▂▁▂▂▂▂▁▂▂▂▂▁▂▂▃▂▁▂▂▃▄▁▅▅▄▅▁▇███▁████▁███▇▁▆▅▅▄▁▃▃▃▂ ▄
  10.3 ns         Histogram: frequency by time        10.8 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

After #2335 + this PR:

julia> @benchmark $b[]
BenchmarkTools.Trial: 10000 samples with 1000 evaluations.
 Range (min  max):  3.869 ns  9.750 ns  ┊ GC (min  max): 0.00%  0.00%
 Time  (median):     3.950 ns             ┊ GC (median):    0.00%
 Time  (mean ± σ):   3.954 ns ± 0.103 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

  ▃   ▁   ▂   ▁   ▃   ▂   ▆   ▆   ▇   █   ▂   ▂   ▁   ▄   ▃ ▂
  █▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█▁▁▁█ █
  3.87 ns     Histogram: log(frequency) by time     4.01 ns <

 Memory estimate: 0 bytes, allocs estimate: 0.

This starts to make it possible, I think, to actually use CPU functionality with unified CuArrays.

@maleadt maleadt added cuda array Stuff about CuArray. bugfix This gets something working again. performance How fast can we go? labels Apr 24, 2024
@maleadt maleadt merged commit cc4980c into master Apr 24, 2024
1 check failed
@maleadt maleadt deleted the tb/unified_indexing_performance branch April 24, 2024 12:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bugfix This gets something working again. cuda array Stuff about CuArray. performance How fast can we go?
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant