Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StridedReinterpretArray's get/setindex pointer based. #44186

Merged
merged 9 commits into from
Nov 8, 2023

Conversation

N5N3
Copy link
Member

@N5N3 N5N3 commented Feb 15, 2022

This PR make StridedReinterpretArray's get/setindex pure pointer based if its root parent is a Array.
Thus a "Dense" ReinterpretArray should behave more like a Array.
Some examples with better performance:

julia> a = randn(ComplexF64, 100, 100); b = randn(100); c = a * b;
julia> aa = reinterpret(Float64, a); cc = reinterpret(Float64, c);
julia> @btime LinearAlgebra.generic_matvecmul!($cc, 'N', $aa, $b, LinearAlgebra.MulAddMul(true,false));
  2.544 μs (0 allocations: 0 bytes)  # on master 116.900 μs (0 allocations: 0 bytes)
julia> f(x, y) = @inbounds @simd ivdep for i in eachindex(x,y) # ivdep here is useless on master
       x[i] = ntoh(y[i])
       end

julia> a = reinterpret(Float64, rand(UInt8, 16000));

julia> @btime f($a, $a)
  124.388 ns (0 allocations: 0 bytes) # on master 697.260 ns (0 allocations: 0 bytes)

Test has been extended thus all branches should be tested.

@N5N3 N5N3 added domain:arrays [a, r, r, a, y, s] performance Must go faster labels Feb 15, 2022
@jishnub
Copy link
Contributor

jishnub commented Apr 29, 2022

@N5N3 this PR seems to provide a significant performance boost for reinterpreted arrays. If you're happy with this, would you mind recommending some reviewers so that it might move ahead?

Ref: https://discourse.julialang.org/t/reinterpretedarray-performance-even-worse-on-1-8/80102/27

if its root parent isa `Array` and it is dense like.
Also add missing `pointer` for `FasterContiguousSubArray`
Since `getindex`/`setindex!` might be strided-based, we use `WrapperArray{T,N,<:Array}` to make sure the general fallback is tested thoroughly.
@N5N3
Copy link
Member Author

N5N3 commented Apr 29, 2022

This PR is mainly for the performance of a reinterpreted (contigious) buffer.
It should be a common usage so I think specific optimizaiton is OK.
The main concern might be the safey. Hope @vtjnash has time to give some suggestion.

@jishnub
Copy link
Contributor

jishnub commented Nov 16, 2022

Gentle bump

@ronisbr
Copy link
Sponsor Member

ronisbr commented Dec 16, 2022

One possible use case that might benefit from this PR:

https://discourse.julialang.org/t/slowdown-with-reinterpret/91749

@j-fu
Copy link

j-fu commented Dec 16, 2022

In fact reinterpret helps a lot when different approaches to handle e.g. coordinate arrays for discretization grids need to be reconciled. E.g. mesh generators written in C/C++ may return a 3xn matrix of coordinates, and this could be readily reinterpreted e.g. as a Vector{Point3}, and vice versa. So I think the use case is not very exotic.

@StefanKarpinski StefanKarpinski added the status:triage This should be discussed on a triage call label Dec 21, 2022
@gbaraldi
Copy link
Member

@nanosoldier runtests()

@nanosoldier
Copy link
Collaborator

Your package evaluation job has completed - possible new issues were detected. A full report can be found here.

@Moelf
Copy link
Sponsor Contributor

Moelf commented Oct 10, 2023

bump, this helps greatly for some of the long-standing performance issues w.r.t. StridedReinterpretArray

@vtjnash vtjnash added status:merge me PR is reviewed. Merge when all tests are passing and removed status:triage This should be discussed on a triage call status:forget me not PRs that one wants to make sure aren't forgotten labels Nov 1, 2023
@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 1, 2023

I removed the unsafe_convert method added here, since it does not seem correct, vis a vis 0a0bd00. Let's see how CI is doing

test/reinterpretarray.jl Outdated Show resolved Hide resolved
@N5N3 N5N3 removed the status:merge me PR is reviewed. Merge when all tests are passing label Nov 2, 2023
@N5N3
Copy link
Member Author

N5N3 commented Nov 2, 2023

Unfortunately, I can confirm this PR blocks Vectorization in some simple cases.

julia> typeof(aa)
Base.ReinterpretArray{Int64, 2, Float64, Matrix{Float64}, false}

julia> @btime fill!($aa, 1);
  1.590 μs (0 allocations: 0 bytes)

julia> Base.check_store(::Array) = false

julia> @btime fill!($aa, 1);
  1.010 μs (0 allocations: 0 bytes)

@N5N3 N5N3 marked this pull request as draft November 2, 2023 17:18
@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

Ah okay. Is that because this intercepts at sort of a bad point, so rather than doing all-native load/store/bitcast for simple cases like Int64<->Float64, now those always go through our unsafe primitives, that are only supposed to be for C-compatibility, of unsafe load/store, and then lose out on the efficiency benefits of the native primitives?

@N5N3
Copy link
Member Author

N5N3 commented Nov 2, 2023

I thought LLVM might prefer unsafe_store!(p, v, li) rather than unsafe_store!(p + sizeof(T) * (li - 1), v).
But the "regression" is not fixed.
unsafe_store!(p, v, li) is fully intrinsiced based, and it does make LLVM IR simpler, but LLVM still refused to generate Vectorization block.

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 2, 2023

I think there is roughly 2 cases here:

  • isprimitivetype for both source and dest
  • only isbitstype

For isprimitivetype, it looks like we might be only using native operations currently, and those will end in LLVM as a load/store into registers. That seems the case where it looks like it should be already capable of emitting the optimal code, so this can only make it worse by using unsafe operations which requires the compiler to destroy useful information (e.g. vectorizability).

For everything else, it looks like we always use a native load from the array too, but for other cases we know that will turn into a memcpy to a stack slot, then we do a move from that stack slot to a Ref, then we do an unsafe_load of that Ref, which we know will also turn into a memcpy to yet another stack slot, and finally we return that, which should do one final copy from that stack slot to the sret. After inlining and optimizations, most of those copies should go away (those are the same copies we would expect if this wasn't a reinterpret, but merely a copy via a Ref(x)[], so this should not be too unusual of IR)

@N5N3
Copy link
Member Author

N5N3 commented Nov 3, 2023

Tried some bisect locally.
This PR still works fine without #51319, so this looks like a fresh "regression".

@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 3, 2023

Sure, but we know doing stuff with unsafe_ operations is expected to perform badly. The advantage of this PR seems to be that since it was already using the bad unsafe_ operations in many cases, that sometimes there are ways to make those more direct (with unsafe_load) instead of waiting for LLVM to transform them to something roughly equivalent

@N5N3
Copy link
Member Author

N5N3 commented Nov 7, 2023

Tried to play with LLVM a bit. Looks like LV dislikes the GC preserved MemoryRef.
Fortunately, Memory stores Ptr directly and would be vectorlizable after reinterpretation. If vectorlization does matter, then user can transform a reinterpreted Array into a reinterpreted + viewed/reshaped Memory to get full speed.

Also turn off ptr-indexing for `Array`-based storage once element size keeps identity.
@N5N3 N5N3 marked this pull request as ready for review November 7, 2023 10:54
@vtjnash
Copy link
Sponsor Member

vtjnash commented Nov 7, 2023

We should probably now define that parent(::Array) = getfield(array.ref.mem), like all the other types. But a viewed/reshaped Memory typically is an Array. Although maybe you meant you also needed it to be a non-mutable object?

@N5N3
Copy link
Member Author

N5N3 commented Nov 7, 2023

Although maybe you meant you also needed it to be a non-mutable object?

Yes, I mean Base.SubArray and Base.ReshapedArray

@N5N3 N5N3 merged commit 1972432 into JuliaLang:master Nov 8, 2023
5 of 7 checks passed
@N5N3 N5N3 deleted the Contigious branch November 8, 2023 15:36
vtjnash added a commit that referenced this pull request Nov 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:arrays [a, r, r, a, y, s] performance Must go faster
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet