replace Base.tanh with faster tanh #1272

DhairyaLGandhi · 2020-07-09T10:10:51Z

This is a speculative change based on some recent discussions to help speedup common tasks in Flux, and the tanh from SLEEFPirates was found to make a significant difference. This is to discuss the viability of doing this by default in Flux and any different kinds of considerations we would have while doing so.

cc @ChrisRackauckas

PR Checklist

Tests are added
Entry in NEWS.md
Documentation, if applicable
Final review from @dhairyagandhi96 (for API changes).

DhairyaLGandhi · 2020-07-10T00:20:05Z

SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with @avx

julia> d1 = Dense(128, 50, Base.tanh)
Dense(128, 50, tanh)

julia> d2 = Dense(128, 50, SLEEFPirates.tanh)
Dense(128, 50, tanh)

julia> x2 = rand(Float32, 128, 500);

julia> d1(x2);

julia> d2(x2);

julia> @btime $d1($x2);
  1.011 ms (4 allocations: 195.53 KiB)

julia> @btime $d2($x2);
  1.325 ms (4 allocations: 195.53 KiB)

julia> struct D3{F,T,S}
         σ::F
         W::T
         b::S
       end

julia> (c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b)

julia> d3 = D3(Base.tanh, d2.W, d2.b);

julia> d3(x2);

julia> @btime $d3($x2);
  237.393 μs (4 allocations: 195.53 KiB)

julia> d4 = D3(SLEEFPirates.tanh, d2.W, d2.b);

julia> @btime $d4($x2);
  236.328 μs (4 allocations: 195.53 KiB)

Running with GPU Arrays, however would need special handling

julia> gd3 = gpu(d3);

julia> gx2 = gpu(x2);

julia> gd3(gx2);
ERROR: MethodError: no method matching VectorizationBase.PackedStridedPointer(::CUDAdrv.CuPtr{Float32}, ::Tuple{Int64})
Closest candidates are:
  VectorizationBase.PackedStridedPointer(::Ptr{T}, ::Tuple{Vararg{Int64,N}}) where {T, N} at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:313
Stacktrace:
 [1] stridedpointer_for_broadcast at /home/dhairyagandhi96/.julia/packages/VectorizationBase/jUlXp/src/vectorizable.jl:794 [inlined]
 [2] macro expansion at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
 [3] vmaterialize! at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:247 [inlined]
 [4] vmaterialize at /home/dhairyagandhi96/.julia/packages/LoopVectorization/anLHu/src/broadcast.jl:317 [inlined]
 [5] (::D3{typeof(tanh),CuArrays.CuArray{Float32,2,Nothing},CuArrays.CuArray{Float32,1,Nothing}})(::CuArrays.CuArray{Float32,2,Nothing}) at ./REPL[169]:1
 [6] top-level scope at REPL[186]:1

ChrisRackauckas · 2020-07-10T00:24:10Z

Yeah the dispatch with @avx would be only for CPU, so GPUs would need a separate dispatch. Yeah, I didn't know that @avx would SIMD through that broadcast well, but it looks like it does. It must just clump the broadcast and do the whole computation of sigma 4 at a time? If so, it probably fails if the activation function isn't a Julia function, so it might need to be careful and check if things are in an approved list?

@chriselrod

DhairyaLGandhi · 2020-07-10T00:44:28Z

Could LV make the GPU passes no-ops? I would rather not lose the generic nature of code, esp since its diminishing returns for the majority of models where batchsize is usually not that large.

ChrisRackauckas · 2020-07-10T00:48:52Z

You could just do typeof(x) <: CuArray ? sigma.(x) : @avx sigma.(x)

DhairyaLGandhi · 2020-07-10T01:17:41Z

If LV did that (even if as a wrapper macro), it could be used by downstream packages without adding special handling at least for different accelerators. I tested it with using Colors, like AbstractArray{RGB} (as an example, and it worked), but we would want to maintain supporting that. It errored with more common activations like relu (basically element-wise defined functions in general, I guess).

I would also be interested in seeing how it would interact with more general use cases like Flux.Zeros for switching bias off in any layer etc.

Just listing things that we should have clarity on in order to understand what the tradeoffs are.

ChrisRackauckas · 2020-07-10T02:08:46Z

I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right.

AStupidBear · 2020-07-10T03:16:47Z

@DhairyaLGandhi It's this already implemented in FluxML/NNlib.jl#199?

chriselrod · 2020-07-10T13:04:03Z

I think it would be much harder for LV to do that since it's a macro, so it is done at parse time and not compile time and doesn't have type information. It could add type checks for everything that shows up in the expression or something, but that seems difficult to get right.

LoopVectorization.@avx does normally use type information. When applied to loops, it does use type checking with LoopVectorization.check_args to choose between using the original loop with @fastmath @inbounds vs calling the generated function LoopVectorization._avx_! to evaluate the loop:

julia> @macroexpand @avx for i in 1:3
          s += x[i]
       end
quote
    begin
    end
    if LoopVectorization.check_args(x)
        var"##vptr##_x" = LoopVectorization.stridedpointer(x)
        local var"##s_0"
        begin
            $(Expr(:gc_preserve, :(var"##s_0" = begin
          begin
              var"##Tloopeltype##" = eltype(x)
              var"##Wvecwidth##" = LoopVectorization.pick_vector_width_val(var"##Tloopeltype##")
          end
          LoopVectorization._avx_!(Val{(0, 0, 0, LoopVectorization.unwrap(var"##Wvecwidth##"))}(), Tuple{:LoopVectorization, :LOOPCONSTANTINSTRUCTION, LoopVectorization.OperationStruct(0x0000000000000000, 0x0000000000000000, 0x0000000000000001, 0x0000000000000000, LoopVectorization.constant, 0x00, 0x01), :LoopVectorization, :getindex, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000000, 0x0000000000000000, 0x0000000000000000, LoopVectorization.memload, 0x01, 0x02), :LoopVectorization, :vadd, LoopVectorization.OperationStruct(0x0000000000000001, 0x0000000000000001, 0x0000000000000000, 0x0000000000000102, LoopVectorization.compute, 0x00, 0x01)}, Tuple{LoopVectorization.ArrayRefStruct{:x,Symbol("##vptr##_x")}(0x0000000000000001, 0x0000000000000001, 0x0000000000000000)}, Tuple{0, Tuple{3}, Tuple{1}, Tuple{}, Tuple{}, Tuple{}, Tuple{}}, Tuple{:i}, (LoopVectorization.StaticUnitRange{1, 3}(),), var"##vptr##_x", s)
      end), :x))
        end
        s = LoopVectorization.reduced_add(var"##s_0", s)
    else
        $(Expr(:inbounds, true))
        local var"#19#val" = for i = 1:3
                    #= REPL[3]:2 =#
                    s = Base.FastMath.add_fast(s, x[i])
                end
        $(Expr(:inbounds, :pop))
        var"#19#val"
    end
end

The generated function of course has type information, but it also (currently) loses the original representation of the loops, and instead is only given a summary of what the loops did, meaning it wouldn't necessarily currently be able to reconstruct the original loops.
Using check_args seemed (a) simpler to implement at the time, and (b) provides a clear interface that others can overload to turn LoopVectorization support on or off.

Broadcasting doesn't use that right now, but I've been saying for a while that I was planning to switch it for the same approach. Currently:

julia> @macroexpand @avx y = foo.(x)
quote
    var"##469" = Base.broadcasted(foo, x)
    var"##470" = LoopVectorization.vmaterialize(var"##469", Val{:Main}())
    y = var"##470"

It basically Meta.lowers the expression, and then swaps out Base.materialize and Base.materialize! for LoopVectorization.vmaterliaze and LoopVectorization.vmaterliaze!.
Currently, there's this fallback definition:

vmaterialize!(dest, bc, ::Val{mod}) where {mod} = Base.Broadcast.materialize!(dest, bc)

With vmaterialize!(::Union{A,Adjoint{T,A},Transpose{T,A}}, args...) where {T <: Union{Bool,Base.HWReal}, A <: StridedArray{T}} being defined to use LoopVectorization.

Unfortunately, I believe because AbstractDeviceArray{<:Base.HWReal} <: DenseArray{<:Base.HWReal}, that LoopVectorization.check_args(x::CuArray) returns true and that vmaterialize! won't hit the fallback definition either.

All the checks within the library assume DenseArrays are safe, and many more broadly support StridedArrays. The assumption having been that these subtypes mean "C/Fortran/BLAS-compatible strided arrays you can get pointers to". It's unfortunate they violate that assumption.
Could ArrayInterface.jl help / features be added and supported across the ecosystem to get more reliable dispatching?

ChrisRackauckas · 2020-07-10T13:16:42Z

Could ArrayInterface.jl help / features be added and supported across the ecosystem to get more reliable dispatching?

Yes, we should really have a trait for whether things opt in or not, and start spreading that around the ecosystem to make this better supported. Guessing can get you pretty far, but at some point libraries need a way to tell you this information.

ToucheSir · 2020-07-10T16:54:45Z

(Cribbing my comment from the discourse thread)

@AStupidBear #199 helps substantially for standalone tanh, but does not appear to trigger when the broadcast is fused with + in Flux's Dense

Case in point:

julia> Base.broadcasted(::typeof(tanh), x) = typeof(x)

julia> Dense(8, 64, tanh)(xx)
Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{2},Nothing,typeof(+),Tuple{Array{Float32,2},Array{Float32,1}}}

Is there a generalizable way to hook into this kind of fused broadcast as well?

ChrisRackauckas · 2020-07-10T16:55:52Z

If you do a 5-arg mul, then that is pulled into the matmul kernel and doesn't need to fuse with the other one.

DhairyaLGandhi · 2020-07-12T18:14:04Z

@chriselrod do I understand this correctly; overloading LoopVectorisation.check_args(::CuArray) = false would allow us to dispatch to CUDA as usual? `ArrayInterface might be the right place for this. I also suppose having that overload in CUDA.jl is fair game too since that would always hold, since CuArrays do not opt in to CPU semantics.

chriselrod · 2020-07-13T04:11:30Z

@DhairyaLGandhi I'll update broadcasting to use check_args.
It'd be nice to add something to ArrayInterface.jl with a well defined meaning like: "has strides, supports Ptr to local CPU-memory, and should be compatible with general C/Fortran/BLAS/LAPACK routines".

For example, I'd like LoopVectorization to support MArrays from StaticArrays.jl, which it currently does not. For good support, two things would be needed:

Get check_args(::MArray{<:Base.HWReal}) to return true.
A bit off topic, but still on the subject of getting different libraries to communicate properly on types, is communicating static size information to LoopVectorization's optimizer. LoopVectorization special cases VectorizationBase.Static[Upper/Lower]UnitRanges to glean static size info. Defining the following four methods would probably be enough for most uses:

VectorizationBase.maybestaticfirst(::StaticArrays.SOneTo) = VectorizationBase.Static{1}()
VectorizationBase.maybestaticlast(::StaticArrays.SOneTo{N}) where {N} = VectorizationBase.Static{N}()
@generated VectorizationBase.maybestaticsize(::MArray{S}, ::Val{I}) where {S,I} = VectorizationBase.Static{(S.parameters[I])::Int}()
VectorizationBase.maybestaticlength(::MArray{S,T,N,L}) where {S,T,N,L} = VectorizationBase.Static{L}()

But there'd have to be a place to define this. StaticArrays.SizedArray should have similar overloads.

A library like StaticRanges.jl, but with an emphasis on fast load times, would be nice.
StaticRanges.jl itself however takes much longer to load on my computers than LoopVectorization.jl, making it a much heavier dependency than I want to take on (and I imagine libraries like StaticArrays.jl would feel similarly).
I could factor all the relevant code out of VectorizationBase.jl into a new library in a few weeks.

EDIT:
I will release this shortly.

julia> using LoopVectorization, Test

julia> x = rand(1000); # should be long enough to make zero differences incredibly unlikely

julia> struct FallbackArrayWrapper{T,N} <: DenseArray{T,N} # Subtypes DenseArray
           data::Array{T,N}
       end

julia> Base.size(A::FallbackArrayWrapper) = size(A.data)

julia> Base.@propagate_inbounds Base.getindex(A::FallbackArrayWrapper, i::Vararg{Int, N}) where {N} = getindex(A.data, i...)

julia> Base.@propagate_inbounds Base.setindex!(A::FallbackArrayWrapper, v, i::Vararg{Int, N}) where {N} = setindex!(A.data, v, i...)

julia> Base.IndexStyle(::Type{<:FallbackArrayWrapper}) = IndexLinear()

julia> Base.pointer(A::FallbackArrayWrapper) = pointer(A.data)

julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # Some elements aren't exactly equal
Test Failed at REPL[11]:1
  Expression: exp.(x) == #= REPL[11]:1 =# @avx(exp.(FallbackArrayWrapper(x)))
   Evaluated: [1.1221417842854389, 2.244121039137226, 2.4589471125063094, 1.4534952264782455, 2.532038566639118, 2.2091284493072854, 1.725068467675581, 1.6073147438178572, 1.5600736225518292, 1.3611858244420798  …  1.7180695976532183, 1.2980672830526578, 1.5196765354463808, 1.961205825911335, 2.020849417983115, 1.6405069022562084, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843] == [1.1221417842854389, 2.244121039137226, 2.45894711250631, 1.4534952264782453, 2.5320385666391174, 2.2091284493072854, 1.7250684676755808, 1.6073147438178574, 1.560073622551829, 1.3611858244420798  …  1.718069597653218, 1.2980672830526578, 1.519676535446381, 1.9612058259113347, 2.0208494179831153, 1.6405069022562089, 2.306307169464506, 1.3112862093793074, 1.2128433365019216, 1.3482334355458843]
ERROR: There was an error during testing

julia> LoopVectorization.check_args(::FallbackArrayWrapper) = false # Make `check_args` false

julia> @test exp.(x) == (@avx exp.(FallbackArrayWrapper(x))) # exact equality
Test Passed

chriselrod · 2020-07-14T12:24:35Z

@DhairyaLGandhi

SLEEFPirates doesn't seem to seem to do much on the machine I tested with, beyond the performance boost with @avx

A couple points here, (1) is that SLEEFPirates by itself relies on the autovectorizer for SIMD. This does not always work. For example:

julia> x = rand(Float32, 512); y = similar(x);

julia> @btime $y .= SLEEFPirates.tanh.($x);
  9.092 μs (0 allocations: 0 bytes)

julia> function vtanh!(y, x)
           @inbounds for i ∈ eachindex(x,y)
               y[i] = SLEEFPirates.tanh(x[i])
           end
       end
vtanh! (generic function with 1 method)

julia> @btime vtanh!($y, $x)
  1.146 μs (0 allocations: 0 bytes)

julia> @btime @avx $y .= SLEEFPirates.tanh.($x);
  253.366 ns (0 allocations: 0 bytes)

julia> @btime @avx $y .= tanh.($x);
  247.944 ns (0 allocations: 0 bytes)

It works with the simple loop, but inspecting the native/llvm code from the broadcast shows we didn't get SIMD. Starting with Julia >= 1.5, the @avx version should be much faster than the SLEEFPirates version for tanh.

@ChrisRackauckas

If so, it probably fails if the activation function isn't a Julia function, so it might need to be careful and check if things are in an approved list?

costs.jl defines costs, as well as an IdDict that broadcasting uses to identify functions from types in the broadcasted objects when possible.
Functions that aren't recognized are assumed to be relatively expensive, but are assumed to work.

In the future, once it integrates more with the compiler so we can introspect functions to ensure they're defined for VectorizationBase.SVec inputs (SIMDPirates and SLEEFPirates heavily pirate base functions by adding ::SVec methods), then we could revert to fallback behavior on an unrecognized function.
However, for now, I think it is better to work for unrecognized functions. If a relatively simple function doesn't work, you can file a bug and I can add support. Otherwise, it may be better to not use @avx in that case. @avx SIMDs across loop iterations, for complicated functions, that probably isn't optimal, or is impractical.

Recently, Tim Holy suggested something like this for defining costs of unknown functions

julia> function cost(f, tt)
           params = Core.Compiler.Params(typemax(UInt))
           mi = Base.method_instances(f, tt)[1]
           ci = code_typed(f, tt)[1][1]
           opt = Core.Compiler.OptimizationState(mi, params)
           cost(stmt::Expr) = Core.Compiler.statement_cost(stmt, -1, ci, opt.sptypes, opt.slottypes, opt.params)
           cost(stmt) = 0
           sum(cost, ci.code)
       end
cost (generic function with 1 method)

julia> isfourthv1(i) = iszero(rem(i, 4))
isfourthv1 (generic function with 1 method)

julia> isfourthv2(i) = iszero(i & 3)
isfourthv2 (generic function with 1 method)

julia> cost(isfourthv1, Tuple{Int})
41

julia> cost(isfourthv2, Tuple{Int})
2

Not perfect -- both isfourthv1 and isfourthv2 produce the exact same llvm -- but pretty good.

If the cost it uses doesn't represent the actual cost well, it may make bad unrolling decisions. The current assumption is that unknowns are expensive, so that unrolling wont be profitable. I figure it is better to play things safe.

For a function f(::T) to be compatible with @avx, then f(::SVec{W,T}) must work. This should be true for reasonably simple duck-typed functions without branches (or using SIMDPirates.vifelse if they need select-type statements), and ultimately calling relatively primitive functions. It's probably more restrictive than ForwardDiff.jl, but should cover most functions someone reasonably wants to try SIMD-ing.
This is a non-trivial example taken from LoopVectorization's test suite:

using LoopVectorization, BenchmarkTools, Test
function clenshaw(x, coeff)
    len_c = length(coeff)
    tmp = zero(x)
    ret = zero(x)
    for i in len_c:-1:2
        ret     = muladd(x,2tmp,coeff[i]-ret)
        ret,tmp = tmp,ret
    end
    ret = muladd(x,tmp,coeff[1]-ret)
    return ret
end
function clenshaw!(ret,x,coeff)
    for j in 1:length(ret)
        ret[j] = clenshaw(x[j], coeff)
    end
end
function clenshawavx!(ret,x,coeff)
    @avx for j in 1:length(ret)
        ret[j] = clenshaw(x[j], coeff)
    end
end
T = Float32; c = rand(T,100); x = rand(T,10^4); y1 = similar(x); y2 = similar(x);
@btime clenshaw!($y1, $x, $c)
#  1.589 ms (0 allocations: 0 bytes)
@btime clenshawavx!($y2, $x, $c)
# 109.479 μs (0 allocations: 0 bytes)
@test y1 ≈ y2
# Test Passed

About 14.5x faster, roughly what we'd expect from AVX512 + single precision.
It would be good to make sure all the activation functions are supported with efficient implementations.

You can test and benchmark how functions via:

julia> using VectorizationBase, SLEEFPirates

julia> W = VectorizationBase.pick_vector_width(Float32) # Chosen SIMD vector width
16

julia> sx32 = SVec(ntuple(_ -> Core.VecElement(randn(Float32)), W))
SVec{16,Float32}<0.1209027f0, 0.26148129f0, 0.81624657f0, -1.1177294f0, -1.8269225f0, -0.542883f0, 2.3699f0, 0.7374754f0, 0.56119925f0, 1.3130764f0, -0.75493294f0, 0.22216304f0, -0.2675562f0, -0.3545847f0, -0.54735f0, 1.676646f0>

julia> @btime tanh($(Ref(sx32))[])
  8.462 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.12031703f0, 0.2556805f0, 0.6730218f0, -0.80677766f0, -0.9495241f0, -0.49516717f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857873f0, -0.2613494f0, -0.3404352f0, -0.49853143f0, 0.93242496f0>

julia> @btime tanh($(Ref(sx32[1]))[])
  7.561 ns (0 allocations: 0 bytes)
0.120317034f0

julia> @btime SLEEFPirates.tanh_fast($(Ref(sx32))[])
  7.083 ns (0 allocations: 0 bytes)
SVec{16,Float32}<0.120317034f0, 0.2556805f0, 0.6730218f0, -0.8067777f0, -0.9495241f0, -0.49516723f0, 0.98267066f0, 0.6276174f0, 0.50886667f0, 0.86505175f0, -0.6380826f0, 0.21857871f0, -0.2613494f0, -0.34043518f0, -0.4985314f0, 0.93242496f0>

Calculating 16 tanh doesn't take much longer than calculating 1 (with AVX512 on Julia >=1.5). 1 takes longer than 16 computed with SLEEFPirates.tanh_fast (with AVX512 on Linux with recent GLIBC).

To be clear, in this broadcast:

(c::D3)(x::AbstractArray) = @avx c.σ.(c.W * x .+ c.b)

It'll use SVecs of width VectorizationBase.pick_vector_width(Float32) (Float32, because that will be the element type of all the arrays). That'll be 8 with AVX, 16 with AVX512, or 4 without them.

DhairyaLGandhi · 2020-07-14T14:57:48Z

The design goal of this piece to me is as follows:

Existing user code breakages should be minimised. And existing test suites should pass without regressions in runtime performance.
Generic code blocks should remain generic across devices. Cpu and gpu targets included. The ArrayInterface proposal probably covers that.
When avx is available, try avx; if a function is defined elementwise (ie no f(::SVec) method exists) it should warn to implement the method (with a simple switch to turn off warnings, but not error) and move to the existing default broadcasting behavior.
We should minimise the number of methods users (as well as Flux) have to define for any single activation function for it to work across GPUs as well (scalar indexing notwithstanding). CUDA can manage to run with f.(::CuArray) and has meant we avoided much of the duplication. LV's goals are not to support arbitrary code afaik, but to accelerate known common cases. In Flux we don't know the f we have to differentiate beforehand. In common cases we would need to accelerate it, in others we shouldn't error.

I'm not sure exactly how it would interact with other ADs in the ecosystem just yet, but it is a fairly common trick to replace Zygote with something else to test for performance/ correctness etc.

I would appreciate some feedback on this if these represent roughly the right expectations since it directly impacts the API for user defined functions.

For a function f(::T) to be compatible with @avx, then f(::SVec{W,T}) must work. This should be true for reasonably simple duck-typed functions without branches (or using SIMDPirates.vifelse if they need select-type statements), and ultimately calling relatively primitive functions

Here we usually deal with T<:Number, maybe it means that we could potentially add any missing activations not already in the autovectoriser that LV recognises. I would personally not expect to have users define simple functions using non-Base blessed if-else statements, which seems to add to the complexity of the API. Tooling around that would take a while to mature, so maybe topic for a different discussion.

DhairyaLGandhi · 2020-07-14T15:11:10Z

I think the basic goal is to accelerate known functions, and supporting LV more generally across the framework, while not dropping support for existing language primitives like control flow.

replace Base.tanh with faster tanh

74fa0c3

DhairyaLGandhi mentioned this pull request Jul 9, 2020

Find out common bottlenecks #1273

Open

4 tasks

This was referenced Jul 15, 2020

Split off the CPU implementation FluxML/NNlib.jl#224

Closed

use vmap for all activations FluxML/NNlib.jl#221

Open

DhairyaLGandhi mentioned this pull request Jul 25, 2020

Revert LoopVectorization FluxML/NNlib.jl#226

Merged

mcabbott mentioned this pull request Aug 16, 2020

Let muladd accept arrays JuliaLang/julia#37065

Merged

mcabbott mentioned this pull request Nov 30, 2021

v0.13 deprecations #1751

Merged

mcabbott mentioned this pull request Jan 16, 2022

Use faster activation functions #1837

Merged

mcabbott closed this in #1837 Feb 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replace Base.tanh with faster tanh #1272

replace Base.tanh with faster tanh #1272

DhairyaLGandhi commented Jul 9, 2020

DhairyaLGandhi commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 10, 2020 •

edited

Loading

ChrisRackauckas commented Jul 10, 2020

AStupidBear commented Jul 10, 2020

chriselrod commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

ToucheSir commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 12, 2020 •

edited

Loading

chriselrod commented Jul 13, 2020 •

edited

Loading

chriselrod commented Jul 14, 2020

DhairyaLGandhi commented Jul 14, 2020 •

edited

Loading

DhairyaLGandhi commented Jul 14, 2020

replace Base.tanh with faster tanh #1272

replace Base.tanh with faster tanh #1272

Conversation

DhairyaLGandhi commented Jul 9, 2020

PR Checklist

DhairyaLGandhi commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 10, 2020 • edited Loading

ChrisRackauckas commented Jul 10, 2020

AStupidBear commented Jul 10, 2020

chriselrod commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

ToucheSir commented Jul 10, 2020

ChrisRackauckas commented Jul 10, 2020

DhairyaLGandhi commented Jul 12, 2020 • edited Loading

chriselrod commented Jul 13, 2020 • edited Loading

chriselrod commented Jul 14, 2020

DhairyaLGandhi commented Jul 14, 2020 • edited Loading

DhairyaLGandhi commented Jul 14, 2020

DhairyaLGandhi commented Jul 10, 2020 •

edited

Loading

DhairyaLGandhi commented Jul 12, 2020 •

edited

Loading

chriselrod commented Jul 13, 2020 •

edited

Loading

DhairyaLGandhi commented Jul 14, 2020 •

edited

Loading