Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed mul_heuristic for non-float #514

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

ryanelandt
Copy link

@ryanelandt ryanelandt commented Oct 9, 2018

Motivated by #513, This PR changes the multiplication heuristic for non-float types. This should fix a performance issue that affects Dual.

Although I didn't do it on this PR because it would be too controversial, I think that the same heuristic should be used for floats. Maybe the compiler has come a long way, but it seems to be able to do the right thing most of the time (mul_loop outperforms the current heuristic on average). Given the current state of the compiler (see item 4), I think that we should reconsider if this is something we still need to do?

Fixes #513

@andyferris
Copy link
Member

I think that the same heuristic should be used for floats

In most cases - this does replace the heuristic for floats.

The only exceptions are for MArray and SizedArray since they can (and should) fall back to BLAS... but the other parts of that specialized function should probably match this one?

@ryanelandt
Copy link
Author

You're right; I saw SizedMatrix and thought StaticMatrix my bad.

I will investigate heuristic performance for MArray and SizedArray and report back. I would guess that the answer is yes.

@ryanelandt
Copy link
Author

ryanelandt commented Oct 24, 2018

The results of the tests on MMatrix, SizedMatrix and SMatrix are below for Float64. The code used to generate the results is at the bottom. My conclusions: compared to the current heuristic, rolling provides modest improvements for MMatrix/SizedMatrix and slight improvements for SMatrix. Although switching to rolled matrix multiplication only is likely not the best possible heuristic, I think it is the one that StaticArrays should use because 1.) it is easy to implement, 2.) robust to novel data-types #513 and 3.) future proof. Do you want to me go ahead and augment the multiplication heuristic for MMatrix and SizedMatrix?

MMatrix

MINIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   2.073     4.901     2.074     2.075     0.999    (unroll default)   ( 2,  2,  2)
   7.730    10.557     7.726     7.727     1.000*** (unroll default)   ( 3,  3,  3)
   3.368     7.995     4.652     4.652     0.724    (unroll default)   ( 4,  3,  3)
  10.812    13.692    10.806    10.807     1.000    (unroll default)   ( 3,  3,  4)
  14.407    18.288    14.409    14.408     1.000    (unroll default)   ( 3,  4,  4)
   4.950    10.241     6.779     6.757     0.733    (unroll default)   ( 4,  4,  3)
   6.338    10.832     6.448     6.445     0.983*** (unroll default)   ( 4,  4,  4)
  29.619    23.715    30.400    30.395     0.974    (unroll default)   ( 5,  5,  5)
  45.788    52.059    47.297    47.295     0.968    (unroll default)   ( 6,  6,  6)
  58.872    65.491    53.873    53.679     1.097    (unroll default)   ( 8,  8,  8)
  60.449    57.003    55.448    55.440     1.090    (unroll default)   (14,  3,  9)
  60.964    50.867   121.848   123.269     0.495    (unroll default)   ( 9, 14,  3)
 115.015   121.230   107.557   107.557     1.069    (unroll default)   ( 3,  9, 14)
 117.127   214.249    95.178   215.998     0.542     (chuck default)   (14,  4, 12)
  53.251    63.855    86.645    63.409     0.840     (chuck default)   (12, 14,  4)
  67.459   322.691    77.533   321.113     0.210     (chuck default)   ( 4, 12, 14)

MAXIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   2.175     5.521     2.205     2.315     0.939    (unroll default)   ( 2,  2,  2)
   9.154    11.515     8.598     8.361     1.095*** (unroll default)   ( 3,  3,  3)
   3.791    10.087     5.679     5.154     0.736    (unroll default)   ( 4,  3,  3)
  11.909    15.495    12.752    11.993     0.993    (unroll default)   ( 3,  3,  4)
  15.451    21.098    15.546    15.617     0.989    (unroll default)   ( 3,  4,  4)
   5.984    11.948     7.474     7.438     0.804    (unroll default)   ( 4,  4,  3)
   6.873    12.161     7.285     7.137     0.963*** (unroll default)   ( 4,  4,  4)
  32.455    26.016    34.875    33.796     0.960    (unroll default)   ( 5,  5,  5)
  51.458    57.137    53.678    52.737     0.976    (unroll default)   ( 6,  6,  6)
  65.131    72.903    61.722    65.883     0.989    (unroll default)   ( 8,  8,  8)
  70.474    65.449    63.732    62.473     1.128    (unroll default)   (14,  3,  9)
  68.526    57.733   140.109   139.151     0.492    (unroll default)   ( 9, 14,  3)
 124.767   141.089   117.295   116.641     1.070    (unroll default)   ( 3,  9, 14)
 131.910   240.933   100.761   229.787     0.574     (chuck default)   (14,  4, 12)
  55.108    66.231    94.707    68.277     0.807     (chuck default)   (12, 14,  4)
  71.788   337.778    83.411   331.628     0.216     (chuck default)   ( 4, 12, 14)

SizedMatrix

MINIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   2.076     5.332     2.075     2.333     0.890    (unroll default)   ( 2,  2,  2)
   7.730    12.676     7.728     7.727     1.000*** (unroll default)   ( 3,  3,  3)
   5.611    11.683     6.077     6.074     0.924    (unroll default)   ( 4,  3,  3)
  10.813    16.146    10.808    10.808     1.000    (unroll default)   ( 3,  3,  4)
  14.412    20.924    14.408    14.408     1.000    (unroll default)   ( 3,  4,  4)
   4.394    12.685     4.532     4.537     0.968    (unroll default)   ( 4,  4,  3)
   5.372    14.006     5.648     5.505     0.976*** (unroll default)   ( 4,  4,  4)
  28.421    31.509    29.633    29.627     0.959    (unroll default)   ( 5,  5,  5)
  31.403    52.888    47.851    47.851     0.656    (unroll default)   ( 6,  6,  6)
  58.365    61.879    54.455    54.254     1.076    (unroll default)   ( 8,  8,  8)
  63.941    67.916    58.543    58.743     1.088    (unroll default)   (14,  3,  9)
  59.923    73.609   112.948   114.238     0.525    (unroll default)   ( 9, 14,  3)
 111.190   117.087   105.574   105.481     1.054    (unroll default)   ( 3,  9, 14)
 115.633   125.493    98.594   121.604     0.951     (chuck default)   (14,  4, 12)
  45.285    69.400    80.747    69.402     0.653     (chuck default)   (12, 14,  4)
  47.402   125.304    64.636   125.528     0.378     (chuck default)   ( 4, 12, 14)

MAXIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   2.581     6.403     2.476     2.978     0.867    (unroll default)   ( 2,  2,  2)
   9.001    13.992     8.679     8.724     1.032*** (unroll default)   ( 3,  3,  3)
   6.212    13.189     6.785     7.354     0.845    (unroll default)   ( 4,  3,  3)
  12.004    17.699    12.555    11.615     1.034    (unroll default)   ( 3,  3,  4)
  17.216    23.072    16.049    16.166     1.065    (unroll default)   ( 3,  4,  4)
   5.227    14.766     5.478     5.377     0.972    (unroll default)   ( 4,  4,  3)
   5.982    15.971     6.078     6.564     0.911*** (unroll default)   ( 4,  4,  4)
  30.591    35.521    34.247    32.578     0.939    (unroll default)   ( 5,  5,  5)
  33.845    58.170    55.175    55.989     0.604    (unroll default)   ( 6,  6,  6)
  71.162    69.424    58.067    60.038     1.185    (unroll default)   ( 8,  8,  8)
  68.896    76.411    66.313    69.118     0.997    (unroll default)   (14,  3,  9)
  67.477    82.571   131.625   126.844     0.532    (unroll default)   ( 9, 14,  3)
 119.902   130.886   118.403   117.470     1.021    (unroll default)   ( 3,  9, 14)
 129.110   139.837   114.198   142.633     0.905     (chuck default)   (14,  4, 12)
  51.730    78.507    92.106    77.134     0.671     (chuck default)   (12, 14,  4)
  51.967   137.541    71.263   138.583     0.375     (chuck default)   ( 4, 12, 14)

SMatrix

MINIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   1.819     5.154     2.068     1.822     0.998    (unroll default)   ( 2,  2,  2)
   7.729    10.531     7.727     7.733     0.999*** (unroll default)   ( 3,  3,  3)
   3.356     7.770     3.111     3.359     0.999    (unroll default)   ( 4,  3,  3)
  10.812    13.235    10.809    10.810     1.000    (unroll default)   ( 3,  3,  4)
  14.408    15.872    14.407    14.407     1.000    (unroll default)   ( 3,  4,  4)
   3.623     9.020     3.881     3.874     0.935    (unroll default)   ( 4,  4,  3)
   4.651    11.093     4.657     4.651     1.000*** (unroll default)   ( 4,  4,  4)
  27.926    25.224    27.258    28.663     0.974    (unroll default)   ( 5,  5,  5)
  29.991    39.040    29.194    29.304     1.023    (unroll default)   ( 6,  6,  6)
  52.218    61.553    66.954    75.179     0.695    (unroll default)   ( 8,  8,  8)
  90.728    78.258    60.175    53.989     1.680    (unroll default)   (14,  3,  9)
  60.112    59.161    85.392    86.895     0.692    (unroll default)   ( 9, 14,  3)
 112.377    99.059   112.281   106.007     1.060    (unroll default)   ( 3,  9, 14)
 113.125   105.840   130.766   111.691     1.013     (chuck default)   (14,  4, 12)
  45.473   101.002    87.289    71.480     0.636     (chuck default)   (12, 14,  4)
  47.969    87.554    61.987    83.615     0.574     (chuck default)   ( 4, 12, 14)

MAXIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   1.973     5.698     2.340     2.044     0.965    (unroll default)   ( 2,  2,  2)
   8.508    11.575     8.467     8.435     1.009*** (unroll default)   ( 3,  3,  3)
   3.691     8.869     3.451     3.726     0.990    (unroll default)   ( 4,  3,  3)
  11.700    14.741    11.419    11.732     0.997    (unroll default)   ( 3,  3,  4)
  15.426    17.431    15.594    15.864     0.972    (unroll default)   ( 3,  4,  4)
   4.291     9.738     4.093     4.175     1.028    (unroll default)   ( 4,  4,  3)
   5.043    12.557     5.054     5.014     1.006*** (unroll default)   ( 4,  4,  4)
  30.257    28.311    30.703    31.825     0.951    (unroll default)   ( 5,  5,  5)
  31.887    44.124    33.123    31.700     1.006    (unroll default)   ( 6,  6,  6)
  59.612    68.653    73.701    82.356     0.724    (unroll default)   ( 8,  8,  8)
  98.975    88.194    66.272    60.927     1.624    (unroll default)   (14,  3,  9)
  71.787    65.768    94.558    99.952     0.718    (unroll default)   ( 9, 14,  3)
 123.761   120.008   128.358   114.799     1.078    (unroll default)   ( 3,  9, 14)
 123.287   116.476   146.230   124.308     0.992     (chuck default)   (14,  4, 12)
  53.789   111.039    98.186    78.839     0.682     (chuck default)   (12, 14,  4)
  52.806    99.494    71.560    95.252     0.554     (chuck default)   ( 4, 12, 14)
using StaticArrays
using BenchmarkTools
using Statistics
using Printf

test_cases = [
(2,2,2),
(3,3,3),
(4,3,3),
(3,3,4),
(3,4,4),
(4,4,3),
(4,4,4),
(5,5,5),
(6,6,6),
(8,8,8),
(14,3,9),
(9,14,3),
(3,9,14),
(14,4,12),
(12,14,4),
(4,12,14)
]

n_cases = length(test_cases)
data_min = zeros(n_cases, 4)
data_mean = zeros(n_cases, 4)

for k = 1:n_cases
    i1, i2, i3 = test_cases[k]

    if false
        println("SMatrix")
        A = rand(SMatrix{i1,i2,Float64,i1*i2})
        B = rand(SMatrix{i2,i3,Float64,i2*i3})
    elseif false
        println("MMatrix")
        A = rand(MMatrix{i1,i2,Float64,i1*i2})
        B = rand(MMatrix{i2,i3,Float64,i2*i3})
    else
        println("SizedMatrix")
        A = Size(i1,i2)(rand(i1,i2))
        B = Size(i2,i3)(rand(i2,i3))
    end

    # NOTE: At least as of Julia 1.4, the use of Ref is necessary to prevent unwanted compiler optimizations
    # info_loop     = @benchmark StaticArrays.mul_loop($(Size(A)),$(Size(B)),$A,$B)
    # info_chunks   = @benchmark StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$A,$B)
    # info_unrolled = @benchmark StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$A,$B)
    # info_default  = @benchmark $A * $B

    info_loop     = @benchmark StaticArrays.mul_loop($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
    info_chunks   = @benchmark StaticArrays.mul_unrolled_chunks($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
    info_unrolled = @benchmark StaticArrays.mul_unrolled($(Size(A)),$(Size(B)),$(Ref(A))[],$(Ref(B))[])
    info_default  = @benchmark $(Ref(A))[] * $(Ref(B))[]

    min_loop     = info_loop.times[1]
    min_chunks   = info_chunks.times[1]
    min_unrolled = info_unrolled.times[1]
    min_default  = info_default.times[1]
    data_min[k, :] = [min_loop, min_chunks, min_unrolled, min_default]

    mean_loop     = mean(info_loop.times)
    mean_chunks   = mean(info_chunks.times)
    mean_unrolled = mean(info_unrolled.times)
    mean_default  = mean(info_default.times)
    data_mean[k, :] = [mean_loop, mean_chunks, mean_unrolled, mean_default]

    println("$k/$n_cases")
end

function force_pad(x::Float64)
    s = @sprintf "% 1.3f" x
    (x < 100) && (s = " " * s)
    (x < 10) && (s = " " * s)
    return s
end

for (description, data_compare) = [("MINIMUM TIME", data_min), ("MAXIMUM TIME", data_mean)]
    println()
    println(description * "                              t.loop")
    println("  t.loop   t.chuck   t.unroll  t.default t.default/      size")
    for k = 1:n_cases
        case_k = test_cases[k]
        time_k = data_compare[k, :]
        time_loop = time_k[1]
        time_default = time_k[4]
        for kk = 1:4
            print(force_pad(time_k[kk]), "  ")
        end
        print(force_pad(time_loop / time_default))
        (all(case_k .== (3, 3, 3)) || all(case_k .== (4, 4, 4))) ? print("*** ") : print("    ")
        if prod(case_k) <= 8^3
            print("(unroll default)")
        elseif maximum(case_k) <= 14
            print(" (chuck default)")
        else
            println("will unroll already no need to test this case")
        end
        string_size = "   (" * lpad(case_k[1], 2) * "," * lpad(case_k[2], 3) * "," * lpad(case_k[3], 3) * ")"
        println(string_size)
    end
end

@c42f
Copy link
Member

c42f commented Aug 1, 2019

This needs a deeper look; possibly it's just ready to merge. Any thoughts @andyferris? From your comment I can't tell whether you're worried about replacing older heuristics or you think it's fine?

@mateuszbaran
Copy link
Collaborator

I have re-run this benchmarks and for example for SMatrix I get very different results:

MINIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   0.023     0.023     0.023     0.023     1.000    (unroll default)   ( 2,  2,  2)
   7.091     0.023     0.023     0.023   308.308*** (unroll default)   ( 3,  3,  3)
   4.484     0.023     0.023     0.023   194.957    (unroll default)   ( 4,  3,  3)
   8.578     0.023     0.023     0.023   372.938    (unroll default)   ( 3,  3,  4)
  11.549     0.023     0.023     0.023   502.111    (unroll default)   ( 3,  4,  4)
   5.599     0.023     0.023     0.023   243.435    (unroll default)   ( 4,  4,  3)
   6.710     0.023     0.023     0.023   291.739*** (unroll default)   ( 4,  4,  4)
  34.517     0.023     0.023     0.023   1500.722    (unroll default)   ( 5,  5,  5)
  59.338    56.294    34.430    34.331     1.728    (unroll default)   ( 6,  6,  6)
  61.891    77.338   165.161   165.002     0.375    (unroll default)   ( 8,  8,  8)
 112.110    96.582    64.594    64.599     1.735    (unroll default)   (14,  3,  9)
  83.304     0.023     0.023     0.023   3621.911    (unroll default)   ( 9, 14,  3)
  95.472   142.231   123.598   121.716     0.784    (unroll default)   ( 3,  9, 14)
 188.695   155.224   172.656   159.319     1.184     (chuck default)   (14,  4, 12)
  73.340   110.361   162.664    85.995     0.853     (chuck default)   (12, 14,  4)
  64.379   151.133    73.198   147.298     0.437     (chuck default)   ( 4, 12, 14)

MAXIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
   0.025     0.025     0.025     0.025     1.000    (unroll default)   ( 2,  2,  2)
   7.450     0.025     0.025     0.025   295.440*** (unroll default)   ( 3,  3,  3)
   4.640     0.025     0.025     0.025   189.106    (unroll default)   ( 4,  3,  3)
   8.631     0.025     0.026     0.024   352.536    (unroll default)   ( 3,  3,  4)
  11.967     0.025     0.025     0.025   488.335    (unroll default)   ( 3,  4,  4)
   5.684     0.025     0.026     0.025   231.865    (unroll default)   ( 4,  4,  3)
   6.747     0.025     0.025     0.025   275.311*** (unroll default)   ( 4,  4,  4)
  35.203     0.025     0.024     0.025   1435.649    (unroll default)   ( 5,  5,  5)
  59.620    56.455    34.685    34.528     1.727    (unroll default)   ( 6,  6,  6)
  62.092    77.794   168.805   169.522     0.366    (unroll default)   ( 8,  8,  8)
 119.095    99.072    68.834    65.002     1.832    (unroll default)   (14,  3,  9)
  83.516     0.025     0.025     0.025   3407.590    (unroll default)   ( 9, 14,  3)
  96.060   142.596   126.501   128.281     0.749    (unroll default)   ( 3,  9, 14)
 190.710   155.839   176.933   159.815     1.193     (chuck default)   (14,  4, 12)
  73.525   117.175   163.970    86.262     0.852     (chuck default)   (12, 14,  4)
  65.319   153.080    73.805   147.597     0.443     (chuck default)   ( 4, 12, 14)

And here is for MMatrix:

MINIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
  12.140    17.970    11.915    12.118     1.002    (unroll default)   ( 2,  2,  2)
  22.248    29.819    23.016    22.999     0.967*** (unroll default)   ( 3,  3,  3)
  27.042    20.683    25.859    25.779     1.049    (unroll default)   ( 4,  3,  3)
  26.715    31.953    25.837    25.719     1.039    (unroll default)   ( 3,  3,  4)
  31.225    37.436    30.691    30.464     1.025    (unroll default)   ( 3,  4,  4)
  31.594    21.136    31.378    30.947     1.021    (unroll default)   ( 4,  4,  3)
  39.068    25.408    37.781    38.016     1.028*** (unroll default)   ( 4,  4,  4)
  69.047    46.422    65.049    64.835     1.065    (unroll default)   ( 5,  5,  5)
 112.260    72.935   108.287   108.387     1.036    (unroll default)   ( 6,  6,  6)
 261.536   115.096   237.487   237.093     1.103    (unroll default)   ( 8,  8,  8)
 263.405   120.496   208.487   200.171     1.316    (unroll default)   (14,  3,  9)
 182.642    90.648   187.398   185.855     0.983    (unroll default)   ( 9, 14,  3)
 183.399   194.597   169.744   169.103     1.085    (unroll default)   ( 3,  9, 14)
 483.821   215.002   360.828   205.885     2.350     (chuck default)   (14,  4, 12)
 319.562   132.327   320.906   131.083     2.438     (chuck default)   (12, 14,  4)
 325.654   139.729   292.704   137.409     2.370     (chuck default)   ( 4, 12, 14)

MAXIMUM TIME                              t.loop
  t.loop   t.chuck   t.unroll  t.default t.default/      size
  16.638    24.282    16.364    16.540     1.006    (unroll default)   ( 2,  2,  2)
  28.054    35.343    28.764    29.000     0.967*** (unroll default)   ( 3,  3,  3)
  32.354    26.493    31.200    31.070     1.041    (unroll default)   ( 4,  3,  3)
  31.951    37.576    31.083    31.159     1.025    (unroll default)   ( 3,  3,  4)
  36.562    42.495    37.776    35.981     1.016    (unroll default)   ( 3,  4,  4)
  36.853    27.061    36.780    36.382     1.013    (unroll default)   ( 4,  4,  3)
  45.337    31.970    43.889    44.207     1.026*** (unroll default)   ( 4,  4,  4)
  77.235    54.762    73.255    72.994     1.058    (unroll default)   ( 5,  5,  5)
 124.461    85.536   120.765   120.650     1.032    (unroll default)   ( 6,  6,  6)
 281.716   132.658   256.729   255.798     1.101    (unroll default)   ( 8,  8,  8)
 323.401   173.968   254.756   254.894     1.269    (unroll default)   (14,  3,  9)
 192.173    98.995   198.007   195.362     0.984    (unroll default)   ( 9, 14,  3)
 197.699   209.413   184.568   183.534     1.077    (unroll default)   ( 3,  9, 14)
 546.123   266.450   420.842   261.034     2.092     (chuck default)   (14,  4, 12)
 343.275   150.832   345.383   150.870     2.275     (chuck default)   (12, 14,  4)
 348.331   159.101   314.240   155.983     2.233     (chuck default)   ( 4, 12, 14)

We do need better heuristics here but the one proposed in this PR is IMO worse for Float64.

Ref: #814 .

@ryanelandt
Copy link
Author

@mateuszbaran, times measured by benchmarking are in nanoseconds. We expect most operations to take at least 1.0 nanoseconds (about 3 clock cycles), so times like 0.025 nanoseconds are the result of unwanted compiler optimizations. In other words, the compiler is simplifying the expression before it can be benchmarked, so the thing you're measuring the time it takes to evaluate something much simpler, a constant in many cases.

As mentioned in #597 (comment), one way to circumvent this issue is to use Ref. I've updated the code to do this as this seems to be necessary as of Julia 1.4.

In this case, using Ref has the side-effect of making the times of the default method equal to none of the measured methods, so you'll have to compare the first three columns, and not rely on the stated ratio.

@mateuszbaran
Copy link
Collaborator

Yes, that's a good point, but still looping isn't consistently better or just as good as unrolled or chunked multiplication, even after adding Refs. I will try running more benchmarks to figure out better heuristics here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance runtime performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Matrix multiplication logic poor for ForwardDiff.Dual
4 participants