Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmarks auto repetitions #791

Merged
merged 11 commits into from
Jun 28, 2021
Merged

Benchmarks auto repetitions #791

merged 11 commits into from
Jun 28, 2021

Conversation

MarcelKoch
Copy link
Member

@MarcelKoch MarcelKoch commented Jun 11, 2021

This PR enables the automatic deduction of repetition number for the benchmarks.

For small working sets, the benchmark timings may be too sensitive regarding outliers. With this PR the number of repetitions for the benchmark run is estimated such that the whole benchmark takes >=0.5s. This should result in more stable benchmarks for small problems.

If the repetitions are set to auto the warm-up step is skipped.

WIP: The PR enables the new behavior only for the blas, conversion, spmv benchmarks.
Todo:

  • enable for solver, preconditioner bencharks
  • handle large benchmarks (one repetition takes >0.5s) gracefully

@yhmtsai
Copy link
Member

yhmtsai commented Jun 11, 2021

do we need to estimate the timing or just use warmup iteration + do repetition until reach the 0.5 sec?
does the warmup (estimate) needs to take 0.25 sec to warmup the device?

@Slaedr
Copy link
Contributor

Slaedr commented Jun 11, 2021

What's the reasoning for taking total time as the criterion? With statistics in mind, perhaps it is the number of repetitions that's more important. I like total time for other considerations, such as making sure enough samples are collected by a profiler, but that does not seem to be applicable here.

@MarcelKoch
Copy link
Member Author

@yhmtsai The repetitions are estimated before the real benchmark run. This way, the timings within the rep loop might be moved outside to reduce the overhead in that loop.
The estimate runs are on the same device as the real benchmark run.

@MarcelKoch
Copy link
Member Author

@Slaedr In my experience, the runtime for small problems (within L1/L2 cache) can vary quite significantly if only a low number of repetitions is used. With this PR I want to get a more stable average runtime for such problems. I should add, that now the number of repetitions is also exported in the json output.

@upsj
Copy link
Member

upsj commented Jun 11, 2021

How about we start off with warmup + a fixed number of iterations, compute the standard deviation of the runtime in the fixed set, and if that exceeds a certain value relative to the average, we increase the number of iterations by some scheme? I'll check what quickbench, GBench and nonius are doing there.

@yhmtsai
Copy link
Member

yhmtsai commented Jun 11, 2021

for blas, use the prepare for the estimate. Some prepare functions are empty. Is it acceptable?
for spmv, use the apply without refresh the memory.
If we would like to refresh the memory in each timing, we can not move the timing out of loop

@MarcelKoch
Copy link
Member Author

Concerning the pre and post operations, I was also not sure if they are used correctly/necessary. I guess I will look a bit more into that.

@Slaedr
Copy link
Contributor

Slaedr commented Jun 11, 2021

@MarcelKoch Ah, perhaps I see your logic. The total time is a good rule-of-thumb for estimating a good number of repetitions. If something takes very little time, we are likely to need many repeats, which your rule will do. If something takes a lot of time, perhaps it needs only a few repeats, which again is factored in your scheme. There might be some corner cases though, where some poorly-implemented algorithm takes a lot of time, but its runtime is still quite variable. I think what @upsj proposed might be more statistically sound; I remember doing that manually for some my studies in the past.

@MarcelKoch
Copy link
Member Author

@Slaedr Yes exactly, the runtime is just used as a rule of thumb. I picked up 0.5s as a reasonable minimal runtime somewhere, although I'm not sure where exactly. That runtime can be discussed, if it's too high or too low.

@upsj Running these kinds of statistical tests seems a bit overkill to me. At least in my experience, the variation was quite low for these larger runtimes (>=0.5s), assuming that the machine does not use frequency scaling. If that is enabled, benchmarks can be quite unreliable, so I just ignored that case.
Also, starting with a fixed number of iterations might not be a good choice if one iteration already takes long. That case is also on my to-do list.

@upsj
Copy link
Member

upsj commented Jun 11, 2021

So nonius provides really powerful analysis, but also mainly analyzes tiny pieces of code, so the methods they use there (bootstrapping) don't make much sense on such small sample sizes.
So I would suggest using either the standard deviation or quartile distance as a measure for variability.

We have a certain variety of runtimes (going from fast to slow: overhead benchmarks (nanoseconds), BLAS, SpMV, preconditioners, solvers (seconds)), so I think it might make sense to have something robust that works on all of them. Requiring SpMVs run for 0.5s looks like a lot of overhead to me, especially since GPUs often have much less variability than CPUs.

@upsj
Copy link
Member

upsj commented Jun 11, 2021

So remembering my statistics lectures back in the day, I think if we want to reduce the standard deviation by a factor of 2, we need to run 4x as many benchmarks, so I guess my suggestion would be

  1. compute rel_stddev = stddev / average
  2. compute scale = rel_stddev / rel_stddev_limit
  3. if scale > 1: run a factor of scale * scale - 1 additional iterations (limited by some max number of iterations)
  4. if the standard deviation didn't decrease significantly then, stop anyways and report statistics (quantiles, outliers)

@tcojean
Copy link
Member

tcojean commented Jun 11, 2021

I definitely agree with most of what is said here. I remember at the very beginning when we were considering using google benchmarks, the obscure algorithm they used and exactly this scheme of stopping after reaching 0.5s without extra control nor detailed information (i.e. vector of timings) was very weird, that creates an apple to oranges comparison in the timings, i.e. you compare qualitatively very different results depending on problem considered, problem size and other such things. Particularly, a lot of our solvers and other building blocks go very much past that time, so all of this is then not very useful.

IMHO, either you use a same amount of runs for matrices of comparable time scale (like what we do now), or you use an algorithm like what Tobias tried to outline which you apply equally all the time to reach the same timing accuracy all the time. At the risk of some benchmarks taking a ridiculous amount of time.

@tcojean
Copy link
Member

tcojean commented Jun 11, 2021

On the cache effects issue for small problems, another probably more accurate approach is to have proper cache warmup and cache flushing strategies (depending on the context) to stabilize the timings, see this excellent paper on the topic. https://homes.sice.indiana.edu/rcwhaley/papers/timing_SPE08.pdf

Of course, there are still important potential performance issues which can come into play (particularly for large data sets), like process placement, turboboost and other speed scaling effects, ...

@MarcelKoch
Copy link
Member Author

I guess I don't fully understand the purpose of these benchmarks. I interpreted them as a quick check to see if some changes significantly impact the performance in either way. (With this I mean that the purpose of the benchmarks is not to detect performance changes in the range 5%-1%)

For such a broad comparison, I think this approach is reasonable, both the runtime based and statistics based approach.
On the other hand, a more detailed analysis would require much more effort, as @tcojean already mentioned.

@MarcelKoch
Copy link
Member Author

Concerning @upsj approach: I'm a bit unsure about the specifics of it. First, what would be a good threshold for the relative stddev? I would guess 1, but I've nothing to support that guess. This is also connected to the underlying distribution, where I'm also not sure which one to assume. The normal distribution would be the easiest choice, but again I've nothing to support that assumption.

@upsj
Copy link
Member

upsj commented Jun 12, 2021

I realized I was actually mixing up two things: stddev of the runtime distribution (which is fixed) and stddev of the mean estimator (which decays like sqrt(n) independent of the distribution). You can get the stddev of the mean estimator by bootstrapping, but that may be overkill here. I need to think about this some more.

As an example I ran a small BLAS benchmarks 1000 times and collected runtimes with 4 and 40 threads
runtime4
runtime40

@upsj
Copy link
Member

upsj commented Jun 12, 2021

I think we should be able to catch all our current cases (quick benchmarks, slow benchmarks) by providing a maximum repetition count used only for the repetition estimate. If the benchmark is slow, then we will choose the number of iterations based on the runtime. If it is fast, we will choose it so we have a sufficient number of runs (let's say 100), but still stay way below 0.5s
That is especially interesting if we are testing hundreds to thousands of small problems, where 0.5s per problem becomes significant. Also I would make both 0.5s and 100 configurable via command-line parameter.
At the moment, we don't seem to have benchmarks that require more sophisticated statistical treatment (though our current timer setup allows for it), but that may change when moving to MPI. That is probably a topic for another PR.

@upsj
Copy link
Member

upsj commented Jun 12, 2021

label!

@upsj upsj added the 1:ST:ready-for-review This PR is ready for review label Jun 12, 2021
@ginkgo-bot ginkgo-bot added reg:benchmarking This is related to benchmarking. type:preconditioner This is related to the preconditioners type:solver This is related to the solvers labels Jun 12, 2021
@MarcelKoch
Copy link
Member Author

I've incorporated a couple of suggestions from this thread.

Now there is more control over the adaptive benchmarking with the additional flags:

  • min_repetitions
  • max_repetitions
  • min_runtime

Also, larger (or rather slower) benchmarks are now handled correctly, i.e. only the minimal requested number of repetitions is executed. Therefore, I've also enabled the adaptive behavior for the solver benchmark.

For the preconditioner, this approach is not possible, since the total runtime is not updated within the repetitions loop. Therefore, I've added a warning if -repetitions auto is used in that case and use the default number of repetitions as a fallback.

On the implementation side, the usage is quite similar to google benchmark, i.e. the following code is valid:

IterationControl ic(timer);
for(auto status: ic.run()){
  timer->tic()
  // run benchmark
  timer->toc()
}

Additionally, status may be used to check the number of the current iteration, and if it is the last iteration.

To clarify, the adaptive benchmarking is only optional and not enabled by default.

@codecov
Copy link

codecov bot commented Jun 18, 2021

Codecov Report

Merging #791 (6b726f8) into develop (3112263) will decrease coverage by 0.00%.
The diff coverage is n/a.

❗ Current head 6b726f8 differs from pull request most recent head 33ff686. Consider uploading reports for the commit 33ff686 to get more accurate results
Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #791      +/-   ##
===========================================
- Coverage    94.37%   94.36%   -0.01%     
===========================================
  Files          400      400              
  Lines        32096    32097       +1     
===========================================
  Hits         30289    30289              
- Misses        1807     1808       +1     
Impacted Files Coverage Δ
omp/reorder/rcm_kernels.cpp 97.53% <0.00%> (-0.61%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3112263...33ff686. Read the comment docs.

Copy link
Member

@upsj upsj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great job! I really like the Google Benchmark-like setup.

benchmark/utils/general.hpp Outdated Show resolved Hide resolved
benchmark/preconditioner/preconditioner.cpp Outdated Show resolved Hide resolved
benchmark/utils/general.hpp Show resolved Hide resolved
benchmark/utils/general.hpp Outdated Show resolved Hide resolved
benchmark/utils/timer.hpp Outdated Show resolved Hide resolved
@@ -260,6 +264,7 @@ class CudaTimer : public Timer {
protected:
void tic_impl() override
{
exec_->synchronize();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this and HipTimer should not have the synchronize

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the tic it doesn't matter right? This is not timed because it's before the eventRecord and it could even help in ensuring nothing is running on the GPU when we start running things, like if there was a copy previously (for example, x_clone).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the event should also be after the memcpy if it involves gpu.
For me, ensuring nothing is running on the GPU is the stuff before calling timer, but it does not hurt the timing step.

auto x_clone = clone(x);
auto x_clone = clone(x);
for (auto status : ic.run(false)) {
x_clone = clone(x);

exec->synchronize();
generate_timer->tic();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we also add the tic/toc function to IterationControl to forward the Timer tic/toc.
such that we call tic/toc on the IC and get output from the same place as others?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First thing: you can already use ic.compute_average_timings to get the output, since ic has internally a copy of the shared ptr for the timer (but only for the apply timer). However, that seems a bit awkward.

From my viewpoint, IC should not handle any timings in this instance, as manage_timings is set to false. Adding timings functions to IC would weaken the distinction to the managed case. If the non-managed IC run is requested, the user should take care of the timings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense from this point.
My point is iteration control needs the results from the corresponding timer, so it can be wrong usage.
but I also realize that may make the tic/toc from TimerManager different?
maybe we need to add some comment about that need to use the corresponding timer such that the iteration control can really check it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I could add a get_timer to IC. That should help to use the correct timer. Otherwise, I can't think of a graceful way of adding tic/toc directly to IC.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is good.

Comment on lines +129 to +130
auto x_clone = clone(x);
for (auto _ : ic_tuning.run()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this changes the behavior. we refreshed the memory every time before

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does that matter? SpMVs should only write to x after all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no for result.
It depends on what we want the memory status for the benchmark.
Should memory always be a new location (only from software point), or the allocation just be there before the operations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at least from the caching standpoint that would probably not make a difference, as the clone calls memcpy, which might, depending on the implementation, already move the data into cache.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think that this change should be fine, especially if you consider this as a best-case benchmark, i.e. the data is already in the appropriate caches. Considering the worst-case, i.e. at the beginning of each SpMV the data is not cached, is more difficult in general and would probably require more adjustments, especially wrt Tobias' comment.

Comment on lines +143 to +144
auto x_clone = clone(x);
for (auto _ : ic.run()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also here

"runtime is larger than 'min_runtime'");

DEFINE_double(min_runtime, 0.05,
"If 'repetitions = auto' is used, the minimal runtime of"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"If 'repetitions = auto' is used, the minimal runtime of"
"If 'repetitions = auto' is used, the minimal runtime (seconds) of"

* ```
* auto timer = get_timer(...);
* IterationControl ic(timer);
* for(auto status: ic.[warmup_run|run](manage_timings [default is true])){
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the manage_timing also for warmuprun?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not added the parameter there, as the warmup-run uses always a fixed number of repetitions. I will clarify the documentation.

* Uses the commandline flags to setup the stopping criteria for the
* warmup and timed run.
*
* @param timer the same timer that is to be used for the timings
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @param timer the same timer that is to be used for the timings
* @param timer the same timer that is to be used for the timings

run_control warmup_run()
{
status_warmup_.cur_it = 0;
status_warmup_.timer->clear();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
status_warmup_.timer->clear();
status_warmup_.timer->clear();
status_warmup_.timer->manage_timings = false;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize there's no change from another function, so it is fine.

Comment on lines 580 to 582
// emulate shared_ptr behavior
const TimerManager *operator->() const { return this; }
TimerManager *operator->() { return this; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is it used for?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That was used for being lazy 😄 I will remove that, and adjust the rest accordingly.


void tic()
{
if (manage_timings) timer->tic();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (manage_timings) timer->tic();
if (manage_timings) {
timer->tic();
}

also apply to next one from the gko ref

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that part of the .clang-format specification? If not, perhaps it should be added there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not in current .clang-format. we only mention it in Contribution guideline.
Does clang-format support this after 6?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a bit of digging, it seems like clang-format (up to 13) still does not support this. But there is an PR for this here: https://reviews.llvm.org/D95168
So it seems like that some future version will support this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a workaround using clang-tidy, but that is over kill (https://stackoverflow.com/a/28437960)

Copy link
Member

@tcojean tcojean left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

cur_info->managed_timer.toc();
stopped = true;
next_timing =
static_cast<IndexType>(std::ceil(next_timing * 1.5));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe make the 1.5 controllable for extreme cases?

Copy link
Member

@yhmtsai yhmtsai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. only two comments for missing documentation

* - 'warmup' warmup iterations, applies in fixed and adaptive case
* - 'min_repetitions' minimal number of repetitions (adaptive case)
* - 'max_repetitions' maximal number of repetitions (adaptive case)
* - 'min_runtime' minimal total runtime (adaptive case)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

miss repetition_growth_factor

* - `warmup_run()`: controls run defined by `warmup` flag
* - `run(bool)`: controls run defined by all other flags
* - `get_timer()`: access to underlying timer
* Both methods return an object that is to be used in a range-based for loop:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not update

* - 'min_repetitions' minimal number of repetitions (adaptive case)
* - 'max_repetitions' maximal number of repetitions (adaptive case)
* - 'min_runtime' minimal total runtime (adaptive case)
* - 'repetitions_growth_factor' controls the increase between two successive
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
* - 'repetitions_growth_factor' controls the increase between two successive
* - 'repetition_growth_factor' controls the increase between two successive

or change the Gflags name

@yhmtsai yhmtsai added 1:ST:ready-to-merge This PR is ready to merge. and removed 1:ST:ready-for-review This PR is ready for review labels Jun 25, 2021
@yhmtsai
Copy link
Member

yhmtsai commented Jun 25, 2021

need rebase before merge. and you should be able to see two pipelines running on gitlab after next push

MarcelKoch and others added 11 commits June 28, 2021 09:42
This reworks the previous adaptiv benchmarking.
Now, the number of iteration is determined on-the-fly instead of before-hand.
Also new cmd-line flags have been added to allow for greater control over the adaptive benchmarks.
The usage is similar to google benchmark.

Currently, it is not possible to use this approach for the preconditioner benchmark, as these do not update the runtime in each iteration.
this changes the preconditioner benchmark to time each preconditioner apply/generate individually, unifying the timing approach across all benchmarks.
- adds more documentation
- minor formatting
- made `status` private s.t. it is not part of the
  public interface of `IterationControl`

Co-authored-by: Yuhsiang Tsai <[email protected]>
Co-authored-by: Tobias Ribizel <[email protected]>
Now the `run_control` object also controls taking the timings. The user does not need to issue the timings by hand anymore. This allows to use increasingly larger intervals between two timings, unitl the benchmark run is finished.

Drawback: everything within the `ic.run()` loop gets timed, parts that should be exempt need to be moved outside of the loop.
Internally this uses a thin wrapper class for the `timer` object, which just skips the `tic/toc` calls, if the `run_control` object does not manage the timings. In that case, the timings have to be issued outside as before.
- clarify documentation
- add accessor to underlying timer
- formatting
- adds flag to choose repetitions growth factor

Co-authored-by: Yuhsiang Tsai <[email protected]>
Co-authored-by: Tobias Ribizel <[email protected]>
Co-authored-by: Terry Cojean <[email protected]>
@sonarcloud
Copy link

sonarcloud bot commented Jun 28, 2021

Kudos, SonarCloud Quality Gate passed!

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 17 Code Smells

0.0% 0.0% Coverage
0.0% 0.0% Duplication

@MarcelKoch MarcelKoch merged commit a37c101 into develop Jun 28, 2021
@MarcelKoch MarcelKoch deleted the benchmarks-auto-repetitions branch June 28, 2021 13:31
tcojean added a commit that referenced this pull request Aug 20, 2021
Ginkgo release 1.4.0

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)


Related PR: #857
tcojean added a commit that referenced this pull request Aug 23, 2021
Release 1.4.0 to master

The Ginkgo team is proud to announce the new Ginkgo minor release 1.4.0. This
release brings most of the Ginkgo functionality to the Intel DPC++ ecosystem
which enables Intel-GPU and CPU execution. The only Ginkgo features which have
not been ported yet are some preconditioners.

Ginkgo's mixed-precision support is greatly enhanced thanks to:
1. The new Accessor concept, which allows writing kernels featuring on-the-fly
memory compression, among other features. The accessor can be used as
header-only, see the [accessor BLAS benchmarks repository](https://github.com/ginkgo-project/accessor-BLAS/tree/develop) as a usage example.
2. All LinOps now transparently support mixed-precision execution. By default,
this is done through a temporary copy which may have a performance impact but
already allows mixed-precision research.

Native mixed-precision ELL kernels are implemented which do not see this cost.
The accessor is also leveraged in a new CB-GMRES solver which allows for
performance improvements by compressing the Krylov basis vectors. Many other
features have been added to Ginkgo, such as reordering support, a new IDR
solver, Incomplete Cholesky preconditioner, matrix assembly support (only CPU
for now), machine topology information, and more!

Supported systems and requirements:
+ For all platforms, cmake 3.13+
+ C++14 compliant compiler
+ Linux and MacOS
  + gcc: 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + clang: 3.9+
  + Intel compiler: 2018+
  + Apple LLVM: 8.0+
  + CUDA module: CUDA 9.0+
  + HIP module: ROCm 3.5+
  + DPC++ module: Intel OneAPI 2021.3. Set the CXX compiler to `dpcpp`.
+ Windows
  + MinGW and Cygwin: gcc 5.3+, 6.3+, 7.3+, all versions after 8.1+
  + Microsoft Visual Studio: VS 2019
  + CUDA module: CUDA 9.0+, Microsoft Visual Studio
  + OpenMP module: MinGW or Cygwin.


Algorithm and important feature additions:
+ Add a new DPC++ Executor for SYCL execution and other base utilities
  [#648](#648), [#661](#661), [#757](#757), [#832](#832)
+ Port matrix formats, solvers and related kernels to DPC++. For some kernels,
  also make use of a shared kernel implementation for all executors (except
  Reference). [#710](#710), [#799](#799), [#779](#779), [#733](#733), [#844](#844), [#843](#843), [#789](#789), [#845](#845), [#849](#849), [#855](#855), [#856](#856)
+ Add accessors which allow multi-precision kernels, among other things.
  [#643](#643), [#708](#708)
+ Add support for mixed precision operations through apply in all LinOps. [#677](#677)
+ Add incomplete Cholesky factorizations and preconditioners as well as some
  improvements to ILU. [#672](#672), [#837](#837), [#846](#846)
+ Add an AMGX implementation and kernels on all devices but DPC++.
  [#528](#528), [#695](#695), [#860](#860)
+ Add a new mixed-precision capability solver, Compressed Basis GMRES
  (CB-GMRES). [#693](#693), [#763](#763)
+ Add the IDR(s) solver. [#620](#620)
+ Add a new fixed-size block CSR matrix format (for the Reference executor).
  [#671](#671), [#730](#730)
+ Add native mixed-precision support to the ELL format. [#717](#717), [#780](#780)
+ Add Reverse Cuthill-McKee reordering [#500](#500), [#649](#649)
+ Add matrix assembly support on CPUs. [#644](#644)
+ Extends ISAI from triangular to general and spd matrices. [#690](#690)

Other additions:
+ Add the possibility to apply real matrices to complex vectors.
  [#655](#655), [#658](#658)
+ Add functions to compute the absolute of a matrix format. [#636](#636)
+ Add symmetric permutation and improve existing permutations.
  [#684](#684), [#657](#657), [#663](#663)
+ Add a MachineTopology class with HWLOC support [#554](#554), [#697](#697)
+ Add an implicit residual norm criterion. [#702](#702), [#818](#818), [#850](#850)
+ Row-major accessor is generalized to more than 2 dimensions and a new
  "block column-major" accessor has been added. [#707](#707)
+ Add an heat equation example. [#698](#698), [#706](#706)
+ Add ccache support in CMake and CI. [#725](#725), [#739](#739)
+ Allow tuning and benchmarking variables non intrusively. [#692](#692)
+ Add triangular solver benchmark [#664](#664)
+ Add benchmarks for BLAS operations [#772](#772), [#829](#829)
+ Add support for different precisions and consistent index types in benchmarks.
  [#675](#675), [#828](#828)
+ Add a Github bot system to facilitate development and PR management.
  [#667](#667), [#674](#674), [#689](#689), [#853](#853)
+ Add Intel (DPC++) CI support and enable CI on HPC systems. [#736](#736), [#751](#751), [#781](#781)
+ Add ssh debugging for Github Actions CI. [#749](#749)
+ Add pipeline segmentation for better CI speed. [#737](#737)


Changes:
+ Add a Scalar Jacobi specialization and kernels. [#808](#808), [#834](#834), [#854](#854)
+ Add implicit residual log for solvers and benchmarks. [#714](#714)
+ Change handling of the conjugate in the dense dot product. [#755](#755)
+ Improved Dense stride handling. [#774](#774)
+ Multiple improvements to the OpenMP kernels performance, including COO,
an exclusive prefix sum, and more. [#703](#703), [#765](#765), [#740](#740)
+ Allow specialization of submatrix and other dense creation functions in solvers. [#718](#718)
+ Improved Identity constructor and treatment of rectangular matrices. [#646](#646)
+ Allow CUDA/HIP executors to select allocation mode. [#758](#758)
+ Check if executors share the same memory. [#670](#670)
+ Improve test install and smoke testing support. [#721](#721)
+ Update the JOSS paper citation and add publications in the documentation.
  [#629](#629), [#724](#724)
+ Improve the version output. [#806](#806)
+ Add some utilities for dim and span. [#821](#821)
+ Improved solver and preconditioner benchmarks. [#660](#660)
+ Improve benchmark timing and output. [#669](#669), [#791](#791), [#801](#801), [#812](#812)


Fixes:
+ Sorting fix for the Jacobi preconditioner. [#659](#659)
+ Also log the first residual norm in CGS [#735](#735)
+ Fix BiCG and HIP CSR to work with complex matrices. [#651](#651)
+ Fix Coo SpMV on strided vectors. [#807](#807)
+ Fix segfault of extract_diagonal, add short-and-fat test. [#769](#769)
+ Fix device_reset issue by moving counter/mutex to device. [#810](#810)
+ Fix `EnableLogging` superclass. [#841](#841)
+ Support ROCm 4.1.x and breaking HIP_PLATFORM changes. [#726](#726)
+ Decreased test size for a few device tests. [#742](#742)
+ Fix multiple issues with our CMake HIP and RPATH setup.
  [#712](#712), [#745](#745), [#709](#709)
+ Cleanup our CMake installation step. [#713](#713)
+ Various simplification and fixes to the Windows CMake setup. [#720](#720), [#785](#785)
+ Simplify third-party integration. [#786](#786)
+ Improve Ginkgo device arch flags management. [#696](#696)
+ Other fixes and improvements to the CMake setup.
  [#685](#685), [#792](#792), [#705](#705), [#836](#836)
+ Clarification of dense norm documentation [#784](#784)
+ Various development tools fixes and improvements [#738](#738), [#830](#830), [#840](#840)
+ Make multiple operators/constructors explicit. [#650](#650), [#761](#761)
+ Fix some issues, memory leaks and warnings found by MSVC.
  [#666](#666), [#731](#731)
+ Improved solver memory estimates and consistent iteration counts [#691](#691)
+ Various logger improvements and fixes [#728](#728), [#743](#743), [#754](#754)
+ Fix for ForwardIterator requirements in iterator_factory. [#665](#665)
+ Various benchmark fixes. [#647](#647), [#673](#673), [#722](#722)
+ Various CI fixes and improvements. [#642](#642), [#641](#641), [#795](#795), [#783](#783), [#793](#793), [#852](#852)

Related PR: #866
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1:ST:ready-to-merge This PR is ready to merge. reg:benchmarking This is related to benchmarking. type:preconditioner This is related to the preconditioners type:solver This is related to the solvers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants