Add a `donotdelete` builtin #44036

Keno · 2022-02-04T01:52:00Z

In #43852 we noticed that the compiler is getting good enough to
completely DCE a number of our benchmarks. We need to add some sort
of mechanism to prevent the compiler from doing so. This adds just
such an intrinsic. The intrinsic itself doesn't do anything, but
it is considered effectful by our optimizer, preventing it from
being DCE'd. At the LLVM level, it turns into a volatile store to
an alloca (or an llvm.sideeffect if the values passed to the
dcebarrier do not have any actual LLVM-level representation).

The docs for the new intrinsic are as follows:

    dcebarrier(args...)

This function prevents dead-code elimination (DCE) of itself and any arguments
passed to it, but is otherwise the lightest barrier possible. In particular,
it is not a GC safepoint, does model an observable heap effect, does not expand
to any code itself and may be re-ordered with respect to other side effects
(though the total number of executions may not change).

A useful model for this function is that it hashes all memory `reachable` from
args and escapes this information through some observable side-channel that does
not otherwise impact program behavior. Of course that's just a model. The
function does nothing and returns `nothing`.

This is intended for use in benchmarks that want to guarantee that `args` are
actually computed. (Otherwise DCE may see that the result of the benchmark is
unused and delete the entire benchmark code).

**Note**: `dcebarrier` does not affect constant foloding. For example, in
          `dcebarrier(1+1)`, no add instruction needs to be executed at runtime and
          the code is semantically equivalent to `dcebarrier(2).`

*# Examples

function loop()
    for i = 1:1000
        # The complier must guarantee that there are 1000 program points (in the correct
       	# order) at which the value of `i` is in a register, but has otherwise
        # total control over the program.
        dcebarrier(i)
    end
end

I believe the voltatile store at the LLVM level is actually somewhat
stronger than what we want here. Ideally the dcebarrier would not
and up generating any machine code at all and would also be compatible
with optimizations like SROA and vectorization. However, I think this
is fine for now.

tkf · 2022-02-04T02:11:38Z

This seems to be related to @vchuravy's JuliaCI/BenchmarkTools.jl#92 which included clobber() and escape(). IIUC, similar ASM-level hack is in google/benchmark and called DoNotOptimize(...) and ClobberMemory()

Can we have a more intuitive name like google/benchmark's DoNotOptimize?

I believe the voltatile store at the LLVM level is actually somewhat
stronger than what we want here.

Can we use call void asm sideeffect "", "X,~{memory}"($name %0)?

https://github.com/JuliaCI/BenchmarkTools.jl/pull/92/files#diff-7a1b40723def106eb9bf5c19254f26077a5b8e09741f69198e39365c306c99bcR57

Keno · 2022-02-04T02:14:39Z

Can we use call void asm sideeffect "", "X,~{memory}"($name %0)?

Yes, but that has the same optimizability challenges, and perhaps even more. I thought the volatile store might at least have some chance of not interfering with loop vectorization.

Keno · 2022-02-04T02:15:36Z

I believe the voltatile store at the LLVM level is actually somewhat
stronger than what we want here. Ideally the dcebarrier would not
and up generating any machine code at all and would also be compatible
with optimizations like SROA and vectorization.

@preames any thoughts on this?

Keno · 2022-02-04T02:23:51Z

Can we have a more intuitive name like google/benchmark's DoNotOptimize?

It's intended to be consistent with Base.inferencebarrier.

Goes with JuliaLang/julia#44036.

JeffBezanson · 2022-02-04T18:00:57Z

base/docs/basedocs.jl

+actually computed. (Otherwise DCE may see that the result of the benchmark is
+unused and delete the entire benchmark code).
+
+**Note**: `dcebarrier` does not affect constant foloding. For example, in


Suggested change

**Note**: `dcebarrier` does not affect constant foloding. For example, in

**Note**: `dcebarrier` does not affect constant folding. For example, in

preames · 2022-02-04T21:30:18Z

There's some prior art on this type of thing in Java with JMH's Blackhole.consume. Naming wise, I would find something along those lines better than dcebarrier. As can already be seen in the discussion above, use of the word "barrier" gives the impression that the call has memory effects, whereas that seems not to be the intent per the draft wording.

Implementation wise, I would start by lowering to an external function call marked "inaccessiblememonly nounwind willreturn" at the LLVM level. This would have some cost - the actual call sequence - but should have minimal impact on optimization.

I would be leery of the volatile store to alloca lowering. volatiles are generally not touched, but there is precedent for removing them if the location being touched is well understood. An alloca seems like an entirely reasonable location for the compiler to assume is not memory mapped IO.

Once implemented with the external call, we could chose to add an LLVM intrinsic with the same meaning. I think this is a broadly reuseable concept, and probably wouldn't be too hard to get upstream.

Keno · 2022-02-04T22:30:23Z

blackhole(args...) is a pretty good name

Keno · 2022-02-07T23:52:16Z

Upon discussion with @JeffBezanson and @vtjnash, they preferred a name that did not require a graduate course on the blackhole information paradox in order to build the correct intuition about whether or not the optimizer is allowed to delete the value or not. We ultimately settled on donotdelete(args...). donotoptimize(args...) was considered bad, because all kinds of optimization are generally allowed, except that it must be computed eventually.

Goes with JuliaLang/julia#44036.

Keno · 2022-02-08T01:31:40Z

Alright, I guess, we should merge the BenchmarkTools version first, then do a nanosoldier run here to see what the effect is (we expect regressions because it's a change in what's being benchmarked), just so we have a baseline.

KristofferC · 2022-02-08T06:15:55Z

The new BenchmarkTools version also has to get deployed explicitly on Nanosoldier.

Keno · 2022-02-08T20:16:45Z

The new BenchmarkTools version also has to get deployed explicitly on Nanosoldier.

I've tagged BenchmarkTools 1.3 and according to @vtjnash, Nanosolider will pick up the latest registered version, so we'll wait for that to go through. I'll rebase this in the meantime, since it's accumulated conflicts.

In #43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into a volatile store to an alloca (or an llvm.sideeffect if the values passed to the `dcebarrier` do not have any actual LLVM-level representation). The docs for the new intrinsic are as follows: ``` dcebarrier(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `dcebarrier` does not affect constant foloding. For example, in `dcebarrier(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `dcebarrier(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. dcebarrier(i) end end ``` I believe the voltatile store at the LLVM level is actually somewhat stronger than what we want here. Ideally the `dcebarrier` would not and up generating any machine code at all and would also be compatible with optimizations like SROA and vectorization. However, I think this is fine for now.

Keno · 2022-02-08T21:04:21Z

@nanosoldier runbenchmarks(ALL, vs=":master")

nanosoldier · 2022-02-09T04:17:59Z

Something went wrong when running your job:

NanosoldierError: error when preparing/pushing to report repo: failed process: Process(setenv(`git push`; dir="/nanosoldier/workdir/NanosoldierReports"), ProcessExited(1)) [1]

Unfortunately, the logs could not be uploaded.

vtjnash · 2022-02-09T04:50:56Z

Not too bad. None get faster (of course), but only a handful got badly affected: https://github.com/JuliaCI/NanosoldierReports/blob/master/benchmark/by_hash/95a9e7f_vs_60f414e/report.md

Keno · 2022-02-09T06:34:54Z

Yep, pretty much as expected. The benchmarks that got affected are the scalar ones that are essentially trivial, so it's like for LLVM to have deleted them. Looks like this is working. Excellent.

DilumAluthge · 2022-02-09T12:55:17Z

Is it possible that this PR broke Windows CI?

Keno · 2022-02-09T13:06:59Z

So it did. Looks like the new test failed. Will fix.

vtjnash · 2022-02-09T17:33:52Z

src/codegen.cpp

+ FnAttrs.addAttribute(C, Attribute::InaccessibleMemOnly);
+ FnAttrs.addAttribute(C, Attribute::WillReturn);
+ FnAttrs.addAttribute(C, Attribute::NoUnwind);


This is not an AttrBuilder. These calls do no have any effects and will be deleted.

/Users/jameson/julia1/src/codegen.cpp:479:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] FnAttrs.addAttribute(C, Attribute::InaccessibleMemOnly); ^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ /Users/jameson/julia1/src/codegen.cpp:480:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] FnAttrs.addAttribute(C, Attribute::WillReturn); ^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~~~ /Users/jameson/julia1/src/codegen.cpp:481:5: warning: ignoring return value of function declared with 'warn_unused_result' attribute [-Wunused-result] FnAttrs.addAttribute(C, Attribute::NoUnwind); ^~~~~~~~~~~~~~~~~~~~ ~~~~~~~~~~~~~~~~~~~~~~

fixed by #44097

vtjnash · 2022-02-13T05:10:00Z

backport?

In JuliaLang#43852 we noticed that the compiler is getting good enough to completely DCE a number of our benchmarks. We need to add some sort of mechanism to prevent the compiler from doing so. This adds just such an intrinsic. The intrinsic itself doesn't do anything, but it is considered effectful by our optimizer, preventing it from being DCE'd. At the LLVM level, it turns into call to an external varargs function. The docs for the new intrinsic are as follows: ``` donotdelete(args...) This function prevents dead-code elimination (DCE) of itself and any arguments passed to it, but is otherwise the lightest barrier possible. In particular, it is not a GC safepoint, does model an observable heap effect, does not expand to any code itself and may be re-ordered with respect to other side effects (though the total number of executions may not change). A useful model for this function is that it hashes all memory `reachable` from args and escapes this information through some observable side-channel that does not otherwise impact program behavior. Of course that's just a model. The function does nothing and returns `nothing`. This is intended for use in benchmarks that want to guarantee that `args` are actually computed. (Otherwise DCE may see that the result of the benchmark is unused and delete the entire benchmark code). **Note**: `donotdelete` does not affect constant foloding. For example, in `donotdelete(1+1)`, no add instruction needs to be executed at runtime and the code is semantically equivalent to `donotdelete(2).` *# Examples function loop() for i = 1:1000 # The complier must guarantee that there are 1000 program points (in the correct # order) at which the value of `i` is in a register, but has otherwise # total control over the program. donotdelete(i) end end ```

Keno force-pushed the kf/dcebarrier branch from f931fcb to 2810646 Compare February 4, 2022 02:18

Keno added a commit to JuliaCI/BaseBenchmarks.jl that referenced this pull request Feb 4, 2022

Make use of Base.dcebarrier if available

e063e40

Goes with JuliaLang/julia#44036.

Keno mentioned this pull request Feb 4, 2022

Make use of Base.donotdelete if available JuliaCI/BenchmarkTools.jl#275

Merged

JeffBezanson reviewed Feb 4, 2022

View reviewed changes

Keno force-pushed the kf/dcebarrier branch from 2810646 to 2ff8768 Compare February 8, 2022 01:27

Keno added a commit to JuliaCI/BenchmarkTools.jl that referenced this pull request Feb 8, 2022

Make use of Base.donotdelete if available

56df665

Goes with JuliaLang/julia#44036.

oscardssmith changed the title ~~Add a DCE barrier builtin~~ Add a donotdelete builtin Feb 8, 2022

Keno force-pushed the kf/dcebarrier branch from 2ff8768 to 3f2a323 Compare February 8, 2022 21:03

Keno merged commit a947fc7 into master Feb 9, 2022

Keno deleted the kf/dcebarrier branch February 9, 2022 06:36

vtjnash reviewed Feb 9, 2022

View reviewed changes

oscardssmith mentioned this pull request Feb 9, 2022

actually add the atributes #44097

Merged

vchuravy mentioned this pull request Mar 26, 2022

donotdelete declaration differs from function call #44759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a `donotdelete` builtin #44036

Add a `donotdelete` builtin #44036

Keno commented Feb 4, 2022

tkf commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 4, 2022

JeffBezanson Feb 4, 2022

preames commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 7, 2022

Keno commented Feb 8, 2022

KristofferC commented Feb 8, 2022

Keno commented Feb 8, 2022

Keno commented Feb 8, 2022

nanosoldier commented Feb 9, 2022

vtjnash commented Feb 9, 2022

Keno commented Feb 9, 2022

DilumAluthge commented Feb 9, 2022

Keno commented Feb 9, 2022

vtjnash Feb 9, 2022

oscardssmith Feb 9, 2022

vtjnash commented Feb 13, 2022

	Note: `dcebarrier` does not affect constant foloding. For example, in
	Note: `dcebarrier` does not affect constant folding. For example, in

Add a donotdelete builtin #44036

Add a donotdelete builtin #44036

Conversation

Keno commented Feb 4, 2022

tkf commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 4, 2022

JeffBezanson Feb 4, 2022

Choose a reason for hiding this comment

preames commented Feb 4, 2022

Keno commented Feb 4, 2022

Keno commented Feb 7, 2022

Keno commented Feb 8, 2022

KristofferC commented Feb 8, 2022

Keno commented Feb 8, 2022

Keno commented Feb 8, 2022

nanosoldier commented Feb 9, 2022

vtjnash commented Feb 9, 2022

Keno commented Feb 9, 2022

DilumAluthge commented Feb 9, 2022

Keno commented Feb 9, 2022

vtjnash Feb 9, 2022

Choose a reason for hiding this comment

oscardssmith Feb 9, 2022

Choose a reason for hiding this comment

vtjnash commented Feb 13, 2022

Add a `donotdelete` builtin #44036

Add a `donotdelete` builtin #44036