Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky constant propagation results in loss of SIMD optimization #121511

Open
KOROBYAKA opened this issue Feb 23, 2024 · 8 comments
Open

Flaky constant propagation results in loss of SIMD optimization #121511

KOROBYAKA opened this issue Feb 23, 2024 · 8 comments
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.

Comments

@KOROBYAKA
Copy link

We have observed unexpected loss of vectorization effort by the compiler when compiling the code below. Target was amd64 machine with AVX2 instruction set supported.

type T = u32;
#[inline(never)]
fn check_dumb(n:u32, max_val:T)->Option<(T,T,T,T)>{                
    for x in 1..max_val{
        for y in 1..max_val{        
            for z in 0..max_val{
                if x.pow(n) + y.pow(n) == z.pow(n){
                    return Some((x,y,z,n as T));
                }                                
                
            }
        }
    }
    return None
}

#[inline(never)]
fn check_smart(n:u32, max_val:T)->Option<(T,T,T,T)>{
    
    const STRIDE:usize=24;

    let zarr_base = {
        let mut t = [0;STRIDE];
        for i in 0..STRIDE{
            t[i] = 1 + i as T;
        }
        t
    };
                
    for x in 1..max_val{
        //let xpow = x.pow(n); // manually "lifting" this operation from inner loop "fixes" the optimizer
        for y in 1..max_val{
            //let ypow = y.pow(n);  // manually "lifting" this operation from inner loop "fixes" the optimizer
            let mut zarr = zarr_base;

            for z_i in 0..max_val/STRIDE as T{

                for i in 0..STRIDE{
                    zarr[i] += STRIDE as T;
                }

                let mut zarr3 = [0;STRIDE];
                for i in 0..STRIDE{
                    zarr3[i] = zarr[i] *zarr[i]*zarr[i] ;
                }     

                let found = zarr3.iter().any(|&i|{
                    x.pow(n) + y.pow(n) == i           
                    // xpow + ypow == i           // see comments above for why this might work
                });

                if found {    
                    return Some((x,y,z_i * STRIDE as T,n as T));
                }                                
                
            }
        }
    }
    return None
}


fn main() {
    
    let n = 3; //this const value should apply for the entire main() function
    {
        let s = std::time::Instant::now();
        println!("Check_dumb n={} returned {:?}",n, check_dumb(n, 1000));
        println!("Time taken {:?}", s.elapsed());
    }
    {
        //Commenting out the line below results in loss of vectorization in check_smart!!!
        let n = 3;
        let s = std::time::Instant::now();
        println!("Check with simd n={} returned {:?}",n, check_smart(n, 1000));
        println!("Time taken {:?}", s.elapsed());
    }
}

I expected to see this happen: the function check_smart should leverage vectorization and SIMD instructions, while function check_dumb should be unable to do so. I should not have to specify n=3 twice in main(), as value of n from outer scope should be propagated into function check_smart. The vectorization in check_smart should happen for any value of n (even when n is not known ahead of time).

Instead, this happened: the ability of the compiler to actually perform SIMD vectorization of check_smart seems to depend on whether x.pow(n) can panic. When value of n is not propagated into the function, the optimizer fails to perform vectorization resulting in massive loss of performance.

Meta

rustc --version --verbose:

rustc 1.76.0 (07dca489a 2024-02-04)
binary: rustc
commit-hash: 07dca489ac2d933c78d3c5158e3f43beefeb02ce
commit-date: 2024-02-04
host: x86_64-unknown-linux-gnu
release: 1.76.0
LLVM version: 17.0.6

Nightly compiler exhibits identical behavior.

CPU is Intel(R) Core(TM) i5-8350U CPU @ 1.70GHz

@KOROBYAKA KOROBYAKA added the C-bug Category: This is a bug. label Feb 23, 2024
@rustbot rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Feb 23, 2024
@Noratrieb Noratrieb added A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such and removed C-bug Category: This is a bug. needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. labels Feb 23, 2024
@alexpyattaev
Copy link

alexpyattaev commented Feb 23, 2024

Just tested on a slightly different CPU target:
with line setting n=3 commented out:

Check_dumb n=3 returned None
Time taken 915.227343ms
Check with simd n=3 returned None
Time taken 4.634040885s 

Wow 4.6 seconds is slower than javascript!
Same but with line present:

Check_dumb n=3 returned None
Time taken 907.446979ms
Check with simd n=3 returned None
Time taken 502.535157ms

CPU used was 13th Gen Intel(R) Core(TM) i5-1335U, optimization was set for native:

[target.x86_64-unknown-linux-gnu]
rustflags = ["-Ctarget-cpu=native"]

Diving a bit into the assembly of check_smart, it seems that failure to figure out an appropriate const value for n results in pow function being splattered around the code of the hot loop, killing any sort of hope for performance. It is very odd that x.pow(n) is not moved out of the loop over z though, in which case it should not matter how it is implemented.

@erikdesjardins
Copy link
Contributor

erikdesjardins commented Feb 24, 2024

Strangely, this only seems to happen with target-cpu=native (even on godbolt which is Zen 3), not if you use the baseline or explicitly specify target-cpu=raptorlake (for the i5-1335U) or any other specific CPU I've tried (haswell, core-avx2, znver3, etc.): https://godbolt.org/z/jqrs7bP5W.

(I was expecting this to be the same root cause as #112478 (comment), where printing a value prevents LLVM from seeing that it's constant, since a pointer escapes into the printing machinery that could be used to mutate it, but it seems weirder than that.)

@alexpyattaev
Copy link

That is mighty odd indeed. On my machine the bug persists with raptorlake setting:

$$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=raptorlake" cargo  run --release
     Removed 11 files, 844.0KiB total
   Compiling fermat v0.1.0 (/home/headhunter/fermat)
    Finished release [optimized] target(s) in 0.26s
     Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 932.948488ms
Check with simd n=3 returned None
Time taken 3.440539885s

LLVM complains about "failed to hoist load with loop-invariant address because load is conditionally executed" near calls to pow(n), but I'm not sure how to extract more info from its optimization remarks...

I've also tried targeting some other architectures that my machines could run with same result:

 $$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=cascadelake" cargo  run --release
     Removed 11 files, 844.0KiB total
   Compiling fermat v0.1.0 (/home/headhunter/fermat)
    Finished release [optimized] target(s) in 0.25s
     Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 910.116577ms
Check with simd n=3 returned None
Time taken 4.317862738s
$$ ~/fermat (master) [101]> cargo clean && RUSTFLAGS="-Ctarget-cpu=znver2" cargo  run --release
     Removed 7 files, 3.3KiB total
   Compiling fermat v0.1.0 (/home/headhunter/fermat)
    Finished release [optimized] target(s) in 0.25s
     Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 924.990783ms
Check with simd n=3 returned None
Time taken 4.279846815s
$$ ~/fermat (master)> cargo clean && RUSTFLAGS="-Ctarget-cpu=znver3" cargo  run --release
     Removed 11 files, 844.0KiB total
   Compiling fermat v0.1.0 (/home/headhunter/fermat)
    Finished release [optimized] target(s) in 0.29s
     Running `target/release/fermat`
Check_dumb n=3 returned None
Time taken 954.212518ms
Check with simd n=3 returned None
Time taken 3.432712025s

@alexpyattaev
Copy link

alexpyattaev commented Feb 24, 2024

Also I've checked your hypothesis about printing killing optimization, and it seems to hold up. Modifying the callsite for check_smart to print n just before call to the function results in failure to optimize.

{
        let n = 3;
        println!("Value of n is {}", n.clone()); //removing clone() here will break optimization no matter how dumb that is!
        let s = std::time::Instant::now();
        println!("Check with simd n={} returned {:?}",n, check_smart(n, 1000));
        println!("Time taken {:?}", s.elapsed());
    }

The deeper question now is as follows: why would it even matter which value n takes? It is entirely irrelevant to the optimizations within the inner loop over z (as both x and y are constant there, and thus x.pow(n) can be safely hoisted out of the loop over z). Even if n could take some entirely bizarre value that would panic the call to pow, there is no point in keeping it within the inner loop, or am I missing something here? Does it actually matter where exactly panic is generated?
Edit: u32.pow() does not panic, even on overflow. So it should be always safe to hoist calls to it, no matter value for n.
Edit2: with new code structure, optimizer remark changed to "failed to hoist load with loop-invariant address because load is conditionally executed", but resulting asm is pretty much the same garbage. And the remark seems to have nothing to do with the actual problem at hand.

@alexpyattaev
Copy link

alexpyattaev commented Feb 24, 2024

Managed to reproduce in Godbolt. Works reliably with different targets also (not just native). The difference in perf on AMD is quite dramatic, from 400ms to 9.5 seconds is crazy slow.

@alexpyattaev
Copy link

Tested again on latest rustc 1.78.0-nightly (c67326b 2024-03-15) , no changes so far in either direction.

@alexpyattaev
Copy link

Tested again on 1.80, the issue no longer presents. Whatever made it go away is unclear.
rustc 1.80.0-beta.3 (105fc5c 2024-06-14) - does not have issue
rustc 1.79.0 (129f3b9 2024-06-10) - still has the issue

I'll keep watching if it reemerges.

@alexpyattaev
Copy link

Latest nightly rustc 1.82.0-nightly (92c6c03 2024-07-21) has the problem again with the exact same symptoms as before. There is a clear regression here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-optimization Category: An issue highlighting optimization opportunities or PRs implementing such T-compiler Relevant to the compiler team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests

5 participants