Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

newpavlov · 2024-07-06T04:39:59Z

I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:

#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};

const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;

type AtomicPage = [AtomicU32; PAGE_SIZE];

#[test]
fn init_atomic() {
    let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
        .map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
        .map(Box::into_raw);

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
        page[inpage_idx].store(u32::MAX, Release);
    }

    for page_ptr in pages {
        drop(unsafe { Box::from_raw(page_ptr) });
    }
}

On my laptop I get the following results (more than 50x slowdown):

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,049s
user	0m4,786s
sys	0m0,265s

$ time cargo miri test

real	4m19,256s
user	4m18,831s
sys	0m0,334s

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

The text was updated successfully, but these errors were encountered:

saethlin · 2024-07-06T16:14:12Z

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

I cannot reproduce this result. Can you post the version that uses Box<AtomicPage>?

newpavlov · 2024-07-07T03:20:33Z

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        pages[page_idx][inpage_idx].store(u32::MAX, Release);
    }
}

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri tes

real	0m5,215s
user	0m4,970s
sys	0m0,245s

$ time cargo miri test

real	0m7,015s
user	0m6,743s
sys	0m0,274s

Interestingly, keeping the page variable results in a ~30x slowdown:

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = &pages[page_idx];
        page[inpage_idx].store(u32::MAX, Release);
    }
}

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,174s
user	0m4,862s
sys	0m0,314s

$ time cargo miri test

real	2m26,782s
user	2m26,401s
sys	0m0,361s

saethlin · 2024-07-07T04:19:22Z

Interestingly, keeping the page variable results in a ~30x slowdown:

There is an extremely subtle difference between those two programs. You're specifying the type to be &AtomicPage, but without the type specified that variable would be a &Box<AtomicPage>. Specifying that type is changing the borrowing structure of this program by creating additional SharedReadWrite tags in the bytes of the AtomicPages, which stresses this loop:

miri/src/borrow_tracker/stacked_borrows/mod.rs

Lines 207 to 215 in 8f40dd2

 while let Some(item) = self.get(idx) { 

 if item.perm() == Permission::SharedReadWrite { 

 // Go on. 

 idx += 1; 

 } else { 

 // Found first incompatible! 

 break; 

 } 

 }

saethlin · 2024-07-07T04:46:33Z

Hunh. In fact adding the type annotation at all, even if it's fully redundant, still produces an extra reborrow. That is new to me.

saethlin · 2024-07-07T06:29:05Z

Ah of course, the primary issue with debugging this performance issue is that the GC is doing its absolute best to cover up the problem. Perhaps a better minimization is this:

use std::cell::UnsafeCell;

const ITEMS: usize = 1 << 12;

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];

    for i in 0..ITEMS {
        let _page = &page;
    }
}

When run with -Zmiri-provenance-gc=0 this has clearly super-linear runtime, at 1 << 10 iterations I get 2.4 seconds runtime, at 1 << 11 iterations I get 7.8 seconds, and at 1 << 12 iterations I see 25 seconds.

This is very similar to our slice-get-unchecked benchmark

miri/bench-cargo-miri/slice-get-unchecked/src/main.rs

Lines 5 to 12 in 8f40dd2

 fn main() { 

 let x = vec![0u8; 4096]; 

 let mut i = 0; 

 while i < x.len() { 

 let _element = unsafe { *x.get_unchecked(i) }; 

 i += 1; 

 } 

 }

In both cases the program is stressing a code path which in Stacked Borrows is technically a linear search of ever-growing length.

But for this search, we don't have a stack cache, so you the only mitigation to true runtime explosion is the GC.

So I have two observations:
All the borrow stacks here have a Unique base tag, then are SharedReadWrite all the way up. If stacks had a bool that indicated they have this property, we could skip the loop that's causing the runtime overhead here to be so dramatic. And I suspect that this property might be pretty common, shouldn't most atomics only be accessed through SharedReadWrite?

I wanted to print the stacks for this program and I got this:

use std::cell::UnsafeCell;

extern "Rust" {
    fn miri_get_alloc_id(ptr: *const u8) -> u64;
    fn miri_print_borrow_state(alloc_id: u64, show_unnamed: bool);
}

fn get_alloc_id(ptr: *const u8) -> u64 {
    unsafe { miri_get_alloc_id(ptr) }
}

fn print_borrow_stacks(alloc_id: u64) {
    unsafe {
        miri_print_borrow_state(alloc_id, /* ignored: show_unnamed */ false)
    }
}

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];
    let id = get_alloc_id(page.as_ptr().cast());
    print_borrow_stacks(id);
}

...
4091..4092: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4092..4093: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4093..4094: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4094..4095: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4095..4096: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

RalfJung · 2024-07-07T07:29:17Z

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

Yes, but that only kicks in when we iterate the stacks and it notices two adjacent stacks being equal. It also doesn't do arbitrarily much comparing and merging, to avoid unnecessary overhead.

But here nothing at all seems to be merged, not even the first bytes:

0..1: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
1..2: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
2..3: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
...

That is indeed strange.

newpavlov changed the title ~~Massive slowdown caused by stacked borrows on code which uses raw pointers and atomic~~ Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics Jul 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

newpavlov commented Jul 6, 2024

saethlin commented Jul 6, 2024

newpavlov commented Jul 7, 2024 •

edited

Loading

saethlin commented Jul 7, 2024 •

edited

Loading

saethlin commented Jul 7, 2024

saethlin commented Jul 7, 2024 •

edited

Loading

RalfJung commented Jul 7, 2024

Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

Comments

newpavlov commented Jul 6, 2024

saethlin commented Jul 6, 2024

newpavlov commented Jul 7, 2024 • edited Loading

saethlin commented Jul 7, 2024 • edited Loading

saethlin commented Jul 7, 2024

saethlin commented Jul 7, 2024 • edited Loading

RalfJung commented Jul 7, 2024

newpavlov commented Jul 7, 2024 •

edited

Loading

saethlin commented Jul 7, 2024 •

edited

Loading

saethlin commented Jul 7, 2024 •

edited

Loading