Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics #3735

Open
newpavlov opened this issue Jul 6, 2024 · 6 comments

Comments

@newpavlov
Copy link

I was trying to minimize a massive slowdown which is observed in our code and got the following snippet:

#![feature(new_uninit)]
use std::sync::atomic::{AtomicU32, Ordering::*};

const PAGE_SIZE: usize = 1 << 12;
const PAGES: usize = 16;
const ITEMS: usize = PAGE_SIZE * PAGES;

type AtomicPage = [AtomicU32; PAGE_SIZE];

#[test]
fn init_atomic() {
    let pages: [*mut AtomicPage; PAGES] = [(); PAGES]
        .map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() })
        .map(Box::into_raw);

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = unsafe { &*(pages[page_idx]) };
        page[inpage_idx].store(u32::MAX, Release);
    }

    for page_ptr in pages {
        drop(unsafe { Box::from_raw(page_ptr) });
    }
}

On my laptop I get the following results (more than 50x slowdown):

$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,049s
user	0m4,786s
sys	0m0,265s
$ time cargo miri test

real	4m19,256s
user	4m18,831s
sys	0m0,334s

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

@newpavlov newpavlov changed the title Massive slowdown caused by stacked borrows on code which uses raw pointers and atomic Massive slowdown caused by stacked borrows on code which uses raw pointers + atomics Jul 6, 2024
@saethlin
Copy link
Member

saethlin commented Jul 6, 2024

Replacing [*mut AtomicPage; PAGES] with [Box<AtomicPage>; PAGES] reduces slowdown to less than 2x.

I cannot reproduce this result. Can you post the version that uses Box<AtomicPage>?

@newpavlov
Copy link
Author

newpavlov commented Jul 7, 2024

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        pages[page_idx][inpage_idx].store(u32::MAX, Release);
    }
}
$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri tes

real	0m5,215s
user	0m4,970s
sys	0m0,245s

$ time cargo miri test

real	0m7,015s
user	0m6,743s
sys	0m0,274s

Interestingly, keeping the page variable results in a ~30x slowdown:

#[test]
fn init_atomic() {
    let pages = [(); PAGES].map(|()| unsafe { Box::<AtomicPage>::new_zeroed().assume_init() });

    for i in 0..ITEMS {
        let page_idx = i / PAGE_SIZE;
        let inpage_idx = i % PAGE_SIZE;
        let page: &AtomicPage = &pages[page_idx];
        page[inpage_idx].store(u32::MAX, Release);
    }
}
$ time MIRIFLAGS='-Zmiri-disable-stacked-borrows' cargo miri test

real	0m5,174s
user	0m4,862s
sys	0m0,314s

$ time cargo miri test

real	2m26,782s
user	2m26,401s
sys	0m0,361s

@saethlin
Copy link
Member

saethlin commented Jul 7, 2024

Interestingly, keeping the page variable results in a ~30x slowdown:

There is an extremely subtle difference between those two programs. You're specifying the type to be &AtomicPage, but without the type specified that variable would be a &Box<AtomicPage>. Specifying that type is changing the borrowing structure of this program by creating additional SharedReadWrite tags in the bytes of the AtomicPages, which stresses this loop:

while let Some(item) = self.get(idx) {
if item.perm() == Permission::SharedReadWrite {
// Go on.
idx += 1;
} else {
// Found first incompatible!
break;
}
}

@saethlin
Copy link
Member

saethlin commented Jul 7, 2024

Hunh. In fact adding the type annotation at all, even if it's fully redundant, still produces an extra reborrow. That is new to me.

@saethlin
Copy link
Member

saethlin commented Jul 7, 2024

Ah of course, the primary issue with debugging this performance issue is that the GC is doing its absolute best to cover up the problem. Perhaps a better minimization is this:

use std::cell::UnsafeCell;

const ITEMS: usize = 1 << 12;

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];

    for i in 0..ITEMS {
        let _page = &page;
    }
}

When run with -Zmiri-provenance-gc=0 this has clearly super-linear runtime, at 1 << 10 iterations I get 2.4 seconds runtime, at 1 << 11 iterations I get 7.8 seconds, and at 1 << 12 iterations I see 25 seconds.

This is very similar to our slice-get-unchecked benchmark

fn main() {
let x = vec![0u8; 4096];
let mut i = 0;
while i < x.len() {
let _element = unsafe { *x.get_unchecked(i) };
i += 1;
}
}
In both cases the program is stressing a code path which in Stacked Borrows is technically a linear search of ever-growing length.

But for this search, we don't have a stack cache, so you the only mitigation to true runtime explosion is the GC.

So I have two observations:
All the borrow stacks here have a Unique base tag, then are SharedReadWrite all the way up. If stacks had a bool that indicated they have this property, we could skip the loop that's causing the runtime overhead here to be so dramatic. And I suspect that this property might be pretty common, shouldn't most atomics only be accessed through SharedReadWrite?

I wanted to print the stacks for this program and I got this:

use std::cell::UnsafeCell;

extern "Rust" {
    fn miri_get_alloc_id(ptr: *const u8) -> u64;
    fn miri_print_borrow_state(alloc_id: u64, show_unnamed: bool);
}

fn get_alloc_id(ptr: *const u8) -> u64 {
    unsafe { miri_get_alloc_id(ptr) }
}

fn print_borrow_stacks(alloc_id: u64) {
    unsafe {
        miri_print_borrow_state(alloc_id, /* ignored: show_unnamed */ false)
    }
}

fn main() {
    let page = [const { UnsafeCell::new(0u8) }; 4096];
    let id = get_alloc_id(page.as_ptr().cast());
    print_borrow_stacks(id);
}
...
4091..4092: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4092..4093: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4093..4094: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4094..4095: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]
4095..4096: [ Unique<2134> SharedReadWrite<2135> SharedReadWrite<2136> SharedReadWrite<2139> SharedReadWrite<2140> ]

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

@RalfJung
Copy link
Member

RalfJung commented Jul 7, 2024

There is a different borrow stack for every byte, but they are all equal. I thought we had a mechanism to prevent this.

Yes, but that only kicks in when we iterate the stacks and it notices two adjacent stacks being equal. It also doesn't do arbitrarily much comparing and merging, to avoid unnecessary overhead.

But here nothing at all seems to be merged, not even the first bytes:

0..1: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
1..2: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
2..3: [ Unique<2201> SharedReadWrite<2202> SharedReadWrite<2203> SharedReadWrite<2206> SharedReadWrite<2207> ]
...

That is indeed strange.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants