Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Put the rr page into a proper ELF library #2753

Merged
merged 1 commit into from
Nov 30, 2020

Conversation

Keno
Copy link
Member

@Keno Keno commented Nov 29, 2020

This fixes #2721, though the strategy I proposed there doesn't
quite work as stated, so this is slightly different.

To recap:
We are seeing some test failures under rr, because our runtime
system assumes that it can asynchronously unwind from anywhere
(basically having things built with -fasynchronous-unwind-tables
and proper CFI for assembly code). However, this ran into problems
when inside a syscall in the rr page, because the rr page is not
part of any image and does not have CFI info, so the unwinder would
complain.

I originally thought I could just create an ET_EXEC file and ask the
dynamic linker to always map the rr page at the correct address
(preceeded by one page for the ELF header and eh_frame table),
but that turns out to be forbidden. Instead what this does is create
an ET_DYN file and then intercept the initial mmap of said file,
where the dynamic linker expects that kernel to choose a (potentially
randomized) memory location for it. Since we're interposing the
kernel, we can just decide to always map this particular mapping
at the address we want it to be at. Of course, this is somewhat
sensitive to the exact order of events, so while it works well
on my current glibc version, I wouldn't be surprised if other libcs
(or future versions of glibc) may do things differently enough to
not trigger these heuristics. Fortunately, we can just simply keep
doing nothing in that case and while we will see the unwind issue again,
the recording should otherwise be unaffected.

Since we now have a probler .so for the rr_page, the generation process
has changed. Rather than generating the raw bytes using python,
the rr page is now a simple assembly file that contains the appropriate
instructions as well as a link script that ensures the layout is
consistent and known. I also combined the record and replay pages
into one file, with rr deciding to either map the second or third page
of librrpage.so as its rr_page. That's not strictly necessary, but
it seemed simplest to me. It also occurred to me that we could also
combine the exec_stub into this file as well to cut down the number
of executable files we need to ship (and have the kernel already
map the rr page for us), but that's just a random side note.

To test this, I'm asking the C++ unwinder to unwind us out of a system
call. Of course, in the real application, the unwinder in question
may not necessarily be the C++ system unwinder (since the unwinder
is explicitly swappable), but it's an easy way to test the issue.
It does assume that whatever libc is in regular use does come
with unwind info, which is the case on most distros I know,
but we can revisit the issue if we get reports that some
distros do not turn on unwind info for their libcs.

cc @staticfloat

This fixes rr-debugger#2721, though the strategy I proposed there doesn't
quite work as stated, so this is slightly different.

To recap:
We are seeing some test failures under rr, because our runtime
system assumes that it can asynchronously unwind from anywhere
(basically having things built with -fasynchronous-unwind-tables
and proper CFI for assembly code). However, this ran into problems
when inside a syscall in the rr page, because the rr page is not
part of any image and does not have CFI info, so the unwinder would
complain.

I originally thought I could just create an ET_EXEC file and ask the
dynamic linker to always map the rr page at the correct address
(preceeded by one page for the ELF header and eh_frame table),
but that turns out to be forbidden. Instead what this does is create
an ET_DYN file and then intercept the initial mmap of said file,
where the dynamic linker expects that kernel to choose a (potentially
randomized) memory location for it. Since we're interposing the
kernel, we can just decide to always map this particular mapping
at the address we want it to be at. Of course, this is somewhat
sensitive to the exact order of events, so while it works well
on my current glibc version, I wouldn't be surprised if other libcs
(or future versions of glibc) may do things differently enough to
not trigger these heuristics. Fortunately, we can just simply keep
doing nothing in that case and while we will see the unwind issue again,
the recording should otherwise be unaffected.

Since we now have a probler .so for the rr_page, the generation process
has changed. Rather than generating the raw bytes using python,
the rr page is now a simple assembly file that contains the appropriate
instructions as well as a link script that ensures the layout is
consistent and known. I also combined the record and replay pages
into one file, with rr deciding to either map the second or third page
of `librrpage.so` as its rr_page. That's not strictly necessary, but
it seemed simplest to me. It also occurred to me that we could also
combine the exec_stub into this file as well to cut down the number
of executable files we need to ship (and have the kernel already
map the rr page for us), but that's just a random side note.

To test this, I'm asking the C++ unwinder to unwind us out of a system
call. Of course, in the real application, the unwinder in question
may not necessarily be the C++ system unwinder (since the unwinder
is explicitly swappable), but it's an easy way to test the issue.
It does assume that whatever libc is in regular use does come
with unwind info, which is the case on most distros I know,
but we can revisit the issue if we get reports that some
distros do not turn on unwind info for their libcs.
@rocallahan
Copy link
Collaborator

Looks great to me! Thanks!

@rocallahan rocallahan merged commit 3aaf792 into rr-debugger:master Nov 30, 2020
@yuyichao
Copy link
Contributor

yuyichao commented Dec 1, 2020

@Keno This breaks make install. It seems that the rr_page_* files are still being installed but are not being generated anymore. What should be the correct new installation rules?

@yuyichao
Copy link
Contributor

yuyichao commented Dec 1, 2020

@Keno
Copy link
Member Author

Keno commented Dec 1, 2020

Sorry, the rr_page files aren't used anymore (replaced by librrpage.so). I may have missed an install rule somewhere in the CMake config. Will take a look shortly.

@yuyichao
Copy link
Contributor

yuyichao commented Dec 1, 2020

I assume the missed installation rule is the one ~L660 right below where the installation rule for rr/src/preload was added...

@Keno
Copy link
Member Author

Keno commented Dec 1, 2020

Yep, I think those should go. Sorry - I thought I had deleted them.

@khuey
Copy link
Collaborator

khuey commented Dec 1, 2020

This doesn't work on Ubuntu 18.04.

@khuey
Copy link
Collaborator

khuey commented Dec 9, 2020

22f34a8

@Keno
Copy link
Member Author

Keno commented Dec 10, 2020

Thanks for catching that. I rarely use chaos mode, so I never saw this ... Sorry!

@khuey
Copy link
Collaborator

khuey commented Dec 10, 2020

No worries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Give rr page proper unwind info
4 participants