Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ndk-stack failing with UnicodeDecodeError due to NativeTombstoneManager #1993

Closed
cortinico opened this issue Jan 25, 2024 · 11 comments
Closed
Labels

Comments

@cortinico
Copy link

Description

Hey all,
I'm having problem running ndk-stack since upgrading to NDK 25.1 and 26.1

Specifically I'm invoking it with either stdin by piping adb logcat or passing a -dump, but the tool fails to work in both cases. The tool works correctly with NDK 23.1.

It seems like ndk-stack fails to read the input due to non UTF-8 chars in the log.
I'm uploading a dump taken from an Android emulator:
dump.txt

Specifically the non UTF-8 chars seems to be the following:

$ grep -axv '.*' dump.txt
01-08 15:14:19.624   559   638 E NativeTombstoneManager: Tombstone has invalid selinux label (u:r:priv_app:s0:c512,c768��), ignoring
01-15 16:25:12.557  3785  3861 E NativeTombstoneManager: Tombstone has invalid selinux label (u:r:priv_app:s0:c512,c768��), ignoring

It would be great if ndk-stack were to not fail on those characters, specifically also because NativeTombstoneManager is part of the Android SDK (and everything worked fine in NDK 23).

Stacktrace on NDK 26.1

$ adb logcat | $ANDROID_HOME/ndk/26.1.10909125/ndk-stack -sym packages/react-native/ReactAndroid/build/intermediates/cmake/debug/obj/arm64-v8a
Traceback (most recent call last):
  File "/opt/android_sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/darwin-x86_64/python3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/android_sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/darwin-x86_64/python3/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/android_sdk/ndk/26.1.10909125/prebuilt/darwin-x86_64/bin/ndkstack.pyz/__main__.py", line 3, in <module>
  File "/opt/android_sdk/ndk/26.1.10909125/prebuilt/darwin-x86_64/bin/ndkstack.pyz/ndkstack.py", line 387, in main
  File "/opt/android_sdk/ndk/26.1.10909125/toolchains/llvm/prebuilt/darwin-x86_64/python3/lib/python3.10/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 7369: invalid start byte

Stacktrace on NDK 25.1

adb logcat | $ANDROID_HOME/ndk/25.1.8937393/ndk-stack -sym packages/react-native/ReactAndroid/build/intermediates/cmake/debug/obj/arm64-v8a
<_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
Traceback (most recent call last):
  File "/opt/android_sdk/ndk/25.1.8937393/prebuilt/darwin-x86_64/bin/ndk-stack.py", line 429, in <module>
    main(sys.argv[1:])
  File "/opt/android_sdk/ndk/25.1.8937393/prebuilt/darwin-x86_64/bin/ndk-stack.py", line 364, in main
    for line in args.input:
  File "/opt/android_sdk/ndk/25.1.8937393/toolchains/llvm/prebuilt/darwin-x86_64/python3/lib/python3.9/codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc0 in position 1177: invalid start byte

Affected versions

r26

Canary version

No response

Host OS

Mac

Host OS version

macOS 14.2.1

Affected ABIs

arm64-v8a

Build system

CMake

Other build system

No response

minSdkVersion

21

Device API level

32

@cortinico cortinico added the bug label Jan 25, 2024
@cortinico cortinico changed the title [BUG] ndk-stack failing with UnicodeDecodeError [BUG] ndk-stack failing with UnicodeDecodeError due to NativeTombstoneManager Jan 25, 2024
@DanAlbert
Copy link
Member

Assuming it also worked in r24, I suspect this is a result of the Python 2 -> 3 migration?

@enh-google started an internal thread with some of the folks that know this system better. I'll nudge them to update here so I don't get something wrong playing telephone. The short answer right now is that they suspect the bug is actually elsewhere, because those characters shouldn't show up in the log at all.

Agreed that ndk-stack shouldn't choke in malformed input either way though 👍

NativeTombstoneManager is part of the Android SDK

I've actually never heard of this, and the Google results are pretty slim. Where should I be looking?

@enh-google
Copy link
Collaborator

yeah, basically from your dump.txt (thanks for including that!) it looks like somewhere we have some C code passing one byte too many (the '\0' terminator) to some Java code that commits Java's utf mistake -- https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8 -- to turn that into the two bytes you're seeing. so someone will want to hunt that down and fix it in the OS.

but, as danalbert says, we should fix the script too, if possible. (python is correct --- this is invalid utf8. but if we can, we should just let it through anyway.)

I've actually never heard of this, and the Google results are pretty slim. Where should I be looking?

i don't think they meant what they said. that's just the part of the OS that deals with moving tombstones to dropbox (from wheret hey go to the Play Console). the public API is https://developer.android.com/reference/android/app/ApplicationExitInfo#getTraceInputStream() but -- as evidenced by how hard it was for me to find that even though i knew what i was looking for -- that's not easy to find via web search. (hopefully that's because folks are just happy with the Play Console, but i suspect there are a lot of bad wheel-reinventions out there too.)

@DanAlbert
Copy link
Member

Gotcha. Are you still trying to figure out who the likely culprit is? Can I leave filing that OS bug to you, and I'll go poke at ndk-stack?

@enh-google
Copy link
Collaborator

i'm 99% sure it's the person i added to the internal thread. i was assuming they'd just fix it rather than file a bug...

but now you've made me look, it's actually jmgao's bug. so i guess i'll be fixing that :-)

@enh-google
Copy link
Collaborator

interestingly, looking at /proc/self/attr/current on my desktop and on a random android device that happens to be plugged in, i think there are just bugs here where folks are echo(1)/write(2)ing into this file and accidentally leaving a '\n' or '0' included. the kernel just treats these as arbitrary byte sequences and keeps the cruft, and then other bits of userspace are surprised to see random control characters handed back later.

i'll see what our selinux supremo says before touching anything (especially because i only see readers in our codebase, so i don't actually know where my hypothetical writes are coming from).

@jmgao
Copy link
Contributor

jmgao commented Jan 26, 2024

but now you've made me look, it's actually jmgao's bug. so i guess i'll be fixing that :-)

Managers hate this one weird trick to avoid having to fix bugs!

interestingly, looking at /proc/self/attr/current on my desktop and on a random android device that happens to be plugged in, i think there are just bugs here where folks are echo(1)/write(2)ing into this file and accidentally leaving a '\n' or '0' included.

I don't think that sounds right, the kernel converts between a u32 internal value and the human readable string when reading and writing the file. My uninformed guess is that there's a null terminator in the cil files or compiled sepolicy?

@cortinico
Copy link
Author

I've actually never heard of this, and the Google results are pretty slim. Where should I be looking?

@DanAlbert I've found it by searching here:
https://cs.android.com/android/platform/superproject/+/master:frameworks/base/services/core/java/com/android/server/os/NativeTombstoneManager.java?q=NativeTombstoneManager

Thank you all for the context and support

@enh-google
Copy link
Collaborator

I don't think that sounds right...

yeah, sorry, i followed up further on the internal thread but forgot about this.

i couldn't find where anything in the platform was writing to this... ah, it's those same selinux macros that mean i can never find what i'm looking for in selinux (and it's always procattr.c i'm looking for, and as usual i regret looking).

so, yes, they do this for some reason:

   ret = write(fd, context2, strlen(context2) + 1);

i still think this is an selinux bug, and i think they usually don't notice because their getter is written in C, uses strdup(), and thus truncates '\0' anyway. (the '\n' on my desktop is another matter entirely!)

inside the kernel i see this in selinux_setprocattr():

        /* Obtain a SID for the context, if one was specified. */
        if (size && str[0] && str[0] != '\n') {
                if (str[size-1] == '\n') {
                        str[size-1] = 0;
                        size--;
                }
                error = security_context_to_sid(value, size,
                                                &sid, GFP_KERNEL);
                if (error == -EINVAL && !strcmp(name, "fscreate")) {
                        if (!has_cap_mac_admin(true)) {
                                struct audit_buffer *ab;
                                size_t audit_size;

                                /* We strip a nul only if it is at the end, otherwise the
                                 * context contains a nul and we should audit that */
                                if (str[size - 1] == '\0')
                                        audit_size = size - 1;
                                else
                                        audit_size = size;

so i'm pretty confused at this point... without docs explaining why any of this makes sense, it feels like bug layered on top of bug? (had they not heard of echo -n? is there a reason the initial "no, you can't have a \0 in a context" error wasn't legit and it makes sense for the kernel to try again with a slightly different context than the one it was given? and is that a good reason, or just that they had an off-by-one in libselinux that they didn't know about?)

i think the least effort fix for us is to switch debuggerd off /proc//attr/current and over to getpidcon(), and continue to ignore the fact that libselinux needs to be rewritten from scratch for another decade... (you do have to wonder how much cruft will have to be in the eventual [presumably rust at this point] rewrite for bug compatibility!)

(and, of course, the only part any users care about is that danalbert changes ndk-stack to just treat logs as a sequence of bytes rather than necessarily valid utf8.)

@DanAlbert
Copy link
Member

(and, of course, the only part any users care about is that danalbert changes ndk-stack to just treat logs as a sequence of bytes rather than necessarily valid utf8.)

Which I'd meant to do yesterday, but stuff happened. It'll be in r27 for sure, but it probably won't be in the canary until next week.

@enh-google
Copy link
Collaborator

actually, it's more of a PITA to use libselinux (and it's awful anyway), so i've just gone with the one-liner "drop a trailing '\0'" code: https://android-review.googlesource.com/c/platform/system/core/+/2930438

@DanAlbert
Copy link
Member

Had to go work on other stuff for a few weeks, but fixed with https://r.android.com/c/2972694.

minaripenguin pushed a commit to minaripenguin/android_system_core that referenced this issue Mar 1, 2024
libselinux has an off-by-one that causes it to pass the trailing '\0' to
the kernel as if it's part of the security context, and the kernel
dutifully hands it back, since it's an uninterpreted byte array as far
as the kernel is concerned. libselinux accidentally hides this bug by
treating it as a C string and calling strdup(), but debuggerd doesn't
because it reads the file into a std::string.

We could switch to libselinux's getcon()/getpidcon(), but (a) libselinux
is awful (see above) and (b) not currently accessible to apexes (and it
doesn't seem like a great idea to make it accessible). So just manually
drop the last byte from the context we read ourselves, if it happens to
be a '\0'.

Bug: android/ndk#1993
Test: treehugger
Change-Id: I8e7605ac5e618007a8da635cb6f45b0778dc167c
travarilo pushed a commit to bananadroid/android_system_core that referenced this issue Mar 4, 2024
libselinux has an off-by-one that causes it to pass the trailing '\0' to
the kernel as if it's part of the security context, and the kernel
dutifully hands it back, since it's an uninterpreted byte array as far
as the kernel is concerned. libselinux accidentally hides this bug by
treating it as a C string and calling strdup(), but debuggerd doesn't
because it reads the file into a std::string.

We could switch to libselinux's getcon()/getpidcon(), but (a) libselinux
is awful (see above) and (b) not currently accessible to apexes (and it
doesn't seem like a great idea to make it accessible). So just manually
drop the last byte from the context we read ourselves, if it happens to
be a '\0'.

Bug: android/ndk#1993
Test: treehugger
Change-Id: I8e7605ac5e618007a8da635cb6f45b0778dc167c
minaripenguin pushed a commit to minaripenguin/android_system_core that referenced this issue Mar 4, 2024
libselinux has an off-by-one that causes it to pass the trailing '\0' to
the kernel as if it's part of the security context, and the kernel
dutifully hands it back, since it's an uninterpreted byte array as far
as the kernel is concerned. libselinux accidentally hides this bug by
treating it as a C string and calling strdup(), but debuggerd doesn't
because it reads the file into a std::string.

We could switch to libselinux's getcon()/getpidcon(), but (a) libselinux
is awful (see above) and (b) not currently accessible to apexes (and it
doesn't seem like a great idea to make it accessible). So just manually
drop the last byte from the context we read ourselves, if it happens to
be a '\0'.

Bug: android/ndk#1993
Test: treehugger
Change-Id: I8e7605ac5e618007a8da635cb6f45b0778dc167c
minaripenguin pushed a commit to minaripenguin/android_system_core that referenced this issue Mar 4, 2024
libselinux has an off-by-one that causes it to pass the trailing '\0' to
the kernel as if it's part of the security context, and the kernel
dutifully hands it back, since it's an uninterpreted byte array as far
as the kernel is concerned. libselinux accidentally hides this bug by
treating it as a C string and calling strdup(), but debuggerd doesn't
because it reads the file into a std::string.

We could switch to libselinux's getcon()/getpidcon(), but (a) libselinux
is awful (see above) and (b) not currently accessible to apexes (and it
doesn't seem like a great idea to make it accessible). So just manually
drop the last byte from the context we read ourselves, if it happens to
be a '\0'.

Bug: android/ndk#1993
Test: treehugger
Change-Id: I8e7605ac5e618007a8da635cb6f45b0778dc167c
Mrick343 pushed a commit to euclidTeam/system_core that referenced this issue May 8, 2024
libselinux has an off-by-one that causes it to pass the trailing '\0' to
the kernel as if it's part of the security context, and the kernel
dutifully hands it back, since it's an uninterpreted byte array as far
as the kernel is concerned. libselinux accidentally hides this bug by
treating it as a C string and calling strdup(), but debuggerd doesn't
because it reads the file into a std::string.

We could switch to libselinux's getcon()/getpidcon(), but (a) libselinux
is awful (see above) and (b) not currently accessible to apexes (and it
doesn't seem like a great idea to make it accessible). So just manually
drop the last byte from the context we read ourselves, if it happens to
be a '\0'.

Bug: android/ndk#1993
Test: treehugger
Change-Id: I8e7605ac5e618007a8da635cb6f45b0778dc167c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
No open projects
Status: Merged
Development

No branches or pull requests

4 participants