-
Notifications
You must be signed in to change notification settings - Fork 431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Profile Native Extensions #2
Comments
You may(?) want to compare with https://github.com/vmprof/vmprof-python, which also uses libunwind. |
I also eager to have the native backtrace support in py-spy! It seems this is a secret sauce of vmprof. It depends on libbacktrace which uses libunwind internally. |
This is something I've mostly thought about in the context of a native sampling profiler 1 systematically detecting Python code while unwinding, and interleaving python info at that point, but not actually tried. The big problem is reconciling the state of the world from the interpreter, to that of the native stack, as frames are created and destroyed. The naive way is to rescan the interpreter "tree" every time one encounters a PyEval function on the stack, and then literally counting the frame list and the unwind stack to find which frame corresponds to this native function. The better way is to know the PyFrameObject directly. It is basically impossible under x86_64 to retrieve the PyFrameObject from a PyEval_EvalFrameEx call while unwinding because it is not required to be in a non-clobbered register. On x86 (32-bit), you can scan the stack and match again the tree inferred from the interpreter state, but this is slow/error prone. PEP 523 (Py 3.6 onwards) introduced the ability to replace PyEval_EvalFrameDefault with a custom function. This can be done by simply loading a custom python extension in the interpreter. Once you have this, you are free to stash the frame pointer in a non-clobbered register (callee/caller-saved? I'm always confused), before continuing frame execution. At profile collection time, when you are unwinding the stacks, I conjecture that there is enough unwind info to retrieve this register content. If you do try this out, let me know. I'm very curious to know if this works :). vmprof doesn't use it because it is a relatively new thing, and it does require that you ship/use a python extension with your intepreter that injects itself to replace this function. |
vmprof stops at the first PyEvalFrame call and switches entirely to Python |
Thanks for the pointers @nikhilm and @anntzer! I hadn't seen PEP 523 before. Just as a quick update on this, I’ve basically got this working for 64bit Linux and OSX right now. It needs a bunch more cleanup, but I should have a initial version posted here in the next little bit. I think windows should be also relatively straightforward to do with the StackWalk64 API but I haven't started on that yet. I don't want to modify the target program and also want to support profiling older versions of Python - so while using PEP 523 it is a cool idea I don't think it would work here. Like @nikhilm, I also found it tricky to reliably get the PyFrameObject pointer from the call stack with x86_64 calling conventions (it’s passed in to PyEval_EvalFrame* on the RDI register which isn’t unwound =(. So instead I'm getting both the interpreter stack and the native stack - and then merging the two stack's together by replacing the native Like given this python stack:
And this corresponding native stack with native and Cython code:
This code merges them together to produce output that looks like
There are a bunch of details I’m skipping over (matching the OS thread id and the python thread id, demangling Cython etc), and a bunch of other things that still need done - but I don’t see any blocking issues at this stage. I’ll probably write a blog post about this at some point with more details. Also, I’m using libunwind only as a fallback, and instead using gimli to unwind (which seems to be about 10x faster). |
Looks awesome! Exactly what I was thinking. I'm curious about libunwind vs gimli. Are you using the libunwind-sys APIs directly, or the higher level libunwind? I wasn't aware that gimli had a direct unwinding API, i thought it was just a parser. One thing to keep in mind is that macOS often uses compact unwind encoding for unwind information, which afaik is not DWARF. As for using StackWalk64, it may not be the fastest for a profiler, as it was always made for debugging, and tends to want to lookup symbols. I believe this is why Firefox doesn't use it as the first choice in their profiler. On x86_64, there is RtlUnwind on windows, which is supposed to be much faster. On x86, you can do direct rbp/rsp based unwinding, with some considerations for leaf functions. |
For OSX, I wrote my own compact unwind code in Rust - and am using gimli for the dwarf unwinding part. For Linux, gimli doesn't provide a unwind API directly but it's pretty easy to build one on top of it (it's only a couple of hundred lines of code). I'm using libunwind-ptrace for a fallback in case the gimli unwinder fails - I've enabled caching w/ the _Ux86_64_set_caching_policy call, but it is still pretty slow compared to gimli. I'll post the code here soon - I'd be interested in your feedback on it. I hadn't seen RtlUnwind for windows yet- thanks for the heads up! |
I'm glad you have a lot of free time! :D |
Ah! Ok, I've mostly been thinking of in-process profiling, where everything is much faster :) |
I've got a first draft of this here: #85 . For windows, I went with the StackWalk64 api since I don't think that RtlUnwindEx will work with a remote process. It still needs some work, but since it doesn't really impact existing functionality (only is called when you pass edit: you can install the dev release on linux/osx by going |
observations from using
|
Update to latest version of gimli/goblin and others. This will fix an issue we were seeing 'failed to parse file for symbolication' (#2 (comment)) that was caused by the object crate using an older version of goblin.
Update to latest version of gimli/goblin and others. This will fix an issue we were seeing 'failed to parse file for symbolication' (#2 (comment)) that was caused by the object crate using an older version of goblin.
Any progress in this topic? I'm still facing |
So there is one more dev release here: This includes some changes to fix issues on how native/python frames are merged, a small lru cache for symbol lookup to speed things up a bit, and updating the gimli/goblin dependencies to the latest versions which fix errors on failing to parse ELF files (like seen by @belm0) Anyways - I'd appreciate it if people could give it a go, and let me know if they have any problems. I'm probably going to push this to a stable release for linux/windows in a week or so. |
I tried to profile a Pythran extension. pip install pythran
pip install --pre py-spy
git clone https://github.com/fluiddyn/fluidfft
cd fluidfft/bench/pythran_implicit_loop
make create a file from bench import (
myfunc_default as func,
f2d as arr
)
for i in range(1000):
func(arr) cProfile output
py-spy output:
py-spy native mode
With pythran there is also a possibilty to make an intermediate C++ file by running:
And then compiling the extensions using |
I discussed with Pythran's author, and apparently pythran by default strips away symbols. Using an option in
which fixes it. Here is the result ❯❯❯ py-spy --native -- python prof.py (py-spy)
Collecting samples from 'python prof.py' (python v3.7.3)
Total Samples 1100
GIL: 0.00%, Active: 100.00%, Threads: 1
%Own %Total OwnTime TotalTime Function (filename:line)
34.00% 34.00% 2.43s 2.43s std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&) (complex:577)
16.00% 16.00% 0.920s 0.920s square<double> (square.hpp:32)
10.00% 10.00% 1.40s 1.40s square<double> (square.hpp:30)
10.00% 10.00% 1.11s 1.11s get (numpy_iexpr.hpp:475)
6.00% 97.00% 0.960s 9.94s __copy_m<(anonymous namespace)::pythonic::types::const_nditerator<(anonymous namespace)::pythonic::types::numpy_expr<(anonymous namespace)::pythonic::operator_::f
6.00% 6.00% 0.710s 0.710s std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:799)
4.00% 4.00% 0.610s 0.610s std::complex<double>& std::complex<double>::operator+=<double>(std::complex<double> const&) (complex:576)
3.00% 3.00% 0.500s 0.500s std::complex<double> std::operator/<double>(std::complex<double> const&, double const&) (complex:922)
3.00% 3.00% 0.640s 0.640s __copy_m<(anonymous namespace)::pythonic::types::const_nditerator<(anonymous namespace)::pythonic::types::numpy_expr<(anonymous namespace)::pythonic::operator_::f
2.00% 2.00% 0.580s 0.580s square<double> (square.hpp:33)
2.00% 2.00% 0.110s 0.110s std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:802)
2.00% 2.00% 0.300s 0.300s square<double> (square.hpp:31)
1.00% 1.00% 0.150s 0.150s std::complex<double>::operator+= (complex:545)
1.00% 1.00% 0.110s 0.110s std::complex<double> std::operator*<double>(std::complex<double> const&, std::complex<double> const&) (complex:801)
0.00% 0.00% 0.000s 0.060s <module> (bench.py:2) |
@ashwinvis Thanks for trying this out! I will make sure that I call out the need for symbols for best results when I release this. If no symbols are present, we try to figure out what the function is from the shared library exports - but in your case none of those functions would have been exported. |
This has been added to v0.2.0 for x86_64 linux and windows - will write a blog post in the next couple of days with some more details |
Blog post is here: https://www.benfrederickson.com/profiling-native-python-extensions-with-py-spy/ - describes the approach and some of the limitations |
@benfred setting |
@ihnorton thanks for the heads up on the emit_linenums option! That's a much better solution, I will update the post to point that as a the recommended way to get the correct line numbers from cython - I think that py-spy will just pull those in automatically if the option is set, but I will verify |
Update to latest version of gimli/goblin and others. This will fix an issue we were seeing 'failed to parse file for symbolication' (benfred/py-spy#2 (comment)) that was caused by the object crate using an older version of goblin.
Right now we are figuring out the call stack of the Python program by inspecting the PyInterpreterState/PyThreadState/PyFrameObject variables found in the target python program. This means we don't get information about time spent in non-python threads and time spent in native extensions to python (like extension code written in Cython or C++)
It could be useful to profile these native extensions in addition to the Python call stacks. It might be possible to use something like libunwind to augment the Python call stack to get this information.
edit: There is a prerelease here
pip install py-spy==0.2.0.dev4
that can profile native extensions of 64-bit Linux/Windows.The text was updated successfully, but these errors were encountered: