This profiler is sponsored by my book on writing fast low-level code in Python, which uses Numba for most of its examples.
Here's what Profila output looks like:
$ python -m profila annotate -- scripts_for_tests/simple.py
# Total samples: 328 (54.9% non-Numba samples, 1.8% bad samples)
## File `/home/itamarst/devel/profila/scripts_for_tests/simple.py`
Lines 10 to 15:
0.3% | for i in range(len(timeseries)):
| # This should be the most expensive line:
38.7% | result[i] = (7 + timeseries[i] / 9 + (timeseries[i] ** 2) / 7) / 5
| for i in range(len(result)):
| # This should be cheaper:
4.3% | result[i] -= 1
You can also use it with Jupyter!
Beyond this README, you can also read this introductory article with a more detailed example and explanations.
TL;DR limitations: Linux only, and only single-threaded Numba can be profiled currently, parallel functions are not yet supported.
Currently Profila works on Linux only.
- On macOS you can use Docker, Podman, or a Linux VM.
- On Windows you can use Docker, Podman, or probably WSL2.
You'll need gdb
installed.
On Ubuntu or Debian you can do:
apt-get install gdb
On RedHat-based systems:
dnf install gdb
Install this library using pip
:
pip install profila
First, before you import numba
you should:
%load_ext profila
Then define your functions as usual:
from numba import njit
@njit
def myfunc(arr):
# ... your code here ...
You probably want to call your Numba function at least once, so profiling doesn't measure compilation time:
myfunc(DATA)
Then, you can profile a specific cell using the %%profila
magic, e.g.
%%profila
# Make sure we run this enough to get good measurements:
for i in range(100):
myfunc(DATA)
If you usually run your script like this:
$ python yourscript.py --arg1=200
Instead run it like this:
$ python -m profila annotate -- yourscript.py --arg1=200
Sampling is done every 10 milliseconds, so you need to make sure your Numba code runs for a sufficiently long time. For example, you can run your function in a loop until a number of seconds has passed:
from time import time
@njit
def myfunc():
# ...
start = time()
# Run for 3 seconds:
while (time() - start) < 3:
myfunc()
- Parallel Numba code will not be profiled correctly; at the moment only single-threaded profiling is supported.
- GPU (CUDA) code is not profiled.
Beyond that:
Compiled languages like Numba do optimization passes and transform the code to make it faster. That means the running code doesn't necessarily map one to one to the original code; different lines might be combined, for example.
As far as I can tell Numba does give you a reasonable mapping, but you can't assume the source code maps one to one to executed code.
In order to profile, additional info needs to be added during compilation; specifically, the NUMBA_DEBUGINFO
env variable is set.
This might change runtime characteristics slightly, because it increases the memory size of the compiled code.
Instruction-level parallelism, branch mispredictions, SIMD, and the CPU memory caches all have a significant impact on runtime performance, but they don't show up in profiling. I'm writing a book about this if you want to learn more.
Bug fixes:
- Run Python using
sys.executable
, so it works in more environments. Thanks to Jeremiah England for the bug report.
Added support for Jupyter profiling.
Initial release.