-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow a Python exception to be raised without throwing for improved performance #1853
base: master
Are you sure you want to change the base?
Conversation
While probably not a big deal in general code, this might result in significant speedups in critical code paths, such as raising an IndexError in __getitem__() implementations to stop iterating -- for example conversion of small arrays or math vector classes to lists (or np.array, for example). Measuring with Magnum's Python bindings and the timeit module, the numbers were as follows (Release build, GCC 9, Py3.7): Vector3(1.0, 2.0, 3.0) 0.54144 µs list(Vector3(1.0, 2.0, 3.0)) 5.25274 µs Vector3()[0] # doesn't throw 0.97913 µs Vector3()[3] # throws 5.78696 µs raise IndexError() # throws 0.15339 µs After this patch and a change where Vector3's __getitem__ calls PyErr_SetString() and returns an empty py::object instead of throwing py::index_error, the numbers went down considerably: Vector3(1.0, 2.0, 3.0) 0.55200 µs list(Vector3(1.0, 2.0, 3.0)) 1.60032 µs Vector3()[0] # doesn't throw 0.68562 µs Vector3()[3] # throws 0.84481 µs raise IndexError() # throws 0.17830 µs To be extra sure about the impact, I also tried PyErr_SetString() together with throwing py::error_already_set as that's also less code being executed, but that led only to comparably minor improvement (~4 µs down from 5.8). Important thing to note is that for conversion to np.array and friends, implementing a buffer protocol is far more performant solution (under a microsecond compared to around 6 µs when relying on this patch and almost 20 µs when throwing py::index_error). So this change is mainly aimed at list conversions and general iteration.
9ee28a0
to
60ebd78
Compare
This PR makes two tests fail, both expecting the I'd argue this is fine (the exception still points directly to the problem) and the expected exceptions in those two tests should be updated. Furthermore this might mean that the |
Prior to this commit throwing error_already_set was expensive due to the eager construction of the error string (which required traversing the Python stack). See pybind#1853 for more context and an alternative take on the issue.
Prior to this commit throwing error_already_set was expensive due to the eager construction of the error string (which required traversing the Python stack). See pybind#1853 for more context and an alternative take on the issue.
Prior to this commit throwing error_already_set was expensive due to the eager construction of the error string (which required traversing the Python stack). See pybind#1853 for more context and an alternative take on the issue. Note that error_already_set no longer inherits from std::runtime_error because the latter has no default constructor.
…e-exception-without-throwing
in critical code paths. In that case we arrive here with | ||
non-null PyErr_Occurred(), so keep that exception instead of | ||
overwriting it with another. */ | ||
if (!PyErr_Occurred()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to look at this more carefully and think about it.
First impression:
- we're building a surprising bypass here for the sake of performance.
- it employs a side effect: the wrapped function manipulating the global Python error state.
We're definitely setting ourself up for quicksand experiences: people thinking they are on firm ground, but there is a bug elsewhere that makes the Python error state invalid, but it only manifests itself through this bypass with an off-topic error that could take hours of debugging to understand. (I've been in similar situations many times.)
Is the benefit of this optimization so big that we're ok accepting that danger ("at scale" it is pretty much certain that people will run into it)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should add: if performance really matters, why do you have this code path in your loop?
Of course it's nice if it's fast.
But there is an opportunity cost to optimizations like this PR: things get more complicated and fragile.
Now we're in the realm of philosophy:
- Keep pybind11 lean and straightforward so that it is as good as it can be at wrapping with ease.
- Tell users: if you have a performance critical loop, put your data into a C++ array (that you can easily wrap) and have your performance critical loop in C++, taking Python completely out of that loop, literally.
So there are clear "performance really doesn't matter" and "performance really does matter" regimes where everybody is in agreement and happy, but a gray area in the middle, where it's not quite important enough for users to bite the bullet and code up the C++ array they'd need.
What's better globally over a long period of time for all users in aggregate?
- Having N bypasses like this in pybind11 to get more of that gray area?
- A leaner and less fragile pybind11?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging #2760 and @jblespiau who is maybe more on the N bypasses side of the spectrum than I am.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rwgk This would also allow better error messages from type casters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drive-by response:
This would also allow better error messages from type casters.
That would change my cost-benefit equation (previously I was only seeing "performance" as the benefit).
A PR that generates better error messages, based on this PR, may be convincing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tagging #2760 and @jblespiau who is maybe more on the N bypasses side of the spectrum than I am.
The bypass I was suggesting in the past was mostly about not performing a dynamic look-up every time we do a C++ to Python or Python to C++ conversion. I don't really consider this a "bypass", but more of a cache (if we know the C++ Type "A" maps to a given Python type, no need to do the look-up in the hash-map every single time to retrieve the same object again and again. It's even more obvious when you convert an std::vector<A>
, you will do as many look-ups as there are items in the vector. I was suggesting to add the keyword "static" on the look-up line, and you get a huge speed-up. The only possible issue is if people unregister and re-register new types on the same symbol, but I don't even think it's possible (you cannot unregister a symbol as far as I know).
Another difference with the current CL is that I was suggesting to improve the normal path, not an exceptional one.
Specifically for the current PR, I would probably need to think more, but I am thinking that:
- Exceptions are known to be slow in Python. Exceptions should be exceptional, and not be the standard way to return something.
- Most of the time, after an exception, we crash.
- Given they are expected to be exceptional, why do we want to optimize it in the first place?
So I am wondering (a) why try to optimize something that should not happen in normal circumstances, (b) and also why, according to the benchmark, it has any impact on the non-raising path : if no exception is raised, it shouldn't change the runtime, right?
…e-exception-without-throwing
@rwgk I really like this handling though because it adds an extra escape hatch in cases where the user may forget to throw an error. Also, while the PyErr_SetString is exceptional, and really shouldn't be used except for advanced usages, the throwing an error when returning null to Python makes sense and could catch errors. |
Such things cause more surprises (and a lot of lost human time (to contrast that with machine optimization)) than one would think. I don't feel comfortable with this change, but if it gets enough support from others I'll be fine with it. |
Could also simplify issues like: #3336 |
* error_already_set::what() is now constructed lazily Prior to this commit throwing error_already_set was expensive due to the eager construction of the error string (which required traversing the Python stack). See #1853 for more context and an alternative take on the issue. Note that error_already_set no longer inherits from std::runtime_error because the latter has no default constructor. * Do not attempt to normalize if no exception occurred This is not supported on PyPy-2.7 5.8.0. * Extract exception name via tp_name This is faster than dynamically looking up __name__ via GetAttrString. Note though that the runtime of the code throwing an error_already_set will be dominated by stack unwinding so the improvement will not be noticeable. Before: 396 ns ± 0.913 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) After: 277 ns ± 0.549 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Benchmark: const std::string foo() { PyErr_SetString(PyExc_KeyError, ""); const std::string &s = py::detail::error_string(); PyErr_Clear(); return s; } PYBIND11_MODULE(foo, m) { m.def("foo", &::foo); } * Reverted error_already_set to subclass std::runtime_error * Revert "Extract exception name via tp_name" The implementation of __name__ is slightly more complex than that. It handles the module name prefix, and heap-allocated types. We could port it to pybind11 later on but for now it seems like an overkill. This reverts commit f1435c7. * Cosmit following @YannickJadoul's comments Note that detail::error_string() no longer calls PyException_SetTraceback as it is unncessary for pretty-printing the exception. * Fixed PyPy build * Moved normalization to error_already_set ctor * Fix merge bugs * Fix more merge errors * Improve formatting * Improve error message in rare case * Revert back if statements * Fix clang-tidy * Try removing mutable * Does build_mode release fix it * Set to Debug to expose segfault * Fix remove set error string * Do not run error_string() more than once * Trying setting the tracebackk to the value * guard if m_type is null * Try to debug PGI * One last try for PGI * Does reverting this fix PyPy * Reviewer suggestions * Remove unnecessary initialization * Add noexcept move and explicit fail throw * Optimize error_string creation * Fix typo * Revert noexcept * Fix merge conflict error * Abuse assignment operator * Revert operator abuse * See if we still need debug * Remove unnecessary mutable * Report "FATAL failure building pybind11::error_already_set error_string" and terminate process. * Try specifying noexcept again * Try explicit ctor * default ctor is noexcept too * Apply reviewer suggestions, simplify code, and make helper method private * Remove unnecessary include * Clang-Tidy fix * detail::obj_class_name(), fprintf with [STDERR], [STDOUT] tags, polish comments * consistently check m_lazy_what.empty() also in production builds * Make a comment slightly less ambiguous. * Bug fix: Remove `what();` from `restore()`. It sure would need to be guarded by `if (m_type)`, otherwise `what()` fails and masks that no error was set (see update unit test). But since `error_already_set` is copyable, there is no point in releasing m_type, m_value, m_trace, therefore we can just as well avoid the runtime overhead of force-building `m_lazy_what`, it may never be used. * Replace extremely opaque (unhelpful) error message with a truthful reflection of what we know. * Fix clang-tidy error [performance-move-constructor-init]. * Make expected error message less specific. * Various changes. * bug fix: error_string(PyObject **, ...) * Putting back the two eager PyErr_NormalizeException() calls. * Change error_already_set() to call pybind11_fail() if the Python error indicator not set. The net result is that a std::runtime_error is thrown instead of error_already_set, but all tests pass as is. * Remove mutable (fixes oversight in the previous commit). * Normalize the exception only locally in error_string(). Python 3.6 & 3.7 test failures expected. This is meant for benchmarking, to determine if it is worth the trouble looking into the failures. * clang-tidy: use auto * Use `gil_scoped_acquire_local` in `error_already_set` destructor. See long comment. * For Python < 3.8: `PyErr_NormalizeException` before `PyErr_WriteUnraisable` * Go back to replacing the held Python exception with then normalized exception, if & when needed. Consistently document the side-effect. * Slightly rewording comment. (There were also other failures.) * Add 1-line comment for obj_class_name() * Benchmark code, with results in this commit message. function #calls test time [s] μs / call master pure_unwind 729540 1.061 14.539876 err_set_unwind_err_clear 681476 1.040 15.260282 err_set_error_already_set 508038 1.049 20.640525 error_already_set_restore 555578 1.052 18.933288 pr1895_original_foo 244113 1.050 43.018168 PR / master PR #1895 pure_unwind 736981 1.054 14.295685 98.32% err_set_unwind_err_clear 685820 1.045 15.237399 99.85% err_set_error_already_set 661374 1.046 15.811879 76.61% error_already_set_restore 669881 1.048 15.645176 82.63% pr1895_original_foo 318243 1.059 33.290806 77.39% master @ commit ad146b2 Running tests in directory "/usr/local/google/home/rwgk/forked/pybind11/tests": ============================= test session starts ============================== platform linux -- Python 3.9.10, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini collecting ... collected 5 items test_perf_error_already_set.py::test_perf[pure_unwind] PERF pure_unwind,729540,1.061,14.539876 PASSED test_perf_error_already_set.py::test_perf[err_set_unwind_err_clear] PERF err_set_unwind_err_clear,681476,1.040,15.260282 PASSED test_perf_error_already_set.py::test_perf[err_set_error_already_set] PERF err_set_error_already_set,508038,1.049,20.640525 PASSED test_perf_error_already_set.py::test_perf[error_already_set_restore] PERF error_already_set_restore,555578,1.052,18.933288 PASSED test_perf_error_already_set.py::test_perf[pr1895_original_foo] PERF pr1895_original_foo,244113,1.050,43.018168 PASSED ============================== 5 passed in 12.38s ============================== pr1895 @ commit 8dff51d Running tests in directory "/usr/local/google/home/rwgk/forked/pybind11/tests": ============================= test session starts ============================== platform linux -- Python 3.9.10, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini collecting ... collected 5 items test_perf_error_already_set.py::test_perf[pure_unwind] PERF pure_unwind,736981,1.054,14.295685 PASSED test_perf_error_already_set.py::test_perf[err_set_unwind_err_clear] PERF err_set_unwind_err_clear,685820,1.045,15.237399 PASSED test_perf_error_already_set.py::test_perf[err_set_error_already_set] PERF err_set_error_already_set,661374,1.046,15.811879 PASSED test_perf_error_already_set.py::test_perf[error_already_set_restore] PERF error_already_set_restore,669881,1.048,15.645176 PASSED test_perf_error_already_set.py::test_perf[pr1895_original_foo] PERF pr1895_original_foo,318243,1.059,33.290806 PASSED ============================== 5 passed in 12.40s ============================== clang++ -o pybind11/tests/test_perf_error_already_set.os -c -std=c++17 -fPIC -fvisibility=hidden -Os -flto -Wall -Wextra -Wconversion -Wcast-qual -Wdeprecated -Wnon-virtual-dtor -Wunused-result -isystem /usr/include/python3.9 -isystem /usr/include/eigen3 -DPYBIND11_STRICT_ASSERTS_CLASS_HOLDER_VS_TYPE_CASTER_MIX -DPYBIND11_TEST_BOOST -Ipybind11/include -I/usr/local/google/home/rwgk/forked/pybind11/include -I/usr/local/google/home/rwgk/clone/pybind11/include /usr/local/google/home/rwgk/forked/pybind11/tests/test_perf_error_already_set.cpp clang++ -o lib/pybind11_tests.so -shared -fPIC -Os -flto -shared ... Debian clang version 13.0.1-3+build2 Target: x86_64-pc-linux-gnu Thread model: posix * Changing call_repetitions_target_elapsed_secs to 0.1 for regular unit testing. * Adding in `recursion_depth` * Optimized ctor * Fix silly bug in recurse_first_then_call() * Add tests that have equivalent PyErr_Fetch(), PyErr_Restore() but no try-catch. * Add call_error_string to tests. Sample only recursion_depth 0, 100. * Show lazy-what speed-up in percent. * Include real_work in benchmarks. * Replace all PyErr_SetString() with generate_python_exception_with_traceback() * Better organization of test loops. * Add test_error_already_set_copy_move * Fix bug in newly added test (discovered by clang-tidy): actually use move ctor * MSVC detects the unreachable return * change test_perf_error_already_set.py back to quick mode * Inherit from std::exception (instead of std::runtime_error, which does not make sense anymore with the lazy what) * Special handling under Windows. * print with leading newline * Removing test_perf_error_already_set (copies are under rwgk/rwgk_tbx@7765113). * Avoid gil and scope overhead if there is nothing to release. * Restore default move ctor. "member function" instead of "function" (note that "method" is Python terminology). * Delete error_already_set copy ctor. * Make restore() non-const again to resolve clang-tidy failure (still experimenting). * Bring back error_already_set copy ctor, to see if that resolves the 4 MSVC test failures. * Add noexcept to error_already_set copy & move ctors (as suggested by @Skylion007 IIUC). * Trying one-by-one noexcept copy ctor for old compilers. * Add back test covering copy ctor. Add another simple test that exercises the copy ctor. * Exclude more older compilers from using the noexcept = default ctors. (The tests in the previous commit exposed that those are broken.) * Factor out & reuse gil_scoped_acquire_local as gil_scoped_acquire_simple * Guard gil_scoped_acquire_simple by _Py_IsFinalizing() check. * what() GIL safety * clang-tidy & Python 3.6 fixes * Use `gil_scoped_acquire` in dtor, copy ctor, `what()`. Remove `_Py_IsFinalizing()` checks (they are racy: python/cpython#28525). * Remove error_scope from copy ctor. * Add `error_scope` to `get_internals()`, to cover the situation that `get_internals()` is called from the `error_already_set` dtor while a new Python error is in flight already. Also backing out `gil_scoped_acquire_simple` change. * Add `FlakyException` tests with failure triggers in `__init__` and `__str__` THIS IS STILL A WORK IN PROGRESS. This commit is only an important resting point. This commit is a first attempt at addressing the observation that `PyErr_NormalizeException()` completely replaces the original exception if `__init__` fails. This can be very confusing even in small applications, and extremely confusing in large ones. * Tweaks to resolve Py 3.6 and PyPy CI failures. * Normalize Python exception immediately in error_already_set ctor. For background see: #1895 (comment) * Fix oversights based on CI failures (copy & move ctor initialization). * Move @pytest.mark.xfail("env.PYPY") after @pytest.mark.parametrize(...) * Use @pytest.mark.skipif (xfail does not work for segfaults, of course). * Remove unused obj_class_name_or() function (it was added only under this PR). * Remove already obsolete C++ comments and code that were added only under this PR. * Slightly better (newly added) comments. * Factor out detail::error_fetch_and_normalize. Preparation for producing identical results from error_already_set::what() and detail::error_string(). Note that this is a very conservative refactoring. It would be much better to first move detail::error_string into detail/error_string.h * Copy most of error_string() code to new error_fetch_and_normalize::complete_lazy_error_string() * Remove all error_string() code from detail/type_caster_base.h. Note that this commit includes a subtle bug fix: previously error_string() restored the Python error, which will upset pybind11_fail(). This never was a problem in practice because the two PyType_Ready() calls in detail/class.h do not usually fail. * Return const std::string& instead of const char * and move error_string() to pytypes.h * Remove gil_scope_acquire from error_fetch_and_normalize, add back to error_already_set * Better handling of FlakyException __str__ failure. * Move error_fetch_and_normalize::complete_lazy_error_string() implementation from pybind11.h to pytypes.h * Add error_fetch_and_normalize::release_py_object_references() and use from error_already_set dtor. * Use shared_ptr for m_fetched_error => 1. non-racy, copy ctor that does not need the GIL; 2. enables guard against duplicate restore() calls. * Add comments. * Trivial renaming of a newly introduced member function. * Workaround for PyPy * Bug fix (oversight). Only valgrind got this one. * Use shared_ptr custom deleter for m_fetched_error in error_already_set. This enables removing the dtor, copy ctor, move ctor completely. * Further small simplification. With the GIL held, simply deleting the raw_ptr takes care of everything. * IWYU cleanup ``` iwyu version: include-what-you-use 0.17 based on Debian clang version 13.0.1-3+build2 ``` Command used: ``` iwyu -c -std=c++17 -DPYBIND11_TEST_BOOST -Iinclude/pybind11 -I/usr/include/python3.9 -I/usr/include/eigen3 include/pybind11/pytypes.cpp ``` pytypes.cpp is a temporary file: `#include "pytypes.h"` The raw output is very long and noisy. I decided to use `#include <cstddef>` instead of `#include <cstdio>` for `std::size_t` (iwyu sticks to the manual choice). I ignored all iwyu suggestions that are indirectly covered by `#include <Python.h>`. I manually verified that all added includes are actually needed. Co-authored-by: Aaron Gokaslan <[email protected]> Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
While probably not a big deal in general code, this patch might result in significant speedups in critical code paths, such as raising an
IndexError
in__getitem__()
implementations to stop iterating -- for example iterating over / conversion of small arrays or math vector classes to lists (or tonp.array
).Measuring with Magnum's Python bindings and the
timeit
module, the numbers before this were as follows (Release build, GCC 9, Py3.7, benchmark code for reference):After this patch and a change where
Vector3
's__getitem__()
callsPyErr_SetString()
and returns an emptypy::object
instead of throwingpy::index_error
, the numbers went down considerably:To be extra sure about the impact, I also tried
PyErr_SetString()
together with throwingpy::error_already_set
as that's also less code being executed, but that led only to comparably minor improvement (~4 µs down from 5.8).Important thing to note is that for conversion to
np.array
and friends, implementing a buffer protocol is a far more performant solution (under a microsecond compared to around 6 µs when relying on this patch and almost 20 µs when throwingpy::index_error
, as numpy seems to be hitting the boundary condition three times when converting from iterables). So this change is mainly aimed at list conversions and general iteration where perf matters.This is my first real PR to this project and while I tried to follow code style and added a test case, I most probably missed something important (even though I don't think this particular change should break anyone's use case) :) Let me know if you need more information or more testing / benchmarks. Happy to address your feedback. Thank you!