Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a Python exception to be raised without throwing for improved performance #1853

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

mosra
Copy link
Contributor

@mosra mosra commented Jul 21, 2019

While probably not a big deal in general code, this patch might result in significant speedups in critical code paths, such as raising an IndexError in __getitem__() implementations to stop iterating -- for example iterating over / conversion of small arrays or math vector classes to lists (or to np.array).

Measuring with Magnum's Python bindings and the timeit module, the numbers before this were as follows (Release build, GCC 9, Py3.7, benchmark code for reference):

Vector3(1.0, 2.0, 3.0)                        0.54144 µs
list(Vector3(1.0, 2.0, 3.0))                  5.25274 µs
Vector3()[0] # doesn't throw                  0.97913 µs
Vector3()[3] # throws                         5.78696 µs
raise IndexError() # throws                   0.15339 µs

After this patch and a change where Vector3's __getitem__() calls PyErr_SetString() and returns an empty py::object instead of throwing py::index_error, the numbers went down considerably:

Vector3(1.0, 2.0, 3.0)                        0.55200 µs
list(Vector3(1.0, 2.0, 3.0))                  1.60032 µs
Vector3()[0] # doesn't throw                  0.68562 µs
Vector3()[3] # throws                         0.84481 µs
raise IndexError() # throws                   0.17830 µs

To be extra sure about the impact, I also tried PyErr_SetString() together with throwing py::error_already_set as that's also less code being executed, but that led only to comparably minor improvement (~4 µs down from 5.8).

Important thing to note is that for conversion to np.array and friends, implementing a buffer protocol is a far more performant solution (under a microsecond compared to around 6 µs when relying on this patch and almost 20 µs when throwing py::index_error, as numpy seems to be hitting the boundary condition three times when converting from iterables). So this change is mainly aimed at list conversions and general iteration where perf matters.


This is my first real PR to this project and while I tried to follow code style and added a test case, I most probably missed something important (even though I don't think this particular change should break anyone's use case) :) Let me know if you need more information or more testing / benchmarks. Happy to address your feedback. Thank you!

While probably not a big deal in general code, this might result in
significant speedups in critical code paths, such as raising an
IndexError in __getitem__() implementations to stop iterating --
for example conversion of small arrays or math vector classes to
lists (or np.array, for example).

Measuring with Magnum's Python bindings and the timeit
module, the numbers were as follows (Release build, GCC 9, Py3.7):

    Vector3(1.0, 2.0, 3.0)                        0.54144 µs
    list(Vector3(1.0, 2.0, 3.0))                  5.25274 µs
    Vector3()[0] # doesn't throw                  0.97913 µs
    Vector3()[3] # throws                         5.78696 µs
    raise IndexError() # throws                   0.15339 µs

After this patch and a change where Vector3's __getitem__ calls
PyErr_SetString() and returns an empty py::object instead of throwing
py::index_error, the numbers went down considerably:

    Vector3(1.0, 2.0, 3.0)                        0.55200 µs
    list(Vector3(1.0, 2.0, 3.0))                  1.60032 µs
    Vector3()[0] # doesn't throw                  0.68562 µs
    Vector3()[3] # throws                         0.84481 µs
    raise IndexError() # throws                   0.17830 µs

To be extra sure about the impact, I also tried PyErr_SetString()
together with throwing py::error_already_set as that's also less code
being executed, but that led only to comparably minor improvement
(~4 µs down from 5.8).

Important thing to note is that for conversion to np.array and friends,
implementing a buffer protocol is far more performant solution (under a
microsecond compared to around 6 µs when relying on this patch and
almost 20 µs when throwing py::index_error). So this change is mainly
aimed at list conversions and general iteration.
@mosra mosra force-pushed the raise-exception-without-throwing branch from 9ee28a0 to 60ebd78 Compare July 21, 2019 21:13
@mosra
Copy link
Contributor Author

mosra commented Jul 21, 2019

This PR makes two tests fail, both expecting the Unable to convert function return value to a Python type! error, but now getting the original TypeError('Unregistered type : std::vector<float, std::allocator<float> >') or TypeError('Unregistered type : test_submodule_class_(module&)::NotRegistered') set by code executed earlier.

I'd argue this is fine (the exception still points directly to the problem) and the expected exceptions in those two tests should be updated. Furthermore this might mean that the append_note_if_missing_header_is_suspected lambda could be moved to the place where the original Unregistered type error gets set and then the Unable to convert function return value code path is not needed at all? Or is there another case where this could be hit without some pre-existing error being set?

superbobry added a commit to superbobry/pybind11 that referenced this pull request Aug 28, 2019
Prior to this commit throwing error_already_set was expensive due to the
eager construction of the error string (which required traversing the
Python stack). See pybind#1853 for more context and an alternative take on the
issue.
superbobry added a commit to superbobry/pybind11 that referenced this pull request Aug 28, 2019
Prior to this commit throwing error_already_set was expensive due to the
eager construction of the error string (which required traversing the
Python stack). See pybind#1853 for more context and an alternative take on the
issue.
superbobry added a commit to superbobry/pybind11 that referenced this pull request Aug 28, 2019
Prior to this commit throwing error_already_set was expensive due to the
eager construction of the error string (which required traversing the
Python stack). See pybind#1853 for more context and an alternative take on the
issue.

Note that error_already_set no longer inherits from std::runtime_error
because the latter has no default constructor.
in critical code paths. In that case we arrive here with
non-null PyErr_Occurred(), so keep that exception instead of
overwriting it with another. */
if (!PyErr_Occurred()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to look at this more carefully and think about it.
First impression:

  • we're building a surprising bypass here for the sake of performance.
  • it employs a side effect: the wrapped function manipulating the global Python error state.
    We're definitely setting ourself up for quicksand experiences: people thinking they are on firm ground, but there is a bug elsewhere that makes the Python error state invalid, but it only manifests itself through this bypass with an off-topic error that could take hours of debugging to understand. (I've been in similar situations many times.)
    Is the benefit of this optimization so big that we're ok accepting that danger ("at scale" it is pretty much certain that people will run into it)?

Copy link
Collaborator

@rwgk rwgk Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should add: if performance really matters, why do you have this code path in your loop?
Of course it's nice if it's fast.
But there is an opportunity cost to optimizations like this PR: things get more complicated and fragile.
Now we're in the realm of philosophy:

  • Keep pybind11 lean and straightforward so that it is as good as it can be at wrapping with ease.
  • Tell users: if you have a performance critical loop, put your data into a C++ array (that you can easily wrap) and have your performance critical loop in C++, taking Python completely out of that loop, literally.

So there are clear "performance really doesn't matter" and "performance really does matter" regimes where everybody is in agreement and happy, but a gray area in the middle, where it's not quite important enough for users to bite the bullet and code up the C++ array they'd need.
What's better globally over a long period of time for all users in aggregate?

  • Having N bypasses like this in pybind11 to get more of that gray area?
  • A leaner and less fragile pybind11?

Copy link
Collaborator

@rwgk rwgk Aug 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging #2760 and @jblespiau who is maybe more on the N bypasses side of the spectrum than I am.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rwgk This would also allow better error messages from type casters.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drive-by response:

This would also allow better error messages from type casters.

That would change my cost-benefit equation (previously I was only seeing "performance" as the benefit).

A PR that generates better error messages, based on this PR, may be convincing.

Copy link
Contributor

@jblespiau jblespiau Jan 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging #2760 and @jblespiau who is maybe more on the N bypasses side of the spectrum than I am.

The bypass I was suggesting in the past was mostly about not performing a dynamic look-up every time we do a C++ to Python or Python to C++ conversion. I don't really consider this a "bypass", but more of a cache (if we know the C++ Type "A" maps to a given Python type, no need to do the look-up in the hash-map every single time to retrieve the same object again and again. It's even more obvious when you convert an std::vector<A>, you will do as many look-ups as there are items in the vector. I was suggesting to add the keyword "static" on the look-up line, and you get a huge speed-up. The only possible issue is if people unregister and re-register new types on the same symbol, but I don't even think it's possible (you cannot unregister a symbol as far as I know).
Another difference with the current CL is that I was suggesting to improve the normal path, not an exceptional one.

Specifically for the current PR, I would probably need to think more, but I am thinking that:

  • Exceptions are known to be slow in Python. Exceptions should be exceptional, and not be the standard way to return something.
  • Most of the time, after an exception, we crash.
  • Given they are expected to be exceptional, why do we want to optimize it in the first place?

So I am wondering (a) why try to optimize something that should not happen in normal circumstances, (b) and also why, according to the benchmark, it has any impact on the non-raising path : if no exception is raised, it shouldn't change the runtime, right?

@Skylion007
Copy link
Collaborator

@rwgk I really like this handling though because it adds an extra escape hatch in cases where the user may forget to throw an error. Also, while the PyErr_SetString is exceptional, and really shouldn't be used except for advanced usages, the throwing an error when returning null to Python makes sense and could catch errors.

@rwgk
Copy link
Collaborator

rwgk commented Oct 9, 2021

extra escape hatch

Such things cause more surprises (and a lot of lost human time (to contrast that with machine optimization)) than one would think.

I don't feel comfortable with this change, but if it gets enough support from others I'll be fine with it.

@Skylion007
Copy link
Collaborator

Could also simplify issues like: #3336

@Skylion007
Copy link
Collaborator

Also would fix: #3600 . Currently, we have no way of raising anything other than a Type Exception in a caster. Even being able to specify the string of the type exception would help alot @rwgk

rwgk pushed a commit that referenced this pull request Jun 2, 2022
* error_already_set::what() is now constructed lazily

Prior to this commit throwing error_already_set was expensive due to the
eager construction of the error string (which required traversing the
Python stack). See #1853 for more context and an alternative take on the
issue.

Note that error_already_set no longer inherits from std::runtime_error
because the latter has no default constructor.

* Do not attempt to normalize if no exception occurred

This is not supported on PyPy-2.7 5.8.0.

* Extract exception name via tp_name

This is faster than dynamically looking up __name__ via GetAttrString.
Note though that the runtime of the code throwing an error_already_set
will be dominated by stack unwinding so the improvement will not be
noticeable.

Before:

396 ns ± 0.913 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

After:

277 ns ± 0.549 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Benchmark:

const std::string foo() {
    PyErr_SetString(PyExc_KeyError, "");
    const std::string &s = py::detail::error_string();
    PyErr_Clear();
    return s;
}

PYBIND11_MODULE(foo, m) {
    m.def("foo", &::foo);
}

* Reverted error_already_set to subclass std::runtime_error

* Revert "Extract exception name via tp_name"

The implementation of __name__ is slightly more complex than that.
It handles the module name prefix, and heap-allocated types. We could
port it to pybind11 later on but for now it seems like an overkill.

This reverts commit f1435c7.

* Cosmit following @YannickJadoul's comments

Note that detail::error_string() no longer calls PyException_SetTraceback
as it is unncessary for pretty-printing the exception.

* Fixed PyPy build

* Moved normalization to error_already_set ctor

* Fix merge bugs

* Fix more merge errors

* Improve formatting

* Improve error message in rare case

* Revert back if statements

* Fix clang-tidy

* Try removing mutable

* Does build_mode release fix it

* Set to Debug to expose segfault

* Fix remove set error string

* Do not run error_string() more than once

* Trying setting the tracebackk to the value

* guard if m_type is null

* Try to debug PGI

* One last try for PGI

* Does reverting this fix PyPy

* Reviewer suggestions

* Remove unnecessary initialization

* Add noexcept move and explicit fail throw

* Optimize error_string creation

* Fix typo

* Revert noexcept

* Fix merge conflict error

* Abuse assignment operator

* Revert operator abuse

* See if we still need debug

* Remove unnecessary mutable

* Report "FATAL failure building pybind11::error_already_set error_string" and terminate process.

* Try specifying noexcept again

* Try explicit ctor

* default ctor is noexcept too

* Apply reviewer suggestions, simplify code, and make helper method private

* Remove unnecessary include

* Clang-Tidy fix

* detail::obj_class_name(), fprintf with [STDERR], [STDOUT] tags, polish comments

* consistently check m_lazy_what.empty() also in production builds

* Make a comment slightly less ambiguous.

* Bug fix: Remove `what();` from `restore()`.

It sure would need to be guarded by `if (m_type)`, otherwise `what()` fails and masks that no error was set (see update unit test). But since `error_already_set` is copyable, there is no point in releasing m_type, m_value, m_trace, therefore we can just as well avoid the runtime overhead of force-building `m_lazy_what`, it may never be used.

* Replace extremely opaque (unhelpful) error message with a truthful reflection of what we know.

* Fix clang-tidy error [performance-move-constructor-init].

* Make expected error message less specific.

* Various changes.

* bug fix: error_string(PyObject **, ...)

* Putting back the two eager PyErr_NormalizeException() calls.

* Change error_already_set() to call pybind11_fail() if the Python error indicator not set. The net result is that a std::runtime_error is thrown instead of error_already_set, but all tests pass as is.

* Remove mutable (fixes oversight in the previous commit).

* Normalize the exception only locally in error_string(). Python 3.6 & 3.7 test failures expected. This is meant for benchmarking, to determine if it is worth the trouble looking into the failures.

* clang-tidy: use auto

* Use `gil_scoped_acquire_local` in `error_already_set` destructor. See long comment.

* For Python < 3.8: `PyErr_NormalizeException` before `PyErr_WriteUnraisable`

* Go back to replacing the held Python exception with then normalized exception, if & when needed. Consistently document the side-effect.

* Slightly rewording comment. (There were also other failures.)

* Add 1-line comment for obj_class_name()

* Benchmark code, with results in this commit message.

          function                   #calls  test time [s]  μs / call
master    pure_unwind                729540      1.061      14.539876
          err_set_unwind_err_clear   681476      1.040      15.260282
          err_set_error_already_set  508038      1.049      20.640525
          error_already_set_restore  555578      1.052      18.933288
          pr1895_original_foo        244113      1.050      43.018168
                                                                       PR / master
PR #1895  pure_unwind                736981      1.054      14.295685       98.32%
          err_set_unwind_err_clear   685820      1.045      15.237399       99.85%
          err_set_error_already_set  661374      1.046      15.811879       76.61%
          error_already_set_restore  669881      1.048      15.645176       82.63%
          pr1895_original_foo        318243      1.059      33.290806       77.39%

master @ commit ad146b2

Running tests in directory "/usr/local/google/home/rwgk/forked/pybind11/tests":
============================= test session starts ==============================
platform linux -- Python 3.9.10, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini
collecting ... collected 5 items

test_perf_error_already_set.py::test_perf[pure_unwind]
PERF pure_unwind,729540,1.061,14.539876
PASSED
test_perf_error_already_set.py::test_perf[err_set_unwind_err_clear]
PERF err_set_unwind_err_clear,681476,1.040,15.260282
PASSED
test_perf_error_already_set.py::test_perf[err_set_error_already_set]
PERF err_set_error_already_set,508038,1.049,20.640525
PASSED
test_perf_error_already_set.py::test_perf[error_already_set_restore]
PERF error_already_set_restore,555578,1.052,18.933288
PASSED
test_perf_error_already_set.py::test_perf[pr1895_original_foo]
PERF pr1895_original_foo,244113,1.050,43.018168
PASSED

============================== 5 passed in 12.38s ==============================

pr1895 @ commit 8dff51d

Running tests in directory "/usr/local/google/home/rwgk/forked/pybind11/tests":
============================= test session starts ==============================
platform linux -- Python 3.9.10, pytest-6.2.3, py-1.10.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /usr/local/google/home/rwgk/forked/pybind11/tests, configfile: pytest.ini
collecting ... collected 5 items

test_perf_error_already_set.py::test_perf[pure_unwind]
PERF pure_unwind,736981,1.054,14.295685
PASSED
test_perf_error_already_set.py::test_perf[err_set_unwind_err_clear]
PERF err_set_unwind_err_clear,685820,1.045,15.237399
PASSED
test_perf_error_already_set.py::test_perf[err_set_error_already_set]
PERF err_set_error_already_set,661374,1.046,15.811879
PASSED
test_perf_error_already_set.py::test_perf[error_already_set_restore]
PERF error_already_set_restore,669881,1.048,15.645176
PASSED
test_perf_error_already_set.py::test_perf[pr1895_original_foo]
PERF pr1895_original_foo,318243,1.059,33.290806
PASSED

============================== 5 passed in 12.40s ==============================

clang++ -o pybind11/tests/test_perf_error_already_set.os -c -std=c++17 -fPIC -fvisibility=hidden -Os -flto -Wall -Wextra -Wconversion -Wcast-qual -Wdeprecated -Wnon-virtual-dtor -Wunused-result -isystem /usr/include/python3.9 -isystem /usr/include/eigen3 -DPYBIND11_STRICT_ASSERTS_CLASS_HOLDER_VS_TYPE_CASTER_MIX -DPYBIND11_TEST_BOOST -Ipybind11/include -I/usr/local/google/home/rwgk/forked/pybind11/include -I/usr/local/google/home/rwgk/clone/pybind11/include /usr/local/google/home/rwgk/forked/pybind11/tests/test_perf_error_already_set.cpp

clang++ -o lib/pybind11_tests.so -shared -fPIC -Os -flto -shared ...

Debian clang version 13.0.1-3+build2
Target: x86_64-pc-linux-gnu
Thread model: posix

* Changing call_repetitions_target_elapsed_secs to 0.1 for regular unit testing.

* Adding in `recursion_depth`

* Optimized ctor

* Fix silly bug in recurse_first_then_call()

* Add tests that have equivalent PyErr_Fetch(), PyErr_Restore() but no try-catch.

* Add call_error_string to tests. Sample only recursion_depth 0, 100.

* Show lazy-what speed-up in percent.

* Include real_work in benchmarks.

* Replace all PyErr_SetString() with generate_python_exception_with_traceback()

* Better organization of test loops.

* Add test_error_already_set_copy_move

* Fix bug in newly added test (discovered by clang-tidy): actually use move ctor

* MSVC detects the unreachable return

* change test_perf_error_already_set.py back to quick mode

* Inherit from std::exception (instead of std::runtime_error, which does not make sense anymore with the lazy what)

* Special handling under Windows.

* print with leading newline

* Removing test_perf_error_already_set (copies are under rwgk/rwgk_tbx@7765113).

* Avoid gil and scope overhead if there is nothing to release.

* Restore default move ctor. "member function" instead of "function" (note that "method" is Python terminology).

* Delete error_already_set copy ctor.

* Make restore() non-const again to resolve clang-tidy failure (still experimenting).

* Bring back error_already_set copy ctor, to see if that resolves the 4 MSVC test failures.

* Add noexcept to error_already_set copy & move ctors (as suggested by @Skylion007 IIUC).

* Trying one-by-one noexcept copy ctor for old compilers.

* Add back test covering copy ctor. Add another simple test that exercises the copy ctor.

* Exclude more older compilers from using the noexcept = default ctors. (The tests in the previous commit exposed that those are broken.)

* Factor out & reuse gil_scoped_acquire_local as gil_scoped_acquire_simple

* Guard gil_scoped_acquire_simple by _Py_IsFinalizing() check.

* what() GIL safety

* clang-tidy & Python 3.6 fixes

* Use `gil_scoped_acquire` in dtor, copy ctor, `what()`. Remove `_Py_IsFinalizing()` checks (they are racy: python/cpython#28525).

* Remove error_scope from copy ctor.

* Add `error_scope` to `get_internals()`, to cover the situation that `get_internals()` is called from the `error_already_set` dtor while a new Python error is in flight already. Also backing out `gil_scoped_acquire_simple` change.

* Add `FlakyException` tests with failure triggers in `__init__` and `__str__`

THIS IS STILL A WORK IN PROGRESS. This commit is only an important resting point.

This commit is a first attempt at addressing the observation that `PyErr_NormalizeException()` completely replaces the original exception if `__init__` fails. This can be very confusing even in small applications, and extremely confusing in large ones.

* Tweaks to resolve Py 3.6 and PyPy CI failures.

* Normalize Python exception immediately in error_already_set ctor.

For background see: #1895 (comment)

* Fix oversights based on CI failures (copy & move ctor initialization).

* Move @pytest.mark.xfail("env.PYPY") after @pytest.mark.parametrize(...)

* Use @pytest.mark.skipif (xfail does not work for segfaults, of course).

* Remove unused obj_class_name_or() function (it was added only under this PR).

* Remove already obsolete C++ comments and code that were added only under this PR.

* Slightly better (newly added) comments.

* Factor out detail::error_fetch_and_normalize. Preparation for producing identical results from error_already_set::what() and detail::error_string(). Note that this is a very conservative refactoring. It would be much better to first move detail::error_string into detail/error_string.h

* Copy most of error_string() code to new error_fetch_and_normalize::complete_lazy_error_string()

* Remove all error_string() code from detail/type_caster_base.h. Note that this commit includes a subtle bug fix: previously error_string() restored the Python error, which will upset pybind11_fail(). This never was a problem in practice because the two PyType_Ready() calls in detail/class.h do not usually fail.

* Return const std::string& instead of const char * and move error_string() to pytypes.h

* Remove gil_scope_acquire from error_fetch_and_normalize, add back to error_already_set

* Better handling of FlakyException __str__ failure.

* Move error_fetch_and_normalize::complete_lazy_error_string() implementation from pybind11.h to pytypes.h

* Add error_fetch_and_normalize::release_py_object_references() and use from error_already_set dtor.

* Use shared_ptr for m_fetched_error => 1. non-racy, copy ctor that does not need the GIL; 2. enables guard against duplicate restore() calls.

* Add comments.

* Trivial renaming of a newly introduced member function.

* Workaround for PyPy

* Bug fix (oversight). Only valgrind got this one.

* Use shared_ptr custom deleter for m_fetched_error in error_already_set. This enables removing the dtor, copy ctor, move ctor completely.

* Further small simplification. With the GIL held, simply deleting the raw_ptr takes care of everything.

* IWYU cleanup

```
iwyu version: include-what-you-use 0.17 based on Debian clang version 13.0.1-3+build2
```

Command used:

```
iwyu -c -std=c++17 -DPYBIND11_TEST_BOOST -Iinclude/pybind11 -I/usr/include/python3.9 -I/usr/include/eigen3 include/pybind11/pytypes.cpp
```

pytypes.cpp is a temporary file: `#include "pytypes.h"`

The raw output is very long and noisy.

I decided to use `#include <cstddef>` instead of `#include <cstdio>` for `std::size_t` (iwyu sticks to the manual choice).

I ignored all iwyu suggestions that are indirectly covered by `#include <Python.h>`.

I manually verified that all added includes are actually needed.

Co-authored-by: Aaron Gokaslan <[email protected]>
Co-authored-by: Ralf W. Grosse-Kunstleve <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants