You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we proxy numpy.ndarray to ensure that cupy arrays are produced instead where possible. However, this behavior only partially addresses the possible ways to interact with numpy. Similarly to how we handle pandas in cudf.pandas, we may want to install a proxied library for numpy itself in such cases to ensure that we get the desired behavior.
Unfortunately, this is far more challenging with numpy than with pandas due to the fact that numpy exposes a C API. This issue is not unique to numpy, but is also present for other libraries (like torch) that rely on being able to translate Python objects from standard vocabulary types like numpy arrays down to C representations to leverage in their own C code. In general, our approach for proxying is incompatible with code that relies on converting Python objects into their C representations since we cannot mimic the latter, only the former.
We should collect cases where we observe failures like this and see if we can, at minimum, improve the ways in which the code fails. If possible, we can also try to come up with more robust strategies for consuming libraries to use such that they won't accidentally go down such bad code paths with cudf.pandas objects.
In an ideal world, we would come up with a solution that actually enables support for such use cases, but at present I don't see how we could manage doing so without doing something crazy like hooking every function call with sys.setprofile, and even if that worked (I'm not sure that it would) the cure might be worse than the disease because we'd most likely slow down every function call in Python enough to overcome any gains from using cudf.pandas.
The text was updated successfully, but these errors were encountered:
We should collect cases where we observe failures like this and see if we can, at minimum, improve the ways in which the code fails.
This is directly related to #15910, so I should also try to catch cases in our proxying scheme where we fallback to Pandas due to NumPy (or other libraries with C APIs).
If possible, we can also try to come up with more robust strategies for consuming libraries to use such that they won't accidentally go down such bad code paths with cudf.pandas objects.
I think once we run the pandas test suite with cudf.pandas debugging modes turned on we should get a better of what kinds of failures we're running in to.
Currently we proxy numpy.ndarray to ensure that cupy arrays are produced instead where possible. However, this behavior only partially addresses the possible ways to interact with numpy. Similarly to how we handle pandas in cudf.pandas, we may want to install a proxied library for numpy itself in such cases to ensure that we get the desired behavior.
Unfortunately, this is far more challenging with numpy than with pandas due to the fact that numpy exposes a C API. This issue is not unique to numpy, but is also present for other libraries (like torch) that rely on being able to translate Python objects from standard vocabulary types like numpy arrays down to C representations to leverage in their own C code. In general, our approach for proxying is incompatible with code that relies on converting Python objects into their C representations since we cannot mimic the latter, only the former.
We should collect cases where we observe failures like this and see if we can, at minimum, improve the ways in which the code fails. If possible, we can also try to come up with more robust strategies for consuming libraries to use such that they won't accidentally go down such bad code paths with cudf.pandas objects.
In an ideal world, we would come up with a solution that actually enables support for such use cases, but at present I don't see how we could manage doing so without doing something crazy like hooking every function call with
sys.setprofile
, and even if that worked (I'm not sure that it would) the cure might be worse than the disease because we'd most likely slow down every function call in Python enough to overcome any gains from using cudf.pandas.The text was updated successfully, but these errors were encountered: