-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for Memory Leak in cdist and Metric-Like Functions #61
Conversation
|
||
if (!output_array) { | ||
free(distances); | ||
goto cleanup; | ||
} | ||
|
||
PyObject *wrapper = PyCapsule_New(distances, "wrapper", (PyCapsule_Destructor)&free_capsule); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for spotting the issue, @smthngslv! Any chance we can avoid the additional memory allocation for the capsule? There must be a cheaper way to forward it to Python.
PS: The cost of every memory allocation can be many times higher than the cost of actual distance computation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, that there is way to avoid creating a capsule, but we probably can add smth like out
in the signature, so it will be:
query: numpy.ndarray = ...
embeddings: numpy.ndarray = ...
scores: numpy.ndarray = numpy.empty(
shape=(query.shape[0], embeddings.shape[0]),
dtype=numpy.dtype("float32")
)
simsimd.cdist(embeddings, query, out=scores)
In this way, you can create an output array only once, and then just reuse it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good idea! Can you implement it? We also must document that behavior, as it goes beyond SciPy functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will take a look at it, later this week
I may have a similar issue in other repos, like USearch, so finding a way to avoid the extra allocation would be great, @smthngslv 🤗 |
@smthngslv how did you catch the leak? which tool would you suggest for the Python binding? Maybe something we can add in CI? PS: I'm gonna merge the current solution and looking forward to the one that adds the |
🎉 This PR is included in version 3.6.6 🎉 The release is available on GitHub release Your semantic-release bot 📦🚀 |
Just manual testing, comment out some lines and testing again. About the build process - I use # Find the module development requirements (requires FindPython from 3.17 or scikit-build-core's built-in backport).
find_package(Python REQUIRED COMPONENTS Interpreter Development.Module)
find_package(pybind11 CONFIG REQUIRED)
find_package(OpenMP)
# Add a library using FindPython's tooling (pybind11 also provides a helper like this).
python_add_library(_core MODULE src/main.cpp WITH_SOABI)
target_link_libraries(_core PRIVATE pybind11::headers)
if(OpenMP_CXX_FOUND)
target_link_libraries(_core PUBLIC OpenMP::OpenMP_CXX)
endif()
Yep, I am still on it, as soon as I manage to find some free time :) |
Fix for Memory Leak in cdist and Metric-Like Functions
alert(gpt)
Problem Description
I identified a memory leak issue in the
cdist
function, as well as other metric-like functions, within our codebase. This leak becomes evident in scenarios where these functions are called repeatedly, such as within awhile True
loop. An example to demonstrate the issue is as follows:This code snippet leads to a gradual increase in memory usage over time, indicating a memory leak.
Root Cause
The memory leak stems from the improper handling of numpy arrays, particularly in the context of garbage collection (GC). The issue arises when PyArray_NewFromDescr is used with non-null data. In such cases, the data must remain alive for the entire lifetime of the array. However, it is crucial to allow both numpy and Python's garbage collector to deallocate the array when it is no longer in use.
Solution
According to the numpy C-API documentation (Numpy C-API Array From Scratch), the constant NPY_ARRAY_OWNDATA should not be used manually. The documentation states:
"The data area is owned by this array. Should never be set manually, instead create a PyObject wrapping the data and set the array’s base to that object. For an example, see the test in test_mem_policy."
Following the guidelines from the numpy documentation, I referred to a specific implementation in the numpy repository for guidance. The reference code can be found here: Numpy Test Memory Policy. By aligning our implementation with the approach demonstrated in this example, I was able to resolve the memory leak issue.