-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Serve] Fix memory leak issue in serve inference #27815
Conversation
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great find!!! Could there a way to unit test this? maybe assert the global dictionary doesn't grow after sending 10 requests or?
python/ray/serve/_private/router.py
Outdated
# Make the scanner GCable to avoid memory leak | ||
delete_scanner_instance(id(scanner)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just call del scanner
should be enough because it invokes the same code path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The motivation here is to hide the logic for the scanner operation, given the delete_scanner_instance
is thin, i can change it for now
python/ray/dag/py_obj_scanner.py
Outdated
@@ -23,6 +23,11 @@ | |||
TransformedType = TypeVar("TransformedType") | |||
|
|||
|
|||
def delete_instance(id: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned previously, we don't need this, just explicitly invoke del should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as @simon-mo, I'd feel much more comfortable if we had a basic test case that checks that the memory footprint isn't growing as we send requests.
This might be a bit tricky to write in a robust way, but it's possible you could use the Python gc library and check the number of live objects or number of objects gc'd after a call to gc.collect
.
Signed-off-by: Sihan Wang <[email protected]>
python/ray/serve/_private/router.py
Outdated
# Make the scanner GCable to avoid memory leak | ||
del ScannerInstances[id(scanner)] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should just call del scanner
here right?
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
dag tests failed |
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
def test_scanner_gc(): | ||
"""Test gc collect after the delete instances[id] called""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to this PR in the docstring please.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you verify that this test failed prior to the fix?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prior to the fix, we don't remove the scanner in the _instance at all so the test will fail.
Signed-off-by: Sihan Wang <[email protected]>
Co-authored-by: Sihan Wang <[email protected]>
Signed-off-by: Stefan van der Kleij <[email protected]>
Signed-off-by: Artur Niederfahrenhorst <[email protected]>
Signed-off-by: Sihan Wang [email protected]
Why are these changes needed?
As the pr #27411 checked in, we eagerly use the
_PyObjScanner
to replace nodes, yet _instances hold the reference until the gc comes in to delete the object. Currently it is not fast enough, so in this pr, wedel
the object explicitly to mark the object GCable.Related issue number
Closes #27692
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.