Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Fix memory leak issue in serve inference #27815

Merged
merged 12 commits into from
Aug 13, 2022

Conversation

sihanwang41
Copy link
Contributor

@sihanwang41 sihanwang41 commented Aug 12, 2022

Signed-off-by: Sihan Wang [email protected]

Why are these changes needed?

As the pr #27411 checked in, we eagerly use the _PyObjScanner to replace nodes, yet _instances hold the reference until the gc comes in to delete the object. Currently it is not fast enough, so in this pr, we del the object explicitly to mark the object GCable.

Related issue number

Closes #27692

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
@sihanwang41 sihanwang41 changed the title [Serve][Debug] Test [Serve] Fix Memory leak issue in serve inference Aug 12, 2022
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Copy link
Contributor

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!!! Could there a way to unit test this? maybe assert the global dictionary doesn't grow after sending 10 requests or?

Comment on lines 62 to 64
# Make the scanner GCable to avoid memory leak
delete_scanner_instance(id(scanner))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just call del scanner should be enough because it invokes the same code path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation here is to hide the logic for the scanner operation, given the delete_scanner_instance is thin, i can change it for now

@@ -23,6 +23,11 @@
TransformedType = TypeVar("TransformedType")


def delete_instance(id: int):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned previously, we don't need this, just explicitly invoke del should work.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as @simon-mo, I'd feel much more comfortable if we had a basic test case that checks that the memory footprint isn't growing as we send requests.

This might be a bit tricky to write in a robust way, but it's possible you could use the Python gc library and check the number of live objects or number of objects gc'd after a call to gc.collect.

@simon-mo simon-mo mentioned this pull request Aug 12, 2022
7 tasks
@simon-mo simon-mo added the release-blocker P0 Issue that blocks the release label Aug 12, 2022
@simon-mo simon-mo changed the title [Serve] Fix Memory leak issue in serve inference [Serve] Fix memory leak issue in serve inference Aug 12, 2022
Signed-off-by: Sihan Wang <[email protected]>
Comment on lines 62 to 64
# Make the scanner GCable to avoid memory leak
del ScannerInstances[id(scanner)]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should just call del scanner here right?

Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
@simon-mo
Copy link
Contributor

dag tests failed

Signed-off-by: Sihan Wang <[email protected]>
Signed-off-by: Sihan Wang <[email protected]>
Comment on lines 37 to 38
def test_scanner_gc():
"""Test gc collect after the delete instances[id] called"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to this PR in the docstring please.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify that this test failed prior to the fix?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prior to the fix, we don't remove the scanner in the _instance at all so the test will fail.

Signed-off-by: Sihan Wang <[email protected]>
@scv119 scv119 added v2.0.0-pick and removed release-blocker P0 Issue that blocks the release labels Aug 12, 2022
@simon-mo simon-mo merged commit 7e7c93f into ray-project:master Aug 13, 2022
simon-mo pushed a commit to simon-mo/ray that referenced this pull request Aug 13, 2022
scv119 pushed a commit that referenced this pull request Aug 13, 2022
gramhagen pushed a commit to gramhagen/ray that referenced this pull request Aug 15, 2022
Stefan-1313 pushed a commit to Stefan-1313/ray_mod that referenced this pull request Aug 18, 2022
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 21, 2022
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022
JiahaoYao pushed a commit to JiahaoYao/ray that referenced this pull request Aug 22, 2022
ArturNiederfahrenhorst pushed a commit to ArturNiederfahrenhorst/ray that referenced this pull request Sep 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[serve] Async Handle Memory Leak
4 participants