Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

Closed
darthhexx opened this issue May 13, 2023 · 7 comments
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks Ray 2.6

Comments

@darthhexx
Copy link
Contributor

What happened + What you expected to happen

We're testing Ray with a separate Redis cluster, using the RAY_REDIS_ADDRESS env var, for production disaster recovery.
It is working for the KV storage:

(gcs_server) gcs_server.cc:452: Using external Redis for KV storage: ...

and when ray is restarted, the existing actors, jobs, etc. are listed and are started up again as expected.
However, shortly after a restart, these messages start being logged by raylet continuously:

[2023-05-12 16:48:38,045 W 438386 438386] (raylet) node_manager.cc:1761: Raylet may have missed a resource broadcast. This either means that GCS has restarted, the network is heavily congested and is dropping, reordering, or duplicating packets. Expected seq#: 1683874108609414361, but got: 1683874108609414360.

If it was only discarding the packet I guess it would be fine, but the real issue is that it results in what appears to be a memory leak in the gcs_server because the process continues to grow in size. Restarting Ray does not fix it either and the only way to get the memory usage to be stable again is to reset the Redis instances and start from scratch.

Versions / Dependencies

I have reproduced the error with the rayproject/ray:2.4.0-py310 docker image, as well as Debian 11.7's Python 3.9.2 default environment. Since the issue appears to lie in the gcs server, I'd say its safe to day it will affect all versions.

Reproduction script

docker run -d --name=redis -p 6000:6379 redis:latest

export RAY_REDIS_ADDRESS='127.0.0.1:6000'
ray start --head --dashboard-host=0.0.0.0 --disable-usage-stats --metrics-export-port=9100 --node-manager-port=9997 --object-manager-port=9998 --dashboard-agent-grpc-port=9999 --dashboard-agent-listen-port=10000 --min-worker-port=10002 --max-worker-port=10500 --ray-client-server-port=10001

ray stop

ray start --head --dashboard-host=0.0.0.0 --disable-usage-stats --metrics-export-port=9100 --node-manager-port=9997 --object-manager-port=9998 --dashboard-agent-grpc-port=9999 --dashboard-agent-listen-port=10000 --min-worker-port=10002 --max-worker-port=10500 --ray-client-server-port=10001

Issue Severity

High: It blocks me from completing my task.

@darthhexx darthhexx added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023
@rkooo567 rkooo567 added the core Issues that should be addressed in Ray Core label May 15, 2023
@rkooo567
Copy link
Contributor

How much is memory leak?

@rkooo567 rkooo567 added P1 Issue that should be fixed within a few weeks Ray 2.6 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2023
@darthhexx
Copy link
Contributor Author

The leak is quite fast, ~211KB per second.

> cat /proc/234940/cmdline
/home/ray/anaconda3/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server--log_dir=/tmp/ray/session_2023-05-15_22-25-08_382879_1/logs .... <snip>

> while true; do cat /proc/234940/status | grep RssAnon; sleep 10; done
RssAnon:	  127136 kB
RssAnon:	  129308 kB
RssAnon:	  131456 kB
RssAnon:	  133824 kB
RssAnon:	  135844 kB
RssAnon:	  137692 kB
RssAnon:	  139876 kB
RssAnon:	  142768 kB
RssAnon:	  144752 kB
RssAnon:	  146984 kB
RssAnon:	  148000 kB

@fishbone
Copy link
Contributor

@darthhexx I think the issue has been fixed in the master (included in the coming 2.5 release). Could you give it another try?

@fishbone fishbone self-assigned this May 23, 2023
fishbone added a commit that referenced this issue May 23, 2023
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

#35632
#35310
fishbone added a commit to fishbone/ray that referenced this issue May 23, 2023
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

ray-project#35632
ray-project#35310
ArturNiederfahrenhorst pushed a commit that referenced this issue May 24, 2023
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

#35632
#35310
@darthhexx
Copy link
Contributor Author

darthhexx commented May 24, 2023

@iycheng after trying to build the docker images yesterday with no success, I tested the nightly images today and the memory leak is gone 👍 Thank you for fixing it.

I see that the PR mentioned "in the wrong setup". I assume this is referring to the node(s) that are still running trying to connect to the new version of the GCS Server with old information?

@fishbone
Copy link
Contributor

Hi @darthhexx thanks for verifying it's working!

Actually there are two bugs there.

  • GCS doesn't drop the message from the node it doesn't know.
    • This is a bug in the old broadcasting which has been disabled in 2.5.
    • We also fixed this bug.
    • If the message is not dropped, GCS will keep broadcasting it to other raylet, where there is the second bugs.
  • Raylet doesn't reply the GRPC if raylet decide to drop the message.
    • This will lead to raylet mem increase overtime, since in the old broadcasting, it sends message every 100ms.

None of this happens in the new broadcast which is enabled in 2.5. But in the wrong setup, where GCS not belong to this cluster talks with the raylet (it could happen in failure case), and sending the wrong resource broadcasting to the raylet, even it's new broadcasting in the raylet, it still trigger the bug.

We fixed the bug in the old resource broadcasting and we also prevent raylet from processing the message from old broadcasting (this shouldn't happen, but if the cluster is not managed well or just for some corner cases, it still might happen).
We also have another working on-going to prevent this from happening from the root

@fishbone
Copy link
Contributor

I'll close this ticket. Feel free to reopen it if it somehow happen again.

@darthhexx
Copy link
Contributor Author

Thank you for the explanation and the link to the on-going work.

scv119 pushed a commit to scv119/ray that referenced this issue Jun 16, 2023
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

ray-project#35632
ray-project#35310
arvind-chandra pushed a commit to lmco/ray that referenced this issue Aug 31, 2023
## Why are these changes needed?

In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource.

In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node.

In this way, the leak will be triggered.

This fix does two things to protect the code:

- If it's syncer based, it'll just reject the request.
- Also fixed the bug in the old code path.

## Related issue number

ray-project#35632
ray-project#35310
Signed-off-by: e428265 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks Ray 2.6
Projects
None yet
Development

No branches or pull requests

3 participants