-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310
Comments
How much is memory leak? |
The leak is quite fast, ~211KB per second. > cat /proc/234940/cmdline
/home/ray/anaconda3/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server--log_dir=/tmp/ray/session_2023-05-15_22-25-08_382879_1/logs .... <snip>
> while true; do cat /proc/234940/status | grep RssAnon; sleep 10; done
RssAnon: 127136 kB
RssAnon: 129308 kB
RssAnon: 131456 kB
RssAnon: 133824 kB
RssAnon: 135844 kB
RssAnon: 137692 kB
RssAnon: 139876 kB
RssAnon: 142768 kB
RssAnon: 144752 kB
RssAnon: 146984 kB
RssAnon: 148000 kB |
@darthhexx I think the issue has been fixed in the master (included in the coming 2.5 release). Could you give it another try? |
## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number #35632 #35310
## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310
## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number #35632 #35310
@iycheng after trying to build the docker images yesterday with no success, I tested the nightly images today and the memory leak is gone 👍 Thank you for fixing it. I see that the PR mentioned "in the wrong setup". I assume this is referring to the node(s) that are still running trying to connect to the new version of the GCS Server with old information? |
Hi @darthhexx thanks for verifying it's working! Actually there are two bugs there.
None of this happens in the new broadcast which is enabled in 2.5. But in the wrong setup, where GCS not belong to this cluster talks with the raylet (it could happen in failure case), and sending the wrong resource broadcasting to the raylet, even it's new broadcasting in the raylet, it still trigger the bug. We fixed the bug in the old resource broadcasting and we also prevent raylet from processing the message from old broadcasting (this shouldn't happen, but if the cluster is not managed well or just for some corner cases, it still might happen). |
I'll close this ticket. Feel free to reopen it if it somehow happen again. |
Thank you for the explanation and the link to the on-going work. |
## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310
## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310 Signed-off-by: e428265 <[email protected]>
What happened + What you expected to happen
We're testing Ray with a separate Redis cluster, using the RAY_REDIS_ADDRESS env var, for production disaster recovery.
It is working for the KV storage:
and when ray is restarted, the existing actors, jobs, etc. are listed and are started up again as expected.
However, shortly after a restart, these messages start being logged by raylet continuously:
If it was only discarding the packet I guess it would be fine, but the real issue is that it results in what appears to be a memory leak in the
gcs_server
because the process continues to grow in size. Restarting Ray does not fix it either and the only way to get the memory usage to be stable again is to reset the Redis instances and start from scratch.Versions / Dependencies
I have reproduced the error with the rayproject/ray:2.4.0-py310 docker image, as well as Debian 11.7's Python 3.9.2 default environment. Since the issue appears to lie in the gcs server, I'd say its safe to day it will affect all versions.
Reproduction script
Issue Severity
High: It blocks me from completing my task.
The text was updated successfully, but these errors were encountered: