[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

darthhexx · 2023-05-13T05:04:44Z

What happened + What you expected to happen

We're testing Ray with a separate Redis cluster, using the RAY_REDIS_ADDRESS env var, for production disaster recovery.
It is working for the KV storage:

(gcs_server) gcs_server.cc:452: Using external Redis for KV storage: ...

and when ray is restarted, the existing actors, jobs, etc. are listed and are started up again as expected.
However, shortly after a restart, these messages start being logged by raylet continuously:

[2023-05-12 16:48:38,045 W 438386 438386] (raylet) node_manager.cc:1761: Raylet may have missed a resource broadcast. This either means that GCS has restarted, the network is heavily congested and is dropping, reordering, or duplicating packets. Expected seq#: 1683874108609414361, but got: 1683874108609414360.

If it was only discarding the packet I guess it would be fine, but the real issue is that it results in what appears to be a memory leak in the gcs_server because the process continues to grow in size. Restarting Ray does not fix it either and the only way to get the memory usage to be stable again is to reset the Redis instances and start from scratch.

Versions / Dependencies

I have reproduced the error with the rayproject/ray:2.4.0-py310 docker image, as well as Debian 11.7's Python 3.9.2 default environment. Since the issue appears to lie in the gcs server, I'd say its safe to day it will affect all versions.

Reproduction script

docker run -d --name=redis -p 6000:6379 redis:latest

export RAY_REDIS_ADDRESS='127.0.0.1:6000'
ray start --head --dashboard-host=0.0.0.0 --disable-usage-stats --metrics-export-port=9100 --node-manager-port=9997 --object-manager-port=9998 --dashboard-agent-grpc-port=9999 --dashboard-agent-listen-port=10000 --min-worker-port=10002 --max-worker-port=10500 --ray-client-server-port=10001

ray stop

ray start --head --dashboard-host=0.0.0.0 --disable-usage-stats --metrics-export-port=9100 --node-manager-port=9997 --object-manager-port=9998 --dashboard-agent-grpc-port=9999 --dashboard-agent-listen-port=10000 --min-worker-port=10002 --max-worker-port=10500 --ray-client-server-port=10001

Issue Severity

High: It blocks me from completing my task.

The text was updated successfully, but these errors were encountered:

rkooo567 · 2023-05-15T21:13:57Z

How much is memory leak?

darthhexx · 2023-05-15T23:04:44Z

The leak is quite fast, ~211KB per second.

> cat /proc/234940/cmdline
/home/ray/anaconda3/lib/python3.10/site-packages/ray/core/src/ray/gcs/gcs_server--log_dir=/tmp/ray/session_2023-05-15_22-25-08_382879_1/logs .... <snip>

> while true; do cat /proc/234940/status | grep RssAnon; sleep 10; done
RssAnon:	  127136 kB
RssAnon:	  129308 kB
RssAnon:	  131456 kB
RssAnon:	  133824 kB
RssAnon:	  135844 kB
RssAnon:	  137692 kB
RssAnon:	  139876 kB
RssAnon:	  142768 kB
RssAnon:	  144752 kB
RssAnon:	  146984 kB
RssAnon:	  148000 kB

fishbone · 2023-05-22T20:27:46Z

@darthhexx I think the issue has been fixed in the master (included in the coming 2.5 release). Could you give it another try?

## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number #35632 #35310

## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310

## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number #35632 #35310

darthhexx · 2023-05-24T05:05:16Z

@iycheng after trying to build the docker images yesterday with no success, I tested the nightly images today and the memory leak is gone 👍 Thank you for fixing it.

I see that the PR mentioned "in the wrong setup". I assume this is referring to the node(s) that are still running trying to connect to the new version of the GCS Server with old information?

fishbone · 2023-05-24T20:08:20Z

Hi @darthhexx thanks for verifying it's working!

Actually there are two bugs there.

GCS doesn't drop the message from the node it doesn't know.
- This is a bug in the old broadcasting which has been disabled in 2.5.
- We also fixed this bug.
- If the message is not dropped, GCS will keep broadcasting it to other raylet, where there is the second bugs.
Raylet doesn't reply the GRPC if raylet decide to drop the message.
- This will lead to raylet mem increase overtime, since in the old broadcasting, it sends message every 100ms.

None of this happens in the new broadcast which is enabled in 2.5. But in the wrong setup, where GCS not belong to this cluster talks with the raylet (it could happen in failure case), and sending the wrong resource broadcasting to the raylet, even it's new broadcasting in the raylet, it still trigger the bug.

We fixed the bug in the old resource broadcasting and we also prevent raylet from processing the message from old broadcasting (this shouldn't happen, but if the cluster is not managed well or just for some corner cases, it still might happen).
We also have another working on-going to prevent this from happening from the root

fishbone · 2023-05-24T20:08:40Z

I'll close this ticket. Feel free to reopen it if it somehow happen again.

darthhexx · 2023-05-25T00:04:08Z

Thank you for the explanation and the link to the on-going work.

## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310

## Why are these changes needed? In the old resource broadcasting, it uses seq and when the seq got delayed, it'll return immediately and this is the place where leak could happen. Don't reply gRPC will in the end lead to a leak of resource. In ray syncer, we don't have this any more, but if in the wrong setup, a bad GCS might talk this this raylet since we don't have any guards right now and the bad GCS might send node info to this node. In this way, the leak will be triggered. This fix does two things to protect the code: - If it's syncer based, it'll just reject the request. - Also fixed the bug in the old code path. ## Related issue number ray-project#35632 ray-project#35310 Signed-off-by: e428265 <[email protected]>

darthhexx added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 13, 2023

rkooo567 added the core Issues that should be addressed in Ray Core label May 15, 2023

rkooo567 added P1 Issue that should be fixed within a few weeks Ray 2.6 and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 15, 2023

fishbone self-assigned this May 23, 2023

fishbone mentioned this issue May 23, 2023

[core] Fix raylet memory leak in the wrong setup. #35647

Merged

8 tasks

fishbone mentioned this issue May 23, 2023

[cherry-pick][core] Fix raylet memory leak in the wrong setup. (#35647) #35673

Merged

8 tasks

fishbone closed this as completed May 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

darthhexx commented May 13, 2023

rkooo567 commented May 15, 2023

darthhexx commented May 15, 2023

fishbone commented May 22, 2023

darthhexx commented May 24, 2023 •

edited

Loading

fishbone commented May 24, 2023

fishbone commented May 24, 2023

darthhexx commented May 25, 2023

[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

[Core] Disaster recovery for head node results in memory leak in GCS Server when using external Redis cluster #35310

Comments

darthhexx commented May 13, 2023

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

rkooo567 commented May 15, 2023

darthhexx commented May 15, 2023

fishbone commented May 22, 2023

darthhexx commented May 24, 2023 • edited Loading

fishbone commented May 24, 2023

fishbone commented May 24, 2023

darthhexx commented May 25, 2023

darthhexx commented May 24, 2023 •

edited

Loading