[core] Expose Redis getter to Python and use to retrieve session name #39194

vitsai · 2023-09-01T02:58:41Z

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

Why are these changes needed?

Related issue number

#38796

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: vitsai <[email protected]>

python/ray/tests/test_gcs_fault_tolerance.py

rkooo567 · 2023-09-01T03:57:42Z

python/ray/tests/test_gcs_fault_tolerance.py

+ cluster.wait_for_nodes()
+
+ head_node = cluster.head_node
+ session_dir = head_node.get_session_dir_path()


CAn you also submit a driver and check if their session name is correct?

import ray ray.init() # Make sure this is correct ray._private.worker._global_node.get_session_dir_path()

We need this

Start the first head node. Submit a driver to a head node and get a session dir

Start the worker node. Submit a driver to "worker node" and get a session dir.

Kill head node. Restart head node.

Submit a driver to a head node and get a session dir

Submit a driver to a worker node and get a session dir.

This could also be done in a similar way as test_gcs_ha_e2e.py (it starts head / worker in docker containers, so it is easy to test it)

Maybe add the same suite of tests in this commit; 748ddf8?

Signed-off-by: vitsai <[email protected]>

rkooo567 · 2023-09-01T08:36:31Z

python/ray/tests/test_gcs_fault_tolerance.py

+ cluster.wait_for_nodes()
+
+ head_node = cluster.head_node
+ session_dir = head_node.get_session_dir_path()


Maybe add the same suite of tests in this commit; 748ddf8?

python/ray/_raylet.pyx

python/ray/_private/node.py

rkooo567 · 2023-09-01T08:39:22Z

python/ray/_private/node.py

+ redis_ip_address, redis_port = parts[1].rsplit(":", 1)
+ if parts[0] == "rediss":
+ enable_redis_ssl = True
+ maybe_key = get_key_from_storage(


Can you add the same validation logic as cleanup_redis_storage?

Added it for ports because the other type checks will happen during runtime and yield a very helpful error anyway (at least that was my experience with pytest on a type mismatch here).

python/ray/_private/node.py

rkooo567 · 2023-09-01T08:43:55Z

python/ray/_private/node.py

+ b"session_name",
+ )
+
+ if maybe_key is None:


@iycheng do we need to handle edge cases like there are 2 head nodes started with the same redis (without storage namespace)?

I don't think so. It's not working right now if the namespace is not set. So not a regression IMO.
Application needs to set it up.

rkooo567 · 2023-09-01T08:44:28Z

python/ray/_private/node.py

- date_str = datetime.datetime.today().strftime("%Y-%m-%d_%H-%M-%S_%f")
- self._session_name = f"session_{date_str}_{os.getpid()}"
+ maybe_key = None
+ if self._ray_params.external_addresses is not None:


Q: What happens if Redis is not started here?

We need Redis to be started by the time the client tries to connect to it, which it does with retries + exponential backoff and a timeout, so there is some leeway, but if it fails to connect then ray start will fail. In these lines specifically, we are only working with the parameters passed in, which are static from the start.

to be clear, if redis is not started, this will fail in 1 second? what's going to be error messageS?

It will fail saying that it can't connect to Redis. This is unfortunately a fatal log in the C++ Redis context, but the cause should be clear.

rkooo567 · 2023-09-01T08:46:04Z

python/ray/includes/global_state_accessor.pxd

+
+ std::string config_list;
+ RAY_CHECK(absl::Base64Unescape(config, &config_list));
+ RayConfig::instance().initialize(config_list);


Hmm is this safe? What's going to happen if we call ray.init with the same config later?

It is not thread-safe, but this part is single-threaded. Currently, each process (raylet, gcs, worker), calls it once, but if it is called multiple times serially, the values will just be reset.

Reset sounds like the right behavior,

can you add an unit test to set this config from redis level and set it differently from ray.init level and see if it resolves to ray.init config correctly?

Can you add comment here this will be reset when a core worker initializes the config again

Manually tested by setting the value before and after the new Redis call, and running the new test with Redis enabled. It resolves to the correct config. Maybe we can defer the unit test to after this change?

Hmm, I fail to get it. Why do we need setup this? Is it about redis configs?

python/ray/includes/global_state_accessor.pxd

Signed-off-by: vitsai <[email protected]>

rkooo567

It looks pretty good to me. A couple comments; (also please update the PR desc)

python/ray/_private/node.py

rkooo567 · 2023-09-01T11:43:56Z

python/ray/includes/global_state_accessor.pxd

+ *data = result.value();
+ ret_val = true;
+ } else {
+ RAY_LOG(ERROR) << "Failed to get " << key;


Suggested change

RAY_LOG(ERROR) << "Failed to get " << key;

RAY_LOG(ERROR) << "Failed to get a key, " << key << " from Redis storage.";

Btw, doesn't this mean it is printed every time you ray.init() first time? Can we just not log here and log in the node.py layer instead?

Changed it to log(info)

python/ray/includes/global_state_accessor.pxd

rkooo567 · 2023-09-01T11:45:36Z

python/ray/includes/global_state_accessor.pxd

+ std::make_unique<RedisStoreClient>(std::move(redis_client)));
+
+ bool ret_val = false;
+ cli->Get("session", key, [&](std::optional<std::string> result) {


What's happening if the CLI fails by other error? E.g., timeout or random Redis related issues? Is it just result contains no value? Would this contain error message in this case?

It will just not contain a value. Internally, on the C++ side, we do retry in redis_context.cc with exponential backoff, but the whole thing is bounded here by the io_context.run_for(duration). In terms of error propagation, the existing code isn't great about that, unfortunately.

Hmm this means we cannot distinguish redis failure vs the first time cluster start?

If the redis failed, GCS will crash too.

Oh I meant GET failure. But we workaround this by checking the session name if put failed (because override=False)

rkooo567 · 2023-09-01T11:48:16Z

python/ray/tests/test_gcs_fault_tolerance.py

+ if not enable_external_redis():
+ assert session_dir != new_session_dir
+ else:
+ assert session_dir == new_session_dir


Can you add tests inside this commit 748ddf8#diff-8e02c2ad08f47c6f22d3a04682d0c3867bffe7cf5f9f2838e6f0f999e97952d5?

The one inside test_ray_init.py and test_gcs_ha_e2e_2.py. I think test_gcs_ha_e2e_2 should just work.

The tests inside test_ray_init only work if redis is disabled. Right now for the tests, we either disable redis or enable redis for the duration of the entire test. In the latter case, we expect the session to be the same.

Yeah it is to verify session dir is not changed when redis is disabled!

rkooo567 · 2023-09-01T13:36:20Z

Okay, the test result seems pretty promising (looks like test_advanced_9.py is not that tricky to fix). I just started mac test and release test k8s_serve_ha_test. Can you sync with @edoakes to test this change asap with services?

rkooo567 · 2023-09-01T13:39:56Z

Also, this seems very loud>?

[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:478: Resolve Redis address to 127.0.0.1
[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:532: Redis cluster leader is 127.0.0.1:49159
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:478: Resolve Redis address to 127.0.0.1
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:532: Redis cluster leader is 127.0.0.1:49159
[2023-09-01 22:38:12,560 E 53255 1167620] _raylet.cpp:865: Failed to get session_name

(maybe let's set the log level to warning when we instantiate the config)

rkooo567 · 2023-09-01T16:24:10Z

test_tempfile & test_advanced_9 seems like a real failure.

Signed-off-by: vitsai <[email protected]>

vitsai · 2023-09-02T00:10:47Z

Changed test_advanced_9 because a failure to connect to Redis is a fatal, which now happens before GCS starts.

Regarding the extra logs: they should be once-per-init, and the log levels are pre-existing in RedisContext, so I don't think we need to change them.

rkooo567 · 2023-09-02T01:01:25Z

windows build failure
Remove config test (it is the same anyway)
Set RAY_BACKEND_LOG_LEVEL=warning in the beg of Redis client and unset before it returns.
Add session dir changed unit test and skip when redis is enabled
Fail if session name is overwritten

Signed-off-by: vitsai <[email protected]>

rkooo567 · 2023-09-02T01:43:55Z

lint failure
can you add unit tests for redis output?

After it is addressed, I will approve the PR

vitsai · 2023-09-02T02:17:50Z

Verified that services restarts the head node smoothly and retains the same session dir with this change:

https://console.anyscale-staging.com/o/anyscale-internal/clusters/ses_59t6hd9eevn5jpirwqe5q3faq6?user=usr_x9zm6779nv4qcf5ljs3n11anwq&command-history-section=head_start_up_log

rkooo567 · 2023-09-02T02:55:39Z

Awesome. Some build faliures + (gcs_ha_e2e_2.py failure is same as what I told you. Set shorter time for

RAY_CONFIG(int64_t, raylet_client_num_connect_attempts, 10)
RAY_CONFIG(int64_t, raylet_client_connect_timeout_milliseconds, 1000)

in both worker/head containers inside conftest_docker.py via RAY_raylet_client_num_connect_attempts=10 and RAY_raylet_client_connect_timeout_milliseconds=100

Signed-off-by: vitsai <[email protected]>

python/ray/includes/global_state_accessor.pxd

python/ray/tests/test_output.py

python/ray/_private/node.py

Signed-off-by: vitsai <[email protected]>

rkooo567 · 2023-09-02T05:26:15Z

Reminder: You need to fix #39194 (comment) to pass gcs_ha_e2e_2.py

Signed-off-by: vitsai <[email protected]>

vitsai · 2023-09-02T05:57:09Z

Looks like overrode with the wrong value before

rkooo567 · 2023-09-02T14:08:31Z

test_advanced_9 & test_placement_group & test_tempfile seems to fail pretty consistenty

rkooo567 · 2023-09-02T15:14:19Z

test_advanced_9: you may need to do ci_repro and debug it. Seems like a weird failure
placement_group.py: can you make it large test?
test_tempfile: seems like a simple test issue that happens because we are sharing the same session dir now

Signed-off-by: vitsai <[email protected]>

rkooo567 · 2023-09-05T01:47:52Z

rkooo567 · 2023-09-05T01:53:17Z

python/ray/tests/test_advanced_9.py

@@ -377,8 +377,6 @@ def test_redis_wrong_password(monkeypatch, external_redis, call_ray_stop_only):
 )

 assert "RedisError: ERR AUTH <password> called" in p.stderr.decode()
- assert "Please check /tmp/ray/session" in p.stderr.decode()


what's the error message now?

Just the authentication error

rkooo567 · 2023-09-05T07:15:34Z

@vitsai NOTE: I added a commit to fix test_advanced_9

vitsai · 2023-09-05T08:10:37Z

@rkooo567 you mean the Windows case for test_redis_not_available?

rkooo567 · 2023-09-05T12:14:40Z

test_advanced_9.py succeeded! I think other weird failures are unrelated, but let me try merging the latest master in case

rkooo567 · 2023-09-05T15:25:17Z

Test result looks good. merging.

…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

…#39194) (#39269) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface. Signed-off-by: Jim Thompson <[email protected]>

…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface. Signed-off-by: Victor <[email protected]>

vitsai added 2 commits September 1, 2023 02:57

initial change

11bd816

Signed-off-by: vitsai <[email protected]>

test

bf509de

Signed-off-by: vitsai <[email protected]>

rkooo567 reviewed Sep 1, 2023

View reviewed changes

some fixes

3db4e18

Signed-off-by: vitsai <[email protected]>

vitsai requested a review from a team as a code owner September 1, 2023 08:02

more fixes

f4537f8

Signed-off-by: vitsai <[email protected]>

rkooo567 reviewed Sep 1, 2023

View reviewed changes

comments

5fc76ab

Signed-off-by: vitsai <[email protected]>

rkooo567 reviewed Sep 1, 2023

View reviewed changes

vitsai changed the title ~~[wip] Hack to retrieve session name~~ [core] Expose Redis getter to Python and use to retrieve session name Sep 1, 2023

vitsai added 2 commits September 1, 2023 23:30

comments

037aabd

Signed-off-by: vitsai <[email protected]>

more comments

ba58ad0

Signed-off-by: vitsai <[email protected]>

rkooo567 assigned rkooo567 and fishbone Sep 2, 2023

vitsai added 2 commits September 2, 2023 00:09

test file revert

41f98e5

Signed-off-by: vitsai <[email protected]>

lint

7c4613e

Signed-off-by: vitsai <[email protected]>

vitsai added 2 commits September 2, 2023 01:25

comments

b324a73

Signed-off-by: vitsai <[email protected]>

forgot a couple files

2ab8069

Signed-off-by: vitsai <[email protected]>

clean up logging

b0e76b4

Signed-off-by: vitsai <[email protected]>

rkooo567 approved these changes Sep 2, 2023

View reviewed changes

python/ray/includes/global_state_accessor.pxd Show resolved Hide resolved

python/ray/tests/test_output.py Outdated Show resolved Hide resolved

fishbone reviewed Sep 2, 2023

View reviewed changes

python/ray/_private/node.py Show resolved Hide resolved

comments

3cf34dc

Signed-off-by: vitsai <[email protected]>

change 1000 to 100

1983a15

Signed-off-by: vitsai <[email protected]>

some fixes

c1853c2

Signed-off-by: vitsai <[email protected]>

rkooo567 reviewed Sep 5, 2023

View reviewed changes

Fix test advanced 9

5c2ea1d

rkooo567 mentioned this pull request Sep 5, 2023

[Core] Fix session_name not reused when GCS restarts + node ip address not set for driver #39211

Closed

8 tasks

Merge branch 'master' into redis-2

f18786c

rkooo567 merged commit 25c0e57 into ray-project:master Sep 5, 2023
112 of 126 checks passed

vitsai mentioned this pull request Sep 5, 2023

[core] Expose Redis getter to Python and use to retrieve session name #39269

Merged

8 tasks

	RAY_LOG(ERROR) << "Failed to get " << key;
	RAY_LOG(ERROR) << "Failed to get a key, " << key << " from Redis storage.";

[core] Expose Redis getter to Python and use to retrieve session name #39194

[core] Expose Redis getter to Python and use to retrieve session name #39194

Conversation

vitsai commented Sep 1, 2023 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Sep 1, 2023

rkooo567 commented Sep 1, 2023 • edited Loading

rkooo567 commented Sep 1, 2023

vitsai commented Sep 2, 2023

rkooo567 commented Sep 2, 2023 • edited Loading

rkooo567 commented Sep 2, 2023 • edited Loading

vitsai commented Sep 2, 2023

rkooo567 commented Sep 2, 2023

rkooo567 commented Sep 2, 2023

vitsai commented Sep 2, 2023

rkooo567 commented Sep 2, 2023

rkooo567 commented Sep 2, 2023

rkooo567 commented Sep 5, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rkooo567 commented Sep 5, 2023

vitsai commented Sep 5, 2023

rkooo567 commented Sep 5, 2023

rkooo567 commented Sep 5, 2023

vitsai commented Sep 1, 2023 •

edited

Loading

rkooo567 left a comment •

edited

Loading

rkooo567 commented Sep 1, 2023 •

edited

Loading

rkooo567 commented Sep 2, 2023 •

edited

Loading

rkooo567 commented Sep 2, 2023 •

edited

Loading