-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Expose Redis getter to Python and use to retrieve session name #39194
Conversation
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
cluster.wait_for_nodes() | ||
|
||
head_node = cluster.head_node | ||
session_dir = head_node.get_session_dir_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CAn you also submit a driver and check if their session name is correct?
import ray
ray.init()
# Make sure this is correct
ray._private.worker._global_node.get_session_dir_path()
We need this
- Start the first head node. Submit a driver to a head node and get a session dir
- Start the worker node. Submit a driver to "worker node" and get a session dir.
- Kill head node. Restart head node.
- Submit a driver to a head node and get a session dir
- Submit a driver to a worker node and get a session dir.
This could also be done in a similar way as test_gcs_ha_e2e.py
(it starts head / worker in docker containers, so it is easy to test it)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the same suite of tests in this commit; 748ddf8?
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
cluster.wait_for_nodes() | ||
|
||
head_node = cluster.head_node | ||
session_dir = head_node.get_session_dir_path() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add the same suite of tests in this commit; 748ddf8?
python/ray/_private/node.py
Outdated
redis_ip_address, redis_port = parts[1].rsplit(":", 1) | ||
if parts[0] == "rediss": | ||
enable_redis_ssl = True | ||
maybe_key = get_key_from_storage( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the same validation logic as cleanup_redis_storage
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added it for ports because the other type checks will happen during runtime and yield a very helpful error anyway (at least that was my experience with pytest on a type mismatch here).
b"session_name", | ||
) | ||
|
||
if maybe_key is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iycheng do we need to handle edge cases like there are 2 head nodes started with the same redis (without storage namespace)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. It's not working right now if the namespace is not set. So not a regression IMO.
Application needs to set it up.
python/ray/_private/node.py
Outdated
date_str = datetime.datetime.today().strftime("%Y-%m-%d_%H-%M-%S_%f") | ||
self._session_name = f"session_{date_str}_{os.getpid()}" | ||
maybe_key = None | ||
if self._ray_params.external_addresses is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: What happens if Redis is not started here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need Redis to be started by the time the client tries to connect to it, which it does with retries + exponential backoff and a timeout, so there is some leeway, but if it fails to connect then ray start will fail. In these lines specifically, we are only working with the parameters passed in, which are static from the start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be clear, if redis is not started, this will fail in 1 second? what's going to be error messageS?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will fail saying that it can't connect to Redis. This is unfortunately a fatal log in the C++ Redis context, but the cause should be clear.
|
||
std::string config_list; | ||
RAY_CHECK(absl::Base64Unescape(config, &config_list)); | ||
RayConfig::instance().initialize(config_list); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this safe? What's going to happen if we call ray.init with the same config later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not thread-safe, but this part is single-threaded. Currently, each process (raylet, gcs, worker), calls it once, but if it is called multiple times serially, the values will just be reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reset sounds like the right behavior,
can you add an unit test to set this config from redis level and set it differently from ray.init level and see if it resolves to ray.init config correctly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add comment here this will be reset when a core worker initializes the config again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Manually tested by setting the value before and after the new Redis call, and running the new test with Redis enabled. It resolves to the correct config. Maybe we can defer the unit test to after this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I fail to get it. Why do we need setup this? Is it about redis configs?
Signed-off-by: vitsai <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks pretty good to me. A couple comments; (also please update the PR desc)
*data = result.value(); | ||
ret_val = true; | ||
} else { | ||
RAY_LOG(ERROR) << "Failed to get " << key; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RAY_LOG(ERROR) << "Failed to get " << key; | |
RAY_LOG(ERROR) << "Failed to get a key, " << key << " from Redis storage."; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, doesn't this mean it is printed every time you ray.init() first time? Can we just not log here and log in the node.py layer instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to log(info)
std::make_unique<RedisStoreClient>(std::move(redis_client))); | ||
|
||
bool ret_val = false; | ||
cli->Get("session", key, [&](std::optional<std::string> result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's happening if the CLI fails by other error? E.g., timeout or random Redis related issues? Is it just result contains no value? Would this contain error message in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will just not contain a value. Internally, on the C++ side, we do retry in redis_context.cc with exponential backoff, but the whole thing is bounded here by the io_context.run_for(duration)
. In terms of error propagation, the existing code isn't great about that, unfortunately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this means we cannot distinguish redis failure vs the first time cluster start?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the redis failed, GCS will crash too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I meant GET failure. But we workaround this by checking the session name if put failed (because override=False)
if not enable_external_redis(): | ||
assert session_dir != new_session_dir | ||
else: | ||
assert session_dir == new_session_dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests inside this commit 748ddf8#diff-8e02c2ad08f47c6f22d3a04682d0c3867bffe7cf5f9f2838e6f0f999e97952d5?
The one inside test_ray_init.py
and test_gcs_ha_e2e_2.py
. I think test_gcs_ha_e2e_2 should just work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The tests inside test_ray_init only work if redis is disabled. Right now for the tests, we either disable redis or enable redis for the duration of the entire test. In the latter case, we expect the session to be the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah it is to verify session dir is not changed when redis is disabled!
Okay, the test result seems pretty promising (looks like test_advanced_9.py is not that tricky to fix). I just started mac test and release test |
Also, this seems very loud>?
(maybe let's set the log level to warning when we instantiate the config) |
test_tempfile & test_advanced_9 seems like a real failure. |
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Changed test_advanced_9 because a failure to connect to Redis is a fatal, which now happens before GCS starts. Regarding the extra logs: they should be once-per-init, and the log levels are pre-existing in RedisContext, so I don't think we need to change them. |
|
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
After it is addressed, I will approve the PR |
Verified that services restarts the head node smoothly and retains the same session dir with this change: |
Awesome. Some build faliures + (gcs_ha_e2e_2.py failure is same as what I told you. Set shorter time for
in both worker/head containers inside |
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Reminder: You need to fix #39194 (comment) to pass gcs_ha_e2e_2.py |
Signed-off-by: vitsai <[email protected]>
Looks like overrode with the wrong value before |
test_advanced_9 & test_placement_group & test_tempfile seems to fail pretty consistenty |
test_advanced_9: you may need to do ci_repro and debug it. Seems like a weird failure |
Signed-off-by: vitsai <[email protected]>
|
@@ -377,8 +377,6 @@ def test_redis_wrong_password(monkeypatch, external_redis, call_ray_stop_only): | |||
) | |||
|
|||
assert "RedisError: ERR AUTH <password> called" in p.stderr.decode() | |||
assert "Please check /tmp/ray/session" in p.stderr.decode() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the error message now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the authentication error
@vitsai NOTE: I added a commit to fix test_advanced_9 |
@rkooo567 you mean the Windows case for |
test_advanced_9.py succeeded! I think other weird failures are unrelated, but let me try merging the latest master in case |
Test result looks good. merging. |
…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
…#39194) (#39269) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface. Signed-off-by: Jim Thompson <[email protected]>
…ray-project#39194) If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface. Signed-off-by: Victor <[email protected]>
If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
Why are these changes needed?
Related issue number
#38796
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.