Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Expose Redis getter to Python and use to retrieve session name #39194

Merged
merged 17 commits into from
Sep 5, 2023

Conversation

vitsai
Copy link
Contributor

@vitsai vitsai commented Sep 1, 2023

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

Why are these changes needed?

Related issue number

#38796

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
python/ray/tests/test_gcs_fault_tolerance.py Show resolved Hide resolved
cluster.wait_for_nodes()

head_node = cluster.head_node
session_dir = head_node.get_session_dir_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CAn you also submit a driver and check if their session name is correct?

import ray
ray.init()
# Make sure this is correct
ray._private.worker._global_node.get_session_dir_path()

We need this

  1. Start the first head node. Submit a driver to a head node and get a session dir
  2. Start the worker node. Submit a driver to "worker node" and get a session dir.
  3. Kill head node. Restart head node.
  4. Submit a driver to a head node and get a session dir
  5. Submit a driver to a worker node and get a session dir.

This could also be done in a similar way as test_gcs_ha_e2e.py (it starts head / worker in docker containers, so it is easy to test it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the same suite of tests in this commit; 748ddf8?

Signed-off-by: vitsai <[email protected]>
@vitsai vitsai requested a review from a team as a code owner September 1, 2023 08:02
Signed-off-by: vitsai <[email protected]>
cluster.wait_for_nodes()

head_node = cluster.head_node
session_dir = head_node.get_session_dir_path()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the same suite of tests in this commit; 748ddf8?

python/ray/_raylet.pyx Outdated Show resolved Hide resolved
python/ray/_private/node.py Outdated Show resolved Hide resolved
redis_ip_address, redis_port = parts[1].rsplit(":", 1)
if parts[0] == "rediss":
enable_redis_ssl = True
maybe_key = get_key_from_storage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add the same validation logic as cleanup_redis_storage?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added it for ports because the other type checks will happen during runtime and yield a very helpful error anyway (at least that was my experience with pytest on a type mismatch here).

python/ray/_private/node.py Outdated Show resolved Hide resolved
python/ray/_private/node.py Show resolved Hide resolved
b"session_name",
)

if maybe_key is None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iycheng do we need to handle edge cases like there are 2 head nodes started with the same redis (without storage namespace)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so. It's not working right now if the namespace is not set. So not a regression IMO.
Application needs to set it up.

date_str = datetime.datetime.today().strftime("%Y-%m-%d_%H-%M-%S_%f")
self._session_name = f"session_{date_str}_{os.getpid()}"
maybe_key = None
if self._ray_params.external_addresses is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: What happens if Redis is not started here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need Redis to be started by the time the client tries to connect to it, which it does with retries + exponential backoff and a timeout, so there is some leeway, but if it fails to connect then ray start will fail. In these lines specifically, we are only working with the parameters passed in, which are static from the start.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be clear, if redis is not started, this will fail in 1 second? what's going to be error messageS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will fail saying that it can't connect to Redis. This is unfortunately a fatal log in the C++ Redis context, but the cause should be clear.


std::string config_list;
RAY_CHECK(absl::Base64Unescape(config, &config_list));
RayConfig::instance().initialize(config_list);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this safe? What's going to happen if we call ray.init with the same config later?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not thread-safe, but this part is single-threaded. Currently, each process (raylet, gcs, worker), calls it once, but if it is called multiple times serially, the values will just be reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reset sounds like the right behavior,

can you add an unit test to set this config from redis level and set it differently from ray.init level and see if it resolves to ray.init config correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add comment here this will be reset when a core worker initializes the config again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually tested by setting the value before and after the new Redis call, and running the new test with Redis enabled. It resolves to the correct config. Maybe we can defer the unit test to after this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I fail to get it. Why do we need setup this? Is it about redis configs?

python/ray/includes/global_state_accessor.pxd Show resolved Hide resolved
Signed-off-by: vitsai <[email protected]>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks pretty good to me. A couple comments; (also please update the PR desc)

python/ray/_private/node.py Outdated Show resolved Hide resolved
python/ray/_private/node.py Outdated Show resolved Hide resolved
*data = result.value();
ret_val = true;
} else {
RAY_LOG(ERROR) << "Failed to get " << key;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RAY_LOG(ERROR) << "Failed to get " << key;
RAY_LOG(ERROR) << "Failed to get a key, " << key << " from Redis storage.";

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, doesn't this mean it is printed every time you ray.init() first time? Can we just not log here and log in the node.py layer instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to log(info)

python/ray/includes/global_state_accessor.pxd Show resolved Hide resolved
std::make_unique<RedisStoreClient>(std::move(redis_client)));

bool ret_val = false;
cli->Get("session", key, [&](std::optional<std::string> result) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's happening if the CLI fails by other error? E.g., timeout or random Redis related issues? Is it just result contains no value? Would this contain error message in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will just not contain a value. Internally, on the C++ side, we do retry in redis_context.cc with exponential backoff, but the whole thing is bounded here by the io_context.run_for(duration). In terms of error propagation, the existing code isn't great about that, unfortunately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this means we cannot distinguish redis failure vs the first time cluster start?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the redis failed, GCS will crash too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I meant GET failure. But we workaround this by checking the session name if put failed (because override=False)

if not enable_external_redis():
assert session_dir != new_session_dir
else:
assert session_dir == new_session_dir
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add tests inside this commit 748ddf8#diff-8e02c2ad08f47c6f22d3a04682d0c3867bffe7cf5f9f2838e6f0f999e97952d5?

The one inside test_ray_init.py and test_gcs_ha_e2e_2.py. I think test_gcs_ha_e2e_2 should just work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests inside test_ray_init only work if redis is disabled. Right now for the tests, we either disable redis or enable redis for the duration of the entire test. In the latter case, we expect the session to be the same.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is to verify session dir is not changed when redis is disabled!

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 1, 2023

Okay, the test result seems pretty promising (looks like test_advanced_9.py is not that tricky to fix). I just started mac test and release test k8s_serve_ha_test. Can you sync with @edoakes to test this change asap with services?

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 1, 2023

Also, this seems very loud>?

[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:478: Resolve Redis address to 127.0.0.1
[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,558 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:532: Redis cluster leader is 127.0.0.1:49159
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:478: Resolve Redis address to 127.0.0.1
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:364: Attempting to connect to address 127.0.0.1:49159.
[2023-09-01 22:38:12,559 I 53255 1167620] redis_context.cc:532: Redis cluster leader is 127.0.0.1:49159
[2023-09-01 22:38:12,560 E 53255 1167620] _raylet.cpp:865: Failed to get session_name

(maybe let's set the log level to warning when we instantiate the config)

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 1, 2023

test_tempfile & test_advanced_9 seems like a real failure.

@vitsai vitsai changed the title [wip] Hack to retrieve session name [core] Expose Redis getter to Python and use to retrieve session name Sep 1, 2023
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
Signed-off-by: vitsai <[email protected]>
@vitsai
Copy link
Contributor Author

vitsai commented Sep 2, 2023

Changed test_advanced_9 because a failure to connect to Redis is a fatal, which now happens before GCS starts.

Regarding the extra logs: they should be once-per-init, and the log levels are pre-existing in RedisContext, so I don't think we need to change them.

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

  1. windows build failure
  2. Remove config test (it is the same anyway)
  3. Set RAY_BACKEND_LOG_LEVEL=warning in the beg of Redis client and unset before it returns.
  4. Add session dir changed unit test and skip when redis is enabled
  5. Fail if session name is overwritten

Signed-off-by: vitsai <[email protected]>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

  1. lint failure
  2. can you add unit tests for redis output?

After it is addressed, I will approve the PR

@vitsai
Copy link
Contributor Author

vitsai commented Sep 2, 2023

Verified that services restarts the head node smoothly and retains the same session dir with this change:

https://console.anyscale-staging.com/o/anyscale-internal/clusters/ses_59t6hd9eevn5jpirwqe5q3faq6?user=usr_x9zm6779nv4qcf5ljs3n11anwq&command-history-section=head_start_up_log

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

Awesome. Some build faliures + (gcs_ha_e2e_2.py failure is same as what I told you. Set shorter time for

RAY_CONFIG(int64_t, raylet_client_num_connect_attempts, 10)
RAY_CONFIG(int64_t, raylet_client_connect_timeout_milliseconds, 1000)

in both worker/head containers inside conftest_docker.py via RAY_raylet_client_num_connect_attempts=10 and RAY_raylet_client_connect_timeout_milliseconds=100

Signed-off-by: vitsai <[email protected]>
python/ray/includes/global_state_accessor.pxd Show resolved Hide resolved
python/ray/tests/test_output.py Outdated Show resolved Hide resolved
Signed-off-by: vitsai <[email protected]>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

Reminder: You need to fix #39194 (comment) to pass gcs_ha_e2e_2.py

Signed-off-by: vitsai <[email protected]>
@vitsai
Copy link
Contributor Author

vitsai commented Sep 2, 2023

Looks like overrode with the wrong value before

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

test_advanced_9 & test_placement_group & test_tempfile seems to fail pretty consistenty

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 2, 2023

test_advanced_9: you may need to do ci_repro and debug it. Seems like a weird failure
placement_group.py: can you make it large test?
test_tempfile: seems like a simple test issue that happens because we are sharing the same session dir now

Signed-off-by: vitsai <[email protected]>
@rkooo567
Copy link
Contributor

rkooo567 commented Sep 5, 2023

@@ -377,8 +377,6 @@ def test_redis_wrong_password(monkeypatch, external_redis, call_ray_stop_only):
)

assert "RedisError: ERR AUTH <password> called" in p.stderr.decode()
assert "Please check /tmp/ray/session" in p.stderr.decode()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the error message now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the authentication error

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 5, 2023

@vitsai NOTE: I added a commit to fix test_advanced_9

@vitsai
Copy link
Contributor Author

vitsai commented Sep 5, 2023

@rkooo567 you mean the Windows case for test_redis_not_available?

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 5, 2023

test_advanced_9.py succeeded! I think other weird failures are unrelated, but let me try merging the latest master in case

@rkooo567
Copy link
Contributor

rkooo567 commented Sep 5, 2023

Test result looks good. merging.

@rkooo567 rkooo567 merged commit 25c0e57 into ray-project:master Sep 5, 2023
112 of 126 checks passed
LeonLuttenberger pushed a commit to jaidisido/ray that referenced this pull request Sep 5, 2023
…ray-project#39194)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
vitsai added a commit to vitsai/ray that referenced this pull request Sep 5, 2023
…ray-project#39194)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
GeneDer pushed a commit that referenced this pull request Sep 5, 2023
…#39194) (#39269)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
harborn pushed a commit to harborn/ray that referenced this pull request Sep 8, 2023
…ray-project#39194)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.
jimthompson5802 pushed a commit to jimthompson5802/ray that referenced this pull request Sep 12, 2023
…ray-project#39194)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

Signed-off-by: Jim Thompson <[email protected]>
vymao pushed a commit to vymao/ray that referenced this pull request Oct 11, 2023
…ray-project#39194)

If only GCS communicates with Redis, there is no way to initialize the log directories using persisted values. However, these directories are required to exist before starting GCS. Expose a function to the Python layer specifically to retrieve these keys from Redis and set them. Follow-ups will ensure that these keys are only set and retrieved through this interface.

Signed-off-by: Victor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants