Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Fix session_name not reused when GCS restarts + node ip address not set for driver #39211

Closed
wants to merge 46 commits into from

Conversation

rkooo567
Copy link
Contributor

@rkooo567 rkooo567 commented Sep 1, 2023

Why are these changes needed?

This is the PR that combines #37644 and also allows to reuse the session_name when GCS is restarted.

Every GCS HA users are supposed to use RAY_external_storage_namespace, and we are using this name as a session_name. This also means all the directory will have different style naming when GCS HA is used (instead of session_, it will become session_).

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@@ -34,19 +34,29 @@ Setting up Redis
set the OS environment ``RAY_REDIS_ADDRESS`` to
the Redis address, and supply the ``--redis-password`` flag with the password when calling ``ray start``:

You should also set the OS environment variable ``RAY_external_storage_namespace`` to isolate the data stored in Redis.
This makes sure that there is no data conflicts if multiple Ray clusters share the same Redis instance.
``RAY_external_storage_namespace`` must be an unique ID, and whenever you restart a head node,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we give users an example of unique id or generator code like uuid.uuid4() ?

# date including microsecond
date_str = datetime.datetime.today().strftime("%Y-%m-%d_%H-%M-%S_%f")
self._session_name = f"session_{date_str}_{os.getpid()}"
else:
assert not self._default_worker
session_name = ray._private.utils.internal_kv_get_with_retry(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This not a broker) . the behavior of worker node will be changed if worker node also set RAY_external_storage_namespace.
origin:
worker node's session_name always fetch from kv storage.

Now:
If worker node set RAY_external_storage_namespace. worker node's session_name will use RAY_external_storage_namespace first.

If the RAY_external_storage_namespace is set improperly by the user, it can potentially cause issues with the worker nodes.

@rkooo567
Copy link
Contributor Author

rkooo567 commented Sep 5, 2023

Sorry @larrylian this was the emergency PR in case we cannot get #39194 in. I think #39194 will work out, so let me close the issue

@rkooo567 rkooo567 closed this Sep 5, 2023
@larrylian
Copy link
Contributor

ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants