Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[runtime env] support eager_install in runtime env #17949

Merged
merged 15 commits into from
Oct 9, 2021

Conversation

SongGuyang
Copy link
Contributor

@SongGuyang SongGuyang commented Aug 19, 2021

Why are these changes needed?

  • Support prepare runtime env eagerly before the workers are leased.
    runtime_env = {"conda": {"dependencies": ["toolz"]}, "eager_install": True}
    ray.init(runtime_env=runtime_env)

Future work:

  • Support actor level eagerly install:
Actor.options(runtime_env={"conda": {...}, "eager_install": True}).remote()

bool successful, const std::string &serialized_runtime_env_context) {
if (successful) {
start_worker_process_fn(task_spec, state, {}, false,
task_spec.SerializedRuntimeEnv(),
serialized_runtime_env_context, callback);
} else {
RAY_LOG(WARNING) << "Couldn't create a runtime environment for task "
<< task_spec.TaskId() << ". The runtime environment was "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the task-specific failure message here was replaced by a job-specific error message below. Should we pass the task ID through here? If it will make the code messy I don't feel strongly about it, the most useful thing is the serialized runtime env, which is still being printed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def test_job_config_conda_env_eagerly(conda_envs, shutdown_only):
runtime_env = {"conda": f"package-{REQUEST_VERSIONS[0]}"}
job_config = ray.job_config.JobConfig(
runtime_env=runtime_env, prepare_runtime_env_eagerly=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we would like to move runtime_env out of the JobConfig and just have it specified in ray.init(). runtime_env is already supported as a keyword argument in ray.init().

Is it possible to make this feature work with ray.init(runtime_env=runtime_env, job_config=ray.job_config.JobConfig(prepare_runtime_env_eagerly=True)? Or even simpler, ray.init(runtime_env=runtime_env, prepare_runtime_env_eagerly=True)? cc @edoakes, what are your thoughts on the API here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I'm also confused that there are two fields to set runtime_env. I agree that we should reserve only one. Let's make it clearly.

Copy link
Contributor

@ericl ericl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes @SongGuyang what's the long-term plan for this feature? Should we have it on by default?

@SongGuyang
Copy link
Contributor Author

@ericl It's a trade-off. Eager mode is much friendlier for large scale jobs or time-delay sensitive jobs. But it we bring downloading storm in cluster. We should determine to set this flag in different scenes. And in AntGroup, most of the jobs will use eager mode because we have divided the nodes into different namespace. The downloads will happen in one namespace inside. But recently, we took some little scale jobs. I think we also need lazy mode to decrease the downloads.

@ericl
Copy link
Contributor

ericl commented Aug 20, 2021

Hmm, I see. If we want to support both eager and on-demand loading, does it make sense to just add a "install runtime_env" command to Ray?

I find setting this as a job config kind of confusing, since a job can have multiple runtime envs within it, and it's not clear if all of them are eagerly loaded or just the top-level one.

In that case an API like the following seems more explicit:

ray.init()
for envs in ["env.yaml", "b.yaml", "c.yaml"]:
   ray.install_runtime_env(env)

But I don't have any objection to this PR as is.

@edoakes
Copy link
Contributor

edoakes commented Aug 20, 2021

@ericl I think it would likely make sense to turn this on by default at the job level in the (near) future. Agree that it's a bit weird for individual tasks/actors, but maybe we could just have a flag at that level too to eagerly install it when defined? If that's what we want to do, maybe this should be a property of the env itself (maybe a bit weird but it'd work):

ray.init(runtime_env={"conda": {...}, "eager_install": True})

Actor.options(runtime_env={"conda": {...}, "eager_install": True}).remote()

@SongGuyang does this handle the case when a new node is added to the cluster (i.e., via autoscaling) while the job is running? Will we eagerly install in that case? Not sure how the "HandleJobStarted" callback works here.

@ericl
Copy link
Contributor

ericl commented Aug 20, 2021 via email

@SongGuyang SongGuyang linked an issue Aug 23, 2021 that may be closed by this pull request
@SongGuyang
Copy link
Contributor Author

@edoakes The way of runtime_env={"conda": {...}, "eager_install": True} is good for me, thanks. But I can't implement task/actor level in current PR because we need to add a new RPC from worker to GCS to publish this install message. I will make an independent PR to do this (issue).
And about the case of new node added, I think we already handle this. GCS will publish all job table data to new node and HandleJobStarted could also been triggered. The related code is here https://github.com/ray-project/ray/blob/master/src/ray/gcs/gcs_client/service_based_accessor.cc#L274.

@architkulkarni architkulkarni added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Aug 31, 2021
@architkulkarni
Copy link
Contributor

@SongGuyang
Copy link
Contributor Author

@architkulkarni @edoakes Sorry for my late update. Please review current PR again.

@SongGuyang SongGuyang removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Sep 28, 2021
Copy link
Contributor

@architkulkarni architkulkarni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great @SongGuyang! Just one question, it looks like this PR introduces a cache of serialized_runtime_env -> RuntimeEnvContext inside the WorkerPool, but we already have the same kind of cache inside the RuntimeEnvAgent:

self._env_cache: Dict[str, CreatedEnvResult] = dict()
. Is there still some benefit for having this extra cache at the level of WorkerPool? If not maybe we should remove it.

@ericl could we please get codeowner approval from you? (Asking you because the other codeowners might not have as much context)

@SongGuyang
Copy link
Contributor Author

@architkulkarni Yep, the original idea was to reduce RPC from raylet to agent. But If we consider the GC and Cache strategies, maybe it will bring some new issues. Let me try to remove it and add it in future if needed.

@SongGuyang
Copy link
Contributor Author

@ericl Can you help approve and merge this PR?

@SongGuyang SongGuyang changed the title [runtime env] support prepare_runtime_env_eagerly [runtime env] support eager_install in runtime env Oct 3, 2021
Copy link
Contributor

@wuisawesome wuisawesome left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

proto changes look compatible

@kfstorm kfstorm merged commit bae543c into ray-project:master Oct 9, 2021
@kfstorm kfstorm deleted the dev_prepare_runtime_env_eagerly branch October 9, 2021 10:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[runtime env] Eager mode of runtime env creation.
6 participants