-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[runtime env] support eager_install in runtime env #17949
[runtime env] support eager_install in runtime env #17949
Conversation
bool successful, const std::string &serialized_runtime_env_context) { | ||
if (successful) { | ||
start_worker_process_fn(task_spec, state, {}, false, | ||
task_spec.SerializedRuntimeEnv(), | ||
serialized_runtime_env_context, callback); | ||
} else { | ||
RAY_LOG(WARNING) << "Couldn't create a runtime environment for task " | ||
<< task_spec.TaskId() << ". The runtime environment was " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the task-specific failure message here was replaced by a job-specific error message below. Should we pass the task ID through here? If it will make the code messy I don't feel strongly about it, the most useful thing is the serialized runtime env, which is still being printed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Add a log here ce76c5c#diff-4c19d6ebd8575189e9439fa5646b567546dc54f64d576a959ce1bb43902d1a03R1019
def test_job_config_conda_env_eagerly(conda_envs, shutdown_only): | ||
runtime_env = {"conda": f"package-{REQUEST_VERSIONS[0]}"} | ||
job_config = ray.job_config.JobConfig( | ||
runtime_env=runtime_env, prepare_runtime_env_eagerly=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we would like to move runtime_env out of the JobConfig and just have it specified in ray.init()
. runtime_env
is already supported as a keyword argument in ray.init()
.
Is it possible to make this feature work with ray.init(runtime_env=runtime_env, job_config=ray.job_config.JobConfig(prepare_runtime_env_eagerly=True)
? Or even simpler, ray.init(runtime_env=runtime_env, prepare_runtime_env_eagerly=True)
? cc @edoakes, what are your thoughts on the API here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I'm also confused that there are two fields to set runtime_env. I agree that we should reserve only one. Let's make it clearly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes @SongGuyang what's the long-term plan for this feature? Should we have it on by default?
@ericl It's a trade-off. Eager mode is much friendlier for large scale jobs or time-delay sensitive jobs. But it we bring downloading storm in cluster. We should determine to set this flag in different scenes. And in AntGroup, most of the jobs will use eager mode because we have divided the nodes into different namespace. The downloads will happen in one namespace inside. But recently, we took some little scale jobs. I think we also need lazy mode to decrease the downloads. |
Hmm, I see. If we want to support both eager and on-demand loading, does it make sense to just add a "install runtime_env" command to Ray? I find setting this as a job config kind of confusing, since a job can have multiple runtime envs within it, and it's not clear if all of them are eagerly loaded or just the top-level one. In that case an API like the following seems more explicit:
But I don't have any objection to this PR as is. |
@ericl I think it would likely make sense to turn this on by default at the job level in the (near) future. Agree that it's a bit weird for individual tasks/actors, but maybe we could just have a flag at that level too to eagerly install it when defined? If that's what we want to do, maybe this should be a property of the env itself (maybe a bit weird but it'd work): ray.init(runtime_env={"conda": {...}, "eager_install": True})
Actor.options(runtime_env={"conda": {...}, "eager_install": True}).remote() @SongGuyang does this handle the case when a new node is added to the cluster (i.e., via autoscaling) while the job is running? Will we eagerly install in that case? Not sure how the "HandleJobStarted" callback works here. |
Got it, eager install for job level makes sense.
…On Fri, Aug 20, 2021, 4:01 PM Edward Oakes ***@***.***> wrote:
@ericl <https://github.com/ericl> I think it would likely make sense to
turn this on by default at the job level in the (near) future. Agree that
it's a bit weird for individual tasks/actors, but maybe we could just have
a flag at that level too to eagerly install it when defined? If that's what
we want to do, maybe this should be a property of the env itself (maybe a
bit weird but it'd work):
ray.init(runtime_env={"conda": {...}, "eager_install": True})
Actor.options(runtime_env={"conda": {...}, "eager_install": True}).remote()
@SongGuyang <https://github.com/SongGuyang> does this handle the case
when a new node is added to the cluster (i.e., via autoscaling) while the
job is running? Will we eagerly install in that case? Not sure how the
"HandleJobStarted" callback works here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#17949 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAADUSXRUQFGTOD2GE7KLQDT53NDHANCNFSM5COIVEIA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email>
.
|
@edoakes The way of |
There seems to be a relevant test failure: https://buildkite.com/ray-project/ray-builders-pr/builds/11915#1091a7fe-f8bb-42a7-9d02-cf9b5e9878b2/6-5919 |
@architkulkarni @edoakes Sorry for my late update. Please review current PR again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great @SongGuyang! Just one question, it looks like this PR introduces a cache of serialized_runtime_env -> RuntimeEnvContext inside the WorkerPool, but we already have the same kind of cache inside the RuntimeEnvAgent:
self._env_cache: Dict[str, CreatedEnvResult] = dict() |
@ericl could we please get codeowner approval from you? (Asking you because the other codeowners might not have as much context)
@architkulkarni Yep, the original idea was to reduce RPC from raylet to agent. But If we consider the GC and Cache strategies, maybe it will bring some new issues. Let me try to remove it and add it in future if needed. |
@ericl Can you help approve and merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
proto changes look compatible
Why are these changes needed?
Future work: