-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Allow user to override global default for max_retries #25189
Conversation
python/ray/remote_function.py
Outdated
"RAY_task_max_retries", v.default_value | ||
) | ||
v.default_value = int(v.default_value) | ||
print("option", k, v.default_value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like debug log?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks! :|
allows infinite retries, and 0 disables retries. To override the default number | ||
of retries for all tasks submitted, set the OS environment variable | ||
``RAY_task_max_retries``. e.g., by passing this to your driver script or by | ||
using :ref:`runtime environments<runtime-environments>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By convention, I think we should add an alias so this is RAY_TASK_MAX_RETRIES
, and reserve RAY_lower_case_vars
as internal flags, for user clarity.
I'll leave it to you and @scv119 to decide if this should be a policy though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, I'll change this.
@@ -217,6 +218,14 @@ def _remote(self, args=None, kwargs=None, **task_options): | |||
|
|||
# fill task required options | |||
for k, v in ray_option_utils.task_options.items(): | |||
if k == "max_retries": | |||
# TODO(swang): We need to override max_retries here because the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this is tricky... wondering if there is a better way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, but Is this not going to work?
"max_retries": _counting_option(
"max_retries", default_value=os.environ.get("RAY_TASK_MAX_RETRIES", ray_constants.DEFAULT_TASK_MAX_RETRIES)
),
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't work if the OS env is set after the ray import. Maybe it's only a real issue for python tests, but either way checking it at ray.remote time seems more robust.
allows infinite retries, and 0 disables retries. To override the default number | ||
of retries for all tasks submitted, set the OS environment variable | ||
``RAY_TASK_MAX_RETRIES``. e.g., by passing this to your driver script or by | ||
using :ref:`runtime environments<runtime-environments>`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we have a short example here?
RAY_TASK_MAX_RETRIES driver.py
or
# in driver.py
ray.init(address="auto", runtime_env={...})
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this is necessary (runtime envs page provides examples) and it kind of breaks the flow of this docs page.
@@ -217,6 +218,14 @@ def _remote(self, args=None, kwargs=None, **task_options): | |||
|
|||
# fill task required options | |||
for k, v in ray_option_utils.task_options.items(): | |||
if k == "max_retries": | |||
# TODO(swang): We need to override max_retries here because the default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, but Is this not going to work?
"max_retries": _counting_option(
"max_retries", default_value=os.environ.get("RAY_TASK_MAX_RETRIES", ray_constants.DEFAULT_TASK_MAX_RETRIES)
),
?
cc @matthewdeng |
…ject#25189) This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called RAY_task_max_retries which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default. It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet). Related issue number Closes ray-project#24854.
Why are these changes needed?
This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called
RAY_task_max_retries
which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default.It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet).
Related issue number
Closes #24854.
Checks
scripts/format.sh
to lint the changes in this PR.