Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Allow user to override global default for max_retries #25189

Merged
merged 6 commits into from
Jun 1, 2022

Conversation

stephanie-wang
Copy link
Contributor

Why are these changes needed?

This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called RAY_task_max_retries which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default.

It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet).

Related issue number

Closes #24854.

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

"RAY_task_max_retries", v.default_value
)
v.default_value = int(v.default_value)
print("option", k, v.default_value)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like debug log?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks! :|

allows infinite retries, and 0 disables retries. To override the default number
of retries for all tasks submitted, set the OS environment variable
``RAY_task_max_retries``. e.g., by passing this to your driver script or by
using :ref:`runtime environments<runtime-environments>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By convention, I think we should add an alias so this is RAY_TASK_MAX_RETRIES, and reserve RAY_lower_case_vars as internal flags, for user clarity.

I'll leave it to you and @scv119 to decide if this should be a policy though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I'll change this.

@@ -217,6 +218,14 @@ def _remote(self, args=None, kwargs=None, **task_options):

# fill task required options
for k, v in ray_option_utils.task_options.items():
if k == "max_retries":
# TODO(swang): We need to override max_retries here because the default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this is tricky... wondering if there is a better way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, but Is this not going to work?

    "max_retries": _counting_option(
        "max_retries", default_value=os.environ.get("RAY_TASK_MAX_RETRIES", ray_constants.DEFAULT_TASK_MAX_RETRIES)
    ),

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't work if the OS env is set after the ray import. Maybe it's only a real issue for python tests, but either way checking it at ray.remote time seems more robust.

allows infinite retries, and 0 disables retries. To override the default number
of retries for all tasks submitted, set the OS environment variable
``RAY_TASK_MAX_RETRIES``. e.g., by passing this to your driver script or by
using :ref:`runtime environments<runtime-environments>`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have a short example here?

RAY_TASK_MAX_RETRIES driver.py

or 

# in driver.py
ray.init(address="auto", runtime_env={...})

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is necessary (runtime envs page provides examples) and it kind of breaks the flow of this docs page.

@@ -217,6 +218,14 @@ def _remote(self, args=None, kwargs=None, **task_options):

# fill task required options
for k, v in ray_option_utils.task_options.items():
if k == "max_retries":
# TODO(swang): We need to override max_retries here because the default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, but Is this not going to work?

    "max_retries": _counting_option(
        "max_retries", default_value=os.environ.get("RAY_TASK_MAX_RETRIES", ray_constants.DEFAULT_TASK_MAX_RETRIES)
    ),

?

@rkooo567
Copy link
Contributor

cc @matthewdeng

@stephanie-wang stephanie-wang merged commit 961b875 into ray-project:master Jun 1, 2022
@stephanie-wang stephanie-wang deleted the max-retries branch June 1, 2022 21:42
matthewdeng pushed a commit to matthewdeng/ray that referenced this pull request Jun 9, 2022
…ject#25189)

This PR allows the user to override the global default for max_retries for non-actor tasks. It adds an OS env called RAY_task_max_retries which can be passed to the driver or set with runtime envs. Any future tasks submitted by that worker will default to this value instead of 3, the hard-coded default.

It would be nicer if we could have a standard way of setting these defaults, but I think this is fine as a one-off for now (not a clear need for overriding defaults of other @ray.remote options yet).
Related issue number

Closes ray-project#24854.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[core] Allow user to configure default max task retries
5 participants