Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

max_reconstructions Refactor and Rename #8274

Merged
merged 48 commits into from
May 14, 2020

Conversation

mfitton
Copy link
Contributor

@mfitton mfitton commented May 1, 2020

Why are these changes needed?

These changes are meant to improve the code's quality and make the API easier for users. The most noticeable change is a name change from reconstruction to restart
We now represent the concept of infinite retry on actor reconstruction with -1 instead of a very large number. This has necessitated changing several types from unsigned integers to signed integers.
In addition, we now keep track of the max_restarts and the num_restarts instead of the max_reconstructions and remaining_reconstructions.

Related issue number

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/latest/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failure rates at https://ray-travis-tracker.herokuapp.com/.
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested (please justify below)
      I have not added any new tests, but given this is mostly a refactoring change, the code should be sufficiently covered by existing tests.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rkooo567 rkooo567 requested a review from edoakes May 1, 2020 17:52
python/ray/test-output.xml Outdated Show resolved Hide resolved
test-output.xml Outdated Show resolved Hide resolved
auto remaining_reconstructions =
need_reschedule ? mutable_actor_table_data->remaining_reconstructions() : 0;
int64_t max_restarts = mutable_actor_table_data->max_restarts();
uint64_t num_restarts = mutable_actor_table_data->num_restarts();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, but are we testing this anywhere?

@edoakes edoakes self-assigned this May 1, 2020
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25422/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25423/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25431/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25429/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25440/
Test PASSed.

Copy link
Contributor

@edoakes edoakes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall! Just some small nits/style comments.

doc/source/fault-tolerance.rst Outdated Show resolved Hide resolved
doc/source/fault-tolerance.rst Outdated Show resolved Hide resolved
java/api/src/main/java/io/ray/api/Checkpointable.java Outdated Show resolved Hide resolved
python/ray/actor.py Show resolved Hide resolved
python/ray/actor.py Outdated Show resolved Hide resolved
python/ray/tests/test_multinode_failures_2.py Outdated Show resolved Hide resolved
@@ -117,7 +117,7 @@ RAY_CONFIG(int64_t, max_direct_call_object_size, 100 * 1024)
RAY_CONFIG(int64_t, max_grpc_message_size, 100 * 1024 * 1024)

// The min number of retries for direct actor creation tasks. The actual number
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What on earth is this for? Does this mean that we will actually restarted actors even if the user specifies max_restarts=0? This doesn't seem like a good idea. Can you git blame this and find out why it's here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added by Eric Liang I think because we want to treat the failure of an actor creation task a bit differently than the failure of an actor that has started up successfully.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not many details in the commit message. I'll ask Eric.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@edoakes from ekl:

We've always resubmitted the creation task if it failed, similar to normal tasks. I agree it probably makes sense to not do this if num reconstructions is configured to zero, but a lot of applications to flake if we don't reconstruct by default.
Note that resubmitting just the construction task if it fails is fairly safe; not much can go wrong.
Reconstructing after the actor has been used, is much dicier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, makes sense to me. Can you please update the comment in the code with this info?

src/ray/core_worker/task_manager.cc Outdated Show resolved Hide resolved
src/ray/gcs/gcs_server/gcs_actor_manager.cc Outdated Show resolved Hide resolved
src/ray/raylet/node_manager.cc Outdated Show resolved Hide resolved
@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25778/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25781/
Test PASSed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25792/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25815/
Test FAILed.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25814/
Test PASSed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25816/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25819/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25822/
Test FAILed.

Max Fitton added 2 commits May 13, 2020 18:09
…as not respecting the argument of -1 for infinite reconstruction
@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25858/
Test FAILed.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/25862/
Test FAILed.

@edoakes edoakes merged commit 00325eb into ray-project:master May 14, 2020
@mfitton mfitton deleted the max-reconstructions-refactor branch May 14, 2020 16:23
@mfitton mfitton restored the max-reconstructions-refactor branch May 14, 2020 16:23
@mfitton mfitton deleted the max-reconstructions-refactor branch May 14, 2020 16:23
@mfitton mfitton restored the max-reconstructions-refactor branch May 14, 2020 16:24
mfitton added a commit to mfitton/ray that referenced this pull request May 14, 2020

# Max 64 bit integer value, which is needed to ensure against overflow
# in C++ when passing integer values cross-language.
MAX_INT64_VALUE = 9223372036854775807
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can rely on Cython to check int overflow. This constant is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants