-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
max_reconstructions Refactor and Rename #8274
max_reconstructions Refactor and Rename #8274
Conversation
…r API, which will indicate infinite reconstructions of the actor.
…f discussion in the api_changes channel
…N constants, to be replaced by the values 0 and 1
Can one of the admins verify this patch? |
auto remaining_reconstructions = | ||
need_reschedule ? mutable_actor_table_data->remaining_reconstructions() : 0; | ||
int64_t max_restarts = mutable_actor_table_data->max_restarts(); | ||
uint64_t num_restarts = mutable_actor_table_data->num_restarts(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious, but are we testing this anywhere?
…me change to Unreconstructable errors.
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall! Just some small nits/style comments.
@@ -117,7 +117,7 @@ RAY_CONFIG(int64_t, max_direct_call_object_size, 100 * 1024) | |||
RAY_CONFIG(int64_t, max_grpc_message_size, 100 * 1024 * 1024) | |||
|
|||
// The min number of retries for direct actor creation tasks. The actual number |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What on earth is this for? Does this mean that we will actually restarted actors even if the user specifies max_restarts=0
? This doesn't seem like a good idea. Can you git blame
this and find out why it's here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was added by Eric Liang I think because we want to treat the failure of an actor creation task a bit differently than the failure of an actor that has started up successfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are not many details in the commit message. I'll ask Eric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@edoakes from ekl:
We've always resubmitted the creation task if it failed, similar to normal tasks. I agree it probably makes sense to not do this if num reconstructions is configured to zero, but a lot of applications to flake if we don't reconstruct by default.
Note that resubmitting just the construction task if it fails is fairly safe; not much can go wrong.
Reconstructing after the actor has been used, is much dicier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, makes sense to me. Can you please update the comment in the code with this info?
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
Co-authored-by: Edward Oakes <[email protected]>
… into max-reconstructions-refactor
Test PASSed. |
Test PASSed. |
Test PASSed. |
Test FAILed. |
Test PASSed. |
Test FAILed. |
… into max-reconstructions-refactor
Test FAILed. |
Test FAILed. |
…as not respecting the argument of -1 for infinite reconstruction
Test FAILed. |
Test FAILed. |
…ay-project#8274) Co-authored-by: Edward Oakes <[email protected]>
|
||
# Max 64 bit integer value, which is needed to ensure against overflow | ||
# in C++ when passing integer values cross-language. | ||
MAX_INT64_VALUE = 9223372036854775807 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can rely on Cython
to check int overflow. This constant is unnecessary.
Why are these changes needed?
These changes are meant to improve the code's quality and make the API easier for users. The most noticeable change is a name change from
reconstruction
torestart
We now represent the concept of infinite retry on actor reconstruction with
-1
instead of a very large number. This has necessitated changing several types from unsigned integers to signed integers.In addition, we now keep track of the
max_restarts
and thenum_restarts
instead of themax_reconstructions
andremaining_reconstructions
.Related issue number
Checks
scripts/format.sh
to lint the changes in this PR.I have not added any new tests, but given this is mostly a refactoring change, the code should be sufficiently covered by existing tests.