-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document fault tolerance behavior. #6698
Document fault tolerance behavior. #6698
Conversation
@@ -9,7 +9,7 @@ | |||
public class ActorCreationOptions extends BaseTaskOptions { | |||
|
|||
public static final int NO_RECONSTRUCTION = 0; | |||
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30); | |||
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was inconsistent between Python and Java.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION
turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION)
. setMaxReconstructions(MAX_RECONSTRUCTION)
is a bit unclear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction
API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions
API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like enableAutoReconstruction
very much because it introduces more configuration flags that interact with each other. I think max_reconstructions=ray.ray_constants.INFINITE_RECONSTRUCTION
is fine for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I don't have strong preference either. Let's keep the current way for now.
Can one of the admins verify this patch? |
@@ -1736,7 +1736,12 @@ def method(self): | |||
number of times that the actor should be reconstructed when it dies | |||
unexpectedly. The minimum valid value is 0 (default), which indicates | |||
that the actor doesn't need to be reconstructed. And the maximum valid | |||
value is ray.ray_constants.INFINITE_RECONSTRUCTIONS. | |||
value is ray.ray_constants.INFINITE_RECONSTRUCTION. | |||
* **max_retries**: Only for *remote functions*. This specifies the maximum |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like the inconsistency between max_reconstructions
for actors and max_retries
for tasks. Thoughts about this? @pcmoritz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree. Maybe max_reconstructions
is a bit clearer? Don't have a strong preference though.
try: | ||
counter = ray.get(actor.increment_and_possibly_fail.remote(0.1)) | ||
print(counter) | ||
except ray.exceptions.RayActorError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect no exceptions to be raised until max_reconstructions
has been exceeded. Instead, it looks like methods that are running when the actor dies also raise exceptions. Is that intentional? @edoakes @stephanie-wang @raulchen @ericl
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems easy to change, but we would need some way of detecting reconstruction if it's not exposed as an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@raulchen I just made it deterministic. If you try running it, you'll see that it raises exceptions way before the max_reconstructions have been used up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we only do resubmission for non-actor tasks, thus if the remote actor dies, tasks that haven't finished execution would be marked as failed. Or do we want to support re-submission for actor tasks as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, I see. Re-submission isn't supported for direct actors yet. For old non-direct actors, reconstruction is completely transparent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once that changes, we should update this documentation.
Test PASSed. |
Each **raylet** (the scheduler process) sends heartbeats to a **monitor** | ||
process. If the monitor does not receive any heartbeats from a given raylet for | ||
some period of time (about ten seconds), then it will mark that process as dead. | ||
When a worker is executing a task, if the worker dies unexpectedly, either |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it preferentially attempt to rerun on the same machine, unless the machine itself failed? How are failures detected? How does Ray determine if it's just the task vs. the whole machine? (I haven't read all the docs thoroughly enough to know whether or not this information is somewhere else already...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it preferentially reruns it in any particular place. It's just scheduled again as if a new task had been submitted.
There are two kinds of failures
- The worker process fails but the node doesn't. This is detected by the raylet when the worker process disconnects from the raylet.
- The whole node fails. This is detected by the monitor which notices missed heartbeats from the raylet.
number of reconstructions has been used up, then subsequent actor methods will | ||
raise exceptions. | ||
|
||
When an actor is reconstructed, its state will be recreated by rerunning its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean whatever state the actor held is lost, i.e., the state is reset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But there is a Checkpointable
API to let users restore actor states.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the Checkpointable
API is working again, I'll document that here.
@@ -9,7 +9,7 @@ | |||
public class ActorCreationOptions extends BaseTaskOptions { | |||
|
|||
public static final int NO_RECONSTRUCTION = 0; | |||
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30); | |||
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION
turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS
?
def potentially_fail(failure_probability): | ||
time.sleep(0.2) | ||
if np.random.random() < failure_probability: | ||
os._exit(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we use a counter actor instead of probability, so this example is deterministic?
try: | ||
counter = ray.get(actor.increment_and_possibly_fail.remote(0.1)) | ||
print(counter) | ||
except ray.exceptions.RayActorError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?
@@ -9,7 +9,7 @@ | |||
public class ActorCreationOptions extends BaseTaskOptions { | |||
|
|||
public static final int NO_RECONSTRUCTION = 0; | |||
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30); | |||
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION)
. setMaxReconstructions(MAX_RECONSTRUCTION)
is a bit unclear.
@@ -9,7 +9,7 @@ | |||
public class ActorCreationOptions extends BaseTaskOptions { | |||
|
|||
public static final int NO_RECONSTRUCTION = 0; | |||
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30); | |||
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction
API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions
API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler
number of reconstructions has been used up, then subsequent actor methods will | ||
raise exceptions. | ||
|
||
When an actor is reconstructed, its state will be recreated by rerunning its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But there is a Checkpointable
API to let users restore actor states.
Test PASSed. |
No description provided.