Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document fault tolerance behavior. #6698

Merged

Conversation

robertnishihara
Copy link
Collaborator

No description provided.

@@ -9,7 +9,7 @@
public class ActorCreationOptions extends BaseTaskOptions {

public static final int NO_RECONSTRUCTION = 0;
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was inconsistent between Python and Java.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS?

Copy link
Contributor

@raulchen raulchen Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION). setMaxReconstructions(MAX_RECONSTRUCTION) is a bit unclear.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like enableAutoReconstruction very much because it introduces more configuration flags that interact with each other. I think max_reconstructions=ray.ray_constants.INFINITE_RECONSTRUCTION is fine for now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I don't have strong preference either. Let's keep the current way for now.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@@ -1736,7 +1736,12 @@ def method(self):
number of times that the actor should be reconstructed when it dies
unexpectedly. The minimum valid value is 0 (default), which indicates
that the actor doesn't need to be reconstructed. And the maximum valid
value is ray.ray_constants.INFINITE_RECONSTRUCTIONS.
value is ray.ray_constants.INFINITE_RECONSTRUCTION.
* **max_retries**: Only for *remote functions*. This specifies the maximum
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the inconsistency between max_reconstructions for actors and max_retries for tasks. Thoughts about this? @pcmoritz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree. Maybe max_reconstructions is a bit clearer? Don't have a strong preference though.

try:
counter = ray.get(actor.increment_and_possibly_fail.remote(0.1))
print(counter)
except ray.exceptions.RayActorError:
Copy link
Collaborator Author

@robertnishihara robertnishihara Jan 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect no exceptions to be raised until max_reconstructions has been exceeded. Instead, it looks like methods that are running when the actor dies also raise exceptions. Is that intentional? @edoakes @stephanie-wang @raulchen @ericl

Copy link
Contributor

@ericl ericl Jan 5, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems easy to change, but we would need some way of detecting reconstruction if it's not exposed as an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@raulchen I just made it deterministic. If you try running it, you'll see that it raises exceptions way before the max_reconstructions have been used up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we only do resubmission for non-actor tasks, thus if the remote actor dies, tasks that haven't finished execution would be marked as failed. Or do we want to support re-submission for actor tasks as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I see. Re-submission isn't supported for direct actors yet. For old non-direct actors, reconstruction is completely transparent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once that changes, we should update this documentation.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20380/
Test PASSed.

Each **raylet** (the scheduler process) sends heartbeats to a **monitor**
process. If the monitor does not receive any heartbeats from a given raylet for
some period of time (about ten seconds), then it will mark that process as dead.
When a worker is executing a task, if the worker dies unexpectedly, either
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it preferentially attempt to rerun on the same machine, unless the machine itself failed? How are failures detected? How does Ray determine if it's just the task vs. the whole machine? (I haven't read all the docs thoroughly enough to know whether or not this information is somewhere else already...)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it preferentially reruns it in any particular place. It's just scheduled again as if a new task had been submitted.

There are two kinds of failures

  • The worker process fails but the node doesn't. This is detected by the raylet when the worker process disconnects from the raylet.
  • The whole node fails. This is detected by the monitor which notices missed heartbeats from the raylet.

number of reconstructions has been used up, then subsequent actor methods will
raise exceptions.

When an actor is reconstructed, its state will be recreated by rerunning its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean whatever state the actor held is lost, i.e., the state is reset?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But there is a Checkpointable API to let users restore actor states.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the Checkpointable API is working again, I'll document that here.

@@ -9,7 +9,7 @@
public class ActorCreationOptions extends BaseTaskOptions {

public static final int NO_RECONSTRUCTION = 0;
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS?

def potentially_fail(failure_probability):
time.sleep(0.2)
if np.random.random() < failure_probability:
os._exit(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use a counter actor instead of probability, so this example is deterministic?

try:
counter = ray.get(actor.increment_and_possibly_fail.remote(0.1))
print(counter)
except ray.exceptions.RayActorError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?

@@ -9,7 +9,7 @@
public class ActorCreationOptions extends BaseTaskOptions {

public static final int NO_RECONSTRUCTION = 0;
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);
Copy link
Contributor

@raulchen raulchen Jan 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION). setMaxReconstructions(MAX_RECONSTRUCTION) is a bit unclear.

@@ -9,7 +9,7 @@
public class ActorCreationOptions extends BaseTaskOptions {

public static final int NO_RECONSTRUCTION = 0;
public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler

number of reconstructions has been used up, then subsequent actor methods will
raise exceptions.

When an actor is reconstructed, its state will be recreated by rerunning its
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But there is a Checkpointable API to let users restore actor states.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20423/
Test PASSed.

@robertnishihara robertnishihara merged commit 5e43b25 into ray-project:master Jan 7, 2020
@robertnishihara robertnishihara deleted the documentfaulttolerance branch January 7, 2020 06:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants