Document fault tolerance behavior. #6698

robertnishihara · 2020-01-05T00:13:44Z

No description provided.

robertnishihara · 2020-01-05T00:14:06Z

java/api/src/main/java/org/ray/api/options/ActorCreationOptions.java

@@ -9,7 +9,7 @@
 public class ActorCreationOptions extends BaseTaskOptions {

 public static final int NO_RECONSTRUCTION = 0;
- public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
+ public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);


This was inconsistent between Python and Java.

Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS?

@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION). setMaxReconstructions(MAX_RECONSTRUCTION) is a bit unclear.

Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler

I don't like enableAutoReconstruction very much because it introduces more configuration flags that interact with each other. I think max_reconstructions=ray.ray_constants.INFINITE_RECONSTRUCTION is fine for now.

Okay, I don't have strong preference either. Let's keep the current way for now.

AmplabJenkins · 2020-01-05T00:14:08Z

Can one of the admins verify this patch?

robertnishihara · 2020-01-05T00:15:13Z

python/ray/worker.py

@@ -1736,7 +1736,12 @@ def method(self):
 number of times that the actor should be reconstructed when it dies
 unexpectedly. The minimum valid value is 0 (default), which indicates
 that the actor doesn't need to be reconstructed. And the maximum valid
- value is ray.ray_constants.INFINITE_RECONSTRUCTIONS.
+ value is ray.ray_constants.INFINITE_RECONSTRUCTION.
+ * **max_retries**: Only for *remote functions*. This specifies the maximum


I don't like the inconsistency between max_reconstructions for actors and max_retries for tasks. Thoughts about this? @pcmoritz

Yeah I agree. Maybe max_reconstructions is a bit clearer? Don't have a strong preference though.

robertnishihara · 2020-01-05T00:17:32Z

doc/source/fault-tolerance.rst

+ try:
+ counter = ray.get(actor.increment_and_possibly_fail.remote(0.1))
+ print(counter)
+ except ray.exceptions.RayActorError:


I would expect no exceptions to be raised until max_reconstructions has been exceeded. Instead, it looks like methods that are running when the actor dies also raise exceptions. Is that intentional? @edoakes @stephanie-wang @raulchen @ericl

This seems easy to change, but we would need some way of detecting reconstruction if it's not exposed as an error.

It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?

@raulchen I just made it deterministic. If you try running it, you'll see that it raises exceptions way before the max_reconstructions have been used up.

Right now we only do resubmission for non-actor tasks, thus if the remote actor dies, tasks that haven't finished execution would be marked as failed. Or do we want to support re-submission for actor tasks as well?

oh, I see. Re-submission isn't supported for direct actors yet. For old non-direct actors, reconstruction is completely transparent.

Once that changes, we should update this documentation.

AmplabJenkins · 2020-01-05T04:53:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20380/
Test PASSed.

deanwampler · 2020-01-05T17:26:52Z

doc/source/fault-tolerance.rst

-Each **raylet** (the scheduler process) sends heartbeats to a **monitor**
-process. If the monitor does not receive any heartbeats from a given raylet for
-some period of time (about ten seconds), then it will mark that process as dead.
+When a worker is executing a task, if the worker dies unexpectedly, either


Does it preferentially attempt to rerun on the same machine, unless the machine itself failed? How are failures detected? How does Ray determine if it's just the task vs. the whole machine? (I haven't read all the docs thoroughly enough to know whether or not this information is somewhere else already...)

I don't think it preferentially reruns it in any particular place. It's just scheduled again as if a new task had been submitted.

There are two kinds of failures

The worker process fails but the node doesn't. This is detected by the raylet when the worker process disconnects from the raylet.

The whole node fails. This is detected by the monitor which notices missed heartbeats from the raylet.

deanwampler · 2020-01-05T17:27:40Z

doc/source/fault-tolerance.rst

+number of reconstructions has been used up, then subsequent actor methods will
+raise exceptions.
+
+When an actor is reconstructed, its state will be recreated by rerunning its


Does this mean whatever state the actor held is lost, i.e., the state is reset?

Yes. But there is a Checkpointable API to let users restore actor states.

Once the Checkpointable API is working again, I'll document that here.

deanwampler · 2020-01-05T17:35:14Z

java/api/src/main/java/org/ray/api/options/ActorCreationOptions.java

@@ -9,7 +9,7 @@
 public class ActorCreationOptions extends BaseTaskOptions {

 public static final int NO_RECONSTRUCTION = 0;
- public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
+ public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);


Any particular reason this was changed from plural to singular? (I see that grepping the source code for INFINITE_RECONSTRUCTION turns up both variants...) Also, setting a finite value for an "infinite" setting doesn't quite make sense. I know this is an old parameter, but why isn't it named MAX_RECONSTRUCTIONS?

raulchen · 2020-01-06T04:06:13Z

doc/source/fault-tolerance.rst

+ def potentially_fail(failure_probability):
+ time.sleep(0.2)
+ if np.random.random() < failure_probability:
+ os._exit(0)


Can we use a counter actor instead of probability, so this example is deterministic?

raulchen · 2020-01-06T04:10:41Z

doc/source/fault-tolerance.rst

+ try:
+ counter = ray.get(actor.increment_and_possibly_fail.remote(0.1))
+ print(counter)
+ except ray.exceptions.RayActorError:


It shouldn't raise exceptions in that case. Reconstruction should be transparent to the caller. Can you make this example deterministic (as I mentioned in the above thread) to confirm this behavior?

raulchen · 2020-01-06T04:21:09Z

java/api/src/main/java/org/ray/api/options/ActorCreationOptions.java

@@ -9,7 +9,7 @@
 public class ActorCreationOptions extends BaseTaskOptions {

 public static final int NO_RECONSTRUCTION = 0;
- public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
+ public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);


@deanwampler We just use a very large number to imitate "infinity". In this way we could simplify the implementation. I agree this is not perfect. But users don't have to care the actual value of this constant. So I think this is fine.
Regarding the name, I think it's clearer when users specify setMaxReconstructions(INFINITE_RECONSTRUCTION). setMaxReconstructions(MAX_RECONSTRUCTION) is a bit unclear.

raulchen · 2020-01-06T04:27:38Z

java/api/src/main/java/org/ray/api/options/ActorCreationOptions.java

@@ -9,7 +9,7 @@
 public class ActorCreationOptions extends BaseTaskOptions {

 public static final int NO_RECONSTRUCTION = 0;
- public static final int INFINITE_RECONSTRUCTIONS = (int) Math.pow(2, 30);
+ public static final int INFINITE_RECONSTRUCTION = (int) Math.pow(2, 30);


Also, based on our experiment. In real applications, our users never want to use a finite number. Their choice is always 0 or infinite. So I'm thinking of adding a enableAutoReconstruction API, which just sets max_reconstructions = infinite.
And we could still keep the setMaxReconstructions API for tests and potential power users.
Do you think it will be clearer this way? @robertnishihara @deanwampler

raulchen · 2020-01-06T04:29:38Z

doc/source/fault-tolerance.rst

+number of reconstructions has been used up, then subsequent actor methods will
+raise exceptions.
+
+When an actor is reconstructed, its state will be recreated by rerunning its


Yes. But there is a Checkpointable API to let users restore actor states.

AmplabJenkins · 2020-01-06T09:56:16Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/20423/
Test PASSed.

Document fault tolerance behavior.

83ca712

robertnishihara commented Jan 5, 2020

View reviewed changes

robertnishihara requested review from stephanie-wang, edoakes and raulchen January 5, 2020 00:17

robertnishihara requested a review from deanwampler January 5, 2020 07:03

deanwampler reviewed Jan 6, 2020

View reviewed changes

raulchen reviewed Jan 6, 2020

View reviewed changes

Make actor example deterministic.

d5b127c

pcmoritz approved these changes Jan 7, 2020

View reviewed changes

robertnishihara merged commit 5e43b25 into ray-project:master Jan 7, 2020

robertnishihara deleted the documentfaulttolerance branch January 7, 2020 06:34

zhijunfu mentioned this pull request Jan 12, 2020

Support retry for failed actor tasks #6773

Closed

robertnishihara mentioned this pull request Feb 9, 2020

Add documentation example for actor checkpointing. #7095

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document fault tolerance behavior. #6698

Document fault tolerance behavior. #6698

robertnishihara commented Jan 5, 2020

robertnishihara Jan 5, 2020

deanwampler Jan 5, 2020

raulchen Jan 6, 2020 •

edited

Loading

raulchen Jan 6, 2020

robertnishihara Jan 6, 2020

raulchen Jan 6, 2020

AmplabJenkins commented Jan 5, 2020

robertnishihara Jan 5, 2020

pcmoritz Jan 7, 2020

robertnishihara Jan 5, 2020 •

edited

Loading

ericl Jan 5, 2020 •

edited

Loading

raulchen Jan 6, 2020

robertnishihara Jan 6, 2020

zhijunfu Jan 6, 2020

raulchen Jan 6, 2020

robertnishihara Jan 6, 2020

AmplabJenkins commented Jan 5, 2020

deanwampler Jan 5, 2020

robertnishihara Jan 6, 2020

deanwampler Jan 5, 2020

raulchen Jan 6, 2020

robertnishihara Jan 6, 2020

deanwampler Jan 5, 2020

raulchen Jan 6, 2020

raulchen Jan 6, 2020

raulchen Jan 6, 2020 •

edited

Loading

raulchen Jan 6, 2020

raulchen Jan 6, 2020

AmplabJenkins commented Jan 6, 2020

Document fault tolerance behavior. #6698

Document fault tolerance behavior. #6698

Conversation

robertnishihara commented Jan 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

robertnishihara Jan 5, 2020 • edited Loading

Choose a reason for hiding this comment

ericl Jan 5, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 5, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raulchen Jan 6, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 6, 2020

raulchen Jan 6, 2020 •

edited

Loading

robertnishihara Jan 5, 2020 •

edited

Loading

ericl Jan 5, 2020 •

edited

Loading

raulchen Jan 6, 2020 •

edited

Loading