Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return "crashing" state from _scheduler/docs immediately after the first crash #1276

Closed
nickva opened this issue Apr 9, 2018 · 0 comments
Closed
Assignees

Comments

@nickva
Copy link
Contributor

nickva commented Apr 9, 2018

When a replication job that has been running for a while crashes with an error and is stopped. It status in _scheduler/docs endpoint response should be crashing instead of pending.

The crashing vs pending status is driven by the "consecutive errors" count. This is computed as the number of crashes that occur in a row, for example a crash soon after a job start or a crash soon after another crash. If the crash happens a long enough time after a job start or previous crash then consecutive errors count is reset to 0 and the job is considered healthy.

So this sequence of steps is possible. A user starts a job, lets it run for 5 minutes then deletes the source. The output of _scheduler/docs will show state = pending since consecutive crashes is still 0, but the user would rather see state as crashing in the result.

The fix is so return crashing if a crash is the last even in the job history even if errors count is still 0.

@nickva nickva self-assigned this Apr 9, 2018
nickva added a commit to cloudant/couchdb that referenced this issue Apr 10, 2018
Replication jobs are backed off based on the number of consecutive crashes,
that is, we count the number of crashes in a row and then penalize jobs with an
exponential wait based that number. After a job runs without crashing for 2
minutes, we consider it healthy and stop going back in its history and looking
for crashes.

Previously a job's state was set to `crashing` only if there were any
consecutive errors. So it could have ran for 3 minutes, then user deletes the
source database, job crashes and stops. Until it runs again the state would
have been shown as `pending`. For internal accounting purposes that's correct
but it is confusing for the user because the last event in its history is a
crash.

This commit makes sure that if the last even in job's history is a crash user
will see the jobs as `crashing` with the respective crash reason. The
scheduling algorithm didn't change.

Fixes apache#1276
nickva added a commit to cloudant/couchdb that referenced this issue Apr 10, 2018
Replication jobs are backed off based on the number of consecutive crashes,
that is, we count the number of crashes in a row and then penalize jobs with an
exponential wait based that number. After a job runs without crashing for 2
minutes, we consider it healthy and stop going back in its history and looking
for crashes.

Previously a job's state was set to `crashing` only if there were any
consecutive errors. So it could have ran for 3 minutes, then user deletes the
source database, job crashes and stops. Until it runs again the state would
have been shown as `pending`. For internal accounting purposes that's correct
but it is confusing for the user because the last event in its history is a
crash.

This commit makes sure that if the last even in job's history is a crash user
will see the jobs as `crashing` with the respective crash reason. The
scheduling algorithm didn't change.

Fixes apache#1276
nickva added a commit to cloudant/couchdb that referenced this issue Apr 12, 2018
Replication jobs are backed off based on the number of consecutive crashes,
that is, we count the number of crashes in a row and then penalize jobs with an
exponential wait based that number. After a job runs without crashing for 2
minutes, we consider it healthy and stop going back in its history and looking
for crashes.

Previously a job's state was set to `crashing` only if there were any
consecutive errors. So it could have ran for 3 minutes, then user deletes the
source database, job crashes and stops. Until it runs again the state would
have been shown as `pending`. For internal accounting purposes that's correct
but it is confusing for the user because the last event in its history is a
crash.

This commit makes sure that if the last even in job's history is a crash user
will see the jobs as `crashing` with the respective crash reason. The
scheduling algorithm didn't change.

Fixes apache#1276
nickva added a commit that referenced this issue Apr 12, 2018
Replication jobs are backed off based on the number of consecutive crashes,
that is, we count the number of crashes in a row and then penalize jobs with an
exponential wait based that number. After a job runs without crashing for 2
minutes, we consider it healthy and stop going back in its history and looking
for crashes.

Previously a job's state was set to `crashing` only if there were any
consecutive errors. So it could have ran for 3 minutes, then user deletes the
source database, job crashes and stops. Until it runs again the state would
have been shown as `pending`. For internal accounting purposes that's correct
but it is confusing for the user because the last event in its history is a
crash.

This commit makes sure that if the last even in job's history is a crash user
will see the jobs as `crashing` with the respective crash reason. The
scheduling algorithm didn't change.

Fixes #1276
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant