-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Return "crashing" state from _scheduler/docs
immediately after the first crash
#1276
Labels
Comments
nickva
added a commit
to cloudant/couchdb
that referenced
this issue
Apr 10, 2018
Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes apache#1276
3 tasks
nickva
added a commit
to cloudant/couchdb
that referenced
this issue
Apr 10, 2018
Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes apache#1276
nickva
added a commit
to cloudant/couchdb
that referenced
this issue
Apr 12, 2018
Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes apache#1276
nickva
added a commit
that referenced
this issue
Apr 12, 2018
Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes #1276
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When a replication job that has been running for a while crashes with an error and is stopped. It status in
_scheduler/docs
endpoint response should becrashing
instead ofpending
.The
crashing
vspending
status is driven by the "consecutive errors" count. This is computed as the number of crashes that occur in a row, for example a crash soon after a job start or a crash soon after another crash. If the crash happens a long enough time after a job start or previous crash then consecutive errors count is reset to 0 and the job is considered healthy.So this sequence of steps is possible. A user starts a job, lets it run for 5 minutes then deletes the source. The output of
_scheduler/docs
will show state =pending
since consecutive crashes is still 0, but the user would rather see state ascrashing
in the result.The fix is so return crashing if a crash is the last even in the job history even if errors count is still 0.
The text was updated successfully, but these errors were encountered: