Return "crashing" state from `_scheduler/docs` immediately after the first crash #1276

nickva · 2018-04-09T22:16:11Z

When a replication job that has been running for a while crashes with an error and is stopped. It status in _scheduler/docs endpoint response should be crashing instead of pending.

The crashing vs pending status is driven by the "consecutive errors" count. This is computed as the number of crashes that occur in a row, for example a crash soon after a job start or a crash soon after another crash. If the crash happens a long enough time after a job start or previous crash then consecutive errors count is reset to 0 and the job is considered healthy.

So this sequence of steps is possible. A user starts a job, lets it run for 5 minutes then deletes the source. The output of _scheduler/docs will show state = pending since consecutive crashes is still 0, but the user would rather see state as crashing in the result.

The fix is so return crashing if a crash is the last even in the job history even if errors count is still 0.

The text was updated successfully, but these errors were encountered:

Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes apache#1276

Replication jobs are backed off based on the number of consecutive crashes, that is, we count the number of crashes in a row and then penalize jobs with an exponential wait based that number. After a job runs without crashing for 2 minutes, we consider it healthy and stop going back in its history and looking for crashes. Previously a job's state was set to `crashing` only if there were any consecutive errors. So it could have ran for 3 minutes, then user deletes the source database, job crashes and stops. Until it runs again the state would have been shown as `pending`. For internal accounting purposes that's correct but it is confusing for the user because the last event in its history is a crash. This commit makes sure that if the last even in job's history is a crash user will see the jobs as `crashing` with the respective crash reason. The scheduling algorithm didn't change. Fixes #1276

nickva added replication bug labels Apr 9, 2018

nickva self-assigned this Apr 9, 2018

nickva mentioned this issue Apr 10, 2018

In _scheduler/docs fix crashing state showing as pending sometimes #1277

Merged

3 tasks

nickva closed this as completed in #1277 Apr 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Return "crashing" state from `_scheduler/docs` immediately after the first crash #1276

Return "crashing" state from `_scheduler/docs` immediately after the first crash #1276

nickva commented Apr 9, 2018

Return "crashing" state from _scheduler/docs immediately after the first crash #1276

Return "crashing" state from _scheduler/docs immediately after the first crash #1276

Comments

nickva commented Apr 9, 2018

Return "crashing" state from `_scheduler/docs` immediately after the first crash #1276

Return "crashing" state from `_scheduler/docs` immediately after the first crash #1276