Release candidate 123256 - 3 #2465

…g jobs When rescheduling jobs, make sure to stops existing job as much as needed to make room for the pending jobs.

Release candidate 122026

Previously `_scheduled/docs` returned detailed replication statistics for completed jobs only. To get the same level of details from a running or pending jobs users had to use `_active_tasks`, which is not optimal and required jumping between monitoring endpoints. `info` field was originally meant to hold these statistics but they were not implemented and it just returned `null` as a placeholder. With work for 3.0 finalizing, this might be a good time to add this improvement to avoid disturbing the API afterwards. Just updating the `_scheduler/docs` was not quite enough since, replications started from the `_replicate` endpoint would not be visible there and users would still have to access `_active_tasks` to get inspect them, so let's add the `info` field to the `_scheduler/jobs` as well. After this update, all states and status details from `_active_tasks` and `_replicator` docs should be available under `_scheduler/jobs` and `_scheduler/docs` endpoints.

Previously if couch_replicator_doc_processor crashed, the job was marked as "failed". We now ignore that case. It's safe to do that since supervisor will restart it anyway, and it will rescan all the docs again. Most of all, we want to prevent the job becoming failed permanently and needing a manual intervention to restart it.

Release candidate 122285

* Detect dreyfus/hastings correctly

Release candidate 122285-2

Also remove the tests to detect that background index building didn't happen, cause it does now.

Release candidate 122285-3

Release candidate 122519

Adds message handlers to mango / all_docs / mrview fabric to recieve an execution_stats message.

Fix missing mango execution stats (part 1)

Design doc writes could fail on the target when replicating with non-admin credentials. Typically the replicator will skip over them and bump the `doc_write_failures` counter. However, that relies on the POST request returning a `200 OK` response. If the authentication scheme is implemented such that the whole request fails if some docs don't have enough permission to be written, then the replication job ends up crashing with an ugly exception and gets stuck retrying forever. In order to accomodate that scanario write _design docs in their separate requests just like we write attachments. Fixes: apache#2415

Previously many HTTP requests failed noisily with `function_clause` errors. Expect some of those failures and handle them better. There are mainly 3 types of improvements: 1) Error messages are shorter. Instead of `function_clause` with a cryptic internal fun names, return a simple marker like `bulk_docs_failed` 2) Include the error body if it was returned. HTTP failures besides the error code may contain useful information in the body to help debug the failure. 3) Do not log or include the stack trace in the message. The error names are enough to identify the place were they are generated so avoid spamming the user and the logs with them. This is done by using `{shutdown, Error}` tuples to bubble up the error the replication scheduler. There is a small but related cleanup of removing source and target monitors since we'd want to handle those error better however those errors are never triggered since we removed local replication endpoints recently. Fixes: apache#2413

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414

Previously any failed node or rexi worker error resulted in requests failing immediately even though there were available workers to keep handling the request. This was because the progress check function didn't account for the fact that partition requests only use a handful of shards which, by design, do not complete the full ring. Here we fix both partition info queries and dreyfus search functionality. We follow the pattern from fabric and pass through a set of "ring options" that let the progress function know it is dealing with partitions instead of a full ring.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release candidate 123256 - 3 #2465

Release candidate 123256 - 3 #2465

Commits on Oct 24, 2019

Commits on Nov 8, 2019

Commits on Nov 9, 2019

Commits on Nov 10, 2019

Commits on Nov 11, 2019

Commits on Nov 22, 2019

Commits on Jan 9, 2020

Commits on Jan 10, 2020

Commits on Jan 17, 2020