Release candidate 123256 - 3 #2465

jiangphcn · 2020-01-17T02:16:16Z

jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick efb374a
[release-candidate-123256-3 0388994] Improve replicator error reporting
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 12:29:49 2020 -0500
5 files changed, 329 insertions(+), 31 deletions(-)
create mode 100644 src/couch_replicator/test/eunit/couch_replicator_error_reporting_tests.erl
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 0a20de6
[release-candidate-123256-3 f506ba2] Properly account for replication stats when splitting bulk docs batches
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:39:31 2020 -0500
1 file changed, 3 insertions(+), 2 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 3573dcc
[release-candidate-123256-3 6db8b57] Preserve replication job stats when jobs are re-created
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:21:58 2020 -0500
4 files changed, 185 insertions(+), 82 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 75e3acb
[release-candidate-123256-3 881e0e0] Fix fabric worker failures for partition requests
Author: Nick Vatamaniuc [email protected]
Date: Wed Jan 15 12:55:19 2020 -0500
9 files changed, 239 insertions(+), 85 deletions(-)

…g jobs When rescheduling jobs, make sure to stops existing job as much as needed to make room for the pending jobs.

Release candidate 122026

Previously `_scheduled/docs` returned detailed replication statistics for completed jobs only. To get the same level of details from a running or pending jobs users had to use `_active_tasks`, which is not optimal and required jumping between monitoring endpoints. `info` field was originally meant to hold these statistics but they were not implemented and it just returned `null` as a placeholder. With work for 3.0 finalizing, this might be a good time to add this improvement to avoid disturbing the API afterwards. Just updating the `_scheduler/docs` was not quite enough since, replications started from the `_replicate` endpoint would not be visible there and users would still have to access `_active_tasks` to get inspect them, so let's add the `info` field to the `_scheduler/jobs` as well. After this update, all states and status details from `_active_tasks` and `_replicator` docs should be available under `_scheduler/jobs` and `_scheduler/docs` endpoints.

Previously if couch_replicator_doc_processor crashed, the job was marked as "failed". We now ignore that case. It's safe to do that since supervisor will restart it anyway, and it will rescan all the docs again. Most of all, we want to prevent the job becoming failed permanently and needing a manual intervention to restart it.

Release candidate 122285

* Detect dreyfus/hastings correctly

Release candidate 122285-2

Also remove the tests to detect that background index building didn't happen, cause it does now.

Release candidate 122285-3

Release candidate 122519

Adds message handlers to mango / all_docs / mrview fabric to recieve an execution_stats message.

Fix missing mango execution stats (part 1)

Design doc writes could fail on the target when replicating with non-admin credentials. Typically the replicator will skip over them and bump the `doc_write_failures` counter. However, that relies on the POST request returning a `200 OK` response. If the authentication scheme is implemented such that the whole request fails if some docs don't have enough permission to be written, then the replication job ends up crashing with an ugly exception and gets stuck retrying forever. In order to accomodate that scanario write _design docs in their separate requests just like we write attachments. Fixes: apache#2415

Previously many HTTP requests failed noisily with `function_clause` errors. Expect some of those failures and handle them better. There are mainly 3 types of improvements: 1) Error messages are shorter. Instead of `function_clause` with a cryptic internal fun names, return a simple marker like `bulk_docs_failed` 2) Include the error body if it was returned. HTTP failures besides the error code may contain useful information in the body to help debug the failure. 3) Do not log or include the stack trace in the message. The error names are enough to identify the place were they are generated so avoid spamming the user and the logs with them. This is done by using `{shutdown, Error}` tuples to bubble up the error the replication scheduler. There is a small but related cleanup of removing source and target monitors since we'd want to handle those error better however those errors are never triggered since we removed local replication endpoints recently. Fixes: apache#2413

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414

Previously any failed node or rexi worker error resulted in requests failing immediately even though there were available workers to keep handling the request. This was because the progress check function didn't account for the fact that partition requests only use a handful of shards which, by design, do not complete the full ring. Here we fix both partition info queries and dreyfus search functionality. We follow the pattern from fabric and pass through a set of "ring options" that let the progress function know it is dealing with partitions instead of a full ring.

wohali · 2020-01-17T03:38:40Z

???

jiangphcn · 2020-01-17T04:34:33Z

hi @wohali I am in charge of building Cloudant Release Candidate and need to cherry-pick some commits from apache/couchdb. This PR should open against https://github.com/cloudant/couchdb. So I close this PR instead. Such PR in cloudant/couchdb will not be merged to apache/couchdb.

nickva and others added 19 commits October 24, 2019 11:41

Avoid churning replication jobs if there is enough room to run pendin…

1c2646a

…g jobs When rescheduling jobs, make sure to stops existing job as much as needed to make room for the pending jobs.

Merge pull request #23 from cloudant/release-candidate-122026

ab14f20

Release candidate 122026

close LRU by database path

bbabde2

Merge pull request #24 from cloudant/release-candidate-122285

08f3008

Release candidate 122285

update ken to 1.0.6

7a22691

* Detect dreyfus/hastings correctly

Merge pull request #25 from cloudant/release-candidate-122285-2

4c87c30

Release candidate 122285-2

export get_servers_from_env/1 for ken

e8c2992

Also remove the tests to detect that background index building didn't happen, cause it does now.

Merge pull request #26 from cloudant/release-candidate-122285-3

3e63d84

Release candidate 122285-3

Revert "Close LRU by database path for deleted database/index"

9864868

Merge pull request #27 from cloudant/release-candidate-122519

6dc33db

Release candidate 122519

Fix missing mango execution stats (part 1)

2ccfa79

Adds message handlers to mango / all_docs / mrview fabric to recieve an execution_stats message.

Merge pull request #28 from cloudant/release-candidate-123256

491b913

Fix missing mango execution stats (part 1)

jiangphcn closed this Jan 17, 2020

jiangphcn deleted the release-candidate-123256-3 branch January 17, 2020 04:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release candidate 123256 - 3 #2465

Release candidate 123256 - 3 #2465

jiangphcn commented Jan 17, 2020

wohali commented Jan 17, 2020

jiangphcn commented Jan 17, 2020

Release candidate 123256 - 3 #2465

Release candidate 123256 - 3 #2465

Conversation

jiangphcn commented Jan 17, 2020

wohali commented Jan 17, 2020

jiangphcn commented Jan 17, 2020