Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate 123256 - 3 #2465

Closed
wants to merge 19 commits into from

Conversation

jiangphcn
Copy link
Contributor

jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick efb374a
[release-candidate-123256-3 0388994] Improve replicator error reporting
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 12:29:49 2020 -0500
5 files changed, 329 insertions(+), 31 deletions(-)
create mode 100644 src/couch_replicator/test/eunit/couch_replicator_error_reporting_tests.erl
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 0a20de6
[release-candidate-123256-3 f506ba2] Properly account for replication stats when splitting bulk docs batches
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:39:31 2020 -0500
1 file changed, 3 insertions(+), 2 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 3573dcc
[release-candidate-123256-3 6db8b57] Preserve replication job stats when jobs are re-created
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:21:58 2020 -0500
4 files changed, 185 insertions(+), 82 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 75e3acb
[release-candidate-123256-3 881e0e0] Fix fabric worker failures for partition requests
Author: Nick Vatamaniuc [email protected]
Date: Wed Jan 15 12:55:19 2020 -0500
9 files changed, 239 insertions(+), 85 deletions(-)

nickva and others added 19 commits October 24, 2019 11:41
…g jobs

When rescheduling jobs, make sure to stops existing job as much as needed to
make room for the pending jobs.
Previously `_scheduled/docs` returned detailed replication statistics for
completed jobs only. To get the same level of details from a running or pending
jobs users had to use `_active_tasks`, which is not optimal and required jumping
between monitoring endpoints.

`info` field was originally meant to hold these statistics but they were not
implemented and it just returned `null` as a placeholder. With work for 3.0
finalizing, this might be a good time to add this improvement to avoid
disturbing the API afterwards.

Just updating the `_scheduler/docs` was not quite enough since, replications
started from the `_replicate` endpoint would not be visible there and users
would still have to access `_active_tasks` to get inspect them, so let's add
the `info` field to the `_scheduler/jobs` as well.

After this update, all states and status details from `_active_tasks` and
`_replicator` docs should be available under `_scheduler/jobs` and
`_scheduler/docs` endpoints.
Previously if couch_replicator_doc_processor crashed, the job was marked as
"failed". We now ignore that case. It's safe to do that since supervisor will
restart it anyway, and it will rescan all the docs again. Most of all, we want
to prevent the job becoming failed permanently and needing a manual
intervention to restart it.
* Detect dreyfus/hastings correctly
Also remove the tests to detect that background index building didn't
happen, cause it does now.
Adds message handlers to mango / all_docs / mrview fabric
to recieve an execution_stats message.
Fix missing mango execution stats (part 1)
Design doc writes could fail on the target when replicating with non-admin
credentials. Typically the replicator will skip over them and bump the
`doc_write_failures` counter. However, that relies on the POST request
returning a `200 OK` response. If the authentication scheme is implemented such
that the whole request fails if some docs don't have enough permission to be
written, then the replication job ends up crashing with an ugly exception and
gets stuck retrying forever. In order to accomodate that scanario write _design
docs in their separate requests just like we write attachments.

Fixes: apache#2415
Previously many HTTP requests failed noisily with `function_clause` errors.
Expect some of those failures and handle them better. There are mainly 3 types
of improvements:

 1) Error messages are shorter. Instead of `function_clause` with a cryptic
 internal fun names, return a simple marker like `bulk_docs_failed`

 2) Include the error body if it was returned. HTTP failures besides the error
 code may contain useful information in the body to help debug the failure.

 3) Do not log or include the stack trace in the message. The error names are
 enough to identify the place were they are generated so avoid spamming the
 user and the logs with them. This is done by using `{shutdown, Error}` tuples
 to bubble up the error the replication scheduler.

There is a small but related cleanup of removing source and target monitors
since we'd want to handle those error better however those errors are never
triggered since we removed local replication endpoints recently.

Fixes: apache#2413
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: apache#2414
Previously we made sure replication job statistics were preserved when
the jobs were started and stopped by the scheduler. However, if a db
node restarted or user re-created the job, replication stats would be
reset to 0.

Some statistics like `docs_read` and `docs_written` are perhaps not as
critical. However `doc_write_failures` is. That is the indicator that
some replication docs have not replicated to the target. Not
preserving that statistic meant users could perceive there was a data
loss during replication -- data was replicated successfully according
to the replication job with no write failures, user deletes source
database, then some times later noticed some of their data is missing.

These statistics were already logged in the checkpoint history and we
just had to initialize a stats object from them when a replication job
starts. In that initialization code we pick the highest values from
either the running scheduler or the checkpointed log. The reason is
that the running stats could be higher if say job was stopped suddenly
and failed to checkpoint but scheduler retained the data.

Fixes: apache#2414
Previously any failed node or rexi worker error resulted in requests failing
immediately even though there were available workers to keep handling the
request. This was because the progress check function didn't account for the
fact that partition requests only use a handful of shards which, by design, do
not complete the full ring.

Here we fix both partition info queries and dreyfus search functionality. We
follow the pattern from fabric and pass through a set of "ring options" that
let the progress function know it is dealing with partitions instead of a full
ring.
@jiangphcn jiangphcn closed this Jan 17, 2020
@wohali
Copy link
Member

wohali commented Jan 17, 2020

???

@jiangphcn jiangphcn deleted the release-candidate-123256-3 branch January 17, 2020 04:32
@jiangphcn
Copy link
Contributor Author

hi @wohali I am in charge of building Cloudant Release Candidate and need to cherry-pick some commits from apache/couchdb. This PR should open against https://github.com/cloudant/couchdb. So I close this PR instead. Such PR in cloudant/couchdb will not be merged to apache/couchdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants