-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release candidate 123256 - 3 #2465
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…g jobs When rescheduling jobs, make sure to stops existing job as much as needed to make room for the pending jobs.
Release candidate 122026
Previously `_scheduled/docs` returned detailed replication statistics for completed jobs only. To get the same level of details from a running or pending jobs users had to use `_active_tasks`, which is not optimal and required jumping between monitoring endpoints. `info` field was originally meant to hold these statistics but they were not implemented and it just returned `null` as a placeholder. With work for 3.0 finalizing, this might be a good time to add this improvement to avoid disturbing the API afterwards. Just updating the `_scheduler/docs` was not quite enough since, replications started from the `_replicate` endpoint would not be visible there and users would still have to access `_active_tasks` to get inspect them, so let's add the `info` field to the `_scheduler/jobs` as well. After this update, all states and status details from `_active_tasks` and `_replicator` docs should be available under `_scheduler/jobs` and `_scheduler/docs` endpoints.
Previously if couch_replicator_doc_processor crashed, the job was marked as "failed". We now ignore that case. It's safe to do that since supervisor will restart it anyway, and it will rescan all the docs again. Most of all, we want to prevent the job becoming failed permanently and needing a manual intervention to restart it.
Release candidate 122285
* Detect dreyfus/hastings correctly
Release candidate 122285-2
Also remove the tests to detect that background index building didn't happen, cause it does now.
Release candidate 122285-3
Release candidate 122519
Adds message handlers to mango / all_docs / mrview fabric to recieve an execution_stats message.
Fix missing mango execution stats (part 1)
Design doc writes could fail on the target when replicating with non-admin credentials. Typically the replicator will skip over them and bump the `doc_write_failures` counter. However, that relies on the POST request returning a `200 OK` response. If the authentication scheme is implemented such that the whole request fails if some docs don't have enough permission to be written, then the replication job ends up crashing with an ugly exception and gets stuck retrying forever. In order to accomodate that scanario write _design docs in their separate requests just like we write attachments. Fixes: apache#2415
Previously many HTTP requests failed noisily with `function_clause` errors. Expect some of those failures and handle them better. There are mainly 3 types of improvements: 1) Error messages are shorter. Instead of `function_clause` with a cryptic internal fun names, return a simple marker like `bulk_docs_failed` 2) Include the error body if it was returned. HTTP failures besides the error code may contain useful information in the body to help debug the failure. 3) Do not log or include the stack trace in the message. The error names are enough to identify the place were they are generated so avoid spamming the user and the logs with them. This is done by using `{shutdown, Error}` tuples to bubble up the error the replication scheduler. There is a small but related cleanup of removing source and target monitors since we'd want to handle those error better however those errors are never triggered since we removed local replication endpoints recently. Fixes: apache#2413
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414
Previously any failed node or rexi worker error resulted in requests failing immediately even though there were available workers to keep handling the request. This was because the progress check function didn't account for the fact that partition requests only use a handful of shards which, by design, do not complete the full ring. Here we fix both partition info queries and dreyfus search functionality. We follow the pattern from fabric and pass through a set of "ring options" that let the progress function know it is dealing with partitions instead of a full ring.
??? |
hi @wohali I am in charge of building Cloudant Release Candidate and need to cherry-pick some commits from apache/couchdb. This PR should open against https://github.com/cloudant/couchdb. So I close this PR instead. Such PR in cloudant/couchdb will not be merged to apache/couchdb. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick efb374a
[release-candidate-123256-3 0388994] Improve replicator error reporting
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 12:29:49 2020 -0500
5 files changed, 329 insertions(+), 31 deletions(-)
create mode 100644 src/couch_replicator/test/eunit/couch_replicator_error_reporting_tests.erl
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 0a20de6
[release-candidate-123256-3 f506ba2] Properly account for replication stats when splitting bulk docs batches
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:39:31 2020 -0500
1 file changed, 3 insertions(+), 2 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 3573dcc
[release-candidate-123256-3 6db8b57] Preserve replication job stats when jobs are re-created
Author: Nick Vatamaniuc [email protected]
Date: Mon Jan 13 18:21:58 2020 -0500
4 files changed, 185 insertions(+), 82 deletions(-)
jiangphs-MacBook-Pro:couchdb jiangph$ git cherry-pick 75e3acb
[release-candidate-123256-3 881e0e0] Fix fabric worker failures for partition requests
Author: Nick Vatamaniuc [email protected]
Date: Wed Jan 15 12:55:19 2020 -0500
9 files changed, 239 insertions(+), 85 deletions(-)