Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate 123256 - 3 #2465

Closed
wants to merge 19 commits into from

Commits on Oct 24, 2019

  1. Avoid churning replication jobs if there is enough room to run pendin…

    …g jobs
    
    When rescheduling jobs, make sure to stops existing job as much as needed to
    make room for the pending jobs.
    nickva authored and jiangphcn committed Oct 24, 2019
    Configuration menu
    Copy the full SHA
    1c2646a View commit details
    Browse the repository at this point in the history
  2. Merge pull request #23 from cloudant/release-candidate-122026

    Release candidate 122026
    jiangphcn committed Oct 24, 2019
    Configuration menu
    Copy the full SHA
    ab14f20 View commit details
    Browse the repository at this point in the history

Commits on Nov 8, 2019

  1. Return detailed replication stats for running and pending jobs

    Previously `_scheduled/docs` returned detailed replication statistics for
    completed jobs only. To get the same level of details from a running or pending
    jobs users had to use `_active_tasks`, which is not optimal and required jumping
    between monitoring endpoints.
    
    `info` field was originally meant to hold these statistics but they were not
    implemented and it just returned `null` as a placeholder. With work for 3.0
    finalizing, this might be a good time to add this improvement to avoid
    disturbing the API afterwards.
    
    Just updating the `_scheduler/docs` was not quite enough since, replications
    started from the `_replicate` endpoint would not be visible there and users
    would still have to access `_active_tasks` to get inspect them, so let's add
    the `info` field to the `_scheduler/jobs` as well.
    
    After this update, all states and status details from `_active_tasks` and
    `_replicator` docs should be available under `_scheduler/jobs` and
    `_scheduler/docs` endpoints.
    nickva authored and jiangphcn committed Nov 8, 2019
    Configuration menu
    Copy the full SHA
    eaa2447 View commit details
    Browse the repository at this point in the history
  2. close LRU by database path

    jiangphcn committed Nov 8, 2019
    Configuration menu
    Copy the full SHA
    bbabde2 View commit details
    Browse the repository at this point in the history
  3. Do not mark replication jobs as failed if doc processor crashes

    Previously if couch_replicator_doc_processor crashed, the job was marked as
    "failed". We now ignore that case. It's safe to do that since supervisor will
    restart it anyway, and it will rescan all the docs again. Most of all, we want
    to prevent the job becoming failed permanently and needing a manual
    intervention to restart it.
    nickva authored and jiangphcn committed Nov 8, 2019
    Configuration menu
    Copy the full SHA
    5e3c208 View commit details
    Browse the repository at this point in the history
  4. Merge pull request #24 from cloudant/release-candidate-122285

    Release candidate 122285
    jiangphcn committed Nov 8, 2019
    Configuration menu
    Copy the full SHA
    08f3008 View commit details
    Browse the repository at this point in the history

Commits on Nov 9, 2019

  1. update ken to 1.0.6

    * Detect dreyfus/hastings correctly
    rnewson authored and jiangphcn committed Nov 9, 2019
    Configuration menu
    Copy the full SHA
    7a22691 View commit details
    Browse the repository at this point in the history

Commits on Nov 10, 2019

  1. Merge pull request #25 from cloudant/release-candidate-122285-2

    Release candidate 122285-2
    jiangphcn committed Nov 10, 2019
    Configuration menu
    Copy the full SHA
    4c87c30 View commit details
    Browse the repository at this point in the history

Commits on Nov 11, 2019

  1. export get_servers_from_env/1 for ken

    Also remove the tests to detect that background index building didn't
    happen, cause it does now.
    rnewson authored and jiangphcn committed Nov 11, 2019
    Configuration menu
    Copy the full SHA
    e8c2992 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #26 from cloudant/release-candidate-122285-3

    Release candidate 122285-3
    jiangphcn committed Nov 11, 2019
    Configuration menu
    Copy the full SHA
    3e63d84 View commit details
    Browse the repository at this point in the history

Commits on Nov 22, 2019

  1. Configuration menu
    Copy the full SHA
    9864868 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #27 from cloudant/release-candidate-122519

    Release candidate 122519
    jiangphcn committed Nov 22, 2019
    Configuration menu
    Copy the full SHA
    6dc33db View commit details
    Browse the repository at this point in the history

Commits on Jan 9, 2020

  1. Fix missing mango execution stats (part 1)

    Adds message handlers to mango / all_docs / mrview fabric
    to recieve an execution_stats message.
    willholley authored and jiangphcn committed Jan 9, 2020
    Configuration menu
    Copy the full SHA
    2ccfa79 View commit details
    Browse the repository at this point in the history
  2. Merge pull request #28 from cloudant/release-candidate-123256

    Fix missing mango execution stats (part 1)
    jiangphcn committed Jan 9, 2020
    Configuration menu
    Copy the full SHA
    491b913 View commit details
    Browse the repository at this point in the history

Commits on Jan 10, 2020

  1. Use separate requests to write design when replicating

    Design doc writes could fail on the target when replicating with non-admin
    credentials. Typically the replicator will skip over them and bump the
    `doc_write_failures` counter. However, that relies on the POST request
    returning a `200 OK` response. If the authentication scheme is implemented such
    that the whole request fails if some docs don't have enough permission to be
    written, then the replication job ends up crashing with an ugly exception and
    gets stuck retrying forever. In order to accomodate that scanario write _design
    docs in their separate requests just like we write attachments.
    
    Fixes: apache#2415
    nickva authored and jiangphcn committed Jan 10, 2020
    Configuration menu
    Copy the full SHA
    c97e88e View commit details
    Browse the repository at this point in the history

Commits on Jan 17, 2020

  1. Improve replicator error reporting

    Previously many HTTP requests failed noisily with `function_clause` errors.
    Expect some of those failures and handle them better. There are mainly 3 types
    of improvements:
    
     1) Error messages are shorter. Instead of `function_clause` with a cryptic
     internal fun names, return a simple marker like `bulk_docs_failed`
    
     2) Include the error body if it was returned. HTTP failures besides the error
     code may contain useful information in the body to help debug the failure.
    
     3) Do not log or include the stack trace in the message. The error names are
     enough to identify the place were they are generated so avoid spamming the
     user and the logs with them. This is done by using `{shutdown, Error}` tuples
     to bubble up the error the replication scheduler.
    
    There is a small but related cleanup of removing source and target monitors
    since we'd want to handle those error better however those errors are never
    triggered since we removed local replication endpoints recently.
    
    Fixes: apache#2413
    nickva authored and jiangphcn committed Jan 17, 2020
    Configuration menu
    Copy the full SHA
    0388994 View commit details
    Browse the repository at this point in the history
  2. Properly account for replication stats when splitting bulk docs batches

    Previously if batch of bulk docs had to be bisected in order to fit a lower max
    request size limit on the target, we only counted stats for the second batch.
    So it was possibly we might have missed some `doc_write_failures` updates which
    can be perceived as a data loss to the customer.
    
    So we use the handy-dandy `sum_stats/2` function to sum the return stats from
    both batches and return that.
    
    Issue: apache#2414
    nickva authored and jiangphcn committed Jan 17, 2020
    Configuration menu
    Copy the full SHA
    f506ba2 View commit details
    Browse the repository at this point in the history
  3. Preserve replication job stats when jobs are re-created

    Previously we made sure replication job statistics were preserved when
    the jobs were started and stopped by the scheduler. However, if a db
    node restarted or user re-created the job, replication stats would be
    reset to 0.
    
    Some statistics like `docs_read` and `docs_written` are perhaps not as
    critical. However `doc_write_failures` is. That is the indicator that
    some replication docs have not replicated to the target. Not
    preserving that statistic meant users could perceive there was a data
    loss during replication -- data was replicated successfully according
    to the replication job with no write failures, user deletes source
    database, then some times later noticed some of their data is missing.
    
    These statistics were already logged in the checkpoint history and we
    just had to initialize a stats object from them when a replication job
    starts. In that initialization code we pick the highest values from
    either the running scheduler or the checkpointed log. The reason is
    that the running stats could be higher if say job was stopped suddenly
    and failed to checkpoint but scheduler retained the data.
    
    Fixes: apache#2414
    nickva authored and jiangphcn committed Jan 17, 2020
    Configuration menu
    Copy the full SHA
    6db8b57 View commit details
    Browse the repository at this point in the history
  4. Fix fabric worker failures for partition requests

    Previously any failed node or rexi worker error resulted in requests failing
    immediately even though there were available workers to keep handling the
    request. This was because the progress check function didn't account for the
    fact that partition requests only use a handful of shards which, by design, do
    not complete the full ring.
    
    Here we fix both partition info queries and dreyfus search functionality. We
    follow the pattern from fabric and pass through a set of "ring options" that
    let the progress function know it is dealing with partitions instead of a full
    ring.
    nickva authored and jiangphcn committed Jan 17, 2020
    Configuration menu
    Copy the full SHA
    881e0e0 View commit details
    Browse the repository at this point in the history