Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc_write_failures count is not preserved sometimes #2414

Closed
nickva opened this issue Jan 7, 2020 · 0 comments
Closed

doc_write_failures count is not preserved sometimes #2414

nickva opened this issue Jan 7, 2020 · 0 comments
Assignees
Labels

Comments

@nickva
Copy link
Contributor

nickva commented Jan 7, 2020

doc_writer_failures statistic is import to preserve as it is the only indicator that some documents might not have replicated to the target.

Recently there was a fix to make sure it is preserved across job scheduled job starts and stop. However, if the replication job is completely removed from the scheduler [+] it doesn't get preserved.

Luckily enough that value is saved in the checkpoint history so we could recover from there. However it is still not a 100% guaranteed as checkpoint history is limited and the statistics are not accumulated. So we might have to update the checkpoint history format to accumulate this stat.

[+] Could happen if it is changed from normal to continuous or vice-versa, the user simply removes and re-creates it, or the cluster is restarted the statistic value is then not preserved.

@nickva nickva self-assigned this Jan 7, 2020
nickva added a commit that referenced this issue Jan 13, 2020
Previously we made sure replication job statistics were preserved when the jobs
were started and stopped by the scheduler. However, if a VM node restarted or
user re-created the job, replication stats would be reset to 0.

Some statistics like `docs_read` and `docs_written` were perhaps not as
critical. However `doc_write_failures` was. That is the only indicator that
some replication docs have been skipped and not replicated to the target. Not
preserving that statistic meant users could perceive a data loss.

These statistics were already log in the checkpoint history and we just had to
initialize a stats object from them when a replication job starts. In that
initialization code we pick the highest values from either the running
scheduler or the checkpointed log. The reason is that the running stats could
be higher if say job was stopped suddenly and failed to checkpoint but
scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 13, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 13, 2020
Previously we made sure replication job statistics were preserved when the jobs
were started and stopped by the scheduler. However, if a VM node restarted or
user re-created the job, replication stats would be reset to 0.

Some statistics like `docs_read` and `docs_written` were perhaps not as
critical. However `doc_write_failures` was. That is the only indicator that
some replication docs have been skipped and not replicated to the target. Not
preserving that statistic meant users could perceive a data loss.

These statistics were already log in the checkpoint history and we just had to
initialize a stats object from them when a replication job starts. In that
initialization code we pick the highest values from either the running
scheduler or the checkpointed log. The reason is that the running stats could
be higher if say job was stopped suddenly and failed to checkpoint but
scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 13, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs
were started and stopped by the scheduler. However, if a VM node restarted or
user re-created the job, replication stats would be reset to 0.

Some statistics like `docs_read` and `docs_written` were perhaps not as
critical. However `doc_write_failures` was. That is the only indicator that
some replication docs have been skipped and not replicated to the target. Not
preserving that statistic meant users could perceive a data loss.

These statistics were already log in the checkpoint history and we just had to
initialize a stats object from them when a replication job starts. In that
initialization code we pick the highest values from either the running
scheduler or the checkpointed log. The reason is that the running stats could
be higher if say job was stopped suddenly and failed to checkpoint but
scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs
were started and stopped by the scheduler. However, if a VM node restarted or
user re-created the job, replication stats would be reset to 0.

Some statistics like `docs_read` and `docs_written` were perhaps not as
critical. However `doc_write_failures` was. That is the only indicator that
some replication docs have been skipped and not replicated to the target. Not
preserving that statistic meant users could perceive there was a data loss when
they migrate databases using replication.

These statistics were already logged in the checkpoint history and we just had
to initialize a stats object from them when a replication job starts. In that
initialization code we pick the highest values from either the running
scheduler or the checkpointed log. The reason is that the running stats could
be higher if say job was stopped suddenly and failed to checkpoint but
scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs
were started and stopped by the scheduler. However, if a VM node restarted or
user re-created the job, replication stats would be reset to 0.

Some statistics like `docs_read` and `docs_written` were perhaps not as
critical. However `doc_write_failures` was. That is the only indicator that
some replication docs have been skipped and not replicated to the target. Not
preserving that statistic meant users could perceive there was a data loss when
they migrate databases using replication.

These statistics were already logged in the checkpoint history and we just had
to initialize a stats object from them when a replication job starts. In that
initialization code we pick the highest values from either the running
scheduler or the checkpointed log. The reason is that the running stats could
be higher if say job was stopped suddenly and failed to checkpoint but
scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously we made sure replication job statistics were preserved when
the jobs were started and stopped by the scheduler. However, if a db
node restarted or user re-created the job, replication stats would be
reset to 0.

Some statistics like `docs_read` and `docs_written` are perhaps not as
critical. However `doc_write_failures` is. That is the indicator that
some replication docs have not replicated to the target. Not
preserving that statistic meant users could perceive there was a data
loss during replication -- data was replicated successfully according
to the replication job with no write failures, user deletes source
database, then some times later noticed some of their data is missing.

These statistics were already logged in the checkpoint history and we
just had to initialize a stats object from them when a replication job
starts. In that initialization code we pick the highest values from
either the running scheduler or the checkpointed log. The reason is
that the running stats could be higher if say job was stopped suddenly
and failed to checkpoint but scheduler retained the data.

Fixes: #2414
nickva added a commit that referenced this issue Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: #2414
@nickva nickva closed this as completed in 3573dcc Jan 14, 2020
jiangphcn pushed a commit to cloudant/couchdb that referenced this issue Jan 17, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: apache#2414
jiangphcn pushed a commit to cloudant/couchdb that referenced this issue Jan 17, 2020
Previously we made sure replication job statistics were preserved when
the jobs were started and stopped by the scheduler. However, if a db
node restarted or user re-created the job, replication stats would be
reset to 0.

Some statistics like `docs_read` and `docs_written` are perhaps not as
critical. However `doc_write_failures` is. That is the indicator that
some replication docs have not replicated to the target. Not
preserving that statistic meant users could perceive there was a data
loss during replication -- data was replicated successfully according
to the replication job with no write failures, user deletes source
database, then some times later noticed some of their data is missing.

These statistics were already logged in the checkpoint history and we
just had to initialize a stats object from them when a replication job
starts. In that initialization code we pick the highest values from
either the running scheduler or the checkpointed log. The reason is
that the running stats could be higher if say job was stopped suddenly
and failed to checkpoint but scheduler retained the data.

Fixes: apache#2414
jiangphcn pushed a commit to cloudant/couchdb that referenced this issue Jan 17, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max
request size limit on the target, we only counted stats for the second batch.
So it was possibly we might have missed some `doc_write_failures` updates which
can be perceived as a data loss to the customer.

So we use the handy-dandy `sum_stats/2` function to sum the return stats from
both batches and return that.

Issue: apache#2414
jiangphcn pushed a commit to cloudant/couchdb that referenced this issue Jan 17, 2020
Previously we made sure replication job statistics were preserved when
the jobs were started and stopped by the scheduler. However, if a db
node restarted or user re-created the job, replication stats would be
reset to 0.

Some statistics like `docs_read` and `docs_written` are perhaps not as
critical. However `doc_write_failures` is. That is the indicator that
some replication docs have not replicated to the target. Not
preserving that statistic meant users could perceive there was a data
loss during replication -- data was replicated successfully according
to the replication job with no write failures, user deletes source
database, then some times later noticed some of their data is missing.

These statistics were already logged in the checkpoint history and we
just had to initialize a stats object from them when a replication job
starts. In that initialization code we pick the highest values from
either the running scheduler or the checkpointed log. The reason is
that the running stats could be higher if say job was stopped suddenly
and failed to checkpoint but scheduler retained the data.

Fixes: apache#2414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant