-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc_write_failures count is not preserved sometimes #2414
Labels
Comments
nickva
added a commit
that referenced
this issue
Jan 13, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 13, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 13, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 13, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss when they migrate databases using replication. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss when they migrate databases using replication. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414
nickva
added a commit
that referenced
this issue
Jan 14, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414
jiangphcn
pushed a commit
to cloudant/couchdb
that referenced
this issue
Jan 17, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414
jiangphcn
pushed a commit
to cloudant/couchdb
that referenced
this issue
Jan 17, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414
jiangphcn
pushed a commit
to cloudant/couchdb
that referenced
this issue
Jan 17, 2020
Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414
jiangphcn
pushed a commit
to cloudant/couchdb
that referenced
this issue
Jan 17, 2020
Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
doc_writer_failures
statistic is import to preserve as it is the only indicator that some documents might not have replicated to the target.Recently there was a fix to make sure it is preserved across job scheduled job starts and stop. However, if the replication job is completely removed from the scheduler [+] it doesn't get preserved.
Luckily enough that value is saved in the checkpoint history so we could recover from there. However it is still not a 100% guaranteed as checkpoint history is limited and the statistics are not accumulated. So we might have to update the checkpoint history format to accumulate this stat.
[+] Could happen if it is changed from normal to continuous or vice-versa, the user simply removes and re-creates it, or the cluster is restarted the statistic value is then not preserved.
The text was updated successfully, but these errors were encountered: