doc_write_failures count is not preserved sometimes #2414

nickva · 2020-01-07T16:40:28Z

doc_writer_failures statistic is import to preserve as it is the only indicator that some documents might not have replicated to the target.

Recently there was a fix to make sure it is preserved across job scheduled job starts and stop. However, if the replication job is completely removed from the scheduler [+] it doesn't get preserved.

Luckily enough that value is saved in the checkpoint history so we could recover from there. However it is still not a 100% guaranteed as checkpoint history is limited and the statistics are not accumulated. So we might have to update the checkpoint history format to accumulate this stat.

[+] Could happen if it is changed from normal to continuous or vice-versa, the user simply removes and re-creates it, or the cluster is restarted the statistic value is then not preserved.

The text was updated successfully, but these errors were encountered:

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive a data loss. These statistics were already log in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss when they migrate databases using replication. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a VM node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` were perhaps not as critical. However `doc_write_failures` was. That is the only indicator that some replication docs have been skipped and not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss when they migrate databases using replication. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: #2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414

Previously if batch of bulk docs had to be bisected in order to fit a lower max request size limit on the target, we only counted stats for the second batch. So it was possibly we might have missed some `doc_write_failures` updates which can be perceived as a data loss to the customer. So we use the handy-dandy `sum_stats/2` function to sum the return stats from both batches and return that. Issue: apache#2414

Previously we made sure replication job statistics were preserved when the jobs were started and stopped by the scheduler. However, if a db node restarted or user re-created the job, replication stats would be reset to 0. Some statistics like `docs_read` and `docs_written` are perhaps not as critical. However `doc_write_failures` is. That is the indicator that some replication docs have not replicated to the target. Not preserving that statistic meant users could perceive there was a data loss during replication -- data was replicated successfully according to the replication job with no write failures, user deletes source database, then some times later noticed some of their data is missing. These statistics were already logged in the checkpoint history and we just had to initialize a stats object from them when a replication job starts. In that initialization code we pick the highest values from either the running scheduler or the checkpointed log. The reason is that the running stats could be higher if say job was stopped suddenly and failed to checkpoint but scheduler retained the data. Fixes: apache#2414

nickva added bug needs-triage and removed needs-triage labels Jan 7, 2020

nickva self-assigned this Jan 7, 2020

nickva closed this as completed in 3573dcc Jan 14, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc_write_failures count is not preserved sometimes #2414

doc_write_failures count is not preserved sometimes #2414

nickva commented Jan 7, 2020

doc_write_failures count is not preserved sometimes #2414

doc_write_failures count is not preserved sometimes #2414

Comments

nickva commented Jan 7, 2020