Replication crashes just on single database from many #4204

arahjan · 2022-10-12T11:03:53Z

Couchdb 2.3.1 on Centos7

I'm replicating more than 20dbs from one server to another. The process works flawlessly apart from a single database.
The database in question consists of more than 100k small documents like below:

"db_name": "test", "purge_seq": "0-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMeC5BkeACk_gNBViIDHrVJCUAyqZ6gOoiZCyBm7idG7QGI2vsE7FcA2W9P0P5EhiR5wp5xABkWT6RnGiAOnA9UmwUAtixejg", "update_seq": "100785-g1AAAAFreJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____s5IYGAyr8KhLUgCSSfYwpdX4lDqAlMZDlRrswqc0AaS0HmaqPx6leSxAkqEBSAFVzwcr_0FQ-QKI8v1gh_wlqPwARPl9sOnsBJU_gCiHeHN7FgAkvmTM", "sizes": { "file": 82658990, "external": 77019483, "active": 82112519 }, "other": { "data_size": 77019483 }, "doc_del_count": 4, "doc_count": 100766, "disk_size": 82658990, "disk_format_version": 7, "data_size": 82112519, "compact_running": false, "cluster": { "q": 8, "n": 1, "w": 1, "r": 1 }, "instance_start_time": "0"

After replicating about 79k docs, the replication crashes with output like below:

[error] 2022-10-12T10:43:26.563124Z [email protected] <0.30702.7> -------- CRASH REPORT Process (<0.30702.7>) with 5 neighbors exited with reason: {worker_died,<0.30700.7>,{process_died,<0.3280.8>,{{nocatch,missing_doc},[{couch_replicator_api_wrap,open_doc_revs,6,[{file,"src/couch_replicator_api_wrap.erl"},{line,302}]},{couch_replicator_worker,'-spawn_doc_reader/3-fun-1-',4,[{file,"src/couch_replicator_worker.erl"},{line,323}]}]}}} at gen_server:terminate/7(line:812) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_replicator_worker,init,['Argument__1']}, ancestors: [<0.30607.7>,couch_replicator_scheduler_sup,couch_replicator_sup,...], messages: [], links: [<0.3401.8>,<0.3503.8>,<0.3700.8>,<0.3413.8>,<0.30703.7>], dictionary: [{last_stats_report,{1665,571404,580133}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 27, reductions: 77732

When I copied this db manually to the second server, there is no problem anymore. Can add documents on main server and they are being copied to the second one.

What is a possible culprit of this issue?

The text was updated successfully, but these errors were encountered:

nickva · 2022-10-13T14:59:07Z

missing_doc often means the document update which changes feed saw was not found when replicator went to fetch all of its revisions. It might happen if the document is deleted in the meantime (and compaction runs), or if the document was just created but somehow it wasn't propagated to the all the nodes in the cluster.

There is one retry which the replicator will do that case. You can adjust the sleep period with this setting [replicator] missing_doc_retry_msec = 2000. The defaults is 2 seconds but you could set to, say, 10000 (10 seconds). Then check that you have good inter-node (cluster) connectivity.

arahjan · 2022-10-14T08:48:29Z

I added this parameter, but it didn't help.
I fetched the database from the MAIN server and restored it to SPARE. Unfortunately that didn't help too. Even though both servers have the same number of documents, the replication status is "crashed".
There is a info in crash details pointing to {couch_replicator_api_wrap.erl"},{line,302}]} that stands "NewMaxLen = get_value(max_url_len, Options, ?MAX_URL_LEN) div 2,". Could max_url_len parameter somehow related then? What's the default value and is it possible to increase it?
Connectivity between nodes is fine. As I said, just this particular database is problematic.

nickva · 2022-10-14T19:11:13Z

If you see max_url_len in the logs check if there are any 414 HTTP error response from the replication endpoints. It could be that the proxy limits the maximum URL length or max_document_id_length was set for couchdb was set too low.

One case where that could also apply is if there are lot of conflicted revisions, those would end passed to the fetch document request as atts_since=.... list and if the response returned is 414 the replicator will try to use a shorter list of atts_since=... values.

Here is where I found a reference to it:

couchdb/src/couch_replicator/src/couch_replicator_api_wrap.erl

Lines 346 to 363 in 21eebad

 {'DOWN', Ref, process, Pid, request_uri_too_long} -> 

 NewMaxLen = get_value(max_url_len, Options, ?MAX_URL_LEN) div 2, 

 case NewMaxLen < ?MIN_URL_LEN of 

 true -> 

 throw(request_uri_too_long); 

 false -> 

 couch_log:info( 

 "Reducing url length to ~B because of" 

 " 414 response", 

 [NewMaxLen] 

 ), 

 Options1 = lists:keystore( 

 max_url_len, 

 1, 

 Options, 

 {max_url_len, NewMaxLen} 

 ), 

 open_doc_revs(HttpDb, Id, Revs, Options1, Fun, Acc)

arahjan · 2022-10-17T12:28:46Z

After some drilling down and increasing logging level I managed to find some errors related to missing revision numbers:

`61374 [error] 2022-10-17T11:54:12.491238Z [email protected] <0.31178.16> -------- Retrying fetch and update of document ABC as it is unexpectedly missing. Missing revisions are: 9-6ab086bc66baa1fffe312b90654d90e5

61419 [debug] 2022-10-17T11:54:32.173657Z [email protected] <0.217.0> -------- New task status for <0.20065.16>: [{changes_pending,null},{checkpoint_interval,30000},{checkpointed_source_seq,0},{continuous,true},{database,<<"shards/40000000-5fffffff/_replicator.1665052660">>},{doc_id,<<"ngraph">>},{doc_write_failures,0},{docs_read,92279},{docs_written,92279},{missing_revisions_found,92279},{replication_id,<<"58c7cfcc0f22e9d73693b78ec745e04d+continuous">>},{revisions_checked,406477},{source,<<"http:https://admin:[email protected]/cdb2/test/">>},{source_seq,<<"79173-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgY1HbkAsXYk41NzUwtE7HpwWNSkgKQTLJHGDYJbJhZkmWqqWUyqYY5gAyLRxi2HGyYcbKpUYoByS5LABlWjzCsGmxYooV5oomJKYmG5bEASYYGIAU0bz7UwHawgYaGxoYGluZkGbgAYuB-qIH7wQYaGJlYpqUkkWXgAYiB96EGHgIbmGpqaJqYaEmWgQ8gBsLC8CLEQAMzCwszC2xaswBFLKSv">>},{started_on,1666007643},{target,<<"http:https://admin:[email protected]/cdb2/test/">>},{through_seq,<<"78031-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgY1HRygWLsycamZqaWidj04DEpSQFIJtkjDGMDG2aWZJlqaplMqmEOIMPiEYYJgQ0zTjY1SjEg2WUJIMPq4YapfgcblmhhnmhiYkqiYXksQJKhAUgBzZsPNfA32EBDQ2NDA0tzsgxcADFwP9S7-mADDYxMLNNSksgy8ADEwPsoBqaaGpomJlqSZeADiIGwCLGGGGhgZmFhZoFNaxYAUe6iNw">>},{type,replication},{updated_on,1666007672},{user,null}]`

I checked the document to which this error is regarding and the revision 9-6ab086bc66baa1fffe312b90654d90e5 exists on the document from the replication source.

Any ideas are welcome...

nickva · 2022-10-17T16:01:39Z

Try setting the checkpoint_interval to 5000 (5 seconds), down from 30 second default to ensure if the replication job crashes, it doesn't have to backtrack too much. Is there anything different about that particular document than other documents? Does it have more conflicts or updated more often, while the replication is happening perhaps?

Another thing to try is to check if this happens on the latest release 3.2.2 or not. Specifically it would be to check if when both the source and instance running the replication jobs are updated, if this happens as well.

arahjan · 2022-10-17T17:14:28Z

Is there anything different about that particular document than other documents?
Actually, there are more documents which are being reported in couch.log. I've just added a single as an example.

Does` it have more conflicts or updated more often, while the replication is happening perhaps?
I haven't noticed. At the moment, the source is passive meaning there are no changes in the documents.

arahjan · 2022-10-18T15:36:49Z

I haven't tried the upgrade yet.
Anyways, I deleted all "questionable" docs on the source. Even so, I'm still seeing replication process trying to fetch doc, which had been deleted.
notice] 2022-10-18T15:20:39.950023Z [email protected] <0.24560.251> -------- Retrying GET to http:https://admin:*****@a.b.x.d/cdb2/test/doc1.revs=true&open_revs=%5B%222-b3604d99facabc85abc075995edf2d75%22%5D&latest=true in 16.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,'-**open_doc_revs/6-fun-1-',[404**,[{[83,101,114,118,101,114],[110,103,105,110,120,47,49,46,49,54,46,49]}

What does 404 error mean in this context? I guess it's a leftover old revision. Do I have to manually purge old revisions on source (is it viable)? Or maybe manual compaction is enough?

nickva · 2022-10-18T15:52:38Z

@arahjan During replication the deleted document tombstones (markers) are also replicated. That's needed because if we have the same document on the target, we'd want it to also be deleted if the source deletes it.

There are a few way to avoid replicating tombstones or removing them later:

Purging could work as well.

arahjan added bug needs-triage labels Oct 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication crashes just on single database from many #4204

Replication crashes just on single database from many #4204

arahjan commented Oct 12, 2022

nickva commented Oct 13, 2022

arahjan commented Oct 14, 2022

nickva commented Oct 14, 2022

arahjan commented Oct 17, 2022 •

edited

nickva commented Oct 17, 2022

arahjan commented Oct 17, 2022

arahjan commented Oct 18, 2022

nickva commented Oct 18, 2022

Replication crashes just on single database from many #4204

Replication crashes just on single database from many #4204

Comments

arahjan commented Oct 12, 2022

nickva commented Oct 13, 2022

arahjan commented Oct 14, 2022

nickva commented Oct 14, 2022

arahjan commented Oct 17, 2022 • edited

nickva commented Oct 17, 2022

arahjan commented Oct 17, 2022

arahjan commented Oct 18, 2022

nickva commented Oct 18, 2022

arahjan commented Oct 17, 2022 •

edited