Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication crashes just on single database from many #4204

Open
arahjan opened this issue Oct 12, 2022 · 8 comments
Open

Replication crashes just on single database from many #4204

arahjan opened this issue Oct 12, 2022 · 8 comments

Comments

@arahjan
Copy link

arahjan commented Oct 12, 2022

Couchdb 2.3.1 on Centos7

I'm replicating more than 20dbs from one server to another. The process works flawlessly apart from a single database.
The database in question consists of more than 100k small documents like below:

"db_name": "test", "purge_seq": "0-g1AAAAFTeJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMeC5BkeACk_gNBViIDHrVJCUAyqZ6gOoiZCyBm7idG7QGI2vsE7FcA2W9P0P5EhiR5wp5xABkWT6RnGiAOnA9UmwUAtixejg", "update_seq": "100785-g1AAAAFreJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____s5IYGAyr8KhLUgCSSfYwpdX4lDqAlMZDlRrswqc0AaS0HmaqPx6leSxAkqEBSAFVzwcr_0FQ-QKI8v1gh_wlqPwARPl9sOnsBJU_gCiHeHN7FgAkvmTM", "sizes": { "file": 82658990, "external": 77019483, "active": 82112519 }, "other": { "data_size": 77019483 }, "doc_del_count": 4, "doc_count": 100766, "disk_size": 82658990, "disk_format_version": 7, "data_size": 82112519, "compact_running": false, "cluster": { "q": 8, "n": 1, "w": 1, "r": 1 }, "instance_start_time": "0"

After replicating about 79k docs, the replication crashes with output like below:

[error] 2022-10-12T10:43:26.563124Z [email protected] <0.30702.7> -------- CRASH REPORT Process (<0.30702.7>) with 5 neighbors exited with reason: {worker_died,<0.30700.7>,{process_died,<0.3280.8>,{{nocatch,missing_doc},[{couch_replicator_api_wrap,open_doc_revs,6,[{file,"src/couch_replicator_api_wrap.erl"},{line,302}]},{couch_replicator_worker,'-spawn_doc_reader/3-fun-1-',4,[{file,"src/couch_replicator_worker.erl"},{line,323}]}]}}} at gen_server:terminate/7(line:812) <= proc_lib:init_p_do_apply/3(line:247); initial_call: {couch_replicator_worker,init,['Argument__1']}, ancestors: [<0.30607.7>,couch_replicator_scheduler_sup,couch_replicator_sup,...], messages: [], links: [<0.3401.8>,<0.3503.8>,<0.3700.8>,<0.3413.8>,<0.30703.7>], dictionary: [{last_stats_report,{1665,571404,580133}}], trap_exit: true, status: running, heap_size: 6772, stack_size: 27, reductions: 77732

When I copied this db manually to the second server, there is no problem anymore. Can add documents on main server and they are being copied to the second one.

What is a possible culprit of this issue?

@nickva
Copy link
Contributor

nickva commented Oct 13, 2022

missing_doc often means the document update which changes feed saw was not found when replicator went to fetch all of its revisions. It might happen if the document is deleted in the meantime (and compaction runs), or if the document was just created but somehow it wasn't propagated to the all the nodes in the cluster.

There is one retry which the replicator will do that case. You can adjust the sleep period with this setting [replicator] missing_doc_retry_msec = 2000. The defaults is 2 seconds but you could set to, say, 10000 (10 seconds). Then check that you have good inter-node (cluster) connectivity.

@arahjan
Copy link
Author

arahjan commented Oct 14, 2022

I added this parameter, but it didn't help.
I fetched the database from the MAIN server and restored it to SPARE. Unfortunately that didn't help too. Even though both servers have the same number of documents, the replication status is "crashed".
There is a info in crash details pointing to {couch_replicator_api_wrap.erl"},{line,302}]} that stands "NewMaxLen = get_value(max_url_len, Options, ?MAX_URL_LEN) div 2,". Could max_url_len parameter somehow related then? What's the default value and is it possible to increase it?
Connectivity between nodes is fine. As I said, just this particular database is problematic.

@nickva
Copy link
Contributor

nickva commented Oct 14, 2022

If you see max_url_len in the logs check if there are any 414 HTTP error response from the replication endpoints. It could be that the proxy limits the maximum URL length or max_document_id_length was set for couchdb was set too low.

One case where that could also apply is if there are lot of conflicted revisions, those would end passed to the fetch document request as atts_since=.... list and if the response returned is 414 the replicator will try to use a shorter list of atts_since=... values.

Here is where I found a reference to it:

{'DOWN', Ref, process, Pid, request_uri_too_long} ->
NewMaxLen = get_value(max_url_len, Options, ?MAX_URL_LEN) div 2,
case NewMaxLen < ?MIN_URL_LEN of
true ->
throw(request_uri_too_long);
false ->
couch_log:info(
"Reducing url length to ~B because of"
" 414 response",
[NewMaxLen]
),
Options1 = lists:keystore(
max_url_len,
1,
Options,
{max_url_len, NewMaxLen}
),
open_doc_revs(HttpDb, Id, Revs, Options1, Fun, Acc)

@arahjan
Copy link
Author

arahjan commented Oct 17, 2022

After some drilling down and increasing logging level I managed to find some errors related to missing revision numbers:

`61374 [error] 2022-10-17T11:54:12.491238Z [email protected] <0.31178.16> -------- Retrying fetch and update of document ABC as it is unexpectedly missing. Missing revisions are: 9-6ab086bc66baa1fffe312b90654d90e5

61419 [debug] 2022-10-17T11:54:32.173657Z [email protected] <0.217.0> -------- New task status for <0.20065.16>: [{changes_pending,null},{checkpoint_interval,30000},{checkpointed_source_seq,0},{continuous,true},{database,<<"shards/40000000-5fffffff/_replicator.1665052660">>},{doc_id,<<"ngraph">>},{doc_write_failures,0},{docs_read,92279},{docs_written,92279},{missing_revisions_found,92279},{replication_id,<<"58c7cfcc0f22e9d73693b78ec745e04d+continuous">>},{revisions_checked,406477},{source,<<"http:https://admin:[email protected]/cdb2/test/">>},{source_seq,<<"79173-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgY1HbkAsXYk41NzUwtE7HpwWNSkgKQTLJHGDYJbJhZkmWqqWUyqYY5gAyLRxi2HGyYcbKpUYoByS5LABlWjzCsGmxYooV5oomJKYmG5bEASYYGIAU0bz7UwHawgYaGxoYGluZkGbgAYuB-qIH7wQYaGJlYpqUkkWXgAYiB96EGHgIbmGpqaJqYaEmWgQ8gBsLC8CLEQAMzCwszC2xaswBFLKSv">>},{started_on,1666007643},{target,<<"http:https://admin:[email protected]/cdb2/test/">>},{through_seq,<<"78031-g1AAAAJ7eJzLYWBg4MhgTmEQTM4vTc5ISXIwNDLXMwBCwxygFFMiQ5L8____szKYkxgY1HRygWLsycamZqaWidj04DEpSQFIJtkjDGMDG2aWZJlqaplMqmEOIMPiEYYJgQ0zTjY1SjEg2WUJIMPq4YapfgcblmhhnmhiYkqiYXksQJKhAUgBzZsPNfA32EBDQ2NDA0tzsgxcADFwP9S7-mADDYxMLNNSksgy8ADEwPsoBqaaGpomJlqSZeADiIGwCLGGGGhgZmFhZoFNaxYAUe6iNw">>},{type,replication},{updated_on,1666007672},{user,null}]`

I checked the document to which this error is regarding and the revision 9-6ab086bc66baa1fffe312b90654d90e5 exists on the document from the replication source.

Any ideas are welcome...

@nickva
Copy link
Contributor

nickva commented Oct 17, 2022

Try setting the checkpoint_interval to 5000 (5 seconds), down from 30 second default to ensure if the replication job crashes, it doesn't have to backtrack too much. Is there anything different about that particular document than other documents? Does it have more conflicts or updated more often, while the replication is happening perhaps?

Another thing to try is to check if this happens on the latest release 3.2.2 or not. Specifically it would be to check if when both the source and instance running the replication jobs are updated, if this happens as well.

@arahjan
Copy link
Author

arahjan commented Oct 17, 2022

Is there anything different about that particular document than other documents?
Actually, there are more documents which are being reported in couch.log. I've just added a single as an example.

Does` it have more conflicts or updated more often, while the replication is happening perhaps?
I haven't noticed. At the moment, the source is passive meaning there are no changes in the documents.

@arahjan
Copy link
Author

arahjan commented Oct 18, 2022

I haven't tried the upgrade yet.
Anyways, I deleted all "questionable" docs on the source. Even so, I'm still seeing replication process trying to fetch doc, which had been deleted.
notice] 2022-10-18T15:20:39.950023Z [email protected] <0.24560.251> -------- Retrying GET to http:https://admin:*****@a.b.x.d/cdb2/test/doc1.revs=true&open_revs=%5B%222-b3604d99facabc85abc075995edf2d75%22%5D&latest=true in 16.0 seconds due to error {function_clause,[{couch_replicator_api_wrap,'-**open_doc_revs/6-fun-1-',[404**,[{[83,101,114,118,101,114],[110,103,105,110,120,47,49,46,49,54,46,49]}

What does 404 error mean in this context? I guess it's a leftover old revision. Do I have to manually purge old revisions on source (is it viable)? Or maybe manual compaction is enough?

@nickva
Copy link
Contributor

nickva commented Oct 18, 2022

@arahjan During replication the deleted document tombstones (markers) are also replicated. That's needed because if we have the same document on the target, we'd want it to also be deleted if the source deletes it.

There are a few way to avoid replicating tombstones or removing them later:

Purging could work as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants