Replication job crashes #4023

tudordumitriu · 2022-05-13T11:52:12Z

We are trying to migrated some DBs between 2 CouchDB servers and if for some of the dbs things run smoothly for some bigger ones (15k docs tops) the replication jobs stop and from the _scheduler/docs we can see only the following errors reported:
info: {error: "{worker_died,<0.1956.3>,{bad_return_value,{invalid_json,{1,invalid_json}}}}"}
error: "{worker_died,<0.1956.3>,{bad_return_value,{invalid_json,{1,invalid_json}}}}"

Description

The DBs are structurally the same but I'd like to find out what is the exact error or the document causing it.
We have also checked the server logs and the error reported is similar to the above one

CouchDB version used: 3.2.0
Browser name and version: Chrome
Operating system and version: Ubuntu, Docker, K8S, Azure AKS

nickva · 2022-05-13T19:21:14Z

@tudordumitriu a replication worker is one of the 4 (by default) processes spawned by each replication job. They perform GET /_revs_diff requests on the target to get the missing revisions, then a GET with open_revs to the source to fetch all the missing revisions, then, finally, a POST /_bulk_docs to the target to insert the docs. So it could be any of those requests which returned an invalid_json response.

invalid_json can often mean that the connection is abruptly terminated, for example if a rate limit is reached or the connection times out, maybe max size is reached and the response it terminated and so on. It's hard to say which one of those or other error happened without extra log or information. Would you be able to get more logs form the servers, or ideally capture the request/response bodies?

tudordumitriu · 2022-05-23T12:36:50Z

Hi @nickva
Thanks for the answer, but the problem was not the connection but there was actually an invalid json document, from the server to server replication point of view, and I shall explain.
First, we did disable all ip rate limiters, firewalls and so on, but nothing got better.
So, if we did try to back up the db to a local db, the replication was working perfectly (the only difference was the url and credentials)
But if we were trying to replicate (by pull) from the target server, it was just stopping and maybe the document was logged but I have never managed to get to it.
Now, when tried to do the replication by push, from the source to target, I did notice the document in the logs.
What is strange is that document was created from an iOS platform (I suspect the iOS file paths) and was replicated from PouchDB to CouchDB and back to other PouchDB dbs (various platforms) without a problem (including the above mentioned local replication).
The file is attached, hope it helps.
Crash.zip

tudordumitriu added bug needs-triage labels May 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication job crashes #4023

Replication job crashes #4023

tudordumitriu commented May 13, 2022

nickva commented May 13, 2022

tudordumitriu commented May 23, 2022

Replication job crashes #4023

Replication job crashes #4023

Comments

tudordumitriu commented May 13, 2022

Description

nickva commented May 13, 2022

tudordumitriu commented May 23, 2022