POST _revs_diff request stuck without timing out #4373

chiraganand · 2023-01-12T13:33:04Z

Hi, I have a setup consisting of CouchDB v3.2.2 and I am using PouchDB v7.3.1 in my mobile app.

I am noticing that _revs_diff requests are often getting stuck and taking several seconds, 15 seconds or more. See screenshots.

Within the same session I can see other _revs_diff requests which respond in less than a second and from another device they are okay. I am a bit clueless because I am not able to create a reproducible example as this is happening occasionally when I keep PUTing to the app DB.

In one case the request got stuck in "Initial connection" for 18 minutes and till it timed out no syncing happened (maybe this should be filed in PouchDB repo).

What could be going on?

Thanks!

The text was updated successfully, but these errors were encountered:

nickva · 2023-01-12T22:39:12Z

Perhaps we're hitting a limit somewhere - disk usage, CPU, network. What about other requests, just _revs_diffs or do _bulk_docs also do this sometimes?

See if there is a log on the server with that timestamp. Is there a difference between payloads (bodies) between the succeeding requests and failing ones?

Also, 3.3.1 has a few _revs_diff optimizations, if you have a chance could try experimenting upgrading on a test instance to see if you still see the same issue.

chiraganand · 2023-01-17T12:05:09Z

Perhaps we're hitting a limit somewhere - disk usage, CPU, network. What about other requests, just _revs_diffs or do _bulk_docs also do this sometimes?

I checked _revs_diff endpoint by running JMeter (10 threads x 100 iterations) and it kind of hangs after some 700 requests. The request times out after 120 seconds or so even before making a connection. Though I could not see any load increase on the server. No CPU waiting times, enough physical memory available, and enough bandwidth available.

See if there is a log on the server with that timestamp. Is there a difference between payloads (bodies) between the succeeding requests and failing ones?

No, I used the same request.

Also, 3.3.1 has a few _revs_diff optimizations, if you have a chance could try experimenting upgrading on a test instance to see if you still see the same issue.

Yes, will try this. Just want to gather some more evidence before trying with the new Couch version.

nickva · 2023-01-18T20:50:33Z

It's strange that it times out before even making a connection.

Could try also tweaking the number of replication workers, maybe try 1 instead of 4

https://docs.couchdb.org/en/stable/config/replicator.html

[replicator]
worker_processes = 1

Or try increasing or decreasing the number of connections:

[replicator]
http_connections = 40

If there are proxies in between they could also be blocking or timing out the connections. It might help inspecting logs there for connection states.

chiraganand · 2023-01-20T16:57:29Z

I tried v3.3.1 and was able to reproduce the error exactly after 800 requests on an EC2 m4.large VM.

Some more observations:

If I reduced the number of concurrent threads to 5 from 10 then no request would timeout among a total of 1000 requests.
I would not see this same behaviour on a bigger machine (m4.xlarge). All the 1000 requests would respond before timing out on the bigger machine with 10 concurrent threads.
Surprisingly, I encountered the original issue while there was only one active user not 10 concurrent users. I used JMeter to pin down the problem.

If there are proxies in between they could also be blocking or timing out the connections. It might help inspecting logs there for connection states.

On m4.large VM there are no proxies but on m4.xlarge there is a Nginx reverse proxy.

Because of an internal requirement we had to wipe off the core databases on both the servers and I am now unable to reproduce this issue. There is not even a single error in 100 iterations with 50 concurrent threads! I guess I will populate more data and then test again.

chiraganand · 2023-01-20T16:58:52Z

Could try also tweaking the number of replication workers, maybe try 1 instead of 4

https://docs.couchdb.org/en/stable/config/replicator.html
[replicator]
worker_processes = 1
Or try increasing or decreasing the number of connections:
[replicator]
http_connections = 40

Can't try these right now because I am unable to reproduce the original issue.

chiraganand mentioned this issue Jan 20, 2023

_revs_diff call doesn't timeout on getting stuck pouchdb/pouchdb#8588

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POST _revs_diff request stuck without timing out #4373

POST _revs_diff request stuck without timing out #4373

chiraganand commented Jan 12, 2023

nickva commented Jan 12, 2023

chiraganand commented Jan 17, 2023

nickva commented Jan 18, 2023

chiraganand commented Jan 20, 2023

chiraganand commented Jan 20, 2023

POST _revs_diff request stuck without timing out #4373

POST _revs_diff request stuck without timing out #4373

Comments

chiraganand commented Jan 12, 2023

nickva commented Jan 12, 2023

chiraganand commented Jan 17, 2023

nickva commented Jan 18, 2023

chiraganand commented Jan 20, 2023

chiraganand commented Jan 20, 2023