Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POST _revs_diff request stuck without timing out #4373

Open
chiraganand opened this issue Jan 12, 2023 · 5 comments
Open

POST _revs_diff request stuck without timing out #4373

chiraganand opened this issue Jan 12, 2023 · 5 comments

Comments

@chiraganand
Copy link

Hi, I have a setup consisting of CouchDB v3.2.2 and I am using PouchDB v7.3.1 in my mobile app.

I am noticing that _revs_diff requests are often getting stuck and taking several seconds, 15 seconds or more. See screenshots.

Within the same session I can see other _revs_diff requests which respond in less than a second and from another device they are okay. I am a bit clueless because I am not able to create a reproducible example as this is happening occasionally when I keep PUTing to the app DB.

In one case the request got stuck in "Initial connection" for 18 minutes and till it timed out no syncing happened (maybe this should be filed in PouchDB repo).

What could be going on?

Thanks!

Screenshot_20230112_182500

Screenshot_20230112_184751

@nickva
Copy link
Contributor

nickva commented Jan 12, 2023

Perhaps we're hitting a limit somewhere - disk usage, CPU, network. What about other requests, just _revs_diffs or do _bulk_docs also do this sometimes?

See if there is a log on the server with that timestamp. Is there a difference between payloads (bodies) between the succeeding requests and failing ones?

Also, 3.3.1 has a few _revs_diff optimizations, if you have a chance could try experimenting upgrading on a test instance to see if you still see the same issue.

@chiraganand
Copy link
Author

Perhaps we're hitting a limit somewhere - disk usage, CPU, network. What about other requests, just _revs_diffs or do _bulk_docs also do this sometimes?

I checked _revs_diff endpoint by running JMeter (10 threads x 100 iterations) and it kind of hangs after some 700 requests. The request times out after 120 seconds or so even before making a connection. Though I could not see any load increase on the server. No CPU waiting times, enough physical memory available, and enough bandwidth available.

See if there is a log on the server with that timestamp. Is there a difference between payloads (bodies) between the succeeding requests and failing ones?

No, I used the same request.

Also, 3.3.1 has a few _revs_diff optimizations, if you have a chance could try experimenting upgrading on a test instance to see if you still see the same issue.

Yes, will try this. Just want to gather some more evidence before trying with the new Couch version.

@nickva
Copy link
Contributor

nickva commented Jan 18, 2023

It's strange that it times out before even making a connection.

Could try also tweaking the number of replication workers, maybe try 1 instead of 4

https://docs.couchdb.org/en/stable/config/replicator.html

[replicator]
worker_processes = 1

Or try increasing or decreasing the number of connections:

[replicator]
http_connections = 40

If there are proxies in between they could also be blocking or timing out the connections. It might help inspecting logs there for connection states.

@chiraganand
Copy link
Author

I tried v3.3.1 and was able to reproduce the error exactly after 800 requests on an EC2 m4.large VM.

Some more observations:

  1. If I reduced the number of concurrent threads to 5 from 10 then no request would timeout among a total of 1000 requests.
  2. I would not see this same behaviour on a bigger machine (m4.xlarge). All the 1000 requests would respond before timing out on the bigger machine with 10 concurrent threads.
  3. Surprisingly, I encountered the original issue while there was only one active user not 10 concurrent users. I used JMeter to pin down the problem.

If there are proxies in between they could also be blocking or timing out the connections. It might help inspecting logs there for connection states.

On m4.large VM there are no proxies but on m4.xlarge there is a Nginx reverse proxy.

Because of an internal requirement we had to wipe off the core databases on both the servers and I am now unable to reproduce this issue. There is not even a single error in 100 iterations with 50 concurrent threads! I guess I will populate more data and then test again.

@chiraganand
Copy link
Author

Could try also tweaking the number of replication workers, maybe try 1 instead of 4

https://docs.couchdb.org/en/stable/config/replicator.html

[replicator]
worker_processes = 1

Or try increasing or decreasing the number of connections:

[replicator]
http_connections = 40

Can't try these right now because I am unable to reproduce the original issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants