You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One deployment of CouchDB in a db-per-user type scenario has many more filtered continuous replication jobs than max_jobs, and increasing max_jobs to match is impractical (CPU limits). As the scheduler rotates through jobs, we are seeing the number of TIME_WAIT sockets on the nodes drastically increase.
Expected behaviour
When the replication scheduler rotates off (i.e. kills) a continuous replication job to round-robin to another waiting replication job, the socket should be closed on the client side before killing replication, triggering proper socket cleanup.
Current behaviour
When the replication scheduler rotates off (i.e. kills) a continuous replication job to round-robin to another waiting replication job, the socket is not closed correctly, leaving a client socket in TIME_WAIT to expire.
Steps to reproduce
First, set up a test environment. In one window:
dev/run -n 1 --with-admin-party-please
In another window, set up a short script to monitor the number of TIME_WAIT TCP connections:
Leave off the | wc -l if you want to see a full list.
Now, in a third window, prep 6 test databases:
curl -X PUT localhost:15984/abc
curl -X PUT localhost:15984/one
curl -X PUT localhost:15984/two
curl -X PUT localhost:15984/three
curl -X PUT localhost:15984/four
curl -X PUT localhost:15984/five
Create continuous replication documents to replicate from the shared abc database to each of the "db-per-user" databases:
Finally, force the replicator to churn by adjusting the replicator max jobs, interval, and startup jitter to minimal values:
curl -X PUT localhost:15984/_node/_local/_config/replicator/interval -d '"1000"'
curl -X PUT localhost:15984/_node/_local/_config/replicator/max_jobs -d '"1"'
curl -X PUT localhost:15984/_node/_local/_config/replicator/startup_jitter -d '"1"'
Now, the replicator scheduler will only have a single job running at a time, and will rotate through jobs every second, with 1ms of jitter on starting each job.
Sitting at idle, with 0 documents in the database, this is currently showing a steady state of ~50-60 TIME_WAIT sockets. Looking at the output of netstat -an | grep 15984, all of these sockets show port 15984 as the destination. Example:
tcp 0 0 127.0.0.1:46031 127.0.0.1:15984 TIME_WAIT
All of the TIME_WAIT sockets are client-side; that is, I never see 127.0.0.1:15984 in the source column of the netstat output unless I kill CouchDB (obviously)
Some interesting things to try at this point:
In a separate window, perturb the abc database by adding a single document:
Description
This is a bit convoluted, so bear with me.
One deployment of CouchDB in a db-per-user type scenario has many more filtered continuous replication jobs than
max_jobs
, and increasingmax_jobs
to match is impractical (CPU limits). As the scheduler rotates through jobs, we are seeing the number of TIME_WAIT sockets on the nodes drastically increase.Expected behaviour
When the replication scheduler rotates off (i.e. kills) a continuous replication job to round-robin to another waiting replication job, the socket should be closed on the client side before killing replication, triggering proper socket cleanup.
Current behaviour
When the replication scheduler rotates off (i.e. kills) a continuous replication job to round-robin to another waiting replication job, the socket is not closed correctly, leaving a client socket in TIME_WAIT to expire.
Steps to reproduce
First, set up a test environment. In one window:
In another window, set up a short script to monitor the number of TIME_WAIT TCP connections:
Leave off the
| wc -l
if you want to see a full list.Now, in a third window, prep 6 test databases:
Create continuous replication documents to replicate from the shared
abc
database to each of the "db-per-user" databases:Finally, force the replicator to churn by adjusting the replicator max jobs, interval, and startup jitter to minimal values:
Now, the replicator scheduler will only have a single job running at a time, and will rotate through jobs every second, with 1ms of jitter on starting each job.
Sitting at idle, with 0 documents in the database, this is currently showing a steady state of ~50-60 TIME_WAIT sockets. Looking at the output of
netstat -an | grep 15984
, all of these sockets show port 15984 as the destination. Example:All of the TIME_WAIT sockets are client-side; that is, I never see
127.0.0.1:15984
in the source column of thenetstat
output unless I kill CouchDB (obviously)Some interesting things to try at this point:
couch_replicator
to force it to treat all replications as one-shot, rather than continuous (thanks @rnewson): https://www.irccloud.com/pastebin/dFGMLgqm/What I'm seeing locally for a TIME_WAIT socket count is, with either approach:
Unpatched (CouchDB master)
Patched (Forced one-shot replications)
The text was updated successfully, but these errors were encountered: