maintenance_mode on high CPU usage #4633

sergey-safarov · 2023-06-04T06:47:52Z

Summary

I use CouchDB rpm packages (no docker, no k8s) on the AWS cloud with an Application load-balancer (ALB).
On ALB configured health check for /_up endpoint.
When the CouchDB daemon fails and starts consuming CPU I want to inform ALG about node failure.
Could you add a config param like maintenance_mode_on_cpu_load with an integer value (for example 60% per CPU core)? If the CPU core load rise configured value for one of the CPU cores, then enable maintenance_mode and return 404 error for /_up endpoint.

Additional context

This allows detect CouchDB node failure on ALB side and reroute traffic to other nodes in the cluster.

The text was updated successfully, but these errors were encountered:

nickva · 2023-06-15T14:46:28Z

@sergey-safarov that would work in some cases but CPU load might be tricky to figure out in various environments (windows, linux, k8s, vms).

Often a proxy for system is overload is when internal process message queues start filling up and stay relatively high. That can be detected by http $DB/_node/$nodename/_system endpoint. It shows some of the message_queues and their length.

http $DB/_node/_local/_system | jq '.message_queues'
{
  "couch_file": {},
  "couch_db_updater": {},
  "couch_server": 0,
  "index_server": 0,
   ...
}

Specifically couch_db_updater is one to look for during document writes as it could back up. But that could also indicate a slow disk IO issue not necessary a CPU overload issue.

Erlang VM also does some busy waiting, that is keeps schedulers working for a bit longer than necessary to trade-off CPU usage for latency. That can make it seem like it's running out of CPUs capacity as the OS visible CPU usage would be higher, but it may still be doing fine in that state. You can disable busy waiting with: +sbwt none +sbwtdcpu none +sbwtdio none vm.args settings.

However, in general, it could be dangerous to automatically put nodes in maintenance mode. There are good chances whatever is causing it on one node, will start happening on another node as well. Especially if it now has to also process API requests from the nodes which are already put in maintenance mode. So it all could lead to a cascading failure until none of the nodes can access traffic.

sergey-safarov · 2023-06-16T06:02:50Z

Thanks, @nickva for clarification.
I will collect monitoring system data (CPU and IO load) when the issue happens and will provide /_node/_local/_system information.
Our installation use CouchDB 2.3.1 version. If it can help understand the issue I can provide an error log also.
This happens not often and maybe required a timer to reproduce it again.

sergey-safarov added enhancement needs-triage labels Jun 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maintenance_mode on high CPU usage #4633

maintenance_mode on high CPU usage #4633

sergey-safarov commented Jun 4, 2023

nickva commented Jun 15, 2023

sergey-safarov commented Jun 16, 2023

maintenance_mode on high CPU usage #4633

maintenance_mode on high CPU usage #4633

Comments

sergey-safarov commented Jun 4, 2023

Summary

Additional context

nickva commented Jun 15, 2023

sergey-safarov commented Jun 16, 2023