-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
maintenance_mode on high CPU usage #4633
Comments
@sergey-safarov that would work in some cases but CPU load might be tricky to figure out in various environments (windows, linux, k8s, vms). Often a proxy for system is overload is when internal process message queues start filling up and stay relatively high. That can be detected by
Specifically Erlang VM also does some busy waiting, that is keeps schedulers working for a bit longer than necessary to trade-off CPU usage for latency. That can make it seem like it's running out of CPUs capacity as the OS visible CPU usage would be higher, but it may still be doing fine in that state. You can disable busy waiting with: However, in general, it could be dangerous to automatically put nodes in maintenance mode. There are good chances whatever is causing it on one node, will start happening on another node as well. Especially if it now has to also process API requests from the nodes which are already put in maintenance mode. So it all could lead to a cascading failure until none of the nodes can access traffic. |
Thanks, @nickva for clarification. |
Summary
I use CouchDB rpm packages (no docker, no k8s) on the AWS cloud with an Application load-balancer (ALB).
On ALB configured health check for
/_up
endpoint.When the CouchDB daemon fails and starts consuming CPU I want to inform ALG about node failure.
Could you add a config param like
maintenance_mode_on_cpu_load
with an integer value (for example 60% per CPU core)? If the CPU core load rise configured value for one of the CPU cores, then enablemaintenance_mode
and return404
error for/_up
endpoint.Additional context
This allows detect CouchDB node failure on ALB side and reroute traffic to other nodes in the cluster.
The text was updated successfully, but these errors were encountered: