Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

beam.smp spikes and eats all available CPU #869

Closed
mtabb13 opened this issue Oct 6, 2017 · 5 comments
Closed

beam.smp spikes and eats all available CPU #869

mtabb13 opened this issue Oct 6, 2017 · 5 comments

Comments

@mtabb13
Copy link

mtabb13 commented Oct 6, 2017

This is a production system, and a bit frustrated, sorry if this in the wrong spot. Don't know where else to look. I have searched high and low for answers, posted in the IRC, and nothing. In a nutshell, I have a 2 node cluster running, with cpu at normal levels, I do some inserts and queries, seriously, not a heavy load at all. The database(s) being used have ~39 million rows and some views, but mostly mango indexes. After ~12-20 insert/queries, the beam.smp process takes off. The current request to insert that caused the spike, never returns and times out. I have no idea where else I can look for clues. The logs are debug level and verbose, and everything looks pretty normal. The 2 nodes are very large 1TiB machines with 4cpu/4cores each. Resources are not an issue at all. Something is fundamentally wrong here, but don't know where to look. I have tweaked and turned every possible knob there is in the couch config, and have had no results. If someone can tell me which additional places to look or log - I just need to understand what rock to look under. I have no problem putting in the work to debug.

Version used: 2.0
Operating System and version (desktop or mobile): Ubuntu 16

@nickva
Copy link
Contributor

nickva commented Oct 6, 2017

Keep an eye in the logs for emfile errors, that might mean running out of file descriptors.

Also try increasing max_dbs_open if you see all_dbs_active in the logs.

In general try to see if there is something in the logs around the time this behavior starts.

Look for things that looks like stack traces (file names and lines of code) as well.

@penkeysuresh
Copy link

penkeysuresh commented Dec 18, 2017

@nickva I'm also facing the same, above described issue. I'm using couchdb 1.6.1. In my case I'm doing a continuous replication between two couch databases for ~20K databases of each ~10MB on average, to and forth. After a certain time couch db crashes and beam process eats up all the available CPU. Restarting the couch process, or deleting the replications didn't help. Could you tell me what information I should be looking at ? Or can provide any pointers to solve this issue ? d

The out put of couchdb.stderr file

heart_beat_kill_pid = 15996
heart_beat_timeout = 11
heart: Fri Dec 15 22:01:14 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:01:15 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 16103
heart_beat_timeout = 11
heart: Fri Dec 15 22:03:52 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:03:53 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 1202
heart_beat_timeout = 11

heart_beat_kill_pid = 21749
heart_beat_timeout = 11
Killed
inet_gethost[1711]: WARNING:Unable to write to child process.
inet_gethost[1711]: WARNING:Unable to select on dying child file descriptor, errno = 9.

heart_beat_kill_pid = 8077
heart_beat_timeout = 11
heart: Mon Dec 18 01:32:56 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Mon Dec 18 01:32:57 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

Last lines out put of couchdb.stdout file

=ERROR REPORT==== 18-Dec-2017::01:33:31 ===
** Generic server <0.13137.79> terminating 
** Last message in was {'EXIT',<0.13139.79>,
                           {badarg,
                               [{ets,lookup,
                                    [couch_rep_id_to_rep_state,
                                     {"622580889a5576440ff2e9c08454d3b7",
                                      "+continuous+create_target"}],
                                    []},
                                {couch_replicator_manager,rep_state,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,617}]},
                                {couch_replicator_manager,
                                    replication_started,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,65}]},
                                {couch_replicator,do_init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,329}]},
                                {couch_replicator,init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,231}]},
                                {gen_server,init_it,6,
                                    [{file,"gen_server.erl"},{line,304}]},
                                {proc_lib,init_p_do_apply,3,
                                    [{file,"proc_lib.erl"},{line,239}]}]}}
** When Server state == {state,"https://<uname>:<pwd>@<domain.name>/lg39e96df4-f71a-42dc-96f1-da90bd46d872/",
                               20,
                               [<0.13136.79>],
                               [],
                               {[],[]}}
** Reason for termination == 
** {badarg,
       [{ets,lookup,
            [couch_rep_id_to_rep_state,
             {"622580889a5576440ff2e9c08454d3b7","+continuous+create_target"}],
            []},
        {couch_replicator_manager,rep_state,1,
            [{file,"src/couch_replicator_manager.erl"},{line,617}]},
        {couch_replicator_manager,replication_started,1,
            [{file,"src/couch_replicator_manager.erl"},{line,65}]},
        {couch_replicator,do_init,1,
            [{file,"src/couch_replicator.erl"},{line,329}]},
        {couch_replicator,init,1,
            [{file,"src/couch_replicator.erl"},{line,231}]},
        {gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
[error] [<0.296.0>] Could not open file /var/lib/couchdb/lg37be7786-fad0-4dd2-ae69-603e2c69fc1d.couch: file already exists
[info] [<0.269.0>] 10.15.0.2 - - PUT /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 412
[info] [<0.270.0>] 10.15.0.2 - - HEAD /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 200

@penkeysuresh
Copy link

Just incase if anyone facing the same issue, I managed to bring back the CPU utilisation to normal levels by shutting down the couchdb instance that is running as service.

      sudo service couchdb stop

And later spawning the couchdb as a background process by using

     sudo couchdb -b

Somehow If the couchdb instance is again started as a background service, it eats up all the available CPU. Didn't get enough time to debug this (I'm guessing upstart script to be debugged).

@iugo
Copy link

iugo commented Oct 29, 2018

@marceloavf
Copy link

@penkeysuresh

When I tried sudo couchdb -b I did receive sudo: couchdb: command not found
Even that I have installed with sudo apt install couchdb

@apache apache locked as resolved and limited conversation to collaborators Nov 12, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants