beam.smp spikes and eats all available CPU #869

mtabb13 · 2017-10-06T04:16:24Z

This is a production system, and a bit frustrated, sorry if this in the wrong spot. Don't know where else to look. I have searched high and low for answers, posted in the IRC, and nothing. In a nutshell, I have a 2 node cluster running, with cpu at normal levels, I do some inserts and queries, seriously, not a heavy load at all. The database(s) being used have ~39 million rows and some views, but mostly mango indexes. After ~12-20 insert/queries, the beam.smp process takes off. The current request to insert that caused the spike, never returns and times out. I have no idea where else I can look for clues. The logs are debug level and verbose, and everything looks pretty normal. The 2 nodes are very large 1TiB machines with 4cpu/4cores each. Resources are not an issue at all. Something is fundamentally wrong here, but don't know where to look. I have tweaked and turned every possible knob there is in the couch config, and have had no results. If someone can tell me which additional places to look or log - I just need to understand what rock to look under. I have no problem putting in the work to debug.

Version used: 2.0
Operating System and version (desktop or mobile): Ubuntu 16

nickva · 2017-10-06T22:56:02Z

Keep an eye in the logs for emfile errors, that might mean running out of file descriptors.

Also try increasing max_dbs_open if you see all_dbs_active in the logs.

In general try to see if there is something in the logs around the time this behavior starts.

Look for things that looks like stack traces (file names and lines of code) as well.

penkeysuresh · 2017-12-18T06:43:05Z

@nickva I'm also facing the same, above described issue. I'm using couchdb 1.6.1. In my case I'm doing a continuous replication between two couch databases for ~20K databases of each ~10MB on average, to and forth. After a certain time couch db crashes and beam process eats up all the available CPU. Restarting the couch process, or deleting the replications didn't help. Could you tell me what information I should be looking at ? Or can provide any pointers to solve this issue ? d

The out put of couchdb.stderr file

heart_beat_kill_pid = 15996
heart_beat_timeout = 11
heart: Fri Dec 15 22:01:14 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:01:15 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 16103
heart_beat_timeout = 11
heart: Fri Dec 15 22:03:52 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Fri Dec 15 22:03:53 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

heart_beat_kill_pid = 1202
heart_beat_timeout = 11

heart_beat_kill_pid = 21749
heart_beat_timeout = 11
Killed
inet_gethost[1711]: WARNING:Unable to write to child process.
inet_gethost[1711]: WARNING:Unable to select on dying child file descriptor, errno = 9.

heart_beat_kill_pid = 8077
heart_beat_timeout = 11
heart: Mon Dec 18 01:32:56 2017: Erlang has closed.
Terminated
sh: echo: I/O error
heart: Mon Dec 18 01:32:57 2017: Executed "/usr/bin/couchdb -k" -> 256. Terminating.

Last lines out put of couchdb.stdout file

=ERROR REPORT==== 18-Dec-2017::01:33:31 ===
** Generic server <0.13137.79> terminating 
** Last message in was {'EXIT',<0.13139.79>,
                           {badarg,
                               [{ets,lookup,
                                    [couch_rep_id_to_rep_state,
                                     {"622580889a5576440ff2e9c08454d3b7",
                                      "+continuous+create_target"}],
                                    []},
                                {couch_replicator_manager,rep_state,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,617}]},
                                {couch_replicator_manager,
                                    replication_started,1,
                                    [{file,"src/couch_replicator_manager.erl"},
                                     {line,65}]},
                                {couch_replicator,do_init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,329}]},
                                {couch_replicator,init,1,
                                    [{file,"src/couch_replicator.erl"},
                                     {line,231}]},
                                {gen_server,init_it,6,
                                    [{file,"gen_server.erl"},{line,304}]},
                                {proc_lib,init_p_do_apply,3,
                                    [{file,"proc_lib.erl"},{line,239}]}]}}
** When Server state == {state,"https://<uname>:<pwd>@<domain.name>/lg39e96df4-f71a-42dc-96f1-da90bd46d872/",
                               20,
                               [<0.13136.79>],
                               [],
                               {[],[]}}
** Reason for termination == 
** {badarg,
       [{ets,lookup,
            [couch_rep_id_to_rep_state,
             {"622580889a5576440ff2e9c08454d3b7","+continuous+create_target"}],
            []},
        {couch_replicator_manager,rep_state,1,
            [{file,"src/couch_replicator_manager.erl"},{line,617}]},
        {couch_replicator_manager,replication_started,1,
            [{file,"src/couch_replicator_manager.erl"},{line,65}]},
        {couch_replicator,do_init,1,
            [{file,"src/couch_replicator.erl"},{line,329}]},
        {couch_replicator,init,1,
            [{file,"src/couch_replicator.erl"},{line,231}]},
        {gen_server,init_it,6,[{file,"gen_server.erl"},{line,304}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}
[error] [<0.296.0>] Could not open file /var/lib/couchdb/lg37be7786-fad0-4dd2-ae69-603e2c69fc1d.couch: file already exists
[info] [<0.269.0>] 10.15.0.2 - - PUT /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 412
[info] [<0.270.0>] 10.15.0.2 - - HEAD /lg37be7786-fad0-4dd2-ae69-603e2c69fc1d/ 200

penkeysuresh · 2018-01-04T14:15:38Z

Just incase if anyone facing the same issue, I managed to bring back the CPU utilisation to normal levels by shutting down the couchdb instance that is running as service.

      sudo service couchdb stop

And later spawning the couchdb as a background process by using

     sudo couchdb -b

Somehow If the couchdb instance is again started as a background service, it eats up all the available CPU. Didn't get enough time to debug this (I'm guessing upstart script to be debugged).

iugo · 2018-10-29T06:56:16Z

apache/couchdb-docker#113

marceloavf · 2019-11-12T15:27:32Z

@penkeysuresh

When I tried sudo couchdb -b I did receive sudo: couchdb: command not found
Even that I have installed with sudo apt install couchdb

janl added the waiting on user label Oct 8, 2017

janl closed this as completed Oct 8, 2017

penkeysuresh mentioned this issue Dec 18, 2017

Couch Db take 100% CPU usage on replication with heavy data #790

Closed

apache locked as resolved and limited conversation to collaborators Nov 12, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

beam.smp spikes and eats all available CPU #869

beam.smp spikes and eats all available CPU #869

mtabb13 commented Oct 6, 2017 •

edited

Loading

nickva commented Oct 6, 2017

penkeysuresh commented Dec 18, 2017 •

edited

Loading

penkeysuresh commented Jan 4, 2018

iugo commented Oct 29, 2018

marceloavf commented Nov 12, 2019

beam.smp spikes and eats all available CPU #869

beam.smp spikes and eats all available CPU #869

Comments

mtabb13 commented Oct 6, 2017 • edited Loading

nickva commented Oct 6, 2017

penkeysuresh commented Dec 18, 2017 • edited Loading

penkeysuresh commented Jan 4, 2018

iugo commented Oct 29, 2018

marceloavf commented Nov 12, 2019

mtabb13 commented Oct 6, 2017 •

edited

Loading

penkeysuresh commented Dec 18, 2017 •

edited

Loading