Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files #3686

Open
janko opened this issue Dec 21, 2017 · 18 comments
Open

Too many open files #3686

janko opened this issue Dec 21, 2017 · 18 comments
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-verification Issue needs verifying it still exists theme/resource-utilization type/bug

Comments

@janko
Copy link

janko commented Dec 21, 2017

Nomad version

$ nomad -v
Nomad v0.7.0

Operating system and Environment details

$ uname -a
Linux iot-useast1-prod-nomad-server-1 3.13.0-116-generic #163-Ubuntu SMP Fri Mar 31 14:13:22 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 14.04.5 LTS
Release:        14.04
Codename:       trusty

We have 5 Nomad server nodes and 113 Nomad client nodes on AWS EC2.

Issue

One of our Nomad server nodes ran out of file descriptors, and now the cluster is struggling to select a leader. This happened the 3rd time already. Previously it was happening on version 0.5.6, and now it's still happening on 0.7.0 after we upgraded.

We can see from the lsof.log below that the vast majority (about 75%) of open file descriptors are towards our nomad-client-admin-4 node, which doesn't run more allocations than other nomad-client-admin-* nodes. I included the log for nomad-client-admin-4 as well, where the only thing I can see is that there is a nomad_exporter job which is being restarted frequently, I don't know if that might be the cause.

Reproduction steps

N/A

Nomad Server logs (if appropriate)

There are a lot of too many file descriptors log lines now, so I tried to extract something relevant:

Earliest errors we have
    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:47:59 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:09 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:10 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:19 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:20 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:29 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:30 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:40 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:48:50 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:48:51 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:02 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:12 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:13 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:24 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:26 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:39 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:49:41 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:01 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:01 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:02 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:20 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:22 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:40 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:50:42 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:50:55.527388 [ERR] worker: failed to dequeue evaluation: eval broker disabled
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:51:00 [ERR] raft: Failed to AppendEntries to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.85:4647: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.133:4647: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:51:03 [ERR] raft: Failed to heartbeat to 10.0.31.89:4647: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:53:31.974209 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:31.974447 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:34.493762 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:34.518940 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.133:4647 10.0.31.133:4647}: dial tcp 10.0.31.133:4647: i/o timeout
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.85:4647 10.0.31.85:4647}: dial tcp 10.0.31.85:4647: i/o timeout
    2017/12/16 06:53:44 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.31.89:4647 10.0.31.89:4647}: dial tcp 10.0.31.89:4647: i/o timeout
    2017/12/16 06:55:43 [ERR] raft-net: Failed to flush response: write tcp 10.0.11.172:4647->10.0.21.37:33314: write: broken pipe
    2017/12/16 06:55:44.008619 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:44.802281 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:45.124885 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:55:45.124895 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:48.406489 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:48.406509 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/16 06:56:52.407520 [ERR] worker: failed to dequeue evaluation: rpc error: rpc error: eval broker disabled

Today's errors: nomad.log

sudo lsof output: lsof.log

Nomad Client logs (if appropriate)

nomad-client-admin-4 log: nomad-admin-client-4.log

Job file (if appropriate)

prometheus_exporters.hcl
job "prometheus_exporters" {
  type = "service"
  datacenters = ["useast1"]

  constraint {
    attribute = "${node.class}"
    value = "admin"
  }

  group "consul_exporter" {
    count = 1

    task "consul_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/consul-exporter:673081d"
        args = ["-consul.server=${attr.unique.network.ip-address}:8500"]
        port_map { http = 9107 }
      }
      service {
        name = "prometheus-consul-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }

  group "es_iot_exporter" {
    count = 1

    task "es_iot_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/elasticsearch-exporter:2dd77e1"
        args = [
          "-es.uri=http:https://244.es-iot-client.service.m2x:9200",
          "-es.all"
        ]
        port_map { http = 9108 }
      }
      service {
        name = "prometheus-es-iot-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }

  group "nomad_exporter" {
    count = 1

    task "nomad_exporter" {
      driver = "docker"
      config {
        image = "registry.service.m2x:5000/attm2x/nomad-exporter:e65b05d"
        command = "nomad-exporter"
        args = [
          "-nomad.server=http:https://nomad.service.m2x:4646"
        ]
        port_map { http = 9172 }
      }
      service {
        name = "prometheus-nomad-exporter"
        port = "http"
        check {
          type = "http"
          path = "/"
          timeout = "5s"
          interval = "30s"
        }
      }
      resources {
        memory = 64
        network {
          mbits = 1
          port "http" {}
        }
      }
    }
  }
}
@schmichael
Copy link
Member

Thanks for the thorough bug report and logs @janko-m!

Does restarting the Nomad client node process making all of the connections fix the problem? (By default restarting the agent does not affect running allocations/tasks.)

There are a few other things that would help us debug this:

lsof on client node

Just to be absolutely sure it's the Nomad client node process making too many connections could you post the output of lsof from the client node?

goroutine dump

Is it possible to set debug = true on your client nodes? If we're able to get a goroutine dump of the client node opening the large number of connections we should be able to get to the bottom of this quickly.

If you're able to enable that on client nodes and the problem occurs again, please attach the output of http:https://localhost:4646/debug/pprof/goroutine?debug=2 (where localhost is the client node making so many connections).

DEBUG log level

Lowest priority

I can't think of any debug log lines that would be particularly useful, so this is the lowest priority for me. However if you're able to set log_level = debug on the problematic client that would definitely give us more information to work with.

Thanks again and sorry for the particularly nasty issue you've hit!

@memelet
Copy link

memelet commented Mar 12, 2018

No details just yet, but we are having a large production outage right now and this is one of the errors we are getting.

@jippi
Copy link
Contributor

jippi commented Mar 12, 2018

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

Every time I setup a new cluster and forget about this setting, eventually i get random client drops, crashy clusters and what not crazy.

@memelet
Copy link

memelet commented Mar 12, 2018

We set ulimit to max during provison

@memelet
Copy link

memelet commented Mar 12, 2018

Well, I take that back. Seems like that got removed from the playbook.

@jippi
Copy link
Contributor

jippi commented Mar 12, 2018

@memelet putting it back will 99,9% fix your cluster instability :)

@janko
Copy link
Author

janko commented Mar 12, 2018

@schmichael Unfortunately I don't get any more information about the nodes in that state, as we had to cycle the nodes out of the cluster.

I think what caused this was that one of our jobs was frequently failing and restarting due to an invalid state. I think this caused Nomad to accumulate temporary files/directories and somehow retain all those connections from the Nomad server nodes.

Since we stopped that job that was frequently restarting we hadn't had this issue on our main cluster. On our staging cluster a similar thing happened, I noticed the Nomad client node accumulating a lot of temporary files/directories, and there we also identified a job that was restarting frequently.

@jippi
Copy link
Contributor

jippi commented Mar 12, 2018

@janko-m what was/is your ulimit value for the nomad client/servers ?

@janko
Copy link
Author

janko commented Mar 12, 2018

@jippi Unlimited 🙈

@memelet
Copy link

memelet commented Mar 12, 2018

@jippi So far it looks very good.

We still get lots of dropping update to alloc ... errors but I think that is unrelated. Thanks!

@jippi
Copy link
Contributor

jippi commented Mar 12, 2018

@memelet can you gist your nomad server config? :)

@memelet
Copy link

memelet commented Mar 12, 2018

@jippi

base.hcl:

bind_addr = "0.0.0.0" # the default

data_dir  = "/data/nomad"

advertise {
  http = "10.0.17.125"
  rpc  = "10.0.17.125"
  serf = "10.0.17.125"
}

consul {
  address = "127.0.0.1:8500"
}

datacenter = "production"

enable_debug = true

http_api_response_headers {
  env = "production"
}

log_level = "INFO"


name = "p-mesos-master-2"  # you can see our heritage ;-)

region = "us-east-1"

agent.hcl:

server {
    enabled = true
    bootstrap_expect = 5
    data_dir = "/var/log/nomad"
    rejoin_after_leave = true
}

@jippi
Copy link
Contributor

jippi commented Mar 13, 2018

@memelet okay, that config seem fine to me, do you have GOMAXPROCS set in your env when running nomad?

@memelet
Copy link

memelet commented Mar 13, 2018

@jippi Yes, in the startup script we have export GOMAXPROCS='nproc'.

@kurtwheeler
Copy link

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

If this is such a guaranteed issue, could it be included in docs somewhere? Maybe on this page: https://www.nomadproject.io/guides/cluster/requirements.html?

@nkissebe
Copy link

@schmichael I don't know about the OP, but for the second time I've had a server instance go into a logging loop

2018/11/27 13:35:35 [ERR] memberlist: Error accepting TCP connection: accept tcp [::]:4648: accept4: too many open files

and

2018/11/27 13:35:35.820153 [ERR] nomad.rpc: failed to accept RPC conn: accept tcp [::]:4647: accept4: too many open files

And proceed to fill up logging directory VERY rapidly, its clearly in a very tight loop and logging thousands of times a second. It is essentially out of control. Upping the open file limit seems to resolve (delay?) the issue. The excessive logging is a bug that needs to be fixed, it is unacceptable in the current state.

@schmichael
Copy link
Member

@nkissebe It's like this log line being hit in a tight loop:

r.logger.Error("failed to accept RPC conn", "error", err)

We'll get it fixed for Nomad 0.9.0.

@ketzacoatl
Copy link
Contributor

I've seen the "too many open files" issue on servers running in production.

In regards to this comment:

If you do not include ulimit -n 65536 (or a similar higher-than-1024 value) your nomad cluster will have a bad time, guaranteed.

What is a reasonable limit? Should this depend on host/node side?

bdossantos added a commit to bdossantos/ansible-nomad that referenced this issue Oct 28, 2019
@tgross tgross added theme/resource-utilization stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/needs-investigation labels Dec 16, 2020
@tgross tgross added this to Needs Roadmapping in Nomad - Community Issues Triage Feb 12, 2021
@tgross tgross added the stage/needs-verification Issue needs verifying it still exists label Mar 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. stage/needs-verification Issue needs verifying it still exists theme/resource-utilization type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

8 participants