Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many open files when reading TLS certificates #2135

Closed
zackangelo opened this issue Oct 1, 2018 · 5 comments
Closed

Too many open files when reading TLS certificates #2135

zackangelo opened this issue Oct 1, 2018 · 5 comments
Assignees

Comments

@zackangelo
Copy link
Contributor

Issue Type: Bug

All channel operations begin failing after an undeterminable period of time because they are unable to read TLS certificates for a router. This appears to be due to linkerd hitting the FD limit (4096), which would indicate a descriptor leak somewhere.

Relevant stack trace:

 WARN 1001 14:56:05.150 CDT finagle/netty4/boss-1: An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.
  io.netty.channel.unix.Errors$NativeIoException: accept(..) failed: Too many open files
  at io.netty.channel.unix.Errors.newIOException(Errors.java:117)
  at io.netty.channel.unix.Socket.accept(Socket.java:314)
  at io.netty.channel.epoll.AbstractEpollServerChannel$EpollServerSocketUnsafe.epollInReady(AbstractEpollServerChannel.java:112)
  at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe$1.run(AbstractEpollChannel.java:412)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
  at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
  at java.lang.Thread.run(Thread.java:813)
  E 1001 14:56:05.152 CDT THREAD29257 TraceId:665942618bc1d03e: service failure: Failure(request cancelled. Remote Info: Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.177.82.223:28505, Downstream label: %/io.l5d.localhost/#/io.l5d.consul/.local/lightstep-collector, Trace Id: 5146b6bbb501b9a5.943008562da89e1f<:a3a6aa83008899bf, flags=0x03) with RemoteInfo -> Upstream Address: Not Available, Upstream id: Not Available, Downstream Address: /10.171.25.200:4141, Downstream label: %/io.l5d.port/4141/#/io.l5d.consul_to_linker/.local/lightstep-collector, Trace Id: 665942618bc1d03e.665942618bc1d03e<:665942618bc1d03e
  WARN 1001 14:56:05.324 CDT finagle/netty4-4: Failed to initialize a channel. Closing: [id: 0x518a954d]
  java.lang.IllegalArgumentException: File does not contain valid certificates: /var/tmp/linkerd/certCollection2672780185168733830.tmp
  at io.netty.handler.ssl.SslContextBuilder.trustManager(SslContextBuilder.java:179)
  at com.twitter.finagle.netty4.ssl.Netty4SslConfigurations$.configureTrust(Netty4SslConfigurations.scala:37)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientSslConfigurations$.createClientContext(Netty4ClientSslConfigurations.scala:80)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientEngineFactory.apply(Netty4ClientEngineFactory.scala:29)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientSslChannelInitializer.$anonfun$initChannel$1(Netty4ClientSslChannelInitializer.scala:120)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientSslChannelInitializer.$anonfun$initChannel$1$adapted(Netty4ClientSslChannelInitializer.scala:117)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientSslChannelInitializer$$Lambda$904.0000000045BED2F0.apply(Unknown Source)
  at scala.Option.foreach(Option.scala:257)
  at com.twitter.finagle.netty4.ssl.client.Netty4ClientSslChannelInitializer.initChannel(Netty4ClientSslChannelInitializer.scala:117)
  at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:113)
  at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:105)
  at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:606)
  at io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:187)
  at io.netty.channel.DefaultChannelPipeline.addFirst(DefaultChannelPipeline.java:152)
  at com.twitter.finagle.netty4.channel.AbstractNetty4ClientChannelInitializer.initChannel(AbstractNetty4ClientChannelInitializer.scala:84)
  at com.twitter.finagle.netty4.channel.RawNetty4ClientChannelInitializer.initChannel(RawNetty4ClientChannelInitializer.scala:18)
  at io.netty.channel.ChannelInitializer.initChannel(ChannelInitializer.java:113)
  at io.netty.channel.ChannelInitializer.handlerAdded(ChannelInitializer.java:105)
  at io.netty.channel.DefaultChannelPipeline.callHandlerAdded0(DefaultChannelPipeline.java:606)
  at io.netty.channel.DefaultChannelPipeline.access$000(DefaultChannelPipeline.java:46)
  at io.netty.channel.DefaultChannelPipeline$PendingHandlerAddedTask.execute(DefaultChannelPipeline.java:1412)
  at io.netty.channel.DefaultChannelPipeline.callHandlerAddedForAllHandlers(DefaultChannelPipeline.java:1131)
  at io.netty.channel.DefaultChannelPipeline.invokeHandlerAddedIfNeeded(DefaultChannelPipeline.java:656)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:510)
  at io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
  at io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
  at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
  at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
  at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:309)
  at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  at com.twitter.finagle.util.BlockingTimeTrackingThreadFactory$$anon$1.run(BlockingTimeTrackingThreadFactory.scala:23)
  at java.lang.Thread.run(Thread.java:813)
  Caused by: java.security.cert.CertificateException: could not find certificate file: /var/tmp/linkerd/certCollection2672780185168733830.tmp
  at io.netty.handler.ssl.PemReader.readCertificates(PemReader.java:69)
  at io.netty.handler.ssl.SslContext.toX509Certificates(SslContext.java:1070)
  at io.netty.handler.ssl.SslContextBuilder.trustManager(SslContextBuilder.java:177)
  ... 33 more
@zackangelo
Copy link
Contributor Author

zackangelo commented Oct 1, 2018

10 day look at FD usage for this instance:

image

It appears to grow slowly over time. Also, I thought that we raised the limit to 4096 before, but I was mistaken. My initial hunch is this correlates with the connect rate for the h2 routers on this instance but will have to do more digging to verify.

@adleong
Copy link
Member

adleong commented Oct 2, 2018

I notice java.security.cert.CertificateException: could not find certificate file: /var/tmp/linkerd/certCollection2672780185168733830.tmp in your logs. I wonder if we fail to clean up file descriptors when the file doesn't exist, or something like that. Is it expected that your Linkerd is trying to load certificates that don't exist?

@adleong
Copy link
Member

adleong commented Oct 2, 2018

It would also be helpful to know if this problem is specific to epoll. Can you reproduce this if you turn epoll off?

@adleong
Copy link
Member

adleong commented Oct 4, 2018

Does the number of open FDs correlate to the number of open connections? Do you see a growth in the number of FDs while number of connections remains constant?

@adleong
Copy link
Member

adleong commented Oct 29, 2018

@zackangelo any new information here? are you still seeing this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants