Docker engine swarm api service discovery #1766

F21 · 2016-06-26T06:56:18Z

In docker 1.12, the docker engine will ship with swarm mode built in. This means that it is now possible to stand up a swarm cluster using a bunch of nodes with just docker installed. In addition, swarm mode will come with dns and health checks built-in, negating the need to run consul or some other service discovery mechanism. More info here: https://docs.docker.com/engine/swarm/

It would be nice if prometheus can directly use the new services API to discover services running in a swarm cluster: https://docs.docker.com/engine/reference/api/docker_remote_api_v1.24/#3-8-services

Perhaps the config option could be called docker_swarm_sd.

The text was updated successfully, but these errors were encountered:

brian-brazil · 2016-07-13T13:37:05Z

In addition, swarm mode will come with dns and health checks built-in,

This will need to be bypassed for Prometheus service discovery.

We may want to wait for a release or two for this to stabilise before adding it, and ensuring there's sufficient interest to justify the maintenance effort of another SD.

bvis · 2016-08-02T00:07:14Z

This feature would be amazing. It would allow us to simplify some dependencies now we need to manage to maintain a dynamic Prometheus environment

michaelharrer · 2016-09-07T21:21:23Z

You could use dns_sd_configs.
Im running a global cadvisor service and a global prometheus/node-exporter service and can scrape all nodes with following config, using the
tasks.<servicename> feature of swarm mode discovery.

  - job_name: 'cadvisor'
    dns_sd_configs:
    - names:
      - 'tasks.cadvisor'
      type: 'A'
      port: 8080

  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.node-exporter'
      type: 'A'
      port: 9100

Its a workaround but functional.

Cas-pian · 2016-12-30T08:14:25Z

Any progress on this? I'm really looking forward to use this feature.
Thanks very much!

genki · 2016-12-30T19:24:03Z

@michaelharrer Unfortunately, there is no way to determine the node_exporter is running on which node. Only node_exporter itself knows it, but there's no option to provide it into its metrics (prometheus/node_exporter#319)

joonas-fi · 2016-12-30T23:11:02Z

I just hacked in a proof-of-concept that syncs tasks from Swarm manager to Prometheus: https://github.com/function61/prometheus-docker-swarm

Current limitation is that Prometheus has to be running on a Swarm manager node.

bvis · 2016-12-31T15:57:49Z

@genki, @joonas-fi: I've updated the description of the image I created for getting the metrics: https://github.com/bvis/docker-prometheus-swarm. It's not perfect but it is very useful and the best I've seen until now.
In particular I did a trick to get the host name in the node-exporter.

docker \
  service create --name node-exporter \
  --mode global \
  --network monitoring \
  --label com.docker.stack.namespace=monitoring \
  --container-label com.docker.stack.namespace=monitoring \
  --mount type=bind,source=/proc,target=/host/proc \
  --mount type=bind,source=/sys,target=/host/sys \
  --mount type=bind,source=/,target=/rootfs \
  --mount type=bind,source=/etc/hostname,target=/etc/host_hostname \
  -e HOST_HOSTNAME=/etc/host_hostname \
  basi/node-exporter:v0.1.1 \
  -collector.procfs /host/proc \
  -collector.sysfs /host/sys \
  -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)" \
  --collector.textfile.directory /etc/node-exporter/ \
  --collectors.enabled="conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs"

@joonas-fi I'll try your solution when I get some time, probably it's a better alternative. And you don't need to have it running in a swarm manager node if you expose the metrics to the cluster thanks a a proxy. A similar approach to:

docker \
  service create \
  --mode global \
  --name docker-exporter \
  --network monitoring \
  --publish 4999 \
  basi/socat:v0.1.0

Or:

docker \
    service create --name docker-proxy \
    --network my-network\
    --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock,readonly \
    --constraint 'node.role==manager' \
    rancher/socat-docker

That gets the docker swarm events and expose them in the docker-proxy:2375 endpoint in the network. But to let this work, if I'm not wrong, you should move this to a variable: cli, err = client.NewClient("unix:https:///var/run/docker.sock", "", nil, nil)

On other hand I've tried it but I didn't get it work as it tries to obtain the data from the ingress network instead of the specific network where both services are attached, I think you should allow to define it as well, do you want me to open an issue in your project?

genki · 2017-01-03T19:06:13Z

@bvis I have implemented your second suggestion genki@2f49d37

This inject meta labels "__domain", "__service", "__task" and "__host" after the query execution time using docker API.

bvis · 2017-01-04T00:52:48Z

@genki Do you have a prometheus image ready for use?~~I've built your image but I think I still need to do more steps to include it in the project provided by @joonas-fi, or am I wrong?~~

It works! At least it's a first approach to a system that provides the host! Nice work!

What I've seen is that these values do not appear in the "Console" column, that's why I didn't saw them. In case you fix it it would be nice to have a public image with your changes.

Could this be acceptable as PR in this project?

genki · 2017-01-04T02:46:43Z

@bvis Thank you for reporting :)
The injection is only taken place at querying on time series, so you couldn't see it on console, but I thought it is sufficiant.
I think this implementation is too optimized for docker users.
It is nice if there is more generalized and sophisticated way to pass meta data about sample sources.

joonas-fi · 2017-01-04T09:02:09Z

@bvis: oh man, thanks for tip regarding creating a service that exposes the manager Docker socket (via constraint) over TCP, I didn't think of that as a way to loosen the requirement of running it on a Swarm node. :)

I will make the Docker URL given to Docker client configurable, as you pointed out!

I'm not sure what you mean by "as it tries to obtain the data from the ingress network". To my understanding the ingress network is only for published ports and the routing mesh? So if you publish the socat port, it will be public and therefore will be visible both from the ingress network AND the container's IP itself. Publishing seems unnecessary as the port shouldn't be public anyway (security issue) and you can reach the socat service just by its name without the port being public (provided that socat service and monitoring are on the same network), if I understand correctly. :)

I haven't given much thought/researched into services running on different networks (business services and monitoring on different networks). Currently my assumption is that everything's running on the same network. I'll document that caveat. It might be easy to implement, I just don't know it yet.

Just to be super clear to everyone, mine and @bvis 's projects achieve different things:

Mine provides autodiscovery of your services running on Swarm that have metrics that Prom should scrape, but not container or node metrics
@bvis 's solution provides container ("Docker container metadata") and node metrics, but not metrics from the actual services if you have services that expose Prom metrics

bvis · 2017-01-04T10:43:47Z

@genki The problem I see with your solution is that it does not allow to filter any query based on these values, then I cannot use it in my dashboard to get values from one or some hosts.

@joonas-fi You are right when you say that it's unnecessary to publish the ports of the exporters in the routing mesh, I have used it just for debugging purposes. At the moment I removed the "--publish" option in cadvisor and node-exporter your system started to scrape the values correctly. But for use it under different environments and conditions I suggest to you to implement the networking selection feature.

Another suggestion is that it would be better to split your "docker-prometheus-bridge" binary in another image to allow process isolation, with both services running in the same container some problems could come. Or try to add it in the prometheus itself.

On other hand my dashboard shows the container metrics cadvisor provides. And it's easy to extend it. It would be good to allow me to create issues in your project to do a better follow-up.

And a 3rd option: create a prometheus fork adding your both features: @joonas-fi and @genki. It could be very useful until the Prometheus project adds support to Docker swarm service discovery, or maybe they could accept your changes, it's one of the best things of the open source model. ;)

genki · 2017-01-04T20:10:47Z

@bvis Injected labels are only be usable for such as legend labels because it is not real labels in the scope of query. Prometheus is using the labels as identifier of targets, so if insert something in it causes duplication of targets while recreation of containers. My motivation was just using injected labels for legend label in Grafana such like "{{__host}}".

jmendiara · 2017-11-23T10:27:55Z

Based on the Swarm discovery from @ContainerSolutions, I've coded that PoC that is working ok in our stagging env:
https://github.com/jmendiara/prometheus-swarm-discovery

Takes some of the great ideas from the original solution, but tries to fit best in a deployment where prometheus is executed in a swarm worker (dedicated) without mounting shared volumes between workers/masters (that is fairly complex in some cloud providers) and provides more swarm metadata.

It also removes the "Autoconnection" to swarm networks feature, leaving this responsibility to the swarm operator that interconnects services. (although this feature can be easily brought back)

The original motivation was using the hostname from the worker as the instance label instead of the task endpoint that is get with @michaelharrer DNS solution
(see https://github.com/jmendiara/prometheus-swarm-discovery/blob/master/prometheus-configs/prometheus.yaml#L8-L9)

That client/server duality needed could be simplified dropping completly the client if prometheus implements a generic <remote_sd_config> very similar to <file_sd_config>, but gets the static_config array from a configured endpoint. That <remote_sd_config> can also take rid of shared volume mounting within the client - prometheus

Please, let me know what do you think about this approach

llitfkitfk · 2017-12-30T07:48:33Z

vegasbrianc/prometheus#91

cuigh · 2018-01-08T11:42:47Z

After several months of waiting, I have implemented a simple Swarm discovery in my fork rep, maybe you guys also need it:
https://github.com/cuigh/prometheus

Or download image directly
https://hub.docker.com/r/cuigh/prometheus/

I'll keep my fork synchronizing with every stable release until Swarm is officially supported.

Configuration

For prometheus:

- job_name: swarm
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  swarm_sd_configs:
  - api_server: http:https://docker-proxy:2375
    # api_version: 1.32
    # group: xxx
    # network: xxx
    # refresh_interval: 10s
    # timeout: 15s
  relabel_configs:
    # Add a service label
    - source_labels: [__meta_swarm_service]
      target_label: service
    # Add a node ip label
    - source_labels: [__meta_swarm_node_ip]
      target_label: node_ip
    # Add a node name label
    - source_labels: [__meta_swarm_node_name]
      target_label: node_name

For Swarm service, You can add several labels to control scraping:

prometheus.enable - Required
prometheus.port - Required
prometheus.network - Optional, 'host' or any other overlay network which prometheus and service both attached to
prometheus.path - Optional, default is /metrics
prometheus.group - Optional, must match with group option of swarm_sd_configs

KZachariassen · 2018-01-18T13:24:37Z

We really need this as well, could we get an indication from prometheus, if they want to include the functionality provided by @cuigh ?

simonpasquier · 2018-08-01T12:39:42Z

I'm closing the issue as unfortunately, we are currently not accepting new integrations. ContainerSolutions/prometheus-swarm-discovery is listed in the Prometheus documentation to integrate Docker Swarm with the file service discovery.

We can only provide the stability and performance we want to provide if we can properly maintain the codebase. This includes, amongst others, test integrations in an automated and scalable fashion. For this reason, we are suggesting people integrate with the help of our generic interfaces. We have an integrations page on which integrations using our generic interfaces are listed.

Even if existing integrations can not be tested in an automated fashion, we will not remove them for reasons of compatibility. This also means that any additions we take on, or any changes to existing integrations we make or accept, will mean maintaining and testing these until at least the next major version, realistically even beyond that.

Feel free to question this answer on our developer mailing list, but be aware it's unlikely that you will get a different answer.

bborysenko · 2018-08-02T11:36:03Z

Be aware that ContainerSolutions/prometheus-swarm-discovery is not yet ready for production usage - due to file descriptors leaks ContainerSolutions/prometheus-swarm-discovery#9.

joonas-fi · 2018-12-20T11:20:05Z

I updated my old proof of concept to use a better strategy: https://github.com/function61/promswarmconnect

Previously it used the file service discovery type to dynamically write the file to disk based on the info in Swarm. Its drawback was that we had to make changes to the Prometheus container, overriding the entrypoint and launching the file synchronizer binary AND Prometheus. This is not robust, because we would have had to write logic to deal with either of the binaries crashing.

My new approach emulates the API of the existing Triton service discovery, so we can run the released Prometheus container from Docker Hub 100 % unchanged. All you have to do is write configuration for the Triton SD in the Prometheus config file.

SuperQ · 2018-12-22T15:31:43Z

Docker Swarm Mode is popular enough that we can make an exception to the SD moratorium. We also previously discussed adding support for it according to @brian-brazil.

brian-brazil · 2018-12-22T15:36:42Z

I see no reason to make any exceptions, we continue to have issues maintaining what we already have. We also previously decided not to support it, and it sounds like what exists now is not what existed then.

simonpasquier · 2019-02-13T10:28:51Z

It is not like "all or nothing", I see it more like responsibility split, you can keep existing integrations in default bundle of prometheus, but make it possible to add other integrations as plugins (Like I mentioned in kafka's case you just add your implementation to classpath and set config value-serializer: com.github.YourImplementation)

@pdambrauskas unfortunately there is no practical option in Go otherwise I guess that pluggable SD would have been done a long time ago...

darkl0rd · 2019-02-23T18:51:17Z

Have you guys seen cuigh's post above? His implementation (https://github.com/cuigh/prometheus) is complete, fully integrated and confirmed working. Considering that he already did all the heavy lifting, why not simply integrate his implementation? By the looks of things he even seems more than happy to maintain it.. .

SuperQ · 2019-02-23T20:09:09Z

I think it would be great. @cuigh Would you be willing to open a PR to add it?

hairyhenderson · 2019-02-24T04:25:02Z

@SuperQ that was already rejected at #3687 😉

WTFKr0 · 2019-03-05T15:55:28Z

Hey, just test the @cuigh fork and it fill my needs, but i prefed to stay in the main prom repo

As i understand, the good solution is to create a new custom sd mechanism like the example in here : https://github.com/prometheus/prometheus/tree/master/documentation/examples/custom-sd

With the code of @cuigh fork in

Anybody have started working on this ?

SuperQ · 2019-03-05T19:37:02Z

I would propose re-submitting #3687 as a new PR. We can take an official vote on the Prometheus developers list to decide if it's good enough to merge, rather than having one person on prometheus-team object.

cuigh · 2019-03-06T02:53:08Z

Hey, just test the @cuigh fork and it fill my needs, but i prefed to stay in the main prom repo

As i understand, the good solution is to create a new custom sd mechanism like the example in here : https://github.com/prometheus/prometheus/tree/master/documentation/examples/custom-sd

With the code of @cuigh fork in

Anybody have started working on this ?

I still don't think it's a good idea to implement swarm_sd based on file_sd, unless HTTP is supported in file_sd.

WTFKr0 · 2019-03-06T13:27:08Z

Yeah agree
But i think prometheus team implement this plugin mechanism to be a standard for all sd
As I understand, they want in the future move existing core sd out of prometheus binary too. So all sd will use the plugin mode

By co-locating Prometheus and our new executable we can configure Prometheus to read the file_sd-compatible output of our executable, and therefore scrape targets from that service discovery mechanism. In the future this will enable us to move SD integrations out of the main Prometheus binary, as well as to move stable SD integrations that make use of the adapter into the Prometheus discovery package

See https://prometheus.io/blog/2018/07/05/implementing-custom-sd/

SuperQ · 2019-03-09T01:43:52Z

No, we don't want to remove SD from the core. We do want to make it easier to add new methods outside the core.

WTFKr0 · 2019-03-11T13:14:30Z

OK

So who can resubmit PR for a vote ?

@cuigh I would like to improve a bit some code in your fork, can you enable issues on your fork so we can echange on that ?

brian-brazil · 2019-03-11T16:27:07Z

We discussed this at our monthly meeting today, the moratorium remains. Currently we're awaiting integration testing for a good swathe of our existing SDs, which any new SD would be expected to follow in the steps of.

joonas-fi · 2019-03-11T17:20:53Z

@brian-brazil could you then please add http support to the file SD (so the SD JSON can be fetched over HTTP), so we'd at least get a clean point of integration for adding SD agents running outside of Prometheus' container?

See use case of https://github.com/function61/promswarmconnect - this would be much cleaner if it could produce JSON compatible with the file SD agent!

brian-brazil · 2019-03-11T17:28:51Z

We have a moratorium on new SDs, and we already have a clean generic interface for integrations.

joonas-fi · 2019-03-11T17:50:06Z

We have a moratorium on new SDs, and we already have a clean generic interface for integrations.

That interface is just passing complexity management to the users. With that interface I need to have a the SD binary (let's say promswarmconnect) running either:

Inside the Prometheus' container. In this case the SD plugin project needs to release a new version each time Prometheus releases a new version (the SD plugin project overlays its binary inside Prometheus' Docker image). This also requires a process supervisor, because now we're having unrelated processes running inside a single Docker image. I tried that in my first proof-of-concept of promswarmconnect project. This approach is not robust and creates unnecessary burden for SD plugin developers
Another container running on the same host as Prometheus runs on, and they have to share a filesystem. This is also far from clean. This also requires support from your orchestration layer allowing you to specify ("this container should always run on the same host as Prometheus"), unless you just want to schedule it manually and therefore forget automatic rescheduling if a host goes down..

I ask again, is all this complexity justified just because you don't want to add remote JSON support to the file SD? I can totally understand not wanting to add 4 138 different SD plugins you have to maintain, for trendiest service platform of the week, but we're asking for an olive branch here because what you're suggesting is far from elegant and especially not of a microservice philosophy which Prometheus in other regards so elegantly fits in.

TL;DR: generic HTTP based SD integration is the only elegant way we'll be able to build SD integrations outside of Prometheus' tree.

brian-brazil · 2019-03-11T18:16:38Z

This also requires support from your orchestration layer allowing you to specify ("this container should always run on the same host as Prometheus")

The sidecar model is pretty standard, and not something you can really avoid if you're using Prometheus. We assume a POSIX system, and that includes processes being able to share filesystems, send each other signals etc.

This approach is not robust and creates unnecessary burden for SD plugin developers

I've done it in the past, the bash scripting is a little finicky, but it's quite doable. Especially if you can use a non-ancient version of bash.

generic HTTP based SD integration is the only elegant way we'll be able to build SD integrations outside of Prometheus' tree.

I disagree here, and there's many out there that build fine on what we have. Writing code and deploying it are separate concerns, and I don't think we should be adding features just because one particular deployment system happens to lack a basic feature.

cuigh · 2019-03-12T03:20:56Z

@cuigh I would like to improve a bit some code in your fork, can you enable issues on your fork so we can echange on that ?

PR was already merged and I enabled issues setting also, thanks.

webchi · 2019-09-02T18:59:30Z

Docker Swarm rocks 🤘

kz1000fan · 2020-02-18T18:27:57Z

Looks like this issue has gone stale...I'm looking for mechanisms to implement metrics discovery for Swarm-hosted containers and came across this thread. Any further progress/thoughts on whether this will be supported in the master branch? Thanks.

WTFKr0 · 2020-02-20T00:33:14Z

@kz1000fan I think it's no way here
Give a try to the https://github.com/cuigh/prometheus fork

darkl0rd · 2020-05-05T19:26:03Z

I am aware that there have been several discussions around this subject - has a decision since been made on whether to natively support swarm service discovery?

SuperQ · 2020-05-07T12:51:40Z

@darkl0rd Yes, we're willing to accept new discovery. The new rules are

Find a core maintainer to be a sponsor.
Write the code.

I'm happy to be the sponsor for the docker swarm discovery, but someone needs to write the code. :)

darkl0rd · 2020-05-07T18:15:25Z

@SuperQ there is a complete, working fork / pull request in here from @cuigh.

brian-brazil · 2020-05-07T21:59:50Z

There are several variants out there, however I've yet to see one which is of a standard where it could be inside Prometheus - for example no hardcoding of business logic.

roidelapluie · 2020-06-18T22:54:47Z

I am working on this in #7420

brian-brazil added the kind/enhancement label Jul 13, 2016

brian-brazil added priority/Pmaybe component/service discovery labels Jul 14, 2017

Tharnas mentioned this issue Nov 22, 2017

Service discovery for cadvisor and node-exporter vegasbrianc/prometheus#80

Merged

cuigh mentioned this issue Jan 16, 2018

Add Swarm discovery #3687

Closed

simonpasquier closed this as completed Aug 1, 2018

joonas-fi mentioned this issue Dec 20, 2018

Project direction function61/promswarmconnect#2

Closed

SuperQ reopened this Dec 22, 2018

davkal mentioned this issue Mar 18, 2019

Docker swarm support grafana/loki#224

Closed

roidelapluie mentioned this issue Jun 18, 2020

Docker Swarm service discovery #7420

Merged

roidelapluie added priority/P3 and removed priority/Pmaybe labels Jun 18, 2020

roidelapluie closed this as completed in #7420 Jun 26, 2020

prometheus locked as resolved and limited conversation to collaborators Nov 22, 2021

Docker engine swarm api service discovery #1766

Docker engine swarm api service discovery #1766

Comments

F21 commented Jun 26, 2016 • edited Loading

brian-brazil commented Jul 13, 2016

bvis commented Aug 2, 2016

michaelharrer commented Sep 7, 2016

Cas-pian commented Dec 30, 2016

genki commented Dec 30, 2016

joonas-fi commented Dec 30, 2016

bvis commented Dec 31, 2016 • edited Loading

genki commented Jan 3, 2017 • edited Loading

bvis commented Jan 4, 2017 • edited Loading

genki commented Jan 4, 2017 • edited Loading

joonas-fi commented Jan 4, 2017

bvis commented Jan 4, 2017

genki commented Jan 4, 2017 • edited Loading

jmendiara commented Nov 23, 2017 • edited Loading

llitfkitfk commented Dec 30, 2017

cuigh commented Jan 8, 2018 • edited Loading

Configuration

KZachariassen commented Jan 18, 2018

simonpasquier commented Aug 1, 2018

bborysenko commented Aug 2, 2018 • edited Loading

joonas-fi commented Dec 20, 2018 • edited Loading

SuperQ commented Dec 22, 2018

brian-brazil commented Dec 22, 2018

simonpasquier commented Feb 13, 2019

darkl0rd commented Feb 23, 2019

SuperQ commented Feb 23, 2019

hairyhenderson commented Feb 24, 2019

WTFKr0 commented Mar 5, 2019

SuperQ commented Mar 5, 2019

cuigh commented Mar 6, 2019

WTFKr0 commented Mar 6, 2019

SuperQ commented Mar 9, 2019

WTFKr0 commented Mar 11, 2019

brian-brazil commented Mar 11, 2019

joonas-fi commented Mar 11, 2019

brian-brazil commented Mar 11, 2019

joonas-fi commented Mar 11, 2019

brian-brazil commented Mar 11, 2019

cuigh commented Mar 12, 2019 • edited Loading

webchi commented Sep 2, 2019

kz1000fan commented Feb 18, 2020

WTFKr0 commented Feb 20, 2020

darkl0rd commented May 5, 2020

SuperQ commented May 7, 2020

darkl0rd commented May 7, 2020

brian-brazil commented May 7, 2020

roidelapluie commented Jun 18, 2020

F21 commented Jun 26, 2016 •

edited

Loading

bvis commented Dec 31, 2016 •

edited

Loading

genki commented Jan 3, 2017 •

edited

Loading

bvis commented Jan 4, 2017 •

edited

Loading

genki commented Jan 4, 2017 •

edited

Loading

genki commented Jan 4, 2017 •

edited

Loading

jmendiara commented Nov 23, 2017 •

edited

Loading

cuigh commented Jan 8, 2018 •

edited

Loading

bborysenko commented Aug 2, 2018 •

edited

Loading

joonas-fi commented Dec 20, 2018 •

edited

Loading

cuigh commented Mar 12, 2019 •

edited

Loading