Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker engine swarm api service discovery #1766

Closed
F21 opened this issue Jun 26, 2016 · 52 comments · Fixed by #7420
Closed

Docker engine swarm api service discovery #1766

F21 opened this issue Jun 26, 2016 · 52 comments · Fixed by #7420

Comments

@F21
Copy link

F21 commented Jun 26, 2016

In docker 1.12, the docker engine will ship with swarm mode built in. This means that it is now possible to stand up a swarm cluster using a bunch of nodes with just docker installed. In addition, swarm mode will come with dns and health checks built-in, negating the need to run consul or some other service discovery mechanism. More info here: https://docs.docker.com/engine/swarm/

It would be nice if prometheus can directly use the new services API to discover services running in a swarm cluster: https://docs.docker.com/engine/reference/api/docker_remote_api_v1.24/#3-8-services

Perhaps the config option could be called docker_swarm_sd.

@brian-brazil
Copy link
Contributor

In addition, swarm mode will come with dns and health checks built-in,

This will need to be bypassed for Prometheus service discovery.

We may want to wait for a release or two for this to stabilise before adding it, and ensuring there's sufficient interest to justify the maintenance effort of another SD.

@bvis
Copy link

bvis commented Aug 2, 2016

This feature would be amazing. It would allow us to simplify some dependencies now we need to manage to maintain a dynamic Prometheus environment

@michaelharrer
Copy link

You could use dns_sd_configs.
Im running a global cadvisor service and a global prometheus/node-exporter service and can scrape all nodes with following config, using the
tasks.<servicename> feature of swarm mode discovery.

  - job_name: 'cadvisor'
    dns_sd_configs:
    - names:
      - 'tasks.cadvisor'
      type: 'A'
      port: 8080

  - job_name: 'node-exporter'
    dns_sd_configs:
    - names:
      - 'tasks.node-exporter'
      type: 'A'
      port: 9100

Its a workaround but functional.

@Cas-pian
Copy link

Any progress on this? I'm really looking forward to use this feature.
Thanks very much!

@genki
Copy link

genki commented Dec 30, 2016

@michaelharrer Unfortunately, there is no way to determine the node_exporter is running on which node. Only node_exporter itself knows it, but there's no option to provide it into its metrics (prometheus/node_exporter#319)

@joonas-fi
Copy link

I just hacked in a proof-of-concept that syncs tasks from Swarm manager to Prometheus: https://github.com/function61/prometheus-docker-swarm

Current limitation is that Prometheus has to be running on a Swarm manager node.

@bvis
Copy link

bvis commented Dec 31, 2016

@genki, @joonas-fi: I've updated the description of the image I created for getting the metrics: https://github.com/bvis/docker-prometheus-swarm. It's not perfect but it is very useful and the best I've seen until now.
In particular I did a trick to get the host name in the node-exporter.

docker \
  service create --name node-exporter \
  --mode global \
  --network monitoring \
  --label com.docker.stack.namespace=monitoring \
  --container-label com.docker.stack.namespace=monitoring \
  --mount type=bind,source=/proc,target=/host/proc \
  --mount type=bind,source=/sys,target=/host/sys \
  --mount type=bind,source=/,target=/rootfs \
  --mount type=bind,source=/etc/hostname,target=/etc/host_hostname \
  -e HOST_HOSTNAME=/etc/host_hostname \
  basi/node-exporter:v0.1.1 \
  -collector.procfs /host/proc \
  -collector.sysfs /host/sys \
  -collector.filesystem.ignored-mount-points "^/(sys|proc|dev|host|etc)($|/)" \
  --collector.textfile.directory /etc/node-exporter/ \
  --collectors.enabled="conntrack,diskstats,entropy,filefd,filesystem,loadavg,mdadm,meminfo,netdev,netstat,stat,textfile,time,vmstat,ipvs"

@joonas-fi I'll try your solution when I get some time, probably it's a better alternative. And you don't need to have it running in a swarm manager node if you expose the metrics to the cluster thanks a a proxy. A similar approach to:

docker \
  service create \
  --mode global \
  --name docker-exporter \
  --network monitoring \
  --publish 4999 \
  basi/socat:v0.1.0

Or:

docker \
    service create --name docker-proxy \
    --network my-network\
    --mount type=bind,source=/var/run/docker.sock,target=/var/run/docker.sock,readonly \
    --constraint 'node.role==manager' \
    rancher/socat-docker

That gets the docker swarm events and expose them in the docker-proxy:2375 endpoint in the network. But to let this work, if I'm not wrong, you should move this to a variable: cli, err = client.NewClient("unix:https:///var/run/docker.sock", "", nil, nil)

On other hand I've tried it but I didn't get it work as it tries to obtain the data from the ingress network instead of the specific network where both services are attached, I think you should allow to define it as well, do you want me to open an issue in your project?

@genki
Copy link

genki commented Jan 3, 2017

@bvis I have implemented your second suggestion genki@2f49d37

This inject meta labels "__domain", "__service", "__task" and "__host" after the query execution time using docker API.

@bvis
Copy link

bvis commented Jan 4, 2017

@genki Do you have a prometheus image ready for use?I've built your image but I think I still need to do more steps to include it in the project provided by @joonas-fi, or am I wrong?

It works! At least it's a first approach to a system that provides the host! Nice work!

image

What I've seen is that these values do not appear in the "Console" column, that's why I didn't saw them. In case you fix it it would be nice to have a public image with your changes.

Could this be acceptable as PR in this project?

@genki
Copy link

genki commented Jan 4, 2017

@bvis Thank you for reporting :)
The injection is only taken place at querying on time series, so you couldn't see it on console, but I thought it is sufficiant.
I think this implementation is too optimized for docker users.
It is nice if there is more generalized and sophisticated way to pass meta data about sample sources.

@joonas-fi
Copy link

@bvis: oh man, thanks for tip regarding creating a service that exposes the manager Docker socket (via constraint) over TCP, I didn't think of that as a way to loosen the requirement of running it on a Swarm node. :)

I will make the Docker URL given to Docker client configurable, as you pointed out!

I'm not sure what you mean by "as it tries to obtain the data from the ingress network". To my understanding the ingress network is only for published ports and the routing mesh? So if you publish the socat port, it will be public and therefore will be visible both from the ingress network AND the container's IP itself. Publishing seems unnecessary as the port shouldn't be public anyway (security issue) and you can reach the socat service just by its name without the port being public (provided that socat service and monitoring are on the same network), if I understand correctly. :)

I haven't given much thought/researched into services running on different networks (business services and monitoring on different networks). Currently my assumption is that everything's running on the same network. I'll document that caveat. It might be easy to implement, I just don't know it yet.

Just to be super clear to everyone, mine and @bvis 's projects achieve different things:

  • Mine provides autodiscovery of your services running on Swarm that have metrics that Prom should scrape, but not container or node metrics
  • @bvis 's solution provides container ("Docker container metadata") and node metrics, but not metrics from the actual services if you have services that expose Prom metrics

@bvis
Copy link

bvis commented Jan 4, 2017

@genki The problem I see with your solution is that it does not allow to filter any query based on these values, then I cannot use it in my dashboard to get values from one or some hosts.

@joonas-fi You are right when you say that it's unnecessary to publish the ports of the exporters in the routing mesh, I have used it just for debugging purposes. At the moment I removed the "--publish" option in cadvisor and node-exporter your system started to scrape the values correctly. But for use it under different environments and conditions I suggest to you to implement the networking selection feature.

Another suggestion is that it would be better to split your "docker-prometheus-bridge" binary in another image to allow process isolation, with both services running in the same container some problems could come. Or try to add it in the prometheus itself.

On other hand my dashboard shows the container metrics cadvisor provides. And it's easy to extend it. It would be good to allow me to create issues in your project to do a better follow-up.

And a 3rd option: create a prometheus fork adding your both features: @joonas-fi and @genki. It could be very useful until the Prometheus project adds support to Docker swarm service discovery, or maybe they could accept your changes, it's one of the best things of the open source model. ;)

@genki
Copy link

genki commented Jan 4, 2017

@bvis Injected labels are only be usable for such as legend labels because it is not real labels in the scope of query. Prometheus is using the labels as identifier of targets, so if insert something in it causes duplication of targets while recreation of containers. My motivation was just using injected labels for legend label in Grafana such like "{{__host}}".

@jmendiara
Copy link

jmendiara commented Nov 23, 2017

Based on the Swarm discovery from @ContainerSolutions, I've coded that PoC that is working ok in our stagging env:
https://github.com/jmendiara/prometheus-swarm-discovery

Takes some of the great ideas from the original solution, but tries to fit best in a deployment where prometheus is executed in a swarm worker (dedicated) without mounting shared volumes between workers/masters (that is fairly complex in some cloud providers) and provides more swarm metadata.

It also removes the "Autoconnection" to swarm networks feature, leaving this responsibility to the swarm operator that interconnects services. (although this feature can be easily brought back)

The original motivation was using the hostname from the worker as the instance label instead of the task endpoint that is get with @michaelharrer DNS solution
(see https://github.com/jmendiara/prometheus-swarm-discovery/blob/master/prometheus-configs/prometheus.yaml#L8-L9)

That client/server duality needed could be simplified dropping completly the client if prometheus implements a generic <remote_sd_config> very similar to <file_sd_config>, but gets the static_config array from a configured endpoint. That <remote_sd_config> can also take rid of shared volume mounting within the client - prometheus

Please, let me know what do you think about this approach

@llitfkitfk
Copy link

@cuigh
Copy link

cuigh commented Jan 8, 2018

After several months of waiting, I have implemented a simple Swarm discovery in my fork rep, maybe you guys also need it:
https://github.com/cuigh/prometheus

Or download image directly
https://hub.docker.com/r/cuigh/prometheus/

I'll keep my fork synchronizing with every stable release until Swarm is officially supported.

Configuration

For prometheus:

- job_name: swarm
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  swarm_sd_configs:
  - api_server: http:https://docker-proxy:2375
    # api_version: 1.32
    # group: xxx
    # network: xxx
    # refresh_interval: 10s
    # timeout: 15s
  relabel_configs:
    # Add a service label
    - source_labels: [__meta_swarm_service]
      target_label: service
    # Add a node ip label
    - source_labels: [__meta_swarm_node_ip]
      target_label: node_ip
    # Add a node name label
    - source_labels: [__meta_swarm_node_name]
      target_label: node_name

For Swarm service, You can add several labels to control scraping:

  • prometheus.enable - Required
  • prometheus.port - Required
  • prometheus.network - Optional, 'host' or any other overlay network which prometheus and service both attached to
  • prometheus.path - Optional, default is /metrics
  • prometheus.group - Optional, must match with group option of swarm_sd_configs

@KZachariassen
Copy link

We really need this as well, could we get an indication from prometheus, if they want to include the functionality provided by @cuigh ?

@simonpasquier
Copy link
Member

I'm closing the issue as unfortunately, we are currently not accepting new integrations. ContainerSolutions/prometheus-swarm-discovery is listed in the Prometheus documentation to integrate Docker Swarm with the file service discovery.

We can only provide the stability and performance we want to provide if we can properly maintain the codebase. This includes, amongst others, test integrations in an automated and scalable fashion. For this reason, we are suggesting people integrate with the help of our generic interfaces. We have an integrations page on which integrations using our generic interfaces are listed.

Even if existing integrations can not be tested in an automated fashion, we will not remove them for reasons of compatibility. This also means that any additions we take on, or any changes to existing integrations we make or accept, will mean maintaining and testing these until at least the next major version, realistically even beyond that.

Feel free to question this answer on our developer mailing list, but be aware it's unlikely that you will get a different answer.

@bborysenko
Copy link

bborysenko commented Aug 2, 2018

Be aware that ContainerSolutions/prometheus-swarm-discovery is not yet ready for production usage - due to file descriptors leaks ContainerSolutions/prometheus-swarm-discovery#9.

@joonas-fi
Copy link

joonas-fi commented Dec 20, 2018

I updated my old proof of concept to use a better strategy: https://github.com/function61/promswarmconnect

Previously it used the file service discovery type to dynamically write the file to disk based on the info in Swarm. Its drawback was that we had to make changes to the Prometheus container, overriding the entrypoint and launching the file synchronizer binary AND Prometheus. This is not robust, because we would have had to write logic to deal with either of the binaries crashing.

My new approach emulates the API of the existing Triton service discovery, so we can run the released Prometheus container from Docker Hub 100 % unchanged. All you have to do is write configuration for the Triton SD in the Prometheus config file.

@SuperQ
Copy link
Member

SuperQ commented Dec 22, 2018

Docker Swarm Mode is popular enough that we can make an exception to the SD moratorium. We also previously discussed adding support for it according to @brian-brazil.

@brian-brazil
Copy link
Contributor

I see no reason to make any exceptions, we continue to have issues maintaining what we already have. We also previously decided not to support it, and it sounds like what exists now is not what existed then.

@simonpasquier
Copy link
Member

It is not like "all or nothing", I see it more like responsibility split, you can keep existing integrations in default bundle of prometheus, but make it possible to add other integrations as plugins (Like I mentioned in kafka's case you just add your implementation to classpath and set config value-serializer: com.github.YourImplementation)

@pdambrauskas unfortunately there is no practical option in Go otherwise I guess that pluggable SD would have been done a long time ago...

@darkl0rd
Copy link

Have you guys seen cuigh's post above? His implementation (https://github.com/cuigh/prometheus) is complete, fully integrated and confirmed working. Considering that he already did all the heavy lifting, why not simply integrate his implementation? By the looks of things he even seems more than happy to maintain it.. .

@SuperQ
Copy link
Member

SuperQ commented Feb 23, 2019

I think it would be great. @cuigh Would you be willing to open a PR to add it?

@hairyhenderson
Copy link
Contributor

@SuperQ that was already rejected at #3687 😉

@WTFKr0
Copy link

WTFKr0 commented Mar 5, 2019

Hey, just test the @cuigh fork and it fill my needs, but i prefed to stay in the main prom repo

As i understand, the good solution is to create a new custom sd mechanism like the example in here : https://github.com/prometheus/prometheus/tree/master/documentation/examples/custom-sd

With the code of @cuigh fork in

Anybody have started working on this ?

@SuperQ
Copy link
Member

SuperQ commented Mar 5, 2019

I would propose re-submitting #3687 as a new PR. We can take an official vote on the Prometheus developers list to decide if it's good enough to merge, rather than having one person on prometheus-team object.

@cuigh
Copy link

cuigh commented Mar 6, 2019

Hey, just test the @cuigh fork and it fill my needs, but i prefed to stay in the main prom repo

As i understand, the good solution is to create a new custom sd mechanism like the example in here : https://github.com/prometheus/prometheus/tree/master/documentation/examples/custom-sd

With the code of @cuigh fork in

Anybody have started working on this ?

I still don't think it's a good idea to implement swarm_sd based on file_sd, unless HTTP is supported in file_sd.

@WTFKr0
Copy link

WTFKr0 commented Mar 6, 2019

Yeah agree
But i think prometheus team implement this plugin mechanism to be a standard for all sd
As I understand, they want in the future move existing core sd out of prometheus binary too. So all sd will use the plugin mode

By co-locating Prometheus and our new executable we can configure Prometheus to read the file_sd-compatible output of our executable, and therefore scrape targets from that service discovery mechanism. In the future this will enable us to move SD integrations out of the main Prometheus binary, as well as to move stable SD integrations that make use of the adapter into the Prometheus discovery package

See https://prometheus.io/blog/2018/07/05/implementing-custom-sd/

@SuperQ
Copy link
Member

SuperQ commented Mar 9, 2019

No, we don't want to remove SD from the core. We do want to make it easier to add new methods outside the core.

@WTFKr0
Copy link

WTFKr0 commented Mar 11, 2019

OK

So who can resubmit PR for a vote ?

@cuigh I would like to improve a bit some code in your fork, can you enable issues on your fork so we can echange on that ?

@brian-brazil
Copy link
Contributor

We discussed this at our monthly meeting today, the moratorium remains. Currently we're awaiting integration testing for a good swathe of our existing SDs, which any new SD would be expected to follow in the steps of.

@joonas-fi
Copy link

@brian-brazil could you then please add http support to the file SD (so the SD JSON can be fetched over HTTP), so we'd at least get a clean point of integration for adding SD agents running outside of Prometheus' container?

See use case of https://github.com/function61/promswarmconnect - this would be much cleaner if it could produce JSON compatible with the file SD agent!

@brian-brazil
Copy link
Contributor

We have a moratorium on new SDs, and we already have a clean generic interface for integrations.

@joonas-fi
Copy link

We have a moratorium on new SDs, and we already have a clean generic interface for integrations.

That interface is just passing complexity management to the users. With that interface I need to have a the SD binary (let's say promswarmconnect) running either:

  1. Inside the Prometheus' container. In this case the SD plugin project needs to release a new version each time Prometheus releases a new version (the SD plugin project overlays its binary inside Prometheus' Docker image). This also requires a process supervisor, because now we're having unrelated processes running inside a single Docker image. I tried that in my first proof-of-concept of promswarmconnect project. This approach is not robust and creates unnecessary burden for SD plugin developers

  2. Another container running on the same host as Prometheus runs on, and they have to share a filesystem. This is also far from clean. This also requires support from your orchestration layer allowing you to specify ("this container should always run on the same host as Prometheus"), unless you just want to schedule it manually and therefore forget automatic rescheduling if a host goes down..

I ask again, is all this complexity justified just because you don't want to add remote JSON support to the file SD? I can totally understand not wanting to add 4 138 different SD plugins you have to maintain, for trendiest service platform of the week, but we're asking for an olive branch here because what you're suggesting is far from elegant and especially not of a microservice philosophy which Prometheus in other regards so elegantly fits in.

TL;DR: generic HTTP based SD integration is the only elegant way we'll be able to build SD integrations outside of Prometheus' tree.

@brian-brazil
Copy link
Contributor

This also requires support from your orchestration layer allowing you to specify ("this container should always run on the same host as Prometheus")

The sidecar model is pretty standard, and not something you can really avoid if you're using Prometheus. We assume a POSIX system, and that includes processes being able to share filesystems, send each other signals etc.

This approach is not robust and creates unnecessary burden for SD plugin developers

I've done it in the past, the bash scripting is a little finicky, but it's quite doable. Especially if you can use a non-ancient version of bash.

generic HTTP based SD integration is the only elegant way we'll be able to build SD integrations outside of Prometheus' tree.

I disagree here, and there's many out there that build fine on what we have. Writing code and deploying it are separate concerns, and I don't think we should be adding features just because one particular deployment system happens to lack a basic feature.

@cuigh
Copy link

cuigh commented Mar 12, 2019

@cuigh I would like to improve a bit some code in your fork, can you enable issues on your fork so we can echange on that ?

PR was already merged and I enabled issues setting also, thanks.

@webchi
Copy link

webchi commented Sep 2, 2019

Docker Swarm rocks 🤘

@kz1000fan
Copy link

Looks like this issue has gone stale...I'm looking for mechanisms to implement metrics discovery for Swarm-hosted containers and came across this thread. Any further progress/thoughts on whether this will be supported in the master branch? Thanks.

@WTFKr0
Copy link

WTFKr0 commented Feb 20, 2020

@kz1000fan I think it's no way here
Give a try to the https://github.com/cuigh/prometheus fork

@darkl0rd
Copy link

darkl0rd commented May 5, 2020

I am aware that there have been several discussions around this subject - has a decision since been made on whether to natively support swarm service discovery?

@SuperQ
Copy link
Member

SuperQ commented May 7, 2020

@darkl0rd Yes, we're willing to accept new discovery. The new rules are

  • Find a core maintainer to be a sponsor.
  • Write the code.

I'm happy to be the sponsor for the docker swarm discovery, but someone needs to write the code. :)

@darkl0rd
Copy link

darkl0rd commented May 7, 2020

@SuperQ there is a complete, working fork / pull request in here from @cuigh.

@brian-brazil
Copy link
Contributor

There are several variants out there, however I've yet to see one which is of a standard where it could be inside Prometheus - for example no hardcoding of business logic.

@roidelapluie
Copy link
Member

I am working on this in #7420

@prometheus prometheus locked as resolved and limited conversation to collaborators Nov 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.