-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add monitoring stack #36
Conversation
9ce5a30
to
b7c9a0f
Compare
This adds a Grafana + Prometheus + Postgres + Loki setup which is deployed via docker-compose and ansible to a remote machine (e.g. an EC2 instance) 2 docker images work to fill data in Postgres, one gathers data on test cases (only the failed tests are stored in pg) and the other gathers the hierarchy of data for Jenkins builds (jobs(`tvm`) -> builds (`main` or `PR-1234`) -> stages (`build: CPU`) -> steps (`run_a_script.sh`)), and Prometheus just scrapes Jenkins
server grafana; | ||
} | ||
|
||
server { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we end up using an alb for ssl termination, we might not need to deploy nginx at all
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that’s all nginx is doing so we could definitely get rid of it at some point
apt update | ||
apt install -y docker.io | ||
apt install -y unzip | ||
- name: Set up Docker Swarm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
never used docker swarm--what's the advantage of using it over straight compose?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for this use not much since it’s just a single machine. we could probably simplify down to just compose and be fine, but this is mostly cobbled together from a bunch of copy pasting so there are questionable design decisions all over the place
max-size: "20m" | ||
max-file: "10" | ||
networks: | ||
- monitoring |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this network need to be defined?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so in order for services to reference each other by domain names (but i’m not 100% on that and didn’t try it without)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah so the way I do it is just define the network directly in the same compose file, see https://docs.docker.com/compose/networking/
Have you tried locally running this yet? Also, how does prometheus authenticate with Jenkins? |
No auth with Jenkins, we'll have to figure out how that works with the Prometheus plugin (it looks like it is supported), the |
volumes: | ||
- /etc/tvm/grafana:/var/lib/grafana | ||
- /etc/tvm/grafana-provisioning/:/etc/grafana/provisioning | ||
- /etc/tvm/email_template.html:/usr/share/grafana/public/emails/alert_notification.html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah is grafana configured to send alert notifications? I'm guessing recipients/alarm conditions are all configured manually through the UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is possible but not set up right now (it needs to auth to gmail or something to actually send mail)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah we can also use something like SES if we want
How exactly does postgres fit in here? Does Grafana consume from it as a data source? Is it linked to Prometheus/Loki in any way? |
postgres maintains a sync of Jenkins, most of the stuff on https://monitoring.tlcpack.ai/ comes from Postgres (basically it copies down the hierarchy of builds/jobs/steps from the blue ocean API into a queryable source). Other than build IDs none of the datasources are linked in any way |
Where does the instance backing monitoring.tlcpack.ai live now? Is ssl-termination being handled by nginx? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a good first step--we'll want to follow up with more ansible pipeline stuff/tf code
Bump CPU-SMALL capacity to 100
This adds a Grafana + Prometheus + Postgres + Loki setup which is deployed via docker-compose and ansible to a remote machine (e.g. an EC2 instance)
2 docker images work to fill data in Postgres, one gathers data on test cases (only the failed tests are stored in pg) and the other gathers the hierarchy of data for Jenkins builds (jobs(
tvm
) -> builds (main
orPR-1234
) -> stages (build: CPU
) -> steps (run_a_script.sh
)), and Prometheus just scrapes JenkinsThere are a bunch of lint failures but this is mostly just to get the code public, we can fix them in a follow up if we decide to support this more
cc @areusch @konturn