Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add monitoring stack #36

Merged
merged 1 commit into from
Jun 14, 2022
Merged

Add monitoring stack #36

merged 1 commit into from
Jun 14, 2022

Conversation

driazati
Copy link
Contributor

@driazati driazati commented Jun 10, 2022

This adds a Grafana + Prometheus + Postgres + Loki setup which is deployed via docker-compose and ansible to a remote machine (e.g. an EC2 instance)

2 docker images work to fill data in Postgres, one gathers data on test cases (only the failed tests are stored in pg) and the other gathers the hierarchy of data for Jenkins builds (jobs(tvm) -> builds (main or PR-1234) -> stages (build: CPU) -> steps (run_a_script.sh)), and Prometheus just scrapes Jenkins

There are a bunch of lint failures but this is mostly just to get the code public, we can fix them in a follow up if we decide to support this more

cc @areusch @konturn

@driazati driazati marked this pull request as ready for review June 10, 2022 20:26
@driazati driazati force-pushed the mon2 branch 2 times, most recently from 9ce5a30 to b7c9a0f Compare June 10, 2022 20:32
This adds a Grafana + Prometheus + Postgres + Loki setup which is deployed via docker-compose and ansible to a remote machine (e.g. an EC2 instance)

2 docker images work to fill data in Postgres, one gathers data on test cases (only the failed tests are stored in pg) and the other gathers the hierarchy of data for Jenkins builds (jobs(`tvm`) -> builds (`main` or `PR-1234`) -> stages (`build: CPU`) -> steps (`run_a_script.sh`)), and Prometheus just scrapes Jenkins
server grafana;
}

server {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we end up using an alb for ssl termination, we might not need to deploy nginx at all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that’s all nginx is doing so we could definitely get rid of it at some point

apt update
apt install -y docker.io
apt install -y unzip
- name: Set up Docker Swarm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

never used docker swarm--what's the advantage of using it over straight compose?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for this use not much since it’s just a single machine. we could probably simplify down to just compose and be fine, but this is mostly cobbled together from a bunch of copy pasting so there are questionable design decisions all over the place

max-size: "20m"
max-file: "10"
networks:
- monitoring
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this network need to be defined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so in order for services to reference each other by domain names (but i’m not 100% on that and didn’t try it without)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so the way I do it is just define the network directly in the same compose file, see https://docs.docker.com/compose/networking/

@konturn
Copy link
Contributor

konturn commented Jun 13, 2022

Have you tried locally running this yet? Also, how does prometheus authenticate with Jenkins?

@driazati
Copy link
Contributor Author

No auth with Jenkins, we'll have to figure out how that works with the Prometheus plugin (it looks like it is supported), the /metrics endpoint is open. I've run it locally for testing in the past but there have been a few iterations since then

volumes:
- /etc/tvm/grafana:/var/lib/grafana
- /etc/tvm/grafana-provisioning/:/etc/grafana/provisioning
- /etc/tvm/email_template.html:/usr/share/grafana/public/emails/alert_notification.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah is grafana configured to send alert notifications? I'm guessing recipients/alarm conditions are all configured manually through the UI?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is possible but not set up right now (it needs to auth to gmail or something to actually send mail)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah we can also use something like SES if we want

@konturn
Copy link
Contributor

konturn commented Jun 13, 2022

How exactly does postgres fit in here? Does Grafana consume from it as a data source? Is it linked to Prometheus/Loki in any way?

@driazati
Copy link
Contributor Author

postgres maintains a sync of Jenkins, most of the stuff on https://monitoring.tlcpack.ai/ comes from Postgres (basically it copies down the hierarchy of builds/jobs/steps from the blue ocean API into a queryable source). Other than build IDs none of the datasources are linked in any way

@konturn
Copy link
Contributor

konturn commented Jun 13, 2022

Where does the instance backing monitoring.tlcpack.ai live now? Is ssl-termination being handled by nginx?

Copy link
Contributor

@konturn konturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a good first step--we'll want to follow up with more ansible pipeline stuff/tf code

@driazati driazati merged commit 09c5f7d into tlc-pack:main Jun 14, 2022
mehrdadh pushed a commit to mehrdadh/ci that referenced this pull request Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants