Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Openstack exporter hits the default Prometheus scrape timeout #214

Open
peppepetra opened this issue Dec 13, 2021 · 4 comments
Open

Openstack exporter hits the default Prometheus scrape timeout #214

peppepetra opened this issue Dec 13, 2021 · 4 comments

Comments

@peppepetra
Copy link

peppepetra commented Dec 13, 2021

In a big environment with:

  • ~40 hosts
  • ~1000 running VMs
  • ~1800 ports

openstack-exporter takes around ~30 sec to collect metrics hitting the 15 sec. default scrape timeout.

I have tried disabling slow and deprecated metrics with:

--disable-slow-metrics --disable-deprecated-metrics 

but I am only seeing 1-2 secs of improvements.

It would be nice to have a configurable caching mechanism as described here

Most expensive metrics appear to be:

  • neutron-security_groups
  • nova-total_vms

Disabling the two metrics, the exporter returns in ~15 sec but still hitting the scrape timeout.

@Hybrid512
Copy link
Contributor

Probably related to #110 or #120

@afreiberger
Copy link

I'd like to echo @peppepetra's suggestion for a caching mechanism to allow for an immediate return of a cached report with tunable metrics collection cycles happening in the background. I suggest in many use cases, Openstack capacity metrics would not impact decision makers if a short, predictable delay between collection and reporting is introduced.

@Hybrid512
Copy link
Contributor

Well, I don't know about your specific situation but in our case, we discovered that the exporter was scrapping metrics from empy domains created by Heat.
These are remains from heat stacks that were not properly purged and the problem is that the exporter is scrapping each domain one by one even if they are empty and this takes quite a lot of time especially when, like in our case, you have thousands of them.
This is what I described already in #110
We're working on a PR for this, I hope we can submit it asap.
This is not a performance fix but just a way to filter out the domains you don't want to scrap and in our case, this is way enough to fix our issue.

@engel75
Copy link

engel75 commented Mar 1, 2023

We are running an OS environment with 3 AZ and a lot of customers (domains) and thousands of assets. After tuning (disabling a lot of metrics and using probe to scrape service [compute,network,volume,...] by service) we still hit timeouts and some scrapes take more than 2 minutes.
We will try to "workaround" this by triggering the scrape by some script and writing the result to a static webpage. This webpage will be our scrape target. So basically we "implement" caching.

Would be awesome, someone extends the exporter allowing us to define and use caching?

For example:

--enable-cache
--cache-ttl=

So the result (metrics) got cached for and the next scrape would still return the outdated data but start a fresh metric scan in the background. The outdated data will be displayed until fresh data is available.

Would do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants