Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce cardinality of Cinder volume status metrics #143

Merged
merged 1 commit into from
Dec 10, 2020

Conversation

dougszumski
Copy link
Contributor

Cinder volume status metrics contain highly variable labels, such as
volume and tenant ids. For Prometheus Server and other TSDBs such as
InfluxDB, it is best to avoid highly variable labels to limit the total
number of unique timeseries.

This patch removes the old, highly variable metrics and converts
them to simpler metrics tracking the number of volumes in particular
states.

@dougszumski
Copy link
Contributor Author

Hey all. Great project. We've been using this on some fairly large deployments and have found that some of the metrics produce a very large number of unique time series which hurts performance.

I'm interested on feedback on the right approach here - would you rather that the original metrics were kept for backwards compatibility, and perhaps marked for deprecation, or could we target some changes to reduce cardinality at a new major release?

@alexeymyltsev
Copy link
Collaborator

Hi @dougszumski, improvement of performance it is great, but we use those metrics for alerting. It is quite useful see in alert id of volume and project id for decrease time of investigation especialy when volume already deleted.

@dougszumski
Copy link
Contributor Author

Thanks @alexeymyltsev - I will have a think about the best way forward that won't break any existing use cases.

@niedbalski
Copy link
Member

niedbalski commented Nov 25, 2020

@dougszumski

Thank you for this contribution. One option is to split the metrics as volume_status_counter (or alike) and volume_status metrics, then you can use --disable-metric=cinder-volume_status to avoid collecting volume details.

Also, we merged the --disable-slow-metrics (disabled by default), that will avoid collecting metrics that have been identified as potentially slow in large deployments. Therefore, if we think volume_status might be problematic, we can flag it as slow as well.

@niedbalski
Copy link
Member

Hey @dougszumski

I'd like to include this in the next release, based on our chat here last week, are you planning to follow up on this patch?

Thanks!

@dougszumski
Copy link
Contributor Author

Thanks @niedbalski! I will try to follow up today, but otherwise early next week.

@dougszumski
Copy link
Contributor Author

It's looking like early next week now..

@niedbalski niedbalski added this to the 1.3.0 milestone Dec 1, 2020
@dougszumski dougszumski force-pushed the bugfix/volume_status branch 2 times, most recently from 2415bb1 to 3199bde Compare December 9, 2020 13:58
@dougszumski
Copy link
Contributor Author

@niedbalski does that look any better? I just gave it a quick test on a live system. Here's some example data without --disable-slow-metrics:

openstack_cinder_volume_status{bootable="true",id="e9a1c9d4-3022-4c63-873c-89d9c3a5564f",name="",size="20",status="in-use",tenant_id="ae2fd17124ac4962b5162c41733ed536",volume_type="__DEFAULT__"} 5
openstack_cinder_volume_status{bootable="true",id="ff7789f3-cd3d-475f-a9b5-311ddbf135e3",name="",size="20",status="in-use",tenant_id="698519617c0f4c4da742c4cb7706d769",volume_type="__DEFAULT__"} 5
openstack_cinder_volume_status{bootable="true",id="ffaa4e8e-d1f7-4875-9c39-85317c000278",name="",size="20",status="in-use",tenant_id="ae2fd17124ac4962b5162c41733e8536",volume_type="__DEFAULT__"} 5
# HELP openstack_cinder_volume_status_counter volume_status_counter
# TYPE openstack_cinder_volume_status_counter gauge
openstack_cinder_volume_status_counter{status="attaching"} 0
openstack_cinder_volume_status_counter{status="available"} 830
openstack_cinder_volume_status_counter{status="awaiting-transfer"} 0
openstack_cinder_volume_status_counter{status="backing-up"} 0
openstack_cinder_volume_status_counter{status="creating"} 0
openstack_cinder_volume_status_counter{status="deleting"} 0
openstack_cinder_volume_status_counter{status="detaching"} 2
openstack_cinder_volume_status_counter{status="downloading"} 0
openstack_cinder_volume_status_counter{status="error"} 0
openstack_cinder_volume_status_counter{status="error_backing-up"} 0
openstack_cinder_volume_status_counter{status="error_deleting"} 0
openstack_cinder_volume_status_counter{status="error_extending"} 0
openstack_cinder_volume_status_counter{status="error_restoring"} 0
openstack_cinder_volume_status_counter{status="extending"} 0
openstack_cinder_volume_status_counter{status="in-use"} 214
openstack_cinder_volume_status_counter{status="maintenance"} 0
openstack_cinder_volume_status_counter{status="reserved"} 1
openstack_cinder_volume_status_counter{status="restoring-backup"} 0
openstack_cinder_volume_status_counter{status="retyping"} 0
openstack_cinder_volume_status_counter{status="uploading"} 0

@niedbalski
Copy link
Member

niedbalski commented Dec 10, 2020

@dougszumski yes, LGTM, 2 minor nitpicks: 1) Can you rebase your changes with current master? 2) Can you add the 'volume_status' as a slow metric in the corresponding README.md?

Thanks for the work on this!

Cinder volume status metrics contain highly variable labels, such as
volume and tenant ids. For Prometheus Server and other TSDBs such as
InfluxDB, it is best to avoid highly variable labels to limit the total
number of unique timeseries. However, in smaller scale deployments
this isn't always a problem and some users like the additional
labels to enrich notifications.

In this patch we therefore add support for optionally disabling the
volume_status metrics via the slow metrics flag and add a new, simpler
metric for keeping track of cinder volumes statuses which shouldn't
cause issues in larger deployments.
@niedbalski
Copy link
Member

LGTM, merging.

@niedbalski niedbalski merged commit 9a9210d into openstack-exporter:master Dec 10, 2020
mzijdemans pushed a commit to mzijdemans/openstack-exporter that referenced this pull request Dec 27, 2022
Cinder volume status metrics contain highly variable labels, such as
volume and tenant ids. For Prometheus Server and other TSDBs such as
InfluxDB, it is best to avoid highly variable labels to limit the total
number of unique timeseries. However, in smaller scale deployments
this isn't always a problem and some users like the additional
labels to enrich notifications.

In this patch we therefore add support for optionally disabling the
volume_status metrics via the slow metrics flag and add a new, simpler
metric for keeping track of cinder volumes statuses which shouldn't
cause issues in larger deployments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants