Reduce cardinality of Cinder volume status metrics #143

dougszumski · 2020-11-25T09:50:28Z

Cinder volume status metrics contain highly variable labels, such as
volume and tenant ids. For Prometheus Server and other TSDBs such as
InfluxDB, it is best to avoid highly variable labels to limit the total
number of unique timeseries.

This patch removes the old, highly variable metrics and converts
them to simpler metrics tracking the number of volumes in particular
states.

dougszumski · 2020-11-25T09:54:46Z

Hey all. Great project. We've been using this on some fairly large deployments and have found that some of the metrics produce a very large number of unique time series which hurts performance.

I'm interested on feedback on the right approach here - would you rather that the original metrics were kept for backwards compatibility, and perhaps marked for deprecation, or could we target some changes to reduce cardinality at a new major release?

alexeymyltsev · 2020-11-25T10:11:47Z

Hi @dougszumski, improvement of performance it is great, but we use those metrics for alerting. It is quite useful see in alert id of volume and project id for decrease time of investigation especialy when volume already deleted.

dougszumski · 2020-11-25T10:20:21Z

Thanks @alexeymyltsev - I will have a think about the best way forward that won't break any existing use cases.

niedbalski · 2020-11-25T12:36:58Z

@dougszumski

Thank you for this contribution. One option is to split the metrics as volume_status_counter (or alike) and volume_status metrics, then you can use --disable-metric=cinder-volume_status to avoid collecting volume details.

Also, we merged the --disable-slow-metrics (disabled by default), that will avoid collecting metrics that have been identified as potentially slow in large deployments. Therefore, if we think volume_status might be problematic, we can flag it as slow as well.

niedbalski · 2020-12-01T12:17:05Z

Hey @dougszumski

I'd like to include this in the next release, based on our chat here last week, are you planning to follow up on this patch?

Thanks!

dougszumski · 2020-12-01T13:34:05Z

Thanks @niedbalski! I will try to follow up today, but otherwise early next week.

dougszumski · 2020-12-01T17:12:53Z

It's looking like early next week now..

dougszumski · 2020-12-09T15:35:02Z

@niedbalski does that look any better? I just gave it a quick test on a live system. Here's some example data without --disable-slow-metrics:

openstack_cinder_volume_status{bootable="true",id="e9a1c9d4-3022-4c63-873c-89d9c3a5564f",name="",size="20",status="in-use",tenant_id="ae2fd17124ac4962b5162c41733ed536",volume_type="__DEFAULT__"} 5
openstack_cinder_volume_status{bootable="true",id="ff7789f3-cd3d-475f-a9b5-311ddbf135e3",name="",size="20",status="in-use",tenant_id="698519617c0f4c4da742c4cb7706d769",volume_type="__DEFAULT__"} 5
openstack_cinder_volume_status{bootable="true",id="ffaa4e8e-d1f7-4875-9c39-85317c000278",name="",size="20",status="in-use",tenant_id="ae2fd17124ac4962b5162c41733e8536",volume_type="__DEFAULT__"} 5
# HELP openstack_cinder_volume_status_counter volume_status_counter
# TYPE openstack_cinder_volume_status_counter gauge
openstack_cinder_volume_status_counter{status="attaching"} 0
openstack_cinder_volume_status_counter{status="available"} 830
openstack_cinder_volume_status_counter{status="awaiting-transfer"} 0
openstack_cinder_volume_status_counter{status="backing-up"} 0
openstack_cinder_volume_status_counter{status="creating"} 0
openstack_cinder_volume_status_counter{status="deleting"} 0
openstack_cinder_volume_status_counter{status="detaching"} 2
openstack_cinder_volume_status_counter{status="downloading"} 0
openstack_cinder_volume_status_counter{status="error"} 0
openstack_cinder_volume_status_counter{status="error_backing-up"} 0
openstack_cinder_volume_status_counter{status="error_deleting"} 0
openstack_cinder_volume_status_counter{status="error_extending"} 0
openstack_cinder_volume_status_counter{status="error_restoring"} 0
openstack_cinder_volume_status_counter{status="extending"} 0
openstack_cinder_volume_status_counter{status="in-use"} 214
openstack_cinder_volume_status_counter{status="maintenance"} 0
openstack_cinder_volume_status_counter{status="reserved"} 1
openstack_cinder_volume_status_counter{status="restoring-backup"} 0
openstack_cinder_volume_status_counter{status="retyping"} 0
openstack_cinder_volume_status_counter{status="uploading"} 0

niedbalski · 2020-12-10T12:10:37Z

@dougszumski yes, LGTM, 2 minor nitpicks: 1) Can you rebase your changes with current master? 2) Can you add the 'volume_status' as a slow metric in the corresponding README.md?

Thanks for the work on this!

Cinder volume status metrics contain highly variable labels, such as volume and tenant ids. For Prometheus Server and other TSDBs such as InfluxDB, it is best to avoid highly variable labels to limit the total number of unique timeseries. However, in smaller scale deployments this isn't always a problem and some users like the additional labels to enrich notifications. In this patch we therefore add support for optionally disabling the volume_status metrics via the slow metrics flag and add a new, simpler metric for keeping track of cinder volumes statuses which shouldn't cause issues in larger deployments.

niedbalski · 2020-12-10T18:24:38Z

LGTM, merging.

Cinder volume status metrics contain highly variable labels, such as volume and tenant ids. For Prometheus Server and other TSDBs such as InfluxDB, it is best to avoid highly variable labels to limit the total number of unique timeseries. However, in smaller scale deployments this isn't always a problem and some users like the additional labels to enrich notifications. In this patch we therefore add support for optionally disabling the volume_status metrics via the slow metrics flag and add a new, simpler metric for keeping track of cinder volumes statuses which shouldn't cause issues in larger deployments.

niedbalski added the new-metrics label Dec 1, 2020

niedbalski added this to the 1.3.0 milestone Dec 1, 2020

niedbalski assigned dougszumski Dec 1, 2020

dougszumski force-pushed the bugfix/volume_status branch 2 times, most recently from 2415bb1 to 3199bde Compare December 9, 2020 13:58

dougszumski force-pushed the bugfix/volume_status branch from 3199bde to fecdf27 Compare December 10, 2020 17:04

niedbalski merged commit 9a9210d into openstack-exporter:master Dec 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce cardinality of Cinder volume status metrics #143

Reduce cardinality of Cinder volume status metrics #143

dougszumski commented Nov 25, 2020

dougszumski commented Nov 25, 2020

alexeymyltsev commented Nov 25, 2020

dougszumski commented Nov 25, 2020

niedbalski commented Nov 25, 2020 •

edited

Loading

niedbalski commented Dec 1, 2020

dougszumski commented Dec 1, 2020

dougszumski commented Dec 1, 2020

dougszumski commented Dec 9, 2020

niedbalski commented Dec 10, 2020 •

edited

Loading

niedbalski commented Dec 10, 2020

Reduce cardinality of Cinder volume status metrics #143

Reduce cardinality of Cinder volume status metrics #143

Conversation

dougszumski commented Nov 25, 2020

dougszumski commented Nov 25, 2020

alexeymyltsev commented Nov 25, 2020

dougszumski commented Nov 25, 2020

niedbalski commented Nov 25, 2020 • edited Loading

niedbalski commented Dec 1, 2020

dougszumski commented Dec 1, 2020

dougszumski commented Dec 1, 2020

dougszumski commented Dec 9, 2020

niedbalski commented Dec 10, 2020 • edited Loading

niedbalski commented Dec 10, 2020

niedbalski commented Nov 25, 2020 •

edited

Loading

niedbalski commented Dec 10, 2020 •

edited

Loading