Skip to content

Commit

Permalink
Update the metrics correlation RFC based on recent feedback (#124)
Browse files Browse the repository at this point in the history
  • Loading branch information
mitsuhiko committed Nov 21, 2023
1 parent 7e8cf40 commit 0b3bc14
Showing 1 changed file with 40 additions and 15 deletions.
55 changes: 40 additions & 15 deletions text/0123-metrics-correlation.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,19 @@ This RFC addresses the high level metrics to span correlation system.
# Motivation

We believe the value in a good metrics solution is the correlation to traces and other signals.
This means that we need to store evidence of metric measurements on the spans. These
measurements are automatically added when using the basic `metrics` API.
This means that we need to store evidence of metrics on the spans in the form of metric summaries.
These summaries are automatically added when using the basic `metrics` API.

# Terms

* **Connected metric:** a metric that is connected to a trace and not just free floating
* **Metric summary:** the concept of a summarized metric associated to a span
* **Measurements:** these are "legacy" transaction level custom metrics. We would like to
eventually align them, but it's unclear at the moment how.

# Basics

Whenever a metrics API is used it operates in either span seeking or span creating mode. Most
Whenever a metrics API is used it operates in either span seeking or in span creating mode. Most
of the metrics APIs are span seeking which means that they record a measurement in relation
to that span. Some APIs (such as `metrics.timing` when used with a code block) will instead
create a span and bind it.
Expand All @@ -24,7 +31,7 @@ create a span and bind it.
def process_batch(batch):
processor = Processor()

# This creates a span with op `timing_measurement`
# This creates a span with op `metric.timer`
with metrics.timing("processor.process_batch"):
for item in batch:
success = processor.process_item(item)
Expand All @@ -36,13 +43,13 @@ def process_batch(batch):

Each metric locally "aggregates" into something that represents a gauge and is persisted with
the closest span. In the above case the following span gets recorded assuming a batch size of 5
where 3 succeed and two fail, the following measurements might be associated:
where 3 succeed and two fail, the following summaries might be associated:

```json
{
"span_id": "deadbeef",
"op": "timing_measurement",
"measurements": {
"op": "metric.timer",
"_metrics_summary": {
"d:processor.process_batch@millisecond": [
{
"min": 421.0,
Expand Down Expand Up @@ -96,7 +103,7 @@ The following correlations are useful for metrics to span queries:

When a code block is timed with `metrics.timing` (or potentially a span is named with the
`metric` parameter) it emits a distribution as timing. That also binds and creates a span
and attached that metric directly as measurement. In that case the tags for metrics
and attached that metric directly as summary. In that case the tags for metrics
might also have to be explicitly recorded with `metric_tags` as parameter.

```python
Expand All @@ -112,12 +119,16 @@ are equivalent:
with metrics.timing("foo"):
pass

with start_span(op="timing_measurement", metric="foo"):
with start_span(op="metrics.timing", metric="foo"):
pass
```

To find corresponding spans the `min` and `max` values on span measurements can be used
for correlation.
(Note that `metric` as a parameter is not something we are going to implement for the
time being).

To find corresponding spans the `min` and `max` values on span summaries can be used
for correlation. Tags associated with the timer are also automatically added to the
span.

## Counters

Expand All @@ -134,7 +145,21 @@ the spans sorted by the highest `count` or highest total sum / max value.

# Open Questions

* Metric vs span tags
* Sets
* Sampling of measurements
* On span storage vs across span storage
Here are some unresolved questions:

## Sets

For now the suggestion is that sets are only stored as "value has been added to set" but not
which value. We do not have a lot of product support for spans today but at a later point
we might need to extend this.

## Sampling of metrics

OpenTelemetry uses a rather elaborate system to filter out "exemplars". There is a chance
that an individual metric measurement is associated with trace and span via the concept
of an exemplar. In our case we attach summaries to spans which means that the dynamic
sampling system can evict them together. However if we were to support open telemetry
exemplars we need to figure out how to sample these properly.

If also the volume of metric summaries is too significant, we might have to introduce a sample
rate for metrics.

0 comments on commit 0b3bc14

Please sign in to comment.