WIP: pipeline monitoring otep #259

kristinapathak · 2024-05-29T22:27:29Z

A continuation of @jmacd's work on #238 and #249

The goal of this OTEP is to define a semantic convention for metrics that provide information on the flow of data through a pipeline, providing insights both between and within segments of telemetry pipelines.

WIP.

Main focuses currently:

Measuring loss in the pipeline: providing examples that show how failures show up in the proposed metric instruments.
Example scenarios and adding descriptions to diagrams.

kristinapathak · 2024-05-29T22:29:09Z

text/metrics/0238-pipeline-monitoring.md

+- `otelcol_outgoing_items`: Exported, dropped, and discarded items (Collector)
+- `otelcol_incoming_items`: Received and inserted data items (Collector)
+- `otelsdk_outgoing_items`: Exported, dropped, and discarded items (SDK)


I'm pretty agnostic as to whether these should use periods or underscores. I kept underscores for now but please let me know if I should change them.

@TylerHelmuth, you left feedback on Josh's PR about this. Please let me know your preference.

kristinapathak · 2024-05-29T22:30:43Z

text/metrics/0238-pipeline-monitoring.md

+
+### Retries
+
+*WIP: add details*


My draft on this so far - what outcomes look like when retrying three times and getting resource exhausted from the gateway for every attempt. (Red is a synchronous agent pipeline, green is async)

I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.

Should we have a separate metric for the extra fanout factor which is N-1 in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.

kristinapathak · 2024-05-30T02:42:46Z

text/metrics/0238-pipeline-monitoring.md

+
+### Recommended conventional attributes
+
+- `otel.error` (boolean): This is true or false depending on whether the


@TylerHelmuth, you recommended adding pipeline to the otel. prefix here (ie otel.pipeline.error. Do you mean for all of the attributes below? I'm not sure how these are all attributes of an otel pipeline.

0x006EA1E5 · 2024-06-05T17:06:01Z

text/metrics/0238-pipeline-monitoring.md

+- `otel.error` (boolean): This is true or false depending on whether the
+ outcome is considered a failure or a success. See the chart below.
+- `otel.outcome` (string): This describes the outcome in a more specific
+ way than `otel.error`.


@kristinapathak As mentioned at the SIG, I am interested in adding a similar attribute to the otelcol_exporter_send_failed_* metrics as part of open-telemetry/opentelemetry-collector#10158

I would be happy to use outcome if that is thought best.

My one concern is that I see that attribute is used on another metric we use in our org, specifically the Micrometer generated http.server.requests metric, see here for the ENUM values. I see that this is not defined for the http semconv, but I just thought it worth noting here for reference.

I can see how outcome will be used by a variety of metrics. My hope is that with the otel prefix (ie otel.outcome) there is no conflict with the attribute name.

Do the outcome values defined here work to solve open-telemetry/opentelemetry-collector#10157 or is more detail needed? If others are also in favor of this attribute, my hope is that you can update your PR to match this. 🙂

djaglowski

Overall this proposal strikes me as more intuitive than previous iterations but I still have a few questions regarding collector pipelines.

djaglowski · 2024-06-05T17:46:30Z

text/metrics/0238-pipeline-monitoring.md

+the OpenTelemetry Collector as Collector pipelines. A Collector can contain 
+multiple Collector pipelines which can contain multiple segments. Each segment 


A Collector can contain multiple Collector pipelines which can contain multiple segments.

I'm unclear what this is saying. Is it saying that a single Collector pipeline contains multiple segments?

If so, how are those segments defined? For example, in the following pipeline, what are the segments?

receivers: [r1, r2] processors: [p1, p2] exporters: [e1, e2]

The segments would be defined as:

r1, r2, p1, p2, e1

r1, r2, p1, p2, e2

If I understand correctly, each collector pipeline contains a segment per exporter, where each of these segments contains all the receivers and all the processors of the pipeline, in addition to the exporter.

Can we update the language in this section to state this more clearly? Currently, it reads as "a receiver, zero or more processors, and an exporter" which doesn't appear accurate.

Can we also state "Components can be a part of multiple segments" before describing the relationship between segments and collector pipelines, since it's a prerequisite to understanding?

djaglowski · 2024-06-05T17:51:52Z

text/metrics/0238-pipeline-monitoring.md

+exporters. If the pipeline is synchronous, the outcome for the incoming item is
+recorded based on the rules in the below order:
+
+1. If there is a permanent error, that is used as the outcome. If there are
+ multiple permanent errors, choose them in the following order:
+ `rejected`, `deferred:rejected`, `unknown`, `deferred:unknown`.
+2. If there is a transient error, that is used as the outcome. If there are
+ multiple transient errors, choose them in the following order:
+ `dropped`, `deferred:dropped`, `timeout`, `deferred:timeout`, `exhausted`, 
+ `deferred:exhausted`, `retryable`, `deferred:retryable`.


Since these rules are described as applying to synchronous pipelines, should they include deferred outcomes?

Oops that's a good point! Deferred doesn't belong here

djaglowski · 2024-06-05T18:45:01Z

text/metrics/0238-pipeline-monitoring.md

+
+Additional examples of these outcomes can be found in the Appendix.
+
+### Collector Pipelines With Multiple Exporter Components


There are some other arrangements of components which I'm trying to fit to the proposed model. Some of them may be worth including in this document as well. Here's how I'm understanding them:

A single collector pipeline with multiple exporters. As already noted, this includes a fanout point. The document describes how to aggregate outcomes from multiple exporters, effectively ensuring that Incoming(Segment) == Outgoing(Segment).

A single collector pipeline with multiple receivers. This is fairly straightforward and common. The incoming items are just summed together. Probably doesn't require a dedicated section but we could include it for symmetry.

A single receiver shared by multiple collector pipelines. This is the other type of fanout point in the collector. From here there are some similar considerations to the first case, but instead of fanning out to multiple exporters, we instead fanout to entire pipelines. What is the relationship between an item arriving at such a receiver and the outcomes of passing it to multiple pipelines? For example, one pipeline may successfully export the item while the other encounters an error which propagates back to the receiver. Does this count as "origin:received" for both pipelines and then each pipeline show a different outcome?

A single exporter shared by multiple collector pipelines. This is the other type of merge point in the collector. In this case, I think synchronous outcomes can resolve back to a specific pipeline, but async outcomes may not be related to any specific pipeline. For example, if 2 pipelines each send 10 items to an exporter, which then batches all 20 items together into a single export request, the outcome may be "20 deferred:rejected", but it would be incorrect for either pipeline to incorporate that count directly. Is there any way to handle this? Otherwise, maybe this is just a caveat for interpreting deferred outcomes.

I wonder about adding to alternatives considered:

The collector's logic for fanoutconsumer uses multierr, building on Go's Unwrap() []error idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.

jmacd

This looks great, thank you @kristinapathak.

While I think I would approve it if all the "WIP" sections were flushed out, but they're minor and I think this document stands in sufficient detail for an implementation to be prototyped. Probably the next step is to prototype these metrics in the collector and an SDK.

jmacd · 2024-06-10T19:00:35Z

text/metrics/0238-pipeline-monitoring.md

+*WIP: Figure this out. This is a bit subjective. What does an end user expect 
+when calculating total items dropped in failure?*


To me, the idea of "total" signifies starting from the beginning of the pipeline at the SDKs, meaning to use a sum of all the SDK-inserted items and compare against some point later in the pipeline to measure how many are original items are lost somehow. This could mean looking at a gateway collector's exporter counts and comparing to the SDK-inserted counts, for example.

jmacd · 2024-06-10T19:17:54Z

text/metrics/0238-pipeline-monitoring.md

+
+Additional examples of these outcomes can be found in the Appendix.
+
+### Collector Pipelines With Multiple Exporter Components


I wonder about adding to alternatives considered:

The collector's logic for fanoutconsumer uses multierr, building on Go's Unwrap() []error idiom to enclose all the errors. In my opinion, it would be nice to see fanoutconsumer manage the logic of deciding how to transform multiple errors into a single error, so that we could specify that fanoutconsumer dictates the N-to-1 problem and the observability mechanism just follows whatever it decides.

jmacd · 2024-06-10T19:23:06Z

text/metrics/0238-pipeline-monitoring.md

+
+### Retries
+
+*WIP: add details*


I think it would be nice if there was some kind of conservation property we could maintain in the presence of retries where we have N errors and need to a single success/failure status. It seems to me, also, something similar will be applicable for the case of fanout-consumer logic.

Should we have a separate metric for the extra fanout factor which is N-1 in both of these cases? This number will be needed somewhere to understand pipeline conservation through fanout and retries, I think.

pipeline monitoring otep draft

e8d8c5e

jmacd mentioned this pull request May 29, 2024

WIP: Pipeline monitoring metrics #249

Closed

kristinapathak commented May 30, 2024

View reviewed changes

codeboten mentioned this pull request Jun 5, 2024

Add reason dimension to exporter and receiver failure metrics open-telemetry/opentelemetry-collector#10158

Closed

0x006EA1E5 reviewed Jun 5, 2024

View reviewed changes

djaglowski reviewed Jun 5, 2024

View reviewed changes

jmacd reviewed Jun 10, 2024

View reviewed changes

mx-psi mentioned this pull request Jun 11, 2024

Add "inserted" metrics to processors open-telemetry/opentelemetry-collector#10372

Merged

jmacd mentioned this pull request Jun 11, 2024

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) open-telemetry/opentelemetry-collector-contrib#31894

Merged

TylerHelmuth mentioned this pull request Jul 14, 2024

Emit collector JSON logs with the message attribute open-telemetry/opentelemetry-collector#10612

Open

codeboten mentioned this pull request Jul 18, 2024

[batchprocessor] Update metric units open-telemetry/opentelemetry-collector#10658

Merged

djaglowski mentioned this pull request Jul 23, 2024

Simple processor metrics open-telemetry/opentelemetry-collector#10708

Open

dmitryax mentioned this pull request Jul 23, 2024

Attribute name for defining observability signal (data type). open-telemetry/semantic-conventions#1274

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: pipeline monitoring otep #259

WIP: pipeline monitoring otep #259

kristinapathak commented May 29, 2024

kristinapathak May 29, 2024

kristinapathak May 30, 2024

kristinapathak May 29, 2024 •

edited

Loading

jmacd Jun 10, 2024

kristinapathak May 30, 2024

0x006EA1E5 Jun 5, 2024

kristinapathak Jun 6, 2024

djaglowski left a comment

djaglowski Jun 5, 2024

kristinapathak Jun 6, 2024

djaglowski Jun 6, 2024

djaglowski Jun 5, 2024

kristinapathak Jun 6, 2024

djaglowski Jun 5, 2024

jmacd Jun 10, 2024

jmacd left a comment

jmacd Jun 10, 2024

jmacd Jun 10, 2024

jmacd Jun 10, 2024


		### Recommended conventional attributes

		- `otel.error` (boolean): This is true or false depending on whether the

		the OpenTelemetry Collector as Collector pipelines. A Collector can contain
		multiple Collector pipelines which can contain multiple segments. Each segment


		Additional examples of these outcomes can be found in the Appendix.

		### Collector Pipelines With Multiple Exporter Components

		*WIP: Figure this out. This is a bit subjective. What does an end user expect
		when calculating total items dropped in failure?*

WIP: pipeline monitoring otep #259

Are you sure you want to change the base?

WIP: pipeline monitoring otep #259

Conversation

kristinapathak commented May 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kristinapathak May 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

djaglowski left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kristinapathak May 29, 2024 •

edited

Loading