Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTLP over message systems #157

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
110 changes: 110 additions & 0 deletions text/0000-otlp-over-messaging-systems.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# OTLP over messaging systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure to rename your file to the PR # you've been assigned (0157).


Best practices for transporting OTLP data over a message system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any performance best practices for message systems based on the existing Collector Kafka implementation? For example should you prefer to batch into many small messages or into fewer large messages?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not make it part of the spec, just like it is not part of the OTLP/gRPC spec.


## Motivation

This proposal tries to bring consistency and guidelines when transporting
OTLP data over messaging systems. A non-exclusive list of examples
of products in this category are:

* Apache Kafka
* Apache Pulsar
* Google Pubsub
* AWS SNS
* RabbitMQ

Using an intermediate messaging system is in some cases preferred compared to
a direct network connection between the telemetry producer and consumer.
Reasons for using such an intermediate medium could be:

* Protecting against backend failure
* Security Policies
* Network Policies
* Buffering

An extra motivation to have a consistent definition of the payload is that
it would be easy to transfer the OTLP data from one messaging system to
another just by reading the payload from one system and writing it to
another without the need for transformations.

## Explanation

Because the OTLP payload that is sent over messaging systems is
well-defined, it’s easy to implement new systems consistently.
An implementation should at least support `otlp_proto`, meaning that the
payload is Protobuf serialized from `ExportTraceServiceRequest` for traces,
`ExportMetricsServiceRequest` for metrics, or `ExportLogsServiceRequest` for
logs.

Optionally an implementation could support `otlp_json` or other alternative
encodings like jaeger or zipkin payloads. Alternative encodings should be configured
by an `encoding` field in the configuration file.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What configuration file is this referring to? Does this assume that every implementation of OTLP over messaging system must have a configuration file? It seems like a stretch to me. Can it be a in-memory config option? Or an API parameter of a library that implements this OTEP?


For the user it’s beneficial that both an exporter, and a receiver are
implemented in the OpenTelemetry Collector, but SDK developers could
also implement a language specific exporter.

For log data, implementors of a receiver are encouraged to also support
receiving raw data from a topic. This enables scenarios where
non-OpenTelemetry log producers can produce a stream of logs, either
structured or unstructured, on that topic and the receiver wraps the log
data in a configurable resource.

## Internal details

The default implementation must implement `otlp_proto`, meaning that the
payload is Protobuf serialized from `ExportTraceServiceRequest` for traces,
`ExportMetricsServiceRequest` for metrics, or `ExportLogsServiceRequest` for
Comment on lines +56 to +58
Copy link
Member

@tigrannajaryan tigrannajaryan May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing that is not clear to me is if we want to include the signal type in the serialized payload so that the recipient can tell what signal is. Since this OTEP does not define it it means the recipient that works with multiple signals needs another way to figure it out (often ends up using the channel name). I think this is an important capability that adds value and has no obvious downsides.

This is also an issue we want to solve for OTLP/File format: open-telemetry/opentelemetry-specification#1443 (comment)
It will be good to solve this here as well and do it in the same way we will do for OTLP/File so this likely requires to think through the OTLP/File design as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't help with the file format at all, but Kafka and Pubsub (maybe others) have key-value headers/attributes which are mentioned in "Open questions". The signal type and encoding (json or proto) could go in there

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment in the OTLP/File discussion about CloudEvents. Certainly, if you want to allow multiplexing logs, traces, and metrics over a single topic CloudEvents is a good fit and it would align with this OTLP as well as the spec defines bindings for different messaging systems:

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexvanboxel I read the comment you left in the OTLP/File issue but I am a bit confused why it talks about the CloudEvents or Avro.

My comment here is about how an OTLP payload that is encoded as a Protobuf or JSON should be recorded in messaging systems such that the signal type information is included. I think we should keep it as simple as possible, new any additional concepts and technology introduced (CloudEvents or Avro) IMO is unnecessary complication.

If we want to record the type of the signal there are fairly ways to achieve it while staying with the existing set of technologies. For example we can add this message to OTLP Protobuf definitions:

// This may be also named simply Request or Payload or something like that
message ExportServiceRequest {
  oneof request {
    ExportTraceServiceRequest traces = 1;
    ExportMetricsServiceRequest metrics = 2;
    ExportLogsServiceRequest logs = 3;
  }
}

ExportServiceRequest can be then serialized and become the payload in the messaging system. Recipients can then deserialize and will be able to use the right signal in the "request" field.

Copy link
Author

@alexvanboxel alexvanboxel May 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan the consensus is that we want to signal the type of the message, so I'm already changing my application. Let me demonstrate what I mean by leveraging the CloudEvent spec (as it's a spec and not technology). In my Google Pubsub implementation, it describes what headers to use. So I could simple use:

func (receiver *pubsubReceiver) detectEncoding(attributes map[string]string) (Encoding, error) {
	ceType := attributes["ce-type"]
	ceContentType := attributes["ce-datacontenttype"]
	if strings.HasSuffix(ceContentType,"application/x-protobuf") {
		switch ceType {
		case "org.opentelemetry.otlp.traces.v1":
			return OtlpProtoTrace, nil
		case "org.opentelemetry.otlp.metrics.v1":
			return OtlpProtoMetric, nil
		case "org.opentelemetry.otlp.logs.v1":
			return OtlpProtoLog, nil
		}
	} else if strings.HasSuffix(ceContentType,"application/json") {
		switch ceType {
		case "org.opentelemetry.otlp.traces.v1":
			return OtlpJsonTrace, nil
		case "org.opentelemetry.otlp.metrics.v1":
			return OtlpJsonMetric, nil
		case "org.opentelemetry.otlp.logs.v1":
			return OtlpJsonLog, nil
		}
	}
	return Unknown, nil
}

Granted, it's heavier than a oneOf, but it leaves the flexibility of even signaling OTLP/Json encoding.

logs. If an implementation support other encodings an `encoding` field should
be added to the configuration to make it switchable.

| value | trace | metric | log | description |
|---|---|---|---|---|
| `otlp_proto` (default) | X | X | X | protobuf serialized from the `Export(Trace/Metric/Log)ServiceRequest` |
| `otlp_json` | X | X | X | proto3 json representation of a `Export(Trace/Metric/Log)ServiceRequest` |
| `jaeger_proto` | X | - | - | the payload is deserialized to a single Jaeger proto `Span` |
| `jaeger_json` | X | - | - | the payload is deserialized to a single Jaeger JSON Span using `jsonpb` |
| `zipkin_proto` | X | - | - | the payload is deserialized into a list of Zipkin proto spans |
| `zipkin_json` | X | - | - | the payload is deserialized into a list of Zipkin V2 JSON spans |
| `zipkin_thrift` | X | - | - | the payload is deserialized into a list of Zipkin Thrift spans |
Comment on lines +66 to +70
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are unrelated to OTLP. I think the shouldn't be in this OTEP.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it, but I'll leave it here to give other people a chance to comment.

Comment on lines +66 to +70
Copy link

@jrcamp jrcamp May 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm proposing changes to these in PR linked above. The types would be something like zipkinv1/thrift, zipkinv2/protobuf. Aside from the formatting it treats zipkinv1 and zipkinv2 as two separate protocols instead of putting them both behind zipkin.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm going to remove the other encoding out of the spec as @tigrannajaryan suggests

| `raw_string` | - | - | X | see `Log specific collector receiver` |
| `raw_binary` | - | - | X | see `Log specific collector receiver` |

Above you’ll find a non-exclusive list of possible encodings. Only `otlp_proto`
is mandatory and the default.

### Log specific collector receiver

As the flow control is the hardest part of the implementations, adding a feature
for reading raw log events avoids having an additional parallel
implementation to do just that.

The encoding field should be used to indicate that the data received is not the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is meant by "field" here?

default `otlp_proto` data, but raw log data. The receiver should construct a valid
OTLP message for each raw message received. Valid encoding are `raw_string`
and `raw_binary`, that will control the type that the data will have when set
in the `body` of the OTLP message.

As the raw log data don’t have a resource attached to them, the receiver should add
a generic resource and instrumentation library message around the raw messages. The
instrumentation library should be set to the name of the receiver and the version to
that of the collector. Defining the exact resource should be done in the pipeline,
using for example the `resourceprocessor`.

## Prior art and alternatives

The `kafkareceiver` and `kafkaexporter` already implement this OTLP as
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I'm proposing some changes to these:

open-telemetry/opentelemetry-collector#3044
open-telemetry/opentelemetry-collector#3200

Still a WIP.

described in the OpenTelemetry Collector. The description in the OTEP
makes both of them compliant without modification. Although it
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. But let's also be aware this is not a must-have requirement. If we believe what Collector does is not the best we can do then we should be able to propose something different in this OTEP and fix the Collector (it is still in Beta and breaking changes are allowed).

doesn't implement the raw logging support but this could be added
without conflicting with the implementation.

## Open questions

This proposal doesn't take advantage of attributes that some systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think it would be good to put encoding and signal type in there.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. I think perhaps having a standard convention for knowing what signal is coming in

(either force separate channels for each signal, OR a standard header of some sort to disambiguate) would be good. OTLP effectively has this in their GRPC definition via the separate services.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, gRPC can handle this over the same endpoint (a separate service is just a path in the HTTP/2 header).

I'm leaning toward the CloudEvents proposal, where you can have multiple (at least have the possibility to) types over one topic.

support. Should it? Would it be useful to look at the CloudEvents
spec, to leverage the conventions for attributes?

No guaranty of order could be given, because not all systems support order.
Is that a problem?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should not be a problem since OTLP conceptually does not guarantee order of delivery in network protocol.