chore: RFC for handling discarded events #14708

bruceg · 2022-10-03T23:03:43Z

Ref: #1772 #12217 #13549 #16432 #16432 #17962

netlify · 2022-10-03T23:03:48Z

✅ Deploy Preview for vector-project canceled.

Name	Link
🔨 Latest commit	`861bf04`
🔍 Latest deploy log	https://app.netlify.com/sites/vector-project/deploys/63634bbc418f2500081eb788

spencergilbert · 2022-10-04T16:34:08Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+We will also add a new command-line option and environment variable to opt into the stricter
+validation for users that want the additional assurance this provides.
+
+#### Simplify Discarded Output Handling


I find the names used here awkward, but admittedly don't have any suggestions.

An alternative name for disposition could be on_error (or on_discard).

It could be, but only as applies to the error output. The proposed disposition applies to all named outputs, including from the datadog_agent source and the remap transform.

rfcs/2022-08-25-12217-handling-discarded-events.md

spencergilbert · 2022-10-04T16:46:54Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- Should the option for turning unhandled output enforcement into a hard error be written as a more
+  generic switch to turn _all_ deprecations into errors?


Yes. This feels similar to firewall/WAF products having "advisory" and "blocking" mode. As a learner, it's nice to get things up and running and be warned of dangerous configurations. Once I'm comfortable and depending on the product, I want a "strict" mode that more forcefully calls things out.

spencergilbert · 2022-10-04T16:50:44Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- What should be done with the existing discarded event metrics? Should they always be emitted, or
+  only when the output isn't consumed by another component? Do we need another disposition marker to
+  indicate discards are not to be counted as errors?


🤔 if we're forcing it to be handled, I would think they could be counted like we do with other event streams - events/bytes sent/received. Practically, that seems like a hard sell given the work we've just put into the discarded metrics and compatibility with the ui.

Yeah, I agree that if we're not dropping them on the floor, then they aren't really discarded... just routed elsewhere.

Although maybe with true discarded event handling, "discarded" metrics would simply be the indication of how many discarded events were sent by that component, and then if the event made it to a component where we just dropped/rejected it, without further daisy chained error outputs... that would be when we emit an actual error metric?

Dunno, I feel like we could argue the definition of this stuff until the cows come home.

Hmm, yeah, the relationship between this and the discarded event metric will require some thought.

I agree with tobz here. In the data flow from source to sink, since all the components are sending an event to the next component, the final discard event will be thrown by the last component.
I see the importance of the discarded metrics, it gives the true number of the events. Subsequent metrics from the dlq component helps us understand the lag behind discarded and writes to dlq, and the loss in these events too. We will need indicators, for example: the difference between the number of discarded events versus the discarded events written to the dlq component. These numbers need to be close to zero in a given window.

TL;DR: we need discarded events metrics, and some more.

rfcs/2022-08-25-12217-handling-discarded-events.md

lukesteensen · 2022-10-05T15:36:16Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- Incompatible event type
+- Unhandled metric value type
+- Unwired component output
+- Failure to send to next component


Is this type of failure possible in cases other than a non-graceful shutdown (i.e. panic)?

Maybe, at the end of a graceful shutdown where components are being forcefully terminated. I think we could still send the data to the discard output in hopes it is still connected, and/or mark them as failed.

If we're already forcefully terminating components then it seems unlikely that we'd be able to successfully send the events anywhere. I ask mostly because this would be a failure case that'd apply to every source and transform, even those that may not have any other way of dropping event, and I'd rather focus this feature around the components where there's a more meaningful type of failure possible.

rfcs/2022-08-25-12217-handling-discarded-events.md

lukesteensen · 2022-10-05T15:37:39Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+We will introduce a new output to all components that would otherwise discard events, named
+`discards`.


How will this relate to the existing errors output from the remap transform?

I'm not exactly sure. I am of the mind that it should be split between errors and discards (ie abort), but that introduces a significant breaking change. If, on the other hand, we call the new output errors, this works for remap but not everything that is discarded is an error.

Ah, nevermind, the remap transform already has a dropped output that is what we need, so I think all the rest of this RFC is about errors in particular and not generic "discards".

lukesteensen · 2022-10-05T15:40:22Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+potential of increased overhead and reduced performance if extra components have to be configured
+just to handle the outputs that now need to be connected somewhere. To reduce this overhead, we will
+add a new optional configuration setting to indicate the internal disposition of each output:
+`outputs.NAME.disposition`. This will have two non-default values, `"drop"` to mark all events going


Do you see this config applying to other outputs in the future? It sounds like we're only talking about a single hardcoded discards output, so I'm not sure that we need to nest the configuration in such a way that implies it could be applied to other outputs as well.

What's the benefit of outputs.NAME.disposition vs something like on_discard = {send,drop,reject}?

We have some sources, like datadog_agent IIRC, that already have multiple outputs. If users do not want to wire up one of those outputs and we enforce that all outputs be handled, then now they need a blackhole for that output as well because we special-cased the discard handling. With the generic handling, all cases are covered.

That's a good point, I didn't consider the effect on those sources. I am still a little bit hesitant to have configuration that refers to a named output that won't actually exist (when configured to drop or reject), since that could result in some confusion. I'm not sure I have a great alternative though, aside from introducing separate config on those sources that controls which events types are emitted.

rfcs/2022-08-25-12217-handling-discarded-events.md

tobz

This is looking good so far, but I had a few clarifying questions to make sure I understand the scope.

rfcs/2022-08-25-12217-handling-discarded-events.md

tobz · 2022-10-05T14:38:23Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+    /// Default behavior, send the event on to further components that name this as an input.
+    Send,
+    /// Discard the event and mark it is as delivered.
+    Drop,
+    /// Discard the event and mark it as failed.
+    Reject,


These make sense individually, but I find it... let's call it "weird", that I could have a component which selects another component's discards output but then that upstream component could basically just disable the sending of those discarded events even though I've named that output specifically?

If that's not the intention, then maybe this is just a docs problem, but that was my initial impression seeing these three distinct options.

That will be something caught by the configuration verification stage, that "handled" outputs are not named as an input. This is an issue I discussed later in the document in "Alternatives" as not having a great solution IMO.

rfcs/2022-08-25-12217-handling-discarded-events.md

tobz · 2022-10-05T14:45:08Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+Simplifying the configuration of previously unhandled outputs presents a bit of a Catch-22. Without
+any help from Vector, this would require users to explicitly route the unhandled outputs to a new
+`blackhole` sink. So, we want to add a shorthand to avoid the extra configuration that would
+require, and potentially the extra running component internally.


This section confuses me a little bit.

Why would a sink be needed if operators can control the output behavior at the component itself? Which is to say, if a source can configure its discarded events output to drop or reject... why would we ever set up a downstream component to collect those events unless it was actually going to do something with them?

The "without any help from Vector" is trying to say that we are not helping sinks by allowing a source to configure its output to drop or reject. I will try to reword to clarify.

rfcs/2022-08-25-12217-handling-discarded-events.md

tobz · 2022-10-05T14:53:55Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- What should be done with the existing discarded event metrics? Should they always be emitted, or
+  only when the output isn't consumed by another component? Do we need another disposition marker to
+  indicate discards are not to be counted as errors?


Yeah, I agree that if we're not dropping them on the floor, then they aren't really discarded... just routed elsewhere.

Although maybe with true discarded event handling, "discarded" metrics would simply be the indication of how many discarded events were sent by that component, and then if the event made it to a component where we just dropped/rejected it, without further daisy chained error outputs... that would be when we emit an actual error metric?

Dunno, I feel like we could argue the definition of this stuff until the cows come home.

rfcs/2022-08-25-12217-handling-discarded-events.md

jszwedko

Nice write-up!

jszwedko · 2022-11-02T22:21:43Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+We will also add a new command-line option and environment variable to opt into the stricter
+validation for users that want the additional assurance this provides.
+
+#### Simplify Discarded Output Handling


An alternative name for disposition could be on_error (or on_discard).

jszwedko · 2022-11-02T22:29:09Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+
+### Implementation
+
+#### Configuration


I wonder if we could have global configuration for the disposition as well? That might simplify it for simple use-cases. Like setting disposition = "drop" at the global level and have it apply to all components.

Could also do the same at the component level for those that already have multiple outputs in addition to letting them configure the disposition per-output.

jszwedko · 2022-11-02T22:32:15Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+transforms, and simply modify them to write failed events to that output. This would have the
+benefit of higher performance, as we could likely rework most existing sinks to avoid needing the
+clone and would not need the extra finalizer and task to handle forwarding the events. On the other
+hand, this performance loss appears to be relatively minor, likely in the low single digit percent


This seems like another area that could benefit from having a copy-on-write data structure for events to avoid a full clone (I know we do have this optimization now for when the event is completely unchanged, which should help here too).

jszwedko · 2022-11-02T22:33:02Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+
+## Outstanding Questions
+
+- Do we want to automate the process of rewriting configurations that have missing output handlers,


I think my suggestion above of allowing global configuration of the disposition would allow users to slowly take advantage of this new behavior component-by-component.

jszwedko · 2022-11-02T22:35:32Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- What should be done with the existing discarded event metrics? Should they always be emitted, or
+  only when the output isn't consumed by another component? Do we need another disposition marker to
+  indicate discards are not to be counted as errors?


Hmm, yeah, the relationship between this and the discarded event metric will require some thought.

jszwedko · 2022-11-02T22:36:58Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+buffers would be a good approach, but that path towards that is less clear than solving the problem
+in isolation.
+
+## Outstanding Questions


Are there any cross-cutting concerns between these new, conditional, error outputs and the configuration schema?

jszwedko · 2022-11-02T22:38:09Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+
+### User Experience
+
+#### Add Discarded Event Output


What will be the format of the events flowing to the errors output? Will we follow the precedent set by the remap transform of wrapping the event with some additional metadata about the failure? I think it'd be good to call that out explicitly in here.

Maybe this should take advantage of the new metadata namespacing too? That feature will only exist for logs, but we could extend it to metrics and traces. Also will we offer any guarantees about the metadata, if we do include it? I could see users starting to route on it based on the type of failure. I'd suggest we push back on that clearly document that the metadata is opaque and just for human consumption for now.

To validate your intuition, I would be tempted to route based on the metadata from an error. Hopefully a specific example will help:

Kafka sinks can throw errors when the broker sees a message that's too large. I'm finding it difficult to know the exact message size before sending, so it's hard to build a filter upstream. I could see handling this situation by routing the error. I might route to a console log or split the message and try again.

At a minimum, I could also envision using the error data to know whether dropping the message is okay or if I need to trigger alerts.

spencergilbert · 2022-11-02T22:45:47Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+### Out of scope
+
+- Handling of discarded data in sources before it is translated into events.
+- Handling of discarded sink request data after it is translated from events.


This should be entirely accounted for by end-to-end acknowledgements? Or is there a gap I'm not thinking of?

This is making me realize that end-to-end acks is another cross-cutting concern here.

The situation here plays around like this: For an event, i.e, after all the retries have exhausted, will be marked as discarded. For a discarded event, the following will happen:

Push to DLQ with retries

If the write to DLQ was successful, acknowledge to the source for this event.

if write to DLQ failed, drop the event and acknowledge to the source for this event.
Is this a fair understanding?

spencergilbert · 2022-11-02T22:50:29Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+We will introduce a new output to all components that would otherwise discard events, named
+`errors`. Note that some components already have such an named output. This proposal standardizes
+that output naming and provides additional support for handling it.


I wonder if we should lean into existing nomenclature, namely "dead letter queues", to make this feel more familiar to users.

rfcs/2022-08-25-12217-handling-discarded-events.md

spencergilbert · 2022-11-02T23:05:04Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+pub enum OutputBuffer {
+    Drop,
+    Reject,
+    Store(Vec<EventArray>),


Store here is for events that are going to passed to a downstream component?

Correct. I'll note that.

netlify · 2022-11-03T05:03:59Z

❌ Deploy Preview for vrl-playground failed.

Name	Link
🔨 Latest commit	`861bf04`
🔍 Latest deploy log	https://app.netlify.com/sites/vrl-playground/deploys/63634bbc94e212000abde16c

neuronull · 2022-11-03T13:56:51Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+- The proposed changes will impact emission and receipt of end-to-end acknowledgements.
+- The proposed configuration options should fit in the configuration schema, but may have
+  implications for its interpretation, particularly around the topology connections.
+- Handling discarded events in any way will change the interpretation of several exiting internal


Suggested change

- Handling discarded events in any way will change the interpretation of several exiting internal

- Handling discarded events in any way will change the interpretation of several existing internal

fuchsnj · 2022-11-03T14:42:40Z

Overall I'm looking at what I assume are the two most common situations I think users would want.

A new user trying out things, who just wants to see Vector work. They don't care about errors and want them to just drop. Forcing them to create sinks to consume all errors is cumbersome / annoying. There should be an easy way to disable errors.
A more enterprise user that is concerned about events being dropped due to errors and wants to have a single sink that stores all the errors somewhere (s3 / kafka / etc).
- What does the config look like here. If there are 50 components in a config, do they have to individually list out every single error output in the one sink? That seems very tedious (even if Vector enforces all the error outputs are used).
- Once Vector is running and starts getting errors collected, the first question I have is, what failed? Do we have information on which error occurred, which component it failed in?
- Will the events be in a format where I can re-ingest the events. Lets say there was an outage / config error which caused millions of logs to be dead-lettered. The issue was fixed, and I now want to re-process all the dead-lettered events. (I assume this is out of scope, but it's good to at least think about)

rfcs/2022-08-25-12217-handling-discarded-events.md

benzaita · 2022-12-07T13:11:27Z

rfcs/2022-08-25-12217-handling-discarded-events.md

+
+### User Experience
+
+#### Add Discarded Event Output


It's a bit out of scope for this RFC, but very much related -- adding an event output for events that successfully went through a sink would allow customers to provide better observability into events that were successfully delivered. For example, we would like to extract metrics from delivered events so we can show our customers how many events we successfully delivered. As it is now, we can only do that before the sink.

Feel free to file an issue about this, but as worded, it sounds like what you're looking for are the existing component metrics such as component_sent_events_total, which sinks do emit.

I think this might clarify what I meant: #14708 (comment)

jszwedko · 2023-02-10T16:57:36Z

Note we will probably need to resolve #14969 as part of this work as it will make the incorrect type definitions much more common.

inftl · 2023-09-27T13:20:14Z

Ideally this is also supported in sinks. Since a sink can fail inserting, it would be nice to have a backup sink.

bruceg added type: task Generic non-code related tasks domain: core Anything related to core crates i.e. vector-core, core-common, etc domain: rfc labels Oct 3, 2022

bruceg requested review from tobz, jszwedko and lukesteensen October 3, 2022 23:03

github-actions bot removed the domain: core Anything related to core crates i.e. vector-core, core-common, etc label Oct 3, 2022

chore: RFC for handling discarded events

c90126e

bruceg force-pushed the bruceg/discarded-events-rfc branch from 6c4cb78 to c90126e Compare October 3, 2022 23:04

spencergilbert reviewed Oct 4, 2022

View reviewed changes

lukesteensen reviewed Oct 5, 2022

View reviewed changes

tobz reviewed Oct 5, 2022

View reviewed changes

bruceg added 2 commits October 5, 2022 10:38

Clarify "sink batch buffers"

46ff788

Clarify output default deprecation

c77e906

lukesteensen mentioned this pull request Oct 5, 2022

Avoiding transform bugs based on mutable state #14743

Open

bruceg added 5 commits October 7, 2022 09:14

Remove discards that are under operator's control

0d65821

Clarify wording on unhandled output configuration alternative

0c23242

Clarify wording of outstanding question

055189a

Clarify that this RFC is about discards due to errors

d5a9e6d

Rework output disposition configuration

5037eb6

bruceg requested review from tobz, lukesteensen and spencergilbert October 7, 2022 22:20

bruceg added 2 commits October 14, 2022 11:21

Removed impossible error case - Failure to send to next component

fde7454

Remove a couple of outstanding questions

ae27d1a

JeanMertz mentioned this pull request Oct 20, 2022

Allow rerouting discarded events #14899

Open

jszwedko reviewed Nov 2, 2022

View reviewed changes

spencergilbert reviewed Nov 2, 2022

View reviewed changes

Apply review feedback

861bf04

neuronull reviewed Nov 3, 2022

View reviewed changes

benzaita reviewed Dec 7, 2022

View reviewed changes

rfcs/2022-08-25-12217-handling-discarded-events.md Show resolved Hide resolved

benzaita reviewed Dec 7, 2022

View reviewed changes

jszwedko mentioned this pull request Dec 28, 2022

Make it easier to inspect sink output #7356

Open

This was referenced Jan 9, 2023

Route transform generates warning for unmatched events #15865

Closed

Sink Nats event buffer settings not works #15857

Closed

jszwedko mentioned this pull request Feb 1, 2023

Logs changed from version 0.24 to 0.25 for elasticsearch sink #15886

Closed

jszwedko closed this Feb 10, 2023

jszwedko reopened this Feb 10, 2023

jszwedko mentioned this pull request Feb 14, 2023

Route percentage of logs through different pipelines #16432

Open

jszwedko marked this pull request as draft March 24, 2023 13:48

jszwedko mentioned this pull request Jul 24, 2023

Cannot receive error response if data is throttled by Kinesis data stream #17962

Open

hhromic mentioned this pull request Aug 18, 2023

feat(route transform): Add option to enable/disable unmatched output #18309

Merged

jszwedko mentioned this pull request Nov 7, 2023

Loadbalance data between sinks #18164

Open

tanushri-sundar mentioned this pull request Jan 25, 2024

Vector acks messages sent by S3 source even when delivery failed #19711

Closed

DimDroll mentioned this pull request May 9, 2024

Support dead letter queue on sinks #1772

Open

		- Should the option for turning unhandled output enforcement into a hard error be written as a more
		generic switch to turn _all_ deprecations into errors?

		We will introduce a new output to all components that would otherwise discard events, named
		`discards`.


		## Outstanding Questions

		- Do we want to automate the process of rewriting configurations that have missing output handlers,

	- Handling discarded events in any way will change the interpretation of several exiting internal
	- Handling discarded events in any way will change the interpretation of several existing internal


		### Implementation

		#### Configuration

chore: RFC for handling discarded events #14708

Are you sure you want to change the base?

chore: RFC for handling discarded events #14708

Conversation

bruceg commented Oct 3, 2022 • edited by jszwedko Loading

netlify bot commented Oct 3, 2022 • edited Loading

✅ Deploy Preview for vector-project canceled.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tobz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszwedko left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

netlify bot commented Nov 3, 2022 • edited Loading

❌ Deploy Preview for vrl-playground failed.

Choose a reason for hiding this comment

fuchsnj commented Nov 3, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jszwedko commented Feb 10, 2023

inftl commented Sep 27, 2023

bruceg commented Oct 3, 2022 •

edited by jszwedko

Loading

netlify bot commented Oct 3, 2022 •

edited

Loading

netlify bot commented Nov 3, 2022 •

edited

Loading