- Start Date: 2023-03-26
- RFC Type: feature
- RFC PR: #79
- RFC Status: active
This RFC captures the changes required to Sentry to handle exception groups, including changes to SDKs, protocol schema, event processing, and UI.
Several programming languages have a concept of an unrelated group of multiple exception, aggregated into a single exception.
- Python
ExceptionGroup
(PEP 654) - .NET
AggregateException
- JavaScript
AggregateError
- Go
Unwrap() []error
(see also) - Java
Throwable.getSuppressed()
There may be others. We will use the term "exception group" throughout the design. It applies to the concept, regardless of language.
Sentry needs a way to capture exception groups, and present them in a meaningful way. Simply capturing the exception group by itself is insufficient, and current workarounds (in .NET) are problematic.
See also:
An exception group is an exception unto itself. It may have been caused by the exceptions in the group, but there is no implied causal relationship between those exceptions.
In other words, given the following:
- Group A
- Exception 1
- Exception 2
Group A
is an exception group that may have been caused by either Exception 1
, Exception 2
, or both of them.
However, it is generally not true that Exception 1
was caused by Exception 2
or vice versa.
Furthermore:
Exception 1
andException 2
might be of the same type, or they might be of different types.- There can be
n
number of exceptions in an exception group (n >= 1
). - There can be a stack trace on each of the exceptions within an exception group, as well as on the group itself.
- Just like any other exception, each exception within an exception group can have a chain of inner exceptions.
- An inner exception can also be an exception group.
- Exception groups can be present at any level. There is no requirement that the start of the chain is an exception group.
Thus, other valid examples that could occur in an application include the following:
-
Group B
- Exception 1
- Exception 2
- Exception 3
- Exception 2
- Exception 1
-
Group C
- Exception 1
- Exception 1a
- Exception 1b
- Exception 1a
- Exception 2
- Exception 2a
- Exception 2b
- Exception 2a
- Exception 1
-
Group D
- Exception 1
- Exception 1a
- Exception 1b
- Exception 1a
- Group E
- Exception 2
- Exception 2a
- Exception 2b
- Exception 2a
- Exception 3
- Exception 3a
- Exception 3b
- Exception 3a
- Exception 2
- Exception 1
-
Exception 1
- Exception 1a
- Group E
- Exception 2
- Exception 2a
- Exception 2b
- Exception 2a
- Exception 3
- Exception 3a
- Exception 3b
- Exception 3a
- Exception 2
- Group E
- Exception 1a
The meaning of a normal exception is fairly straightforward:
- An exception is something that went wrong in the application, and thus represents an issue to resolve.
- If there is another exception that occurred to cause this one, that is assigned to the "inner exception" or "cause".
- The chain of exceptions thus is linear.
This gets more complicated when an exception group is involved:
- The exception group might represent the issue to resolve, or the issue might better be represented by one or more of the inner exceptions within the group.
- Depending on language, there might be a cause that is separate from the any of the exceptions in the group.
- Thus, the chain of exceptions is more tree-like than linear.
SDKs send exceptions to Sentry using the Exception Interface
on an event. One or more exceptions can be sent in the values
array of the interface. Multiple values represent
a chain of exceptions in a causal relationship, sorted from oldest to newest. For example:
{
"exception": {
"values": [
{"type": "TypeError", "value": "Invalid Type!"},
{"type": "ValueError", "value": "Invalid Value!"},
{"type": "RuntimeError", "value": "Something went wrong!"}
]
}
}
In the above example, an issue is created in Sentry for the RuntimeError
exception.
Issue Title: "RuntimeError: Something went wrong!"
+==============================
+ RuntimeError: Something went wrong!
+------------------------------
+ (stack trace)
+==============================
+==============================
+ ValueError: Invalid Value!
+------------------------------
+ (stack trace)
+==============================
+==============================
+ TypeError: Invalid Type!
+------------------------------
+ (stack trace)
+==============================
This design is linear, and thus cannot support exception groups having more than one exception. But even in that case, the exception presented in Sentry might be titled and grouped in an undesirable manner.
Consider:
{
"exception": {
"values": [
{"type": "TypeError", "value": "Invalid Type!"},
{"type": "ValueError", "value": "Invalid Value!"},
{"type": "RuntimeError", "value": "Something went wrong!"},
{"type": "ExceptionGroup", "value": "Exception Group (1 sub-exception)"}
]
}
}
Sending this to Sentry would result in an issue titled "ExceptionGroup: Exception Group (1 sub-exception)"
.
In some contexts that may be desired, but in others the expectation is that the issue would be titled "RuntimeError: Something went wrong!"
.
The exception group type in .NET is AggregateException
.
- The exceptions of the group are stored in the
InnerExceptions
property. - Like other exceptions, it also has an
InnerException
property, which is interpreted as the cause of this exception.- Its value is always the same as
InnerExceptions[0]
.
- Its value is always the same as
The exception group type in .Python is ExceptionGroup
.
- The exceptions of the group are stored in the
exceptions
attribute. - Like other exceptions, it can have a
__cause__
and/or a__context__
attribute.__context__
is an indirect cause, assigned if another exception occurs while handling the exception.__cause__
is a direct cause, assigned if raised with the exception (using thefrom
keyword).- Setting it suppresses any
__context__
value, when displayed in the stack trace.
- Setting it suppresses any
- There is no requirement that
exceptions[0]
be either of these.
The exception group type in JavaScript is AggregateError
.
- The errors of the group are stored in the
errors
property. - Like other errors, it has a
cause
(singular) property.- There is no requirement that
errors[0]
be the same ascause
.
- There is no requirement that
NOTE: This has changed significantly from prior versions of the draft RFC.
We will focus having the SDKs capture all available information, and adjust the issue grouping rules in Sentry to use that new information.
The SDKs will do the following:
- Capture the entire tree of exceptions represented by the exception group.
- Capture any other chained exceptions that are not part of the exception group.
- Use the existing
exception
value of the Sentry event. - Add a few new fields to each exception's
mechanism
data, as described below.
Sentry will do the following:
- Take the new mechanism fields into account when grouping issues. See Sentry Issue Grouping below for more details.
- Present a user interface that depicts the structure of the exception group on the issue details page. See Sentry UI Changes below for more details.
The new grouping rules MUST be fully implemented in Sentry.io and a published version of self-hosted Sentry before any SDK can release a non-preview version that includes these changes. The SDK's release notes should mention the required version of Sentry.
Although the protocol changes are backwards compatible, the current problems with exception groups will persist or be exacerbated without the new grouping rules in place.
SDKs that currently require the user to opt-in to send chained exceptions can continue to do so without side-effects. SDKs that send chained exceptions by default may see an immediate change to exception grouping after implementing this feature.
The Exception Mechanism Interface will have the following new fields added:
An optional string value describing the source of the exception.
- The SDK should populate this with the name of the property or attribute of the parent exception that this exception was acquired from. In the case of an array, it should include the zero-based array index as well.
- Python Examples:
"__context__"
,"__cause__"
,"exceptions[0]"
,"exceptions[1]"
- .NET Examples:
"InnerException"
,"InnerExceptions[0]"
,"InnerExceptions[1]"
- JavaScript Examples:
"cause"
,"errors[0]"
,"errors[1]"
An optional boolean value, set true
when the exception is the exception group type specific to the platform or language.
The default is false
when omitted.
- For example, exceptions of type
ExceptionGroup
(Python),AggregateException
(.NET), andAggregateError
(JavaScript) should have"is_exception_group": true
. Other exceptions can omit this field.
An optional numeric value providing an ID for the exception relative to this specific event.
- The SDK should assign simple incrementing integers to each exception in the tree, starting with
0
for the root of the tree. In other words, when flattened into the list provided in theexception
values on the event, the last exception in the list should have ID0
, the previous one should have ID1
, the next previous should have ID2
, etc.
An optional numeric value pointing at the exception_id
that is the parent of this exception.
- The SDK should assign this to all exceptions except the root exception (the last to be listed in the
exception
values).
The exception_id
and parent_id
fields work in conjunction to represent the hierarchical nature of the tree of exceptions.
If not provided, the previous interpretation will be assumed - which is that each exception in the list of exception
values
is a child of the one immediately following it in the list.
The Exception Interface will not change in structure, but it will change in interpretation with regard to multiple exception values.
- The previous interpretation was: "Multiple values represent chained exceptions.".
- The new interpretation will be: "Multiple values are related by the optional
mechanism.exception_id
andmechanism.parent_id
fields. When not present, multiple values represent chained exceptions."
When setting the mechanism.type
field, SDKs should use the following guidelines:
-
For the root exception (the last to be in the
exception.values
list), setmechanism.type
to the name of the integration that produced the exception (as was the case before this proposal). If the exception was captured manually, set themechanism.type
to"generic"
. -
For all other exceptions in the list, set the
mechanism.type
to"chained"
. This will indicate that the exception is part of the chain of exceptions stemming from the root exception (regardless of whether it is in an exception group or not).
Do not omit the mechanism.type
field, nor send it empty or null.
When setting the value
field of the exception for an exception group type, SDKs should only deliver the meaningful part
of the exception message, excluding any string that may have been automatically added by their platform.
For example:
-
In Python, the
value
field should not contain details such as" (2 sub-exceptions)"
. Use themessage
attribute to get the raw message from anExceptionGroup
. If there is no message, omit thevalue
and just sendtype
. -
In .NET, the
value
field should not contain details such as" (Exception 1) (Exception 2)"
. TheMessage
property may need to be modified to remove the appended inner messages.
The .NET SDK previously had implemented an option called KeepAggregateExceptions
. This flag should be deprecated,
in favor of always sending the entire chain of aggregate exceptions as explained in this SDK. This may affect
existing issue grouping, and should be noted in the change log when released.
Other SDKs should not implement a similar option.
Given the Python code:
try:
raise RuntimeError("something")
except:
raise ExceptionGroup("nested",
[
ValueError(654),
ExceptionGroup("imports",
[
ImportError("no_such_module"),
ModuleNotFoundError("another_module"),
]
),
TypeError("int"),
]
)
The event would contain:
{
"exception": {
"values": [
{
"type": "TypeError",
"value": "int",
"mechanism": {
"type": "chained",
"source": "exceptions[2]",
"exception_id": 6,
"parent_id": 0
}
},
{
"type": "ModuleNotFoundError",
"value": "another_module",
"mechanism": {
"type": "chained",
"source": "exceptions[1]",
"exception_id": 5,
"parent_id": 3
}
},
{
"type": "ImportError",
"value": "no_such_module",
"mechanism": {
"type": "chained",
"source": "exceptions[0]",
"exception_id": 4,
"parent_id": 3
}
},
{
"type": "ExceptionGroup",
"value": "imports",
"mechanism": {
"type": "chained",
"source": "exceptions[1]",
"is_exception_group": true,
"exception_id": 3,
"parent_id": 0
}
},
{
"type": "ValueError",
"value": "654",
"mechanism": {
"type": "chained",
"source": "exceptions[0]",
"exception_id": 2,
"parent_id": 0
}
},
{
"type": "RuntimeError",
"value": "something",
"mechanism": {
"type": "chained",
"source": "__context__",
"exception_id": 1,
"parent_id": 0
}
},
{
"type": "ExceptionGroup",
"value": "nested",
"mechanism": {
"type": "exceptionhook",
"handled": false,
"is_exception_group": true,
"exception_id": 0
}
},
]
}
}
Reminder: In .NET, InnerException
is always the same as InnerExceptions[0]
, thus it does not need to be reported separately.
However, Python's __cause__
and __context__
, and JavaScript's cause
, are independent and thus should be reported separately
if they have values.
Issue grouping rules for exception groups are complex, because the nature of exception groups is that they may or may not represent more than one distinct issue. While this may require some further experimentation to get right, the initial plan is as follows:
First, determine the list of "top-level" exceptions. These are the exceptions that represent distinct issues contained in the exception group.
- Start from the exception having
mechanism.exception_id:0
. - If it has
mechanism.is_exception_group:true
, then recursively search each child. - When reaching one where
mechanism.is_exception_group:false
(or not present), include it as a "top-level" exception, and do not traverse any of its child exceptions.
Next, determine from the top-level exceptions which of them would have been grouped together, had they been in separate events.
- Apply grouping rules between the top-level exception to determine the distinct number of issues represented by the group.
- For each top-level exception, only consider the first-path through any child exceptions.
Finally:
-
If there is only one distinct group of top-level exceptions, group the event with other events based on that top-level exception only. Ignore any parent exception groups.
-
If more than one distinct top-level exception exists, then group the event based on the parent exception group that they have in common. This will often be the root-level exception group.
As an example, consider simplified issue grouping rules that only considered the exception type
. When applied to an exception group such as:
ExceptionGroup
ValueError
TypeError
TypeError
There are two distinct top-level exceptions, ValueError
and TypeError
. They have the ExceptionGroup
in common.
Thus the three exceptions considered for issue grouping are ExceptionGroup
, ValueError
, and the first TypeError
.
Now consider this example:
ExceptionGroup
ExceptionGroup
ValueError
ValueError
TypeError
ValueError
ValueError
ValueError
There are 5 top-level exceptions, all of type ValueError
. Thus, the event should only be grouped based on a single ValueError
,
and the others should be ignored for purposes of issue grouping. That one of them has a chained TypeError
is not relevant,
at least not in this initial plan.
A further modification to the plan might consider all possible branches of chained exceptions, but that is not proposed at this time.
As mentioned earlier, SDKs will set a meaningful mechanism.type
on the root exception only (the last item in the exception.values
list).
Other exceptions will have type "chained"
to indicate that they are a chained exception stemming from the root exception.
For purposes of issue grouping, "chained"
mechanism typed should be excluded. Additionally, the mechanism type of the root exception must
always be considered as part of issue grouping, even if the rest of the exception group is being ignored.
As a side-effect of issue grouping, issues will be titled (and subtitled) based on the top-most exception that is not ignored from
the grouping. In other words, if there is more than one distinct top-level exception, the issue will be titled by the exception group itself. In the above examples, the first issue would be titled as ExceptionGroup
, and the second issue would be titled as ValueError
.
However, the mechanism.handled
field (which determines error.handled
and error.unhandled
event attributes in Sentry) should always be
taken from the root exception, even if the title and subtitle are derived from one of the chained exceptions.
This is because SDKs are only expected to supply the mechanism.handled
field on the root exception.
The Issue Details page will be updated to improve usability of exception groups. The exact details are at the discretion of the design team, however the following should be considered:
- A condensed tree-like visualization of the exception group should be added somewhere on the page. Each exception in the tree should have an in-page link to jump to that exception and ensure it is expanded.
- Some exceptions should be collapsed by default, including any where
mechanism.is_exception_group === true
, and perhaps others. - The
mechanism.source
field, if available, should be displayed on each exception in the exceptions section. - We may want to include a way to navigate from each exception to its parent exception, or back to the exception group.
-
Modifying the way Exceptions are sent to Sentry will affect existing issue grouping rules that customers may have set up. This change could create new alerts when first deployed.
-
The design proposed above retains backwards compatibility with older versions of Sentry. However, without the proposed UI changes, previous versions (self-hosted, etc.) of Sentry will treat the exceptions list as if they were all one long chain of direct exceptions. This could be a bit confusing to the user, until such time they upgrade their Sentry instance to a version that includes the UI and issue grouping changes.
We considered the following, each had pitfalls that led to the plan described above.
This would mean leaving things the way they currently are.
Pros:
- Nothing to do.
Cons:
- Overall, reduced ability to use Sentry for error monitoring, as the usage of exception groups increases.
- Events created by the .NET SDK have several problems for exception groups (
AggregateException
), such as:- There's no structure represented by the chain of events, so every relationship appears as parent/child, even those that should actually be siblings.
- In some cases, some stack traces are relocated from the exception group to the first child exception, otherwise no code location will be represented. Doing so grossly misrepresents the true nature of the exception caused at the highlighted stack frame.
- In other cases, there are already stack traces on both the exception group and the first child exception. Thus the location of the exception group is lost completely.
- The
KeepAggregateException
option is global for the entire application, and can't be adjusted on a case-by-case basis.
- Issues created by other SDKs such as Python and JavaScript are not prepared to deal with exception groups at all.
- Issues are always titled and grouped by the exception group, even when there's only a single type of exception contained within.
- Because none of the items in the
exceptions
orerrors
lists are part of the cause, they're currently not passed to Sentry at all. This makes it impossible to identify the actual cause of an exception raised via an exception group.
This approach was seriously considered. It would involve creating a new tree-like data structure that more closely
resembles the original tree of exceptions. It would have been placed on either a new exception_group
interface,
or added to the existing contexts
or extra
collections.
Pros:
- The event would contain a more direct representation of the exception data.
- Less work for the SDKs.
Cons:
- Much of the server-side processing would have to be reconsidered, including relay, symbolication, and trimming.
- It would not be backwards compatible, without duplicating significant data into the exceptions list anyway.
This approach would involve not capturing the entire exception group, but trying to determine which top-level exception was worth capturing, from within the SDK.
Pros:
- Fully compatible with existing Sentry, without any changes to grouping rules or UI.
- Fully backwards compatible as well.
Cons:
- Potential to lose a lot of useful data.
- Misrepresents the exception that was actually raised.
- Loses track of the actual location in source code where the exception group was raised.
This approach would involve the SDK sending multiple separate events for each top-level exception in an exception group.
Pros:
- All exceptions would come through at once.
- No server-side processing would need to be performed.
Cons:
- Too much duplicate date is sent from the SDK at run time.
- It can quickly exceed the SDKs internal maximum queue length.
- It can trigger rate limits and spike-protection mechanisms.
This approach would involve the SDK sending one event containing the exception group, and relying on Sentry Relay
to split out top-level exceptions into separate events.
It would require the creation of a new exception_group
interface, placed directly on the incoming event.
Pros:
- One event sent from the SDK, so none of the cons involved with splitting in the SDK.
Cons:
- Could be very CPU intensive.
- Too much business logic.
- Could back up overall ingestion throughput.
- Would not be backwards compatible with the existing event schema.
This approach would set mechanism.synthetic:true
on exception groups types, to attempt to keep them from
being considered during issue grouping.
Pros:
- If it worked, issue grouping would need less adjustment.
Cons:
- It doesn't work for this use case.
- Title of issue would be incorrect, referring to the first in-app frame of the exception group.
- Issue grouping would be incorrect, including details of the exception group.