Skip to content

Latest commit

 

History

History
615 lines (457 loc) · 25.1 KB

0079-exception-groups.md

File metadata and controls

615 lines (457 loc) · 25.1 KB
  • Start Date: 2023-03-26
  • RFC Type: feature
  • RFC PR: #79
  • RFC Status: active

Summary

This RFC captures the changes required to Sentry to handle exception groups, including changes to SDKs, protocol schema, event processing, and UI.

Motivation

Several programming languages have a concept of an unrelated group of multiple exception, aggregated into a single exception.

There may be others. We will use the term "exception group" throughout the design. It applies to the concept, regardless of language.

Sentry needs a way to capture exception groups, and present them in a meaningful way. Simply capturing the exception group by itself is insufficient, and current workarounds (in .NET) are problematic.

See also:

Background

About Exception Groups

An exception group is an exception unto itself. It may have been caused by the exceptions in the group, but there is no implied causal relationship between those exceptions.

In other words, given the following:

  • Group A
    • Exception 1
    • Exception 2

Group A is an exception group that may have been caused by either Exception 1, Exception 2, or both of them. However, it is generally not true that Exception 1 was caused by Exception 2 or vice versa.

Furthermore:

  • Exception 1 and Exception 2 might be of the same type, or they might be of different types.
  • There can be n number of exceptions in an exception group (n >= 1).
  • There can be a stack trace on each of the exceptions within an exception group, as well as on the group itself.
  • Just like any other exception, each exception within an exception group can have a chain of inner exceptions.
  • An inner exception can also be an exception group.
  • Exception groups can be present at any level. There is no requirement that the start of the chain is an exception group.

Thus, other valid examples that could occur in an application include the following:

  • Group B

    • Exception 1
      • Exception 2
        • Exception 3
  • Group C

    • Exception 1
      • Exception 1a
        • Exception 1b
    • Exception 2
      • Exception 2a
        • Exception 2b
  • Group D

    • Exception 1
      • Exception 1a
        • Exception 1b
    • Group E
      • Exception 2
        • Exception 2a
          • Exception 2b
      • Exception 3
        • Exception 3a
          • Exception 3b
  • Exception 1

    • Exception 1a
      • Group E
        • Exception 2
          • Exception 2a
            • Exception 2b
        • Exception 3
          • Exception 3a
            • Exception 3b

Interpreting Exception Groups

The meaning of a normal exception is fairly straightforward:

  • An exception is something that went wrong in the application, and thus represents an issue to resolve.
  • If there is another exception that occurred to cause this one, that is assigned to the "inner exception" or "cause".
  • The chain of exceptions thus is linear.

This gets more complicated when an exception group is involved:

  • The exception group might represent the issue to resolve, or the issue might better be represented by one or more of the inner exceptions within the group.
  • Depending on language, there might be a cause that is separate from the any of the exceptions in the group.
  • Thus, the chain of exceptions is more tree-like than linear.

Exception Handling in Sentry

SDKs send exceptions to Sentry using the Exception Interface on an event. One or more exceptions can be sent in the values array of the interface. Multiple values represent a chain of exceptions in a causal relationship, sorted from oldest to newest. For example:

{
  "exception": {
    "values": [
      {"type": "TypeError", "value": "Invalid Type!"},
      {"type": "ValueError", "value": "Invalid Value!"},
      {"type": "RuntimeError", "value": "Something went wrong!"}
    ]
  }
}

In the above example, an issue is created in Sentry for the RuntimeError exception.

Issue Title: "RuntimeError: Something went wrong!"

+==============================
+ RuntimeError: Something went wrong!
+------------------------------
+ (stack trace)
+==============================

+==============================
+ ValueError: Invalid Value!
+------------------------------
+ (stack trace)
+==============================

+==============================
+ TypeError: Invalid Type!
+------------------------------
+ (stack trace)
+==============================

This design is linear, and thus cannot support exception groups having more than one exception. But even in that case, the exception presented in Sentry might be titled and grouped in an undesirable manner.

Consider:

{
  "exception": {
    "values": [
      {"type": "TypeError", "value": "Invalid Type!"},
      {"type": "ValueError", "value": "Invalid Value!"},
      {"type": "RuntimeError", "value": "Something went wrong!"},
      {"type": "ExceptionGroup", "value": "Exception Group (1 sub-exception)"}
    ]
  }
}

Sending this to Sentry would result in an issue titled "ExceptionGroup: Exception Group (1 sub-exception)". In some contexts that may be desired, but in others the expectation is that the issue would be titled "RuntimeError: Something went wrong!".

Language Specifics

.NET

The exception group type in .NET is AggregateException.

  • The exceptions of the group are stored in the InnerExceptions property.
  • Like other exceptions, it also has an InnerException property, which is interpreted as the cause of this exception.
    • Its value is always the same as InnerExceptions[0].

Python

The exception group type in .Python is ExceptionGroup.

  • The exceptions of the group are stored in the exceptions attribute.
  • Like other exceptions, it can have a __cause__ and/or a __context__ attribute.
    • __context__ is an indirect cause, assigned if another exception occurs while handling the exception.
    • __cause__ is a direct cause, assigned if raised with the exception (using the from keyword).
      • Setting it suppresses any __context__ value, when displayed in the stack trace.
    • There is no requirement that exceptions[0] be either of these.

JavaScript

The exception group type in JavaScript is AggregateError.

  • The errors of the group are stored in the errors property.
  • Like other errors, it has a cause (singular) property.
    • There is no requirement that errors[0] be the same as cause.

Proposed Solution

NOTE: This has changed significantly from prior versions of the draft RFC.

We will focus having the SDKs capture all available information, and adjust the issue grouping rules in Sentry to use that new information.

The SDKs will do the following:

  • Capture the entire tree of exceptions represented by the exception group.
  • Capture any other chained exceptions that are not part of the exception group.
  • Use the existing exception value of the Sentry event.
  • Add a few new fields to each exception's mechanism data, as described below.

Sentry will do the following:

  • Take the new mechanism fields into account when grouping issues. See Sentry Issue Grouping below for more details.
  • Present a user interface that depicts the structure of the exception group on the issue details page. See Sentry UI Changes below for more details.

Important Notice

The new grouping rules MUST be fully implemented in Sentry.io and a published version of self-hosted Sentry before any SDK can release a non-preview version that includes these changes. The SDK's release notes should mention the required version of Sentry.

Although the protocol changes are backwards compatible, the current problems with exception groups will persist or be exacerbated without the new grouping rules in place.

SDKs that currently require the user to opt-in to send chained exceptions can continue to do so without side-effects. SDKs that send chained exceptions by default may see an immediate change to exception grouping after implementing this feature.

New Mechanism Fields

The Exception Mechanism Interface will have the following new fields added:

source

An optional string value describing the source of the exception.

  • The SDK should populate this with the name of the property or attribute of the parent exception that this exception was acquired from. In the case of an array, it should include the zero-based array index as well.
  • Python Examples: "__context__", "__cause__", "exceptions[0]", "exceptions[1]"
  • .NET Examples: "InnerException", "InnerExceptions[0]", "InnerExceptions[1]"
  • JavaScript Examples: "cause", "errors[0]", "errors[1]"

is_exception_group

An optional boolean value, set true when the exception is the exception group type specific to the platform or language. The default is false when omitted.

  • For example, exceptions of type ExceptionGroup (Python), AggregateException (.NET), and AggregateError (JavaScript) should have "is_exception_group": true. Other exceptions can omit this field.

exception_id

An optional numeric value providing an ID for the exception relative to this specific event.

  • The SDK should assign simple incrementing integers to each exception in the tree, starting with 0 for the root of the tree. In other words, when flattened into the list provided in the exception values on the event, the last exception in the list should have ID 0, the previous one should have ID 1, the next previous should have ID 2, etc.

parent_id

An optional numeric value pointing at the exception_id that is the parent of this exception.

  • The SDK should assign this to all exceptions except the root exception (the last to be listed in the exception values).

Interpretation

The exception_id and parent_id fields work in conjunction to represent the hierarchical nature of the tree of exceptions. If not provided, the previous interpretation will be assumed - which is that each exception in the list of exception values is a child of the one immediately following it in the list.

The Exception Interface will not change in structure, but it will change in interpretation with regard to multiple exception values.

  • The previous interpretation was: "Multiple values represent chained exceptions.".
  • The new interpretation will be: "Multiple values are related by the optional mechanism.exception_id and mechanism.parent_id fields. When not present, multiple values represent chained exceptions."

Additional SDK Requirements

Mechanism Type

When setting the mechanism.type field, SDKs should use the following guidelines:

  • For the root exception (the last to be in the exception.values list), set mechanism.type to the name of the integration that produced the exception (as was the case before this proposal). If the exception was captured manually, set the mechanism.type to "generic".

  • For all other exceptions in the list, set the mechanism.type to "chained". This will indicate that the exception is part of the chain of exceptions stemming from the root exception (regardless of whether it is in an exception group or not).

Do not omit the mechanism.type field, nor send it empty or null.

Exception Value

When setting the value field of the exception for an exception group type, SDKs should only deliver the meaningful part of the exception message, excluding any string that may have been automatically added by their platform.

For example:

  • In Python, the value field should not contain details such as " (2 sub-exceptions)". Use the message attribute to get the raw message from an ExceptionGroup. If there is no message, omit the value and just send type.

  • In .NET, the value field should not contain details such as " (Exception 1) (Exception 2)". The Message property may need to be modified to remove the appended inner messages.

Keep Aggregate Exceptions

The .NET SDK previously had implemented an option called KeepAggregateExceptions. This flag should be deprecated, in favor of always sending the entire chain of aggregate exceptions as explained in this SDK. This may affect existing issue grouping, and should be noted in the change log when released.

Other SDKs should not implement a similar option.

Example Event

Given the Python code:

try:
  raise RuntimeError("something")
except:
  raise ExceptionGroup("nested",
    [
      ValueError(654),
      ExceptionGroup("imports",
        [
          ImportError("no_such_module"),
          ModuleNotFoundError("another_module"),
        ]
      ),
      TypeError("int"),
    ]
  )

The event would contain:

{
  "exception": {
    "values": [
      {
        "type": "TypeError",
        "value": "int",
        "mechanism": {
          "type": "chained",
          "source": "exceptions[2]",
          "exception_id": 6,
          "parent_id": 0
        }
      },
      {
        "type": "ModuleNotFoundError",
        "value": "another_module",
        "mechanism": {
          "type": "chained",
          "source": "exceptions[1]",
          "exception_id": 5,
          "parent_id": 3
        }
      },
      {
        "type": "ImportError",
        "value": "no_such_module",
        "mechanism": {
          "type": "chained",
          "source": "exceptions[0]",
          "exception_id": 4,
          "parent_id": 3
        }
      },
      {
        "type": "ExceptionGroup",
        "value": "imports",
        "mechanism": {
          "type": "chained",
          "source": "exceptions[1]",
          "is_exception_group": true,
          "exception_id": 3,
          "parent_id": 0
        }
      },
      {
        "type": "ValueError",
        "value": "654",
        "mechanism": {
          "type": "chained",
          "source": "exceptions[0]",
          "exception_id": 2,
          "parent_id": 0
        }
      },
      {
        "type": "RuntimeError",
        "value": "something",
        "mechanism": {
          "type": "chained",
          "source": "__context__",
          "exception_id": 1,
          "parent_id": 0
        }
      },
      {
        "type": "ExceptionGroup",
        "value": "nested",
        "mechanism": {
          "type": "exceptionhook",
          "handled": false,
          "is_exception_group": true,
          "exception_id": 0
        }
      },
    ]
  }
}

Reminder: In .NET, InnerException is always the same as InnerExceptions[0], thus it does not need to be reported separately. However, Python's __cause__ and __context__, and JavaScript's cause, are independent and thus should be reported separately if they have values.

Sentry Issue Grouping

Issue grouping rules for exception groups are complex, because the nature of exception groups is that they may or may not represent more than one distinct issue. While this may require some further experimentation to get right, the initial plan is as follows:

First, determine the list of "top-level" exceptions. These are the exceptions that represent distinct issues contained in the exception group.

  1. Start from the exception having mechanism.exception_id:0.
  2. If it has mechanism.is_exception_group:true, then recursively search each child.
  3. When reaching one where mechanism.is_exception_group:false (or not present), include it as a "top-level" exception, and do not traverse any of its child exceptions.

Next, determine from the top-level exceptions which of them would have been grouped together, had they been in separate events.

  • Apply grouping rules between the top-level exception to determine the distinct number of issues represented by the group.
  • For each top-level exception, only consider the first-path through any child exceptions.

Finally:

  • If there is only one distinct group of top-level exceptions, group the event with other events based on that top-level exception only. Ignore any parent exception groups.

  • If more than one distinct top-level exception exists, then group the event based on the parent exception group that they have in common. This will often be the root-level exception group.

As an example, consider simplified issue grouping rules that only considered the exception type. When applied to an exception group such as:

  • ExceptionGroup
    • ValueError
    • TypeError
    • TypeError

There are two distinct top-level exceptions, ValueError and TypeError. They have the ExceptionGroup in common. Thus the three exceptions considered for issue grouping are ExceptionGroup, ValueError, and the first TypeError.

Now consider this example:

  • ExceptionGroup
    • ExceptionGroup
      • ValueError
      • ValueError
        • TypeError
      • ValueError
    • ValueError
    • ValueError

There are 5 top-level exceptions, all of type ValueError. Thus, the event should only be grouped based on a single ValueError, and the others should be ignored for purposes of issue grouping. That one of them has a chained TypeError is not relevant, at least not in this initial plan.

A further modification to the plan might consider all possible branches of chained exceptions, but that is not proposed at this time.

Additional Issue Grouping Requirements

As mentioned earlier, SDKs will set a meaningful mechanism.type on the root exception only (the last item in the exception.values list). Other exceptions will have type "chained" to indicate that they are a chained exception stemming from the root exception.

For purposes of issue grouping, "chained" mechanism typed should be excluded. Additionally, the mechanism type of the root exception must always be considered as part of issue grouping, even if the rest of the exception group is being ignored.

Issue Titles

As a side-effect of issue grouping, issues will be titled (and subtitled) based on the top-most exception that is not ignored from the grouping. In other words, if there is more than one distinct top-level exception, the issue will be titled by the exception group itself. In the above examples, the first issue would be titled as ExceptionGroup, and the second issue would be titled as ValueError.

However, the mechanism.handled field (which determines error.handled and error.unhandled event attributes in Sentry) should always be taken from the root exception, even if the title and subtitle are derived from one of the chained exceptions. This is because SDKs are only expected to supply the mechanism.handled field on the root exception.

Sentry UI Changes

The Issue Details page will be updated to improve usability of exception groups. The exact details are at the discretion of the design team, however the following should be considered:

  • A condensed tree-like visualization of the exception group should be added somewhere on the page. Each exception in the tree should have an in-page link to jump to that exception and ensure it is expanded.
  • Some exceptions should be collapsed by default, including any where mechanism.is_exception_group === true, and perhaps others.
  • The mechanism.source field, if available, should be displayed on each exception in the exceptions section.
  • We may want to include a way to navigate from each exception to its parent exception, or back to the exception group.

Drawbacks

  • Modifying the way Exceptions are sent to Sentry will affect existing issue grouping rules that customers may have set up. This change could create new alerts when first deployed.

  • The design proposed above retains backwards compatibility with older versions of Sentry. However, without the proposed UI changes, previous versions (self-hosted, etc.) of Sentry will treat the exceptions list as if they were all one long chain of direct exceptions. This could be a bit confusing to the user, until such time they upgrade their Sentry instance to a version that includes the UI and issue grouping changes.

Other Options Considered

We considered the following, each had pitfalls that led to the plan described above.

Do Nothing

This would mean leaving things the way they currently are.

Pros:

  • Nothing to do.

Cons:

  • Overall, reduced ability to use Sentry for error monitoring, as the usage of exception groups increases.
  • Events created by the .NET SDK have several problems for exception groups (AggregateException), such as:
    • There's no structure represented by the chain of events, so every relationship appears as parent/child, even those that should actually be siblings.
    • In some cases, some stack traces are relocated from the exception group to the first child exception, otherwise no code location will be represented. Doing so grossly misrepresents the true nature of the exception caused at the highlighted stack frame.
    • In other cases, there are already stack traces on both the exception group and the first child exception. Thus the location of the exception group is lost completely.
    • The KeepAggregateException option is global for the entire application, and can't be adjusted on a case-by-case basis.
  • Issues created by other SDKs such as Python and JavaScript are not prepared to deal with exception groups at all.
    • Issues are always titled and grouped by the exception group, even when there's only a single type of exception contained within.
    • Because none of the items in the exceptions or errors lists are part of the cause, they're currently not passed to Sentry at all. This makes it impossible to identify the actual cause of an exception raised via an exception group.

Sending Hierarchical Data

This approach was seriously considered. It would involve creating a new tree-like data structure that more closely resembles the original tree of exceptions. It would have been placed on either a new exception_group interface, or added to the existing contexts or extra collections.

Pros:

  • The event would contain a more direct representation of the exception data.
  • Less work for the SDKs.

Cons:

  • Much of the server-side processing would have to be reconsidered, including relay, symbolication, and trimming.
  • It would not be backwards compatible, without duplicating significant data into the exceptions list anyway.

Sending One Exception Chain Only

This approach would involve not capturing the entire exception group, but trying to determine which top-level exception was worth capturing, from within the SDK.

Pros:

  • Fully compatible with existing Sentry, without any changes to grouping rules or UI.
  • Fully backwards compatible as well.

Cons:

  • Potential to lose a lot of useful data.
  • Misrepresents the exception that was actually raised.
  • Loses track of the actual location in source code where the exception group was raised.

Splitting Events in the SDK

This approach would involve the SDK sending multiple separate events for each top-level exception in an exception group.

Pros:

  • All exceptions would come through at once.
  • No server-side processing would need to be performed.

Cons:

  • Too much duplicate date is sent from the SDK at run time.
  • It can quickly exceed the SDKs internal maximum queue length.
  • It can trigger rate limits and spike-protection mechanisms.

Splitting Events in Relay

This approach would involve the SDK sending one event containing the exception group, and relying on Sentry Relay to split out top-level exceptions into separate events. It would require the creation of a new exception_group interface, placed directly on the incoming event.

Pros:

  • One event sent from the SDK, so none of the cons involved with splitting in the SDK.

Cons:

  • Could be very CPU intensive.
  • Too much business logic.
  • Could back up overall ingestion throughput.
  • Would not be backwards compatible with the existing event schema.

Using Synthetic Exceptions

This approach would set mechanism.synthetic:true on exception groups types, to attempt to keep them from being considered during issue grouping.

Pros:

  • If it worked, issue grouping would need less adjustment.

Cons:

  • It doesn't work for this use case.
    • Title of issue would be incorrect, referring to the first in-app frame of the exception group.
    • Issue grouping would be incorrect, including details of the exception group.