-
-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc(decision): Mobile - Tracing Without Performance V2 #136
Changes from all commits
313a0a0
27c44c4
a68b563
e3be25a
3a70e7b
9c252cc
fddeeaa
764196a
cd840eb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
- Start Date: 2024-06-04 | ||
- RFC Type: decision | ||
- RFC PR: https://github.com/getsentry/rfcs/pull/136 | ||
- RFC Status: draft | ||
|
||
# Summary | ||
|
||
This RFC aims to find a strategy to define a better lifetime for traces so they don’t reference hundreds of unrelated | ||
events. | ||
|
||
# Motivation | ||
|
||
On mobile, traces can have hundreds of unrelated events caused by the possibly never-changing | ||
`traceId` required for tracing without performance. This occurs mostly when users don’t have performance | ||
enabled. | ||
|
||
# Background | ||
|
||
In the summer of 2023, all mobile SDKs implemented [Tracing without performance](https://www.notion.so/Tracing-without-performance-efab307eb7f64e71a04f09dc72722530?pvs=21), | ||
see also [team-sdks GH issue](https://github.com/getsentry/team-sdks/issues/5). | ||
The goal of this endeavor was to | ||
|
||
> always have access to a trace and span ID, add a new internal `PropagationContext` property to the | ||
> scope, an object holding a `traceId` and `spanId` | ||
|
||
On mobile, most users interact purely with the static API, which holds a reference to a global | ||
Hub and Scope. Therefore, mobile SDKs create a `PropagationContext` with `traceId` and `spanId` | ||
during initialization, and these usually persist for the entire lifetime of the app. Mobile | ||
SDKs prefer the `traceID` of transactions bound to the scope over the `PropagationContext`. So | ||
when performance is disabled, or no transaction is bound to the scope, mobile SDKs use the same | ||
`traceId` and `spanId` for all captured events. This can lead to traces with hundreds of | ||
unrelated events confusing users. JS addressed this recently by updating the `PropagationContext` | ||
based on routes, see [Ensure browser traceId lifetime works as expected](https://github.com/getsentry/sentry-javascript/issues/11599). | ||
|
||
# Options Considered | ||
|
||
## Option 1: Update `PropagationContext` based on screens/routes <a name="option-1"></a> | ||
|
||
Mobile SDKs base the lifetime of the `traceId` of the `PropagationContext` on screens/routes, | ||
which is similar to a route on JavaScript. Mobile SDKs already report the screen name automatically | ||
via `view_names` with the [app context](https://develop.sentry.dev/sdk/event-payloads/contexts/#app-context) | ||
and use the same information for the name of screen load transactions, which the screen load | ||
starfish module uses. Whenever the screen name changes automatically or with a yet to be defined | ||
[manual API](https://www.notion.so/sentry/Specs-Screens-API-084d773272f24f57aeb622c07619264e), | ||
mobile SDKs must renew the `traceId` of the `PropagationContext`. The screen load transaction | ||
and subsequent events on the same screen must use the same `traceId`. When the app moves to the | ||
foreground after being in the background for longer than 30 seconds, which is the same approach | ||
mobile SDKs use for determining the end of a session, mobile SDKs renew the `traceId` of the | ||
PropagationContext. If the app stays in the background for shorter or equal to 30 seconds, | ||
mobile SDKs must not renew the `traceId` of the PropagationContext when the app moves again to | ||
the foreground. When a span starts, SDKs should use the traceID on the PropagationContext, but | ||
SDKs make an exception to this rule to map all spans related to a screen/route to the same traceID. | ||
For example, when a span leads to a new screen, SDKs should use the traceID of the new screen. | ||
|
||
### Pros <a name="option-1-pros"></a> | ||
|
||
1. Similar to [JavaScript]((https://github.com/getsentry/sentry-javascript/issues/11599)) updating | ||
it based on routes, so it should be easy to implement for React-Native. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for flutter at the very least we would need the user to use our navigatorObservers: [
SentryNavigatorObserver()
],
MaterialPageRoute(
settings: const RouteSettings(name: 'AutoCloseScreen'),
builder: (context) => const AutoCloseScreen()),
), Would the default behaviour remain as it is right now if a user didn't use the navigator observer on flutter? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What's the current default behavior in Flutter, @buenaflor? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is similar issue with React Native. We don't have the screens/routes information without the performance instrumentation (ReactNavigation or ReactNativeNavigation). We could update the auto instrumentation work without performance (without creating spans). Or get some signal of change from native. Or having a public API to renew the |
||
2. Works for spans first, as all spans get added to one trace per screen. | ||
|
||
### Cons <a name="option-1-cons"></a> | ||
|
||
1. For single-screen applications such as social networks, the lifetime of a trace could still be | ||
long, and multiple unrelated events could be mapped to one trace. | ||
Comment on lines
+63
to
+64
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. IIUC this is in line with the JS implementation, and something we can accept for now. In case we have clear pointers we need to address this (and how) we can iterate There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. A manual API to renew the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fyi, we have TracingUtils.startNewTrace in Java (should probably be exposed through the static API though) |
||
2. For applications running for a long time in the background, such as running apps, the lifetime of | ||
a trace could still be long, and multiple unrelated events could be mapped to one trace. | ||
philipphofmann marked this conversation as resolved.
Show resolved
Hide resolved
|
||
3. It doesn’t work well for declarative UI frameworks as Jetpack Compose and SwiftUI for which the | ||
SDKs can’t reliably automatically detect when apps load a new screen. | ||
Comment on lines
+67
to
+68
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think the easiest way to get started is by providing a wrapper ~ |
||
|
||
# Drawbacks | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We were recently made aware that another implication of long-running traces is that it potentially increases transaction/span quota usage. This is because in JS we inherit the sampling decision for the trace in subsequent transactions. For example:
So either we accept this and move on for now by continuing with this behaviour or we break trace consistency by again rolling the dice for new root spans/transactions. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Lms24, please clarify why this wasn't a problem before. I don't understand how this proposed change here will cause this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I might be missing a bit of context around how Tracing without Performance and the PropagationContext is implemented in mobile SDKs. If the proposed change is purely intended for TwP scenarios and does not affect the overall trace lifetime in case a root span/transaction is started ("tracing with performance") I think we're good. That is because for TwP, we defer the sampling decision to the downstream service (i.e. send In JS however, we changed the trace lifetime not just for TwP but in general, leading to scenarios like the one above. To illustrate further, why this is problematic, I'm gonna adjust the example a bit from above
So even without an active transaction, we'd still propagate a forced sampled flag to downstream services. Does this make sense? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for the detailed explanation. Yes, that makes sense, but I guess in the long run, you should have a roughly equal amount of transactions. It shouldn't matter if you roll the dice once for 10 transactions or every time for each transaction. If you roll the dice often enough, an equal amount of transactions should be captured. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not necessarily unfortunately. This would only hold up if the sample rates on client and server were the same. If users have lower sample rates on the server, they would send significantly less server-side transactions with the previous implementation. I tried verifying this with a small script: https://gist.github.com/Lms24/9a631295aef58cf22fb8f5307953335c There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We should say explicitly to use only the |
||
|
||
Please add drawbacks here if you can think of any. | ||
philipphofmann marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
# Unresolved questions | ||
|
||
- None. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this break transaction/spans sample rates?
The
sampled
state of thePropagationContext
would overwrite the calculated sampling decision of the span done throughtracesSampleRate
- or even worse throughtracesSampler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or are we just saying to use the traceId of the PropagationContext, retaining the usual sample decision?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
continuing below