Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfc(decision): Mobile - Tracing Without Performance V2 #136

Closed
Prev Previous commit
Next Next commit
first version
  • Loading branch information
philipphofmann committed Jun 4, 2024
commit 27c44c47086dc81ab7ac6ff95b643f913fc38596
53 changes: 42 additions & 11 deletions text/0136-mobile-tracing-without-performance-v-2.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,31 +5,62 @@

# Summary

One paragraph explanation of the feature or document purpose.
This RFC aims to find a strategy to update the so traces don’t reference hundreds of unrelated
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved
events.

# Motivation

Why are we doing this? What use cases does it support? What is the expected outcome?
On mobile, traces can have hundreds of unrelated events caused by the possibly never-changing
required for tracing without performance. This occurs mostly when users don’t have performance
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved
enabled.

# Background

The reason this decision or document is required. This section might not always exist.
In the summer of 2023, all mobile SDKs implemented [Tracing without performance](https://www.notion.so/Tracing-without-performance-efab307eb7f64e71a04f09dc72722530?pvs=21),
see also [team-sdks GH issue](https://github.com/getsentry/team-sdks/issues/5).
The goal of this endeavor was to

# Supporting Data
> always have access to a trace and span ID, add a new internal `PropagationContext` property to the
> scope, an object holding a `traceId` and `spanId`

[Metrics to help support your decision (if applicable).]
On mobile, most users interact purely with the static API, which holds a reference to a global
Hub and Scope. Therefore, mobile SDKs create a `PropagationContext` with `traceId` and `spanId`
during initialization, and these usually persist for the entire lifetime of the app. Mobile
SDKs prefer the `traceID` of transactions bound to the scope over the `PropagationContext`. So
when performance is disabled, or no transition is bound to the scope, mobile SDKs use the same
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved
`traceId` and `spanId` for all captured events. This can lead to traces with hundreds of
unrelated events confusing users. JS addressed this recently by updating the `PropagationContext`
based on routes, see [Ensure browser traceId lifetime works as expected](https://github.com/getsentry/sentry-javascript/issues/11599).

# Options Considered

If an RFC does not know yet what the options are, it can propose multiple options. The
preferred model is to propose one option and to provide alternatives.
## Option 1: Update `PropagationContext` based on screens <a name="option-1"></a>
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved

Mobile SDKs base the lifetime of the `traceId` of the `PropagationContext` on screens/routes,
which is similar to a route on JavaScript. Mobile SDKs already report the screen name automatically
via `view_names` with the [app context](https://develop.sentry.dev/sdk/event-payloads/contexts/#app-context)
and use the same information for the name of screen load transactions, which the screen load
starfish module uses. Whenever the screen name changes automatically or with a yet to be defined
[manual API](https://www.notion.so/sentry/Specs-Screens-API-084d773272f24f57aeb622c07619264e),
mobile SDKs must renew the `traceId` of the `PropagationContext`. The screen load transaction
and subsequent events on the same screen must use the same `traceId`. When the app moves to the
background, mobile SDKs also update the `traceId` of the PropagationContext.

### Pros <a name="option-1-pros"></a>

1. Similar to [JavaScript]((https://github.com/getsentry/sentry-javascript/issues/11599)) updating
it based on routes, so it should be easy to implement for React-Native.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for flutter at the very least we would need the user to use our SentryNavigatorObserver and then set a route name, e.g

navigatorObservers: [
  SentryNavigatorObserver()
],

MaterialPageRoute(
  settings: const RouteSettings(name: 'AutoCloseScreen'),
  builder: (context) => const AutoCloseScreen()),
),

Would the default behaviour remain as it is right now if a user didn't use the navigator observer on flutter?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the current default behavior in Flutter, @buenaflor?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is similar issue with React Native. We don't have the screens/routes information without the performance instrumentation (ReactNavigation or ReactNativeNavigation).

We could update the auto instrumentation work without performance (without creating spans). Or get some signal of change from native.

Or having a public API to renew the traceId.

2. Works for spans first, as all spans get added to one trace per screen.

### Cons <a name="option-1-cons"></a>

1. It doesn’t work well for declarative UI frameworks as Jetpack Compose and SwiftUI for which the
SDKs can’t reliably automatically detect when apps load a new screen.

# Drawbacks
Copy link
Member

@Lms24 Lms24 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We were recently made aware that another implication of long-running traces is that it potentially increases transaction/span quota usage. This is because in JS we inherit the sampling decision for the trace in subsequent transactions.

For example:

  1. Pageload transaction is sampled by rolling the dice
  2. PropagationContext stores positive sampling decision
  3. Interaction transaction is started
  4. Interaction transaction is sampled because the propagation context already holds a positive sampling decision
  5. Repeat for every started transaction until next pageload or navigation

So either we accept this and move on for now by continuing with this behaviour or we break trace consistency by again rolling the dice for new root spans/transactions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Lms24, please clarify why this wasn't a problem before. I don't understand how this proposed change here will cause this.

Copy link
Member

@Lms24 Lms24 Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing a bit of context around how Tracing without Performance and the PropagationContext is implemented in mobile SDKs.

If the proposed change is purely intended for TwP scenarios and does not affect the overall trace lifetime in case a root span/transaction is started ("tracing with performance") I think we're good. That is because for TwP, we defer the sampling decision to the downstream service (i.e. send sentry-trace headers without a sampled flag).

In JS however, we changed the trace lifetime not just for TwP but in general, leading to scenarios like the one above. To illustrate further, why this is problematic, I'm gonna adjust the example a bit from above

  1. Initial Pageload transaction is sampled by rolling the dice
  2. PropagationContext stores positive sampling decision
  3. application, still on the same page but after the pageload span ended, makes an http request to a downstream service and propagates the sentry-trace header with the positive sampling decision, forcing the downstream service to positively sample their transaction.
  4. repeat 3 a lot of times (e.g. an application auto-refreshing some state every 5s) and you have a lot of sampled transactions because one initial transaction was sampled positively in the FE.

So even without an active transaction, we'd still propagate a forced sampled flag to downstream services.

Does this make sense?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed explanation. Yes, that makes sense, but I guess in the long run, you should have a roughly equal amount of transactions. It shouldn't matter if you roll the dice once for 10 transactions or every time for each transaction. If you roll the dice often enough, an equal amount of transactions should be captured.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not necessarily unfortunately. This would only hold up if the sample rates on client and server were the same. If users have lower sample rates on the server, they would send significantly less server-side transactions with the previous implementation.

I tried verifying this with a small script: https://gist.github.com/Lms24/9a631295aef58cf22fb8f5307953335c

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When a span starts, SDKs should use the traceID on the PropagationContext

We should say explicitly to use only the traceId, but not the sampling decision of the PropagationContext.
Regardless of client/server side sampling, i think we would break the sampling in general, as it doesn't apply to the PropagationContext. The sampler function is particularly problematic imho


Why should we not do this? What are the drawbacks of this RFC or a particular option if
multiple options are presented.
Please add drawbacks here if you can think of any.
philipphofmann marked this conversation as resolved.
Show resolved Hide resolved

# Unresolved questions

- What parts of the design do you expect to resolve through this RFC?
- What issues are out of scope for this RFC but are known?
- None.