Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive increase in transactions upgrading from 5.13.0 to 5.17.3 #2318

Open
rnystrom opened this issue Jun 4, 2024 · 10 comments
Open

Massive increase in transactions upgrading from 5.13.0 to 5.17.3 #2318

rnystrom opened this issue Jun 4, 2024 · 10 comments

Comments

@rnystrom
Copy link

rnystrom commented Jun 4, 2024

We did some routine upgrades on 5/24:

CleanShot 2024-06-04 at 16 34 03@2x

This was later on a Friday. When our weekday traffic picked up next week, we noticed an alarming number of transactions being logged:

CleanShot 2024-06-04 at 16 31 56@2x

This resulted in a much larger Sentry bill than we were expecting. We've been trying to trace where this came from, but we can't find anything. All the increase in transactions are coming from our API service (rails).

Our frontend is NextJS, and we also upgraded those Sentry dependencies from 7.110.1 to 8.4.0. I mention that because of our tracing setup. If the client were to send many more requests with tracing enabled, I guess that could result in a large increase in transactions.

# frozen_string_literal: true

Sentry.init do |config|
  config.dsn = Rails.application.credentials&.sentry&.fetch(:dsn)
  config.breadcrumbs_logger = [:active_support_logger, :http_logger, :sentry_logger]
  config.enabled_environments = ["production"]
  config.release = ENV["RELEASE_SHA"]

  if Sidekiq.server?
    config.excluded_exceptions.delete("ActiveRecord::RecordNotFound")
  end

  config.traces_sampler = lambda do |sampling_context|
    unless sampling_context[:parent_sampled].nil?
      next sampling_context[:parent_sampled]
    end

    transaction = sampling_context[:transaction_context]
    op = transaction[:op]

    if op.match?("sidekiq")
      0.000000000001
    else
      0.0
    end
  end
end

However I do see our client sample rate still at 2.5% when looking at traces. Example:

CleanShot 2024-06-04 at 16 38 32@2x

We're having a hard time understanding where the increase in transactions is coming from. I've looked at the releases between 5.13 and 5.17 and don't see anything obvious.

I'd love some help to understand what is happening. Either some explanation on what new transactions are being sent, where to look in Sentry to see more details, or anything else.

Thank you!

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jun 5, 2024

@rnystrom the only thing that changed on the ruby side during those releases for transactions is more DelayedJob tracing but you don't seem to be using that.

Your frontend upgrade to v8 also includes our massive OpenTelemetry refactor so that could indeed cause more frontend (and thus more linked backend) transactions.
I confirmed with our JS team and this shouldn't ideally be the case.

I need more info to help you further:

  • your sentry org slug (if you don't want it visible publicly on github, please route your request through [email protected] and link this issue)
  • your frontend Sentry initializer config
  • what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)

image

@rnystrom
Copy link
Author

rnystrom commented Jun 5, 2024

Thanks @sl0thentr0py! I can provide everything here in the hopes that it helps anyone else.

your sentry org slug

campsite-software

your frontend Sentry initializer config

sentry.client.config.js
import * as Sentry from '@sentry/nextjs'

const SENTRY_DSN = process.env.SENTRY_DSN || process.env.NEXT_PUBLIC_SENTRY_DSN

Sentry.init({
  dsn: SENTRY_DSN,
  beforeSend(event, hint) {
    // Check if the error is a TypeError: Failed to fetch
    if (hint.originalException instanceof TypeError && hint.originalException.message === 'Failed to fetch') {
      // Check if the URL contains "_vercel/speed-insights/vitals"
      const url = event.request.url
      if (url && url.includes('_vercel/speed-insights/vitals')) {
        // Ignore the error
        return null
      }
    }
    // If conditions are not met, return the event
    return event
  },
  ignoreErrors: [
    // a bunch of error regexes and strings from extensions and more
  ],
  denyUrls: [
    // facebook and other extension urls...
  ],
  tracePropagationTargets: ['api.campsite.test', 'api.campsite.design', 'api.campsite.co'],
  tracesSampleRate: 0.025,
  profilesSampleRate: 1.0,
  integrations: [Sentry.httpClientIntegration(), Sentry.browserTracingIntegration(), Sentry.browserProfilingIntegration()]
})
next.config.js
const path = require('path')

/**
 * @type {import('next').NextConfig}
 */

const { withSentryConfig } = require('@sentry/nextjs')

// bunch of setup and config...

/** @type {import('next').NextConfig} */
const moduleExports = {
  // next config...
  async headers() {
    return [
      // other stuff...
      {
        // Sentry Profiling
        // @see https://docs.sentry.io/platforms/javascript/profiling/#step-2-add-document-policy-js-profiling-header
        source: '/(.*)',
        headers: [
          {
            key: 'Document-Policy',
            value: 'js-profiling'
          }
        ]
      }
    ]
  },
  sentry: {
    widenClientFileUpload: true,
    hideSourceMaps: true
  },
  // webback config...
}

const sentryWebpackPluginOptions = {
  silent: true, // Suppresses all logs
  // For all available options, see:
  // https://github.com/getsentry/sentry-webpack-plugin#options.
  authToken: process.env.SENTRY_AUTH_TOKEN
}

// Make sure adding Sentry options is the last code to run before exporting, to
// ensure that your source maps include changes from all other Webpack plugins
module.exports = withBundleAnalyzer(withSentryConfig(moduleExports, sentryWebpackPluginOptions))

what kind of transactions have increased in volume compared to before (you can sort by TPM in your Performance section to find high volume transactions - see picture below)

Here's a before & after of 2-weeks of API transactions. There doesn't appear to be any one product area that increased, instead it seems to be fairly unilateral between controller actions and Sidekiq jobs. Here's a sampling:

  • Our Pusher job transactions increased 23x while our actual Pusher usage through May is flat
  • OrganizationMembersController#show 2x'd (not as big a jump but still rather large)
  • NotificationsController#index transactions 36x'd while our actual endpoint hits are flat
  • DeliverWebPushMessageJob 273x'd (our actual web pushes are flat between periods)

Side-note: I'd love to view "previous period" in a column as well. It's hard to compare results between time periods grouped by columns. The dotted line previous-period is nice but its just a total for the interval.

May 8-22 May 22-June 5
CleanShot 2024-06-05 at 11 42 07@2x CleanShot 2024-06-05 at 11 42 26@2x

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jun 7, 2024

thx @rnystrom, don't know the exact problem yet but some observations

Trace differences before / after

if I look at the /[org]/inbox/[inboxView] transaction before May 25, I see a much smaller trace.

image

compare that to a more recent trace which has 177 events in the compared to the 21 above.

image

So for some reason, the traces started on your frontend now are getting linked to far more Rails transactions (via the sentry-trace and baggage headers). The frontend seems to be keeping the same trace_id alive for very long.

Because you're using parent_sampled to decide your rails sampling in your traces_sampler(as you should), you will always respect the frontend's sampling decision on the Rails side and thus you have far more transactions getting sampled via that frontend sampled trace.

Further steps

  • I'm not 100% certain but I don't see any change on the ruby/rails side that would've caused the increase
  • The JS SDK changed a lot in how it does tracing with v8 because it now uses OpenTelemetry under the hood, so I will ask our frontend folks to take a look at this too
    • If you want to verify this, you could downgrade the NextJS Sentry SDK back to 7.112.2

@sl0thentr0py
Copy link
Member

for cross reference, here's a running tracker of problems with NextJS in v8

getsentry/sentry-javascript#11046

@sl0thentr0py
Copy link
Member

ok @Lms24 helped me figure out what's up.

v8 changed how long traces live here
getsentry/sentry-javascript#11599

So if your application has a page that lives for very long without navigation, your traces will stay alive for much longer with v8. This is valuable feedback on the change, so we might make it more visible or add some tweaks/config to not let traces get too big.

For now, if you're not happy with this change, you can either

  • just sample less if you want to use less quota (the traces are accurate, they just are much longer now)
  • you could do the following reset of the active trace in some kind of lifecycle hook (not navigation) which you think should be your trace boundary (this is sort of subjective to the type of application)
Sentry.getCurrentScope().setPropagationContext({
    traceId: uuid4(),
    spanId: uuid4().substring(16),
})

@rnystrom
Copy link
Author

rnystrom commented Jun 7, 2024

Ahh ok sounds good. I'll start by lowering the client sample significantly to start. The change makes sense, it was just quite shocking the impact it had on us (particularly our bill...). I'll report back once we deploy the change.

@rnystrom
Copy link
Author

rnystrom commented Jun 7, 2024

So if I lower the client sample rate to, say, 0.0025 from 0.025 (10x lower), we will have far fewer API traces, but now we will also have much less client traces. This is a bummer as now we'll have significantly less client trace data for perf monitoring + debugging just to keep our API spend lower.

We may need to implement a custom trace boundary like you suggested, though this entire change is just quite a headache as it changes the trace sampling mental model.

@Lms24
Copy link
Member

Lms24 commented Jun 10, 2024

Hey @rnystrom, I'm really sorry for the trouble! We didn't take this decision light-hearted but rather out of the necessity for resolving the confusion from the previous behaviour in version 7 of the SDKs. We knew that there are cases where the new, page-pased trace lifetime would cause unexpected behaviour but it's not great that you're affected in this way.

As an alternative to manually resetting the trace yourself, (which I agree, is probably the best you can do for now) you could also downgrade to v7 of the NextJS SDK. It's still being supported although new features will most likely only go into v8.

We're still re-evaluating the page-based trace lifetime internally and ideally we can come up with a better approach. However, unfortunately, this entails quite a lot of changes across the Sentry product so I can't make any promises or ETAs.

@rnystrom
Copy link
Author

No worries appreciate the understanding. We definitely want to stay up to date with SDKs so we can tag along with improvements and new features.

If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.

@sl0thentr0py
Copy link
Member

sl0thentr0py commented Jun 11, 2024

If you all have any advice on our sampling setup so that we can receive more client-side sampling without blowing up our quotas with API sampling, that'd be very helpful.

You could in principle ignore parent_sampled (the client's sampling decision) sometimes and reduce the API sampling rates independently for API routes you don't care about. That just means those API transactions will be missing from the complete frontend-backend trace.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Waiting for: Community
Development

No branches or pull requests

3 participants