Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce trace costs #16

Open
15 of 17 tasks
pbiggar opened this issue Apr 4, 2023 · 1 comment
Open
15 of 17 tasks

Reduce trace costs #16

pbiggar opened this issue Apr 4, 2023 · 1 comment

Comments

@pbiggar
Copy link
Member

pbiggar commented Apr 4, 2023

Our OpenTelemetry provider is putting their prices up, so we should reduce how much we use.

Currently, we're using about 1.2B events and the next lowest threshold is 450M.

They are currently split:

cloudsql-proxy 0.11%
kubernetes-bwd-nginx 0.15%
kubernetes-bwd-ocaml 57.03% (1.13B)
kubernetes-garbagecollector 38.02% (376M)
kubernetes-metrics 4.69% (45M)

Among kubernetes-bwd-ocaml, they are split:

BwdServer | 608,015,209
QueueWorker | 354,919,048
ApiServer | 66,742,393
CronChecker | 38,742,278
other  | 5,528,954

Note the numbers don't add up because we had a big month for BwdServer due to an anomaly.

To address this:

  • use TraceRatio samplers one each service (20% for BwdServer, 20% for QW, 100% for others)
    • write code
    • merge to dark repo
    • backport to classic-dark repo
    • merge & deploy
    • add flags to LaunchDarkly.
      • add flags
      • BwdServer
      • Queueworker
      • check it works
    • Reduce plan
  • use honeycomb sampling for garbagecollector (5% should be fine, I'd be surprised if we ever look at this again)
    • merge change
    • check it worked
  • disable k8s metrics (we get this from google cloud anyway)
    • merge change
    • check it worked

Overall, this should reduce us from 1.8B in march to:
BwdServer: 121M
QueueWorker: 71M
ApiServer: 67M
CronChecker: 39M
kubernetes-bwd-ocaml other: 6M
garbagecollector: 18M

Overall around 350M

@pbiggar
Copy link
Member Author

pbiggar commented May 7, 2023

Confirmed this is in production and works. Just final confirmation needed that this does in fact lower telemetry usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant