[proxy] return 503 for /analytics errors, instead of 502 #19970

kylos101 · 2024-06-25T16:26:54Z

Description

The default ErrorHandler for httputil.NewSingleHostReverseProxy returns 502 status codes. Let us instead return a 503 status code for the /analytics endpoint.

Why? This will help us trigger the ProxyBadGateway alert less often, ideally when there are proxy failures breaking the user experience.

Background:
It fires more often in Enterprise, for /analytics/v1/batch, which routes to telemetry-exporter (and then to Firehose and S3), but is likely having trouble with telemetry-exporter.

More background:
I checked the logs for telemetry-exporter and could not find a record of a failure. Therefore, I opted for this change, so as to make the ProxyBadGateway alert more valuable by excluding /analytics requests.

Related Issue(s)

Related to ENT-268
Fixes ENT-331

How to test

With a kube context for the preview, do:

kubectl set env deployment/proxy ANALYTICS_PLUGIN_SEGMENT_ENDPOINT=https://idontexistnope.com/foo

Then start a workspace, then check logs and metrics.

In the preview logs:

In metrics:

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

🏷️ Name - kylos101-ent-268
🔗 URL - kylos101-ent-268.preview.gitpod-dev.com/workspaces.
📚 Documentation - See our internal documentation for information on how to interact with your preview environment.
📦 Version - kylos101-ENT-268-gha.26903
🗒️ Logs - GCP Logs Explorer

Build Options

Build

/werft with-werft
Run the build with werft instead of GHA
leeway-no-cache
/werft no-test
Run Leeway with --dont-test

Publish

/werft publish-to-npm
/werft publish-to-jb-marketplace

Installer

analytics=segment
with-dedicated-emulation
workspace-feature-flags
Add desired feature flags to the end of the line above, space separated

Preview Environment / Integration Tests

/werft with-local-preview
If enabled this will build install/preview
/werft with-preview
/werft with-large-vm
/werft with-gce-vm
If enabled this will create the environment on GCE infra
/werft preemptible
Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
with-integration-tests=all
Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
with-monitoring

/hold

Elevations of 502 errors for Proxy can trigger ProxyBadGateway, which can be a warning that end-users are having trouble (like if supervisor is returning 502s). Sometimes /analytics returns 502, like with Enterprise, if telemetry-exporter has trouble. In these scenarios, we do not need/want to trigger ProxyBadGateway (which alerts on-call). So, for /analytics errors, we return a response code of 503, instead of 502.

kylos101 · 2024-06-25T22:31:23Z

@akosyakov you probably have the most experience in this space, could I ask for a review, time permitting?

akosyakov · 2024-06-26T07:11:49Z

@kylos101 why does it fires more often on Enterprise? Should we fix the root cause? Don't we miss analytics then?

geropl · 2024-06-26T07:26:51Z

Why? This will help us trigger the ProxyBadGateway alert less often, ideally when there are proxy failures breaking the user experience.

@kylos101 Did we try to ignore the route in proxy for error reporting? This feels like "too far away" from the impact, and can actually be misleading to others as well. This could help us find time (and prioritize) to fix the underlying root cause as Anton mentioned.

kylos101 · 2024-06-26T11:37:13Z

👋 thanks for the review, @akosyakov ! Let me know if this helps or if you need more info?

@kylos101 why does it fires more often on Enterprise?

There is a separate issue causing the send of analytics to fail sometimes for Enterprise.

Should we fix the root cause? Don't we miss analytics then?

Yes, but, that is a separate problem, I will create a Linear issue. In other words, this PR aims to avoid returning errors as 502 for /analytics failures.

When Proxy returns 502 error code, we'll notify in PagerDuty. For the 503 error code, we'll notify in Slack. This way, failures for /analytics don't wake someone up at night.

kylos101 · 2024-06-26T12:08:11Z

👋 @geropl , let me know if this helps, or if you'd like more info?

@kylos101 Did we try to ignore the route in proxy for error reporting?

What status code would you expect Proxy to return on ignore for /analytics failures?

Based on docs I've read and testing I've done, changes to the error handler must be done in code (rather than the Caddyfile). ℹ️ The default error code in Caddy reverse proxy is 502.

By returning a 503 for errors associated with /analytics, it provides us with flexibility at the alerting level. Ideally 502's would trigger alerts that route to PagerDuty (to wake someone up), but 503's would trigger alerts which route to Slack (to not wake someone up). This change only impacts the analytics handler, the default handler for others will continue to return 502.

Presently the /analytics/v1/batch endpoint is the only one producing 502 errors on a regular basis with Enterprise, since it's not user facing, that is why I changed it to 503.

This feels like "too far away" from the impact,

How so? Can you elaborate?

I'll share a separate issue for the analytics failures potentially caused by telemetry-exporter. (ref)

and can actually be misleading to others as well.

What do you mean / misleading how?

The new normal will be to get PagerDuty alerts for increased thresholds of 502. That will result in us bug fixing and/or changing the underlying handler to a different status code (depending on whether we want to wake someone up or not for that handler). For example, if the handler is for user facing traffic, we'd do nothing to the error handler (so it continues to send 502). However, if it's not user facing, we could consider changing to a 503 (or some other code).

This could help us find time (and prioritize) to fix the underlying root cause as Anton mentioned.

Returning a 503 finds time by avoiding the triggering of this alert to PagerDuty. We can create a new alert that routes 503 alerts to Slack. Would that work?

As discussed above, I'll make a separate issue for the root cause. What I'm trying to solve here, is alerting via PagerDuty for /analytics failures.

geropl · 2024-06-26T12:45:48Z

@kylos101 Sorry for the noise! I was thinking about using a replace_status directive, but checked, and we directly call the plugin... 🙈

kylos101 · 2024-06-26T13:02:37Z

For posterity, I created https://linear.app/gitpod/issue/ENT-346/, to get to the root cause for the /analytics failures.

roboquat added do-not-merge/hold do-not-merge/work-in-progress size/S labels Jun 25, 2024

kylos101 force-pushed the kylos101/ENT-268 branch from b5c0097 to e08d715 Compare June 25, 2024 21:28

kylos101 marked this pull request as ready for review June 25, 2024 22:30

kylos101 requested review from a team as code owners June 25, 2024 22:30

roboquat removed the do-not-merge/work-in-progress label Jun 25, 2024

github-actions bot added team: team-experience team: team-enterprise labels Jun 25, 2024

kylos101 changed the title ~~[proxy] rewrite 502 for /analytics~~ [proxy] send 503 for /analytics errors, instead of 502 Jun 25, 2024

kylos101 changed the title ~~[proxy] send 503 for /analytics errors, instead of 502~~ [proxy] return 503 for /analytics errors, instead of 502 Jun 25, 2024

akosyakov approved these changes Jun 26, 2024

View reviewed changes

geropl approved these changes Jun 26, 2024

View reviewed changes

kylos101 removed the do-not-merge/hold label Jun 26, 2024

roboquat merged commit aaa928c into main Jun 26, 2024
95 of 98 checks passed

roboquat deleted the kylos101/ENT-268 branch June 26, 2024 13:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[proxy] return 503 for /analytics errors, instead of 502 #19970

[proxy] return 503 for /analytics errors, instead of 502 #19970

kylos101 commented Jun 25, 2024 •

edited by github-actions bot

Loading

kylos101 commented Jun 25, 2024

akosyakov commented Jun 26, 2024

geropl commented Jun 26, 2024

kylos101 commented Jun 26, 2024

kylos101 commented Jun 26, 2024

geropl commented Jun 26, 2024

kylos101 commented Jun 26, 2024

[proxy] return 503 for /analytics errors, instead of 502 #19970

[proxy] return 503 for /analytics errors, instead of 502 #19970

Conversation

kylos101 commented Jun 25, 2024 • edited by github-actions bot Loading

Description

Related Issue(s)

How to test

Documentation

Preview status

Build Options

kylos101 commented Jun 25, 2024

akosyakov commented Jun 26, 2024

geropl commented Jun 26, 2024

kylos101 commented Jun 26, 2024

kylos101 commented Jun 26, 2024

geropl commented Jun 26, 2024

kylos101 commented Jun 26, 2024

kylos101 commented Jun 25, 2024 •

edited by github-actions bot

Loading