Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[proxy] return 503 for /analytics errors, instead of 502 #19970

Merged
merged 1 commit into from
Jun 26, 2024

Conversation

kylos101
Copy link
Contributor

@kylos101 kylos101 commented Jun 25, 2024

Description

The default ErrorHandler for httputil.NewSingleHostReverseProxy returns 502 status codes. Let us instead return a 503 status code for the /analytics endpoint.

Why? This will help us trigger the ProxyBadGateway alert less often, ideally when there are proxy failures breaking the user experience.

Background:
It fires more often in Enterprise, for /analytics/v1/batch, which routes to telemetry-exporter (and then to Firehose and S3), but is likely having trouble with telemetry-exporter.

More background:
I checked the logs for telemetry-exporter and could not find a record of a failure. Therefore, I opted for this change, so as to make the ProxyBadGateway alert more valuable by excluding /analytics requests.

Related Issue(s)

Related to ENT-268
Fixes ENT-331

How to test

With a kube context for the preview, do:

kubectl set env deployment/proxy ANALYTICS_PLUGIN_SEGMENT_ENDPOINT=https://idontexistnope.com/foo

Then start a workspace, then check logs and metrics.

In the preview logs:
image

In metrics:
image

Documentation

Preview status

Gitpod was successfully deployed to your preview environment.

Build Options

Build
  • /werft with-werft
    Run the build with werft instead of GHA
  • leeway-no-cache
  • /werft no-test
    Run Leeway with --dont-test
Publish
  • /werft publish-to-npm
  • /werft publish-to-jb-marketplace
Installer
  • analytics=segment
  • with-dedicated-emulation
  • workspace-feature-flags
    Add desired feature flags to the end of the line above, space separated
Preview Environment / Integration Tests
  • /werft with-local-preview
    If enabled this will build install/preview
  • /werft with-preview
  • /werft with-large-vm
  • /werft with-gce-vm
    If enabled this will create the environment on GCE infra
  • /werft preemptible
    Saves cost. Untick this only if you're really sure you need a non-preemtible machine.
  • with-integration-tests=all
    Valid options are all, workspace, webapp, ide, jetbrains, vscode, ssh. If enabled, with-preview and with-large-vm will be enabled.
  • with-monitoring

/hold

Elevations of 502 errors for Proxy can trigger ProxyBadGateway, which can be a warning that end-users are having trouble (like if supervisor is returning 502s).

Sometimes /analytics returns 502, like with Enterprise, if telemetry-exporter has trouble. In these scenarios, we do not need/want to trigger ProxyBadGateway (which alerts on-call).

So, for /analytics errors, we return a response code of 503, instead of 502.
@kylos101
Copy link
Contributor Author

@akosyakov you probably have the most experience in this space, could I ask for a review, time permitting?

@kylos101 kylos101 changed the title [proxy] rewrite 502 for /analytics [proxy] send 503 for /analytics errors, instead of 502 Jun 25, 2024
@kylos101 kylos101 changed the title [proxy] send 503 for /analytics errors, instead of 502 [proxy] return 503 for /analytics errors, instead of 502 Jun 25, 2024
@akosyakov
Copy link
Member

@kylos101 why does it fires more often on Enterprise? Should we fix the root cause? Don't we miss analytics then?

@geropl
Copy link
Member

geropl commented Jun 26, 2024

Why? This will help us trigger the ProxyBadGateway alert less often, ideally when there are proxy failures breaking the user experience.

@kylos101 Did we try to ignore the route in proxy for error reporting? This feels like "too far away" from the impact, and can actually be misleading to others as well. This could help us find time (and prioritize) to fix the underlying root cause as Anton mentioned.

@kylos101
Copy link
Contributor Author

👋 thanks for the review, @akosyakov ! Let me know if this helps or if you need more info?

@kylos101 why does it fires more often on Enterprise?

There is a separate issue causing the send of analytics to fail sometimes for Enterprise.

Should we fix the root cause? Don't we miss analytics then?

Yes, but, that is a separate problem, I will create a Linear issue. In other words, this PR aims to avoid returning errors as 502 for /analytics failures.

When Proxy returns 502 error code, we'll notify in PagerDuty. For the 503 error code, we'll notify in Slack. This way, failures for /analytics don't wake someone up at night.

@kylos101
Copy link
Contributor Author

👋 @geropl , let me know if this helps, or if you'd like more info?

@kylos101 Did we try to ignore the route in proxy for error reporting?

What status code would you expect Proxy to return on ignore for /analytics failures?

Based on docs I've read and testing I've done, changes to the error handler must be done in code (rather than the Caddyfile). ℹ️ The default error code in Caddy reverse proxy is 502.

By returning a 503 for errors associated with /analytics, it provides us with flexibility at the alerting level. Ideally 502's would trigger alerts that route to PagerDuty (to wake someone up), but 503's would trigger alerts which route to Slack (to not wake someone up). This change only impacts the analytics handler, the default handler for others will continue to return 502.

Presently the /analytics/v1/batch endpoint is the only one producing 502 errors on a regular basis with Enterprise, since it's not user facing, that is why I changed it to 503.

This feels like "too far away" from the impact,

How so? Can you elaborate?

I'll share a separate issue for the analytics failures potentially caused by telemetry-exporter. (ref)

and can actually be misleading to others as well.

What do you mean / misleading how?

The new normal will be to get PagerDuty alerts for increased thresholds of 502. That will result in us bug fixing and/or changing the underlying handler to a different status code (depending on whether we want to wake someone up or not for that handler). For example, if the handler is for user facing traffic, we'd do nothing to the error handler (so it continues to send 502). However, if it's not user facing, we could consider changing to a 503 (or some other code).

This could help us find time (and prioritize) to fix the underlying root cause as Anton mentioned.

Returning a 503 finds time by avoiding the triggering of this alert to PagerDuty. We can create a new alert that routes 503 alerts to Slack. Would that work?

As discussed above, I'll make a separate issue for the root cause. What I'm trying to solve here, is alerting via PagerDuty for /analytics failures.

@geropl
Copy link
Member

geropl commented Jun 26, 2024

@kylos101 Sorry for the noise! I was thinking about using a replace_status directive, but checked, and we directly call the plugin... 🙈

@kylos101
Copy link
Contributor Author

For posterity, I created https://linear.app/gitpod/issue/ENT-346/, to get to the root cause for the /analytics failures.

@roboquat roboquat merged commit aaa928c into main Jun 26, 2024
95 of 98 checks passed
@roboquat roboquat deleted the kylos101/ENT-268 branch June 26, 2024 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants