Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

Open
ItsLastDay opened this issue Jun 28, 2022 · 9 comments
Assignees
Labels
bug Something isn't working extension/healthcheck Health Check Extension never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium

Comments

@ItsLastDay
Copy link
Contributor

Describe the bug
I'm trying to use "check_collector_pipeline" function of healthcheckextension. However, the health check always returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.

Steps to reproduce
I built a simple OTel collector:

Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup).

Then I run that image, docker attach to it, and issue curl requests to the health endpoint (shown below).

What did you expect to see?
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).

What did you see instead?
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.

What version did you use?
v0.54.0

What config did you use?
agent-conf.yaml

Environment
OS: Debian GNU/Linux rodete, kernel version 5.17.11-1rodete2-amd64
Compiler(if manually compiled): go 1.18.3

Additional context
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.

I added more logging to the healthcheckextension: healthcheckextension.go, exporter.go.
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.

Notice that the HealthCheckExtension's internal queue exporterFailureQueue is growing at first (log message HC queue size: 0, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log message exiting `rotate` function, queue size is 1).
The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.

The exporter itself was constantly failing (log messages containing Exporting failed.), but the queue was not growing. Why? Take for example failure at 2022-06-28T14:35:48.345Z error exporterhelper/queued_retry.go:149 Exporting failed.. There was one element added to the exporterFailureQueue, with the following timestamps:

 Start: (time.Time) 2022-06-28 14:33:18.342453932 +0000 UTC m=+0.039182067,
 End: (time.Time) 2022-06-28 14:35:48.306595188 +0000 UTC m=+150.003323321,

Then, the extension compared Start with the current time, and removed the element from the exporterFailureQueue (the current time at that point was 2022-06-28 14:35:48.344996512 +0000).

Overall, notice that the Start of every element of exporterFailureQueue is constant: 14:33:18. It corresponds to the start time of the collector.
However, exporter.go compares Start with current time as if Start was the time of the exporter failure.
Thus, as long as collector was started within health_check::check_collector_pipeline::interval from current time, the health check works fine. After the interval is crossed, health check always returns 200 OK.

Solutions that I can imagine:

@TylerHelmuth
Copy link
Member

pinging @jpkrohling as code owner

@TylerHelmuth TylerHelmuth added extension/healthcheck Health Check Extension priority:p2 Medium labels Jul 1, 2022
@jpkrohling jpkrohling self-assigned this Jul 5, 2022
@ItsLastDay
Copy link
Contributor Author

ItsLastDay commented Jul 6, 2022

I digged deeper into this, and here are the issues I've found:

  • exporter can only be healthchecked if it uses obsreport. For example, this can be achieved by using exporterhelper like this. This is because obsreport defined the exporter/send_failed_requests (this view is later registered in service/telemetry.go) view which health checks are using. This is not documented anywhere;
  • the exporter/send_failed_requests counts number of failed spans: code. The exporter that I would want to health check exports metrics. So exporter/send_failed_requests will be 0 for my exporter, regardless of the number of errors it gets while exporting metrics. By the way, this View also does not use tagKeys, which looses the info about which exporter had an error;
  • according to OpenCensus code, View aggregations are always cumulative: reference. Hence, the view.Start, which healthcheckextension is checking, will always be the start of the collector.

I'm implementing health checks in my own code for my exporter, and here is what I plan to try:

  • create another View for send_failed_metric_points, with aggregation=LastValue
  • subscribe my health checker to this view
  • do the same loop as healthcheckextension, but check vd.End instead of vd.Start when pushing out of queue
  • when adding a View to the queue, check if the LastValue is non-zero (i.e. there were some errors in the last period). Also check the name of the exporter (this allows me to do health checks based on individual exporters, instead of monitoring the whole pipeline like current healthcheckextension does)

I'll implement this, and then comment in this bug whether it was successful.

@ItsLastDay
Copy link
Contributor Author

ItsLastDay commented Jul 7, 2022

Here is the code which works for my usecase: exporter.go.

@jpkrohling
Copy link
Member

Thanks for this investigation and sorry for taking so long to take a look at this.

This is not documented anywhere;

This is likely because we still haven't made our internal observability ready for wider consumption by other components. But I think this specific characteristic should indeed be documented somewhere. Where do you think you would have found it more naturally?

create another View for send_failed_metric_points, with aggregation=LastValue

I think they could/should share the same metric. If it's failing for metrics or traces, I would want to see that as part of the same metric as an admin. Other labels perhaps, but the same metric.

compare vd.End instead of vd.Start with current time here

@skyduo would be the best person to comment on the choice of methods here, but this is used to invalidate old data, right? Wouldn't it be more suitable to calculate the failure rate instead? Perhaps by storing the state of the last events, setting the health check based on their success rate. Both parameters (# of last events to store, success rate) can have reasonable defaults and be configurable.

@jpkrohling
Copy link
Member

In any case, if just switching from Start to End would provide a good solution for the short term, I'm all for it.

@ItsLastDay
Copy link
Contributor Author

Where do you think you would have found it more naturally?

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md would be good. Smth like "It only supports monitoring exporter failures, and only exporters which use exporterhelper".

I think they could/should share the same metric. If it's failing for metrics or traces, I would want to see that as part of the same metric as an admin. Other labels perhaps, but the same metric.

Yeah, I agree that would make more sense. I'll leave it up to you and/or obsreport code owners to decide on that.

but this is used to invalidate old data, right?

That's true. However, vd.Start is a fixed value, which does not change during collector's lifetime. Hence, all the data is invalidated in the current implementation.

@jpkrohling
Copy link
Member

I'll leave it up to you and/or obsreport code owners to decide on that.

@open-telemetry/collector-approvers , what do you think about this? Should all signals share a single metric to report failure to send data? We can still have one label per signal, but keeping the same metric would make more sense to me.

@github-actions
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Nov 10, 2022
@jpkrohling jpkrohling added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 24, 2022
@haoqixu
Copy link
Member

haoqixu commented Jan 12, 2023

The current implementation of healthcheckextension depends on the internal observability of exporters to check the health status.
Maybe it is better to use the component status reporting mechanism introduced by open-telemetry/opentelemetry-collector#6560. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working extension/healthcheck Health Check Extension never stale Issues marked with this label will be never staled and automatically removed priority:p2 Medium
Projects
None yet
Development

No branches or pull requests

4 participants