healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

ItsLastDay · 2022-06-28T14:56:05Z

Describe the bug
I'm trying to use "check_collector_pipeline" function of healthcheckextension. However, the health check always returns status code 200 to me. Even when my exporter is failing, health check is still OK.
In "additional context" section I break down the technical aspect of such behavior.

Steps to reproduce
I built a simple OTel collector:

Then I build a Docker image out of that agent, and I run that image (not showing commands here, because they are specific to my setup).

Then I run that image, docker attach to it, and issue curl requests to the health endpoint (shown below).

What did you expect to see?
I expected to see status code 500 coming from health checks, because the exporter I'm using is constantly failing (on purpose).

What did you see instead?
However, the health check returns status code 500 only for a limited amount of time. Then, it always returns status code 200: https://pastebin.com/uR5X6DAr.

What version did you use?
v0.54.0

What config did you use?
agent-conf.yaml

Environment
OS: Debian GNU/Linux rodete, kernel version 5.17.11-1rodete2-amd64
Compiler(if manually compiled): go 1.18.3

Additional context
I investigated the problem, and I found out a technical cause. However, I'm not sure how to fix this properly.

I added more logging to the healthcheckextension: healthcheckextension.go, exporter.go.
Here are my OpenTelemetry logs coming from the run (beware, those are very verbose): https://pastebin.com/F2ZJQhAf.

Notice that the HealthCheckExtension's internal queue exporterFailureQueue is growing at first (log message HC queue size: 0, then is 1, 2, ..., 6). Then it becomes empty (size 0) or contains one element (log message exiting `rotate` function, queue size is 1).
The extension itself decides whether exporter is healthy based on the size of the queue. Since it has <= 1 elements all the time, the health check returns 200 OK all the time.

The exporter itself was constantly failing (log messages containing Exporting failed.), but the queue was not growing. Why? Take for example failure at 2022-06-28T14:35:48.345Z error exporterhelper/queued_retry.go:149 Exporting failed.. There was one element added to the exporterFailureQueue, with the following timestamps:

 Start: (time.Time) 2022-06-28 14:33:18.342453932 +0000 UTC m=+0.039182067,
 End: (time.Time) 2022-06-28 14:35:48.306595188 +0000 UTC m=+150.003323321,

Then, the extension compared Start with the current time, and removed the element from the exporterFailureQueue (the current time at that point was 2022-06-28 14:35:48.344996512 +0000).

Overall, notice that the Start of every element of exporterFailureQueue is constant: 14:33:18. It corresponds to the start time of the collector.
However, exporter.go compares Start with current time as if Start was the time of the exporter failure.
Thus, as long as collector was started within health_check::check_collector_pipeline::interval from current time, the health check works fine. After the interval is crossed, health check always returns 200 OK.

Solutions that I can imagine:

compare vd.End instead of vd.Start with current time here
populate more accurate Start time in obsreportconfig.go

The text was updated successfully, but these errors were encountered:

TylerHelmuth · 2022-07-01T17:30:43Z

pinging @jpkrohling as code owner

ItsLastDay · 2022-07-06T14:42:18Z

I digged deeper into this, and here are the issues I've found:

exporter can only be healthchecked if it uses obsreport. For example, this can be achieved by using exporterhelper like this. This is because obsreport defined the exporter/send_failed_requests (this view is later registered in service/telemetry.go) view which health checks are using. This is not documented anywhere;
the exporter/send_failed_requests counts number of failed spans: code. The exporter that I would want to health check exports metrics. So exporter/send_failed_requests will be 0 for my exporter, regardless of the number of errors it gets while exporting metrics. By the way, this View also does not use tagKeys, which looses the info about which exporter had an error;
according to OpenCensus code, View aggregations are always cumulative: reference. Hence, the view.Start, which healthcheckextension is checking, will always be the start of the collector.

I'm implementing health checks in my own code for my exporter, and here is what I plan to try:

create another View for send_failed_metric_points, with aggregation=LastValue
subscribe my health checker to this view
do the same loop as healthcheckextension, but check vd.End instead of vd.Start when pushing out of queue
when adding a View to the queue, check if the LastValue is non-zero (i.e. there were some errors in the last period). Also check the name of the exporter (this allows me to do health checks based on individual exporters, instead of monitoring the whole pipeline like current healthcheckextension does)

I'll implement this, and then comment in this bug whether it was successful.

ItsLastDay · 2022-07-07T18:11:36Z

Here is the code which works for my usecase: exporter.go.

jpkrohling · 2022-07-25T20:10:36Z

Thanks for this investigation and sorry for taking so long to take a look at this.

This is not documented anywhere;

This is likely because we still haven't made our internal observability ready for wider consumption by other components. But I think this specific characteristic should indeed be documented somewhere. Where do you think you would have found it more naturally?

create another View for send_failed_metric_points, with aggregation=LastValue

I think they could/should share the same metric. If it's failing for metrics or traces, I would want to see that as part of the same metric as an admin. Other labels perhaps, but the same metric.

compare vd.End instead of vd.Start with current time here

@skyduo would be the best person to comment on the choice of methods here, but this is used to invalidate old data, right? Wouldn't it be more suitable to calculate the failure rate instead? Perhaps by storing the state of the last events, setting the health check based on their success rate. Both parameters (# of last events to store, success rate) can have reasonable defaults and be configurable.

jpkrohling · 2022-07-25T20:11:30Z

In any case, if just switching from Start to End would provide a good solution for the short term, I'm all for it.

ItsLastDay · 2022-07-26T13:36:32Z

Where do you think you would have found it more naturally?

https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/extension/healthcheckextension/README.md would be good. Smth like "It only supports monitoring exporter failures, and only exporters which use exporterhelper".

I think they could/should share the same metric. If it's failing for metrics or traces, I would want to see that as part of the same metric as an admin. Other labels perhaps, but the same metric.

Yeah, I agree that would make more sense. I'll leave it up to you and/or obsreport code owners to decide on that.

but this is used to invalidate old data, right?

That's true. However, vd.Start is a fixed value, which does not change during collector's lifetime. Hence, all the data is invalidated in the current implementation.

jpkrohling · 2022-08-01T14:59:32Z

I'll leave it up to you and/or obsreport code owners to decide on that.

@open-telemetry/collector-approvers , what do you think about this? Should all signals share a single metric to report failure to send data? We can still have one label per signal, but keeping the same metric would make more sense to me.

github-actions · 2022-11-10T03:47:50Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

extension/healthcheck: @jpkrohling

See Adding Labels via Comments if you do not have permissions to add labels yourself.

haoqixu · 2023-01-12T06:44:14Z

The current implementation of healthcheckextension depends on the internal observability of exporters to check the health status.
Maybe it is better to use the component status reporting mechanism introduced by open-telemetry/opentelemetry-collector#6560. 🤔

ItsLastDay added the bug Something isn't working label Jun 28, 2022

ItsLastDay mentioned this issue Jul 1, 2022

Health check extension returns 200 status code during errors #8276

Closed

TylerHelmuth added extension/healthcheck Health Check Extension priority:p2 Medium labels Jul 1, 2022

jpkrohling self-assigned this Jul 5, 2022

github-actions bot added the Stale label Nov 10, 2022

jpkrohling added never stale Issues marked with this label will be never staled and automatically removed and removed Stale labels Nov 24, 2022

haoqixu mentioned this issue Jan 11, 2023

health_check with check_collector_pipeline enabled returns HTTP 500 error open-telemetry/opentelemetry-collector#6710

Closed

haoqixu mentioned this issue Sep 13, 2023

Refactor healthcheck extension #26661

Open

codeboten mentioned this issue Jan 19, 2024

[healthcheckextensionv2] Introduce health check extension based on component status reporting #30673

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

ItsLastDay commented Jun 28, 2022

TylerHelmuth commented Jul 1, 2022

ItsLastDay commented Jul 6, 2022 •

edited

Loading

ItsLastDay commented Jul 7, 2022 •

edited

Loading

jpkrohling commented Jul 25, 2022

jpkrohling commented Jul 25, 2022

ItsLastDay commented Jul 26, 2022

jpkrohling commented Aug 1, 2022

github-actions bot commented Nov 10, 2022

haoqixu commented Jan 12, 2023

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

healthcheckextension "check_collector_pipeline" is always healthy, even when exporter is failing #11780

Comments

ItsLastDay commented Jun 28, 2022

TylerHelmuth commented Jul 1, 2022

ItsLastDay commented Jul 6, 2022 • edited Loading

ItsLastDay commented Jul 7, 2022 • edited Loading

jpkrohling commented Jul 25, 2022

jpkrohling commented Jul 25, 2022

ItsLastDay commented Jul 26, 2022

jpkrohling commented Aug 1, 2022

github-actions bot commented Nov 10, 2022

haoqixu commented Jan 12, 2023

ItsLastDay commented Jul 6, 2022 •

edited

Loading

ItsLastDay commented Jul 7, 2022 •

edited

Loading