Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Separate regex validation failures from ssl_probe_success metric #162

Open
dragoangel opened this issue Feb 23, 2024 · 7 comments
Open

Separate regex validation failures from ssl_probe_success metric #162

dragoangel opened this issue Feb 23, 2024 · 7 comments

Comments

@dragoangel
Copy link
Contributor

From my view up == 0 better covers failed proto regex checks then ssl_probe_success, as failed ssl_probe_success should indicates TLS instead of host unavailability or instability.
Ideally there should be dedicated metric for protol specific regex's that should indicate such failures in composition with up == 0 so existing users alerts will still cover issues as before, but for newly configured alerts would allow separate general unavailable service from service which fails protol specific checks

@dragoangel
Copy link
Contributor Author

dragoangel commented Feb 23, 2024

Also another option is to add label with reason of failures, instead creating other metrics :)

It would be cool if all data that now written to logs about check failures will be available via metric label so user can create alert description that will indicate exact reason of failure. aka:

read tcp x:x->x:x: i/o timeout
dial tcp x:x: operation was canceled
regex: ^x didn't match: ...
dial tcp x:25: connect: connection refused
...
etc

@ribbybibby
Copy link
Owner

The up metric records the success/failure of requests from Prometheus -> the exporter. I don't think it would be right for us to return a non-2xx response if the exporter is fine but the issue is with the upstream.

You make a good point about non-TLS related failures though. Perhaps we should ignore errors from the upstream as long as we can successfully establish a TLS connection and extract certificates?

@dragoangel
Copy link
Contributor Author

My point that I want to see that host is down or timeout separetly from ssl. This could be explicitly each separate metric but it requires a lot of alerts, or reason could be recorded as a label - then only one alert can cover all issues and throw exact reason of failure

@ribbybibby
Copy link
Owner

ribbybibby commented Mar 21, 2024

What are some examples of SSL related errors? TLS verification failing? Is there anything else that can happen?

I suppose bugs in our regex matching for starttls? Or servers that do things in a way we haven't accounted for?

@ribbybibby
Copy link
Owner

Putting raw error log strings into metrics strikes me as the wrong approach. Metrics are not designed for that kind of information.

We could have some coarser labels like 'starttls' I guess? What would you actually use this delineation for though? How would you treat a host that is timing out vs a host that is failing the starttls handshake differently?

@dragoangel
Copy link
Contributor Author

Yes I would definetly treat host with failed ssl definitely compared of host that down, because it could be server totally off. Alerting that provide exact reason what is going on always better that alert that could be due to different reasons because you need to check all of them. And in historical view - you would know what it was, without going and reading logs of remote ssl exporter that setuped somewhere far away :)

@dragoangel
Copy link
Contributor Author

What are some examples of SSL related errors? TLS verification failing? Is there anything else that can happen?

I suppose bugs in our regex matching for starttls? Or servers that do things in a way we haven't accounted for?

For example regex can not match in smtp when server is totally overloaded and do not return any data, just open connection, saw it couple of times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants