Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add kube_pod_container_status_last_terminated_reason #535

Merged

Conversation

jutley
Copy link
Contributor

@jutley jutley commented Sep 4, 2018

What this PR does / why we need it:
This PR introduces a new metric: kube_pod_container_status_last_terminated_reason.

Currently, we have the kube_pod_container_status_terminated_reason, but this always returns to 0 once a container starts back up. This means that we will only have a couple data points, if any at all, around the reason for a container termination. As a result, we cannot alert when a container crashes for a specific reason (we'd like to alert based on OOMs).

This is brought up in this issue: #344

Which issue(s) this PR fixes:
Fixes #344

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Sep 4, 2018
@jutley
Copy link
Contributor Author

jutley commented Sep 4, 2018

/assign @andyxning

@andyxning
Copy link
Member

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 5, 2018
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andyxning, jutley

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 5, 2018
@k8s-ci-robot k8s-ci-robot merged commit 5119063 into kubernetes:master Sep 5, 2018
@mxinden
Copy link
Contributor

mxinden commented Sep 5, 2018

Thanks for following up with this @jutley! 👍

@olvesh
Copy link

olvesh commented Oct 26, 2018

I can see that #535 was merged, but there has been no release since august. Would it be ok to cut a release of kube-state-metrics soon?

@brancz
Copy link
Member

brancz commented Oct 26, 2018

we are soon going to start cutting pre releases of the new release

@andredantasrocha
Copy link

Hi @brancz any updates on the new release? This new metric will be very handy to us!

Thanks,
Andre

@mxinden
Copy link
Contributor

mxinden commented Nov 19, 2018 via email

@aiman-alsari
Copy link

I've used the alpha version of 1.5.0 to try this out. Whilst it is useful for alerting on a single occurrence of a particular reason e.g OomKill, it makes it very difficult to do anything more complex like "only alert if there have been 2 OomKills in the space of an hour". Sometimes a single kill is a valid scenario, so isn't worth alerting on. To be able to do this cleanly, there needs to be a restart counter per reason rather than a simple flag as is now.

It is possible to do more complex queries by using a multiplication of the difference in restart count * the last reason flag, but it makes the PromQL a lot trickier in general.

Just my 2c

@jutley
Copy link
Contributor Author

jutley commented Dec 12, 2018

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.

If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.

As an alternative, you can create an alert that looks like:

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:

min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1

This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.

@gjtempleton
Copy link
Member

@jutley Am I right in thinking that to get the points to match up neatly between the kube_pod_container_status_restarts_total and kube_pod_container_status_last_terminated_reason metric paths so that your first example would work the reason tag will need to be stripped out, e.g. with the without clause?

@EvilCreamsicle
Copy link

I'm able to get almost what I need with this metric, but does anyone know if there is a metric that displays the actual container exit code?
I'm trying to filter on a specific segfault error, so I need to know if an error was an exit code 139.

@brancz
Copy link
Member

brancz commented Apr 1, 2019

@EvilCreamsicle it is unlikely that kube-state-metrics will expose that fine granular detail, as it would cause too high churn of time-series data.

@onrylmz
Copy link

onrylmz commented Sep 7, 2020

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.

If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.

As an alternative, you can create an alert that looks like:

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:

min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1

This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.

Only one remark,

  • kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2

  • kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

return different results in terms of the labels. So, prometheus cannot match the two vectors provided from those expressions. I solved the issue with ignoring keyword.

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1 AND ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0

Here is the reference link --> https://prometheus.io/docs/prometheus/latest/querying/operators/#one-to-one-vector-matches

@aszubarev
Copy link

aszubarev commented Feb 24, 2021

@aiman-alsari I agree that having counts for each error type could be useful. Unfortunately, the Kubernetes API does not provide that type of information. All it offers is 1) current state, 2) last state, and 3) error count. Since this project creates Prometheus metrics that reflect data from the Kubernetes API, I don't think this suggestion fits within the project scope.
If you'd like, you can still file an issue. I don't think posting on this PR will get much more attention since it has already been merged and is nearly released.
As an alternative, you can create an alert that looks like:

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
AND
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

This will fire if there have been at least 2 restarts in the last hour, and the last_terminated_reason is OOMKilled. It's not perfect, but will probably get you what you want in most cases. If there are other regular causes for the container to terminate, you can change the latter expression to:

min_over_time(kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}[1h]) == 1

This will cause the alert to only fire if the last terminated reason for the last hour is OOMKilled. It will take a minimum of an hour to fire, so in most cases I would probably prefer the first example.

Only one remark,

  • kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 2
  • kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} == 1

return different results in terms of the labels. So, prometheus cannot match the two vectors provided from those expressions. I solved the issue with ignoring keyword.

kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1 AND ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0

Here is the reference link --> https://prometheus.io/docs/prometheus/latest/querying/operators/#one-to-one-vector-matches

I think, it is better
(kube_pod_container_status_restarts_total - kube_pod_container_status_restarts_total offset 1h >= 1) * ignoring(reason) kube_pod_container_status_last_terminated_reason{reason='ContainerCannotRun'} > 0.

Experimentally found that when using ignoring with operator and the second vector ignores the filter reason. The result is greater than 0, while it is not true

@leoskyrocker
Copy link

leoskyrocker commented Jun 2, 2022

@Zubant The second vector is not ignoring the reason filter in my case (using the AND operator as opposed to the multiplication * operator)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Duration of kube_pod_container_status_terminated_reason metrics