Skip to content
This repository has been archived by the owner on Jun 7, 2023. It is now read-only.

Perform RCA immediately after anomaly detection #165

Closed
saswat0 opened this issue Sep 13, 2021 · 11 comments
Closed

Perform RCA immediately after anomaly detection #165

saswat0 opened this issue Sep 13, 2021 · 11 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. proposal

Comments

@saswat0
Copy link

saswat0 commented Sep 13, 2021

Currently the model monitors the behaviour of application/service level metrics and flags them as anomalous. The issue resolution is still left up to the engineers and time taken for debugging might increase the outage duration

Should we add RCA of the anomaly on the fly as soon as it is detected? This will probably help in pinpointing the cause and allow faster repairs

@4n4nd Suggestions?

@4n4nd 4n4nd added the proposal label Sep 13, 2021
@4n4nd
Copy link
Contributor

4n4nd commented Sep 13, 2021

@saswat0 could you please give us more information on how the implementation for this would work?

cc @chauhankaranraj @hemajv

@saswat0
Copy link
Author

saswat0 commented Sep 13, 2021

An anomaly is usually triggered when either the application, its downstream services (dependent apps) or the runtime environment crashes.
On the event of an anomaly, we could check for inconsistency in the metrics of the affected application, all other downstream services or metrics from the deployed cluster/node and figure out which of them was responsible for the outage

Say for example, we have a micro-service A that handles date-time conversion. This being an inherent part for an application to function, is used by services B & C, which call A's endpoint frequently. Now if B has to suddenly handle a huge spike of requests and brings down A in the process, all of A, B & C would be affected. In such cases, figuring out the point of origin becomes difficult. The proposed implementation should aim at mitigating this delay. Once there's an exact match/pattern between two faulty services, we can easily figure out which is affected by which and resolve it

Hope it was clear 😓

@sesheta
Copy link

sesheta commented Dec 12, 2021

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 12, 2021
@sesheta
Copy link

sesheta commented Jan 11, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jan 11, 2022
@saswat0
Copy link
Author

saswat0 commented Jan 19, 2022

/remove-lifecycle rotten

@sesheta sesheta removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 19, 2022
@sesheta
Copy link

sesheta commented Apr 19, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 19, 2022
@saswat0
Copy link
Author

saswat0 commented May 4, 2022

/remove-lifecycle stale

@sesheta sesheta removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 4, 2022
@sesheta
Copy link

sesheta commented Aug 2, 2022

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@sesheta sesheta added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Aug 2, 2022
@sesheta
Copy link

sesheta commented Sep 1, 2022

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

/lifecycle rotten

@sesheta sesheta added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Sep 1, 2022
@sesheta
Copy link

sesheta commented Oct 1, 2022

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

@sesheta sesheta closed this as completed Oct 1, 2022
@sesheta
Copy link

sesheta commented Oct 1, 2022

@sesheta: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. proposal
Projects
None yet
Development

No branches or pull requests

3 participants