-
Notifications
You must be signed in to change notification settings - Fork 148
Perform RCA immediately after anomaly detection #165
Comments
@saswat0 could you please give us more information on how the implementation for this would work? |
An anomaly is usually triggered when either the application, its downstream services (dependent apps) or the runtime environment crashes. Say for example, we have a micro-service A that handles date-time conversion. This being an inherent part for an application to function, is used by services B & C, which call A's endpoint frequently. Now if B has to suddenly handle a huge spike of requests and brings down A in the process, all of A, B & C would be affected. In such cases, figuring out the point of origin becomes difficult. The proposed implementation should aim at mitigating this delay. Once there's an exact match/pattern between two faulty services, we can easily figure out which is affected by which and resolve it Hope it was clear 😓 |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with /lifecycle stale |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with /lifecycle rotten |
Rotten issues close after 30d of inactivity. /close |
@sesheta: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Currently the model monitors the behaviour of application/service level metrics and flags them as anomalous. The issue resolution is still left up to the engineers and time taken for debugging might increase the outage duration
Should we add RCA of the anomaly on the fly as soon as it is detected? This will probably help in pinpointing the cause and allow faster repairs
@4n4nd Suggestions?
The text was updated successfully, but these errors were encountered: