US20200084086A1 - Management of computing system alerts - Google Patents
Management of computing system alerts Download PDFInfo
- Publication number
- US20200084086A1 US20200084086A1 US16/574,999 US201916574999A US2020084086A1 US 20200084086 A1 US20200084086 A1 US 20200084086A1 US 201916574999 A US201916574999 A US 201916574999A US 2020084086 A1 US2020084086 A1 US 2020084086A1
- Authority
- US
- United States
- Prior art keywords
- alert
- relationship
- alerts
- type
- determining
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0631—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
- H04L41/065—Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/0604—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time
- H04L41/0609—Management of faults, events, alarms or notifications using filtering, e.g. reduction of information by using priority, element types, position or time based on severity or priority
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
-
- H04L67/42—
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/06—Management of faults, events, alarms or notifications
- H04L41/069—Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/22—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks comprising specially adapted graphical user interfaces [GUI]
Definitions
- the present disclosure relates to systems and methods for processing alerts indicative of conditions of a computing system. More generally, the present disclosure relates to a data processing system for error or fault handling, namely, by aggregating data used for and generated in response to performing impact calculation on resources of a computing system. Implementations of the present disclosure can be used to enhance the ability of a server programmed for organizing and manipulating data for responding to planned or unplanned conditions identified with respect to hardware or software resources included within the computing system.
- a computing system such as a cloud computing system, providing software based services to a customer, uses a network of one or more servers configured to execute various programs for delivering services to client computing devices.
- a system operator may receive hundreds of alerts daily, such as notifications that a hardware or software component in the cloud computing system requires a change or modification.
- alerts can be triaged for dispositioning. Appropriate assignments need to be identified for dispositioning each alert, with consideration to both technical and security qualifications.
- Monitoring tools allow detection of health and status of network resources, and enable a variety of network maintenance functions.
- One implementation of the disclosure is an apparatus for grouping alerts generated by automated monitoring of an operating condition of a machine in a computing system, the machine represented as a configuration item in a configuration management database, the apparatus comprising a processor configured to execute instructions stored in a memory, the instructions including an avalanche pattern detection module, a conditional probability pattern detection module, an alert grouping module, and a presentation module.
- the avalanche pattern detection module may receive historical alert data, and identify a first event pattern of alert information based on at least one avalanche of alerts identified from the historical alert data.
- the historical alert data includes a time stamp, a configuration item identifier, and an alert metric associated with each alert stored in the memory prior to the alert history time marker.
- the first event pattern of alert information is stored in the memory.
- the conditional probability pattern detection module may receive the historical alert data and at least one conditional probability parameter, and identify a second event pattern of alert information based on co-occurrences of configuration item pairs in the historical alert data and on the at least one conditional probability parameter.
- the second event pattern is stored in the memory.
- the alert grouping module may determine at least one alert group by comparing at least one configuration item associated with a current alert to the plurality of configuration items of the first event pattern and of the second event pattern stored in the memory.
- the presentation module may generate a graphical display region for displaying the at least one alert group.
- Another implementation of the disclosure is an apparatus for grouping alerts generated by automated monitoring of at least an operating condition of a machine in a computing system, the machine represented as a configuration item in a configuration management database, the apparatus comprising a processor configured to execute instructions stored in a memory, the instructions including a pattern detection module, an alert grouping module, and a presentation module.
- the pattern detection module may identify an event pattern from historical alert data associated with a plurality of configuration items. The event pattern is based on an intersection of configuration items identified in an avalanche of alerts with configuration items identified in the historical alerts.
- the pattern detection module may identify the avalanche of alerts based on at least one avalanche parameter, and store the event pattern in the memory.
- the alert grouping module may determine at least one alert group by comparing at least one configuration item associated with a current alert to the plurality of configuration items of the pattern stored in the memory.
- the presentation module may generate a graphical display region for displaying the alert group.
- FIG. 1 is a block diagram of a computing network in which the teachings herein may be implemented.
- FIG. 2 is a block diagram of an example internal configuration of a computing device, such as a computing device of the computing network as shown in FIG. 1 .
- FIG. 3 is a block diagram of an example modular configuration of a computing device, such as the computing device as shown in FIG. 2 , in accordance with the present disclosure.
- FIG. 4 is a block diagram of an example alert table in accordance with the present disclosure.
- FIG. 5 is a diagram of an example time window size determination and avalanche window determination based on an alert history in accordance with the present disclosure.
- FIGS. 6A-6C are diagrams of an example conditional probabilistic graphing sequence in accordance with the present disclosure.
- FIG. 7 is a diagram of an example display region generated for displaying alert groups in accordance with the present disclosure.
- FIG. 8 is a diagram of example display region generated for enabling user feedback and supervision of alert grouping in accordance with the present disclosure.
- FIG. 9 is a flow chart of an example method of aggregating alerts for management of computer system alerts in accordance with the present disclosure.
- a distributed computing system such as a cloud computing system, may include multiple computing devices at a customer end and multiple computer servers at a service provider end, which may be interconnected by a cloud network.
- customer devices request services and use resources provided by the server devices, the flow of information must be controlled and monitored to maintain quality of service.
- nodes, such as servers, along the interconnected network may encounter overload conditions and traffic may need to be rerouted to other available nodes that are currently not overloaded.
- Alerts may be triggered upon detection of conditions or events that relate to nodes being overloaded. Other examples of alerts that may be triggered may include when a customer or server device is down due to a hardware or software error or failure.
- a cluster or “avalanche” of alerts may be triggered within a short time period. Over time, as avalanches of alerts are detected, patterns of affected nodes may emerge, which can be stored and used as a template during real time monitoring of alerts. As a current alert is detected, it may be matched to learned patterns for aggregating the alert into one or more alert groups to more efficiently manage and dispatch the alert. Conditional probability patterns may also be developed based on stored alert information, which may further refine the learned patterns for the alert grouping. By aggregating alerts, system operators may more efficiently triage alerts for disposition.
- FIG. 1 is a block diagram of a distributed (e.g., client-server, networked, or cloud) computing system 100 .
- Cloud computing system 100 can have any number of customers, including customer 110 .
- Each customer 110 may have clients, such as clients 112 .
- Each of clients 112 can be in the form of a computing system comprising multiple computing devices, or in the form of a single computing device, for example, a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, and the like.
- Customer 110 and clients 112 are examples only, and a cloud computing system may have a different number of customers or clients or may have a different configuration of customers or clients. For example, there may be hundreds or thousands of customers and each customer may have any number of clients.
- Cloud computing system 100 can include any number of datacenters, including datacenter 120 .
- Each datacenter 120 may have servers, such as servers 122 .
- Each datacenter 120 may represent a facility in a different geographic location where servers are located.
- Each of servers 122 can be in the form of a computing system including multiple computing devices, or in the form of a single computing device, for example, a desktop computer, a server computer and the like.
- the datacenter 120 and servers 122 are examples only, and a cloud computing system may have a different number of datacenters and servers or may have a different configuration of datacenters and servers. For example, there may be tens of data centers and each data center may have hundreds or any number of servers.
- Clients 112 and servers 122 may be configured to connect to network 130 .
- the clients for a particular customer may connect to network 130 via a common connection point 116 or different connection points, e.g., a wireless connection point 118 and a wired connection point 119 . Any combination of common or different connections points may be present, and any combination of wired and wireless connection points may be present as well.
- Network 130 can be, for example, the Internet.
- Network 130 can also be or include a local area network (LAN), wide area network (WAN), virtual private network (VPN), or any other means of transferring data between any of clients 112 and servers 122 .
- Network 130 , datacenter 120 and/or blocks not shown may include network hardware such as routers, switches, load balancers and/or other network devices.
- cloud computing system 100 devices other than the clients and servers shown may be included in system 100 .
- one or more additional servers may operate as a cloud infrastructure control, from which servers and/or clients of the cloud infrastructure are monitored, controlled and/or configured.
- some or all of the techniques described herein may operate on said cloud infrastructure control servers.
- some or all of the techniques described herein may operate on servers such as servers 122 .
- FIG. 2 is a block diagram of an example internal configuration of a computing device 200 , such as a client 112 or server device 122 of the computing system 100 as shown in FIG. 1 , including an infrastructure control server, of a computing system.
- clients 112 or servers 122 may take the form of a computing system including multiple computing units, or in the form of a single computing unit, for example, a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, a server computer and the like.
- the computing device 200 can comprise a number of components, as illustrated in FIG. 2 .
- CPU (or processor) 202 can be a central processing unit, such as a microprocessor, and can include single or multiple processors, each having single or multiple processing cores.
- CPU 202 can include another type of device, or multiple devices, capable of manipulating or processing information now-existing or hereafter developed. When multiple processing devices are present, they may be interconnected in any manner, including hardwired or networked, including wirelessly networked. Thus, the operations of CPU 202 can be distributed across multiple machines that can be coupled directly or across a local area or other network.
- the CPU 202 can be a general purpose processor or a special purpose processor.
- Memory 204 can be any suitable non-permanent storage device that is used as memory.
- RAM 204 can include executable instructions and data for immediate access by CPU 202 .
- RAM 204 typically includes one or more DRAM modules such as DDR SDRAM.
- RAM 204 can include another type of device, or multiple devices, capable of storing data for processing by CPU 202 now-existing or hereafter developed.
- CPU 202 can access and manipulate data in RAM 204 via bus 212 .
- the CPU 202 may utilize a cache 220 as a form of localized fast memory for operating on data and instructions.
- Storage 206 can be in the form of read only memory (ROM), a disk drive, a solid state drive, flash memory, Phase-Change Memory (PCM), or any form of non-volatile memory designed to maintain data for some duration of time, and preferably in the event of a power loss.
- Storage 206 can comprise executable instructions 206 A and application files/data 206 B along with other data.
- the executable instructions 206 A can include, for example, an operating system and one or more application programs for loading in whole or part into RAM 204 (with RAM-based executable instructions 204 A and application files/data 204 B) and to be executed by CPU 202 .
- the executable instructions 206 A may be organized into programmable modules or algorithms, functional programs, codes, and code segments designed to perform various functions described herein.
- the operating system can be, for example, Microsoft Windows, Mac OS X®, or Linux®, or other operating system, or it can be an operating system for a small device, such as a smart phone or tablet device, or a large device, such as a mainframe computer.
- the application program can include, for example, a web browser, web server and/or database server.
- Application files 206 B can, for example, include user files, database catalogs and configuration information.
- storage 206 comprises instructions to perform the discovery techniques described herein.
- Storage 206 may comprise one or multiple devices and may utilize one or more types of storage, such as solid state or magnetic.
- the computing device 200 can also include one or more input/output devices, such as a network communication unit 208 and interface 230 that may have a wired communication component or a wireless communications component 290 , which can be coupled to CPU 202 via bus 212 .
- the network communication unit 208 can utilized any of a variety of standardized network protocols, such as Ethernet, TCP/IP, to name a few of many protocols, to effect communications between devices.
- the interface 230 can include one or more transceiver(s) that utilize the Ethernet, power line communication (PLC), WiFi, infrared, GPRS/GSM, CDMA, etc.
- a user interface can be broken down into the hardware user interface portion and the software user interface portion.
- a hardware user interface 210 can include a display, positional input device (such as a mouse, touchpad, touchscreen, or the like), keyboard, or other forms of user input and output devices and hardware.
- the hardware user interface 210 can be coupled to the processor 202 via the bus 212 .
- Other output devices that permit a user to program or otherwise use the client or server can be provided in addition to or as an alternative to display 210 .
- the output device is or comprises a hardware display, this display can be implemented in various ways, including by a liquid crystal display (LCD) or a cathode-ray tube (CRT) or light emitting diode (LED) display, such as an OLED display.
- LCD liquid crystal display
- CRT cathode-ray tube
- LED light emitting diode
- the software graphical user interface constitutes programs and data that reflect information ultimately destined for display on a hardware device.
- the data can contain rendering instructions for bounded graphical display regions, such as windows, or pixel information representative of controls, such as buttons and drop-down menus.
- the rendering instructions can, for example, be in the form of HTML, SGML, JavaScript, Jelly, AngularJS, or other text or binary instructions for generating a graphical user interface on a display that can be used to generate pixel information.
- a structured data output of one device can be provided to an input of the hardware display so that the elements provided on the hardware display screen represent the underlying structure of the output data.
- servers may omit display 210 .
- RAM 204 or storage 206 can be distributed across multiple machines such as network-based memory or memory in multiple machines performing the operations of clients or servers.
- bus 212 can be composed of multiple buses, that may be connected to each other through various bridges, controllers, and/or adapters.
- the computing device 200 may also contain a power source 270 , such as a battery, so that the unit can operate in a self-contained manner.
- Computing device 200 may contain any number of sensors and detectors 260 that monitor physical conditions of the device 200 itself or the environment around the device 200 . For example, sensors 260 may trigger alerts that provide indications of the physical conditions.
- Such alerts may indicate conditions that may include temperature of the processor 202 , utilization of the processor 202 or memory 204 , utilization of the storage 206 , and utilization of the power source 270 .
- Such alerts of conditions detected by sensors 260 may safeguard against exceeding operational capacity or operational limits, such as hard drive rpm for storage 206 , maximum temperature of processor 202 or power source 270 , or any other physical health states of the computing device 200 .
- Sensors 260 may include a location identification unit, such as a GPS or other type of location device. These may communicate with the CPU/processor 202 via the bus 212 .
- FIG. 3 is a block diagram of an example modular configuration of a computing device, such as the computing device as shown in FIG. 2 , in accordance with this disclosure.
- a modular configuration 300 may include an event pattern detection module 301 , an alert grouping module 322 , a presentation module 324 , or any combination thereof.
- the event pattern detection module 301 may include an avalanche pattern detection module 304 , a conditional probability pattern detection module 314 , a pattern merging module 320 , or a combination thereof.
- the event pattern detection module 301 may receive inputs including historical alert data 302 and parameters 312 .
- Historical alert data 302 may include information related to alerts prior to a selected time marker. For example, in the cloud computing system 100 , an alert history period may be selected between a first time marker and a second time marker, which may be hours, days, weeks or months apart.
- the alert information for the historical alert data 302 may be stored as a table of alerts in one or more databases or data storage units, at one or more locations such as datacenter 120 .
- Parameters 312 may include control values set by a user or system administrator for setting control limits or adjustments for various modules to execute functions related to pattern detection and alert grouping as described herein.
- FIG. 4 is a block diagram of an example alert table in accordance with the present disclosure.
- the alert information for the historical alert data 302 may be stored as an alert table 404 as shown in FIG. 4 , in a data storage unit 402 .
- the alert table 404 may include an alert ID 412 , a time stamp 414 , a configuration item (CI) ID 416 , and an alert metric 418 .
- each alert may be recorded and stored in the data storage 402 with alert information including the time stamp 412 of when the alert was triggered, a CI ID 416 for the identity of the CI affected by the event, and the metric 418 that triggered the alert, which may include for example, high memory utilization or high CPU utilization.
- an alert metric 418 for high memory utilization may be an indication that an additional server is needed to handle the current traffic or demand for services by clients 112 in the cloud computing system 100 .
- an avalanche pattern module 304 may receive the historical alert data 302 and parameters 312 for processing to determine avalanche patterns.
- the avalanche pattern module 304 may include a time window generation module 306 and an avalanche window module 308 .
- Parameters 312 used for avalanche detection may include values set by a user or system administrator to control the avalanche detection, including but not limited to, a factor value C1 used to determine a window size for counting alerts and a factor value C2 used to determine an avalanche threshold.
- the time window generation module 306 may determine a fixed time window size based on inter-arrival times of consecutive alerts.
- the avalanche window module 308 may determine which time windows contain an avalanche of alerts based on a number of alerts observed in each time window compared to an avalanche threshold.
- FIG. 5 is a diagram of an example of time window size determination and avalanche window determination in accordance with the present disclosure.
- An alert history 501 may be defined between time markers 503 , 505 where an alert 521 and an alert 525 may correspond with the first alert A_ 001 and last alert A_ 689 , respectively, of an alert table such as alert table 404 shown in FIG. 4 .
- An alert count 502 is shown for the alert history 501 , including numerous alert clusters 504 .
- An inter-arrival time Ti exists between alerts or alert clusters 504 .
- the time window generation module 306 may divide the alert history into fixed time windows TW of the same size.
- the fixed time window size may be based on an average or a median of the inter-arrival times Ti for the alert history 501 .
- the time window size WS may be calculated according to the following equation.
- the avalanche window module 308 may determine the avalanche windows 512 / 514 / 518 based on a total alert count 502 within each time window TW that meets or exceeds the avalanche threshold 506 .
- the avalanche threshold (AV_th) 506 may be calculated according to the following equation.
- C2 is a constant value Acnt/TW is a median of alert counts per time window
- the avalanche pattern module 304 may merge time windows TW that are adjacent and consecutive to an avalanche window 512 / 514 / 518 so that any alerts in the adjacent windows may also be included for the avalanche pattern detection.
- time windows 513 and 515 are adjacent to avalanche windows 512 and 514
- an expanded avalanche window is defined to include the alerts in the time windows 513 and 515 to form a merged avalanche of alerts, so that a more comprehensive set of alerts are considered for the avalanche pattern detection.
- adjacent time windows 517 and 519 may be merged with avalanche window 518 , and the alerts from the time windows 517 / 518 / 519 form a merged avalanche of alerts.
- the avalanche pattern module 304 may compare each expanded avalanche window to each non-avalanche time window TW in the alert history 501 to locate intersections of alert information using parameters 312 that may include minimum frequency of intersections. Intersections of time windows may be identified by comparing the alert information between two time windows, and finding common alert information. For example, if an avalanche window has alerts of alert types A, B, C, D, E, F, and another window has alerts of alert types A, X, C, Z, E, G, then the intersection would be the set A, C, E.
- an alert type may be defined as a CI ID, alert metric combination. In a broader example, an alert type may be defined by the CI ID alone.
- an intersection between an avalanche window and another window may produce the following set: [CI_30# CPU Utilization, CI_68# Memory, CI_72# CPU Utilization]. While alert type examples including CI ID and/or alert metric have been presented here, other mappings to different alert information types are possible.
- an intersection may be determined by matching a CI ID, such as CI-26 in expanded avalanche window 512 / 513 / 514 / 515 to another alert occurrence in another time window TW related to CI-26.
- the intersections may be stored as a pattern.
- an intersection for alert types A, C, E exceed the minimum frequency parameter Fi, where A, C and E are each defined by a different CI ID and an alert metric combination, the intersection may be stored as a pattern.
- the avalanche pattern module 304 may assign a score to each of the alert intersections based on the total number of alert intersections identified by the time window comparison for the entire alert history and based on the size of the expanded avalanche window. For example, an intersection score may be determined according to the following equation.
- alert type in an intersection Intersection scores may be defined by variations to the Equation (3) above.
- avalanche patterns may be identified as follows. A list of intersections may be sorted in descending order according to intersection score. This sorted list of intersections may be sequentially processed one intersection at a time using an overlap test. As each intersection is considered, the aggregate set of event types for this avalanche list may be accumulated by taking the union of the event types in the current intersection with the event types in previous intersections that have been added to the pattern list. Given the set of alert types for the current intersection, if one or more of alert types in the current intersection which have already been seen in other intersections from this avalanche is a sufficiently small aggregate compared to a group overlap percentage threshold parameter, which may be one of input parameters 312 , then that intersection is considered for addition to the pattern list.
- a group overlap percentage threshold parameter which may be one of input parameters 312
- this intersection may be added to the pattern list as a pattern. If the pattern list already contains a pattern that is a sub-set of the current intersection, then this intersection may replace that pattern in the list.
- the avalanche window may be compared with the other windows, generating the following list of intersections with their scores: (AG, 90), (ABC, 40), (CDG, 20), (DEF, 8). These intersections may be considered in the given order for addition to the pattern list.
- the first intersection AG passes the overlap test by default, leaving [“AG” ] in the pattern list and A, G in the aggregate set of event types.
- intersection ABC its overlap with the aggregate set of event types A, G is 1 (i.e., “A”), which yields an aggregate percentage of 33.3% (i.e., A/ABC).
- intersection CDG may be compared to the aggregate set of event types A, B, C, G, yielding an aggregate percentage of 66% (i.e., CD/CDG). Since this aggregate percentage is greater than the group overlap percentage threshold of 40%, intersection CDG is not added to the pattern list. Finally, intersection DEF is considered.
- DEF may be added to the pattern list, which then becomes [“AG”, “ABC”, “DEF” ], with an aggregate set of event types A, B, C, D, E, F, G. Note that if ABC had scored higher than AG, then AG would not be included in the pattern list.
- the avalanche pattern detection module 304 may validate patterns based on configuration management database (CMDB) information.
- CMDB configuration management database
- computing devices related to the candidate nodes may be interconnected in a cloud computing system, such as cloud computing system 100 shown in FIG. 1 , and the interconnected dependencies may be tracked and updated in the CMDB.
- the CI's may also have dependency relationships based on instances of software modules, such as service applications for example, which reside on interconnected servers, such as servers 122 shown in FIG. 1 .
- the avalanche pattern detection module 304 may compare the identified avalanche patterns to actual CI dependency information from the CMDB to determine if any patterns are invalid. For example, if one of the identified avalanche patterns consists of CI's that bear no interconnected relationship, then that pattern may be deleted from the avalanche patterns.
- the event pattern detection module 301 may include a conditional probability pattern detection module 314 for performing a pattern detection in parallel with the avalanche pattern module 304 .
- the conditional probability pattern detection module 314 may include a co-occurrence detection module 316 , a probabilistic graph module 318 , and a parametric graph component module 319 .
- the co-occurrence detection module 316 may determine a number of time windows TW of size WS according to Equation (2) above. Alternatively, the co-occurrence detection module 316 may receive the time window information and time window size WS from the avalanche pattern detection module 304 .
- the co-occurrence detection module 316 may detect co-occurrences of CI pairs or groups in the time windows TW, which may or may not be avalanche windows. For example, as shown in FIGS. 4 and 5 , alerts 521 and 522 occur within time window 512 and relate to CI-26 and CI-27. The same pair of CI's, CI-26 and CI-27 appear in alerts 523 and 524 in time window 514 . In some applications, the co-occurrence detection module 316 may detect co-occurrences of CI ID, alert metric combination pairs or groups in the time windows TW.
- FIGS. 6A-6C are diagrams of an example conditional probabilistic graphing sequence in accordance with the present disclosure.
- the probabilistic graph module 318 may generate a conditional probabilistic graph, such as the probabilistic graph components 621 and 622 shown in FIG. 6A , based on parameters 312 .
- candidate nodes 601 - 612 of candidate graph component 621 may be identified based on parameters 312 set by a user or system administrator to include a minimum frequency of CI co-occurrences f_CI in the alert history 501 .
- the conditional probabilistic graph components 621 and 622 may be generated by probabilistic graph module 318 based on the results of the co-occurrence detection module 316 , which may use pairwise probability.
- each node on graph components 621 , 622 may represent a CI or alert metric with a probability that meets or exceeds the f_CI parameter.
- a first count of how many co-occurrences of CI's and/or alert metrics occur in time window comparisons may be determined, and the first count may be compared to a second count of individual alerts to calculate a probability for each node.
- a frequentist probability may be determined to establish the conditional probabilistic graph components 621 , 622 .
- the probability may be calculated on a condition that the first count and the second count is not less than the f_CI parameter. For example, the probability of an alert A, such as at node 610 for a CI-26 alert may be determined given an alert B, such as at node 601 for a CI-27 alert, according to the following equation.
- N A is number of time windows TW in which alert A appears
- the probability P(AIB) may not be calculated if the number of A alerts or B alerts is less than the minimum frequency parameter value f_CI, and nodes for such low frequency alerts are omitted from the graph component.
- the probabilistic graph components 621 , 623 may be generated in some applications by the probabilistic graph module 318 based on parameters adjusted by parametric graph module 319 .
- probabilistic graph components 621 , 623 may be generated by defining the nodes according to the alert types A and B as nodes 601 - 615 and by adding edges 631 to the graph component if one of the pairwise conditional probabilities P(A
- graph component 622 may be generated where alerts represented by nodes 613 - 615 have pairwise probability of 0.25 as shown by the edge value 631 , which exceeds an initial threshold CP_in value of 0.04.
- FIG. 6B shows an example of conditional probability graph components 621 and 622 after applying an adjusted threshold CP_in value in order to identify smaller subgraph components that represent stronger alert correlations.
- an edge count threshold EC may be established from the input parameters 312 .
- FIG. 6C shows an example of subgraph components 625 - 627 formed by eliminating edges of subgraph component 623 after raising the conditional probability threshold to 0.3
- these graphs may be considered as having a critical size for conditional probability, and as such are potential patterns and may be stored by the conditional probability pattern detection module 314 .
- conditional probability threshold increments of 0.1
- other fixed-size or variable-size increments may be applied to accommodate different edge count EC thresholds and for more or less rapid approach to reach critical size graph components.
- the conditional probability pattern detection module 314 may validate patterns based on an alert coverage parameter taken from parameters 312 . For example, a percentage of alerts in alert history 501 that appear in a pattern from probabilistic graph components can be determined, and if the percentage is greater than the alert coverage parameter, the pattern is stored as a conditional probability pattern. If one or more patterns fail to meet the alert coverage parameter, the probabilistic graph components can be reformed based on adjustment to the parameters 312 .
- the conditional probability pattern detection module 314 may validate patterns based on CMDB information.
- computing devices related to the candidate nodes may be interconnected in a cloud computing system, such as cloud computing system 100 shown in FIG. 1 , and the interconnected dependencies may be tracked and updated in the CMDB.
- the CI's may also have dependency relationships based on instances of software modules, such as service applications for example, which reside on interconnected servers, such as servers 122 shown in FIG. 1 .
- the probabilistic graph module 318 may compare the graph components, such as graph components 622 , 624 - 627 shown in FIG. 6C , to actual CI dependency information from the CMDB to determine if any patterns are invalid. For example, if one of the graph components consists of CI's that bear no interconnected relationship, then a pattern based on that graph component may be deleted from the candidate patterns.
- a pattern merging module 320 may merge the patterns determined by the avalanche pattern module 304 with the patterns determined by the conditional probability pattern detection module 314 .
- the pattern merging module 320 may combine the avalanche patterns with the conditional probability patterns and store the union of patterns as learned patterns from the event pattern detection module 301 .
- the merged pattern information may be stored in alert tables with pattern IDs, each row consisting of a CI_ID, alert metric combination and pattern ID.
- the merged pattern information may be stored as a hash map with key as pattern ID and value as set of entities representing the pattern.
- An alert grouping module 322 may perform a matching function to compare an alert stream 332 to the stored patterns and assign an alert to an alert group if there is a match to one or more patterns.
- the alert grouping module 322 may receive a current alert from alert stream 332 and compare the CI and/or the alert metric for the current alert to the learned patterns stored by the pattern merging module 320 , and determine which one or more patterns include the same CI and/or alert metric.
- An alert group may be formed by applying a sliding window.
- a ten minute window may be defined according to parameters 312 .
- the matching process starts with a first alert group AB, which may be kept active for maximum group lifetime, which may be an input parameter 312 based on a fixed time window, such as for example, a ten minute window.
- the alert group AB may be compared to all alerts received from the alert stream 332 in the past ten minute window.
- the alert group AB may include a list of associated pattern IDs. If no match is made to pattern IDs associated with the alerts monitored in the alert stream, then the lifetime of alert group AB is expired, and no further comparisons are made to that alert group.
- alert group C is grouped with alert group AB, to form alert group ABC. For example, if alert C includes an alert type found within alert group AB, then alert C is added to the group.
- the list of pattern IDs for the group AB may be updated by keeping only those pattern IDs that contains C. The time is maintained for the updated list of pattern IDs whenever any new alert is grouped using that list. If a the time window elapses, the pattern IDs in that list has not been used to group alerts, the alert group may be finalized and no further comparison for future alerts is made.
- the alert group may include a set of alert types, such as a CI_ID, alert metric pairs, a list of patterns that match this set of alert types, a first alert time (i.e. the alert time for the earliest alert in the group) and a latest alert time (i.e., the alert time for the most recent alert added to the group).
- a first alert time i.e. the alert time for the earliest alert in the group
- a latest alert time i.e., the alert time for the most recent alert added to the group.
- a presentation module 324 may generate a display region for displaying the alert groups for monitoring by a user or system administrator.
- FIG. 7 is a diagram of an example display region generated for displaying alert groups in accordance with the present disclosure.
- the display region 702 may include a presentation of the alert groups 703 , the severity type 704 of the alerts in the alert group, the severity of related alerts 705 , and a count of impacted services 706 .
- an alert group 703 such as alert group 5
- related alerts 705 such as 1 related alert having critical severity, 0 related alerts having major or minor severity, and 1 impacted service 706 .
- Three severity types 704 are shown in FIG. 7 for illustrative purposes, however there may be more or less severity types.
- the alert groups 703 may be sorted by severity 704 as shown in FIG. 7 .
- Other presentations of the alert groups 703 may be presented by sorting according to other information shown in FIG. 7 .
- Alert groups determined by alert grouping module 322 may be evaluated based on a compression rate parameter and an alert coverage parameter from parameters 312 .
- the compression rate may be determined according to the following equation.
- the alert coverage may be determined according to the following equation.
- FIG. 8 is a diagram of an example display region generated for enabling user feedback and supervision of alert grouping in accordance with the present disclosure.
- the user or system administrator may delete a group using a graphical user interface, such as the delete alert button 809 .
- the display region 802 may include various alert information for an alert group 803 , as shown in FIG. 8 .
- impacted services may be visually accessed by clicking on the impacted services button 804 .
- Feedback 806 from the user may be submitted to a system administrator regarding whether the alert group is representative or related to new alerts.
- the display region 802 may also include the alert ID 810 , severity of each alert 812 , the CI_ID 814 , the metric name 816 , or a combination thereof, for the alerts in the displayed alert group.
- the user or system administrator may set a rule to prevent particular patterns from being developed by the avalanche pattern detection module 304 or the conditional probability pattern detection module 314 .
- the user or system administrator may define patterns and can set a high priority of those patterns if so desired. For instance, a particular alert or entity may be flagged as significant or critical and an associated pattern may be assigned a high priority.
- FIG. 9 is a flowchart of an example method for aggregating alerts for management of computer system alerts in accordance with this disclosure.
- Grouping alerts generated by automated monitoring of at least an operating condition change of a machine in a computer network in response to an event may be implemented in a computing system, such as the cloud computing system 100 shown in FIG. 1 .
- grouping alerts may be implemented on a server, such as one or more of the servers 122 shown in FIG. 1 , a computing device, such as a client 112 shown in FIG. 1 , or by a combination thereof communicating via a network, such as the network 130 shown in FIG. 1 .
- Grouping alerts may include obtaining historical alert data at 902 , identifying event patterns associated with avalanches at 908 , identifying event patterns based on conditional probability at 918 , merging patterns at 920 , matching alerts to patterns at 922 , assigning current alert to alert group(s) at 924 , generating graphical display regions for alert groups at 926 , or a combination thereof.
- event pattern detection module 301 obtains available historical alert data that may be stored in a data storage unit.
- the historical alert data may include information for each alert, including an alert ID, a time stamp, a configuration item ID, an alert metric, or a combination thereof.
- the alert data may be associated with alerts received in recent hours, days, weeks or months for computing system 100 .
- Avalanche patterns may identified at steps 904 - 908 .
- time windows may be defined for avalanche detection based on a fixed window size determined by a parameter C1 and inter-arrival time of alerts.
- Avalanche windows may be determined at 906 based on an avalanche threshold and a parameter C2.
- avalanche windows may be determined based on an alert count for a time window meeting or exceeding the avalanche threshold.
- event patterns associated with avalanches may be identified based on intersections of avalanche alerts with alerts in each other time window.
- Avalanche patterns may be based on intersections that have an intersection score that meets or exceeds an avalanche pattern threshold. Intersection scores may be determined based on number of intersections and number of avalanche window alerts.
- Conditional probability patterns may be identified at steps 914 - 918 .
- probabilistic graph candidates may be determined based on co-occurrences of alert information in the time windows meeting or exceeding a parametric threshold.
- a probabilistic graph maybe generated using the probabilistic graph candidates having a probability that satisfies a threshold.
- Event patterns may be identified based on conditional probability determined from the probabilistic graph, at 918 , where conditional probability is based on co-occurrences of CI's for alerts in two or more time windows.
- the conditional probability may also be supervised by rule based parameters set by a user.
- event patterns identified by avalanche detection may be merged with event patterns identified by conditional probability and stored for alert grouping.
- an alert stream is monitored and each alert is compared to the stored patterns.
- a current alert may be assigned to one or more groups for each match found to a stored pattern at 924 .
- a graphical display region may be generated for displaying of alert groups at 926 based on the alert groups identified at 924 .
- the graphical display region can include, for example, information relating to the alert groups for monitoring by a user or system administrator, for management of computer system alerts and to enable user feedback and supervision.
- a graphical display region may be generated in response to a user intervention, such as interface with a graphical user interface. However, in certain circumstances and implementations, the graphical display region may be triggered automatically.
- the alert groups displayed in step 926 can include information about associated alerts, including severity type, related alerts and impacted services. Steps 922 through 926 can be repeated multiple times over numerous client instances, server instances, or a combination of both as alerts from an alert stream are received.
- the alert groups can be ordered in priority order based on severity type.
- Steps 902 - 926 may be performed periodically. For example, a task can be scheduled on an hourly, daily, or weekly basis during which the steps are performed. The steps can be performed on the same or different periodic schedules for each of the database server instances in the cloud infrastructure, such as by physical server or datacenter. Certain database server instances or physical servers may not be included based on user configuration. Upon each iteration, the graphical display regions generated by step 926 can be updated and/or regenerated.
- the pattern detection module and alert grouping module can take the form of one or more Java classes with executable or human-readable code for performing some or all of the steps 902 - 924 described above.
- the pattern detection module and alert grouping module can, for example, be located on one or more servers used to manage other servers (management servers) in the cloud computing system, including but not limited to servers 122 .
- the management servers can, for example, include the same or similar platform application and included on some of servers 122 .
- the one or more Java classes can be plugged into or connected to an instance or installation of the platform application to extend the platform functionality to include the functionality of the pattern detection module and alert grouping module.
- functionality of the pattern detection module and alert grouping module may be accessed via the platform, for example, by using script calls stored in an associated database that are configured to invoke the desired functionality.
- the platform can be configured to periodically execute techniques similar to steps 902 - 924 included in the pattern detection module and alert grouping module without user intervention.
- the graphical display regions generated by the presentation module 324 can include one or more links or buttons that when clicked cause the platform to execute other platform functionality for invoking a move operation for an associated database server instance.
- Input parameters 312 used in any of the above embodiments may be based on various types of information, included but not limited to value-based information, event-based information, environment-based information, or a combination thereof.
- value-based information may include business models, service catalog information, customer impact feeds information, or the like.
- events-based information may include change management information, alerts, incidents, or the like.
- environment-based information may include configuration management database (CMDB) information, business rules, workflows, or the like.
- CMDB configuration management database
- the implementations of computing devices as described herein can be realized in hardware, software, or any combination thereof.
- the hardware can include, for example, computers, intellectual property (IP) cores, application-specific integrated circuits (ASICs), programmable logic arrays, optical processors, programmable logic controllers, microcode, microcontrollers, servers, microprocessors, digital signal processors or any other suitable circuit.
- IP intellectual property
- ASICs application-specific integrated circuits
- programmable logic arrays optical processors
- programmable logic controllers microcode, microcontrollers
- servers microprocessors, digital signal processors or any other suitable circuit.
- processors should be understood as encompassing any of the foregoing hardware, either singly or in combination.
- one or more computing devices can include an ASIC or programmable logic array such as a field-programmable gate array (FPGA) configured as a special-purpose processor to perform one or more of the operations or operations described or claimed herein.
- FPGA field-programmable gate array
- An example FPGA can include a collection of logic blocks and random access memory (RAM) blocks that can be individually configured and/or configurably interconnected in order to cause the FPGA to perform certain functions. Certain FPGA's may contain other general or special purpose blocks as well.
- An example FPGA can be programmed based on a hardware definition language (HDL) design, such as VHSIC Hardware Description Language or Verilog.
- HDL hardware definition language
- the embodiments herein may be described in terms of functional block components and various processing steps. Such functional blocks may be realized by any number of hardware and/or software components that perform the specified functions.
- the described embodiments may employ various integrated circuit components, e.g., memory elements, processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices.
- the elements of the described embodiments are implemented using software programming or software elements the invention may be implemented with any programming or scripting language such as C, C++, Java, assembler, or the like, with the various algorithms being implemented with any combination of data structures, objects, processes, routines or other programming elements.
- Implementations or portions of implementations of the above disclosure can take the form of a computer program product accessible from, for example, a computer-usable or computer-readable medium.
- a computer-usable or computer-readable medium can be any device that can, for example, tangibly contain, store, communicate, or transport a program or data structure for use by or in connection with any processor.
- the medium can be, for example, an electronic, magnetic, optical, electromagnetic, or a semiconductor device. Other suitable mediums are also available.
- Such computer-usable or computer-readable media can be referred to as non-transitory memory or media, and may include RAM or other volatile memory or storage devices that may change over time.
- a memory of an apparatus described herein, unless otherwise specified, does not have to be physically contained by the apparatus, but is one that can be accessed remotely by the apparatus, and does not have to be contiguous with other memory that might be physically contained by the apparatus.
- example is used herein to mean serving as an example, instance, or illustration. Any implementation or design described herein as “example” is not necessarily to be construed as preferred or advantageous over other implementations or designs. Rather, use of the word “example” is intended to present concepts in a concrete fashion.
- the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. In other words, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances.
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Human Computer Interaction (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- The present disclosure is a continuation of and claims priority to U.S. patent application Ser. No. 15/141,395, filed on Apr. 28, 2016, the entire contents of which are herein incorporated by reference.
- The present disclosure relates to systems and methods for processing alerts indicative of conditions of a computing system. More generally, the present disclosure relates to a data processing system for error or fault handling, namely, by aggregating data used for and generated in response to performing impact calculation on resources of a computing system. Implementations of the present disclosure can be used to enhance the ability of a server programmed for organizing and manipulating data for responding to planned or unplanned conditions identified with respect to hardware or software resources included within the computing system.
- A computing system, such as a cloud computing system, providing software based services to a customer, uses a network of one or more servers configured to execute various programs for delivering services to client computing devices. A system operator may receive hundreds of alerts daily, such as notifications that a hardware or software component in the cloud computing system requires a change or modification. To address the alerts in a timely and efficient manner, alerts can be triaged for dispositioning. Appropriate assignments need to be identified for dispositioning each alert, with consideration to both technical and security qualifications. Monitoring tools allow detection of health and status of network resources, and enable a variety of network maintenance functions.
- One implementation of the disclosure is an apparatus for grouping alerts generated by automated monitoring of an operating condition of a machine in a computing system, the machine represented as a configuration item in a configuration management database, the apparatus comprising a processor configured to execute instructions stored in a memory, the instructions including an avalanche pattern detection module, a conditional probability pattern detection module, an alert grouping module, and a presentation module. The avalanche pattern detection module may receive historical alert data, and identify a first event pattern of alert information based on at least one avalanche of alerts identified from the historical alert data. The historical alert data includes a time stamp, a configuration item identifier, and an alert metric associated with each alert stored in the memory prior to the alert history time marker. The first event pattern of alert information is stored in the memory. The conditional probability pattern detection module may receive the historical alert data and at least one conditional probability parameter, and identify a second event pattern of alert information based on co-occurrences of configuration item pairs in the historical alert data and on the at least one conditional probability parameter. The second event pattern is stored in the memory. The alert grouping module may determine at least one alert group by comparing at least one configuration item associated with a current alert to the plurality of configuration items of the first event pattern and of the second event pattern stored in the memory. The presentation module may generate a graphical display region for displaying the at least one alert group.
- Another implementation of the disclosure is an apparatus for grouping alerts generated by automated monitoring of at least an operating condition of a machine in a computing system, the machine represented as a configuration item in a configuration management database, the apparatus comprising a processor configured to execute instructions stored in a memory, the instructions including a pattern detection module, an alert grouping module, and a presentation module. The pattern detection module may identify an event pattern from historical alert data associated with a plurality of configuration items. The event pattern is based on an intersection of configuration items identified in an avalanche of alerts with configuration items identified in the historical alerts. The pattern detection module may identify the avalanche of alerts based on at least one avalanche parameter, and store the event pattern in the memory. The alert grouping module may determine at least one alert group by comparing at least one configuration item associated with a current alert to the plurality of configuration items of the pattern stored in the memory. The presentation module may generate a graphical display region for displaying the alert group.
- The description herein makes reference to the accompanying drawings wherein like reference numerals refer to like parts throughout the several views.
-
FIG. 1 is a block diagram of a computing network in which the teachings herein may be implemented. -
FIG. 2 is a block diagram of an example internal configuration of a computing device, such as a computing device of the computing network as shown inFIG. 1 . -
FIG. 3 is a block diagram of an example modular configuration of a computing device, such as the computing device as shown inFIG. 2 , in accordance with the present disclosure. -
FIG. 4 is a block diagram of an example alert table in accordance with the present disclosure. -
FIG. 5 is a diagram of an example time window size determination and avalanche window determination based on an alert history in accordance with the present disclosure. -
FIGS. 6A-6C are diagrams of an example conditional probabilistic graphing sequence in accordance with the present disclosure. -
FIG. 7 is a diagram of an example display region generated for displaying alert groups in accordance with the present disclosure. -
FIG. 8 is a diagram of example display region generated for enabling user feedback and supervision of alert grouping in accordance with the present disclosure. -
FIG. 9 is a flow chart of an example method of aggregating alerts for management of computer system alerts in accordance with the present disclosure. - A distributed computing system, such as a cloud computing system, may include multiple computing devices at a customer end and multiple computer servers at a service provider end, which may be interconnected by a cloud network. As customer devices request services and use resources provided by the server devices, the flow of information must be controlled and monitored to maintain quality of service. At times of higher demand for services and resources, nodes, such as servers, along the interconnected network may encounter overload conditions and traffic may need to be rerouted to other available nodes that are currently not overloaded. Alerts may be triggered upon detection of conditions or events that relate to nodes being overloaded. Other examples of alerts that may be triggered may include when a customer or server device is down due to a hardware or software error or failure.
- When a significant condition or event occurs that affects multiple devices in the cloud computing system, a cluster or “avalanche” of alerts may be triggered within a short time period. Over time, as avalanches of alerts are detected, patterns of affected nodes may emerge, which can be stored and used as a template during real time monitoring of alerts. As a current alert is detected, it may be matched to learned patterns for aggregating the alert into one or more alert groups to more efficiently manage and dispatch the alert. Conditional probability patterns may also be developed based on stored alert information, which may further refine the learned patterns for the alert grouping. By aggregating alerts, system operators may more efficiently triage alerts for disposition.
-
FIG. 1 is a block diagram of a distributed (e.g., client-server, networked, or cloud)computing system 100. Use of the phrase “cloud computing system” herein is a proxy for any form of a distributed computing system, and this phrase is used simply for ease of reference.Cloud computing system 100 can have any number of customers, includingcustomer 110. Eachcustomer 110 may have clients, such asclients 112. Each ofclients 112 can be in the form of a computing system comprising multiple computing devices, or in the form of a single computing device, for example, a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, and the like.Customer 110 andclients 112 are examples only, and a cloud computing system may have a different number of customers or clients or may have a different configuration of customers or clients. For example, there may be hundreds or thousands of customers and each customer may have any number of clients. -
Cloud computing system 100 can include any number of datacenters, includingdatacenter 120. Eachdatacenter 120 may have servers, such asservers 122. Eachdatacenter 120 may represent a facility in a different geographic location where servers are located. Each ofservers 122 can be in the form of a computing system including multiple computing devices, or in the form of a single computing device, for example, a desktop computer, a server computer and the like. Thedatacenter 120 andservers 122 are examples only, and a cloud computing system may have a different number of datacenters and servers or may have a different configuration of datacenters and servers. For example, there may be tens of data centers and each data center may have hundreds or any number of servers. -
Clients 112 andservers 122 may be configured to connect tonetwork 130. The clients for a particular customer may connect to network 130 via acommon connection point 116 or different connection points, e.g., awireless connection point 118 and awired connection point 119. Any combination of common or different connections points may be present, and any combination of wired and wireless connection points may be present as well.Network 130 can be, for example, the Internet.Network 130 can also be or include a local area network (LAN), wide area network (WAN), virtual private network (VPN), or any other means of transferring data between any ofclients 112 andservers 122.Network 130,datacenter 120 and/or blocks not shown may include network hardware such as routers, switches, load balancers and/or other network devices. - Other implementations of the
cloud computing system 100 are also possible. For example, devices other than the clients and servers shown may be included insystem 100. In an implementation, one or more additional servers may operate as a cloud infrastructure control, from which servers and/or clients of the cloud infrastructure are monitored, controlled and/or configured. For example, some or all of the techniques described herein may operate on said cloud infrastructure control servers. Alternatively, or in addition, some or all of the techniques described herein may operate on servers such asservers 122. -
FIG. 2 is a block diagram of an example internal configuration of acomputing device 200, such as aclient 112 orserver device 122 of thecomputing system 100 as shown inFIG. 1 , including an infrastructure control server, of a computing system. As previously described,clients 112 orservers 122 may take the form of a computing system including multiple computing units, or in the form of a single computing unit, for example, a mobile phone, a tablet computer, a laptop computer, a notebook computer, a desktop computer, a server computer and the like. - The
computing device 200 can comprise a number of components, as illustrated inFIG. 2 . CPU (or processor) 202 can be a central processing unit, such as a microprocessor, and can include single or multiple processors, each having single or multiple processing cores. Alternatively,CPU 202 can include another type of device, or multiple devices, capable of manipulating or processing information now-existing or hereafter developed. When multiple processing devices are present, they may be interconnected in any manner, including hardwired or networked, including wirelessly networked. Thus, the operations ofCPU 202 can be distributed across multiple machines that can be coupled directly or across a local area or other network. TheCPU 202 can be a general purpose processor or a special purpose processor. -
Memory 204, such as Random Access Memory (RAM), can be any suitable non-permanent storage device that is used as memory.RAM 204 can include executable instructions and data for immediate access byCPU 202.RAM 204 typically includes one or more DRAM modules such as DDR SDRAM. Alternatively,RAM 204 can include another type of device, or multiple devices, capable of storing data for processing byCPU 202 now-existing or hereafter developed.CPU 202 can access and manipulate data inRAM 204 viabus 212. TheCPU 202 may utilize acache 220 as a form of localized fast memory for operating on data and instructions. -
Storage 206 can be in the form of read only memory (ROM), a disk drive, a solid state drive, flash memory, Phase-Change Memory (PCM), or any form of non-volatile memory designed to maintain data for some duration of time, and preferably in the event of a power loss.Storage 206 can compriseexecutable instructions 206A and application files/data 206B along with other data. Theexecutable instructions 206A can include, for example, an operating system and one or more application programs for loading in whole or part into RAM 204 (with RAM-basedexecutable instructions 204A and application files/data 204B) and to be executed byCPU 202. Theexecutable instructions 206A may be organized into programmable modules or algorithms, functional programs, codes, and code segments designed to perform various functions described herein. The operating system can be, for example, Microsoft Windows, Mac OS X®, or Linux®, or other operating system, or it can be an operating system for a small device, such as a smart phone or tablet device, or a large device, such as a mainframe computer. The application program can include, for example, a web browser, web server and/or database server. Application files 206B can, for example, include user files, database catalogs and configuration information. In an implementation,storage 206 comprises instructions to perform the discovery techniques described herein.Storage 206 may comprise one or multiple devices and may utilize one or more types of storage, such as solid state or magnetic. - The
computing device 200 can also include one or more input/output devices, such as anetwork communication unit 208 andinterface 230 that may have a wired communication component or awireless communications component 290, which can be coupled toCPU 202 viabus 212. Thenetwork communication unit 208 can utilized any of a variety of standardized network protocols, such as Ethernet, TCP/IP, to name a few of many protocols, to effect communications between devices. Theinterface 230 can include one or more transceiver(s) that utilize the Ethernet, power line communication (PLC), WiFi, infrared, GPRS/GSM, CDMA, etc. - A user interface can be broken down into the hardware user interface portion and the software user interface portion. A
hardware user interface 210 can include a display, positional input device (such as a mouse, touchpad, touchscreen, or the like), keyboard, or other forms of user input and output devices and hardware. Thehardware user interface 210 can be coupled to theprocessor 202 via thebus 212. Other output devices that permit a user to program or otherwise use the client or server can be provided in addition to or as an alternative to display 210. When the output device is or comprises a hardware display, this display can be implemented in various ways, including by a liquid crystal display (LCD) or a cathode-ray tube (CRT) or light emitting diode (LED) display, such as an OLED display. - The software graphical user interface constitutes programs and data that reflect information ultimately destined for display on a hardware device. For example, the data can contain rendering instructions for bounded graphical display regions, such as windows, or pixel information representative of controls, such as buttons and drop-down menus. The rendering instructions can, for example, be in the form of HTML, SGML, JavaScript, Jelly, AngularJS, or other text or binary instructions for generating a graphical user interface on a display that can be used to generate pixel information. A structured data output of one device can be provided to an input of the hardware display so that the elements provided on the hardware display screen represent the underlying structure of the output data.
- Other implementations of the internal configuration or architecture of clients and
servers 200 are also possible. For example, servers may omitdisplay 210.RAM 204 orstorage 206 can be distributed across multiple machines such as network-based memory or memory in multiple machines performing the operations of clients or servers. Although depicted here as a single bus,bus 212 can be composed of multiple buses, that may be connected to each other through various bridges, controllers, and/or adapters. Thecomputing device 200 may also contain apower source 270, such as a battery, so that the unit can operate in a self-contained manner.Computing device 200 may contain any number of sensors anddetectors 260 that monitor physical conditions of thedevice 200 itself or the environment around thedevice 200. For example,sensors 260 may trigger alerts that provide indications of the physical conditions. Such alerts may indicate conditions that may include temperature of theprocessor 202, utilization of theprocessor 202 ormemory 204, utilization of thestorage 206, and utilization of thepower source 270. Such alerts of conditions detected bysensors 260 may safeguard against exceeding operational capacity or operational limits, such as hard drive rpm forstorage 206, maximum temperature ofprocessor 202 orpower source 270, or any other physical health states of thecomputing device 200.Sensors 260 may include a location identification unit, such as a GPS or other type of location device. These may communicate with the CPU/processor 202 via thebus 212. -
FIG. 3 is a block diagram of an example modular configuration of a computing device, such as the computing device as shown inFIG. 2 , in accordance with this disclosure. Amodular configuration 300 may include an eventpattern detection module 301, analert grouping module 322, apresentation module 324, or any combination thereof. The eventpattern detection module 301 may include an avalanchepattern detection module 304, a conditional probabilitypattern detection module 314, apattern merging module 320, or a combination thereof. - The event
pattern detection module 301 may receive inputs includinghistorical alert data 302 andparameters 312.Historical alert data 302 may include information related to alerts prior to a selected time marker. For example, in thecloud computing system 100, an alert history period may be selected between a first time marker and a second time marker, which may be hours, days, weeks or months apart. The alert information for thehistorical alert data 302 may be stored as a table of alerts in one or more databases or data storage units, at one or more locations such asdatacenter 120.Parameters 312 may include control values set by a user or system administrator for setting control limits or adjustments for various modules to execute functions related to pattern detection and alert grouping as described herein. -
FIG. 4 is a block diagram of an example alert table in accordance with the present disclosure. The alert information for thehistorical alert data 302 may be stored as an alert table 404 as shown inFIG. 4 , in adata storage unit 402. For each alert, the alert table 404 may include analert ID 412, atime stamp 414, a configuration item (CI)ID 416, and analert metric 418. For example, when an event triggers an alert, each alert may be recorded and stored in thedata storage 402 with alert information including thetime stamp 412 of when the alert was triggered, aCI ID 416 for the identity of the CI affected by the event, and the metric 418 that triggered the alert, which may include for example, high memory utilization or high CPU utilization. In some applications, analert metric 418 for high memory utilization may be an indication that an additional server is needed to handle the current traffic or demand for services byclients 112 in thecloud computing system 100. - Returning to
FIG. 3 , anavalanche pattern module 304 may receive thehistorical alert data 302 andparameters 312 for processing to determine avalanche patterns. Theavalanche pattern module 304 may include a timewindow generation module 306 and anavalanche window module 308.Parameters 312 used for avalanche detection may include values set by a user or system administrator to control the avalanche detection, including but not limited to, a factor value C1 used to determine a window size for counting alerts and a factor value C2 used to determine an avalanche threshold. In some applications, the timewindow generation module 306 may determine a fixed time window size based on inter-arrival times of consecutive alerts. Theavalanche window module 308 may determine which time windows contain an avalanche of alerts based on a number of alerts observed in each time window compared to an avalanche threshold. -
FIG. 5 is a diagram of an example of time window size determination and avalanche window determination in accordance with the present disclosure. Analert history 501 may be defined betweentime markers FIG. 4 . Analert count 502 is shown for thealert history 501, including numerousalert clusters 504. An inter-arrival time Ti exists between alerts oralert clusters 504. The timewindow generation module 306 may divide the alert history into fixed time windows TW of the same size. The fixed time window size may be based on an average or a median of the inter-arrival times Ti for thealert history 501. For example, the time window size WS may be calculated according to the following equation. -
WS=C1*IAT Equation (1) - where
C1 is a constant value
IAT is a median of inter-arrival times Ti - The
avalanche window module 308 may determine theavalanche windows 512/514/518 based on atotal alert count 502 within each time window TW that meets or exceeds theavalanche threshold 506. For example, the avalanche threshold (AV_th) 506 may be calculated according to the following equation. -
AV_th=C2*Acnt/TW Equation (2) - where
C2 is a constant value
Acnt/TW is a median of alert counts per time window - The
avalanche pattern module 304 may merge time windows TW that are adjacent and consecutive to anavalanche window 512/514/518 so that any alerts in the adjacent windows may also be included for the avalanche pattern detection. For example,time windows avalanche windows time windows adjacent time windows avalanche window 518, and the alerts from thetime windows 517/518/519 form a merged avalanche of alerts. - The
avalanche pattern module 304 may compare each expanded avalanche window to each non-avalanche time window TW in thealert history 501 to locate intersections of alertinformation using parameters 312 that may include minimum frequency of intersections. Intersections of time windows may be identified by comparing the alert information between two time windows, and finding common alert information. For example, if an avalanche window has alerts of alert types A, B, C, D, E, F, and another window has alerts of alert types A, X, C, Z, E, G, then the intersection would be the set A, C, E. As an example, an alert type may be defined as a CI ID, alert metric combination. In a broader example, an alert type may be defined by the CI ID alone. In some applications, an intersection between an avalanche window and another window may produce the following set: [CI_30# CPU Utilization, CI_68# Memory, CI_72# CPU Utilization]. While alert type examples including CI ID and/or alert metric have been presented here, other mappings to different alert information types are possible. - In some applications, an intersection may be determined by matching a CI ID, such as CI-26 in expanded
avalanche window 512/513/514/515 to another alert occurrence in another time window TW related to CI-26.Parameters 312 may include, for example, a parameter Fi for minimum frequency of intersections, which may be set to a value, such as Fi=3, by a user or system administrator. On a condition that three or more unique alert intersections, such as alerts for CI-26, are identified by the comparison at theavalanche pattern module 304, the intersections may be stored as a pattern. In some applications, where an intersection for alert types A, C, E exceed the minimum frequency parameter Fi, where A, C and E are each defined by a different CI ID and an alert metric combination, the intersection may be stored as a pattern. - The
avalanche pattern module 304 may assign a score to each of the alert intersections based on the total number of alert intersections identified by the time window comparison for the entire alert history and based on the size of the expanded avalanche window. For example, an intersection score may be determined according to the following equation. -
Int_Score=Ni*(1+Int_size) Equation (3) - where Int_Score is the score for the i'th intersection
Ni is frequency of i'th intersection
Int_size is number of CI, alert type in an intersection
Intersection scores may be defined by variations to the Equation (3) above. - Based on the intersection scores, avalanche patterns may be identified as follows. A list of intersections may be sorted in descending order according to intersection score. This sorted list of intersections may be sequentially processed one intersection at a time using an overlap test. As each intersection is considered, the aggregate set of event types for this avalanche list may be accumulated by taking the union of the event types in the current intersection with the event types in previous intersections that have been added to the pattern list. Given the set of alert types for the current intersection, if one or more of alert types in the current intersection which have already been seen in other intersections from this avalanche is a sufficiently small aggregate compared to a group overlap percentage threshold parameter, which may be one of
input parameters 312, then that intersection is considered for addition to the pattern list. If the current intersection does not already exist in the pattern list, or is not contained in a pattern already in the pattern list, then this intersection may be added to the pattern list as a pattern. If the pattern list already contains a pattern that is a sub-set of the current intersection, then this intersection may replace that pattern in the list. - The following example illustrates identifying an avalanche pattern according to the above description. For an avalanche window containing event types A-G, the avalanche window may be compared with the other windows, generating the following list of intersections with their scores: (AG, 90), (ABC, 40), (CDG, 20), (DEF, 8). These intersections may be considered in the given order for addition to the pattern list. The first intersection AG passes the overlap test by default, leaving [“AG” ] in the pattern list and A, G in the aggregate set of event types. Next, considering intersection ABC, its overlap with the aggregate set of event types A, G is 1 (i.e., “A”), which yields an aggregate percentage of 33.3% (i.e., A/ABC). If the group overlap percentage threshold is 40%, then the aggregate percentage is below the threshold, and the intersection may be added to the pattern list, which now contains [“AG”, “ABC” ] and the aggregate set of event types becomes A, B, C, G. Next, intersection CDG may be compared to the aggregate set of event types A, B, C, G, yielding an aggregate percentage of 66% (i.e., CD/CDG). Since this aggregate percentage is greater than the group overlap percentage threshold of 40%, intersection CDG is not added to the pattern list. Finally, intersection DEF is considered. Since DEF yields an overlap percentage of 0 compared to the aggregate set of event types, DEF may be added to the pattern list, which then becomes [“AG”, “ABC”, “DEF” ], with an aggregate set of event types A, B, C, D, E, F, G. Note that if ABC had scored higher than AG, then AG would not be included in the pattern list. By developing avalanche patterns according to descending intersection scores for identified intersections of CI's and/or alert metrics, the avalanche pattern can be a useful tool for predicting specific health and status changes to the network resources, as will be described below in greater detail.
- The avalanche
pattern detection module 304 may validate patterns based on configuration management database (CMDB) information. For example, computing devices related to the candidate nodes may be interconnected in a cloud computing system, such ascloud computing system 100 shown inFIG. 1 , and the interconnected dependencies may be tracked and updated in the CMDB. The CI's may also have dependency relationships based on instances of software modules, such as service applications for example, which reside on interconnected servers, such asservers 122 shown inFIG. 1 . Using the CI dependency information from the CMDB, the avalanchepattern detection module 304 may compare the identified avalanche patterns to actual CI dependency information from the CMDB to determine if any patterns are invalid. For example, if one of the identified avalanche patterns consists of CI's that bear no interconnected relationship, then that pattern may be deleted from the avalanche patterns. - Returning to
FIG. 3 , the eventpattern detection module 301 may include a conditional probabilitypattern detection module 314 for performing a pattern detection in parallel with theavalanche pattern module 304. The conditional probabilitypattern detection module 314 may include aco-occurrence detection module 316, aprobabilistic graph module 318, and a parametricgraph component module 319. Theco-occurrence detection module 316 may determine a number of time windows TW of size WS according to Equation (2) above. Alternatively, theco-occurrence detection module 316 may receive the time window information and time window size WS from the avalanchepattern detection module 304. Theco-occurrence detection module 316 may detect co-occurrences of CI pairs or groups in the time windows TW, which may or may not be avalanche windows. For example, as shown inFIGS. 4 and 5 ,alerts time window 512 and relate to CI-26 and CI-27. The same pair of CI's, CI-26 and CI-27 appear inalerts time window 514. In some applications, theco-occurrence detection module 316 may detect co-occurrences of CI ID, alert metric combination pairs or groups in the time windows TW. -
FIGS. 6A-6C are diagrams of an example conditional probabilistic graphing sequence in accordance with the present disclosure. Theprobabilistic graph module 318 may generate a conditional probabilistic graph, such as theprobabilistic graph components FIG. 6A , based onparameters 312. For example, candidate nodes 601-612 ofcandidate graph component 621 may be identified based onparameters 312 set by a user or system administrator to include a minimum frequency of CI co-occurrences f_CI in thealert history 501. - The conditional
probabilistic graph components probabilistic graph module 318 based on the results of theco-occurrence detection module 316, which may use pairwise probability. For example, each node ongraph components probabilistic graph components node 610 for a CI-26 alert may be determined given an alert B, such as atnode 601 for a CI-27 alert, according to the following equation. -
P(A|B)=N AB /N B Equation (4a) - where
P(A|B) is probability of A given B
A is alert A
B is alert B
NAB is number of time windows TW in which both alert A and B appear
NB is number of time windows TW in which alert B appears
Similarly, the probability of an alert B, given and alert A may be determined according to the following equation: -
P(B|A)=N AB /N A Equation (4b) - where
- NA is number of time windows TW in which alert A appears
In some applications, the probability P(AIB) may not be calculated if the number of A alerts or B alerts is less than the minimum frequency parameter value f_CI, and nodes for such low frequency alerts are omitted from the graph component. - The
probabilistic graph components probabilistic graph module 318 based on parameters adjusted byparametric graph module 319. For example,probabilistic graph components edges 631 to the graph component if one of the pairwise conditional probabilities P(A|B) or P(B|A) exceeds an initial threshold CP_in, which may be one ofparameters 312. For example,graph component 622 may be generated where alerts represented by nodes 613-615 have pairwise probability of 0.25 as shown by theedge value 631, which exceeds an initial threshold CP_in value of 0.04. -
FIG. 6B shows an example of conditionalprobability graph components input parameters 312. In this example, the edge count parameter EC is set to EC=3 based on a determination that graph components having three edges provide optimum alert correlation. Sincegraph component 622 has three edges,only graph component 621 may be considered for reduction into subgraph components according to the following iterative process. A conditional probability threshold CP_in may be set to CP_in=0.2. Any edges having a value less than 0.2 are removed, which results in a twosubgraph components original graph component 621 shown inFIG. 6A . Sincesubgraph component 623 has an edge count exceeding edge count parameter EC=3, further reduction of the graph component is achieved as shown inFIG. 6C . -
FIG. 6C shows an example of subgraph components 625-627 formed by eliminating edges ofsubgraph component 623 after raising the conditional probability threshold to 0.3 With the remaininggraph components 622, 624-627 satisfying the edge count parameter EC=3, these graphs may be considered as having a critical size for conditional probability, and as such are potential patterns and may be stored by the conditional probabilitypattern detection module 314. While the above example applied iterations of conditional probability threshold increments of 0.1, other fixed-size or variable-size increments may be applied to accommodate different edge count EC thresholds and for more or less rapid approach to reach critical size graph components. - The conditional probability
pattern detection module 314 may validate patterns based on an alert coverage parameter taken fromparameters 312. For example, a percentage of alerts inalert history 501 that appear in a pattern from probabilistic graph components can be determined, and if the percentage is greater than the alert coverage parameter, the pattern is stored as a conditional probability pattern. If one or more patterns fail to meet the alert coverage parameter, the probabilistic graph components can be reformed based on adjustment to theparameters 312. - The conditional probability
pattern detection module 314 may validate patterns based on CMDB information. For example, computing devices related to the candidate nodes may be interconnected in a cloud computing system, such ascloud computing system 100 shown inFIG. 1 , and the interconnected dependencies may be tracked and updated in the CMDB. The CI's may also have dependency relationships based on instances of software modules, such as service applications for example, which reside on interconnected servers, such asservers 122 shown inFIG. 1 . Using the CI dependency information from the CMDB, theprobabilistic graph module 318 may compare the graph components, such asgraph components 622, 624-627 shown inFIG. 6C , to actual CI dependency information from the CMDB to determine if any patterns are invalid. For example, if one of the graph components consists of CI's that bear no interconnected relationship, then a pattern based on that graph component may be deleted from the candidate patterns. - Returning to
FIG. 3 , apattern merging module 320 may merge the patterns determined by theavalanche pattern module 304 with the patterns determined by the conditional probabilitypattern detection module 314. In some applications, thepattern merging module 320 may combine the avalanche patterns with the conditional probability patterns and store the union of patterns as learned patterns from the eventpattern detection module 301. For example, the merged pattern information may be stored in alert tables with pattern IDs, each row consisting of a CI_ID, alert metric combination and pattern ID. In some applications, the merged pattern information may be stored as a hash map with key as pattern ID and value as set of entities representing the pattern. - An
alert grouping module 322 may perform a matching function to compare analert stream 332 to the stored patterns and assign an alert to an alert group if there is a match to one or more patterns. In some applications, thealert grouping module 322 may receive a current alert fromalert stream 332 and compare the CI and/or the alert metric for the current alert to the learned patterns stored by thepattern merging module 320, and determine which one or more patterns include the same CI and/or alert metric. - An alert group may be formed by applying a sliding window. For example, a ten minute window may be defined according to
parameters 312. The matching process starts with a first alert group AB, which may be kept active for maximum group lifetime, which may be aninput parameter 312 based on a fixed time window, such as for example, a ten minute window. For example, the alert group AB may be compared to all alerts received from thealert stream 332 in the past ten minute window. The alert group AB may include a list of associated pattern IDs. If no match is made to pattern IDs associated with the alerts monitored in the alert stream, then the lifetime of alert group AB is expired, and no further comparisons are made to that alert group. If a pattern ID match is made for a current alert, such as an alert C, then alert group C is grouped with alert group AB, to form alert group ABC. For example, if alert C includes an alert type found within alert group AB, then alert C is added to the group. The list of pattern IDs for the group AB may be updated by keeping only those pattern IDs that contains C. The time is maintained for the updated list of pattern IDs whenever any new alert is grouped using that list. If a the time window elapses, the pattern IDs in that list has not been used to group alerts, the alert group may be finalized and no further comparison for future alerts is made. - In some applications, the alert group may include a set of alert types, such as a CI_ID, alert metric pairs, a list of patterns that match this set of alert types, a first alert time (i.e. the alert time for the earliest alert in the group) and a latest alert time (i.e., the alert time for the most recent alert added to the group). When processing a new set of alerts, all groups may be examined and if the current time minus the latest time for the group exceeds the time window, then the group is removed from the active list. If the current time minus the first alert time exceeds the maximum group lifetime parameter, then the group may be removed from the active list.
- A
presentation module 324 may generate a display region for displaying the alert groups for monitoring by a user or system administrator. -
FIG. 7 is a diagram of an example display region generated for displaying alert groups in accordance with the present disclosure. Thedisplay region 702 may include a presentation of thealert groups 703, theseverity type 704 of the alerts in the alert group, the severity ofrelated alerts 705, and a count of impactedservices 706. For example, analert group 703, such asalert group 5, may have acritical severity 704, andrelated alerts 705, such as 1 related alert having critical severity, 0 related alerts having major or minor severity, and 1 impactedservice 706. Threeseverity types 704 are shown inFIG. 7 for illustrative purposes, however there may be more or less severity types. Thealert groups 703 may be sorted byseverity 704 as shown inFIG. 7 . Other presentations of thealert groups 703 may be presented by sorting according to other information shown inFIG. 7 . - Alert groups determined by
alert grouping module 322 may be evaluated based on a compression rate parameter and an alert coverage parameter fromparameters 312. For example, the compression rate may be determined according to the following equation. -
Comp=1−(Ngrp+Rem)/N TOT Equation (5) - where
-
- Ngrp is the number of alert groups
- Rem is the number of remaining ungrouped alerts
- NTOT is the number of total alerts
For example, a compression rate of 70% may be determined, meaning that the alert aggregation may reduce the number of raw alerts by 30%. Accordingly, the alert pattern detection and alert grouping is a useful tool to enable a user or system administrator to more efficiently manage alert dispositions by reducing the number of alerts, which may have been inflated due to redundancy of alerts. The compression rate may be compared to a compression parameter to determine if the number of alert groups are satisfactory.
- The alert coverage may be determined according to the following equation.
-
Acov=N_Agrpd/N TOT Equation (6) - where
-
- N_Agrpd is the number of alerts assigned to a group
- NTOT is the number of total alerts
For example, an alert coverage of 50% may be determined and compared to an alert coverage parameter to determine if the alert groups are satisfactory. If the alert coverage is insufficient, the alert groups may be modified, such as by adding new alert groups to capture more alerts that were omitted from the alert aggregation.
-
FIG. 8 is a diagram of an example display region generated for enabling user feedback and supervision of alert grouping in accordance with the present disclosure. In some applications, if the compression rate Comp or alert coverage Acov values do not meet the parameter thresholds, the user or system administrator may delete a group using a graphical user interface, such as thedelete alert button 809. Thedisplay region 802 may include various alert information for an alert group 803, as shown inFIG. 8 . For example, impacted services may be visually accessed by clicking on the impactedservices button 804.Feedback 806 from the user may be submitted to a system administrator regarding whether the alert group is representative or related to new alerts. For example, due to irregularities in the pattern detection, some alert groups may be determined to be invalid, in which case the user has the ability to delete the alert group usinginterface button 809. Thedisplay region 802 may also include thealert ID 810, severity of each alert 812, theCI_ID 814, themetric name 816, or a combination thereof, for the alerts in the displayed alert group. In some applications, if an erroneous group repeatedly appears, the user or system administrator may set a rule to prevent particular patterns from being developed by the avalanchepattern detection module 304 or the conditional probabilitypattern detection module 314. In some applications, the user or system administrator may define patterns and can set a high priority of those patterns if so desired. For instance, a particular alert or entity may be flagged as significant or critical and an associated pattern may be assigned a high priority. -
FIG. 9 is a flowchart of an example method for aggregating alerts for management of computer system alerts in accordance with this disclosure. Grouping alerts generated by automated monitoring of at least an operating condition change of a machine in a computer network in response to an event, may be implemented in a computing system, such as thecloud computing system 100 shown inFIG. 1 . For example, grouping alerts may be implemented on a server, such as one or more of theservers 122 shown inFIG. 1 , a computing device, such as aclient 112 shown inFIG. 1 , or by a combination thereof communicating via a network, such as thenetwork 130 shown inFIG. 1 . - Grouping alerts may include obtaining historical alert data at 902, identifying event patterns associated with avalanches at 908, identifying event patterns based on conditional probability at 918, merging patterns at 920, matching alerts to patterns at 922, assigning current alert to alert group(s) at 924, generating graphical display regions for alert groups at 926, or a combination thereof.
- In an implementation, event
pattern detection module 301 obtains available historical alert data that may be stored in a data storage unit. The historical alert data may include information for each alert, including an alert ID, a time stamp, a configuration item ID, an alert metric, or a combination thereof. The alert data may be associated with alerts received in recent hours, days, weeks or months forcomputing system 100. - Avalanche patterns may identified at steps 904-908. At 904, time windows may be defined for avalanche detection based on a fixed window size determined by a parameter C1 and inter-arrival time of alerts. Avalanche windows may be determined at 906 based on an avalanche threshold and a parameter C2. At 906, avalanche windows may be determined based on an alert count for a time window meeting or exceeding the avalanche threshold. At 908, event patterns associated with avalanches may be identified based on intersections of avalanche alerts with alerts in each other time window. Avalanche patterns may be based on intersections that have an intersection score that meets or exceeds an avalanche pattern threshold. Intersection scores may be determined based on number of intersections and number of avalanche window alerts.
- Conditional probability patterns may be identified at steps 914-918. At 914, probabilistic graph candidates may be determined based on co-occurrences of alert information in the time windows meeting or exceeding a parametric threshold. At 916, a probabilistic graph maybe generated using the probabilistic graph candidates having a probability that satisfies a threshold. Event patterns may be identified based on conditional probability determined from the probabilistic graph, at 918, where conditional probability is based on co-occurrences of CI's for alerts in two or more time windows. The conditional probability may also be supervised by rule based parameters set by a user.
- At 920, event patterns identified by avalanche detection may be merged with event patterns identified by conditional probability and stored for alert grouping. At 922, an alert stream is monitored and each alert is compared to the stored patterns. A current alert may be assigned to one or more groups for each match found to a stored pattern at 924.
- A graphical display region may be generated for displaying of alert groups at 926 based on the alert groups identified at 924. The graphical display region can include, for example, information relating to the alert groups for monitoring by a user or system administrator, for management of computer system alerts and to enable user feedback and supervision. A graphical display region may be generated in response to a user intervention, such as interface with a graphical user interface. However, in certain circumstances and implementations, the graphical display region may be triggered automatically.
- The alert groups displayed in
step 926 can include information about associated alerts, including severity type, related alerts and impacted services.Steps 922 through 926 can be repeated multiple times over numerous client instances, server instances, or a combination of both as alerts from an alert stream are received. The alert groups can be ordered in priority order based on severity type. - Steps 902-926 may be performed periodically. For example, a task can be scheduled on an hourly, daily, or weekly basis during which the steps are performed. The steps can be performed on the same or different periodic schedules for each of the database server instances in the cloud infrastructure, such as by physical server or datacenter. Certain database server instances or physical servers may not be included based on user configuration. Upon each iteration, the graphical display regions generated by
step 926 can be updated and/or regenerated. - Some or all of the steps of
FIG. 9 can be implemented in a pattern detection module and alert grouping module. In one implementation, the pattern detection module and alert grouping module can take the form of one or more Java classes with executable or human-readable code for performing some or all of the steps 902-924 described above. The pattern detection module and alert grouping module can, for example, be located on one or more servers used to manage other servers (management servers) in the cloud computing system, including but not limited toservers 122. The management servers can, for example, include the same or similar platform application and included on some ofservers 122. In one implementation, the one or more Java classes can be plugged into or connected to an instance or installation of the platform application to extend the platform functionality to include the functionality of the pattern detection module and alert grouping module. In an implementation, functionality of the pattern detection module and alert grouping module may be accessed via the platform, for example, by using script calls stored in an associated database that are configured to invoke the desired functionality. In one example, the platform can be configured to periodically execute techniques similar to steps 902-924 included in the pattern detection module and alert grouping module without user intervention. In another example, the graphical display regions generated by thepresentation module 324 can include one or more links or buttons that when clicked cause the platform to execute other platform functionality for invoking a move operation for an associated database server instance. -
Input parameters 312 used in any of the above embodiments may be based on various types of information, included but not limited to value-based information, event-based information, environment-based information, or a combination thereof. For example, value-based information may include business models, service catalog information, customer impact feeds information, or the like. As another example, events-based information may include change management information, alerts, incidents, or the like. As another example, environment-based information may include configuration management database (CMDB) information, business rules, workflows, or the like. All or a portion of implementations of the invention described herein can be implemented using a general purpose computer/processor with a computer program that, when executed, carries out any of the respective techniques, algorithms and/or instructions described herein. In addition, or alternatively, for example, a special purpose computer/processor can be utilized which can contain specialized hardware for carrying out any of the techniques, algorithms, or instructions described herein. - The implementations of computing devices as described herein (and the algorithms, methods, instructions, etc., stored thereon and/or executed thereby) can be realized in hardware, software, or any combination thereof. The hardware can include, for example, computers, intellectual property (IP) cores, application-specific integrated circuits (ASICs), programmable logic arrays, optical processors, programmable logic controllers, microcode, microcontrollers, servers, microprocessors, digital signal processors or any other suitable circuit. In the claims, the term “processor” should be understood as encompassing any of the foregoing hardware, either singly or in combination.
- For example, one or more computing devices can include an ASIC or programmable logic array such as a field-programmable gate array (FPGA) configured as a special-purpose processor to perform one or more of the operations or operations described or claimed herein. An example FPGA can include a collection of logic blocks and random access memory (RAM) blocks that can be individually configured and/or configurably interconnected in order to cause the FPGA to perform certain functions. Certain FPGA's may contain other general or special purpose blocks as well. An example FPGA can be programmed based on a hardware definition language (HDL) design, such as VHSIC Hardware Description Language or Verilog.
- The embodiments herein may be described in terms of functional block components and various processing steps. Such functional blocks may be realized by any number of hardware and/or software components that perform the specified functions. For example, the described embodiments may employ various integrated circuit components, e.g., memory elements, processing elements, logic elements, look-up tables, and the like, which may carry out a variety of functions under the control of one or more microprocessors or other control devices. Similarly, where the elements of the described embodiments are implemented using software programming or software elements the invention may be implemented with any programming or scripting language such as C, C++, Java, assembler, or the like, with the various algorithms being implemented with any combination of data structures, objects, processes, routines or other programming elements. Functional implementations may be implemented in algorithms that execute on one or more processors. Furthermore, the embodiments of the invention could employ any number of conventional techniques for electronics configuration, signal processing and/or control, data processing and the like. The words “mechanism” and “element” are used broadly and are not limited to mechanical or physical embodiments, but can include software routines in conjunction with processors, etc.
- Implementations or portions of implementations of the above disclosure can take the form of a computer program product accessible from, for example, a computer-usable or computer-readable medium. A computer-usable or computer-readable medium can be any device that can, for example, tangibly contain, store, communicate, or transport a program or data structure for use by or in connection with any processor. The medium can be, for example, an electronic, magnetic, optical, electromagnetic, or a semiconductor device. Other suitable mediums are also available. Such computer-usable or computer-readable media can be referred to as non-transitory memory or media, and may include RAM or other volatile memory or storage devices that may change over time. A memory of an apparatus described herein, unless otherwise specified, does not have to be physically contained by the apparatus, but is one that can be accessed remotely by the apparatus, and does not have to be contiguous with other memory that might be physically contained by the apparatus.
- The word “example” is used herein to mean serving as an example, instance, or illustration. Any implementation or design described herein as “example” is not necessarily to be construed as preferred or advantageous over other implementations or designs. Rather, use of the word “example” is intended to present concepts in a concrete fashion. As used in this application, the term “or” is intended to mean an inclusive “or” rather than an exclusive “or”. That is, unless specified otherwise, or clear from context, “X includes A or B” is intended to mean any of the natural inclusive permutations. In other words, if X includes A; X includes B; or X includes both A and B, then “X includes A or B” is satisfied under any of the foregoing instances. In addition, the articles “a” and “an” as used in this application and the appended claims should generally be construed to mean “one or more” unless specified otherwise or clear from context to be directed to a singular form. Moreover, use of the term “an implementation” or “one implementation” throughout is not intended to mean the same embodiment or implementation unless described as such.
- The particular implementations shown and described herein are illustrative examples of the invention and are not intended to otherwise limit the scope of the invention in any way. For the sake of brevity, conventional electronics, control systems, software development and other functional implementations of the systems (and components of the individual operating components of the systems) may not be described in detail. Furthermore, the connecting lines, or connectors shown in the various figures presented are intended to represent exemplary functional relationships and/or physical or logical couplings between the various elements. Many alternative or additional functional relationships, physical connections or logical connections may be present in a practical device. Moreover, no item or component is essential to the practice of the invention unless the element is specifically described as “essential” or “critical”.
- The use of “including” or “having” and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as additional items. Unless specified or limited otherwise, the terms “mounted,” “connected,” ‘supported,” and “coupled” and variations thereof are used broadly and encompass both direct and indirect mountings, connections, supports, and couplings. Further, “connected” and “coupled” are not restricted to physical or mechanical connections or couplings.
- The use of the terms “a” and “an” and “the” and similar referents in the context of describing the invention (especially in the context of the following claims) should be construed to cover both the singular and the plural. Furthermore, recitation of ranges of values herein are merely intended to serve as a shorthand method of referring individually to each separate value falling within the range, unless otherwise indicated herein, and each separate value is incorporated into the specification as if it were individually recited herein. Finally, the steps of all methods described herein are performable in any suitable order unless otherwise indicated herein or otherwise clearly contradicted by context. The use of any and all examples, or exemplary language (e.g., “such as”) provided herein, is intended merely to better illuminate the invention and does not pose a limitation on the scope of the invention unless otherwise claimed.
- All references, including publications, patent applications, and patents, cited herein are hereby incorporated by reference to the same extent as if each reference were individually and specifically indicated as incorporated by reference and were set forth in its entirety herein.
- The above-described embodiments have been described in order to allow easy understanding of the present invention and do not limit the present invention. To the contrary, the invention is intended to cover various modifications and equivalent arrangements included within the scope of the appended claims, which scope is to be accorded the broadest interpretation so as to encompass all such modifications and equivalent structure as is permitted under the law.
Claims (21)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/574,999 US20200084086A1 (en) | 2016-04-28 | 2019-09-18 | Management of computing system alerts |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US15/141,395 US10469309B1 (en) | 2016-04-28 | 2016-04-28 | Management of computing system alerts |
US16/574,999 US20200084086A1 (en) | 2016-04-28 | 2019-09-18 | Management of computing system alerts |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/141,395 Continuation US10469309B1 (en) | 2016-04-28 | 2016-04-28 | Management of computing system alerts |
Publications (1)
Publication Number | Publication Date |
---|---|
US20200084086A1 true US20200084086A1 (en) | 2020-03-12 |
Family
ID=68391982
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/141,395 Active 2036-08-30 US10469309B1 (en) | 2016-04-28 | 2016-04-28 | Management of computing system alerts |
US16/574,999 Abandoned US20200084086A1 (en) | 2016-04-28 | 2019-09-18 | Management of computing system alerts |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US15/141,395 Active 2036-08-30 US10469309B1 (en) | 2016-04-28 | 2016-04-28 | Management of computing system alerts |
Country Status (1)
Country | Link |
---|---|
US (2) | US10469309B1 (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11153144B2 (en) * | 2018-12-06 | 2021-10-19 | Infosys Limited | System and method of automated fault correction in a network environment |
US11269706B2 (en) * | 2020-07-15 | 2022-03-08 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for alarm correlation and aggregation in IT monitoring |
US11347755B2 (en) * | 2018-10-11 | 2022-05-31 | International Business Machines Corporation | Determining causes of events in data |
US11477077B1 (en) * | 2019-10-30 | 2022-10-18 | United Services Automobile Association (Usaa) | Change management system with monitoring, alerting, and trending for information technology environment |
WO2023154854A1 (en) * | 2022-02-14 | 2023-08-17 | Cribl, Inc. | Edge-based data collection system for an observability pipeline system |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2015142765A1 (en) | 2014-03-17 | 2015-09-24 | Coinbase, Inc | Bitcoin host computer system |
US9735958B2 (en) | 2015-05-19 | 2017-08-15 | Coinbase, Inc. | Key ceremony of a security system forming part of a host computer for cryptographic transactions |
WO2018097242A1 (en) * | 2016-11-25 | 2018-05-31 | 国立大学法人筑波大学 | Networking system |
US11328574B2 (en) * | 2017-04-03 | 2022-05-10 | Honeywell International Inc. | Alarm and notification generation devices, methods, and systems |
EP3782105A4 (en) * | 2018-04-17 | 2021-12-22 | Coinbase Inc. | Offline storage system and method of use |
US11394543B2 (en) | 2018-12-13 | 2022-07-19 | Coinbase, Inc. | System and method for secure sensitive data storage and recovery |
US11500874B2 (en) | 2019-01-23 | 2022-11-15 | Servicenow, Inc. | Systems and methods for linking metric data to resources |
EP3761561B1 (en) * | 2019-07-03 | 2022-09-14 | Hewlett Packard Enterprise Development LP | Self-learning correlation of network patterns for agile network operations |
US10903991B1 (en) | 2019-08-01 | 2021-01-26 | Coinbase, Inc. | Systems and methods for generating signatures |
WO2021076868A1 (en) * | 2019-10-16 | 2021-04-22 | Coinbase, Inc. | Systems and methods for re-using cold storage keys |
US12057996B2 (en) * | 2020-09-14 | 2024-08-06 | Nippon Telegraph And Telephone Corporation | Combination rules creation device, method and program |
US11296926B1 (en) * | 2021-01-07 | 2022-04-05 | Servicenow, Inc. | Systems and methods for ranked visualization of events |
US11700192B2 (en) * | 2021-06-30 | 2023-07-11 | Atlassian Pty Ltd | Apparatuses, methods, and computer program products for improved structured event-based data observability |
US20230289035A1 (en) * | 2022-03-09 | 2023-09-14 | Dennis Garlick | System and Method for Visual Data Reporting |
US12068907B1 (en) * | 2023-01-31 | 2024-08-20 | PagerDuty, Inc. | Service dependencies based on relationship network graph |
Family Cites Families (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7079157B2 (en) * | 2000-03-17 | 2006-07-18 | Sun Microsystems, Inc. | Matching the edges of multiple overlapping screen images |
US7076298B2 (en) * | 2002-06-14 | 2006-07-11 | Medtronic, Inc. | Method and apparatus for prevention of arrhythmia clusters using overdrive pacing |
US7301448B1 (en) | 2004-04-30 | 2007-11-27 | Sprint Communications Company L.P. | Method and system for deduplicating status indications in a communications network |
CA2666509C (en) * | 2006-10-16 | 2017-05-09 | Hospira, Inc. | System and method for comparing and utilizing activity information and configuration information from multiple medical device management systems |
EP2296703A4 (en) * | 2008-05-09 | 2012-09-05 | Dyax Corp | Igf-ii/gf-iie binding proteins |
CN103562863A (en) * | 2011-04-04 | 2014-02-05 | 惠普发展公司,有限责任合伙企业 | Creating a correlation rule defining a relationship between event types |
US9213590B2 (en) * | 2012-06-27 | 2015-12-15 | Brocade Communications Systems, Inc. | Network monitoring and diagnostics |
US9569076B2 (en) * | 2014-01-15 | 2017-02-14 | Accenture Global Services Limited | Systems and methods for configuring tiles in a user interface |
US9790834B2 (en) * | 2014-03-20 | 2017-10-17 | General Electric Company | Method of monitoring for combustion anomalies in a gas turbomachine and a gas turbomachine including a combustion anomaly detection system |
US10180867B2 (en) * | 2014-06-11 | 2019-01-15 | Leviathan Security Group, Inc. | System and method for bruteforce intrusion detection |
-
2016
- 2016-04-28 US US15/141,395 patent/US10469309B1/en active Active
-
2019
- 2019-09-18 US US16/574,999 patent/US20200084086A1/en not_active Abandoned
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11347755B2 (en) * | 2018-10-11 | 2022-05-31 | International Business Machines Corporation | Determining causes of events in data |
US11354320B2 (en) * | 2018-10-11 | 2022-06-07 | International Business Machines Corporation | Determining causes of events in data |
US11153144B2 (en) * | 2018-12-06 | 2021-10-19 | Infosys Limited | System and method of automated fault correction in a network environment |
US11477077B1 (en) * | 2019-10-30 | 2022-10-18 | United Services Automobile Association (Usaa) | Change management system with monitoring, alerting, and trending for information technology environment |
US11777801B1 (en) | 2019-10-30 | 2023-10-03 | United Services Automobile Association (Usaa) | Change management system with monitoring, alerting, and trending for information technology environment |
US11269706B2 (en) * | 2020-07-15 | 2022-03-08 | Beijing Wodong Tianjun Information Technology Co., Ltd. | System and method for alarm correlation and aggregation in IT monitoring |
WO2023154854A1 (en) * | 2022-02-14 | 2023-08-17 | Cribl, Inc. | Edge-based data collection system for an observability pipeline system |
US11921602B2 (en) | 2022-02-14 | 2024-03-05 | Cribl, Inc. | Edge-based data collection system for an observability pipeline system |
Also Published As
Publication number | Publication date |
---|---|
US10469309B1 (en) | 2019-11-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20200084086A1 (en) | Management of computing system alerts | |
US11082285B2 (en) | Network event grouping | |
US11954568B2 (en) | Root cause discovery engine | |
US11847130B2 (en) | Extract, transform, load monitoring platform | |
CN107766568B (en) | Efficient query processing using histograms in columnar databases | |
US10318366B2 (en) | System and method for relationship based root cause recommendation | |
US10860406B2 (en) | Information processing device and monitoring method | |
US11163747B2 (en) | Time series data forecasting | |
CN108322320B (en) | Service survivability analysis method and device | |
US11017331B2 (en) | Method and system for predicting demand for vehicles | |
US11991154B2 (en) | System and method for fingerprint-based network mapping of cyber-physical assets | |
US20210014102A1 (en) | Reinforced machine learning tool for anomaly detection | |
US20210366268A1 (en) | Automatic tuning of incident noise | |
JP6217644B2 (en) | Rule distribution server, event processing system, method and program | |
US11934972B2 (en) | Configuration assessment based on inventory | |
Hemmat et al. | SLA violation prediction in cloud computing: A machine learning perspective | |
Qi et al. | A cloud-based triage log analysis and recovery framework | |
US10819604B2 (en) | Change to availability mapping | |
US20240291718A1 (en) | Predictive Analytics For Network Topology Subsets | |
CN116821120A (en) | Device management method, device management apparatus, device and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE AFTER FINAL ACTION FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: ADVISORY ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: DOCKETED NEW CASE - READY FOR EXAMINATION |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: FINAL REJECTION MAILED |
|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |