CN113722134A - Cluster fault processing method, device and equipment and readable storage medium - Google Patents

Cluster fault processing method, device and equipment and readable storage medium Download PDF

Info

Publication number
CN113722134A
CN113722134A CN202110866089.XA CN202110866089A CN113722134A CN 113722134 A CN113722134 A CN 113722134A CN 202110866089 A CN202110866089 A CN 202110866089A CN 113722134 A CN113722134 A CN 113722134A
Authority
CN
China
Prior art keywords
fault
cluster
target
program
parameter value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202110866089.XA
Other languages
Chinese (zh)
Inventor
武鹏
颜秉珩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Electronic Information Industry Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN202110866089.XA priority Critical patent/CN113722134A/en
Publication of CN113722134A publication Critical patent/CN113722134A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/0709Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a distributed system consisting of a plurality of standalone computer nodes, e.g. clusters, client-server systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a cluster fault processing method, a cluster fault processing device, electronic equipment and a computer readable storage medium, wherein the method comprises the following steps: according to the inspection item list, carrying out state detection on the target cluster to obtain a state parameter value; carrying out fault identification on the state parameter values; if the fault parameter value is identified, selecting a corresponding target repair program according to the fault parameter type of the fault parameter value; carrying out fault processing by using a target repairing program; by presetting fault repairing programs for repairing different types of faults and selecting a corresponding target repairing program for fault treatment according to the types of the fault parameters after the fault parameter values are detected, the repairing of the cluster faults can be automatically completed, and the fault repairing efficiency is improved.

Description

Cluster fault processing method, device and equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a cluster fault processing method, a cluster fault processing apparatus, an electronic device, and a computer-readable storage medium.
Background
Ambari is a Web (Web page) -based Hadoop (a distributed system infrastructure developed by the Apache foundation) distributed cluster configuration management platform, supports the automatic installation, management, operation and maintenance and other functions of Apache Hadoop big data components, and can be installed and used by a user in an interface mode through the platform. Ambari provides its own cluster health related functions including alarms that reflect whether a certain operational indicator of a machine or service exceeds an alarm threshold. However, for the detected alarm or fault, manual troubleshooting is still needed, and the efficiency is low.
Therefore, how to solve the problem of low failure handling efficiency in the related art is a technical problem to be solved by those skilled in the art.
Disclosure of Invention
In view of this, an object of the present application is to provide a cluster fault processing method, a cluster fault processing apparatus, an electronic device, and a computer readable storage medium, which can automatically complete the repair of a cluster fault and improve the fault repair efficiency.
In order to solve the above technical problem, the present application provides a cluster fault processing method, including:
according to the inspection item list, carrying out state detection on the target cluster to obtain a state parameter value;
carrying out fault identification on the state parameter values;
if a fault parameter value is identified, selecting a corresponding target repair program according to the fault parameter type of the fault parameter value;
and utilizing the target repair program to carry out fault treatment.
Optionally, the selecting a corresponding target repair program according to the fault parameter type of the fault parameter value includes:
acquiring a corresponding relation between the parameter type and the fault repairing program;
and determining the target repairing program corresponding to the fault parameter type from a plurality of fault repairing programs based on the corresponding relation.
Optionally, the method further comprises:
counting the occurrence frequency of each fault parameter type;
acquiring operation parameters acquired when the high-frequency parameter type appears; the occurrence frequency of the high-frequency parameter type is greater than the occurrence frequency of the non-high-frequency parameter type;
inputting each operation parameter into a reason identification model to obtain a plurality of fault reasons corresponding to a plurality of high-frequency parameter types respectively;
and updating each target relationship corresponding to each high-frequency parameter type in the corresponding relationship based on the fault reason.
Optionally, the updating, based on the fault cause, a target relationship corresponding to the high-frequency parameter type in the correspondence relationship includes:
replacing the failover program in the target relationship with the failover program that matches the target failure cause.
Optionally, the selecting a corresponding target repair program according to the fault parameter type of the fault parameter value includes:
acquiring target operation parameters acquired when the fault parameter type occurs;
inputting the target operation parameters into a reason identification model to obtain candidate fault reasons;
and determining the fault repairing program corresponding to the candidate fault reason as the target repairing program.
Optionally, the performing fault identification on the status parameter value includes:
comparing each state parameter value with the abnormal interval of the corresponding detection item;
and if the abnormal interval exists, determining the state parameter value as a fault parameter value.
Optionally, the method further comprises:
and generating a report based on the fault parameter value and/or the fault parameter type, and outputting the report in a preset mode.
The present application further provides a cluster fault processing apparatus, including:
the state detection module is used for carrying out state detection on the target cluster according to the check item list to obtain a state parameter value;
the fault identification module is used for carrying out fault identification on the state parameter values;
the program selection module is used for selecting a corresponding target repair program according to the fault parameter type of the fault parameter value if the fault parameter value is identified;
and the fault processing module is used for processing faults by using the target repairing program.
The present application further provides an electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the cluster fault handling method.
The present application further provides a computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the above-mentioned cluster failure handling method.
According to the cluster fault processing method, state detection is carried out on a target cluster according to an inspection item list to obtain a state parameter value; carrying out fault identification on the state parameter values; if the fault parameter value is identified, selecting a corresponding target repair program according to the fault parameter type of the fault parameter value; and carrying out fault processing by using the target repairing program.
Therefore, after the state parameter value is obtained by state detection, the method carries out fault identification on the state parameter value and judges whether the state parameter value is normal or not. If the fault parameter value is identified, the cluster is indicated to have a fault, and the fault is expressed by the state parameter value. Each state parameter value has a corresponding parameter type, different parameter types represent cluster states from different angles, and different types of fault parameter values indicate different types of faults of the clusters. In order to perform automatic fault repair, a plurality of fault repair programs are preset, and the fault repair programs respectively correspond to different types of faults, namely different parameter types. After the fault parameter value occurs, the cluster is indicated to have a fault related to the fault parameter type, so that a target repair program corresponding to the fault parameter value is selected and further used for fault processing. By presetting fault repairing programs for repairing different types of faults and selecting a corresponding target repairing program for fault treatment according to the types of the fault parameters after the fault parameter values are detected, the repairing of cluster faults can be automatically completed, the fault repairing efficiency is improved, and the problem of low fault treatment efficiency in the related technology is solved.
In addition, the application also provides a cluster fault processing device, electronic equipment and a computer readable storage medium, and the cluster fault processing device, the electronic equipment and the computer readable storage medium also have the beneficial effects.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or related technologies of the present application, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a flowchart of a cluster fault processing method provided in an embodiment of the present application;
fig. 2 is a flowchart of a specific cluster fault handling method provided in an embodiment of the present application;
fig. 3 is a structural diagram of a specific health inspection system according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a cluster fault processing apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, fig. 1 is a flowchart of a cluster fault processing method according to an embodiment of the present disclosure.
The method comprises the following steps:
s101: and performing state detection on the target cluster according to the check item list to obtain a state parameter value.
The check item list is a list for explaining a state detection mode, and a plurality of check items are recorded in the check item list. The check item refers to information for specifying the detection object and content, such as host memory usage, cluster disk usage, network latency, and the like. The cluster is composed of a plurality of nodes (or called hosts), and the service is provided to the outside of the cluster as a whole, so that the types of the check items can cover the cluster, the hosts, the service, the network and the like. In a preferred embodiment, all the examination items that can be performed on the target cluster are recorded in the examination item list, so that the state of the whole cluster can be detected most comprehensively. The content in the checklist may be fixed, i.e. the number of checklists, the content of checklists, etc. at each detection time are not changed. Or, the content in the check item list is variable, and may be adjusted according to actual needs, specifically, the check items therein may be added, deleted, and modified.
The detection modes corresponding to different examination items are different, and the embodiment does not limit the specific content of the detection mode, and may be any optional mode, such as reading, triggering, and the like. And after the check item list is determined, carrying out corresponding state detection on the target cluster according to the check item list to obtain a state parameter value. The state parameter value refers to a detection result obtained after state detection, and for different inspection items, the parameter types of the state parameter value are different, and the actual meanings represented by the contents are also different. Typically, one check term corresponds to one state parameter value, and for some check terms it may correspond to more than one state parameter value.
As to the execution timing of the state detection, in an embodiment, the state detection may be executed periodically, and the specific size of the period is not limited, and may be executed once every 24 hours, for example; in another embodiment, the status detection may be performed periodically, for example, once per day at 17 and 19 o' clock; in another embodiment, the detection instruction may be executed when it is detected, and the detection instruction is used for triggering state detection, and the generation manner may be various, for example, in response to the user clicking a designated button, or in response to some data sent by other electronic devices.
S102: and carrying out fault identification on the state parameter values.
And after the state parameter values are obtained, fault identification is carried out on the state parameter values, and whether each state parameter value represents the condition that the cluster has faults or not is judged. For the failure identification manner, in one implementation, it may be determined whether a combination of the respective state parameter values meets a preset rule; in another embodiment, an abnormal interval may be preset, the abnormal interval corresponds to the state parameter values one to one, each state parameter value is compared with the corresponding abnormal interval, if the abnormal interval is located, the state parameter value is determined to be a fault parameter value, and then the cluster fault is determined to be detected; in another embodiment, the two embodiments described above may be combined.
S103: and if the fault parameter value is identified, selecting a corresponding target repair program according to the fault parameter type of the fault parameter value.
It will be appreciated that if a faulty parameter value is identified, it indicates that the cluster is abnormal in some way, resulting in an abnormality in the parameter value associated therewith. Since the meaning of the fault parameter value expression is related to the fault parameter type corresponding to the fault parameter value expression, and the fault parameter type represents the angle of state detection, the fault parameter type can represent the type of the fault, and can further indicate how the fault should be solved. For example, when the failure parameter type is the remaining space of the hard disk, if the corresponding state parameter value is smaller than the remaining space threshold, the failure parameter is in the corresponding abnormal interval and determined as the failure parameter value. The remaining space of the hard disk is small, and in order to ensure the normal operation of the cluster, the fault repair needs to be performed by adopting a mode of expanding the available space.
In the application, a plurality of fault repairing programs are preset, and each parameter type corresponds to at least one fault repairing program. And the fault repairing program is used for repairing the fault expressed by the fault parameter type to ensure that the cluster is recovered to be normal. The target repair type refers to a fault repair program selected to repair a fault corresponding to the fault parameter type, and may be any one of a plurality of fault repair programs corresponding to the fault parameter type.
It should be noted that several different causes of failure may lead to the same result, i.e. the appearance of several different causes of failure may differ. For example, if there is less storage space remaining, the cause of the failure may be that more data is stored, or may be a disconnection from a blank data disk. The two situations are characterized by insufficient residual storage space, but the causes of the phenomenon are different, and the repairing procedures needed for repairing the fault are also different. In order to accurately select a target repair program and further efficiently complete fault repair, the process of selecting a corresponding target repair program according to the fault parameter type of the fault parameter value may include the following steps:
step 11: and acquiring target operation parameters acquired when the fault parameter type occurs.
Step 12: and inputting the target operation parameters into a reason identification model to obtain candidate fault reasons.
Step 13: and determining the fault repairing program corresponding to the candidate fault reason as the target repairing program.
The target operating parameter refers to an operating parameter collected when a fault occurs, and may specifically be a software (or service) operating parameter or a hardware parameter. It should be noted that the format, category, etc. of the target operation parameter may not be changed; or, in order to reduce the interference of invalid data and improve the determination accuracy and speed of the target repair program, the determination accuracy and speed may be different according to different fault parameter types. After the target operation parameters are obtained, the target operation parameters are input into a reason identification model, the reason identification model is a trained deep learning model, operation characteristics can be extracted according to the target operation parameters, and candidate fault reasons can be determined according to the operation characteristics.
Specifically, the cause identification model is obtained by training using training data, and the training data is labeled target operation parameters (or called data combination). For example, the format may be an abnormal detection item name (i.e. fault parameter type), an operation index (i.e. target operation parameter) of the service and the host computer, and a target cause (i.e. fault cause), each time training, the current model is used to derive a fault cause of the training data, then a difference between the fault cause and the target cause is calculated, the model is updated reversely according to the difference, and an algorithm model, i.e. a cause identification model, which meets the expectation is repeatedly calculated and output in an iterative manner. Furthermore, the reason recognition model can be updated according to a preset period, and new training data are continuously utilized to perform additional training on the model, so that the recognition accuracy of the model is continuously improved.
After the target operation parameters are input as input data of the reason identification, candidate fault reasons can be obtained. Because the corresponding relationship between each fault reason and the corresponding fault repair program is preset, after the candidate fault reason is determined, the corresponding fault repair program can be determined as the target repair program.
In another embodiment, a certain type of failure is typically caused by a certain fixed cause, and other causes are less likely to cause the type of failure. In this case, in order to increase the speed of determining the target repair program, the correspondence between the parameter type and the failure repair program may be set in advance, and the parameter type and the failure repair program for repairing the corresponding failure may be directly specified. Specifically, the process of selecting the corresponding target repair program according to the fault parameter type of the fault parameter value may specifically include the following steps:
step 21: and acquiring the corresponding relation between the parameter type and the fault repairing program.
Step 22: and determining a target repairing program corresponding to the fault parameter type from the plurality of fault repairing programs based on the corresponding relation.
The corresponding relation can be stored locally in the device in advance, and is directly called when the target repairing program needs to be determined; in another embodiment, the correspondence may be obtained from other electronic devices or storage devices. And after the fault parameter type is determined, screening the corresponding relation by using the fault parameter type to obtain a corresponding target repairing program.
Further, as the cluster operates, its operation may change, and the same failure may be caused by another cause. In this case, the target repair program still determined by the original correspondence relationship cannot be effectively repaired. Therefore, the present embodiment may further include the following steps:
step 31: and counting the occurrence frequency of each fault parameter type.
Step 32: acquiring operation parameters acquired when the high-frequency parameter type appears; the high frequency parameter type occurs more frequently than the non-high frequency parameter type.
Step 33: and inputting each operation parameter into the reason identification model to obtain a plurality of fault reasons corresponding to a plurality of high-frequency parameter types respectively.
Step 34: and updating each target relation corresponding to each high-frequency parameter type in the corresponding relation based on the fault reasons with the largest quantity.
Specifically, steps 31 to 34 may be performed periodically. After the fault parameter is identified, the corresponding fault parameter type is determined, and the occurrence frequency of the fault parameter type is counted, wherein the occurrence frequency can be the frequency within a period of time before the current time. By counting the frequency of occurrence of each fault parameter type, it can be determined which faults have been frequently caused in a past period of time. Since a fault should not occur for a while after the fault is repaired, a fault that frequently occurs must be a fault that is not repaired correctly.
After the occurrence frequency is determined, all fault parameter types are sequenced based on the occurrence frequency, a plurality of fault parameter types sequenced in the front are determined as high-frequency parameter types, and the occurrence frequency of all the high-frequency parameter types is greater than the occurrence frequency of all the non-high-frequency parameter types. In order to solve the problem that faults corresponding to high-frequency parameter types frequently occur, the method and the device for identifying the faults acquire the operation parameters acquired in each occurrence, input the operation parameters into the reason identification model and obtain the corresponding fault reasons. The number of the high-frequency parameter types is a plurality, each high-frequency parameter type corresponds to a plurality of operation parameters, each operation parameter corresponds to a fault reason, and therefore each high-frequency parameter type corresponds to a plurality of fault reasons. And selecting the fault reason with the largest occurrence frequency, namely updating each target relation corresponding to each high-frequency parameter type in the corresponding relation by the fault reason with the largest number, so as to complete the updating of the corresponding relation. The updated corresponding relation is more matched with the current running condition of the cluster, and an accurate target repairing program can be selected by using the updated corresponding relation.
Specifically, in one embodiment, when the target relationship is updated, the fault recovery program matching the largest number of fault causes may be directly used to replace the fault recovery program in the target relationship. In another embodiment, it may be determined whether the target relationship specifies a failure recovery procedure that matches the largest number of failure causes, and if not, the target relationship is replaced, otherwise, the update is confirmed to be completed.
S104: and utilizing the target repair program to carry out fault treatment.
And after the target repairing program is determined, the program is operated, and the fault processing process is completed. After the fault processing is finished, a report can be generated based on the fault parameter value and/or the fault parameter type, and the report is output according to a preset mode, so that a user can know the cluster state and the fault processing condition in time.
By applying the cluster fault processing method provided by the embodiment of the application, after the state parameter value is obtained by state detection, fault recognition is carried out on the state parameter value, and whether the state parameter value is normal or not is judged. If the fault parameter value is identified, the cluster is indicated to have a fault, and the fault is expressed by the state parameter value. Each state parameter value has a corresponding parameter type, different parameter types represent cluster states from different angles, and different types of fault parameter values indicate different types of faults of the clusters. In order to perform automatic fault repair, a plurality of fault repair programs are preset, and the fault repair programs respectively correspond to different types of faults, namely different parameter types. After the fault parameter value occurs, the cluster is indicated to have a fault related to the fault parameter type, so that a target repair program corresponding to the fault parameter value is selected and further used for fault processing. By presetting fault repairing programs for repairing different types of faults and selecting a corresponding target repairing program for fault treatment according to the types of the fault parameters after the fault parameter values are detected, the repairing of cluster faults can be automatically completed, the fault repairing efficiency is improved, and the problem of low fault treatment efficiency in the related technology is solved.
Referring to fig. 2, fig. 2 is a flowchart of a specific cluster fault handling method according to an embodiment of the present disclosure. The user can run the task of creating the health check regularly or once, and various check items in the health check are defined in a mode of setting the health check definition. After the check is finished, a check report is generated, a user can click one-button to repair the abnormal items in the report, and the computer can select a target repair program and carry out fault processing by using the target repair program, namely, the one-button repair selects the target repair program according to a repair list and executes the repair program to finish the repair. In another embodiment, the generated report is subjected to statistical analysis of the fault types, and the definition of the repair program is guided to be updated according to the output analysis result, that is, the correspondence relationship in the above embodiment is updated, so that the target repair program is selected by using the new correspondence relationship.
Specifically, the statistical analysis process includes:
1) high-frequency fault statistics: counting and ranking statistics are carried out on abnormal detection items in each detection report for a period of time, and detection items with the highest abnormal occurrence frequency ranking (for example, the top 10 items with the highest statistical occurrence frequency) are found and recorded.
2) Index feature extraction: and extracting operation indexes of the relevant service and the host according to the abnormal detection item, for example, when the memory use exceeds the standard, extracting the memory use condition of each service in operation and the memory use condition of each operation program in the host as the operation indexes.
3) The intelligent reasoning model comprises: the intelligent statistical analyzer is built in the intelligent statistical analyzer and is a key component of the intelligent statistical analyzer, and causes of abnormal items can be calculated and reasoned according to the operation index characteristics of the service and the host. The training system has the autonomous learning ability, so that different training data can be input to perform training updating for many times, and the reasoning accuracy is improved. The data is a marked data combination, the format is an abnormal detection item name, service and operation index of a host computer, target reason, the data can be added, deleted or updated, the reason is deduced according to the current model each time, then the difference value between the reason and the target reason is calculated, the model is reversely updated according to the difference value, and the algorithm model which is in line with expectation is repeatedly calculated and output in an iterative manner.
4) Reasoning the fault reason: and calculating and reasoning to obtain the reason causing the abnormal detection item, and accurately guiding to set the selected repair program during one-key repair.
Referring to fig. 3, fig. 3 is a structural diagram of a specific health inspection system according to an embodiment of the present application, where the system is configured to execute each step of the cluster fault handling method according to the present application. The system comprises 5 functional modules: the system comprises a health detection definition, a health examination task execution, a health report display, a one-key repair and an intelligent statistical analyzer, and provides a series of health guarantee schemes from health examination to one-key repair to statistical analysis for a big data platform. Wherein:
1) health detection definition: a health check is a task, and the definition of the task comprises a plurality of check items which are classified into 4 types of clusters, hosts, services and networks. The examination item comprises examination content and a judgment result 2, and the examination content is a certain health index of the cluster, such as: the memory usage of the host, the dfs (Distributed File System) disk usage of the HDFS (Hadoop Distributed File System), and the judgment result is to perform state judgment on the inspection content according to a threshold, for example: and judging that the memory usage of the host exceeds 80 percent, and otherwise, judging that the memory usage is normal. The examination items need to be defined and configured one by one in advance to perform the health examination.
2) And (3) executing an inspection task: the method is divided into 2 types of timing and single execution, wherein the timing task is a regular execution task and is executed once every day in a timing mode, and the single execution task is an irregular execution task and needs to be triggered manually. The results of the timed tasks cannot reflect the real-time health status of the cluster, and the results of the single execution can reflect the real-time health status of the cluster.
3) And (4) checking and reporting: the final health examination result is presented in the form of a report, wherein the examination result and the numerical value of each examination item are listed according to the category, the examination result is normal or abnormal and reflects whether the examination item exceeds the examination threshold, and the examination numerical value is a certain health index collected by the examination item.
4) One-key repair: the function automatically repairs the detected abnormal items in the inspection report by one key, the function is closed by default, and a user can automatically determine whether to repair the abnormal items by one key after browsing the inspection report every time. The repair list defines the repair procedures for each check item, each check item has its own specific repair procedure, and may be an attempt to restart the service or modify some configuration of the service. When one-key repair is executed, corresponding repair programs are found from the repair list according to abnormal items in the detection report, and the repair programs are executed one by one to complete repair. The repair program needs to be predefined and configured, can be updated and adjusted according to the output result of the intelligent statistical analyzer in the using process of the system, and can be removed or added.
5) Intelligent statistical analysis: and carrying out real-time statistical analysis on each detection item in the detection report, carrying out high-frequency fault statistics, and carrying out fault reason analysis and reasoning.
In the following, the cluster fault handling apparatus provided in the embodiment of the present application is introduced, and the cluster fault handling apparatus described below and the cluster fault handling method described above may be referred to correspondingly.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a cluster fault handling device according to an embodiment of the present application, including:
the state detection module 110 is configured to perform state detection on the target cluster according to the check item list to obtain a state parameter value;
a fault identification module 120, configured to perform fault identification on the status parameter value;
a program selecting module 130, configured to select a corresponding target repair program according to a fault parameter type of a fault parameter value if the fault parameter value is identified;
and a failure processing module 140, configured to perform failure processing by using the target repair program.
Optionally, the program selecting module 130 includes:
a correspondence obtaining unit configured to obtain a correspondence between a parameter type and a fault repairing program;
and the program selecting unit is used for determining a target repairing program corresponding to the fault parameter type from the plurality of fault repairing programs based on the corresponding relation.
Optionally, the method further comprises:
the statistical module is used for counting the occurrence frequency of each fault parameter type;
the operation parameter acquisition module is used for acquiring operation parameters acquired when the high-frequency parameter type appears; the occurrence frequency of the high-frequency parameter type is greater than the occurrence frequency of the non-high-frequency parameter type;
the identification module is used for inputting each operation parameter into the reason identification model to obtain a plurality of fault reasons corresponding to a plurality of high-frequency parameter types respectively;
and the updating module is used for updating each target relation corresponding to each high-frequency parameter type in the corresponding relation based on the fault reasons with the largest quantity.
Optionally, the update module includes:
and the replacing unit is used for replacing the fault repairing programs in the target relation by the fault repairing programs matched with the fault reasons with the largest number.
Optionally, the program selecting module 130 includes:
the parameter acquisition module is used for acquiring target operation parameters acquired when the fault parameter type occurs;
the input unit is used for inputting the target operation parameters into the reason identification model to obtain candidate fault reasons;
and the determining unit is used for determining the fault repairing program corresponding to the candidate fault reason as the target repairing program.
Optionally, the fault identification module 120 includes:
the comparison unit is used for comparing each state parameter value with the corresponding abnormal interval;
and the fault determining unit is used for determining the state parameter value as a fault parameter value if the abnormal interval exists.
Optionally, the method further comprises:
and the report generating module is used for generating a report based on the fault parameter value and/or the fault parameter type and outputting the report according to a preset mode.
In the following, the electronic device provided in the embodiment of the present application is introduced, and the electronic device described below and the cluster fault handling method described above may be referred to correspondingly.
Referring to fig. 5, fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure. Wherein the electronic device 100 may include a processor 101 and a memory 102, and may further include one or more of a multimedia component 103, an information input/information output (I/O) interface 104, and a communication component 105.
The processor 101 is configured to control overall operations of the electronic device 100 to complete all or part of the steps in the above-described cluster fault handling method; the memory 102 is used to store various types of data to support operation at the electronic device 100, such data may include, for example, instructions for any application or method operating on the electronic device 100, as well as application-related data. The Memory 102 may be implemented by any type or combination of volatile and non-volatile Memory devices, such as one or more of Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic Memory, flash Memory, magnetic or optical disk.
The multimedia component 103 may include a screen and an audio component. Wherein the screen may be, for example, a touch screen and the audio component is used for outputting and/or inputting audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signal may further be stored in the memory 102 or transmitted through the communication component 105. The audio assembly also includes at least one speaker for outputting audio signals. The I/O interface 104 provides an interface between the processor 101 and other interface modules, such as a keyboard, mouse, buttons, etc. These buttons may be virtual buttons or physical buttons. The communication component 105 is used for wired or wireless communication between the electronic device 100 and other devices. Wireless Communication, such as Wi-Fi, bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination of one or more of them, so that the corresponding Communication component 105 may include: Wi-Fi part, Bluetooth part, NFC part.
The electronic Device 100 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, microcontrollers, microprocessors or other electronic components, and is configured to perform the cluster failure Processing method according to the above embodiments.
In the following, a computer-readable storage medium provided in the embodiments of the present application is introduced, and the computer-readable storage medium described below and the cluster fault handling method described above may be referred to correspondingly.
The present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps of the above-mentioned cluster fault handling method.
The computer-readable storage medium may include: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative components and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it should also be noted that, herein, relationships such as first and second, etc., are intended only to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms include, or any other variation is intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that includes a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (10)

1. A cluster fault processing method is characterized by comprising the following steps:
according to the inspection item list, carrying out state detection on the target cluster to obtain a state parameter value;
carrying out fault identification on the state parameter values;
if a fault parameter value is identified, selecting a corresponding target repair program according to the fault parameter type of the fault parameter value;
and utilizing the target repair program to carry out fault treatment.
2. The cluster fault handling method according to claim 1, wherein the selecting a corresponding target repair program according to the fault parameter type of the fault parameter value includes:
acquiring a corresponding relation between the parameter type and the fault repairing program;
and determining the target repairing program corresponding to the fault parameter type from a plurality of fault repairing programs based on the corresponding relation.
3. The cluster failure processing method of claim 2, further comprising:
counting the occurrence frequency of each fault parameter type;
acquiring operation parameters acquired when the high-frequency parameter type appears; the occurrence frequency of the high-frequency parameter type is greater than the occurrence frequency of the non-high-frequency parameter type;
inputting each operation parameter into a reason identification model to obtain a plurality of fault reasons corresponding to a plurality of high-frequency parameter types respectively;
and updating each target relationship corresponding to each high-frequency parameter type in the corresponding relationship based on the most fault reasons.
4. The cluster fault handling method according to claim 3, wherein the updating, based on the largest number of the fault causes, each target relationship corresponding to each high-frequency parameter type in the correspondence relationship, includes:
replacing the failover program in the target relationship with the failover program that matches the largest number of the failure causes.
5. The cluster fault handling method according to claim 1, wherein the selecting a corresponding target repair program according to the fault parameter type of the fault parameter value includes:
acquiring target operation parameters acquired when the fault parameter type occurs;
inputting the target operation parameters into a reason identification model to obtain candidate fault reasons;
and determining the fault repairing program corresponding to the candidate fault reason as the target repairing program.
6. The cluster fault handling method according to claim 1, wherein the performing fault identification on the status parameter value includes:
comparing each state parameter value with the corresponding abnormal interval;
and if the abnormal interval exists, determining the state parameter value as a fault parameter value.
7. The cluster failure processing method of claim 1, further comprising:
and generating a report based on the fault parameter value and/or the fault parameter type, and outputting the report in a preset mode.
8. A cluster failure handling apparatus, comprising:
the state detection module is used for carrying out state detection on the target cluster according to the check item list to obtain a state parameter value;
the fault identification module is used for carrying out fault identification on the state parameter values;
the program selection module is used for selecting a corresponding target repair program according to the fault parameter type of the fault parameter value if the fault parameter value is identified;
and the fault processing module is used for processing faults by using the target repairing program.
9. An electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the cluster fault handling method according to any one of claims 1 to 7.
10. A computer-readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the cluster failure handling method of any of claims 1 to 7.
CN202110866089.XA 2021-07-29 2021-07-29 Cluster fault processing method, device and equipment and readable storage medium Withdrawn CN113722134A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110866089.XA CN113722134A (en) 2021-07-29 2021-07-29 Cluster fault processing method, device and equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110866089.XA CN113722134A (en) 2021-07-29 2021-07-29 Cluster fault processing method, device and equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN113722134A true CN113722134A (en) 2021-11-30

Family

ID=78674279

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110866089.XA Withdrawn CN113722134A (en) 2021-07-29 2021-07-29 Cluster fault processing method, device and equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN113722134A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium
CN116643906A (en) * 2023-06-01 2023-08-25 北京首都在线科技股份有限公司 Cloud platform fault processing method and device, electronic equipment and storage medium
CN117057676A (en) * 2023-10-11 2023-11-14 深圳润世华软件和信息技术服务有限公司 Multi-data fusion fault analysis method, equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115225460A (en) * 2022-07-15 2022-10-21 北京天融信网络安全技术有限公司 Failure determination method, electronic device, and storage medium
CN115225460B (en) * 2022-07-15 2023-11-28 北京天融信网络安全技术有限公司 Fault determination method, electronic device, and storage medium
CN116643906A (en) * 2023-06-01 2023-08-25 北京首都在线科技股份有限公司 Cloud platform fault processing method and device, electronic equipment and storage medium
CN116643906B (en) * 2023-06-01 2024-08-02 北京首都在线科技股份有限公司 Cloud platform fault processing method and device, electronic equipment and storage medium
CN117057676A (en) * 2023-10-11 2023-11-14 深圳润世华软件和信息技术服务有限公司 Multi-data fusion fault analysis method, equipment and storage medium
CN117057676B (en) * 2023-10-11 2024-02-23 深圳润世华软件和信息技术服务有限公司 Multi-data fusion fault analysis method, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110321371B (en) Log data anomaly detection method, device, terminal and medium
CN107528722B (en) Method and device for detecting abnormal point in time sequence
CN111459700B (en) Equipment fault diagnosis method, diagnosis device, diagnosis equipment and storage medium
US11294754B2 (en) System and method for contextual event sequence analysis
CN113722134A (en) Cluster fault processing method, device and equipment and readable storage medium
JP6875179B2 (en) System analyzer and system analysis method
EP3131234A1 (en) Core network analytics system
US20160116378A1 (en) Population-based learning with deep belief networks
JP2018045403A (en) Abnormality detection system and abnormality detection method
JP6708219B2 (en) Log analysis system, method and program
CN111108481B (en) Fault analysis method and related equipment
CN106104496A (en) The abnormality detection not being subjected to supervision for arbitrary sequence
KR20190021560A (en) Failure prediction system using big data and failure prediction method
CN113282461A (en) Alarm identification method and device for transmission network
CN111262750B (en) Method and system for evaluating baseline model
US11675643B2 (en) Method and device for determining a technical incident risk value in a computing infrastructure from performance indicator values
US9860109B2 (en) Automatic alert generation
CN112182056A (en) Data detection method, device, equipment and storage medium
CN107430590B (en) System and method for data comparison
CN115118580B (en) Alarm analysis method and device
CN112416800A (en) Intelligent contract testing method, device, equipment and storage medium
CN111091863A (en) Storage equipment fault detection method and related device
CN109711450A (en) A kind of power grid forecast failure collection prediction technique, device, electronic equipment and storage medium
EP3367241B1 (en) Method, computer program and system for providing a control signal for a software development environment
CN111651753A (en) User behavior analysis system and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20211130

WW01 Invention patent application withdrawn after publication