WO2024121952A1

WO2024121952A1 - Failure restoration expediting system, failure restoration expediting method, and failure restoration expediting program

Info

Publication number: WO2024121952A1
Application number: PCT/JP2022/044983
Authority: WO
Inventors: 雅人西口; 俊之金澤; 寛規井上; 達也出水
Original assignee: 日本電信電話株式会社
Priority date: 2022-12-06
Filing date: 2022-12-06
Publication date: 2024-06-13

Abstract

In the present invention, a host instructs a server that controls clusters to execute a power supply operation on a reserve node. The host acquires the power supply state of the reserve node in response to receiving a notification that the power supply operation has been completed. The host assesses whether the power supply state of the reserve node is powered on or powered off. In response to the assessment that the power supply state of the reserve node is powered off, the host instructs the server to turn on the power supply to the reserve node.

Description

Faster failure recovery system, faster failure recovery method, and faster failure recovery program

This disclosure relates to a system for accelerating failure recovery, a method for accelerating failure recovery, and a program for accelerating failure recovery.

Cluster systems provide availability to services. A cluster is multiple computers working together to perform a task. These computers are called nodes.

Cluster systems provide a variety of high availability services. These high availability services include service failover: if the node providing a service fails, another node takes over the service.

However, the above prior art techniques can take a long time to restore a cluster where a node has failed.

The present disclosure provides a system, method, and program for accelerating failure recovery that can shorten the recovery time of a cluster in which a node has failed.

In one aspect of the present disclosure, the fast failure recovery system includes a first instruction unit that instructs a server that controls a cluster to execute a power operation on a backup node, an acquisition unit that acquires the power state of the backup node in response to receiving a notification of the completion of the power operation, a determination unit that determines whether the power state of the backup node is powered on or off, and a second instruction unit that instructs the server to power on the backup node in response to determining that the power state of the backup node is powered off.

The high-speed failure recovery system can shorten the recovery time of a cluster in which a node has failed.

FIG. 1 shows an example of a cluster system in a virtual environment. FIG. 2 shows an example of a countermeasure against split-brain. FIG. 3 shows an example of the expected behavior when the fencing action is a reboot. FIG. 4 shows an example of the problem case when the fencing action is a restart. FIG. 5 shows an example of the expected behavior when the fencing action is stop. FIG. 6 shows an example of the problem case when the fencing action is stalled. FIG. 7 is a block diagram of an example environment for cluster recovery. FIG. 8 shows an overview of one cluster recovery process according to the present disclosure. FIG. 9 is a block diagram of an example of a host configuration according to the present disclosure. FIG. 10 is a sequence diagram showing an example of a process for recovering a cluster in which a node has failed. FIG. 11 shows an example of the hardware configuration of a computer.

Several embodiments of the present disclosure are described in the accompanying drawings and in the following description. However, the present invention is not limited to these embodiments. The various features of these embodiments may be combined in various ways, provided that the features are not mutually inconsistent. Like reference numerals refer to like elements.

〔table of contents〕
The following explanation is divided into nine sections:
Overview 1. Introduction 2. Environment for cluster recovery 3. Overview of cluster recovery process 4. Host configuration 5. Sequence diagram of cluster recovery process 6. Effects 7. Hardware configuration 8. Summary of embodiment 9. Addendum

1. Introduction
The technology proposed in this specification relates to cluster recovery. In particular, this technology relates to shortening the recovery time of a cluster system in STONITH (Shoot The Other Node In The Head). In a cluster system in a virtual environment, a host sometimes fails. This technology realizes STONITH with a short recovery time.

FIG. 1 shows a cluster system 10, which is an example of a cluster system in a virtual environment. Cluster system 10 includes host 11a, host 11b, and a virtualization infrastructure control server 12. Host 11a includes guest #1. Host 11b includes guest #2.

When a cluster configuration is created, there are cases where clustering is performed on both the guest and the host to increase availability. Clustering software clusters the guests (13). An example of clustering software is Pacemaker (registered trademark). Virtualization infrastructure software clusters the hosts (14a, 14b). An example of virtualization infrastructure software is vSphere. vSphere on the host side is vSphere ESXi. vSphere on the control server side is vSphere vCenter.

The clustering software responds to failures of processes and services running within the guest. If monitored resources fail, the clustering software performs system switching.

The virtualization infrastructure software responds to failures on the host side (e.g., hardware (H/W) failures). For example, if a failure occurs on a monitored host, the virtualization infrastructure software restarts the VM (Virtual Machine) on another host.

Figure 2 shows split-brain countermeasure 20, which is an example of a split-brain countermeasure. Split-brain is a fatal problem that can occur in a cluster system.

First, communication between nodes is cut off for some reason. As a result, the state of the corresponding node becomes unknown. After that, the standby system node switches to the active system. Therefore, the cluster system ends up in a state where there are multiple active systems. This state is called split brain. Split brain occurs in the following situations: network disconnection between systems (21), host failure (e.g. hardware failure), and failure to stop resources.

If the service does not function correctly, data writing from multiple nodes may lead to data corruption or data inconsistencies.

STONITH is an effective measure against split-brain. STONITH is a fencing function that forcibly terminates a node whose status is unknown, and then causes the node to leave the cluster. If STONITH is successful, the standby system will perform system switchover.

In a virtual environment, STONITH performs power operations on the opposing node via the virtualization infrastructure control server 12. The host 11a issues a power-off instruction to the virtualization infrastructure control server 12 (22a). The virtualization infrastructure control server 12 then cuts off the power (22b).

As a result of STONITH, the standby node is fenced, thereby preventing a split-brain from occurring. To prevent double fencing, STONITH on the standby node is usually configured to delay the execution of STONITH.

In a system where clustering is performed on both the guest and the host, the problem lies in failure cases (e.g. hardware failure) where STONITH is activated and the host-side clustering function is activated.

Fencing actions using STONITH include "restart" and "stop."

If the fencing action is "reboot", the success of STONITH is determined based on the completion of the reboot of the opposing node. Therefore, if the reboot does not complete for some reason, STONITH will fail, and as a result, system switchover will not occur. In cases such as a failure of the host on the ACT side, both systems will go down if system switchover does not occur.

If the fencing operation is "stopped," the difference will occur in the state of the opposing node after STONITH is completed, depending on the timing of the power operation on the opposing node by the host's clustering function.

The reliability of STONITH and recovery time (the time from completion of STONITH to resumption of service) can be summarized as follows:

(STONITH Certainty: Fencing in Action - Restart)
It takes a long time to start up a VM. Or, the VM does not start up successfully. In such a case, STONITH fails, and as a result, system switchover does not occur. This has a significant impact on operations.

(Recovery time: fencing action - restart)
After STONITH is completed, the state of the VM on the opposing node is powered on. The power state is common between the node and its opposing node. In this case, the time required for recovery is short.

(STONITH Certainty: Fencing Action - Stop)
If the VM on the remote node is powered off, it is determined that the STONITH is successful. Therefore, in this case, the STONITH can be executed more reliably.

(Recovery time: fencing operation - stopped)
After STONITH is complete, the power of the opposing node is off. However, the opposing node may be powered on depending on the timing of the power operation on the host side. Therefore, this case places a heavy burden on operations. Furthermore, this case requires a long time for recovery.

FIG. 3 shows expected action 30, which is an example of expected action when the fencing action is a reboot. Expected action 30 includes six stages.

(First stage)
A hardware failure occurs in system #0. A network interruption occurs along with the hardware failure.

(Second stage)
STONITH fails due to network outage.

(Third stage)
STONITH continues to fail, and system switchover doesn't work either.

(Fourth stage)
The virtualization platform control server restarts the VM on another host in a standby system. The VM is restarted independently of the #1 system. For example, the virtualization platform control server detects a network outage using ping. When a network outage is detected, the virtualization platform control server restarts the VM in the standby system.

(Fifth stage)
The #1 host performs a STONITH on the new VM, which is then rebooted.

(Sixth stage)
Due to the success of STONITH, the SBY system is promoted to the ACT system.

Figure 4 shows problem case 40, which is an example of a problem case when the fencing action is a reboot. Problem case 40 shows the actual action that occurred. Problem case 40 includes seven stages.

(Second stage)
STONITH fails due to network outage.

(Fourth stage)
The virtualization infrastructure control server restarts the VM on another host as a backup system.

(Sixth stage)
Rebooting the VM is not successful for some reason. STONITH is not successful.

(Seventh stage)
System switching does not work and both systems go down.

FIG. 5 shows expected behavior 50, which is an example of expected behavior when fencing operation is stopped. Expected behavior 50 includes six stages.

(Second stage)
STONITH fails due to network outage.

(Fifth stage)
The host in the #1 series performs a STONITH on the new VM, which is then stopped.

(Sixth stage)
The new VM is in a powered off state due to STONITH. The SBY system is promoted to the ACT system due to the success of STONITH.

FIG. 6 shows problem case 60, which is an example of a problem case when fencing action is stopped. Problem case 60 shows the actual action that occurred. Problem case 60 includes seven stages.

(Second stage)
STONITH fails due to network outage.

(Fifth stage)
The #1 host performs a STONITH to the new VM. The new VM is stopped. The VM resumes while it continues.

(Sixth stage)
The new VM is powered off due to STONITH. The standby system is promoted to the active system due to the success of STONITH. However, the VM is still being resumed.

(Seventh stage)
The VM resumes and then boots.

As described above with reference to Figures 4 and 6, the problem exists not only in the case where the fencing action is a restart, but also in the case where the fencing action is a stop. To solve the above problem, the host according to the present disclosure performs one or more cluster recovery processes described below.

[2. Cluster recovery environment]
First, the environment for cluster recovery will be described with reference to FIG.

FIG. 7 is a block diagram of environment 1, which is an example of an environment for cluster recovery. As shown in FIG. 7, environment 1 includes host 100, network 200, and control server 300. Host 100 is an example of a high-speed failure recovery system.

Host 100 is a system that performs processing to recover a cluster in which a node has failed. In this specification, such processing is called cluster recovery processing. An overview of cluster recovery processing is explained in Section 3. The details of cluster recovery processing are explained in Section 5 using a sequence diagram.

Host 100 includes one or more computers, such as one or more servers. An example configuration of host 100 is described in Section 4.

The network 200 is a network such as a LAN (Local Area Network), a WAN (Wide Area Network), or the Internet. The network 200 connects the host 100 and the control server 300.

The control server 300 is a server that controls the cluster. The control server 300 is, for example, a virtualization infrastructure control server.

[3. Overview of cluster recovery process]
Next, an overview of a cluster recovery process is described with reference to Figure 8. Note that this overview is not intended to limit the invention or the embodiments described in the following sections.

FIG. 8 shows overview 70, which is an overview of one cluster recovery process according to the present disclosure. Overview 70 compares the present technology with existing techniques (fencing: "restart" or "stop").

As explained above with reference to Figures 4 and 6, the existing technology includes problem cases. For "reboot", if the reboot is not completed, STONITH fails. For "stop", the VM state is different depending on the timing of the host-side clustering function.

The technology proposed in this specification is based on the premise that the fencing operation is "stop." In this technology, host 100a (node #1) powers off host 100b (node #0) via control server 300. After that, host 100a checks the power status of host 100b. If the power status is powered off, host 100a powers on host 100b via control server 300. This power operation solves the problem (fifth, sixth, and seventh stages) described above with reference to FIG. 6. As a result, this technology can shorten the recovery time of a cluster system in STONITH.

4. Host Configuration
Next, an example of the configuration of the host 100 will be described with reference to FIG.

FIG. 9 is a block diagram of an example of the configuration of a host 100 according to the present disclosure. As shown in FIG. 9, the host 100 includes a communication unit 110, a control unit 120, and a memory unit 130. The host 100 may include an input unit (e.g., a keyboard, a mouse) that accepts input from an administrator of the host 100. The host 100 may also include an output unit (e.g., a liquid crystal display, an organic EL (Electro Luminescence) display) that displays information to the administrator.

(Communication unit 110)
The communication unit 110 is implemented by a network device such as a network interface card (NIC). The communication unit 110 is connected to the network 200 by wire or wirelessly. The communication unit 110 can transmit and receive data to and from the control server 300 via the network 200.

(Control unit 120)
The control unit 120 is implemented by a data processing device and various programs stored in a storage device. The data processing device is, for example, a processor such as a central processing unit (CPU), a micro processing unit (MPU), or a general purpose graphic processing unit (GPGPU). The control unit 120 can be implemented as a controller for controlling multiple operations of the host 100. For example, when one or more processors execute a program (multiple instructions) by using a random access memory (RAM) as a working area, the one or more processors perform multiple operations.

The control unit 120 can receive input data for the cluster recovery process from an external device. The control unit 120 can store data such as the input data, data used in the cluster recovery process, and output data of the cluster recovery process in the storage unit 130. The control unit 120 can obtain such data from the storage unit 130 as necessary.

(Memory unit 130)
The storage unit 130 is implemented by a RAM, a semiconductor memory such as a flash memory, a magnetic disk such as a hard disk, or an optical disk. The storage unit 130 can store various programs and various data.

As shown in FIG. 9, the control unit 120 includes a fencing unit 121, an identification unit 122, a request unit 123, a confirmation unit 124, a notification unit 125, and a recovery unit 126. The fencing unit 121 is an example of a first instruction unit. The identification unit 122 is an example of an acquisition unit. The request unit 123 is an example of a determination unit and a second instruction unit. The recovery unit 126 is an example of an integration unit. The data processing performed by each unit is described below.

(Fencing Club 121)
The fencing unit 121 performs fencing operations. The fencing unit 121 can perform STONITH on the remote node.

(Identification unit 122)
The identifying unit 122 identifies the state of the remote node. For example, the state of the remote node is a power state.

(Request Unit 123)
The request unit 123 requests the control server 300 to power on the remote node.

(Confirmation Unit 124)
The confirmation unit 124 confirms the normality of the remote node.

(Notification unit 125)
The notification unit 125 transmits the notification to the maintenance person.

(Recovery Unit 126)
The recovery unit 126 recovers the cluster.

[5. Details of cluster recovery process]
An overview of one cluster recovery process is given above with reference to Figure 8. The details of the cluster recovery process are given in this section with the aid of sequence diagrams.

A sequence diagram of an example of cluster recovery processing will be described with reference to FIG. 10. The example of cluster recovery processing includes processing for recovering a cluster in which a node has failed. The processing for recovering a cluster in which a node has failed is performed, for example, by the host 100 in FIG. 7.

FIG. 10 is a sequence diagram showing process P100, which is an example of a process for recovering a cluster in which a node has failed. Process P100 is performed by the STONITH device and a new module of the host 100a. The STONITH device corresponds to the fencing unit 121. The new module corresponds to the identification unit 122, the request unit 123, the confirmation unit 124, the notification unit 125, and the recovery unit 126.

Process P100 is based on the assumption that the fencing operation is "stopped."

Host 100a (node #1) corresponds to, for example, the #1 system in Figures 3, 4, 5, and 6. In this case, host 100b (node #0) corresponds to the standby system in Figures 3, 4, 5, and 6. Control server 300 corresponds to the virtualization infrastructure control server in Figures 3, 4, 5, and 6.

Host 100a (e.g., fencing unit 121) performs STONITH on host 100b via control server 300 (step S101).

The new module receives notification of STONITH completion from the STONITH device (step S102). A "trap" command may be used for the notification. The new module begins operation upon receiving the notification.

The host 100a (e.g., the identification unit 122) sends an instruction to the control server 300 to check the status of the VM (step S103).

The control server 300 sends a request for the VM status to the host 100b (step S104).

The host 100b sends the status of the VM to the control server 300 (step S105).

The control server 300 returns the status of the VM to the host 100a (step S106). The host 100a then checks the status of the opposing node (VM).

If the VM (opposing node) is powered off, the host 100a (e.g., the request unit 123) issues an instruction to the control server 300 to power on the opposing node (step S107).

The control server 300 powers on the opposing node by sending a power-on request to the opposing node (step S108).

The control server 300 sends a request for the VM status to the host 100b (step S109).

Host 100b sends the VM status to control server 300 (step S110).

The control server 300 returns the VM state to the host 100a (step S111).

In addition to the steps above, host 100a performs the following steps to further shorten recovery time. The following steps are to check the normality of services and processes on the remote node. Specifically, host 100a checks whether there are any events that require maintenance personnel to respond.

The host 100a (e.g., the confirmation unit 124) sends an instruction to the control server 300 to confirm the normality (step S112). The host 100a then checks the normality of the services and processes.

If a maintenance person needs to take action, the host 100a (e.g., the notification unit 125) notifies the maintenance person. For example, the host 100a sends a notification regarding the maintenance request to the maintenance person (step S113). In this case, the host 100a does not recover the cluster.

If no maintenance personnel is required to take action, host 100a (e.g., recovery unit 126) incorporates the opposing node into the cluster and then recovers the cluster (step S114).

It is safe for the cluster system to incorporate the opposing node into the cluster after confirming its normality. Therefore, process P100 is based on the assumption that the opposing node is configured not to be automatically incorporated into the cluster when the VM starts up.

6. Effects
This technology is based on the premise that the action of fencing is "stop." Therefore, the reliability of STONITH is guaranteed.

After STONITH is completed, the state of the VMs on the opposing node will be powered on. The resulting power state of the opposing node will be the same as the node that performed STONITH. In addition, if no maintenance intervention is required, the host 100 will automatically recover the cluster and then the equipment. This can reduce operational burden and shorten recovery time (the time from STONITH completion to resumption of service).

7. Hardware Configuration
11 is a diagram showing an example of a hardware configuration of a computer 1000. The systems and methods described in this specification are implemented by the computer 1000, for example.

Computer 1000 is an example of a computer that implements host 100 by executing a program. Computer 1000 has memory 1010 and CPU 1020. Computer 1000 also has hard disk drive interface 1030, disk drive interface 1040, serial port interface 1050, video adapter 1060, and network interface 1070. These components are connected by bus 1080.

The memory 1010 includes a ROM (Read Only Memory) 1011 and a RAM 1012. The ROM 1011 stores a boot program such as a BIOS (Basic Input Output System). The hard disk drive interface 1030 is connected to a hard disk drive 1090. The disk drive interface 1040 is connected to a disk drive 1100. A removable storage medium (e.g., a magnetic disk or an optical disk) can be inserted into the disk drive 1100. The serial port interface 1050 is connected to, for example, a mouse 1110 and a keyboard 1120. The video adapter 1060 is connected to, for example, a display 1130.

The hard disk drive 1090 stores an OS 1091, an application program 1092, a program module 1093, and program data 1094. The program executed by the computer 1000 defines multiple operations of the host 100. This program may be implemented as a program module 1093 written in code executable by the computer 1000. The program module 1093 is stored, for example, in the hard disk drive 1090. For example, the hard disk drive 1090 stores a program module 1093 for executing processing similar to the functions of the components of the host 100. The hard disk drive 1090 may be replaced with an SSD (Solid State Drive).

Hard disk drive 1090 can store a failure recovery acceleration program for cluster recovery processing. Hard disk drive 1090 may store a computer program product including a failure recovery acceleration program (a plurality of instructions). When executed, the failure recovery acceleration program performs one or more methods, such as those described above.

The setting data used in the various processes described above can be implemented as program data 1094. The setting data is stored, for example, in the memory 1010 or the hard disk drive 1090. The CPU 1020 loads the program module 1093 or the program data 1094 stored in the memory 1010 or the hard disk drive 1090 into the RAM 1012 as necessary. The CPU 1020 then performs the various processes described above.

The program module 1093 and program data 1094 may be stored in a removable storage medium instead of the hard disk drive 1090. The CPU 1020 may load the program module 1093 and program data 1094 via the disk drive 1100 or the like. Alternatively, the program module 1093 and program data 1094 may be stored in another computer connected to the computer 1000 via a network (LAN, WAN, etc.). In this case, the CPU 1020 may load the program module 1093 and program data 1094 via the network interface 1070.

8. Summary of the embodiment
As described above, host 100 includes fencing unit 121, identifying unit 122, and request unit 123. In at least one embodiment, fencing unit 121 instructs a server controlling the cluster to execute a power operation on the standby node. In at least one embodiment, identifying unit 122 acquires the power state of the standby node in response to receiving a notification of the completion of the power operation. In at least one embodiment, request unit 123 determines whether the power state of the standby node is powered on or powered off. In response to determining that the power state of the standby node is powered off, request unit 123 instructs the server to power on the standby node.

In some embodiments, the power operation is STONITH.

In some embodiments, the fencing action of STONITH is to stop.

As described above, the host 100 includes a verification unit 124. In at least one embodiment, the verification unit 124 verifies the health of the services or processes of the backup node.

As described above, the host 100 includes a recovery unit 126. The recovery unit 126 incorporates a backup node into the cluster when the service or process is normal.

As described above, the host 100 includes a notification unit 125. The notification unit 125 notifies the maintenance personnel of a maintenance request for the backup node when a service or process is not normal.

9. Addendum
Finally, the above description is supplemented with other embodiments. Various embodiments have been described above with reference to the drawings. These embodiments are exemplary, and the above description is not intended to limit the present disclosure to these embodiments. The features described in this specification can be realized in various ways, including modifications and improvements based on the knowledge of those skilled in the art.

(Various variations)
In this specification, some processes have been described as being performed automatically. Some of these processes may be performed manually. Some other processes have been described as being performed manually. All or part of these other processes may be performed automatically using known methods.

Various implementations of the host 100 are described herein or shown in the drawings. Some implementations relate to information that includes various data, data processing procedures, specific names, or parameters. Such implementations may be modified in any way unless otherwise specified. For example, the various data are not limited to the data shown in the drawings.

The components of the system are shown in the drawings. The illustrated components conceptually represent the functionality of the system. The components are not necessarily physically configured as shown in the drawings. The components may be integrated or distributed, and the specific form of the system is not limited to that shown in the drawings. All or part of the system may be functionally or physically integrated or distributed depending on various loads and usage conditions.

(Terms expressing components)
The terms module, section, -er suffix or -or suffix may be read as unit, means, circuit, etc. For example, a communication module, a control module, and a storage module may be read as a communication unit, a control unit, and a storage unit, respectively.

(Configuration of the control unit)
The configuration of the control unit 120 shown in Fig. 9 is exemplary, and the data processing described with respect to a particular unit does not necessarily have to be performed by that particular unit. For example, the identification unit 122 may perform the data processing described with respect to the request unit 123. Furthermore, the control unit 120 may include other units not shown in Fig. 9. The other units may perform the data processing described with respect to the control unit 120.

(Data Processing Device)
The data processing device described with respect to the control unit 120 is not limited to the specific hardware described above. The data processing device may be, for example, various computers or integrated circuits such as an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or a GPGPU (General Purpose Graphic Processing Unit).

REFERENCE SIGNS LIST 1 Environment 100 Host 110 Communication unit 120 Control unit 121 Fencing unit 122 Identification unit 123 Request unit 124 Confirmation unit 125 Notification unit 126 Recovery unit 130 Storage unit 200 Network 300 Control server

Claims

a first instruction unit that instructs a server that controls the cluster to execute a power supply operation on a standby node;
an acquisition unit that acquires a power state of the backup node in response to receiving a notification of the completion of the power operation;
a determination unit that determines whether a power state of the backup node is powered on or powered off;
a second instruction unit that instructs the server to turn on the power of the backup node in response to determining that the power state of the backup node is powered off.
The high-speed failure recovery system according to claim 1 , wherein the power supply operation is STONITH (Shoot The Other Node In The Head).
The system for accelerating failure recovery according to claim 2 , wherein the fencing action of the STONITH is a stop.
The fast failure recovery system according to claim 1 , further comprising a confirmation unit that confirms normality of a service or process of the standby node.
The failure recovery speed-up system according to claim 4 , further comprising an incorporating unit that incorporates the standby node into the cluster when the service or the process is normal.
5. The fast failure recovery system according to claim 4, further comprising a notification unit that notifies a maintenance person of a maintenance request for said standby node when said service or said process is not normal.
1. A computer-implemented method for accelerating failure recovery, comprising:
a first instruction step of instructing a server that controls the cluster to execute a power supply operation on a standby node;
an acquisition step of acquiring a power state of the backup node in response to receiving a notification of completion of the power operation;
a determination step of determining whether a power state of the backup node is powered on or powered off;
a second instruction step of instructing the server to power on the backup node in response to determining that the power state of the backup node is powered off.
A program for accelerating fault recovery that causes a computer to function as a fault recovery acceleration system according to any one of claims 1 to 6.