CN110912780B - High-availability cluster detection method, system and controlled terminal - Google Patents

High-availability cluster detection method, system and controlled terminal Download PDF

Info

Publication number
CN110912780B
CN110912780B CN201911281240.2A CN201911281240A CN110912780B CN 110912780 B CN110912780 B CN 110912780B CN 201911281240 A CN201911281240 A CN 201911281240A CN 110912780 B CN110912780 B CN 110912780B
Authority
CN
China
Prior art keywords
node
master node
channel
slave
switch
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911281240.2A
Other languages
Chinese (zh)
Other versions
CN110912780A (en
Inventor
过育红
朱正东
仇大玉
张银滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huayun Data Holding Group Co Ltd
Original Assignee
Huayun Data Holding Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huayun Data Holding Group Co Ltd filed Critical Huayun Data Holding Group Co Ltd
Priority to CN201911281240.2A priority Critical patent/CN110912780B/en
Publication of CN110912780A publication Critical patent/CN110912780A/en
Application granted granted Critical
Publication of CN110912780B publication Critical patent/CN110912780B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/70Routing based on monitoring results
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/25Routing or path finding in a switch fabric
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/35Switches specially adapted for specific applications
    • H04L49/354Switches specially adapted for specific applications for supporting virtual local area networks [VLAN]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/104Peer-to-peer [P2P] networks
    • H04L67/1044Group management mechanisms 
    • H04L67/1046Joining mechanisms
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Hardware Redundancy (AREA)

Abstract

The invention provides a high-availability cluster detection method, a high-availability cluster detection system based on the method and a controlled terminal, wherein the high-availability cluster detection method is used for carrying out heartbeat detection on a master node and a slave node which are configured with keepalive based on a VRRP protocol, establishing a first channel for carrying out health check between the master node and the second switch as well as between the slave node and a second switch, establishing a second channel for carrying out health check between the master node and the third switch, and selecting the slave node as a new master node only when the first channel and the second channel trigger to reselect a master node strategy at the same time. By the high-availability cluster detection method and the high-availability cluster detection system, a keepalive heartbeat detection mechanism between the conventional master node and the conventional slave node is remarkably improved, master-slave switching caused by insubstantial downtime of the master node due to busy service or overtime detection is avoided, and reliability of a high-availability cluster and high availability of service are ensured.

Description

High-availability cluster detection method, system and controlled terminal
Technical Field
The invention relates to the technical field of computers, in particular to a high-availability cluster detection method, a high-availability cluster detection system and a controlled terminal.
Background
With the rapid development of the internet, the service volume of users is continuously increased, and the requirements on the reliability and performance of the service are higher and higher. In order to meet the requirements of users, in an actual application environment, an ha (high availability) cluster is often used to implement service processing. In a high-availability cluster, cooperation and consistency among nodes are needed to ensure the effectiveness of the cluster on service processing. If a certain node in the cluster has a problem, the working performance of the whole cluster is affected, so that the cluster is required to have a function of rapidly processing the problem node, and the reliability of the cluster and the effectiveness of service processing are ensured.
A Master node (Master) and a plurality of slave nodes (Backup) are usually included in the high availability cluster, and the Master node and the plurality of slave nodes are usually based on a combination of Keepalived and Haproxy to ensure the high availability performance of the cluster. Keepalived is implemented based on the VRRP protocol (virtual route redundancy protocol). The master node and each slave node maintain the state through a heartbeat mechanism. And when the slave node cannot receive the VRRP control message sent by the master node, the master node is considered to be down. In this scenario, one slave node is selected from the plurality of slave nodes according to the priority of the VRRP protocol and is used as a new master node. The new master node starts a resource management module to take over resources, services or processes running on the original master node.
At present, in the prior art of heartbeat detection between a master node and a slave node, the reason that the heartbeat of the master node cannot be detected is not that the master node is down, and various reasons such as the master node being busy or detection overtime exist. If the main node is switched blindly once the heartbeat of the main node is not detected, a split brain phenomenon can be caused. Split-brain (split-brain) refers to a phenomenon that when two connected nodes disconnect in a High Availability (HA) system, the system which is originally an integral system is split into two independent nodes, and at this time, the two nodes begin to contend for shared resources, so that system confusion and data damage are caused.
Meanwhile, chinese patent publication No. CN109286525A discloses a dual-computer backup method based on MQTT communication and heartbeat between main and standby. However, the above prior art for heartbeat detection based on MQTT protocol has the following disadvantages: (1) the MQTT protocol has no complete SDK, and different heterogeneous terminals need corresponding software SDK packets communicated with the MQTT server; (2) the MQTT protocol does not support load balancing, and high concurrency and malicious attack cannot be effectively prevented; (3) the method does not support a user management interface, point-to-point communication, group communication and group management and offline messages; (4) because the MQTT server needs to be configured, the complexity of the cluster on the topological logic is increased, the cluster building cost is increased, and the difficulty of cluster maintenance in the later period is increased.
In view of the above, there is a need to improve the detection method for high-availability clusters in the prior art to solve the above problems.
Disclosure of Invention
The invention aims to disclose a high-availability cluster detection method, a high-availability cluster detection system and a controlled terminal, which are used for overcoming the defects in the prior art, in particular for solving the master-slave switching phenomenon caused by insubstantial downtime of a main node due to busy service or overtime detection and the like in a traditional keepalive heartbeat detection mechanism, solving the technical problem of brain split in the whole cluster caused by the master-slave switching phenomenon, and ensuring the reliability of a high-availability cluster and the high availability of service.
To achieve the above object, the present application first provides a high-availability cluster detection method,
performing heartbeat detection on the master node and the slave node configured with keepalived based on the VRRP protocol,
a first channel for health check is established between the master node and the slave node and the second switch, a second channel for health check is established between the master node and the slave node and the third switch,
and selecting the slave node as a new master node only when the first channel and the second channel simultaneously trigger the master node re-selection strategy.
As a further improvement of the present invention,
performing heartbeat detection on the master node and the slave node configured with keepalived based on the VRRP protocol,
a first channel for health check is established between the master node and the slave node and the second switch, a second channel for health check is established between the master node and the slave node and the third switch,
and establishing a BFD session between the master node and the slave node, and electing the slave node as a new master node only when the first channel and the second channel simultaneously trigger the master node policy reselection.
As a further improvement of the present invention, the master node and the slave node both establish sessions with cluster nodes in the high-availability cluster through the second switch and the third switch;
the first channel is a session channel established by the master node and the slave node and at least one of the control node, the computing node, the network node or the storage node through the second switch,
the second channel is a session channel established between the master node and the slave node and at least one of the control node, the computing node, the network node or the storage node through a third switch.
As a further improvement of the invention, the method also comprises the following steps:
and multiplexing authentication data fields contained in messages based on the VRRP protocol and transmitted between the master node and the slave nodes which have already established the session to determine whether the first channel and the second channel simultaneously trigger the reselecting master node strategy, and establishing the BFD session between a determined new master node and the master node from a plurality of slave nodes when the first channel and the second channel simultaneously trigger the reselecting master node strategy.
As a further improvement of the present invention, the reselecting master node policy is described by a priority and a weight value together to determine a new master node from a plurality of slave nodes.
As a further improvement of the invention, the health check includes TCP check, HTTP check, check script check, timeout check, or load check.
As a further improvement of the invention, the method also comprises the following steps:
after a new master node is determined in a plurality of slave nodes, synchronizing the state information of the new master node to the slave nodes, and drifting the virtual IP to the new master node;
synchronously configuring the state information of the new master node to the cluster node mounted to the third switch;
wherein,
the cluster nodes comprise control nodes, computing nodes, network nodes and storage nodes.
Based on the same inventive concept, the present application further provides a high availability cluster detection system, which is characterized by comprising:
the heartbeat detection unit is used for carrying out heartbeat detection on the master node and the slave node which are configured with keepalive based on a VRRP protocol;
the first health check unit is used for carrying out health check on a first channel established between the master node and the second switch as well as between the slave node and the second switch;
the second health check unit is used for carrying out health check on a second channel established between the master node and the slave node and a third switch;
and the decision unit is used for electing the slave node which simultaneously triggers the selected master node strategy as a new master node only when the first channel and the second channel simultaneously trigger the reselected master node strategy.
As a further improvement of the present invention, the high availability cluster detection system operates in a Zookeeper cluster.
Finally, the present application also provides a controlled terminal, comprising: a processor, a storage device, and a communication bus establishing a communication connection between the processor and the storage device;
the processor is configured to execute one or more programs stored in the storage device to implement the method for detecting a high availability cluster as disclosed in any of the above inventions.
Compared with the prior art, the invention has the beneficial effects that:
by the high-availability cluster detection method and the high-availability cluster detection system, a keepalive heartbeat detection mechanism between the conventional master node and the conventional slave node is remarkably improved, master-slave switching caused by insubstantial downtime of the master node due to busy service or overtime detection is avoided, split brain of the high-availability cluster is effectively avoided, and reliability and high availability of service of the high-availability cluster are ensured.
Drawings
FIG. 1 is a general flow chart of a method for detecting a high availability cluster according to the present invention;
fig. 2 is a schematic diagram of a high-availability cluster applying the high-availability cluster detection method of the present invention performing health check through a first channel and a second channel to determine whether master-slave switching occurs in a first example;
fig. 3 is a schematic diagram of a high-availability cluster applying the high-availability cluster detection method of the present invention performing health check through a first channel and a second channel to determine whether master-slave switching occurs in a second example;
fig. 4 is a schematic diagram of a high-availability cluster applying the high-availability cluster detection method of the present invention performing health check through a first channel and a second channel to determine whether master-slave switching occurs in a third example;
FIG. 5 is a topology diagram of a high availability cluster detection system of the present invention;
fig. 6 is a topology diagram of a controlled terminal applying the high availability cluster detection method.
Detailed Description
The present invention is described in detail with reference to the embodiments shown in the drawings, but it should be understood that these embodiments are not intended to limit the present invention, and those skilled in the art should understand that functional, methodological, or structural equivalents or substitutions made by these embodiments are within the scope of the present invention.
Prior to describing the present embodiments in detail, technical terms related to the various embodiments of the present application are necessarily explained and defined. In the present application, the term "cluster" means: connected by a local area network, wide area network or other communication means, and a computer system formed by a set of loosely integrated computer software or hardware connections that cooperate to perform computing tasks with a high degree of closeness. Meanwhile, the term "cluster" has a technical meaning equivalent to the term "computer cluster" or the term "data center" in the present application. The term "master-slave switching" means: role switching between the role as a master node and the role as a slave node; and in the present application, the term "master server" has an equivalent technical meaning to the term "master node", and the term "slave server" has an equivalent technical meaning to the term "slave node"; the term "substantial downtime" is a technical concept opposite to the term "insubstantial downtime", wherein the "insubstantial downtime" refers to a state that a master node is identified as a "Fail" because VRRP control messages cannot be transmitted and received between the master node and one or more slave nodes due to busy traffic or detection timeout. Finally, in the present application, unless otherwise specified, the term "packet" refers specifically to a VRRP control packet formed based on the VRRP protocol. The present application discloses a method, a system and a controlled terminal for detecting a high availability cluster.
The first embodiment is as follows:
referring to fig. 1 to 4, the present embodiment discloses a high-availability cluster detection method (hereinafter referred to as "method"). The method is applied to computer clusters, data centers (IDCs) and cloud computing platforms, and is particularly suitable for high-availability cluster scenes. In this embodiment, one of the goals of high availability of the cluster is to eliminate a single point of failure in the infrastructure. A single point failure is a component of the technology stack that will cause service disruption if it becomes unavailable (Fail). In highly available clusters, systems that allow flexible IP address remapping, such as floating IP (floating IP). On-demand IP address remapping eliminates the propagation and caching problems inherent in DNS changes by providing static IP addresses that can be easily remapped when needed. The domain name may remain associated with the same IP address, and the IP address itself may also be moved between servers.
Referring to fig. 1, the method for detecting a high availability cluster includes the following steps:
first, step S1 is executed to perform heartbeat detection based on the VRRP protocol for the master node and the slave node configured with Keepalived. The method disclosed by the present embodiment is applied to the high availability cluster as shown in fig. 2. The high availability cluster comprises a master node a, a slave node B, and a control node 21, a computing node 22, a network node 23 and a storage node 24 (hereinafter referred to as such, the control node 21, the computing node 22, the network node 23 and the storage node 24 have equivalent technical meaning to the term "functional node"). It should be noted that the functional nodes in the high availability cluster are only used as an example, and the number of the computing nodes 22 and the storage nodes 24 may be multiple in actual configuration, and form a distributed computing architecture; in particular, the storage nodes 24 may be configured in a Ceph distributed storage architecture, DAS storage architecture, Network Attached Storage (NAS), or Storage Area Network (SAN) architecture. Furthermore, a plurality of high-availability clusters can form a high-availability cluster with stronger fault tolerance capability through the local area network. Therefore, the number of the control node 21, the computing node 22, the network node 23, and the storage node 24 shown in fig. 2 may be one or more.
The control node 21, the computing node 22, the network node 23 and the storage node 24 are all coupled to the first switch 10 to communicate through the first switch 10. The third switch 30 is connected to the router 40, and accesses the internet 50 by the router 40. Router 40 performs a floating IP translation through its built-in floating IP mechanism to respond to user-initiated access requests.
High Availability (HA) refers to providing the ability to continue to access services in the event of a failure of a single component of the local system, whether this failure is a failure of the business process, physical facility, IT software/hardware. The high-availability cluster uses a combination of Keepalived and HAproxy, and determines whether operations such as failover, automatic capacity expansion, master-slave switching and the like need to be executed or not between the master node A and one or more slave nodes B based on a Keepalived heartbeat detection mechanism, so that uninterrupted high-quality service/response is provided for users (users). The VRRP protocol adopted by the keepalive heartbeat detection mechanism uses a common VIP (virtual IP) to realize the back-and-forth drifting on two (or more) HAproxy nodes, thus only one IP is externally embodied.
Keepalived is based on VRRP (Virtual Router Redundancy Protocol). The VRRP protocol is a protocol that implements high availability, i.e., N devices providing the same functionality form a high availability cluster. Within the high availability cluster is a master node a and one or more slave node bs, which form a slave node cluster 40 (see fig. 3). The host node a has a VIP that serves a User (User). The master node a and the slave node B use a heartbeat mechanism to maintain the state, and when the slave node B cannot receive the VRRP packet, the master node a is considered to be down, and at this time, a slave node needs to be selected from the slave node cluster 40 in fig. 3 according to the Priority (Priority) of the VRRP control packet to serve as a new master node a. The new master node a starts a resource takeover module (a Pacemaker cluster management service) to take over the resources or services configured by the master node a running in a substantial downtime. The method disclosed by the embodiment can distinguish the substantial downtime and the non-substantial downtime of the main node A, so as to prevent the main node A from switching between the main node A and the slave node A caused by the non-substantial downtime, such as busy traffic or detection timeout. It should be noted that, in various embodiments of the present application, the term "Master-Slave switching" refers to defining a server and/or a database, which originally has a role defined as a Slave node (Slave), as a Master node (Master), so as to improve the stability and robustness of the high availability cluster.
Then, step S2 is executed to establish a first channel for health check between the master node and the slave node and the second switch, and to establish a second channel for health check between the master node and the slave node and the third switch. It should be noted that, in this embodiment, the first channel and the second channel are both generally referred to as a type of channel.
Both the master node a and the slave node B establish sessions with cluster nodes in the high availability cluster through the second switch 20 and the third switch 30. Thereby determining the number of first channels and second channels from the established session. The first channel is a session channel established between the master node a and the slave node B and at least one of the control node 21, the computing node 22, the network node 23, or the storage node 24 through the second switch 20. The second channel is a session channel established between the master node a and the slave node B and at least one functional node of the control node 21, the computing node 22, the network node 23, or the storage node 24 through the third switch 30.
Specifically, the first channel may be a transmission channel of a VRRP control packet formed by the master node a and the control node 21 and the slave node B in fig. 2 through the second switch 20 and based on the VRRP protocol; the first channel may also be a transmission channel of a VRRP control message formed by the master node a and the control node 21, the computing node 22, and the slave node B in fig. 2 through the second switch 20 and based on the VRRP protocol; the first channel may also be a transmission channel of a VRRP control packet formed by the master node a and the control node 21, the computing node 22, the network node 23, the storage node 24, and the slave node B in fig. 2 through the second switch 20 and based on the VRRP protocol; the first channel may also be a transmission channel of the master node a and the slave node B in fig. 2, and one or several functional nodes (i.e. the control node 21, the computing node 22, the network node 23, and the storage node 24) through the second switch 20 and based on the VRRP control packet formed by the VRRP protocol.
Similarly, the second channel may be a transmission channel of the VRRP control packet formed by the master node a, the control node 21 and the slave node B in fig. 2 through the third switch 30 and based on the VRRP protocol; the second channel may also be a transmission channel of a VRRP control message formed by the master node a and the control node 21, the computing node 22, and the slave node B in fig. 2 through the third switch 30 and based on the VRRP protocol; the second channel may also be a transmission channel of a VRRP control packet formed by the master node a, the control node 21, the computing node 22, the network node 23, the storage node 24, and the slave node B in fig. 2 through the third switch 30 and based on the VRRP protocol; the second channel may also be a transmission channel of the master node a and the slave node B in fig. 2, and one or several functional nodes (i.e., the control node 21, the computing node 22, the network node 23, and the storage node 24 shown in fig. 2) through the third switch 30 and based on the VRRP control packet formed by the VRRP protocol.
It can be seen that, in this embodiment, the first channel and the second channel may be respectively formed as one VRRP control message transmission channel, or may be formed as multiple VRRP control message transmission channels.
Preferably, in this embodiment, step S2 of the method further includes: and multiplexing authentication data fields contained in messages based on the VRRP protocol and transmitted between the master node A and the slave node B which have already established the session to determine whether the first channel and the second channel simultaneously trigger the reselecting master node strategy, and establishing the BFD session between a determined new master node and the master node from a plurality of slave nodes when the first channel and the second channel simultaneously trigger the reselecting master node strategy. Then, after the BFD session is established, and only when the first channel and the second channel simultaneously trigger the reselection of the master node strategy, the slave node is elected as a new master node. The selected new master node is selected from the plurality of node bs included in the slave node cluster 40 of fig. 3 according to the reselection master node policy.
Particularly, in the method disclosed in this embodiment, the second channel is enabled to perform health check on the high-availability cluster, so as to prevent a master-slave switching phenomenon caused by an insubstantial downtime, such as busy traffic or overtime detection, of the master node a, which occurs when health check is performed only through the first channel, and effectively avoid a split brain phenomenon of the high-availability cluster, thereby ensuring reliability of the high-availability cluster and high availability of the service. The format of the VRRP control packet disclosed in the high availability cluster detection method disclosed in this embodiment is as follows.
Figure BDA0002316811030000091
When the master node a and the slave node B interact with each other for VRRP control messages, VIPs of the first path and the second path are distributed to each node (see the plurality of slave node bs included in the slave node cluster 40 in fig. 3). Thus, the existing Authentication Data fields (i.e., Authentication Data (1) and Authentication Data (2)) in the VRRP control message are multiplexed. The Authentication Data field is used for RFC2338 backward compatibility, is abandoned at present, is set to be 0 when the VRRP message is sent at present, and is ignored when the VRRP control message is received. Therefore, the Authentication Data field can be multiplexed to issue respective health check IP, and each node records the IP address after receiving the VRRP control message with the health check IP. The Authentication Data field has an equivalent meaning to the term "Authentication Data field" in the present embodiment.
The BFD (Bidirectional Forwarding Detection) protocol provides a method for detecting the communication state of a Forwarding path between two adjacent routers/switches with light load and high speed, and is a simple protocol of 'Hello'. The BFD protocol provides connectivity detection of the link in both back and forth directions through a three-way handshake mechanism. A pair of systems periodically sends VRRP control messages on a session channel established between the two systems, if a certain system does not receive the VRRP control message of an opposite end in enough time, a certain part of a bidirectional channel from the system to an adjacent system is considered to have a fault protocol neighbor, so that the communication fault of a forwarding path can be quickly detected, the starting of a backup forwarding path is accelerated, and the performance of the existing network is improved. BFD may be used to detect any form of path, including directly connected physical links, virtual circuits, tunnels, LSPs in MPLS protocol, or even multi-hop routing channels. Even for unidirectional links (such as MPLS TE tunnels), detection may be possible as long as there is a path back.
The detection mechanism provided by the BFD protocol is independent of the type of interface media used, the encapsulation format, and associated upper layer protocols such as OSPF, BGP, RIP, etc. The BFD protocol establishes a session between two routers (i.e., the second switch 20 and the third switch 30 in fig. 2-4) and greatly reduces the convergence time of the entire network by quickly sending a detect failure message to the running routing protocol to trigger the routing protocol to recalculate the routing table. The BFD protocol itself has no capability to discover neighbors and requires upper layer protocols to inform which neighbor to establish a session with.
Meanwhile, in the present embodiment, the reselecting master node policy is described by a priority and a weight value together, so as to determine a new master node from a plurality of slave nodes. In particular, the health check includes TCP detection, HTTP detection, detection of a check script, detection of a timeout, or detection of a load.
The method for detecting a high-availability cluster disclosed by the embodiment further comprises the following steps: after a new master node is determined from the plurality of slave nodes, synchronizing the state information of the new master node to the slave nodes, and drifting the virtual IP to the new master node. Synchronously configuring the state information of the new master node to the cluster node mounted to the third switch 30; the cluster nodes include a control node 21, a computing node 22, a network node 23, and a storage node 24.
Finally, step S3 is executed to elect the slave node as the new master node only when the first channel and the second channel trigger the reselection of the master node policy at the same time.
When a high availability cluster is in operation, one master node a must have been created at a certain moment, and at least one slave node B. Based on the first channel for health check established between the master node a and the slave node B and the second switch 20, and the second channel for health check established between the master node a and the slave node B and the third switch 30, it can be determined whether a new master node needs to be elected when the opposite end cannot receive the VRRP control message. Based on the BFD session established between the master node A and the slave node B, when the election strategy of selecting a new master node is triggered by the first channel and the second channel at the same time, the fact that the master node A in the current state is down is determined, and therefore service and/or data are migrated to the new master node determined according to the election strategy; in this process, one of the Slave node B1 (or Slave node B2) in the Slave node cluster 40, which is in the role of Slave, may be switched from the Backup state to the Master state to complete the Master-Slave switching operation.
In this embodiment, the election policy of the master node is specifically described as follows.
The election policy of the master node is described by Priority (Priority) and Weight value (Weight). Referring to fig. 4 and 5 in combination, in this embodiment, each of the master node a and the plurality of slave nodes (i.e., the slave node B1 and the slave node B2) configured in the slave node cluster 40 is configured with an initial priority, and is determined by a priority configuration item in the configuration file. In the initial state, the initial priority of the master node a is higher than that of any one of the slave nodes. Keepalived sets up according to the Weight value (Weight) of vrrp _ script, when a master-slave switch is needed and a slave node is determined from the slave node cluster 40 to define as a master node, or the priorities of the master node a and the multiple slave nodes are adjusted to perform increasing or decreasing operations on the initial priorities of the multiple slave nodes, specifically, the following rules (1) to (3) are sequentially performed.
Rule (1): when the Weight value (Weight) is greater than 0, the Priority is Priority + Weight when vrrp _ script execution returns 0 (success), otherwise Priority. A master-slave switch is made when a slave node (e.g., slave node B1) that is a backup node discovers that the slave node B1 priority is greater than the priority advertised by master node a.
Rule (2): when the Weight value (Weight) is less than 0, the Priority is Priority + Weight when vrrp _ script execution returns a value other than 0 (failure), otherwise Priority. A master-slave switch is made when a slave node (e.g., slave node B1) that is a backup node discovers that the priority of the slave node B1 is greater than the priority advertised by master node a.
Rule (3): when the priorities of two slave nodes (e.g., the slave node B1 and the slave node B2 included in the slave node cluster 40) are the same, the IP of the slave node transmitting the VRRP advertisement is taken as a comparison object, and the larger IP is elected as a new master node. The VRRP priority ranges from 0 to 255, with larger values indicating higher priority.
The storage node 24 is connected and mapped to the control node 21 and the computing node 22 through an FC protocol or an iSCSI protocol, and the control node 21 and the computing node 22 are coupled and connected through the first switch 10, so as to implement forwarding and intercommunication of data and/or messages between the above functional nodes. The control node 21 creates the storage node 24 as a Volume Group (VG) by the pvcreate command and the vgcreate command in common. The control node 21 and the computing node 22 are respectively provided with a Pacemaker cluster management service, and a plurality of Pacemaker cluster management services jointly form a Pacemaker cluster. The Logical Volume (LV) is mounted to a Virtual Machine (VM) by a compute service on a compute node 22. The Pacemaker cluster management service performs synchronous update operation on the metadata (Mate data) of the Logical Volume (LV) to any other control node 21 and/or computing node 22. The synchronous update operation can further improve the defect that performance bottleneck occurs to the control node 21 after the role of the existing master node a is deprived, realize the clustering management of the logical volume, and synchronously update the resource state information in the whole high-availability cluster so as to ensure the high availability and stability of the whole high-availability cluster.
By the high-availability cluster detection method disclosed by the embodiment, a keepalive heartbeat detection mechanism between a master node and a slave node in the existing high-availability cluster is obviously improved, master-slave switching caused by insubstantial downtime of the master node A due to busy service or overtime detection is effectively avoided, and split brains of the high-availability cluster are effectively avoided, so that the reliability of the high-availability cluster and the high availability and stability of service are ensured.
Example two:
referring to fig. 5, in this embodiment, based on the technical solution of the high availability cluster detection method disclosed in the first embodiment, a high availability cluster detection system 100 (hereinafter referred to as "system") is also disclosed.
The system 100, comprising: a heartbeat detection unit 31, configured to perform heartbeat detection on the master node a and the slave node B configured with keepalived based on the VRRP protocol. The first health check unit 32 performs health check on the first channel established between the master node a and the slave node B and the second switch 20. The second health check unit 33 performs health check on the second channel established between the master node a and the slave node B and the third switch 30. A decision unit 34, configured to, only when the first channel and the second channel trigger the reselection master node policy at the same time, elect a slave node that triggers the reselection master node policy at the same time as a new master node, that is, the slave node B1 and the slave node B2 included in the slave node cluster 40 in fig. 5 migrate the service of the master node a to a new master node (e.g., the slave node B1 or the slave node B2) determined according to the election policy according to the technical solution disclosed by the high availability cluster detection method disclosed in the first embodiment; in this process, one of the slave node bs 1 in the slave node cluster 40 may be switched from the Backup state to the Master state to define its role as the Master node to complete the Master-slave switching operation.
Meanwhile, the high availability cluster detection system 100 in this embodiment operates in a Zookeeper cluster or other equivalent type of distributed cluster system. The reselection master node policy is described by a priority and a weight value together to determine a new master node from a plurality of slave nodes.
The system 100 disclosed in the present embodiment and the technical solutions of the same parts in the first embodiment are described in the first embodiment, and are not described herein again.
Example three:
referring to fig. 6, the present embodiment discloses a controlled terminal 200, where the controlled terminal 200 includes: a processor 51, a storage device 52, and a communication bus 53 establishing a communication connection between the processor 51 and the storage device 52. The processor 51 is configured to execute one or more programs stored in the storage device 52 to implement the high availability cluster detection method according to the first embodiment. The storage device 52 includes a plurality of storage units, i.e., a storage unit 521 to a storage unit 52i, where the parameter i is a positive integer greater than or equal to 2. The controlled terminal 200 may be regarded as a computer, a data center, a bare metal server, a portable electronic device, or the like. The controlled terminal 200 disclosed in this embodiment and the technical solutions of the same parts in the first embodiment and/or the second embodiment refer to the above description in the first embodiment and/or the second embodiment, and are not repeated herein.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above-listed detailed description is only a specific description of a possible embodiment of the present invention, and they are not intended to limit the scope of the present invention, and equivalent embodiments or modifications made without departing from the technical spirit of the present invention should be included in the scope of the present invention.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.
Furthermore, it should be understood that although the present description refers to embodiments, not every embodiment may contain only a single embodiment, and such description is for clarity only, and those skilled in the art should integrate the description, and the embodiments may be combined as appropriate to form other embodiments understood by those skilled in the art.

Claims (10)

1. A high availability cluster detection method is used for distinguishing substantial downtime and non-substantial downtime generated by a main node,
performing heartbeat detection on the master node and the slave node configured with keepalived based on the VRRP protocol,
a first channel for health check is established between the master node and the slave node and the second switch, a second channel for health check is established between the master node and the slave node and the third switch,
selecting the slave node as a new master node only when the first channel and the second channel trigger the master node re-selection strategy at the same time;
a first channel for health check is established between the master node and the slave node and the second switch, a second channel for health check is established between the master node and the slave node and the third switch,
and establishing a BFD session between the master node and the slave node, and electing the slave node as a new master node only when the first channel and the second channel simultaneously trigger the master node policy reselection.
2. The high availability cluster detection method of claim 1,
and carrying out heartbeat detection on the master node and the slave node which are configured with keepalived based on the VRRP protocol.
3. The method according to claim 1 or 2, wherein the master node and the slave node establish sessions with cluster nodes in the high availability cluster through a second switch and a third switch;
the first channel is a session channel established by the master node and the slave node and at least one of the control node, the computing node, the network node or the storage node through the second switch,
the second channel is a session channel established between the master node and the slave node and at least one of the control node, the computing node, the network node or the storage node through a third switch.
4. The method of claim 3, further comprising:
and multiplexing authentication data fields contained in messages based on the VRRP protocol and transmitted between the master node and the slave nodes which have already established the session to determine whether the first channel and the second channel simultaneously trigger the reselecting master node strategy, and establishing the BFD session between a determined new master node and the master node from a plurality of slave nodes when the first channel and the second channel simultaneously trigger the reselecting master node strategy.
5. The method of claim 4, wherein the reselecting master node policy is described by a priority and a weight value together to determine a new master node from a plurality of slave nodes.
6. The method of claim 3, wherein the health check comprises a TCP check, an HTTP check, a check script check, a timeout check, or a load check.
7. The method of claim 3, further comprising:
after a new master node is determined in a plurality of slave nodes, synchronizing the state information of the new master node to the slave nodes, and drifting the virtual IP to the new master node;
synchronously configuring the state information of the new master node to the cluster node mounted to the third switch;
wherein,
the cluster nodes comprise control nodes, computing nodes, network nodes and storage nodes.
8. A high availability cluster detection system for distinguishing between substantial and insubstantial crashes occurring with a master node, comprising:
the heartbeat detection unit is used for carrying out heartbeat detection on the master node and the slave node which are configured with keepalive based on a VRRP protocol;
the first health check unit is used for carrying out health check on a first channel established between the master node and the second switch as well as between the slave node and the second switch;
the second health check unit is used for carrying out health check on a second channel established between the master node and the slave node and a third switch;
the decision unit is used for electing the slave node which simultaneously triggers the selected master node strategy as a new master node only when the first channel and the second channel simultaneously trigger the reselected master node strategy;
a first channel for health check is established between the master node and the slave node and the second switch, a second channel for health check is established between the master node and the slave node and the third switch,
and establishing a BFD session between the master node and the slave node, and electing the slave node as a new master node only when the first channel and the second channel simultaneously trigger the master node policy reselection.
9. The high availability cluster detection system of claim 8, wherein the high availability cluster detection system operates in a Zookeeper cluster.
10. A controlled terminal, comprising: a processor, a storage device, and a communication bus establishing a communication connection between the processor and the storage device;
the processor is configured to execute one or more programs stored in the storage device to implement the high availability cluster detection method of any one of claims 1 to 8.
CN201911281240.2A 2019-12-13 2019-12-13 High-availability cluster detection method, system and controlled terminal Active CN110912780B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911281240.2A CN110912780B (en) 2019-12-13 2019-12-13 High-availability cluster detection method, system and controlled terminal

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911281240.2A CN110912780B (en) 2019-12-13 2019-12-13 High-availability cluster detection method, system and controlled terminal

Publications (2)

Publication Number Publication Date
CN110912780A CN110912780A (en) 2020-03-24
CN110912780B true CN110912780B (en) 2021-08-27

Family

ID=69825420

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911281240.2A Active CN110912780B (en) 2019-12-13 2019-12-13 High-availability cluster detection method, system and controlled terminal

Country Status (1)

Country Link
CN (1) CN110912780B (en)

Families Citing this family (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111488395B (en) * 2020-04-02 2023-05-05 中国船舶集团有限公司第七一六研究所 Dual node high availability distributed storage system
CN111694789A (en) * 2020-04-22 2020-09-22 西安电子科技大学 Embedded reconfigurable heterogeneous determination method, system, storage medium and processor
CN111552942B (en) * 2020-04-27 2023-02-10 北京三快在线科技有限公司 Identity authentication method, system, device and computer storage medium
CN111866094B (en) * 2020-07-01 2023-10-31 天津联想超融合科技有限公司 Timing task processing method, node and computer readable storage medium
CN112187877B (en) * 2020-09-10 2022-04-01 华云数据控股集团有限公司 Node awakening method based on distributed cluster and controlled terminal
CN112104513B (en) * 2020-11-02 2021-02-12 武汉中科通达高新技术股份有限公司 Visual software load method, device, equipment and storage medium
CN112671601B (en) * 2020-12-11 2023-10-31 航天信息股份有限公司 Interface monitoring system and method based on Zookeeper
CN112653734B (en) * 2020-12-11 2023-09-19 邦彦技术股份有限公司 Real-time master-slave control and data synchronization system and method for server cluster
CN112866367A (en) * 2021-01-12 2021-05-28 优刻得科技股份有限公司 Routing system based on programmable switch
CN112988462A (en) * 2021-02-08 2021-06-18 同盾科技有限公司 Container mirror image center and container mirror image center operation method
CN112965790B (en) * 2021-03-29 2022-05-31 华云数据控股集团有限公司 PXE protocol-based virtual machine starting method and electronic equipment
CN113590024B (en) * 2021-06-18 2023-12-22 济南浪潮数据技术有限公司 Health inspection method, device and terminal of distributed storage system
CN114124673A (en) * 2021-11-25 2022-03-01 杭州安恒信息技术股份有限公司 Method for comparing and testing syslog and high availability of main and standby system
CN114640618B (en) * 2022-03-15 2024-03-12 平安国际智慧城市科技股份有限公司 Cluster route scheduling method and device, electronic equipment and readable storage medium
CN114826892B (en) * 2022-04-28 2024-07-02 济南浪潮数据技术有限公司 Cluster node control method, device, equipment and medium
CN115242701B (en) * 2022-07-25 2024-04-02 中国民用航空总局第二研究所 Airport data platform cluster consumption processing method, device and storage medium
CN115794769B (en) * 2022-10-09 2024-03-19 云和恩墨(北京)信息技术有限公司 Method for managing high-availability database, electronic equipment and storage medium
CN115941448A (en) * 2022-11-17 2023-04-07 天翼云科技有限公司 Application layer service active-standby switching method based on BFD and domain name resolution
CN115967669B (en) * 2023-03-16 2023-06-27 北京志凌海纳科技有限公司 VRRP (virtual router redundancy protocol) expansion protocol-based brain crack inhibition method and device
CN117240694A (en) * 2023-11-01 2023-12-15 广东保伦电子股份有限公司 Method, device and system for switching active and standby hot standby based on keepaled
CN118349419B (en) * 2024-06-17 2024-08-30 杭州宇信数字科技有限公司 State monitoring method and system for middleware node

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103257350B (en) * 2012-05-07 2014-12-24 中国交通通信信息中心 Double-computer duplex automatic switching method
CN105045533B (en) * 2015-07-09 2019-03-22 上海爱数信息技术股份有限公司 Magnetic disk heartbeat receiving/transmission method suitable for dual control high availability storage system
CN105141400B (en) * 2015-07-30 2018-08-21 杭州华为数字技术有限公司 A kind of high availability cluster management method and relevant device
US9916214B2 (en) * 2015-11-17 2018-03-13 International Business Machines Corporation Preventing split-brain scenario in a high-availability cluster
CN107147528A (en) * 2017-05-23 2017-09-08 郑州云海信息技术有限公司 One kind stores gateway intelligently anti-fissure system and method
JP6919461B2 (en) * 2017-09-27 2021-08-18 日本電気株式会社 Node device and failover control method
TWI666896B (en) * 2017-12-26 2019-07-21 資易國際股份有限公司 Automatic repair method of network device real and virtual address corresponding failure
CN109495312B (en) * 2018-12-05 2020-01-17 广州鼎甲计算机科技有限公司 Method and system for realizing high-availability cluster based on arbitration disk and double links

Also Published As

Publication number Publication date
CN110912780A (en) 2020-03-24

Similar Documents

Publication Publication Date Title
CN110912780B (en) High-availability cluster detection method, system and controlled terminal
CN107454155B (en) Fault processing method, device and system based on load balancing cluster
US10567340B2 (en) Data center system
AU2004306913B2 (en) Redundant routing capabilities for a network node cluster
Song et al. Control path management framework for enhancing software-defined network (SDN) reliability
WO2017162184A1 (en) Method of controlling service traffic between data centers, device, and system
CN102439903B (en) Method, device and system for realizing disaster-tolerant backup
US20220334935A1 (en) Hot standby method, apparatus, and system
US7894334B2 (en) Hierarchical redundancy for a distributed control plane
CN108306777B (en) SDN controller-based virtual gateway active/standby switching method and device
WO2012000234A1 (en) Method, apparatus and system for fast switching between links
CN109861867B (en) MEC service processing method and device
US11349706B2 (en) Two-channel-based high-availability
US10581669B2 (en) Restoring control-plane connectivity with a network management entity
CN107241208B (en) Message forwarding method, first switch and related system
US10447581B2 (en) Failure handling at logical routers according to a non-preemptive mode
US8625407B2 (en) Highly available virtual packet network device
CN112583708B (en) Connection relation control method and device and electronic equipment
US10819628B1 (en) Virtual link trunking control of virtual router redundancy protocol master designation
CN109302328B (en) Hot standby switching method and system for VXLAN (virtual extensible local area network)
US11418382B2 (en) Method of cooperative active-standby failover between logical routers based on health of attached services
EP2575306B1 (en) Ospf nonstop routing synchronization nack
KR20200072941A (en) Method and apparatus for handling VRRP(Virtual Router Redundancy Protocol)-based network failure using real-time fault detection
CN114268581B (en) Method for realizing high availability and load sharing of network equipment
CN113852514A (en) Data processing system with uninterrupted service, processing equipment switching method and connecting equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: No. 6 Science and Education Software Park, Binhu District, Wuxi City, Jiangsu Province

Applicant after: Huayun data holding group Co., Ltd

Address before: No. 6 Science and Education Software Park, Binhu District, Wuxi City, Jiangsu Province

Applicant before: WUXI CHINAC DATA TECHNICAL SERVICE Co.,Ltd.

GR01 Patent grant
GR01 Patent grant