TWI685736B

TWI685736B - Method for remotely clearing abnormal status of racks applied in data center

Info

Publication number: TWI685736B
Application number: TW107147660A
Authority: TW
Inventors: 林韋成; 辛柏陞; 林政翰
Original assignee: 營邦企業股份有限公司
Priority date: 2018-12-28
Filing date: 2018-12-28
Publication date: 2020-02-21
Also published as: TW202026878A

Abstract

A method for remotely clearing abnormal status of racks is disclosed and includes following steps: obtaining each information of a rack management controller (RMC) and multiple baseboard management controllers (BMCs) of a rack regularly by a management system; recording each operating action performed by manager through the management system; analyzing the information and the operating action by the management system for determining whether any RMC or BMC is under one of multiple default attention-statuses; and, automatically performing a remotely service re-starting procedure to one of the RMC and the BMCs for preventing the RMC or the BMC from occurring an abnormal status when the RMC or the BMC is determined keeping a connection with the management system but about to occur the abnormal status.

Description

The remote elimination method of the abnormal state of the cabinet used in the data center (2)

本發明涉及資料中心，尤其涉及對資料中心中的機櫃的異常狀態的分析與排除的方法。 The invention relates to a data center, in particular to a method for analyzing and eliminating the abnormal state of the cabinet in the data center.

一般來說，一個資料中心通常會透過智慧型平台管理介面(Intelligent Platform Management Interface,IPMI)對資料中心內的機櫃、端點伺服器等設備的機櫃管理控制器(Rack Management Controller,RMC)及基板管理控制器(Baseboard Management Controller,BMC)進行遠端管理。 Generally speaking, a data center usually uses intelligent platform management interface (Intelligent Platform Management Interface, IPMI) to control the rack management controller (RMC) and baseboard of the cabinets, endpoint servers and other equipment in the data center The management controller (Baseboard Management Controller, BMC) performs remote management.

不論使用何種方式進行遠端管理，只要任一機櫃或端點伺服器的RMC或BMC出現異常，管理者就會收到許多警告信件。然而，管理者一般難以通過這些警告信件在第一時間直接得知狀態的真正問題點，往往需要隨著時間不斷推進，直到收到數百封警告信件並且與設備失去連線後，才能確定所述RMC、BMC發生了異常。 No matter which method is used for remote management, as long as the RMC or BMC of any cabinet or endpoint server is abnormal, the administrator will receive many warning letters. However, it is generally difficult for managers to directly know the real problem of the status through these warning letters in the first place, and it is often necessary to continue to advance with time until the hundreds of warning letters are received and the connection with the device is lost. An abnormality occurred in RMC and BMC.

更甚者，即使部分的管理平台從不同的監控管道收集到錯誤訊息，並且進行彙整後提交故障評估報告給管理者，但這樣的監控方式仍然需要由管理者進行最後的判斷，並且決定處理方式。然而，只要有人為因素的介入，就無法全然避免誤判的可能。 What's more, even if some management platforms collect error messages from different monitoring pipelines and submit failure assessment reports to managers after aggregation, such monitoring methods still need to be managed by The manager makes the final judgment and decides the processing method. However, as long as human factors intervene, the possibility of misjudgment cannot be completely avoided.

有鑑於此，本領域確實需要發展一套新穎的系統與方法，可針對處於異常狀態的RMC及BMC自動實施遠端修復機制，藉此強化資料中心的監控能力，使得機櫃管理能夠高度自動化，同時減少人為判定所間接流失的時間，並且避免人為誤判。 In view of this, the field really needs to develop a novel system and method that can automatically implement a remote repair mechanism for RMC and BMC in an abnormal state, thereby strengthening the monitoring capabilities of the data center, so that the cabinet management can be highly automated, while Reduce the time lost by human judgment, and avoid human misjudgment.

本發明的主要目的，在於提供一種運用於資料中心的機櫃異常狀態的遠端排除方法，可以在判斷機櫃管理控制器或基板管理控制器連線正常但判斷可能即將出現異常狀態時，直接於遠端避免機櫃管理控制器或基板管理控制器進入所述異常狀態。 The main purpose of the present invention is to provide a remote elimination method for the abnormal state of the cabinet used in the data center. When it is judged that the cabinet management controller or the baseboard management controller is connected normally but it is judged that the abnormal state may be imminent, To prevent the cabinet management controller or baseboard management controller from entering the abnormal state.

為了達成上述的目的，本發明的方法是由一機櫃伺服器管理系統定時於遠端取得一個機櫃內的一機櫃管理控制器以及多個基板管理控制器的各項資訊，並且記錄一管理者通過該機櫃伺服器管理系統所進行的各項操作行為。該機櫃伺服器管理系統對上述資訊以及操作行為進行分析，以判斷該機櫃內的該機櫃管理控制器或各該基板管理控制器是否處於預設的多種關注狀態的其中之一。 In order to achieve the above-mentioned object, the method of the present invention is that a cabinet server management system periodically obtains various information of a cabinet management controller and a plurality of baseboard management controllers in a cabinet remotely, and records a manager’s pass The operation actions performed by the cabinet server management system. The rack server management system analyzes the above information and operation behaviors to determine whether the rack management controller or each baseboard management controller in the rack is in one of a variety of preset states of interest.

若判斷任一機櫃管理控制器或基板管理控制器與該機櫃伺服器管理系統的連線正常，但判斷可能即將出現異常狀態，則該機櫃伺服器管理系統自動實施一遠端服務重啟機制，以避免該機櫃管理控制器或各該基板管理控制器進入異常狀態。 If it is determined that the connection between any rack management controller or baseboard management controller and the rack server management system is normal, but it is determined that an abnormal state may soon occur, the rack server management system automatically implements a remote service restart mechanism to Prevent the cabinet management controller or each baseboard management controller from entering an abnormal state.

相對於相關技術，本發明的方法由與機櫃連線的機櫃伺服器管理系統來進行分析並自動實施遠端服務重啟機制，無需等待管理者對於異常狀態的人為判定，可大幅降低管理成本，亦使得機櫃的監控無需人為干涉，也不受距離與時間的影響。 Compared with the related art, the method of the present invention is analyzed by the rack server management system connected to the rack and automatically implements the remote service restart mechanism. There is no need to wait for the manager to artificially determine the abnormal state, which can greatly reduce the management cost. The monitoring of the cabinet does not need human intervention, and is not affected by distance and time.

1‧‧‧資料中心 1‧‧‧Data Center

2‧‧‧機櫃 2‧‧‧ Cabinet

21‧‧‧機櫃管理控制器 21‧‧‧ Cabinet management controller

211、221‧‧‧網路介面控制器 211, 221‧‧‧ network interface controller

22‧‧‧基板管理控制器 22‧‧‧Baseboard management controller

23‧‧‧內部網路交換機 23‧‧‧Internal network switch

24‧‧‧內部硬體線路 24‧‧‧Internal hardware circuit

3‧‧‧機櫃伺服器管理系統 3‧‧‧ Cabinet server management system

31‧‧‧資料庫 31‧‧‧ Database

4‧‧‧公共網路交換機 4‧‧‧Public network switch

S11~S15、S21~S28‧‧‧搜集步驟 S11~S15, S21~S28‧‧‧‧Collection steps

S31~S39‧‧‧分析與排除步驟 S31~S39‧‧‧‧Analysis and elimination steps

S41~S47、S51~S58、S61~S66、S71~S80‧‧‧排除步驟 S41~S47, S51~S58, S61~S66, S71~S80

圖1為本發明的資料中心的示意圖。 FIG. 1 is a schematic diagram of the data center of the present invention.

圖2為本發明的機櫃的方塊圖的第一具體實施例。 FIG. 2 is a first specific embodiment of a block diagram of a cabinet of the present invention.

圖3A為本發明的資料搜集流程圖的第一具體實施例。 FIG. 3A is a first specific embodiment of the data collection flowchart of the present invention.

圖3B為本發明的資料搜集流程圖的第二具體實施例。 FIG. 3B is a second specific embodiment of the data collection flowchart of the present invention.

圖4為本發明的分析與排除流程圖的第一具體實施例。 FIG. 4 is a first specific embodiment of the analysis and elimination flowchart of the present invention.

圖5為本發明的第一類關注狀態排除流程圖的第一具體實施例。 FIG. 5 is a first specific embodiment of the first type of attention state elimination flowchart of the present invention.

圖6為本發明的第一類關注狀態排除流程圖的第二具體實施例。 FIG. 6 is a second specific embodiment of the first type of attention state elimination flowchart of the present invention.

圖7為本發明的第二類關注狀態排除流程圖的第一具體實施例。 FIG. 7 is a first specific embodiment of a second type of attention state elimination flowchart of the present invention.

圖8為本發明的第三類關注狀態排除流程圖的第一具體實施例。 FIG. 8 is a first specific embodiment of the third type of attention state elimination flowchart of the present invention.

茲就本發明之一較佳實施例，配合圖式，詳細說明如後。 The following is a detailed description of a preferred embodiment of the present invention with reference to the drawings.

本發明揭露了一種機櫃異常狀態的遠端排除方法(下面將於說明書中簡稱為排除方法)，所述排除方法主要運用於資料中心內，以協助管理者自動監控、分析並且排除資料中心內的異常狀態。 The present invention discloses a remote elimination method for abnormal conditions of cabinets (hereinafter referred to as the elimination method in the specification). The elimination method is mainly used in a data center to help managers automatically monitor, analyze, and exclude data in the data center. Abnormal state.

參閱圖1，為本發明的資料中心的示意圖。如圖1所示，本發明所述的資料中心1主要具有複數機櫃2，以及由遠端與複數機櫃2連線的機櫃伺服器管理系統3(下面簡稱為管理系統3)。所述管理系統3可設置於資料中心1的內部或外部，並且經由網路連接公共網路交換機4，再經由公共網路交換機4連接資料中心1內的複數機櫃2。 Refer to FIG. 1 for a schematic diagram of the data center of the present invention. As shown in FIG. 1, the data center 1 according to the present invention mainly includes a plurality of cabinets 2 and a cabinet server management system 3 (hereinafter referred to as a management system 3 for short) connected to the plurality of cabinets 2 from a remote end. The management system 3 may be installed inside or outside the data center 1 and connected to the public network switch 4 via the network, and then connected to the plurality of cabinets 2 in the data center 1 via the public network switch 4.

本發明的管理系統3可實時監控資料中心1內的複數機櫃2、獲取複數機櫃2的各項資訊、並且對這些資訊進行分析。當發現任一機櫃2發生異常狀態或即將發生異常狀態時，本發明的管理系統3可自動實施對應的處理機制以進行狀況排除。藉此，本發明可以在完全不需要人為介入、大幅降低人為誤判並且提升處理速度的前提下，對機櫃2已發生的異常狀態進行排除，或對可能即將發生的異常狀態進行預防。 The management system 3 of the present invention can monitor the plural cabinets 2 in the data center 1 in real time, obtain various information of the plural cabinets 2, and analyze the information. When it is found that any cabinet 2 has an abnormal state or an abnormal state is about to occur, the management system 3 of the present invention can automatically implement a corresponding processing mechanism to eliminate the situation. In this way, the present invention can eliminate the abnormal state that has occurred in the cabinet 2 or prevent the abnormal state that may occur soon without the need for human intervention, greatly reducing human misjudgment, and increasing processing speed.

於一實施例中，所述管理系統3可為個人電腦或雲端伺服器，內部具有一或多個中央處理單元(圖未標示)。管理系統3被啟動後，可通過公共網路交機4連接至資料中心1內的複數機櫃2，並可藉由一或多個中央處理單元執行特定的應用程式與演算法，以實現對這些機櫃2的監控、資料分析及異常狀態排除。 In an embodiment, the management system 3 may be a personal computer or a cloud server, which has one or more central processing units (not shown in the figure). After the management system 3 is activated, it can be connected to a plurality of cabinets 2 in the data center 1 through a public network cross machine 4, and one or more central processing units can execute specific applications and algorithms to achieve these The monitoring, data analysis and elimination of abnormal status of cabinet 2.

所述管理系統3還具有資料庫31，用以暫存或永久保存從資料中心1內的複數機櫃2所獲得的各項資訊。於圖1的實施例中，所述資料庫31是內建於管理系統3。於其他實施例中，管理系統3亦可從外部連接一或多個資料庫31，不加以限定。 The management system 3 also has a data base 31 for temporarily storing or permanently storing various information obtained from the plural cabinets 2 in the data center 1. In the embodiment of FIG. 1, the database 31 is built in the management system 3. In other embodiments, the management system 3 may also be externally connected to one or more databases 31 without limitation.

參閱圖2，為本發明的機櫃的方塊圖的第一具體實施例。圖2的實施例中以資料中心1內的單一台機櫃2連接至所述管理系統3為例，進行說明，然而資料中心1係可依實際所需設置多台的機櫃2，而不以圖2所示者為限。 Referring to FIG. 2, it is a first specific embodiment of the block diagram of the cabinet of the present invention. In the embodiment of FIG. 2, a single cabinet 2 in the data center 1 is connected to the management system 3 as an example for description. However, the data center 1 may be provided with multiple cabinets 2 according to actual needs, instead of the figure. 2 is limited.

如圖2所示，本發明的機櫃2內主要包括至少一個機櫃管理控制器(Rack Management Controller,RMC)21，以及與RMC21連接的多台端點伺服器220，其中各個端點伺服器220分別具備至少一個基板管理控制器(Baseboard Management Controller,BMC)22。 As shown in FIG. 2, the cabinet 2 of the present invention mainly includes at least one Rack Management Controller (RMC) 21, and multiple endpoint servers 220 connected to the RMC 21, wherein each endpoint server 220 is provided with At least one baseboard management controller (Baseboard Management Controller, BMC) 22.

所述RMC21為一種嵌入式系統，設置於機櫃2內，透過各式硬體線路協助處理機櫃2的內部硬體設備(降溫風扇，各式感測器或電源供應器等等設備)的所有對外通訊，並與機櫃2內的所有端點伺服器220的BMC22進行溝通。所述BMC22也為嵌入式系統，設置於端點伺服器220中並協助處理端點伺服器220的內部硬體設備(各式感測器等等設備)的所有對外通訊。 The RMC21 is an embedded system, which is installed in the cabinet 2 and assists in processing all external hardware devices (cooling fans, various sensors or power supplies, etc.) of the cabinet 2 through various hardware circuits. Communicate and communicate with BMC22 of all endpoint servers 220 in cabinet 2. The BMC 22 is also an embedded system, which is installed in the endpoint server 220 and assists in processing all external communications of the internal hardware devices (various sensors and other devices) of the endpoint server 220.

本實施例中，RMC21通過內部硬體線路24連接機櫃2內的所有端點伺服器220的BMC22，藉由與各個BMC22溝通來控制各個端點伺服器220並且獲取所需資訊。本實施例中，所述端點伺服器可例如為直立式伺服器(Tower Model Server)或刀鋒伺服器(Blade Server)等，但不加以限定。 In this embodiment, the RMC 21 connects to the BMCs 22 of all the endpoint servers 220 in the cabinet 2 through internal hardware lines 24, and communicates with the BMCs 22 to control the endpoint servers 220 and obtain the required information. In this embodiment, the endpoint server may be, for example, an upright server (Tower Model Server) or a blade server (Blade Server), but it is not limited.

如圖2所示，設置在機櫃2內的每一個端點伺服器220分別具有一個固定的位置號碼(如圖2中的#1、#2、#n等)，當端點伺服器220或是BMC22對外的網路功能失效時，RMC21可通過內部硬體線路24連接至機櫃2內的指定位置(如上述的#1、#2、#n)，進而與該指定位置上的端點伺服器220及BMC22溝通。如此一來，即使端點伺服器220或是BMC22失去網路連線，機櫃2仍可藉由RMC21來進行監控、管理各個BMC22並且排除各個BMC22的異常狀況。 As shown in FIG. 2, each endpoint server 220 installed in the cabinet 2 has a fixed position number (such as #1, #2, #n, etc. in FIG. 2). When the endpoint server 220 or When the external network function of BMC22 fails, RMC21 can be connected to the specified location in cabinet 2 (such as #1, #2, #n above) through internal hardware line 24, and then serve with the endpoint at the specified location 220 and BMC22 communication. In this way, even if the endpoint server 220 or the BMC 22 loses the network connection, the cabinet 2 can still monitor and manage each BMC 22 through the RMC 21 and eliminate the abnormal condition of each BMC 22.

另，本發明的RMC21內設置有網路介面控制器(Network Interface Controller,NIC)211，各個BMC22內亦分別設置有網路介面控制器221。RMC21通過NIC211連接機櫃2內部的內部網路交換機23，各個BMC22分別通過各自的NIC221連接所述內部網路交換機23。機櫃2通過內部網路交換機23連接公共網路交換機4，並且藉由公共網路交換機4與所述管理系統3建立網路連線。如此一來，管理系統3可經由網路遠程訪問資料中心1內的機櫃2，藉此查詢並獲取機櫃2內的所有RMC21及BMC22的各項資訊，並且儲存於資料庫31內。 In addition, the RMC 21 of the present invention is provided with a network interface controller (NIC) 211, and each BMC 22 is also provided with a network interface controller 221, respectively. The RMC 21 is connected to the internal network switch 23 inside the cabinet 2 through the NIC 211, and each BMC 22 is connected to the internal network switch 23 through the respective NIC 221. The cabinet 2 is connected to the public network switch 4 through the internal network switch 23, and establishes a network connection with the management system 3 through the public network switch 4. In this way, the management system 3 can remotely access the cabinet 2 in the data center 1 via the network, thereby inquiring and acquiring all the information of all RMC21 and BMC22 in the cabinet 2 and storing it in the database 31.

本發明的主要技術特徵在於，管理系統3可經由網路定時訪問機櫃2，並獲取機櫃2內所有RMC21及BMC22的各項資訊(例如狀態資料、事件日誌(event log)、系統資源使用率、端點伺服器220內部感測器的感測數值等等)，藉由這些資訊來主動分析RMC21及BMC22是否發生異常狀態，或即將發生異常狀態。當管理系統3經分析後認為有必要時，即可主動於遠端實施對應的機制，以於遠端直接排除RMC21及/或BMC22的異常狀態，或是預先避免RMC21及/或BMC22進入所述異常狀態。 The main technical feature of the present invention is that the management system 3 can regularly access the cabinet 2 via the network and obtain all the information of all RMC21 and BMC22 in the cabinet 2 (such as status data, event log, system resource utilization rate, The sensing values of the internal sensors of the endpoint server 220, etc.) are used to actively analyze whether the RMC21 and the BMC22 have an abnormal state, or an abnormal state is about to occur. When the management system 3 considers it necessary after analysis, it can actively implement the corresponding mechanism at the remote end to directly rule out the abnormal state of the RMC21 and/or BMC22 at the remote end, or prevent the RMC21 and/or BMC22 from entering the Abnormal state.

本發明的技術方案可以在完全不需人為介入的情況下進行異常狀態的處理，大幅降低了人為誤判的可能，並且可令機櫃2的監控達到高度自動化。 The technical solution of the present invention can perform abnormal state processing without human intervention at all, greatly reducing the possibility of human misjudgment, and can make the monitoring of the cabinet 2 highly automated.

續請參閱圖3A，為本發明的資料搜集流程圖的第一具體實施例。 Please refer to FIG. 3A for the first specific embodiment of the data collection flowchart of the present invention.

如圖3A所示，若管理者欲對資料中心1內的機櫃2進行監控，則管理者可直接啟動遠端的管理系統3(步驟S11)。當管理系統3被啟動後，即會主動遠程訪問資料中心1中的機櫃2(以圖2中的單一個機櫃2為例)內的 RMC21及所有BMC22(步驟S12)。並且，管理系統3藉由遠程訪問來取得機櫃2中的RMC21及所有BMC22的各項資訊(步驟S13)，再將所取得的資訊儲存於本地端的資料中31中(步驟S14)。 As shown in FIG. 3A, if the manager wants to monitor the cabinet 2 in the data center 1, the manager can directly start the remote management system 3 (step S11). When the management system 3 is started, it will take the initiative to remotely access the cabinet 2 in the data center 1 (take the single cabinet 2 in FIG. 2 as an example) RMC21 and all BMC22 (step S12). In addition, the management system 3 obtains the information of the RMC 21 and all the BMC 22 in the cabinet 2 through remote access (step S13), and then stores the obtained information in the local data 31 (step S14).

具體地，本實施例中，管理系統3是在啟動後定時主動訪問機櫃2，也就是將步驟S12、S13、S14的訪問動作、資訊取得動作及儲存動作視為啟動後的例行程序(routine)。於執行上述routine時，持續判斷管理系統3是否關閉(步驟S15)，並且於管理系統3關閉前持續執行上述步驟S12至步驟S14，以持續對機櫃2內的RMC21與BMC22進行監控。 Specifically, in this embodiment, the management system 3 actively accesses the cabinet 2 regularly after startup, that is, the access operations, information acquisition operations, and storage operations in steps S12, S13, and S14 are regarded as routine programs after startup (routine ). When the above routine is executed, it is continuously determined whether the management system 3 is shut down (step S15), and before the management system 3 is shut down, the above steps S12 to S14 are continuously performed to continuously monitor the RMC21 and the BMC22 in the cabinet 2.

參閱圖3B，為本發明的資料搜集流程圖的第二具體實施例。 Referring to FIG. 3B, it is a second specific embodiment of the data collection flowchart of the present invention.

本實施例中，當管理者啟動了所述管理系統3後(步驟S21)，管理系統3可以提供一個操作介面(步驟S22)。通過這個操作介面，管理者可以登入管理系統3，並且藉由管理系統3來於遠端對資料中心1中的各個機櫃2進行資訊監控以及控制。本實施例中，所述操作介面可為一個實體介面或網頁(Web)介面，不加以限定。 In this embodiment, after the manager activates the management system 3 (step S21), the management system 3 may provide an operation interface (step S22). Through this operation interface, the administrator can log in to the management system 3, and use the management system 3 to remotely monitor and control the information of each cabinet 2 in the data center 1. In this embodiment, the operation interface may be a physical interface or a web interface (Web) interface, which is not limited.

在提供了所述操作介面後，管理系統3持續判斷是否通過操作介面接受了由管理者所進行的操作(步驟S23)。若確實接受到管理者的操作，則管理系統3依據管理者的操作行為，從遠端對機櫃2以及機櫃2內的RMC21及BMC22實施對應的遠端管理(步驟S24)。接著，管理系統3可記錄管理者的上述操作行為(步驟S25)，並且，還可取得並記錄管理系統3、機櫃2、各端點伺服器220以及RMC21、BMC22因為所述遠端管理而產生的反饋、系統參數及執行數據等反饋資訊(步驟S26)。最後，管理系統3同樣將所述操作行為及反饋資訊儲存於資料庫31中(步驟S27)，以利於後續對於異常狀態的分析動作。 After the operation interface is provided, the management system 3 continuously determines whether the operation performed by the manager is accepted through the operation interface (step S23). If the operation of the manager is indeed accepted, the management system 3 performs remote management on the cabinet 2 and the RMC 21 and the BMC 22 in the cabinet 2 from the remote according to the operation behavior of the manager (step S24). Next, the management system 3 can record the above-mentioned operation behavior of the manager (step S25), and can also obtain and record the management system 3, the cabinet 2, each endpoint server 220, and the RMC21, BMC22 generated by the remote management Feedback information such as feedback, system parameters and execution data (step S26). Finally, the management system 3 also stores the operation behavior and feedback information in the database 31 (step S27) to facilitate subsequent analysis actions for abnormal states.

相同地，本實施例的管理系統3會將步驟S22至步驟S27的動作視為啟動後的routine。於執行上述routine時，持續判斷管理系統3是否關閉(步驟S28)，並且於管理系統3關閉前持續執行上述步驟S22至步驟S27，以持續監控並分析管理者所實施的操作行為對機櫃2內的RMC21與BMC22所造成的影響。 Similarly, the management system 3 of the present embodiment regards the actions of steps S22 to S27 as the activated routine. During the execution of the above routine, continuously determine whether the management system 3 is shut down (step S28), and continue to execute the above steps S22 to S27 before the management system 3 is shut down, to continuously monitor and analyze the operation actions performed by the manager to the cabinet 2 The impact of RMC21 and BMC22.

續請參閱圖4，為本發明的分析與排除流程圖的第一具體實施例。 Please refer to FIG. 4 for the first specific embodiment of the analysis and elimination flowchart of the present invention.

如圖4所示，本實施例中管理系統3會定時存取資料庫31(步驟S31)，並且從資料庫31中取得機櫃2中的RMC21及BMC22各項資訊、管理者的操作行為、以及各項反饋資訊(步驟S32)，並且加以進行分析。藉由上述資料，管理系統3可以分析出機櫃2內的RMC21及各個BMC22是否處於預設的多種關注狀態的其中之一(步驟S33)。 As shown in FIG. 4, in this embodiment, the management system 3 periodically accesses the database 31 (step S31), and obtains the RMC21 and BMC22 information in the cabinet 2 from the database 31, the operation behavior of the manager, and Various feedback information (step S32), and analyze it. With the above data, the management system 3 can analyze whether the RMC 21 and each BMC 22 in the cabinet 2 are in one of the preset multiple attention states (step S33).

於一實施例中，所述管理系統3可以實時地取得機櫃2中的RMC21與BMC22的各項資訊、實時地從操作介面取得管理者的操作行為，並且據以進行分析。於另一實施例中，管理系統3可藉由圖3A的步驟S14及圖3B的步驟S27定時將上述資料儲存至資料庫31中，並且定時從資料庫31中讀取上述資料以進行分析，不加以限定。 In an embodiment, the management system 3 can obtain various information of the RMC 21 and BMC 22 in the cabinet 2 in real time, obtain the operation behavior of the manager from the operation interface in real time, and analyze accordingly. In another embodiment, the management system 3 may periodically store the above-mentioned data into the database 31 through step S14 of FIG. 3A and step S27 of FIG. 3B, and periodically read the above-mentioned data from the database 31 for analysis. Not limited.

於一實施例中，上述RMC21及BMC22的各項資訊，可例如為狀態資料(如目前處於工作模式或更新模式、IP位址、MAC位址、子網路遮罩、閘道器IP位址、IPMI session數量等)、事件日誌(event log)等，而上述操作行為可例如為管理者針對特定機櫃2、端點伺服器220或RMC21、BMC22所實行的資料查詢作業、更新作業、重置作業等，但不加以限定。通過上述資料，管理系統3可以藉由執行對應演算法而分析出機櫃2中目前是否具有需要即時救援的RMC21或BMC22。 In one embodiment, the above information of RMC21 and BMC22 can be, for example, status data (such as currently in working mode or update mode, IP address, MAC address, subnet mask, gateway IP address) , The number of IPMI sessions, etc.), event logs, etc., and the above operations may be, for example, data query operations, update operations, and resets performed by a manager for a specific cabinet 2, endpoint server 220, or RMC21, BMC22 Homework, etc., but not limited. Through the above information, the Department of Management The system 3 can analyze whether the RMC 21 or the BMC 22 in need of real-time rescue is currently in the cabinet 2 by executing a corresponding algorithm.

於圖4的實施例中，管理系統3主要可預設至少三個種類的關注狀態，包括第一類關注狀態、第二類關注狀態及第三類關注狀態，其中這三類的關注狀態分別對應至RMC21/BMC22不同的異常狀況，並且分別需要由管理系統3於遠端直接實施不同的機制來加以排除或加以預防。 In the embodiment of FIG. 4, the management system 3 can preset at least three types of attention states, including a first-type attention state, a second-type attention state, and a third-type attention state, of which the three types of attention states are Corresponding to different abnormal conditions of RMC21/BMC22, it is necessary for the management system 3 to implement different mechanisms directly at the remote end to eliminate or prevent them.

如圖4所示，若管理系統3依據上述資料(主要依據狀態資料、事件日誌及管理者的操作行為)進行分析後發現有任一RMC21或BMC22已處於異常狀態，但尚未與管理系統3失去連線，則會認定這個RMC21或BMC22是處於所述第一類關注狀態(步驟S34)。當發現任一RMC21、BMC22處於第一類關注狀態時，管理系統3可自動對處於第一類關注狀態的RMC21、BMC22實施遠端恢復機制，以遠程解除RMC21或BMC22的異常狀態(步驟S37)。 As shown in Figure 4, if the management system 3 analyzes based on the above data (mainly based on status data, event logs, and manager's operating behavior), it is found that any RMC21 or BMC22 has been in an abnormal state, but has not been lost from the management system 3 Connected, it will be determined that the RMC21 or BMC22 is in the first type of attention state (step S34). When any RMC21 or BMC22 is found in the first type of attention state, the management system 3 can automatically implement a remote recovery mechanism for the RMC21 or BMC22 in the first type of attention state to remotely release the abnormal state of RMC21 or BMC22 (step S37) .

若管理系統3依據上述資料(主要依據RMC21與BMC22狀態資料)進行分析後發現有任一RMC21或BMC22與管理系統3的連線正常，但判斷可能即將出現異常狀態，則會認定這個RMC21或BMC22是處於所述第二類關注狀態(步驟S35)。當發現任一RMC21、BMC22處於第二類關注狀態時，管理系統3可自動對處於第二類關注狀態的RMC21、BMC22實施遠端服務重啟機制，以遠程避免RMC21或BMC22進入可能的異常狀態(步驟S38)。 If the management system 3 analyzes based on the above data (mainly based on the RMC21 and BMC22 status data) and finds that any RMC21 or BMC22 is connected to the management system 3 normally, but judges that an abnormal state may occur soon, it will be determined that this RMC21 or BMC22 It is in the second type of attention state (step S35). When any RMC21 or BMC22 is found to be in the second type of state of concern, the management system 3 can automatically implement a remote service restart mechanism for the RMC21 or BMC22 in the second type of state of attention, so as to remotely prevent the RMC21 or BMC22 from entering a possible abnormal state ( Step S38).

若管理系統3依據上述資料(主要依據狀態資料、管理者的操作行為以及各項反饋資訊)進行分析後發現有任一BMC22已失去了網路連線(即，管理系統3無法遠程直接訪問這個BMC22)，則會認定這個BMC22是處於所述第三類關注狀態(步驟S36)。當發現任一BMC22處於第三類關注狀態時，管理系統 3可自動對處於第三類關注狀態的BMC22實施遠端救援機制，以遠程排除BMC22失去連線的狀態，並且使BMC22的網路連線恢復正常(步驟S39)。 If the management system 3 analyzes based on the above data (mainly based on the status data, the manager's operation behavior and various feedback information), it is found that any BMC22 has lost the network connection (that is, the management system 3 cannot remotely access this directly BMC22), it will be determined that this BMC22 is in the third type of attention state (step S36). When any BMC22 is found to be in the third type of attention state, the management system 3 A remote rescue mechanism can be automatically implemented on the BMC22 in the third type of state of concern to remotely exclude the state where the BMC22 loses connection, and restore the network connection of the BMC22 to normal (step S39).

下面段落討論所述第一類關注狀態。 The following paragraph discusses the first type of attention state.

由於部分的RMC21/BMC22不具備基本輸入輸出系統(Basic Input/Output System,BIOS)，因此需要通過外部伺服器所提供的網路時間協定(Network Time Protocol,NTP)服務，或是硬體時鐘晶片提供的實時時鐘(Real-time Clock,RTC)服務來設定時間，以與其他設備達到時間同步。 Since some RMC21/BMC22 do not have a Basic Input/Output System (BIOS), they need to use the Network Time Protocol (NTP) service provided by an external server or a hardware clock chip Provide real-time clock (Real-time Clock, RTC) service to set the time to achieve time synchronization with other devices.

如上所述，若在RMC21或BMC22的時間同步程序尚未完成前發生了系統事件，則雖然該系統事件仍然會被記錄在RMC21、BMC22的事件日誌中，但該系統事件的時間欄位將無法記錄正確的事件發生時間，而只會記錄例如“Pre-init”的字樣。若沒有正確的事件發生時間，則管理者無法將事件日誌做為所述系統事件的參考指標，這樣將會導致判斷錯誤。除此之外，若所述RMC21、BMC22需要進行重置(Reset)作業，也可能會造成上述系統事件的事件發生時間記錄錯誤或異常的情況。 As mentioned above, if a system event occurs before the time synchronization procedure of RMC21 or BMC22 is completed, although the system event will still be recorded in the event log of RMC21 and BMC22, the time field of the system event will not be recorded The correct time of the event, and only the words "Pre-init" will be recorded. If there is no correct event occurrence time, the administrator cannot use the event log as a reference indicator of the system event, which will result in a judgment error. In addition, if the RMC21 and the BMC22 need to perform a reset (Reset) operation, it may also cause an error or abnormality in the recording time of the event of the above system event.

參閱圖5，為本發明的第一類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3會定時存取資料庫31(步驟S41)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料及事件日誌，並且判斷RMC21及BMC22的狀態變化(步驟S42)。 Referring to FIG. 5, it is a first specific embodiment of the first type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 periodically accesses the database 31 (step S41) to obtain the status data and event logs of the RMC21 and BMC22 in the cabinet 2 from the database 31, and determine the status of the RMC21 and BMC22 Change (step S42).

本實施例中，管理系統3主要是判斷所獲得的事件日誌中，是否有任一系統事件的事件發生時間不明或錯誤(步驟S43)。若所述事件日誌中的所有系統事件皆記錄了正確的事件發生時間，則管理系統3不主動實施任何動作。 In this embodiment, the management system 3 mainly determines whether there is any system event whose event occurrence time is unknown or wrong in the obtained event log (step S43). If all system events in the event log record the correct event occurrence time, the management system 3 does not actively implement any action.

若經分析後，管理系統3發現任一RMC21或BMC22具有時間不明或錯誤的系統事件，則管理系統3會將該RMC21或BMC22視為處於所述第一類關注狀態(步驟S44)，即，認定這個RMC21或BMC22處於異常狀態，但尚未失去網路連線。 If after analysis, the management system 3 finds that any RMC21 or BMC22 has a system event whose time is unknown or wrong, the management system 3 will regard the RMC21 or BMC22 as being in the first type of attention state (step S44), that is, It is believed that the RMC21 or BMC22 is in an abnormal state, but the network connection has not been lost.

於一實施例中，管理系統3主要可於所述事件日誌中的任一系統事件的事件發生時間被記錄為“Pre-init”或類似字樣時(即，無法正確說明系統事件的發生時間)，判斷所述系統事件的事件發生時間不明或錯誤。於另一實施例中，管理系統3主要可以在從事件日誌中發現任一RMC21或BMC22具有事件發生時間不明的系統事件，並且從狀態資料中發現這個RMC21或BMC22尚未完成時間同步程序或是需要進行重置作業時，判斷所述系統事件的事件發生時間不明或錯誤。 In an embodiment, the management system 3 can mainly record when the event occurrence time of any system event in the event log is recorded as "Pre-init" or similar words (ie, the event occurrence time of the system event cannot be correctly stated) To determine whether the event time of the system event is unknown or wrong. In another embodiment, the management system 3 can mainly find from the event log that any RMC21 or BMC22 has a system event whose event occurrence time is unknown, and find from the status data that the RMC21 or BMC22 has not completed the time synchronization procedure or needs When performing the reset operation, it is determined that the event occurrence time of the system event is unknown or wrong.

當管理系統3於步驟S44中認定一個RMC21或BMC22處於第一類關注狀態後，管理系統3首先取得本次存取事件日誌的時間戳記(步驟S45)，將這個時間戳記做為所述系統事件的備位時間識別資訊，並儲存於資料庫31中(步驟S46)。於一實施例中，管理系統3是將本次存取資料庫31以讀取所述事件日誌的時間做為上述時間戳記。於另一實施例中，管理系統3是將本次遠程訪問機櫃2並從RMC21、BMC22取得所述事件日誌的時間做為上述時間戳記，但不加以限定。 After the management system 3 determines in step S44 that an RMC21 or BMC22 is in the first type of attention state, the management system 3 first obtains the time stamp of the access event log (step S45), and uses this time stamp as the system event The spare time identification information is stored in the database 31 (step S46). In an embodiment, the management system 3 uses the time of reading the event log as the timestamp for the current access to the database 31. In another embodiment, the management system 3 uses the time for remotely accessing the cabinet 2 and obtaining the event log from the RMC 21 and the BMC 22 as the time stamp, but it is not limited.

舉例來說，所述事件日誌的原始內容可例如下表所示：

For example, the original content of the event log may be as shown in the following table:

若管理系統3在2018年12月22日的下午11時32分23秒時存取了所述事件日誌，並發現事件二的事件發生時間有誤，則管理系統3可以主動為事件二產生所述備位時間識別資訊，並且修改事件日誌的內容或是產生新的事件日誌。新的事件日誌可例如下表所示：

If the management system 3 accesses the event log at 11:32:23 pm on December 22, 2018, and finds that the occurrence time of the event 2 is wrong, the management system 3 can take the initiative Describe the bit time identification information, and modify the content of the event log or generate a new event log. The new event log can be shown in the following table, for example:

當管理者通過所述操作介面登入管理系統3，並且於管理系統3中查詢所述事件日誌時，管理系統3即可如上表所示，顯示所述備位時間識別資訊以做為事件二的事件發生時間。如此一來，即使RMC21或BMC22在時間同步未完成前發生一個系統事件，管理系統3仍可為該系統事件設定一個可供識別的備位時間，以利管理系統3以及管理者於對該系統事件的解讀，並藉此強化遠端恢復的效果。 When the administrator logs in to the management system 3 through the operation interface and queries the event log in the management system 3, the management system 3 can display the standby time identification information as event 2 as shown in the table above When the event occurred. In this way, even if a system event occurs before RMC21 or BMC22 time synchronization is not completed, the management system 3 can still set a recognizable standby time for the system event, so that the management system 3 and the administrator can Interpretation of the event, and thereby strengthen the effect of remote recovery.

步驟S46後，管理系統3可進一步通過網路發出控制指令(例如第一控制指令)至處於第一類關注狀態的RMC21或BMC22，以對具有時間錯誤的異常狀態的RMC21或BMC22執行時間校正程序(步驟S47)。於一實施例中，所述時間校正程序是控制RMC21或BMC22藉由NTP服務進行時間校正。於另一實施例中，所述時間校正程序是強制RMC21或BMC22進行重置作業，但不加以限定。 After step S46, the management system 3 may further issue a control command (such as a first control command) to the RMC21 or BMC22 in the first type of attention state through the network to perform a time correction procedure on the RMC21 or BMC22 having an abnormal state with a time error (Step S47). In one embodiment, the time correction procedure is to control RMC21 or BMC22 to perform time correction through the NTP service. In another embodiment, the time correction procedure is to force the RMC21 or BMC22 to perform the reset operation, but it is not limited.

下面段落繼續說明其他可能發生的第一類關注狀態。 The following paragraphs continue to explain other possible types of concerns.

由於資料中心1內部的機櫃2數量眾多，當管理者有更新的需求時，實難以通過人工方式逐台進行更新。因此，當管理者要對機櫃2內的RMC21、BMC22實施更新作業時(例如韌體更新)，係可對管理系統3進行操作，以通過管理系統3的相關程式碼來發送更新指令以及最新版本的韌體，藉此於遠端同時更新資料中心1內的多個機櫃2的RMC21及BMC22。 Due to the large number of cabinets 2 inside the data center 1, it is difficult to manually update one by one when the manager has a need for updating. Therefore, when the administrator wants to update the RMC21 and BMC22 in the cabinet 2 (such as firmware update), he can operate the management system 3 to pass the management Manage the relevant program code of the system 3 to send the update command and the latest version of the firmware, thereby simultaneously updating the RMC21 and BMC22 of multiple cabinets 2 in the data center 1 at the same time.

若於更新過程中遇到網路壅塞或網路訊號不穩定造成網路連線中斷等問題，使得部分RMC21、BMC22無法依循正常的更新流程完成更新作業，就有可能造成更新作業失敗。然而，部分RMC21、BMC22在更新作業失敗後僅會造成系統無法正常運作，但並未失去網路連線(例如進入更新模式後無法恢復為工作模式)，此時就需要由管理系統3於遠端介入以進行異常狀況排除。 If problems such as network congestion or network signal instability cause network connection interruption during the update process, making some RMC21 and BMC22 unable to follow the normal update process to complete the update operation, it may cause the update operation to fail. However, some RMC21 and BMC22 will only cause the system to not operate normally after the update operation fails, but the network connection has not been lost (for example, it cannot be restored to the working mode after entering the update mode). End intervention to eliminate abnormal conditions.

參閱圖6，為本發明的第一類關注狀態排除流程圖的第二具體實施例。本實施例中，管理系統3同樣定時存取資料庫31(步驟S51)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料及事件日誌，同時取得管理者通過操作介面所實施的操作行為，並且判斷RMC21及BMC22的狀態變化(步驟S52)。 Referring to FIG. 6, it is a second specific embodiment of the first type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 also regularly accesses the database 31 (step S51), so as to obtain the status data and event logs of the RMC21 and BMC22 in the cabinet 2 from the database 31, and at the same time obtain the implementation of the administrator through the operation interface Operation behavior, and judge the state change of RMC21 and BMC22 (step S52).

本實施例中，管理系統3首先可對RMC21及BMC22的狀態資料以及事件日誌進行分析，以判斷是否有任一RMC21、BMC22的更新作業已逾時(步驟S54)或發生錯誤，並且判斷所述更新作業逾時或發生錯誤的RMC21或BMC22的網路連線是否正常(步驟S55)。若管理系統3在分析後發現有任一RMC21或BMC22的更新作業逾時或發生錯誤但網路連線仍然正常，則可將這個RMC21或BMC22視為處於所述第一類關注狀態(步驟S56)，即，處於異常狀態，但尚未失去連線。 In this embodiment, the management system 3 can first analyze the status data and event logs of the RMC21 and BMC22 to determine whether any of the RMC21 and BMC22 update operations have expired (step S54) or an error has occurred, and determine the Whether the network connection of RMC21 or BMC22 where the update operation has timed out or an error has occurred is normal (step S55). If, after analysis, the management system 3 finds that the update operation of any RMC21 or BMC22 has timed out or an error has occurred but the network connection is still normal, the RMC21 or BMC22 may be regarded as being in the first type of attention state (step S56 ), that is, in an abnormal state, but the connection has not been lost.

更具體地，於上述步驟S52後，管理系統3可先依據所述操作行為來判斷管理者是否曾對機櫃2中的RMC21及/或BMC22實施了更新作業(步驟S53)。並且，於確定了管理者曾經實施了更新作業後，管理系統3再接續執行步驟S54以及步驟S55，以判斷這些RMC21、BMC22的更新作業是否逾時或發生錯誤，以及網路連線是否正常。 More specifically, after the above step S52, the management system 3 may first determine whether the manager has performed an update operation on the RMC21 and/or BMC22 in the cabinet 2 according to the operation behavior (step S53). Moreover, after confirming that the manager has performed the update operation, the management system 3 continues to execute Steps S54 and S55 are used to determine whether the update operations of these RMC21 and BMC22 have timed out or an error has occurred, and whether the network connection is normal.

所述RMC21、BMC22在接受了管理者實施的更新作業後，將會自動進入更新模式。此時，RMC21、BMC22會在狀態資料中設定已進入更新模式的標記(flag)。當周邊設備與RMC21、BMC22溝通並且讀到更新模式的標記時，就會自動停止與這個RMC21、BMC22的互動。因此，只要RMC21、BMC22更新作業失敗而無法離開更新模式，這個RMC21、BMC22就無法正常運作。當管理系統3發現任一RMC21、BMC22接受了更新作業、更新作業已逾時或發生錯誤、但是尚未失去網路連線時，就可認定這個RMC21、BMC22處於所述第一關注狀態。 The RMC21 and BMC22 will automatically enter the update mode after accepting the update operation performed by the administrator. At this time, RMC21 and BMC22 will set the flag that has entered the update mode in the status data. When the peripheral device communicates with RMC21 and BMC22 and reads the update mode mark, it will automatically stop interacting with this RMC21 and BMC22. Therefore, as long as the update operation of RMC21 and BMC22 fails and it is impossible to leave the update mode, the RMC21 and BMC22 cannot operate normally. When the management system 3 finds that any RMC21 and BMC22 have accepted the update operation, the update operation has timed out or an error has occurred, but the network connection has not been lost, it can be determined that the RMC21 and BMC22 are in the first attention state.

步驟S56後，管理系統3可進一步通過網路發出控制指令(例如第二控制指令)至處於第一類關注狀態的RMC21或BMC22，以強制更新作業失敗的RMC21或BMC22離開所述更新模式(步驟S57)。 After step S56, the management system 3 may further issue a control command (such as a second control command) to the RMC21 or BMC22 in the first type of attention state via the network to force the RMC21 or BMC22 that failed the update operation to leave the update mode (step S57).

如上所述，在本實施例所指的更新作業失敗情況下(即，無法離開更新模式)，所述RMC21、BMC22仍可接收並處理相關的指令，只是周邊設備在讀到更新模式的標記(flag)時就會自動停止與RMC21、BMC22的互動。本實施例中，管理系統3已判斷所述RMC21、BMC22發生異常狀態，因此會無視於上述標記，而藉由控制指令的發出來強制RMC21、BMC22離開更新模式。 As described above, in the case where the update operation referred to in this embodiment fails (ie, it is impossible to leave the update mode), the RMC21 and BMC22 can still receive and process related commands, but the peripheral device is reading the update mode flag (flag ) Will automatically stop the interaction with RMC21, BMC22. In this embodiment, the management system 3 has judged that the RMC21 and BMC22 have abnormal statuses, so it ignores the above-mentioned flag and forces the RMC21 and BMC22 to leave the update mode by issuing control commands.

步驟S57後，管理系統3還可進一步通過網路發出另一控制指令(例如第三控制指令)至已離開更新模式的RMC21或BMC22，以強制RMC21或BMC22進行重置作業，或是再次實施所述更新作業(步驟S58)。藉此，管理系統3可以確保RMC21、BMC22已恢復正常運作，並且韌體或軟體處於更新完成的最新版本。 After step S57, the management system 3 may further issue another control command (for example, a third control command) to the RMC21 or BMC22 that has left the update mode through the network to force the RMC21 or BMC22 to perform a reset operation, or implement the operation again. The update operation is described (step S58). In this way, the management system 3 can ensure that the RMC21 and BMC22 have resumed normal operation, and the firmware or software is at the latest version after the update is completed.

下面段落接著討論所述第二類關注狀態。 The following paragraphs continue to discuss the second type of state of interest.

本發明中的RMC21、BMC22為一種嵌入式系統(Embbeded System)，因此即使機櫃2內的端點伺服器220未開機，管理系統3仍可藉由與RMC21、BMC22的溝通來實現遠程開機、遠程關機、查看設備狀態等遠程管理功能。 The RMC21 and BMC22 in the present invention are an embedded system (Embbeded System), so even if the endpoint server 220 in the cabinet 2 is not turned on, the management system 3 can still realize remote booting and remote control by communicating with the RMC21 and BMC22 Remote management functions such as shutdown and device status check.

一般來說，管理者在實施遠程管理程序時，可在管理系統3上使用智慧平台管理介面(Intelligent Platform Management Interface,IPMI)工具程式來通過網路發送IPMI指令，藉此與機櫃2內的RMC21、BMC22溝通。於使用IPMI工具程式的情況下，每一道指令的發送都需與目的地的RMC21、BMC22建立一個IPMI會話期間(session)，藉此才能與目的地的RMC21、BMC22進行溝通。具體地，在IPMI session建立完成後，管理系統3才能透過網路與RMC21、BMC22以及機櫃2、端點伺服器220的底層硬體設備溝通，進而取得所述指令的執行結果(例如取得韌體版本、端點伺服器220內的所有感測器的感測數值等)。 Generally speaking, when implementing remote management procedures, managers can use the Intelligent Platform Management Interface (IPMI) tool program on the management system 3 to send IPMI commands through the network to communicate with the RMC21 in the cabinet 2 , BMC22 communication. In the case of using the IPMI tool program, each command transmission needs to establish an IPMI session with the destination's RMC21 and BMC22, so as to communicate with the destination's RMC21 and BMC22. Specifically, after the IPMI session is established, the management system 3 can communicate with the RMC21, BMC22, and the underlying hardware devices of the cabinet 2 and the endpoint server 220 through the network to obtain the execution result of the command (such as obtaining firmware Version, sensing values of all sensors in the endpoint server 220, etc.).

惟，嵌入式系統本身的運算資源是相當有限的，除了運作所需的基本資源消耗外，與RMC21的溝通、與BMC22的溝通以及回覆資料中心1內的各式監控系統等動作皆會進一步消耗嵌入式系統的運算資源。 However, the computing resources of the embedded system itself are quite limited. In addition to the basic resource consumption required for operation, communication with RMC21, communication with BMC22, and responses to various monitoring systems in the data center 1 will be further consumed. Computing resources for embedded systems.

再者，當管理者通過管理系統3對各個RMC21、BMC22實施遠端管理程序時，也需消耗RMC21、BMC22的運算資源，最明顯的就是令RMC21、BMC22的IPMI session數量大幅增加，使得RMC21、BMC22出現回應不及或是請求超時(timeout)的現象。此時，雖然所述RMC21、BMC22尚未發生異常狀態，但可能需要由管理系統3於遠端介入以避免RMC21、BMC22將來發生異常狀態而影響機櫃2的運作。 In addition, when the administrator implements remote management procedures for each RMC21 and BMC22 through the management system 3, it also consumes the computing resources of RMC21 and BMC22. The most obvious is that the number of IPMI sessions of RMC21 and BMC22 is greatly increased, making RMC21, BMC22 appears to be under-responsive or request timeout. At this time, although the RMC21 and the BMC22 have not yet experienced an abnormal state, the management system 3 may need to intervene remotely to prevent the RMC21 and the BMC22 from having an abnormal state in the future and affecting the operation of the cabinet 2.

參閱圖7，為本發明的第二類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3同樣會定時存取資料庫31(步驟S61)，以由資料庫31中取得機櫃2內的RMC21及BMC22的狀態資料，並且判斷RMC21及BMC22的狀態變化(步驟S62)。於一實施例中，管理系統3在步驟S62中主要是取得RMC21及各個BMC22目前的IPMI session總數。於另一實施例中，管理系統3在步驟S62中同時取得RMC21及各個BMC22目前的系統資源使用率。 Referring to FIG. 7, it is a first specific embodiment of the second type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 also regularly accesses the database 31 (step S61) to The state data of RMC21 and BMC22 in the cabinet 2 is acquired from the database 31, and the state change of RMC21 and BMC22 is judged (step S62). In one embodiment, the management system 3 in step S62 mainly obtains the current total IPMI session of the RMC21 and each BMC22. In another embodiment, the management system 3 obtains the current system resource utilization rate of the RMC 21 and each BMC 22 at the same time in step S62.

步驟S63後，管理系統3判斷是否有任一RMC21、BMC22的IPMI session總數高於第一門檻值(步驟S63)，並且於任一RMC21、BMC22的IPMI session總數高於第一門檻值時，認定這個RMC21、BMC22處於所述第二關注狀態(步驟S65)，即，RMC21或BMC22的連線正常，但判斷可能即將出現異常狀態。 After step S63, the management system 3 determines whether the total number of IPMI sessions of any RMC21 and BMC22 is higher than the first threshold (step S63), and when the total number of IPMI sessions of any RMC21 and BMC22 is higher than the first threshold, it is determined The RMC21 and the BMC22 are in the second state of attention (step S65), that is, the connection of the RMC21 or the BMC22 is normal, but it is determined that an abnormal state may soon occur.

值得一提的是，若管理系統3於步驟S62中同時取得了RMC21及各個BMC22的系統資源使用率，則管理系統3可同時判斷是否有任一RMC21、BMC22的系統資源使用率高於第二門檻值(步驟S64)。於此情境下，管理系統3會認定目前的IPMI session總數高於第一門檻值，並且系統資源使用率高於第二門檻值的RMC21或BMC22處於所述第二關注狀態。 It is worth mentioning that if the management system 3 obtains the system resource utilization rate of RMC21 and each BMC22 at the same time in step S62, the management system 3 can simultaneously determine whether there is any RMC21, BMC22 system resource utilization rate higher than the second Threshold value (step S64). In this situation, the management system 3 determines that the current total number of IPMI sessions is higher than the first threshold, and the RMC21 or BMC22 whose system resource utilization rate is higher than the second threshold is in the second state of concern.

於一實施例中，所述系統資源使用率為RMC21、BMC22的中央處理單元或記憶體的使用率。於另一實施例中，所述系統資源使用率為RMC21、BMC22內部主要用來提供各項服務(如超文本傳輸協議(HyperText Transfer Protocol,HTTP)服務或IPMI服務等)所使用的系統資源的使用率，但不加以限定。 In one embodiment, the system resource usage rate is the usage rate of the central processing unit or memory of RMC21 and BMC22. In another embodiment, the system resource usage rate is mainly used within RMC21 and BMC22 to provide system resources used by various services (such as HyperText Transfer Protocol (HTTP) service or IPMI service, etc.) Usage rate, but not limited.

當管理系統3認定一個RMC21或BMC22處於第二類關注狀態後，管理系統3可進一步通過網路發出控制指令(例如第四控制指令)至處於第二類關注狀態的RMC21或BMC22，以令所述RMC21或BMC22重啟IPMI服務(步驟S66)。藉此，RMC21、BMC22可將目前累積的IPMI session清空，以避免異常狀態的發生。 When the management system 3 determines that an RMC21 or BMC22 is in the second type of attention state, the management system 3 can further issue control commands (such as a fourth control instruction) through the network to the RMC21 or BMC22 in the second type of attention state, so that all The RMC21 or BMC22 restarts the IPMI service (step S66). In this way, RMC21 and BMC22 can clear the currently accumulated IPMI session to avoid the occurrence of abnormal conditions.

於一實施例中，所述第四控制指令為重置指令，管理系統3是通過網路發出重置指令至處於第二類關注狀態的RMC21或BMC22，以強制RMC21或BMC22進行重置作業。如此一來，重置後的RMC21、BMC22即可直接重啟IPMI服務。惟，上述僅為本發明的其中一個具體實施例，但不以上述為限。 In an embodiment, the fourth control command is a reset command. The management system 3 sends a reset command to the RMC21 or BMC22 in the second type of attention state through the network to force the RMC21 or BMC22 to perform the reset operation. In this way, the reset RMC21 and BMC22 can directly restart the IPMI service. However, the above is only one specific embodiment of the present invention, but not limited to the above.

通過上述技術方案，管理系統3可以經由分析提早發現RMC21或BMC22可能即將發生異常狀態，因此可主動於遠端實施服務重啟機制，以避免RMC21或BMC22真的發生異常狀態而影響機櫃2的運作。 Through the above technical solution, the management system 3 can find that the abnormal state of RMC21 or BMC22 may soon occur through analysis. Therefore, it can actively implement a service restart mechanism at the remote end to avoid the abnormal state of RMC21 or BMC22 from affecting the operation of the cabinet 2.

下面段落接著討論所述第三類關注狀態。 The following paragraphs continue to discuss the third type of state of interest.

如前文中所述，本發明的管理系統3主要是通過網路與資料中心1內的機櫃2中的RMC21、BMC22進行溝通，並且管理者也是通過網路對這些RMC21、BMC22實施遠程管理程序。因此，當機櫃2中的BMC22失去網路連線時，管理系統3將無法與BMC22進行溝通，管理者也無法對BMC22進行管理。於本實施例中，BMC22失去網路連線的異常狀況，可能是因為IP位址設定錯誤所引起的。 As mentioned above, the management system 3 of the present invention mainly communicates with the RMC21 and BMC22 in the cabinet 2 in the data center 1 through the network, and the administrator also implements remote management procedures for these RMC21 and BMC22 through the network. Therefore, when the BMC 22 in the cabinet 2 loses the network connection, the management system 3 cannot communicate with the BMC 22, and the manager cannot manage the BMC 22. In this embodiment, the abnormal condition of the BMC22 losing the network connection may be caused by the incorrect IP address setting.

一般來說，機櫃2內的BMC22可能被設定成使用動態IP位址(即，BMC22的網路模式被設定為動態IP模式)或靜態IP位址(即，BMC22的網路模式被設定為靜態IP模式)。若BMC22的網路模式為動態IP模式，則可由資料中心1內的動態主機設定協定(Dynamic Host Configuration Protocol,DHCP)伺服器(圖未標示)來主動配發一組動態IP位址給BMC22使用。若BMC22的網路模式為靜態IP模式，則管理者可通過管理系統3的操作介面來自行為BMC22設定一組靜態IP位址。 Generally speaking, the BMC22 in the cabinet 2 may be configured to use a dynamic IP address (that is, the network mode of the BMC22 is set to the dynamic IP mode) or a static IP address (that is, the network mode of the BMC22 is set to the static IP mode). If the network mode of BMC22 is dynamic IP mode, a dynamic host configuration protocol (DHCP) server (not shown) in the data center 1 can be used to proactively assign a set of dynamic IP addresses to BMC22 . If the network mode of the BMC 22 is the static IP mode, the administrator can set a set of static IP addresses from the BMC 22 through the operation interface of the management system 3.

要對BMC22實施網路設定作業以設定一組可用的靜態IP位址，管理者需經由管理系統3下達至少四道指令給BMC22(即，需建立四個IPMI session)，包括：(1)設定BMC22的網路模式為靜態IP模式；(2)設定靜態IP位址；(3)設定子網路遮罩(netmask)；(4)設定閘道器(Gateway)IP位址。 To perform network configuration operations on BMC22 to set a set of available static IP addresses, the administrator needs to issue at least four commands to BMC22 via management system 3 (that is, four IPMI sessions need to be established), including: (1) settings The network mode of BMC22 is static IP mode; (2) set static IP address; (3) set subnet mask (netmask); (4) set gateway IP address.

如上所述，若管理者設定的靜態IP位址錯誤(例如與DHCP伺服器所配發的多組動態IP位址的其中之一重覆)，或是閘道器IP位址設定錯誤，則在多個子網域共存的環境，或是需要透過閘道器才能溝通的環境下，所述BMC22將無法與管理系統3連線。對於管理系統3來說，雖然這個BMC22所屬的端點伺服器220仍然存在，但因為管理系統3失去了與這個BMC22間的連線，因此將無法對這個BMC22(及其所屬的端點伺服器220)進行管理。此時，管理系統3可能需要於遠端介入以令BMC22恢復網路連線。 As mentioned above, if the static IP address set by the administrator is wrong (for example, it overlaps with one of the multiple dynamic IP addresses assigned by the DHCP server), or the gateway IP address is set incorrectly, then In an environment where multiple sub-domains coexist, or an environment that requires a gateway to communicate, the BMC 22 will not be able to connect to the management system 3. For the management system 3, although the endpoint server 220 to which the BMC22 belongs still exists, but because the management system 3 loses the connection with the BMC22, the BMC22 (and the endpoint server to which it belongs 220) Management. At this time, the management system 3 may need to intervene remotely to restore the BMC 22 to the network connection.

參閱圖8，為本發明的第三類關注狀態排除流程圖的第一具體實施例。本實施例中，所述管理系統3會定時存取資料庫31(步驟S71)，以由資料庫31中取得機櫃2內的各個BMC22的狀態資料、管理者通過管理系統3實施的操作行為、以及管理系統3基於所述操作行為所獲得的各項反饋資訊，並且判斷BMC22的狀態變化(步驟S72)。 Referring to FIG. 8, it is a first specific embodiment of the third type of attention state elimination flowchart of the present invention. In this embodiment, the management system 3 periodically accesses the database 31 (step S71), so as to obtain the status data of each BMC 22 in the cabinet 2 from the database 31, the operation behavior implemented by the manager through the management system 3, And the management system 3 obtains various pieces of feedback information based on the operation behavior, and judges the state change of the BMC 22 (step S72).

於一實施例中，管理系統3在步驟S72中取得的狀態資料至少包括各個BMC22的網路模式(靜態IP模式或動態IP模式)、目前使用的靜態IP位址、子網路遮罩及閘道器IP位址等，不加以限定。並且，管理系統3在步驟S72中取得的反饋資訊主要包括所述操作行為實施時，管理系統3、機櫃2及各個端點伺服器220(以及各個BMC22)基於這個操作行為所產生的反饋、系統參數及執行數據等資料，但不加以限定。 In one embodiment, the status data obtained by the management system 3 in step S72 includes at least the network mode (static IP mode or dynamic IP mode) of each BMC 22, the currently used static IP address, subnet mask and gate The IP address of the router is not limited. In addition, the feedback information obtained by the management system 3 in step S72 mainly includes the feedback generated by the management system 3, the cabinet 2 and each endpoint server 220 (and each BMC 22) based on this operation behavior when the operation behavior is implemented. Data such as parameters and execution data are not limited.

步驟S72後，管理系統3首先依據所述狀態資料以及反饋資訊判斷機櫃2中是否有任一BMC22失去了與管理系統3間的連線(步驟S73)，並且，依據所述操作行為判斷管理者是否剛剛為機櫃2中的任一BMC22實施了網路設定作業(步驟S74)。若經分析後發現管理者剛剛對某一BMC22實施了網路設定作業，並且這個BMC22在網路設定作業後即失去連線，則管理系統3即可將這個BMC22視為處於所述第三類關注狀態(步驟S75)，即，BMC22已失去連線。 After step S72, the management system 3 first determines whether any BMC 22 in the cabinet 2 has lost the connection with the management system 3 based on the status data and feedback information (step S73), and determines the manager based on the operation behavior Whether the network setting operation has just been performed for any BMC 22 in the cabinet 2 (step S74). If after analysis, it is found that the administrator has just performed a network setting operation on a certain BMC22, and this BMC22 loses connection after the network setting operation, the management system 3 can regard this BMC22 as being in the third category Attention state (step S75), that is, the BMC22 has lost connection.

值得一提的是，於前述步驟S73中，管理系統3主要可於任一BMC22的網路模式被設定為靜態IP模式，並且這個BMC22的靜態IP位址與DHCP伺服器所配發的多組動態IP位址的其中之一重覆時，判斷這個BMC22失去網路連線(已經失去連線，或可能失去連線)。 It is worth mentioning that in the aforementioned step S73, the management system 3 can mainly be set to a static IP mode in any BMC22 network mode, and the static IP address of this BMC22 and the multiple groups allocated by the DHCP server When one of the dynamic IP addresses is repeated, it is judged that the BMC22 has lost the network connection (the connection has been lost, or may be lost).

另，於前述步驟S73中，管理系統3還可於任一BMC22的網路模式被設定為靜態IP模式，並且這個BMC22的閘道器IP位址設定錯誤時，判斷這個BMC22失去網路連線(已經失去連線，或可能失去連線)。惟，上述僅為本發明的部分具體實施範例，但不應以上述為限。 In addition, in the aforementioned step S73, the management system 3 can also determine that the BMC22 loses the network connection when the network mode of any BMC22 is set to the static IP mode, and the gateway IP address of the BMC22 is set incorrectly (The connection has been lost, or may be lost). However, the above are only some specific implementation examples of the present invention, but should not be limited to the above.

於步驟S75後，管理系統3已可認定某一BMC22處於所述第三類關注狀態，接著，管理系統3判斷在資料中心1中主要負責這個BMC22的RMC21為何(步驟S76)，並且控制這個RMC21通過機櫃2的內部硬體線路24檢查所述BMC22所屬的端點伺服器220(步驟S77)，以確認這個端點伺服器220是否存在(步驟S78)。 After step S75, the management system 3 can determine that a certain BMC22 is in the third type of attention state. Then, the management system 3 determines what RMC21 is mainly responsible for the BMC22 in the data center 1 (step S76), and controls the RMC21 Check the endpoint server 220 to which the BMC 22 belongs through the internal hardware circuit 24 of the cabinet 2 (step S77) to confirm whether this endpoint server 220 exists (step S78).

如圖2所示，一個機櫃2內的RMC21主要可通過內部硬體線路24實體連接機櫃2中的所有端點伺服器220中的BMC22，因此，即使BMC22失去網路連線，同一個機櫃2內的RMC21仍可通過內部硬體線路24來與BMC22進行溝通。 As shown in FIG. 2, the RMC 21 in a cabinet 2 can mainly be physically connected to the BMC 22 in all the endpoint servers 220 in the cabinet 2 through internal hardware lines 24. Therefore, even if the BMC 22 loses the network Road connection, RMC21 in the same cabinet 2 can still communicate with BMC22 through internal hardware circuit 24.

若於上述步驟S78中判斷所述端點伺服器220不存在(例如已被抽離機櫃2，或已經損壞)，則管理系統3對應發出警示訊號(步驟S79)。於一實施例中，管理系統3可通過操作介面發出警示訊號(例如文字、燈光或聲響)，以對管理者進行警示。於另一實施例中，管理系統3可通過網路發送警示訊號(例如簡訊、電子郵件或通訊軟體)給管理者，以達到警示作用。 If it is determined in the above step S78 that the endpoint server 220 does not exist (for example, it has been pulled away from the cabinet 2 or has been damaged), the management system 3 correspondingly issues a warning signal (step S79). In an embodiment, the management system 3 can issue a warning signal (such as text, light or sound) through the operation interface to warn the manager. In another embodiment, the management system 3 can send a warning signal (such as a text message, email, or communication software) to the administrator through the network to achieve the warning function.

若於上述步驟S78中判斷所述端點伺服器220仍然存在，則管理系統3控制所述RMC21通過內部硬體線路24發送一組IPMI指令至所述BMC22，以令BMC22恢復網路連線(步驟S80)。於一實施例中，管理系統3可通過RMC21將IPMI指令發送至所述BMC22，以重新設定所述BMC22的靜態IP位址，或是重新設定所述BMC22的閘道器IP位址，藉此令BMC22恢復與管理系統3間的連線。 If it is determined in the above step S78 that the endpoint server 220 still exists, the management system 3 controls the RMC 21 to send a set of IPMI commands to the BMC 22 through the internal hardware line 24, so that the BMC 22 resumes the network connection ( Step S80). In one embodiment, the management system 3 may send an IPMI command to the BMC22 via RMC21 to reset the static IP address of the BMC22, or reset the IP address of the gateway of the BMC22, thereby Make the BMC22 restore the connection with the management system 3.

通過上述技術方案，管理系統3可以在BMC22失去連線後主動於遠端對BMC22實施救援機制，以令BMC22恢復網路連線。 Through the above technical solution, the management system 3 can actively implement a rescue mechanism on the BMC 22 remotely after the BMC 22 loses the connection, so that the BMC 22 can restore the network connection.

本發明的方法可由管理系統3自動搜集所需資訊並對所有RMC21及BMC22的狀態進行分析，同時於任一RMC21、BMC22處於多種關注狀態之一時自動實施對應機制以排除異常狀態。如此一來，本發明的技術方案可大幅降低管理成本，亦使得資料中心1的監控無需人為干涉，也不受距離與時間的影響。 In the method of the present invention, the management system 3 can automatically collect the required information and analyze the status of all RMC21 and BMC22, and at the same time, when any RMC21 and BMC22 are in one of a variety of states of attention, a corresponding mechanism is automatically implemented to eliminate the abnormal state. In this way, the technical solution of the present invention can greatly reduce the management cost, and also makes the monitoring of the data center 1 without human intervention, and is not affected by distance and time.

以上所述僅為本發明之較佳具體實例，非因此即侷限本發明之專利範圍，故舉凡運用本發明內容所為之等效變化，均同理皆包含於本發明之範圍內，合予陳明。 The above is only a preferred specific example of the present invention, and the patent scope of the present invention is not limited by this, so all equivalent changes in applying the content of the present invention are included in the scope of the present invention in the same way. Bright.

Claims

A remote elimination method for abnormal state of a cabinet is applied to a data center having a cabinet and a cabinet server management system connected to the cabinet from a remote side, wherein the cabinet has a rack management controller (RMC) ) And a plurality of endpoint servers, each of which has a baseboard management controller (BMC), the remote exclusion method includes: a) the rack server management system regularly accesses a database To obtain the status data of the RMC and each BMC, and determine the status change of the RMC and each BMC; b) according to the status data and the status change, determine whether one of the RMC and each BMC is in the preset One of multiple attention states; and c) when it is judged that any RMC or BMC is in a second type of attention state among the multiple attention states, the rack server management system automatically checks for the second type of attention state The RMC or the BMC implements a remote service restart mechanism to prevent the RMC or the BMC from entering an abnormal state, where the second type of attention state refers to the normal connection between the RMC or the BMC and the rack server management system, but It is judged that the abnormal state may be imminent; wherein step a) is to obtain the total number of sessions of the RMC and the current Intelligent Platform Management Interface (IPMI) of each BMC, and step b) is When the total number of IPMI sessions of any RMC or BMC is higher than a first threshold, it is determined that the RMC or BMC is in the second type of attention state, and the step c) is a control command issued by the rack server management system To the RMC or the BMC in the second type of attention state, so that the RMC or the BMC restarts the IPMI service.

The remote elimination method of the abnormal state of the cabinet as described in claim 1, further comprising the following steps: a01) the cabinet server management system is started; a02) after the step a01), the cabinet server management system regularly and actively accesses remotely The RMC and each BMC in the cabinet; a03) obtain the status data of the RMC and each BMC; a04) store the status data in the database; and a05) continue until the cabinet server management system is shut down Perform this step a02) to this step a04).

The remote elimination method of the abnormal state of the cabinet as described in claim 1, further comprising the following steps: a11) the cabinet server management system is started; a12) after the step a11), the cabinet server management system provides an operation interface A13) When accepting an operation behavior of a manager through the operation interface, implement a remote management procedure for the RMC and each BMC according to the content of the operation behavior; a14) obtain feedback information corresponding to the remote management procedure A15) store the operation behavior and the feedback information to the database; and a16) continue to execute the steps a12) to a15) before the rack server management system is shut down.

The remote elimination method for the abnormal state of the cabinet according to claim 1, wherein the step a) also obtains a system resource utilization rate of the RMC and each BMC, and the step b) is the IPMI of any RMC or BMC When the total number of sessions is higher than the first threshold and the system resource utilization rate is higher than a second threshold, it is determined that the RMC or the BMC is in the second type of attention state.

The remote elimination method for the abnormal state of the cabinet according to claim 4, wherein the system resource utilization rate is the utilization rate of the central processing unit or memory of the RMC or the BMC.

The method for remotely eliminating the abnormal state of the cabinet according to claim 4, wherein the system resource usage rate is the system resource usage rate of the RMC or the BMC mainly used to provide HTTP services or IPMI services.

The remote elimination method for the abnormal state of the cabinet as described in claim 1, wherein the step c) is that the cabinet server management system issues a reset command to the RMC or the BMC in the second type of state of concern, Force the RMC or the BMC to perform a reset operation to restart the IPMI service.