JP5298393B2

JP5298393B2 - Parallel Reed-Solomon RAID (RS-RAID) architecture, device, and method

Info

Publication number: JP5298393B2
Application number: JP2010534958A
Authority: JP
Inventors: プルーチ、アービンド
Original assignee: マーベルワールドトレードリミテッド
Priority date: 2007-11-21
Filing date: 2008-11-20
Publication date: 2013-09-25
Anticipated expiration: 2028-11-20
Also published as: WO2009070235A3; EP2605140A1; US8219887B2; EP2212796B1; KR101543369B1; JP2011504269A; US8359524B2; EP2212796A2; WO2009070235A2; US20090132851A1; KR20100095525A; US8645798B2; US20130138881A1; US20120266049A1; EP2605140B1

Abstract

The parallel RS-RAID data storage architecture can aggregate that data and checksums within each cluster into intermediate or partial sums that are transferred or distributed to other clusters. The use of intermediate data symbols, intermediate checksum symbols, cluster configuration information on the assignment of data storage devices to clusters and the operational status of data storage devices, and the like, can reduce the computational burden and latency for the error correction calculations while increasing the scalability and throughput of the parallel RS-RAID distributed data storage architecture.

Description

本出願は、２００７年１１月２１日に出願された米国仮出願第６０／９８９，６７０号「ＰａｒａｌｌｅｌＲＡＩＤＩｍｐｌｅｍｅｎｔａｔｉｏｎｆｏｒＲＡＩＤ６ａｎｄＲｅｅｄ−ＳｏｌｏｍｏｎＣｏｄｅ」の利益を主張し、参照によりその全体が本明細書に組込まれる全ての引用された参考文献を含む。 This application claims the benefit of US Provisional Application No. 60 / 989,670 “Parallel RAID Implementation for RAID 6 and Reed-Solomon Code” filed on Nov. 21, 2007, which is hereby incorporated by reference in its entirety. Includes all cited references incorporated into.

ＲＡＩＤ（ｒｅｄｕｎｄａｎｔａｒｒａｙｏｆｉｎｅｘｐｅｎｓｉｖｅｄｉｓｋｓ）アーキテクチャは、ハードディスクなどのデータ記憶ユニットのグループを使用して、耐故障性を有するデータ記憶装置を提供する。ＲＡＩＤアーキテクチャは、誤りおよびディスク故障から情報を保護するために、前方誤り訂正（ｆｏｒｗａｒｄｅｒｒｏｒｃｏｒｒｅｃｔｉｏｎ）（ＦＥＣ）コードおよび予備のデータ記憶ユニットを使用する。情報シンボルは、ビット、バイト、またはワードであってよい。情報シンボルは、符合化されて、データおよびチェックサムまたはパリティシンボルを含むコードシンボルを形成しうる。組織的な前方誤り訂正コードの場合、情報シンボルは、コードシンボルのデータシンボル部分において明示的に表されうる。 The RAID (Redundant Array of Inexpensive Disks) architecture uses a group of data storage units such as hard disks to provide a fault-tolerant data storage device. The RAID architecture uses a forward error correction (FEC) code and a spare data storage unit to protect information from errors and disk failures. An information symbol may be a bit, byte, or word. Information symbols may be encoded to form code symbols including data and checksum or parity symbols. In the case of a systematic forward error correction code, the information symbol can be explicitly represented in the data symbol part of the code symbol.

リードソロモンコードは、チェックサムシンボルの数に等しい記憶ユニットの故障の数を許容するために、ＲＡＩＤアーキテクチャ（ＲＳ−ＲＡＩＤ）で使用されうる。たとえば、データ用に２０の記憶ユニットを、チェックサム用に４つの記憶ユニットを割当てる４重誤り訂正ＲＳ−ＲＡＩＤアーキテクチャは、４つを含む４つまでの記憶デバイスにおける故障を許容しうる。 The Reed-Solomon code can be used in a RAID architecture (RS-RAID) to allow a number of storage unit failures equal to the number of checksum symbols. For example, a quad error correction RS-RAID architecture that allocates 20 storage units for data and 4 storage units for checksum can tolerate failures in up to four storage devices, including four.

ＲＳ−ＲＡＩＤアーキテクチャは、通常、データ記憶ユニットに書込まれるデータシンボルを保護するために単一のＲＡＩＤコントローラを使用する。単一のＲＡＩＤコントローラが使用されて、チェックサム、符号化、および復号化計算を実施するとき、ＲＡＩＤアーキテクチャのスループットまたはデータ記憶量および取出し速度は、ＲＡＩＤでなくかつ耐故障性を有さないデータ記憶アーキテクチャと比べて低減される可能性がある。 RS-RAID architecture typically uses a single RAID controller to protect data symbols written to the data storage unit. When a single RAID controller is used to perform checksum, encoding, and decoding calculations, the RAID architecture's throughput or data storage and retrieval speed is not RAID and data storage is not fault tolerant. May be reduced compared to architecture.

したがって、高スループットで耐故障性を有する分散型のデータ記憶アーキテクチャが望ましい場合がある。 Thus, a distributed data storage architecture with high throughput and fault tolerance may be desirable.

高性能記憶アーキテクチャでは、複数のＲＡＩＤコントローラは、通信ファブリックと呼ぶ通信経路の共通集合を通じて互いに通信してもよい。通信ファブリックは、ＲＡＩＤコントローラと所与のＲＡＩＤコントローラに割当てられた記憶デバイスとの間の通信経路と比較して高いレーテンシを有する可能性がある。レーテンシの高い通信ファブリックは、ＲＡＩＤコントローラ間のデータ、メッセージ、構成などのトラフィックが、耐故障性を有する分散データ記憶装置のタスクに整合しなければ、ＲＡＩＤデータ記憶アーキテクチャのスループットを減少させる可能性がある。通信ファブリックと割当てられたデータ記憶デバイスの集合との間に介在してもよい、それぞれのＲＡＩＤコントローラは、データ記憶アーキテクチャのノードと呼ばれてもよい。ＲＡＩＤコントローラおよび割当てられたデータ記憶デバイスは、データ記憶クラスタと呼ばれてもよい。 In a high performance storage architecture, multiple RAID controllers may communicate with each other through a common set of communication paths called a communication fabric. The communication fabric may have a high latency compared to the communication path between the RAID controller and the storage device assigned to a given RAID controller. A high-latency communication fabric can reduce the throughput of a RAID data storage architecture if the traffic between RAID controllers, such as data, messages, and configuration traffic, does not match the fault-tolerant distributed data storage task. . Each RAID controller, which may intervene between the communications fabric and the assigned collection of data storage devices, may be referred to as a data storage architecture node. The RAID controller and assigned data storage device may be referred to as a data storage cluster.

リードソロモン（Ｒｅｅｄ−Ｓｏｌｏｍｏｎ）ＲＡＩＤ（ＲＳ−ＲＡＩＤ）アーキテクチャは、冗長なデータ記憶デバイスを含むことによって、ハードディスクなどの記憶デバイスに書込まれ、また、記憶デバイスから読出される情報シンボルを保護しうる。ｍのチェックサムデバイスを使用するＲＳ−ＲＡＩＤアーキテクチャは、データ記憶デバイスのｍ程度の同時故障を許容しうる。ｍのチェックサムシンボルはｃ_１，ｃ_２、…、ｃ_ｍで示されてもよい。ＲＳ−ＲＡＩＤアーキテクチャはまた、ｄ_１，ｄ_２、…、ｄ_ｎで示す情報保持またはデータシンボル用の数ｎのデータ記憶デバイスを含みうる。 The Reed-Solomon RAID (RS-RAID) architecture can protect redundant information storage devices to protect information symbols written to and read from storage devices such as hard disks . An RS-RAID architecture that uses m checksum devices can tolerate as many as m simultaneous failures of data storage devices. The checksum symbols of _m may be denoted by c ₁ , c ₂ ,..., _cm . RS-RAID architecture _{_{also, d 1, d 2, ...}} , it may include a data storage device of the number n of information holding or data symbols indicated by _{d n.}

チェックサムおよびデータ記憶デバイスは、データおよびチェックサムシンボルを、ビット、バイト、ワードなどとして記憶してもよい。リードソロモン（ＲＳ）コードなどのあるタイプの前方誤り訂正コード（ｆｏｒｗａｒｄｅｒｒｏｒｃｏｒｒｅｃｔｉｏｎｃｏｄｅｓ）（ＦＥＣ）は、通常、バイトを使用することが留意されてもよい。たとえば、ＲＳコードは、２５５のバイトブロック内の２３３のデータバイトと３２のチェックサムバイトに２３３の情報バイトを符合化するブロックなどのバイトブロックに作用しうる。 The checksum and data storage device may store data and checksum symbols as bits, bytes, words, etc. It may be noted that certain types of forward error correction codes (FEC), such as Reed-Solomon (RS) codes, typically use bytes. For example, the RS code may operate on a byte block, such as a block that encodes 233 information bytes into 233 data bytes and 32 checksum bytes in a 255 byte block.

ＲＳ−ＲＡＩＤアーキテクチャは、対応するデータ記憶デバイスＤ_１、Ｄ_２、…、Ｄ_ｎによって保持されるデータシンボルｄ_１，ｄ_２、…、ｄ_ｎを使用して、ｉ番目のチェックサムデバイスＣ_ｉに記憶されるチェックサムシンボルｃ_ｉを計算しうる。ＲＳ−ＲＡＩＤアーキテクチャは、Ｄ_１、Ｄ_２、…、Ｄ_ｎ、Ｃ_１、Ｃ_２、…、Ｃ_ｍの記憶デバイスのうちの任意のｍ以下の記憶デバイスが故障する場合、故障したデバイスのうちの任意のデバイスのコンテンツが、損なわれていないまたは故障していないデバイスから再構築されうるように各ｃ_ｉ（１≦ｉ≦ｍ）を確定しうる。ＲＳ−ＲＡＩＤアーキテクチャは、ファンデルモンデ行列の特性により耐故障性を有する演算を提供することができ、ファンデルモンデ行列は、チェックサムシンボルを計算し維持し、記憶デバイスから読出したデータおよびチェックサムシンボルから情報を回復するのに使用される。ＲＳ−ＲＡＩＤコントローラは、記憶デバイスが故障しても、添加されるかまたは拡大されたファンデルモンデ行列と単位行列の（ｎ×ｎ）部分の逆行列を計算することによって、記憶デバイスにおけるデータおよび／またはチェックサムシンボルを回復しうる。 RS-RAID architecture, the corresponding data storage device _D _1, D 2, ..., the data symbols _d _1, d 2, which is held by the _{D n,} ..., using the _{d n,} i-th checksum device _{C i} The checksum symbol c _i stored in can be calculated. RS-RAID _{_{_{architecture, D 1, D 2, ...}}} , D n, C 1, C 2, ..., if any of the following storage devices m of the storage device _{C m} fails, among failed device Each c _i (1 ≦ i ≦ m) may be determined so that the content of any of the devices can be reconstructed from intact or non-failed devices. The RS-RAID architecture can provide fault tolerant operations due to the characteristics of the van der monde matrix, which calculates and maintains checksum symbols, and the data and checksum symbols read from the storage device. Used to recover information from. The RS-RAID controller calculates the data in the storage device by calculating the inverse matrix of the (n × n) portion of the added and expanded van der Monde matrix and unit matrix even if the storage device fails. The checksum symbol may be recovered.

チェックサムシンボルを生成するために、ＲＳ−ＲＡＩＤアーキテクチャは、データシンボルを、ファンデルモンデ行列の要素で重み付けし、式１によって線形関数Ｆ_ｉを使用して重み付けされたデータシンボルを合計しうる。関数Ｆ_ｉは、ファンデルモンデ行列の要素のｉ番目の行から得られうるため、Ｆ_ｉ＝［ｆ_ｉ，１；ｆ_ｉ，２；…；ｆ_ｉ，ｎ］^Ｔである。

換言すれば、データおよびチェックサムシンボルが、それぞれ、（ｎ×１）次元および（ｍ×１）次元ベクトルＤ＝［ｄ_１，ｄ_２、…、ｄ_ｎ］^ＴおよびＣ＝［ｃ_１、ｃ_２、…、ｃ_ｍ］^Ｔとして表され、また、関数Ｆ_ｉが行列Ｆの行として表される場合、ＲＳ−ＲＡＩＤアーキテクチャは、チェックサムシンボルを式２ａによって符合化しうる。
Ｃ＝ＦＤ（式２ａ）
式２ａは、

に等しい。 To generate the checksum symbol, the RS-RAID architecture may weight the data symbols with the elements of the van der Monde matrix and sum the weighted data symbols using the linear function F _{i according} to Equation 1. Since the function F _i can be obtained from the i th row of the elements of the van der Monde matrix, F _i = [f _{i, 1} ; f _{i, 2} ;... F _{i, n} ] ^T.

In other words, the data and checksum symbols are represented by (n × 1) dimensional and (m × 1) dimensional vectors D = [d ₁ , d ₂ ,..., D _n ] ^T and C = [c ₁ , c, respectively. _2, _..., expressed as c ^{m] T,} also, if the function _{F i} is represented as rows of the matrix F, RS-RAID architecture, may be encoded by the equation 2a checksum symbols.
C = FD (Formula 2a)
Equation 2a is

be equivalent to.

有利に設計されたＲＳ−ＲＡＩＤＦＥＣコードの場合、Ｆ行列は、要素：ｆ_ｉ，ｊ＝ｊ^ｉ−１を有する（ｍ×ｎ）ファンデルモンデ行列でありうる。式中、インデックスｉ＝１、２、…、ｍおよびｊ＝１、２、…、ｎは、それぞれ、ファンデルモンデ行列の行および列に対応し、代数演算は、ガロア体の特性を使用して実施される。たとえば、（３×４）ファンデルモンデ行列は、

として書かれうる。 In the case of an advantageously designed RS-RAID FEC code, the F matrix may be an (m × n) van der Monde matrix with elements: f _{i, j} = j ⁱ⁻¹ . Where the indices i = 1, 2,..., M and j = 1, 2,..., N correspond to the rows and columns of the van der Monde matrix, respectively, and algebraic operations use the properties of Galois fields. Implemented. For example, the (3x4) van der Monde matrix is

Can be written as

誤りを含む可能性があるコードワードまたはコードシンボルから（ｎ×１）情報ベクトル

を回復するために、並列ＲＳ−ＲＡＩＤアーキテクチャは、ファンデルモンデ行列およびＩで示す添加された（（ｎ−ｍ）×（ｎ−ｍ））単位行列を含む拡大されるかまたは分割された行列Ａの逆行列を求め、Ａの逆行列に、作動可能な記憶ユニットから読出されるデータおよびチェックサムシンボルの集合ＤおよびＣをそれぞれ右から乗算し(ｐｏｓｔ−ｍｕｌｔｉｐｌｙ)うる。記号的には、回復される情報ベクトル

は、

から得られうる。式中、拡大行列は

であり、

は、拡大されたデータおよびチェックサムシンボルベクトルである。表記Ｉｎｖ（Ａ）は、正則な（ｎ×ｎ）正方行列を形成するＡの行の部分集合の逆行列などのＡに基づく逆行列をもたらし、また、以下で述べるように、

で示す列行列Ｅのｎの行の、対応する選択されるかまたは選抜された集合に共形的である関数であると理解されてもよい。Ａ行列の逆行列を求めるプロセスは、Ａの行の選択された集合の反転とみなされてもよく、選択は、作動可能なデータ記憶デバイスのリストおよび行列にベクトルを掛ける計算における共形性についての要件によって確定される。（（ｎ＋ｍ）×ｎ）拡大行列Ａのｎの行の全ての部分集合は、Ｆがファンデルモンデ行列であるため反転可能であることが留意されてもよい。 (N × 1) information vector from a codeword or code symbol that may contain errors

In order to recover the parallel RS-RAID architecture, an expanded or partitioned matrix including a van der Monde matrix and an added ((n−m) × (n−m)) identity matrix denoted I An inverse matrix of A can be determined, and the inverse matrix of A can be post-multiplied from the right, respectively, by the data read from the operable storage unit and the sets D and C of checksum symbols. Symbolically, the recovered information vector

Is

Can be obtained from Where the expansion matrix is

And

Are the expanded data and checksum symbol vectors. The notation Inv (A) yields an inverse matrix based on A, such as the inverse of a subset of the rows of A that form a regular (n × n) square matrix, and, as described below,

May be understood to be a function that is conformal to the corresponding selected or selected set of n rows of the column matrix E. The process of finding the inverse of the A matrix may be viewed as an inversion of the selected set of rows of A, and the selection is for conformality in the computation of multiplying the list and matrix of operable data storage devices by a vector. Determined by the requirements of It may be noted that all subsets of the n rows of the ((n + m) × n) augmented matrix A can be inverted because F is a van der Monde matrix.

拡張形態では、式

は、

として表されうる。式中、共形性は、行列Ａの選択された部分を反転する前に、ＥおよびＡの対応する行を選択することによって実施される。 In expanded form, the expression

Is

Can be expressed as Where conformality is implemented by selecting the corresponding rows of E and A before inverting the selected portion of matrix A.

換言すれば、ＲＳ−ＲＡＩＤアーキテクチャの各記憶デバイスは、拡大行列Ａの行および列ベクトルＥ＝［ｄ_１、ｄ_２、…、ｄ_ｎ、ｃ_１、ｃ_２、…、ｃ_ｍ］^Ｔの対応する要素によって表されうる。ｍの冗長記憶デバイス中のいずれの冗長記憶デバイスも故障しない場合、回復される情報シンボルは、Ａのｎの行の任意の部分集合およびＥのｎの対応する要素を選択することによって確定されて、データ回復行列として記述されてもよい正方行列Ａ'および対応するデータ記憶ユニットから読出されるデータのベクトル

が形成されうる。換言すれば、Ｉｎｖ（Ａ）＝（Ａ'）^−１であり、また、

である。たとえば、４＋２ＲＳ−ＲＡＩＤアーキテクチャの場合、回復されるかまたは復号化されるデータ

は、

によって、拡大されたファンデルモンデ行列の最初の４つの行および記憶デバイスアレイから読出されたデータおよびチェックサムの最初の４つのエンティティから抽出された回復されるデータシンボルのベクトルでありうる。 In other words, the respective storage devices of RS-RAID architecture, the row of the augmented matrix A and a column vector _{_{_{E = [d 1, d 2}}} , ..., d n, c 1, c 2, ..., c m] corresponding ^T Can be represented by If no redundant storage device in m redundant storage devices fails, the recovered information symbols are determined by selecting any subset of A's n rows and E's n corresponding elements. A square matrix A ′, which may be described as a data recovery matrix, and a vector of data read from the corresponding data storage unit

Can be formed. In other words, Inv (A) = (A ′) ⁻¹ , and

It is. For example, in the case of a 4 + 2RS-RAID architecture, the data to be recovered or decoded

Is

Can be a vector of recovered data symbols extracted from the first four rows of the expanded van der Monde matrix and the data read from the storage device array and the first four entities of the checksum.

たとえば、第３の、第５の、または、第３と第５の両方のデータ記憶デバイスが故障する場合、

は、以下の通りに、作動可能なデバイスに対応する４つの行を選択することによって、Ｅ'から回復されうる。

式中、２重末梢線は記憶デバイスの故障を示し、１重末梢線は、逆行列を形成し、その後の計算を実施するための記憶デバイスの選択解除を示しうる。逆行列は、ガウス消去法または別の方法によって計算されてもよい。

の値が得られると、回復されるかまたは推定される任意のチェックサムベクトル

が、

を使用して、データベクトル

に基づいて計算されてもよい。 For example, if the third, fifth, or both third and fifth data storage devices fail,

Can be recovered from E ′ by selecting the four rows corresponding to the ready devices as follows.

Where the double peripheral line indicates a failure of the storage device and the single peripheral line forms an inverse matrix and may indicate deselection of the storage device for performing subsequent calculations. The inverse matrix may be calculated by Gaussian elimination or another method.

An arbitrary checksum vector that is recovered or estimated once the value of is obtained

But,

Use the data vector

May be calculated based on

並列ＲＳ−ＲＡＩＤデータ記憶アーキテクチャは、各クラスタ内のデータおよびチェックサムを、全てのクラスタに転送されるかまたは配信される中間または部分和に集計しうる。中間データシンボル、中間チェックサムシンボル、データ記憶デバイスのクラスタへの割当てに関するクラスタ構成情報、およびデータ記憶デバイスの作動に関する状態などの使用は、並列ＲＳ−ＲＡＩＤ分散データ記憶アーキテクチャのスケーラビリティおよびスループットを向上させながら、誤り訂正計算についての計算負荷およびレーテンシを低減しうる。 A parallel RS-RAID data storage architecture may aggregate data and checksums within each cluster into intermediate or partial sums that are transferred or distributed to all clusters. Use of intermediate data symbols, intermediate checksum symbols, cluster configuration information regarding allocation of data storage devices to clusters, and status regarding operation of data storage devices, etc., improves scalability and throughput of parallel RS-RAID distributed data storage architecture. However, the calculation load and latency for error correction calculation can be reduced.

本開示は、同じ数字が同じ要素を表す添付図面を参照することになる。 The present disclosure will refer to the accompanying drawings in which like numerals represent like elements.

並列ＲＳ−ＲＡＩＤ（Ｒｅｅｄ−Ｓｏｌｏｍｏｎｒｅｄｕｎｄａｎｔａｒｒａｙｏｆｉｎｅｘｐｅｎｓｉｖｅｄｉｓｋ）の例を示す図である。It is a figure which shows the example of parallel RS-RAID (Reed-Solomon redundant array of inexpensive disk). 構成行列の例を示す図である。It is a figure which shows the example of a structure matrix. ＲＡＩＤコントローラの例を示す図である。It is a figure which shows the example of a RAID controller. チェックサムプログラムの例のフローチャートである。It is a flowchart of the example of a checksum program. チェックサム更新プログラムの例のフローチャートである。It is a flowchart of the example of a checksum update program. データプログラムの例のフローチャートである。It is a flowchart of the example of a data program.

図１は、データ記憶のための並列ＲＳ−ＲＡＩＤアーキテクチャ１００の例である。並列ＲＳ−ＲＡＩＤアーキテクチャ１００は、通信ファブリック１２００、ＲＡＩＤコントローラ１１１１〜１１１３、および記憶デバイス１００１〜１０１２を含みうる。記憶デバイス１００１〜１００４、１００５〜１００８、および１００９〜１０１２は、それぞれ、ＲＡＩＤコントローラ１１１１〜１１１３に結合されうる。換言すれば、記憶デバイス１００１〜１０１２の部分集合またはクラスタは、それぞれの対応するＲＡＩＤコントローラ１１１１〜１１１３に結合しうる。各ＲＡＩＤコントローラ１１１１〜１１１３に結合しうる記憶デバイスの数は、等しくても、等しくなくてもよく、また、耐故障性を改善する、スループットを改善するなどのために、記憶デバイスのＲＡＩＤコントローラに対する構成またはマッピングは動的に変わってもよい。たとえば、記憶デバイス１００１〜１０１２のＲＡＩＤコントローラ１１１１〜１１１３に対する割当ては、構成行列または同様のデータ構造によって確定されてもよい。 FIG. 1 is an example of a parallel RS-RAID architecture 100 for data storage. The parallel RS-RAID architecture 100 may include a communication fabric 1200, RAID controllers 1111 to 1113, and storage devices 1001 to 1012. Storage devices 1001-1004, 1005-1008, and 1009-1012 may be coupled to RAID controllers 1111-1113, respectively. In other words, a subset or cluster of storage devices 1001-1012 can be coupled to their corresponding RAID controllers 1111-1113. The number of storage devices that can be coupled to each RAID controller 1111 to 1113 may or may not be equal, and the storage devices are configured for the RAID controller to improve fault tolerance, improve throughput, and the like. Or the mapping may change dynamically. For example, the assignment of storage devices 1001-1012 to RAID controllers 1111-1113 may be determined by a configuration matrix or similar data structure.

図２は、変数「ｔ」を含みうる構成行列２００の例を示し、変数「ｔ」は、ＲＡＩＤコントローラの数のインデックスまたはカウンタである。たとえば、構成行列２００の行２０６は、ＲＡＩＤコントローラインデックス番号「ｔ」と、ＲＡＩＤコントローラ１１１１〜１１１３などのそれぞれのＲＡＩＤコントローラとの間のマッピング関数Ｑ（ｔ）を示す。行２０２はＲＡＩＤ記憶デバイス開始インデックスＱＳ（ｔ）を示し、行２０４はＲＡＩＤ記憶デバイス終了インデックスＱＥ（ｔ）を示す。たとえば、ＱＳ（２）＝１００５であり、ＱＥ（２）＝１００８である。デバイス番号のオフセットが、関数Ｊ（）によって供給されてもよいため、Ｊ（ＱＳ（２））＝５であり、これは、たとえば、５番目の記憶デバイスが第２の記憶デバイスクラスタで開始することを示しうることが留意されてもよい。構成行列２００は、記憶デバイスを対応するＲＡＩＤコントローラにマッピングしうる。換言すれば、構成行列２００は、記憶デバイスのどの部分集合またはクラスタが、所与のＲＡＩＤコントローラに割当てられるかを確定するかまたは制御しうる。計算のために、構成行列２００は、以下で述べるように、コードワードを符合化するかまたは復号化しうる重み付き部分和の始まりと終わりを確定しうる、チェックサムおよびデータを更新するかまたは維持しうるなどを行いうる。 FIG. 2 shows an example of a configuration matrix 200 that may include a variable “t”, where the variable “t” is an index or counter of the number of RAID controllers. For example, the row 206 of the configuration matrix 200 indicates the mapping function Q (t) between the RAID controller index number “t” and each RAID controller such as the RAID controllers 1111 to 1113. Row 202 shows a RAID storage device start index QS (t), and row 204 shows a RAID storage device end index QE (t). For example, QS (2) = 1005 and QE (2) = 1008. Since the device number offset may be supplied by the function J (), J (QS (2)) = 5, for example, the fifth storage device starts with the second storage device cluster. It may be noted that this can be shown. The configuration matrix 200 may map storage devices to corresponding RAID controllers. In other words, the configuration matrix 200 may determine or control which subset or cluster of storage devices are assigned to a given RAID controller. For computation, the configuration matrix 200 updates or maintains the checksum and data, which can determine the beginning and end of a weighted partial sum that can be encoded or decoded, as described below. It can be done.

通信ファブリック１２００は、ＲＡＩＤコントローラ１１１１〜１１１３間で、また、並列ＲＳ−ＲＡＩＤアーキテクチャ１００と外部デバイスとの間で入力および出力（Ｉ／Ｏ）デジタル信号を結合しうる。たとえば、通信ファブリック１２００は、ＲＡＩＤコントローラ１１１１〜１１１３間で、データシンボル、チェックサムシンボル、中間データおよびチェックサムシンボルなどのようなデジタル信号を結合しうる。通信ファブリック１２００は、並列バス構造、直列データリンク、光バックプレーンなどを使用してもよい。通信ファブリック１２００は、外部通信用に１つのタイプのバス、リンク、またはバックプレーン構造を、ＲＡＩＤコントローラ１１１１〜１１１３間の通信用に別のタイプを使用してもよい。 Communication fabric 1200 may couple input and output (I / O) digital signals between RAID controllers 1111-1113 and between parallel RS-RAID architecture 100 and external devices. For example, the communication fabric 1200 may combine digital signals such as data symbols, checksum symbols, intermediate data, checksum symbols, etc. between RAID controllers 1111-1113. The communication fabric 1200 may use a parallel bus structure, a serial data link, an optical backplane, or the like. The communication fabric 1200 may use one type of bus, link, or backplane structure for external communication and another type for communication between the RAID controllers 1111-1113.

ＲＡＩＤコントローラ１１１１〜１１１３は、構成行列２００などの構成行列または他のデータ構造によって与えられた、割当てられた記憶デバイスの部分集合またはクラスタ内の各記憶デバイスについてデータチェックサムシンボルを計算しうる。ＲＡＩＤコントローラ１１１１〜１１１３は、誤り訂正コード計算の部分和を集計するかまたは蓄積し、集計されたデータおよびパリティ計算結果を、通信ファブリック１２００を通じて、並列ＲＳ−ＲＡＩＤアーキテクチャ１００内の他のＲＡＩＤコントローラに報告しうる。データおよびチェックサムシンボルについての部分和計算の詳細が、特定のＲＡＩＤコントローラを参照して述べられてもよいが、対応する計算は、ＲＡＩＤコントローラ１１１１などの、並列ＲＳ−ＲＡＩＤアーキテクチャ１００内の任意のＲＡＩＤコントローラによって実施されてもよい。 The RAID controllers 1111 to 1113 may calculate data checksum symbols for each storage device in the allocated subset of storage devices or clusters given by a configuration matrix such as the configuration matrix 200 or other data structure. The RAID controllers 1111 to 1113 totalize or accumulate partial sums of error correction code calculations, and the aggregated data and parity calculation results are transmitted to other RAID controllers in the parallel RS-RAID architecture 100 through the communication fabric 1200. Can be reported. Details of the partial sum calculation for data and checksum symbols may be described with reference to a particular RAID controller, but the corresponding calculation can be performed in any parallel RS-RAID architecture 100, such as RAID controller 1111. It may be implemented by a RAID controller.

図３は、通信ファブリックインタフェース１１１１ａ、ＲＡＩＤコントロールユニット１１１１ｂ、中間和デバイス１１１１ｃ、記憶デバイスインタフェース１１１１ｇ、および記憶デバイス故障センスユニット１１１１ｈを含みうるＲＡＩＤコントローラ１１１１の例を示す。通信ファブリックインタフェース１１１１ａは、通信ファブリック１２００などの通信ファブリックへのまた通信ファブリックからの信号を、中間和デバイス１１１１ｃおよびＲＡＩＤコントロールユニット１１１１ｂに結合しうる。ＲＡＩＤコントロールユニット１１１１ｂは、中間和デバイス１１１１ｃ、記憶デバイスインタフェース１１１１ｇ、および記憶デバイス故障センスユニット１１１１ｈに結合しうる。記憶デバイスインタフェース１１１１ｇは、ＲＡＩＤコントロールユニット１１１１ｂ、中間和デバイス１１１１ｃ、および記憶デバイス故障センスユニット１１１１ｈに結合しうる。ＲＡＩＤコントローラ１１１１は、先に述べたように、通信ファブリック１２００へまた通信ファブリック１２００から結合し、記憶デバイスインタフェース１１１１ｇを介して記憶デバイス１００１〜１００４などの記憶デバイスへまた記憶デバイスから結合しうる。 FIG. 3 shows an example of a RAID controller 1111 that may include a communication fabric interface 1111a, a RAID control unit 1111b, an intermediate sum device 1111c, a storage device interface 1111g, and a storage device failure sense unit 1111h. Communication fabric interface 1111a may couple signals to and from a communication fabric such as communication fabric 1200 to intermediate sum device 1111c and RAID control unit 1111b. The RAID control unit 1111b may be coupled to the intermediate sum device 1111c, the storage device interface 1111g, and the storage device failure sense unit 1111h. The storage device interface 1111g may be coupled to the RAID control unit 1111b, the intermediate sum device 1111c, and the storage device failure sense unit 1111h. The RAID controller 1111 can be coupled to and from the communication fabric 1200 as described above, and can be coupled to and from the storage devices such as the storage devices 1001 to 1004 via the storage device interface 1111g.

中間和デバイス１１１１ｃは、中間和計算器１１１１ｄ、再計算器１１１１ｅ、および計算コントロール１１１１ｆを含みうる。中間和計算器１１１１ｄは、通信ファブリックインタフェース１１１１ａ、記憶デバイスインタフェース１１１１ｇ、再計算器１１１１ｅ、および計算コントロール１１１１ｆに結合しうる。再計算器１１１１ｅは、通信ファブリックインタフェース１１１１ａ、中間和計算器１１１１ｄ、計算コントロール１１１１ｆ、および記憶デバイスインタフェース１１１１ｇに結合しうる。計算コントロール１１１１ｆは、中間和計算器１１１１ｄ、再計算器１１１１ｅ、および記憶デバイスインタフェース１１１１ｇに結合しうる。 The intermediate sum device 1111c may include an intermediate sum calculator 1111d, a recalculator 1111e, and a calculation control 1111f. The intermediate sum calculator 1111d may be coupled to the communication fabric interface 1111a, the storage device interface 1111g, the recalculator 1111e, and the calculation control 1111f. The recalculator 1111e may be coupled to the communication fabric interface 1111a, the intermediate sum calculator 1111d, the calculation control 1111f, and the storage device interface 1111g. Calculation control 1111f may couple to intermediate sum calculator 1111d, recalculator 1111e, and storage device interface 1111g.

通信ファブリックインタフェース１１１１ａは、並列ＲＳ−ＲＡＩＤアーキテクチャ１００と外部デバイスとの間で情報シンボルを転送し、通信ファブリック１２００とＲＡＩＤコントローラ１１１１の要素との間で、情報シンボル、情報シンボルの所定部分、データシンボル、中間チェックサムシンボルなどのチェックサムシンボル、コントロール信号、クロック信号などを結合しうる。通信ファブリックインタフェース１１１１ａは、情報シンボルをビットからバイト、ワード、または他のシンボルにリフォーマットしうる、信号を多重化し逆多重化しうる、データ転送を同期化しうる、ラインドライバおよび受信機によって信号をバッファリングしうるなどを行いうる。換言すれば、通信ファブリックインタフェース１１１１ａは、デジタルバスなどの通信ファブリックを通じて送信するためにデジタル信号を調節しうる、データ転送をバッファリングしうるなどを行いうる。 The communication fabric interface 1111a transfers an information symbol between the parallel RS-RAID architecture 100 and an external device, and an information symbol, a predetermined part of the information symbol, a data symbol between the communication fabric 1200 and the RAID controller 1111 Checksum symbols such as intermediate checksum symbols, control signals, clock signals, and the like. The communication fabric interface 1111a can reformat information symbols from bits to bytes, words, or other symbols, can multiplex and demultiplex signals, can synchronize data transfer, and buffers signals by line drivers and receivers You can ring and so on. In other words, the communication fabric interface 1111a may adjust digital signals for transmission over a communication fabric such as a digital bus, may buffer data transfers, and so on.

ＲＡＩＤコントロールユニット１１１１ｂは、通信ファブリックインタフェース１１１１ａおよび記憶デバイスから信号を受信しうる、情報シンボルの部分集合からデータシンボルを選択しうる、記憶デバイスにわたってデータおよびチェックサムシンボルをストライピングしうる、前方誤り訂正コード（ｆｏｒｗａｒｄ−ｅｒｒｏｒｃｏｒｒｅｃｔｉｏｎｃｏｄｅ）（ＦＥＣｃｏｄｅ）によって中間和デバイス１１１１ｃの作動を制御しうるなどを行いうる。たとえば、情報シンボルの部分集合は、データシンボルによって表され、かつ、ＲＡＩＤコントローラ１１１１によって制御される作動可能なデータ記憶デバイスに記憶される情報シンボルでありうる。中間和デバイス１１１１ｃは、記憶デバイス故障センスユニット１１１１ｈから状態情報を得ることができるＲＡＩＤコントロールユニット１１１１ｂから作動可能な記憶デバイスの数に関する状態情報を受信してもよい。 RAID control unit 1111b may receive signals from communication fabric interface 1111a and storage device, may select data symbols from a subset of information symbols, may strip data and checksum symbols across storage devices, and forward error correction code The operation of the intermediate sum device 1111c can be controlled by (forward-error correction code) (FECcode). For example, the subset of information symbols may be information symbols that are represented by data symbols and stored in an operable data storage device that is controlled by RAID controller 1111. The intermediate sum device 1111c may receive status information regarding the number of operable storage devices from the RAID control unit 1111b that can obtain status information from the storage device failure sense unit 1111h.

記憶デバイス故障センスユニット１１１１ｈは、ＲＡＩＤコントローラ１１１１に結合する任意の記憶デバイスの作動に関する状態を確定し、作動可能な記憶デバイスのリストを確定しうる。換言すれば、記憶デバイス故障センスユニット１１１１ｈは、所与の記憶デバイスが、データおよびチェックサムの確実な記憶に適さなくなったかどうかを判定しうる。記憶デバイス故障センスユニット１１１１ｈは、信頼性のある作動について記憶デバイスを試験しうる、所与の記憶デバイスがオンラインであるかどうかを判定しうる、所与の記憶デバイスからの応答が、所定のタイムアウト間隔内に受信されない場合、ユニットオフラインを宣言しうる、信号品質メトリックが、記憶デバイスから読出されたデータについて閾品質より小さいかどうかを判定しうる、作動可能な記憶デバイスを挙げうるなどを行いうる。記憶デバイス故障センスユニット１１１１ｈは、こうした試験の結果を記録し、ＲＡＩＤコントロールユニット１１１１ｂなどのＲＡＩＤコントローラ１１１１の要素のために、作動可能な記憶デバイスのリストを配信しうる。 The storage device failure sense unit 1111h may determine a state relating to the operation of any storage device coupled to the RAID controller 1111 and determine a list of operational storage devices. In other words, the storage device failure sense unit 1111h may determine whether a given storage device is no longer suitable for reliable storage of data and checksums. The storage device failure sense unit 1111h can test the storage device for reliable operation, can determine whether the given storage device is online, a response from the given storage device is a predetermined timeout If not received within an interval, a unit offline may be declared, a signal quality metric may be determined if the data read from the storage device is less than a threshold quality, an operational storage device may be listed, etc. . The storage device failure sense unit 1111h may record the results of such tests and distribute a list of operational storage devices for elements of the RAID controller 1111 such as the RAID control unit 1111b.

中間和計算器１１１１ｄは、中間的でローカルで部分的な和を計算することができ、この和内に、チェックサムおよびデータについての誤り訂正コード計算が、式８および式１３に関してそれぞれ述べたように分解されうる。中間的なまたは部分的な和は、ＲＡＩＤコントローラ１１１１に報告するクラスタ内の作動可能な記憶デバイスから読出されるシンボルの重み付き和であってよい。たとえば、記憶デバイスのクラスタおよびこうした部分和の合計の対応する制限は、構成行列２００などの構成行列または他のデータ構造から確定されてもよい。中間和計算器１１１１ｄは、ＲＡＩＤコントローラ１１１２またはＲＡＩＤコントローラ１１１３などの他のＲＡＩＤコントローラから対応する部分和を受信した後、データおよびチェックサムシンボルを計算しうる。 The intermediate sum calculator 1111d can calculate an intermediate, local and partial sum, within which the error correction code calculation for the checksum and data is as described with respect to Equation 8 and Equation 13, respectively. Can be broken down into The intermediate or partial sum may be a weighted sum of symbols read from an operational storage device in the cluster that reports to the RAID controller 1111. For example, the corresponding limit on the cluster of storage devices and the sum of these partial sums may be determined from a configuration matrix such as the configuration matrix 200 or other data structure. The intermediate sum calculator 1111d may calculate data and checksum symbols after receiving corresponding partial sums from other RAID controllers such as RAID controller 1112 or RAID controller 1113.

再計算器１１１１ｅは、ＲＡＩＤコントローラ１１１１に直接結合する記憶デバイスからのデータに基づいて中間的なローカルのチェックサムを、また、通信ファブリックインタフェース１１１１ａを通して転送される他のＲＡＩＤコントローラからのローカルでない中間チェックサムを再計算しうる。換言すれば、データまたはチェックサムシンボルの変化が、ＲＡＩＤコントローラ１１１１に直接結合するローカルの記憶デバイスにおいて起こるか、または、通信ファブリック１２００を通じてＲＡＩＤコントローラ１１１１に送信される中間チェックサムによって起こると、再計算器１１１１ｅは、中間和計算器１１１１ｄからの結果を相応して修正しうる。 The recalculator 1111e performs an intermediate local checksum based on data from the storage device directly coupled to the RAID controller 1111 and a non-local intermediate check from other RAID controllers transferred through the communication fabric interface 1111a. Sam can be recalculated. In other words, if the data or checksum symbol change occurs in a local storage device that is directly coupled to the RAID controller 1111 or is caused by an intermediate checksum sent to the RAID controller 1111 through the communication fabric 1200, the recalculation Unit 1111e may modify the result from intermediate sum calculator 1111d accordingly.

計算コントロール１１１１ｆは、中間チェックサム計算結果または再計算されたチェックサムが、ＦＥＣのために使用されるべきかどうかを判定するために、中間和計算器１１１１ｄと再計算器１１１１ｅの両方を制御しうる。ＲＡＩＤコントロール１１１１ｂは、中間和計算器１１１１ｄの結果と再計算器１１１１ｅの結果のいずれが計算されるかを判定するために、直接にまたは通信ファブリックインタフェース１１１１ａを通して計算コントロール１１１１ｆに合図し(signal)うる。ＲＡＩＤコントロール１１１１ｂは、記憶デバイス故障センスユニット１１１１ｈから、データ記憶デバイスに関する作動に関する状態などの状態情報を得ることができる。 The calculation control 1111f controls both the intermediate sum calculator 1111d and the recalculator 1111e to determine whether the intermediate checksum calculation result or the recalculated checksum should be used for FEC. sell. The RAID control 1111b can signal the calculation control 1111f directly or through the communication fabric interface 1111a to determine whether the result of the intermediate sum calculator 1111d or the result of the recalculator 1111e is calculated. . The RAID control 1111b can obtain state information such as a state relating to an operation related to the data storage device from the storage device failure sense unit 1111h.

並列ＲＡＩＤコントローラ１１１１〜１１１３は、

によって、チェックサムを計算し記憶しうる。式中、インデックスｔは、１からＲＡＩＤコントローラの数ｒまでの範囲にあることができ、ｃ_ｉ，ｔは、ｔ番目のインデックスについてのｉ番目の中間チェックサムである。たとえば、ｒは、並列ＲＡＩＤアーキテクチャ１００の場合、３に等しい。構成行列２００に関して述べたように、ＱＳ（ｔ）およびＱＥ（ｔ）は、開始および終了記憶デバイスをＲＡＩＤコントローラにマッピングし、それぞれの中間チェックサムｃ_ｉ，ｊを生成する部分和の合計の制限を確定しうる。関数Ｊ（・）は、たとえば、Ｊ（１００２）＝２であるように、オフセットを減算しうる。 The parallel RAID controllers 1111 to 1113 are

Can calculate and store the checksum. Where index t can range from 1 to the number r of RAID controllers, and c _{i, t} is the i th intermediate checksum for the t th index. For example, r is equal to 3 for the parallel RAID architecture 100. As described with respect to the configuration matrix 200, QS (t) and QE (t) limit the sum of the partial sums that map the start and end storage devices to the RAID controller and generate the respective intermediate checksums c _{i, j.} Can be determined. The function J (•) may subtract the offset such that J (1002) = 2, for example.

ＲＡＩＤコントローラ１１１１などのｔ番目のＲＳ−ＲＡＩＤコントローラは、

によって、中間チェックサムｃ_ｉ，ｔを計算しうる。 The t-th RS-RAID controller such as the RAID controller 1111 is

_Can calculate an intermediate checksum c _{i, t} .

中間チェックサムｃ_ｉ，ｔの使用は、通信ファブリック１２００上のデータトラフィックを低減することができ、並列ＲＳ−ＲＡＩＤアーキテクチャ１００のスループットを増加させることができる。たとえば、８＋４ＲＳ−ＲＡＩＤアーキテクチャでは、単一の主要なＲＡＩＤコントローラが、記憶デバイスの全てを制御し、チェックサムを計算する場合、８つのデータシンボルが、通信ファブリックを通じて転送されうる。対照的に、８＋４並列ＲＳ−ＲＡＩＤアーキテクチャからの中間チェックサム計算器結果を使用して、２つの中間チェックサムシンボルだけが、通信ファブリックを通じて転送される必要がある可能性がある。 The use of intermediate checksums c _{i, t} can reduce data traffic on communication fabric 1200 and can increase the throughput of parallel RS-RAID architecture 100. For example, in an 8 + 4 RS-RAID architecture, if a single primary RAID controller controls all of the storage devices and calculates the checksum, eight data symbols can be transferred through the communication fabric. In contrast, using the intermediate checksum calculator results from the 8 + 4 parallel RS-RAID architecture, only two intermediate checksum symbols may need to be transferred through the communication fabric.

中間チェックサムおよび全チェックサムを計算することに加えて、並列ＲＳ−ＲＡＩＤアーキテクチャ１００は、データシンボルが変化すると、チェックサムシンボルを修正するかまたは維持しうる。たとえば、データシンボルが、ｄ_ｊからｄ'_ｊに変化すると、チェックサムは、
ｃ'_ｉ＝ｃ_ｉ＋ｆ_ｉ，ｊ（ｄ'_ｊ−ｄ_ｊ）式９
によって、再計算されうる。式９の計算を実施するとき、ＲＡＩＤコントローラ１１１１は、データ差（ｄ'_ｊ−ｄ_ｊ）を計算し、ファンデルモンデ要素ｆ_ｉ，ｊ、すなわち、

によって、データ差に重み付けしうる。 In addition to calculating the intermediate checksum and the full checksum, the parallel RS-RAID architecture 100 may modify or maintain the checksum symbol as the data symbol changes. For example, if the data symbol changes from d _j to d ′ _j , the checksum is
c ′ _i = c _i + f _{i, j} (d ′ _j −d _j ) Equation 9
Can be recalculated. When performing the calculation of Equation 9, the RAID controller 1111 calculates the data difference (d ′ _j −d _j ), and the van der Monde element fi _{, j} ,

Can weight the data difference.

個々の並列ＲＡＩＤコントローラ１１１１〜１１１３は、一時的な成分ｃ'_ｉ，ｔをＲＡＩＤコントローラ１１１１〜１１１３の他のコントローラに送出しうる。 Individual parallel RAID controllers 1111 to 1113 can send temporary components c ′ _{i, t} to other controllers of RAID controllers 1111 to 1113.

ＲＳ−ＲＡＩＤコントローラ１１１１〜１１１３は、

によって、それぞれの割当てられた記憶デバイスを更新しうる。 The RS-RAID controllers 1111 to 1113 are

Can update each assigned storage device.

記憶デバイスが故障する、たとえば、記憶デバイス故障センスユニット１１１１ｈがハードディスククラッシュを検出すると、拡大行列の逆行列Ｉｎｖ（Ａ）が、並列ＲＡＩＤコントローラ１１１１〜１１１３によって修正されて、残りのまたは作動可能なデータ記憶デバイスに対応する逆行列Ｉｎｖ（Ａ'）が形成される。行列Ｉｎｖ（Ａ'）は、記憶デバイスのさらなる故障が起こらない限り、静的データ構造である可能性がある。別の記憶デバイスが故障すると、Ｉｎｖ（Ａ'）が一回計算され、その後、ＲＡＩＤコントローラ１１１１〜１１１３などの全ての作動可能なＲＡＩＤコントローラにブロードキャストされうる。より多くの記憶デバイスが後で故障する場合、新しい逆行列Ｉｎｖ（Ａ''）が、再計算され、全てのＲＡＩＤコントローラにブロードキャストされてもよい。 When a storage device fails, for example, when the storage device failure sense unit 1111h detects a hard disk crash, the inverse matrix Inv (A) of the expanded matrix is modified by the parallel RAID controllers 1111-1113 to leave the remaining or ready data An inverse matrix Inv (A ′) corresponding to the storage device is formed. The matrix Inv (A ′) may be a static data structure unless further storage device failure occurs. If another storage device fails, Inv (A ′) can be calculated once and then broadcast to all operational RAID controllers, such as RAID controllers 1111 to 1113. If more storage devices fail later, a new inverse matrix Inv (A ″) may be recalculated and broadcast to all RAID controllers.

並列ＲＳ−ＲＡＩＤアーキテクチャ１００は、記憶デバイスが故障しても、各ＲＡＩＤコントローラにおいてローカルで計算される中間和または部分和を使用して、データシンボルを回復しうる。回復されるデータ

は、

から回復される可能性がある。式中、Ｉｎｖ（Ａ'）の要素は、ａ_ｉ，ｊ（１≦ｉ≦ｎおよび１≦ｊ≦ｎ）で示されてもよい。対応するデータおよびチェックサムシンボルＥ'＝［ｅ_１、ｅ_２、…、ｅ_ｎ］^Ｔの要素は、作動可能でかつ選択されたデータ記憶デバイスから読出されうる。並列ＲＳ−ＲＡＩＤアーキテクチャ１００は、Ｅの行ならびに添加された単位行列とファンデルモンデ行列の対応する部分集合を選択するかまたは選抜して、それぞれ、Ｅ'およびＩｎｖ（Ａ'）が形成されうる。換言すれば、並列ＲＳ−ＲＡＩＤアーキテクチャ１００は、データ回復計算を、

によって、部分和または中間データシンボルの集合に分解しうる。式中、ｅ_ｊは、ｔ番目のＲＳ−ＲＡＩＤコントローラの制御下にある全てのデータまたはチェックサムシンボルの集合であると理解される。 The parallel RS-RAID architecture 100 can recover data symbols using intermediate sums or partial sums calculated locally at each RAID controller even if a storage device fails. Data recovered

Is

May be recovered from. In the formula, the element of Inv (A ′) may be represented by a _{i, j} (1 ≦ i ≦ n and 1 ≦ j ≦ n). Corresponding data and elements of the checksum symbol E ′ = [e ₁ , e ₂ ,..., E _n ] ^T are operational and can be read from the selected data storage device. Parallel RS-RAID architecture 100 may select or select E rows and corresponding subsets of added identity and van der Monde matrices to form E ′ and Inv (A ′), respectively. . In other words, the parallel RS-RAID architecture 100 performs data recovery computation as follows:

Can be decomposed into a set of partial sums or intermediate data symbols. _Where ej is understood to be the set of all data or checksum symbols under the control of the t th RS-RAID controller.

他の並列ＲＳ−ＲＡＩＤコントローラから、中間データシンボルなどのメッセージを受信すると、個々のＲＡＩＤコントローラは、最初に中間データシンボル

を計算し、次に、

によって、回復されるデータ

を計算しうる。 When a message, such as an intermediate data symbol, is received from another parallel RS-RAID controller, each RAID controller first starts with an intermediate data symbol

And then

Recovered data by

Can be calculated.

図４Ａは、データ記憶用の並列ＲＳ−ＲＡＩＤアーキテクチャのためのチェックサムプログラムフローチャート４００Ａの例を示す。プログラムフローチャート４００Ａは、プログラムステップＳ４１０で開始し、並列ＲＳ−ＲＡＩＤアーキテクチャの構成行列が読出されうるプログラムステップＳ４２０に進みうる。たとえば、構成行列は、図２に関して述べた構成行列などの、所与のＲＡＩＤコントローラに関連する記憶デバイスについての開始デバイス番号と終了デバイス番号を指定しうる。各ＲＡＩＤコントローラは、構成行列のローカルなコピーを記憶しうる、構成行列を他のＲＡＩＤコントローラと調和させうる、構成行列を、高レベルのＲＡＩＤデバイスまたはネットワークコントローラから受信しうるなどを行いうることが理解されてもよい。 FIG. 4A shows an example of a checksum program flowchart 400A for a parallel RS-RAID architecture for data storage. Program flowchart 400A may begin at program step S410 and proceed to program step S420, where a parallel RS-RAID architecture configuration matrix may be read. For example, the configuration matrix may specify starting and ending device numbers for storage devices associated with a given RAID controller, such as the configuration matrix described with respect to FIG. Each RAID controller may store a local copy of the configuration matrix, may reconcile the configuration matrix with other RAID controllers, may receive the configuration matrix from a higher level RAID device or network controller, etc. May be understood.

プログラムステップＳ４２０から、プログラムフローは、プログラムステップＳ４２５に進むことができ、プログラムステップＳ４２５にて、プログラムは、外部デバイスから、記憶される情報保持データシンボルを読出すことができる。たとえば、プログラムは、通信ファブリックを通じて受信される２Ｋビットのデータブロックの集合をフラッシュドライブから受信しうる。 From program step S420, the program flow can proceed to program step S425, where the program can read the stored information holding data symbols from the external device. For example, the program may receive a set of 2K bit data blocks received from the flash drive through the communication fabric.

プログラムステップＳ４２５から、プログラムは、中間チェックサムが計算されうるプログラムステップＳ４３０に進むことができる。たとえば、中間チェックサムまたはデータおよびパリティ計算は、式８およびガロア体の特性を使用した、データワードの線形結合から計算されうる。プログラムステップＳ４３０は、中間チェックサムを計算し、１）所与のＲＡＩＤコントローラにコードワードシンボルの所定部分を供給する個々の作動可能な記憶ユニットからの記憶されたコードワードシンボルを使用して中間チェックサムを更新するかまたは維持し、２）通信ファブリックを通じて所与のＲＡＩＤコントローラと通信する他のＲＡＩＤコントローラからの中間チェックサムを集計しうる。換言すれば、ローカルな部分集合データ記憶ユニットからの低レーテンシのデータおよびパリティビット、バイト、またはワードは、他のデータ記憶ユニットからの、ｃ_ｉ，ｔの形態の、高レーテンシの、蓄積されるかまたは部分的に合計されたデータおよびパリティと結合されうる。その後、プログラムは、プログラムステップＳ４６０に進む。 From program step S425, the program can proceed to program step S430 where an intermediate checksum can be calculated. For example, an intermediate checksum or data and parity calculation may be calculated from a linear combination of data words using Equation 8 and Galois field properties. Program step S430 calculates an intermediate checksum and 1) uses the stored codeword symbols from each operable storage unit to supply a given portion of the codeword symbols to a given RAID controller. The sum may be updated or maintained, and 2) intermediate checksums from other RAID controllers communicating with a given RAID controller through the communication fabric may be aggregated. In other words, low latency data and parity bits, bytes, or words from the local subset data storage unit are stored in high latency, in the form of ci _{, t} , from other data storage units. Or may be combined with partially summed data and parity. Thereafter, the program proceeds to program step S460.

ステップＳ４６０にて、プログラムは、中間チェックサムを異なるＲＳ−ＲＡＩＤコントローラに配信しうる。たとえば、Ｑ（ｔ）＝ｔである場合、プログラムステップＳ４６０は、第１のＲＡＩＤコントローラからの第１の中間チェックサムｃ_１，１を第２および第３のＲＡＩＤコントローラに配信しうる。 In step S460, the program may distribute the intermediate checksum to different RS-RAID controllers. For example, if Q (t) = t, program step S460 may distribute the first intermediate checksum c _1,1 from the first RAID controller to the second and third RAID controllers.

プログラムステップＳ４６０から、プログラムフローは、プログラムステップＳ４７０に進むことができ、プログラムステップＳ４７０にて、プログラムは、他のＲＡＩＤコントローラから中間チェックサムを受信しうる。プログラムステップＳ４７０から、プログラムはプログラムステップＳ４８０に進みうる。中間チェックサムの集合によって、各ＲＡＩＤコントローラが、式８によって完全なチェックサムｃ_ｉを計算し、その後の誤り訂正および検出計算についてｃ_ｉを記憶することが可能になる。たとえば、プログラムは、第２および第３の中間チェックサムｃ_ｉ，２およびｃ_ｉ，３を受信することができ、第２および第３の中間チェックサムｃ_ｉ，２およびｃ_ｉ，３は、ローカルに計算された第１のチェックサムｃ_ｉ，１と共に、ｃ_１を計算するためのチェックサムの十分な集合を形成しうる。 From program step S460, program flow can proceed to program step S470, where the program can receive an intermediate checksum from another RAID controller. From program step S470, the program can proceed to program step S480. The set of intermediate checksums allows each RAID controller to calculate a complete checksum c _{i according} to Equation 8 and store c _i for subsequent error correction and detection calculations. For example, the program can receive the second and third intermediate checksum _{c i, 2,} and _{c i, 3,} second and third intermediate checksum _{c i, 2,} and _{c i, 3} is With the first checksum c _{i, 1} calculated locally, a sufficient set of checksums to calculate c ₁ may be formed.

プログラムステップＳ４８０から、プログラムフローは、プログラムステップＳ４９０に進むことができ、プログラムステップＳ４９０にて、プログラムは、プログラムを実行するＲＡＩＤコントローラに割当てられるデータおよび完全なチェックサムシンボルを記憶しうる。たとえば、プログラムは、ディスクのアレイにわたってデータおよびチェックサムシンボルをストライピングしうる。プログラムステップＳ４９０から、プログラムフローは、プログラム実行が停止しうるプログラムステップＳ４９５に進みうる。 From program step S480, program flow can proceed to program step S490, where the program can store data assigned to the RAID controller executing the program and a complete checksum symbol. For example, a program may strip data and checksum symbols across an array of disks. From program step S490, the program flow can proceed to program step S495 where program execution can stop.

図４Ｂは、データ記憶用の並列ＲＳ−ＲＡＩＤアーキテクチャのためのチェックサム更新プログラムフローチャート４００Ｂの例を示す。プログラムフローチャート４００Ｂは、ステップＳ４４０で開始し、ステップＳ４４２に進む。 FIG. 4B shows an example of a checksum update program flowchart 400B for a parallel RS-RAID architecture for data storage. Program flowchart 400B starts in step S440 and proceeds to step S442.

ステップＳ４４２にて、並列ＲＳ−ＲＡＩＤアーキテクチャは、データの変化を受信する可能性がある。たとえば、記憶デバイスは、古いデータシンボルを置換するために新しいデータシンボルを受信してもよい。その後、プログラムフローはステップＳ４４４に進みうる。 At step S442, the parallel RS-RAID architecture may receive a data change. For example, the storage device may receive new data symbols to replace old data symbols. Thereafter, the program flow can proceed to step S444.

ステップＳ４４４にて、記憶デバイスに結合するＲＡＩＤコントローラは、式１０によって、一時的な成分を計算しうる。ＲＡＩＤコントローラは、新しいデータシンボルと古いデータシンボルとのデータ差を得、ファンデルモンデ行列要素によってデータ差に重み付けしてもよい。その後、プログラムフローはステップＳ４４６に進みうる。 In step S444, the RAID controller coupled to the storage device may calculate a temporary component according to Equation 10. The RAID controller may obtain the data difference between the new data symbol and the old data symbol and weight the data difference by van der Monde matrix elements. Thereafter, the program flow can proceed to step S446.

ステップＳ４４６にて、一時的な成分は、他のＲＡＩＤコントローラに伝達されうる。ある実施形態では、通信ファブリックは、種々のＲＡＩＤコントローラを結合してもよい。通信ファブリックは、データ変化に対応する一時的な成分を、チェックサムを記憶する記憶デバイスを制御するＲＡＩＤコントローラに伝達しうる。その後、プログラムフローはステップＳ４４８に進みうる。 In step S446, the temporary component can be communicated to other RAID controllers. In certain embodiments, the communication fabric may combine various RAID controllers. The communication fabric may communicate a temporary component corresponding to the data change to a RAID controller that controls the storage device that stores the checksum. Thereafter, the program flow may proceed to step S448.

ステップＳ４４８にて、チェックサムを記憶する記憶デバイスを制御するＲＡＩＤコントローラは、たとえば、式１１によって、受信した一時的な成分に基づいてチェックサムを更新してもよい。その後、プログラムフローはステップＳ４５０に進み、停止しうる。 In step S448, the RAID controller that controls the storage device that stores the checksum may update the checksum based on the received temporary component, for example, using Equation 11. Thereafter, the program flow may proceed to step S450 and stop.

図５は、データ記憶用の並列ＲＳ−ＲＡＩＤアーキテクチャのためのデータプログラムフローチャート５００の例を示す。プログラムフローチャート５００は、ステップＳ５１０で開始し、ステップＳ５２０に進むことができ、ステップＳ５２０にて、並列ＲＳ−ＲＡＩＤアーキテクチャの構成行列が、図４Ａに関して説明したように読出されうる。プログラムステップＳ５２０から、プログラムフローは、プログラムステップＳ５２５に進むことができ、プログラムステップＳ５２５にて、データおよびチェックサムシンボルが、記憶デバイスから読出されうる。たとえば、８つのデータおよび４つのチェックサムシンボルは、１２の記憶デバイスから読出されうる。この例では、少なくとも８つのデータまたはチェックサムシンボルが、作動可能な記憶デバイスから読出されうる。 FIG. 5 shows an example of a data program flow chart 500 for a parallel RS-RAID architecture for data storage. Program flowchart 500 may begin at step S510 and proceed to step S520, where a configuration matrix for a parallel RS-RAID architecture may be read as described with respect to FIG. 4A. From program step S520, program flow can proceed to program step S525, where data and checksum symbols can be read from the storage device. For example, 8 data and 4 checksum symbols can be read from 12 storage devices. In this example, at least eight data or checksum symbols can be read from the operable storage device.

プログラムステップＳ４２５から、プログラムフローは、プログラムステップＳ５３０に進むことができ、プログラムステップＳ５３０にて、プログラムは、中間データシンボルを計算しうる。たとえば、プログラムは、式１３によって、中間データシンボルを計算しうる。式１３で使用される重み係数ａ_ｉ，ｊは、予め計算され、ＲＡＩＤコントローラに配信されるか、または、プログラムステップＳ５２０において構成行列を読出した後などに、必要に応じて再計算されてもよいことが理解されてもよい。プログラムステップＳ５３０から、プログラムフローは、プログラムステップＳ５４０に進むことができ、プログラムステップＳ５４０にて、プログラムは、中間データシンボルを並列ＲＡＩＤコントローラに配信しうる。 From program step S425, the program flow can proceed to program step S530, where the program can calculate intermediate data symbols. For example, the program may calculate intermediate data symbols according to Equation 13. The weighting factors a _{i, j} used in Equation 13 may be calculated in advance and distributed to the RAID controller, or may be recalculated as necessary, such as after reading the configuration matrix in program step S520. It may be understood that it is good. From program step S530, program flow can proceed to program step S540, where the program can distribute the intermediate data symbols to the parallel RAID controller.

プログラムステップＳ５４０から、プログラムフローは、プログラムステップＳ５５０に進むことができ、プログラムステップＳ５５０にて、プログラムは、並列ＲＡＩＤコントローラから中間データシンボルを受信しうる。プログラムステップＳ５５０から、プログラムフローは、プログラムステップＳ５６０に進むことができ、プログラムステップＳ５６０にて、プログラムは、ローカルなＲＡＩＤコントローラと並列ＲＡＩＤコントローラの両方から来た中間データシンボルから、回復されるデータシンボルを計算しうる。換言すれば、プログラムは、式１４によって、中間データシンボルを合計しうる。プログラムステップＳ５６０から、プログラムフローは、プログラム実行が停止しうるプログラムステップＳ５７０に進むことができる。 From program step S540, the program flow can proceed to program step S550, where the program can receive intermediate data symbols from the parallel RAID controller. From program step S550, program flow can proceed to program step S560, where the program recovers data symbols from intermediate data symbols that came from both the local RAID controller and the parallel RAID controller. Can be calculated. In other words, the program may sum the intermediate data symbols according to Equation 14. From program step S560, the program flow can proceed to program step S570 where program execution can stop.

本発明は、本発明の特定の例示的な実施形態に関連して述べられたが、多くの代替、修正、および変形が当業者に明らかになることが明らかである。したがって、本明細書で述べる本発明の実施形態は、制限的でなく、例証的であることを意図される。本発明の精神および範囲から逸脱することなく行われてもよい変更が存在する。 Although the invention has been described with reference to specific exemplary embodiments of the invention, it will be apparent that many alternatives, modifications, and variations will be apparent to those skilled in the art. Accordingly, the embodiments of the invention described herein are intended to be illustrative rather than limiting. There are changes that may be made without departing from the spirit and scope of the invention.

Claims

A distributed data storage device,
A plurality of parallel data storage clusters,
A plurality of data storage devices for storing a plurality of data symbols; and
Coupling to each of the plurality of data storage devices, calculating a local intermediate checksum from the plurality of data symbols, and checking the plurality of data symbols based on a plurality of intermediate checksums including the local intermediate checksum A plurality of parallel data storage clusters including a RAID controller configured to calculate a sum;
A distributed data storage device comprising: a communication fabric coupled to each of the plurality of parallel data storage clusters and delivering the plurality of intermediate checksums to the plurality of parallel data storage clusters.

The distributed data storage device of claim 1, wherein the RAID controller calculates the local intermediate checksum based on a weighted sum of the plurality of data symbols.

At least one of the plurality of RAID controllers of the plurality of parallel data storage clusters receives the intermediate checksum from another parallel data storage cluster, and is configured to include a plurality of data storage devices of the plurality of parallel data storage clusters. The distributed data storage device according to claim 1 or 2, wherein a checksum for storing in one is generated based on the received intermediate checksum and the local intermediate checksum.

The RAID controller, distributed data storage device according to any one of the three further from configured claim 1 to store the checksum in the data storage device.

The communication fabric, distributed data storage device according to the plurality of information symbols from further configured claim 1 to deliver to the plurality of parallel data storage clusters to any one of 4, including a plurality of data symbols.

6. The distributed data storage device of claim 5 , wherein the RAID controller further comprises a RAID control unit that selects the plurality of data symbols from the subset of the plurality of information symbols.

The distributed data storage device of claim 6 , wherein the RAID control unit further comprises a storage device failure sense unit that establishes a list of operable data storage devices from the plurality of data storage devices.

The distributed data storage device of claim 7 , wherein the RAID control unit is further configured to calculate a data recovery matrix based on the list of operable data storage devices.

The distributed data storage device of claim 8 , wherein the communication fabric is further configured to deliver the data recovery matrix to the plurality of parallel data storage clusters when the list of operational data storage devices changes.

The distributed data storage device of claim 6 , wherein the RAID controller further comprises an intermediate sum device configured to calculate the intermediate checksum based on a list of operational data storage devices.

The distributed data storage device of claim 10 , wherein the intermediate sum device further comprises a recalculator configured to update the local intermediate checksum when a data symbol changes.

9. The distributed data storage device of claim 8 , wherein the RAID control unit is further configured to determine an inverse matrix of an added unit matrix and van der Monde matrix to calculate the data recovery matrix.

Intermediate sum device intermediate data symbols based on the data recovery matrix, and, according to further configured claim 12 to calculate at least one of the vectors of the read data symbols and the read checksum symbol is read from the storage device Distributed data storage device.

The RAID control unit selects a vector based on the set of rows of the added identity matrix and van der Monde matrix and the list of operable data storage devices to form the data recovery matrix. The distributed data storage device of claim 12 , further configured to determine an inverse matrix of the set.

15. The distributed data storage device of claim 14 , wherein the RAID controller is further configured to send a message to the plurality of parallel data storage clusters that causes an intermediate sum device to calculate intermediate data symbols.

The distributed data storage device of claim 15 , wherein the communication fabric is further configured to distribute the intermediate data symbols to each of the plurality of parallel data storage clusters.

The distributed data storage device of claim 16 , wherein the RAID controller is further configured to calculate recovered data symbols from a plurality of intermediate data symbols.

Said local intermediate checksum, any one of the partial sum of a subset of said plurality of data symbols corresponding to said plurality of data storage devices assigned to the RAID controller (partials sum) at a claims 1 to 17 The distributed data storage device according to item .

14. The distributed data storage device of claim 13 , wherein the intermediate data symbols are a subset of the data symbols corresponding to the data storage device assigned to the RAID controller and a partial sum of the read checksum symbols.

A data storage method,
Assigning multiple data storage devices to multiple parallel data storage clusters;
Storing a plurality of data symbols in each of the plurality of parallel data storage clusters;
Each of a plurality of RAID controllers included in each of the plurality of parallel data storage clusters calculates an intermediate checksum from a weighted sum of the plurality of data symbols;
At least one of the plurality of RAID controllers receives the intermediate checksum from each of the other RAID controllers of the plurality of RAID controllers ;
At least one of the plurality of RAID controllers calculates a checksum of the plurality of data symbols based on the calculated intermediate checksum and the intermediate checksum received from each of the other RAID controllers ; and
A data storage method comprising storing the checksum in at least one of the plurality of data storage devices.

Selecting the plurality of data symbols from a subset of information symbols; and
21. The data storage method of claim 20 , further comprising distributing the selected data symbols to the plurality of parallel data storage clusters using a communication fabric.

The data storage method of claim 21 , further comprising detecting a set of operable data storage devices in each of the plurality of parallel data storage clusters.

23. The data storage method of claim 22 , further comprising distributing a data recovery matrix based on the set of operable data storage devices to each of the plurality of parallel data storage clusters.

24. The data storage method of claim 23 , further comprising calculating a vector of at least one of an intermediate data symbol based on the data recovery matrix and a read data symbol and a read checksum symbol read from the storage device.

Delivering the intermediate data symbols to each of the plurality of parallel data storage clusters; and
The data storage method of claim 24 , further comprising calculating a recovered data symbol from the sum of the intermediate data symbols.

A method for error correction in a distributed data architecture, comprising:
Reading a configuration matrix that allocates a plurality of data storage devices to a plurality of parallel data storage clusters;
Storing a plurality of data symbols in the plurality of parallel data storage clusters;
Each of a plurality of RAID controllers included in each of the plurality of parallel data storage clusters calculates an intermediate checksum from the plurality of data symbols stored in the assigned data storage device, thereby Calculating at least one intermediate checksum for each data storage cluster;
At least one of the plurality of RAID controllers receives the intermediate checksum from each of the other RAID controllers of the plurality of RAID controllers ;
At least one of the plurality of RAID controllers sums the calculated intermediate checksum and the intermediate checksum received from each of the other RAID controllers to form a checksum; and
A method of storing the checksum in at least one data storage device.