Skip to content

danindiana/GRAID_GPT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 

Repository files navigation

GRAID_GPT

GPT coding GRAID

Reed-Solomon Coding for RAID: The primary operation in Reed-Solomon coding involves multiplying F, the lower m rows of an information dispersal matrix A = [I F], with a vector of data elements d [87]. This results in another vector of redundant elements (the coding vector, c). The redundancy comes from the over-qualification of the system: Any k elements of e = [d c] may be used to recover d, even if some (or all) elements of d are not available. The operations are performed on members of a finite field [62], using exclusive-or for addition and linear feedback shift register (LFSR) for multiplication by two [5]. Multiplying two arbitrary numbers involves decomposing the problem into addition of products involving powers of two, which can require a large number of operations. However, in finite fields of size 2w (where w is the number of bits per symbol), the following identity holds true: x × y = exp(log(x) + log(y)) (13). Here, the addition operator denotes normal integer addition modulo 2w − 1, while exp() and log() are exponentiation and logarithm operations in the finite field using a common base [87]. For RAID systems with w = 8, pre-calculated tables for the exp and log operations can be used, each being 256 bytes. Multiplication can then be implemented using these tables with three table look-ups and addition modulo 2w −1 instead of many more logical operations. Decoding (recovering missing data elements from k remaining data and/or coding elements) follows similar principles.

Mapping Reed-Solomon Coding to GPUs:

GPUs differ significantly from CPUs in their architecture. GPUs focus on performing numerous small, independent, and memory-intensive computations per second to deliver interactive graphics. These GPU qualities can be directly applied to Reed-Solomon coding tasks.

In this context, a buffer refers to a specific slice of data of size "s" bytes. For a coding scheme with "k" data buffers and "m" coding buffers, these buffers are usually stored together in continuous memory known as the buffer space. The "i-th" buffer in the buffer space refers to the bytes between i×s... (i+1)×s−1 within the buffer space. However, without explicit qualification, the "i-th" buffer refers to its contents rather than its position.

In the buffer space, the first "k" buffers contain the initial "s" bytes of data, with buffer 0 holding the first set, buffer 1 holding the next set, and so on. The subsequent "m" buffers store the coding data.

GPU Architecture and Reed-Solomon Coding:

A significant feature of GPUs is their numerous multi-threaded processing cores. For instance, the NVIDIA GeForce 8800 GTX has 128 cores, and the GeForce 285 GTX contains 240 cores [76]. These cores are optimized for executing many threads with just a few instructions each. In the context of parity generation, each set of bytes in the data stream (byte b of each buffer) can be treated as an independent computation. The threading implementation allows for hiding memory access latency by maximizing bytes per memory request. In this approach, each thread handles four bytes per buffer, leveraging the wide 384-bit memory bus that enables multiple threads to transfer 32 bits each in parallel. Consequently, even though there may be thousands of threads, each thread can remain relatively small, and the GPU's thread scheduling efficiently generates parity for large amounts of data at once.

The cores are organized into multiprocessors, each containing eight cores and a shared memory space. This shared memory space is banked, allowing up to 16 parallel accesses simultaneously, performing at a speed similar to that of registers for each core. However, bank conflicts can negatively impact performance. With 30 multiprocessors in the GeForce GTX 285, each core within a warp can access a separate memory bank, allowing up to 240 simultaneous table look-up operations. Despite challenges in managing conflicts due to the birthday paradox, simulations have shown that there are about 4.4 accesses per bank on average to satisfy the table look-up operations for 32 simultaneous threads. On the NVIDIA GTX 285, each access takes four clock cycles, meaning 17.6 clock cycles are required to fulfill 32 requests, resulting in 1.82 requests per cycle per shared memory unit, or 55 requests per cycle across the chip.

Another beneficial hardware feature for Reed-Solomon coding is the support for constant memory values. GPUs, primarily used for accessing read-only textures in graphics rendering, require fast constant accesses to ensure high graphics performance [2]. Unlike other types of memory in the CUDA architecture, constant memory is immutable. As a result, each multiprocessor's constant accesses are cached, and once data is loaded into this cache, accesses to this data are as fast as register accesses. This makes constant memory an ideal location for storing the information dispersal matrix. Since the matrices are small and each element is accessed in lock-step across all cores within a multiprocessor, constant memory enables caching and simultaneous fulfillment of all requests for each element of the matrix as if the values were in registers.

About

GPT coding GRAID

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published