

### Screaming Fast Galois Field Arithmetic Using Intel SIMD Instructions

Ethan L. Miller Center for Research in Storage Systems University of California, Santa Cruz (and Pure Storage)

### **Authors**









Jim Plank Univ. of Tennessee Kevin Greenan EMC/Data Domain (now at Box.com) Ethan Miller UC Santa Cruz (and Pure Storage)

These slides are derived from the presentation Jim gave at FAST 2013 (used with permission).

### **Erasure codes are everywhere**





### **Conventional wisdom (FAST 2009)**





### **Conventional wisdom**



Inconvenient: Reed-Solomon codes are powerful, general and flexible

Led to a proliferation of XOR-based codes





□ However, in recent years....

- Eerily smug reports of doing Reed-Solomon coding at "cache line speeds"
- No need for messy XOR codes!
- But what's the secret handshake?
- □ In this talk, we reveal the secret handshake

No prior experience with Galois field arithmetic necessary!



Using Intel's SSE3 SIMD instructions gets you

- Galois field arithmetic fast enough that performance is limited by L2/L3 cache
- Factor of <u>2.7x</u> to <u>12x</u> faster than previous implementations
- All on a single general-purpose CPU core!
- Open source library: GF-Complete
  - Gives you the secret handshake in a neat package
  - Flexible BSD license



- Galois field is also known as a finite field
- Contains a finite set of elements
  - **\Box** Field with *k* elements is called GF(*k*)
  - □ Often, k is a power of 2: GF(2<sup>*w*</sup>)
- Supports two operations: add & multiply
  - All results must be elements in the field
  - Additive inverse and multiplicative inverse
  - Usual rules apply (associative, distributive, etc.)
  - Add is done by XOR
  - Multiplication is ... more difficult

# How do storage systems use Galois field arithmetic?



Erasure codes are structured as linear combinations of <u>w-bit data words</u> in a Galois Field GF(2<sup>w</sup>)



© Ethan L. Miller & James S. Plank All Rights Reserved.





**w**: the number of bits in each element

Small w limits the width of each stripe



### What operations do we need?



- Required operations are
  - XOR two regions of memory together (addition)
  - **–** Multiply a region of memory by a constant in  $GF(2^w)$



© Ethern L. Miller & Jorgen C. Dierk, All Die

© Ethan L. Miller & James S. Plank All Rights Reserved.

# Using multiplication and XOR to generate a code symbol



Requires *n*-1 XORs and *n*-1 multiplications
 Need to multiply each data symbol by a (usually different) constant



© Ethan L. Miller & James S. Plank All Rights Reserved.

### **Performing fast multiplication**





Common (non-trivial) operation: multiply a 1K (large) vector of words ( $b_i$ ) in GF(2<sup>8</sup>) by a constant *a* 



Result should look like 1024 individual multiplications...

$$ab_0$$
  $ab_1$   $ab_2$   $ab_3$   $ab_4$   $ab_5$   $ab_6$   $ab_7$   $\cdots$   $ab_{1022}$   $ab_{1023}$ 

... but doing 1024 individual multiplications can be slow!



Modern Intel processors support vector operations: Intel "Streaming SIMD" instructions

- 128 bits per vector
  - □ 256 for some instructions in newest CPUs
- Operations done on all elements in parallel
  - □ Some instructions operate bitwise (*e.g.*, XOR)
  - □ Others operate on *k*-bit words (k=8, 16, 32, 64)
- Other architectures support similar instructions
   ARM
   Power



# Bitwise operations XOR: v = \_mm\_xor\_sillows AND: v = \_mm\_and\_sillows (a, b)

Other bitwise operations also supported



© Ethan L. Miller & James S. Plank All Rights Reserved.



Shift left (operates on 64-bit words):
v = \_mm\_slli\_epi64 (a, x)



- "Load one" (put same value into all 8-bit elements):
  v = \_mm\_set1\_epi8 (b)
- Not a single instruction—compiler expands it

2013 Storage Developer Conference.

© Ethan L. Miller & James S. Plank All Rights Reserved.

### **Killer instruction: shuffle**



- □ Shuffle instruction: v = \_mm\_shuffle\_epi8 (a, x)
- Performs 16 simultaneous table lookups using
  - a: 16 element table
  - b: 16 indices, each 4 bits long



### **Buffer-constant multiply in GF(2<sup>4</sup>)**



We can use a single lookup to multiply in GF(2<sup>4</sup>)
 Example: multiply 16 bytes A by 7 in GF(2<sup>4</sup>)





- □ Not that slow...
- Similar to calculating multiplication tables for base-10 arithmetic
  - Done by repeated multiply-by-two and reduction
- Details aren't important for now: just treat the table like a lookup table

YEAR



### □ Example: multiply 16 bytes A by 7 in GF(2<sup>4</sup>)



Create indices from the low-order bits of each byte in the vector

Perform the table lookup using a shuffle



### **□** Example: multiply 16 bytes A by 7 in $GF(2^4)$



#### Create indices from the high-order bits

Perform the table lookup using a shuffle

2013 Storage Developer Conference.

© Ethan L. Miller & James S. Plank All Rights Reserved.



### **\square** Example: multiply 16 bytes *A* by 7 in GF(2<sup>4</sup>)



2013 Storage Developer Conference.

© Ethan L. Miller & James S. Plank All Rights Reserved.



Split each symbol into two 4-bit pieces
 Use the distributive law of multiplication
 All operands and results are 8 bits

$$a = (a_{high} << 4) \oplus a_{low} \rightarrow ab = (a_{high} << 4)b \oplus a_{low}b$$
  
Two different tables!

Example:  $0xe4 = 0xe0 \oplus 0x04 \longrightarrow 0x85 \times 0xe4 = 0x85 \times 0xe0 \oplus 0x85 \times 0xe4$ 

### Need to calculate h\_tbl in a similar way to how we calculated l\_tbl

□ Otherwise, code is identical to GF(2<sup>4</sup>)!

### **Does this work for GF(2<sup>16</sup>)?**



We can still use the distributive law, but...
 Operands and results are 16 bits
 Tables can only handle 8 bits at a time!



4 pieces in each 16-bit word, and 2 tables per piece = 8 total tables

### Mapping of words to memory matters

Standard mapping of 16-bit words a-h to 128 bit vector (each box is 4 bits)



Requires 8 table lookups for 8 products



2013 Storage Developer Conference.

© Ethan L. Miller & James S. Plank All Rights Reserved.

## Mapping of words to memory matters

Alternate mapping: split each 16-bit word over two 128-bit vectors



Still requires 8 table lookups for 8 products, but now we get <u>256</u> bits for our effort





□ This is called the *alternate mapping* 

- Has all the properties needed for Reed-Solomon coding
- May be confusing: it's harder to "read" memory

### Conversions are simple and fast

- □ Standard → alternate: 7 SIMD instructions
- □ Alternate → standard: 2 SIMD instructions

# But you don't need to do this if you don't want to!

### Does this work for GF(2<sup>32</sup>)?



Again, we can use the distributive law, but...
 Operands and results are now 32 bits

8 sub-products × 4 tables each ⇒ 32 tables!



The same alternate mapping trick can be used here, too.

### **Performance: overview**





□ Lots of comparisons...

### **Performance: experiments**





#### **3.4 GHz Intel Core i7-3770**

#### 256 KB L2 cache, 8 MB L3 cache

Running buffer-constant multiply on various buffer sizes

#### Lots of comparisons...

### **Performance: experiments**





□ 256 KB L2 cache, 8 MB L3 cache

Running buffer-constant multiply on various buffer sizes

**Lots of comparisons...** 



- Memcpy & XOR are as you'd think
- Anvin\*2 is a technique for multiplying 128 bits by 2 in any Galois field in a few SIMD instructions (from code in the Linux kernel RAID6 driver)



# 3.4 GHz Intel Core i7-3770 256 KB L2 cache, 8 MB L3 cache

- Running buffer-constant multiply on various buffer sizes
- □ Lots of comparisons...

### **Performance: cache saturation**





### **Performance: traditional**





Traditional techniques don't come close to cache line speeds

### Rizzo, Jerasure, Onion Networks

NOTE: Both axes use log-scaling

### **Performance: non-traditional**





Non-traditional techniques do better
 Require amortization for w=8 and w=16
 Not effective for w=32
 Still below cache-line speeds

### **Performance: Intel SIMD**





- Our techniques perform identically to Anvin\*2 for w=4,8,16: cache-limited
- Alternate mapping makes a significant difference
   w=16 and w=32 show some amortization effects

### Performance improvement





### **GF-Complete library now available**



### □ Big open-source GF arithmetic library in C



Please use it, and let us know when you do



- 2009: H. Peter Anvin publishes a code sequence for doing fast Galois field arithmetic for RAID6\*
- 2010: I implement fast Galois field arithmetic for Pure Storage
- 2012: Jim Plank and I discuss writing a paper at a conference in Asilomar
- 2012–13: We work with Kevin Greenan and some undergrads to write the library

Add some new optimizations (you'll learn about)

\*http://web.archive.org/web/20090807060018/http://www.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf





- Incorporating new optimizations for the latest Intel instruction sets
  - 256-bit vectors
  - Carry-free multiply for large fields
  - Other optimizations
- Adding erasure code implementations
  - Investigating optimizations for doing the codes themselves
  - Example: calculate codes across or down?
- □ We're open to suggestions!



When Galois field arithmetic runs at XOR speed, it frees up code design

- Rotated Reed-Solomon array codes
- Pyramid/LRC codes (Microsoft)
- PMDS codes (IBM)
- □ SD codes
- Regenerating codes
- Erasure code designers are no longer handcuffed to XORs





**GF-Complete is** 

- Cool
- Fast
- Open-source
- Ready to use!





#### http://bitbucket.org/ethanmiller/gf-complete

Thanks to my collaborators:

Jim Plank, Kevin Greenan, and an army of undergrads at UTK