## **Tensaurus:** A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

Nitish Srivastava, Hanchen Jin, Shaden Smith<sup>2</sup>, Hongbo Rong<sup>3</sup>, David Albonesi, and Zhiru Zhang

> Cornell University <sup>2</sup>Microsoft AI & Research <sup>3</sup>Intel Parallel Computing Lab

## What is a Tensor?

- Tensors are generalization of matrices to n dimensions
  - Scalar is tensor with 0 dimensions
  - Vector is tensor with 1 dimension
  - Matrix is tensor with 2 dimensions, and so on







- High-dimensional data
- Density: 10<sup>-7</sup> %
- Requires low-dimensional representation for ease of analysis



#### **Tensor Decompositions for Low-Dimensional Representation**



**Tucker Decomposition** 











### **Challenges with Sparse-Dense Tensor Acceleration**

#### Dense Tensor Acceleration

 Systolic arrays provide high utilization of both memory and compute



## **Challenges with Sparse-Dense Tensor Acceleration**

#### Dense Tensor Acceleration

 Systolic arrays provide high utilization of both memory and compute

#### Mixed Sparse-Dense Tensor Acceleration

- Memory bound
- Hard to achieve high compute and bandwidth utilization
- Goal: Leverage a dense accelerator to efficiently perform sparse-dense compute
- Key approach: Co-design of accelerator <u>architecture</u> and <u>sparse format</u>
  - Low overhead of supporting sparse compute
  - High compute and bandwidth utilization







SpMV (<u>Sparse Matrix dense Vector multiply</u>)









SpMV (<u>Sp</u>arse <u>Matrix dense Vector multiply</u>)





SpMV (<u>Sp</u>arse <u>Matrix dense Vector multiply</u>)





SpMV (<u>Sp</u>arse <u>Matrix dense Vector multiply</u>)





SpMV (<u>Sp</u>arse <u>Matrix dense Vector multiply</u>)





SpMV (<u>Sparse Matrix dense Vector multiply</u>)

Compressed Sparse Row Format (CSR)



**Problems with CSR** 

- Non-streaming & non-vectorized accesses
- Indirect memory access



















#### **Computation Pattern for Tensor Kernels**



SF<sup>3</sup> compute pattern can express all the common dense and mixed sparse-dense tensor kernels

#### **PE for SF<sup>3</sup> Compute Pattern**



#### **Vertical Scaling Using Coarse-Grained Parallelism**



#### **Horizontal Scaling Using SIMD-Vector Parallelism**



#### **Tensaurus Architecture**



**MLU: Matrix Load Unit TLU: Tensor Load Unit MSU: Matrix Store Unit** 





Crossbar

#### **Tensaurus Architecture**

#### Accelerator for both dense and sparse-dense!!



## **Evaluation Methodology**

#### Cycle-level simulation in gem5

- 8 x 8 PE array, VLEN = 8
- 8 16KB RAMs per SPM
- HBM: 8 128-bit physical channels (128 GB/s peak bandwidth)

#### RTL Modeling of a PE using PyMTL

- 28 nm (Synopsys & Cadence Tools)

#### Baselines

- CPU: Intel(R) Xeon(R) CPU E7-8867
  - SparseBLAS and SPLATT
- GPU: Titan XP
  - CuSparse, PaRTI
- Sparse NN Accelerator:
  - Cambricon-X [1]

#### % Component Area $(mm^2)$ % Power (mW) 27.2~%40.9~% $\mathbf{PE}$ 0.625402.30 2.8~%Xbar 0.066 24.272.5%36.2~%SPM 0.832296.0530.1 % 33.0~%MSU 0.759247.0325.2%TLU0.009 0.4~%6.280.6%0.6~%MLU 0.0090.4~%6.28Total 100 %100 %2.3982.21

#### Datasets

- FROSTT Tensors, Florida Sparse Matrices, AlexNet, VGG-16

[1] Zhang, Shijin, et al. "Cambricon-x: An accelerator for sparse neural networks.", Int'I Symp. on Microarchitecture (MICRO), 2016.

#### Area and Power Breakdown

## **Cambricon-X Baseline**

- Cambricon-X uses a CSR-variant
  - Pads empty entries with padding (x)
  - Uses vector bit-masks to indicate non-zero positions
  - Specialized for CNNs with low sparsity





Vector Bitmasks

0

0

0

()

0

0

- CSR results in load-imbalanced schedule
  - Cambricon-X has synchronization boundaries across rows





Load imbalance due to synchronization at row boundaries

#### **Results on Sparse Neural Nets**



Overall Tensaurus is 1.9x faster and 1.7x more energy-efficient than Cambricon-X even for Sparse Neural Nets

#### **Results on Sparse Tensor Decomposition**



Tensaurus is 22.9x & 3.1x faster, and 220x & 290x more energy-efficient than CPU & GPU for MTTKRP

## **Concluding Remarks**

#### Tensaurus: A versatile accelerator for sparse-dense tensor acceleration

- First accelerator for sparse tensor decompositions (MTTKRP, TTMc)
- Versatile: NOT limited to tensor decompositions. Also efficient for <u>sparse-dense matrix</u> <u>computations</u>
- Adaptable:
  - Also <u>accelerates dense kernels</u>
  - Easily adapts to different levels of sparsity found in various domains
- Key Approach: Co-design <u>sparse format</u> and <u>architecture</u>
- Key Results:
  - High bandwidth utilization (> 70% of peak bandwidth)
  - High speedup and energy efficiency compared to CPU, GPU and Cambricon-X

# **Thank you! Questions?**

#### **Tensaurus:** A Versatile Accelerator for Mixed Sparse-Dense Tensor Computations

Nitish Srivastava, Hanchen Jin, Shaden Smith<sup>2</sup>, Hongbo Rong<sup>3</sup>, David Albonesi, and Zhiru Zhang

> Cornell University <sup>2</sup>Microsoft AI & Research <sup>3</sup>Intel Parallel Computing Lab