Skip to content
forked from mattdean1/cuda

A PyTorch wrapper of parallel exclusive scan in CUDA

Notifications You must be signed in to change notification settings

lxxue/prefix_sum

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Parallel Prefix Sum (Scan) with CUDA

Pytorch Usage Note

Installation

python setup.py install

Usage

from prefix_sum import prefix_sum_cpu, prefix_sum_cuda
# assuming input is a torch.cuda.IntTensor, num_elements is an integer
# allocate output_array on cuda
# e.g. output = torch.zeros((num_elements,), dtype=torch.int, device=torch.device('cuda'))
prefix_sum_cuda(input, num_elements, output)

# similarly for the CPU version
# except that both input and output are torch.IntTensor now
prefix_sum_cpu(input, num_elements, output)

Original README

My implementation of parallel exclusive scan in CUDA, following this NVIDIA paper.

Parallel prefix sum, also known as parallel Scan, is a useful building block for many parallel algorithms including sorting and building data structures. In this document we introduce Scan and describe step-by-step how it can be implemented efficiently in NVIDIA CUDA. We start with a basic naïve algorithm and proceed through more advanced techniques to obtain best performance. We then explain how to scan arrays of arbitrary size that cannot be processed with a single block of threads.

This implementation can handle very large arbitrary length vectors thanks to the recursively defined scan function.

Performance is increased with a memory-bank conflict avoidance optimization (BCAO).


See the timings for a performance comparison between:

  1. Sequential scan run on the CPU
  2. Parallel scan run on the GPU
  3. Parallel scan with BCAO

For a vector of 10 million entries:

  CPU      : 20749 ms
  GPU      : 7.860768 ms
  GPU BCAO : 4.304064 ms

Intel Core i5-4670k @ 3.4GHz, NVIDIA GeForce GTX 760

About

A PyTorch wrapper of parallel exclusive scan in CUDA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Cuda 88.8%
  • C++ 6.0%
  • Python 3.7%
  • C 1.5%