Robust Mini-Batch Gradient Descent

This repository contains the source codes for the final project of COMP_SCI 468: Programming Massively Parallel Processors with CUDA at Northwestern University.

In this project, we implemented the best model (ε-then-Z-score-trimmed Huber Loss without preprocessing) found in Robust Mini-Batch Gradient Descent with C++ (baseline) and CUDA.

Note: the CUDA codes are optimized for RTX 2060 Super graphics cards in the Wilkinson lab.

For more details on the model itself, please visit the Robust Mini-Batch Gradient Descent repo.

Run Tests

make dependencies    # only before the first run
make                 # build both the C++ and the CUDA implementations
make run             # run the same test on both implementations
# make clean         # to clean up the binaries and input/output files generated
# make clobber       # to also clean up the graphs and python virtual environment

Two graphs, testing.png and training.png will be generated, showing the regression lines under the training set and the testing set. These graphs are generated to check the correctness of the implementation. For more test cases, please visit the Robust Mini-Batch Gradient Descent repo.

Results

Implementation	# of models trained	Training Time (ms)
C++	1	54.701
	10	542.758
	100	5,336.640
	1000	54,236,484
CUDA	1	119.817
	10	121.361
	100	132.146
	1000	848.172

TODO

Find out ways to reduce shared memory usage
Add padding to adapt to non-power-of-2 batch sizes

Contribution

Hanming Wang

Implemented the baseline implementation in C++
Implemented a naive implementation in CUDA
Implemented the python test script
Parallelized residual calculation
Parallelized merge sort (which was replaced by the later enumeration sort)
Parallelize epsilon-trimming
Parallelize z-score-trimming
Parallelized loss and gradient calculation
Implemented a local copy of the weight vector in shared memory and only write to global memory when the model is fully trained
Changed X from row-major to column-major and applied other methods for coalesced global memory access
Changed the batch sample selection mechanism to save the time from random number generator
Applied thread coarsening
Solved many thread divergence issues

Jiren Li

Implemented the enumeration sort
Implemented the odd-even sort (which was replaced by the enumeration sort)
Unrolled the loops for the parallel merge sort (which was replaced by the later odd-even sort)

Acknowledgement

We appreciate prof. Nikos Hardavellas for the wonderful lectures throughout the quarter. Hanming especially thanks Nikos for the time spent during and beyond the office hours, answering the numerous questions (some being stupid) about CUDA.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
regress.cpp		regress.cpp
regress.cu		regress.cu
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Robust Mini-Batch Gradient Descent

Run Tests

Results

TODO

Contribution

Acknowledgement

About

Contributors 2

Languages

WHMHammer/robust-mini-batch-gradient-descent-cuda

Folders and files

Latest commit

History

Repository files navigation

Robust Mini-Batch Gradient Descent

Run Tests

Results

TODO

Contribution

Acknowledgement

About

Topics

Resources

Stars

Watchers

Forks

Contributors 2

Languages