GitHub - bayesian/boosting: Fast implementation of Gradient Boosting Machine (GBM) training algorithm.

Fast & Simple implementation of GBM

GBM is the generally regarded as best perform supervised learning algorithms before recent DL revolution. It is robust but not scalable.

Goal:

Fast (Handle 40M rows * 500 features within 10 hours)
Simple (The less lines of code, the better) <= 3000
Mudular/Extensible for further improvements

Algorithms:

pre-bucketing (data compression)
bucket sort to build histogram, then linear scan to find best split
hints and intelligent of using #buckets
stochastic gradient boosting machine

features:

correctness (model + fimps)
deterministic randomness
easily extensible for wide varieties of similar algorithms: random forest, bagging, gbm, for both classification and regression methods, regression takes priority

new features:

byte/short: two layer of storage. (save both memory and cpu)
taking hints based on previous fimps (top 1/3 using short, rest using byte)

Prameters:

m: number trees n: number of leaves per tree r: example sampling rate s: feature sampling rate

d: number of data points f: number of features

k: number of buckets ml: minimum number of datapoints per leave

Complexity: Memory: max(f * d1 * 8, [f * d, f * d * 2))

Algorithmic:

Bucketization: O(f * d1 * log(d1))
Continue reading: O(f * d2 * log(k))

3: Single Best Split: O(f' * d' + f' * k) 4a: depth-k balanced tree: k * S 4b: single n-leaves tree: #splits: (2n - 3), O(S * n * log(n)) (roughly)

D: 20M, exampling sampling: 4M feature sampling rate:

Components:

Config: (specify data format and training parameters) DataSet: (column-wise storage, with Self Compression) Tree: (works both in compressed/raw) TreeRegressor: (k-leaf regression tree) GbmFun: (function to extend to different types of loss) Gbm: (gradient boosting machine)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
Concurrency.cpp		Concurrency.cpp
Concurrency.h		Concurrency.h
Config.cpp		Config.cpp
Config.h		Config.h
DataSet.cpp		DataSet.cpp
DataSet.h		DataSet.h
Gbm.cpp		Gbm.cpp
Gbm.h		Gbm.h
GbmFun.h		GbmFun.h
LICENSE		LICENSE
LogisticFun.h		LogisticFun.h
README.md		README.md
Train.cpp		Train.cpp
Tree.h		Tree.h
TreeRegressor.cpp		TreeRegressor.cpp
TreeRegressor.h		TreeRegressor.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 2

Languages

License

bayesian/boosting

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages