Python implementation of SC_SI (Subspace Clustering with Scalable and Iterative Algorithm). SC_SI solves multiple R-PCA (Robust PCA) problem to clustering.
The algorithm is proposed at my master thesis
Scalable Iterative Algorithm for Robust Subspace Clustering: Convergence and Initialization
You can check full paper in the link:
- https://sanghyukchun.github.io/home/media/papers/chun2016scsi.pdf
- https://library.kaist.ac.kr/mobile/book/view.do?bibCtrlNo=649637
It is almost same as scikit-learn K-means link
import numpy as np
from sc_si.clustering.sc_si import SC_SI, MiniBatchSC_SI
n_data = 10000
dim_data = 500
X = np.random.random((n_data, dim_data))
n_clusters = 10
n_components = 20
# if you have a large dataset, use `MiniBatchSC_SI` instead.
model = SC_SI(n_clusters=n_clusters, n_components=n_components, alpha=1.0, init='sc_si', n_init=3, max_iter=100, verbose=True)
labels = model.fit_predict(X)
n_data2 = 1000
X_unseen = np.random.random((n_data2, dim_data))
labels_unseen = model.predict(X_unseen)
- alpha handles 'robustness' of the objective function. The objective function is more robust with less alpha.
- Use alpha=1.0. Theoretically alpha can be any number between (0, 2] but practically, I recommend to choose alpha as 1
- Use default initialization named SC_IN if size of dataset is not too large. Otherwise, use 'random' initialization with large n_init. Or, sampling datasets to initialization
- Use default svd_algorithm (subspace iteration). It is much faster and use less memory than exact SVD.
- For datasets with less outliers, use large beta (e.g. 10) otherwise, set beta = alpha