For StepMixR, please refer to this repository.
A Python package following the scikit-learn API for generalized mixture modeling. The package supports categorical data (Latent Class Analysis) and continuous data (Gaussian Mixtures/Latent Profile Analysis). StepMix can be used for both clustering and supervised learning.
Additional features include:
- Support for missing values through Full Information Maximum Likelihood (FIML);
- Multiple stepwise Expectation-Maximization (EM) estimation methods based on pseudolikelihood theory;
- Covariates and distal outcomes;
- Parametric and non-parametric bootstrapping.
If you find StepMix useful, please leave a ⭐ and consider citing our arXiv preprint:
@article{morin2023stepmix,
title={StepMix: A Python Package for Pseudo-Likelihood Estimation of Generalized Mixture Models with External Variables},
author={Morin, Sacha and Legault, Robin and Lalibert{\'e}, F{\'e}lix and Bakk, Zsuzsa and Gigu{\`e}re, Charles-{\'E}douard and de la Sablonni{\`e}re, Roxane and Lacourse, {\'E}ric},
journal={arXiv preprint arXiv:2304.03853},
year={2023}
}
You can install StepMix with pip, preferably in a virtual environment:
pip install stepmix
A StepMix mixture using categorical variables on a preloaded data matrix. StepMix accepts either numpy.array
or
pandas.DataFrame
. Categories should be integer-encoded and 0-indexed.
from stepmix.stepmix import StepMix
# Categorical StepMix Model with 3 latent classes
model = StepMix(n_components=3, measurement="categorical")
model.fit(data)
# Allow missing values
model_nan = StepMix(n_components=3, measurement="categorical_nan")
model_nan.fit(data_nan)
For binary data you can also use measurement="binary"
or measurement="binary_nan"
. For continuous data, you can fit a Gaussian Mixture with diagonal covariances using measurement="continuous"
or measurement="continuous_nan"
.
Set verbose=1
for a detailed output.
Please refer to the StepMix tutorials to learn how to combine continuous and categorical data in the same model.
Detailed tutorials are available in notebooks:
- Generalized Mixture Models with StepMix:
an in-depth look at how mixture models can be defined with StepMix. The tutorial uses the Iris Dataset as an example
and covers:
- Gaussian Mixtures (Latent Profile Analysis);
- Binary Mixtures (LCA);
- Categorical Mixtures (LCA);
- Mixed Categorical and Continuous Mixtures;
- Missing Values through Full-Information Maximum Likelihood.
- Stepwise Estimation with StepMix:
a tutorial demonstrating how to define measurement and structural models. The tutorial discusses:
- LCA models with distal outcomes;
- LCA models with covariates;
- 1-step, 2-step and 3-step estimation;
- Corrections (BCH or ML) and other options for 3-step estimation;
- Putting it All Together: A Complete Model with Missing Values
- Model Selection:
- Selecting the number of components in a mixture model (
n_components
) with cross-validation; - Selecting the number of components with the Parametric Bootstrapped Likelihood Ratio Test (BLRT);
- Fit indices: AIC, BIC and other metrics.
- Selecting the number of components in a mixture model (
- Parameters, Bootstrapping and CI:
a tutorial discussing how to:
- Access StepMix parameters;
- Bootstrap StepMix estimators;
- Quickly plot confidence intervals.
- Supervised and Semi-Supervised Learning with StepMix:
- Binary Classification;
- Multiclass Classification;
- Semi-Supervised Learning;
- Cross-Validation.
- Deriving p-values in StepMix: a tutorial demonstrating how to transform SM parameters into conventional regression coefficients and how to derive p-values.
The tutorial covers models with:
- Continuous covariate;
- Binary covariate;
- Categorical covariate;
- Multiple covariates (different distributions);
- Binary distal outcome;