A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework
This repository provides the source code, algorithms, experimental setup, and results for the experimental review on imbalanced data streams submitted for publication to the journal Machine Learning. The manuscript preprint is available at arXiv.
This website provides interactive plots to display the metrics over time and result tables for each experiment, algorithm, and benchmark.
The package src/main/java/experiments
provides the scripts for binary and multi-class experiments. It comprises the following experiments:
Binary class experiment | Script |
---|---|
Static imbalance ratio | binary/Static_Imbalance_Ratio |
Increasing imbalance ratio | binary/Dynamic_Imbalance_Ratio_Increasing |
Increasing then decreasing imbalance ratio | binary/Dynamic_Imbalance_Ratio_Increasing_Decreasing |
Flipping imbalance ratio | binary/Dynamic_Imbalance_Ratio_Flipping |
Flipping then reflipping imbalance ratio | binary/Dynamic_Imbalance_Ratio_Flipping_Reflipping |
Instance-level difficulties | binary/Instance_Level_Difficulties |
Concept drift and static imbalance ratio | binary/Concept_Drift_Static_Imbalance_Ratio |
Concept drift and dynamic imbalance ratio | binary/Concept_Drift_Dynamic_Imbalance_Ratio_Increasing |
Real-world imbalanced datasets | binary/Datasets |
Multi-class experiment | Script |
---|---|
Static imbalance ratio | multiclass/Static_Imbalance_Ratio |
Dynamic imbalance ratio | multiclass/Dynamic_Imbalance_Ratio |
Concept drift and static imbalance ratio | multiclass/Concept_Drift_Static_Imbalance_Ratio |
Concept drift and dynamic imbalance ratio | multiclass/Concept_Drift_Dynamic_Imbalance_Ratio |
Real-world imbalanced datasets | multiclass/Datasets |
Semi-synthetic imbalanced datasets | multiclass/Semisynthetic |
The package src/main/java/moa/classifiers
contains 24 state-of-the-art algorithms for data streams, including those inherited from the MOA 2021.07 dependency in the pom.xml file.
Algorithm | Script |
---|---|
IRL | meta.imbalanced.RebalanceStream |
C-SMOTE | meta.imbalanced.CSMOTE |
VFC-SMOTE | meta.imbalanced.VFCSMOTE |
CSARF | meta.CSARF |
GHVFDT | trees.GHVFDT |
HDVFDT | trees.HDVFDT |
ARF | meta.AdaptiveRandomForest |
KUE | meta.KUE |
LB | meta.LeveragingBag |
OBA | meta.OzaBagAdwin |
SRP | meta.StreamingRandomPatches |
ESOS-ELM | ann.meta.ESOS_ELM |
CALMID | active.CALMID |
MICFOAL | active.MicFoal |
ROSE | meta.imbalanced.ROSE |
OADA | meta.imbalanced.OnlineAdaBoost |
OADAC2 | meta.imbalanced.OnlineAdaC2 |
ARFR | meta.imbalanced.AdaptiveRandomForestResampling |
SMOTE-OB | meta.imbalanced.SMOTEOB |
OSMOTE | meta.imbalanced.OnlineSMOTEBagging |
OOB | meta.OOB |
UOB | meta.UOB |
ORUB | meta.imbalanced.OnlineRUSBoost |
OUOB | meta.imbalanced.OnlineUnderOverBagging |
The package src/main/java/moa/evaluation
contains the performance evaluators.
ImbalancedPerformanceEvaluator
is used for binary class experiments reporting G-Mean, AUC, and Kappa metrics.
MultiClassImbalancedPerformanceEvaluator
is used for multi-class experiments reporting G-Mean, PMAUC, and Kappa metrics. The evaluators also report the runtime (seconds), memory consumption (RAM-Hours), and the complete confusion matrix for posterior analysis.
This website provides interactive plots to display the metrics over time and result tables for each experiment, algorithm, and benchmark.
Complete csv results (median for 5 seeds) for all experiments, algorithms, and benchmarks reported on the manuscript are available to download to facilitate the transparency, reproducibility, and extendability of the experimental study.
Complete csv results are provided for 5 seeds:
- Complete csv results for seed 123456789
- Complete csv results for seed 234567891
- Complete csv results for seed 345678912
- Complete csv results for seed 456789123
- Complete csv results for seed 567891234
ARFF files are available to download for binary class datasets, multi-class datasets, and semi-synthetic datasets.
We use the MOA framework and its class hierarchy. Adding a new algorithm, generator, or evaluator is the same as adding it in MOA (see MOA documentation).
First, import the source code to your favorite IDE (Eclipse, VS code, IntelliJ, etc) using Git.
To add a new algorithm, e.g. MyAlgorithmName, create a new Java file at src/main/java/moa/classifiers/MyAlgorithmName.java
. The class must extend the AbstractClassifier
class and implement the public void trainOnInstanceImpl(Instance instance)
and public double[] getVotesForInstance(Instance instance)
methods.
To add a new generator, e.g. MyGeneratorName, create a new Java file at src/main/java/moa/streams/generators/MyGeneratorName.java
. The class must implement the InstanceStream
interface and the public InstanceExample nextInstance()
method.
To add a new performance metric you can edit an existing evaluator (e.g. src/main/java/moa/evaluation/ImbalancedPerformanceEvaluator.java
) to add the metric calculation. Alternatively, you can add a new evaluator, e.g. MyEvaluatorName. To do so, create a new Java file at src/main/java/moa/evaluation/MyEvaluatorName.java
. The class must implement the ClassificationPerformanceEvaluator
interface, and the public void addResult(Example<Instance> exampleInstance, double[] classVotes)
and public Measurement[] getPerformanceMeasurements()
methods.
The next step is to compile the source code using Maven (pom.xml file). Use the command mvn package
or your IDE options to build the jar file target/imbalanced-streams-1.0-jar-with-dependencies.jar
Finally, use any of the scripts provided at src/main/java/experiments
for the different groups of experiments and add your algorithm, generator, or evaluator. These scripts will generate the command lines used to run the experiments.
@article{aguiar2024survey,
author={Aguiar, Gabriel and Krawczyk, Bartosz and Cano, Alberto},
title={A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework},
journal={Machine Learning},
volume={113},
pages={4165-4243},
year={2024}
}