Skip to content

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

License

Notifications You must be signed in to change notification settings

canoalberto/imbalanced-streams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Build Status License: GPL v3

This repository provides the source code, algorithms, experimental setup, and results for the experimental review on imbalanced data streams submitted for publication to the journal Machine Learning. The manuscript preprint is available at arXiv.

This website provides interactive plots to display the metrics over time and result tables for each experiment, algorithm, and benchmark.

Experiments

The package src/main/java/experiments provides the scripts for binary and multi-class experiments. It comprises the following experiments:

Binary class experiment Script
Static imbalance ratio binary/Static_Imbalance_Ratio
Increasing imbalance ratio binary/Dynamic_Imbalance_Ratio_Increasing
Increasing then decreasing imbalance ratio binary/Dynamic_Imbalance_Ratio_Increasing_Decreasing
Flipping imbalance ratio binary/Dynamic_Imbalance_Ratio_Flipping
Flipping then reflipping imbalance ratio binary/Dynamic_Imbalance_Ratio_Flipping_Reflipping
Instance-level difficulties binary/Instance_Level_Difficulties
Concept drift and static imbalance ratio binary/Concept_Drift_Static_Imbalance_Ratio
Concept drift and dynamic imbalance ratio binary/Concept_Drift_Dynamic_Imbalance_Ratio_Increasing
Real-world imbalanced datasets binary/Datasets
Multi-class experiment Script
Static imbalance ratio multiclass/Static_Imbalance_Ratio
Dynamic imbalance ratio multiclass/Dynamic_Imbalance_Ratio
Concept drift and static imbalance ratio multiclass/Concept_Drift_Static_Imbalance_Ratio
Concept drift and dynamic imbalance ratio multiclass/Concept_Drift_Dynamic_Imbalance_Ratio
Real-world imbalanced datasets multiclass/Datasets
Semi-synthetic imbalanced datasets multiclass/Semisynthetic

Algorithms

The package src/main/java/moa/classifiers contains 24 state-of-the-art algorithms for data streams, including those inherited from the MOA 2021.07 dependency in the pom.xml file.

Algorithm Script
IRL meta.imbalanced.RebalanceStream
C-SMOTE meta.imbalanced.CSMOTE
VFC-SMOTE meta.imbalanced.VFCSMOTE
CSARF meta.CSARF
GHVFDT trees.GHVFDT
HDVFDT trees.HDVFDT
ARF meta.AdaptiveRandomForest
KUE meta.KUE
LB meta.LeveragingBag
OBA meta.OzaBagAdwin
SRP meta.StreamingRandomPatches
ESOS-ELM ann.meta.ESOS_ELM
CALMID active.CALMID
MICFOAL active.MicFoal
ROSE meta.imbalanced.ROSE
OADA meta.imbalanced.OnlineAdaBoost
OADAC2 meta.imbalanced.OnlineAdaC2
ARFR meta.imbalanced.AdaptiveRandomForestResampling
SMOTE-OB meta.imbalanced.SMOTEOB
OSMOTE meta.imbalanced.OnlineSMOTEBagging
OOB meta.OOB
UOB meta.UOB
ORUB meta.imbalanced.OnlineRUSBoost
OUOB meta.imbalanced.OnlineUnderOverBagging

Evaluators

The package src/main/java/moa/evaluation contains the performance evaluators.

ImbalancedPerformanceEvaluator is used for binary class experiments reporting G-Mean, AUC, and Kappa metrics.

MultiClassImbalancedPerformanceEvaluator is used for multi-class experiments reporting G-Mean, PMAUC, and Kappa metrics. The evaluators also report the runtime (seconds), memory consumption (RAM-Hours), and the complete confusion matrix for posterior analysis.

Results

This website provides interactive plots to display the metrics over time and result tables for each experiment, algorithm, and benchmark.

Complete csv results (median for 5 seeds) for all experiments, algorithms, and benchmarks reported on the manuscript are available to download to facilitate the transparency, reproducibility, and extendability of the experimental study.

Complete csv results are provided for 5 seeds:

ARFF files are available to download for binary class datasets, multi-class datasets, and semi-synthetic datasets.

Binary class experiments: G-Mean vs Kappa Multi-class experiments: PMAUC vs Kappa
Binary class experiments: spiral barplot Multi-class experiments: spiral barplot

How to add a new algorithm, generator, or evaluator in the framework

We use the MOA framework and its class hierarchy. Adding a new algorithm, generator, or evaluator is the same as adding it in MOA (see MOA documentation).

First, import the source code to your favorite IDE (Eclipse, VS code, IntelliJ, etc) using Git.

To add a new algorithm, e.g. MyAlgorithmName, create a new Java file at src/main/java/moa/classifiers/MyAlgorithmName.java. The class must extend the AbstractClassifier class and implement the public void trainOnInstanceImpl(Instance instance) and public double[] getVotesForInstance(Instance instance) methods.

To add a new generator, e.g. MyGeneratorName, create a new Java file at src/main/java/moa/streams/generators/MyGeneratorName.java. The class must implement the InstanceStream interface and the public InstanceExample nextInstance() method.

To add a new performance metric you can edit an existing evaluator (e.g. src/main/java/moa/evaluation/ImbalancedPerformanceEvaluator.java) to add the metric calculation. Alternatively, you can add a new evaluator, e.g. MyEvaluatorName. To do so, create a new Java file at src/main/java/moa/evaluation/MyEvaluatorName.java. The class must implement the ClassificationPerformanceEvaluator interface, and the public void addResult(Example<Instance> exampleInstance, double[] classVotes) and public Measurement[] getPerformanceMeasurements() methods.

The next step is to compile the source code using Maven (pom.xml file). Use the command mvn package or your IDE options to build the jar file target/imbalanced-streams-1.0-jar-with-dependencies.jar

Finally, use any of the scripts provided at src/main/java/experiments for the different groups of experiments and add your algorithm, generator, or evaluator. These scripts will generate the command lines used to run the experiments.

Citation

@article{aguiar2024survey,
  author={Aguiar, Gabriel and Krawczyk, Bartosz and Cano, Alberto},
  title={A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework},
  journal={Machine Learning},
  volume={113},
  pages={4165-4243},
  year={2024}
}

About

A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages