One-Class Sampling Evaluation

Scripts and notebooks to benchmark one-class sampling strategies.

This repository contains scripts and notebooks to reproduce the experiments and analyses of the paper

Adrian Englhardt, Holger Trittenbach, Daniel Kottke, Bernhard Sick, Klemens Böhm, "Efficient SVDD sampling with approximation guarantees for the decision boundary", Machine Learning (2022), DOI: 10.1007/s10994-022-06149-0.

For more information about this research project, see also the one-class sampling project website.

The analysis and main results of the experiments can be found under notebooks:

example_intro.ipynb: Figure 1
example.ipynb: Figure 4
eval_synthetic.ipynb: Figure 5
eval_dami.ipynb: Figure 6 and Table 2

To execute the notebooks, make sure you follow the setup, and download the raw results into data/output/.

Prerequisites

The experiments are implemented in Julia, some of the evaluation notebooks are written in python. This repository contains code to setup the experiments, to execute them, and to analyze the results. The one-class classifiers and some other helper methods are implemented in two separate Julia packages: SVDD.jl and OneClassActiveLearning.jl. The one-class sampling strategies are implemented in OneClassSampling.jl.

Setup

Just clone the repo.

$ git clone https://github.com/englhardt/ocs-evaluation.git

Experiments require Julia 1.3.1, requirements are defined in Manifest.toml. To instantiate, start julia in the ocs-evaluation directory with julia --project and run julia> ]instantiate. See Julia documentation for general information on how to setup this project.
Notebooks require
- Julia 1.3.1 (dependencies are already installed in the previous step)
- Python 3.8 and pipenv. Run pipenv install to install all dependencies

Repo Overview

data
- input
  - raw: contains unprocessed data set collections literature and semantic downloaded from the DAMI repository
  - dami: output directory of preprocess_data.jl
  - synthetic: output directory of generate_synthetic_data.jl
- output: output directory of experiments; generate_experiments.jl creates the folder structure and experiments; run_experiments.jl writes results and log files
notebooks: jupyter notebooks to analyze experimental results
- eval_dami.ipynb: Figure 6 and Table 2
- eval_synthetic.ipynb: Figure 5
- example_intro.ipynb: Figure 1
- example.ipynb: Figure 4
scripts
- config: configuration files for experiments
  - config.jl: high-level configuration for DAMI experiments, e.g., for number of workers
  - config_syn.jl: high-level configuration for synthetic data experiments, e.g., for number of workers
  - config_dami_large.jl: experiment config for large DAMI data sets
  - config_dami.jl: experiment config for small DAMI data sets
  - config_dami_baseline_gt.jl: experiment config for the ground-truth baseline
  - config_dami_baseline_prefiltering.jl: experiment config for the prefiltering baseline
  - config_dami_baseline_rand.jl: experiment config for the random sample baseline
  - config_dami_large_outperc.jl: experiment config for varying the outlier percentage on DAMI data sets
  - config_dami_outperc.jl: experiment config for varying the outlier percentage on small DAMI data sets
  - config_synthetic.jl: experiment config for synthetic data
  - config_precompute_parameters.jl: experiment config to precompute classifier hyperparameters for DAMI data
  - config_precompute_parameters_gt.jl: experiment config to precompute classifier hyperparameters for DAMI data with ground truth
  - config_precompute_parameters_syn.jl: experiment config to precompute classifier hyperparameters for synthetic data
  - config_warmup.jl: experiment config for precomputation warmup experiments
- util/setup_workers.jl: utility script to setup multiple workers, see Infrastructure and Parallelization
- util/evaluate.jl: utility script to setup evaluate SVDD classifier on samples
- generate_experiments.jl: generate experiments for one type of query strategy, e.g. DAMI
- generate_synthetic_data.jl: generate synthetic data sets
- precompute_parameters.jl: precompute classifier hyperparameters
- precompute_parameters_gt.jl: precompute classifier hyperparameters with ground truth
- preprocess_data.jl: preprocess DAMI data
- run_experiments.jl: executes experiments

Reproduce Experiments

Here, we specify how to reproduce our experiments after running the steps specified in (Setup)[#setup]

Experiment execution

To manually rerun all our experiments we provide two scripts run.sh for the DAMI experiments and run_syn.sh for the experiments on synthetic data. Since experiment execution takes several days on modern machines, we provide the raw results as a download. One can then skip the experiment execution and head straight to Step 2. The downloaded raw results must be extracted into data/output/, e.g., data/output/dami

To reproduce the DAMI experiments, download semantic.tar.gz and literature.tar.gz containing the .arff files from the DAMI benchmark repository and extract into data/input/raw/.../<data set> (e.g. data/input/raw/literature/ALOI/ or data/input/raw/semantic/Annthyroid).

Experiment evaluation

To analyze the results run the jupyter notebooks in the notebooks directory. Run the following to produce the figures and tables in the experiment section of the paper:

pipenv run eval
pipenv run eval_syn

Infrastructure and Parallelization

Experiment execution can be parallelized over several workers. In general, one can use any ClusterManager. In this case, the node that executes run_experiments.jl is the driver node. The driver node loads the experiments.jser, and initiates a function call for each experiment on one of the workers via pmap. Edit scripts/config/config_syn.jl and scripts/config/config.jl to add remote machines and workers.

Authors

This package is developed and maintained by Adrian Englhardt

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
notebooks		notebooks
plots		plots
scripts		scripts
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Manifest.toml		Manifest.toml
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
Project.toml		Project.toml
README.md		README.md
run.sh		run.sh
run_syn.sh		run_syn.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

One-Class Sampling Evaluation

Prerequisites

Setup

Repo Overview

Reproduce Experiments

Infrastructure and Parallelization

Authors

About

Releases

Packages

Languages

License

englhardt/ocs-evaluation

Folders and files

Latest commit

History

Repository files navigation

One-Class Sampling Evaluation

Prerequisites

Setup

Repo Overview

Reproduce Experiments

Infrastructure and Parallelization

Authors

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages