Source code for the implementation and evaluation of a Transformer-based Anomaly Detection on Streams of System Calls (Master’s Thesis)
The thesis aimed to develop a host-based intrusion detection system (HIDS) using a transformer-based model as an anomaly detector. The proposed system builds a model of normal behaviour by processing n-grams of system calls during training. The trained model is then used to detect anomalies when an attack occurs by measuring the deviation from the benign profile. The architecture of the transformer-based model consists of stacked decoders to allow a language modelling approach while processing the n-grams of system calls. The development and evaluation of the HIDS employed the Leipzig Intrusion Detection – Data Set (LID-DS) and framework.
src
├── cluster # scripts for running the ids pipline on an HPC cluster with slurm
├── decision_engines # transformer and a modified AE decision engine, also contains the transformer model
├── features # several building blocks that could be used as input features for the model.
├── evaluation # script and notebooks used for creating the evaluation plots and tables
│ ├── fluctuation_analysis # utility classes and script for creating the ngram set experiments (sections 5.3 and 6.3 of my thesis)
│ │ └── cluster # scripts to run these analysis on cluster
│ ├── js_functions # MongoDB custom js functions used to retrieve/aggregate saved results (lid_ds specifics)
│ ├── preliminary # scripts used to create the plots for the preliminary experiments (6.2)
│ └── primary # plots for the final full evaluation (6.4)
├── misc # scripts that can be used to run the AE and MLP based IDSs
├── Models # empty directory used as checkpoint for trained models
└── utils # helper functions to store and load trained models for a specific epochs
Note
There are several features in the src/features
directory that did not make it into my thesis due to limited project scope and changes in research direction.
-
Set up your python environment (venv, ...)
-
Install main dependency LID-DS from source
cd /path/for/installing/lidds
git clone [email protected]:LID-DS/LID-DS.git
cd LID-DS
git checkout 0f7760a4785f8758359227a1309be46b5d14955a # to ensure compatibility, last commit at the time of development
pip install -r requirements.txt # this is important as the setup script does not install all dependencies
pip install -e . # install LID-DS from source
- Install other dependencies
cd /path/to/this/project/LID-DS-TF
pip install -r requirements.txt
To run the IDS pipeline (2-3 min on my laptop) on the LID-DS dataset for a single example scenario with default configurations, use the following commands:
cd src
mkdir -p dataset/LID-DS-2019/
wget "https://cloud.scadsai.uni-leipzig.de/index.php/s/HLXiWssriRMt9pp/download?files=CVE-2017-7529.tar.gz" -O CVE-2017-7529.tar.gz \
&& tar -zxf CVE-2017-7529.tar.gz -C dataset/LID-DS-2019/ \
&& rm CVE-2017-7529.tar.gz
LID_DS_BASE="dataset" python ids_transformer_main.py
The default configuration can be found in the main function of ids_transformer_main.py
.