Skip to content

svgsponer/SqLoss

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SqLoss: Sequence Regression with Squared Error Loss in All-Subsequence Space

This repository contains all the code to reproduce the experiments in the paper:

Efficient Sequence Regression by Learning Linear Models in All-Subsequence Space
Gsponer S., Smyth B., Ifrim G. (2017) Efficient Sequence Regression by Learning Linear Models in All-Subsequence Space. In: Ceci M., Hollmén J., Todorovski L., Vens C., Džeroski S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science, vol 10535. Springer, Cham https://doi.org/10.1007/978-3-319-71246-8_3

The repository consists of four parts:

  • SqLoss: Basic algorithm for linear sequence regression with squared error loss.

  • Simulation: Generation of toy sequence regression datasets and evaluation of SqLoss on it.

  • Dream5 challenge: Code to run and evaluate SqLoss on the DREAM5 - Transcription Factor Binding Affinity Challenge.

  • MMC: Code to run and evaluate SqLoss on the Microsoft Malware Classification Challenge.

SqLoss

Requirements

SqLoss has following dependencies:

  • cmake (>=2.8)
  • C++ compiler with C++11 support

Build

  • Replace the path place holder in SqLoss/experiments/MMC.cpp.

  • To build the SqLoss run these commands in the source (SqLoss) directory:

mkdir build
cd build
cmake ..
make

These commands compile all the files to reproduce the experiments. In particular sqloss_dream, sqloss_regression and sqloss_mmc executable are created.

Experiments

Here we provide a short overview of the scripts in each folder.

Simulated sequence regression:

  • generatToySequenceReg.py: Generates sequence regression datasets.
  • othermethods/SotA_methods.py: Runs state of the art methods (scikit learn) for comparison.
  • eval.py: Extracts results and creates various plots.

Dream 5 - TF Challenge:

  • prepare_dream5_data.sh: Performs the log2 transformation and cuts sequences as described in the paper.
  • run_all_dream.sh: Runs SqLoss for all 66 TFs (sqloss_dream must be in $PATH).
  • createEvalFile.sh: Creates submission file as expected of DREAMtools.
  • compareDreamtool.py: Evaluates results with the DREAMtools (external dependency)

Data is available http:https://dreamchallenges.org/project/dream-5-tf-dna-motif-recognition-challenge/.

Microsoft Malware Classification:

  • create_kfolds.py: Creates stratifies k-folds (depends on sklearn) based on label file provided by Kaggle.
  • createTrainfile.py: Creates input file for SqLoss from individual malware files and k-fold label files.
  • eval.py: Evaluates obtained results.

Data and further information is available https://www.kaggle.com/c/malware-classification.

About

Sequence Regression in All-Subsequence Space

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published