Skip to content

VerifyBot/Alpha

Repository files navigation

Alpha

Nir Yona -- Alpha bioinformatics project using Python @ https://www.cs.tau.ac.il/~tamirtul

Full project paper SOON!

🪁 Tools

  • Python (pandas, sklearn, matplotlib/plotly, numpy, pickle, etc.)
  • Jupyter Notebook & Google Colab
  • ViennaRNA, Chimera UGEM

🎯 Goal

This project focused on creating features and traing a model that will predict re-initiation chances for E-Coli operons.

Methods

Data

The data was provided from a library created in this paper - ~13,000 synthetic variants. The project focused on the randomic part of the data, as shown in this image:

Generating Features

To predict the re-initiation chances best, we created textual features that describe the sequence and the structure of the sequence (like occurances of different nucs in certain places and reading layers, etc) and biological features known to be related to protein expression and therefore to re-initiation chances ( like CAI, tAI, Folding levels and more) After the features creation a Multiple Regression model starts training. At the end we try the feature on E.Coli and test p-value to find out the meaningful features.

Training

We trained a Multiple Regression model on a library variants with their re-initiation chances. The model was trained on 60% of the data, tested on 20% and validated on the last 20%. After the training and testing we tried to optimize the features, and by that get the features that affect the re-initiation chances the most. We use cross validation to make sure we don't overfit the data.

Analyzing the Features on E.Coli Operons

To make sure the chosen features are really significant, we tried them on endogenous E.Coli operons. We created premutations of the operons, and finally calculated p-value for each feature against the library operons.

🎯 Results

Multiple Regression Model

The model was trained over and over adding more feautures every run, and the prediction results were as follows:

Features Analysis

For the significant features we set the p-value border to 0.05 as a significant feature requirement.



The results conclude that the amount of appearances of the TT dinucleotide affects the re-initiation chances in a linear way the most out of all the other features calculated!

The results are backed up by other papers showing similar relationships (like Yang et al., 2019 or Nie, Wu, & Zhang, 2006)

🐉 Run

Create a virtual environment and install the requirements from requirements.txt, then run the desired notebook with jupyter notebook or the script with python.

py -m pip install requirements.txt
jupyter notebook