GitHub - mmalejky/proteins: Set of tools for protein sequence enrichment and protein domain finding

Set of command-line tools operating on FASTA files suitable for:

extend.py - protein sequence enrichment
scan-pfam.py - protein domain finding
fisher-test.py - measuring of protein enrichment significance

Pipfile and Pipfile.lock (for Pipenv) with all required libraries is included.
In addition, test script test.sh, sample input data example.fasta as well as all generated output files for this input are included.

The APIs used here are:

NCBI BLAST API - for finding corresponding extended protein sequences
HMMER hmmscan API - for finding PFAM protein domains

Usage scenario

Suppose that we have a given set of protein subsequences (e.g. from protein microarray) that are known to be derived from a larger set of protein sequences. And suppose that we are interested in what this larger set looks like, in particular what protein domains are present in it.

One possible solution to this problem is extending these subsequences to sequences that are known and very similar to them. Then a search for protein domains in these longer sequences need to be done. In the end, we can calculate significance of this set of domains in comparison to domains from input subsequences, using Fisher's exact test. It will be measured as a deviation from a null hypothesis, that in both sets of sequences, all types of domains should occur equally often.

The list of steps needed to obtain such a result is as follows:

You have an input FASTA file example.fasta with incomplete protein sequences.
Pass it to extend.py -i example.fasta to get extended protein sequences example-out.fasta from BLAST nr Database.
Pass the result file example-out.fasta to scan-pfam.py example-out.fasta example-out.csv.
Pass the input file example.fasta to scan-pfam.py example.fasta example.csv.
Pass both example.csv and example-out.csv to fisher-test.py.
At the end, we get probability of this domain distribution.

This whole set of instructions can be executed from test.sh example.fasta.

extend.py

Usage: extend.py -i <input_file> --evalue <evalue> --minident <minimal_percent_of_identity>

Search in NCBI nr (Non-redundant protein sequences) database for protein sequences that match with at least one sequence from input_file. The resulting file does not contain duplicates. The match criterion is determined by parameters:

minident - Minimal percent of amino acids pairs identity with respect to alignment length. Defaults to 90%.
evalue - Maximal expected number of chance matches in a random model. Defaults to 10E-9.

scan-pfam.py

Usage: scan_pfam.py <input_file> <output_file>

Generates .csv file of all domains found in input file according to PFAM database of protein domains.
The .csv output file has the following structure:

Protein sequence ids in table rows
Protein domain ids in table columns
At the intersection of row and column:
- 1 if this domain was found in this sequence
- 0 otherwise

fisher-test.py

Usage: fisher-test.py <file1> <file2>

Calculate Fisher exact test for protein domain distributions from .csv files: file1 and file2.
This is done using fisher_exact method from scipy.stats library, as well as own implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Usage scenario

extend.py

scan-pfam.py

fisher-test.py

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
example-out.csv		example-out.csv
example-out.fasta		example-out.fasta
example.csv		example.csv
example.fasta		example.fasta
extend.py		extend.py
fisher-test.py		fisher-test.py
scan-pfam.py		scan-pfam.py
test.sh		test.sh

mmalejky/proteins

Folders and files

Latest commit

History

Repository files navigation

Usage scenario

extend.py

scan-pfam.py

fisher-test.py

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages