Skip to content

Find disordered proteins based on compositional similarlity

License

Notifications You must be signed in to change notification settings

qks1lver/disorderly

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Disorderly.py

Compare protein sequences by their lengths and compositions

MIT License.
Requires Python 3+

To see the commands:

$ python3 disorderly.py -h

How to use it?

1. Prepare your query

Put your query sequences in FASTA format and put them in a file

2. Prepare your database

Your database is made of sequences that you want to compare against. This is also in FASTA format, but we need to convert it to a .disorderdb database so it can be used to search against. Generate a .disorderdb file from your database using the following command:

$ python3 disorderly.py -v -fb path/to/your_database.fasta

-v Verbose flag

-fb Database FASTA file

This will generate your_database.fasta.disorderdb in the same folder as your_database.fasta

3. Search

Each of your queries is compared only to sequences of the same length in the database. Once a same-length sequence is found, the Euclidean distance between the compositions of your query and the database sequence is computed. The output contains all the same-length sequences sorted by the Euclidean distance (low to high).

This search is distributed over all the available CPUs!

$ python3 disorderly.py -v -i path/to/query.fasta -db path/to/your_database.fasta.disorderdb

-i Your query sequences in FASTA

-db The converted .disorderdb database

This will generate a .csv with the same name as your query with a bit of additional stuff (i.e. for query.fasta, the result will be query_search-20180816190934-ABCD.csv). The -v verbose flag will tell you where your result is, which will be in the same directory as your query)

Alternatively, you can run everything all at once:

$ python3 disorderly.py -v -i query.fasta -fb your_database.fasta

The previous step-by-step instruction is meant to help you understand what is really going on.

Reading the result

Open the .csv file with a text editor or Excel

The format is (sequence IDs are the FASTA headers):

Queries Hits Distances
query-seq-1 database-seq-9 0.000
query-seq-1 database-seq-5 0.135
query-seq-1 database-seq-14 0.246
query-seq-2 database-seq-3 0.000
query-seq-2 database-seq-75 0.321

How to get it? (Install)

No wheel currently :( , so just:

1. Download the .zip
2. Unpack it wherever you want
3. Find disorderly.py under src/ and run as described above

For Stanford folks

Those that run on MEMEX (or any of our servers that uses SLURM):

Feel free to use the bash_run.sh file to submit jobs so it can be run on multiple CPUs

$ sbatch bash_run.sh -v -i query.fasta -fb your_database.fasta

NOTE: bash_run.sh must be in the same folder as disorderly.py

ALSO: It is currently configured to use the DPB partition and 24 cores (1 node on MEMEX). Edit the file with any editor to change this, i.e.:

#SBATCH -p dge    # To use the DGE partition
#SBATCH -c 12     # for 12 cores

About

Find disordered proteins based on compositional similarlity

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published