Dna2vec is an open-source library to train distributed representations of variable-length k-mers.
For more information, please refer to the paper: dna2vec: Consistent vector representations of variable-length k-mers
Note that this implementation has only been tested on Python 3.5.3, but we welcome any contributions or bug reporting to make it more accessible.
- Clone the
dna2vec
repository:git clone https://github.com/pnpnpn/dna2vec
- Install Python dependencies:
pip3 install -r requirements.txt
- Test the installation:
python3 ./scripts/train_dna2vec.py -c configs/small_example.yml
- Download
hg38
from https://hgdownload.cse.ucsc.edu/goldenPath/hg38/bigZips/hg38.chromFa.tar.gz. This will take a while as it's 938MB. - Untar with
tar -zxvf hg38.chromFa.tar.gz
. You should see FASTA files for chromosome 1 to 22:chr1.fa
,chr2.fa
, ...,chr22.fa
. - Move the 22 FASTA files to folder
inputs/hg38/
- Start the training with:
python3 ./scripts/train_dna2vec.py -c configs/hg38-20161219-0153.yml
- Wait for a couple of days ...
- Once the training is done, there should be a
dna2vec-<ID>.w2v
and a correspondingdna2vec-<ID>.txt
file in yourresults/
directory.
You can read pretrained dna2vec vectors pretrained/dna2vec-*.w2v
using
the class MultiKModel
in dna2vec/multi_k_model.py
. For example:
from dna2vec.multi_k_model import MultiKModel
filepath = 'pretrained/dna2vec-20161219-0153-k3to8-100d-10c-29320Mbp-sliding-Xat.w2v'
mk_model = MultiKModel(filepath)
You can fetch the vector representation of AAA
with:
>>> mk_model.vector('AAA')
array([ 0.023137 , 0.156295 , ...
Compute the cosine distance between two k-mers via dna2vec:
>>> mk_model.cosine_distance('AAA', 'GCT')
0.14546435594464155
>>> mk_model.cosine_distance('AAA', 'AAAA')
0.89000147450211231
The pre-trained data should cover all k-mers for 3 ≤ k ≤ 8
>>> [len(mk_model.model(k).vocab) for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
>>> [4**k for k in range(3,9)]
[64, 256, 1024, 4096, 16384, 65536]
I would love for you to fork and send me pull request for this project. Please contribute.
This software is licensed under the MIT license