Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DNA sequence hashing #144

Closed
avilella opened this issue Jun 19, 2020 · 3 comments
Closed

DNA sequence hashing #144

avilella opened this issue Jun 19, 2020 · 3 comments

Comments

@avilella
Copy link

avilella commented Jun 19, 2020

Feature request:

Given a DNA sequence, convert into a number by hashing in base 2 as shown below:

mapping = {'A': '00', 'T' = '01', 'G': '10', 'C' = '11'}

produce a value of zero if there are non-ACGT values.
For RNA, we could have T=U in case there are Us.

Example:

cat file.fasta | seqkit hash -t DNA
>foo
348128344908321234
>bar
32049201394320

Maybe the output should be tabular as the output of seqkit fx2tab?

Maybe this is already somehow implemented internally in the deduplication code, not sure. It's useful (for me) when wanting to give a short(ish) numerical value that would be unique to each unique DNA sequence. Thx

@shenwei356
Copy link
Owner

Yes, it's easy to implement. Just call a fast hash function on any sequence (string), but it's irreversible.
Similarly, unikmer encode can convert dna/rna (<=32bp)to reversible uint64 values.

@shenwei356
Copy link
Owner

shenwei356 commented Jul 7, 2020

Implemented in seqkit fx2tab -s/--seq-hash, and it's case sensitive.

For DNA/RNA transform, use seqkit seq --dna2rna/--rna2dna, and seqkit seq -l/--lower-case/-u/--upper-case for letter case.

@jolespin
Copy link

I use this to get [id][hash]

pv query.fasta.gz | seqkit fx2tab -s -n > id_to_hash.tsv

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants