DNA sequence hashing #144

avilella · 2020-06-19T12:05:57Z

Feature request:

Given a DNA sequence, convert into a number by hashing in base 2 as shown below:

mapping = {'A': '00', 'T' = '01', 'G': '10', 'C' = '11'}

produce a value of zero if there are non-ACGT values.
For RNA, we could have T=U in case there are Us.

Example:

cat file.fasta | seqkit hash -t DNA
>foo
348128344908321234
>bar
32049201394320

Maybe the output should be tabular as the output of seqkit fx2tab?

Maybe this is already somehow implemented internally in the deduplication code, not sure. It's useful (for me) when wanting to give a short(ish) numerical value that would be unique to each unique DNA sequence. Thx

The text was updated successfully, but these errors were encountered:

shenwei356 · 2020-06-19T15:05:58Z

Yes, it's easy to implement. Just call a fast hash function on any sequence (string), but it's irreversible.
Similarly, unikmer encode can convert dna/rna (<=32bp）to reversible uint64 values.

shenwei356 · 2020-07-07T07:20:44Z

Implemented in seqkit fx2tab -s/--seq-hash, and it's case sensitive.

For DNA/RNA transform, use seqkit seq --dna2rna/--rna2dna, and seqkit seq -l/--lower-case/-u/--upper-case for letter case.

jolespin · 2022-12-21T04:27:55Z

I use this to get [id][hash]

pv query.fasta.gz | seqkit fx2tab -s -n > id_to_hash.tsv

shenwei356 closed this as completed Jul 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNA sequence hashing #144

DNA sequence hashing #144

avilella commented Jun 19, 2020 •

edited

Loading

shenwei356 commented Jun 19, 2020

shenwei356 commented Jul 7, 2020 •

edited

Loading

jolespin commented Dec 21, 2022

DNA sequence hashing #144

DNA sequence hashing #144

Comments

avilella commented Jun 19, 2020 • edited Loading

shenwei356 commented Jun 19, 2020

shenwei356 commented Jul 7, 2020 • edited Loading

jolespin commented Dec 21, 2022

avilella commented Jun 19, 2020 •

edited

Loading

shenwei356 commented Jul 7, 2020 •

edited

Loading