Skip to content

Finds Non-overlapping Approximate Matches (NAMs) between query and reference sequences using strobemers

License

Notifications You must be signed in to change notification settings

ksahlin/namfinder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

namfinder: Fast computation of shared regions between sequences

2023-05-19: Namfinder is not for stable use yet. The program currently contains a limiting complexity in some cases (sqared in the number of hits) for genome size comparisons. I advice not to run this software until it is fixed. This repo went public just because uLTRA long transcriptomic aligner depends on it.

Namfinder is a sequence (DNA/RNA) mapping tool used to find Non-overlapping Approximate Matches (NAMs). The output and usage mimicks that of nucmer. You can think of NAMs as Maximal Exact Matches (MEMs) but allowing some SNVs and smaller indels. NAMs are constructed from overlapping strobemer seeds.

Namfinder has borrowed the whole indexing construction codebase from strobealign (a short-read mapper), but is used only for finding NAM seeds. Credits to @marcelm, @luispedro and @psj1997 for the optimized indexing implementation. Namfinder is a more optimized version of the previous proof-of-concept tool StrobeMap that was implemented for the strobemers paper. It has changed name not to confuse it with strobealign.

Features

  • Multithreading support
  • Fast indexing (2-5 minutes for a human-sized reference genome)
  • Output in MUMmer MEM tsv format

Table of contents

  1. Installation
  2. Usage
  3. Command-line options
  4. Index file
  5. Changelog
  6. Contributing
  7. Performance
  8. Credits
  9. Version info
  10. License

Installation

You need to have CMake, a recent g++ (tested with version 8) and zlib installed. Then do the following:

git clone https://github.com/ksahlin/namfinder
cd namfinder
cmake -B build -DCMAKE_C_FLAGS="-march=native" -DCMAKE_CXX_FLAGS="-march=native"
make -j -C build

The resulting binary is build/namfinder.

The binary is tailored to the CPU the compiler runs on. If it needs to run on other machines, use this cmake command instead for compatibility with most x86-64 CPUs in use today:

cmake -B build -DCMAKE_C_FLAGS="-msse4.2" -DCMAKE_CXX_FLAGS="-msse4.2"

Usage

Parameter -k is the strobe size, -s is sub-k-mer size (used for thinning in syncmers). Set -s to the same value as kfor no thinning. Parameters -l and -u are window min and window mac for sampling the downstream strobe. only strobemers of order 2 can currently be used.

namfinder -k 10 -s 10 -l 11 -u 35 -C 500 -o nams.tsv ref.fa reads.f[a/q]

CREDITS

  • Some of the ideas for the index and NAM construction in namfinder was borrowed from: Sahlin, K. Strobealign: flexible seed size enables ultra-fast and accurate read alignment. Genome Biol 23, 260 (2022). https://doi.org/10.1186/s13059-022-02831-7
  • Big improvements were designed by @marcelm and @luispedro, and inplemented by @marcelm and @psj1997 (forthcoming paper).

LICENCE

MIT license, see LICENSE.

About

Finds Non-overlapping Approximate Matches (NAMs) between query and reference sequences using strobemers

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages