Illumina reads adapter screening utilities

This contains several utility programs that removes the adapters and low quality bases from Illumina reads.

Author:	Haibao Tang (tanghaibao)
Contributor:	Tristan Lefebure
Email:	[email protected]
License:	BSD

Contents

Installation
trimReads
sortPairedReads

Installation

The program depends on the excellent SeqAn library. Please download the library, and place seqan/ in the same folder.

Please note that trimReads is no longer compatible with seqan-1.4+. For back-ward compatibility, a copy of older seqan is now included as seqan.tgz.

To install, run:

$ tar zxvf seqan.tgz
$ make

trimReads

Functionality emulates cutadapt. The adapter sequences are identified through Waterman-Eggert algorithm implemented in SeqAn. The quality trimming are a simple algorithm that takes the quality values, deduct a user specified cutoff, and then finds the max-sum segment. This method guarantees that the average base quality is higher than the user cutoff.

There are other options to cut adapters, including cutadapt and FASTX_TOOLKIT. The main advantage of this program:

Accepts an adapter FASTA file
Fast, robust and flexibility
Qual/adapter trimming in one step
Can trim both 5`- and 3`- end

Just run:

trimReads

to see a list of program options:

Illumina reads trimming utility
Author: Haibao Tang <[email protected]>

Usage: trimReads [options] fastqfile

  -h, --help                   displays this help message
  -o, --outfile                Output file name. (default replace suffix with .trimmed.fastq)
  -f, --adapterfile            FASTA formatted file containing the adapters for removal  (default adapters.fasta)
  -s, --score                  Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
  -q, --quality-cutoff         Trim low-quality regions below quality cutoff. The algorithm is similar to the one used by BWA by finding a max-sum segment within the quality string. Set it to 0 to skip quality trimming.  (default 20)
  -m, --minimum-length         Discard trimmed reads that are shorter than LENGTH. (default 30)
  -Q, --quality-encoding       Read quality encoding for input file. 64 for Illumina, 33 for Sanger.  (default 64)
  -d, --discard-adapter-reads  Discard reads with adapter sequences rather than trim (default 0)

Find a list of adapters to remove (more will slow down search), default is adapters.fasta. When ready:

trimReads test.fastq

to get a trimmed file test.trimmed.fastq. To turn off the quality trimming, just set -q to 0:

trimReads -q 0 test.fastq

The detected adapter stretch will have quality values of AAAAAAAAAAA.... This will help you verify that the sequence masked is indeed adapters. For example:

@SNPSTER4:7:1:2:458#0/1 run=090205_SNPSTER4_0273_30GAUAAXX_PE
ATTGAAGTGTTTGGGGTTCAAACACCGACAGATCGGAAGAGCGGTTCAGCAGGAAAGCCGAGACACACATCGGTATCCGCTTTTTTTTTT
+
aba`aaa]a`aaaaaa]a_aa\aa`aa_^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBB

sortPairedReads

This program sorts all read pairs into three sets:

Adapter set: the pairs with either /1 or /2 match adapters (in most cases both will match). These are fragments up to 1X read length.
Overlap set: the pairs with /1 and /2 having dovetail overlap. These are fragments up to 2X read length.
Clean set: survived the above two searches.

The reason for this sorting is to get rid of the short fragments (set 1 and set 2) commonly in the Illumina PE library. Some libraries are worse than others. The goal is to input the mated library within nominal insert size ranges.

Just run:

sortPairedReads

to see a list of program options:

Sort pairs of Illumina reads
Author: Haibao Tang <[email protected]>

Usage: sortPairedReads [options] fastqfile1 fastqfile2

  -h, --help               displays this help message
  -O, --nooverlap          Turn off overlapping reads detection, and do not create .overlap.fastq file. (default 0)
  -f, --adapterfile        FASTA formatted file containing the adapters for removal  (default adapters.fasta)
  -s, --adapterMatchScore  Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15)
  -t, --endMatchScore      Minimum score to call dovetail match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 20)
  -Q, --quality-encoding   Read quality encoding for input file. 64 for Illumina, 33 for Sanger.  (default 64)
  -v, --verbose            Print alignments for debugging  (default 0)

For any given two fastq files, the output contains 4 files: fastqfile1.adapters.fastq (set 1), fastqfile1.overlap.fastq (set 2), fastqfile1.clean.fastq and fastqfile2.clean.fastq (set 3). For genome assembler inputs, I recommend discard set 1, treat set 2 as unmated, and treat set 3 as mated.

For example:

$ sortPairedReads s1.fastq s2.fastq
[0] Illumina_PE-1 found 0 times
[1] Illumina_PE-2 found 0 times
[2] Illumina_PE-1rc found 54 times
[3] Illumina_PE-2rc found 83 times

Processed 2500 sequences took 3.33262 seconds.

$ ls *.*.fastq
s1.clean.fastq  s2.clean.fastq  s1.adapters.fastq  s1.overlap.fastq

Turn -O on if you don't like .overlap.fastq:

$ sortPairedReads s1.fastq s2.fastq -O
...

$ ls *.*.fastq
s1.clean.fastq  s2.clean.fastq  s1.adapters.fastq

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
seqan-1.4-branch		seqan-1.4-branch
README.rst		README.rst
adapters.fasta		adapters.fasta
makefile		makefile
seqan.tgz		seqan.tgz
sortPairedReads.cc		sortPairedReads.cc
trimReads.cc		trimReads.cc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Illumina reads adapter screening utilities

Installation

trimReads

sortPairedReads

About

Releases

Packages

Languages

tanghaibao/trimReads

Folders and files

Latest commit

History

Repository files navigation

Illumina reads adapter screening utilities

Installation

trimReads

sortPairedReads

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages