This contains several utility programs that removes the adapters and low quality bases from Illumina reads.
Author: | Haibao Tang (tanghaibao) |
---|---|
Contributor: | Tristan Lefebure |
Email: | [email protected] |
License: | BSD |
Contents
The program depends on the excellent SeqAn library.
Please download the library, and place seqan/
in the same folder.
Please note that trimReads
is no longer compatible with seqan-1.4+. For
back-ward compatibility, a copy of older seqan is now included as seqan.tgz
.
To install, run:
$ tar zxvf seqan.tgz $ make
Functionality emulates cutadapt. The adapter sequences are identified through Waterman-Eggert algorithm implemented in SeqAn. The quality trimming are a simple algorithm that takes the quality values, deduct a user specified cutoff, and then finds the max-sum segment. This method guarantees that the average base quality is higher than the user cutoff.
There are other options to cut adapters, including cutadapt and FASTX_TOOLKIT. The main advantage of this program:
- Accepts an adapter FASTA file
- Fast, robust and flexibility
- Qual/adapter trimming in one step
- Can trim both 5`- and 3`- end
Just run:
trimReads
to see a list of program options:
Illumina reads trimming utility Author: Haibao Tang <[email protected]> Usage: trimReads [options] fastqfile -h, --help displays this help message -o, --outfile Output file name. (default replace suffix with .trimmed.fastq) -f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta) -s, --score Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15) -q, --quality-cutoff Trim low-quality regions below quality cutoff. The algorithm is similar to the one used by BWA by finding a max-sum segment within the quality string. Set it to 0 to skip quality trimming. (default 20) -m, --minimum-length Discard trimmed reads that are shorter than LENGTH. (default 30) -Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64) -d, --discard-adapter-reads Discard reads with adapter sequences rather than trim (default 0)
Find a list of adapters to remove (more will slow down search), default is adapters.fasta
. When ready:
trimReads test.fastq
to get a trimmed file test.trimmed.fastq. To turn off the quality trimming, just set -q
to 0
:
trimReads -q 0 test.fastq
The detected adapter stretch will have quality values of AAAAAAAAAAA...
.
This will help you verify that the sequence masked is indeed adapters. For
example:
@SNPSTER4:7:1:2:458#0/1 run=090205_SNPSTER4_0273_30GAUAAXX_PE ATTGAAGTGTTTGGGGTTCAAACACCGACAGATCGGAAGAGCGGTTCAGCAGGAAAGCCGAGACACACATCGGTATCCGCTTTTTTTTTT + aba`aaa]a`aaaaaa]a_aa\aa`aa_^AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBBBBBBBBBBBBBB
This program sorts all read pairs into three sets:
- Adapter set: the pairs with either /1 or /2 match adapters (in most cases both will match). These are fragments up to 1X read length.
- Overlap set: the pairs with /1 and /2 having dovetail overlap. These are fragments up to 2X read length.
- Clean set: survived the above two searches.
The reason for this sorting is to get rid of the short fragments (set 1 and set 2) commonly in the Illumina PE library. Some libraries are worse than others. The goal is to input the mated library within nominal insert size ranges.
Just run:
sortPairedReads
to see a list of program options:
Sort pairs of Illumina reads Author: Haibao Tang <[email protected]> Usage: sortPairedReads [options] fastqfile1 fastqfile2 -h, --help displays this help message -O, --nooverlap Turn off overlapping reads detection, and do not create .overlap.fastq file. (default 0) -f, --adapterfile FASTA formatted file containing the adapters for removal (default adapters.fasta) -s, --adapterMatchScore Minimum score to call adapter match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 15) -t, --endMatchScore Minimum score to call dovetail match. Default scoring scheme for +1 match, -3 for mismatch/gapOpen/gapExtension. (default 20) -Q, --quality-encoding Read quality encoding for input file. 64 for Illumina, 33 for Sanger. (default 64) -v, --verbose Print alignments for debugging (default 0)
For any given two fastq files, the output contains 4 files: fastqfile1.adapters.fastq
(set 1),
fastqfile1.overlap.fastq
(set 2), fastqfile1.clean.fastq
and
fastqfile2.clean.fastq
(set 3). For genome assembler inputs, I recommend
discard set 1, treat set 2 as unmated, and treat set 3 as mated.
For example:
$ sortPairedReads s1.fastq s2.fastq [0] Illumina_PE-1 found 0 times [1] Illumina_PE-2 found 0 times [2] Illumina_PE-1rc found 54 times [3] Illumina_PE-2rc found 83 times Processed 2500 sequences took 3.33262 seconds. $ ls *.*.fastq s1.clean.fastq s2.clean.fastq s1.adapters.fastq s1.overlap.fastq
Turn -O
on if you don't like .overlap.fastq
:
$ sortPairedReads s1.fastq s2.fastq -O ... $ ls *.*.fastq s1.clean.fastq s2.clean.fastq s1.adapters.fastq