Skip to content

Extract SNP positions from BLAST XML and SAM formats

Notifications You must be signed in to change notification settings

mojaveazure/SNP_Utils

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

83 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SNP_Utils

SNP_Utils is a Python program that creates a VCF file for a list of SNPs given in an Illumina lookup table. There are three subroutines for SNP_Utils: CONFIG, BLAST, and SAM

To get basic usage information, simply run the program without any arguments

$ ./snp_utils.py
usage: snp_utils.py [-h] {CONFIG,BLAST,SAM} ...

optional arguments:
  -h, --help          show this help message and exit

Subroutine:
  Choose a subroutine

  {CONFIG,BLAST,SAM}  'BLAST' means we are running a BLAST search to find
                      SNPs, 'SAM' means we are using a SAM file, 'CONFIG' is
                      used to configure a BLAST search

More detailed information can be found for each subroutine by passing the name of the subroutine and the -h | --help flag or by reading below

Illumina Lookup Table

The Illumina lookup table is a two-column, headerless table that has a SNP ID and contextual sequence in with the SNP in brackets ([A/B]) shwogin the two states for the SNP (A and B). The two columns are tab-delimited.

Example

SNP_1   ACGTCACGATCGA[A/G]ACGTATGCGAAGTTCGCC
SNP_2   GCTAGACTACCAG[G/T]GTCACGATGCCGTCAGTC

CONFIG

The CONFIG subroutine is used only when running BLAST within SNP_Utils. This step is required before running BLAST, but not required before parsing a BLAST XML file. Options for CONFIG are as follows:

  • Choosing whether the reference is in FASTA or nucleotide BLAST database format and specifying the path to these files
  • E-value threshold
  • Maximum number of hits and hsps
  • Percent identity to keep
  • Whether or not to keep the FASTA file generated from the Illumina lookup table

The configuration file is written in INI format

BLAST

The BLAST subroutine is used to run and parse BLAST results to create a VCF file of SNPs from an Illumina lookup table. To run BLASTn within SNP_Utils, you must configure using the CONFIG subroutine. BLAST can also rank SNPs; high-ranking SNPs are those that had a low e-value and high bit-score in BLAST. When ranking, only the highest hit for every SNP is kept, ties for highest include both SNPs. To parse a previously-generated BLAST XML file, you do not need to configure BLAST. Options for BLAST are as follows:

  • Choosing either a BLAST config or XML file as input
  • Setting the basename for the output
  • Choosing whether or not we rank SNPs
  • Setting filtering schemes to eliminate duplicate SNPs

SAM

The SAM subroutine is used to parse a SAM file designed around the Illumina lookup table. This SAM file should be generated by read-mapping the Illumina lookup sequences to a reference genome. Options for SAM are as follows:

  • Choosing the SAM file and reference genome in FASTA format
  • Setting the basename for the output
  • Setting filtering schemes to eliminate duplicate SNPs

Filtering Final VCF

SNP_Utils can filter found SNPs in an attempt to remove duplicate positions. Filtering works with both BLAST and SAM subroutines. The following filtering schemes are available:

  • Use a genetic map to keeps SNPs on the chromosome specified by the map. Use -b | --by-chrom and -m | --genetic-map to filter by chromosome
  • Set a minimum distance threshold in base pairs and keep the leftmost SNP when two are closer than this threshold. Use -t | --threshold to specify this distance in base pairs
  • Use genetic map distances to choose the SNP closest to this position proportionally. This option ignores SNPs on other chromosomes. Please note that if using the BLAST subroutine, this filtering scheme requires that a reference sequence be used instead of a BLAST database or that the BLAST database was generated with the -parse_seqids flag toggled. Use -d | --by-distance and -m | --genetic-map to filter by genetic map distance

Outputs

SNP_Utils creates between one and six output files

File name Contents
config.ini BLAST configuration file generated from CONFIG
fasta_database.xml XML results from running BLAST with SNP_Utils, generated from BLAST
lookup.fasta Lookup table in FASTA format for running BLAST with SNP_Utils, generated from BLAST and CONFIG with `-k
output.vcf VCF file with confident SNP positions, generated from BLAST and SAM
output_masked.vcf VCF file with masked ('N') alternate states for each SNP, these are SNPs that we could not find a position for given the Illumina lookup table, generated from BLAST and SAM if there were masked SNPs
output_failed.log List of SNP IDs that we could not find at all, generated from BLAST and SNP if there were failures

Dependencies

SNP_Utils depends on the following:

The last four all available on PyPi and can be downloaded using pip3 (included with Python 3.4 or greater)

About

Extract SNP positions from BLAST XML and SAM formats

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages