Skip to content

Latest commit

 

History

History

utils

The following README contains information on the classifier_parse.py

Developed by EJ Osborne and Z. Kronenberg

	  [email protected]; [email protected]

##############
USAGE: 
##############

For help:

    python classify_WHAM_vcf.py -h


Runs RandomForest classifier on WHAM output VCF files to classify structural
variant type. Appends WC and WP flags for user to explore structural variant
calls. The output is a VCF file written to standard out. The optional --filter
flag will aid in providing results of higher sensitivity of specificity. 
Leaving the option out returns SV calls for all of the data. 

positional arguments:
  VCF              User supplied VCF with WHAM variants; VCF needs AT field
                   data
  training_matrix  training dataset for classifier derived from simulated read
                   dataset

optional arguments:
  -h, --help       show this help message and exit
  --filter FILTER  optional arg for filtering type one of : ['sensitive',
                   'specific']; defaults to output all data if filtering if
                   argument is not supplied.

Typical usage:

  python classify_WHAM_vcf.py [VCF file] [training dataset] --filter [sensitive/specific]
  	 #In this default mode, the new VCF file with SV calls will be written
	 #standard out
  python classify_WHAM_vcf.py [VCF file] [training dataset] > [output VCF]
  	 # the standard out can be re-directed to an output VCF file with the
	 # '>' sign as above
  python classify_WHAM_vcf.py X.vcf WHAM_training_data.txt --filter sensitive > out.sensitive.vcf

#######
OUTPUT:
#######

The new output VCF file has two new FIELDS appended to the VCF file, WC and WP
which stand for:

WC - "WHAM CALL" : SV TYPE
WP - "WHAM PROBABILITY" : PROBABILITIES FROM RANDOM FOREST MODEL FOR EACH
   - IMPLEMENTED SV TYPE. CHECK DOCS FOR DETAILS ON WHICH TYPES ARE CURRENTLY
   - IMPLEMENTED. 


##############
DEPENDENCIES:
##############

Your python distirbution will need to run scikit-learn for the RandomForest 
modeling. Information can be obtained here: 

http:https://scikit-learn.org/stable/

All code was developed on Python2.7 Anaconda distribution:

http:https://continuum.io/downloads


##############
TRAINING DATA:
##############
We supply a training dataset deirved form simulated read data. The file,
with its supplied md5sum is:

     cb30db2b8dc0c6b2693a6a1595855272  WHAM_training_data.txt