utils
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
The following README contains information on the classifier_parse.py Developed by EJ Osborne and Z. Kronenberg [email protected]; [email protected] ############## USAGE: ############## For help: python classify_WHAM_vcf.py -h Runs RandomForest classifier on WHAM output VCF files to classify structural variant type. Appends WC and WP flags for user to explore structural variant calls. The output is a VCF file written to standard out. The optional --filter flag will aid in providing results of higher sensitivity of specificity. Leaving the option out returns SV calls for all of the data. positional arguments: VCF User supplied VCF with WHAM variants; VCF needs AT field data training_matrix training dataset for classifier derived from simulated read dataset optional arguments: -h, --help show this help message and exit --filter FILTER optional arg for filtering type one of : ['sensitive', 'specific']; defaults to output all data if filtering if argument is not supplied. Typical usage: python classify_WHAM_vcf.py [VCF file] [training dataset] --filter [sensitive/specific] #In this default mode, the new VCF file with SV calls will be written #standard out python classify_WHAM_vcf.py [VCF file] [training dataset] > [output VCF] # the standard out can be re-directed to an output VCF file with the # '>' sign as above python classify_WHAM_vcf.py X.vcf WHAM_training_data.txt --filter sensitive > out.sensitive.vcf ####### OUTPUT: ####### The new output VCF file has two new FIELDS appended to the VCF file, WC and WP which stand for: WC - "WHAM CALL" : SV TYPE WP - "WHAM PROBABILITY" : PROBABILITIES FROM RANDOM FOREST MODEL FOR EACH - IMPLEMENTED SV TYPE. CHECK DOCS FOR DETAILS ON WHICH TYPES ARE CURRENTLY - IMPLEMENTED. ############## DEPENDENCIES: ############## Your python distirbution will need to run scikit-learn for the RandomForest modeling. Information can be obtained here: http:https://scikit-learn.org/stable/ All code was developed on Python2.7 Anaconda distribution: http:https://continuum.io/downloads ############## TRAINING DATA: ############## We supply a training dataset deirved form simulated read data. The file, with its supplied md5sum is: cb30db2b8dc0c6b2693a6a1595855272 WHAM_training_data.txt