geneparse
is a module that helps developers to parse multiple genetics file
format (e.g. Plink binary files, IMPUTE2 files, BGEN and VCF).
The tool requires a standard Python installation (3.6 or higher are supported) with the following modules:
The tool has been tested on Linux only, but should work on MacOS operating systems as well.
You can install or update geneparse
using pip
:
pip install -U geneparse
To test the module, just perform the following command:
$ python -m geneparse.tests
..sssss.........s..sssss.........ssssssss...ss.ss...s.................
.......................................................s....ss.....
----------------------------------------------------------------------
Ran 137 tests in 1.549s
OK (skipped=27)
Some genotype data require indexing for fast access. This can be done using geneparse.
$ python -m geneparse.index --help
usage: geneparse-indexer [-h] [--impute2 IMPUTE2 [IMPUTE2 ...]]
[--bgen BGEN [BGEN ...]] [--legacy]
Genotype file indexer.
optional arguments:
-h, --help show this help message and exit
IMPUTE2 index:
--impute2 IMPUTE2 [IMPUTE2 ...]
Index an IMPUTE2 genotype file format. The file can be
plain text or bgzipped.
BGEN index:
--bgen BGEN [BGEN ...]
Index a BGEN genotype file. This requires 'bgenix' to
be in the PATH.
--legacy Index the file using the '-with-rowid' option. This
flag enables compatibility with SQLITE prior to
version 3.8.2. See
https://bitbucket.org/gavinband/bgen/wiki/bgenix for
more information.
We provide a simple tool to extract genotypes from different format to either VCF or Binary plink files.
$ python -m geneparse.extract --help
usage: geneparse-extractor [-h] -f FORMAT [-e FILE] [-k FILE] [--maf] -o FILE
[--output-format FORMAT]
PARSER_ARGS [PARSER_ARGS ...]
Genotype file extractor. This tool will extract markers according to names or
to genomic locations.
optional arguments:
-h, --help show this help message and exit
Input Options:
-f FORMAT, --format FORMAT
The input file format.
PARSER_ARGS The arguments that will be passed to the genotype
parsers.
Extract Options:
-e FILE, --extract FILE
The list of markers to extract (one per line, no
header).
-k FILE, --keep FILE The list of samples to keep (one per line, no header).
--maf Check MAF and flip the allele coding if the MAF is
higher than 50%.
Output Options:
-o FILE, --output FILE
The output file (can be '-' for STDOUT when using VCF
or CSV as output format).
--output-format FORMAT
The output file format. Note that the extension will
be added if absent. Note that CSV is a long format
(hence it might take more disk space).
The parser arguments (PARSER_ARGS) are the same as the one in the API. For
example, the arguments for the Plink parser is 'prefix:PREFIX' (where PREFIX
is the prefix of the BED/BIM/FAM files).