Hail

Hail is an open-source, scalable framework for exploring and analyzing genomic data.

The Hail project began in Fall 2015 to empower the worldwide genetics community to harness the flood of genomes to discover the biology of human disease. Since then, Hail has expanded to enable analysis of large-scale datasets beyond the field of genomics.

Here are two examples of projects powered by Hail:

The gnomAD team uses Hail as its core analysis platform. gnomAD is among the most comprehensive catalogues of human genetic variation in the world, and one of the largest genetic datasets. Analysis results are shared publicly and have had sweeping impact on biomedical research and the clinical diagnosis of genetic disorders.
The Neale Lab at the Broad Institute used Hail to perform QC and genome-wide association analysis of 2419 phenotypes across 10 million variants and 337,000 samples from the UK Biobank in 24 hours. These results are also publicly available, see the Neale Lab blog for more info.

For genomics applications, Hail can:

flexibly import and export to a variety of data and annotation formats, including VCF, BGEN and PLINK
generate variant annotations like call rate, Hardy-Weinberg equilibrium p-value, and population-specific allele count; and import annotations in parallel through the annotation database, VEP, and Nirvana
generate sample annotations like mean depth, imputed sex, and TiTv ratio
generate new annotations from existing ones as well as genotypes, and use these to filter samples, variants, and genotypes
find Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples, and compute sample scores and variant loadings using PCA
perform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritability
lots more!

Hail's functionality is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster.

Users can script pipelines or explore data interactively in Jupyter notebooks that combine Hail's methods, PySpark's scalable SQL and machine learning algorithms, and Python libraries like pandas's scikit-learn and Matplotlib. Hail also provides a flexible domain language to express complex quality control and analysis pipelines with concise, readable code.

To learn more, you can view our talks at Spark Summit East and Spark Summit West (below).

Using Hail

To get started using Hail on your own data or on public data:

install Hail using the instructions in Getting Started
read the Overview for a broad introduction to Hail
follow the Tutorials for examples of how to use Hail
check out the Python API for detailed information on the programming interface

Support

There are many ways to get in touch with the Hail team if you need help using Hail, or if you would like to suggest improvemen

Name		Name	Last commit message	Last commit date
Latest commit History 3,147 Commits
.github		.github
gradle/wrapper		gradle/wrapper
libs		libs
project		project
python/hail		python/hail
scripts		scripts
src		src
www		www
.gitignore		.gitignore
AUTHORS		AUTHORS
LICENSE		LICENSE
README.md		README.md
acknowledgements.txt		acknowledgements.txt
build.gradle		build.gradle
build.sbt		build.sbt
changes.md		changes.md
code_style.xml		code_style.xml
deployed-spark-versions.txt		deployed-spark-versions.txt
generate-build-info.sh		generate-build-info.sh
generate-dist-links.sh		generate-dist-links.sh
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle
style-guide.md		style-guide.md
testng.xml		testng.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hail

Using Hail

Support

License

patrick-schultz/hail

Folders and files

Latest commit

History

Repository files navigation

Hail

Using Hail

Support