Prevailing clinical pipelines leverage only a tiny fraction of the data available, relying upon patient exome and genome data to characterize patients, and searching public databases to find potential similarities. This is possible because standard formats have been developed for these types of data (genome and variant), which enables automation, search and similarity matching.
Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient symptoms and disease features) or environmental factors.
The health of an individual organism results from complex interplay between its genes and environment. Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient symptoms and disease features) or environmental factors (Figure 1). Phenotypic abnormalities of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, and even social media. However, the lack of standardization, accessibility, and computability among these contexts makes it extremely difficult to effectively extract and utilize these data, hindering the understanding of genetic and environmental contributions to disease. We are a multi-disciplinary, international scientific team comprised of basic science researchers, data modelers, computer scientists, bioinformaticians, environmental scientists, and clinicians dedicated to maximizing the value of existing and new data. Here we propose a new unit of exchange for the open data marketplace that will make the sharing of phenotype data sustainable, persistent, and reusable without sacrificing the flexibility needed to express rapidly emerging threats to the environment and human health (e.g., Zika virus). This unit, called a “Phenopacket”, will contain the minimal amount of information necessary for describing one or more patients, cohorts, or populations in a computable way. The Phenopacket is citable, persistent, and available outside the paywall of journals, read and written by anyone: data repositories, patient registries, drug companies, environmental monitoring agencies, and even patients themselves, to enable crowdsourced analysis. We will develop the standard and initial software tools to read, write, and validate Phenopackets. The greater accessibility and computability of these data have the potential to revolutionize the field of healthcare analytics.
VCF for example, Sequence Similarity
PhenoTypes need an exchange format to enable cross-species analysis
PhenoTypes are difficult
- One model, many encodings
- YAML, JSON, RDF, TSV
- Species-agnostic
- Microbes, plants, humans
- Clinical and basic research
- Support variety of entities
- Patients/individuals organisms, cohorts, populations
- Diseases
- Papers
- Genes, genotypes, alleles, variants
- Simple for simple cases
- Bag of terms model
- Incremental expressivity
- Temporality and Causality
- Quantitative and qualitative
- Negation, severity, frequency, expressivity
- Ontology-smart
- Rational composition (post-coordination)
- Explicit Semantics
- Findable
- Accessible outside paywalls and private data sources
- Attributable
- Interoperable and Computable
- Reusable, exchangeable across contexts and disciplines
It is important to understand the difference between the terms disease and phenotype. They are often used interchangeably, but that obscures the important relationship between the genetic causes of disease and the phenotypic effects of these underlying genetic phenomena.
We use phenotype to refer to a discrete feature, such as hypoglycemia, that is one observable component of a disease, such as diabetes mellitus type II.
The Phenotype Exchange Format (PXF) proposed here is designed to support “deep phenotyping”, a process wherein individual components of each phenotype are observed and documented5.
In order to describe things and communicate, a community agrees upon names for organisms, diseases, anatomy and phenotypes. The Tower of Babel problem is that multiple communities exist, each of which has developed their own nomenclature.
The PXF requires the use of a common ontology, a logically defined hierarchy of terms, that allows sophisticated algorithmic analysis over medically relevant abnormalities. computable
The PXF requires the use of a common ontology, a logically defined hierarchy of terms, that allows sophisticated algorithmic analysis over medically relevant abnormalities. computable
The Human Phenotype Ontology (HPO) 6 was built for this purpose and has been used for genomic diagnostics, translational research, genomic matchmaking, and systems biology applications 7–14. The HPO is developed in the context of the Monarch Initiative, an international team of computer scientists, clinicians, and biologists in the United States, Europe, and Australia; HPO is being translated into multiple languages to support international interoperability. Due to its extensive phenotypic coverage beyond other terminologies 15,16, HPO has recently been integrated into the Unified Medical Language System (UMLS) to support deep phenotyping in a variety of mainstream health care IT systems.
While great strides have been made in exchange formats for sequence and variation data (e.g. Variant Call Format; VCF 1, complementary standards for phenotypes and environment are urgently needed.
For individuals with rare and undiagnosed diseases, such standards could improve the speed and accuracy of diagnosis. For patients with common but hard-to-treat diseases, such standards can help us design personalized interventions and learn more about shared disease mechanisms 2.
The health of an individual organism results from a complex interplay between its genes and environment.
Although great strides have been made in standardizing the representation of genetic information for exchange, there are no comparable standards to represent phenotypes (e.g. patient symptoms and disease features) and environmental factors.
Phenotypic abnormalities of individual organisms are currently described in diverse places and in diverse formats: publications, databases, health records, registries, clinical trials, and even social media.
However, the lack of standardization, accessibility, and computability among these contexts makes it extremely difficult to effectively extract and utilize these data, hindering the understanding of genetic and environmental contributions to disease.
We propose that when phenotypic abnormalities of individuals are described, whether in publications, databases, health records, or social media, that these descriptions a) contain a minimum set of fields and b) get transmitted alongside genomic sequence data, such as in VCF, between clinics, authors, journals, and data repositories. The structure of the data in the exchange standard will be optimized for integration from these distributed contexts.
The implementation of such a system will allow the sharing of phenotype data prospectively, as well as retrospectively. Increasing the volume of computable data across a diversity of systems will support large-scale computational disease analysis using the combined genotype and phenotype data.
It is necessary because study sizes of well over 100,000 patients are thought to be required to effectively assess the role of rare variation in common disease 3 or to discover the genomic basis for a substantial portion of Mendelian diseases 4.
It is timely because studies of this power are now becoming financially and technologically tractable.