Skip to content
This repository has been archived by the owner on Dec 13, 2019. It is now read-only.

Identifiers

Jim Balhoff edited this page May 19, 2016 · 4 revisions

Consistent handling of identifiers is crucial for phenopackets. On the one hand, phenopackets may often refer to some entity outside the current packet. For example, a variant or a person. On the other hand, phenopackets are structured such that identifier references are frequently used, as opposed to nesting objects.

This is shown in the following example:

persons:
  - id: "#1"
phenotype_profile:
  - entity: "#1"
    phenotype:
      types:
        - id: HP:0003560

The first phenotype association in the profile refers to patient 1 by using a id-reference, rather than one object nesting the other. This makes it easier to partition a phenopacket into distinct components that can be communicated asynchronously.

We can also see in the above example that we are referring to an ontology class, HP:0003560, by its identifier.

It is therefore crucial that identifier references can be bound to the correct entities.

Approach

The World-Wide-Web Consortium (W3C) provides many of the standards required for a successful identifier binding strategy. In particular, in W3C standards, entities (called resources in W3C terminology) are identified by URIs or IRIs.

An example of an IRI is that provided by the Human Phenotype Ontology for the class Muscular dystrophy, which is:

http:https://purl.obolibrary.org/obo/HP_0003560

Note that as far as many technologies within the W3C stack are concerned, this URI is the identifier for that class. This eliminates the possibility of any clashes (for example, if another database were added that accidentally also chose to use the prefix "HP").

Many W3C standards also provide a means of compacting or shortening IRIs unambiguously within the context of a single document, through the declaration of prefixes. For example, a document that declares the prefix:

HP: http:https://purl.obolibrary.org/obo/HP_

can refer to the URI using the shortened form HP:0003560 - this is known as a CURIE or compact URI (but not all compacted URIs are CURIES). But note that this shortened form is only valid within the context of the document that declares the HP prefix. Another document can use the prefix "HP" to stand for an imaginary "Human Proteome" database and there is no confusion: when the documents are combined, the CURIEs are expanded to globally unambiguous URIs.

Phenopackets leverages a particular W3C standard called JSON-LD, in particular JSON-LD Context Files

A context file contains amongst other things, a set of prefix mappings to determine how short tokens are expanded to IRIs. All PhenoPackets make use of a single centralized standard JSON-LD context file which contains default mappings; if required, this set can be extended or overridden within any one PhenoPacket.

Default JSON-LD context

The default JSON-LD context defines prefixes both for commonly uses ontologies such as the HPO, as well as for commonly used databases, such as ClinVar.

This means that any phenopacket can refer to the class for Muscular dystrophy by the CURIE HP:0003560, or by the fully expanded http:https://purl.obolibrary.org/obo/HP_0003560

Internal identifiers

Consider the case where we have a phenopacket describing a case study involving 3 people. These people are not registered in any global patient database (and even if they were, we may not wish to identify them).

For cases such as these, we can use internal identifiers. For example, we may want to refer to them as patients 1, 2 and 3. Of course, these numbers are entirely local to this study; another study may use numbers 1, 2 and 3 to refer to completely different people.

Here we recommend the use of hash-prefixed identifiers, such as:

  • #patient1
  • #patient2
  • #patient3

These start with a hash, and are followed by alphanumeric characters, or the symbols -, _ or /. These absolutely should not include :s, as this will cause them to be interpreted as a CURIE.

IMPORTANT NOTE when using yaml syntax, remember to quote these, otherwise they are interpreted as comments!

When these are expanded to URIs, the URI of the base document for the phenopacket is used (see later), which guarantees uniqueness for the internal identifiers, provided the base URI is unique.

In cases where we may sometimes want to refer to these individuals from outside the phenopacket, we recommend using a slash-prefixed identifier. This will be concatenated onto the base URI.

For example, if the document base URI is http:https://mypatientregistry, then an identifier /patient/1 will be expanded to http:https://mypatientregistry/patient/1.

Declaring a base URI

'@context':
    '@base': http:https://mypatientregistry/

We do not provide recommendations as to whether that URI must be a URL that is resolvable in a web browser. For example, some groups may choose to use UUID URNs.

Clone this wiki locally