Skip to content

TypeDB Bio: Biomedical Knowledge Graph

Notifications You must be signed in to change notification settings

veredsil/typedb-bio

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

96 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TypeDB Bio: Biomedical Knowledge Graph

Overview | Installation | Datasets | Examples | How You Can Help | Further Learning

Discord Discussion Forum Stack Overflow Stack Overflow

Overview

TypeDB Bio is an open source biomedical knowledge graph to enable research in areas such as drug discovery, precision medicine and drug repurposing. It provides biomedical researchers an intuitive way to query interconnected and heterogeneous biomedical data in one single place.

For example, by querying for the virus SARS-CoV-2, we can find the associated human protein, proteasome subunit alpha type-2 (PSMA2), a component of the proteasome, implicated in SARS-CoV-2 replication, and its encoding gene (PSMA2). Additionally, we can identify the drug carfilzomib, a known inhibitor of the proteasome that could therefore be researched as a potential treatment for patients with Covid-19.

image

By examining these specific relationships and their attributes, we can further investigate any connected biological components and better understand their inter-relations. This helps researchers to efficiently study the mechanisms of protein interactions, infections, the immune response, and help to find targets for the development of treatments or drugs more efficiently. We can also expand our search to include contextual information as is shown below:

image

The team behind TypeDB Bio consists of a partnership between GSK, Oxford PharmaGenesis and Vaticle

The schema that models the underlying knowledge graph alongside the descriptive query language, TypeQL, makes writing complex queries an extremely straightforward and intuitive process. Furthermore, TypeDB's automated reasoning, allows TypeDB Bio to become an intelligent database of biomedical data in the biomedical field that infers implicit knowledge based on the explicitly stored data. TypeDB Bio can understand biological facts, infer based on new findings and enforce research constraints, all at query (run) time.

Installation

Prerequesites: Python >3.6, TypeDB Core 2.14.0, TypeDB Python Client API 2.14.3, TypeDB Studio 2.11.0

Clone this repo:

git clone https://github.com/vaticle/typedb-bio.git

Download the CORD-NER data set from this link and add it to this directory: Dataset/CORD_NER

Set up a virtual environment and install the dependencies:

cd <path/to/typedb-bio>/
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Start typedb

typedb server

Start the migrator script

python migrator.py -n 4 # insert using 4 threads

For help with the migrator script command line options:

python migrator.py -h

Now grab a coffee (or two) while the migrator builds the database and schema for you!

Examples

TypeQL queries can be run either in the TypeDB console, in TypeDB Studio or through client APIs. However, we encourage running the queries on TypeDB Studio to have the best visual experience.

# What are the drugs that interact with the genes associated to the virus Sars?

match 
$virus isa virus, has virus-name "SARS"; 
$gene isa gene; 
$drug isa drug; 
$rel1 ($gene, $virus) isa gene-virus-association; 
$rel2 ($gene, $drug) isa drug-gene-interaction; 
offset 0; limit 20;

image

Datasets

Currently the datasets we've integrated include:

  • CORD-NER: The CORD-19 dataset that the White House released has been annotated and made publicly available. It uses various NER methods to recognise named entities on CORD-19 with distant or weak supervision.
  • Uniprot: We’ve downloaded the reviewed human subset, and ingested genes, transcripts and protein identifiers.
  • Coronaviruses: This is an annotated dataset of coronaviruses and their potential drug targets put together by Oxford PharmaGenesis based on literature review.
  • DGIdb: We’ve taken the Interactions TSV which includes all drug-gene interactions.
  • Human Protein Atlas: The Normal Tissue Data includes the expression profiles for proteins in human tissues.
  • Reactome: This dataset connects pathways and their participating proteins.
  • DisGeNet: We’ve taken the curated gene-disease-associations dataset, which contains associations from Uniprot, CGI, ClinGen, Genomics England and CTD, PsyGeNET, and Orphanet.
  • SemMed: This is a subset of the SemMed version 4.0 database

In progress:

  • CORD-19: We incorporate the original corpus which includes peer-reviewed publications from bioRxiv, medRxiv and others.
    • TODO: write migrator script
  • TissueNet
    • TODO: ./Migrators/TissueNet/TissueNetMigrator.py incomplete: only migrates a single data file and is not called in ./migrator.py.

We plan to add many more datasets!

How You Can Help

This is an on-going project and we need your help! If you want to contribute, you can help out by helping us including:

  • Migrate more data sources (e.g. clinical trials, DrugBank, Excelra)
  • Extend the schema by adding relevant rules
  • Create a website
  • Write tutorials and articles for researchers to get started

If you wish to get in touch, please talk to us on the #typedb-bio channel on our Discord (link here).

Further Learning

About

TypeDB Bio: Biomedical Knowledge Graph

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%