Skip to content

Latest commit

 

History

History
189 lines (118 loc) · 6.16 KB

README.md

File metadata and controls

189 lines (118 loc) · 6.16 KB

Data Sources

The majority of data sources are either provided in this repository or automatically downloaded via scripts. However, some need to be manually obtained and saved into the folder structure prior to executing the analysis pipeline.

Automatic download

To download most of the data sources, simply execute:

./download.sh

and (note: requires python 3 to run):

python3 download_psicquic.py

Manual download

The following data sources need to be manually obtained, since they are not (yet) publicly available:

  • Dana Farber CCSB HI-2012 PPI network (see below)

If you don't want to sign up to get this data, you can instead download one of the older Human Interactomes from their website and name it HI_2012_PRE.tsv in the ppi/ folder. Note that the results from the analysis pipeline will be different for this network, since it uses different data. All other results should remain identical. The pipeline will not execute if this file is not available.

Protein-Protein Interaction Networks

Bossi & Lehner composite PPI network

From the supplementary section of the paper "Tissue specificity and the human protein interaction network".

This can be automatically downloaded via the download_data.sh script.

Dana Farber CCSB PPI network

The Human Interactome 2012 is still in preliminary form, thus you have to sign up at the CCSB and to download the Human interactome database here.

Download the HI_2012_PRE.tsv file and save it into the ppis/ folder.

Havugimana et al. protein complexes PPI network

This is the protein complex network from the paper "A Census of Human Soluble Protein Complexes" by Havugimana et al.

Download the network from the supplemental information Table S2 here

The protein-protein interactions are in the Excel sheet "14K Denoised PPI". Copy the first two columns into a tab-separated text file with headers "Gene1" and "Gene2".

Note:

Right now this network is already part of the repository. This might complicate giving free access to the repo, as I am currently unsure about the licensing of the PPI network from Havugimana et al.

string-db

The protein interaction data can be loaded from string-db automatically with the download_data.sh shell script.

If you already downloaded the string-db file or import it manually for other reasons, do the following:

Go to string-db to the Download section and download the protein.links.vX.XX.txt.gz file. This file contains all protein protein interactions using Ensembl IDs and a reliability score per interaction.

Direct download link (for version 9.05): here

Save this file into the data/download folder, and then run the download_data.sh script, which will pre-filter the string-db interactions for human interactions with score >= 0.7

PSICQUIC Composite network

Run the download_psicquic.py python script in this folder in order to download the different PSICQUIC provided PPI networks.

Protein Expression data sets

These are all downloaded and unpacked automatically via the download_data.sh script. The following just describes the data sources.

Human Protein Atlas (HPA)

From the Human Protein Atlas version 11 the normal_tissue.csv and subcellular_location.csv files are used.

Manual download here

GeneAtlas (now BioGPS)

This is RNA micro-array data from Su Al et al. 2009. It can be downloaded from BioGPS here.

The files gnf1h-gcrma.zip and gnf1h-anntable.zip are needed, where the first is the actual expression data while the second holds the annotation for the genes.

Illumina Body Map RNAseq

The RNAseq data from Illumina Body Map can be downloaded from EBI here.

The automatic script removes all filters (no gene selection, especially not only protein coding genes; and a cutoff of 0).

RNAseq Atlas from medicalgenomics.org

This data is also publicly available and can be downloaded from here.

This is also automatically downloaded with the download.sh script.

Mapping files

BioMart id mapping

Go to: ensembl.org

Choose "Ensembl Genes 71" (or current version) and table "Homo sapiens genes"

Include following fields for the table:

  • Ensembl Gene ID
  • Ensembl Protein ID
  • Associated Gene Name
  • UniProt/SwissProt ID
  • HGNC ID(s)
  • EntrezGene ID

Export the table as CSV (and choose "Unique results only") and save this into the file mapping/mart_export.csv.

Currently BioMart version 71 is provided in the repository.

HGNC mapping

Get data from: genenames.org Goto Locus Group: "protein-coding gene" and click "Custom". Choose only the Columns:

  • HGNC ID
  • Approved Symbol
  • Approved Name
  • Status
  • Entrez Gene ID
  • Ensembl Gene ID (and from external sources)
  • Entrez Gene ID (supplied by NCBI)
  • UniProt ID (supplied by UniProt)
  • Ensembl ID (supplied by Ensembl)

Make sure to deselect (exclude) the status: "Entry and Symbol Withdrawn"

Full URL to results: HGNC Mapping

This should return all 19060 genes.

Save this file into the mapping/hgnc_downloads.txt. A version of this file is provided in the repository (not guaranteed to be up to date).