Skip to content
/ tsppi Public

Analysis of tissue-specific protein-protein interaction networks

License

Notifications You must be signed in to change notification settings

patflick/tsppi

Repository files navigation

PAPPI - Protein Atlas and protein-protein interaction networks

The code consists of a python pipeline based on SQL (right now on SQLite in particular).

The pipeline imports all the various different data files, mostly in comma separated values (CSV) or tab separated values (TSV) formats into a common SQL database. Then it performs mapping of gene IDs from Entrez, Uniprot and Ensembl to HGNC symbols (names). The HGNC symbols are used as main identifier in the further pipeline and analysis.

A few genes are still lost in this pipeline approach, as they can not be mapped to any HGNC symbol.

The python pipeline is in the src/ folder. The init_pappi.py script has to be called in order to import all files. It is also there, that all the paths to the various data files is configured.

Right now this script is still a bit messy and dependent from all the file paths (i.e. hardcoded filenames).

The analysis folder contains R scripts for analysis of the data.

Edit the sql_config.R file to point to the correct sqlite file.

Get the data:

Create a folder for all the data files (preferably on an internal, and not a network drive).

Human protein atlas

Go to the downloadable data section at proteinatlas.org and download the normal_tissue.csv.zip file.

Unpack it into the data folder.

string-db

Go to string-db to the Download section and download the protein.links.vX.XX.txt.gz file. This file contains all protein protein interactions using Ensembl IDs and a reliability score per interaction.

Direct download link (version 9.05): here

CCSB network:

Sign up at the CCSB and download the Human interactome database here.

Save it into the data folder.

MMC network

This is the protein complex network from the paper "A Census of Human Soluble Protein Complexes" by Havugimana et al.

Download the network from the supplemental information Table S2 here The protein-protein interactions are in the

This network is imported as a CSV file holding only the two gene names, thus these have to be copied into a separate file for import into the PAPPI pipeline.

Right now this network is already part of the repository. This might complicate giving free access to the repo, as I am currently unsure about the licencing of the PPI network from Havugimana et al.

Mapping files

BioMart Gene<->Protein mapping

Get the ensembl ID mapping from http:https://www.ensembl.org/biomart/martview/ using homo sapiens version 71 database and output the attributes:

  • Ensembl Gene ID
  • Ensembl Protein ID

with NO filters

BioMart Ensembl<->HGNC mapping

Go to: http:https://www.ensembl.org/biomart/martview

Choose "Ensembl Genes 71" and table "Homo sapiens genes"

Include following fields for the table:

  • Ensembl Gene ID
  • Associated Gene Name
  • UniProt/SwissProt ID
  • HGNC ID(s)
  • EntrezGene ID

Export the table as CSV (and choose "Unique results only")

HGNC mapping

A few files for Gene-ID mapping/matching need to be downloaded and imported as well.

Get data from: [http:https://www.genenames.org/cgi-bin/hgnc_stats] Goto Locus Group: "protein-coding gene" and click "Custom". Choose only the Columns:

  • HGNC ID
  • Approved Symbol
  • Approved Name
  • Status
  • Entrez Gene ID
  • Ensembl Gene ID

(and from external sources):

  • Entrez Gene ID (supplied by NCBI)
  • UniProt ID (supplied by UniProt)
  • Ensembl ID (supplied by Ensembl)

Make sure to deselect (exclude) the status: "Entry and Symbol Withdrawn"

Full URL to results: BioMart Mapping

TODO:

  • write this readme, should include:

    • download locations for all downloadable data (e.g. HPA, string-db, CCSB)
    • how to import everything (which configs to edit how)
    • overview of code layout (modules)
    • explanation about different .R analysis scripts
  • write a more detailed documentation !?

  • ...

About

Analysis of tissue-specific protein-protein interaction networks

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published