The code consists of a python pipeline based on SQL (right now on SQLite in particular).
The pipeline imports all the various different data files, mostly in comma separated values (CSV)
or
tab separated values (TSV)
formats into a common SQL database. Then it performs
mapping of gene IDs from Entrez, Uniprot and Ensembl to HGNC symbols (names).
The HGNC symbols are used as main identifier in the further pipeline and analysis.
A few genes are still lost in this pipeline approach, as they can not be mapped to any HGNC symbol.
The python pipeline is in the src/
folder. The init_pappi.py
script has to be called
in order to import all files. It is also there, that all the paths to the various
data files is configured.
Right now this script is still a bit messy and dependent from all the file paths (i.e. hardcoded filenames).
The analysis
folder contains R scripts for analysis of the data.
Edit the sql_config.R file to point to the correct sqlite file.
Create a folder for all the data files (preferably on an internal, and not a network drive).
Go to the downloadable data section at proteinatlas.org and download the normal_tissue.csv.zip file.
Unpack it into the data folder.
Go to string-db to the Download section and download the protein.links.vX.XX.txt.gz file. This file contains all protein protein interactions using Ensembl IDs and a reliability score per interaction.
Direct download link (version 9.05): here
Sign up at the CCSB and download the Human interactome database here.
Save it into the data folder.
This is the protein complex network from the paper "A Census of Human Soluble Protein Complexes" by Havugimana et al.
Download the network from the supplemental information Table S2 here The protein-protein interactions are in the
This network is imported as a CSV file holding only the two gene names, thus these have to be copied into a separate file for import into the PAPPI pipeline.
Right now this network is already part of the repository. This might complicate giving free access to the repo, as I am currently unsure about the licencing of the PPI network from Havugimana et al.
Get the ensembl ID mapping from https://www.ensembl.org/biomart/martview/ using homo sapiens version 71 database and output the attributes:
- Ensembl Gene ID
- Ensembl Protein ID
with NO filters
Go to: https://www.ensembl.org/biomart/martview
Choose "Ensembl Genes 71" and table "Homo sapiens genes"
Include following fields for the table:
- Ensembl Gene ID
- Associated Gene Name
- UniProt/SwissProt ID
- HGNC ID(s)
- EntrezGene ID
Export the table as CSV (and choose "Unique results only")
A few files for Gene-ID mapping/matching need to be downloaded and imported as well.
Get data from: [https://www.genenames.org/cgi-bin/hgnc_stats]
Goto Locus Group
: "protein-coding gene" and click "Custom".
Choose only the Columns:
- HGNC ID
- Approved Symbol
- Approved Name
- Status
- Entrez Gene ID
- Ensembl Gene ID
(and from external sources):
- Entrez Gene ID (supplied by NCBI)
- UniProt ID (supplied by UniProt)
- Ensembl ID (supplied by Ensembl)
Make sure to deselect (exclude) the status: "Entry and Symbol Withdrawn"
Full URL to results: BioMart Mapping
-
write this readme, should include:
- download locations for all downloadable data (e.g. HPA, string-db, CCSB)
- how to import everything (which configs to edit how)
- overview of code layout (modules)
- explanation about different .R analysis scripts
-
write a more detailed documentation !?
-
...