Skip to content

PDBClean helps create a curated ensemble of molecular structures

License

Notifications You must be signed in to change notification settings

fatipardo/PDBClean-0.0.2

Repository files navigation

PDBCleanV2

With PDBCleanV2, users can create their own self-consistent structure dataset, enabling more straightforward comparison among structures. The library creates separate files for each biological assembly present in a structure file and standardizes chain names and numbering. Our goal is to provide researchers with a consistent dataset that facilitates their analysis.

Table of contents

PDBCleanV2 Workflow

We have created Jupyter Notebooks that provide a step-by-step guide for creating a curated ensemble of structures using PDBCleanV2.

flowchart of workflow

Download all structures that match the name and sequence of your molecule of interest.

Note: This notebook sometimes does not display on the Github website, download and open in your browser.

A CIF file may contain multiple biological assemblies within one asymmetric unit. In this step we separate these biological assemblies, and create one CIF file for each one. We also reduce the number of data blocks included in the CIF file.

The script goes over all the CIF files and collects all entities. The user can decide what Mol ID to assign them. In this example, we show the case in which we give a different ID to each entity found. This step is also important because it lists all the entities that were found in your ensemble, so it allows you to identify if there is a structure that doesn't belong. We show an example of this in this notebook.

Same as Step 2.1, but in our example, we give the same MOL ID to different entities. You may want to do this for example, if you want to give the same MOL ID to all ligands, or water molecules. Doing this will trigger a concatenation menu, which we show how to use.

Step 3 allows us to name each entity with whatever name we want. Step 3 makes sure that the chains that are the same (we do sequence alignment to determine similarity) in different CIF files, have a consistent name. Sometimes entities/chains are mislabeled in deposited structures, this step is recommended to identify any such cases. This step can also be used to identify any possible outliers, by seeing how all chains score compared to our reference.

We divide the tutorial for this step in two parts. The second part shows how to generate the reference sequences, as well as showing how to load them when running the script. Doing this could also help speed up this step, as it allows to run the script in parallel in batches. This is particularly important when working with large datasets, or with molecules with many chains.

In this tutorial, we show how the reference sequence is selected by our script, and show how the user can modify it. It also shows how to load the reference sequences, creating the opportunity for running this step in parallel, in batches, speeding up the whole process.

Following step 3, now that we have consistent chain (entity) naming among all structures in the ensembe, we want to make sure that the numbering is also consistent (that the same residue position has the same number in all structures).

This is also the last step! You have a curated dataset!

Note: There are more advanced curation steps and analysis that we will cover in future releases.

Other tools

Check project mini tutorial. This mini tutorial can be run after doing step 2. Check_project checks if a directory has been created, if not it creates the directory and an info.txt file with the creation date.

Dataset Summary. This notebook can be run after doing step 0. It creates plots that summarize important information from your dataset such as organism of origin, resolution, year, and method used to solve the structure. The notebook also creates a pandas dataframe so users can create their own personalized plots.

Installation

We recommend installing PDBClean inside a virtual environment. We provide an environment.yml with the libraries you will need. Additionally, Anaconda is a recommended prerequisite before utilizing PDBClean, and we provide our tutorial as jupyter notebooks. We have tested the installation on MacOS.

  1. Download PDBClean from GitHub and install environment from YML file

git clone [email protected]:fatipardo/PDBClean-0.0.2.git

cd PDBClean-0.0.2

conda env create -f environment.yml

  1. Activate environment and install PDBClean

conda activate PDBCleanV2

python setup.py install

  1. Install Jupyter Notebook kernel

python -m ipykernel install --user --name PDBCleanV2 --display-name PDBCleanV2

  1. Running notebook:

cd Notebooks

jupyter notebook

  • Open any notebook you would like to run.
  • If Jupyter does not recognize the kernel, select ‘PDBCleanV2’ from the drop down menu.

PDBClean team

The code in this repository is based on the code found here. The code was originally written by Frédéric Poitevin and Nicholas Corsepius. Fátima Pardo Avila and Liv Weiner created this repository. We all worked on this project while being part of the Levitt Lab at Stanford University.

About

PDBClean helps create a curated ensemble of molecular structures

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published