-
-
Notifications
You must be signed in to change notification settings - Fork 130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add the ability for the PDBManager
to perform interface-based chain filtering
#333
base: master
Are you sure you want to change the base?
Add the ability for the PDBManager
to perform interface-based chain filtering
#333
Conversation
graphein/ml/datasets/pdb_data.py
Outdated
@@ -23,6 +25,11 @@ | |||
) | |||
from graphein.utils.dependencies import is_tool | |||
|
|||
PRIMARY_INTERCHAIN_CONTACT_ATOMS_FOR_FILTERING: List[str] = ["CA", "C4'"] | |||
SECONDARY_INTERCHAIN_CONTACT_ATOMS_NOT_FOR_FILTERING: List[str] = ["H"] | |||
PRIMARY_HYDROGEN_BOND_ATOMS_FOR_FILTERING: List[str] = ["N", "O", "N1", "N9", "N3", "C2", "C4", "C5", "C6"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be vetted more carefully, as I initially chose these atom types heuristically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this atom naming scheme? It doesn't ring any bells for me (
graphein/graphein/protein/resi_atoms.py
Line 276 in 281ce30
ATOM_NUMBERING: Dict[str, int] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have these constants:
graphein/graphein/protein/resi_atoms.py
Line 858 in 281ce30
HYDROGEN_BOND_DONORS: Dict[str, Dict[str, int]] = { |
graphein/graphein/protein/resi_atoms.py
Line 880 in 281ce30
HYDROGEN_BOND_ACCEPTORS: Dict[str, Dict[str, int]] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this atom naming scheme? It doesn't ring any bells for me (
graphein/graphein/protein/resi_atoms.py
Line 276 in 281ce30
ATOM_NUMBERING: Dict[str, int] = { )
The N
, CA
, O
, and H
atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms. My initial goal with this PR was to make a generic dataset chain filter for protein-protein interactions, protein-nucleic acid interactions, and nucleic acid-nucleic acid interactions (inspired by the dataset curation technique of RoseTTAFold2NA for protein-nucleic acid structure prediction - https://www.biorxiv.org/content/10.1101/2022.09.09.507333v1.full.pdf - page 8). I am essentially trying to reproduce this filtering logic with the PDBManager
(minus all the sequence alignments), and I thought a PR would be in order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per a suggestion from a colleague, I have removed the C atoms from the hydrogen bond calculation, as these atoms are very rarely involved in the formation of h-bonds in proteins and NAs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The N, CA, O, and H atoms correspond to regular protein vocabulary, however, all other types correspond to nucleic acid residue atoms.
Got it, bells ring for me now :)
So these H-bond definitions do not account for sidechain-X hbonds, only backbone-backbone hbonds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Here's a naive question on my part: How frequent would you say the occurrence of sidechain-X hbonds is? If they are pretty common, perhaps we can simply include more protein and nucleic acid (NA) atom types to the list here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seemingly quite common!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By way of how I have designed this filtering logic, I am assuming that each (protein or NA) residue (potentially) contains the following atoms: "N", "O", "N1", "N9", "N3"
. Given the prevalence of sidechain hbonds, what types of protein atoms (shared across all residue types) would you say would be most reasonable to include to cover most of the possible hbonds mentioned in this article? The only other atom type I think we could include would be the carbon-beta (Cb) atoms.
for more information, see https://pre-commit.ci
Kudos, SonarCloud Quality Gate passed! 0 Bugs No Coverage information |
What does this implement/fix? Explain your changes
This allows one to select PDB complex chains satisfying certain interface contact or hydrogen bonding constraints.