Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

keep_het parameter not working #373

Closed
davidkastner opened this issue Mar 12, 2024 · 2 comments
Closed

keep_het parameter not working #373

davidkastner opened this issue Mar 12, 2024 · 2 comments

Comments

@davidkastner
Copy link

davidkastner commented Mar 12, 2024

Describe the bug
The config parameter keep_hets is currently not working. It seems keep_hets was recently updated from bool type to a list of strings, where it contains the specific residue name of a HETATM residue such as keep_hets=["HOH"]. However, after updating the parameter, it doesn't include the specified residues in the graph. The tutorial installed the newest version of Graphein-1.7.6 and I haven't had a chance to back test the other versions to see when the keep_het functionality broke but will updated this ticket when I have a chance.

To Reproduce
This can be seen in the tutorial example of 3EIY, which contains 112 waters. However, when we run:

from graphein.protein.graphs import construct_graph
config = ProteinGraphConfig(keep_hets=["HOH"])
g = construct_graph(config=config, pdb_code="3eiy")

None of the waters are included in the graph. If we print the nodes with g.nodes() and look that the last residues we see that no waters were included:

['A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']

Expected behavior
If I understand correctly, the expected behavior of keep_hets would be for the waters to now be included in the graph representation.

Screenshots
Here is a screen shot of the representation of 3EIY, where we can see only the protein residues included.
Screenshot 2024-03-12 at 11 23 03 AM

Desktop (please complete the following information):
This reproduced using the google Collab notebook with graphein-1.7.6 installed.
No other modification where made to the tutorial.

@a-r-j
Copy link
Owner

a-r-j commented Mar 12, 2024

Hi @davidkastner, good catch. This is a slightly tricky issue to resolve.

I think the omission of the water nodes comes from here:

def convert_structure_to_centroids(df: pd.DataFrame) -> pd.DataFrame:

Where we select on CA atoms to count as nodes. I think if you use granularity="atom" the waters will be present.

For heteroatoms it can be tricky to consistently and universally define what the coarsened node should be. I think a good heuristic could be the CoM for the ligand for coarsened graphs. One work-around would be to write your own hetatm_df_processing_func to manipulate the hetatm df to contain a representative "CA"

We looked into this quite extensively for Protein-Ligands graphs (see #164 , mainly here: https://github.com/a-r-j/graphein/blob/d81fc2f77b3562f61f70f257ddf509d5102b8bf6/graphein/protein_ligand/graphs.py).

What's your application? Using graphein.protein.tensor.data.Protein should work reliably if it's ML-based.

@davidkastner
Copy link
Author

Hi @a-r-j. I see the problem and agree it would be challenging to generalize! For my purposes, the atom representation will work well as I am building graphs for QM cluster models extracted from proteins. As the QM cluster models are small in size, the extra information afforded by the atom representation will be useful. I appreciate your response and will close the issue as resolved but hopefully it will be a useful point of reference for others.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants