keep_het parameter not working #373

davidkastner · 2024-03-12T15:31:33Z

Describe the bug
The config parameter keep_hets is currently not working. It seems keep_hets was recently updated from bool type to a list of strings, where it contains the specific residue name of a HETATM residue such as keep_hets=["HOH"]. However, after updating the parameter, it doesn't include the specified residues in the graph. The tutorial installed the newest version of Graphein-1.7.6 and I haven't had a chance to back test the other versions to see when the keep_het functionality broke but will updated this ticket when I have a chance.

To Reproduce
This can be seen in the tutorial example of 3EIY, which contains 112 waters. However, when we run:

from graphein.protein.graphs import construct_graph
config = ProteinGraphConfig(keep_hets=["HOH"])
g = construct_graph(config=config, pdb_code="3eiy")

None of the waters are included in the graph. If we print the nodes with g.nodes() and look that the last residues we see that no waters were included:

['A:ALA:171', 'A:ASN:172', 'A:PHE:173', 'A:LYS:174', 'A:LYS:175']

Expected behavior
If I understand correctly, the expected behavior of keep_hets would be for the waters to now be included in the graph representation.

Screenshots
Here is a screen shot of the representation of 3EIY, where we can see only the protein residues included.

Desktop (please complete the following information):
This reproduced using the google Collab notebook with graphein-1.7.6 installed.
No other modification where made to the tutorial.

The text was updated successfully, but these errors were encountered:

a-r-j · 2024-03-12T15:49:33Z

Hi @davidkastner, good catch. This is a slightly tricky issue to resolve.

I think the omission of the water nodes comes from here:

graphein/graphein/protein/graphs.py

Line 199 in 6dae5ff

def convert_structure_to_centroids(df: pd.DataFrame) -> pd.DataFrame:

Where we select on CA atoms to count as nodes. I think if you use granularity="atom" the waters will be present.

For heteroatoms it can be tricky to consistently and universally define what the coarsened node should be. I think a good heuristic could be the CoM for the ligand for coarsened graphs. One work-around would be to write your own hetatm_df_processing_func to manipulate the hetatm df to contain a representative "CA"

We looked into this quite extensively for Protein-Ligands graphs (see #164 , mainly here: https://github.com/a-r-j/graphein/blob/d81fc2f77b3562f61f70f257ddf509d5102b8bf6/graphein/protein_ligand/graphs.py).

What's your application? Using graphein.protein.tensor.data.Protein should work reliably if it's ML-based.

davidkastner · 2024-03-12T18:05:31Z

Hi @a-r-j. I see the problem and agree it would be challenging to generalize! For my purposes, the atom representation will work well as I am building graphs for QM cluster models extracted from proteins. As the QM cluster models are small in size, the extra information afforded by the atom representation will be useful. I appreciate your response and will close the issue as resolved but hopefully it will be a useful point of reference for others.

davidkastner closed this as completed Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

keep_het parameter not working #373

keep_het parameter not working #373

davidkastner commented Mar 12, 2024 •

edited

Loading

a-r-j commented Mar 12, 2024 •

edited

Loading

davidkastner commented Mar 12, 2024

keep_het parameter not working #373

keep_het parameter not working #373

Comments

davidkastner commented Mar 12, 2024 • edited Loading

a-r-j commented Mar 12, 2024 • edited Loading

davidkastner commented Mar 12, 2024

davidkastner commented Mar 12, 2024 •

edited

Loading

a-r-j commented Mar 12, 2024 •

edited

Loading