Biomolecules
See recent articles
Showing new listings for Friday, 4 October 2024
- [1] arXiv:2410.02647 (cross-list from cs.LG) [pdf, html, other]
-
Title: Immunogenicity Prediction with Dual Attention Enables Vaccine Target SelectionComments: 18 pages, 11 tables, 5 figuresSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
Immunogenicity prediction is a central topic in reverse vaccinology for finding candidate vaccines that can trigger protective immune responses. Existing approaches typically rely on highly compressed features and simple model architectures, leading to limited prediction accuracy and poor generalizability. To address these challenges, we introduce ProVaccine, a novel deep learning solution with a dual attention mechanism that integrates pre-trained latent vector representations of protein sequences and structures. We also compile the most comprehensive immunogenicity dataset to date, encompassing over 9,500 antigen sequences, structures, and immunogenicity labels from bacteria, viruses, and tumors. Extensive experiments demonstrate that ProVaccine outperforms existing methods across a wide range of evaluation metrics. Furthermore, we establish a post-hoc validation protocol to assess the practical significance of deep learning models in tackling vaccine design challenges. Our work provides an effective tool for vaccine design and sets valuable benchmarks for future research.
Cross submissions (showing 1 of 1 entries)
- [2] arXiv:2309.16519 (replaced) [pdf, html, other]
-
Title: AtomSurf : Surface Representation for Learning on Protein StructuresComments: 10 pagesSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
While there has been significant progress in evaluating and comparing different representations for learning on protein data, the role of surface-based learning approaches remains not well-understood. In particular, there is a lack of direct and fair benchmark comparison between the best available surface-based learning methods against alternative representations such as graphs. Moreover, the few existing surface-based approaches either use surface information in isolation or, at best, perform global pooling between surface and graph-based architectures.
In this work, we fill this gap by first adapting a state-of-the-art surface encoder for protein learning tasks. We then perform a direct and fair comparison of the resulting method against alternative approaches within the Atom3D benchmark, highlighting the limitations of pure surface-based learning. Finally, we propose an integrated approach, which allows learned feature sharing between graphs and surface representations on the level of nodes and vertices $\textit{across all layers}$.
We demonstrate that the resulting architecture achieves state-of-the-art results on all tasks in the Atom3D benchmark, while adhering to the strict benchmark protocol, as well as more broadly on binding site identification and binding pocket classification. Furthermore, we use coarsened surfaces and optimize our approach for efficiency, making our tool competitive in training and inference time with existing techniques. Our code and data can be found online: $\texttt{this http URL}$ - [3] arXiv:2401.13858 (replaced) [pdf, html, other]
-
Title: Graph Diffusion Transformers for Multi-Conditional Molecular GenerationComments: Accepted by NeurIPS 2024 (Oral). 21 pages, 11 figures, 8 tablesSubjects: Machine Learning (cs.LG); Biomolecules (q-bio.BM)
Inverse molecular design with diffusion models holds great potential for advancements in material and drug discovery. Despite success in unconditional molecular generation, integrating multiple properties such as synthetic score and gas permeability as condition constraints into diffusion models remains unexplored. We present the Graph Diffusion Transformer (Graph DiT) for multi-conditional molecular generation. Graph DiT integrates an encoder to learn numerical and categorical property representations with the Transformer-based denoiser. Unlike previous graph diffusion models that add noise separately on the atoms and bonds in the forward diffusion process, Graph DiT is trained with a novel graph-dependent noise model for accurate estimation of graph-related noise in molecules. We extensively validate Graph DiT for multi-conditional polymer and small molecule generation. Results demonstrate the superiority of Graph DiT across nine metrics from distribution learning to condition control for molecular properties. A polymer inverse design task for gas separation with feedback from domain experts further demonstrates its practical utility.
- [4] arXiv:2403.07179 (replaced) [pdf, html, other]
-
Title: 3M-Diffusion: Latent Multi-Modal Diffusion for Language-Guided Molecular Structure GenerationSubjects: Machine Learning (cs.LG); Computation and Language (cs.CL); Biomolecules (q-bio.BM)
Generating molecular structures with desired properties is a critical task with broad applications in drug discovery and materials design. We propose 3M-Diffusion, a novel multi-modal molecular graph generation method, to generate diverse, ideally novel molecular structures with desired properties. 3M-Diffusion encodes molecular graphs into a graph latent space which it then aligns with the text space learned by encoder-based LLMs from textual descriptions. It then reconstructs the molecular structure and atomic attributes based on the given text descriptions using the molecule decoder. It then learns a probabilistic mapping from the text space to the latent molecular graph space using a diffusion model. The results of our extensive experiments on several datasets demonstrate that 3M-Diffusion can generate high-quality, novel and diverse molecular graphs that semantically match the textual description provided.