AutoDiff: Autoregressive Diffusion Modeling for Structure-based Drug Design

Xinze Li Penglei Wang Tianfan Fu Wenhao Gao Chengtao Li Leilei Shi Junhong Liu

Abstract

Structure-based drug design (SBDD), which aims to generate molecules that can bind tightly to the target protein, is an essential problem in drug discovery, and previous approaches have achieved initial success. However, most existing methods still suffer from invalid local structure or unrealistic conformation issues, which are mainly due to the poor leaning of bond angles or torsional angles. To alleviate these problems, we propose AutoDiff, a diffusion-based fragment-wise autoregressive generation model. Specifically, we design a novel molecule assembly strategy named conformal motif that preserves the conformation of local structures of molecules first, then we encode the interaction of the protein-ligand complex with an $SE(3)$ -equivariant convolutional network and generate molecules motif-by-motif with diffusion modeling. In addition, we also improve the evaluation framework of SBDD by constraining the molecular weights of the generated molecules in the same range, together with some new metrics, which make the evaluation more fair and practical. Extensive experiments on CrossDocked2020 demonstrate that our approach outperforms the existing models in generating realistic molecules with valid structures and conformations while maintaining high binding affinity.

Machine Learning, ICML

1 Introduction

Structure-based drug design (SBDD), which can be formulated as generating 3D molecules conditioned on protein pockets, is an important and challenging task in drug discovery (Bohacek et al., 1996). Compared to string-based (Bjerrum & Threlfall, 2017; Segler et al., 2018) and graph-based (You et al., 2018; Jin et al., 2018; Shi et al., 2020; Jin et al., 2020a) molecule generation, SBDD leverages the spatial geometric structure information and perceives how molecule interacts with protein pocket. Therefore, it can generate drug-like molecules with high binding affinities to the target. Recently, we have witnessed the success of deep generative models on this task, and most of the existing approaches can be roughly divided into two categories: autoregressive-based and diffusion-based.

For autoregressive-based approaches, early attempts generated 3D molecules by estimating the probability density of atoms’ occurrence in protein pocket and placing atoms of specific types and locations one by one (Luo et al., 2021; Liu et al., 2022). Subsequently, Peng et al. (2022) took the modeling of chemical bonds into consideration and achieved more practical atomic connections. However, atom-wise autoregressive approaches always force the model to generate chemically invalid intermediaries (Jin et al., 2018), yielding unrealistic fragments in the generated molecules. To tackle this problem, fragment-wise autoregressive approaches (Zhang et al., 2022; Zhang & Liu, 2023) were proposed, while these methods always suffer from error accumulation due to the poor learning capacity of torsional angle and the defective motif design strategy, which lead to invalid local structures or unrealistic conformations.

For diffusion-based approaches, some learned the distribution of atom types and positions from a standard Gaussian prior based on diffusion process (Guan et al., 2022; Lin et al., 2022; Schneuing et al., 2022), and some introduced the scaffold-arm decomposition prior into the diffusion modeling to improve the binding affinity of the generated molecules (Guan et al., 2023). However, diffusion-based approaches also tend to generate unrealistic local structures such as messy rings (Guan et al., 2023). In addition to SBDD, diffusion models have also been widely used in other biochemistry tasks such as molecule conformation prediction (Jing et al., 2022) and ligand-protein binding prediction (Corso et al., 2022; Lu et al., 2023), where diffusion models show promising modeling capacity of torsional angle.

To overcome the aforementioned challenges and limitations, we leverage the strength of diffusion models and motif-based autoregressive generation and propose AutoDiff, a novel conformal motif-based molecule generation method with diffusion modeling. Different from previous approaches (Zhang et al., 2022), we propose a novel conformal motif design strategy, which can alleviate the invalid structure and unrealistic conformation problems. In addition, we model the protein-ligand complex with an $SE(3)$ -equivariant convolutional network to learn the spatial geometric structure features and interaction information. At each generation step, we predict a connection site which can be either an atom or a bond for the current fragment and the motif library, respectively, then attach two predicted connection sites to form a new fragment, and the torsional angle is predicted with a probabilistic diffusion model at last. Thanks to the implicitly encoded conformation in the conformal motifs, the connection site-based attachment can perceive the local environment of the current pocket-ligand complex, therefore alleviating the error accumulation and generating more realistic molecules. Furthermore, we also improve the evaluation framework by constraining the molecular weights of the generated molecules in the same range, together with some new metrics, which can evaluate the structure validity and binding affinity more practically than before.

To summarize, the main contributions of this paper are three-fold:

•

Assembly strategy: we propose a new motif design strategy named conformal motif, which preserves all conformation information of local structures.
•

Generative method: we present a novel generation framework which makes use of the advantages of diffusion model and motif-based generation to design realistic molecules.
•

Experimental result: we improve the evaluation framework together with some new metrics, with which the SBDD models can be evaluated and compared more fairly and practically.

2 Related Work

Fragment-Wise Molecule Generation. Fragment-wise generation is prevalent since the chemical information is preserved in the substructures to produce realistic molecules. Jin et al. (2018) proposed a junction tree variational autoencoder for generating molecules with chemical motifs, it constructs a tree-structured scaffold first, and then combines the motifs of the tree into a molecule with a graph message passing network. Jin et al. (2020b) designed a multiple-property optimization approach in which the motif vocabulary with good properties is constructed first, then molecules are generated by expanding rationale graphs with graph generative models and optimized by fine-tuning to desirable properties with reinforcement learning models. Recently, fragment-based 3D generation approaches have shown promising capacity in drug design and lead optimization (Flam-Shepherd et al., 2022; Powers et al., 2022; Zhang et al., 2022; Zhang & Liu, 2023). Flam-Shepherd et al. (2022) used a hierarchical agent to generate 3D molecules guided by quantum mechanics with a reinforcement learning framework in an autoregressive fashion. Powers et al. (2022) learned how to attach fragments to a growing structure by recognizing realistic intermediates generated en route to a final ligand, which solved a 3D molecule optimization problem. For fragment-wise generation, the key is how to design the motif that can encode the chemical information and local topological structure appropriately. This is even more important to structure-based drug design, which needs to take the motif conformation into account to achieve realistic 3D structures.

Structure-Based Drug Design. Structure-based drug design (SBDD) generates target-aware molecules that bind to specific protein pockets. Fu et al. (2022) proposed a variant of genetic algorithm guided by reinforcement learning, which employs neural models to prioritize the profitable drug design steps with protein structure information as input. Ragoza et al. (2022) presented an atomic density grid representation of protein-ligand complex and learned the molecule distributions with a conditional variational autoencoder. Luo et al. (2021); Liu et al. (2022) generated 3D molecules by estimating the probability density of atoms’ occurrence in protein pocket and placing atoms of specific types and locations one by one. Peng et al. (2022) took the modeling of chemical bonds into consideration and achieved more practical atomic connections. Zhang et al. (2022); Zhang & Liu (2023) proposed a fragment-wise framework which generates molecules motif-by-motif. Another line of work focuses on diffusion-based approaches. Lin et al. (2022); Schneuing et al. (2022); Guan et al. (2022) learned the distribution of atom types and positions from a standard Gaussian prior based on the diffusion process. Guan et al. (2023) decomposed ligands into arms and scaffolds, then incorporated related prior knowledge into diffusion models for better molecule generation.

Diffusion Models. Recently, diffusion models have attracted considerable attention thanks to their promising generative results, which have been widely used in computer vision (Dhariwal & Nichol, 2021; Nichol et al., 2021; Rombach et al., 2022; Ceylan et al., 2023; Tumanyan et al., 2023), natural language processing (Li et al., 2022; Lovelace et al., 2022; Yuan et al., 2022; Lin et al., 2023), and speech modeling (Pascual et al., 2023; Guo et al., 2023), while remarkable success also has been achieved in the domain of biochemistry and drug design (Hoogeboom et al., 2022; Xu et al., 2022b; Jing et al., 2022; Corso et al., 2022; Lu et al., 2023). Jing et al. (2022) studied molecular conformation generation, which operates on the space of torsional angles via a diffusion process on the hypertorus and an extrinsic-to-intrinsic score model. Hoogeboom et al. (2022) generated 3D molecules, which learns to denoise a diffusion process with an equivariant network that jointly operates on both atom coordinates and atom types.

3 Method

In this section, we present AutoDiff, a diffusion-based fragment-wise autoregressive generation model for structure-based drug design. Firstly, we formulate the task of structure-based drug design (SBDD) formally in Section 3.1 and introduce conformal motif design strategy in Section 3.2. Then, we elaborate on the generation process based on the proposed conformal motif in Section 3.3. In the end, we derive the optimization objective to train our model in Section 3.4.

Algorithm 1 Conformal Motif Extraction (Section 3.2)

Input: A set of molecule graphs

\mathcal{D}=\{\mathcal{G}_{1},\mathcal{G}_{2},...,\mathcal{G}_{|\mathcal{D}|}\}

Output: Motif vocabulary

\mathcal{W}

\mathcal{W}_{r}\leftarrow\{\},\mathcal{W}_{c*}\leftarrow\{\}

for

\mathcal{G}(\mathcal{V},\mathcal{E})\in\mathcal{D}

\mathcal{W}_{\mathcal{G}}

= Disconnect(

\mathcal{G},\mathcal{R}_{\mathcal{G}})

for

\mathcal{F}(\mathcal{V}_{\mathcal{F}},\mathcal{E}_{\mathcal{F}})\in\mathcal{W}% _{\mathcal{G}}

if IsFusedRing(

\mathcal{F})

then

\mathcal{F}_{r}

= Decompose(

\mathcal{F}

)

\mathcal{W}_{r}\leftarrow\mathcal{W}_{r}\cup\mathcal{F}_{r}

else if IsChain(

\mathcal{F})

then

\mathcal{V}_{\mathcal{F}}^{*}\leftarrow\mathcal{V}_{\mathcal{F}}\cup\{a|a\in% \mathcal{V},\exists b\in\mathcal{V}_{\mathcal{F}},(a,b)\in\mathcal{E}\}

\mathcal{E}_{\mathcal{F}}^{*}\leftarrow\{(a,b)\in\mathcal{E}|a\in\mathcal{V}_{% \mathcal{F}}^{*},b\in\mathcal{V}_{\mathcal{F}}^{*}\}

\mathcal{W}_{c*}\leftarrow\mathcal{W}_{c*}\cup\{\mathcal{F}^{*}(\mathcal{V}_{% \mathcal{F}}^{*},\mathcal{E}_{\mathcal{F}}^{*})\}

else

\mathcal{W}_{r}\leftarrow\mathcal{W}_{r}\cup\{\mathcal{F}\}

end if

end for

\mathcal{W}\leftarrow\mathcal{W}_{r}\cup\mathcal{W}_{c*}

3.1 Problem Formulation

Structure-based drug design (SBDD) task can be formulated as a conditional generation task that generates 3D molecules conditioned on the given protein pocket. Specifically, the protein pocket can be represented as a set of atoms (with coordinates) $\mathcal{P}=\{(a_{P}^{i},\mathbf{r}_{P}^{i})\}_{i=1}^{N_{P}}$ , while the drug molecule can also be represented as a set of atoms $\mathcal{G}=\{(a_{G}^{i},\mathbf{r}_{G}^{i})\}_{i=1}^{N_{G}}$ , where $N_{P}$ and $N_{G}$ denotes the number of atoms in the pocket and the molecule, respectively; $\mathbf{r}^{i}\in\mathbb{R}^{3}$ is the coordinate of the $i$ -th heavy atom. With the definitions, the SBDD task can be re-formulated as learning a conditional distribution $p(\mathcal{G}|\mathcal{P})$ from the co-crystallized (or docked) 3D protein-ligand complex data.

3.2 Assembly: Conformal Motif

The motif-based generation is explored and applied in 2D generation (Jin et al., 2018, 2020a, 2020b; Fu et al., 2021) initially, and has achieved decent performance especially compared to atom-wise approaches. To construct motif vocabulary, molecules in a library are decomposed into disconnected fragments by breaking all the bridge bonds or rotatable bonds that will not violate chemical validity, and fragments with higher frequency than a threshold are selected as the building blocks, i.e., motifs. This strategy was also employed in 3D generation (Flam-Shepherd et al., 2022; Powers et al., 2022) and SBDD (Zhang et al., 2022; Zhang & Liu, 2023), while the results are not satisfied due to the invalid structures and conformations generated in the sampled molecules. The main reason is that the existing motif design strategy is defective since it only encodes part of the 3D topological information of local structures. Specifically, some 3D topological information of the surrounding environment of atoms in severed bonds will be lost during fragmentation, which leads to the annihilation of the correct local conformation when motifs are attached to each other and finally results in invalid structures or unrealistic conformations, as shown in Figure 1. Motivated by the analysis results, we propose a novel motif design strategy, i.e., conformal motif, in which the term “conformal” stems from hydromechanics and geometry, while we refer to fully preserving 3D topology information in our motifs. Concretely, we first detach all freely rotatable bonds $\mathcal{R}$ (precise definitions in Section B.1) to break molecules into fragments, then we use redundant dummy atoms to act as placeholders, which preserve the 3D topology information (mainly bond angles) of the surrounding environment implicitly for atoms of the severed bonds, and conformation of the motifs can be recovered with cheminformatics tools such as RDKit (Bento et al., 2020). Figure 1 shows that it avoids distorting local structures and helps to generate molecules with realistic structures and conformations. Furthermore, to explore more possible conformation flexibly, we further decompose the fused ring to reduce the motif size. The extracted conformal motifs can be divided into two categories: ring-like $\mathcal{W}_{r}$ and chain-like $\mathcal{W}_{c*}$ . In view of the combination explosion of adding dummy atoms on $\mathcal{W}_{r}$ , we only add dummy atoms on $\mathcal{W}_{c*}$ . Algorithm 1 shows the pseudo-code of the complete process. We also provide an example in Appendix B.2. To the best of our knowledge, conformal motif is the first motif strategy designed for the SBDD task that takes full conformation information into consideration.

Refer to caption — Figure 1: Illustration of the advantage of conformal motif (bottom) versus other methods (Zhang et al., 2022; Zhang & Liu, 2023) (top).

3.3 Method: AutoDiff

We first present the overall generation process of the proposed AutoDiff, together with some notion definitions in Section 3.3.1; then we show the $SE(3)$ -equivariant encoder in Section 3.3.2, which is employed as the main architecture to model the protein pocket-ligand complex; finally, we introduce each module from Section 3.3.3 to Section 3.3.5 in detail. For ease of exposition, we list all the terminologies and mathematical notations in Appendix A.

3.3.1 Overview of AutoDiff

Firstly, we will define some notions. A connection site is an atom in a severed bond or a bond in a fused ring during motif extraction. A candidate connection site is an atom or a bond that may become a connection site during the generation process. We denote all the candidate connection sites in the intermediate $\mathcal{G}^{(t)}$ or the motif vocabulary $\mathcal{W}$ as CCS, while some atoms or bonds in CCS of $\mathcal{W}$ are chemically equivalent and can be reduced to a simplified version, which is called Reduced Candidate Connection Sites (RCCS). It is worth noting that RCCS is the same as CCS in $\mathcal{G}^{(t)}$ . The item selected from RCCS for attachment in each generation step is termed as Focal Connection Site (FCS).

Overall, the process of which can be defined as follows:

	$\displaystyle\mathcal{G}^{(t)}=\phi(\mathcal{P}),$	$\displaystyle t=1,$		(1)
	$\displaystyle\mathcal{G}^{(t)}=\phi(\mathcal{G}^{(t-1)},\mathcal{P}),$	$\displaystyle t>1,$		(2)

where $\phi$ is our generation model. The generation of each motif consists of four steps (Figure 2): (1) an FCS prediction model is trained to predict the FCS in $\mathcal{G}^{(t-1)}$ ; (2) another FCS prediction model predicts the FCS in $\mathcal{W}$ , and the corresponding motif $\mathcal{W}^{(t-1)}$ is also determined simultaneously; (3) $\mathcal{W}^{(t-1)}$ will be attached to $\mathcal{G}^{(t-1)}$ by connecting the two FCSs; (4) learn the torsional angle with a diffusion-based model. In this way, we get $\mathcal{G}^{(t)}$ and the generation process continues until no FCS in the current ligand fragment can be found. Note that the generation of the first motif is different from the procedure presented above and we will introduce the details in Section 3.3.3.

3.3.2 Contextual Encoder

This section describes an $SE(3)$ -equivariant convolutional network-based encoder, which is employed as the main architecture to model the protein pocket-ligand complex. It is crucial to characterize the surrounding environmental information in the protein pocket for the SBDD task. In our approach, complexes of protein pockets and ligand fragments are collectively represented as heterogeneous geometric graphs $\mathcal{G}_{H}=(\mathcal{V}_{H},\mathcal{E}_{H})$ , where the vertex set $\mathcal{V}_{H}=(\mathcal{V}_{l},\mathcal{V}_{p})$ is the collection of all the heavy atoms of ligand fragment and protein pocket, while the edge set $\mathcal{E}_{H}=(\mathcal{E}_{ll},\mathcal{E}_{lp},\mathcal{E}_{pl},\mathcal{E% }_{pp})$ is constructed by cutting off the distance between atoms with thresholds of 5 $\mathring{A}$ ¹¹1The units is Angstrom $\mathring{A}$ ( $10^{-10}$ m)., 10 $\mathring{A}$ , 15 $\mathring{A}$ for ligand-ligand, ligand-pocket/pocket-ligand, and pocket-pocket atom pairs respectively. Unlike previous approaches (Peng et al., 2022; Zhang et al., 2022) that build interaction graphs only depending on distance or $k$ -nearest neighbors, we also preserve all the covalent bonds in ligands as edges to better model ligand-ligand atoms interactions.

$SE(3)$ -equivariant convolutional networks based on tensor products of irreducible representations of $SO(3)$ are used to encode $\mathcal{G}_{H}$ . At each interaction layer, messages are generated using a tensor product of spherical harmonic representations of the edge vector and node representation. Then, for each node $i$ of type $c_{a}(c_{a}\in\{l,p\})$ , it collects the message from its connected edges and updates its representation, which can be formulated as:

\mathbf{h}_{i}\leftarrow\mathbf{h}_{i}\underset{c\in\{l,p\}}{\oplus}\mathrm{BN% }^{(c_{a},c)}\left(\frac{1}{|\mathcal{N}_{i}^{(c)}|}\sum_{j\in\mathcal{N}_{i}^% {(c)}}Y(\mathbf{r}_{ij})\otimes_{\psi_{ij}}\mathbf{h}_{j}\right),

(3)

where $\mathbf{h}_{i}$ denotes node $i$ ’s representation, $\oplus$ denotes vector addition, $\otimes_{\psi_{ij}}$ denotes spherical tensor product with weight $\psi_{ij}$ , $\mathrm{BN}$ is the equivariant batch normalization, $\mathcal{N}_{i}^{(c)}$ denotes neighbors of node $i$ of type $c$ . $Y$ refer to spherical harmonics, $\mathbf{r}_{ij}$ is the direction vector of edge $e_{ij}$ , $\psi_{ij}=\Psi(\mathrm{h}_{ij},\mathrm{h}_{j},\mathrm{h}_{j})$ contains learnable weight of tensor product, where $\mathrm{h}_{ij}$ denotes the embedding of $e_{ij}$ , and $\mathrm{h}_{j}$ denotes scalar features of node $i$ . Note that the interaction layer contains three sublayers: two intra-interaction layers on $\mathcal{G}=(\mathcal{V}_{l},\mathcal{E}_{ll})$ and $\mathcal{P}=(\mathcal{V}_{p},\mathcal{E}_{pp})$ , one inter-interaction layer on $((\mathcal{V}_{l},\mathcal{V}_{p}),(\mathcal{E}_{pl},\mathcal{E}_{lp}))$ .

3.3.3 Generate the First Motif

How to generate the first motif is crucial to achieving successful generation, which consists of two steps: selection of the motif and placement in the protein pocket. We first train a model to predict the frontier in the pocket, which is defined as the pocket atom closest to the first motif’s centroid, and then a classifier is employed to predict the motif by taking the predicted frontier as input. So far, we have selected the first motif, while placing it in the pocket, namely pose prediction, is quite challenging. Previous methods used contact map to predict the position for the first motif (Zhang et al., 2022; Zhang & Liu, 2023), which tends to predict implausible poses and mislead the whole generation process due to the inappropriate modeling of the motif poses. To be specific, the same motif may have different relative positions to the same pocket or subpocket in different complexes, while the contact map-based approaches are forced to learn an average-like output from the training data consisting of various alternative poses, thus usually resulting in unrealistic molecule structures. In our approach, we develop a diffusion-based generative model to learn the distribution of various motif poses. Since there are no freely rotatable bonds in the motif, the pose lies in a 6-dimensional submanifold whose degree of freedom comes from translation and rotation. Therefore, we generate the conformation of the selected motif by RDKit (Bento et al., 2020) first, and a convolution of each motif atom with motif centroid $o$ is employed:

\mathbf{v}=\frac{1}{|\mathcal{V}_{l}|}\sum_{i\in\mathcal{V}_{l}}Y(\mathbf{r}_{% oi})\otimes_{\psi_{oi}}\mathbf{h}_{i},

(4)

where $\psi_{oi}=\Psi(\mathrm{h}_{oi},\mathrm{h}_{i})$ . The output $\mathbf{v}$ consists of 2 odd parity vectors and 2 even vectors. Translation and rotation of the molecule are predicted as:

	$\displaystyle\mathbf{tr}$	$\displaystyle=\frac{\mathbf{\bar{v}}_{odd}}{\left\\|\mathbf{\bar{v}}_{odd}% \right\\|}\times\mathrm{MLP}\left(\left\\|\mathbf{\bar{v}}_{odd}\right\\|,\mathbf% {s}_{t}\right),$		(5)
	$\displaystyle\mathbf{rot}$	$\displaystyle=\frac{\mathbf{\bar{v}}_{even}}{\left\\|\mathbf{\bar{v}}_{even}% \right\\|}\times\mathrm{MLP}\left(\left\\|\mathbf{\bar{v}}_{even}\right\\|,% \mathbf{s}_{t}\right),$		(6)

where $\mathbf{s}_{t}$ denotes the sinusoidal embeddings of the diffusion time ${t}$ .

3.3.4 FCS Prediction

The key step for fragment-wise autoregressive generation is selecting a motif $\mathcal{W}^{(t-1)}$ and attaching it to the current generated molecule $\mathcal{G}^{(t-1)}$ . Different from previous methods (Zhang et al., 2022; Zhang & Liu, 2023) that predict a motif first and then select the appropriate attachment by enumeration and scoring, we predict a connection site directly in the conformal motif vocabulary $\mathcal{W}$ , while $\mathcal{W}^{(t-1)}$ is determined simultaneously. In this way, the motif is predicted in a more fine-grained fashion, taking the atom-level contextual information into account. In the following, we will elaborate on FCS prediction of $\mathcal{G}^{(t-1)}$ and $\mathcal{W}$ , respectively.

FCS prediction of $\mathcal{G}^{(t-1)}$ . According to the definition in Section 3.3.1, CCS for $\mathcal{G}^{(t-1)}$ includes three parts: dummy atoms in chains, atoms in rings that do not violate the chemical validity if connected to other heavy atoms, and chemical bonds in rings that connect two candidate connection atoms. Therefore, we train a model to predict the FCS from CCS as follows:

	$\displaystyle P_{v_{i}}$	$\displaystyle=\sigma(\mathrm{MLP}(\mathbf{h}_{i})),$		(7)
	$\displaystyle P_{e_{u}}$	$\displaystyle=\sigma(\mathrm{MLP}([\mathbf{h}_{i},\mathbf{h}_{j}])),(v_{i},v_{% j})\in e_{u}.$		(8)

Note that edges are directed, and the predicted FCS can be an atom $v_{f}$ or an edge $e_{f}$ .

FCS prediction of $\mathcal{W}$ . As defined in Section 3.3.1, the construction of CCS for $\mathcal{W}$ is similar to $\mathcal{G}^{(t-1)}$ which also consists of three parts: dummy atoms in $\mathcal{W}_{c*}$ , atoms in $\mathcal{W}_{r}$ that do not violate the chemical validity if connected to other heavy atoms, and bonds in $\mathcal{W}_{r}$ that connect two candidate connection atoms. While some atoms or bonds in CCS are equivalent, which means that their centered neighbors of atoms and bonds are all the same, resulting in the same graph generated by connecting them to $\mathcal{G}^{(t-1)}$ . Therefore, equivalent atoms or bonds should be reduced in case of inducing aleatoric uncertainty in generation. To reduce the CCS into the RCCS, we first recognize equivalent atoms and bonds. Specifically, we traverse all the atom pairs $(v_{i},v_{j})$ in a motif and define corresponding graph pairs $(\mathcal{G}_{i}^{\mathcal{W}},\mathcal{G}_{j}^{\mathcal{W}})$ , which denote the motif graph centered in $v_{i}$ and $v_{j}$ , In the graph pair, $v_{i}$ and $v_{j}$ are assigned with a special label, while the other atoms and bonds are labeled with their corresponding element type or bond type, respectively. If the graph pairs $(\mathcal{G}_{i}^{\mathcal{W}},\mathcal{G}_{j}^{\mathcal{W}})$ are proven to be isomorphic under the graph isomorphism testing, the atoms $v_{i}$ and $v_{j}$ are equivalent. On the basis of this, equivalent edges are determined by whether their connected atoms are equivalent, in other words, $e_{ij}\equiv e_{mn}\iff(v_{i}\equiv v_{m})\land(v_{j}\equiv v_{n})$ , where $\equiv$ refers to equivalent. After recognizing the equivalent connection sites, we reduce the CCS into the RCCS. In the end, we use FCS in $\mathcal{G}^{(t-1)}$ to query another FCS in $\mathcal{W}$ , and employ two neural networks to make a query vector $Q$ and key vectors $K$ . Specifically, FCS is predicted by either of the following models:

	$\displaystyle P_{v}$	$\displaystyle=\underset{v\in\mathcal{V}_{\mathcal{W}}}{\mathrm{softmax}}\left(% \mathrm{MLP}_{Q}^{v}([\mathbf{h}_{\mathcal{\hat{G}}},\mathbf{h}_{v_{f}}])\cdot% \mathrm{MLP}_{K}^{v}(\mathbf{h}_{v})\right)$		(9)
	$\displaystyle P_{e}$	$\displaystyle=\underset{e\in\mathcal{E}_{\mathcal{W}}}{\mathrm{softmax}}\left(% \mathrm{MLP}_{Q}^{e}([\mathbf{h}_{\mathcal{\hat{G}}},\mathbf{h}_{e_{f}}])\cdot% \mathrm{MLP}_{K}^{e}(\mathbf{h}_{e})\right)$		(10)

where $\mathcal{\hat{G}}=\mathcal{G}^{(t-1)}$ . The type of FCS (atom or bond) is determined to be consistent with FCS in $\mathcal{G}^{(t-1)}$ to avoid mismatched connection. Up to now, the motif to be attached $\mathcal{W}^{(t-1)}$ is also determined and the new molecule $\mathcal{G}^{(t)}$ is generated by attaching $\mathcal{W}^{(t-1)}$ to $\mathcal{G}^{(t-1)}$ on the FCSs of both. To realize the conformation of $\mathcal{G}^{(t)}$ , we first represent the conformation of $\mathcal{G}^{(t-1)}$ as $C_{\mathcal{G}^{(t-1)}}$ , and sample a conformation of $\mathcal{G}^{(t)}$ that denoted as $\hat{C}_{\mathcal{G}^{(t)}}$ with RDKit (Bento et al., 2020), then we use Kabsch algorithm (Kabsch, 1976) to calculate the translation $\mathbf{t}$ and the rotation $\mathbf{R}$ that align conformation of $\mathcal{W}^{(t-2)}$ in $\hat{C}_{\mathcal{G}^{(t)}}$ to the conformation of $\mathcal{W}^{(t-2)}$ in $C_{\mathcal{G}^{(t-1)}}$ . Let $\mathbf{x}_{l}^{i}$ denote the position vector of atom $i$ in $\mathcal{W}^{(t-1)}$ , its position after attachment $\mathbf{\hat{x}}_{l}^{i}$ is calculated as:

\mathbf{\hat{x}}_{l}^{i}=\mathbf{R}\mathbf{x}_{l}^{i}+\mathbf{t}.

(11)

It should be noted that there may be additional freedom if the newly formed bond is rotatable, then the torsional angle should be predicted, which will be introduced in the next section.

3.3.5 Torsional angle prediction

Torsional angle prediction is the last but important module that determines the conformation of the generated molecules, since bond lengths, bond angles and small rings are essentially rigid, such that the flexibility of molecules lies almost entirely in the torsional angles at rotatable bonds, and it is also hard to learn due to the flexibility. Previous approaches (Zhang et al., 2022; Zhang & Liu, 2023) predict the change of the torsional angle with a regression model, which tends to learn an average-like implausible output and may lead to unrealistic conformation. Inspired by the results achieved in (Jing et al., 2022), we propose a diffusion-based model to characterize the distribution of torsional angles. To be specific, for bond $b$ , an $SE(3)$ -invariant scalar representing torsional score $T_{b}$ is generated by a convolution of each atom with the bond center $z$ :

{T}_{b}=\mathrm{MLP}\left(\frac{1}{|\mathcal{N}_{b}|}\sum_{a\in\mathcal{N}_{b}% }Y(\mathbf{r}_{za})\otimes Y^{2}(\mathbf{r}_{b})\otimes_{\gamma_{za}}\mathbf{h% }_{a}\right),

(12)

where $\gamma_{za}=\Gamma(\mathrm{h}_{za},\mathrm{h}_{a},\mathrm{h}_{i}+\mathrm{h}_{j% }),(v_{i},v_{j}\in e_{b})$ , where $e_{b}$ is the edge of bond $b$ .

3.4 Training

In the training stage, we use binary cross-entropy loss $\mathcal{L}_{fro}$ for the prediction of frontiers in pockets and cross-entropy loss $\mathcal{L}_{mot}$ for the first motif prediction, while binary cross entropy loss $\mathcal{L}_{CS}$ is used for connection site prediction, furthermore, $\mathcal{L}_{\mathbf{tr}}$ , $\mathcal{L}_{\mathbf{rot}}$ , $\mathcal{L}_{T}$ are the losses for the translation, rotation of the first motif and torsion of rotatable bond produced in generation. The total loss can be defined as follows:

\mathcal{L}=\mathcal{L}_{fro}+\mathcal{L}_{mot}+\mathcal{L}_{CS}+\mathcal{L}_{% \mathbf{tr}}+\mathcal{L}_{\mathbf{rot}}+\mathcal{L}_{T}.

(13)

We provide more implementation details in Appendix D.

4 Experiment

4.1 Experiment Setup

Dataset. In this paper, we train and evaluate our model with the CrossDock2020 (Francoeur et al., 2020) dataset, which contains 22.5 million poses of ligands docked into multiple similar binding pockets across the Protein Data Bank (Berman et al., 2000). In our experiments, the dataset is processed with the same procedure to (Guan et al., 2023).

Baselines. We compare our model with various state-of-the-art baselines: LiGAN (Ragoza et al., 2022) is a conditional variational autoencoder-based generation model. GraphBP (Liu et al., 2022), AR (Luo et al., 2021), and Pocket2Mol (Peng et al., 2022) are atom-wise autoregressive generation approaches. FLAG (Zhang et al., 2022) generates molecules fragment-by-fragment in an autoregressive fashion. TargetDiff (Guan et al., 2022) and DecompDiff (Guan et al., 2023) are diffusion-based generation methods.

Evaluation. To compare with the existing state-of-the-art generation models more fairly and practically, we improve the evaluation framework by constraining the molecular weights of the generated molecules in the same range (detailed in Appendix C), which is different from previous evaluations since there is a strong correlation between Vina Score and molecular weight (Xu et al., 2022a). Specifically, we evaluate the generated molecules from three perspectives: (1) molecular structure validity: we analyze the atom distance and bond angle respectively first, by calculating the Jensen-Shannon divergences (JSD) between the generated molecules and the reference set. In addition, we also calculate the JSD between the generated molecules and the force-filed optimized ones, which does not rely on a specific reference set and achieves a more generalized and realistic estimation. Furthermore, to evaluate the whole structure comprehensively, we propose a new metric called conformer RMSD, which is specified in Section 4.2. (2) pharmaceutical properties: we choose the two most commonly used metrics which are also important reference indicators for pharmaceutical chemists in practical development: Synthetic Accessibility (SA) and Quantitative Estimation of Drug-likeness (QED), which follow the setup of Guan et al. (2023). (3) binding affinity: we also evaluate the binding affinity of the generated molecules with AutoDock Vina (Eberhardt et al., 2021). Following the setup of Guan et al. (2023), we report both the mean and the median value of four metrics: Vina Score, Vina Min, Vina Dock, and High Affinity. Additionally, we propose another two metrics, i.e., Vina Score^∗ and Vina Min^∗, which are specified in Section 4.3.

4.2 Molecular Structure Validity Analysis

Firstly, we evaluate the structure validity by analyzing the distributions of all-atom distances and carbon-carbon pair distances and comparing them against the corresponding reference empirical distributions in Figure 3. For overall atom distances, AutoDiff achieves the lowest JSD compared to other autoregressive-based approaches and competitive performance compared to diffusion-based approaches, which are similar to the results of carbon-carbon pair distances scenario. In addition, we compute the bond angle distributions of the generated molecules and compare them against the reference set (Table 1, top rows), and similarly, AutoDiff achieves comparable performance as well.

Nevertheless, the capacity of JSD computed against the reference empirical distributions is limited to evaluate the structure validity exactly, since it prefers molecules that are structure-similar to the reference set, while not the ones that are dissimilar to the reference set but actually structure-valid. Therefore, we propose to compute JSD of the generated molecules against their force field optimized results rather than the reference set, as the outputs of the force field are generally considered to be approximate correct structures. The new metric is more generalized and alleviates the bias arose in the evaluation before. As shown in Table 1 (bottom rows), AutoDiff outperforms other baselines and achieves the best performance, and the results of JSD are close to the ones of the reference set, which means AutoDiff is capable of learning molecule structures with valid bond angles.

Table 1: JSD between bond angle distributions of the reference and the generated molecules (top), and of the generated and the force-field optimized molecules (bottom). The best two results are highlighted with bold text and underlined text, respectively.

Angle	Ref	li GAN	Graph BP	AR	Pocket 2Mol	Target Diff	Decomp Diff	FLAG	Ours
CCC	-	0.60	0.38	0.33	0.34	0.33	0.26	0.40	0.31
CCO	-	0.64	0.31	0.45	0.40	0.38	0.29	0.44	0.27
CNC	-	0.62	0.45	0.38	0.24	0.37	0.29	0.53	0.42
NCC	-	0.63	0.32	0.40	0.36	0.35	0.25	0.44	0.32
CC=O	-	0.65	0.36	0.48	0.36	0.36	0.25	0.45	0.30
CCC	0.14	0.49	0.23	0.32	0.31	0.36	0.38	0.31	0.22
CCO	0.20	0.62	0.31	0.47	0.43	0.44	0.43	0.40	0.24
CNC	0.20	0.46	0.25	0.31	0.29	0.28	0.31	0.29	0.23
NCC	0.20	0.52	0.20	0.37	0.34	0.34	0.35	0.32	0.25
CC=O	0.30	0.64	0.24	0.56	0.47	0.44	0.34	0.38	0.23

Table 2: Results of binding affinities and pharmaceutical properties. Top 2 results are highlighted with bold text and underlined text, respectively.

Methods	Vina Score(↓)		Vina Score^∗(↓)		Vina Min(↓)		Vina Min^∗(↓)		Vina Dock(↓)		High Affinity(↑)		QED (↑)		SA (↑)
Methods	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.	Avg.	Med.
Reference	-6.36	-6.46	-5.65	-5.94	-6.71	-6.49	-6.32	-6.18	-7.45	-7.26	–	–	0.48	0.47	0.73	0.74
LiGAN	–	–	–	–	–	–	–	–	-8.49	-8.39	0.64	0.69	0.35	0.30	0.54	0.52
GraphBP	–	–	–	–	–	–	–	–	-2.49	-3.96	0.10	0.03	0.49	0.50	0.49	0.49
AR	-4.98	-6.40	-3.37	-4.23	-6.51	-6.76	-5.33	-5.53	-7.67	-7.40	0.58	0.69	0.47	0.46	0.56	0.55
Pocket2Mol	-6.37	-6.56	-4.72	-4.88	-7.39	-7.54	-5.98	-6.26	-8.58	-8.63	0.68	0.79	0.54	0.54	0.71	0.71
FLAG	51.03	42.13	50.08	41.90	9.42	-2.23	8.63	-2.12	-5.49	-6.04	0.26	0.10	0.35	0.31	0.49	0.48
TargetDiff	-5.83	-6.36	-2.64	-3.79	-6.87	-6.89	-4.50	-4.84	-7.85	-7.94	0.60	0.60	0.50	0.50	0.59	0.58
DecompDiff	-3.76	-4.72	-2.33	-3.63	-5.29	-5.59	-4.34	-4.86	-7.03	-7.17	0.37	0.24	0.44	0.43	0.68	0.68
AutoDiff	-5.25	-5.33	-5.02	-5.18	-6.91	-7.06	-6.69	-6.83	-8.86	-8.94	0.73	0.77	0.57	0.58	0.76	0.77

To further evaluate the structure validity comprehensively in addition to separate analysis of atom distances and bond angles, we design another new metric conformer RMSD inspired by the conformer matching (Jing et al., 2022): for a molecule $\mathcal{G}$ we optimize its conformation $C_{\mathcal{G}}$ by force-field to obtain $C_{\mathcal{G}}^{\mathcal{FF}}$ , then we modify torsion angels of the $C_{\mathcal{G}}^{\mathcal{FF}}$ to match $C_{\mathcal{G}}$ . The optimal match $(\hat{C}_{\mathcal{G}}^{\mathcal{FF}},C_{\mathcal{G}})$ can be found by running a differential evolution optimization procedure over the torsion angles, and $\mathrm{RMSD}(\hat{C}_{\mathcal{G}}^{\mathcal{FF}},C_{\mathcal{G}})$ is defined as conformer RMSD. As shown in Figure 4, AutoDiff achieves the lowest conformer RMSD compared to all other baselines, which is also close to the result of the reference set. The result suggests that AutoDiff can generate molecules with more valid structures and conformations.

Case study. To better understand the structure validity, a case study is conducted and reported in Figure 5. We can see that FLAG tends to generate unrealistic structures, and TargetDiff as well as DecompDiff generate approximate valid structures but incorrect local details such as invalid bond angles, while AutoDiff can generate rational molecules with more realistic structures and conformations, which can be attributed to the superiority of the conformal motif strategy.

4.3 Binding Affinities and Pharmaceutical Properties

In accordance with practice, we evaluate the binding affinity by computing Vina Score, Vina Min, Vina Dock and High Affinity first. It should be noted that our experiments are conducted under the constraint of molecular weights. In Table 2, we can see that Pocket2Mol outperforms other models, while AutoDiff achieves competitive performance similar to Pocket2Mol and is better than other baselines.

However, the metrics Vina Score and Vina Min are not robust enough since they do not take the structure validity into account, which means molecules with unrealistic structures may still achieve decent Vina scores. To address this issue, we propose another two metrics, i.e., Vina Score^∗ and Vina Min^∗, which compute Vina scores for the molecules that are preprocessed with conformer matching (Jing et al., 2022) rather than the ones generated by models. These two new metrics can evaluate the binding affinity more practically which ensure the approximate correctness of molecular local structures and conformations (bond lengths and bond angles). Table 2 shows that results of Vina Score^∗ and Vina Min^∗ are worse than Vina Score and Vin Min for almost all the models, which are reasonable since the metrics are more strict than before due to taking structure validity into account when docking, while we can see that AutoDiff achieves the best performance and the Vina score values also fall into a good range. It is unexpected that Pocket2Mol acquires the second-best results which are also very impressive. Furthermore, AutoDiff also obtains the highest QED and SA scores. All the results again suggest that AutoDiff is suitable for SBDD task and it can generate more drug-like molecules with good binding affinities.

5 Conclusion

In this paper, we propose AutoDiff, a diffusion-based fragment-wise autoregressive generation approach, which can generate realistic molecules with valid structures and conformations based on the conformal motif. Moreover, we also improve the evaluation framework of SBDD, which can benchmark the generation models fairly and practically. In future work a fine-tuning module could be introduced to refine the intermediates during generation.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none of which we feel must be specifically highlighted here.

References

Bento et al. (2020) Bento, A., Hersey, A., Félix, E., Landrum, G., Gaulton, A., Atkinson, F., Bellis, L., De Veij, M., and Leach, A. An open source chemical structure curation pipeline using rdkit. j cheminform 12: 51, 2020.
Berman et al. (2000) Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. The protein data bank. Nucleic acids research, 28(1):235–242, 2000.
Bjerrum & Threlfall (2017) Bjerrum, E. J. and Threlfall, R. Molecular generation with recurrent neural networks (rnns). arXiv preprint arXiv:1705.04612, 2017.
Bohacek et al. (1996) Bohacek, R. S., McMartin, C., and Guida, W. C. The art and practice of structure-based drug design: a molecular modeling perspective. Medicinal research reviews, 16(1):3–50, 1996.
Ceylan et al. (2023) Ceylan, D., Huang, C.-H. P., and Mitra, N. J. Pix2video: Video editing using image diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 23206–23217, 2023.
Corso et al. (2022) Corso, G., Stärk, H., Jing, B., Barzilay, R., and Jaakkola, T. Diffdock: Diffusion steps, twists, and turns for molecular docking. arXiv preprint arXiv:2210.01776, 2022.
Dhariwal & Nichol (2021) Dhariwal, P. and Nichol, A. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
Eberhardt et al. (2021) Eberhardt, J., Santos-Martins, D., Tillack, A. F., and Forli, S. Autodock vina 1.2. 0: New docking methods, expanded force field, and python bindings. Journal of chemical information and modeling, 61(8):3891–3898, 2021.
Flam-Shepherd et al. (2022) Flam-Shepherd, D., Zhigalin, A., and Aspuru-Guzik, A. Scalable fragment-based 3d molecular design with reinforcement learning. arXiv preprint arXiv:2202.00658, 2022.
Francoeur et al. (2020) Francoeur, P. G., Masuda, T., Sunseri, J., Jia, A., Iovanisci, R. B., Snyder, I., and Koes, D. R. Three-dimensional convolutional neural networks and a cross-docked data set for structure-based drug design. Journal of chemical information and modeling, 60(9):4200–4215, 2020.
Fu et al. (2021) Fu, T., Xiao, C., Li, X., Glass, L. M., and Sun, J. Mimosa: Multi-constraint molecule sampling for molecule optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 125–133, 2021.
Fu et al. (2022) Fu, T., Gao, W., Coley, C., and Sun, J. Reinforced genetic algorithm for structure-based drug design. Advances in Neural Information Processing Systems, 35:12325–12338, 2022.
Guan et al. (2022) Guan, J., Qian, W. W., Peng, X., Su, Y., Peng, J., and Ma, J. 3d equivariant diffusion for target-aware molecule generation and affinity prediction. In The Eleventh International Conference on Learning Representations, 2022.
Guan et al. (2023) Guan, J., Zhou, X., Yang, Y., Bao, Y., Peng, J., Ma, J., Liu, Q., Wang, L., and Gu, Q. DecompDiff: Diffusion models with decomposed priors for structure-based drug design. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 11827–11846. PMLR, 23–29 Jul 2023.
Guo et al. (2023) Guo, Y., Du, C., Chen, X., and Yu, K. Emodiff: Intensity controllable emotional text-to-speech with soft-label guidance. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
Hoogeboom et al. (2022) Hoogeboom, E., Satorras, V. G., Vignac, C., and Welling, M. Equivariant diffusion for molecule generation in 3d. In International conference on machine learning, pp. 8867–8887. PMLR, 2022.
Jin et al. (2018) Jin, W., Barzilay, R., and Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. In International conference on machine learning, pp. 2323–2332. PMLR, 2018.
Jin et al. (2020a) Jin, W., Barzilay, R., and Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. In International conference on machine learning, pp. 4839–4848. PMLR, 2020a.
Jin et al. (2020b) Jin, W., Barzilay, R., and Jaakkola, T. Multi-objective molecule generation using interpretable substructures. In International conference on machine learning, pp. 4849–4859. PMLR, 2020b.
Jing et al. (2022) Jing, B., Corso, G., Chang, J., Barzilay, R., and Jaakkola, T. Torsional diffusion for molecular conformer generation. Advances in Neural Information Processing Systems, 35:24240–24253, 2022.
Kabsch (1976) Kabsch, W. A solution for the best rotation to relate two sets of vectors. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography, 32(5):922–923, 1976.
Li et al. (2022) Li, X., Thickstun, J., Gulrajani, I., Liang, P. S., and Hashimoto, T. B. Diffusion-lm improves controllable text generation. Advances in Neural Information Processing Systems, 35:4328–4343, 2022.
Lin et al. (2022) Lin, H., Huang, Y., Liu, M., Li, X., Ji, S., and Li, S. Z. Diffbp: Generative diffusion of 3d molecules for target protein binding. arXiv preprint arXiv:2211.11214, 2022.
Lin et al. (2023) Lin, Z., Gong, Y., Shen, Y., Wu, T., Fan, Z., Lin, C., Duan, N., and Chen, W. Text generation with diffusion language models: A pre-training approach with continuous paragraph denoise. In International Conference on Machine Learning, pp. 21051–21064. PMLR, 2023.
Liu et al. (2022) Liu, M., Luo, Y., Uchino, K., Maruhashi, K., and Ji, S. Generating 3d molecules for target protein binding. In International Conference on Machine Learning, pp. 13912–13924. PMLR, 2022.
Lovelace et al. (2022) Lovelace, J., Kishore, V., Wan, C., Shekhtman, E., and Weinberger, K. Q. Latent diffusion for language generation. arXiv preprint arXiv:2212.09462, 2022.
Lu et al. (2023) Lu, W., Zhang, J., Weifeng, H., Zhang, Z., Li, C., and Zheng, S. Dynamicbind: Predicting ligand-specific protein-ligand complex structure with a deep equivariant generative model. In NeurIPS 2023 Generative AI and Biology (GenBio) Workshop, 2023.
Luo et al. (2021) Luo, S., Guan, J., Ma, J., and Peng, J. A 3d generative model for structure-based drug design. Advances in Neural Information Processing Systems, 34:6229–6239, 2021.
Nichol et al. (2021) Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I., and Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
Pascual et al. (2023) Pascual, S., Bhattacharya, G., Yeh, C., Pons, J., and Serrà, J. Full-band general audio synthesis with score-based diffusion. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
Peng et al. (2022) Peng, X., Luo, S., Guan, J., Xie, Q., Peng, J., and Ma, J. Pocket2mol: Efficient molecular sampling based on 3d protein pockets. In International Conference on Machine Learning, pp. 17644–17655. PMLR, 2022.
Powers et al. (2022) Powers, A. S., Yu, H. H., Suriana, P., and Dror, R. O. Fragment-based ligand generation guided by geometric deep learning on protein-ligand structure. bioRxiv, pp. 2022–03, 2022.
Ragoza et al. (2022) Ragoza, M., Masuda, T., and Koes, D. R. Generating 3d molecules conditional on receptor binding sites with deep generative models. Chemical science, 13(9):2701–2713, 2022.
Rombach et al. (2022) Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
Schneuing et al. (2022) Schneuing, A., Du, Y., Harris, C., Jamasb, A., Igashov, I., Du, W., Blundell, T., Lió, P., Gomes, C., Welling, M., et al. Structure-based drug design with equivariant diffusion models. arXiv preprint arXiv:2210.13695, 2022.
Segler et al. (2018) Segler, M., Kogej, T., Tyrchan, C., and Waller, M. Generating focused molecule libraries for drug discovery with recurrent neural networks. acs cent sci 4 (1): 120–131. arXiv preprint arXiv:1701.0132, 9, 2018.
Shi et al. (2020) Shi, C., Xu, M., Zhu, Z., Zhang, W., Zhang, M., and Tang, J. Graphaf: a flow-based autoregressive model for molecular graph generation. arXiv preprint arXiv:2001.09382, 2020.
Tumanyan et al. (2023) Tumanyan, N., Geyer, M., Bagon, S., and Dekel, T. Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1921–1930, 2023.
Xu et al. (2022a) Xu, M., Shen, C., Yang, J., Wang, Q., and Huang, N. Systematic investigation of docking failures in large-scale structure-based virtual screening. ACS omega, 7(43):39417–39428, 2022a.
Xu et al. (2022b) Xu, M., Yu, L., Song, Y., Shi, C., Ermon, S., and Tang, J. Geodiff: A geometric diffusion model for molecular conformation generation. arXiv preprint arXiv:2203.02923, 2022b.
You et al. (2018) You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. Graphrnn: Generating realistic graphs with deep auto-regressive models. In International conference on machine learning, pp. 5708–5717. PMLR, 2018.
Yuan et al. (2022) Yuan, H., Yuan, Z., Tan, C., Huang, F., and Huang, S. Seqdiffuseq: Text diffusion with encoder-decoder transformers. arXiv preprint arXiv:2212.10325, 2022.
Zhang & Liu (2023) Zhang, Z. and Liu, Q. Learning subpocket prototypes for generalizable structure-based drug design. arXiv preprint arXiv:2305.13997, 2023.
Zhang et al. (2022) Zhang, Z., Min, Y., Zheng, S., and Liu, Q. Molecule generation for target protein binding with structural motifs. In The Eleventh International Conference on Learning Representations, 2022.

Appendix A Terminologies and notations

We list terminologies and notations in Table 3

Table 3: Terminologies and notations.

Notations	Explanations
CCS	candidate connection sites
RCCS	reduced candidate connection sites
FCS	focal connection site
$\mathcal{P}$	protein pocket
$a_{P}^{i}$	the atom type of the $i$ -th atom in protein pocket
$\mathbf{r}_{P}^{i}$	the coordinate of the $i$ -th heavy atom in protein pocket
$\mathcal{G}$	drug molecule (ligand)
$a_{G}^{i}$	the atom type of the $i$ -th atom in drug molecule
$\mathbf{r}_{G}^{i}$	the coordinate of the $i$ -th heavy atom in drug molecule
$\mathcal{W}$	motif vocabulary
$\mathcal{W}_{r}$	ring-like motif vocabulary
$\mathcal{W}_{c*}$	chain-like motif vocabulary
$\mathcal{G}^{(t)}$	drug molecule after generating $t$ motifs
$\phi$	generation model
$\mathcal{G}_{H}$	$\mathcal{G}_{H}=(\mathcal{V}_{H},\mathcal{E}_{H})$ , heterogeneous geometric graphs
$\mathcal{V}_{H}$	$\mathcal{V}_{H}=(\mathcal{V}_{l},\mathcal{V}_{p})$ , collection of all the heavy atoms of ligand fragment (l) and protein pocket (p)
$\mathcal{E}_{H}$	edge set, $\mathcal{E}_{H}=(\mathcal{E}_{ll},\mathcal{E}_{lp},\mathcal{E}_{pl},\mathcal{E% }_{pp})$
$\mathrm{MLP}$	multiple layer perceptron
$\mathbf{h}_{i}$	node $i$ ’s representation
$\mathrm{h}_{i}$	node $i$ ’s scalar features
$\mathring{A}$	Angstrom ( $10^{-10}$ m)
$\oplus$	vector addition
$\otimes_{\psi}$	spherical tensor product with weight $\psi$
$\mathrm{BN}$	batch normalization
$\mathcal{N}_{i}^{(c)}$	neighbors of node $i$ of type $c$ in radius graphs.
$c_{a}$	atom type indicating whether the atom is in ligand or protein, $c_{a}\in\{l,p\}$ .
$\mathcal{W}^{(t)}$	the added motif at the $t$ -th iteration
$\mathbf{R}$	rotation matrix
$\|\|\cdot\|\|$	$l_{2}$ norm of a vector
$T_{b}$	torsion score of bond $b$

Appendix B Details in Conformal Motif Extraction

B.1 Definition of Freely Rotatable Bond

In this paper, the freely rotatable bond is defined as follows: if cutting a bond creates two connected components of the molecule, and each connected component has at least one atom that is not in the direction of the severed bond, then the bond is considered to be freely rotatable. We only count single bonds as rotatable. Different from previous definitions (Jing et al., 2022; Zhang et al., 2022), our definition guarantees that a freely rotatable bond is chemically rotatable and changes molecular conformation as it rotates.

B.2 Example of conformal motif extraction

In Figure 6, we provide an example of conformal motif extraction.

Appendix C Details of Molecular Weight Constraint

By analysis we found that the molecular weights of molecules generated by different models always vary differently, autoregressive-based approaches tend to generate small molecules in an atom-by-atom (or motif-by-motif) fashion, while diffusion-based approaches generate molecules in a one-shot fashion, they determine the number of atoms before generation, thus allowing for more flexible control of the molecular weight. Taking the correlation between vina score and molecular weight into consideration, we constrain the molecular weights of the generated molecules in the same range to conduct a fair evaluation. Considering the statistical number of generated molecules and the similar molecular weight distribution between the molecules generated by TargetDiff and the reference set, for each testing protein pocket $\mathcal{P}_{i}$ , we drop the top 20% and the bottom 20% of the molecules generated by TargetDiff according to the molecular weight and calculate the mean $\mu_{i}$ and standard deviation $\sigma_{i}$ of the remaining molecules. Then we define the valid molecular weight range $H_{i}$ as $[\mu_{i}-\sigma_{i},\mu_{i}+\sigma_{i}]$ . For protein pocket $\mathcal{P}_{i}$ , only the molecules whose molecular weight fall in the range $H_{i}$ will be evaluated.

For each model, 100 molecules are generated and sampled in the range $H_{i}$ , which are used to be evaluated in our experiments. Table 4 shows the molecular weights of molecules generated by various SBDD models under default settings and the molecular weight constraint settings.

Table 4: Molecular weights for various models under default settings (MolWt1) and the molecular weight constraint settings (MolWt2).

	liGAN	GraphBP	AR	Pocket2Mol	FLAG	TargetDiff	DecompDiff	AutoDiff
MolWt1	294.87±25.20	344.54±171.75	250.48±58.84	242.75±51.99	287.33±81.07	347.34±85.03	581.92±42.30	254.62±60.98
MolWt2	348.29±14.01	336.59±21.89	328.95±15.68	335.47±15.45	337.57±20.91	336.71±21.72	335.36±21.44	331.60±18.39

Appendix D Implementation Details

For node features of molecules, we use atom symbol, formal charge, number of explicit Hs, number of total Hs, and hybridization type. For node features of protein atoms, we use element types, the amino acids they belong to, and whether they are backbone or side-chain atoms. Edge features include the distances encoded with radial basis functions and bond type. The input scalar features of nodes and edges are concatenated with sinusoidal embeddings of diffusion time.

In the training stage, we first construct motif trees of molecules, then we traverse motif trees in a breadth-first (BFS) order to get a traverse sequence $S$ . We sample a mask ratio from the uniform distribution $\mathrm{U}[0,1]$ and mask the corresponding number of the last $K$ motifs in $S$ . Connection sites are determined during the masking procedure.