2024_09_10_45386283934fad67e7d4g

communications chemistry 通信化学

ARTICLE 条款
https://doi.org/10.1038/s42004-023-01054-6

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity
基于变分自编码器的化学潜在空间用于具有 3D 复杂性的大分子结构

Toshiki Ochiai , Tensei Inukai Manato Akiyama , Kairi Furui(i) ,Masahito Ohue (i) , Nobuaki Matsumori(i) ,
落合敏树、犬养转生、秋山真人、古井佳凛(i) 、大植雅仁(i) 、松森信明(i) Shinsuke Inuki , Motonari Uesugi( , Toshiaki Sunazuka , Kazuya Kikuchi7 , Hideaki Kakeya(0) &
乾杉介、上杉元就、砂塚敏晶、菊池和也 7 、筧秀晃(0) 及Yasubumi Sakakibara (1,9®
坂木原康文 (1,9®

Abstract 摘要

The structural diversity of chemical libraries, which are systematic collections of compounds that have potential to bind to biomolecules, can be represented by chemical latent space. A chemical latent space is a projection of a compound structure into a mathematical space based on several molecular features, and it can express structural diversity within a compound library in order to explore a broader chemical space and generate novel compound structures for drug candidates. In this study, we developed a deep-learning method, called NP-VAE (Natural Product-oriented Variational Autoencoder), based on variational autoencoder for managing hard-to-analyze datasets from DrugBank and large molecular structures such as natural compounds with chirality, an essential factor in the 3D complexity of compounds. NP-VAE was successful in constructing the chemical latent space from largesized compounds that were unable to be handled in existing methods, achieving higher reconstruction accuracy, and demonstrating stable performance as a generative model across various indices. Furthermore, by exploring the acquired latent space, we succeeded in comprehensively analyzing a compound library containing natural compounds and generating novel compound structures with optimized functions.
化学文库的结构多样性,即针对生物分子具有潜在结合能力的化合物的系统性集合,可以用化学潜在空间来表示。化学潜在空间是将化合物结构投射到基于多种分子特征的数学空间中,其可以表达化合物文库内的结构多样性,从而探索更广阔的化学空间,为药物候选物生成新颖化合物结构。本研究开发了一种基于变分自动编码器的深度学习方法,称为 NP-VAE(面向天然产品的变分自动编码器),用于管理来自 DrugBank 的难以分析的数据集以及具有手性这一 3D 复杂性关键因素的大型分子结构,如天然化合物。NP-VAE 成功构建了无法用现有方法处理的大型化合物的化学潜在空间,实现了更高的重建精度,并在各种指标上展示了作为生成模型的稳定性能。此外,通过探索获得的潜在空间,我们成功全面分析了包含天然化合物的化合物文库,并生成了具有优化功能的新颖化合物结构。

t is estimated that there are approximately

variations in all possible compound structures, even when limited to small molecules with a molecular weight of 500 or less

. Structural diversity within compound libraries is crucial for discovering new pharmaceutical compounds, necessitating coverage of as many candidates as possible from this vast pool. The structural diversity of compounds in a compound library can be represented by a chemical latent space, which projects compound structures onto a mathematical space based on various molecular features, such as fingerprints. Building a chemical latent space that potentially contains numerous unknown compound structures beyond existing libraries is an important research topic in cheminformatics. On the other hand, many of the natural compounds produced by living organisms have complex and unique structures compared to conventional drugs, and they exhibit high biological activity

. By applying state-of-the-art generative artificial intelligence techniques to the heterogeneous data from both the approved drug database and natural product compound libraries, it becomes possible to virtually generate and design novel compound structures that combine the characteristics of both types of data. This approach has become a global trend in cutting-edge drug discovery, known as DMTA (design-make-testanalyze

.
据估计,即使限定于分子量 500 或更小的小分子,所有可能的化合物结构也有大约

种变体。化合物库的结构多样性对于发现新的药物化合物至关重要,需要覆盖这个广泛池子中尽可能多的候选物。化合物库中化合物的结构多样性可以用一个化学潜在空间来表示,该空间将化合物结构投射到一个基于指纹等多种分子特征的数学空间上。构建一个可能包含现有库之外众多未知化合物结构的化学潜在空间是化学信息学中的重要研究课题。另一方面,生物体产生的许多天然化合物与传统药物相比具有复杂独特的结构,并表现出很高的生物活性

。通过将最先进的生成式人工智能技术应用于批准用药数据库和天然产物化合物库的异构数据,就有可能虚拟生成并设计出结合两种数据特征的全新化合物结构。这种方法已成为前沿药物发现领域的全球趋势,被称为 DMTA(设计-制造-测试-分析)

。

Various deep-learning models have been developed with the aim of constructing chemical latent spaces and computationally generating new compound structures. Variational autoencoder

is one of the representative deep learning methods for constructing chemical latent spaces. VAE consists of two components: an encoder, which transforms input values into lowdimensional variables called latent variables, and a decoder, which transforms latent variables into corresponding output values. By exploring the chemical latent space, unknown compound structures not present in the training data could be generated. VAEs that handle compound structures as inputs and outputs are broadly classified into two types: SMILES-based and graph-based. Chemical VAE (CVAE)

, the earliest model of SMILES-based methods, takes SMILES strings

, which represent compound structures as strings, as input and constructs a latent space by projecting a compound library into a low-dimensional space. CVAE was a pioneering study that applied VAE for constructing a latent space of chemical compounds. However, CVAE, which outputs SMILES strings symbol by symbol, faces the issue that most of the output does not conform to chemical rules and fails to form a valid compound structure. Therefore, CVAE added an ad-hoc one-step process to validate the chemical structures output from the decoder and discarded the invalid ones. Grammar VAE (GVAE)

and Syntax-Directed VAE (SD-VAE)

were developed as models capable of generating more valid compound structures by focusing on the grammatical structure of SMILES strings. Recently, studies

applying the SMILES representation have again become active as chemical language model (CLM). These state-of-the-art methods adopt recurrent neural networks (RNNs) such as LSTM

to learn the SMILES grammar using a large amount of pretraining data and employ transfer learning to the compound structures of interest. Nevertheless, the SMILES representation-based model suffers from generating invalid SMILES strings, and hence requires the filtering out of these invalid outputs

. In addition, one crucial difference compared with the VAE approach is that these RNN-based models are merely generative and do not explicitly construct a vector space (latent space) and its latent variables that are projected from and can be reverse-mapped to the actual compounds, which means they do not provide the capability to explore the latent space of structurally diverse compounds in the library.
目前已经开发了各种深度学习模型,旨在构建化学潜在空间并通过计算方式生成新的化合物结构。变分自编码器

是用于构建化学潜在空间的代表性深度学习方法之一。变分自编码器由两个部件组成:编码器,将输入值转换为称为潜在变量的低维变量;解码器,将潜在变量转换为相应的输出值。通过探索化学潜在空间,可以生成训练数据中不存在的未知化合物结构。处理化合物结构作为输入和输出的变分自编码器大致可分为两种类型:基于 SMILES 的和基于图的。化学变分自编码器(CVAE)

是最早的基于 SMILES 的方法,它以 SMILES 字符串

作为输入,将化合物库投影到低维空间来构建潜在空间。CVAE 是将变分自编码器应用于构建化学化合物潜在空间的开创性研究。然而,CVAE 逐符号输出 SMILES 字符串,存在输出大部分不符合化学规则、无法构成有效化合物结构的问题。因此,CVAE 添加了一个临时的一步过程,用于验证解码器输出的化学结构并丢弃无效的结构。Grammar VAE (GVAE)

和 Syntax-Directed VAE (SD-VAE)

被开发为能够通过关注 SMILES 字符串的语法结构生成更多有效化合物结构的模型。最近,将 SMILES 表示应用的研究

再次活跃,推动了化学语言模型(CLM)的发展。这些最先进的方法采用循环神经网络(RNN)等 LSTM

来学习使用大量预训练数据的 SMILES 语法,并采用迁移学习应用于感兴趣的化合物结构。然而,基于 SMILES 表示的模型会生成无效的 SMILES 字符串,因此需要对这些无效输出进行过滤

。此外,与 VAE 方法相比,这些基于 RNN 的模型仅仅是生成性的,没有显式地构建向量空间(潜在空间)及其映射到实际化合物的潜在变量,这意味着它们不提供探索库中结构多样性化合物潜在空间的能力。
Graph-based models include Constrained Graph VAE (CGVAE)

and Junction Tree VAE (JT-VAE)

. These models represent compound structures as graph structures defined by adjacency relationships between atoms, enabling them to generate completely valid outputs. Specifically, JT-VAE achieved high reconstruction accuracy by treating molecular graphs not only as graph structures but also as tree structures. However, these models were all designed for small molecules and could not handle large compound structures due to their high spatial order. As a result, Hierarchical VAE (HierVAE)

was developed to apply VAE to larger compound structures. By handling compound structures in relatively large fragment units, HierVAE demonstrated high accuracy for datasets composed of large compounds with repeating structures. Nevertheless, challenges remain, such as the inability to consider stereochemistry and the difficulty in applying the model to massive and complex compound structures with diverse internal structures like natural compounds. It is worth noting that natural compounds are quite different from massive compounds with uniform internal structures, like polymers.
基于图的模型包括受约束的图 VAE(CGVAE)

和连接树 VAE(JT-VAE)

。这些模型将化合物结构表示为由原子之间的邻接关系定义的图结构,从而能够生成完全有效的输出。具体而言,JT-VAE 通过将分子图视为不仅是图结构,也是树结构,从而实现了高重建精度。然而,这些模型都是为小分子设计的,无法处理大化合物结构,因为它们的空间顺序很高。因此,开发了层次化 VAE(HierVAE)

来将 VAE 应用于更大的化合物结构。通过处理相对较大的片段单元,HierVAE 在由大化合物和重复结构组成的数据集上表现出高精度。然而,仍然存在一些挑战,如无法考虑立体化学,以及将该模型应用于自然化合物等复杂内部结构的大型和多样化化合物结构的困难。值得注意的是,自然化合物与具有统一内部结构的大型化合物(如聚合物)有很大不同。

Another state-of-the-art generative model is the flow-based model

, which uses deep learning to project and generate compound structures. Flow-based models map to a latent space that guarantees inverse transformation by repeatedly applying invertible functions to input data. In other words, the reconstruction accuracy of flow-based models is always guaranteed to be

, regardless of the degree of learning. The graph-based flow model, MoFlow

, represents compound structures as two types of bit matrices and can acquire completely invertible latent variables by applying a normalizing flow to each. However, this does not necessarily indicate the accuracy or continuity of the resulting latent space, unlike the case with VAEs. Furthermore, Flow-based models have a problem that the latent space becomes high-dimensional in order to guarantee reversibility. Since a latent space with the same dimension as the input dimension is constructed, the space exploration is highly harder than with VAEs, which project onto a lower-dimensional space. Moreover, flow-based learning can be quite unstable, making it prone to gradient explosion when the input dimension increases. Therefore, when using flow-based models for compound data, it is limited to compounds with a small size of input compounds.
另一种最先进的生成模型是基于流(flow-based)的模型

,它使用深度学习来投射和生成复合结构。基于流的模型将输入数据映射到潜在空间,并通过反复应用可逆函数来保证逆向转换。换句话说,基于流的模型的重建精度总是可以保证

,无论学习程度如何。基于图的流模型 MoFlow

将化合物结构表示为两种类型的位矩阵,并通过对每个矩阵应用规范化流来获得完全可逆的潜在变量。然而,这并不一定意味着所得潜在空间的准确性或连续性,这有别于 VAE 的情况。此外,基于流的模型存在一个问题,即为了保证可逆性,潜在空间会变得高维。由于构建了与输入维度相同的潜在空间,因此空间探索比 VAE 投射到低维空间更加困难。此外,基于流的学习可能会相当不稳定,当输入维度增加时容易出现梯度爆炸。因此,在使用基于流的模型处理化合物数据时,其适用范围仅限于较小规模的化合物。

As discussed above, deep learning models developed thus far have struggled to effectively handle large, complex and heterogeneous compound structures, such as natural compounds. Natural compounds produced by organisms such as microbes and plants often possess novel structures, and due to their characteristics of being produced during biological processes, many of them exhibit high biological activity

. Indeed, natural products have been widely used as drugs, such as antibiotics like Penicillin and Streptomycin, and anticancer agents like Bryostatin and Epothilone

. Therefore, constructing a chemical latent space from a compound library that includes natural compounds plays a crucial role in drug discovery.
如上所述,到目前为止开发的深度学习模型在处理大型、复杂和异质的复合结构(如天然化合物)方面一直存在困难。由微生物和植物等生物体产生的天然化合物通常具有新颖的结构,由于其在生物过程中产生的特性,其中许多具有高生物活性

。事实上,天然产品已被广泛用作药物,如青霉素和链霉素等抗生素,以及布尼菌素和表雷米酮等抗癌药物

。因此,从包含天然化合物的化合物库构建化学潜在空间在药物发现中发挥着至关重要的作用。

In this study, we developed a VAE-based deep learning method, called Natural-Product Compound Variational Autoencoder (NP-VAE), to handle compounds with complex molecular structures like natural compounds and acquire a chemical latent space that projects large molecular structures. NP-VAE is a graph-based VAE that combines an algorithm for effectively decomposing compound structures into fragment units and converting them into tree structures, along with the Extended Connectivity Fingerprints (ECFP)

, and the Tree-LSTM

, a type of recurrent neural network. NP-VAE, which has 12 million parameters, was successfully developed through significant improvements to the algorithms of JT-VAE

and HierVAE

.
在这项研究中,我们开发了一种基于 VAE 的深度学习方法,称为天然产物化合物变分自编码器(NP-VAE),用于处理复杂分子结构的天然化合物,并获得投射大分子结构的化学潜在空间。NP-VAE 是一种基于图的 VAE,结合了有效分解化合物结构为片段单元并将其转换为树状结构的算法,以及扩展连接性指纹(ECFP)

和树状长短期记忆(Tree-LSTM)

这种一种循环神经网络。通过对 JT-VAE

和 HierVAE

算法的重大改进,成功开发了具有 1200 万个参数的 NP-VAE。

The first objective of this study is to obtain a highly interpretable chemical latent space that includes middle/large molecular structures like natural compounds using NP-VAE. We construct a latent space
这项研究的第一个目标是使用 NP-VAE 获得一个高度可解释的化学潜在空间,包括中大分子结构,如天然化合物。我们构建了一个潜在空间。

Table 1 Performance comparison of generalization ability between NP-VAE and existing methods (baseline models).
表 1 NP-VAE 和现有方法(基线模型)的概括能力性能比较

2D 重建准确度（测试）

2D Reconstruction accuracy

(test)

Validity 有效性

NP-VAE

0.813

1.000

HierVAE 层次变分自编码器

0.799

1.000

JT-VAE

0.585

1.000

CG-VAE

0.424

1.000

CVAE

0.215

0.931

The reconstruction accuracy and validity for test compounds in the evaluation dataset were compared.
在评估数据集中,测试化合物的重建准确性和有效性进行了比较。
The reconstruction accuracy and validity of existing methods were taken from values reported in the previous study

.
现有方法的重建准确性和有效性是从先前研究中报告的数值得到的。
that includes hard-to-analyze DrugBank

and a large collection of natural compounds, which previous studies could not handle, and effectively perform statistical and comprehensive functional analysis of compound libraries. Furthermore, NP-VAE is developed to deal with chirality, which is an essential factor in the 3D structure of compounds. The second objective is to generate novel compound structures optimized for the target function (property) by exploring the acquired latent space. NP-VAE has a mechanism to train the chemical latent space incorporating functional information along with structural information. The obtained chemical latent space enables the design of optimized compound structures as moleculartargeted drugs by generating new compounds from the surrounding sub-space of an existing pharmaceutical drug, such as anticancer drugs. Furthermore, by combining NP-VAE to generate novel compound structures with docking analysis, we demonstrate the usefulness of this method as an in-silico drug discovery tool.
这包括难以分析的 DrugBank

和大量的天然化合物集合,先前的研究无法处理,并有效地执行化合物库的统计和全面功能分析。此外,NP-VAE 被开发用于处理手性,这是化合物 3D 结构的关键因素。第二个目标是通过探索获得的潜在空间来生成针对目标功能(特性)优化的新型化合物结构。NP-VAE 有一种机制来训练化学潜在空间,同时融合功能信息和结构信息。所获得的化学潜在空间使我们能够通过从现有药物(如抗癌药物)的相邻子空间生成新化合物,设计优化的化合物结构作为分子靶向药物。此外,通过将 NP-VAE 与对接分析相结合,我们证明了该方法作为计算机辅助药物发现工具的有用性。

Results and discussions 结果与讨论

Performance evaluation of NP-VAE as reconstruction and generative model. First, we evaluated the reconstruction accuracy of the proposed NP-VAE for test compounds that were not included in the training compound data, which is referred to as the generalization ability. Evaluating the generalization ability is crucial because it allows us to verify how accurately the constructed chemical latent space has been interpolated. Following the study HierVAE

that conducted an evaluation of the generalization ability, we used St. John et al.'s dataset

(hereinafter referred to as the evaluation dataset). This dataset was divided into 76,000 training compounds, 5000 validation compounds, and 5000 test compounds, exactly same as the previous study

. After training on the training compounds, the reconstruction accuracy and validity for test compounds were calculated. Reconstruction accuracy was determined using the Monte Carlo method for 5000 test compounds. In other words, for each test compound, 10 encodings were performed, and 10 decodings were conducted for each encoding, resulting in 100 output compounds. The proportion of compound structures that matched between the input to the encoder and the output from the decoder was calculated. To determine validity, 1000 latent vectors were sampled from the prior distribution

, and after decoding each of them 100 times, the proportion of chemically valid output compounds was examined using RDKit

. Four state-of-the-art compound VAEs, namely CVAE

, CG-VAE

, JT-VAE

, and HierVAE

, were compared as baseline models. As shown in Table 1, NP-VAE demonstrated higher reconstruction accuracy for the test compounds compared to the previous models. Additionally, since NP-VAE generates compounds in substructure units (fragments) rather than singleatom units, the generation success rate is always

. These
作为重建和生成模型的 NP-VAE 性能评估。首先,我们评估了提出的 NP-VAE 对未包含在训练化合物数据中的测试化合物的重建精度,这称为概括能力。评估概括能力很关键,因为它让我们验证构建的化学潜在空间是如何进行内插的。根据 HierVAE

研究进行了概括能力评估,我们使用了 St. John et al.的数据集

(以下称为评估数据集)。这个数据集被划分为 76,000 个训练化合物、5,000 个验证化合物和 5,000 个测试化合物,与之前的研究

完全一致。在训练了训练化合物之后,计算了测试化合物的重建精度和有效性。使用蒙特卡罗方法对 5,000 个测试化合物计算重建精度。换言之,对于每个测试化合物,进行 10 次编码,并对每次编码进行 10 次解码,产生 100 个输出化合物。计算输入编码器的化合物结构与解码器输出的化合物结构匹配的比例。为确定有效性,从先验分布

抽取 1,000 个潜在向量,对每个向量进行 100 次解码,并使用 RDKit

检查化学有效输出化合物的比例。将四种最先进的化合物 VAE,即 CVAE

、CG-VAE

、JT-VAE

和 HierVAE

作为基线模型进行比较。如表 1 所示,与之前的模型相比,NP-VAE 在测试化合物上展示出更高的重建精度。此外,由于 NP-VAE 以子结构单元(片段)而不是单原子单元生成化合物,因此生成成功率始终为

。这些
Table 2 Comparison of the maximum number of atoms and molecular weight of the compounds included in the dataset.
表 2 包含在数据集中的化合物的最大原子数和分子量的比较。

原子的最大数量

Maximum number of

atoms

最大分子量

Maximum molecular

weight

药品和天然产品数据集

Drug-and-natural-

product dataset

551

8272

Restricted dataset 受限制的数据集

100

1626

DrugBank 药品数据库

551

8272

Project dataset 项目数据集

457

6574

ZINC

500

In the maximum number of atoms, only non-hydrogen atoms were counted. Compared to the ZINC dataset used in previous studies, the drug-and-natural-product dataset (combined drug and natural product dataset of DrugBank and the project dataset) includes compounds with more than 13 times larger molecular weights.
在原子数量最大的化合物中,只有非氢原子被计算在内。与先前研究中使用的 ZINC 数据集相比,药物和天然产物数据集(DrugBank 和项目数据集的药物和天然产品数据集合并)包含分子量超过 13 倍的化合物。
results indicate that NP-VAE is a high-performance generative model, suggesting that the chemical latent space constructed by NP-VAE contains sufficient information to accurately estimate unknown compounds from known compounds.
结果表明,NP-VAE 是一个高性能的生成模型,这表明由 NP-VAE 构建的化学潜在空间包含足够的信息,可以准确地从已知化合物中估计未知化合物。

Next, we compared the performance of NP-VAE as generative model with the state-of-the-art generative models, the Flow-based model MoFlow

, and the SMILES-based method employing CharRNN (character-level RNN)

, which is also referred to as SM-RNN and demonstrated high performance in the study

, and the VAE model HierVAE.
接下来,我们将 NP-VAE 作为生成模型的性能与最先进的生成模型进行比较,包括基于 Flow 的模型 MoFlow

以及使用 CharRNN(字符级别 RNN)的 SMILES 为基础的方法

,后者也称为 SM-RNN,在前期研究中展现出了很高的性能

,以及 VAE 模型 HierVAE。

Since our primary motivation is to develop a VAE model capable of handling large and complex molecules for the construction of chemical latent space, we prepared a compound library consisting of approximately 30,000 compounds. This library combines around 10,000 compounds from DrugBank, a public database containing numerous approved drug compounds, and approximately 20,000 compounds from the project dataset (hereinafter referred to as the drug-and-natural-product dataset). The project dataset is an original compound library collected from various laboratories through the Ministry of Education, Culture, Sports, Science, and Technology-designated project, "Frontier Research on Chemical Communications" 27 , in which this research participated. The project dataset mainly includes natural compounds, and compared to compounds in the frequently used ZINC database

, it contains a number of complex and large molecules (see Supplementary Fig. S1 for illustration). However, the state-of-the-art VAE models, JT

, and HierVAE

, and the flow-based model MoFlow

were unable to handle data for compounds of this larger size, so we had to prepare a restricted dataset where all existing methods can be executed. This restricted dataset (hereinafter referred to as the restricted dataset) was constructed by first reducing the drugand-natural-product dataset to compounds with fewer than 100 non-hydrogen atoms and further removing some compounds that caused errors with HierVAE. Therefore, initially, we compared the difference in maximum compound size between the drug-and-natural-product dataset and the restricted dataset. The Table 2 shows the comparison of the maximum number of atoms and molecular weight of the compounds included in the drug-and-natural-product dataset, the restricted dataset, and three other databases. Note that regarding JTVAE, it took an impractical amount of computation time and did not complete the calculations even with the restricted dataset; therefore, it was excluded from all subsequent experiments.
由于我们的主要动机是开发一个能够处理大型和复杂分子的 VAE 模型,用于构建化学潜在空间,我们准备了一个由约 30,000 个化合物组成的化合物库。这个库包括大约 10,000 个来自公共数据库 DrugBank 的批准药物化合物,以及大约 20,000 个来自项目数据集(以下称为药物和天然产品数据集)的化合物。该项目数据集是在教育、文化、体育、科学和技术部门"化学通讯前沿研究"27 项目的支持下,从各实验室收集的原始化合物库。该项目数据集主要包括天然化合物,与经常使用的 ZINC 数据库

中的化合物相比,它包含了许多复杂和大分子(见补充图 S1)。然而,最先进的 VAE 模型 JT

、HierVAE

和基于流的模型 MoFlow

无法处理这些更大尺寸的化合物数据,因此我们必须准备一个受限的数据集,以使所有现有方法都可以执行。这个受限数据集(以下称为受限数据集)是通过先将药物和天然产品数据集减少到少于 100 个非氢原子的化合物,然后进一步删除了一些导致 HierVAE 出错的化合物而构建的。因此,我们首先比较了药物和天然产品数据集和受限数据集之间的最大化合物大小差异。表 2 显示了包含在药物和天然产品数据集、受限数据集和其他三个数据库中的化合物的最大原子数和分子量的比较。注意,对于 JTVAE,它需要大量不实际的计算时间,即使使用受限的数据集,计算也无法完成;因此,它被排除在所有后续实验之外。

Next, we performed the process of randomly sampling 5000 latent vectors from the prior Gaussian distribution

five times for each model. Then, we calculated and compared the following variety of metrics proposed in benchmarking studies such as MOSES

and GuacaMol

.
接下来,我们从先验高斯分布中随机抽取了 5000 个潜在向量,这个过程对每个模型重复了五次。然后,我们计算并比较了基准测试研究如 MOSES 和 GuacaMol 中提出的以下各种度量指标。

Table 3 Comparison of NP-VAE, HierVAE, MoFlow and SM-RNN as generative models: For the generated compounds, 2D and 3D reconstruction accuracy, uniqueness, novelty, logP, SAscore, QED, Filters, SNN, molecular weight, NP-likeness, and the distance between compound property distributions, Frag, Scaf, IntDiv, and Phys div were calculated.
表 3 NP-VAE、HierVAE、MoFlow 和 SM-RNN 作为生成模型的比较:对于生成的化合物,计算了 2D 和 3D 重构精度、独特性、新颖性、logP、SAscore、QED、过滤器、SNN、分子量、NP 相似性以及化合物属性分布之间的距离、Frag、Scaf、IntDiv 和 Phys div。

	NP-VAE	HierVAE 层次变分自编码器	MoFlow 摩尔流
2D Reconstruction accuracy 二维重建精度	0.871	0.438
3D Reconstruction accuracy 3D 重建精度
Uniqueness 独特性
Novelty 新奇
logP 对数 P
QED
SAscore 评分
Filters 滤波器
SNN
MolWt 分子量
NP-likeness NP-类似性
Frag 询问
Scaf 脚手架
IntDiv 整除
Phys div 物理除法

The highest value for each accuracy indices is shown in bold.
每个准确性指标的最高值以粗体显示。
-Uniqueness: The proportion of unique molecules among the generated valid molecules. This serves as an indicator of the uniqueness of the generated compound structures, and the value will be low if the model has collapsed and generated only a few typical molecules.
-独特性:在生成的有效分子中,独特分子的比例。这可以作为生成化合物结构的独特性指标,如果模型已经崩溃并只生成了少数几种典型的分子,该值将会较低。

Novelty: The proportion of generated molecules that do not exist in the training set. A low value may indicate overfitting.
新颖性：生成分子中未出现在训练集中的分子的比例。较低的值可能表示过拟合。
LogP: Represents the lipophilicity of a molecule. A moderate lipophilicity is required for pharmaceutical compounds.
对数 P: 表示分子的亲脂性。医药化合物需要适中的亲脂性。
QED: An indicator representing the drug-likeness of a molecule. Since it is calculated based on existing oral drugs, it can be considered an indicator of oral drug-likeness. It is expressed as a value between 0 and 1 , with values closer to 1 indicating structures that are more like oral pharmaceuticals .
QED:一个代表分子药物性的指标。由于它是基于现有口服药物计算的,因此可以被视为口服药物性的指标。它的值在 0 到 1 之间,值越接近 1 表示其结构越类似于口服药品。
SAscore: A score representing the difficulty of synthesis based on molecular structure. It is expressed as a value between 1 and 10, with values closer to 10 indicating higher synthesis difficulty .
合成相关评分(SAscore)：一种基于分子结构的合成难度评分。其值在 1 到 10 之间，数值越接近 10 表示合成越困难。
Filters: Represent the proportion of generated molecules that passed through a filter, which eliminates undesired structures used during the construction of the MOSES dataset . A lower value indicates that there are more molecules with abnormal structures.
过滤器：表示通过过滤的生成分子的比例,这些过滤器消除了在构建 MOSES 数据集期间使用的不需要的结构。较低的值表示存在更多具有异常结构的分子。
SNN (similarity to a nearest neighbor): The average Tanimoto coefficient between each generated molecule and the most similar molecule within the training data. This value decreases as the generated molecules deviate further from the distribution of the training data.
相似度最近邻(SNN):每个生成分子与训练数据中最相似分子的 Tanimoto 系数的平均值。该值会随着生成分子越偏离训练数据分布而降低。
MolWt: Molecular weight. 分子量：分子量。
NP-likeness: A measure of naturalness. NP-likeness score is a measure designed to estimate how closely a given molecule resembles known natural products .
NP 相似性:一种自然性的度量。NP 相似性得分是一种旨在估计给定分子与已知天然产物有多大相似性的度量。
Frag: Comparison of the distribution of BRICS fragmentations between generated molecules and the training data. The value increases when both sets contain molecules with similar fragments.
比较生成分子和训练数据之间 BRICS 片段分布的对比。当两组都包含具有相似片段的分子时,该值会增加。
Scaf: Comparison of the distribution of primary structural elements within molecules, referred to as scaffolds. Frag and Scaf are both metrics used to measure the similarity between generated molecules and the training data at the level of substructure units.
分子中主要结构单元即支架的分布比较。Frag 和 Scaf 都是用来衡量生成分子与训练数据在子结构单元层面相似度的指标。
IntDiv: This is an evaluation of the structural diversity within the set of generated molecules. It is calculated by computing the
整除：这是对生成分子集合内部结构多样性的评估。它是通过计算
Tanimoto coefficient between all pairs of generated molecules and taking the average.
对生成分子的所有对进行 Tanimoto 系数的计算并取平均值。
Phy div: KL divergence between generated molecules and the training data in terms of physicochemical properties and is calculated from physicochemical descriptors such as molecular weight, the number of aromatic rings, and the count of rotatable bonds . A higher value indicates better performance.
物化属性和旋转键计数等物化描述符之间的生成分子和训练数据的 KL 散度。更高的值表示性能更好。

The results are shown in Table 3. First, the reconstruction accuracy for training compounds was compared among three models: NP-VAE, HierVAE, and MoFlow, which constitute the latent space using the encoder and decoder. This reconstruction accuracy indicates how accurately the compound library can be projected without information loss. Regarding the 2D reconstruction accuracy for the planar structure of compounds, NP-VAE significantly outperformed HierVAE. On the other hand, MoFlow achieved a

2D reconstruction accuracy. From the mechanism of flow-based models, the reconstruction accuracy of flowbased models is always guaranteed to be

, regardless of the degree of learning. However, this does not necessarily indicate the accuracy or continuity of the resulting latent space, unlike the case with VAEs. Flow-based models have a problem that the latent space becomes high-dimensional in order to guarantee reversibility. Since a latent space with the same dimension as the input dimension is constructed, the space exploration is highly harder than with VAEs, which project onto a lower-dimensional space. Moreover, flow-based learning can be quite unstable, making it prone to gradient explosion when the input dimension increases. The 3D reconstruction accuracy, considering not only the planar structure but also the stereochemistry (chirality), of NP-VAE exceeded

, while the other two models could not handle the 3D structures, and hence, their accuracy was not available. This result demonstrates that NP-VAE has high performance as a feature extractor.
结果如表 3 所示。首先,在训练化合物中,三种模型(NP-VAE、HierVAE 和 MoFlow)的重建精度进行了比较,这些模型利用编码器和解码器构建了潜在空间。这种重建精度表示如何在不丢失信息的情况下将化合物库投射出来。就化合物平面结构的 2D 重建精度而言,NP-VAE 显著优于 HierVAE。另一方面,MoFlow 实现了

的 2D 重建精度。从基于流的模型的机制来看,基于流的模型的重建精度总是被保证为

,无论学习程度如何。但是,这并不一定表示所得潜在空间的准确性或连续性,这与 VAEs 的情况不同。基于流的模型存在一个问题,即为了保证可逆性,潜在空间会变得高维。由于构建了与输入维度相同的潜在空间,因此探索空间比 VAEs 投射到低维空间的情况要困难得多。此外,基于流的学习可能很不稳定,当输入维度增加时容易出现梯度爆炸。考虑不仅平面结构,而且立体化学(手性),NP-VAE 的 3D 重建精度超过了

,而另外两种模型无法处理 3D 结构,因此它们的精度不可用。这一结果表明,NP-VAE 具有出色的特征提取性能。

Second, NP-VAE demonstrated stable performance as a generative model across almost all indices. In terms of uniqueness, novelty and logP, NP-VAE showed comparable performance to the top-performing models. Due to the large variance in

, the difference in

scores among models is not statistically significant. NP-VAE exhibited the highest QED, SA score, and Filters score, which represents the drug-likeness of a molecule, the difficulty of its synthesis, and the proportion that passed through a filter to eliminate undesired structures,
其次,NP-VAE 作为一个生成模型在几乎所有的指标上都表现出了稳定的性能。在独特性、新颖性和 logP 方面,NP-VAE 的表现与表现最出色的模型相当。由于

方差过大,各模型之间的

分数差异在统计上并不显著。NP-VAE 展现了最高的 QED、SA 分数和过滤器分数,这表示了一个分子的药物性、合成难度以及通过滤去不需要的结构的比例。
respectively. On the other hand, SM-RNN exhibited a difference in performance values of nearly two times for SNN, Scaf and Phys div metrics; however, these results are not informative. The compound structures generated by SM-RNN had the Novelty of only

, which implies that

of the structures were identical to the training data. It is obvious that these three metrics plus Frag score would improve if the model outputs compound structures identical to the training data. In other words, SM-RNN is simply memorizing the training data, and its usefulness as a generative model for generating new structures is limited. Regarding molecular weight, we included a plot illustrating the size distributions of the molecules generated by all models in Supplementary Fig. S2. Each distribution is highly divergent, indicating the generation of diverse molecular weights. Regarding NP-likeness, MoFlow showed the highest value, being the only one with a positive score, while other models take negative values. This is attributed to the fact that MoFlow generates extremely long straight-chain structures, which are considered to have abnormal structures, as indicated by the low Filter score. Moreover, since these structures have a Scaf value of 0 , meaning they completely lack a scaffold, they are considered to be naturalproduct-like due to their unusual and rare structures compared to general compounds. Therefore, the NP-likeness calculation method is prone to assigning higher values in such cases. On the other hand, the NP-likeness score of the training data is -0.638 , mainly due to the inclusion of numerous approved drug compounds from DrugBank. This value is close to the negative NP-likeness scores shown by the other three models. In conclusion, NP-VAE successfully generates compound structures with the highest scores in desired metrics for drug-likeness, such as QED, SAscore and Filters, while maintaining similarity to the training data at the level of fragments and scaffolds, as indicated by Frag and Scaf values, and achieving high novelty of generated structures.

总之,NP-VAE 在所需指标(如 QED、SAscore 和过滤器等药物相似性指标)上成功生成了得分最高的复合结构,同时保持了与训练数据在片段和骨架层面的相似性(由 Frag 和 Scaf 值表示),并且生成的结构具有高度的新颖性。

Construction of chemical latent space with natural compounds. We constructed two chemical latent spaces using the entire set of the drug-and-natural-product dataset: one obtained by training with only the structural information of the compounds and the other obtained by training with both structural information and the NP-likeness score

, which serves as a measure of naturalness, as functional information. More precisely, the NP-likeness score is incorporated into the learning process as functional information, which is implemented by a component of the loss function. NP-likeness score is a measure designed to estimate how closely a given molecule resembles known natural products. It is based on the distribution of fragments (substructures) in the molecule compared to a reference set of known natural products

. A higher NP-likeness score suggests that a molecule is more 'natural product-like,' meaning that its structure is more similar to those of known natural products.
利用天然化合物构建化学潜在空间。我们使用药物-天然产物数据集的全部数据构建了两个化学潜在空间:一个仅使用化合物的结构信息训练得到,另一个使用结构信息和 NP-likeness 得分

作为功能信息进行训练得到。更准确地说,NP-likeness 得分作为功能信息被纳入学习过程,通过损失函数的一部分实现。NP-likeness 得分是一种设计用来估算给定分子与已知天然产品的相似程度的指标。它基于分子中碎片(子结构)的分布与已知天然产品参考集的比较

。较高的 NP-likeness 得分表明该分子更"天然产品类似",意味着其结构更接近已知天然产品的结构。

Fig. 1 Visualization of chemical latent spaces using t-SNE plot. a,

The higher the NP-likeness score, the more yellow it is, and the lower the score, the more purple it is. Compared to the chemical latent space (a) trained only on substructures, the chemical latent space (b) trained on both substructures and the NP-likeness score as functional information shows a more clustered distribution according to the NP-likeness score. Comparing the cases when plotting representative anticancer compounds in the chemical latent space trained only on substructures (c) and when plotting anticancer compounds in the chemical latent space trained on both substructures and NP-likeness scores (d), both show clustered distributions according to the anticancer drug classification, with chemotherapeutic drugs and molecular targeted drugs distributed separately. Focusing on the distribution of molecular targeted drugs (red frame), the distribution is more locally clustered when the NP-likeness score is included.

得分越高,颜色越偏黄;得分越低,颜色越偏紫。相比于仅使用子结构训练的化学潜在空间(a),使用子结构和天然产物相似性得分作为功能信息训练的化学潜在空间(b)显示出更聚类的分布。仅使用子结构训练的化学潜在空间(c)和同时使用子结构和天然产物相似性得分训练的化学潜在空间(d)中,抗癌化合物均呈现出根据抗癌药分类的聚类分布,化疗药物和分子靶向药物分别分布。对于分子靶向药物的分布(红框),当加入天然产物相似性得分时,分布更加局部聚集。
(a)

(b)

Fig. 2 Yessotoxin and its EGFR inhibitory activity. a Structure of Yessotoxin. Yessotoxin was first discovered in the 1980s from a scallop species called Patinopecten yessoensis

and since then, various derivatives have been found in crustaceans and algae

. b Inhibition of EGF-stimulated EGFR phosphorylation by Yessotoxin. EGFR tyrosine kinase activities are expressed as a percentage of the maximal phosphorylation induced by EGF. AG1478 is a selective EGFR inhibitor and was used as a positive control. When Yessotoxin was at

, an inhibitory effect of over

was confirmed.
图 2 叶索毒素及其对 EGFR 的抑制活性。a 叶索毒素的结构。叶索毒素最初于 1980 年代从一种叫做扇贝(Patinopecten yessoensis)的贝类中发现

,此后在甲壳类生物和藻类中也发现了各种衍生物

。b EGF 刺激的 EGFR 磷酸化被叶索毒素抑制。EGFR 酪氨酸激酶活性以 EGF 诱导的最大磷酸化百分比表示。AG1478 是一种选择性 EGFR 抑制剂,用作阳性对照。当叶索毒素浓度为

时,抑制效果超过了

。

First, we visualized those two chemical latent spaces by reducing the dimension of latent variables to two dimension using t-SNE

. The results are shown in Fig. 1. In (a) and (b), compounds with higher NP-likeness scores are represented in yellow, while those with lower scores are depicted in purple. Comparing the latent space constructed using only the structural information of the compounds (Fig. 1a), the gradients of NPlikeness can be observed in the latent space constructed by incorporating NP-likeness scores as functional information (Fig. 1b). To quantitatively assess the gradients of NP-likeness in the chemical latent space, we calculated the correlation between the embedding distance and the difference of NPlikeness scores for randomly sampled pairs of points in the latent space. The results shown in Supplementary Fig. S3 indicates that the correlation (Pearson correlation coefficient

) in the latent space incorporating NP-likeness scores is slightly higher than the correlation

in the latent space constructed using only the structural information.

首先,我们通过使用 t-SNE 将潜在变量的维度降至二维,可视化了这两个化学潜在空间。结果如图 1 所示。在(a)和(b)中,NP 相似性得分较高的化合物用黄色表示,而得分较低的化合物用紫色表示。与仅使用化合物结构信息构建的潜在空间(图 1a)相比,在结合 NP 相似性得分作为功能信息构建的潜在空间(图 1b)中,可以观察到 NP 相似性的梯度。为了定量评估化学潜在空间中 NP 相似性的梯度,我们计算了随机抽取的点对的嵌入距离与 NP 相似性得分差之间的相关性。补充图 S3 中的结果表明,结合 NP 相似性得分的潜在空间中的相关性(

皮尔逊相关系数)略高于仅使用结构信息构建的潜在空间中的相关性(

)。
When plotting representative anticancer drug compounds on these chemical latent spaces, we observed more clustered distributions for each class of anticancer drugs in the space incorporating NP-likeness scores (Fig. 1d) compared to the space constructed with only structural information (Fig. 1c). In particular, many molecular-targeted drugs were found to be locally distributed. The reason for the localized distribution of existing molecular-targeted drugs in the space incorporating NPlikeness scores is considered to be due to the lower NP-likeness scores of molecular-targeted drugs compared to other pharmaceutical compounds (Supplementary Fig. S4). Thus, if a functional value for the drug of interest can be added during the training of NP-VAE as a functional indicator, a chemical latent space can be obtained where the desired pharmaceutical candidate compounds are locally distributed, and novel functionally optimized structures are easier to explore.
当在这些化学潜在空间上绘制代表性抗癌药物化合物时，我们发现，与仅使用结构信息构建的空间（图 1c）相比，在结合 NP 特征得分的空间中（图 1d），每种类型的抗癌药物的分布更加聚集。特别是,许多针对分子靶点的药物被发现在局部分布。针对分子靶点药物在结合 NP 特征得分的空间中局部分布的原因,被认为是由于针对分子靶点药物的 NP 特征得分低于其他药物化合物（补充图 S4）。因此,如果在训练 NP-VAE 时可以添加目标药物的功能值作为功能指标,则可以获得一个化学潜在空间,其中所需的医药候选化合物在局部分布,并且更容易探索功能优化的新颖结构。
Second, by utilizing those advancement of the constructed chemical latent space, we found that the natural compound Yessotoxin (Fig. 2a)

included in the drug-and-natural-product dataset was located near existing molecular-targeted drugs. Based on this observation, we hypothesized that Yessotoxin isolated from a scallop species called Patinopecten yessoensis might have a function as a molecular-targeted drug. Upon conducting an assay, it was confirmed that Yessotoxin exhibited weak EGFR inhibitory activity (Fig. 2b). This suggests that exploring the chemical latent space constructed by
其次,通过利用那些构建的化学潜在空间的进步,我们发现,包含在药物和天然产品数据集中的天然化合物 Yessotoxin(图 2a)

位于现有的分子靶向药物附近。基于这一观察,我们假设从名为 Patinopecten yessoensis 的扇贝中分离得到的 Yessotoxin 可能具有作为分子靶向药物的功能。在进行测定时,确认 Yessotoxin 确实表现出微弱的 EGFR 抑制活性(图 2b)。这表明,探索由
NP-VAE may also enable the discovery and annotation of unexpected compounds with similar pharmacological effects.
NP-VAE 可能还能发现和注释具有类似药理作用的意外化合物。

Interpolation in chemical latent space. For two existing drug compounds, we generated novel compound structures that exist between the two compounds by scanning the chemical latent space from one compound to the other. Let the latent variables of the starting and destination compounds be

and

, respectively. When exploring

equidistant points on the chemical latent space connecting the starting and destination points, the latent variable

of the

-th generated compound was derived from the following equation.
化学潜在空间中的插值。对于两种现有的药物化合物,我们通过扫描从一种化合物到另一种化合物的化学潜在空间,生成了存在于两种化合物之间的新的化合物结构。将起始和目标化合物的潜在变量分别表示为

和

。在探索

个等距点连接起始和目标点的化学潜在空间时,生成的第

种化合物的潜在变量

由以下方程式导出。

By inputting this

into the decoder, new compound structures can be generated. Figure 3 shows the novel compound structures obtained by exploring the space between two existing drugs, with the starting compound being a biomolecule, a Nicotinamide adenine dinucleotide derivative, and the destination compound being the molecular-targeted drug Sorafenib. As shown in Fig. 3, the similarity to the starting compound gradually decreased as one moves away from it and approaches the destination compound, while the similarity to the destination point compound gradually increased. Moreover, the NP-likeness score of the generated compound structures gradually decreased as the scanning progresses from a high to a low NP-likeness compound.
通过将这个

输入到解码器中，可以生成新的化合物结构。图 3 显示了通过探索两种现有药物之间的空间而获得的新型化合物结构,起始化合物为生物分子,即烟酰胺腺嘌呤二核苷酸衍生物,目标化合物为分子靶向药物索拉非尼。如图 3 所示,随着离开起始化合物而接近目标化合物,相似性逐渐降低,而与目标化合物的相似性逐渐增加。此外,生成的化合物结构的天然产物相似性得分随着扫描从高天然产物相似性向低天然产物相似性化合物进行而逐渐降低。

Modification of compound structures by Bayesian optimization. We used Bayesian optimization with

to explore the chemical latent space and generate novel compound structures with optimized functional indicators. By setting an existing drug compound as the starting point and limiting the exploration range to the vicinity of the starting point in the chemical latent space, we generated novel compound structures with optimized functional indicators while maintaining structural similarity to the starting compound. The objective function to be maximized was set to the

, an indicator of oral drug-likeness. The correlation coefficient between the NP-likeness score and QED in the drug-and-natural-product dataset is -0.31 , indicating a negative correlation. Therefore, it is expected that there is an increasing gradient of QED in the space with an NP-likeness decreasing gradient, and the chemical latent space constructed in
通过贝叶斯优化修改化合物结构。我们使用

探索化学潜在空间并生成具有优化的功能指标的新颖化合物结构。通过将一个现有的药物化合物设置为起点,并将探索范围限制在起点附近的化学潜在空间中,我们生成了具有优化功能指标的新颖化合物结构,同时保持了与起始化合物的结构相似性。要最大化的目标函数被设置为

,这是口服药物类似性的一个指标。在药品和天然产品数据集中,NP 类似性得分和 QED 之间的相关系数为-0.31,表示存在负相关。因此,预计在具有 NP 类似性下降梯度的空间中,QED 会有一个增加的梯度,而在此构建的化学潜在空间中。

Fig. 3 Generation of compound structures between two existing drugs through interpolation. Interpolation of novel compound structures obtained by scanning the chemical latent space between two points, with the starting compound structure being a Nicotinamide adenine dinucleotide derivative from a biomolecule and the destination compound structure being Sorafenib, a molecular targeted drug. The three values below each compound structure represent, from left to right, the similarity to the starting compound, the similarity to the destination compound, and the NP-likeness score. As the compounds move closer to the destination point, the similarity to the starting compound gradually decreases, the similarity to the destination compound increases, and the NP-likeness score becomes lower.
图 3 通过插值生成两个现有药物之间的复合结构。通过扫描化学潜在空间两点之间获得的新化合物结构的插值,其中起始化合物结构为生物分子中的烟酰胺腺嘌呤二核苷酸衍生物,目标化合物结构为分子靶向药物索拉非尼。每个化合物结构下方的三个值分别代表从左到右:与起始化合物的相似度、与目标化合物的相似度和天然产物相似性得分。随着化合物越接近目标点,与起始化合物的相似度逐步降低,与目标化合物的相似度增加,天然产物相似性得分变低。

Fig. 4 Generation of novel compound structures using Bayesian optimization. The objective function to be maximized was set as the quantitative estimate of drug-likeness (QED), and novel compound structures with improved functional indices were explored using Bayesian optimization. The two values below each compound structure represent, from left to right, the QED score and the similarity to the starting compound. In this case, the search space was limited to the vicinity of the target compound, and optimization was performed in both narrow and wide search ranges, examining the effects on the resulting compound structures depending on the search space. When the search range was small, it was possible to obtain novel compound structures with improved QED while maintaining the characteristic structure of the target compound. When the search range was expanded, changes in the characteristic structure were observed, and novel compound structures with significantly improved QED could be obtained.
图 4 利用贝叶斯优化生成新颖化合物结构。目标函数的最大化设置为药物相似性的定量估计(QED)，利用贝叶斯优化探索具有改善功能指数的新颖化合物结构。每个化合物结构下方的两个数值分别代表从左到右的 QED 得分和与起始化合物的相似性。在这种情况下，搜索空间仅限于目标化合物附近，并执行狭窄和广泛搜索范围的优化,探讨搜索空间对所得化合物结构的影响。当搜索范围较小时,可获得具有改善 QED 的新颖化合物结构,同时保持目标化合物的特征结构。当扩大搜索范围时,观察到特征结构发生变化,可获得具有显著改善 QED 的新颖化合物结构。
this study can also be used for exploring oral drug candidate compound structures.
这项研究也可用于探索口服药物候选化合物的结构。
Figure 4 shows the results of generating novel optimized compounds when the starting point for exploration is set to a peptide drug Octreotide and an anticancer drugs Paclitaxel.
图 4 显示了当探索的起点设定为肽类药物八喜酮和抗癌药紫杉醇时生成新颖优化化合物的结果。
When the exploration range is small, novel compound structures with optimized QED can be obtained while maintaining the characteristic structures of the target compounds. On the other hand, when the exploration range is expanded, although a bit large changes in the characteristic structures can be observed,
当探索范围较小时,可以在保持目标化合物特征结构的同时获得具有优化 QED 的新颖化合物结构。另一方面,当扩大探索范围时,虽然可以观察到目标化合物特征结构发生一些较大的变化,
(a)

Docking score (

)
停靠得分（

）
(b)

Gefitinib 吉非替尼

Docking score :-7.02 (kcal/mol)
对接得分:-7.02 (千卡/摩尔)
Generate new structures 生成新结构

(c)

Fig. 5 Generating novel compound structures from the vicinity of Gefitinib and calculating the docking scores with EGFR. a Histogram with the number of generated compounds on the vertical axis and their docking scores on the horizontal axis. There were approximately 5700 novel compound structures with improved docking scores compared to Osimertinib, and about 1600 structures with improved scores compared to Gefitinib. b Novel generated compounds with top docking scores against EGFR. The numbers below the compound structures represent the docking scores. Among these, the majority of the structures contain a kinase-inhibiting quinazoline moiety, known to play a crucial role in EGFR interactions. In addition, it can be seen that the docking scores have been significantly improved due to the addition of other structural components. c Histogram of the docking scores for the virtual compounds generated by the machine-learning-based molecular generation tool, REINVENT (version 3.0)

.
Fig. 5 从吉非替尼附近生成新颖化合物结构并计算其与 EGFR 的对接得分。a 柱状图,垂直轴为生成的化合物数量,水平轴为对接得分。与奥西替尼相比,大约有 5700 个新颖化合物结构具有更好的对接得分,与吉非替尼相比,约有 1600 个结构具有更好的得分。b 对 EGFR 具有最高对接得分的新生成化合物。化合物结构下方的数字代表对接得分。其中大多数结构都包含已知在 EGFR 相互作用中发挥关键作用的喹唑啉基团。此外,可以看到通过添加其他结构组分,对接得分显著提高。c 通过基于机器学习的分子生成工具 REINVENT(3.0 版)生成的虚拟化合物对接得分的柱状图。

(a)

(c)

(b)

Fig. 6 Docking poses between EGFR and Gefitinib, as well as EGFR and the novel generated compounds. a Docking pose of interaction between EGFR and gefitinib, (b) Docking pose of interaction between EGFR and the novel compound with the highest docking score, and (c) docking pose of interaction between EGFR and the novel compound with the second-highest docking score. Carbon atoms within

Å

of the ligand compound in the EGFR structure are shown in light blue, and the parts where interactions were confirmed in the simulation results are indicated by yellow dashed lines. While Gefitinib is observed to interact with methionine at position 793, the ligand with the highest docking score was confirmed to interact with methionine at position 793 , as well as arginine at position 841 and asparagine at position 842 . Additionally, for the ligand with the second-highest docking score, interactions were observed with methionine at position 790, cysteine at position 797, and alanine at position 743.
Fig. 6 EGFR 和吉非替尼以及 EGFR 和新生成化合物之间的对接姿势。(a) EGFR 和吉非替尼之间相互作用的对接姿势,(b)EGFR 和得到最高对接得分的新化合物之间相互作用的对接姿势,以及(c)EGFR 和得到第二高对接得分的新化合物之间相互作用的对接姿势。EGFR 结构中

Å

内的配体化合物的碳原子以浅蓝色显示,仿真结果确认的相互作用部位用黄色虚线标注。吉非替尼被观察到与 793 位置的甲硫氨酸发生相互作用,而得到最高对接得分的配体则被确认与 793 位置的甲硫氨酸、841 位置的精氨酸以及 842 位置的天冬氨酸发生相互作用。此外,对于得到第二高对接得分的配体,观察到与 790 位置的甲硫氨酸、797 位置的半胱氨酸以及 743 位置的丙氨酸发生相互作用。

Table 4 Comparison of computational time between NPVAE and existing methods.
表 4 NPVAE 与现有方法的计算时间比较。

Computational time per epoch (sec)
每个周期的计算时间（秒）

NP-VAE
HierVAE 层次变分自编码器
MoFlow 摩尔流
SM-RNN

The computational time required for one epoch during training with the restricted dataset of drug-and-natural-product compounds is shown. The hardware specifications include Nvidia Tesla P100-SXM2, 16GB.
使用受限药物和天然产品化合物数据集进行训练时一个 epoch 所需的计算时间如下所示。硬件规格包括 Nvidia Tesla P100-SXM2，16GB 显存。
novel compound structures with significantly improved QED can be obtained. Natural product-derived drugs, such as Octreotide and Paclitaxel, are generally administered by injection. Therefore, improving the QED of such compounds is expected to enhance their properties as orally administered drugs, leading to increased convenience for patients. In addition, to quantitatively assess the effectiveness of Bayesian optimization, we repeated the optimization experiment for multiple points sampled from the latent space. When the exploration range was limited to compounds with a similarity of 0.6 or higher, the average improvement in the objective function QED was 0.046 with a standard deviation of 0.074 . When the exploration range was expanded to compounds with a similarity of 0.2 or higher, the average improvement in the objective function QED significantly improved to 0.538 with a standard deviation of 0.022 .
可以获得显著提高 QED 的新型复合结构。从天然产品衍生的药物,如八喜酚和紫杉醇,通常都是通过注射给药。因此,提高这些化合物的 QED 有望增强其作为口服药物的特性,从而为患者带来更大的便利性。此外,为定量评估贝叶斯优化的有效性,我们对从潜在空间中采样的多个点重复了优化实验。当探索范围限定为相似度 0.6 及以上的化合物时,目标函数 QED 的平均改善为 0.046,标准差为 0.074。当探索范围扩大到相似度 0.2 及以上的化合物时,目标函数 QED 的平均改善显著提高到 0.538,标准差为 0.022。
There are a few papers on functional optimization through sampling in the latent space. OptiMol proposed by Boitreaud et al.

. focused on an optimization strategy and targeted a specific aspect of drug discovery (binding affinities), whereas our NPVAE model aims to deal with large, complex molecules with 3D structures and desired properties. The constrained Bayesian optimization method proposed by Griffiths et al.

. primarily used SMILES representations, which can lead to invalid outputs, making the handling of such outputs a significant focus of their work. In contrast, NP-VAE model is designed to effectively decompose the input compound structures into fragment units and convert them into tree structures to handle large and complex 3D molecular structures. Chembo proposed by Korovina et al.

. aimed to introduce a method for synthesizable recommendations beyond the SA score. The proposed Gaussian process-based approach has enough potential to be incorporated into functional optimization in NP-VAE.
有几篇关于在潜在空间中通过采样进行函数优化的论文。Boitreaud 等人提出的 OptiMol

专注于优化策略并针对药物发现的特定方面(结合亲和力),而我们的 NPVAE 模型旨在处理具有 3D 结构和所需属性的大型复杂分子。Griffiths 等人提出的受约束的贝叶斯优化方法

主要使用 SMILES 表示,这可能会导致无效输出,使处理这些输出成为他们工作的重点。相比之下,NP-VAE 模型旨在有效地将输入化合物结构分解为片段单元,并将其转换为树结构,以处理大型和复杂的 3D 分子结构。Korovina 等人提出的 Chembo

旨在引入一种超越 SA 分数的可合成推荐方法。所提出的基于高斯过程的方法有足够的潜力被纳入到 NP-VAE 的功能优化中。

Generation of drug candidates from chemical latent space with docking analysis. In the constructed chemical latent space, we generated approximately 10,000 novel compound structures from the vicinity of existing anticancer drug compounds. By performing docking analysis for the generated compound structures with the target proteins that interact with the original anticancer drug compounds, we searched for novel compound structures that are expected to have greater efficacy as molecular-targeted drugs. Schrödinger Glide

was used as the docking analysis software. When generating novel compound structures from the
从化学潜在空间生成药物候选物并进行对接分析。在构建的化学潜在空间中，我们从现有抗癌药物化合物的附近生成了约 10,000 个新颖的化合物结构。通过对生成的化合物结构与与原始抗癌药物化合物相互作用的靶蛋白进行对接分析,我们搜索了预计具有更大疗效作为分子靶向药物的新颖化合物结构。使用 Schrödinger Glide

作为对接分析软件。在从

Chemical latent space 化学潜在空间

Fig. 7 Overall structure of NP-VAE. When the compound structure information is input to the Encoder, the latent variable

is calculated based on the tree structure obtained by preprocessing. In the Decoder, the compound structure is calculated and output using a continuous algorithm with

as the input. During training, a pathway is used in parallel to predict the functional indices of the compound with the latent variable as input. This allows for the construction of a chemical latent space that takes into account not only structural information but also functional information.
图 7 NP-VAE 的整体结构。当复合结构信息输入到 Encoder 时,基于预处理得到的树结构计算出潜在变量

。在 Decoder 中,以

作为输入,利用连续算法计算并输出复合结构。在训练过程中,采用并行的方式使用潜在变量作为输入来预测化合物的功能指标。这样可以构建一个考虑了不仅结构信息也包含功能信息的化学潜在空间。
space surrounding molecular-targeted drugs such as Gefitinib and Osimertinib, we successfully discovered multiple compound structures with significantly better docking scores than the original compounds, as shown in Fig. 5a. There were about 5700 novel compound structures with improved docking scores compared to Osimertinib and about 1600 structures with improved scores compared to Gefitinib. Figure 5b shows the top-ranking compound structures in terms of docking scores. Many of these structures share the pyrimidine moiety, which is known to play an important role in the interaction with EGFR

, with Gefitinib. However, the docking scores are greatly improved by the addition of other structures. These results suggest that, in the chemical latent space constructed by NP-VAE, it is possible to discover novel seed compounds with effects greater than the original drugs by exploring the vicinity of existing drug compounds. In addition, for the performance comparison of our NP-VAE model to a baseline approach, we conducted the docking experiment using virtual compounds. These compounds were generated by another machine-learning-based molecular generation tool, REINVENT (version 3.0), developed by Blaschke et al.

. This model's architecture is based on a recurrent neural network with SMILES representations and is pre-trained on the ChEMBL chemical compounds database

. Specifically, these virtual compounds were generated using the reinforcement learning reward of REINVENT with the QEPPI score

. For the comparison, we displayed the histogram of the docking scores for the virtual compounds in Fig. 5c. From these docking simulations, the docking score for Gefitinib ranked in the top

for NP-VAE and in the top

for REINVENT.
关于吉非替尼和奥西美替尼等分子靶向药物的化学空间,我们成功发现了多个化合物结构,其对接评分明显优于原始化合物,如图 5a 所示。与奥西美替尼相比,有约 5700 个新型化合物结构的对接评分有所提高;与吉非替尼相比,有约 1600 个结构的评分有所提高。图 5b 显示了在对接评分方面排名靠前的化合物结构。这些结构中的许多都共享吡嗪基团,这已知在与 EGFR 的相互作用中起重要作用,与吉非替尼类似。然而,通过添加其他结构,对接评分大大提高。这些结果表明,在由 NP-VAE 构建的化学潜在空间中,可以通过探索现有药物化合物的附近区域来发现具有超越原始药物效果的新型种子化合物。另外,为了比较我们的 NP-VAE 模型与基线方法的性能,我们进行了使用虚拟化合物的对接实验。这些化合物是由 Blaschke 等人开发的另一种基于机器学习的分子生成工具 REINVENT(3.0 版)生成的。该模型的架构基于带有 SMILES 表示的循环神经网络,并在 ChEMBL 化学化合物数据库上进行了预训练。具体而言,这些虚拟化合物是使用 REINVENT 的强化学习奖励和 QEPPI 评分生成的。为了进行比较,我们在图 5c 中显示了虚拟化合物的对接评分直方图。从这些对接模拟中可以看出,吉非替尼的对接评分在 NP-VAE 中排名靠前,而在 REINVENT 中也排名靠前。

Figure 6 shows the verification of the interaction between the newly generated drug candidate compounds and the target EGFR (PDB code: 4I2255). While it is evident that Gefitinib interacts with methionine at position 793, the ligand with the best docking score was found to interact not only with methionine at position 793 but also with arginine at position 841 and asparagine at position 842 . In addition, the ligand with the second-best docking score showed interactions with methionine at position 790, cysteine at position 797, and alanine at position 743. In fact, previous studies on molecular-targeted drugs have reported hydrogen bonding between Osimertinib and EGFR at positions 790 and

, and a covalent bond with Afatinib at position 797 has been reported

. The novel compounds obtained in this study were found to demonstrate interactions consistent with these investigations, and it was shown that compound structures other than pyrimidine have a significant impact on the increase in binding strength with the target. The combination of NP-VAE and docking analysis in this method enables the verification of interactions between diverse compound structures and target proteins, and is expected to contribute not only to the discovery of novel pharmaceutical compounds but also to the elucidation of drug mechanisms of action and the acquisition of new insights.
图 6 显示了新生成的候选药物化合物与靶标 EGFR(PDB 编码:4I2255)之间相互作用的验证。虽然明显表明 Gefitinib 与 793 位置的甲硫氨酸发生相互作用,但具有最佳对接分数的配体不仅与 793 位置的甲硫氨酸发生相互作用,还与 841 位置的精氨酸和 842 位置的天冬氨酸发生相互作用。此外,具有第二最佳对接分数的配体显示与 790 位置的甲硫氨酸、797 位置的半胱氨酸和 743 位置的丙氨酸发生相互作用。事实上,先前关于靶向药物的研究报告了 Osimertinib 与 EGFR 在 790 和

位置之间的氢键结合,以及 Afatinib 在 797 位置与 EGFR 的共价键结合

。本研究中获得的新化合物被发现具有与这些研究一致的相互作用,并显示除吡啶环以外的化合物结构对增加与靶标的结合强度有显著影响。该方法中 NP-VAE 和对接分析的结合可以验证不同化合物结构与靶蛋白之间的相互作用,预计不仅可以用于发现新的药物化合物,还可用于阐明药物作用机制并获得新的见解。

Computational complexity. The training of generative models generally demands significant computational time, particularly for models with a large number of parameters like ours, which comprises tens of millions of parameters. Therefore, we conducted a comparison of computational times between NP-VAE and existing methods. Table 4 shows the computational time required for one epoch during training. NP-VAE proved to be faster compared to VAE-based methods such as HierVAE. However, it exhibited longer computational times when compared to non-VAE methods. This discrepancy can be attributed to the fact that all VAE models, including ours, adopt LSTM, a sequential computation process, which appears to be the bottleneck in terms of computational time.
计算复杂性。训练生成模型通常需要大量计算时间,尤其是对于像我们这样有数千万个参数的大型模型。因此,我们对 NP-VAE 和现有方法的计算时间进行了比较。表 4 显示了训练过程中一个周期所需的计算时间。与基于 VAE 的方法(如 HierVAE)相比,NP-VAE 的训练速度更快。然而,与非 VAE 方法相比,它的计算时间更长。这种差异可归因于所有 VAE 模型(包括我们的模型)采用 LSTM 这种顺序计算过程,这似乎是计算时间的瓶颈。

Synthetic accessibility of natural product-like compounds. From Table 3, it can be observed that compounds with high NPlikeness scores tend to have even higher synthetic accessibility (SA) scores, which indicates the difficulty of synthesis. Thus, a main issue in the context of handling large and complex compounds resembling natural products is the synthetic accessibility of the generated compounds. Many microbial natural products often possess intricate and unique molecular architectures. Their complexity arises from the diverse and specialized biosynthetic pathways found in nature, and therefore presents a high degree of
天然产品类化合物的合成易访问性。从表 3 可以观察到,具有高 NPlikeness 分数的化合物往往具有更高的合成易访问性(SA)分数,这表示合成的难度。因此,在处理类似天然产品的大型复杂化合物时,合成易访问性是一个主要问题。许多微生物天然产品往往具有复杂独特的分子结构。它们的复杂性源于自然界中多样化和专门化的生物合成途径,因此具有很高的
synthetic difficulty. Due to their complexity, the synthesis of natural products often requires many steps and translates to challenges in reproducing these compounds in the laboratory. A potential application for our NP-VAE model in addressing the synthetic accessibility challenge is the simplification of compound structures within the chemical latent space developed by NPVAE. This involves searching for structures that are simpler, smaller in size, of the same bioactivity, and easier to synthesize, in the vicinity of known natural product structures in the chemical latent space. These candidates could then be subjected to further experimental validation. Confirming the effectiveness of this approach is the next crucial challenge.
合成难度。由于其复杂性,天然产物的合成通常需要多个步骤,这带来在实验室中复制这些化合物的挑战。我们 NP-VAE 模型可能应用于解决合成可及性挑战,通过化学潜在空间中已知天然产物结构的邻域,搜索更简单、更小、具有相同生物活性且更易合成的结构。这些候选物可进一步进行实验验证。验证这种方法的有效性是接下来的关键挑战。

Conclusion 结论

We developed a VAE model capable of handling large molecular structures, such as natural compounds, with high accuracy, and constructed a chemical latent space that takes into account both structural and functional information including chirality. By using a large set of pharmaceutical compounds and natural compounds as compound libraries, we successfully constructed the chemical latent space incorporating pharmacological effects, enabling statistical and comprehensive analysis. NP-VAE demonstrated consistent performance as a generative model across various indices. Furthermore, by exploring the chemical latent space, we succeeded in generating novel compound structures with the desired functionality, and demonstrated that insilico selection of drug candidate compounds is possible by combining with docking analysis-based screening.
我们开发了一个可以处理大型分子结构(如天然化合物)的 VAE 模型,并以高精度构建了一个考虑结构和功能信息(包括手性)的化学潜在空间。通过使用大量的医药化合物和天然化合物作为化合物库,我们成功构建了包含药理效应的化学潜在空间,实现了统计和综合分析。NP-VAE 作为生成模型在各种指标下表现一致良好。此外,通过探索化学潜在空间,我们成功生成了具有所需功能的新颖化合物结构,并证明可以通过与基于对接分析的筛选相结合来进行体外药物候选化合物的选择。

Material and methods 材料和方法

The architecture of the NPVAE. The overall structure of NP-VAE is shown in Fig. 7. NP-VAE consists of three components: preprocessing, Encoder, and Decoder. In preprocessing, the compound structure is decomposed into fragments according to certain rules and converted into a corresponding tree structure. In the Encoder, the tree structure obtained from preprocessing and the original compound structure are inputted to calculate the latent variable

. In the Decoder, taking the latent variable

as input, a tree structure is generated using a depth-first algorithm, and then converted back into the corresponding compound structure. The summary of the specific components in the NP-VAE architecture compared to existing methods such as HierVAE and JT-VAE is as follows.
NPVAE 的体系结构。NP-VAE 的整体结构如图 7 所示。NP-VAE 由三个部分组成：预处理、编码器和解码器。在预处理过程中,复合结构根据某些规则分解成片段并转换为相应的树形结构。在编码器中,输入预处理得到的树形结构和原始复合结构,计算潜变量

。在解码器中,以潜变量

为输入,使用深度优先算法生成树形结构,然后转换回相应的复合结构。与现有方法(如 HierVAE 和 JT-VAE)相比,NP-VAE 体系结构中具体组件的概要如下。

Novel components of NP-VAE
小说组件的 NP-VAE

Preprocessing Objectives in NP-VAE: Simplification of compound structure: The first objective is to convert the input compound structure into a simpler structure that can be more easily handled. Particularly when dealing with large molecular structures, calculating at the single-atom level, as JT-VAE does, would result in an enormous order both in time and space. To address this, we devised a procedure to capture compound substructures by decomposing them into several fragments.
在 NP-VAE 中的预处理目标:化合物结构的简化:第一个目标是将输入的化合物结构转换为更简单的结构,以更容易处理。特别是在处理大分子结构时,如 JT-VAE 所做的那样,在单个原子级别进行计算将导致时间和空间上的巨大开销。为解决这个问题,我们设计了一个程序来捕获化合物的子结构,通过将其分解成几个片段来实现。

Extraction of meaningful physicochemical features: The second objective is to reshape the compounds so that meaningful physicochemical features can be extracted. Aromatic rings like benzene, as well as functional groups deeply involved in the physicochemical properties, such as amide and carboxyl groups, should be treated as a single fragment rather than a sequence of individual atoms. The compound decomposition algorithm was determined based on these objectives.
提取有意义的理化特性：第二个目标是重塑化合物,以便提取有意义的理化特性。像苯这样的芳香环,以及深深参与理化性质的官能团,如酰胺基和羧基,应该被视为单一片段,而不是一系列单个原子。化合物分解算法是根据这些目标确定的。

Chirality handling: We have devised a method of managing and preserving the chirality of molecules, which is an essential factor in the 3D complexity of compounds.
手性处理:我们设计了一种管理和保持分子手性的方法,这是化合物三维复杂性的关键因素。
Components inspired by existing methods. Variational inference: NP-VAE, like HierVAE and JT-VAE, employs variational inference to learn a continuous latent space.
受现有方法启发的组件。变分推断:NP-VAE 像 HierVAE 和 JT-VAE 一样,采用变分推断来学习连续的潜在空间。

The preprocessing of NP-VAE. There are two objectives in the preprocessing of NP-VAE. The first one is to convert the input compound structure into a simpler structure that can be more easily handled. Particularly when dealing with large molecular structures, calculating at the single-atom level would result in an enormous order both in time and space. To address this, we devised a procedure to capture compound substructures by decomposing them into several fragments. Also, the presence of loop structures in the molecular graph would require a significant computational cost during compound generation in the subsequent Decoder; thus, we aim to capture the structure as a tree without loops. The second objective is to reshape the compounds so that meaningful physicochemical features can be extracted. Aromatic rings like benzene, as well as functional groups deeply involved in the physicochemical properties, such as amide and carboxyl groups, should be treated as a single fragment rather than a sequence of individual atoms. The compound decomposition algorithm was determined based on these objectives.
非参数变分自编码器(NP-VAE)的预处理。NP-VAE 预处理有两个目标。第一个是将输入化合物结构转换为更简单的结构,以便更容易处理。特别是在处理大分子结构时,在单个原子级别进行计算将导致时间和空间中极其庞大的顺序。为了解决这个问题,我们设计了一个程序来捕获化合物的子结构,将其分解为几个片段。此外,分子图中的环状结构将在后续解码器中生成化合物时产生重大的计算开销;因此,我们的目标是将结构捕获为没有环的树。第二个目标是重塑化合物,以便提取有意义的物理化学特征。类似于苯环等芳香环以及涉及物理化学属性的功能基团,如酰胺基和羧基,应该被视为单一片段,而不是一系列单个原子。基于这些目标确定了化合物分解算法。

In the preprocessing step, we first extract substructures fragmented from the entire compound structures according to the decomposition procedure (in Supplementary Methods), and save them as substructure labels while converting them into corresponding tree structures (Supplementary Fig. S5).
在预处理步骤中，我们首先根据分解过程(见补充方法)从整个化合物结构中提取碎片子结构,并将它们保存为子结构标签,同时将它们转换为相应的树结构(补充图 S5)。

When defining the tree structure

corresponding to the compound structure

, the number of nodes in

matches the number of substructures, and edges are drawn between neighboring substructures within

. At each node of

, the ECFP calculated from the corresponding substructure is stored as a feature vector.
在定义与复合结构

相对应的树结构时,

中的节点数量与子结构的数量相匹配,并且在

内的邻近子结构之间绘制了边缘。在

的每个节点上,从相应的子结构计算得出的 ECFP 都被存储为特征向量。

NP-VAE encoder. In the Encoder, feature extraction of compound structures is performed combining two processes (Supplementary Fig. S6). First, for each ECFP stored in the nodes of the tree structure

, a feature vector

is obtained using ChildSum Tree-LSTM

. Let

be all the child nodes of node

be the ECFP of node

be the hidden state of node

in the TreeLSTM,

be the input gate,

be the output gate,

be the memory cell, and

be the forget gate for child node

of node

. The Child-Sum Tree-LSTM is defined by the following equations:
NP-VAE 编码器。在编码器中,通过结合两个过程来执行化合物结构的特征提取(补充图 S6)。首先,对于树结构节点中存储的每个 ECFP,使用 ChildSum Tree-LSTM

获得特征向量

。令

为节点

的所有子节点,

为节点

的 ECFP,

为节点

在 Tree-LSTM 中的隐藏状态,

为输入门,

为输出门,

为记忆单元,

为节点

的子节点的遗忘门。ChildSum Tree-LSTM 由以下方程式定义:

Here,

represents the element-wise product,

are the weights learned in the fully connected layers, and

are the learned constants (biases).
这里,

表示元素级乘积,

是在全连接层中学习到的权重,

是学习得到的常量(偏差)。

Second, we compute the ECFP for the entire compound structure. This is denoted as

, and by inputting it into the

-layer fully connected layers, we obtain the output

. The output

is defined by the following formula, with the weights and biases of the

-th fully connected layer denoted as

and

, respectively.
其次,我们计算整个化合物结构的 ECFP。这表示为

,通过将其输入到

层全连接层,我们获得输出

。输出

由以下公式定义,其中

th 全连接层的权重和偏差分别表示为

和

。

Lastly, we sum up the feature vector

, which corresponds to the root node obtained from the Tree-LSTM, and the output

of the fully connected layers. Using random noise

, we compute the latent variable

via the reparameterization trick. With the weights of the fully connected layers denoted as

and biases as

, the expression is as follows.
最后,我们将特征向量

（对应于从 Tree-LSTM 获得的根节点）和全连接层的输出

求和。使用随机噪声

,我们通过重参数化技巧计算潜变量

。使用全连接层的权重

和偏置

,表达式如下。

NP-VAE decoder. In the Decoder, based on the input latent variable

, a tree structure is generated using a depth-first sequential algorithm and is then converted to a compound structure for output (Supplementary Fig. 7). NP-VAE decoder consists of seven procedures: Root label prediction, Topological prediction, Bond prediction, Label prediction, Update the variable

, Conversion to compound structure, and Chirality assignment. We briefly describe each procedure, and for a full description of NP-VAE algorithm, see the Supplementary Methods.
NP-VAE 解码器。在解码器中,基于输入的潜变量

,使用深度优先顺序算法生成一个树状结构,然后将其转换为输出的复合结构(补充图 7)。NP-VAE 解码器由七个程序组成:根标签预测、拓扑预测、键键预测、标签预测、更新变量

、转换为复合结构以及手性分配。我们简要介绍了每个程序,有关 NP-VAE 算法的完整描述,请参见补充方法部分。

In the first step of the Decoder, called Root label prediction, we predict the substructure label that will be assigned to the initially generated root node. The prediction of substructure labels is selected from all the substructure labels obtained during the preprocessing of NP-VAE. The input latent variable

to the Decoder is fed into

fully connected layers, and a multi-class classification is performed. In Topological prediction, we predict whether or not to generate a new child node under the current node. If it is predicted to generate a child node, we then proceed to bond prediction and label prediction. On the other hand, if it is predicted not to generate a child node, we terminate the Decoder process if the node is at the root position; otherwise, we backtrack from the current node to its parent node. In Bond prediction, we predict the type of bond between the current node's substructure and the substructure of the newly generated child node. In Label prediction, we predict the substructure label that corresponds to the newly generated child node. After label prediction or backtrack, we compute

from

using a fully connected layer. The output

is defined by the following equation, where

and

are the weights and biases of the fully connected layer, respectively.
在解码器的第一步称为"根标签预测"中，我们预测将被分配给最初生成的根节点的子结构标签。子结构标签的预测是从在 NP-VAE 预处理过程中获得的所有子结构标签中选择的。输入到解码器的潜变量

被馈送到

全连接层，并执行多类分类。在拓扑预测中，我们预测是否在当前节点下生成新的子节点。如果预测要生成子节点，则我们继续进行键合预测和标签预测。另一方面,如果预测不生成子节点,则如果该节点位于根位置,我们终止解码器过程;否则,我们从当前节点回溯到其父节点。在键合预测中,我们预测当前节点的子结构与新生成的子节点的子结构之间的键合类型。在标签预测中,我们预测与新生成的子节点相对应的子结构标签。在标签预测或回溯之后,我们使用全连接层从

计算

。输出

由以下等式定义,其中

和

分别是全连接层的权重和偏置。

Here,

is the feature vector obtained by performing the ChildSum Tree-LSTM computation, which represents the features at node

after propagating the ECFP stored in each node in the tentative tree structure. During child node generation, the features are transmitted through backward propagation from the root node to the leaf node, and that child node is set as node

(Supplementary Fig. S8a). On the other hand, during backtrack, after the backward propagation from the root node to the leaf node, a forward propagation from the leaf node to the root node is performed, and the Backtrack destination parent node is set as node

(Supplementary Figure S8(b)). In Conversion to compound structure, after generating the tree structure, the substructure labels of each node are connected and converted into the corresponding compound structure. Since information about the atoms corresponding to the bonding sites within the substructure and their bonding order is already included in the substructure labels, the compound structure can be uniquely determined from the generated tree structure (Supplementary Fig. S8c).

是通过执行 ChildSum Tree-LSTM 计算得到的特征向量,表示在每个节点中传播 ECFP 后,节点

处的特征。在子节点生成过程中,特征通过从根节点到叶节点的反向传播进行传输,该子节点被设置为节点

(补充图 S8a)。另一方面,在回溯过程中,在从根节点到叶节点的反向传播之后,会执行从叶节点到根节点的正向传播,并将回溯目标父节点设置为节点

(补充图 S8b)。在转换为化合物结构的过程中,在生成树结构后,将每个节点的子结构标签连接并转换为对应的化合物结构。由于子结构中相应原子的键位信息和键序已经包含在子结构标签中,因此可以从生成的树结构中唯一确定化合物结构(补充图 S8c)。

In Assignment of chirality, to handle three-dimensional information of compounds in the Encoder, ECFP with chirality information is used. In the Decoder, the latent variable

is input to the

-layer fully connected layer, and the predicted ECFP value is output. The output

is defined by the following equation, where the weights and biases of the

-th fully connected layer are

and

, respectively.
在手性赋值中，为了处理化合物在编码器中的三维信息，使用具有手性信息的 ECFP。在解码器中，潜在变量

作为输入进入

层全连接层，并输出预测的 ECFP 值。输出

由以下公式定义,其中

全连接层的权重和偏差分别为

和

。

Here, the dimension of

is same as the bit size of ECFP. After the two-dimensional structure of the compound is output based on the aforementioned sequential algorithm, all possible stereoisomers are enumerated and their ECFP is calculated. The Euclidean distance between them and

is computed, and the three-dimensional structure corresponding to the ECFP with the smallest distance is selected as the output compound structure.
这里，

的维度与 ECFP 的位大小相同。基于上述顺序算法输出化合物的二维结构后，枚举所有可能的立体异构体并计算其 ECFP。计算它们与

之间的欧氏距离，并选择与 ECFP 距离最小的三维结构作为输出化合物结构。

Learning. During training, to ensure proper learning, even if an incorrect prediction is made in the decoding process that cannot reconstruct the input data, the learning proceeds by propagating feature values on the tree structure, replacing it with the correct one for reconstruction. Additionally, to ensure that the latent space generated by NP-VAE not only accounts for structural information of compounds but also incorporates functional information, such as bioactivity, the latent variable

is input to the

-layer fully connected layer for predicting the activity value of the input compounds. The output

is defined by the following formula, with the weights and biases of the

-th fully connected layer represented by

and

, respectively.
学习。在训练过程中,为了确保正确的学习,即使在解码过程中做出了无法重构输入数据的错误预测,学习仍会通过在树结构上传播特征值来进行,并用正确的值进行重建。此外,为了确保 NP-VAE 生成的潜在空间不仅考虑化合物的结构信息,还包含功能信息,如生物活性,潜在变量

被输入到

层全连接层中,以预测输入化合物的活性值。输出

由以下公式定义,其中

层全连接层的权重和偏置分别由

和

表示。

By adding the difference loss between the predicted value

and the true activity value in the loss function, functional information is incorporated into the chemical latent space.
通过在损失函数中加入预测值

与真实活性值之间的差异损失,将功能信息纳入化学潜在空间。

The loss function during NP-VAE training consists of a weighted sum of the cross-entropy loss (CE) calculated from each prediction task in the Decoder, the KL divergence

representing the distance between the distribution

of latent variables and the Gaussian distribution, the binary crossentropy loss (

) in three-dimensional structure prediction, and the mean squared error (MSE) in functional information prediction. Let the ground truth values for Root Label prediction, Topological prediction, Label prediction, and Bond prediction be

, and

respectively (represented by a vector where the index of the correct label is 1 and all others are 0 ), and let the true ECFP value be

and the true functional information be

. The
在 NP-VAE 训练过程中,损失函数由以下组成:从解码器中每个预测任务计算出的交叉熵损失(CE)、表示潜在变量分布与高斯分布之间距离的 KL 散度

、三维结构预测中的二进制交叉熵损失 (

)以及功能信息预测中的平均平方误差(MSE)的加权和。设根标签预测、拓扑预测、标签预测和键合预测的真实值分别为

和

(用向量表示,正确标签位置为 1,其余为 0),真实 ECFP 值为

,真实功能信息为

。
loss function

is defined as follows:
损失函数

定义如下：

Here,

, and

are hyperparameters used to adjust the contribution of each term.
这里，

和

是用于调整每个项的贡献的超参数。

Assay for EGFR inhibitory activity of Yessotoxin. EGFR tyrosine kinase assay was carried out in the same manner as previously reported

. Briefly, EGFR was obtained as a membrane fraction of A431 cells, and an aliquot of DMSO solution of Yessotoxin or AG1478, a selective EGFR inhibitor and positive control, was added to an HEPES buffer ( pH 7.4 ) containing the A431 membrane fraction,

, angiotensin II, and EGF. After incubating the mixture at

for 30 min , the kinase reactions were initiated by the addition of

ATP. The reaction mixture was incubated at

for 15 min , and then the reaction was stopped by addition of TCA and BSA. After removing precipitated proteins by centrifugation, the radioactivity of the supernatant was counted with a liquid scintillation counter.
Yessotoxin 的表皮生长因子受体抑制活性测定。EGFR 酪氨酸激酶活性测定的操作步骤与先前报告的相同

。简而言之,EGFR 取自 A431 细胞的膜分离物,将酶解于含有 HEPES 缓冲液(pH 7.4)的反应体系中,其中包括 A431 膜分离物、

、血管紧张素 II 和 EGF。在

下孵育 30 分钟后,通过添加

ATP 启动激酶反应。在

下继续孵育 15 分钟后,用三氯乙酸和牛血清白蛋白终止反应。离心去除沉淀的蛋白质后,用液体闪烁计数器测定上清液的放射性。

Docking analysis for EGFR. Protein-ligand docking analysis was conducted using Schrödinger Glide (version 2020-2)

. The complex structure of EGFR tyrosine kinase and Gefitinib (PDB ID: 2ITY chain A) was used, from which Gefitinib was removed from the PDB file. The input files for the protein structure were prepared using the Protein Preparation Wizard in Schrödinger Maestro. The ligand was prepared from the SDF file by LigPrep, generating a 3D conformer. All tautomers were generated by LigPrep. As the docking site, a docking grid of

Å

Å

Å

from the center of Gefitinib in 2ITY chain A was specified. Docking was carried out using Glide SP mode, and the pose with the best Glide SP score was selected for each ligand.
对 EGFR 的对接分析。使用 Schrödinger Glide（2020-2 版本）

进行了蛋白质-配体对接分析。使用了表皮生长因子受体（EGFR）酪氨酸激酶和吉非替尼的复合物结构（PDB ID: 2ITY 链 A），从中移除了吉非替尼。使用 Schrödinger Maestro 中的 Protein Preparation Wizard 准备了蛋白质结构的输入文件。通过 LigPrep 从 SDF 文件中准备了配体，生成了 3D 构象。LigPrep 生成了所有的异构体。将 2ITY 链 A 中吉非替尼的中心作为对接位点，指定了一个尺寸为

Å

Å

Å

的对接网格。使用 Glide SP 模式进行了对接，选择了每个配体的最佳 Glide SP 得分。

Data availability 数据可用性

The evaluation dataset and the processed DrugBank dataset used in this study are available at https://github.com/toshikiochiai/NPVAE.
本研究使用的评估数据集和处理过的 DrugBank 数据集可在https://github.com/toshikiochiai/NPVAE获取。

Two representative collections of compound structures within the project dataset, namely collection A and B, are also available at the same site. Most other compound structures in the project dataset are unpublished, and restrictions apply to the availability of these data, which were used under license for the current study and therefore are not publicly available. Data could, however, be available from the authors upon reasonable request. On the other hand, the model parameters of NP-VAE trained on the evaluation dataset and on the drug-and-natural-product dataset are all available at the above site.
该项目数据集中的两个代表性复合结构集合 A 和 B 也可在同一站点获得。该项目数据集中的大多数其他复合结构尚未发布,这些数据的可用性受到限制,它们是根据许可证用于当前研究的,因此不公开。但是,如果提出合理的请求,可以从作者那里获得这些数据。另一方面,在评估数据集和药物与天然产品数据集上训练的 NP-VAE 模型参数都可在上述网站获得。

The source data for the graphs in Fig. 2(b), Figs. 5(a) and 5(c) are available as Supplementary Data.
图 2(b)、图 5(a)和图 5(c)中图表的源数据可作为补充数据获得。

Code availability 代码可用性

The source code for the implementation of NP-VAE is available at
NP-VAE 的实现源代码可在
https://github.com/toshikiochiai/NPVAE.
https://github.com/toshikiochiai/NPVAE
Received: 16 June 2023; Accepted: 6 November 2023;
收到：2023 年 6 月 16 日；已接受：2023 年 11 月 6 日；
Published online: 16 November 2023
2023 年 11 月 16 日在线发布

References 参考文献

Bohacek, R. S., McMartin, C. & Guida, W. C. The art and practice of structurebased drug design: a molecular modeling perspective. Med Res Rev. 16, 3-50 (1996).
波哈切克、麦克马丁和吉达。结构基础药物设计的艺术与实践:分子建模视角。Med Res Rev. 16, 3-50 (1996)。
Rodrigues, T., Reker, D., Schneider, P. & Schneider, G. Counting on natural products for drug design. Nat. Chem. 8, 531-541 (2016).
罗德里格斯,T.,雷克尔,D.,施奈德,P.与施奈德,G.依靠天然产品进行药物设计.自然化学,8,531-541(2016)。
Grisoni, F. et al. Combining generative artificial intelligence and on-chip synthesis for de novo drug design. Sci. Adv. 7, eabg3338 (2021).
格里索尼,F. 等人。将生成式人工智能与芯片合成相结合用于新药设计。Sci. Adv. 7, eabg3338 (2021)。
Kingma D. P., Welling M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2013).
金马 D. P., 韦林 M. 自动编码变分贝叶斯. 预印本于 https://arxiv.org/abs/1312.6114 (2013).
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven Continuous representation of molecules. ACS Cent. Sci. 4, 268-276 (2018).
戈梅斯-博姆巴雷利,R.等人。利用分子连续表示的数据驱动自动化化学设计。ACS Cent。Sci。4, 268-276 (2018)。
Weininger, D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J. Chem. Inf. Comput Sci. 28, 31-36 (1988).
魏宁, D. SMILES,一种化学语言和信息系统。1. 方法学和编码规则简介。化学信息与计算机科学杂志。28, 31-36 (1988)。
Kusner, M. J., Paige, B. & Hernández-Lobato, J. M. Grammar variational autoencoder. Proc. 34th Int. Conf. Mach. Learn. 70, 1945-1954 (2017).
库斯纳, M. J., 佩奇, B. & 埃尔南德斯-洛巴托, J. M.语法变分自动编码器.第 34 届国际机器学习会议论文集 70, 1945-1954 (2017).
Dai H., Tian Y., Dai B., Skiena S., Song L. Syntax-directed variational autoencoder for structured data. Preprint at https://arxiv.org/abs/1802.08786 (2018).
戴 H.、田 Y.、戴 B.、Skiena S.、宋 L. 面向结构化数据的语法导向变分自编码器. 预印本, https://arxiv.org/abs/1802.08786 (2018).
Moret, M., Helmstädter, M., Grisoni, F., Schneider, G. & Merk, D. Beam search for automated design and scoring of novel ROR ligands with machine intelligence. Angew. Chem. Int Ed. Engl. 60, 19477-19482 (2021).
莫勒,赫尔姆施特德,格里索尼,施奈德,默克。使用机器智能自动设计和评分新型 ROR 配体的光束搜索。德国化学会国际版。60, 19477-19482 (2021)。
Flam-Shepherd, D., Zhu, K. & Aspuru-Guzik, A. Language models can learn complex molecular distributions. Nat. Commun. 13, 3293 (2022).
弗拉姆-谢泼德、朱、阿斯普鲁-古兹克. 语言模型可以学习复杂的分子分布. 自然通讯. 13, 3293 (2022).
Moret, M. et al. Leveraging molecular structure and bioactivity with chemical language models for de novo drug design. Nat. Commun. 14, 114 (2023).
莫雷特,M.等。利用化学语言模型的分子结构和生物活性进行新药设计。Nat. Commun. 14, 114 (2023)。
Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural Comput. 9, (1997).
霍利特,S.和施米德胡贝尔,J. 长短期记忆。神经计算。9, (1997)。
Skinnider, M. A. et al. Chemical language models enable navigation in sparsely populated chemical space. Nat. Mach. Intell. 3, 759-770 (2021).
斯金尼德，M.A.等。化学语言模型能够在稀疏的化学空间中导航。Nat. Mach. Intell. 3, 759-770 (2021)。
Liu Q., Allamanis M., Brockschmidt M., Gaunt A. L. Constrained graph variational autoencoders for molecule design. Preprint at https://arxiv.org/abs/ 1805.09076 (2018).
刘 Q.，Allamanis M.，Brockschmidt M.，Gaunt A. L.用于分子设计的约束图变分自编码器. 预印本，网址https://arxiv.org/abs/1805.09076 (2018).
Jin, W., Barzilay, R. & Jaakkola, T. Junction tree variational autoencoder for molecular graph generation. Proc. 35th Int. Conf. Mach. Learn. 80, 2323-2332 (2018).
金文静、安德鲁·巴尔扎利、托马斯·雅柯布拉.合并树变分自动编码器用于分子图生成.第 35 届国际机器学习大会论文集 80,2323-2332(2018).
Jin, W., Barzilay, R. & Jaakkola, T. Hierarchical generation of molecular graphs using structural motifs. Proc. 37th Int. Conf. Mach. Learn. 119, (2020).
金, W., 巴尔津利, R. & 贾科拉, T. 使用结构动机对分子图进行层次生成. Proc. 第 37 届国际机器学习会议, 119, (2020).
Rezende, D. & Mohamed, S. Variational inference with normalizing flows. Proc. 32th Int. Conf. Mach. Learn. 37, 1530-1538 (2015).
雷赞德, D. & 穆罕默德, S. 使用规范化流的变分推断。第 32 届机器学习国际会议论文集 37, 1530-1538 (2015)。
Zang, C. & Wang, F. MoFlow: an invertible flow model for generating molecular graphs. Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. 37, 617-626 (2020).
藏, C. 和王, F. MoFlow: 一种可逆流模型以生成分子图. 第 26 届 ACM SIGKDD 国际知识发现与数据挖掘会议 37, 617-626 (2020).
Kakeya, H. Natural products-prompted chemical biology: phenotypic screening and a new platform for target identification. Nat. Prod. Rep. 33, 648-654 (2016).
加贺屋，H. 天然产物促进的化学生物学：表型筛选和新的靶标鉴定平台。天然产物报告 33, 648-654 (2016)。
Newman, D. J. & Cragg, G. M. Natural products as sources of new drugs over the nearly four decades from to 09/2019. J. Nat. Prod. 83, 770-803 (2020).
纽曼, D. J. 和 Cragg, G. M. 从到 2019 年 09 月,天然产物作为新药物来源的近四十年. J. Nat. Prod. 83, 770-803 (2020).
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010).
罗杰斯, D. & 汉, M. 扩展连通性指纹. J. 化学信息与建模, 50, 742-754 (2010).
Tai, K. S., Socher, R. & Manning, C. D. Improved semantic representations from tree-structured long short-term memory networks. Proc. 53rd Annu. Meet. Assoc. Computational Linguist. 7th Int. Jt. Conf. Nat. Lang. Process. 1, 1556-156 (2015).
泰, 凯.斯.,索彻, R.&曼宁, C. D.基于树结构的长短期记忆网络的语义表示改进.第 53 届计算语言学年会暨第 7 届自然语言处理联合会议论文集, 1, 1556-156 (2015).
Wishart, D. S. et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res. 46, D1074-D1082 (2018).
Wishart, D. S.等人。《DrugBank 5.0:2018 年 DrugBank 数据库的重大更新》。《核酸研究》46, D1074-D1082 (2018)。
St John, P. C. et al. Message-passing neural networks for high-throughput polymer screening. J. Chem. Phys. 150, 234111 (2019).
约翰, P. C.等人. 高通量聚合物筛选的消息传递神经网络. 《化学物理杂志》150, 234111 (2019).
Landrum G. et al. RDKit: Open-Source Cheminformatics Software. https:// www.rdkit.org/ (2016).
朗德伦及他人。 RDKit：开源化学信息学软件。 https://www.rdkit.org/（2016 年）。
Segler, M. H. S., Kogej, T., Tyrchan, C. & Waller, M. P. Generating focused molecule libraries for drug discovery with recurrent neural networks. ACS Cent. Sci. 4, 120-131 (2018)
赛格勒、科格、蒂尔兴和瓦勒. 利用递归神经网络为药物发现生成聚焦分子库. 美国化学会中心科学, 4, 120-131 (2018)
Frontier Research on Chemical Communications. Project URL: https://www. pharm.kyoto-u.ac.jp/fr_chemcomm/en/ (2021).
化学通信前沿研究。项目网址：https://www.pharm.kyoto-u.ac.jp/fr_chemcomm/en/（2021 年）。
Irwin, J. J. & Shoichet, B. K. ZINC-a free database of commercially available compounds for virtual screening. J. Chem. Inf. Model. 45, 177-182 (2005).
艾文,J.J.和肖伊切特,B.K. ZINC - 一个用于虚拟筛选的免费商用化合物数据库.J. Chem. Inf. Model. 45, 177-182 (2005).
Polykovskiy, D. et al. Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models. Front Pharmacol. 11, 565644 (2020).
波利科夫斯基,D.等人。分子集合（MOSES）：分子生成模型的基准测试平台。制药学杂志。11, 565644 (2020)。
Brown, N., Fiscato, M., Segler, M. H. S. & Vaucher, A. C. GuacaMol: Benchmarking models for de novo molecular design. J. Chem. Inf. Model. 259, (2019).
布朗, N., 费斯卡托, M., 赛格勒, M. H. S. 和瓦舍, A. C. GuacaMol: 用于新分子设计模型的基准测试. J. Chem. Inf. Model. 259, (2019).
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nat. Chem. 4, 90-98 (2012).
比克顿,G.R.、帕奥利尼,G.V.、贝斯纳尔,J.、穆雷山,S.和霍普金斯,A.L.量化药物的化学之美.自然化学,4,90-98(2012).
Ertl, P. & Schuffenhauer, A. Estimation of synthetic accessibility score of druglike molecules based on molecular complexity and fragment contributions. J. Cheminform. 1, 8 (2009).
艾特尔,P.和舒芬豪尔,A.根据分子复杂性和片段贡献估算药物分子的合成可及性得分.J.化学信息学.1,8(2009).
Ertl, P., Roggo, S. & Schuffenhauer, A. Natural product-likeness score and its application for prioritization of compound libraries. J. Chem. Inf. Model. 48, (2008).
埃尔特尔、罗戈和修汉豪泽尔。自然产物相似度得分及其在化合物文库优先化中的应用。J. Chem. Inf. Model. 48, (2008)。
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn Res. 9, 2579-2605 (2008).
范德马登, L. 和 Hinton, G. 使用 t-SNE 可视化数据。机器学习研究杂志 9, 2579-2605 (2008)。
Murata, M., Kumagai, M., Lee, J. S. & Yasumoto, T. Isolation and structure of yessotoxin, a novel polyether compound implicated in diarrhetic shellfish poisoning. Tetrahedron Lett. 28, 5869-5872 (1987).
村田 M.、熊谷 M.、李 J.S.和安本 T.分离和结构分析了一种新的聚醚化合物葛念霖素,这种化合物被认为是腹泻性贝毒的致病原因。Tetrahedron Lett. 28, 5869-5872 (1987).
Bergstra, J., Bardenet, R., Bengio, Y. & Kégl, B. Algorithms for hyperparameter optimization. Adv. Neural Inf. Process Syst. 24, 2546-2554 (2011).
贝尔格斯特拉、巴尔德内特、本吉奥和克格尔. 超参数优化算法. 神经信息处理系统进展 24, 2546-2554 (2011).
Boitreaud, J., Mallet, V., Oliver, C. & Waldispühl, J. OptiMol: Optimization of binding affinities in chemical space for drug discovery. J. Chem. Inf. Model. 60, (2020).
波特劳德,J., 马莱特,V., 奥利弗,C. & 沃尔迪斯普尔,J. OptiMol:药物发现中化学空间中结合亲和力的优化. J. Chem. Inf. Model. 60, (2020).
Griffiths, R. R. & Hernández-Lobato, J. M. Constrained Bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci. 11, 577-586 (2019).
格里菲斯, R.R. & 埃尔南德斯-洛巴托, J.M. 使用变分自编码器的受约束贝叶斯优化自动化化学设计. 化学科学. 11, 577-586 (2019).
Korovina, K. et al. Chembo: Bayesian optimization of small organic molecules with synthesizable recommendations. Proc. Twenty Third Int. Conf. Artif. Intell. Stat. PMLR 108, 3393-3403 (2020).
科罗维娜,K.等。Chembo:小有机分子的贝叶斯优化及可合成建议。第二十三届人工智能与统计国际会议论文集,PMLR 108,3393-3403(2020)。
Friesner, R. A. et al. Extra precision glide: docking and scoring incorporating a model of hydrophobic enclosure for protein-ligand complexes. J. Med Chem. 49, 6177-6196 (2006).
弗里斯纳,R.A.等。额外精度滑动:将用于蛋白质-配体复合物的包水模型纳入对接和评分。J. Med Chem. 49, 6177-6196 (2006).
Cross, D. A. et al. AZD9291, an irreversible EGFR TKI, overcomes T790Mmediated resistance to EGFR inhibitors in lung cancer. Cancer Discov. 4, 1046-1061 (2014).
艾战益（D. A. Cross 等）. AZD9291,一种不可逆的 EGFR-TKI,克服了 EGFR 抑制剂对 T790M 介导的肺癌耐药性.癌症发现.4,1046-1061 (2014).
Blaschke, T. et al. REINVENT 2.0: An AI tool for de novo drug design. . Chem. Inf. Model. 60, 5918-5922 (2020).
布拉施克,T.等。 REINVENT 2.0:一种用于新颖药物设计的人工智能工具。。Chem.Inf.Model.60,5918-5922(2020)。
Gaulton, A. et al. The ChEMBL database. Nucleic Acids Res. 45, D945-D954 (2017).
高尔顿等. ChEMBL 数据库. 核酸研究. 45, D945-D954 (2017).
Ohue, M., Kojima, Y. & Kosugi, T. Generating potential protein-protein interaction inhibitor molecules based on physicochemical properties. Molecules 28, 5652 (2023).
大贺, M., 小泉, Y. 和小杉, T. 基于物理化学性质生成潜在的蛋白质-蛋白质相互作用抑制剂分子。分子 28, 5652 (2023)。
Gajiwala, K. S. et al. Insights into the aberrant activity of mutant EGFR kinase domain and drug recognition. Structure 21, 209-219 (2013).
贾杰瓦拉, K. S. 等. 关于突变表皮生长因子受体激酶域异常活性和药物识别的见解. Structure 21, 209-219 (2013).
Modjtahedi, H., Cho, B. C., Michel, M. C. & Solca, F. A comprehensive review of the preclinical efficacy profile of the ErbB family blocker afatinib in cancer. Naunyn Schmiedebergs Arch. Pharmacol. 387, 505-521 (2014).
莫德塔赫迪, H., 曹炳诚, B. C., 米歇尔, M. C. 和索尔卡, F. ErbB 家族拮抗剂阿法替尼在癌症治疗中前临床疗效概况综述. 诺宁 - 施米德伯格药理学档案. 387, 505-521 (2014).
Paz, B. et al. Yessotoxins, a group of marine polyether toxins: an overview. Mar. Drugs 6, 73-102 (2008).
和平,B.等。海洋多聚醚毒素: 一个概述。海洋药物 6,73-102 (2008)。

Acknowledgements 鸣谢

This work was supported by a Grant-in-Aid for Scientific Research on Innovative Areas "Frontier Research on Chemical Communications" [no. 17H06410, no. 22H04901 (Y.S.)] from the Ministry of Education, Culture, Sports, Science and Technology of Japan. This work was also supported by Grant-in-Aid for Transformative Research Area (A) "Latent Chemical Space" [23H04887 (M.O.), 23H04881 (K.K., N.M.), 23H04885 (Y.S.), and 23H04880 (M.O., K.K., Y.S.)] from the Ministry of Education, Culture, Sports, Science and Technology, Japan.
本工作获得了日本文部科学省教育、文化、体育、科学及技术领域"化学通信前沿研究"创新领域科研补助金[no. 17H06410,no. 22H04901 (Y.S.)]的支持。本工作还获得了日本文部科学省"潜在化学空间"可变革领域科研补助金[23H04887 (M.O.), 23H04881 (K.K., N.M.), 23H04885 (Y.S.), 23H04880 (M.O., K.K., Y.S.)]的支持。

We are grateful to the project members in "Frontier Research on Chemical Communications" for contributing to construction of the database for the original compound library.
我们感谢"化学通信前沿研究"项目成员为原始化合物库数据库建设做出的贡献。

Author contributions 作者贡献

T.O.; implemented the software, analysed data, and compared with the existing methods. T.I.; compared with the existing methods and analysed data. M.A.; analysed data. K.F., M.O.; performed docking analysis. N.M.; performed EGFR inhibitory assay. S.I., M.U., T.S.; set up the database for the original compound library. H.K., K.K.; set up the database for the original compound library and edited the paper. Y.S.; designed and supervised the research, analysed data, and wrote the paper. All authors read and approved the final manuscript.
T.O.;实施了软件, 分析了数据, 并与现有方法进行了比较。T.I.;与现有方法进行了比较并分析了数据。M.A.;分析了数据。K.F., M.O.;进行了对接分析。N.M.;进行了表皮生长因子受体抑制实验。S.I., M.U., T.S.;为原始化合物库建立了数据库。H.K., K.K.;为原始化合物库建立了数据库并编辑了论文。Y.S.;设计并监督了该研究, 分析了数据, 并撰写了论文。所有作者均已阅读并批准了最终稿件。

Competing interests 利益冲突

The authors declare no competing interests.
作者声明没有利益冲突。

Additional information 额外信息

Supplementary information The online version contains supplementary material available at https://doi.org/10.1038/s42004-023-01054-6.
补充信息在线版本包含可在https://doi.org/10.1038/s42004-023-01054-6获取的补充材料。

Correspondence and requests for materials should be addressed to Yasubumi Sakakibara.
所有相关信函和材料申请应寄送至 Yasubumi Sakakibara。

Peer review information Communications Chemistry thanks Martin Vogt and Carlos Oliver for their contribution to the peer review of this work. A peer review file is available.
同行评审信息《通讯化学》感谢 Martin Vogt 和 Carlos Oliver 在本工作的同行评审中做出的贡献。同行评审文件可查阅。

Reprints and permission information is available at http://www.nature.com/reprints
版权信息可查阅 http://www.nature.com/reprints

Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
出版商声明施普林格自然对出版地图和机构附属关系中的管辖权主张保持中立。

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/ licenses/by/4.0/.
开放存取本文章在知识共享署名 4.0 国际许可协议下获得授权，允许使用、共享、改编、分发和复制，只要您给予适当的署名并提供知识共享许可的链接，并标明是否进行了修改。本文章中的图像或其他第三方材料也包含在知识共享许可协议中，除非在相关材料的署名中另有说明。如果某些材料未包含在知识共享许可协议中，且您的预期用途未经法定规定许可或超出了允许使用范围，您将需要直接从版权持有人获得许可。要查看该许可证的副本，请访问 http://creativecommons.org/ licenses/by/4.0/。
(c) The Author(s) 2023 (c) 作者 2023

Department of Biosciences and Informatics, Keio University, Yokohama, Kanagawa 223-8522, Japan. Department of Computer Science, School of Computing, Tokyo Institute of Technology, Yokohama, Kanagawa 226-8501, Japan. Department of Chemistry, Graduate School of Science, Kyushu University, Fukuoka, Fukuoka 819-0395, Japan. Division of Medicinal Frontier Sciences, Graduate School of Pharmaceutical Sciences, Kyoto University, Kyoto, Kyoto 606-8501, Japan. Institute for Chemical Research and WPI-iCeMS, Kyoto University, Uji, Kyoto 611-0011, Japan. Omura Satoshi Memorial Institute and Graduate School of Infection Control Sciences, Kitasato University, Minato-ku, Tokyo 108-8641, Japan. Department of Applied Chemistry, Graduate School of Engineering, Osaka University, Suita, Osaka 565-0871, Japan. Immunology Frontier Research Centre, Osaka University, Suita, Osaka 565-0871, Japan. Department of Data Science, Kitasato University School of Frontier Engineering, Sagamihara, Kanagawa 252-0373, Japan. email: yasu@bio.keio.ac.jp
慶應義塾大学理工学部生命情報学科,神奈川県横浜市 223-8522,日本。東京工業大学工学院情報工学系,神奈川県横浜市 226-8501,日本。九州大学理学府化学専攻,福岡県福岡市 819-0395,日本。京都大学大学院薬学研究科医療医薬品開発分野,京都府京都市 606-8501,日本。京都大学化学研究所及び国際高等研究所 iCeMS,京都府宇治市 611-0011,日本。北里大学大学院感染制御科学府及び大村智記念研究所,東京都港区 108-8641,日本。大阪大学大学院工学研究科応用化学専攻,大阪府吹田市 565-0871,日本。大阪大学免疫学フロンティア研究センター,大阪府吹田市 565-0871,日本。北里大学工学部データサイエンス学科,神奈川県相模原市 252-0373,日本。电子邮件: yasu@bio.keio.ac.jp

communications chemistry 通信化学

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity基于变分自编码器的化学潜在空间用于具有 3D 复杂性的大分子结构

Abstract 摘要

Results and discussions 结果与讨论

Chemical latent space 化学潜在空间

Conclusion 结论

Material and methods 材料和方法

Novel components of NP-VAE小说组件的 NP-VAE

Data availability 数据可用性

Code availability 代码可用性

References 参考文献

Acknowledgements 鸣谢

Author contributions 作者贡献

Competing interests 利益冲突

Additional information 额外信息

Reprints and permission information is available at http://www.nature.com/reprints版权信息可查阅 http://www.nature.com/reprints

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity
基于变分自编码器的化学潜在空间用于具有 3D 复杂性的大分子结构

Novel components of NP-VAE
小说组件的 NP-VAE

Reprints and permission information is available at http://www.nature.com/reprints
版权信息可查阅 http://www.nature.com/reprints