这是用户在 2024-9-10 17:28 为 https://app.immersivetranslate.com/pdf-pro/840be101-a172-4667-a7e5-59eb1165f5b2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_09_10_45386283934fad67e7d4g

communications chemistry  通信化学

Variational autoencoder-based chemical latent space for large molecular structures with 3D complexity
基于变分自编码器的化学潜在空间用于具有 3D 复杂性的大分子结构

Toshiki Ochiai , Tensei Inukai Manato Akiyama , Kairi Furui(i) ,Masahito Ohue (i) , Nobuaki Matsumori(i) ,
落合敏树 、犬养转生 、秋山真人 、古井佳凛(i) 、大植雅仁(i) 、松森信明(i)
Shinsuke Inuki , Motonari Uesugi( , Toshiaki Sunazuka , Kazuya Kikuchi7 , Hideaki Kakeya(0) &
乾杉介 、上杉元就 、砂塚敏晶 、菊池和也 7 、筧秀晃(0)
Yasubumi Sakakibara (1,9®
坂木原康文 (1,9®

Abstract 摘要

The structural diversity of chemical libraries, which are systematic collections of compounds that have potential to bind to biomolecules, can be represented by chemical latent space. A chemical latent space is a projection of a compound structure into a mathematical space based on several molecular features, and it can express structural diversity within a compound library in order to explore a broader chemical space and generate novel compound structures for drug candidates. In this study, we developed a deep-learning method, called NP-VAE (Natural Product-oriented Variational Autoencoder), based on variational autoencoder for managing hard-to-analyze datasets from DrugBank and large molecular structures such as natural compounds with chirality, an essential factor in the 3D complexity of compounds. NP-VAE was successful in constructing the chemical latent space from largesized compounds that were unable to be handled in existing methods, achieving higher reconstruction accuracy, and demonstrating stable performance as a generative model across various indices. Furthermore, by exploring the acquired latent space, we succeeded in comprehensively analyzing a compound library containing natural compounds and generating novel compound structures with optimized functions.
化学文库的结构多样性,即针对生物分子具有潜在结合能力的化合物的系统性集合,可以用化学潜在空间来表示。化学潜在空间是将化合物结构投射到基于多种分子特征的数学空间中,其可以表达化合物文库内的结构多样性,从而探索更广阔的化学空间,为药物候选物生成新颖化合物结构。本研究开发了一种基于变分自动编码器的深度学习方法,称为 NP-VAE(面向天然产品的变分自动编码器),用于管理来自 DrugBank 的难以分析的数据集以及具有手性这一 3D 复杂性关键因素的大型分子结构,如天然化合物。NP-VAE 成功构建了无法用现有方法处理的大型化合物的化学潜在空间,实现了更高的重建精度,并在各种指标上展示了作为生成模型的稳定性能。此外,通过探索获得的潜在空间,我们成功全面分析了包含天然化合物的化合物文库,并生成了具有优化功能的新颖化合物结构。

t is estimated that there are approximately variations in all possible compound structures, even when limited to small molecules with a molecular weight of 500 or less . Structural diversity within compound libraries is crucial for discovering new pharmaceutical compounds, necessitating coverage of as many candidates as possible from this vast pool. The structural diversity of compounds in a compound library can be represented by a chemical latent space, which projects compound structures onto a mathematical space based on various molecular features, such as fingerprints. Building a chemical latent space that potentially contains numerous unknown compound structures beyond existing libraries is an important research topic in cheminformatics. On the other hand, many of the natural compounds produced by living organisms have complex and unique structures compared to conventional drugs, and they exhibit high biological activity . By applying state-of-the-art generative artificial intelligence techniques to the heterogeneous data from both the approved drug database and natural product compound libraries, it becomes possible to virtually generate and design novel compound structures that combine the characteristics of both types of data. This approach has become a global trend in cutting-edge drug discovery, known as DMTA (design-make-testanalyze .
据估计,即使限定于分子量 500 或更小的小分子,所有可能的化合物结构也有大约 种变体。化合物库的结构多样性对于发现新的药物化合物至关重要,需要覆盖这个广泛池子中尽可能多的候选物。化合物库中化合物的结构多样性可以用一个化学潜在空间来表示,该空间将化合物结构投射到一个基于指纹等多种分子特征的数学空间上。构建一个可能包含现有库之外众多未知化合物结构的化学潜在空间是化学信息学中的重要研究课题。另一方面,生物体产生的许多天然化合物与传统药物相比具有复杂独特的结构,并表现出很高的生物活性 。通过将最先进的生成式人工智能技术应用于批准用药数据库和天然产物化合物库的异构数据,就有可能虚拟生成并设计出结合两种数据特征的全新化合物结构。这种方法已成为前沿药物发现领域的全球趋势,被称为 DMTA(设计-制造-测试-分析)
Various deep-learning models have been developed with the aim of constructing chemical latent spaces and computationally generating new compound structures. Variational autoencoder is one of the representative deep learning methods for constructing chemical latent spaces. VAE consists of two components: an encoder, which transforms input values into lowdimensional variables called latent variables, and a decoder, which transforms latent variables into corresponding output values. By exploring the chemical latent space, unknown compound structures not present in the training data could be generated. VAEs that handle compound structures as inputs and outputs are broadly classified into two types: SMILES-based and graph-based. Chemical VAE (CVAE) , the earliest model of SMILES-based methods, takes SMILES strings , which represent compound structures as strings, as input and constructs a latent space by projecting a compound library into a low-dimensional space. CVAE was a pioneering study that applied VAE for constructing a latent space of chemical compounds. However, CVAE, which outputs SMILES strings symbol by symbol, faces the issue that most of the output does not conform to chemical rules and fails to form a valid compound structure. Therefore, CVAE added an ad-hoc one-step process to validate the chemical structures output from the decoder and discarded the invalid ones. Grammar VAE (GVAE) and Syntax-Directed VAE (SD-VAE) were developed as models capable of generating more valid compound structures by focusing on the grammatical structure of SMILES strings. Recently, studies applying the SMILES representation have again become active as chemical language model (CLM). These state-of-the-art methods adopt recurrent neural networks (RNNs) such as LSTM to learn the SMILES grammar using a large amount of pretraining data and employ transfer learning to the compound structures of interest. Nevertheless, the SMILES representation-based model suffers from generating invalid SMILES strings, and hence requires the filtering out of these invalid outputs . In addition, one crucial difference compared with the VAE approach is that these RNN-based models are merely generative and do not explicitly construct a vector space (latent space) and its latent variables that are projected from and can be reverse-mapped to the actual compounds, which means they do not provide the capability to explore the latent space of structurally diverse compounds in the library.
目前已经开发了各种深度学习模型,旨在构建化学潜在空间并通过计算方式生成新的化合物结构。变分自编码器 是用于构建化学潜在空间的代表性深度学习方法之一。变分自编码器由两个部件组成:编码器,将输入值转换为称为潜在变量的低维变量;解码器,将潜在变量转换为相应的输出值。通过探索化学潜在空间,可以生成训练数据中不存在的未知化合物结构。处理化合物结构作为输入和输出的变分自编码器大致可分为两种类型:基于 SMILES 的和基于图的。化学变分自编码器(CVAE) 是最早的基于 SMILES 的方法,它以 SMILES 字符串 作为输入,将化合物库投影到低维空间来构建潜在空间。CVAE 是将变分自编码器应用于构建化学化合物潜在空间的开创性研究。然而,CVAE 逐符号输出 SMILES 字符串,存在输出大部分不符合化学规则、无法构成有效化合物结构的问题。因此,CVAE 添加了一个临时的一步过程,用于验证解码器输出的化学结构并丢弃无效的结构。Grammar VAE (GVAE) 和 Syntax-Directed VAE (SD-VAE) 被开发为能够通过关注 SMILES 字符串的语法结构生成更多有效化合物结构的模型。最近,将 SMILES 表示应用的研究 再次活跃,推动了化学语言模型(CLM)的发展。 这些最先进的方法采用循环神经网络(RNN)等 LSTM 来学习使用大量预训练数据的 SMILES 语法,并采用迁移学习应用于感兴趣的化合物结构。然而,基于 SMILES 表示的模型会生成无效的 SMILES 字符串,因此需要对这些无效输出进行过滤 。此外,与 VAE 方法相比,这些基于 RNN 的模型仅仅是生成性的,没有显式地构建向量空间(潜在空间)及其映射到实际化合物的潜在变量,这意味着它们不提供探索库中结构多样性化合物潜在空间的能力。

Graph-based models include Constrained Graph VAE (CGVAE) and Junction Tree VAE (JT-VAE) . These models represent compound structures as graph structures defined by adjacency relationships between atoms, enabling them to generate completely valid outputs. Specifically, JT-VAE achieved high reconstruction accuracy by treating molecular graphs not only as graph structures but also as tree structures. However, these models were all designed for small molecules and could not handle large compound structures due to their high spatial order. As a result, Hierarchical VAE (HierVAE) was developed to apply VAE to larger compound structures. By handling compound structures in relatively large fragment units, HierVAE demonstrated high accuracy for datasets composed of large compounds with repeating structures. Nevertheless, challenges remain, such as the inability to consider stereochemistry and the difficulty in applying the model to massive and complex compound structures with diverse internal structures like natural compounds. It is worth noting that natural compounds are quite different from massive compounds with uniform internal structures, like polymers.
基于图的模型包括受约束的图 VAE(CGVAE) 和连接树 VAE(JT-VAE) 。这些模型将化合物结构表示为由原子之间的邻接关系定义的图结构,从而能够生成完全有效的输出。具体而言,JT-VAE 通过将分子图视为不仅是图结构,也是树结构,从而实现了高重建精度。然而,这些模型都是为小分子设计的,无法处理大化合物结构,因为它们的空间顺序很高。因此,开发了层次化 VAE(HierVAE) 来将 VAE 应用于更大的化合物结构。通过处理相对较大的片段单元,HierVAE 在由大化合物和重复结构组成的数据集上表现出高精度。然而,仍然存在一些挑战,如无法考虑立体化学,以及将该模型应用于自然化合物等复杂内部结构的大型和多样化化合物结构的困难。值得注意的是,自然化合物与具有统一内部结构的大型化合物(如聚合物)有很大不同。
Another state-of-the-art generative model is the flow-based model , which uses deep learning to project and generate compound structures. Flow-based models map to a latent space that guarantees inverse transformation by repeatedly applying invertible functions to input data. In other words, the reconstruction accuracy of flow-based models is always guaranteed to be , regardless of the degree of learning. The graph-based flow model, MoFlow , represents compound structures as two types of bit matrices and can acquire completely invertible latent variables by applying a normalizing flow to each. However, this does not necessarily indicate the accuracy or continuity of the resulting latent space, unlike the case with VAEs. Furthermore, Flow-based models have a problem that the latent space becomes high-dimensional in order to guarantee reversibility. Since a latent space with the same dimension as the input dimension is constructed, the space exploration is highly harder than with VAEs, which project onto a lower-dimensional space. Moreover, flow-based learning can be quite unstable, making it prone to gradient explosion when the input dimension increases. Therefore, when using flow-based models for compound data, it is limited to compounds with a small size of input compounds.
另一种最先进的生成模型是基于流(flow-based)的模型 ,它使用深度学习来投射和生成复合结构。基于流的模型将输入数据映射到潜在空间,并通过反复应用可逆函数来保证逆向转换。换句话说,基于流的模型的重建精度总是可以保证 ,无论学习程度如何。基于图的流模型 MoFlow 将化合物结构表示为两种类型的位矩阵,并通过对每个矩阵应用规范化流来获得完全可逆的潜在变量。然而,这并不一定意味着所得潜在空间的准确性或连续性,这有别于 VAE 的情况。此外,基于流的模型存在一个问题,即为了保证可逆性,潜在空间会变得高维。由于构建了与输入维度相同的潜在空间,因此空间探索比 VAE 投射到低维空间更加困难。此外,基于流的学习可能会相当不稳定,当输入维度增加时容易出现梯度爆炸。因此,在使用基于流的模型处理化合物数据时,其适用范围仅限于较小规模的化合物。
As discussed above, deep learning models developed thus far have struggled to effectively handle large, complex and heterogeneous compound structures, such as natural compounds. Natural compounds produced by organisms such as microbes and plants often possess novel structures, and due to their characteristics of being produced during biological processes, many of them exhibit high biological activity . Indeed, natural products have been widely used as drugs, such as antibiotics like Penicillin and Streptomycin, and anticancer agents like Bryostatin and Epothilone . Therefore, constructing a chemical latent space from a compound library that includes natural compounds plays a crucial role in drug discovery.
如上所述,到目前为止开发的深度学习模型在处理大型、复杂和异质的复合结构(如天然化合物)方面一直存在困难。由微生物和植物等生物体产生的天然化合物通常具有新颖的结构,由于其在生物过程中产生的特性,其中许多具有高生物活性 。事实上,天然产品已被广泛用作药物,如青霉素和链霉素等抗生素,以及布尼菌素和表雷米酮等抗癌药物 。因此,从包含天然化合物的化合物库构建化学潜在空间在药物发现中发挥着至关重要的作用。
In this study, we developed a VAE-based deep learning method, called Natural-Product Compound Variational Autoencoder (NP-VAE), to handle compounds with complex molecular structures like natural compounds and acquire a chemical latent space that projects large molecular structures. NP-VAE is a graph-based VAE that combines an algorithm for effectively decomposing compound structures into fragment units and converting them into tree structures, along with the Extended Connectivity Fingerprints (ECFP) , and the Tree-LSTM , a type of recurrent neural network. NP-VAE, which has 12 million parameters, was successfully developed through significant improvements to the algorithms of JT-VAE and HierVAE .
在这项研究中,我们开发了一种基于 VAE 的深度学习方法,称为天然产物化合物变分自编码器(NP-VAE),用于处理复杂分子结构的天然化合物,并获得投射大分子结构的化学潜在空间。NP-VAE 是一种基于图的 VAE,结合了有效分解化合物结构为片段单元并将其转换为树状结构的算法,以及扩展连接性指纹(ECFP) 和树状长短期记忆(Tree-LSTM) 这种一种循环神经网络。通过对 JT-VAE 和 HierVAE 算法的重大改进,成功开发了具有 1200 万个参数的 NP-VAE。
The first objective of this study is to obtain a highly interpretable chemical latent space that includes middle/large molecular structures like natural compounds using NP-VAE. We construct a latent space
这项研究的第一个目标是使用 NP-VAE 获得一个高度可解释的化学潜在空间,包括中大分子结构,如天然化合物。我们构建了一个潜在空间。
Table 1 Performance comparison of generalization ability between NP-VAE and existing methods (baseline models).
表 1 NP-VAE 和现有方法(基线模型)的概括能力性能比较

2D 重建准确度(测试)
2D Reconstruction accuracy
(test)
Validity 有效性
NP-VAE 0.813 1.000
HierVAE 层次变分自编码器 0.799 1.000
JT-VAE 0.585 1.000
CG-VAE 0.424 1.000
CVAE 0.215 0.931
The reconstruction accuracy and validity for test compounds in the evaluation dataset were compared.
在评估数据集中,测试化合物的重建准确性和有效性进行了比较。

The reconstruction accuracy and validity of existing methods were taken from values reported in the previous study .
现有方法的重建准确性和有效性是从先前研究中报告的数值得到的。

that includes hard-to-analyze DrugBank and a large collection of natural compounds, which previous studies could not handle, and effectively perform statistical and comprehensive functional analysis of compound libraries. Furthermore, NP-VAE is developed to deal with chirality, which is an essential factor in the 3D structure of compounds. The second objective is to generate novel compound structures optimized for the target function (property) by exploring the acquired latent space. NP-VAE has a mechanism to train the chemical latent space incorporating functional information along with structural information. The obtained chemical latent space enables the design of optimized compound structures as moleculartargeted drugs by generating new compounds from the surrounding sub-space of an existing pharmaceutical drug, such as anticancer drugs. Furthermore, by combining NP-VAE to generate novel compound structures with docking analysis, we demonstrate the usefulness of this method as an in-silico drug discovery tool.
这包括难以分析的 DrugBank 和大量的天然化合物集合,先前的研究无法处理,并有效地执行化合物库的统计和全面功能分析。此外,NP-VAE 被开发用于处理手性,这是化合物 3D 结构的关键因素。第二个目标是通过探索获得的潜在空间来生成针对目标功能(特性)优化的新型化合物结构。NP-VAE 有一种机制来训练化学潜在空间,同时融合功能信息和结构信息。所获得的化学潜在空间使我们能够通过从现有药物(如抗癌药物)的相邻子空间生成新化合物,设计优化的化合物结构作为分子靶向药物。此外,通过将 NP-VAE 与对接分析相结合,我们证明了该方法作为计算机辅助药物发现工具的有用性。

Results and discussions 结果与讨论

Performance evaluation of NP-VAE as reconstruction and generative model. First, we evaluated the reconstruction accuracy of the proposed NP-VAE for test compounds that were not included in the training compound data, which is referred to as the generalization ability. Evaluating the generalization ability is crucial because it allows us to verify how accurately the constructed chemical latent space has been interpolated. Following the study HierVAE that conducted an evaluation of the generalization ability, we used St. John et al.'s dataset (hereinafter referred to as the evaluation dataset). This dataset was divided into 76,000 training compounds, 5000 validation compounds, and 5000 test compounds, exactly same as the previous study . After training on the training compounds, the reconstruction accuracy and validity for test compounds were calculated. Reconstruction accuracy was determined using the Monte Carlo method for 5000 test compounds. In other words, for each test compound, 10 encodings were performed, and 10 decodings were conducted for each encoding, resulting in 100 output compounds. The proportion of compound structures that matched between the input to the encoder and the output from the decoder was calculated. To determine validity, 1000 latent vectors were sampled from the prior distribution , and after decoding each of them 100 times, the proportion of chemically valid output compounds was examined using RDKit . Four state-of-the-art compound VAEs, namely CVAE , CG-VAE , JT-VAE , and HierVAE , were compared as baseline models. As shown in Table 1, NP-VAE demonstrated higher reconstruction accuracy for the test compounds compared to the previous models. Additionally, since NP-VAE generates compounds in substructure units (fragments) rather than singleatom units, the generation success rate is always . These
作为重建和生成模型的 NP-VAE 性能评估。首先,我们评估了提出的 NP-VAE 对未包含在训练化合物数据中的测试化合物的重建精度,这称为概括能力。评估概括能力很关键,因为它让我们验证构建的化学潜在空间是如何进行内插的。根据 HierVAE 研究进行了概括能力评估,我们使用了 St. John et al.的数据集 (以下称为评估数据集)。这个数据集被划分为 76,000 个训练化合物、5,000 个验证化合物和 5,000 个测试化合物,与之前的研究 完全一致。在训练了训练化合物之后,计算了测试化合物的重建精度和有效性。使用蒙特卡罗方法对 5,000 个测试化合物计算重建精度。换言之,对于每个测试化合物,进行 10 次编码,并对每次编码进行 10 次解码,产生 100 个输出化合物。计算输入编码器的化合物结构与解码器输出的化合物结构匹配的比例。为确定有效性,从先验分布 抽取 1,000 个潜在向量,对每个向量进行 100 次解码,并使用 RDKit 检查化学有效输出化合物的比例。将四种最先进的化合物 VAE,即 CVAE 、CG-VAE 、JT-VAE 和 HierVAE 作为基线模型进行比较。如表 1 所示,与之前的模型相比,NP-VAE 在测试化合物上展示出更高的重建精度。此外,由于 NP-VAE 以子结构单元(片段)而不是单原子单元生成化合物,因此生成成功率始终为 。这些

Table 2 Comparison of the maximum number of atoms and molecular weight of the compounds included in the dataset.
表 2 包含在数据集中的化合物的最大原子数和分子量的比较。

原子的最大数量
Maximum number of
atoms

最大分子量
Maximum molecular
weight

药品和天然产品数据集
Drug-and-natural-
product dataset
551 8272
Restricted dataset 受限制的数据集 100 1626
DrugBank 药品数据库 551 8272
Project dataset 项目数据集 457 6574
ZINC 38 500
In the maximum number of atoms, only non-hydrogen atoms were counted. Compared to the ZINC dataset used in previous studies, the drug-and-natural-product dataset (combined drug and natural product dataset of DrugBank and the project dataset) includes compounds with more than 13 times larger molecular weights.
在原子数量最大的化合物中,只有非氢原子被计算在内。与先前研究中使用的 ZINC 数据集相比,药物和天然产物数据集(DrugBank 和项目数据集的药物和天然产品数据集合并)包含分子量超过 13 倍的化合物。

results indicate that NP-VAE is a high-performance generative model, suggesting that the chemical latent space constructed by NP-VAE contains sufficient information to accurately estimate unknown compounds from known compounds.
结果表明,NP-VAE 是一个高性能的生成模型,这表明由 NP-VAE 构建的化学潜在空间包含足够的信息,可以准确地从已知化合物中估计未知化合物。
Next, we compared the performance of NP-VAE as generative model with the state-of-the-art generative models, the Flow-based model MoFlow , and the SMILES-based method employing CharRNN (character-level RNN) , which is also referred to as SM-RNN and demonstrated high performance in the study , and the VAE model HierVAE.
接下来,我们将 NP-VAE 作为生成模型的性能与最先进的生成模型进行比较,包括基于 Flow 的模型 MoFlow 以及使用 CharRNN(字符级别 RNN)的 SMILES 为基础的方法 ,后者也称为 SM-RNN,在前期研究中展现出了很高的性能 ,以及 VAE 模型 HierVAE。
Since our primary motivation is to develop a VAE model capable of handling large and complex molecules for the construction of chemical latent space, we prepared a compound library consisting of approximately 30,000 compounds. This library combines around 10,000 compounds from DrugBank, a public database containing numerous approved drug compounds, and approximately 20,000 compounds from the project dataset (hereinafter referred to as the drug-and-natural-product dataset). The project dataset is an original compound library collected from various laboratories through the Ministry of Education, Culture, Sports, Science, and Technology-designated project, "Frontier Research on Chemical Communications" 27 , in which this research participated. The project dataset mainly includes natural compounds, and compared to compounds in the frequently used ZINC database , it contains a number of complex and large molecules (see Supplementary Fig. S1 for illustration). However, the state-of-the-art VAE models, JT , and HierVAE , and the flow-based model MoFlow were unable to handle data for compounds of this larger size, so we had to prepare a restricted dataset where all existing methods can be executed. This restricted dataset (hereinafter referred to as the restricted dataset) was constructed by first reducing the drugand-natural-product dataset to compounds with fewer than 100 non-hydrogen atoms and further removing some compounds that caused errors with HierVAE. Therefore, initially, we compared the difference in maximum compound size between the drug-and-natural-product dataset and the restricted dataset. The Table 2 shows the comparison of the maximum number of atoms and molecular weight of the compounds included in the drug-and-natural-product dataset, the restricted dataset, and three other databases. Note that regarding JTVAE, it took an impractical amount of computation time and did not complete the calculations even with the restricted dataset; therefore, it was excluded from all subsequent experiments.
由于我们的主要动机是开发一个能够处理大型和复杂分子的 VAE 模型,用于构建化学潜在空间,我们准备了一个由约 30,000 个化合物组成的化合物库。这个库包括大约 10,000 个来自公共数据库 DrugBank 的批准药物化合物,以及大约 20,000 个来自项目数据集(以下称为药物和天然产品数据集)的化合物。该项目数据集是在教育、文化、体育、科学和技术部门"化学通讯前沿研究"27 项目的支持下,从各实验室收集的原始化合物库。该项目数据集主要包括天然化合物,与经常使用的 ZINC 数据库 中的化合物相比,它包含了许多复杂和大分子(见补充图 S1)。然而,最先进的 VAE 模型 JT 、HierVAE 和基于流的模型 MoFlow 无法处理这些更大尺寸的化合物数据,因此我们必须准备一个受限的数据集,以使所有现有方法都可以执行。这个受限数据集(以下称为受限数据集)是通过先将药物和天然产品数据集减少到少于 100 个非氢原子的化合物,然后进一步删除了一些导致 HierVAE 出错的化合物而构建的。因此,我们首先比较了药物和天然产品数据集和受限数据集之间的最大化合物大小差异。表 2 显示了包含在药物和天然产品数据集、受限数据集和其他三个数据库中的化合物的最大原子数和分子量的比较。 注意,对于 JTVAE,它需要大量不实际的计算时间,即使使用受限的数据集,计算也无法完成;因此,它被排除在所有后续实验之外。
Next, we performed the process of randomly sampling 5000 latent vectors from the prior Gaussian distribution five times for each model. Then, we calculated and compared the following variety of metrics proposed in benchmarking studies such as MOSES and GuacaMol .
接下来,我们从先验高斯分布中随机抽取了 5000 个潜在向量,这个过程对每个模型重复了五次。然后,我们计算并比较了基准测试研究如 MOSES 和 GuacaMol 中提出的以下各种度量指标。
Table 3 Comparison of NP-VAE, HierVAE, MoFlow and SM-RNN as generative models: For the generated compounds, 2D and 3D reconstruction accuracy, uniqueness, novelty, logP, SAscore, QED, Filters, SNN, molecular weight, NP-likeness, and the distance between compound property distributions, Frag, Scaf, IntDiv, and Phys div were calculated.
表 3 NP-VAE、HierVAE、MoFlow 和 SM-RNN 作为生成模型的比较:对于生成的化合物,计算了 2D 和 3D 重构精度、独特性、新颖性、logP、SAscore、QED、过滤器、SNN、分子量、NP 相似性以及化合物属性分布之间的距离、Frag、Scaf、IntDiv 和 Phys div。
NP-VAE HierVAE 层次变分自编码器 MoFlow 摩尔流
2D Reconstruction accuracy
二维重建精度
0.871 0.438
3D Reconstruction accuracy
3D 重建精度
Uniqueness 独特性
Novelty 新奇
logP 对数 P
QED
SAscore 评分
Filters 滤波器
SNN
MolWt 分子量
NP-likeness NP-类似性
Frag 询问
Scaf 脚手架
IntDiv 整除
Phys div 物理除法
The highest value for each accuracy indices is shown in bold.
每个准确性指标的最高值以粗体显示。

-Uniqueness: The proportion of unique molecules among the generated valid molecules. This serves as an indicator of the uniqueness of the generated compound structures, and the value will be low if the model has collapsed and generated only a few typical molecules.
-独特性:在生成的有效分子中,独特分子的比例。这可以作为生成化合物结构的独特性指标,如果模型已经崩溃并只生成了少数几种典型的分子,该值将会较低。
  • Novelty: The proportion of generated molecules that do not exist in the training set. A low value may indicate overfitting.
    新颖性:生成分子中未出现在训练集中的分子的比例。较低的值可能表示过拟合。
  • LogP: Represents the lipophilicity of a molecule. A moderate lipophilicity is required for pharmaceutical compounds.
    对数 P: 表示分子的亲脂性。医药化合物需要适中的亲脂性。
  • QED: An indicator representing the drug-likeness of a molecule. Since it is calculated based on existing oral drugs, it can be considered an indicator of oral drug-likeness. It is expressed as a value between 0 and 1 , with values closer to 1 indicating structures that are more like oral pharmaceuticals .
    QED:一个代表分子药物性的指标。由于它是基于现有口服药物计算的,因此可以被视为口服药物性的指标。它的值在 0 到 1 之间,值越接近 1 表示其结构越类似于口服药品。
  • SAscore: A score representing the difficulty of synthesis based on molecular structure. It is expressed as a value between 1 and 10, with values closer to 10 indicating higher synthesis difficulty .
    合成相关评分(SAscore):一种基于分子结构的合成难度评分。其值在 1 到 10 之间,数值越接近 10 表示合成越困难。
  • Filters: Represent the proportion of generated molecules that passed through a filter, which eliminates undesired structures used during the construction of the MOSES dataset . A lower value indicates that there are more molecules with abnormal structures.
    过滤器:表示通过过滤的生成分子的比例,这些过滤器消除了在构建 MOSES 数据集期间使用的不需要的结构。较低的值表示存在更多具有异常结构的分子。
  • SNN (similarity to a nearest neighbor): The average Tanimoto coefficient between each generated molecule and the most similar molecule within the training data. This value decreases as the generated molecules deviate further from the distribution of the training data.
    相似度最近邻(SNN):每个生成分子与训练数据中最相似分子的 Tanimoto 系数的平均值。该值会随着生成分子越偏离训练数据分布而降低。
  • MolWt: Molecular weight. 分子量:分子量。
  • NP-likeness: A measure of naturalness. NP-likeness score is a measure designed to estimate how closely a given molecule resembles known natural products .
    NP 相似性:一种自然性的度量。NP 相似性得分是一种旨在估计给定分子与已知天然产物有多大相似性的度量
  • Frag: Comparison of the distribution of BRICS fragmentations between generated molecules and the training data. The value increases when both sets contain molecules with similar fragments.
    比较生成分子和训练数据之间 BRICS 片段分布的对比。当两组都包含具有相似片段的分子时,该值会增加。
  • Scaf: Comparison of the distribution of primary structural elements within molecules, referred to as scaffolds. Frag and Scaf are both metrics used to measure the similarity between generated molecules and the training data at the level of substructure units.
    分子中主要结构单元即支架的分布比较。Frag 和 Scaf 都是用来衡量生成分子与训练数据在子结构单元层面相似度的指标。
  • IntDiv: This is an evaluation of the structural diversity within the set of generated molecules. It is calculated by computing the
    整除:这是对生成分子集合内部结构多样性的评估。它是通过计算

    Tanimoto coefficient between all pairs of generated molecules and taking the average.
    对生成分子的所有对进行 Tanimoto 系数的计算并取平均值。
  • Phy div: KL divergence between generated molecules and the training data in terms of physicochemical properties and is calculated from physicochemical descriptors such as molecular weight, the number of aromatic rings, and the count of rotatable bonds . A higher value indicates better performance.
    物化属性和旋转键计数等物化描述符之间的生成分子和训练数据的 KL 散度。更高的值表示性能更好。
The results are shown in Table 3. First, the reconstruction accuracy for training compounds was compared among three models: NP-VAE, HierVAE, and MoFlow, which constitute the latent space using the encoder and decoder. This reconstruction accuracy indicates how accurately the compound library can be projected without information loss. Regarding the 2D reconstruction accuracy for the planar structure of compounds, NP-VAE significantly outperformed HierVAE. On the other hand, MoFlow achieved a 2D reconstruction accuracy. From the mechanism of flow-based models, the reconstruction accuracy of flowbased models is always guaranteed to be , regardless of the degree of learning. However, this does not necessarily indicate the accuracy or continuity of the resulting latent space, unlike the case with VAEs. Flow-based models have a problem that the latent space becomes high-dimensional in order to guarantee reversibility. Since a latent space with the same dimension as the input dimension is constructed, the space exploration is highly harder than with VAEs, which project onto a lower-dimensional space. Moreover, flow-based learning can be quite unstable, making it prone to gradient explosion when the input dimension increases. The 3D reconstruction accuracy, considering not only the planar structure but also the stereochemistry (chirality), of NP-VAE exceeded , while the other two models could not handle the 3D structures, and hence, their accuracy was not available. This result demonstrates that NP-VAE has high performance as a feature extractor.
结果如表 3 所示。首先,在训练化合物中,三种模型(NP-VAE、HierVAE 和 MoFlow)的重建精度进行了比较,这些模型利用编码器和解码器构建了潜在空间。这种重建精度表示如何在不丢失信息的情况下将化合物库投射出来。就化合物平面结构的 2D 重建精度而言,NP-VAE 显著优于 HierVAE。另一方面,MoFlow 实现了 的 2D 重建精度。从基于流的模型的机制来看,基于流的模型的重建精度总是被保证为 ,无论学习程度如何。但是,这并不一定表示所得潜在空间的准确性或连续性,这与 VAEs 的情况不同。基于流的模型存在一个问题,即为了保证可逆性,潜在空间会变得高维。由于构建了与输入维度相同的潜在空间,因此探索空间比 VAEs 投射到低维空间的情况要困难得多。此外,基于流的学习可能很不稳定,当输入维度增加时容易出现梯度爆炸。考虑不仅平面结构,而且立体化学(手性),NP-VAE 的 3D 重建精度超过了 ,而另外两种模型无法处理 3D 结构,因此它们的精度不可用。这一结果表明,NP-VAE 具有出色的特征提取性能。
Second, NP-VAE demonstrated stable performance as a generative model across almost all indices. In terms of uniqueness, novelty and logP, NP-VAE showed comparable performance to the top-performing models. Due to the large variance in , the difference in scores among models is not statistically significant. NP-VAE exhibited the highest QED, SA score, and Filters score, which represents the drug-likeness of a molecule, the difficulty of its synthesis, and the proportion that passed through a filter to eliminate undesired structures,
其次,NP-VAE 作为一个生成模型在几乎所有的指标上都表现出了稳定的性能。在独特性、新颖性和 logP 方面,NP-VAE 的表现与表现最出色的模型相当。由于 方差过大,各模型之间的 分数差异在统计上并不显著。NP-VAE 展现了最高的 QED、SA 分数和过滤器分数,这表示了一个分子的药物性、合成难度以及通过滤去不需要的结构的比例。

respectively. On the other hand, SM-RNN exhibited a difference in performance values of nearly two times for SNN, Scaf and Phys div metrics; however, these results are not informative. The compound structures generated by SM-RNN had the Novelty of only , which implies that of the structures were identical to the training data. It is obvious that these three metrics plus Frag score would improve if the model outputs compound structures identical to the training data. In other words, SM-RNN is simply memorizing the training data, and its usefulness as a generative model for generating new structures is limited. Regarding molecular weight, we included a plot illustrating the size distributions of the molecules generated by all models in Supplementary Fig. S2. Each distribution is highly divergent, indicating the generation of diverse molecular weights. Regarding NP-likeness, MoFlow showed the highest value, being the only one with a positive score, while other models take negative values. This is attributed to the fact that MoFlow generates extremely long straight-chain structures, which are considered to have abnormal structures, as indicated by the low Filter score. Moreover, since these structures have a Scaf value of 0 , meaning they completely lack a scaffold, they are considered to be naturalproduct-like due to their unusual and rare structures compared to general compounds. Therefore, the NP-likeness calculation method is prone to assigning higher values in such cases. On the other hand, the NP-likeness score of the training data is -0.638 , mainly due to the inclusion of numerous approved drug compounds from DrugBank. This value is close to the negative NP-likeness scores shown by the other three models. In conclusion, NP-VAE successfully generates compound structures with the highest scores in desired metrics for drug-likeness, such as QED, SAscore and Filters, while maintaining similarity to the training data at the level of fragments and scaffolds, as indicated by Frag and Scaf values, and achieving high novelty of generated structures.
总之,NP-VAE 在所需指标(如 QED、SAscore 和过滤器等药物相似性指标)上成功生成了得分最高的复合结构,同时保持了与训练数据在片段和骨架层面的相似性(由 Frag 和 Scaf 值表示),并且生成的结构具有高度的新颖性。
Construction of chemical latent space with natural compounds. We constructed two chemical latent spaces using the entire set of the drug-and-natural-product dataset: one obtained by training with only the structural information of the compounds and the other obtained by training with both structural information and the NP-likeness score , which serves as a measure of naturalness, as functional information. More precisely, the NP-likeness score is incorporated into the learning process as functional information, which is implemented by a component of the loss function. NP-likeness score is a measure designed to estimate how closely a given molecule resembles known natural products. It is based on the distribution of fragments (substructures) in the molecule compared to a reference set of known natural products . A higher NP-likeness score suggests that a molecule is more 'natural product-like,' meaning that its structure is more similar to those of known natural products.
利用天然化合物构建化学潜在空间。我们使用药物-天然产物数据集的全部数据构建了两个化学潜在空间:一个仅使用化合物的结构信息训练得到,另一个使用结构信息和 NP-likeness 得分 作为功能信息进行训练得到。更准确地说,NP-likeness 得分作为功能信息被纳入学习过程,通过损失函数的一部分实现。NP-likeness 得分是一种设计用来估算给定分子与已知天然产品的相似程度的指标。它基于分子中碎片(子结构)的分布与已知天然产品参考集的比较 。较高的 NP-likeness 得分表明该分子更"天然产品类似",意味着其结构更接近已知天然产品的结构。
Fig. 1 Visualization of chemical latent spaces using t-SNE plot. a, The higher the NP-likeness score, the more yellow it is, and the lower the score, the more purple it is. Compared to the chemical latent space (a) trained only on substructures, the chemical latent space (b) trained on both substructures and the NP-likeness score as functional information shows a more clustered distribution according to the NP-likeness score. Comparing the cases when plotting representative anticancer compounds in the chemical latent space trained only on substructures (c) and when plotting anticancer compounds in the chemical latent space trained on both substructures and NP-likeness scores (d), both show clustered distributions according to the anticancer drug classification, with chemotherapeutic drugs and molecular targeted drugs distributed separately. Focusing on the distribution of molecular targeted drugs (red frame), the distribution is more locally clustered when the NP-likeness score is included.
得分越高,颜色越偏黄;得分越低,颜色越偏紫。相比于仅使用子结构训练的化学潜在空间(a),使用子结构和天然产物相似性得分作为功能信息训练的化学潜在空间(b)显示出更聚类的分布。仅使用子结构训练的化学潜在空间(c)和同时使用子结构和天然产物相似性得分训练的化学潜在空间(d)中,抗癌化合物均呈现出根据抗癌药分类的聚类分布,化疗药物和分子靶向药物分别分布。对于分子靶向药物的分布(红框),当加入天然产物相似性得分时,分布更加局部聚集。

(a)

(b)
Fig. 2 Yessotoxin and its EGFR inhibitory activity. a Structure of Yessotoxin. Yessotoxin was first discovered in the 1980s from a scallop species called Patinopecten yessoensis and since then, various derivatives have been found in crustaceans and algae . b Inhibition of EGF-stimulated EGFR phosphorylation by Yessotoxin. EGFR tyrosine kinase activities are expressed as a percentage of the maximal phosphorylation induced by EGF. AG1478 is a selective EGFR inhibitor and was used as a positive control. When Yessotoxin was at , an inhibitory effect of over was confirmed.
图 2 叶索毒素及其对 EGFR 的抑制活性。a 叶索毒素的结构。叶索毒素最初于 1980 年代从一种叫做扇贝(Patinopecten yessoensis)的贝类中发现 ,此后在甲壳类生物和藻类中也发现了各种衍生物 。b EGF 刺激的 EGFR 磷酸化被叶索毒素抑制。EGFR 酪氨酸激酶活性以 EGF 诱导的最大磷酸化百分比表示。AG1478 是一种选择性 EGFR 抑制剂,用作阳性对照。当叶索毒素浓度为 时,抑制效果超过了
First, we visualized those two chemical latent spaces by reducing the dimension of latent variables to two dimension using t-SNE . The results are shown in Fig. 1. In (a) and (b), compounds with higher NP-likeness scores are represented in yellow, while those with lower scores are depicted in purple. Comparing the latent space constructed using only the structural information of the compounds (Fig. 1a), the gradients of NPlikeness can be observed in the latent space constructed by incorporating NP-likeness scores as functional information (Fig. 1b). To quantitatively assess the gradients of NP-likeness in the chemical latent space, we calculated the correlation between the embedding distance and the difference of NPlikeness scores for randomly sampled pairs of points in the latent space. The results shown in Supplementary Fig. S3 indicates that the correlation (Pearson correlation coefficient ) in the latent space incorporating NP-likeness scores is slightly higher than the correlation in the latent space constructed using only the structural information.
首先,我们通过使用 t-SNE 将潜在变量的维度降至二维,可视化了这两个化学潜在空间。结果如图 1 所示。在(a)和(b)中,NP 相似性得分较高的化合物用黄色表示,而得分较低的化合物用紫色表示。与仅使用化合物结构信息构建的潜在空间(图 1a)相比,在结合 NP 相似性得分作为功能信息构建的潜在空间(图 1b)中,可以观察到 NP 相似性的梯度。为了定量评估化学潜在空间中 NP 相似性的梯度,我们计算了随机抽取的点对的嵌入距离与 NP 相似性得分差之间的相关性。补充图 S3 中的结果表明,结合 NP 相似性得分的潜在空间中的相关性( 皮尔逊相关系数)略高于仅使用结构信息构建的潜在空间中的相关性( )。

When plotting representative anticancer drug compounds on these chemical latent spaces, we observed more clustered distributions for each class of anticancer drugs in the space incorporating NP-likeness scores (Fig. 1d) compared to the space constructed with only structural information (Fig. 1c). In particular, many molecular-targeted drugs were found to be locally distributed. The reason for the localized distribution of existing molecular-targeted drugs in the space incorporating NPlikeness scores is considered to be due to the lower NP-likeness scores of molecular-targeted drugs compared to other pharmaceutical compounds (Supplementary Fig. S4). Thus, if a functional value for the drug of interest can be added during the training of NP-VAE as a functional indicator, a chemical latent space can be obtained where the desired pharmaceutical candidate compounds are locally distributed, and novel functionally optimized structures are easier to explore.
当在这些化学潜在空间上绘制代表性抗癌药物化合物时,我们发现,与仅使用结构信息构建的空间(图 1c)相比,在结合 NP 特征得分的空间中(图 1d),每种类型的抗癌药物的分布更加聚集。特别是,许多针对分子靶点的药物被发现在局部分布。针对分子靶点药物在结合 NP 特征得分的空间中局部分布的原因,被认为是由于针对分子靶点药物的 NP 特征得分低于其他药物化合物(补充图 S4)。因此,如果在训练 NP-VAE 时可以添加目标药物的功能值作为功能指标,则可以获得一个化学潜在空间,其中所需的医药候选化合物在局部分布,并且更容易探索功能优化的新颖结构。

Second, by utilizing those advancement of the constructed chemical latent space, we found that the natural compound Yessotoxin (Fig. 2a) included in the drug-and-natural-product dataset was located near existing molecular-targeted drugs. Based on this observation, we hypothesized that Yessotoxin isolated from a scallop species called Patinopecten yessoensis might have a function as a molecular-targeted drug. Upon conducting an assay, it was confirmed that Yessotoxin exhibited weak EGFR inhibitory activity (Fig. 2b). This suggests that exploring the chemical latent space constructed by
其次,通过利用那些构建的化学潜在空间的进步,我们发现,包含在药物和天然产品数据集中的天然化合物 Yessotoxin(图 2a) 位于现有的分子靶向药物附近。基于这一观察,我们假设从名为 Patinopecten yessoensis 的扇贝中分离得到的 Yessotoxin 可能具有作为分子靶向药物的功能。在进行测定时,确认 Yessotoxin 确实表现出微弱的 EGFR 抑制活性(图 2b)。这表明,探索由

NP-VAE may also enable the discovery and annotation of unexpected compounds with similar pharmacological effects.
NP-VAE 可能还能发现和注释具有类似药理作用的意外化合物。
Interpolation in chemical latent space. For two existing drug compounds, we generated novel compound structures that exist between the two compounds by scanning the chemical latent space from one compound to the other. Let the latent variables of the starting and destination compounds be and , respectively. When exploring equidistant points on the chemical latent space connecting the starting and destination points, the latent variable of the -th generated compound was derived from the following equation.
化学潜在空间中的插值。对于两种现有的药物化合物,我们通过扫描从一种化合物到另一种化合物的化学潜在空间,生成了存在于两种化合物之间的新的化合物结构。将起始和目标化合物的潜在变量分别表示为 。在探索 个等距点连接起始和目标点的化学潜在空间时,生成的第 种化合物的潜在变量 由以下方程式导出。
By inputting this into the decoder, new compound structures can be generated. Figure 3 shows the novel compound structures obtained by exploring the space between two existing drugs, with the starting compound being a biomolecule, a Nicotinamide adenine dinucleotide derivative, and the destination compound being the molecular-targeted drug Sorafenib. As shown in Fig. 3, the similarity to the starting compound gradually decreased as one moves away from it and approaches the destination compound, while the similarity to the destination point compound gradually increased. Moreover, the NP-likeness score of the generated compound structures gradually decreased as the scanning progresses from a high to a low NP-likeness compound.
通过将这个 输入到解码器中,可以生成新的化合物结构。图 3 显示了通过探索两种现有药物之间的空间而获得的新型化合物结构,起始化合物为生物分子,即烟酰胺腺嘌呤二核苷酸衍生物,目标化