Augmenting large language models with chemistry tools

M. Bran, Andres; Cox, Sam; Schilter, Oliver; Baldassari, Carlo; White, Andrew D.; Schwaller, Philippe

doi:10.1038/s42256-024-00832-8

Download PDF 下载 PDF

Article
Open access
Published: 08 May 2024

Augmenting large language models with chemistry tools
增强大型语言模型的化学工具

Nature Machine Intelligence
自然机器智能 volume 6, pages 525–535 (2024)Cite this article
525–535（2024）引用本文

18k Accesses
154 Altmetric
Metrics details

A preprint version of the article is available at arXiv.
文章的预印本版本可在 arXiv 获取。

Abstract 摘要

Large language models (LLMs) have shown strong performance in tasks across domains but struggle with chemistry-related problems. These models also lack access to external knowledge sources, limiting their usefulness in scientific applications. We introduce ChemCrow, an LLM chemistry agent designed to accomplish tasks across organic synthesis, drug discovery and materials design. By integrating 18 expert-designed tools and using GPT-4 as the LLM, ChemCrow augments the LLM performance in chemistry, and new capabilities emerge. Our agent autonomously planned and executed the syntheses of an insect repellent and three organocatalysts and guided the discovery of a novel chromophore. Our evaluation, including both LLM and expert assessments, demonstrates ChemCrow’s effectiveness in automating a diverse set of chemical tasks. Our work not only aids expert chemists and lowers barriers for non-experts but also fosters scientific advancement by bridging the gap between experimental and computational chemistry.
大型语言模型（LLMs）在跨领域的任务中表现出色，但在化学相关问题上却遇到困难。这些模型也无法访问外部知识源，限制了它们在科学应用中的实用性。我们引入了 ChemCrow，一个LLM化学代理，旨在完成有机合成、药物发现和材料设计任务。通过集成 18 个专家设计的工具，并使用 GPT-4 作为LLM，ChemCrow 增强了化学领域的LLM性能，并产生了新的能力。我们的代理自主规划并执行了驱虫剂和三种有机催化剂的合成，并指导了新型色团的发现。我们的评估，包括自动评估和专家评估，证明了 ChemCrow 在自动化一系列化学任务方面的有效性。我们的工作不仅帮助了专家化学家，降低了非专家的门槛，而且通过弥合实验化学和计算化学之间的差距，促进了科学进步。

Leveraging large language models for predictive chemistry
利用大型语言模型进行预测化学

Article Open access 开放获取的文章 06 February 2024 2024 年 2 月 6 日

Autonomous chemical research with large language models
自主化学研究与大规模语言模型

Article Open access 开放获取的文章 20 December 2023 2023 年 12 月 20 日

ChatMOF: an artificial intelligence system for predicting and generating metal-organic frameworks using large language models
ChatMOF：一种使用大型语言模型预测和生成金属有机框架的人工智能系统

Article Open access 开放获取的文章 03 June 2024 2024 年 6 月 3 日

Main 主要

In the last few years, large language models (LLMs)^1,2,3,4,5 have transformed various sectors by automating natural language tasks. A prime example of this is the introduction of GitHub Copilot in 2021⁶ and more recently StarCoder⁷, which provides proposed code completions based on the context of a file and open windows and increases developers’ productivity⁸. Most recent advances are based on the Transformer architecture⁹, introduced for neural machine translation and extended to various natural language processing tasks demonstrating remarkable few-shot and zero-shot performance². Nevertheless, it is crucial to recognize the limitations of LLMs, which often struggle with seemingly simple tasks like basic mathematics and chemistry operations^10,11. For instance, GPT-4 (ref. ¹²) and GPT-3.5 (ref. ¹³) cannot consistently and accurately multiply 12,345 × 98,765 or convert IUPAC names into the corresponding molecular graph¹⁴. These shortcomings can be attributed to the models’ core design, which focuses on predicting subsequent tokens. To address these limitations, one viable approach is to augment LLMs with dedicated external tools or plugins, such as a calculator for mathematical operations or OPSIN¹⁵ for IUPAC-to-structure conversion. These specialized tools provide exact answers, thereby compensating for the inherent deficiencies of LLMs in specific domains and enhancing their overall performance and applicability.
在过去的几年中，大型语言模型（LLMs） ^1,2,3,4,5 通过自动化自然语言任务，彻底改变了各个领域。一个典型的例子是 2021 年推出的 GitHub Copilot ⁶ 和最近的 StarCoder ⁷ ，它们根据文件和打开窗口的上下文提供代码补全建议，从而提高开发人员的生产力 ⁸ 。最近的进展主要基于 Transformer 架构 ⁹ ，该架构最初用于神经机器翻译，并扩展到各种自然语言处理任务，显示出惊人的少示例和零示例性能 ² 。然而，识别LLMs的局限性至关重要，这些模型在诸如基本数学和化学操作等看似简单的任务中往往表现不佳 ^10,11 。例如，GPT-4（参考 ¹² ）和 GPT-3.5（参考 ¹³ ）无法始终如一且准确地计算 12,345 × 98,765 或将 IUPAC 名称转换为相应的分子图 ¹⁴ 。这些缺点可以归因于模型的核心设计，该设计专注于预测后续的令牌。为了解决这些限制，一个可行的方法是为LLMs增加专用的外部工具或插件，例如用于数学运算的计算器，或者用于 IUPAC 到结构转换的 OPSIN ¹⁵ 。这些专门的工具提供精确的答案，从而弥补了LLMs在特定领域内的固有缺陷，并提高了它们的整体性能和适用性。

Chemistry, as a field, has been impacted through expert-designed artificial intelligence (AI) systems that tackle specific problems, such as reaction prediction^{16,17,18,19,20}, retrosynthesis planning^{21,22,23,24,25,26,27}, molecular property prediction^{28,29,30,31,32}, de novo molecular generation^33,34, materials design^35,36 and, more recently, Bayesian optimization^37,38,39. Due to the nature of their training data, it has been shown that code-generating LLMs do possess some understanding of chemistry¹⁴, allowing them to adapt to observations, plan over multiple steps and respond correctly to intent in a chemical setting^{13,40,41,42,43,44}. Still, the automation levels achieved in chemistry remain relatively low compared to other domains, primarily due to its highly experimental nature, the lack of data and the limited scope and applicability of computational tools, even within their designated areas⁴⁵.
化学作为一门学科，已经通过专家设计的人工智能（AI）系统受到了影响，这些系统针对特定问题进行设计，如反应预测 ^{16,17,18,19,20} ，逆合成规划 ^{21,22,23,24,25,26,27} ，分子性质预测 ^{28,29,30,31,32} ，从头开始生成分子 ^33,34 ，材料设计 ^35,36 ，以及最近的贝叶斯优化 ^37,38,39 。由于它们训练数据的性质，已经证明代码生成LLMs确实具备一些化学知识 ¹⁴ ，允许它们适应观察，跨多个步骤规划，并在化学环境中正确响应意图 ^{13,40,41,42,43,44} 。然而，化学领域实现的自动化水平相对较低，与其他领域相比，主要原因是其高度实验性，缺乏数据以及计算工具在指定领域内的应用范围有限和适用性有限 ⁴⁵ 。

Integrating such tools tends to occur within isolated environments, such as RXN for Chemistry^{18,24,46,47,48} and AIZynthFinder^25,49,50, facilitated by corporate directives that promote integrability. Although most tools are developed by the open-source community or made accessible through application programming interfaces (APIs), their integration and interoperability pose considerable challenges for experimental chemists, mainly due to their lack of computational skill sets and the diversity of tools with steep learning curves, thereby preventing the full exploitation of their potential.
将此类工具集成通常发生在孤立的环境中，例如化学领域的 RXN ^{18,24,46,47,48} 和 AIZynthFinder ^25,49,50 ，这得益于企业指令的推动，旨在促进集成。尽管大多数工具由开源社区开发或通过应用程序编程接口（API）提供，但它们的集成和互操作性对实验化学家构成了重大挑战，主要原因是他们缺乏计算技能集，以及工具的多样性导致学习曲线陡峭，从而阻碍了他们充分利用这些工具潜力的能力。

Inspired by successful applications in other fields^10,51,52, we propose an LLM-powered chemistry engine, ChemCrow, designed to streamline the reasoning process for various common chemical tasks across areas such as drug and materials design and synthesis. ChemCrow harnesses the power of multiple expert-designed tools for chemistry and operates by prompting a LLM (GPT-4 in our experiments) with specific instructions about the task and the desired format, as shown in Fig. 1a. The LLM is provided with a list of tool names, descriptions of their utility and details about the expected input/output. It is then instructed to answer a user-given prompt, using the tools provided when necessary. The model is guided to follow the Thought, Action, Action Input, Observation format⁴³, which requires it to reason about the current state of the task, consider its relevance to the final goal and plan the next steps accordingly, demonstrating its level of understanding. After the reasoning in the Thought step, the LLM requests a tool (preceded by the keyword ‘Action’) and the input for this tool (with the keyword ‘Action Input’). The text generation then pauses, and the program attempts to execute the requested function using the provided input. The result is returned to the LLM prepended by the keyword ‘Observation’, and the LLM proceeds to the Thought step again. It continues iteratively until the final answer is reached.
受其他领域成功应用的启发，我们提出了一种由LLM驱动的化学引擎，ChemCrow，旨在简化药物和材料设计与合成等领域的各种常见化学任务的推理过程。ChemCrow 利用了多个专家设计的化学工具的力量，并通过提示LLM（我们在实验中使用 GPT-4）关于任务和所需格式的具体说明来操作，如图 1a 所示。LLM提供了一个工具名称列表，描述了它们的用途以及预期输入/输出的详细信息。然后，它被指示回答用户给出的提示，必要时使用提供的工具。模型遵循 Thought, Action, Action Input, Observation 格式 ⁴³ ，要求它对任务的当前状态进行推理，考虑其对最终目标的相关性，并相应地规划下一步，展示其理解水平。在 Thought 步骤的推理之后，LLM请求一个工具（以关键词“Action”开头）以及该工具的输入（以关键词“Action Input”开头）。文本生成暂停，程序尝试使用提供的输入执行请求的功能。结果由关键字‘Observation’前缀的LLM返回，并由LLM再次进行思考步骤。它持续迭代，直到最终答案被达到。

Fig. 1: Overview and toolset.
图 1：概述和工具集。

This workflow, previously described in the ReAct⁴³ and MRKL⁵³ papers, effectively combines chain-of-thought reasoning with tools relevant to the tasks. As a result, and as will be shown in the following sections, the LLM transitions from a hyperconfident—although typically wrong—information source to a reasoning engine that is prompted to reflect on a task, act using a suitable tool to gather additional information, observe the tool’s responses and repeat this loop until the final answer is reached. Contemporaneously with this work, ref. ⁵⁴ describes a similar approach of augmenting an LLM with tools for accomplishing tasks in chemistry that are out of reach of GPT-4 alone. Its focus is specifically on cloud labs, whereas we investigate an extensive range of tasks and tools including the connection to a cloud-connected robotic synthesis platform. We implemented 18 tools, as shown in Fig. 1b and described in ‘Tools’, that endow ChemCrow not only with knowledge about molecular and reaction properties but also with the capacity to directly execute tasks in a physical lab. Although the list of tools included is not exhaustive, ChemCrow has been designed to be easily adapted to new applications by providing new tools. ChemCrow serves as an assistant to expert chemists while simultaneously lowering the entry barrier for non-experts by offering a simple interface to access accurate chemical knowledge. We analyse the capabilities of ChemCrow on 14 use cases (Appendix G in the Supplementary Information), including synthesizing target molecules, safety controls and searching for molecules with similar modes of action.
此工作流程，之前在 ReAct ⁴³ 和 MRKL ⁵³ 论文中描述过，有效地将链式思维推理与与任务相关的工具结合在一起。因此，如后续章节所示，LLM从一个虽然通常错误但过度自信的信息源转变为一个被提示反思任务、使用合适的工具收集额外信息、观察工具的响应并重复此循环直到最终答案的推理引擎。与此同时，参考 ⁵⁴ 描述了类似的方法，通过为化学任务提供工具来增强 LLM，这些任务仅凭 GPT-4 无法完成。它的重点是云实验室，而我们则研究了从连接到云的机器人合成平台的广泛任务和工具。我们实现了 18 个工具，如图 1b 所示，并在“工具”部分描述，这不仅赋予了 ChemCrow 关于分子和反应属性的知识，还赋予了它在物理实验室直接执行任务的能力。尽管所包含的工具列表并不详尽，但 ChemCrow 已被设计为可以通过提供新工具来轻松适应新应用。ChemCrow 作为专家化学家的助手，同时通过提供易于访问的准确化学知识的简单界面，降低了非专家的入门门槛。我们在 14 个用例（补充信息中的附录 G）上分析了 ChemCrow 的能力，包括合成目标分子、安全控制和搜索具有相似作用模式的分子。

Results and discussion 结果与讨论

Autonomous chemical synthesis
自主化学合成

From user inputs such as ‘Plan and execute the synthesis of an insect repellent’ (Fig. 1a) and ‘Find a thiourea organocatalyst which accelerates the Diels-Alder reaction. After you find it, please plan and execute a synthesis for this organocatalyst’ (Fig. 2b), ChemCrow sequentially queried tools to find appropriate molecules, planned the syntheses and executed the syntheses on the cloud-connected, proprietary RoboRXN platform from IBM Research⁵⁵. Using RoboRXN, ChemCrow autonomously ran the syntheses of an insect repellent (DEET) and three known thiourea organocatalysts (Schreiner’s^56,57, Ricci’s⁵⁸ and Takemoto’s⁵⁹). The synthesized structures are shown in Fig. 2d and the detailed description of the tools in ‘Tools’. The four syntheses yielded the anticipated compounds successfully, demonstrating synthesis planning and execution-related LLM agent interactions with the physical world. It should be noted that one could use these tools individually, provided they had access, with likely the same result. ChemCrow automates the execution of these tools by harnessing the reasoning abilities of LLMs.
从用户输入如“计划并执行合成驱虫剂的合成”（图 1a）和“找到一个加速迪尔斯-阿尔德反应的硫脲有机催化剂。找到后，请计划并执行这个有机催化剂的合成”（图 2b），ChemCrow 依次查询工具以找到合适的分子，计划合成并执行在 IBM 研究的云连接、专有的 RoboRXN 平台上的合成 ⁵⁵ 。使用 RoboRXN，ChemCrow 自动运行了驱虫剂（DEET）和三个已知的硫脲有机催化剂（Schreiner 的 ^56,57 ，Ricci 的 ⁵⁸ 和 Takemoto 的 ⁵⁹ ）的合成。合成的结构显示在图 2d 中，工具的详细描述在“工具”中。这四次合成成功产生了预期的化合物，证明了与物理世界的合成规划和执行相关的LLM代理交互。值得注意的是，如果有人有这些工具的访问权限，他们可以单独使用，可能得到相同的结果。ChemCrow 通过利用LLMs的推理能力自动化执行这些工具。

Fig. 2: Experimental validation.
图 2：实验验证。

Standardized synthesis procedures are key for successful execution. However, the predicted procedures⁴⁶ are not always directly executable on the RoboRXN platform; typical problems include ‘not enough solvent’ or ‘invalid purify action’. Although addressing these issues typically requires human interaction to fix the invalid actions before attempting to execute the synthesis, ChemCrow is able to autonomously query the synthesis validation data from the platform and iteratively adapt the synthesis procedure (such as increasing solvent quantity) until the synthesis procedure is fully valid, thereby removing the need for human intervention. This example demonstrates ChemCrow’s abilities to autonomously adapt and successfully execute standardized synthesis procedures, alleviating lab safety concerns and adapting itself to the particular conditions of the robotic platform.
标准化合成程序是成功执行的关键。然而，预测的程序 ⁴⁶ 并不总是可以直接在 RoboRXN 平台上执行；常见的问题包括“溶剂不足”或“无效纯化操作”。尽管通常需要人工交互来修复无效操作，然后尝试执行合成，但 ChemCrow 能够自主查询平台上的合成验证数据，并迭代地适应合成程序（例如增加溶剂量）直到合成程序完全有效，从而无需人工干预。这个例子展示了 ChemCrow 自主适应并成功执行标准化合成程序的能力，缓解了实验室安全问题，并使其能够适应机器人平台的特定条件。

Human–AI collaboration 人类-人工智能协作

Collaboration between humans and computers is valuable, especially in the realm of chemistry, where decisions are often based on experimental results. Here we demonstrate how such an interaction can lead to the discovery of a novel chromophore. For this example, ChemCrow was instructed to train a machine-learning model to help screen a library of candidate chromophores⁶⁰. As can be seen in Fig. 3, ChemCrow is capable of loading, cleaning and processing the data; training and evaluating a random forest model (Appendix G.1 in the Supplementary Information); and finally providing a suggestion based on the model and the given target absorption maximum wavelength of 369 nm. The proposed molecule (Fig. 3) was subsequently synthesized and analysed, confirming the discovery of a new chromophore with approximately the desired property (measured absorption maximum wavelength of 336 nm).
人与计算机之间的合作非常有价值，尤其是在化学领域，决策往往基于实验结果。在这里，我们展示了这种互动如何导致发现一种新型的色团。对于这个例子，ChemCrow 被指示训练一个机器学习模型来帮助筛选候选的色团 ⁶⁰ 。如图 3 所见，ChemCrow 能够加载、清理和处理数据；训练并评估随机森林模型（补充信息中的附录 G.1）；最后根据模型和给定的目标最大吸收波长 369 nm 提供一个建议。提出的分子（图 3）随后被合成和分析，证实发现了具有大约期望性质的新色团（测量的最大吸收波长为 336 nm）。

Fig. 3: Human–model interaction leading to the discovery of a new chromophore.
图 3：人类-模型交互导致发现新的色团。

Evaluation across diverse chemical use cases
跨多种化学应用案例的评估

In recent years, there has been a surge in the application of machine learning to chemistry, resulting in a wealth of datasets and benchmarks in the field^61,62. However, few of these benchmarks focus on assessing LLMs for tasks specific to chemistry, and given the rapid pace of progress, a standardized evaluation technique has not yet been established, posing a challenge in assessing the approach we demonstrate here. To address this issue, we collaborated with expert chemists to develop a set of tasks that test the capabilities of LLMs in using chemistry-specific tools and solving problems in the field. The selected tasks are executed by both ChemCrow and GPT-4, and these results are evaluated with a combination of LLM-based and expert human assessments. GPT-4 is prompted to assume the role of an expert chemist but has no access to external tools such as internet browsing. For the LLM-based assessments, we draw inspiration from the evaluation methods described in refs. ^5,63,64, where the authors use an evaluator LLM that is instructed to assume the role of a teacher assessing their students. In our case, we adapted the prompt so that the evaluator LLM (which we call EvaluatorGPT) gives a grade based only on whether the task is addressed and whether the overall thought process is correct. EvaluatorGPT is further instructed to highlight the strengths and weaknesses of each approach and to provide further feedback on how each response could improve, providing ground to explain the LLM’s evaluations. Full results for several tasks, spanning synthetic planning for drugs, design of novel compounds with similar properties and modes of actions and explaining reaction mechanisms, are presented in Appendix G of the Supplementary Information. The full examples are also available at https://github.com/ur-whitelab/chemcrow-runs.
近年来，机器学习在化学领域的应用激增，产生了大量特定于化学领域的数据集和基准 ^61,62 。然而，很少有这些基准专注于评估 LLMs 在特定化学任务中的性能，鉴于进展的快速步伐，尚未建立标准化的评估方法，这为评估我们在这里展示的方法带来了挑战。为解决这一问题，我们与化学专家合作，开发了一套任务，用于测试 LLMs 使用化学特定工具解决问题的能力。所选任务由 ChemCrow 和 GPT-4 执行，并通过结合基于LLM和专家的人类评估进行评估。GPT-4 被提示扮演专家化学家的角色，但无法访问如网络浏览等外部工具。对于基于LLM的评估，我们借鉴了参考文献 ^5,63,64 中描述的评估方法，作者使用了一个评估者LLM，该评估者被指示扮演评估学生的教师角色。在我们的案例中，我们调整了提示，使得评估者LLM（我们称之为 EvaluatorGPT）仅根据任务是否得到解决以及整体思维过程是否正确来给出评分。EvaluatorGPT 进一步被指示强调每种方法的优缺点，并提供关于每种响应如何改进的额外反馈，为解释LLM的评估提供依据。多个任务的完整结果，涵盖药物的合成规划、设计具有相似性质和作用方式的新化合物以及解释反应机制，都在补充信息的附录 G 中呈现。完整的示例也可见于 https://github.com/ur-whitelab/chemcrow-runs。

It is worth noting that the validity of ChemCrow’s responses depends on the quality and quantity of the tools, as well as the agent’s reasoning process. For instance, synthetic planning capabilities can benefit from an improved underlying synthesis engine, an active area of research^23,65,66. Even then, any tool becomes useless if the reasoning behind its usage is flawed or if garbage inputs are given. Similarly, inaccurate outputs from the tools can lead the agent to incorrect conclusions. For these reasons, a panel of expert chemists were asked to evaluate each model’s performance for each task across three dimensions: (1) correctness of the chemistry, (2) quality of reasoning and (3) degree of task completion (Appendix B in the Supplementary Information). As shown in Fig. 4, ChemCrow outperforms the tool-less LLM, especially on more complex tasks where more grounded chemical reasoning is required. Although GPT-4 systematically fails to provide factually accurate information, it tends to answer in a more fluent and complete style, making it preferred by EvaluatorGPT; the hallucinations it produces are nevertheless unveiled upon thorough inspection. Both systems perform similarly in ‘quality of reasoning’, an expected outcome given ChemCrow’s by-design reliance on GPT-4 for reasoning. As shown in Fig. 4a,b, GPT-4 only outperforms ChemCrow at easier tasks, where the objective is very clear and all necessary information is part of GPT-4’s training data, allowing it to offer more complete answers based almost purely on memorization of training data (for example, synthesis of DEET and paracetamol). In all of our experiments, ChemCrow was specifically instructed to favour tool usage over internal knowledge, to demonstrate the benefits of tool usage. Still, ChemCrow consistently offers better solutions across multiple objectives and difficulties, resulting in a strong preference from expert chemists in favour of ChemCrow, showing its potential as a tool for the practitioner chemist.
值得注意的是，ChemCrow 的回答的有效性取决于工具的质量和数量，以及代理的推理过程。例如，合成规划能力可以从改进的基础合成引擎中受益，这是研究的活跃领域 ^23,65,66 。即使如此，如果使用工具的推理过程有误，或者提供了垃圾输入，任何工具都会变得无用。同样，工具的不准确输出会导致代理得出错误的结论。出于这些原因，要求一组专家化学家对每个模型在每个任务上的表现进行评估，评估的维度包括：（1）化学的正确性，（2）推理的质量，（3）任务完成的程度（补充信息中的附录 B）。如图 4 所示，ChemCrow 在没有工具的LLM中表现更优，尤其是在需要更深入化学推理的更复杂任务中。尽管 GPT-4 系统性地未能提供事实准确的信息，但它倾向于以更流畅和完整的方式回答问题，这使得 EvaluatorGPT 更倾向于选择它；然而，它产生的幻觉在深入检查后会被揭示出来。两个系统在“推理质量”方面表现相似，这是预期的结果，因为 ChemCrow 的设计依赖于 GPT-4 进行推理。如图 4a、b 所示，GPT-4 仅在更简单的任务中表现出色，此时目标非常明确，所有必要的信息都包含在 GPT-4 的训练数据中，允许它基于几乎完全基于训练数据的记忆提供更完整答案（例如，合成 DEET 和对乙酰氨基酚）。在我们的所有实验中，ChemCrow 特别被指示优先使用工具而非内部知识，以展示工具使用的好处。尽管如此，ChemCrow 在多个目标和难度下始终提供更好的解决方案，这使得专家化学家更倾向于 ChemCrow，显示了其作为实践化学家工具的潜力。

Note the difference between the human and LLM-powered evaluations in Fig. 4. Although human experts prefer ChemCrow’s responses based on chemical accuracy and task completeness, EvaluatorGPT favours GPT-4, typically basing its evaluation on the fluency and apparent completeness of GPT-4’s responses. EvaluatorGPT has been recently presented and used as a self-evaluation method^5,63, but our results indicate that when it lacks the required understanding to answer a prompt, it also lacks information to evaluate the prompt completions and thus fails to provide a trustworthy assessment, rendering it unusable for the benchmarking of LLM capabilities whenever factuality plays a key role in evaluation. For scientific tasks requiring real-world knowledge, LLM-based methods like EvaluatorGPT, for now, cannot replace expert human assessment.
注意图 4 中人类与LLM动力评估之间的差异。尽管人类专家更偏好 ChemCrow 基于化学准确性和任务完整性生成的响应，但评估器 GPT 更倾向于 GPT-4，通常根据 GPT-4 响应的流畅性和显而易见的完整性来评估。评估器 GPT 最近被提出并用作自我评估方法 ^5,63 ，但我们的结果表明，当它缺乏回答提示所需的理解时，它也缺乏评估提示完成信息，因此无法提供可信的评估，使其在事实性在评估中起关键作用时无法用于衡量LLM能力。对于需要现实世界知识的科学任务，基于LLM的方法如评估器 GPT，目前无法替代专家的人类评估。

Risk-mitigation strategies
风险缓解策略

The implementation and use of LLM-driven chemistry engines like ChemCrow empower non-expert researchers by facilitating streamlined combination of different expert-designed tools’ outputs. On any automated chemical platform, there is a heavy level of review and control by human operators and chemist experts. Nevertheless, it is crucial to ensure responsible development and use of LLM agents^67,68,69.
实施并使用由LLM驱动的化学引擎，如 ChemCrow，通过简化不同专家设计工具输出的整合，使非专家研究人员受益。在任何自动化化学平台上，都存在大量的人工操作者和化学专家的审查和控制。然而，确保LLM代理的负责任开发和使用 ^67,68,69 至关重要。

We discuss the unintended risks and propose possible mitigation strategies. Those can be achieved through foresight and safeguards, still promoting open and transparent science to enable broad oversight and feedback from the research community.
我们讨论了意外风险，并提出了可能的缓解策略。这些可以通过预见性和保障措施实现，同时促进开放和透明的科学研究，以实现广泛监督和研究社区的反馈。

Unintended risks 意外风险

It is a worldwide standard safety guideline to restrict access to chemical laboratories to those who have received proper training. Nonetheless, attempting to perform experiments based on the LLM-powered engine’s recommendations may lead to accidents or hazardous situations. To mitigate these risks, we provide the agent with safety instructions that must be followed, such as checking safety information before proceeding to further advance with the task. As shown in Fig. 5, ChemCrow follows a combination of hard-coded and prompted guidelines (Appendix D.2 in the Supplementary Information) to ensure safety. If the proposed reaction is deemed dangerous, execution stops. Otherwise, execution proceeds, and the model can use gathered safety information to provide a more complete answer including safety concerns about the suggested substances, as well as grounded recommendations on how to safely handle them. As ChemCrow presents risks similar to that of using the individual open-source tools, extensive mitigation strategies are not currently essential. Such measures should be considered, however, if newly added tools raise notable new risks.
全球性的安全指南规定，只有接受过适当培训的人才能进入化学实验室。然而，根据由LLM动力引擎提供的建议进行实验可能会导致事故或危险情况。为了降低这些风险，我们为代理提供必须遵循的安全指示，例如在进一步执行任务之前检查安全信息。如图 5 所示，ChemCrow 结合了硬编码和提示的指南（补充信息中的附录 D.2）来确保安全。如果提议的反应被认为危险，执行将停止。否则，执行将继续，模型可以利用收集到的安全信息提供更完整的问题答案，包括关于建议物质的安全问题，以及如何安全处理它们的可靠建议。鉴于 ChemCrow 所呈现的风险类似于使用单独的开源工具，当前不需要采取广泛的缓解措施。然而，如果新增的工具引发了新的显著风险，应考虑采取这样的措施。

Fig. 5: Safety guidelines provided by ChemCrow.
图 5：ChemCrow 提供的安全指南

Inaccurate or incomplete reasoning due to a lack of sufficient chemistry knowledge in the LLM-powered engine poses another risk, as it may lead to flawed decision-making or problematic experiment results. One of the key points of this Article is that the integration of expert-designed tools can help mitigate the hallucination issues commonly associated with these models, thus reducing the risk of inaccuracy. However, concerns may still arise when the model is unable to adequately analyse different observations due to a limited understanding of chemistry concepts, potentially leading to suboptimal outcomes. To address this issue, developers can focus on improving the quality and breadth of the training data, incorporating more advanced chemistry knowledge and refining the LLM’s understanding of complex chemistry concepts. Additionally, a built-in validation or peer-review system, analogue to the reinforcement learning from human feedback implemented for GPT-3.5 (refs. ^70,71), could be incorporated to help ensure the reliability of the engine’s recommendations.
由于LLM动力引擎中化学知识不足导致的推理不准确或不完整，构成了另一种风险，这可能导致决策失误或实验结果问题。本文的关键点之一是，将专家设计的工具集成到这些模型中，可以帮助缓解这些模型通常关联的幻觉问题，从而降低不准确的风险。然而，当模型由于对化学概念理解有限，无法充分分析不同的观察结果时，仍可能出现问题，可能导致结果不佳。为解决这个问题，开发者可以专注于提高训练数据的质量和广度，加入更高级的化学知识，并细化LLM对复杂化学概念的理解。此外，可以引入一个内置的验证或同行评审系统，类似于为 GPT-3.5 实现的人类反馈强化学习（参考 ^70,71 ），以帮助确保引擎推荐的可靠性。

Encouraging users to critically evaluate the information provided by the LLM-powered engine and cross-reference it with established literature and expert opinions can further mitigate the risk of relying on flawed reasoning⁷². By combining these approaches, developers can work towards minimizing the impact of insufficient chemistry knowledge on the engine’s reasoning process and enhancing the overall effectiveness of LLM-powered chemistry engines⁷³ like ChemCrow.
鼓励用户批判性地评估由LLM驱动的引擎提供的信息，并将其与已确立的文献和专家意见进行交叉验证，可以进一步减少依赖有缺陷推理 ⁷² 的风险。通过结合这些方法，开发人员可以朝着最小化不足的化学知识对引擎推理过程的影响，并增强LLM驱动的化学引擎 ⁷³ 如 ChemCrow 的整体有效性方向努力。

Addressing intellectual property issues is crucial for the responsible development and use of generative AI models⁷⁴ like ChemCrow. Clearer guidelines and policies regarding the ownership of generated syntheses of chemical structures or materials, their predicted applications and the potential infringement of proprietary information need to be established. Collaboration with legal experts, as well as industry stakeholders, can help in navigating these complex issues and implementing appropriate measures to protect intellectual property.
处理知识产权问题对于负责开发和使用如 ChemCrow 之类的生成型 AI 模型至关重要。需要明确关于生成的化学结构或材料合成的所有权、预测应用以及可能侵犯专有信息的政策和指导方针。与法律专家和行业利益相关者的合作，可以帮助解决这些复杂问题，并实施适当的措施来保护知识产权。

In summary, it is crucial to carefully consider and address the potential drawbacks associated with LLM-powered chemistry engines such as ChemCrow, to ensure their safe and responsible application. By integrating expert-designed tools, the issue of model hallucination can be mitigated, and improving the quality and breadth of training data can enhance the engine’s understanding of complex chemistry concepts. Implementing effective mitigation strategies, such as access controls, safety guidelines and ethical policies, further contributes to minimizing risks and maximizing the positive impact of these engines on the field of chemistry. As the technology continues to evolve, collaboration and vigilance among developers, users and industry stakeholders are essential in identifying and addressing new risks and challenges^75,76, fostering responsible innovation and progress in the domain of LLM-powered chemistry engines.
总之，对于由LLM驱动的化学引擎，如 ChemCrow，需要仔细考虑并解决潜在的缺点，以确保它们的安全和负责任的应用。通过整合专家设计的工具，可以缓解模型幻觉的问题，并通过提高训练数据的质量和广度来增强引擎对复杂化学概念的理解。实施有效的缓解策略，如访问控制、安全指南和伦理政策，有助于最小化风险并最大化这些引擎在化学领域的积极影响。随着技术的不断发展，开发人员、用户和行业利益相关者之间的合作和警惕对于识别和解决新风险和挑战至关重要，促进LLM驱动的化学引擎领域的负责任创新和进步。

Conclusion 结论

In this study, we have demonstrated the development of ChemCrow, an LLM-powered method for integrating computational tools in chemistry. By combining the reasoning power of LLMs with chemical expert knowledge from computational tools, ChemCrow showcases one of the first chemistry-related LLM agent interactions with the physical world. ChemCrow has successfully planned and synthesized an insect repellent and three organocatalysts and guided the screening and synthesis of a chromophore with target properties. Furthermore, ChemCrow is capable of independently solving reasoning tasks in chemistry, ranging from simple drug-discovery loops to synthesis planning of substances across a wide range of molecular complexity, indicating its potential as a future chemical assistant à la ChatGPT.
在这项研究中，我们展示了 ChemCrow 的发展，这是一种由LLM驱动的方法，用于在化学中集成计算工具。通过将LLMs的推理能力与计算工具中的化学专家知识相结合，ChemCrow 展示了与物理世界中与化学相关的第一个LLM代理交互。ChemCrow 成功地规划并合成了驱虫剂和三种有机催化剂，并指导了具有目标属性的色团筛选和合成。此外，ChemCrow 能够独立解决化学中的推理任务，从简单的药物发现循环到跨多种分子复杂性物质的合成规划，这表明它作为未来化学助手的潜力，类似于 ChatGPT。

Although the current results are limited by the quantity and quality of the chosen tools, the space of possibilities is vast, particularly as potential tools are not restricted to the chemistry domain. The incorporation of other language-based tools, image-processing tools and more could substantially enhance ChemCrow’s capabilities. Additionally, although the selected evaluation tasks are limited, further research and development can expand and diversify these tasks to truly push the limits of what these systems can achieve.
尽管当前结果受限于所选工具的数量和质量，但可能性的空间非常广阔，特别是潜在工具不限于化学领域。整合其他基于语言的工具、图像处理工具和更多内容，可以显著增强 ChemCrow 的功能。此外，尽管所选评估任务有限，进一步的研究和发展可以扩展和多样化这些任务，真正推动这些系统所能达到极限的边界。

Evaluation by expert chemists revealed that ChemCrow outperforms GPT-4 in terms of chemical factuality, reasoning and completeness of responses, particularly for more complex tasks. Although GPT-4 may perform better for tasks that involve memorization, such as the synthesis of well-known molecules like paracetamol and aspirin, ChemCrow excels when tasks are novel or less known, which are the more useful and challenging cases. In contrast, LLM-powered evaluation tends to favour GPT-4, primarily due to the more fluent and complete-looking nature of its responses. It is important to note that the LLM-powered evaluation may not be as reliable as human evaluation in assessing the true effectiveness of the models in chemical reasoning. This discrepancy highlights the need for further refining evaluation methods to better capture the unique capabilities of systems like ChemCrow in solving complex, real-world chemistry problems.
专家化学家的评估显示，ChemCrow 在化学事实性、推理和响应的完整性方面优于 GPT-4，尤其是在更复杂的任务中。尽管 GPT-4 可能在涉及记忆的任务上表现更好，例如合成已知分子如扑热息痛和阿司匹林这样的任务，但当任务新颖或较少为人所知时，ChemCrow 表现出色，这正是更实用和更具挑战性的案例。相比之下，由LLM支持的评估倾向于偏好 GPT-4，主要是因为其响应更流畅、更完整。值得注意的是，由LLM支持的评估可能不如人工评估可靠，用于评估模型在化学推理方面的真正有效性。这种差异强调了进一步改进评估方法的必要性，以更好地捕捉系统如 ChemCrow 在解决复杂、现实世界化学问题方面的独特能力。

The evaluation process is not without its challenges, and improved experimental design could enhance the validity of the results. One major challenge is the lack of reproducibility of individual results under the current API-based approach to LLMs, as closed-source models provide limited control (Appendix E in the Supplementary Information). Recent open-source models^77,78,79 offer a potential solution to this issue, albeit with a possible trade-off in reasoning power. Additionally, implicit bias in task selection and the inherent limitations of testing chemical logic behind task solutions on a large scale present difficulties for evaluating ML systems. Despite these challenges, our results demonstrate the promising capabilities and potential of systems like ChemCrow to serve as valuable assistants in chemical laboratories and to address chemical tasks across diverse domains.
评价过程并非没有挑战，改进实验设计可以增强结果的有效性。一个主要的挑战是，在基于 API 的LLMs方法下，个人结果的可重复性不足，因为封闭源模型提供了有限的控制（补充信息中的附录 E）。最近的开源模型 ^77,78,79 为解决这个问题提供了可能的解决方案，尽管可能在推理能力上有所妥协。此外，任务选择中的隐性偏见以及在大规模测试任务解决方案背后的化学逻辑时固有的局限性，为评估 ML 系统带来了困难。尽管存在这些挑战，我们的结果表明，如 ChemCrow 这样的系统在化学实验室中作为有价值的助手以及跨不同领域解决化学任务的潜力是很有前景的。

Methods 方法

LLMs

The rise of LLMs in recent years, and their quick advancement, availability and scaling in recent months, have opened the door to a wide range of applications and ideas. Usage of LLMs is further made more powerful when used as part of some frameworks designed to exploit their zero-shot reasoning capabilities, as can be demonstrated by architectures like ReAct⁴³ and MRKL⁵³. These architectures allow combining the shown success of chain-of-thought⁴¹ reasoning with LLMs’ use of tools¹⁰. For our experiments, we used OpenAI’s GPT-4 (ref. ¹²) with a temperature of 0.1.
近年来，LLMs的兴起及其快速的发展、近几个月的可用性和规模化，为一系列应用和想法打开了大门。将LLMs作为利用其零次推理能力的一些框架的一部分使用，进一步增强了其功能，例如可以通过 ReAct ⁴³ 和 MRKL ⁵³ 这样的架构来展示。这些架构允许将链式思考 ⁴¹ 推理的成功与LLMs的工具使用相结合。为了我们的实验，我们使用了 OpenAI 的 GPT-4（参考 ¹² ），温度设置为 0.1。

LLMs application framework, LangChain
1001 应用框架，LangChain

LangChain⁸⁰ is a comprehensive framework designed to facilitate the development of language model applications by providing support for various modules, including access to various LLMs, prompts, document loaders, chains, indexes, agents, memory and chat functionality. With these modules, LangChain enables users to create various applications such as chatbots, question-answering systems, summarization tools and data-augmented generation systems. LangChain not only offers standard interfaces for these modules but also assists in integrating with external tools, experimenting with different prompts and models and evaluating the performance of generative models. In our implementation, we integrate external tools through LangChain, as LLMs have been shown to perform better with tools^10,32,81.
LangChain ⁸⁰ 是一个全面的框架，旨在通过提供对各种模块的支持来促进语言模型应用的开发，这些模块包括对各种LLMs的访问、提示、文档加载器、链、索引、代理、记忆和聊天功能。借助这些模块，LangChain 允许用户创建各种应用，如聊天机器人、问答系统、摘要工具和数据增强生成系统。LangChain 不仅提供了这些模块的标准接口，还帮助用户集成外部工具、实验不同的提示和模型，并评估生成模型的性能。在我们的实现中，我们通过 LangChain 集成外部工具，因为LLMs已被证明在使用工具 ^10,32,81 时表现更好。

Tools 工具

Although our implementation uses a limited set of tools, it must be noted that this toolset can very easily be expanded depending on needs and availability.
尽管我们的实现使用了一组有限的工具，但需要注意的是，根据需求和可用性，这个工具集可以非常容易地进行扩展。

The tools used can be classified into general tools, molecular tools and chemical reaction tools.
使用的工具可以分为通用工具、分子工具和化学反应工具。

General tools 通用工具

WebSearch 网络搜索

The web search tool is designed to provide the language model with the ability to access relevant information from the web. Utilizing SerpAPI⁸², the tool queries search engines and compiles a selection of impressions from the first page of Google search results. This allows the model to collect current and relevant information across a broad range of scientific topics. A distinct characteristic of this instrument is its capacity to act as a launching pad when the model encounters a query it cannot tackle or is unsure of the suitable tool to apply. Integrating this tool enables the language model to efficiently expand its knowledge base, streamline the process of addressing common scientific challenges and verify the precision and dependability of the information it offers. By default, LitSearch is preferred by the agent over the WebSearch tool.
网络搜索工具旨在为语言模型提供访问网络相关信息的能力。利用 SerpAPI ⁸² ，该工具查询搜索引擎，并从 Google 搜索结果的第一页中编译一系列印象。这使得模型能够收集广泛科学主题的当前和相关信息。此工具的一个显著特点是，当模型遇到无法解决的查询或不确定适用工具时，它可以作为启动平台。整合此工具使语言模型能够高效地扩展其知识库，简化解决常见科学挑战的过程，并验证其提供的信息的准确性和可靠性。默认情况下，LitSearch 工具更受代理人的青睐，而不是 WebSearch 工具。

LitSearch 文献搜索

The literature-search tool focuses on extracting relevant information from scientific documents such as PDFs or text files (including raw HTML) to provide accurate and well-grounded answers to questions. This tool utilizes the paper-qa Python package (https://github.com/whitead/paper-qa). By leveraging OpenAI Embeddings⁸³ and FAISS⁸⁴, a vector database, the tool embeds and searches through documents efficiently. A language model then aids in generating answers based on these embedded vectors.
文献搜索工具专注于从科学文档，如 PDF 或文本文件（包括原始 HTML）中提取相关信息，以提供准确且有根据的问题答案。该工具利用了 paper-qa Python 包（https://github.com/whitead/paper-qa）。通过利用 OpenAI 嵌入 ⁸³ 和 FAISS ⁸⁴ ，一个向量数据库，该工具高效地嵌入和搜索文档。然后，使用语言模型根据这些嵌入的向量生成答案。

The literature-search process involves embedding documents and queries into vectors and searching for the top k passages in the documents. Once these relevant passages have been identified, the tool creates a summary of each passage in relation to the query. These summaries are then incorporated into the prompt, allowing the language model to generate an informed answer. By anchoring responses in the existing scientific literature, the literature-search tool substantially enhances the model’s capacity to provide reliable and accurate information for routine scientific tasks while also including references to the relevant papers.
文献搜索过程涉及将文档和查询嵌入向量中，并在文档中搜索与查询相关的前 k 段。一旦确定了这些相关段落，工具会为每个段落创建与查询相关的摘要。然后将这些摘要整合到提示中，使语言模型能够生成有根据的答案。通过将响应锚定在现有的科学文献中，文献搜索工具显著增强了模型提供可靠和准确信息的能力，用于常规科学任务，同时还将相关论文的引用纳入其中。

Python REPL

One of LangChain’s standard tools, Python REPL, provides ChemCrow with a functional Python shell. This tool enables the LLM to write and run Python code directly, making it easier to accomplish a wide range of complex tasks. These tasks can range from performing numerical computations to training AI models and performing data analysis.
LangChain 的标准工具之一，Python REPL，为 ChemCrow 提供了一个功能性的 Python 壳。此工具使用户可以直接编写并运行 Python 代码，从而更轻松地完成一系列复杂任务。这些任务可以从进行数值计算到训练 AI 模型和进行数据分析。

Human 人类

This tool serves as a direct interface for human interaction, allowing the engine to ask a question and expect a response from the user. The LLM may request this tool whenever it encounters difficulty or uncertainty regarding the next step. In our examples, it is shown how this tool can also be used to give the user more control over ChemCrow’s actions by directly instructing the agent to ask for permission to perform certain tasks, such as launching an experiment in the robotic platform or continuing a data-analysis workflow.
此工具作为人类交互的直接界面，允许引擎提出问题并期待用户响应。LLM在遇到下一步的困难或不确定性时可以请求此工具。在我们的示例中，展示了如何使用此工具让用户对 ChemCrow 的行为有更多控制权，直接指示代理请求对某些任务的权限，例如在机器人平台上启动实验或继续数据分析工作流程。

Molecule tools 分子工具

Name2SMILES 名 2SMILES

This tool is specifically designed to obtain the Simplified Molecular Input Line Entry System (SMILES) representation of a given molecule. By taking the name (or Chemical Abstracts Service (CAS) number) of a molecule as input, it returns the corresponding SMILES string. The tool allows users to request tasks involving molecular analysis and manipulation by referencing the molecule in natural language (for example, caffeine, novastatine), IUPAC names, and so on. Our implementation queries chem-space⁸⁵ as a primary source and upon failure queries PubChem⁸⁶ and the IUPAC to SMILES converter OPSIN¹⁵ as a last option.
此工具专门设计用于获取给定分子的简化分子输入线性录入系统（SMILES）表示形式。通过输入分子的名称（或化学文摘服务（CAS）编号），它返回相应的 SMILES 字符串。该工具允许用户通过使用自然语言（例如，咖啡因、诺瓦斯塔丁）、国际纯粹与应用化学联合会（IUPAC）名称等引用分子来请求涉及分子分析和操作的任务。我们的实现将 chem-space ⁸⁵ 作为主要来源进行查询，在失败时将查询 PubChem ⁸⁶ 和 IUPAC 到 SMILES 转换器 OPSIN ¹⁵ 作为最后的选择。

SMILES2Price

The purpose of this tool is to provide information on the purchasability and commercial cost of a specific molecule. By taking a molecule as input, it first utilizes molbloom⁸⁷ to check whether the molecule is available for purchase (in ZINC20 (ref. ⁸⁸)). Then, using the chem-space API⁸⁵, it returns the cheapest price available on the market, enabling the LLM to make informed decisions about the affordability and availability of the queried molecule towards the resolution of a given task.
此工具的目的是提供有关特定分子的可购买性和商业成本的信息。通过将分子作为输入，它首先使用 molbloom ⁸⁷ 检查该分子是否可以购买（在 ZINC20（参考 ⁸⁸ ）中）。然后，使用 chem-space API ⁸⁵ ，它返回市场上可获得的最低价格，使LLM能够就查询的分子在解决特定任务时的可负担性和可用性做出明智的决策。

Name2CAS 名称 2CAS

The tool is designed to determine the CAS number of a given molecule using various types of input references such as common names, IUPAC names or SMILES strings by querying the PubChem⁸⁶ database. The CAS number serves as a precise and universally recognized chemical identifier, enabling researchers to access relevant data and resources with ease and ensuring that they obtain accurate and consistent information about the target molecule⁸⁹.
该工具旨在通过查询 PubChem ⁸⁶ 数据库，使用诸如通用名称、IUPAC 名称或 SMILES 字符串等各种类型的输入参考，来确定给定分子的 CAS 号。CAS 号作为精确且普遍认可的化学标识符，使研究人员能够轻松访问相关数据和资源，并确保他们获得目标分子的准确和一致信息 ⁸⁹ 。

Similarity 相似性

The primary function of this tool is to evaluate the similarity between two molecules, utilizing the Tanimoto similarity measure⁹⁰ based on the ECFP2 molecular fingerprints⁹¹ of the input molecules. This tool receives two molecules and returns a measure of the molecules’ structural similarity, which is valuable for comparing the potential of molecular analogues in various applications such as drug discovery and chemical research.
此工具的主要功能是评估两个分子之间的相似性，利用基于输入分子的 ECFP2 分子指纹的 Tanimoto 相似性度量 ⁹⁰ 。该工具接收两个分子并返回分子结构相似性的度量，这对于比较药物发现和化学研究等各种应用中分子类似物的潜力非常有价值。

ModifyMol 修改分子

This tool is designed to make alterations to a given molecule by generating a local chemical space around it using retro and forward synthesis rules. It employs the SynSpace package⁹², originally applied in counterfactual explanations for molecular machine learning⁹³. The modification process utilizes 50 robust medicinal chemistry reactions⁹⁴, and the retrosynthesis is performed either via PostEra Manifold^18,95 (upon availability of an API key) or by reversing the 50 robust reactions. The purchasable building blocks come from the Purchasable Mcule supplier building block catalogues⁹⁶, although customization options are available. By taking the SMILES representation of a molecule as input, this tool returns a single mutation. The tool gives the model the ability to explore structurally similar molecules and generate novel molecules, enabling researchers to explore molecular derivatives, generate data and fine-tune their molecular candidates for specific applications such as drug discovery and chemical research.
此工具旨在通过使用逆合成和前合成规则生成给定分子周围的局部化学空间，以对分子进行修改。它利用了 SynSpace 包 ⁹² ，最初在分子机器学习的反事实解释中应用 ⁹³ 。修改过程利用了 50 种稳健的药物化学反应 ⁹⁴ ，而逆合成则通过 PostEra Manifold ^18,95 （在具有 API 密钥的情况下）或通过反转 50 种稳健反应来执行。可购买的构建块来自 Purchasable Mcule 供应商的构建块目录 ⁹⁶ ，尽管有定制选项。通过将分子的 SMILES 表示作为输入，此工具返回一个单一的突变。该工具使模型能够探索结构相似的分子并生成新型分子，使研究人员能够探索分子衍生物，生成数据并针对特定应用（如药物发现和化学研究）精细调整其分子候选物。

PatentCheck 专利检查

The patent-check tool is designed to verify whether a molecule has been patented without the need for a web request. It utilizes molbloom⁸⁷, a C library, to check strings against a bloom filter, making it an efficient tool to assess compounds against known databases. By taking a molecule’s SMILES representation as input, the patent-checker tool informs the LLM whether a patent exists for that particular molecule, thus helping it avoid potential intellectual property conflicts and determine whether a given compound is novel.
专利检查工具旨在无需网络请求即可验证分子是否已被专利。它利用了 molbloom ⁸⁷ ，一个 C 库，通过检查字符串与布隆过滤器的匹配来验证，使其成为评估化合物与已知数据库的高效工具。通过接受分子的 SMILES 表示作为输入，专利检查工具告知LLM该特定分子是否存在专利，从而帮助它避免潜在的知识产权冲突，并确定给定化合物是否新颖。

FuncGroups 功能组

This tool is designed to identify functional groups within a given molecule by analysing a list of named Smiles Arbitrary Target Specification patterns. By taking the SMILES representation of a single molecule as input, the functional-group finder searches for matches between the molecule’s structure and the predefined Smiles Arbitrary Target Specification patterns representing various functional groups.
此工具旨在通过分析命名的 Smiles Arbitrary Target Specification 模式列表来识别给定分子内的功能团。通过将单个分子的 SMILES 表示作为输入，功能团查找器在分子结构与表示各种功能团的预定义 Smiles Arbitrary Target Specification 模式之间寻找匹配。

Upon identifying these matches, the tool returns a list of functional groups present in the molecule. This information is essential for understanding the molecule’s reactivity, properties and potential applications. By providing a comprehensive overview of a molecule’s functional groups, the LLM can make informed decisions when designing experiments, synthesizing compounds or exploring new molecular candidates.
识别这些匹配后，该工具返回分子中存在的一系列功能团列表。这些信息对于理解分子的反应性、性质和潜在应用至关重要。通过提供分子功能团的全面概述，LLM可以在设计实验、合成化合物或探索新分子候选时做出明智的决策。

SMILES2Weight

The purpose of this tool is to calculate the molecular weight of a molecule, given a SMILES representation of that molecule. This tool utilizes RDKit⁹⁷ to get the exact molecular weight from a SMILES string.
此工具的目的是，给定分子的 SMILES 表示形式，计算该分子的分子量。此工具利用 RDKit ⁹⁷ 从 SMILES 字符串获取确切的分子量。

Safety tools 安全工具

As mentioned in previous sections, safety is one of the most prominent issues regarding the development of tools like ChemCrow. Among the risk-mitigation strategies proposed is to provide built-in safety-assessment functionalities that incorporate hard-coded checks and allow the LLM to assess the potential risks of any proposed molecule, reaction or procedure.
如前几节所述，安全是开发工具如 ChemCrow 时最突出的问题之一。提出的风险缓解策略之一是提供内置的安全评估功能，其中包含硬编码检查，允许用户评估任何提议的分子、反应或程序的潜在风险。

ControlledChemicalCheck 受控化学检查

Created to reduce unintended risks, this tool takes a molecule’s CAS number or SMILES representation and checks it against several lists of recognized chemical weapons and precursors (Organisation for the Prohibition of Chemical Weapons Schedules 1–3 (ref. ⁹⁸) and The Australia Group’s Export Control List: Chemical Weapons Precursors⁹⁹). If the input molecule is not in any of these lists, the maximum similarity (using the MolSimilarity tool) between it and the molecules from the database is calculated, and a warning is given if this similarity is greater than 0.35. This tool is automatically invoked when a request is made for a synthesis method or execution for a given molecule. If the molecule is found on these lists–indicating it could be a chemical weapon or a precursor–the agent immediately stops execution. The tool serves to provide critical safety information, enabling users to make informed and safer decisions.
创建此工具旨在降低意外风险，它接受分子的 CAS 号或 SMILES 表示，并将其与已识别的化学武器和前体列表进行比对（禁止化学武器组织第 1-3 号计划（参考 ⁹⁸ ）和澳大利亚集团的出口控制列表：化学武器前体 ⁹⁹ ）。如果输入的分子不在这些列表中，将计算它与数据库中分子的最大相似性（使用 MolSimilarity 工具），如果此相似性大于 0.35，则会给出警告。当请求合成方法或执行给定分子时，此工具会自动调用。如果分子在这些列表中，表明它可能是化学武器或前体，代理程序会立即停止执行。该工具提供关键的安全信息，使用户能够做出明智和更安全的决策。

ExplosiveCheck 爆炸检查

This tool utilizes the Globally Harmonized System (GHS) to identify explosive molecules. It queries the PubChem database using molecular identifiers like common name, IUPAC name or CAS number to determine whether a molecule’s GHS rating is ‘Explosive’. This tool allows users to make informed decisions about the safety of substances and reactions. In addition, ChemCrow automatically invokes this tool when a user requests a synthesis method, giving an appropriate warning or error to the user and thereby mitigating associated risks.
此工具利用全球协调系统（GHS）来识别爆炸性分子。它使用分子标识符（如通用名称、IUPAC 名称或 CAS 号）查询 PubChem 数据库，以确定分子的 GHS 评级是否为“爆炸性”。此工具允许用户根据物质和反应的安全性做出明智的决策。此外，ChemCrow 在用户请求合成方法时自动调用此工具，向用户提供适当的警告或错误信息，从而减轻相关风险。

SafetySummary 安全概要

This tool provides a general safety overview for any given molecule. It produces a safety summary by querying data from the PubChem database⁸⁶ and uses an LLM summarizer to highlight four central aspects: operational safety (potential risks for the operator: that is, health concerns of handling the given substance), GHS information (general hazards and recommendations to handle the substance), environmental risks and societal impact (whether the substance is a known controlled chemical). Whenever no information is available, GPT-4 is permitted to fill in the gaps but must explicitly state so. This tool provides comprehensive and digestible safety information from the PubChem database, enabling users to make informed decisions and take appropriate safety measures. Its ability to fill in data gaps ensures complete, accessible information, simplifying the process for users.
此工具为任何给定分子提供了一般的安全性概述。它通过查询 PubChem 数据库 ⁸⁶ 的数据并使用LLM总结器来突出四个核心方面：操作安全性（操作者可能面临的风险：即处理给定物质时的健康问题）、GHS 信息（物质的一般危害和处理建议）、环境风险和社会影响（该物质是否为已知受控化学物质）。当没有信息可用时，GPT-4 允许填补空白，但必须明确说明。此工具从 PubChem 数据库提供全面且易于消化的安全信息，使用户能够做出明智的决策并采取适当的措施。其能够填补数据空白的能力确保了完整且易于访问的信息，简化了用户的过程。

Chemical reaction tools 化学反应工具

NameRXN 名 RXN

This tool, powered by the proprietary software NameRxn from NextMove Software¹⁰⁰, is designed to identify and classify a given chemical reaction based on its internal database of several hundred named reactions. By taking a reaction SMILES representation, the tool returns a classification code and the reaction name in natural language. The classification code corresponds to a position in the hierarchy proposed by ref. ¹⁰¹. This information is essential for understanding reaction mechanisms, selecting appropriate catalysts and optimizing experimental conditions.
此工具由 NextMove Software 的专有软件 NameRxn 提供动力 ¹⁰⁰ ，旨在根据其包含数百个命名反应的内部数据库，识别并分类给定的化学反应。通过采用反应的 SMILES 表示形式，该工具返回分类代码和自然语言中的反应名称。分类代码对应于由引用 ¹⁰¹ 提出的层次结构中的一个位置。这些信息对于理解反应机制、选择合适的催化剂并优化实验条件至关重要。

ReactionPredict 反应预测

The reaction prediction tool leverages the RXN4Chemistry API from IBM Research⁴⁸, which utilizes a transformer model specifically tailored for predicting chemical reactions and retrosynthesis paths based on the Molecular Transformer^18,24 and provides highly accurate predictions. This tool takes as input a set of reactants and returns the predicted product, allowing the LLM to have accurate chemical information that can’t typically be obtained by a simple database query but that requires a sort of abstract reasoning chemists are trained to perform. Although the API is free to use, registration is required.
反应预测工具利用了 IBM 研究的 RXN4Chemistry API ⁴⁸ ，该 API 使用了一种专门为预测化学反应和逆合成路径而定制的转换器模型 ^18,24 ，并基于分子转换器 ^18,24 提供了高度准确的预测。该工具将一组反应物作为输入，并返回预测产物，使 LLM 能够获得通常无法通过简单的数据库查询获得但需要化学家接受过训练才能进行的抽象推理的准确化学信息。尽管 API 免费使用，但需要注册。

ReactionPlanner 反应规划

This powerful tool also employs the RXN4Chemistry API from IBM Research^18,24,48, utilizing the same Transformer approach for translation tasks as the reaction prediction tool but adding search algorithms to handle multistep synthesis and an action prediction algorithm that converts a reaction sequence into actionable steps in machine-readable format, including conditions, additives and solvents⁴⁶. To interface with ChemCrow, we added an LLM processing step that converts these machine-readable actions into natural language. The molecular synthesis planner is designed to assist the LLM in planning a synthetic route to prepare a desired target molecule. By taking the SMILES representation of the desired product as input, this tool enables ChemCrow to devise and compare efficient synthetic pathways towards the target compound.
这个强大的工具还采用了 IBM 研究的 RXN4Chemistry API ^18,24,48 ，在翻译任务中使用与反应预测工具相同的 Transformer 方法，但增加了搜索算法来处理多步合成，并添加了一个动作预测算法，将反应序列转换为机器可读的可执行步骤，包括条件、添加剂和溶剂 ⁴⁶ 。为了与 ChemCrow 接口，我们添加了一个LLM处理步骤，将这些机器可读的动作转换为自然语言。分子合成规划师旨在帮助LLM规划合成路线以准备所需的靶分子。通过将所需产品的 SMILES 表示作为输入，此工具使 ChemCrow 能够设计和比较通往目标化合物的高效合成途径。

ReactionExecute 反应执行

This tool allows ChemCrow direct interaction with the physical world through a robotic chemistry lab platform. Also based on the RXN4Chemistry API, the tool allows the agent to plan, adapt and execute the synthesis of a given molecule. Internally, the tool requests a synthesis plan (using the RXNPlanner tool), obtains the action sequence to be executed on the robot and uses a LLM-powered loop to adapt the errors and warnings in the action sequence. Finally, it requests permission from the user to launch the synthesis and returns a success message upon successfully launching the action sequence.
此工具通过机器人化学实验室平台，使 ChemCrow 可以直接与物理世界互动。该工具也基于 RXN4Chemistry API，允许代理规划、适应并执行给定分子的合成。内部，该工具请求一个合成计划（使用 RXNPlanner 工具），获取要执行在机器人上的操作序列，并使用一个由LLM驱动的循环来适应操作序列中的错误和警告。最后，它请求用户授权启动合成，并在成功启动操作序列后返回成功消息。

Reporting summary 报告摘要

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
关于研究设计的更多信息，请参阅与本文关联的 Nature Portfolio Reporting Summary。

Data availability 数据可用性

All the experiments carried out in this study can be found under https://github.com/ur-whitelab/chemcrow-runs (ref. ¹⁰²). Source data are provided with this paper.
本研究中进行的所有实验都可以在 (参考 ¹⁰² )中找到。原始数据随论文提供。

Code availability 代码可用性

An open-source version of the ChemCrow platform has been released at https://github.com/ur-whitelab/chemcrow-public (ref. ¹⁰³), which includes the main agent setup and a subset of 12 tools used in the original implementation. Access to the proprietary GPT-4 API can be obtained through OpenAI.
开源版本的 ChemCrow 平台已在 https://github.com/ur-whitelab/chemcrow-public（参考 ¹⁰³ ）发布，其中包括主要代理设置和原始实现中使用的 12 个工具的子集。可以通过 OpenAI 获取专有 GPT-4 API 的访问权限。

References 参考文献

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: pre-training of deep bidirectional transformers for language understanding. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (eds Burstein, J. et al.) 4171–4186 (Association for Computational Linguistics, 2019).
德林，J.，张，M.-W.，李，K. & 托塔诺，K. Bert：语言理解的深度双向转换器预训练。在《北美计算语言学协会人类语言技术会议论文集》（编辑：Burstein，J.等人）4171-4186（计算语言学协会，2019 年）。
Brown, T. et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 33, 1877–1901 (2020).
布朗，T. 等人。语言模型是少样本学习者。先进神经信息处理系统 33 卷，第 1877-1901 页（2020 年）。
Google Scholar
Bommasani, R. et al. On the opportunities and risks of foundation models. Preprint at https://arxiv.org/abs/2108.07258 (2021).
Bommasani, R. 等人。关于基础模型的机会与风险。预印本发布于 https://arxiv.org/abs/2108.07258（2021 年）。
Chowdhery, A. et al. Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24, 1–113 (2023).
乔赫里，A. 等。掌：使用路径扩展语言建模。机器学习研究杂志 24，1-113（2023 年）。
Google Scholar
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with gpt-4. Preprint at https://arxiv.org/abs/2303.12712 (2023).
Bubeck, S. 等人。人工通用智能的火花：使用 gpt-4 的早期实验。预印本发布于 https://arxiv.org/abs/2303.12712（2023 年）。
Github Copilot. GitHub https://copilot.github.com (2023).
Li, R. et al. Starcoder: may the source be with you! Trans. Mach. Learn. Res. https://openreview.net/pdf?id=KoFOg41haE (2023).
李，R. 等人。Starcoder：愿源代码与你同在！机器学习研究汇刊 https://openreview.net/pdf?id=KoFOg41haE (2023)。
Ziegler, A. et al. Productivity assessment of neural code completion. In Proc. 6th ACM SIGPLAN International Symposium on Machine Programming (eds Chaudhuri, S. and Sutton, C.) 21–29 (ACM, 2022).
齐格勒，A. 等。神经代码完成的生产力评估。在《第 6 届 ACM SIGPLAN 机器编程国际研讨会论文集》（主编：查杜里，S. 和萨顿，C.）21-29 页（ACM，2022 年）。
Vaswani, A. et al. Attention is all you need. In Proc. Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5999–6009 (Curran Associates, 2017). 重试错误原因
Schick, T. et al. Toolformer: language models can teach themselves to use tools. In Proc. Advances in Neural Information Processing Systems 36 (eds. Oh, A. et al.) 68539–68551 (Curran Associates, 2023).
施克，T. 等。Toolformer：语言模型可以自学使用工具。在第 36 届神经信息处理系统进展会议论文集（Oh, A. 等人编）第 68539-68551 页（Curran Associates，2023 年）。
Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with ChatGPT. J. Chem. Inf. Model. 63, 1649–1655 (2023).
卡斯特罗·纳西门托，C. M. & 皮门特尔，A. S. 大型语言模型理解化学吗？与 ChatGPT 的对话。化学信息与模型杂志. 63, 1649–1655 (2023)。
Article Google Scholar
OpenAI. GPT-4 technical report. Preprint at https://arxiv.org/abs/2303.08774 (2023).
OpenAI. GPT-4 技术报告。预印本发布于 https://arxiv.org/abs/2303.08774（2023 年）。
Ouyang, L. et al. Training language models to follow instructions with human feedback. Adv. Neural Inf. Process. Syst. 35, 27730–27744 (2022).
欧阳，L. 等人。在人类反馈下训练语言模型遵循指令。先进神经信息处理系统 35 卷，27730-27744 页（2022 年）。
Google Scholar
White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digit. Discov. 2, 368–376 (2023).
白，A. D. 等。大型代码生成语言模型的化学知识评估。数字发现 2，368-376（2023 年）。
Article Google Scholar
Lowe, D. M., Corbett, P. T., Murray-Rust, P. & Glen, R. C. Chemical name to structure: Opsin, an open source solution. J. Chem. Inf. Model. 51, 739–753 (2011).
洛伊，D. M.，科伯特，P. T.，默里-拉斯特，P. & 格伦，R. C. 化学名称到结构：视蛋白，开源解决方案。《化学信息与模型》杂志，51 卷，739-753 页（2011 年）。
Article Google Scholar
Coley, C. W., Barzilay, R., Jaakkola, T. S., Green, W. H. & Jensen, K. F. Prediction of organic reaction outcomes using machine learning. ACS Cent. Sci. 3, 434–443 (2017).
科尔利，C. W.，巴尔齐莱，R.，雅科拉，T. S.，格林，W. H. & 珍森，K. F. 使用机器学习预测有机反应结果。《美国化学学会中心科学》3，434-443（2017 年）。
Article Google Scholar
Coley, C. W. et al. A graph-convolutional neural network model for the prediction of chemical reactivity. Chem. Sci. 10, 370–377 (2019).
科尔利，C. W. 等。一种用于预测化学反应性的图卷积神经网络模型。化学科学，10，370-377（2019 年）。
Article Google Scholar
Schwaller, P. et al. Molecular transformer: a model for uncertainty-calibrated chemical reaction prediction. ACS Cent. Sci. 5, 1572–1583 (2019).
施瓦勒，P.等人。分子变换器：一种用于化学反应预测的模型，考虑不确定性。《美国化学学会中心科学》5 卷，1572-1583 页（2019 年）。
Article Google Scholar
Pesciullesi, G., Schwaller, P., Laino, T. & Reymond, J.-L. Transfer learning enables the molecular transformer to predict regio-and stereoselective reactions on carbohydrates. Nat. Commun. 11, 4874 (2020).
佩斯奇利，G.，施瓦勒，P.，莱亚诺，T. & 雷蒙德，J.-L. 转移学习使分子变换器能够预测糖类的区域和立体选择性反应。《自然通讯》11 卷，4874 号（2020 年）。
Article Google Scholar
Irwin, R., Dimitriadis, S., He, J. & Bjerrum, E. J. Chemformer: a pre-trained transformer for computational chemistry. Mach. Learn. Sci.Technol. 3, 015022 (2022).
伊文，R.，迪米特里亚迪斯，S.，何，J. & 比尔鲁姆，E. 计算化学中的预训练变换器：化学合成者。机器学习科学与技术，3，015022（2022 年）。
Article Google Scholar
Szymkuc, S. et al. Computer-assisted synthetic planning: the end of the beginning. Angew. Chem. Int. Ed. Engl. 55, 5904–5937 (2016).
齐姆库克，S. 等。计算机辅助合成规划：新纪元的结束。《应用化学国际版》英文版 55, 5904–5937（2016 年）。
Article Google Scholar
Segler, M. H., Preuss, M. & Waller, M. P. Planning chemical syntheses with deep neural networks and symbolic AI. Nature 555, 604–610 (2018). 重试错误原因
Article Google Scholar
Coley, C. W. et al. A robotic platform for flow synthesis of organic compounds informed by AI planning. Science 365 (2019). 重试错误原因
Schwaller, P. et al. Predicting retrosynthetic pathways using transformer-based models and a hyper-graph exploration strategy. Chem. Sci. 11, 3316–3325 (2020).
施瓦勒，P.等人。使用基于转换器的模型和超图探索策略预测逆合成途径。化学科学，11，3316-3325（2020 年）。
Article Google Scholar
Genheden, S. et al. AiZynthFinder: a fast, robust and flexible open-source software for retrosynthetic planning. J. Cheminf. 12, 1–9 (2020). 重试错误原因
Article Google Scholar
Molga, K., Szymkuc, S. & Grzybowski, B. A. Chemist ex machina: advanced synthesis planning by computers. Acc. Chem. Res. 54, 1094–1106 (2021).
莫尔加，K.，西姆库克，S. & 格日布斯基，B. A. 机器造化学家：计算机高级合成规划。《应用化学研究》54 卷，1094-1106 页（2021 年）。
Article Google Scholar
Schwaller, P. et al. Machine intelligence for chemical reaction space. Wiley Interdiscip. Rev. Comput. Mol. Sci. 12, e1604 (2022).
施瓦勒，P.等人。化学反应空间的机器智能。威利跨学科评论计算分子科学 12，e1604（2022 年）。
Article MathSciNet Google Scholar
Mayr, A., Klambauer, G., Unterthiner, T. & Hochreiter, S. Deeptox: toxicity prediction using deep learning. Front. Environ. Sci. 3, 80 (2016). 重试错误原因
Article Google Scholar
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model. 59, 3370–3388 (2019). 重试错误原因
Article Google Scholar
Chithrananda, S., Grand, G. & Ramsundar, B. Chemberta: large-scale self-supervised pretraining for molecular property prediction. Preprint at https://arxiv.org/abs/2010.09885 (2020).
Chithrananda, S., Grand, G. & Ramsundar, B. Chemberta：大规模自我监督预训练用于分子性质预测。预印本发布于 https://arxiv.org/abs/2010.09885（2020 年）。
van Tilborg, D., Alenicheva, A. & Grisoni, F. Exposing the limitations of molecular machine learning with activity cliffs. J. Chem. Inf. Model. 62, 5938–5951 (2022).
范蒂尔伯格，D.，阿利尼切娃，A. & 格里松尼，F. 揭示分子机器学习的局限性与活性悬崖。化学信息与模型杂志. 62，5938–5951（2022 年）。
Article Google Scholar
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. 利用大型语言模型进行预测化学. 自然机器智能, 6, 161–169 (2024).
Article Google Scholar
Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 4, 268–276 (2018). 重试错误原因
Article Google Scholar
Blaschke, T. et al. Reinvent 2.0: an AI tool for de novo drug design. J. Chem. Inf. Model. 60, 5918–5922 (2020). 重试错误原因
Article Google Scholar
Tao, Q., Xu, P., Li, M. & Lu, W. Machine learning for perovskite materials design and discovery. NPJ Comput. Mater. 7, 1–18 (2021). 重试错误原因
Article Google Scholar
Gómez-Bombarelli, R. et al. Design of efficient molecular organic light-emitting diodes by a high-throughput virtual screening and experimental approach. Nat. Mater. 15, 1120–1127 (2016). 重试错误原因
Article Google Scholar
Shields, B. J. et al. Bayesian reaction optimization as a tool for chemical synthesis. Nature 590, 89–96 (2021).
谢尔德斯，B. J. 等人。贝叶斯反应优化作为化学合成的工具。自然 590, 89–96 (2021)。
Article Google Scholar
Torres, J. A. G. et al. A multi-objective active learning platform and web app for reaction optimization. J. Am. Chem. Soc. 144, 19999–20007 (2022). 重试错误原因
Article Google Scholar
Ramos, M. C., Michtavy, S. S., Porosoff, M. D. & White, A. D. Bayesian optimization of catalysts with in-context learning. Preprint at https://arxiv.org/abs/2304.05341 (2023). 重试错误原因
Marra, G., Giannini, F., Diligenti, M. & Gori, M. Integrating learning and reasoning with deep logic models. In Proc. Machine Learning and Knowledge Discovery in Databases, Part II (eds. Hutter, F. et al.) 517–532 (Springer, 2020).
马拉，G.，吉安尼尼，F.，迪利金蒂，M. & 格罗里，M.。将深度逻辑模型与学习和推理相结合。在《机器学习与数据库中的知识发现》（第二部分）（编辑：赫特纳，F.等人）第 517 至 532 页（Springer，2020 年）。
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022).
魏，J. 等人。链式思考提示促使大型语言模型进行推理。先进神经信息处理系统 35 卷，24824-24837 页（2022 年）。
Google Scholar
Ho, N., Schmid, L. & Yun, S.-Y. Large language models are reasoning teachers. In Proc. 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Rogers, A. et al.) 14852–14882 (ACL, 2023).
霍, N., 施密德, L. & 云, S.-Y. 大型语言模型是推理教师。在第 61 届计算语言学协会年会论文集（第一卷：长篇论文）（编辑：罗杰斯, A. 等人）14852-14882（ACL, 2023）。
Yao, S. et al. ReAct: synergizing reasoning and acting in language models. In Proc. 11th International Conference on Learning Representations (OpenReview, 2023).
姚，S. 等人。ReAct：在语言模型中协同推理与行动。在第 11 届国际学习表示会议（OpenReview，2023）。
Zelikman, E., Wu, Y., Mu, J. & Goodman, N. Star: bootstrapping reasoning with reasoning. Adv. Neural Inf. Process. Syst. 35, 15476–15488 (2022). 重试错误原因
Google Scholar
Zhao, Z.-W., del Cueto, M. & Troisi, A. Limitations of machine learning models when predicting compounds with completely new chemistries: possible improvements applied to the discovery of new non-fullerene acceptors. Digit. Discov. 1, 266–276 (2022). 重试错误原因
Article Google Scholar
Vaucher, A. C. et al. Inferring experimental procedures from text-based representations of chemical reactions. Nat. Commun. 12, 2573 (2021). 重试错误原因
Article Google Scholar
Schwaller, P. et al. Mapping the space of chemical reactions using attention-based neural networks. Nat. Mach. Intell. 3, 144–152 (2021). 重试错误原因
Article Google Scholar
RXN for Chemistry. rxn4Chemistry. GitHub https://github.com/rxn4chemistry/rxn4chemistry (2020). 重试错误原因
Thakkar, A., Kogej, T., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J. Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain. Chem. Sci. 11, 154–168 (2020).
塔克卡尔，A.，科杰，T.，雷蒙德，J.-L.，恩格维斯特，O. & 比尔鲁姆，E. J. 药物领域计算机辅助合成规划工具发展中数据集及其影响。化学科学，11，154–168（2020 年）。
Article Google Scholar
Thakkar, A., Selmi, N., Reymond, J.-L., Engkvist, O. & Bjerrum, E. J. ‘Ring breaker’: neural network driven synthesis prediction of the ring system chemical space. J. Med. Chem. 63, 8791–8808 (2020).
塔卡尔，A.，赛尔米，N.，雷蒙德，J.-L.，恩格奎斯特，O. & 比尔鲁姆，E. J. “环断器”：神经网络驱动的环系统化学空间合成预测。《药物化学杂志》63 卷，8791-8808 页（2020 年）。
Article Google Scholar
Yang, Z. et al. Mm-react: prompting ChatGPT for multimodal reasoning and action. Preprint at https://arxiv.org/abs/2303.11381 (2023). 重试错误原因
Shen, Y. et al. Hugginggpt: solving AI tasks with chatgpt and its friends in huggingface. Poster at Advances in Neural Information Processing Systems 36 (2023). 重试错误原因
Karpas, E. et al. Mrkl systems: a modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. Preprint at https://arxiv.org/abs/2205.00445 (2022).
卡帕斯，E. 等人。Mrkl 系统：一种模块化、神经符号架构，结合了大型语言模型、外部知识源和离散推理。预印本发布于 https://arxiv.org/abs/2205.00445（2022 年）。
Boiko, D. A., MacKnight, R., Kline, B. & Gomes, G. Autonomous chemical research with large language models. Nature 624, 570–578 (2023). 重试错误原因
Article Google Scholar
RoboRXN. IBM https://research.ibm.com/science/ibm-roborxn/ (2021).
Wittkopp, A. & Schreiner, P. R. Metal-free, noncovalent catalysis of Diels-Alder reactions by neutral hydrogen bond donors in organic solvents and in water. Chem. Eur. J. 9, 407–414 (2003).
Wittkopp, A. & Schreiner, P. R. 无金属，非共价的中性氢键供体在有机溶剂和水中的 Diels-Alder 反应催化. 化学欧洲杂志. 9, 407–414 (2003).
Article Google Scholar
Schreiner, P. R. & Wittkopp, A. H-bonding additives act like Lewis acid catalysts. Org. Lett. 4, 217–220 (2002).
施赖纳，P.R. & 维特科普，A. - 键合添加剂表现出莱文酸催化剂的性质。有机信件 4, 217–220 (2002)。
Article Google Scholar
Herrera, R. P., Sgarzani, V., Bernardi, L. & Ricci, A. Catalytic enantioselective friedel-crafts alkylation of indoles with nitroalkenes by using a simple thiourea organocatalyst. Angew. Chem. Int. Ed. Engl. 44, 6576–6579 (2005).
赫拉拉，R. P.，萨尔扎尼，V.，伯纳尔迪，L. & 里奇，A. 使用简单的硫脲有机催化剂，吲哚的硝基烷基催化对映选择性 Friedel-Crafts 烷基化。《应用化学国际版》44 卷，6576-6579 页（2005 年）。
Article Google Scholar
Okino, T., Hoashi, Y. & Takemoto, Y. Enantioselective Michael reaction of malonates to nitroolefins catalyzed by bifunctional organocatalysts. J. Am. Chem. Soc. 125, 12672–12673 (2003).
秋野，T.，小林，Y. & 高本，Y. 使用双功能有机催化剂催化马尔诺酸与硝基烯烃的对映选择性迈克尔反应。美国化学学会志，125，12672–12673（2003 年）。
Article Google Scholar
Joung, J. F., Han, M., Jeong, M. & Park, S. DB for chromophore. figshare https://figshare.com/articles/dataset/DB_for_chromophore/12045567 (2020).
乔恩, J. F., 韩, M., 金, M. & 朴, S. 色团簇数据库。figshare https://figshare.com/articles/dataset/色团簇数据库/12045567 (2020)。
Lowe, D. M. Extraction of Chemical Structures and Reactions from the Literature. PhD thesis, Univ. of Cambridge (2012). 重试错误原因
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chem. Sci. 9, 513–530 (2018).
吴，Z. 等。Moleculenet：分子机器学习的基准。化学科学。9，513–530（2018 年）。
Article Google Scholar
Liu, Y. et al. G-Eval: NLG evaluation using GPT-4 with better human alignment. In Proc. Conference on Empirical Methods in Natural Language Processing (eds. Bouamor, H. et al.) 2511–2522 (ACL, 2023). 重试错误原因
Eloundou, T., Manning, S., Mishkin, P. & Rock, D. GPTs are GPTs: an early look at the labor market impact potential of large language models. Preprint at https://arxiv.org/abs/2303.10130 (2023). 重试错误原因
Grzybowski, B. A., Badowski, T., Molga, K. & Szymkuc, S. Network search algorithms and scoring functions for advanced-level computerized synthesis planning. Wiley Interdiscip. Rev. Comput. Mol. Sci. 13, e1630 (2023). 重试错误原因
Article Google Scholar
Thakkar, A. et al. Artificial intelligence and automation in computer aided synthesis planning. React. Chem. Eng. 6, 27–51 (2021). 重试错误原因
Article Google Scholar
Urbina, F., Lentzos, F., Invernizzi, C. & Ekins, S. Dual use of artificial-intelligence-powered drug discovery. Nat. Mach. Intell. 4, 189–191 (2022).
乌布纳，F.，伦佐斯，F.，因韦尼齐，C. & 埃金斯，S. 人工智能驱动药物发现的双重用途。《自然机器智能》4，189-191（2022 年）。
Article Google Scholar
Urbina, F., Lentzos, F., Invernizzi, C. & Ekins, S. A teachable moment for dual-use. Nat. Mach. Intell. 4, 607–607 (2022). 重试错误原因
Article Google Scholar
Campbell, Q. L., Herington, J. & White, A. D. Censoring chemical data to mitigate dual use risk. Preprint at https://arxiv.org/abs/2304.10510 (2023).
坎贝尔，Q. L.，赫灵顿，J.与怀特，A. D. 对化学数据进行删减以减轻双重用途风险。预印本发布于 https://arxiv.org/abs/2304.10510（2023 年）。
Gao, L., Schulman, J. & Hilton, J. Scaling laws for reward model overoptimization. In Proc. International Conference on Machine Learning (eds Krause, A. et al.) 10835–10866 (PMLR, 2023).
高，L.，舒尔曼，J. & 希尔顿，J. 奖励模型过优化的缩放定律。在国际机器学习会议（编者：克劳斯，A.等人）第 10835-10866 页（PMLR，2023 年）。
Radford, A. et al. Improving language understanding by generative pre-training. OpenAI blog https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (2018).
Radford, A. 等人。通过生成式预训练提高语言理解。OpenAI 博客，https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf（2018 年）。
Li, B. et al. Trustworthy AI: from principles to practices. ACM Comput. Surv. 55, 1–46 (2021). 重试错误原因
Google Scholar
Hocky, G. M. & White, A. D. Natural language processing models that automate programming will transform chemistry research and teaching. Dig. Discov. 1, 79–83 (2022). 重试错误原因
Article Google Scholar
Henderson, P. et al. Foundation models and fair use. Preprint at https://arxiv.org/abs/2303.15715 (2023). 重试错误原因
Askell, A., Brundage, M. & Hadfield, G. The role of cooperation in responsible AI development. Preprint at https://arxiv.org/abs/1907.04534 (2019). 重试错误原因
Neufville, R. D. & Baum, S. D. Collective action on artificial intelligence: a primer and review. Technol. Soc. 66, 101649 (2021). 重试错误原因
Article Google Scholar
Touvron, H. et al. Llama: open and efficient foundation language models. Preprint at https://arxiv.org/abs/2302.13971 (2023).
图弗隆，H. 等人。Llama：开放且高效的基语言模型。预印本位于 https://arxiv.org/abs/2302.13971（2023 年）。
Chiang, W.-L. et al. Vicuna: an open-source chatbot impressing GPT-4 with 90%* ChatGPT quality. LMSYS Org. https://lmsys.org/blog/2023-03-30-vicuna/ (2023).
蒋，W.-L. 等。美洲驼：一个开源聊天机器人，以 90%*的 ChatGPT 质量让 GPT-4 印象深刻。LMSYS 组织。https://lmsys.org/blog/2023-03-30-vicuna/（2023 年）。
Mukherjee, S. et al. Orca: progressive learning from complex explanation traces of GPT-4. Preprint at https://arxiv.org/abs/2306.02707 (2023).
穆赫吉，S.等人。orca：从 GPT-4 的复杂解释轨迹进行渐进式学习。预印本发布于 https://arxiv.org/abs/2306.02707（2023 年）。
Chase, H. LangChain. GitHub https://github.com/hwchase17/langchain (2022). 重试错误原因
Press, O. et al. Measuring and narrowing the compositionality gap in language models. In Proc. Association for Computational Linguistics: EMNLP (eds. Bouamor, H. et al.) 5687–5711 (ACL, 2023). 重试错误原因
Google search API. SerpApi https://serpapi.com/ (2023). 重试错误原因
Neelakantan, A. et al. Text and code embeddings by contrastive pre-training. Preprint at https://arxiv.org/abs/2201.10005 (2022). 重试错误原因
Johnson, J., Douze, M. & Jégou, H. Billion-scale similarity search with GPUs. IEEE Trans. Big Data 7, 535–547 (2019). 重试错误原因
Article Google Scholar
ChemSpace https://chem-space.com/ (2023). 重试错误原因
National Center for Biotechnology Information. PubChem. NIH https://pubchem.ncbi.nlm.nih.gov/ (2023). 重试错误原因
Medina, J. & White, A. D. Bloom filters for molecules. J. Cheminf. 15, 95 (2023). 重试错误原因
Article Google Scholar
Irwin, J. J. et al. Zinc20—a free ultralarge-scale chemical database for ligand discovery. J. Chem. Inf. Model. 60, 6065–6073 (2020).
伊尔温，J. J. 等。Zinc20——一种免费的超大规模化学数据库，用于配体发现。化学信息与模型杂志。60 卷，6065-6073 页（2020 年）。
Article Google Scholar
Chemical Abstracts Service. CAS registry number. CAS www.cas.org/content/cas-registry (2023).
化学摘要服务。CAS 登记号。CAS www.cas.org/content/cas-registry (2023)。
Tanimoto, T. T. An Elementary Mathematical Theory of Classification and Prediction (IBM, 1958).
田蒙，T. T. 一种基本的数学分类与预测理论（IBM，1958 年）。
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).
罗杰斯，D. & 黑恩，M. 扩展连接性指纹。《化学信息与模型杂志》. 50 卷，742-754 页（2010 年）。
Article Google Scholar
White, A. D. Synspace. GitHub https://github.com/whitead/synspace (2023).
怀特，A.D. Synspace. GitHub https://github.com/whitead/synspace (2023).
Wellawatte, G. P., Seshadri, A. & White, A. D. Model agnostic generation of counterfactual explanations for molecules. Chem. Sci. 13, 3697–3705 (2022).
Wellawatte, G. P., Seshadri, A. & White, A. D. 对分子的模型无关生成反事实解释. 化学科学. 13, 3697–3705 (2022).
Article Google Scholar
Hartenfeller, M. et al. A collection of robust organic synthesis reactions for in silico molecule design. J. Chem. Inf. Model. 51, 3093–3098 (2011).
哈滕费尔德，M. 等。用于计算机辅助分子设计的稳健有机合成反应集。化学信息与模型杂志。51，3093–3098（2011 年）。
Article Google Scholar
Yang, Q. et al. Molecular transformer unifies reaction prediction and retrosynthesis across pharma chemical space. Chem. Commun. 55, 12152–12155 (2019). 重试错误原因
Article Google Scholar
Purchasable Mcule. Mcule https://purchasable.mcule.com/ (2023). 重试错误原因
RDKit: open-source cheminformatics (RDKit, 2023); www.rdkit.org 重试错误原因
Chemical weapons convention, annex on chemicals, b. schedules of chemicals. OPCW www.opcw.org/chemical-weapons-convention/annexes/annex-chemicals/annex-chemicals (2024). 重试错误原因
The Australia Group. Australia Group common control lists: chemical weapons precursors. Department of Foreign Affairs and Trade www.dfat.gov.au/publications/minisite/theaustraliagroupnet/site/en/controllists.html (2023).
澳大利亚集团。澳大利亚集团共同控制清单：化学武器前体。外交贸易部 www.dfat.gov.au/publications/minisite/theaustraliagroupnet/site/en/controllists.html (2023)。
Namerxn (NextMove Software, 2023); www.nextmovesoftware.com/namerxn.html 重试错误原因
Carey, J. S., Laffan, D., Thomson, C. & Williams, M. T. Analysis of the reactions used for the preparation of drug candidate molecules. Org. Biomol. Chem. 4, 2337–2347 (2006).
Carey, J. S., Laffan, D., Thomson, C. & Williams, M. T. 药物候选分子制备中使用的反应的分析. Org. Biomol. Chem. 4, 2337–2347 (2006).
Article Google Scholar
Bran, A. & Cox, S. ur-whitelab/chemcrow-runs: Zendo release. Zenodo https://doi.org/10.5281/zenodo.10884645 (2024). 重试错误原因
Bran, A., Cox, S., White, A. & Schwaller, P. ur-whitelab/chemcrow-public: v0.3.24. Zenodo https://doi.org/10.5281/zenodo.10884639 (2024). 重试错误原因

Download references 重试错误原因

Acknowledgements 重试错误原因

A.M.B., O.S. and P.S. acknowledge support from NCCR Catalysis (grant no. 180544), a National Centre of Competence in Research funded by the Swiss National Science Foundation. S.C. and A.D.W. acknowledge support from the National Science Foundation under grant no. 1751471. Research reported in this work was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award no. R35GM137966. We thank the wider RXN for Chemistry team for the support and for having granted limited access to the platform for the sole scope of executing the reported syntheses. We thank M. Lederbauer and J. Marulanda for helping with the illustrations in Fig. 1.
A.M.B., O.S. 和 P.S. 感谢瑞士自然科学基金会资助的国家级研究卓越中心催化中心（项目编号 180544）的支持。S.C. 和 A.D.W. 感谢美国国家科学基金会在项目编号 1751471 下的支持。本文报告的研究得到了美国国立卫生研究院一般医学科学研究所的资助，项目编号为 R35GM137966。我们感谢化学反应 RXN 的更广泛的团队的支持，并感谢他们仅限于执行报告合成的平台的有限访问权限。我们感谢 M. Lederbauer 和 J. Marulanda 帮助绘制第 1 页的插图。

Funding 资助

Open access funding provided by EPFL Lausanne. 重试错误原因

Author information 作者信息

These authors contributed equally: Andres M. Bran, Sam Cox.

Authors and Affiliations

Laboratory of Artificial Chemical Intelligence (LIAC), ISIC, EPFL, Lausanne, Switzerland
Andres M. Bran, Oliver Schilter & Philippe Schwaller
National Centre of Competence in Research (NCCR) Catalysis, EPFL, Lausanne, Switzerland
Andres M. Bran, Oliver Schilter & Philippe Schwaller
Department of Chemical Engineering, University of Rochester, Rochester, NY, USA
Sam Cox & Andrew D. White
FutureHouse, San Francisco, CA, USA
Sam Cox & Andrew D. White
Accelerated Discovery, IBM Research – Europe, Rüschlikon, Switzerland
Oliver Schilter & Carlo Baldassari

Authors

Andres M. Bran
View author publications
You can also search for this author in PubMed Google Scholar
Sam Cox
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Schilter
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Baldassari
View author publications
You can also search for this author in PubMed Google Scholar
Andrew D. White
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Schwaller
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.M.B. and S.C. contributed to methodology, model creation, writing, visualization, guardrails and assessment. O.S. and C.B. contributed to methodology, laboratory experiments and assessment. A.D.W. contributed to conceptualization, methodology, model creation, writing, funding and project supervision. P.S. contributed to conceptualization, methodology, model creation, assessment, writing, funding and project supervision.

Corresponding authors

Correspondence to Andrew D. White or Philippe Schwaller.

Ethics declarations 重试错误原因

Competing interests 竞争利益

A.D.W. has served as a paid consultant for evaluating AI model safety at OpenAI. The other authors declare no competing interests. 重试错误原因

Peer review 同行评审

Peer review information 重试错误原因

Nature Machine Intelligence thanks Michael Heinzinger and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Nature 机器智能感谢 Michael Heinzinger 和另一位，匿名的，审稿人对这篇工作的同行评审做出的贡献。

Additional information 重试错误原因

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. 重试错误原因

Supplementary information
补充信息

Supplementary Information 重试错误原因

Supplementary Discussion and Figs. 1–18.
补充讨论和图 1-18。

Reporting Summary 重试错误原因

Source data 重试错误原因

Source Data Fig. 1 重试错误原因

Unprocessed evaluation data.
未处理的评估数据

Rights and permissions 重试错误原因

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 重试错误原因

Reprints and permissions 重试错误原因

About this article

Cite this article

M. Bran, A., Cox, S., Schilter, O. et al. Augmenting large language models with chemistry tools. Nat Mach Intell 6, 525–535 (2024). https://doi.org/10.1038/s42256-024-00832-8

Download citation

Received: 13 September 2023
Accepted: 27 March 2024
Published: 08 May 2024
Issue Date: May 2024
DOI: https://doi.org/10.1038/s42256-024-00832-8