ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis 用于文本挖掘和 MOF 合成预测的 ChatGPT 化学助手
^(†){ }^{\dagger} Department of Chemistry, University of California, Berkeley, California 94720, United States ^(†){ }^{\dagger} 美国加州大学伯克利分校化学系,加利福尼亚州,94720‡\ddagger Kavli Energy Nanoscience Institute, University of California, Berkeley, California 94720, United States ‡\ddagger 美国加州大学伯克利分校卡弗利能源纳米科学研究所,加利福尼亚州,94720§ Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, United States^(∙){ }^{\bullet} Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, United States ^(∙){ }^{\bullet} 美国加州大学伯克利分校电子工程与计算机科学系,加利福尼亚州,94720† Department of Mathematics, University of California, Berkeley, California 94720, United States † 加利福尼亚大学伯克利分校数学系,美国加利福尼亚州,94720# Department of Statistics, University of California, Berkeley, California 94720, United States # 加利福尼亚大学伯克利分校统计系,美国加利福尼亚州,94720§\S School of Information, University of California, Berkeley, California 94720, United States §\S 美国加州大学伯克利分校信息学院,加利福尼亚州,94720" KACST-UC Berkeley Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia "沙特阿拉伯利雅得 11442,阿卜杜勒阿齐兹国王科技城,KACST-加州大学伯克利分校清洁能源应用纳米材料卓越中心
The dream of chemists is to create matter in the hope of advancing human knowledge for the betterment of society. ^(1,2){ }^{1,2} As we stand on the precipice of the age of Artificial General Intelligence (AGI), the potential for synergy between AI and chemistry is vast and promising. ^(3,4){ }^{3,4} The idea of creating AI-powered chemistry assistants offers unprecedented opportunities to revolutionize the landscape of chemistry research by applying knowledge across various disciplines, efficiently processing laborintensive and time-consuming tasks, such as literature searches, compound screening and data analysis. AI-powered chemistry may ultimately transcend the limits of human cognition. ^(5-8){ }^{5-8} 化学家的梦想是创造物质,希望推动人类知识的发展,从而改善社会。 ^(1,2){ }^{1,2} 当我们站在人工通用智能(AGI)时代的悬崖边上时,人工智能与化学之间的协同潜力巨大,前景广阔。 ^(3,4){ }^{3,4} 创建人工智能驱动的化学助手的想法提供了前所未有的机会,通过应用各学科的知识,高效处理劳动密集型和耗时的任务,如文献检索、化合物筛选和数据分析,彻底改变化学研究的面貌。人工智能驱动的化学最终可能会超越人类认知的极限。 ^(5-8){ }^{5-8}
Identifying chemical information for compounds, including ideal synthesis conditions and physical and chemical properties, has been a critical endeavor in chemistry research. The comprehensive summary of chemical information from literature reports, such as publications and patents, and their subsequent storage in an organized database format is the next logical and necessary step toward discovery of materials. ^(9){ }^{9} The challenge lies in efficiently mining the vast amount of available literature to obtain valuable information and insights. Traditionally, specialized natural language processing (NLP) models have been employed to address this issue. ^(10-14){ }^{10-14} However, these approaches can be labor-intensive and necessitate expertise in coding, computer science, and data science. Furthermore, they are less generalizable, requiring rewriting the program when the target changes. The advent of large language models (LLMs), such as GPT-3, GPT-3.5 and GPT-4, has the potential to fundamentally transform this process and revolutionize the routine of chemistry research in the next decade., 15-18 确定化合物的化学信息,包括理想的合成条件以及物理和化学特性,一直是化学研究中的一项重要工作。从出版物和专利等文献报告中全面总结化学信息,并随后以有组织的数据库格式进行存储,是发现材料的下一个合乎逻辑的必要步骤。 ^(9){ }^{9} 所面临的挑战在于如何高效地挖掘大量可用文献,以获得有价值的信息和见解。传统上,人们采用专门的自然语言处理 (NLP) 模型来解决这一问题。 ^(10-14){ }^{10-14} 然而,这些方法可能是劳动密集型的,并且需要编码、计算机科学和数据科学方面的专业知识。此外,这些方法的通用性较差,当目标发生变化时需要重写程序。大型语言模型(LLMs)的出现,如 GPT-3、GPT-3.5 和 GPT-4,有可能从根本上改变这一过程,并在未来十年彻底改变化学研究的常规方法。
Figure 1. Schematics of ChatGPT Chemistry Assistant workflow having three different processes employing ChatGPT and ChemPrompt for efficient text mining and summarization of MOF synthesis conditions from a diverse set of published research articles. Each process is distinctively labeled with red, blue, and green dots respectively. To illustrate, Process 1 initiates with “Published Research Articles”, proceeds to “Human Preselection”, moves onto the “Synthesis Paragraph”, integrates “ChatGPT with ChemPrompt”, and culminates in “Tabulated Data”. Steps shared among multiple processes are indicated with corresponding color-coded dots. The two-snakes logo of Python is included to indicate the use of the Python programming language, with the logo’s credit attributed to the Python Software Foundation (PSF). 图 1.ChatGPT Chemistry Assistant 工作流程示意图,其中有三个不同的流程,采用 ChatGPT 和 ChemPrompt 从不同的已发表研究文章中对 MOF 合成条件进行有效的文本挖掘和总结。每个流程分别用红色、蓝色和绿色圆点标注。例如,流程 1 从 "已发表的研究文章 "开始,然后是 "人类预选",接着是 "合成段落",最后是 "ChatGPT 与 ChemPrompt",最后是 "制表数据"。多个过程共享的步骤用相应的彩色圆点表示。Python 的双蛇徽标表示使用了 Python 编程语言,该徽标归功于 Python 软件基金会 (PSF)。
Herein, we demonstrate that LLMs, including ChatGPT based on the GPT-3.5 and GPT-4 model, can act as chemistry assistants to collaborate with human researchers, facilitating text mining and data analysis to accelerate the research process. To harness the power of what we termed as the ChatGPT Chemistry Assistant (CCA), we provide a comprehensive guide on ChatGPT prompt engineering for chemistry-related tasks, making it accessible to researchers regardless of their familiarity with machine learning, thus bridging the gap between chemists and computer scientists. In this report, we present (1) A novel approach to using ChatGPT for text mining the synthesis conditions of metal-organic frameworks (MOFs), which can be easily 在此,我们展示了 LLMs(包括基于 GPT-3.5 和 GPT-4 模型的 ChatGPT)可以作为化学助手与人类研究人员合作,促进文本挖掘和数据分析,从而加快研究进程。为了利用我们称之为 ChatGPT 化学助手(CCA)的强大功能,我们为化学相关任务的 ChatGPT 提示工程提供了全面的指导,使研究人员无论是否熟悉机器学习都能使用它,从而缩小了化学家与计算机科学家之间的差距。在本报告中,我们将介绍 (1) 一种使用 ChatGPT 对金属有机框架(MOFs)的合成条件进行文本挖掘的新方法,这种方法可以轻松地
generalizable to other contexts requiring minimal coding knowledge and operating primarily on verbal instructions. (2) Assessment of ChatGPT’s intelligence in literature text mining through accuracy evaluation and its ability for data refinement. (3) Utilization of the chemical synthesis reaction dataset obtained from text mining to train a model capable of predicting reaction results as crystalline powder or single crystals. Furthermore, we demonstrate that the CCA chatbot can be tuned to specialize in answering questions related to MOF synthesis based on literature conditions, with minimal hallucinations. This study underscores the transformative potential of ChatGPT and other LLMs in the realm of chemistry research, offering new avenues for collaboration and accelerating scientific discovery. (1) ChatGPT 可在其他情况下使用,只需最低限度的编码知识,并主要根据口头指令运行。(2) 通过准确性评估和数据提炼能力评估 ChatGPT 在文献文本挖掘方面的智能。(3) 利用从文本挖掘中获得的化学合成反应数据集,训练一个能够预测反应结果为结晶粉末或单晶的模型。此外,我们还证明了 CCA 聊天机器人可以根据文献条件进行调整,以专门回答与 MOF 合成有关的问题,并尽量减少幻觉。这项研究强调了 ChatGPT 和其他 LLMs 在化学研究领域的变革潜力,为合作和加速科学发现提供了新途径。
MATERIALS AND METHODS 材料和方法
Design Considerations for ChatGPT-Based Text Mining. In curating research papers for ChatGPT to read and extract information, it is imperative to account for the diversity in MOF synthesis conditions, such as variations in metal sources, linkers, solvents, and equipment, as well as the different writing styles employed. Notably, the absence of a standardized format for reporting MOF synthesis conditions leads to variable reporting templates by research groups and journals. Indeed, by incorporating a broad spectrum of narrative styles, we can examine ChatGPT’s robustness in processing information from heterogeneous sources. On the other hand, it is essential to recognize that the challenge of establishing unambiguous criteria to identify MOF compounds in the literature may lead to the inadvertent inclusion of some non-MOF compounds reported in earlier publications that are non-porous inorganic complexes and amorphous coordination polymers (included in some MOF datasets). As such, maintaining a balance between quality and quantity is vital, and prioritizing the selection of high-quality and well-cited papers, rather than incorporating all associated papers indiscriminately can ensure that the text mining of MOF synthesis conditions yields reliable and accurate data. 基于 ChatGPT 的文本挖掘设计注意事项。在整理供 ChatGPT 阅读和提取信息的研究论文时,必须考虑到 MOF 合成条件的多样性,如金属源、连接剂、溶剂和设备的变化,以及所采用的不同写作风格。值得注意的是,由于缺乏报告 MOF 合成条件的标准格式,导致研究小组和期刊的报告模板各不相同。事实上,通过采用广泛的叙述风格,我们可以检验 ChatGPT 在处理来自不同来源信息时的稳健性。另一方面,我们必须认识到,建立明确的标准来识别文献中的 MOF 化合物是一项挑战,这可能会导致在早期出版物中报告的一些非 MOF 化合物被无意中纳入,这些化合物属于无孔无机复合物和无定形配位聚合物(已纳入某些 MOF 数据集)。因此,保持质量和数量之间的平衡至关重要,优先选择高质量和引用充分的论文,而不是不加区分地纳入所有相关论文,可以确保 MOF 合成条件的文本挖掘产生可靠、准确的数据。
Moreover, papers discussing post-synthetic modifications, catalytic reactions of MOFs, and MOF composites are not directly pertinent to our objective of identifying MOF synthesis conditions. Hence, such papers have been excluded. Another consideration is that MOFs can be synthesized as both microcrystalline powders and single crystals, both of which should be regarded as valid candidates for our dataset. Utilizing the above-mentioned selection criteria, we narrowed our selection to 228 papers from an extensive pool of MOF papers, retrieved from Web of Science, Cambridge Structure Database MOF subset, ^(19){ }^{19} and the CoreMOF database. ^(20,21){ }^{20,21} This sample represents a diverse range of MOF synthesis conditions and narrative styles. 此外,讨论合成后修饰、MOF 的催化反应和 MOF 复合材料的论文与我们确定 MOF 合成条件的目标并不直接相关。因此,我们排除了此类论文。另一个考虑因素是,MOF 既可以微晶粉末的形式合成,也可以单晶的形式合成。根据上述选择标准,我们从科学网、剑桥结构数据库MOF子集、 ^(19){ }^{19} 和CoreMOF数据库检索到的大量MOF论文中筛选出228篇论文。 ^(20,21){ }^{20,21} 这个样本代表了不同的MOF合成条件和叙述风格。
To enable ChatGPT to process each paper, we devised three different approaches analogous to human paper reading: (1) locating potential sections containing synthesis conditions within the document, (2) confirming the presence of synthesis conditions in the identified sections, and (3) extracting synthesis parameters one by one. For our ChatGPT Chemistry Assistant, these steps are accomplished through filtering, classification, and summarization (Figure 1). 为了让 ChatGPT 能够处理每篇论文,我们设计了三种类似于人类阅读论文的不同方法:(1)在文档中查找包含合成条件的潜在章节;(2)在确定的章节中确认合成条件的存在;(3)逐一提取合成参数。对于我们的 ChatGPT 化学助手来说,这些步骤是通过过滤、分类和总结来完成的(图 1)。
Figure 2. Illustration of a carefully designed ChemPrompt (shown on the left), encapsulating all three fundamental principles of ChemPrompt Engineering (shown on the right). The prompt guides ChatGPT to systematically extract and summarize synthesis conditions from a specified section in a research article, organizing the data into a well-structured table. 图 2.精心设计的化学提示图(左图),囊括了化学提示工程的所有三项基本原则(右图)。该提示引导 ChatGPT 从研究文章的指定章节中系统地提取和总结合成条件,并将数据组织成结构良好的表格。
In Process 1, we developed prompts to guide ChatGPT in summarizing text from designated experimental sections contained in those papers. To replace the need for human intervention to obtain synthesis sections, in Process 2 , we designed a method for ChatGPT to categorize text inputs as either “experimental section” or “non-experimental section”, enabling it to generate experimental sections for summarization. In Process 3, we further devised a technique to swiftly eliminate irrelevant paper sections, such as references, titles, and acknowledgments, which are unlikely to encompass comprehensive synthesis conditions. This accelerates processing speed for the later classification task. As such, in Process 1, ChatGPT is solely responsible for summarizing and tabulating synthesis conditions and requires one or more paragraphs of experimental text as input, while Process 2 and 3 can be considered as an “automated paper reading system”. While Process 2 entails a thorough examination of the entire paper to scrutinize each section, the more efficient Process 3 rapidly scans the entire paper, removing the least relevant portions, thereby reducing the number of paragraphs that ChatGPT must meticulously analyze. 在流程 1 中,我们开发了一些提示,引导 ChatGPT 总结这些论文中指定实验部分的文本。为了取代人工干预来获取合成部分,在流程 2 中,我们设计了一种方法,让 ChatGPT 将输入的文本分为 "实验部分 "或 "非实验部分",从而生成实验部分进行摘要。在流程 3 中,我们进一步设计了一种技术,可以快速剔除与论文无关的部分,如参考文献、标题和致谢,因为这些部分不太可能包含全面的综合条件。这为后面的分类任务加快了处理速度。因此,在流程 1 中,ChatGPT 只负责汇总和制表综合条件,并要求输入一段或多段实验文本,而流程 2 和 3 则可视为 "自动阅卷系统"。流程 2 需要彻底检查整篇论文,仔细研究每个部分,而效率更高的流程 3 则快速扫描整篇论文,删除最不相关的部分,从而减少 ChatGPT 必须仔细分析的段落数量。
Prompt Engineering. In the realm of chemistry-related tasks, ChatGPT’s performance can be significantly enhanced by employing prompt engineering (PE)—a meticulous approach to designing prompts that steer ChatGPT towards generating precise and pertinent information. We propose three fundamental principles in prompt engineering for chemistry-focused applications, denoted as ChemPrompt Engineering: 提示工程。在化学相关任务领域,通过使用提示工程(PE)--一种精心设计提示的方法,引导 ChatGPT 生成精确而相关的信息--可以显著提高 ChatGPT 的性能。我们提出了针对化学应用的提示工程的三个基本原则,称为化学提示工程(ChemPrompt Engineering):
(1) Minimizing Hallucination, which entails the formulation of prompts to avoid eliciting fabricated or misleading content from ChatGPT. This is particularly important in the field of chemistry, where the accuracy of information can have significant implications on research outcomes and safety. For instance, when asked to provide synthesis conditions for MOFs without any additional prompt or context, ChatGPT may recognize that MOF-99999 does not exist but will generate fabricated conditions for existing compounds with names like MOF-41, MOF-419, and MOF-519. We should note that with additional prompts followed after the question, it is possible to minimize hallucination and enforce ChatGPT to answer the questions based on its knowledge (Table 1 and Table 2). Furthermore, we demonstrate that with well-designed prompts and context, hallucination occurrences can be minimized (Supporting Information, Section S2.1). We note that this should be the first and foremost principle to follow when designing prompts for ChatGPT to perform in handling text and questions relevant to chemical information. (1) 尽量减少幻觉,这需要制定提示,避免从 ChatGPT 中引出捏造或误导性的内容。这一点在化学领域尤为重要,因为信息的准确性会对研究成果和安全性产生重大影响。例如,在没有任何额外提示或上下文的情况下要求提供 MOF 的合成条件时,ChatGPT 可能会识别出 MOF-99999 并不存在,但会生成名称为 MOF-41、MOF-419 和 MOF-519 的现有化合物的合成条件。我们应该注意的是,如果在问题后跟上额外的提示,就有可能最大限度地减少幻觉,并强制 ChatGPT 根据自己的知识回答问题(表 1 和表 2)。此外,我们还证明,通过精心设计的提示和语境,可以最大限度地减少幻觉的发生(佐证资料,第 S2.1 节)。我们注意到,在为 ChatGPT 设计处理与化学信息相关的文本和问题的提示时,这应该是首要遵循的原则。
(2) Implementing Detailed Instructions, whereby explicit directions are provided in the prompt to assist ChatGPT in understanding the context and desired response format. By incorporating detailed guidance and context into the prompts, we can facilitate a more focused and accurate response from ChatGPT. In chemistry-related tasks, this approach narrows down the (2) 实施详细说明,即在提示中提供明确的指示,以帮助 ChatGPT 理解上下文和所需的回答格式。通过在提示中加入详细的指导和上下文,我们可以帮助 ChatGPT 做出更有针对性和更准确的回答。在化学相关任务中,这种方法缩小了
Table 1. Assessment of hallucination in ChatGPT response without prompt engineering. 表 1.对 ChatGPT 反应中的幻觉进行评估,但未及时进行工程设计。
Query 查询
ChatGPT Response (a) ChatGPT 答复 (a)
Which metal is used in the synthesis of MOF-5? 在合成 MOF-5 时使用了哪种金属?
Zinc (Correct) 锌(正确)
Which metal is used in the synthesis of MOF-519? 在合成 MOF-519 时使用了哪种金属?
Zirconium (Incorrect) 锆(不正确)
What is the linker used in the synthesis of MOF-99999? 合成 MOF-99999 时使用的连接剂是什么?
I don't know (Correct) 我不知道(正确)
What is the linker used in the synthesis of MOF-419? 合成 MOF-419 时使用的连接剂是什么?
Terephthalic acid (Incorrect) 对苯二甲酸(不正确)
What is the linker used in the synthesis of ZIF-8? 合成 ZIF-8 时使用的连接剂是什么?
2-methylimidazole (Correct) 2-甲基咪唑(正确)
Query ChatGPT Response (a)
Which metal is used in the synthesis of MOF-5? Zinc (Correct)
Which metal is used in the synthesis of MOF-519? Zirconium (Incorrect)
What is the linker used in the synthesis of MOF-99999? I don't know (Correct)
What is the linker used in the synthesis of MOF-419? Terephthalic acid (Incorrect)
What is the linker used in the synthesis of ZIF-8? 2-methylimidazole (Correct)| Query | ChatGPT Response (a) |
| :--- | :--- |
| Which metal is used in the synthesis of MOF-5? | Zinc (Correct) |
| Which metal is used in the synthesis of MOF-519? | Zirconium (Incorrect) |
| What is the linker used in the synthesis of MOF-99999? | I don't know (Correct) |
| What is the linker used in the synthesis of MOF-419? | Terephthalic acid (Incorrect) |
| What is the linker used in the synthesis of ZIF-8? | 2-methylimidazole (Correct) |
Table 2. Improvements in ChatGPT response accuracy utilizing a basic prompt engineering strategy. 表 2.利用基本的提示工程策略提高 ChatGPT 响应的准确性。
Initial Query 初始查询
Guided Prompt 指导提示
ChatGPT Response (a) ChatGPT 答复 (a)
Which metal is used in the synthesis of MOF-5? 在合成 MOF-5 时使用了哪种金属?
If you're uncertain, please reply with 'I do not know'. 如果您不确定,请回复 "我不知道"。
锌(正确)
Zinc
(Correct)
Zinc
(Correct)| Zinc |
| :--- |
| (Correct) |
Which metal is used in the synthesis of MOF-519? 在合成 MOF-519 时使用了哪种金属?
I don't know (Correct) 我不知道(正确)
What is the linker used in the synthesis of MOF-99999? 合成 MOF-99999 时使用的连接剂是什么?
I don't know (Correct) 我不知道(正确)
What is the linker used in the synthesis of MOF-419? 合成 MOF-419 时使用的连接剂是什么?
I don't know (Correct) 我不知道(正确)
What is the linker used in the synthesis of ZIF-8? 合成 ZIF-8 时使用的连接剂是什么?
2-methylimidazole (Correct) 2-甲基咪唑(正确)
Initial Query Guided Prompt ChatGPT Response (a)
Which metal is used in the synthesis of MOF-5? If you're uncertain, please reply with 'I do not know'. "Zinc
(Correct)"
Which metal is used in the synthesis of MOF-519? I don't know (Correct)
What is the linker used in the synthesis of MOF-99999? I don't know (Correct)
What is the linker used in the synthesis of MOF-419? I don't know (Correct)
What is the linker used in the synthesis of ZIF-8? 2-methylimidazole (Correct)| Initial Query | Guided Prompt | ChatGPT Response (a) |
| :---: | :---: | :---: |
| Which metal is used in the synthesis of MOF-5? | If you're uncertain, please reply with 'I do not know'. | Zinc <br> (Correct) |
| Which metal is used in the synthesis of MOF-519? | | I don't know (Correct) |
| What is the linker used in the synthesis of MOF-99999? | | I don't know (Correct) |
| What is the linker used in the synthesis of MOF-419? | | I don't know (Correct) |
| What is the linker used in the synthesis of ZIF-8? | | 2-methylimidazole (Correct) |
potential answer space and reduces the likelihood of irrelevant or ambiguous responses. For example, we can specify not to include any organic linker synthesis conditions and focus solely on MOF synthesis (Supporting Information, Figure S8). In this case, we found that ChatGPT can recognize the features of organic linker synthesis and differentiate them from MOF synthesis. With proper prompts, information from organic linker synthesis will not be included. Additionally, instructions can provide step-by-step guidance, which has proven effective when multiple tasks are included in one prompt (Supporting Information, Section S2.2). 潜在的答案空间,减少了不相关或模棱两可的回答的可能性。例如,我们可以指定不包含任何有机连接体合成条件,而只关注 MOF 合成(佐证资料,图 S8)。在这种情况下,我们发现 ChatGPT 可以识别有机连接体合成的特征,并将其与 MOF 合成区分开来。有了适当的提示,有机连接体合成的信息就不会被包含在内。此外,说明可以提供逐步指导,这在一个提示中包含多个任务时被证明是有效的(佐证资料,第 S2.2 节)。
(3) Requesting Structured Output, which includes the incorporation of an organized and well-defined response template or instruction to facilitate data extraction. We emphasize that this principle is particularly valuable in the context of chemistry, where data can often be complex and multifaceted. Structured output enables the efficient extraction and interpretation of critical information, which in turn can significantly contribute to the advancement of research and knowledge in the field. Take synthesis condition extraction as an example, without clear instructions on the formatted output, ChatGPT can generate a table, list-like bullet points, or a paragraph, with the order of parameters such as reaction temperature, reaction time, and solvent volume not being uniform, making it challenging for later sorting and storage of the data. This can be easily improved by explicitly asking it to generate a table and providing a fixed header to start with prompt (Supporting Information, Section S2.3). By incorporating these principles, the resulting prompt can ensure that ChatGPT yields accurate and reliable results, ultimately enhancing its utility in tackling complex chemistry-related tasks (Figure 2). We further employ the idea of interactive prompt refinement, in which we start with asking ChatGPT to write a prompt to instruct itself by giving it preliminary descriptions and information (Supporting Information, Figure S15). Through conversation, we add more specific details and considerations to the prompt, testing it with some texts, and once we obtain output, we provide feedback to ChatGPT and ask it to improve the quality of the prompt (Supporting Information, Section S2.4). (3) 要求结构化输出,包括采用有组织的、定义明确的回复模板或指令,以方便数据提取。我们强调,这一原则在化学领域尤为重要,因为化学数据往往是复杂和多方面的。结构化的输出能够有效提取和解释关键信息,进而极大地促进该领域研究和知识的发展。以合成条件提取为例,如果没有明确的格式化输出说明,ChatGPT 可能会生成表格、列表式要点或段落,其中反应温度、反应时间和溶剂体积等参数的顺序并不统一,这给后期的数据分类和存储带来了挑战。通过明确要求其生成表格并提供一个固定的标题来开始提示,可以很容易地改善这一问题(佐证资料,第 S2.3 节)。通过结合这些原则,生成的提示可以确保 ChatGPT 得出准确可靠的结果,最终提高其在处理复杂的化学相关任务时的实用性(图 2)。我们进一步采用了交互式提示改进的理念,即首先让 ChatGPT 写出提示,通过提供初步描述和信息来指导自己(佐证资料,图 S15)。通过对话,我们为提示添加了更多具体细节和注意事项,并用一些文本对其进行测试,一旦获得输出,我们就会向 ChatGPT 提供反馈,并要求它改进提示的质量(佐证资料,第 S2.4 节)。
As there has been almost no literature systematically discussing prompt engineering in Chemistry, and the fact that this field is relatively new, we provide a comprehensive step-by-step ChemPrompt Engineering guide for beginners to start with, including numerous chemistry-related examples in the Supporting Information, Section S2. At present, everyone is at the same starting point, and no one possesses exclusive expertise in this area. It is our hope that this work will stimulate the development of more powerful prompt engineering skills and help every chemist quickly understand the art of ChemPrompt Engineering, thereby advancing the field of chemistry at large. 由于几乎没有文献系统地讨论过化学中的 "提示工程",而这一领域又相对较新,因此我们为初学者提供了一份循序渐进的 "化学提示工程 "入门指南,其中包括大量与化学相关的示例(见 "辅助信息 "第 S2 节)。目前,每个人的起点都是一样的,没有人在这一领域拥有独一无二的专业知识。我们希望这项工作能激励人们开发更强大的提示工程技能,帮助每一位化学家快速了解化学提示工程的艺术,从而推动整个化学领域的发展。
Process 1: Synthesis Conditions Summarization. One revolutionary aspect of ChatGPT is its specialized domain knowledge due to its extensive pre-trained text corpus, which enables an understanding of chemical nomenclature and reaction conditions. ^(18){ }^{18} In contrast to traditional NLP methods, ChatGPT requires no additional training for named entity recognition, and can readily identify inorganic metal sources, organic linkers, solvents, and other compounds within a given experimental text. Another notable feature is ChatGPT’s ability to recognize and associate compound abbreviations (e.g., DMF) with their full names ( N,NN, N-dimethylformamide) within the context of MOF synthesis (Supporting Information, Figure S5). This capability is crucial as the use of different abbreviations for the same compound can inflate the number of “unique compounds” in the dataset post text mining, leading to redundancy without providing new information. This challenge is difficult to address using traditional NLP methods or packages, as no model can inherently discern that DMF and N,N\mathrm{N}, \mathrm{N}-dimethylformamide are the same compound without a manually curated dictionary of chemical abbreviations. Although ChatGPT may not cover all abbreviations, its proficiency in identifying and associating the most common ones such as DEF, DI water, EtOH, and CH_(3)CN\mathrm{CH}_{3} \mathrm{CN} with their full names enhances data consistency and reduces redundancy. This, in turn, facilitates data retrieval and analysis, ensuring that different names of the same compound are treated as a single entity with its unique chemical identity and information. 流程 1:合成条件总结。ChatGPT 的一个革命性特点是其专业领域知识,这得益于其广泛的预训练文本语料库,该语料库能够理解化学术语和反应条件。 ^(18){ }^{18} 与传统的 NLP 方法相比,ChatGPT 不需要额外的命名实体识别训练,就能轻松识别给定实验文本中的无机金属源、有机连接剂、溶剂和其他化合物。另一个显著特点是 ChatGPT 能够识别并关联 MOF 合成背景下的化合物缩写(如 DMF)和全名( N,NN, N -二甲基甲酰胺)(辅助信息,图 S5)。这种能力至关重要,因为对同一化合物使用不同的缩写会增加文本挖掘后数据集中 "唯一化合物 "的数量,从而导致冗余而不提供新信息。使用传统的 NLP 方法或软件包很难解决这一难题,因为如果没有人工编辑的化学缩写字典,任何模型都无法从本质上辨别 DMF 和 N,N\mathrm{N}, \mathrm{N} 二甲基甲酰胺是同一种化合物。虽然 ChatGPT 可能无法涵盖所有缩写,但它能够熟练识别最常见的缩写,并将其与全名关联起来,如 DEF、去离子水、EtOH 和 CH_(3)CN\mathrm{CH}_{3} \mathrm{CN} 等,从而增强了数据的一致性并减少了冗余。这反过来又促进了数据检索和分析,确保同一化合物的不同名称被视为具有独特化学特性和信息的单一实体。
Our first goal is to develop a ChatGPT-based AI assistant that demonstrates high performance in converting a given experimental section paragraph into a table containing all synthesis parameters (Supporting Information, Figure S22). To design the prompt for this purpose, we incorporate the three principles discussed earlier into ChemPrompt Engineering (Figure 2). The rationale for using tabulation as the output for synthesis condition summarization is that the tabular format simplifies subsequent data sorting, analysis, and storage. In terms of the choice of 11 synthesis parameters, we include those deemed most important and non-negligible for each MOF synthesis. Specifically, these parameters encompass metal sources and quantities, dictating metal centers in the framework and their relative concentrations; the linker and its quantity, which affect connectivity and pore size within the MOF; the modulator and its quantity or volume, which can fine-tune the MOF’s structure by impacting the nucleation and growth of the MOF in the reaction; the solvent and its volume, which can influence both the crystallization process and the final MOF structure; and the reaction temperature and duration, which are vital parameters governing the kinetics and thermodynamics of MOF formation in each synthesis. In our prompt, we also account for the fact that some papers may report multiple synthesis conditions for the same compound and instruct ChatGPT to use multiple rows to include each variation. For multiple units of the same synthesis parameters, such as when molarity mass and weight mass are both reported, we encourage ChatGPT to include them in the same cell, separated by a comma, which can be later streamlined depending on the needs. If any information is not provided in the sections, e.g., most MOF reactions may not involve the use of modulators and some papers may not specify the reaction time, we expect ChatGPT to answer “N/A” for that parameter. Importantly, to eliminate non-MOF synthesis conditions such as organic linker synthesis, post-synthetic modification, or catalysis reactions, which are not helpful for studying MOF synthesis reactions, we simply add one line of narrative instruction, asking ChatGPT to ignore these types of reactions and focus solely on MOF synthesis parameters. Notably, this natural 我们的第一个目标是开发一个基于 ChatGPT 的人工智能助手,它能将给定的实验部分段落高效地转换成包含所有合成参数的表格(佐证资料,图 S22)。为了设计这样的提示,我们将前面讨论过的三个原则纳入了化学提示工程(图 2)。使用表格作为合成条件汇总输出的理由是,表格格式简化了后续的数据整理、分析和存储。在 11 个合成参数的选择方面,我们包括了每个 MOF 合成中最重要且不可忽略的参数。具体来说,这些参数包括:金属源及其数量(决定框架中的金属中心及其相对浓度);连接剂及其数量(影响 MOF 内的连通性和孔径大小);调节剂及其数量或体积(通过影响 MOF 在反应中的成核和生长,对 MOF 结构进行微调);溶剂及其体积,它可以影响结晶过程和 MOF 的最终结构;以及反应温度和持续时间,它们是影响每次合成中 MOF 形成的动力学和热力学的重要参数。在我们的提示中,我们还考虑到一些论文可能会报告同一化合物的多种合成条件,并指示 ChatGPT 使用多行来包含每种变化。对于同一合成参数的多个单位,如同时报告摩尔质量和重量质量时,我们鼓励 ChatGPT 将其包含在同一单元格中,中间用逗号隔开,以后可根据需要进行精简。 如果章节中没有提供任何信息,例如大多数 MOF 反应可能不涉及调制剂的使用,有些论文可能没有说明反应时间,我们希望 ChatGPT 对该参数的回答为 "N/A"。重要的是,为了排除有机连接体合成、合成后修饰或催化反应等非 MOF 合成条件,我们只需添加一行叙述性说明,要求 ChatGPT 忽略这些类型的反应,只关注 MOF 合成参数。值得注意的是,这种自然
language-based instruction is highly convenient, requiring no complex and laborious rule-based code to identify unwanted cases and filter them out, and is friendly to researchers without coding experience. 基于语言的教学非常方便,不需要复杂费力的规则代码来识别和过滤不需要的案例,对没有编码经验的研究人员也很友好。
The finalized prompts for Process 1 consist of three parts: (i) a request for ChatGPT to summarize and tabulate the reaction conditions, and only use the text or information provided by humans, which adheres to Principle 1 to minimize hallucination; (ii) a specification of the output table’s structure, enumerating expectations and handling instructions, which follows Principles 2 and 3 for detailed instructions and structured output requests; and (iii) the context, consisting of MOF synthesis reaction condition paragraphs from experimental sections or supporting information in research articles. Note that parts (i) and (ii) are fixed prompts, while part (iii) is considered as “input.” The combined prompt results in a single question-and-answer interaction, allowing ChatGPT to generate a summarization of the given synthesis conditions as output. 流程 1 的最终提示由三部分组成:(i) 要求 ChatGPT 对反应条件进行总结和制表,并只使用人类提供的文本或信息,这符合原则 1,以最大限度地减少幻觉;(ii) 输出表的结构说明,列举了期望值和处理说明,这符合原则 2 和 3,即详细说明和结构化输出请求;(iii) 上下文,包括实验部分的 MOF 合成反应条件段落或研究文章中的辅助信息。请注意,第(i)和(ii)部分是固定提示,而第(iii)部分被视为 "输入"。综合提示产生一个单一的问答交互,允许 ChatGPT 生成给定合成条件的摘要作为输出。
Process 2: Synthesis Paragraph Classification. The next question to be answered is, “if ChatGPT is given an entire research article, can it correctly locate the sections of experimental sections?” The objective of Process 2 is to accept an entire research paper as input and selectively forward paragraphs containing chemical experiment details to the next assistant for summarization. However, locating the experimental synthesis section within a research paper is a complex task, as simple techniques such as keyword searches often prove insufficient. For instance, the synthesis of MOFs may be embedded within the supporting information or combined with organic linker synthesis. In earlier publications, synthesis information might appear as a footnote. Furthermore, different journals or research groups utilize varying section titles, including “Experimental,” “Methods,” “General Methods and Materials,” “Experimental methods,” “Synthesis and Characterization,” “Synthetic Procedures,” “Methods Summary,” and more. Manually enumerating each case is labor-intensive, especially when synthesis paragraphs may be dispersed with non-MOF synthesis, characterization conditions, or instrument details. Even a human might take considerable time to identify the correct section. 过程 2:综合段落分类。接下来要回答的问题是:"如果给 ChatGPT 一整篇研究文章,它能否正确定位实验部分的章节?流程 2 的目标是接受整篇研究论文作为输入,并选择性地将包含化学实验细节的段落转发给下一位助手进行总结。然而,在研究论文中查找实验合成部分是一项复杂的任务,简单的技术(如关键词搜索)往往证明是不够的。例如,MOFs 的合成可能包含在辅助信息中,也可能与有机连接体的合成结合在一起。在早期出版物中,合成信息可能作为脚注出现。此外,不同期刊或研究小组使用不同的章节标题,包括 "实验"、"方法"、"一般方法和材料"、"实验方法"、"合成和表征"、"合成步骤"、"方法摘要 "等。手动枚举每个案例是一项劳动密集型工作,尤其是当合成段落可能分散在非 MOF 合成、表征条件或仪器细节中时。即使是人,也可能需要相当长的时间才能识别出正确的部分。
To address this challenge and enable ChatGPT to accurately discern synthesis details within a lengthy research paper, we draw inspiration from the human process. A chemistry Ph.D. student, when asked to locate the MOF synthesis section in a new research paper, would typically start with the first paragraph and ask themselves if it contains synthesis parameters. They would then draw upon prior knowledge from previously read papers to determine if the section is experimental. This process is repeated paragraph by paragraph until the end of the supporting information is reached, with no guarantee that additional synthesis details will not be encountered later. To train ChatGPT similarly, we prompt it to read paper sections incrementally, focusing on one or two paragraphs at a time. Using a few-shot prompt strategy, we provided ChatGPT with a couple of example cases of both synthesis and non-synthesis paragraphs and asked it to classify the sections it reads as either “Yes” (synthesis paragraph) or “No” (non-synthesis paragraph). The ChatGPT Chemistry Assistant would then continue processing the research paper section by section, passing only the paragraphs labeled as “Yes” to the following assistant for summarization. 为了应对这一挑战,使 ChatGPT 能够在冗长的研究论文中准确辨别合成细节,我们从人类的研究过程中汲取了灵感。一名化学博士生被要求在一篇新的研究论文中找到 MOF 合成部分时,通常会从第一段开始,询问自己该段是否包含合成参数。然后,他们会利用以前阅读过的论文中的知识来确定该部分是否是实验性的。这个过程会逐段重复,直到辅助信息结束,但不能保证以后不会遇到其他的合成细节。为了对 ChatGPT 进行类似的训练,我们提示它逐步阅读论文章节,每次只读一到两段。我们采用少量提示的策略,向 ChatGPT 提供了几个合成段落和非合成段落的示例,并要求它将所阅读的部分归类为 "是"(合成段落)或 "否"(非合成段落)。然后,ChatGPT 化学助手将继续逐节处理研究论文,只将标为 "是 "的段落交给下一位助手进行总结。
This few-shot prompt strategy is more convenient than traditional approaches, which require researchers to manually identify and label a large number of paragraphs as “Synthesis Paragraphs” and train their models accordingly. In fact, ChatGPT can even perform such classification using a zero-shot prompt strategy with detailed descriptions of what a “Synthesis Paragraph” should look like and contain. However, we have found that providing four or five short examples in a few-shot prompt strategy enables ChatGPT to identify the features of synthesis paragraphs more effectively, streamlining the classification process (Supporting Information, Figure S24). 与传统的方法相比,这种寥寥数语的提示策略更为方便,因为传统方法需要研究人员手动识别大量段落并将其标记为 "合成段落",然后再对模型进行相应的训练。事实上,ChatGPT 甚至可以使用零镜头提示策略来进行这种分类,并详细描述 "合成段落 "应该是什么样子和包含哪些内容。然而,我们发现,在少量提示策略中提供四五个简短的例子可以让 ChatGPT 更有效地识别综合段落的特征,从而简化分类过程(佐证信息,图 S24)。
The finalized prompt for Process 2 comprises three parts: (i) a request for ChatGPT to determine whether the provided context includes a comprehensive MOF synthesis, answering only with “Yes” or “No”; (ii) some example contexts labeled as “Yes” and other labeled as “No”; (iii) the context to be classified, consisting of one or more research article paragraphs. Similar to Process 1’s prompt, parts (i) and (ii) are fixed, while part (iii) is replaced with independent sections from the paper to be classified. The entire research article is parsed into sections of 100-500 words, which are iteratively incorporated into the prompt and sent separately to ChatGPT for a “Yes” or “No” response. Each prompt represents a one-time conversation, and ChatGPT cannot view answers from previous prompts, preventing potential bias in its decision-making for the current prompt. 流程 2 的最终提示包括三个部分:(i) 要求 ChatGPT 确定所提供的上下文是否包含全面的 MOF 综述,只回答 "是 "或 "否";(ii) 一些标注为 "是 "的示例上下文和其他标注为 "否 "的示例上下文;(iii) 要分类的上下文,由一个或多个研究文章段落组成。与流程 1 的提示类似,第(i)和(ii)部分是固定的,而第(iii)部分则由要分类的论文中的独立段落取代。整篇研究文章被解析为 100-500 字的段落,这些段落被反复纳入提示中,并分别发送给 ChatGPT,以获得 "是 "或 "否 "的回复。每个提示都是一次性对话,ChatGPT 无法查看以前提示的答案,从而避免了对当前提示的决策可能产生的偏差。
Process 3: Text Embeddings for Search and Filtering. Text embeddings are high-dimensional vector representations of text that capture semantic information, enabling quantification of the relatedness of textual content. ^(22,23){ }^{22,23} The distance between these vectors in the embedded space correlates with the semantic similarity between corresponding text strings, with smaller distances indicating greater relatedness. ^(24,25){ }^{24,25} While Process 2 can automatically read and summarize papers, it must evaluate every section to identify synthesis paragraphs. To expedite this process, we developed Process 3 , which filters sections least likely to contain synthesis parameters using OpenAI embeddings before exposing the article to classification assistant in Process 2. To achieve this, we employed a two-step approach to construct Process 3: first, parsing all papers and converting each segment into embeddings; and second, calculating and ranking the similarity scores of each segment based on their relevance to a predefined prompt encapsulating synthesis parameter. 流程 3:用于搜索和筛选的文本嵌入。文本嵌入是文本的高维向量表示,可捕获语义信息,从而量化文本内容的相关性。 ^(22,23){ }^{22,23} 嵌入空间中这些向量之间的距离与相应文本字符串之间的语义相似性相关,距离越小表示相关性越高。 ^(24,25){ }^{24,25} 虽然流程 2 可以自动阅读和总结论文,但它必须评估每个部分,以识别综合段落。为了加快这一过程,我们开发了流程 3,该流程使用 OpenAI 嵌入过滤最不可能包含合成参数的部分,然后再将文章提供给流程 2 中的分类助手。为此,我们采用两步法构建流程 3:首先,解析所有论文并将每个段落转换为嵌入式;其次,根据每个段落与预定义的包含合成参数的提示的相关性,计算每个段落的相似性分数并进行排序。
Figure 3. Two-dimensional visualization of 18,248 text segment embeddings, with each point representing a text segment from the research articles selected. Color coding denotes thematic categories: red for “synthesis”, green for “gas sorption”, yellow for “literature reference”, blue for “crystallographic data”, purple for “structural analysis”, orange for “characterization”, and grey for other text segments not emphasized in this study. 图 3.18,248 个文本片段嵌入的二维可视化图,每个点代表所选研究文章中的一个文本片段。彩色编码表示主题类别:红色表示 "合成",绿色表示 "气体吸附",黄色表示 "文献参考",蓝色表示 "晶体学数据",紫色表示 "结构分析",橙色表示 "表征",灰色表示本研究未强调的其他文本片段。
In particular, we partitioned the 228 research articles into 18,248 individual text segments (Supporting Information, Figure S30-S32). Each segment was converted into a 1536-dimensional text embedding using OpenAI’s text-embedding-ada-002, a simple but efficient model for this process (Supporting Information, Figure S33-S35). These vectors were stored for future use. To identify segments 具体而言,我们将 228 篇研究文章划分为 18,248 个单独的文本片段(佐证资料,图 S30-S32)。我们使用 OpenAI 的 text-embedding-ada-002 将每个片段转换为 1536 维的文本嵌入,这是一个简单而高效的模型(佐证资料,图 S33-S35)。这些向量被储存起来,以备将来使用。识别片段
most and least likely to contain synthesis parameters, we employed interactive prompt refinement strategy (Supporting Information, Section S2.4), consulting with ChatGPT to optimize the prompt. The prompt used in Process 3, unlike previous prompts, served as a text segment for search and similarity comparison rather than instructing ChatGPT (Supporting Information, Figure S25). Next, the embeddings of all 18,248 text segments were compared with the prompt’s embedding, and a relevance score was assigned to each segment based on the cosine similarity between the two embeddings. Highly relevant segments were passed on to classification assistant for further processing, while low similarity segments were filtered out (Figure 1). 我们采用了交互式提示改进策略(佐证资料,第 S2.4 节),与 ChatGPT 协商以优化提示。与之前的提示不同,流程 3 中使用的提示是作为搜索和相似性比较的文本片段,而不是指示 ChatGPT(佐证信息,图 S25)。接下来,将所有 18248 个文本片段的嵌入与提示的嵌入进行比较,并根据两个嵌入之间的余弦相似度为每个片段分配相关性分数。相关性高的片段会交给分类助手进一步处理,而相似性低的片段则会被过滤掉(图 1)。
To evaluate the effectiveness of this approach, we conducted a visual exploration of our embedding data (Figure 3). By reducing the vectors’ dimensionality, we observed distinct clusters corresponding to different topics. Notably, we identified distinct clusters related to topics like “gas sorption”, “literature reference”, “characterization”, “structural analysis” and “crystallographic data”, which were separate from the “synthesis” cluster. This observation strongly supports the efficiency of our embedding-based filtering strategy. However, this strategy, while effective at filtering out less relevant text and passing segments of mid to high relevance to the subsequent classification assistant, cannot directly search for synthesis paragraphs to feed to the summarization assistant, thus bypassing the classification assistant. In other words, the searching-to-classifying-to-summarizing pipeline cannot be simplified to a searching-to-summarizing pathway due to the inherent search limitations of the embeddings. As shown in Figure 3, embeddings alone may not accurately identify all relevant “synthesis” sections, particularly when they contain additional information such as characterization and sorption data. The presence of these elements in a synthesis section can reduce its similarity score and its proximity to the center of the “synthesis” cluster. Points between the “synthesis” and “characterization” or “crystallographic data” clusters may not have the highest similarity scores and could be missed. However, by filtering only the lowest scores, mid-relevance points are retained and passed to the classification assistant, which can more accurately classify ambiguous content. 为了评估这种方法的有效性,我们对嵌入数据进行了可视化探索(图 3)。通过降低向量的维度,我们观察到了与不同主题相对应的不同聚类。值得注意的是,我们发现了与 "气体吸附"、"文献参考"、"表征"、"结构分析 "和 "晶体学数据 "等主题相关的不同聚类,它们与 "合成 "聚类是分开的。这一观察结果有力地证明了我们基于嵌入的过滤策略的效率。不过,这种策略虽然能有效地过滤掉相关性较低的文本,并将相关性中高的段落传递给后续的分类助手,但却不能直接搜索合成段落以提供给摘要助手,从而绕过分类助手。换句话说,由于嵌入式固有的搜索限制,从搜索到分类再到摘要的流程无法简化为从搜索到摘要的路径。如图 3 所示,仅靠嵌入式可能无法准确识别所有相关的 "合成 "部分,尤其是当这些部分包含表征和吸附数据等附加信息时。合成部分中这些元素的存在会降低其相似性得分和与 "合成 "群组中心的接近程度。在 "合成 "和 "表征 "或 "结晶数据 "聚类之间的点可能没有最高的相似性得分,因此可能会被忽略。不过,通过只过滤最低分数,中等相关性的点就会被保留下来,并传递给分类助手,从而能更准确地对模糊内容进行分类。
ChatGPT-Assisted Python Code Generation and Data Processing. Rather than relying on singular, time-consuming conversations with web-based ChatGPT to process textual data from a multitude of research articles, OpenAI’s GPT-3.5-turbo, which is identical to the one underpinning the ChatGPT product, facilitates a more efficient approach, as it incorporates an Application Programming Interface (API), enabling batch processing of text from an extensive array of articles. This is achieved through iterative context and prompt submissions to ChatGPT, followed by the collection of its responses (Supporting Information, Section S3.4). ChatGPT 辅助 Python 代码生成和数据处理。OpenAI的GPT-3.5-turbo与ChatGPT产品的基础相同,它不依赖于与基于网络的ChatGPT进行耗时的单一对话来处理来自大量研究文章的文本数据,而是采用了一种更高效的方法,因为它集成了应用编程接口(API),能够批量处理来自大量文章的文本。这是通过迭代上下文并及时提交给 ChatGPT,然后收集其回复来实现的(佐证资料,第 S3.4 节)。
Specifically, our approach involves having ChatGPT to create Python scripts for parsing academic papers, generating prompts, executing text processing through Processes 1, 2, and 3, and collating the responses into cleaned, tabulated data (Supporting Information, Figures S28-S39). Traditionally, such a process could necessitate substantial coding experience and be time-consuming. However, we leverage the code generation capabilities of ChatGPT to establish Processes 1, 2, and 3 for batch processing using OpenAI’s APIs, namely, gpt-3.5-turbo and text-embedding-ada-002. In essence, researchers only need to express their requirements for each model in natural language - specifying inputs and desired outputs - and ChatGPT will generate the appropriate Python code (Supporting Information, Section S3.5). This code can be copied, pasted, and executed in the relevant environment. Notably, even in the event of an error, ChatGPT, especially when equipped with the GPT-4 model, can assist in code revision. We note that while coding assistance from ChatGPT may not be necessary for those with coding experience, it does provide an accessible platform for individuals lacking such experience to engage in the process. Given the simplicity and straightforwardness of the logic involved in Processes 1, 2, and 3, ChatGPT-generated Python code exhibits minimal errors and significantly accelerates the programming process. 具体来说,我们的方法是让 ChatGPT 创建 Python 脚本,用于解析学术论文、生成提示、通过流程 1、2 和 3 执行文本处理,并将回复整理成经过清理的表格数据(佐证信息,图 S28-S39)。传统上,这样的过程可能需要丰富的编码经验,而且非常耗时。然而,我们利用 ChatGPT 的代码生成功能,使用 OpenAI 的应用程序接口(即 gpt-3.5-turbo 和 text-embedding-ada-002)建立了用于批量处理的流程 1、2 和 3。本质上,研究人员只需用自然语言表达他们对每个模型的要求--指定输入和期望输出--ChatGPT 就会生成相应的 Python 代码(佐证信息,第 S3.5 节)。这些代码可以复制、粘贴并在相关环境中运行。值得注意的是,即使出现错误,ChatGPT(尤其是配备 GPT-4 模型时)也能协助修改代码。我们注意到,虽然对于那些有编码经验的人来说,可能不需要 ChatGPT 的编码协助,但它确实为缺乏编码经验的人提供了一个参与编码过程的平台。鉴于流程 1、2 和 3 所涉及的逻辑简单明了,ChatGPT 生成的 Python 代码错误极少,大大加快了编程过程。
Figure 4. Schematic representation of the diverse data unification tasks managed either directly by ChatGPT or through Python code written by ChatGPT. The figure distinguishes between simpler tasks handled directly by ChatGPT, such as standardizing chemical notation, and converting time and temperature units in reactions. More complex tasks, such as matching linker abbreviations to their full names, converting these to SMILES codes, classifying product morphology, and calculating metal amounts, are accomplished via Python code generated by ChatGPT. The Python logo displayed is credited to PSF. 图 4.由 ChatGPT 直接管理或通过 ChatGPT 编写的 Python 代码管理的各种数据统一任务示意图。图中区分了由 ChatGPT 直接处理的简单任务,如标准化化学符号、转换反应中的时间和温度单位。而更复杂的任务,如将连接体缩写与全名匹配、转换为 SMILES 代码、产品形态分类和计算金属量等,则通过 ChatGPT 生成的 Python 代码完成。显示的 Python 徽标归 PSF 所有。
ChatGPT also aids in entity resolution post text mining (Figure 4). This step involves standardizing data formats including units, notation, and compound representations. For each task, we designed a specific prompt for ChatGPT to handle data directly or a specialized Python code generated by ChatGPT. More details on designing prompts to handle different synthesis parameters are available in a cookbook style in Supporting Information, Section S4. In simpler cases, ChatGPT can directly handle conversions such as time and reaction temperature. For complex calculations, we take advantage of ChatGPT in generating Python code. For instance, to calculate the molar mass of each metal source, ChatGPT can generate the appropriate Python code based on the given compound formulas. For harmonizing notation of compound pairs or mixtures, ChatGPT can standardize different notations to a unified format, facilitating subsequent data processing. ChatGPT 还有助于文本挖掘后的实体解析(图 4)。这一步涉及数据格式的标准化,包括单位、符号和复合表示法。对于每项任务,我们都为 ChatGPT 设计了特定的提示,以便直接处理数据或由 ChatGPT 生成专门的 Python 代码。关于如何设计处理不同合成参数的提示的更多细节,请参阅 "辅助信息 "第 S4 节中的 "烹饪手册"。在较简单的情况下,ChatGPT 可以直接处理时间和反应温度等换算。对于复杂的计算,我们利用 ChatGPT 生成 Python 代码。例如,要计算每个金属源的摩尔质量,ChatGPT 可以根据给定的化合物公式生成相应的 Python 代码。为了统一化合物对或混合物的符号,ChatGPT 可以将不同的符号标准化为统一的格式,方便后续的数据处理。
To standardize compound representations, we employ the Simplified Molecular Input Line Entry System (SMILES). We faced challenges with some synthesis procedures, where only abbreviations were provided. To overcome this, we designed prompts for ChatGPT to search for the full names of given abbreviations. We then created a dictionary linking each unique PubChem Compound identification number (CID) or Chemical Abstracts Service (CAS) number to multiple full names and abbreviations and generated the corresponding SMILES code. We note that for complicated linkers or those with missing full 为了使化合物表示标准化,我们采用了简化分子输入行输入系统(SMILES)。我们在一些只提供缩写的合成程序中遇到了挑战。为了解决这个问题,我们为 ChatGPT 设计了提示,以便搜索给定缩写的全名。然后,我们创建了一个字典,将每个唯一的 PubChem 化合物识别码(CID)或化学文摘社(CAS)编号与多个全名和缩写联系起来,并生成相应的 SMILES 代码。我们注意到,对于复杂的链接器或那些缺少全
names, inappropriate nomenclature or non-existent CID or CAS numbers, ^(26-33){ }^{26-33} manual intervention was occasionally necessary to generate SMILES codes for such chemicals (Supporting Information, Figure S50-S54). However, most straightforward cases were handled efficiently by ChatGPT’s generated Python code. As a result, we achieved uniformly formatted data, ready for subsequent evaluation and utilization. ^(26-33){ }^{26-33} 有时需要人工干预才能为这些化学品生成 SMILES 代码(佐证资料,图 S50-S54)。不过,ChatGPT 生成的 Python 代码可以高效地处理大多数简单的情况。因此,我们获得了统一格式的数据,为后续评估和利用做好了准备。
RESULTS AND DISCUSSION 结果与讨论
Evaluation of Text Mining Performance. We began our performance analysis by first evaluating the execution time consumption for each process (Figure 5a). As previously outlined, the ChatGPT assistant in Process 1 exclusively accepts preselected experimental sections for summarization. Consequently, Process 1 requires human intervention for the identification and extraction of the synthesis section from a paper to operate autonomously. As illustrated in Figure 5a, this process can vary in duration based on the length and structure of the document and its supporting information file. In our study, the complete selection procedure spanned 12 hours for 228 papers, averaging around 2.5 minutes per paper. This period must be considered as the requisite time for Process 1’s execution. For summarization tasks, ChatGPT Chemistry Assistant demonstrated an impressive performance, taking an average of 13 seconds per paper. This is noteworthy considering that certain papers in the dataset contained more than 20 MOF compounds, and human summarization in the traditional way without AI might consume a significantly larger duration. By accelerating the summarization process, we alleviate the burden of repetitive work and free up valuable time for researchers. 文本挖掘性能评估。我们首先评估了每个进程的执行时间消耗(图 5a),然后开始了性能分析。如前所述,流程 1 中的 ChatGPT 助手只接受预选的实验部分进行总结。因此,流程 1 需要人工干预从论文中识别和提取综述部分才能自主运行。如图 5a 所示,这一过程的持续时间会根据文件及其辅助信息文件的长度和结构而有所不同。在我们的研究中,228 篇论文的整个筛选过程耗时 12 小时,平均每篇论文耗时约 2.5 分钟。这段时间必须被视为流程 1 执行的必要时间。在摘要任务方面,ChatGPT 化学助手表现出色,平均每篇论文只需 13 秒。值得注意的是,考虑到数据集中的某些论文包含 20 多种 MOF 化合物,如果不使用人工智能,以传统方式进行人工总结可能会耗时更长。通过加速摘要过程,我们减轻了重复性工作的负担,为研究人员腾出了宝贵的时间。
In contrast, Process 2 operates in a fully automated manner, integrating the classification and result-passing processes to the next assistant for summarization. There is no doubt that it outperforms the manual identification and summarization combination of Process 1 in terms of speed due to ChatGPT’s superior text processing capabilities. Lastly, Process 3, as anticipated, is the fastest due to the incorporation of section filtering powered by embedding, reducing the classification tasks, and subsequently enhancing the speed. The efficiency of Process 3 can be further optimized by storing the embeddings locally as a CSV file during the first reading of a paper, which reduces the processing time by 15-20 seconds (28%-37% faster) in subsequent readings. This provides a convenient solution in scenarios necessitating repeated readings for comparison or extraction of diverse information. 相比之下,流程 2 采用全自动方式运行,将分类和结果传递给下一个助手进行总结。毫无疑问,由于 ChatGPT 的超强文本处理能力,它在速度上超过了流程 1 的人工识别和总结组合。最后,流程 3 正如预期的那样是速度最快的,因为它采用了嵌入式分段过滤技术,减少了分类任务,从而提高了速度。流程 3 的效率还可以进一步优化,即在第一次阅读论文时,将嵌入内容以 CSV 文件的形式存储在本地,从而在后续阅读中将处理时间缩短 15-20 秒(快 28%-37%)。这为需要重复阅读以进行比较或提取不同信息的情况提供了便捷的解决方案。
To evaluate the accuracy of the three processes in text mining, instead of sampling, we conducted a comprehensive analysis of the entire result dataset. In particular, we manually wrote down the ground truth for all 11 parameters for approximately 800 compounds reported in all papers across the three processes, which was used to judge the text mining output. This involved the grading of nearly 26,000 synthesis parameters by us. Each synthesis parameter was assigned one of three labels: True Positive (TP, correct identification of synthesis parameters by ChatGPT), False Positive (FP, incorrect assignment of a compound to the wrong synthesis parameter or extraction of irrelevant information), and False Negative (FN, failure of ChatGPT to extract some synthesis parameters). Notably, a special rule for assigning labels on modulators, most of which were anticipated to be acid and base, was introduced to accommodate the neutral solvents in a mixed solvent system, due to the inherent challenges in distinguishing between co-solvents and modulators. For instance, in a DMF: H_(2)O=10:1\mathrm{H}_{2} \mathrm{O}=10: 1 solution, the role of H_(2)O\mathrm{H}_{2} \mathrm{O} becomes ambiguous. In such situations, we labeled the result as a TP if H_(2)O\mathrm{H}_{2} \mathrm{O} was considered either as a solvent or modulator. However, we labeled it as FP or FN if it appeared or was absent in both solvent and modulator columns. Nevertheless, acids and bases were still classified as modulators, and if labeled as solvents, they were graded as FP. 为了评估这三个过程在文本挖掘中的准确性,我们没有进行抽样,而是对整个结果数据集进行了综合分析。特别是,我们手工写下了三种流程中所有论文中报告的约 800 种化合物的所有 11 个参数的基本真实值,并以此来判断文本挖掘的输出结果。这涉及到我们对近 26,000 个合成参数的分级。每个合成参数都有三个标签:真阳性(TP,ChatGPT 对合成参数的正确识别)、假阳性(FP,将化合物分配给错误的合成参数或提取无关信息)和假阴性(FN,ChatGPT 未能提取某些合成参数)。值得注意的是,由于在区分共溶剂和调制剂方面存在固有的挑战,因此引入了一条特殊的规则来分配调制剂的标签,以适应混合溶剂体系中的中性溶剂。例如,在 DMF: H_(2)O=10:1\mathrm{H}_{2} \mathrm{O}=10: 1 溶液中, H_(2)O\mathrm{H}_{2} \mathrm{O} 的作用就变得模糊不清。在这种情况下,如果将 H_(2)O\mathrm{H}_{2} \mathrm{O} 视为溶剂或调节剂,我们就将结果标记为 TP。但是,如果 H_(2)O\mathrm{H}_{2} \mathrm{O} 在溶剂和调节剂列中出现或不出现,我们则将其标记为 FP 或 FN。不过,酸和碱仍被归类为调节剂,如果被标记为溶剂,则被评为 FP。
The distribution of TP labels counted for each of the 11 synthesis parameters across all papers is presented in Figure 5b. It should be noted that not all MOF synthesis conditions necessitate reporting of all 11 parameters; for instance, some syntheses do not involve modulators, and in such cases, we asked ChatGPT to assign an “N/A” to the corresponding column and its amount. Subsequently, we computed the precision, recall, and F1 scores for each parameter across all three processes, illustrated in Figure 5c and d. All processes demonstrated commendable performance in identifying compound names, metal source names, linker names, modulator names, and solvent names. However, they encountered difficulties in accurately determining the quantities or volumes of the chemicals involved. Meanwhile, parameters like reaction temperature and reaction time, which usually have fixed patterns (e.g., units such as ^(@)C{ }^{\circ} \mathrm{C}, hours), were accurately identified by all processes, resulting in high recall, precision, and F1 scores. The lowest scores were associated with the recall of solvent volumes. This is because ChatGPT often captured only one volume in mixed solvent systems instead of multiple volumes. Moreover, in some literatures, the stock solution was used for dissolving metals and linkers, and in principle these volumes should be added to the total volume and unfortunately, ChatGPT lacked the ability to report the volume for each portion in these cases. 图 5b 显示了所有论文中 11 个合成参数的 TP 标签分布情况。需要注意的是,并非所有 MOF 合成条件都需要报告所有 11 个参数;例如,有些合成不涉及调制剂,在这种情况下,我们要求 ChatGPT 在相应列及其数量上标注 "N/A"。随后,我们计算了所有三个流程中每个参数的精确度、召回率和 F1 分数,如图 5c 和 d 所示。所有流程在识别化合物名称、金属源名称、连接剂名称、调节剂名称和溶剂名称方面都表现出色,值得称赞。不过,它们在准确确定所涉化学品的数量或体积方面遇到了困难。同时,反应温度和反应时间等参数通常有固定的模式(例如,单位如 ^(@)C{ }^{\circ} \mathrm{C} , 小时),但所有过程都能准确识别,因此召回率、精确度和 F1 分数都很高。得分最低的是溶剂卷的召回率。这是因为 ChatGPT 在混合溶剂系统中通常只捕捉到一个体积,而不是多个体积。此外,在一些文献中,储备溶液被用于溶解金属和连接剂,原则上这些体积应该加到总体积中,但遗憾的是,在这些情况下,ChatGPT 缺乏报告各部分体积的能力。
Nevertheless, it should be noted that our instructions did not intend for ChatGPT to perform arithmetic operations in these cases, as the mathematical reasoning of the large languages models is limited, and the diminishment of the recall scores is unavoidable. In other instances, only one exemplary synthesis condition for MOF was reported, and then for similar MOFs, the paper would only state “following similar procedures”. In such cases, while occasionally ChatGPT could duplicate conditions, most of the time it recognized solvents, reaction temperature, and reaction time as “N/A”, which was graded as a FN, thus reducing the recall scores across all processes. 尽管如此,应该指出的是,我们的说明并不打算让 ChatGPT 在这些情况下进行算术运算,因为大型语言模型的数学推理能力有限,召回分数的降低是不可避免的。在其他情况下,论文只报告了 MOF 的一个示范合成条件,然后对于类似的 MOF,论文只说明 "遵循类似的程序"。在这种情况下,虽然 ChatGPT 偶尔会重复一些条件,但大多数情况下都会将溶剂、反应温度和反应时间识别为 "不适用",并将其评为 FN,从而降低了所有过程的回忆分数。
Despite these irregularities, which were primarily attributable to informal synthesis reporting styles, the precision, recall, and F1 scores for all three processes remained impressively high, with less than 9.8%9.8 \% of NP and 0 cases of hallucination detected by human evaluators. We further calculated the average and standard deviation of each process on precision, recall, 尽管出现了这些主要归因于非正式综合报告风格的不规范现象,但所有三个流程的精确度、召回率和 F1 分数仍然很高,人类评估人员检测到的 NP 和幻觉案例分别少于 9.8%9.8 \% 和 0 个。我们进一步计算了每个流程在精确度、召回率和 F1 分数上的平均值和标准偏差、
and F1 scores, respectively, as shown in Figure 5c. By considering and averaging precision, recall, and F1 scores across the 11 parameters, given their equal importance in evaluating overall performance of the process, we found that all three processes achieved impressive precision (> 95%), recall (> 90%), and F1 scores (> 92%). 和 F1 分数,如图 5c 所示。考虑到精确度、召回率和 F1 分数在评估流程整体性能方面的同等重要性,我们对 11 个参数的精确度、召回率和 F1 分数进行了考虑和平均,发现所有三个流程都达到了令人印象深刻的精确度(> 95%)、召回率(> 90%)和 F1 分数(> 92%)。
The performance metrics of Process 1 substantiated our hypothesis that ChatGPT excels in summarization tasks. Upon comparing the performance of Processes 2 and 3 - both of which are fully automated paper-reading systems capable of generating datasets from PDFs with a single click - we observed that Process 2, by meticulously examining every paragraph across all papers, ensures high precision and recall by circumventing the omission of any synthesis paragraphs or extraction of incorrect data from irrelevant sections. Conversely, while Process 3’s accuracy is marginally lower than that of Process 2, it provides a significant reduction in processing time, thus enabling faster paper reading while maintaining acceptable accuracy, courtesy of its useful filtration process. 流程 1 的性能指标证实了我们的假设,即 ChatGPT 擅长摘要任务。流程 2 和流程 3 都是全自动阅卷系统,只需点击一下就能从 PDF 文档中生成数据集,通过比较这两个系统的性能,我们发现流程 2 通过仔细检查所有论文的每个段落,避免了遗漏任何综合段落或从无关部分提取错误数据的情况,从而确保了高精确度和高召回率。相反,虽然流程 3 的准确率略低于流程 2,但由于其过滤过程非常有用,因此大大缩短了处理时间,从而在保持可接受准确率的同时加快了论文阅读速度。
To the best of our knowledge, these scores surpass most of other models in text mining in the MOF-related domain. ^(11,13,14,){ }^{11,13,14,}^(34,35){ }^{34,35} Notably, the entire workflow, established via code and programs generated from ChatGPT, can be assembled by one or two researchers with only basic coding proficiency in a period as brief as a week, whilst maintaining remarkable performance. The successful establishment of this innovative ChatGPT Chemistry Assistant workflow including the ChemPrompt 据我们所知,这些分数超过了 MOF 相关领域文本挖掘中的大多数其他模型。 ^(11,13,14,){ }^{11,13,14,}^(34,35){ }^{34,35} 值得注意的是,整个工作流程是通过 ChatGPT 生成的代码和程序建立的,只需一两个具备基本编码能力的研究人员就能在短短一周内完成组装,同时保持出色的性能。成功建立这一创新的 ChatGPT 化学助手工作流程,包括 ChemPrompt
Engineering system, which harnesses AI for processing chemistry-related tasks, promises to significantly streamline scientific research. It liberates researchers from routine laborious work, enabling them to concentrate on more focused and innovative tasks. Consequently, we anticipate that this approach will catalyze potentially revolutionary shifts in research practices through the integration of AI-powered tools. 利用人工智能处理化学相关任务的工程系统有望大大简化科学研究。它将研究人员从日常繁重的工作中解放出来,使他们能够专注于更有针对性和创新性的任务。因此,我们预计这种方法将通过整合人工智能驱动的工具,促进研究实践发生潜在的革命性转变。
Prediction Modeling of MOF Synthesis Outcomes. Given the large quantity of synthesis conditions obtained through our ChatGPT-based text mining programs, our aim is to utilize this data to investigate, comprehend, and predict the crystallization conditions of a material of interest. Specifically, our goal was to determine the crystalline state based on synthesis conditions - we seek to discern which synthesis conditions will yield MOFs in the form of single crystals, and which conditions are likely to yield non-single crystal forms of MOFs, such as microcrystalline powder or solids. MOF 合成结果的预测建模。鉴于我们通过基于 ChatGPT 的文本挖掘程序获得了大量合成条件,我们的目标是利用这些数据来调查、理解和预测相关材料的结晶条件。具体来说,我们的目标是根据合成条件确定结晶状态--我们试图辨别哪些合成条件会产生单晶形式的 MOF,哪些条件可能会产生非单晶形式的 MOF,如微晶粉末或固体。
With this objective in mind, we identified the need for a label signifying the crystalline state of the resulting MOF for each synthesis condition, thereby forming a target variable for prediction. Fortunately, nearly all research papers in the MOF field consistently include the description of crystal morphological characteristics such as the color and shape of as-synthesized MOFs (e.g. yellow needle crystals, red solid, sky-blue powdered product). This facilitated in re-running our processes with the same synthesis paragraphs as input and modifying the prompt to instruct ChatGPT to extract the description of reaction products, summarizing and categorizing them (Supporting Information, Figure S23 and Figure S47). The final label for each condition will either be Single-Crystal (SC) or Polycrystalline ( P ), and our objective is to construct a machine learning model capable of accurately predicting whether a given condition will yield SC or P. Furthermore, we recognized that the crystallization process is intrinsically linked with the synthesis method (e.g., vapor diffusion, solvothermal, conventional, microwaveassisted method). Thus, we incorporated an additional synthesis variable, “Synthesis Method”, to categorize each synthesis condition into four distinct groups. Extracting the reaction type variable for each synthesis condition can be achieved using the same input but a different few-shot prompt to guide our ChatGPT-based assistants for classification and summarization, subsequently merging this data with the existing dataset. This process parallels the method for obtaining MOF crystalline state outcomes, and both processes can be unified in a single prompt. Moreover, as the name of the MOF is a user-defined term and does not influence the synthesis result, we have excluded this variable for the purposes of prediction modeling. 有鉴于此,我们认为有必要为每种合成条件下生成的 MOF 的结晶状态贴上标签,从而形成预测的目标变量。幸运的是,几乎所有 MOF 领域的研究论文都包含对晶体形态特征的描述,如合成 MOF 的颜色和形状(如黄色针状晶体、红色固体、天蓝色粉末状产品)。这有助于我们以相同的合成段落作为输入重新运行流程,并修改提示,指示 ChatGPT 提取反应产物的描述,对其进行总结和分类(证明资料,图 S23 和图 S47)。每个条件的最终标签将是单晶(SC)或多晶(P),我们的目标是构建一个机器学习模型,能够准确预测给定条件下产生的是单晶还是多晶。此外,我们认识到结晶过程与合成方法(如蒸汽扩散法、溶热法、传统法、微波辅助法)有着内在联系。因此,我们加入了一个额外的合成变量 "合成方法",将每种合成条件分为四组。提取每个合成条件的反应类型变量时,可以使用相同的输入但不同的寥寥数语提示,以指导我们基于 ChatGPT 的助手进行分类和汇总,随后将这些数据与现有数据集合并。这一过程与获取 MOF 结晶状态结果的方法类似,两个过程都可以统一到一个提示中。 此外,由于 MOF 名称是用户定义的术语,不会影响合成结果,因此我们在预测建模时排除了这一变量。
After unifying and organizing the data to incorporate 11 synthesis parameter variables and 1 synthesis outcome target variable, we designed respective descriptors for each synthesis parameter capable of robustly representing the diversity and complexity in the synthesis conditions and facilitating the transformation of these variables into features suitable for machine learning algorithms. A total of six sets of chemical descriptors were formulated for the metal node(s), linker(s), modulator(s), solvent(s), their respective molar ratios, and the reaction condition(s) - aligning with the extracted synthesis parameters (Supporting Information, Section S5). ^(36-40){ }^{36-40} These MOF-tailored, hierarchical descriptors have been previously shown to perform well in various prediction tasks. ^(13,41)To{ }^{13,41} \mathrm{To} distill the most pertinent features and streamline the model, a recursive feature elimination (REF) with 5-fold cross-validation was performed on 80%80 \% of the total data. The rest was preserved as a held out set unseen during the learning process for independent evaluation (Figure 6a). This down-selection process reduced the number of descriptors from 70 to 33, thereby preserving comparative model performance on the held out set while removing the non-informative features that can lead to overfitting (Supporting Information Section S5). 在将 11 个合成参数变量和 1 个合成结果目标变量统一和组织数据后,我们为每个合成参数设计了各自的描述符,这些描述符能够稳健地表示合成条件的多样性和复杂性,并有助于将这些变量转化为适合机器学习算法的特征。针对金属节点、连接剂、调节剂、溶剂、它们各自的摩尔比和反应条件共制定了六套化学描述符,与提取的合成参数保持一致(辅助信息,第 S5 节)。 ^(36-40){ }^{36-40} 这些MOF定制的分层描述符先前已被证明在各种预测任务中表现出色。 ^(13,41)To{ }^{13,41} \mathrm{To} 提炼出最相关的特征并精简模型,对 80%80 \% 总数据进行了5倍交叉验证的递归特征消除(REF)。其余数据作为学习过程中未见的保留集,用于独立评估(图 6a)。这种向下选择过程将描述符的数量从 70 个减少到 33 个,从而保留了保留集上的比较模型性能,同时删除了可能导致过度拟合的非信息特征(佐证信息第 S5 节)。
Subsequently, we constructed a machine learning model to train for synthesis conditions to predict if a given synthesis condition can yield single crystals. A binary classifier was trained based on a random forest model (Supporting Information, Section S5). The random forest (RF) is an ensemble of decision trees, whose independent predictions are max voted in the classification case to arrive at the more precise prediction. ^(42){ }^{42} In our study, we trained an RF classifier to predict crystalline states from synthesis parameters, given its ability to work with both continuous and categorical data, its advantage in ranking important features towards prediction, its robustness against noisy data, ^(43){ }^{43} and its demonstrated efficacy in various chemistry applications such as chemical property estimation, ^(44-47){ }^{44-47} spectroscopic analysis, ^(48-51){ }^{48-51} and material characterization and discovery. ^(52){ }^{52} 随后,我们构建了一个机器学习模型来训练合成条件,以预测给定的合成条件是否能产生单晶。我们根据随机森林模型训练了一个二元分类器(佐证资料,第 S5 节)。随机森林(RF)是决策树的集合,在分类情况下,对其独立预测进行最大投票,以得出更精确的预测。 ^(42){ }^{42} 在我们的研究中,我们训练了一个 RF 分类器来根据合成参数预测结晶状态,这是因为 RF 既能处理连续数据,也能处理分类数据,它在对预测的重要特征进行排序方面具有优势,对噪声数据具有鲁棒性, ^(43){ }^{43} 而且它在各种化学应用(如化学性质估计、 ^(44-47){ }^{44-47} 光谱分析、 ^(48-51){ }^{48-51} 以及材料表征和发现)中的功效已得到证实。 ^(52){ }^{52}
The dimension-reduced data was randomly divided into different training sizes; for each train test split, optimal hyperparameters, in particular, number of tree estimators and minimum samples required for leaf split, were determined with 5 -fold cross validation of the training set. Model performance was gauged in terms of class weighted accuracy, precision, recall, and F1 score over 10 runs on the held out set and test set (Figure 6b and Supporting Information, Figure S64). The model converged to an average accuracy of 87%87 \% and an F1 score of 92%92 \% on the held out set, indicating a reasonable performance in the presence of the imbalanced classification challenge. 降维后的数据被随机分成不同的训练大小;对于每个训练测试分割,通过对训练集进行 5 倍交叉验证来确定最佳超参数,特别是树估计器的数量和叶片分割所需的最小样本。在训练集和测试集上运行 10 次后,以类加权准确率、精确度、召回率和 F1 分数来衡量模型性能(图 6b 和证明资料,图 S64)。该模型在保留集上的平均准确率为 87%87 \% ,F1 分数为 92%92 \% ,这表明该模型在不平衡分类挑战中表现出了合理的性能。
Following the creation of the predictive model, our objective was to apply this model for descriptor analysis to illuminate the factors impacting MOF crystalline outcomes. This aids in discerning which features in the synthesis protocol are more crucial in determining whether a synthesis condition will yield MOF single crystals. Although the random forest model is not inherently interpretable, we probed the relative importance of descriptors used in building the model. One potential measure of a descriptor’s importance is the percent decrease in the model’s accuracy score when values for that descriptor are randomly shuffled and the model is retrained. We found that among the descriptors involved, the top ten most influential descriptors are key in predicting MOF crystallization outcomes (Figure 6c). In fact, these descriptors broadly align with the chemical intuition and our understanding on MOF crystal growth. ^(53,54){ }^{53,54} For example, the descriptors related to stoichiometry of the MOF synthesis, namely the “modulator to metal ratio”, “solvent to metal ratio”, and “linker to metal ratio”, take 建立预测模型后,我们的目标是将该模型用于描述符分析,以阐明影响 MOF 结晶结果的因素。这有助于分辨合成方案中哪些特征对确定合成条件是否会产生 MOF 单晶更为关键。虽然随机森林模型本身并不具有可解释性,但我们还是对用于建立模型的描述符的相对重要性进行了探究。衡量描述符重要性的一个潜在标准是,当该描述符的值被随机洗牌并重新训练模型时,模型准确度得分的下降百分比。我们发现,在所涉及的描述符中,影响最大的前十个描述符是预测 MOF 结晶结果的关键(图 6c)。事实上,这些描述符与化学直觉和我们对 MOF 晶体生长的理解基本一致。 ^(53,54){ }^{53,54} 例如,与MOF合成的化学计量学有关的描述符,即 "调制剂与金属的比例"、"溶剂与金属的比例 "和 "连接剂与金属的比例",在预测MOF结晶结果时占据了重要地位(图6c)。
precedence in the ranking. These descriptors reflect the vital role of precise stoichiometric control in MOF crystal formation, and they directly impact the crystallization process, playing critical roles in determining the quality and morphology of the MOF crystals. 在排序中处于优先地位。这些描述符反映了精确的化学计量控制在 MOF 晶体形成中的重要作用,它们直接影响结晶过程,在决定 MOF 晶体的质量和形态方面起着关键作用。
Following closely is the descriptor “time”, and it highlights the significant role of reaction duration in the crystallization process. Additionally, the “metal valence” descriptor emphasizes the key role of the nature and reactivity of the metal ions used in MOF synthesis. The valence directly influences the secondary building units (SBUs) and the final crystalline state of the MOF. In the meantime, descriptors related to the molecular and the linker can impact the kinetics of the synthesis, influencing the orderliness of crystal growth. Together, this result provides a greater understanding of the crucial factors affecting 紧随其后的是 "时间 "描述符,它强调了反应持续时间在结晶过程中的重要作用。此外,"金属价 "描述符强调了在 MOF 合成中使用的金属离子的性质和反应性的关键作用。价态直接影响次生结构单元(SBU)和 MOF 的最终结晶状态。同时,与分子和连接体有关的描述符会影响合成的动力学,从而影响晶体生长的有序性。总之,这一结果使我们对影响 MOF 的关键因素有了更深入的了解。
the crystallization of MOFs and will aid in the design and optimization of synthesis conditions for the targeted preparation of single-crystal or polycrystalline MOFs (Figure 6d). 这将有助于设计和优化合成条件,从而有针对性地制备单晶或多晶 MOFs(图 6d)。
Interrogating the Synthesis Dataset via a Chatbot. Having utilized text mining techniques to construct a comprehensive MOF Synthesis Dataset, our aim was to leverage this resource to its fullest potential. To enhance data accessibility and aid in the interpretation of its intricate contents, we embarked on a journey to convert this dataset into an interactive and userfriendly dialogue system, which effectively converts the dataset to dialogue. The resulting chatbot is part of the umbrella concept of ChatGPT Chemistry Assistant thus serving as a reliable and fact-based assistant in chemistry, proficient in addressing a broad spectrum of queries pertaining to chemical reactions, in particular MOF synthesis. Unlike typical and more general web-based ChatGPT provided by OpenAI, which may suffer from limitations such as the inability to access the most recent data and a propensity for hallucinatory errors. This chatbot is grounded firmly in the factual data contained within the MOF synthesis dataset from text mining and is engineered to ensure that responses during conversations are based on accurate information and synthesis conditions derived from text mining the literature (Supporting Information, Section S6). 通过聊天机器人询问合成数据集。在利用文本挖掘技术构建了全面的 MOF 合成数据集之后,我们的目标是充分利用这一资源。为了提高数据的可访问性并帮助解释其中错综复杂的内容,我们开始了将该数据集转换为交互式用户友好对话系统的旅程,该系统可有效地将数据集转换为对话。由此产生的聊天机器人是 "ChatGPT 化学助手 "这一总体概念的一部分,它是一个可靠的、基于事实的化学助手,能熟练解决与化学反应,特别是 MOF 合成有关的各种问题。与 OpenAI 提供的典型和更一般的基于网络的 ChatGPT 不同,后者可能存在一些局限性,如无法访问最新数据和容易出现幻觉错误。该聊天机器人以文本挖掘中的 MOF 合成数据集所包含的事实数据为坚实基础,并经过精心设计,以确保对话期间的回复基于准确的信息和从文本挖掘文献中得出的合成条件(佐证资料,第 S6 节)。
In particular, to construct the chemistry chatbot, our initial step was the creation of distinct entries corresponding to each MOF we identified from the text mining, which encompasses a comprehensive array of synthesis parameters, such as the reaction time, temperature, metal, and linker, among others, using the dataset we have. Recognizing the value of bibliographic context, we compiled a list of paper information, such as authors, DOI, and publication years, collated from Web of Science, into each section (Supporting Information, Table S3). Subsequently, we generated embeddings for each of these information cards of different compounds, thereby constructing an embedding dataset (Figure 7). When a user asks a question, if it is the first query, the system first navigates to the embedding dataset to locate the most relevant information card using the question’s embedding, which is based on a similarity score calculation and is similar to the foundation of Process 3 in text mining. The information of the highest-ranking entry is then dispatched to the prompt engineering module of MOF chatbot, guiding it to construct responses centered solely around the given synthesis information. 特别是,为了构建化学聊天机器人,我们的第一步是创建与我们从文本挖掘中确定的每种 MOF 相对应的不同条目,其中包括一系列全面的合成参数,如反应时间、温度、金属和连接剂等。由于认识到文献背景的价值,我们将从 Web of Science 收集到的作者、DOI 和出版年等论文信息编入了每个章节(佐证信息,表 S3)。随后,我们为这些不同化合物的信息卡生成了嵌入,从而构建了一个嵌入数据集(图 7)。当用户提问时,如果是第一次查询,系统首先会导航到嵌入数据集,利用问题的嵌入找到最相关的信息卡,这个过程基于相似度得分计算,类似于文本挖掘中过程 3 的基础。然后,排名最高的条目信息会被发送到 MOF 聊天机器人的提示工程模块,引导其仅围绕给定的综合信息构建回复。
To mitigate the possibility of hallucination, the chatbot is programmed to refrain from addressing queries that fall outside the scope of the dataset. Instead, it encourages the user to rephrase the question (Supporting Information, Figure S69). It’s worth noting that, following the initial query, the chatbot ‘memorizes’ the conversation context by being presented with the context of prior interactions between user and itself. This includes the synthesis context and paper information identified from the initial query, ensuring that the answers to subsequent queries are also based on factual information from the dataset. Consequently, this strategy guarantees that responses to ensuing queries are contextually accurate, being grounded in the facts outlined in the synthesis dataset and corresponding paper information (Figure 7 and Supporting Information, Figures S71-S74). 为了减少出现幻觉的可能性,聊天机器人在编程时会避免回答数据集范围之外的问题。相反,它鼓励用户重新措辞提问(佐证资料,图 S69)。值得注意的是,在用户提出初始询问后,聊天机器人会通过展示用户与聊天机器人之前的交互上下文来 "记忆 "对话上下文。这包括从初始查询中识别出的综合语境和纸质信息,确保对后续查询的回答也是基于数据集中的事实信息。因此,这一策略保证了对后续查询的回复在语境上的准确性,并以综合数据集和相应论文信息中概述的事实为基础(图 7 和辅助信息,图 S71-S74)。
By virtue of its design, the chatbot addresses the challenge of enhancing data accessibility and interpretation. It accomplishes this by delivering synthesis parameters and procedures in a clear and comprehensible manner. Furthermore, it ensures data integrity and traceability by providing DOI links to the original papers, guiding users directly to the source of information. This functionality proves particularly beneficial for newcomers to the field. By leveraging ChatGPT’s general knowledge base, they can receive guided instructions through the synthesis process, even when faced with a procedure in a journal that is ambiguously or vaguely described. In this case, the user can consult ChatGPT to “chat with the paper” for a more precise explanation, thereby simplifying the learning process and facilitating a more efficient understanding of complex synthesis procedures. This capability fosters independent learning and expedites comprehension of intricate synthesis procedures, reinforcing ChatGPT’s role as a valuable assistant in the field of chemistry research. 聊天机器人的设计解决了提高数据可获取性和解释性的难题。为此,它以清晰易懂的方式提供合成参数和程序。此外,它还通过提供原始论文的 DOI 链接来确保数据的完整性和可追溯性,引导用户直接找到信息来源。这一功能对于该领域的新手来说尤其有益。通过利用 ChatGPT 的常识库,他们可以在合成过程中获得指导,即使面对的是期刊中描述不明确或含糊不清的程序。在这种情况下,用户可以查阅 ChatGPT 与论文 "聊天",以获得更精确的解释,从而简化学习过程,更有效地理解复杂的合成过程。这种功能促进了自主学习,加快了对复杂合成过程的理解,强化了 ChatGPT 在化学研究领域的重要助手作用。
Exploring Adaptability and Versatility in Large Language Models. The adaptability of LLM-based programs, a hallmark feature distinguishing them from traditional NLP programs, lies in their inherent ability to modify search targets or tasks simply by adjusting the input prompt. Whereas traditional NLP models may necessitate a complete overhaul of rules and coding in the event of task modifications, programs powered by ChatGPT and some other LLMs utilize a more intuitive approach. A simple change in narrative language within the prompt can adequately steer the model towards the intended task, obviating the need for elaborate code adjustments. 探索大型语言模型的适应性和多样性。基于 LLM 的程序的适应性是其区别于传统 NLP 程序的标志性特征,这在于它们只需调整输入提示即可修改搜索目标或任务的固有能力。如果要修改任务,传统的 NLP 模型可能需要对规则和编码进行全面修改,而由 ChatGPT 和其他一些 LLMs 提供支持的程序则采用了更为直观的方法。只需对提示中的叙述语言进行简单的更改,就能充分引导模型完成预期任务,而无需对代码进行复杂的调整。
However, we do recognize limitations within the current workflow, particularly concerning token limitations. Research articles for text mining were parsed into short snippets due to 4096 token limit from GPT-3.5-turbo, since longer research articles can extend to 20,000-40,00020,000-40,000 tokens. This fragmentation may inadvertently result in the undesirable segmentation of synthesis paragraphs or other sections containing pertinent information. To alleviate this, we envision that a large language model that can process higher token memory ^(61,62){ }^{61,62} such as GPT-4-32K (OpenAI), or Claude-v1 (Anthropic) will be very helpful, since each time it reads the entire paper rather than just sections, which can further increase its accuracy by avoiding undesirable segmentation of the synthesis paragraph or other targeted paragraph containing information. Longer reading capabilities will also have the added benefit of reducing the number of tokens used in repeated questions, thus enhancing processing times. As we continue to refine our workflow, we believe that there are further opportunities for improvement. For instance, parts of the fixed prompt could be more concise to save tokens, and the examples in the few-shot prompt can be further optimized to reduce total tokens. Given that each paper may have around 100 segments, such refinements could dramatically reduce time and costs, particularly for classification and summarization tasks, which must process every section with the same fixed prompt, especially for few-shot instructions. 不过,我们也认识到当前工作流程的局限性,特别是在标记限制方面。由于 GPT-3.5-turbo 的令牌限制为 4096 个,用于文本挖掘的研究文章被解析为较短的片段,因为较长的研究文章可以扩展到 20,000-40,00020,000-40,000 个令牌。这种碎片化可能会无意中导致包含相关信息的综合段落或其他部分被分割成不理想的片段。为了缓解这一问题,我们设想采用能够处理更高的标记记忆 ^(61,62){ }^{61,62} 的大型语言模型,如 GPT-4-32K (OpenAI) 或 Claude-v1 (Anthropic) 将非常有帮助,因为每次它都会阅读整篇论文,而不仅仅是章节,这可以避免对包含信息的合成段落或其他目标段落进行不必要的分割,从而进一步提高其准确性。更长的阅读能力还能减少重复问题中使用的标记数量,从而缩短处理时间。随着工作流程的不断完善,我们相信还有更多改进的机会。例如,固定提示中的部分内容可以更加简洁,以节省令牌;少量提示中的示例可以进一步优化,以减少令牌总数。鉴于每篇论文可能有大约 100 个部分,这种改进可以大大减少时间和成本,特别是对于分类和摘要任务,因为它们必须用相同的固定提示来处理每个部分,尤其是少量提示。
Figure 7. Integrated workflow of the MOF chatbot transforming comprehensive synthesis datasets into contextually accurate dialogue systems and demonstration of conversation with the data-driven chatbot. The process ensures enhanced data accessibility, interpretation, and facilitates independent learning in the field of chemistry research. 图 7.MOF 聊天机器人将综合合成数据集转化为语境准确的对话系统的集成工作流程,以及与数据驱动聊天机器人的对话演示。该流程可确保提高数据的可访问性和解释性,并促进化学研究领域的自主学习。
Furthermore, language versatility, a crucial aspect in the realm of text mining, is seamlessly addressed by LLMs. Traditional NLP models, trained in a specific language, often struggle when the task requires processing text data in another language. For example, if the model is trained on English data, it may require substantial adjustments or even a complete rewrite to process text data in Arabic, Chinese, French, German, French, Japanese, Korean and some other languages. However, with LLMs that can handle multiple languages, such as ChatGPT, we showed that researchers just need to slightly alter the instructions or prompts to achieve the goal, without the necessity of substantial code modifications (Supporting Information, Figure S55-S58). 此外,语言通用性是文本挖掘领域的一个重要方面,LLMs无缝地解决了这一问题。当任务需要处理另一种语言的文本数据时,以特定语言训练的传统 NLP 模型往往会陷入困境。例如,如果模型是根据英语数据训练的,那么在处理阿拉伯语、汉语、法语、德语、法语、日语、韩语和其他一些语言的文本数据时,可能需要进行大量调整,甚至完全重写。但是,对于可以处理多种语言的 LLMs,例如 ChatGPT,我们的研究表明,研究人员只需对指令或提示稍作改动即可实现目标,而无需对代码进行大量修改(佐证信息,图 S55-S58)。
The adaptable nature of LLMs can further extend versatility in handling diverse tasks. We demonstrated how prompts can be changed to direct ChatGPT to parse and summarize different types of information from the same pool of research articles. For instance, with minor modification of the prompts, we show that our ChatGPT Chemistry Assistants have the potential to be instructed to summarize diverse information such as thermal stability, BET surface area, CO_(2)\mathrm{CO}_{2} uptake, crystal parameters, water stability, and even MOF structure or topology (Supporting Information, Section S4). This adaptability was previously a labor-intensive process, requiring experienced specialists to manually collect or establish training sets for text mining each type of information. ^(11,13,35,41,63-66){ }^{11,13,35,41,63-66} LLMs 的适应性可进一步扩展处理不同任务的通用性。我们演示了如何通过更改提示来指导 ChatGPT 从相同的研究文章库中解析和总结不同类型的信息。例如,只要对提示稍作修改,我们就可以演示如何指导 ChatGPT 化学助手总结各种信息,如热稳定性、BET 表面积、 CO_(2)\mathrm{CO}_{2} 吸收、晶体参数、水稳定性,甚至 MOF 结构或拓扑(佐证资料,第 S4 节)。这种适应性以前是一个劳动密集型过程,需要经验丰富的专家手动收集或建立训练集,以便对每种类型的信息进行文本挖掘。 ^(11,13,35,41,63-66){ }^{11,13,35,41,63-66}
Moreover, the utility of this approach can benefit the broader chemistry domain: it is capable of not only facilitating data mining in research papers addressing MOF synthesis but also extending to all chemistry papers with the accorded modifications. By fine-tuning the prompt, the ChatGPT Chemistry Assistant can effectively extract and tabulate data from diverse fields such as organic synthesis, biochemistry preparations, perovskite preparations, polymer synthesis, and more. This capability underscores the versatility of the ChatGPT-based assistant, not only in terms of subject matter but also in the level of detail it can handle. In the event that key parameters for data extraction are not explicitly defined, ChatGPT can be prompted to suggest parameters based on its trained understanding of the text. This level of adaptability and interactivity is unparalleled in traditional NLP models, highlighting a key advantage of the ChatGPT approach. The shift from a code-intensive approach to a natural language instruction approach democratizes the process of data mining, making it accessible even to those with less coding expertise, makes it an innovative and powerful solution for diverse data mining challenges. 此外,这种方法的实用性还能惠及更广泛的化学领域:它不仅能促进针对 MOF 合成的研究论文的数据挖掘,还能扩展到所有经过相应修改的化学论文。通过对提示进行微调,ChatGPT 化学助手可以有效地从有机合成、生物化学制备、包晶石制备、聚合物合成等不同领域提取数据并制成表格。这一功能凸显了基于 ChatGPT 的助手的多功能性,不仅在主题方面,而且在可处理的详细程度方面都是如此。如果数据提取的关键参数没有明确定义,ChatGPT 可以根据其对文本的理解提出参数建议。这种适应性和交互性在传统的 NLP 模型中是无与伦比的,凸显了 ChatGPT 方法的关键优势。从代码密集型方法到自然语言指导方法的转变使数据挖掘过程民主化,即使是代码专业知识较少的人也能使用,这使它成为应对各种数据挖掘挑战的创新而强大的解决方案。
CONCLUDING REMARKS 结束语
Our research has successfully demonstrated the potential of LLMs, particularly GPT models, in the domain of chemistry research. We presented a ChatGPT Chemistry Assistant, which includes three different but connected approaches to text mining with ChemPrompt Engineering: Process 3 is capable of conducting search and filtration, Processes 2 and 3 both classify synthesis paragraphs, and Processes 1, 2 and 3 are capable of summarizing synthesis conditions into structured datasets. Enhanced by three fundamental principles of prompt engineering specific to chemistry text processing, coupled with the interactive prompt refinement strategy, the ChatGPT-based assistant have substantially advanced the extraction and analysis of MOF synthesis literature, with precision, recall, and F1 scores exceeding 90%. 我们的研究成功证明了 LLMs,尤其是 GPT 模型在化学研究领域的潜力。我们展示了 ChatGPT 化学助手,它包括三种不同但相互关联的方法,利用 ChemPrompt Engineering 进行文本挖掘:流程 3 可以进行搜索和过滤,流程 2 和 3 都可以对合成段落进行分类,流程 1、2 和 3 可以将合成条件汇总为结构化数据集。基于 ChatGPT 的助手通过化学文本处理特有的提示工程的三个基本原则,再加上交互式提示改进策略,大大推进了 MOF 合成文献的提取和分析,其精确度、召回率和 F1 分数均超过 90%。
We elucidated two crucial insights from the dataset of synthesis conditions. First, the data can be employed to construct predictive models for reaction outcomes, which shed light into the key experimental factors that influence the MOF crystallization process. Second, it is possible to create a MOF chatbot that can provide accurate answers based on text mining, thereby improving access to the synthesis dataset, and achieving a data-to-dialogue transition. This investigation illustrates the potential for rapid advancement inherent to ChatGPT and other LLMs as a proof-of-concept. 我们从合成条件数据集中获得了两个重要启示。首先,可以利用这些数据构建反应结果预测模型,从而揭示影响 MOF 结晶过程的关键实验因素。其次,可以创建一个 MOF 聊天机器人,在文本挖掘的基础上提供准确的答案,从而改善合成数据集的访问,实现从数据到对话的转变。这项调查说明了 ChatGPT 和其他 LLMs 作为概念验证所固有的快速进步的潜力。
On a fundamental level, this study provides guidance on interacting with LLMs to serve as AI assistants for chemists, accelerating research with minimal prerequisite coding expertise and thus bridging the gap between chemistry and the realms of computational and data science more effectively. Through interaction and chatting, the code and design of experiments can be modified, democratizing data mining and enhancing the landscape of scientific research. Our work sets a foundation for further exploration and application of LLMs across various scientific domains, paving the way for a new era of AI-assisted chemistry research. 从根本上讲,这项研究为与 LLMs 交互提供了指导,使其成为化学家的人工智能助手,以最少的前提编码专业知识加速研究,从而更有效地弥合化学与计算和数据科学领域之间的差距。通过互动和聊天,可以修改代码和实验设计,实现数据挖掘的民主化,改善科学研究的面貌。我们的工作为LLMs在各个科学领域的进一步探索和应用奠定了基础,为人工智能辅助化学研究的新时代铺平了道路。
ASSOCIATED CONTENT 相关内容
Supporting Information. Detailed instructions and design principles for ChemPrompt Engineering, as well as the specifics of the prompts employed in the ChatGPT Chemistry Assistant for text mining and other chemistry-related tasks. Additional information on the ChatGPT-assisted coding and data processing methods. An extensive explanation of the machine learning models and methods used, as well as the steps involved in setting up the MOF chatbot based on the MOF synthesis condition dataset. This material is available free of charge via the Internet at http://pubs.acs.org. 辅助信息。化学提示工程的详细说明和设计原则,以及用于文本挖掘和其他化学相关任务的 ChatGPT 化学助手中使用的提示的具体内容。有关 ChatGPT 辅助编码和数据处理方法的更多信息。广泛解释所使用的机器学习模型和方法,以及基于 MOF 合成条件数据集设置 MOF 聊天机器人所涉及的步骤。本资料可通过互联网 http://pubs.acs.org 免费获取。
AUTHOR INFORMATION 作者信息
Corresponding Author 通讯作者
Omar M. Yaghi - Department of Chemistry; Kavli Energy Nanoscience Institute; and Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, United States; UC Berkeley-KACST Joint Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia; orcid.org/0000-0002-5611-3325; Email: yaghi@berkeley.edu Omar M. Yaghi - 化学系;Kavli 能源纳米科学研究所;美国加州大学伯克利分校计算、数据科学与社会学院巴卡尔地球数字材料研究所(College of Computing, Data Science, and Society, University of California, Berkeley, California 94720);加州大学伯克利分校-KACST 纳米材料清洁能源应用联合卓越中心(UC Berkeley-KACST Joint Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia);orcid.org/0000-0002-5611-3325;电子邮件:yaghi@berkeley.edu
Other Authors 其他作者
Zhiling Zheng - Department of Chemistry; Kavli Energy Nanoscience Institute; and Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, United States; orcid.org/0000-0001-6090-2258 Zhiling Zheng - 美国加利福尼亚州伯克利 94720 加州大学伯克利分校计算、数据科学与社会学院化学系、卡弗里能源纳米科学研究所和巴卡尔地球数字材料研究所;orcid.org/0000-0001-6090-2258
Oufan Zhang - Department of Chemistry, University of California, Berkeley, California 94720, United States Oufan Zhang - 加利福尼亚大学伯克利分校化学系,美国加利福尼亚州 94720
Christian Borgs - Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society; Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, United States; orcid.org/0000-0001-5653-0498 Christian Borgs - Bakar 地球数字材料研究所,计算、数据科学与社会学院;电子工程与计算机科学系,加利福尼亚大学伯克利分校,美国加利福尼亚州 94720;orcid.org/0000-0001-5653-0498
Jennifer T. Chayes - Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society; Department of Electrical Engineering and Computer Sciences; Department of Mathematics; Department of Statistics; and School of Information, University of California, Berkeley, California 94720, United States; orcid.org/0000-0003-4020-8618 Jennifer T. Chayes - 美国加州大学伯克利分校计算、数据科学与社会学院巴卡尔地球数字材料研究所;电子工程与计算机科学系;数学系;统计系;信息学院,加利福尼亚州伯克利,94720;orcid.org/0000-0003-4020-8618
ACKNOWLEDGMENTS 致谢
Z.Z. extends special gratitude to Jiayi Weng (OpenAI) for valuable discussions on harnessing the potential of ChatGPT. In addition, Z.Z. acknowledges the inspiring guidance and input from Kefan Dong (Stanford University), Long Lian (University of California, Berkeley), and Yifan Deng (Carnegie Mellon University), all of whom contributed to shaping the study’s design and enhancing the performance of ChatGPT. We express our appreciation to Dr. Nakul Rampal from the Yaghi Lab for insightful discussions. Our gratitude is also extended for the financial support received from the Defense Advanced Research Projects Agency (DARPA) under contract HR0011-21-C-0020. O.Z. acknowledges funding and extends thanks for the support provided by the National Institute of Health (NIH) under Grant 5R01GM127627-04. Additionally, Z.Z. thanks for the financial support received through a Kavli ENSI Graduate Student Fellowship and the Bakar Institute of Digital Materials for the Planet (BIDMaP). his work is independently developed by the University of California, Berkeley research team and not affiliated, endorsed, or sponsored by OpenAI. Z.Z.特别感谢翁佳怡(OpenAI)就如何利用 ChatGPT 的潜力进行的宝贵讨论。此外,Z.Z.还要感谢 Kefan Dong(斯坦福大学)、Long Lian(加州大学伯克利分校)和 Yifan Deng(卡内基梅隆大学)提供的启发性指导和意见,他们都为本研究的设计和提高 ChatGPT 的性能做出了贡献。我们对 Yaghi 实验室的 Nakul Rampal 博士的深入讨论表示感谢。我们还要感谢美国国防部高级研究计划局(DARPA)根据 HR0011-21-C-0020 合同提供的资金支持。O.Z. 感谢美国国立卫生研究院(NIH)根据 5R01GM127627-04 号拨款提供的资金支持。此外,Z.Z. 还感谢 Kavli ENSI Graduate Student Fellowship 和 Bakar Institute of Digital Materials for the Planet (BIDMaP) 提供的资金支持。他的工作由加州大学伯克利分校研究团队独立完成,与 OpenAI 无关,也未得到 OpenAI 的认可或赞助。
REFERENCES 参考文献
Yaghi, O. M.; O’Keeffe, M.; Ockwig, N. W.; Chae, H. K.; Eddaoudi, M.; Kim, J., Reticular synthesis and the design of new materials. Nature 2003, 423 (6941), 705-714. Yaghi,O. M.;O'Keeffe,M.;Ockwig,N. W.;Chae,H. K.;Eddaoudi,M.;Kim,J.,网状合成和新材料的设计。自然》,2003 年,423 (6941),705-714。
Matlin, S. A.; Mehta, G.; Hopf, H.; Krief, A., The role of chemistry in inventing a sustainable future. Nat. Chem. 2015, 7 (12), 941-943. Matlin, S. A.; Mehta, G.; Hopf, H.; Krief, A., The role of chemistry in inventing a sustainable future.Nat.Chem.2015, 7 (12), 941-943.
Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv 10.48550/arXiv.2303.12712 (accessed 2023-04-13). Bubeck, S.; Chandrasekaran, V.; Eldan, R.; Gehrke, J.; Horvitz, E.; Kamar, E.; Lee, P.; Lee, Y. T.; Li, Y.; Lundberg, S. 人工通用智能的火花:ArXiv 10.48550/arXiv.2303.12712 (accessed 2023-04-13).
Aspuru-Guzik, A.; Lindh, R.; Reiher, M., The matter simulation ® evolution. ACS Cent. Sci. 2018, 4 (2), 144-152. Aspuru-Guzik, A.; Lindh, R.; Reiher, M., The matter simulation ® evolution.ACS Cent.Sci. 2018, 4 (2), 144-152.
Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T., The rise of deep learning in drug discovery. Drug Discov. Today 2018, 23 (6), 1241-1250. Chen, H.; Engkvist, O.; Wang, Y.; Olivecrona, M.; Blaschke, T., The rise of deep learning in drug discovery.Drug Discov.Today 2018, 23 (6), 1241-1250.
Kaspar, C.; Ravoo, B.; van der Wiel, W. G.; Wegner, S.; Pernice, W., The rise of intelligent matter. Nature 2021, 594 (7863), 345-355. Kaspar, C.; Ravoo, B.; van der Wiel, W. G.; Wegner, S.; Pernice, W., The rise of intelligent matter.自然》2021 年第 594 (7863) 期,345-355 页。
Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A., Automatic chemical design using a data-driven continuous representation of molecules. ACS Cent. Sci. 2018, 4 (2), 268-276. Gómez-Bombarelli, R.; Wei, J. N.; Duvenaud, D.; Hernández-Lobato, J. M.; Sánchez-Lengeling, B.; Sheberla, D.; Aguilera-Iparraguirre, J.; Hirzel, T. D.; Adams, R. P.; Aspuru-Guzik, A., 使用数据驱动的分子连续表示自动化学设计。ACS Cent.Sci. 2018, 4 (2), 268-276.
Firat, M., What ChatGPT means for universities: Perceptions of scholars and students. J. Appl. Learn. Teach. 2023, 6 (1), 1-7. Firat, M., What ChatGPT means for universities:学者和学生的看法。J. Appl.Teach.2023, 6 (1), 1-7.
Lyu, H.; Ji, Z.; Wuttke, S.; Yaghi, O. M., Digital reticular chemistry. Chem 2020, 6 (9), 2219-2241. Lyu, H.; Ji, Z.; Wuttke, S.; Yaghi, O. M., Digital reticular chemistry.化学 2020》,6 (9),2219-2241。
Jensen, Z.; Kim, E.; Kwon, S.; Gani, T. Z.; Román-Leshkov, Y.; Moliner, M.; Corma, A.; Olivetti, E., A machine learning approach to zeolite synthesis enabled by automatic literature data extraction. ACS Cent. Sci. 2019, 5 (5), 892-899. Jensen, Z.; Kim, E.; Kwon, S.; Gani, T. Z.; Román-Leshkov, Y.; Moliner, M.; Corma, A.; Olivetti, E., 通过自动文献数据提取实现沸石合成的机器学习方法。ACS Cent.Sci. 2019, 5 (5), 892-899.
Park, S.; Kim, B.; Choi, S.; Boyd, P. G.; Smit, B.; Kim, J., Text mining metal-organic framework papers. J. Chem. Inf. Model. 2018, 58 (2), 244-251. Park, S.; Kim, B.; Choi, S.; Boyd, P. G.; Smit, B.; Kim, J., 文本挖掘金属有机框架论文。J. Chem.Inf.Model.2018, 58 (2), 244-251.
Park, H.; Kang, Y.; Choe, W.; Kim, J., Mining Insights on Metal-Organic Framework Synthesis from Scientific Literature Texts.J.Chem. Inf. Model. 2022, 62 (5), 1190-1198. Park, H.; Kang, Y.; Choe, W.; Kim, J., Mining Insights on Metal-Organic Framework Synthesis from Scientific Literature Texts.J.Chem. Inf.Inf.Model.2022, 62 (5), 1190-1198.
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A., Language models are few-shot learners. NIPS 2020, 33, 1877-1901. Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A., Language models are few-shot learners.Nips 2020, 33, 1877-1901.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I., Language models are unsupervised multitask learners. OpenAI blog 2019, 1 (8), 9. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I., Language models are unsupervised multitask learners.OpenAI blog 2019, 1 (8), 9.
Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I., Improving language understanding by generative pre-training. 2018. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I., Improving language understanding by generative pre-training.2018.
Jablonka, K. M.; Schwaller, P.; Ortega-Guerrero, A.; Smit, B. Is GPT-3 all you need for low-data discovery in chemistry? ChemRxiv 10.26434/chemrxiv-2023-fw8n4 (accessed 2023-02-14). Jablonka,K. M.;Schwaller,P.;Ortega-Guerrero,A.;Smit,B. GPT-3 是化学低数据发现所需要的吗?ChemRxiv 10.26434/chemrxiv-2023-fw8n4 (accessed 2023-02-14).
Moghadam, P. Z.; Li, A.; Wiggin, S. B.; Tao, A.; Maloney, A. G.; Wood, P. A.; Ward, S. C.; Fairen-Jimenez, D., Development of a Cambridge Structural Database subset: a collection of metal-organic frameworks for past, present, and future. Chem. Mater. 2017, 29 (7), 2618-2625. Moghadam, P. Z.; Li, A.; Wiggin, S. B.; Tao, A.; Maloney, A. G.; Wood, P. A.; Ward, S. C.; Fairen-Jimenez, D., Development of a Cambridge Structural Database subset: a collection of metal-organic frameworks for past, present, and future.Chem.Mater.2017, 29 (7), 2618-2625.
Chung, Y. G.; Camp, J.; Haranczyk, M.; Sikora, B. J.; Bury, W.; Krungleviciute, V.; Yildirim, T.; Farha, O. K.; Sholl, D. S.; Snurr, R. Q., Computation-ready, experimental metal-organic frameworks: A tool to enable high-throughput screening of nanoporous crystals. Chem. Mater. 2014, 26 (21), 6185-6192. Chung, Y. G.; Camp, J.; Haranczyk, M.; Sikora, B. J.; Bury, W.; Krungleviciute, V.; Yildirim, T.; Farha, O. K.; Sholl, D. S.; Snurr, R. Q., Computation-ready, experimental metal-organic frameworks:实现纳米多孔晶体高通量筛选的工具。Chem.Mater.2014, 26 (21), 6185-6192.
Chung, Y. G.; Haldoupis, E.; Bucior, B. J.; Haranczyk, M.; Lee, S.; Zhang, H.; Vogiatzis, K. D.; Milisavljevic, M.; Ling, S.; Camp, J. S., Advances, updates, and analytics for the computation-ready, experimental metal-organic framework database: CoRE MOF 2019. J. Chem. Eng. Data 2019, 64 (12), 5985-5998. Chung,Y. G.;Haldoupis,E.;Bucior,B. J.;Haranczyk,M.;Lee,S.;Zhang,H.;Vogiatzis,K. D.;Milisavljevic,M.;Ling,S.;Camp,J. S.,计算就绪的实验金属有机框架数据库的进展、更新和分析:CoRE MOF 2019.J. Chem.Eng.Data 2019, 64 (12), 5985-5998.
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient estimation of word representations in vector space. arXiv 10.48550/arXiv.1301.3781 (accessed 2013-09-07).
Le, Q.; Mikolov, T. In Distributed representations of sentences and documents, International conference on machine learning, PMLR: 2014; pp 1188-1196.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; Dean, J., Distributed representations of words and phrases and their compositionality. NIPS 2013, 26. Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; Dean, J., Distributed Representations of words and phrases and their compositionality.NIPS 2013, 26.
Kusner, M.; Sun, Y.; Kolkin, N.; Weinberger, K. In From word embeddings to document distances, International conference on machine learning, PMLR: 2015; pp 957-966.
Gong, W.; Xie, H.; Idrees, K. B.; Son, F. A.; Chen, Z.; Sha, F.; Liu, Y.; Cui, Y.; Farha, O. K., Water sorption evolution enabled by reticular construction of zirconium metal-organic frameworks based on a unique [2.2] paracyclophane scaffold. J. Am. Chem. Soc. 2022, 144 (4), 18261834. Gong,W.;Xie,H.;Idrees,K. B.;Son,F. A.;Chen,Z.;Sha,F.;Liu,Y.;Cui,Y.;Farha,O. K.,基于独特[2.2]对位环链支架的锆金属有机框架网状结构的水吸附演化。J. Am.J. Am.2022, 144 (4), 18261834.
Hanikel, N.; Kurandina, D.; Chheda, S.; Zheng, Z.; Rong, Z.; Neumann, S. E.; Sauer, J.; Siepmann, J. I.; Gagliardi, L.; Yaghi, O. M., MOF Linker Extension Strategy for Enhanced Atmospheric Water Harvesting. ACS Cent. Sci. 2023, 9 (3), 551-557. Hanikel, N.; Kurandina, D.; Chheda, S.; Zheng, Z.; Rong, Z.; Neumann, S. E.; Sauer, J.; Siepmann, J. I.; Gagliardi, L.; Yaghi, O. M., MOF Linker Extension Strategy for Enhanced Atmospheric Water Harvesting.ACS Cent.2023, 9 (3), 551-557.
Liu, T.-F.; Feng, D.; Chen, Y.-P.; Zou, L.; Bosch, M.; Yuan, S.; Wei, Z.; Fordham, S.; Wang, K.; Zhou, H.-C., Topology-guided design and syntheses of highly stable mesoporous porphyrinic zirconium metal-organic frameworks with high surface area.J. Am. Chem. Soc. 2015, 137 (1), 413-419. Liu,T.-F.;Feng,D.;Chen,Y.-P.;Zou,L.;Bosch,M.;Yuan,S.;Wei,Z.;Fordham,S.;Wang,K.;Zhou,H.-C.,Topology-guided design and syntheses of highly stable mesoporous porphyrinic zirconium metal-organic frameworks with high surface area.J. Am. Chem.Chem.2015, 137 (1), 413-419.
Bloch, E. D.; Murray, L. J.; Queen, W. L.; Chavan, S.; Maximoff, S. N.; Bigi, J. P.; Krishna, R.; Peterson, V. K.; Grandjean, F.; Long, G. J., Selective binding of O2\mathrm{O2} over N2 in a redox-active metal-organic framework with open iron (II) coordination sites. J. Am. Chem. Soc. 2011, 133 (37), 14814-14822. Bloch,E. D.;Murray,L. J.;Queen,W. L.;Chavan,S.;Maximoff,S. N.;Bigi,J. P.;Krishna,R.;Peterson,V. K.;Grandjean,F.;Long,G. J.,在具有开放铁(II)配位位点的氧化还原活性金属有机框架中, O2\mathrm{O2} 与 N2 的选择性结合。J. Am.J. Am.2011,133 (37),14814-14822。
Furukawa, H.; Go, Y. B.; Ko, N.; Park, Y. K.; Uribe-Romo, F. J.; Kim, J.; O’Keeffe, M.; Yaghi, O. M., Isoreticular expansion of metal-organic frameworks with triangular and square building units and the lowest calculated density for porous crystals. Inorg. Chem. 2011, 50 (18), 9147-9152. Furukawa, H.; Go, Y. B.; Ko, N.; Park, Y. K.; Uribe-Romo, F. J.; Kim, J.; O'Keeffe, M.; Yaghi, O. M., Isoreticular expansion of metal-organic frameworks with triangular and square building units and the lowest calculated density for porous crystals.Inorg.Chem.2011, 50 (18), 9147-9152.
Zheng, Z.; Rong, Z.; Iu-Fan Chen, O.; Yaghi, O. M., Metal-Organic Frameworks with Rod Yttrium Secondary Building Units. Isr. J. Chem. 2023, e202300017. Zheng,Z.;Rong,Z.;Iu-Fan Chen,O.;Yaghi,O. M.,Metal-Organic Frameworks with Rod Yttrium Secondary Building Units.Isr.J. Chem.2023, e202300017.
Reinsch, H.; van der Veen, M. A.; Gil, B.; Marszalek, B.; Verbiest, T.; De Vos, D.; Stock, N., Structures, sorption characteristics, and nonlinear optical properties of a new series of highly stable aluminum MOFs. Chem. Mater. 2013, 25 (1), 17-26. Reinsch,H.;van der Veen,M. A.;Gil,B.;Marszalek,B.;Verbiest,T.;De Vos,D.;Stock,N.,一系列新的高稳定性铝 MOFs 的结构、吸附特性和非线性光学特性。Chem.Mater.2013, 25 (1), 17-26.
Hu, Z.; Pramanik, S.; Tan, K.; Zheng, C.; Liu, W.; Zhang, X.; Chabal, Y. J.; Li, J., Selective, sensitive, and reversible detection of vapor-phase high explosives via two-dimensional mapping: A new strategy for MOF-based sensors. Cryst. Growth Des. 2013, 13 (10), 4204-4207. Hu, Z.; Pramanik, S.; Tan, K.; Zheng, C.; Liu, W.; Zhang, X.; Chabal, Y. J.; Li, J., 通过二维映射对气相烈性炸药进行选择性、灵敏性和可逆性检测:基于 MOF 的传感器的新策略。Cryst.Growth Des.2013, 13 (10), 4204-4207.
Glasby, L. T.; Gubsch, K.; Bence, R.; Oktavian, R.; Isoko, K.; Moosavi, S. M.; Cordiner, J. L.; Cole, J. C.; Moghadam, P. Z., DigiMOF: A Database of Metal-Organic Framework Synthesis Information Generated via Text Mining. Chem. Mater. 2023. Glasby,L. T.;Gubsch,K.;Bence,R.;Oktavian,R.;Isoko,K.;Moosavi,S. M.;Cordiner,J. L.;Cole,J. C.;Moghadam,P. Z.,DigiMOF:通过文本挖掘生成的金属有机框架合成信息数据库。Chem.Mater.2023.
Nandy, A.; Duan, C.; Kulik, H. J., Using machine learning and data mining to leverage community knowledge for the engineering of stable metal-organic frameworks. J. Am. Chem. Soc. 2021, 143 (42), 17535-17547. Nandy, A.; Duan, C.; Kulik, H. J., Using machine learning and data mining to leverage community knowledge for the engineering of stable metal-organic frameworks.J. Am.Chem.2021, 143 (42), 17535-17547.
Shannon, R. D., Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides. Acta Crystallogr. A. 1976, 32 (5), 751-767. Shannon, R. D., Revised effective ionic radii and systematic studies of interatomic distances in halides and chalcogenides.Acta Crystallogr.A. 1976, 32 (5), 751-767.
Haynes, W. M., CRC handbook of chemistry and physics. CRC press: Boca Raton, FL, 2016. Haynes, W. M., CRC Handbook of chemistry and physics.CRC press:Boca Raton, FL, 2016.
Pauling, L., The nature of the chemical bond. IV. The energy of single bonds and the relative electronegativity of atoms. J. Am. Chem. Soc. 1932, 54 (9), 3570-3582. Pauling, L., The nature of the chemical bond.IV.单键的能量和原子的相对电负性。J. Am.J. Am.1932, 54 (9), 3570-3582.
Nguyen, K. T.; Blum, L. C.; Van Deursen, R.; Reymond, J. L., Classification of organic molecules by molecular quantum numbers. ChemMedChem 2009, 4 (11), 1803-1805. Nguyen, K. T.; Blum, L. C.; Van Deursen, R.; Reymond, J. L., 按分子量子数对有机分子进行分类。ChemMedChem 2009, 4 (11), 1803-1805.
Deursen, R. v.; Blum, L. C.; Reymond, J.-L., A searchable map of PubChem. J. Chem. Inf. Model. 2010, 50 (11), 1924-1934. Deursen, R. v.; Blum, L. C.; Reymond, J.-L., A searchable map of PubChem.J. Chem.Inf.Model.2010, 50 (11), 1924-1934.
Batra, R.; Chen, C.; Evans, T. G.; Walton, K. S.; Ramprasad, R., Prediction of water stability of metal-organic frameworks using machine learning. Nat. Mach. 2020, 2 (11), 704-710. Batra, R.; Chen, C.; Evans, T. G.; Walton, K. S.; Ramprasad, R., Prediction of water stability of metal-organic frameworks using machine learning.Nat.Mach.2020, 2 (11), 704-710.
Ho, T. K. In Random decision forests, Proceedings of 3rd international conference on document analysis and recognition, IEEE: 1995; pp 278-282.
Kaiser, T. M.; Burger, P. B., Error tolerance of machine learning algorithms across contemporary biological targets. Molecules 2019, 24 (11), 2115. Kaiser, T. M.; Burger, P. B., 当代生物靶标机器学习算法的容错性。分子 2019, 24 (11), 2115.
Meyer, J. G.; Liu, S.; Miller, I. J.; Coon, J. J.; Gitter, A., Learning drug functions from chemical structures with convolutional neural networks and random forests. J. Chem. Inf. Model. 2019, 59 (10), 4438-4449. Meyer, J. G.; Liu, S.; Miller, I. J.; Coon, J. J.; Gitter, A., Learning drug functions from chemical structures with convolutional neural networks and random forests.J. Chem.Inf.Model.2019, 59 (10), 4438-4449.
Rajappan, R.; Shingade, P. D.; Natarajan, R.; Jayaraman, V. K., Quantitative Structure- Property Relationship (QSPR) Prediction of Liquid Viscosities of Pure Organic Compounds Employing Random Forest Regression. Ind. Eng. Chem. Res. 2009, 48 (21), 9708-9712. Rajappan, R.; Shingade, P. D.; Natarajan, R.; Jayaraman, V. K., Quantitative Structure- Property Relationship (QSPR) Prediction of Liquid Viscosities of Pure Organic Compounds Employing Random Forest Regression.Ind.Ind.Chem.Res.2009,48 (21),9708-9712。
Kapsiani, S.; Howlin, B. J., Random forest classification for predicting lifespan-extending chemical compounds. Sci. Rep. 2021, 11 (1), 113. Kapsiani, S.; Howlin, B. J., Random forest classification for predicting lifespan-extending chemical compounds.Sci. Rep. 2021, 11 (1), 113.
Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P., Random forest: a classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. Comput. Sci. 2003, 43 (6), 1947-1958. Svetnik, V.; Liaw, A.; Tong, C.; Culberson, J. C.; Sheridan, R. P.; Feuston, B. P., Random forest: a classification and regression tool for compound classification and QSAR modeling.J. Chem.Inf.Comput.2003, 43 (6), 1947-1958.
Franklin, E. B.; Yee, L. D.; Aumont, B.; Weber, R. J.; Grigas, P.; Goldstein, A. H., Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography-mass spectrometry techniques. Atmos. Meas. Tech. 2022, 15 (12), 3779-3803. Franklin, E. B.; Yee, L. D.; Aumont, B.; Weber, R. J.; Grigas, P.; Goldstein, A. H., Ch3MS-RF: a random forest model for chemical characterization and improved quantification of unidentified atmospheric organics detected by chromatography-mass spectrometry techniques. Atmos.Atmos.Meas.技术。2022, 15 (12), 3779-3803.
de Santana, F. B.; Neto, W. B.; Poppi, R. J., Random forest as one-class classifier and infrared spectroscopy for food adulteration detection. Food Chem. 2019, 293, 323-332. de Santana, F. B.; Neto, W. B.; Poppi, R. J., Random forest as one-class classifier and infrared spectroscopy for food adulteration detection.Food Chem.2019, 293, 323-332.
Seifert, S., Application of random forest based approaches to surface-enhanced Raman scattering data. Sci. Rep. 2020, 10 (1), 1-11. Seifert, S.,基于随机森林的表面增强拉曼散射数据应用方法。科学报告,2020,10 (1),1-11。
Torrisi, S. B.; Carbone, M. R.; Rohr, B. A.; Montoya, J. H.; Ha, Y.; Yano, J.; Suram, S. K.; Hung, L., Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships. Npj Comput. Mater. 2020, 6 (1), 109. Torrisi, S. B.; Carbone, M. R.; Rohr, B. A.; Montoya, J. H.; Ha, Y.; Yano, J.; Suram, S. K.; Hung, L., Random forest machine learning models for interpretable X-ray absorption near-edge structure spectrum-property relationships.Npj Comput.Mater.2020, 6 (1), 109.
Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G., Predicting reaction performance in C-N cross-coupling using machine learning. Science 2018, 360 (6385), 186-190. Ahneman, D. T.; Estrada, J. G.; Lin, S.; Dreher, S. D.; Doyle, A. G., 使用机器学习预测 C-N 交叉偶联反应性能。Science 2018, 360 (6385), 186-190.
Yaghi, O. M.; Kalmutzki, M. J.; Diercks, C. S., Introduction to reticular chemistry: metal-organic frameworks and covalent organic frameworks. John Wiley & Sons: 2019. Yaghi,O. M.;Kalmutzki,M. J.;Diercks,C. S.,《网状化学导论:金属有机框架和共价有机框架》。John Wiley & Sons: 2019.
Han, Y.; Yang, H.; Guo, X., Synthesis methods and crystallization of MOFs. Synthesis Methods and Crystallization 2020, 1-23. Han,Y.;Yang,H.;Guo,X.,MOFs 的合成方法与结晶。合成方法与结晶2020,1-23。
Gándara, F.; Furukawa, H.; Lee, S.; Yaghi, O. M., High methane storage capacity in aluminum metal-organic frameworks.J. Am. Chem. Soc. 2014, 136 (14), 5271-5274. Gándara, F.; Furukawa, H.; Lee, S.; Yaghi, O. M., High methane storage capacity in aluminum metal-organic frameworks.J. Am. Chem.Chem.2014, 136 (14), 5271-5274.
Rowsell, J. L.; Yaghi, O. M., Effects of functionalization, catenation, and variation of the metal oxide and organic linking units on the lowpressure hydrogen adsorption properties of metal- organic frameworks. J. Am. Chem. Soc. 2006, 128 (4), 1304-1315. Rowsell, J. L.; Yaghi, O. M.,金属氧化物和有机连接单元的官能化、猫化和变化对金属有机框架低压氢吸附特性的影响。J. Am.J. Am.2006, 128 (4), 1304-1315.
Li, M.-Y.; Wang, F.; Zhang, J., Zeolitic tetrazolate-imidazolate frameworks with SOD topology for room temperature fixation of CO2 to cyclic carbonates. Cryst. Growth Des. 2020, 20 (5), 2866-2870. Li, M.-Y.; Wang, F.; Zhang, J., Zeolitic tetrazolate-imidazolate frameworks with SOD topology for room temperature fixation of CO2 to cyclic carbonates.Cryst.Growth Des.2020, 20 (5), 2866-2870.
Zheng, Z.; Alawadhi, A. H.; Yaghi, O. M., Green Synthesis and Scale-Up of MOFs for Water Harvesting from Air. Mol. Front. J. 2023, 1-20. 59. Köppen, M.; Meyer, V.; Ångström, J.; Inge, A. K.; Stock, N., Solvent-dependent formation of three new Bi-metal-organic frameworks using a tetracarboxylic acid. Cryst. Growth Des. 2018, 18 (7), 4060-4067. Zheng,Z.;Alawadhi,A. H.;Yaghi,O. M.,用于从空气中收集水的 MOFs 的绿色合成与放大。Mol. Front.J. 2023, 1-20.J. 2023, 1-20.59.Köppen, M.; Meyer, V.; Ångström, J.; Inge, A. K.; Stock, N., Solvent-dependent formation of three new Bi-metal-organic frameworks using a tetracarboxylic acid.Cryst.Growth Des.2018, 18 (7), 4060-4067.
Ma, K.; Cheung, Y. H.; Xie, H.; Wang, X.; Evangelopoulos, M.; Kirlikovali, K. O.; Su, S.; Wang, X.; Mirkin, C. A.; Xin, J. H., Zirconium-Based Metal-Organic Frameworks as Reusable Antibacterial Peroxide Carriers for Protective Textiles. Chem. Mater. 2023, 35 (6), 2342-2352. Ma, K.; Cheung, Y. H.; Xie, H.; Wang, X.; Evangelopoulos, M.; Kirlikovali, K. O.; Su, S.; Wang, X.; Mirkin, C. A.; Xin, J. H., Zirconium-Based Metal-Organic Frameworks as Reusable Antibacterial Peroxide Carriers for Protective Textiles.Chem.Mater.2023, 35 (6), 2342-2352.
Bulatov, A.; Kuratov, Y.; Burtsev, M. S. Scaling Transformer to 1M tokens and beyond with RMT. arXiv 10.48550/arXiv.2304.11062 (accessed 2023-04-19).
Colón, Y. J.; Gomez-Gualdron, D. A.; Snurr, R. Q., Topologically guided, automated construction of metal-organic frameworks and their evaluation for energy-related applications. Cryst. Growth Des. 2017, 17 (11), 5801-5810. Colón, Y. J.; Gomez-Gualdron, D. A.; Snurr, R. Q., Topologically guided, automated construction of metal-organic frameworks and their evaluation for energy-related applications.Cryst.Growth Des.2017, 17 (11), 5801-5810.
Nandy, A.; Yue, S.; Oh, C.; Duan, C.; Terrones, G. G.; Chung, Y. G.; Kulik, H. J., A database of ultrastable MOFs reassembled from stable fragments with machine learning models. Matter 2023, 6 (5), 1585-1603. Nandy,A.;Yue,S.;Oh,C.;Duan,C.;Terrones,G.G.;Chung,Y.G.;Kulik,H.J.,利用机器学习模型从稳定片段重新组装的超稳定 MOFs 数据库。Matter 2023, 6 (5), 1585-1603.
Suyetin, M., The application of machine learning for predicting the methane uptake and working capacity of MOFs. Faraday Discuss. 2021, 231, 224-234. Suyetin, M., The application of machine learning for predicting the methane uptake and working capacity of MOFs.Faraday Discuss.2021, 231, 224-234.
Nandy, A.; Terrones, G.; Arunachalam, N.; Duan, C.; Kastner, D. W.; Kulik, H. J., MOFSimplify, machine learning models with extracted stability data of three thousand metal-organic frameworks. Sci. Data 2022, 9 (1), 74. Nandy,A.;Terrones,G.;Arunachalam,N.;Duan,C.;Kastner,D.W.;Kulik,H.J.,MOFSimplify,使用提取的三千个金属有机框架稳定性数据的机器学习模型。科学数据 2022》,9 (1),74。
Supporting Information 辅助信息
ChatGPT Chemistry Assistant for Text Mining and Prediction of MOF Synthesis 用于文本挖掘和 MOF 合成预测的 ChatGPT 化学助手
Zhiling Zheng, ^(†,ℏ,§){ }^{\dagger, \hbar, \S} Oufan Zhang, ^(†){ }^{\dagger} Christian Borgs, ^(§,diamond){ }^{\S, \diamond} Jennifer T. Chayes, ^(§,⊘,††,dots,§§){ }^{\S, \oslash, \dagger \dagger, \ldots, \S \S} Omar M. Yaghi ^(†,!in,xi,||,**){ }^{\dagger, \notin, \xi, \|, *} Zhiling Zheng、 ^(†,ℏ,§){ }^{\dagger, \hbar, \S} Oufan Zhang、 ^(†){ }^{\dagger} Christian Borgs、 ^(§,diamond){ }^{\S, \diamond} Jennifer T. Chayes、 ^(§,⊘,††,dots,§§){ }^{\S, \oslash, \dagger \dagger, \ldots, \S \S} Omar M. Yaghi ^(†,!in,xi,||,**){ }^{\dagger, \notin, \xi, \|, *}
^(†){ }^{\dagger} Department of Chemistry, University of California, Berkeley, California 94720, United States ^(†){ }^{\dagger} 美国加州大学伯克利分校化学系,美国加利福尼亚州,94720^(‡){ }^{\ddagger} Kavli Energy Nanoscience Institute, University of California, Berkeley, California 94720, United States ^(‡){ }^{\ddagger} 美国加州大学伯克利分校卡弗利能源纳米科学研究所,加利福尼亚州,94720§ Bakar Institute of Digital Materials for the Planet, College of Computing, Data Science, and Society, University of California, Berkeley, California 94720, United States^(∙){ }^{\bullet} Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, United States ^(∙){ }^{\bullet} 美国加州大学伯克利分校电子工程与计算机科学系,加利福尼亚州,94720†† Department of Mathematics, University of California, Berkeley, California 94720, United States 美国加州大学伯克利分校数学系,加利福尼亚州,94720# Department of Statistics, University of California, Berkeley, California 94720, United States # 加利福尼亚大学伯克利分校统计系,美国加利福尼亚州,94720^("§ S School of Information, University of California, Berkeley, California 94720, United States "){ }^{\text {§ S School of Information, University of California, Berkeley, California 94720, United States }} " KACST-UC Berkeley Center of Excellence for Nanomaterials for Clean Energy Applications, King Abdulaziz City for Science and Technology, Riyadh 11442, Saudi Arabia ^("§ S School of Information, University of California, Berkeley, California 94720, United States "){ }^{\text {§ S School of Information, University of California, Berkeley, California 94720, United States }} " KACST-UC Berkeley 清洁能源应用纳米材料卓越中心,阿卜杜勒阿齐兹国王科技城,沙特阿拉伯利雅得 11442* To whom correspondence should be addressed: yaghi@berkeley.edu * 通信收件人:yaghi@berkeley.edu