Report 报告

Predicting reaction performance in C–N cross-coupling using machine learning
使用机器学习预测 C-N 交叉偶联的反应性能

Science 科学

15 Feb 2018 15 2月 2018

Vol 360, Issue 6385
第 360 卷，第 6385 期

pp. 186-190
页码： 186-190

DOI: 10.1126/science.aar5169IF: 44.7 Q1
DOI： 10.1126/science.aar5169IF： 44.7 第一季度

A guide for catalyst choice in the forest
森林中催化剂选择指南

Chemists often discover reactions by applying catalysts to a series of simple compounds. Tweaking those reactions to tolerate more structural complexity in pharmaceutical research is time-consuming. Ahneman et al. report that machine learning can help. Using a high-throughput data set, they trained a random forest algorithm to predict which specific palladium catalysts would best tolerate isoxazoles (cyclic structures with an N–O bond) during C–N bond formation. The predictions also helped to guide analysis of the catalyst inhibition mechanism.
化学家通常通过将催化剂应用于一系列简单的化合物来发现反应。调整这些反应以容忍药物研究中更多的结构复杂性非常耗时。Ahneman 等人报告说，机器学习可以提供帮助。使用高通量数据集，他们训练了一种随机森林算法，以预测在 C-N 键形成过程中哪些特定的钯催化剂最能耐受异恶唑（具有 N-O 键的环状结构）。这些预测还有助于指导催化剂抑制机制的分析。

Science, this issue p. 186
《科学》，本期，第 186 页

Abstract 抽象

Machine learning methods are becoming integral to scientific inquiry in numerous disciplines. We demonstrated that machine learning can be used to predict the performance of a synthetic reaction in multidimensional chemical space using data obtained via high-throughput experimentation. We created scripts to compute and extract atomic, molecular, and vibrational descriptors for the components of a palladium-catalyzed Buchwald-Hartwig cross-coupling of aryl halides with 4-methylaniline in the presence of various potentially inhibitory additives. Using these descriptors as inputs and reaction yield as output, we showed that a random forest algorithm provides significantly improved predictive performance over linear regression analysis. The random forest model was also successfully applied to sparse training sets and out-of-sample prediction, suggesting its value in facilitating adoption of synthetic methodology.
机器学习方法正在成为众多学科科学探究不可或缺的一部分。我们证明，机器学习可用于使用通过高通量实验获得的数据来预测合成反应在多维化学空间中的性能。我们创建了脚本来计算和提取在各种潜在抑制添加剂存在下芳烃与 4-甲基苯胺的钯催化的 Buchwald-Hartwig 交叉偶联组分的原子、分子和振动描述符。使用这些描述符作为输入，反应产率作为输出，我们表明随机森林算法比线性回归分析提供了显着改进的预测性能。随机森林模型还成功地应用于稀疏训练集和样本外预测，表明其在促进采用合成方法方面的价值。

Machine learning (ML) is the study and construction of computer algorithms that can learn from data (1). The ability of these algorithms to detect meaningful patterns has led to their adoption across a wide range of applications in science and technology, from autonomous vehicle control to recommender systems (2). ML has also been successfully applied in the biomedical sciences to enhance the virtual screening of libraries of druglike molecules for biological function (3–5). However, its application to the chemical sciences, and synthetic organic chemistry in particular, has been limited (6, 7). Prior efforts have focused primarily on using ML to assist with synthetic planning via retrosynthetic pathways or to predict the products of chemical reactions given a set of reactants and conditions (8–11). Applications of ML to predict the performance of a given reaction, however, are rare. Studies in the area of heterogeneous catalysis have used ML to predict reaction performance when only a single component is varied (12, 13). Two recent studies have advanced the field by evaluating predictions in multidimensional chemical space, although these studies performed a binary classification of reaction success (14, 15). The use of regression-based ML to predict reaction yields in multidimensional chemical space could provide chemists with a powerful tool to navigate the adoption of synthetic methodology.
机器学习（ML）是研究和构建可以从数据中学习的计算机算法（1）。这些算法能够检测有意义的模式，因此在科学和技术中被广泛采用，从自动驾驶汽车控制到推荐系统（2）。ML 还已成功应用于生物医学科学，以增强对药物样分子文库的生物功能虚拟筛选（3–5）。然而，它在化学科学，特别是合成有机化学中的应用受到限制（6， 7）。之前的工作主要集中在使用 ML 通过逆合成途径协助合成规划，或预测给定一组反应物和条件的化学反应产物（8–11）。然而，使用 ML 来预测给定反应性能的应用很少见。多相催化领域的研究使用 ML 来预测仅改变单一组分时的反应性能（12， 13）。最近的两项研究通过评估多维化学空间中的预测来推动该领域的发展，尽管这些研究对反应成功进行了二元分类（14， 15）。使用基于回归的 ML 来预测多维化学空间中的反应产率可以为化学家提供一个强大的工具来引导合成方法的采用。

The many challenges in applying ML to reaction performance have previously hindered its use in the field of chemical synthesis. Implementation of these algorithms has historically been complicated for nonspecialists. Further, the amount of data required to obtain statistically meaningful results grows exponentially with the number of dimensions under study, a problem known as the “curse of dimensionality” (1). Given the multidimensionality of chemical structure and reactivity, it has been difficult to generate enough data or to get access to sufficiently complete and consistent data from databases to warrant implementation of these algorithms (14). Fortunately, over the past decade, high-throughput experimentation (HTE) has emerged as a powerful tool in industry and academia for reaction optimization and discovery (16, 17). We sought to evaluate whether ML could be applied to the scale of data available to modern HTE and enable yield prediction in multidimensional chemical space.
将 ML 应用于反应性能的许多挑战以前阻碍了其在化学合成领域的使用。对于非专业人士来说，这些算法的实现历来很复杂。此外，获得具有统计意义的结果所需的数据量随着所研究的维度数量呈指数级增长，这一问题被称为“维度诅咒”（1）。鉴于化学结构和反应性的多维性，很难生成足够的数据或从数据库中获得足够完整和一致的数据来保证这些算法的实施（14）。幸运的是，在过去十年中，高通量实验（HTE）已成为工业界和学术界用于反应优化和发现的强大工具（16， 17）。我们试图评估 ML 是否可以应用于现代 HTE 可用的数据规模，并在多维化学空间中实现产量预测。

Linear regression is the traditional tool for reaction prediction and analysis in both industry and academia (18). In this approach, the user assumes a linear relationship between reaction input (e.g., catalyst descriptors) and output (e.g., product selectivity) and hand-selects input variables on the basis of specific mechanistic hypotheses (19, 20). A strength of linear regression is its interpretability: A good fit between reagent descriptors and output supports mechanistic inferences, such as in the seminal Hammett linear free-energy relationship (21).

The models obtained from linear regression analysis have also been used for prediction. Recently, Sigman and co-workers have applied multivariate linear and polynomial regression analyses to optimize reaction selectivity by predicting catalyst, ligand, and substrate effects (22–24). Predicting yield tends to be more difficult; whereas product selectivity is determined by a small number of elementary steps, many on- and off-cycle events can substantially alter reaction yield. ML approaches accept numerous input descriptors without recourse to a mechanistic hypothesis and evaluate functions with greater flexibility to match patterns in data. We postulated that ML might outperform regression analysis for yield prediction and circumvent the challenge of selecting mechanistically relevant descriptors for large and multidimensional data sets. Here, we report that a random forest ML model trained on multidimensional chemical data can be used to predict the performance of a Buchwald-Hartwig amination reaction conducted in the presence of potentially inhibitory additives and to infer underlying reactivity. We have taken steps to automate reaction parameterization and modeling with the aim of making this tool accessible to the synthetic chemistry community.

We selected the Pd-catalyzed Buchwald-Hartwig reaction as our test reaction for model development because of its broad value in pharmaceutical synthesis (Fig. 1A) (25). Nevertheless, the application of this reaction to complex drug-like molecules remains challenging (26). One limitation is the poor performance of substrates possessing five-membered heterocycles that contain heteroatom-heteroatom bonds, such as isoxazoles. These heterocycles have drug-like characteristics but are underrepresented in successful drug candidates (27). Thus, we sought to use ML to predict the performance of the Buchwald-Hartwig reaction in the presence of isoxazoles. Rather than evaluate the coupling of a collection of substrates directly bearing the heterocycle functionality, we pursued a Glorius fragment additive screening approach (28) wherein we evaluated the effects of isoxazole fragment additives on the amination of different aryl and heteroaryl halides. This method cannot always account for the full impact of a structural motif embedded within a substrate. However, the Glorius approach allowed us to test 345 diverse structural interactions between isoxazoles and aryl and heteroaryl halides. This large array would not be possible with whole molecules because of the necessity of synthesizing and isolating all possible products for quantification in this study. We conducted the coupling reactions using the ultra–high-throughput setup recently developed in the Merck Research Laboratories for nanomole-scale experimentation in 1536-well plates (16). Use of the Mosquito robot enabled simultaneous evaluation of more reaction dimensions than could previously be examined by classical statistical analysis. Three 1536-well plates consisting of a full matrix of 15 aryl and heteroaryl halides, 4 Buchwald ligands, 3 bases, and 23 isoxazole additives generated a total of 4608 reactions (including controls). The yields of these reactions were used as the model output. Approximately 30% of the reactions failed to deliver any product, with the remainder quite evenly spread over the range of yields (fig. S7).

Fig. 1 Application of ML to reaction prediction.
(A) A Buchwald-Hartwig amination was used as a model reaction for data generation with simultaneous evaluation of four dimensions. The impact of 23 isoxazole additives on the amination reaction was investigated according to a Glorius fragment screening approach. Full structures are provided in fig. S1. Me, methyl; X, any halide; equiv, equivalent; DMSO, dimethyl sulfoxide; L, ligand; OTf, triflate; i-Pr, isopropyl; R, H or alkyl group; t-Bu, *tert*-butyl; BTMG, t-butyltetramethylguanidine; MTBD, methyltriazabicyclodecene; Et, ethyl. (B) Software was built to automate feature generation. Molecular, atomic, and vibrational property calculations were performed using Spartan (with density functional B3LYP and basis set 6-31G*), and these features were subsequently extracted from the resulting text files to generate a modeling data table filled with descriptors and yields. To include vibrational modes as descriptors, we compared molecular vibrations for all compounds in a class on the basis of atomic movements. To more appropriately include the movement of heavy atoms, we multiplied each atom’s movement by its atomic mass. Vibrational mode vectors were compared using Pearson correlations. Only vibrational modes with R² > 0.5 and with values greater than any other entry in the same row and column were treated as matching vibrations. If the first molecule in the set (chosen arbitrarily) shared a particular matched vibration with all others in the group, that vibrational mode was considered to be conserved. In this case, the vibration’s frequency and intensity were included in the modeling data table. SA, surface area; V1 through V5, vibrational modes 1 through 5; *, shared atom.

Open in viewer

Next we turned to the selection of appropriate descriptors. In linear regression analysis, this selection is typically done by hand according to a mechanistic hypothesis, with principal component analysis sometimes being used to reduce the parameter set to an uncorrelated and statistically tractable number (29). For the ML model, we sought a set of descriptors that adequately characterizes the differences among the reactions without recourse to a specific hypothesis. For reasons of internal consistency and descriptor availability, calculated properties were used. To avoid prohibitively time-consuming analysis and logging of computational data, we developed software to submit molecular, atomic, and vibrational property calculations to Spartan and subsequently extract these features from the resulting text files for accessibility to a general user (Fig. 1B). The program requires only the input of reagent structures in the Spartan graphical user interface and specification of the reaction components in a Python script; it is applicable to any reaction type. The program then generates the data table that can be used for modeling. In total, 120 descriptors were extracted by the software to characterize each reaction (section III in the supplementary materials).

With these data in hand, we evaluated the predictive accuracies of linear regression and an array of ML methods using 70% of the data as a training set to predict the remaining 30% (test set) (Fig. 2A). For the linear regression models, we evaluated dimension reduction by removing correlated descriptors, as well as various regularization methods [such as LASSO (least absolute shrinkage and selection operator), ridge regression, and elastic net], but none generated good predictive performance. Turning to supervised ML models, we found that k-nearest neighbors, support vector machines, and a Bayes generalized linear model provided no improvement over a linear regression model. However, a single-layer neural network delivered substantial improvement over these methods. Moreover, we found that the random forest algorithm provided even better predictive performance. The test-set root mean square error (RMSE) for the random forest model was 7.8%, with a coefficient of determination R² value of 0.92. A significant proportion of this variation is likely attributable to experimental and analytical error. Random forest algorithms operate by randomly sampling the data and constructing decision trees, which are then aggregated to generate an overall prediction (30). By combining a large number of low-precision models, the algorithm can deliver high predictive accuracy without succumbing to overfitting.

Fig. 2 Test set performance plots.
(A) Observed versus predicted plots for various ML algorithms and linear regression analysis. For all the models, a 70/30 split of training and test data, with k-fold cross-validation on the training data, was performed to measure each model’s generalizability to an independent data set. Only test set data are shown in plots. kNN, k-nearest neighbor; SVM, support vector machine; GLM, generalized linear model; dashed line, y = x line; solid line, Loess best-fit curve. (B) Test set performance of the random forest model with sparse data. A gradual erosion in predictive accuracy occurred from 70% of the data (the entire training set) down to 2.5% of the full data set. The smaller training sets were selected randomly from the original training data.

Open in viewer

Nevertheless, ML tends to encounter predictive limitations when substantially different reaction conditions are used in the test set. This problem is exacerbated by the presence of activity cliffs, which are areas in reaction space where modest changes in chemical structure can lead to notable changes in reaction outcome (31). The tendency of ML algorithms to overfit and the presence of activity cliffs necessitate the collection of local reaction data (see fig. S30 for prediction of ArI and ArCl reaction outcomes from ArBr training data). One method for maximizing the extrapolative ability of a model is to use training data spread across the chemical space of interest. The ability to perform accurate prediction under sparsity effectively increases the reaction space that can be explored with the same number of experiments. For the random forest model, we were surprised to discover that enhanced predictive power over other methods could be achieved with a markedly smaller subset of the training data (Fig. 2B). With training on only 5% of the reaction data, the random forest algorithm outperformed linear regression using 70% of the same reaction data. Because 5% of the data set is only 230 experiments, these results indicate that ML can offer improvements in prediction on a scale routinely pursued in the course of reaction optimization and scope elucidation.

We next explored the ability of a random forest model to predict outcomes for reactions containing additives not included in the training data. If effective out-of-sample prediction was possible, ML could predict the effect of a new isoxazole or aryl halide structure on the outcome of a Buchwald-Hartwig amination and identify the combination of base and ligand that would deliver the highest yield. To this end, we evaluated whether the results for 15 additives could be used to predict the outcomes with 8 distinct additives (Fig. 3A). On average, the out-of-sample RMSE was 11.3%, with an R² value of 0.83 (Fig. 3B). None of the additives created significant systematic deviations from what was predicted by the model. The high predictive ability of the model suggests that the effects of these substituents on reaction outcome were captured well by the descriptors. However, as additive consumption was not included in the output, the algorithm is likely to encounter predictive limitations when applied to substrates with embedded isoxazoles.

Fig. 3 Additive prediction.
(A) Isoxazoles in the additive training set (1 to 6 and 8 to 15) were used to predict the performance of isoxazoles 16 to 23 in the test set. Ph, phenyl; Bn, benzyl. (B) Out-of-sample performance of the random forest model from (A). Test set data are shown.

Open in viewer

Having obtained a predictive model, we sought to determine whether it could be used to guide mechanistic analysis. Unlike a linear regression model, the random forest model is challenging to interpret directly. We therefore evaluated the relative importance of descriptors used to construct the model. One such measure of a descriptor’s importance is the percent increase in the model’s mean square error (MSE) when values for that descriptor are randomly shuffled and the model is retrained (1). We found that four of the five most important descriptors in predicting reaction outcomes were the additive’s *C-3 nuclear magnetic resonance (NMR) shift (where the asterisk indicates a shared atom), lowest unoccupied molecular orbital (LUMO) energy, and *O-1 and *C-5 electrostatic charges (Fig. 4A). These features are not sufficient to obtain a predictive linear model (fig. S24). Taken together, the descriptors suggest that the propensity of the additive to act as an electrophile influences reaction outcomes (32–34). We hypothesized that competitive oxidative addition of the isoxazole could be a source of deleterious side reactivity. Although oxidative addition of Pd to isoxazoles is not known (35), such an elementary step has been reported previously for other transition metals (36).

Fig. 4 Model analysis.
(A) The 10 most important descriptors of the trained random forest model determined by measuring the percent increase in the MSE upon reshuffling of the values of a given descriptor and retraining of the model. * indicates a shared atom. E, energy; HOMO, highest occupied molecular orbital; V, vibration. (B) Isoxazoles and the set of reactions designed to test the hypothesis that Pd undergoes oxidative addition to certain additives, leading to diminished yield of the Buchwald-Hartwig amination. ppm, parts per million; rt, room temperature. (C) ³¹P-NMR spectra for the reactions depicted in (B). Spectrum 2 shows the generation of a new Pd species, designated 2b, upon reaction of Pd(PPh₃)₄ with 1b. Species 2b is characterized by a pair of doublets with equal integration and a coupling constant (J) consistent with two *cis* phosphines (²J_PP = 37 Hz, where ²J_PP is the geminal phosphorus coupling constant). HRMS analysis of the reaction mixture indicates the presence of Pd(1b)(PPh₃)₂ (2b, [M + 1]⁺ = 750.13).

Open in viewer

To evaluate this proposal, we conducted a series of experiments with isoxazoles 1a and 1b, which possess the smallest and largest predicted *C-3 NMR chemical shifts of the additives in the test set, respectively (Fig. 4B). As shown in Fig. 4C, spectrum 1, isoxazole 1a underwent no reaction with tetrakis(triphenylphosphine)palladium(0) [Pd(PPh₃)₄] in benzene at room temperature. On the other hand, with isoxazole 1b, a new species was observed within 1 hour (Fig. 4C, spectrum 2). High-resolution mass spectrometry (HRMS) and spectroscopic (³¹P, ¹³C, and ¹H NMR) analyses provided strong evidence that isoxazole 1b underwent oxidative addition at the N–O bond (section VI in the supplementary materials). Going further, we investigated how isoxazoles 1a and 1b performed in competition with an aryl halide. When 1a was mixed with aryl bromide 1c, formation of only the aryl bromide oxidative adduct (2c) was observed (Fig. 4C, spectrum 3). However, when isoxazole 1b was subjected to the same competition experiment, the oxidative adducts of both the aryl bromide 1c and isoxazole 1b were observed in roughly equal amounts (Fig. 4C, spectrum 4). These data are consistent with the hypothesis that electrophilic isoxazole additives can undergo N–O oxidative addition to Pd(0) as a deleterious side reaction, causing diminished yields of the desired Buchwald-Hartwig aminations. Although such a hypothesis could have been obtained by alternate means, this study highlights how measuring the influence of a large collection of descriptors for their predictive ability in an ML algorithm can be used to generate hypotheses for further mechanistic inquiry. Although one should be hesitant to perform direct causal inference, this approach could be particularly enabling for larger and higher-dimensional data sets wherein it would be challenging or impossible to intuit a unified mechanism.

Vast resources and time are currently expended on the development of synthetic methods and their application to complex molecule synthesis, often in a largely ad hoc manner. Here we have shown that simple atomic, molecular, and vibrational descriptors that can be automatically extracted from the text files of Spartan calculations can be used as input for a random forest model to predict yields of multidimensional chemical data. We expect that this approach, coupled with advances in HTE and analysis with whole-molecule systems, will prove to be of broad utility in facilitating the adoption of synthetic methods by enabling prediction of a new substrate’s performance under given conditions or prediction of the optimal conditions for a new substrate.

Acknowledgments

We thank K. Chuang and M. Keiser of the University of California, San Francisco, for help troubleshooting the neural network implementation and K. Wu of Princeton University and R. Sheridan and Z. Peng of Merck Research Laboratories for helpful discussions. Funding: Financial support was provided by Princeton University, an Amgen Young Investigator Award, and a Camille Dreyfus Teacher-Scholar Award. Author contributions: D.T.A., J.G.E., and S.L. performed the experiments. D.T.A. wrote the code. All authors designed the experiments, analyzed the data, and wrote the manuscript. Competing interests: None declared. Data and materials availability: All code and data used to produce the reported results can be found online at https://github.com/doylelab/rxnpredict. Additional HTE yields and model analysis data are available in the supplementary materials.

Supplementary Material

Summary

Materials and Methods

Supplementary Text

Figs. S1 to S34

Tables S1 to S4

References (37–43)

Resources

File (aar5169-ahenman-sm.pdf)

Download
12.66 MB

File (aar5169-ahenman-sm_revision_1.pdf)

Download
13.92 MB

References and Notes

T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).

A guide for catalyst choice in the forest森林中催化剂选择指南

Abstract 抽象

SIGN UP FOR THE SCIENCE eTOC注册 SCIENCE eTOC

Acknowledgments

Supplementary Material

Summary

Resources

References and Notes

(0)eLetters

Information

Published In

Copyright

Article versions

Submission history

Permissions

Acknowledgments

Authors

AffiliationsExpand All

Derek T. Ahneman https://orcid.org/0000-0002-7461-0077

Jesús G. Estrada

Shishi Lin https://orcid.org/0000-0002-8225-5394

Spencer D. Dreher* https://orcid.org/0000-0002-5094-1218 spencer_dreher@merck.com

Abigail G. Doyle* https://orcid.org/0000-0002-6641-0833 spencer_dreher@merck.com

Notes

Metrics

Article Usage

Altmetrics

Citations

Cite as

Export citation

Cited by

View options

PDF format

Figures

Multimedia

Share

Share article link

Share on social media

References

Microbial dietary preference and interactions affect the export of lipids to the deep ocean微生物饮食偏好和相互作用影响脂质向深海的输出

Autoregulated splicing of TRA2β programs T cell fate in response to antigen-receptor stimulationTRA2β 的自动调节剪接在抗原受体刺激下对 T 细胞命运进行编程

Transcripts of repetitive DNA elements signal to block phagocytosis of hematopoietic stem cells重复 DNA 元件的转录本信号阻断造血干细胞的吞噬作用

A guide for catalyst choice in the forest
森林中催化剂选择指南

SIGN UP FOR THE SCIENCE eTOC
注册 SCIENCE eTOC

Affiliations

Spencer D. Dreher^* https://orcid.org/0000-0002-5094-1218 spencer_dreher@merck.com

Abigail G. Doyle^* https://orcid.org/0000-0002-6641-0833 spencer_dreher@merck.com

Microbial dietary preference and interactions affect the export of lipids to the deep ocean
微生物饮食偏好和相互作用影响脂质向深海的输出

Autoregulated splicing of TRA2β programs T cell fate in response to antigen-receptor stimulation
TRA2β 的自动调节剪接在抗原受体刺激下对 T 细胞命运进行编程

Transcripts of repetitive DNA elements signal to block phagocytosis of hematopoietic stem cells
重复 DNA 元件的转录本信号阻断造血干细胞的吞噬作用