這是用戶在 2024-4-15 21:37 為 https://app.immersivetranslate.com/pdf-pro/2290c009-aaf3-41bf-bfc0-c034e1a0512c 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
2024_04_15_cab9399bf98f42e0708cg

'A ANNUAL R REVIEWS
年度審查

Annual Review of Economics Machine Learning Methods That Economists Should Know About
經濟學年度評論 經濟學家應該了解的機器學習方法

Susan Athey and Guido W. Imbens
Susan Athey 和 Guido W. Imbens
Graduate School of Business, Stanford University, Stanford, California 94305, USA;
史丹佛大學商學院,美國加州史丹佛 94305;
email: athey@stanford.edu, imbens@stanford.edu
電子郵件:athey@stanford.edu, imbens@stanford.edu
Stanford Institute for Economic Policy Research, Stanford University, Stanford,
史丹佛大學史丹佛經濟政策研究所,史丹佛、
California 94305, USA 美國加州 94305 National Bureau of Economic Research, Cambridge, Massachusetts 02138, USA
國家經濟研究局,美國麻薩諸塞州劍橋市,郵編 02138
Department of Economics, Stanford University, Stanford, California 94305, USA
史丹佛大學經濟系,美國加州史丹佛 94305

CONNECT  連接

  • Download figures 下載數位
  • Navigate cited references
    瀏覽引用的參考文獻
  • Keyword search 關鍵字搜尋
  • Explore related articles
    探索相關文章
  • Share via email or social media
    透過電子郵件或社群媒體分享
Annu. Rev. Econ. 2019. 11:685-725
Annu.Rev. Econ.11:685-725

Keywords 關鍵字

First published as a Review in Advance on June 10, 2019
2019年6月10日首次以《先期回顧》出版
The Annual Review of Economics is online at machine learning, causal inference, econometrics economics.annualreviews.org
經濟學年度評論》線上網址為:machine learning, causal inference, econometrics economics.annualreviews.org
Copyright (c) 2019 by Annual Reviews. All rights reserved
Copyright (c) 2019 by Annual Reviews.保留所有權利
JEL code: C30 JEL 代碼:C30

Abstract 摘要

We discuss the relevance of the recent machine learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods, and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the ML literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, and matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, including causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
我們將討論近期機器學習(ML)文獻與經濟學和計量經濟學的相關性。首先,我們討論 ML 文獻與傳統計量經濟學和統計文獻在目標、方法和設定上的差異。然後,我們討論我們認為對經濟學實證研究人員很重要的 ML 文獻中的一些具體方法。這些方法包括用於回歸和分類的監督學習方法、無監督學習方法和矩陣補全方法。最後,我們重點介紹了在ML 和計量經濟學交叉領域新開發的方法,這些方法在應用於特定類別的問題時,通常比現成的ML 或更傳統的計量經濟學方法表現更好,這些問題包括平均處理效果的因果推論、最適政策估計、消費者選擇模型中價格變動的反事實效應估計。

1. INTRODUCTION 1.引言

In the abstract of his provocative 2001 paper in Statistical Science, the Berkeley statistician Leo Breiman (2001b, p. 199) writes about the difference between model-based and algorithmic approaches to statistics:
柏克萊分校統計學家 Leo Breiman(2001b,第 199 頁)在其 2001 年發表於《統計科學》(Statistical Science)的論文摘要中談到了基於模型的統計方法與演算法統計方法之間的區別:
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
使用統計建模從數據中得出結論有兩種文化。一種假定資料是由給定的隨機資料模型產生的。另一種則使用演算法模型,將資料機制視為未知。
Breiman (2001b, p. 199) goes on to claim that,
Breiman (2001b, 第 199 頁)接著指出:
The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
統計界一直致力於幾乎完全使用資料模型。這種執著導致了不相關的理論和可疑的結論,並使統計學家無法研究當前大量有趣的問題。演算法建模在理論和實踐上都在統計以外的領域中得到了迅速發展。它既可用於大型複雜資料集,也可在較小的資料集上作為資料建模的更準確、資訊量更大的替代方法。如果我們這個領域的目標是利用資料解決問題,那麼我們就需要擺脫對資料模型的完全依賴,並採用一套更多樣化的工具。
Breiman's (2001b) characterization no longer applies to the field of statistics. The statistics community has by and large accepted the machine learning (ML) revolution that Breiman refers to as the algorithm modeling culture, and many textbooks discuss ML methods alongside more traditional statistical methods (e.g., Hastie et al. 2009, Efron & Hastie 2016). Although the adoption of these methods in economics has been slower, they are now beginning to be widely used in empirical work and are the topic of a rapidly increasing methodological literature. In this review, we want to make the case that economists and econometricians also, as Breiman writes about the statistics community, "need to move away from exclusive dependence on data models and adopt a more diverse set of tools." We discuss some of the specific tools that empirical researchers would benefit from, and that we feel should be part of the standard graduate curriculum in econometrics if, as Breiman writes and we agree with, "our goal as a field is to use data to solve problems;" if, in other words, we view econometrics as, in essence, decision making under uncertainty (e.g., Chamberlain 2000); and if we wish to enable students to communicate effectively with researchers in other fields where these methods are routinely being adopted. Although relevant more generally, the methods developed in the ML literature have been particularly successful in big data settings, where we observe information on a large number of units, many pieces of information on each unit, or both, and often outside the simple setting with a single cross-section of units. For such settings, ML tools are becoming the standard across disciplines, so the economist's toolkit needs to adapt accordingly while preserving the traditional strengths of applied econometrics.
Breiman (2001b) 的描述不再適用於統計學領域。統計學界大體上已經接受了機器學習(ML)革命,也就是Breiman 所說的演算法建模文化,許多教科書在討論ML 方法的同時也討論了更傳統的統計方法(例如,Hastie 等人,2009 年;Efron & Hastie,2016 年)。儘管這些方法在經濟學中的應用較為緩慢,但現在它們已開始廣泛應用於實證工作中,並成為迅速增加的方法論文獻的主題。在這篇綜述中,我們希望說明,正如布雷曼在談到統計界時所寫的那樣,經濟學家和計量經濟學家也"需要擺脫對數據模型的完全依賴,並採用一套更多樣化的工具"。我們將討論實證研究人員將從其中受益的一些具體工具,我們認為這些工具應該成為計量經濟學研究生標準課程的一部分,如果正如布雷曼所寫、我們也同意的那樣,"我們作為一個領域的目標是使用數據來解決問題";換句話說,如果我們將計量經濟學實質上視為不確定性條件下的決策制定(例如,張伯倫,2000 年);如果我們希望使學生能夠與其他領域的研究人員進行有效的交流,而這些領域也常採用這些方法的話。在大數據環境中,我們會觀察到大量單位的資訊、每個單位的許多信息,或兩者兼而有之,而且往往超出了單位單一橫截面的簡單環境。在這種情況下,ML 工具正成為各學科的標準工具,因此經濟學家的工具包也需要相應調整,同時保留應用計量經濟學的傳統優勢。
Why has the acceptance of ML methods been so much slower in economics compared to the broader statistics community? A large part of it may be the culture as Breiman refers to it. Economics journals emphasize the use of methods with formal properties of a type that many of the ML methods do not naturally deliver. This includes large sample properties of estimators and tests, including consistency, normality, and efficiency. In contrast, the focus in the ML literature is often on working properties of algorithms in specific settings, with the formal results being of a different type, e.g., guarantees of error rates. There are typically fewer theoretical results of the type traditionally reported in econometrics papers, although recently there have been some major advances in this area (Wager & Athey 2017, Farrell et al. 2018). There are no formal results that show that, for supervised learning problems, deep learning or neural net methods are uniformly superior to regression trees or random forests, and it appears unlikely that general results for such comparisons will soon be available, if ever.
與更廣泛的統計界相比,為什麼經濟學界對 ML 方法的接受要慢得多?其中很大一部分原因可能是 Breiman 所說的文化。經濟學期刊強調使用具有形式屬性的方法,而許多 ML 方法並不能自然地提供這種類型的屬性。這包括估計和檢定的大樣本屬性,包括一致性、常態性和效率。相較之下,ML 文獻的重點往往是特定環境下演算法的工作屬性,而正式結果則屬於不同類型,如錯誤率保證。計量經濟學論文中傳統報告類型的理論結果通常較少,儘管最近在這一領域取得了一些重大進展(Wager & Athey 2017,Farrell 等 2018)。目前還沒有正式的結果表明,對於監督學習問題,深度學習或神經網路方法統一優於回歸樹或隨機森林,而且這種比較的一般結果似乎不可能很快出現,如果有的話。
Although the ability to construct valid large-sample confidence intervals is important in many cases, one should not out-of-hand dismiss methods that cannot deliver them (or, possibly, that cannot yet deliver them) if these methods have other advantages. The demonstrated ability to outperform alternative methods on specific data sets in terms of out-of-sample predictive power is valuable in practice, even though such performance is rarely explicitly acknowledged as a goal or assessed in econometrics. As Mullainathan & Spiess (2017) highlight, some substantive problems are naturally cast as prediction problems, and assessing their goodness of fit on a test set may be sufficient for the purposes of the analysis in such cases. In other cases, the output of a prediction problem is an input to the primary analysis of interest, and statistical analysis of the prediction component beyond convergence rates is not needed. However, there are also many settings where it is important to provide valid confidence intervals for a parameter of interest, such as an average treatment effect. The degree of uncertainty captured by standard errors or confidence intervals may be a component in decisions about whether to implement the treatment. We argue that, in the future, as ML tools are more widely adopted, researchers should articulate clearly the goals of their analysis and why certain properties of algorithms and estimators may or may not be important.
雖然建立有效的大樣本置信區間的能力在許多情況下都很重要,但如果這些方法具有其他優勢,我們也不應該斷然否定那些無法實現置信區間的方法(或者,可能是那些還無法實現置信區間的方法)。在樣本外預測能力方面,在特定資料集上表現出優於其他方法的能力在實踐中是很有價值的,儘管計量經濟學很少明確承認這種表現是一個目標或對其進行評估。正如 Mullainathan & Spiess(2017)所強調的那樣,一些實質問題被自然地視為預測問題,在這種情況下,評估它們在測試集上的擬合度可能足以達到分析目的。在其他情況下,預測問題的輸出是主要分析的輸入,因此不需要對預測部分進行收斂率以外的統計分析。不過,在許多情況下,為平均治療效果等相關參數提供有效的置信區間也很重要。標準誤或信賴區間所反映的不確定性程度可能是決定是否實施治療的因素。我們認為,在未來,隨著 ML 工具被更廣泛地採用,研究人員應該清楚地闡明他們的分析目標,以及為什麼演算法和估計器的某些屬性可能重要或不重要。
A major theme of this review is that, even though there are cases where using simple offthe-shelf algorithms from the ML literature can be effective (for examples, see Mullainathan & Spiess 2017), there are also many cases where this is not the case. The ML techniques often require careful tuning and adaptation to effectively address the specific problems that economists are interested in. Perhaps the most important type of adaptation is to exploit the structure of the problems, e.g., the causal nature of many estimands; the endogeneity of variables; the configuration of data such as panel data; the nature of discrete choice among a set of substitutable products; or the presence of credible restrictions motivated by economic theory, such as monotonicity of demand in prices or other shape restrictions (Matzkin 1994, 2007). Statistics and econometrics have traditionally put much emphasis on these structures and developed insights to exploit them, whereas ML has often put little emphasis on them. Exploitation of these insights, both substantive and statistical, which, in a different form, is also seen in the careful tuning of ML techniques for specific problems such as image recognition, can greatly improve their performance. Another type of adaptation involves changing the optimization criteria of ML algorithms to prioritize considerations from causal inference, such as controlling for confounders or discovering treatment effect heterogeneity. Finally, techniques such as sample splitting [using different data to select models than to estimate parameters (e.g., Athey & Imbens 2016, Wager & Athey 2017)] and orthogonalization (e.g., Chernozhukov et al. 2016a) can be used to improve the performance of ML estimators, in some cases leading to desirable properties such as asymptotic normality of ML estimators (e.g., Athey et al. 2016b, Farrell et al. 2018).
本綜述的一個重要主題是,儘管在某些情況下,使用 ML 文獻中的簡單現成演算法可能會很有效(有關例子,請參見 Mullainathan & Spiess 2017),但也有很多情況並非如此。 ML 技術往往需要仔細調整和適應,才能有效解決經濟學家感興趣的特定問題。也許最重要的適應類型是利用問題的結構,例如,許多估算對象的因果性質;變量的內生性;數據的配置,如面板數據;在一組可替代產品中離散選擇的性質;或存在由經濟理論激發的可信限制,如價格需求的單調性或其他形狀限制(Matzkin,1994 年,2007 年)。統計學和計量經濟學歷來非常重視這些結構,並提出了利用這些結構的見解,而 ML 則往往很少重視這些結構。利用這些洞察力(包括實質洞察力和統計洞察力),可以大幅提高 ML 技術的效能;針對影像辨識等特定問題對 ML 技術進行精心調整,也是利用這些洞察力的一種不同形式。另一種適應涉及改變 ML 演算法的最佳化標準,優先考慮因果推斷,例如控制混雜因素或發現治療效果的異質性。最後,樣本分割(使用不同的資料選擇模型,而不是估計參數(如Athey & Imbens 2016,Wager & Athey 2017))和正交化(如Chernozhukov 等人2016a)等技術可用於提高ML 估計器的效能,在某些情況下可帶來理想的特性,如ML 估計器的漸近正態性(如Athey 等人2016b,Farrell 等人2018)。
In this review, we discuss a list of tools that we feel should be part of the empirical economist's toolkit and should be covered in the core econometrics graduate courses. Of course, this is a subjective list, and given the speed with which this literature is developing, the list will rapidly evolve. Moreover, we do not give a comprehensive discussion of these topics; rather, we aim to provide an introduction to these methods that conveys the main ideas and insights, with references to more comprehensive treatments. First on our list is nonparametric regression, or in the terminology of the ML literature, supervised learning for regression problems. Second, we discuss supervised learning for classification problems or, closely related but not quite the same, nonparametric regression for discrete response models. This is the area where ML methods have had perhaps their biggest successes. Third, we discuss unsupervised learning, or clustering analysis and density estimation. Fourth, we analyze estimates of heterogeneous treatment effects and optimal policies mapping from individuals' observed characteristics to treatments. Fifth,
在這篇綜述中,我們將討論一份工具清單,我們認為這些工具應該成為實證經濟學家工具包的一部分,並且應該涵蓋在計量經濟學研究生的核心課程中。當然,這是一份主觀性的清單,而且考慮到文獻的發展速度,這份清單還會迅速演變。此外,我們不會對這些主題進行全面討論;相反,我們的目的是介紹這些方法,傳達主要觀點和見解,並參考更全面的論述。首先是非參數迴歸,或用 ML 文獻的術語來說,回歸問題的監督式學習。其次,我們討論了分類問題的監督學習,或與之密切相關但不完全相同的離散響應模型的非參數回歸。這也許是 ML 方法取得最大成功的領域。第三,我們討論無監督學習,或聚類分析和密度估計。第四,我們分析了異質性治療效果的估計以及從個人觀察特徵到治療方法的最優政策映射。第五.我們將討論如何對治療效果進行評估、

we discuss ML approaches to experimental design, where bandit approaches are starting to revolutionize effective experimentation, especially in online settings. Sixth, we discuss the matrix completion problem, including its application to causal panel data models and problems of consumer choice among a discrete set of products. Finally, we discuss the analysis of text data.
第五,我們將討論實驗設計的 ML 方法,其中強盜方法正開始徹底改變有效的實驗,尤其是線上實驗。第六,我們將討論矩陣補全問題,包括其在因果面板​​資料模型中的應用以及消費者在離散產品中的選擇問題。最後,我們討論文字資料分析。
We note that there are a few other recent reviews of ML methods aimed at economists, often with more empirical examples and references to applications than we discuss in this review. Varian (2014) provides an early high-level discussion of a selection of important ML methods. Mullainathan & Spiess (2017) focus on the benefits of supervised learning methods for regression and discuss the prevalence of problems in economics where prediction methods are appropriate. Athey (2017) and Athey et al. (2017c) provide a broader perspective with more emphasis on recent developments in adapting ML methods for causal questions and general implications for economics. Gentzkow et al. (2017) provide an excellent recent discussion of methods for text analyses with a focus on economics applications. In the computer science and statistics literatures, there are also several excellent textbooks, with different levels of accessibility to researchers with a social science background, including the work of Efron & Hastie (2016); Hastie et al. (2009), who provide a more comprehensive text from a statistics perspective; Burkov (2019), who provides a very accessible introduction; Alpaydin (2009); and Knox (2018); all of these works take more of a computer science perspective.
我們注意到,最近有一些針對經濟學家的其他 ML 方法綜述,通常比我們在本綜述中討論的有更多的實證例子和應用參考。 Varian(2014)對一些重要的 ML 方法進行了早期的高階討論。 Mullainathan & Spiess(2017)重點介紹了監督學習方法對回歸的益處,並討論了經濟學中適合使用預測方法的普遍問題。 Athey(2017)和 Athey 等人(2017c)提供了一個更廣闊的視角,更加強調了將 ML 方法應用於因果問題的最新進展以及對經濟學的一般影響。 Gentzkow 等人(2017)對文本分析方法進行了精彩的最新討論,重點關注經濟學應用。在電腦科學和統計學領域,也有幾本優秀的教科書,它們對具有社會科學背景的研究人員具有不同程度的易讀性,其中包括Efron & Hastie(2016 年)的著作;Hastie 等人(2009 年)從統計學角度提供了更全面的文本;Burkov(2019 年)提供了非常易讀的介紹;Alpaydin(2009 年);以及Knox(2018 年);所有這些著作都更多地從計算機科學角度出發。

2. ECONOMETRICS AND MACHINE LEARNING: GOALS, METHODS, AND SETTTINGS
2.計量經濟學和機器學習:目標、方法和設置

In this section, we introduce some of the general themes of this review. What are the differences in the goals and concerns of traditional econometrics and the ML literature, and how do these goals and concerns affect the choices among specific methods?
在本節中,我們將介紹本綜述的一些一般性主題。傳統計量經濟學和 ML 文獻在目標和關注點上有哪些不同,這些目標和關注點如何影響具體方法的選擇?

2.1. Goals 2.1.目標

The traditional approach in econometrics, as exemplified in leading texts such as those of Greene (2000), Angrist & Pischke (2008), and Wooldridge (2010), is to specify a target, an estimand, that is a functional of a joint distribution of the data. The target is often a parameter of a statistical model that describes the distribution of a set of variables (typically conditional on some other variables) in terms of a set of parameters, which can be a finite or infinite set. Given a random sample from the population of interest, the parameter of interest and the nuisance parameters are estimated by finding the parameter values that best fit the full sample, using an objective function such as the sum of squared errors or the likelihood function. The focus is on the quality of the estimators of the target, traditionally measured through large sample efficiency. There is often also interest in constructing confidence intervals. Researchers typically report point estimates and standard errors.
計量經濟學的傳統方法(如 Greene (2000)、Angrist & Pischke (2008) 和 Wooldridge (2010)等人的主要著作)是指定一個目標,即一個估計值,它是資料聯合分佈的函數。目標通常是一個統計模型的參數,該模型用一組參數來描述一組變數的分佈(通常以其他變數為條件),這組參數可以是有限的,也可以是無限的。給定相關群體的隨機樣本,透過使用目標函數(如平方誤差總和)或似然比函數,找到最適合全部樣本的參數值,從而估算相關參數和乾擾參數。重點是目標估計值的質量,傳統上透過大樣本效率來衡量。人們通常也對建立置信區間感興趣。研究人員通常會報告點估計值和標準誤差。
In contrast, in the ML literature, the focus is typically on developing algorithms [a widely cited paper by Wu et al. (2008) has the title "Top 10 Algorithms in Data Mining"]. The goal for the algorithms is typically to make predictions about some variables given others or to classify units on the basis of limited information, for example, to classify handwritten digits on the basis of pixel values.
相反,在 ML 文獻中,重點通常是開發演算法[Wu 等人(2008 年)的一篇論文以 "資料探勘中的十大演算法 "為題被廣泛引用]。演算法的目標通常是根據其他變數對某些變數進行預測,或根據有限的資訊對單元進行分類,例如,根據像素值對手寫數字進行分類。
In a very simple example, suppose that we model the conditional distribution of some outcome given a vector-valued regressor or feature . Suppose that we are confident that
舉個非常簡單的例子,假設我們對給定向量值迴歸因子或特徵 的某個結果的條件分佈 進行建模。假設我們確信
We could estimate by least squares, that is, as
我們可以用最小平方法估計 ,即
Most introductory econometrics texts would focus on the least squares estimator without much discussion. If the model is correct, then the least squares estimator has well-known attractive properties: It is unbiased, it is the best linear unbiased estimator, it is the maximum likelihood estimator, and thus it has large sample efficiency properties.
大多數計量經濟學入門教材都將重點放在最小二乘估計器上,而不做過多討論。如果模型是正確的,那麼最小平方法估計器就具有眾所周知的誘人特性:它是無偏的,是最好的線性無偏估計器,是最大似然估計器,因此具有大樣本效率特性。
In ML settings, the goal may be to make a prediction for the outcome for new units on the basis of their regressor values. Suppose that we are interested in predicting the value of for a new unit , on the basis of the regressor values for this new unit, . Suppose that we restrict ourselves to linear predictors, so that the prediction is
在 ML 設定中,目標可能是根據新單位的迴歸值預測其結果。假設我們有興趣根據新單位 的迴歸因子值 來預測 的值。假設我們只使用線性預測因子,那麼預測結果為
for some estimator . The loss associated with this decision may be the squared error
。與此決定相關的損失可能是平方誤差
The question now is how to come up with estimators that have good properties associated with this loss function. This need not be the least squares estimator. In fact, when the dimension of the features exceeds two, we know from decision theory that we can do better in terms of expected squared error than the least squares estimator. The latter is not admissible; that is, there are other estimators that dominate the least squares estimator.
現在的問題是,如何找到與此損失函數相關的、具有良好特性的估計器 。這不一定是最小平方法估計器。事實上,當特徵維度超過 2 個時,我們可以從決策理論中得知,在期望平方誤差方面,我們可以比最小平方法估計器做得更好。後者是不被允許的;也就是說,有其他估計器可以支配最小平方法估計器。

2.2. Terminology 2.2.術語

One source of confusion is the use of new terminology in ML for concepts that have wellestablished labels in the older literatures. In the context of a regression model, the sample used to estimate the parameters is often referred to as the training sample. Instead of the model being estimated, it is being trained. Regressors, covariates, or predictors are referred to as features. Regression parameters are sometimes referred to as weights. Prediction problems are divided into supervised learning problems, where we observe both the predictors (features) and the outcome , and unsupervised learning problems, where we only observe the and try to group them into clusters or otherwise estimate their joint distribution. Unordered discrete response problems are generally referred to as classification problems.
造成混淆的一個原因是,在 ML 中使用了新的術語來描述在舊文獻中已有固定標籤的概念。在迴歸模型中,用於估計參數的樣本通常被稱為訓練樣本。與其說模型是在估計,不如說是在訓練。迴歸因子、協變數或預測因子稱為特徵。迴歸參數有時稱為權重。預測問題分為監督學習問題和無監督學習問題,前者是指我們同時觀察預測因子(特徵) 和結果 ,後者是指我們只觀察< b2> ,並嘗試將它們歸類為簇或以其他方式估計它們的聯合分佈。無序離散反應問題一般稱為分類問題。

2.3. Validation and Cross-Validation
2.3.驗證和交叉驗證

In most discussions of linear regression in econometric textbooks, there is little emphasis on model validation. The form of the regression model, be it parametric or nonparametric, and the set of regressors are assumed to be given from the outside, e.g., economic theory. Given this specification, the task of the researcher is to estimate the unknown parameters of this model. Much emphasis is placed on doing this estimation step efficiently, typically operationalized through definitions of large sample efficiency. If there is discussion of model selection, it is often in the form of testing null hypotheses concerning the validity of a particular model, with the implication that there is a true model that should be selected and used for subsequent tasks.
計量經濟學教科書在討論線性迴歸時,大多很少強調模型驗證。迴歸模型的形式,無論是參數模型或非參數模型,以及迴歸因子的集合,都被假定是從外在(如經濟理論)給定的。有了這個規範,研究者的任務就是估計這個模型的未知參數。人們非常重視如何有效率地完成這個估算步驟,這通常是透過大樣本效率的定義來實現的。如果有關於模型選擇的討論,其形式通常是檢驗關於特定模型有效性的零假設,其含義是存在一個真正的模型,應選擇該模型並用於後續任務。
Consider the regression example in the previous section. Let us assume that we are interested in predicting the outcome for a new unit, randomly drawn from the same population as our sample
請看上一節的迴歸範例。假設我們想預測一個新單位的結果,這個新單位是從與樣本相同的人群中隨機抽取的

was drawn from. As an alternative to estimating the linear model with an intercept and a scalar , we could estimate the model with only an intercept. Certainly, if , then that model would lead to better predictions. By the same argument, if the true value of were close but not exactly equal to zero, then we would still do better leaving out of the regression. Out-of-sample cross-validation can help guide such decisions. There are two components of the problem that are important for this ability. First, the goal is predictive power, rather than estimation of a particular structural or causal parameter. Second, the method uses out-of-sample comparisons, rather than in-sample goodness-of-fit measures. This ensures that we obtain unbiased comparisons of the fit.
得出。除了估算具有截距和標量的線性模型之外, ,我們還可以估算只有截距的模型。當然,如果 ,那麼模型的預測結果會更好。根據同樣的論點,如果 的真實值接近但不完全等於零,那麼我們不對 進行迴歸會更好。樣本外交叉驗證有助於指導此類決策。對於這種能力來說,有兩個問題很重要。首先,目標是預測能力,而不是對特定結構或因果參數的估計。其次,此方法使用的是樣本外比較,而不是樣本內適合度測量。這樣可以確保我們得到無偏的適合度比較。

2.4. Overfitting, Regularization, and Tuning Parameters
2.4.過擬合、正則化和調整參數

The ML literature is much more concerned with overfitting than the standard statistics or econometrics literatures. Researchers attempt to select flexible models that fit well, but not so well that out-of-sample prediction is compromised. There is much less emphasis on formal results that particular methods are superior in large samples (asymptotically); instead, methods are compared on specific data sets to see what works well. A key concept is that of regularization. As Vapnik (2013, p. 9) writes, "Regularization theory was one of the first signs of the existence of intelligent inference."
與標準統計學或計量經濟學文獻相比,ML 文獻更關注過擬合問題。研究人員試圖選擇適合度高但又不會影響樣本外預測的彈性模型。對於特定方法在大樣本(漸近)中更優越的正式結果,研究者並不太重視;相反,研究者會在特定的數據集上對各種方法進行比較,看看哪種方法更有效。正則化是一個關鍵概念。正如 Vapnik(2013 年,第 9 頁)所寫,"正則化理論是智能推斷存在的最初跡象之一"。
Consider a setting with a large set of models that differ in their complexity, measured, for example, as the number of unknown parameters in the model or, more subtly, through the the Vapnik-Chervonenkis (VC) dimension that measures the capacity or complexity of a space of models. Instead of directly optimizing an objective function, say, minimizing the sum of squared residuals in a least squares regression setting or maximizing the logarithm of the likelihood function, a term is added to the objective function to penalize the complexity of the model. There are antecedents of this practice in the traditional econometrics and statistics literatures. One is that, in likelihood settings, researchers sometimes add a term to the logarithm of the likelihood function equal to minus the logarithm of the sample size times the number of free parameters divided by two, leading to the Bayesian information criterion, or simply the number of free parameters, the Akaike information criterion. In Bayesian analyses of regression models, the use of a prior distribution on the regression parameters, centered at zero, independent across parameters with a constant prior variance, is another way of regularizing estimation that has a long tradition. The modern approaches to regularization are different in that they are more data driven, with the amount of regularization determined explicitly by the out-of-sample predictive performance rather than by, for example, a subjectively chosen prior distribution.
考慮到大量模型的複雜性各不相同,例如,模型中未知參數的數量,或更微妙的,透過 Vapnik-Chervonenkis (VC)維度來衡量模型空間的容量或複雜性。我們不是直接最佳化目標函數,例如最小平方法迴歸中的殘差平方和最小化或似然函數對數最大化,而是在目標函數中加入一個項來懲罰模型的複雜性。這種做法在傳統計量經濟學和統計學文獻中已有先例。其一是,在似然設定中,研究人員有時會在似然函數的對數中加入一個項,等於樣本量的對數減去自由參數的數量再除以2,從而得出貝葉斯訊息準則,或簡單的自由參數數量,即Akaike 資訊準則。在迴歸模型的貝葉斯分析中,使用迴歸參數的先驗分佈(以零為中心,各參數獨立,先驗變異數恆定)是正規化估計的另一種方法,具有悠久的傳統。現代正則化方法的不同之處在於它們更多地由數據驅動,正則化的程度由樣本外預測性能明確決定,而不是由主觀選擇的先驗分佈等決定。
Consider a linear regression model with regressors,
考慮使用 迴歸因子的線性迴歸模型、
Suppose that we also have a prior distribution for the the slope coefficients , with the prior for , and independent of for any . (This may be more plausible if we first normalize the features and outcome to have mean zero and unit variance. We assume that this has been done.) Given the value for the variance of the prior distribution, , the posterior mean for is the solution to
假設我們也有一個斜率係數的先驗分佈 ,其先驗值為 ,且對於任何 都與 無關(如果我們先將特徵和結果歸一化,使其平均值為零,方差為單位,這可能更可信)。我們假定已經這樣做了)。給定先驗分佈變異數的值 的後驗平均數就是下列公式的解
where . One version of an ML approach to this problem is to estimate by minimizing
其中 。解決這個問題的一種 ML 方法是透過最小化 來估算。
The only difference is in the way the penalty parameter is chosen. In a formal Bayesian approach, this reflects the (subjective) prior distribution on the parameters, and it would be chosen a priori. In an ML approach, would be chosen through out-of-sample cross-validation to optimize the out-of-sample predictive performance. This is closer to an empirical Bayes approach, where the data are used to estimate the prior distribution (e.g., Morris 1983).
唯一的差別在於懲罰參數 的選擇方式。在正式的貝葉斯方法中,該參數反映了參數的(主觀)先驗分佈,並且是先驗選擇的。在 ML 方法中, 將透過樣本外交叉驗證來選擇,以優化樣本外預測效能。這更接近經驗貝葉斯方法,即利用資料來估計先驗分佈(如 Morris,1983 年)。

2.5. Sparsity 2.5.稀疏性

In many settings in the ML literature, the number of features is substantial, both in absolute terms and relative to the number of units in the sample. However, there is often a sense that many of the features are of minor importance, if not completely irrelevant. The problem is that we may not know ex ante which of the features matter and which can be dropped from the analysis without substantially hurting the predictive power.
在許多 ML 文獻中,特徵的數量是相當可觀的,無論是絕對數量還是相對於樣本中的單位數量而言。然而,人們往往會覺得許多特徵即使不是完全不相關,也是次要的。問題在於,我們可能事先並不知道哪些特徵是重要的,哪些特徵可以從分析中剔除而不會嚴重影響預測能力。
Hastie et al. discuss what they call the sparsity principle:
Hastie 等人在 上討論了他們所說的稀疏性原則:
Assume that the underlying true signal is sparse and we use an penalty to try to recover it. If our assumption is correct, we can do a good job in recovering the true signal.... But if we are wrong-the underlying truth is not sparse in the chosen bases - then the penalty will not work well. However, in that instance, no method can do well, relative to the Bayes error. (Hastie et al. 2015, page 24)
假設底層真實訊號是稀疏的,我們使用 懲罰來試圖恢復它。如果我們的假設是正確的,我們就能很好地恢復真實訊號....。但是,如果我們的假設是錯誤的--在所選的基底中,底層真實訊號並不稀疏--那麼 懲罰就不能很好地發揮作用。然而,在這種情況下,相對於貝葉斯誤差,沒有一種方法能做得很好。 (哈斯蒂等人,2015 年,第 24 頁)
Exact sparsity is in fact stronger than is necessary; in many cases it is sufficient to have approximate sparsity, where most of the explanatory variables have very limited explanatory power, even if not zero, and only a few of the features are of substantial importance (see, for example, Belloni et al. 2014).
事實上,精確稀疏性比必要稀疏性更強;在許多情況下,有近似稀疏性就足夠了,在這種情況下,大多數解釋變數的解釋能力非常有限,即使不為零,也只有少數特徵具有重要意義(例如,請參閱Belloni 等人,2014 年)。
Traditionally, in the empirical literature in social sciences, researchers limited the number of explanatory variables by hand, rather than choosing them in a data-dependent manner. Allowing the data to play a bigger role in the variable selection process appears to be a clear improvement, even if the assumption that the underlying process is at least approximately sparse is still a very strong one, and even if inference in the presence of data-dependent model selection can be challenging.
傳統上,在社會科學的實證文獻中,研究者手動限制解釋變數的數量,而不是以依賴資料的方式選擇變數。讓資料在變數選擇過程中發揮更大的作用似乎是一個明顯的改進,儘管"底層過程至少近似稀疏"這一假設仍然是一個非常有力的假設,儘管在資料依賴模型選擇的情況下進行推理可能具有挑戰性。

2.6. Computational Issues and Scalability
2.6.計算問題和可擴展性

Compared to the traditional statistics and econometrics literatures, the ML literature is much more concerned with computational issues and the ability to implement estimation methods with large data sets. Solutions that may have attractive theoretical properties in terms of statistical efficiency but that do not scale well to large data sets are often discarded in favor of methods that can be implemented easily in very large data sets. This can be seen in the discussion of the relative merits of least absolute shrinkage and selection operator (LASSO) versus subset selection in linear regression settings. In a setting with a large number of features that might be included in the analysis, subset selection methods focus on selecting a subset of the regressors and then estimating the parameters of the regression function by least squares. However, LASSO has computational
與傳統的統計學和計量經濟學文獻相比,ML 文獻更關注計算問題以及利用大型資料集實施估計方法的能力。在統計效率方面可能具有吸引人的理論特性,但不能很好地擴展到大型資料集的解決方案往往會被放棄,而傾向於那些可以輕鬆地在超大型資料集中實施的方法。在討論線性迴歸設定中最小絕對收縮和選擇算子(LASSO)與子集選擇的相對優勢時,我們就可以看到這一點。在分析中可能包含大量特徵的情況下,子集選擇方法著重於選擇迴歸因子的子集,然後透過最小平方法估計迴歸函數的參數。然而,最小平方法在計算

advantages. It can be implemented by adding a penalty term that is proportional to the sum of the absolute values of the parameters. A major attraction of LASSO is that there are effective methods for calculating the LASSO estimates with the number of regressors in the millions. Best subset selection regression, in contrast, is an NP-hard problem. Until recently, it was thought that this was only feasible in settings with the number of regressors in the 30s, although current research (Bertsimas et al. 2016) suggests that it may be feasible with the number of regressors in the 1,000s. This has reopened a new, still unresolved debate on the relative merits of LASSO versus best subset selection (see Hastie et al. 2017) in settings where both are feasible. There are some indications that, in settings with a low signal-to-noise ratio, as is common in many social science applications, LASSO may have better performance, although there remain many open questions. In many social science applications, the scale of the problems is such that best subset selection is also feasible, and the computational issues may be less important than these substantive aspects of the problems.
優點它可以透過添加與參數絕對值總和成正比的懲罰項來實現。 LASSO 的一個主要吸引力在於,當迴歸因子的數量達到數百萬時,有有效的方法來計算 LASSO 估計值。相比之下,最佳子集選擇迴歸是一個 NP 難問題。雖然目前的研究(Bertsimas 等人,2016 年)表明,在回歸因子數量為 1000 個的情況下,最佳子集選擇回歸也是可行的。這重新引發了一場新的、仍未解決的爭論,即在兩者都可行的情況下,LASSO 與最佳子集選擇的相對優勢(參見 Hastie 等人,2017 年)。有跡象表明,在信噪比較低的情況下,如許多社會科學應用中常見的情況,LASSO 可能具有更好的性能,儘管仍有許多問題有待解決。在許多社會科學應用中,問題的規模使得最佳子集選擇也是可行的,計算問題可能不如問題的這些實質方面重要。
A key computational optimization tool used in many ML methods is stochastic gradient descent (SGD) (Bottou 1998, 2012; Friedman 2002). It is used in a wide variety of settings, including in optimizing neural networks and estimating models with many latent variables (e.g., Ruiz et al. 2017). The idea is very simple. Suppose that the goal is to estimate a parameter , and that the estimation approach entails finding the value that minimizes an empirical loss function, where is the loss for observation , and the overall loss is the sum , with derivative . Classic gradient descent methods involve an iterative approach, where is updated from as follows:
隨機梯度下降(SGD)是許多 ML 方法中使用的關鍵計算優化工具(Bottou,1998 年,2012 年;Friedman,2002 年)。它被廣泛應用於各種場合,包括優化神經網路和估計具有許多潛在變數的模型(例如,Ruiz 等人,2017 年)。其思路非常簡單。假設目標是估算一個參數 ,估算方法需要找出能使經驗損失函數最小化的值 ,其中 是觀測值 ,導數 。經典的梯度下降法採用迭代法,其中 更新如下:
where is the learning rate, often chosen optimally through line search. More sophisticated optimization methods multiply the first derivative by the inverse of the matrix of second derivatives or estimates thereof.
其中 是學習率,通常透過直線搜尋最佳化選擇。更複雜的最佳化方法是將一階導數乘以二階導數矩陣的逆或估計值。
The challenge with this approach is that it can be computationally expensive. The computational cost is in evaluating the full derivative and even more in optimizing the learning rate . The idea behind SGD is that it is better to take many small steps that are noisy but, on average, in the right direction than it is to spend equivalent computational cost in very accurately figuring out in what direction to take a single small step. More specifically, SGD uses the fact that the average of for a random subset of the sample is an unbiased (but noisy) estimate of the gradient. For example, dividing the data randomly into 10 subsets or batches, with denoting the subset unit belongs to, one could do 10 steps of the type
這種方法的挑戰在於計算成本高。計算成本在於評估全導數 ,而最佳化學習率 的計算成本則較高。 SGD 背後的理念是,與花費同等的計算成本來非常精確地確定一小步的方向相比,採取許多雖然有噪音但平均方向正確的小步更好。更具體地說,SGD 利用了這樣一個事實,即樣本隨機子集 的平均值是梯度的無偏估計值(但有雜訊)。例如,將資料隨機分成 10 個子集或批次,以 表示子集單元 屬於哪個子集,就可以進行 10 步驟類型的計算
with a deterministic learning rate . After the 10 iterations, one could reshuffle the data set and then repeat. If the learning rate decreases at an appropriate rate, then under relatively mild assumptions, SGD converges almost surely to a global minimum when the objective function is convex or pseudoconvex and otherwise converges almost surely to a local minimum. Bottou (2012) provides an overview and practical tips for implementation.
確定學習率 。迭代 10 次後,可以重新調整資料集,然後重複。如果學習率 以適當的速率下降,那麼在相對溫和的假設條件下,當目標函數是凸函數或偽凸函數時,SGD 幾乎肯定會收斂到全局最小值,否則幾乎肯定會收斂到局部最小值。 Bottou (2012) 提供了有關實施的概述和實用技巧。
The idea can be pushed even further in the case where is itself an expectation. We can consider evaluating using Monte Carlo integration. However, rather than taking many Monte Carlo draws to get an accurate approximation to the integral, we can instead take a small number
如果 本身就是一個期望值,那麼這個想法還可以更進一步。我們可以考慮使用蒙特卡羅積分來評估 。不過,與其進行多次蒙特卡洛抽樣來獲得積分的精確近似值,我們可以取而代之的是少量的

of draws or even a single draw. This type of approximation is used in economic applications by Ruiz et al. (2017) and Hartford et al. (2016).
甚至是一次抽籤。 Ruiz 等人(2017 年)和 Hartford 等人(2016 年)在經濟應用中使用了這種近似方法。

2.7. Ensemble Methods and Model Averaging
2.7.集合方法與模型平均

Another key feature of the ML literature is the use of model averaging and ensemble methods (e.g., Dietterich 2000). In many cases, a single model or algorithm does not perform as well as a combination of possibly quite different models, averaged using weights (sometimes called votes) obtained by optimizing out-of-sample performance. A striking example is the Netflix Prize competition (Bennett & Lanning 2007), where all of the top contenders use combinations of models and often averages of many models (Bell & Koren 2007). There are two related ideas in the traditional econometrics literature. Obviously, Bayesian analysis implicitly averages over the posterior distribution of the parameters. Mixture models are also used to combine different parameter values in a single prediction. However, in both cases, this model averaging involves averaging over similar models, typically with the same specification, that are only different in terms of parameter values. In the modern literature, and in the top entries in the Netflix Prize competition, the models that are averaged over can be quite different, and the weights are obtained by optimizing out-of-sample predictive power, rather than in-sample fit.
ML 文獻的另一個主要特徵是使用模型平均和集合方法(如 Dietterich 2000)。在許多情況下,單一模型或演算法的表現並不如透過優化樣本外表現而獲得的權重(有時稱為選票)進行平均的可能完全不同的模型組合的表現好。一個突出的例子是Netflix 大獎競賽(Bennett & Lanning,2007 年),在該競賽中,所有最優秀的競爭者都使用了模型組合,而且往往是多種模型的平均值(Bell & Koren,2007 年)。傳統計量經濟學文獻中有兩個相關的觀點。顯然,貝葉斯分析隱含了對參數後驗分佈的平均值。混合模型也用於在單一預測中結合不同的參數值。然而,在這兩種情況下,這種模型平均都涉及到類似模型的平均,這些模型通常具有相同的規範,只是參數值不同而已。在現代文獻中,以及在 Netflix 大獎評選中,被平均的模型可能大相徑庭,權重是透過優化樣本外預測能力而不是樣本內擬合度獲得的。
For example, one may have three predictive models, one based on a random forest, leading to predictions ; one based on a neural net, with predictions ; and one based on a linear model
例如,可能有三個預測模型,一個基於隨機森林,預測結果為 ;一個基於神經網絡,預測結果為 ;一個基於線性模型,預測結果為 。
estimated by LASSO, leading to . Then, using a test sample, one can choose weights , , and by minimizing the sum of squared residuals in the test sample:
透過 LASSO 估計,得出 。然後,利用測試樣本,可以透過最小化測試樣本中的殘差平方和來選擇權重 , , 和
One may also estimate weights based on regression of the outcomes in the test sample on the predictors from the different models without imposing that the weights sum to one and are nonnegative because random forests, neural nets, and LASSO have distinct strengths and weaknesses in terms of how well they deal with the presence of irrelevant features, nonlinearities, and interactions. As a result, averaging over these models may lead to out-of-sample predictions that are strictly better than predictions based on a single model.
由於隨機森林、神經網路和LASSO 在處理不相關特徵、非線性和交互作用方面各有優缺點,因此我們也可以根據測試樣本中的結果對不同模型預測因子的迴歸結果來估計權重,而不必強求權重之和為1 且為非負。因此,對這些模型取平均值可能會導致樣本外預測嚴格優於基於單一模型的預測。
In a panel data context (Athey et al. 2019), one can use ensemble methods combining various forms of synthetic control and matrix completion methods and find that the combinations outperform the individual methods.
在面板資料背景下(Athey 等人,2019 年),我們可以使用組合方法,將各種形式的合成控制和矩陣補全方法結合起來,並發現組合方法的效果優於單一方法。

2.8. Inference 2.8.推理

The ML literature has focused heavily on out-of-sample performance as the criterion of interest. This has come at the expense of one of the concerns that the statistics and econometrics literatures have traditionally focused on, namely, the ability to do inference, e.g., construct confidence intervals that are valid, at least in large samples. Efron & Hastie (2016, p. 209) write:
ML 文獻將樣本外表現作為關注的標準。這犧牲了統計和計量經濟學文獻傳統上所關注的問題之一,即進行推論的能力,例如建構至少在大樣本中有效的置信區間。 Efron & Hastie(2016 年,第 209 頁)寫道:
Prediction, perhaps because of its model-free nature, is an area where algorithmic developments have run far ahead of their inferential justification.
也許是因為預測不需要模型,演算法的發展遠遠超過了其推理的合理性。
Although there has recently been substantial progress in the development of methods for inference for low-dimensional functionals in specific settings [e.g., the work of Wager & Athey (2017) in the context of random forests and of Farrell et al. (2018) in the context of neural networks], it remains the case that, for many methods, it is currently impossible to construct confidence intervals that are valid, even if only asymptotically. One question is whether this ability to construct confidence intervals is as important as the traditional emphasis on it in the econometric literature suggests. For many decision problems, it may be that prediction is of primary importance, and inference is at best of secondary importance. Even in cases where it is possible to do inference, it is important to keep in mind that the requirements that ensure this ability often come at the expense of predictive performance. One can see this tradeoff in traditional kernel regression, where the bandwidth that optimizes expected squared error balances the tradeoff between the square of the bias and the variance, so that the optimal estimators have an asymptotic bias that invalidates the use of standard confidence intervals. This can be fixed by using a bandwidth that is smaller than the optimal one, so that the asymptotic bias vanishes, but it does so explicitly at the expense of increasing the variance.
儘管最近在開發特定環境下的低維函數推斷方法方面取得了實質進展[例如,Wager & Athey(2017)在隨機森林背景下的工作和Farrell 等人(2018)在神經網路背景下的工作] ,但對於許多方法來說,目前仍無法建立有效的置信區間,即使只是漸近線上的置信區間。一個問題是,建構信賴區間的能力是否如計量經濟學文獻中傳統所強調的那麼重要。對於許多決策問題來說,預測可能是最重要的,而推論充其量只是次要的。即使在有可能進行推理的情況下,也必須牢記,確保推理能力的要求往往是以犧牲預測性能為代價的。我們可以在傳統的核回歸中看到這種權衡,優化期望平方誤差的頻寬平衡了偏差平方和變異數之間的權衡,因此最優估計值具有漸進偏差,使標準置信區間的使用無效。可以透過使用比最優頻寬更小的頻寬來解決這個問題,從而使漸近偏差消失,但這樣做顯然要以增加方差為代價。

3. SUPERVISED LEARNING FOR REGRESSION PROBLEMS
3.回歸問題的監督學習

One of the canonical problems in both the ML and econometric literatures is that of estimating the conditional mean of a scalar outcome given a set of of covariates or features. Let denote the outcome for unit , and let denote the -component vector of covariates or features. The conditional expectation is
在 ML 和計量經濟學文獻中,一個典型的問題是在給定一組協變量或特徵的情況下估計標量結果的條件均值。令 表示 單位的結果,讓 表示 協變數或特徵的分量向量。條件期望為
Compared to the traditional econometric textbooks (e.g., Greene 2000, Angrist & Pischke 2008, Wooldridge 2010), there are some conceptual differences in the ML literature (for discussion, see Mullainathan & Spiess 2017). In the settings considered in the ML literature, there are often many covariates, sometimes more than there are units in the sample. There is no presumption in the ML literature that the conditional distribution of the outcomes given the covariates follows a particular parametric model. The derivatives of the conditional expectation for each of the covariates, which in the linear regression model correspond to the parameters, are not of intrinsic interest. Instead, the focus is on out-of-sample predictions and their accuracy. Furthermore, there is less of a sense that the conditional expectation is monotone in each of the covariates compared to many economic applications. There is often concern that the conditional expectation may be an extremely nonmonotone function with some higher-order interactions of substantial importance.
與傳統計量經濟學教科書(如 Greene 2000、Angrist & Pischke 2008、Wooldridge 2010)相比,ML 文獻在概念上存在一些差異(討論見 Mullainathan & Spiess 2017)。在 ML 文獻所考慮的環境中,往往存在許多協變量,有時多於樣本中的單位數量。 ML 文獻並沒有假定給定協變數的結果條件分佈遵循特定的參數模型。在線性迴歸模型中,每個協變數的條件期望的導數對應於參數,這並不引起人們的內在興趣。相反,重點在於樣本外預測及其準確度。此外,與許多經濟應用相比,條件期望在每個協變數中的單調性並不那麼明顯。人們往往擔心條件期望可能是一個極其非單調的函數,其中一些高階交互作用非常重要。
The econometric literature on estimating the conditional expectation is also huge. Parametric methods for estimating often use least squares. Since the work of Bierens (1987), kernel regression methods have become a popular alternative when more flexibility is required, and series or sieve methods have subsequently gained interest (for a survey, see Chen 2007). These methods have well-established large sample properties, allowing for the construction of confidence intervals. Simple nonnegative kernel methods are viewed as performing very poorly in settings with high-dimensional covariates, with the difference of order . This rate can be improved by using higher-order kernels and assuming the existence of many derivatives of , but practical experience with high-dimensional covariates has not been satisfactory for these methods, and applications of kernel methods in econometrics are generally limited to low-dimensional settings.
關於條件期望估計的計量經濟學文獻也非常多。估計 的參數方法通常使用最小平方法。自 Bierens(1987 年)的研究以來,核回歸方法已成為需要更多靈活性時的流行替代方法,序列或篩選方法也隨之受到關注(有關調查,見 Chen,2007 年)。這些方法具有成熟的大樣本特性,可以建立信賴區間。簡單的非負核方法在高維協變數環境中表現很差,差值 ,階次為 。透過使用高階核並假定 的許多導數的存在,這一比率可以得到改善,但這些方法在高維協變量方面的實際經驗並不令人滿意,而且核方法在計量經濟學中的應用通常僅限於低維環境。
The differences in performance between some of the traditional methods such as kernel regression and the modern methods such as random forests are particularly pronounced in sparse settings with a large number of more or less irrelevant covariates. Random forests are effective at picking up on the sparsity and ignoring the irrelevant features, even if there are many of them, while the traditional implementations of kernel methods essentially waste degrees of freedom on accounting for these covariates. Although it may be possible to adapt kernel methods for the presence of irrelevant covariates by allowing for covariate-specific bandwidths, in practice there has been little effort in this direction. A second issue is that the modern methods are particularly good at detecting severe nonlinearities and high-order interactions. The presence of such high-order interactions in some of the success stories of these methods should not blind us to the fact that, with many economic data, we expect high-order interactions to be of limited importance. If we try to predict earnings for individuals, then we expect the regression function to be monotone in many of the important predictors such as education and prior earnings variables, even for homogeneous subgroups. This means that models based on linearizations may do well in such cases relative to other methods, compared to settings where monotonicity is fundamentally less plausible, as, for example, in an image recognition problem. This is also a reason for the superior performance of locally linear random forests (Friedberg et al. 2018) relative to standard random forests.
一些傳統方法(如核回歸)與現代方法(如隨機森林)之間的表現差異在具有大量或多或少無關協變量的稀疏環境中尤其明顯。隨機森林能有效地捕捉稀疏性並忽略不相關的特徵,即使這些特徵數量很多也是如此,而傳統的核方法基本上是在浪費自由度來考慮這些協變量。雖然有可能透過允許特定協變量頻寬來調整核方法,以適應不相關協變量的存在,但在實踐中,這方面的努力還很少。第二個問題是,現代方法特別擅長偵測嚴重的非線性和高階交互作用。在這些方法的一些成功案例中,高階交互作用的存在不應使我們忽略這樣一個事實,即在許多經濟數據中,我們預期高階交互作用的重要性是有限的。如果我們試圖預測個人的收入,那麼我們希望回歸函數在許多重要的預測因素(如教育程度和先前的收入變數)中是單調的,即使對於同質的子群體也是如此。這就意味著,與影像辨識問題等單調性從根本上不太可信的情況相比,基於線性化的模型在這種情況下可能會比其他方法做得更好。這也是局部線性隨機森林(Friedberg 等人,2018 年)相對於標準隨機森林表現優異的原因。
We discuss four specific sets of methods, although there are many more, including variations on the basic methods. First, we discuss methods where the class of models considered is linear in the covariates, and the question is solely about regularization. Second, we discuss methods based on partitioning the covariate space using regression trees and random forests. Third, we discuss neural nets, which were the focus of a small econometrics literature in the 1990s (Hornik et al. 1989, White 1992) but more recently have become a very prominent part of the literature on ML in various subtle reincarnations. Fourth, we discuss boosting as a general principle.
我們將討論四組具體方法,儘管還有更多方法,包括基本方法的變體。首先,我們討論的是考慮協變數線性模型的方法,問題只在於正規化。其次,我們討論基於使用迴歸樹和隨機森林劃分協變量空間的方法。第三,我們討論神經網絡,神經網絡是20 世紀90 年代小型計量經濟學文獻(Hornik 等人,1989 年;White,1992 年)的重點,但最近以各種微妙的形式成為ML 文獻中非常重要的一部分。第四,我們討論作為一般原則的提升。

3.1. Regularized Linear Regression: LASSO, Ridge, and Elastic Nets
3.1.正規化線性迴歸:LASSO、Ridge 和彈性網

Suppose that we consider approximations to the conditional expectation that have a linear form
假設我們考慮的條件期望近似值具有線性形式
after the covariates and the outcome are demeaned, and the covariates are normalized to have unit variance. The traditional method for estimating the regression function in this case is least squares, with
將協變數和結果去均化處理,並對協變數進行歸一化處理,使其具有單位變異數。在這種情況下,估計迴歸函數的傳統方法是最小平方法,即
However, if the number of covariates is large relative to the number of observations , then the least squares estimator does not even have particularly good repeated sampling properties as an estimator for , let alone good predictive properties. In fact, with , the least squares estimator is not even admissible and is dominated by estimators that shrink toward zero. With very large, possibly even exceeding the sample size , the least squares estimator has particularly poor properties, even if the conditional mean of the outcome given the covariates is in fact linear.
但是,如果 的協變量數量相對於 的觀測值數量很大,那麼最小二乘估計器 作為 的估計器甚至沒有特別好的重複抽樣特性,更不用說良好的預測特性了。事實上,在 的情況下,最小平方法估計器甚至都是不可接受的,而且會被向零收縮的估計器所支配。當 非常大,甚至可能超過樣本量 時,即使給定協變量的結果的條件均值實際上是線性的,最小二乘估計器的特性也特別差。
Even with modest in magnitude, the predictive properties of the least squares estimator may be inferior to those of estimators that use some amount of regularization. One common form of
即使 的幅度不大,最小平方法估計器的預測特性可能不如使用一定正規化的估計器。一種常見的

regularization is to add a penalty term that shrinks the toward zero and minimize
正規化是增加一個懲罰項,將 縮減為零,並使其最小化。
where . For , this corresponds to LASSO (Tibshirani 1996). For , this corresponds to ridge regression (Hoerl & Kennard 1970). As , the solution penalizes the number of nonzero covariates, leading to best subset regression (Miller 2002, Bertsimas et al. 2016). In addition, there are many hybrid methods and modifications, including elastic nets, which combine penalty terms from LASSO and ridge (Zou & Hastie 2005); the relaxed LASSO, which combines least squares estimates from the subset selected by LASSO and the LASSO estimates themselves (Meinshausen 2007); least angle regression (Efron et al. 2004); the Dantzig selector (Candès & Tao 2007); and the non-negative garrotte (Breiman 1993).
其中 。對於 ,這相當於 LASSO(Tibshirani,1996 年)。對於 ,這相當於脊回歸(Hoerl & Kennard,1970 年)。如 ,解決方案對非零協變量的數量進行懲罰,從而實現最佳子集回歸(Miller,2002 年;Bertsimas 等,2016 年)。此外,還有許多混合方法和修改方法,包括結合LASSO 和脊的懲罰項的彈性網(Zou & Hastie 2005);結合LASSO 所選子集的最小二乘估計值和LASSO 估計值本身的鬆弛LASSO( Meinshausen 2007);最小角回歸(Efron 等人,2004);Dantzig 選擇器(Candès & Tao 2007);以及非負加羅特(Breiman 1993)。
There are a couple of important conceptual differences among these three special cases, subset selection, LASSO, and ridge regression (for a recent discussion, see Hastie et al. 2017). First, both best subset and LASSO lead to solutions with a number of the regression coefficients exactly equal to zero, a sparse solution. For the ridge estimator, in contrast, all of the estimated regression coefficients will generally differ from zero. It is not always important to have a sparse solution, and the variable selection that is implicit in these solutions is often overinterpreted. Second, best subset regression is computationally hard (NP-hard) and, as a result, not feasible in settings with and large, although progress has recently been made in this regard (Bertsimas et al. 2016). LASSO and ridge regression have a Bayesian interpretation. Ridge regression gives the posterior mean and mode under a normal model for the conditional distribution of given , and normal prior distributions for the parameters. LASSO gives the posterior mode given Laplace prior distributions. However, in contrast to formal Bayesian approaches, the coefficient on the penalty term is, in the modern literature, chosen through out-of-sample cross-validation, rather than subjectively through the choice of prior distribution.
子集選擇、LASSO 和脊回歸這三種特殊情況在概念上有一些重要區別(最近的討論見 Hastie 等人,2017 年)。首先,最佳子集和 LASSO 都會導致一些迴歸係數剛好等於零的解,也就是稀疏解。相反,對於脊估計法,所有估計迴歸係數一般都與零不同。稀疏解並不總是很重要,而且這些解中隱含的變數選擇往往會被過度解讀。其次,最佳子集迴歸在計算上很困難(NP-hard),因此在 大的情況下並不可行,儘管最近在這方面取得了進展(Bertsimas 等人,2016 年)。 LASSO 和脊回歸具有貝葉斯解釋。脊迴歸給出了 條件分佈的常態模型下的後驗平均值和模式(給定 ),以及參數的常態先驗分佈。 LASSO 給出了拉普拉斯先驗分佈的後驗模式。然而,與正式的貝葉斯方法不同,在現代文獻中,懲罰項的係數 是透過樣本外交叉驗證來選擇的,而不是透過選擇先驗分佈來主觀選擇的。

3.2. Regression Trees and Forests
3.2.回歸樹和森林

Regression trees (Breiman et al. 1984) and their extension, random forests (Breiman 2001a), have become very popular and effective methods for flexibly estimating regression functions in settings where out-of-sample predictive power is important. They are considered to have great out-ofthe-box performance without requiring subtle tuning. Given a sample , for , the idea is to split the sample into subsamples and estimate the regression function within the subsamples simply as the average outcome. The splits are sequential and based on a single covariate at a time exceeding a threshold . Starting with the full training sample, consider a split based on feature or covariate and threshold . The sum of in-sample squared errors before the split is
回歸樹(Breiman 等人,1984 年)及其延伸隨機森林(Breiman,2001 年 a)已成為在樣本外預測能力非常重要的情況下靈活估計回歸函數的非常流行和有效的方法。它們被認為具有出色的開箱即用性能,無需進行微妙的調整。給定一個樣本 ,對於 ,其思路是將樣本分割成若干子樣本,並將子樣本內的回歸函數簡單地估計為平均結果。拆分是連續的,基於每次超過閾值的單一協變數 。從全部訓練樣本開始,考慮基於特徵或協變量 和閾值 進行拆分。分割前的樣本內平方誤差總和為
where 其中
After a split based on covariate and threshold , the sum of in-sample squared errors is
依協變數 和閾值 進行分割後,樣本內平方誤差總和為
where (with and denoting left and right)
其中( 表示左側和右側)
and
are the average outcomes in the two subsamples. We split the sample using the covariate and threshold that minimize the average squared error over all covariates and all thresholds . We then repeat this, optimizing also over the subsamples or leaves. At each split, the average squared error is further reduced (or stays the same). We therefore need some regularization to avoid the overfitting that would result from splitting the sample too many times. One approach is to add a penalty term to the sum of squared residuals that is linear in the number of subsamples (the leaves). The coefficient on this penalty term is then chosen through cross-validation. In practice, a very deep tree is estimated, and then pruned to a more shallow tree using cross-validation to select the optimal tree depth. The sequence of first growing and then pruning the tree avoids splits that may be missed because their benefits rely on subtle interactions.
是兩個子樣本的平均結果。我們使用協變數 和閾值 分割樣本,在所有協變數 和所有閾值 的情況下,使平均平方誤差 最小。然後重複上述步驟,對子樣本或樣本葉進行最佳化。每次分割時,平均平方誤差都會進一步減少(或保持不變)。因此,我們需要進行一些正規化處理,以避免因樣本拆分次數過多而導致的過度擬合。一種方法是在殘差平方和中加入一個懲罰項,該懲罰項與子樣本(葉子)的數量成線性關係。然後透過交叉驗證來選擇懲罰項的係數。在實踐中,先估算出一棵很深的樹,然後透過交叉驗證將其修剪為一棵較淺的樹,以選擇最佳的樹深。先生長再修剪樹的順序避免了可能被遺漏的分裂,因為分裂的益處依賴於微妙的相互作用。
An advantage of a single tree is that it is easy to explain and interpret results. Once the tree structure is defined, the prediction in each leaf is a sample average, and the standard error of that sample average is easy to compute. However, it is not, in general, true that the sample average of the mean within a leaf is an unbiased estimate of what the mean would be within that same leaf in a new test set. Since the leaves were selected using the data, the leaf sample means in the training data will tend to be more extreme (in the sense of being different from the overall sample mean) than in an independent test set. Athey & Imbens (2016) suggest sample splitting as a way to avoid this issue. If a confidence interval for the prediction is desired, then the analyst can simply split the data in half. One half of the data are used to construct a regression tree. Then, the partition implied by this tree is taken to the other half of the data, where the sample mean within a given leaf is an unbiased estimate of the true mean value for the leaf.
單一樹狀結構的優點是易於解釋和詮釋結果。樹狀結構一旦確定,每片葉子的預測結果就是樣本平均數,而樣本平均數的標準誤差也很容易計算。但是,一般來說,樹葉內平均值的樣本平均值並不能無偏估計新測試集中同一樹葉內的平均值。由於樹葉是透過資料選擇的,因此訓練資料中的樹葉樣本平均值往往會比獨立測試集中的樣本平均值更極端(即與總體樣本平均值不同)。 Athey & Imbens(2016 年)建議將樣本拆分作為避免此問題的方法。如果需要預測的置信區間,分析師可以簡單地將資料分成兩半。一半資料用於建立迴歸樹。然後,將該樹所隱含的分區用於另一半數據,其中給定葉片內的樣本平均值是對該葉片真實平均值的無偏估計。
Although trees are easy to interpret, it is important not to go too far in interpreting the structure of the tree, including the selection of variables used for the splits. Standard intuitions from econometrics about omitted variable bias can be useful in this case. Particular covariates that have strong associations with the outcome may not show up in splits because the tree splits on covariates highly correlated with those covariates.
雖然樹狀結構很容易解釋,但在解釋樹狀結構時,包括選擇用於拆分的變數時,不能走得太遠,這一點很重要。在這種情況下,計量經濟學中關於遺漏變數偏差的標準直覺是有用的。與結果密切相關的特定協變數可能不會出現在分割結果中,因為該樹是在與這些協變數高度相關的協變數上進行拆分的。
One way to interpret a tree is that it is an alternative to kernel regression. Within each tree, the prediction for a leaf is simply the sample average outcome within the leaf. Thus, we can think of the leaf as defining the set of nearest neighbors for a given target observation in a leaf, and the estimator from a single regression tree is a matching estimator with nonstandard ways of selecting the nearest neighbor to a target point. In particular, the neighborhoods will prioritize some covariates over others in determining which observations qualify as nearby. Figure 1 illustrates the difference between kernel regression and a tree-based matching algorithm for the case of two
解釋樹的一種方法是,它是核回歸的替代方法。在每棵樹中,葉子的預測只是葉子中的樣本平均結果。因此,我們可以把樹葉看作是定義樹葉中給定目標觀測值的近鄰集合,而單棵迴歸樹的估計值是一個匹配估計值,它以非標準的方式選擇目標點的近鄰。特別是,在確定哪些觀測值符合近鄰條件時,鄰域會優先考慮某些協變量,而不是其他協變量。圖 1 舉例說明了核迴歸與基於樹的匹配演算法之間的差異。

Figure 1 圖 1
(a) Euclidean neighborhood for -nearest neighbor (KNN) matching. (b) Tree-based neighborhood.
(a) 用於 -nearest neighbor (KNN) 匹配的歐氏鄰域。 (b) 基於樹的鄰域。
covariates. Kernel regression will create a neighborhood around a target observation based on the Euclidean distance to each point, while tree-based neighborhoods will be rectangles. In addition, a target observation may not be in the center of a rectangle. Thus, a single tree is generally not the best way to predict outcomes for any given test point . When a prediction tailored to a specific target observation is desired, generalizations of tree-based methods can be used.
協變量。核回歸會根據每個點的歐氏距離在目標觀測值周圍建立鄰域,而基於樹的鄰域則是矩形。此外,目標觀測點可能並不在矩形的中心。因此,對於任何給定的測試點 ,單棵樹通常都不是預測結果的最佳方法。當需要針對特定目標觀測點進行預測時,可以使用樹狀方法的一般化。
For better estimates of , random forests (Breiman 2001a) build on the regression tree algorithm. A key issue that random forests address is that the estimated regression function given a tree is discontinuous with substantial jumps, more so than one might like. Random forests induce smoothness by averaging over a large number of trees. These trees differ from each other in two ways. First, each tree is based not on the original sample, but on a bootstrap sample [known as bagging (Breiman 1996)] or, alternatively, on a subsample of the data. Second, the splits at each stage are not optimized over all possible covariates, but rather over a random subset of the covariates, changing every split. These two modifications lead to sufficient variation in the trees that the average is relatively smooth (although still discontinuous) and, more importantly, has better predictive power than a single tree.
為了更好地估計 ,隨機森林(Breiman 2001a)建立在迴歸樹演算法的基礎上。隨機森林所要解決的一個關鍵問題是,給定迴歸樹的估計迴歸函數是不連續的,有很大的跳躍性,這一點比人們所希望的要嚴重得多。隨機森林透過對大量的樹進行平均,從而獲得平滑性。這些樹有兩個不同之處。首先,每棵樹都不是基於原始樣本,而是基於一個自舉樣本(稱為 "bagging"(Breiman,1996 年)),或者,基於數據的一個子樣本。其次,每個階段的拆分不是針對所有可能的協變數進行最佳化,而是針對協變數的隨機子集,每次拆分都會改變。這兩個修改使樹的變化足夠大,從而使平均值相對平滑(儘管仍不連續),更重要的是,比單一樹具有更好的預測能力。
Random forests have become very popular methods. A key attraction is that they require relatively little tuning and have great performance out of the box compared to more complex methods such as deep learning neural networks. Random forests and regression trees are particularly effective in settings with a large number of features that are not related to the outcome, that is, settings with sparsity. The splits will generally ignore those covariates, and as a result, the performance will remain strong even in settings with a large number of features. Indeed, when comparing forests to kernel regression, a reliable way to improve the relative performance of random forests is to add irrelevant covariates that have no predictive power. These will rapidly degrade the performance of kernel regression but will not affect a random forest nearly as severely because it will largely ignore them (Wager & Athey 2017).
隨機森林已成為非常流行的方法。其主要吸引力在於,與深度學習神經網路等更複雜的方法相比,隨機森林只需要相對較少的調整,並且具有出色的開箱即用性能。隨機森林和迴歸樹在有大量與結果無關的特徵(即稀疏性特徵)的情況下尤其有效。分裂通常會忽略這些協變量,因此,即使在有大量特徵的情況下,其表現仍然很強。事實上,在比較森林和核回歸時,提高隨機森林相對表現的可靠方法是添加沒有預測能力的無關協變量。這些協變量會迅速降低核回歸的效能,但對隨機森林的影響不會那麼嚴重,因為隨機森林會在很大程度上忽略這些協變量(Wager & Athey,2017 年)。
Although the statistical analysis of forests has proved elusive since Breiman's original work, Wager & Athey (2017) show that a particular variant of random forests can produce estimates with an asymptotically normal distribution centered on the true value ; furthermore, they provide an estimate of the variance of the estimator so that centered confidence intervals can be constructed. The variant that they study uses subsampling rather than bagging; furthermore,
儘管自從Breiman 的原始工作以來,森林的統計分析一直難以實現,但Wager & Athey(2017 年)的研究表明,隨機森林的特定變體可以產生以真實值 為中心的漸近常態分佈估計值 ;此外,他們還提供了估計值變異數的估計值,從而可以建立以真實值為中心的置信區間。他們研究的變體使用的是子採樣而非袋式採樣;此外、

Figure 2 圖 2
(a) Different trees in a random forest generating weights for test point . (b) The kernel based on the share of trees in the same leaf as test point .
(a) 為測試點 產生權重的隨機森林中的不同樹木。 (b) 基於與測試點 處於同一葉片的樹木份額的核。
each tree is built using two disjoint subsamples, one used to define the tree and the second used to estimate sample means for each leaf. This honest estimation is crucial for the asymptotic analysis.
每棵樹都是用兩個不相連的子樣本建立的,一個用於定義樹,另一個用於估計每片葉子的樣本平均值。這種誠實的估計對於漸近分析至關重要。
Random forests can be connected to traditional econometric methods in several ways. Returning to the kernel regression comparison, since each tree is a form of matching estimator, the forest is an average of matching estimators. As Figure 2 illustrates, by averaging over trees, the prediction for each point will be centered on the test point (except near boundaries of the covariate space). However, the forest prioritizes more important covariates for selecting matches in a data-driven way. Another way to interpret random forests (e.g., Athey et al. 2016b) is that they generate weighting functions analogous to kernel weighting functions. For example, a kernel regression makes a prediction at a point by averaging nearby points but weighting closer points more heavily. A random forest, by averaging over many trees, will include nearby points more often than distant points. We can formally derive a weighting function for a given test point by counting the share of trees where a particular observation is in the same leaf as a test point. Then, random forest predictions can be written as
隨機森林可以透過多種方式與傳統計量經濟學方法連結。回到核回歸比較,由於每棵樹都是匹配估計器,因此森林是匹配估計器的平均值。如圖 2 所示,透過對樹進行平均,每個點的預測都將以測試點為中心(協變數空間的邊界附近除外)。不過,森林會優先選擇更重要的協變量,以數據驅動的方式選擇配對。另一種解釋隨機森林的方法(如 Athey 等人,2016b)是,隨機森林產生的加權函數類似於核加權函數。例如,核回歸通過平均附近的點,但對較近的點加權更多,從而在一個點上做出預測 。隨機森林透過對許多樹木進行平均,會更多地包含附近的點,而不是遠處的點。我們可以透過計算特定觀測值與測試點位於同一片樹葉中的樹所佔的比例,正式得出給定測試點的加權函數。然後,隨機森林預測可以寫成
where the weights encode the weight given by the forest to the th training example when predicting at . The difference between typical kernel weighting functions and forest-based weighting
其中,權重 表示森林在預測 時給予 個訓練實例的權重。典型核加權函數與基於森林的加權函數的區別

functions is that the forest weights are adaptive; if a covariate has little effect, it will not be used in splitting leaves, and thus the weighting function will not be very sensitive to distance along that covariate.
這是因為森林權重是自適應的;如果某個協變量的影響很小,那麼在分割樹葉時就不會使用它,因此權重函數對該協變量的距離就不會很敏感。
Recently random forests have been extended to settings where the interest is in causal effects, either average or unit-level (Wager & Athey 2017), as well as for estimating parameters in general economic models that can be estimated with maximum likelihood or generalized method of moments (GMM) (Athey et al. 2016b). In the latter case, the interpretation of the forest as creating a weighting function is operationalized; the new generalized random forest algorithm operates in two steps. First, a forest is constructed, and second, a GMM model is estimated for each test point, where points that are nearby in the sense of frequently occurring in the same leaf as the test point are weighted more heavily in estimation. With an appropriate version of honest estimation, these forests produce parameter estimates with an asymptotically normal distribution. Generalized random forests can be thought of as a generalization of local maximum likelihood, which was introduced by Tibshirani & Hastie (1987), but where kernel weighting functions are used to weight nearby observations more heavily than observations distant from a particular test point.
最近,隨機森林已被擴展到對因果效應(平均或單位水平)感興趣的環境中(Wager & Athey 2017),以及用於估計可使用最大似然法或廣義矩法(GMM)估計的一般經濟模型中的參數(Athey 等人,2016b)。在後一種情況下,將森林解釋為創建加權函數是可操作的;新的廣義隨機森林演算法分兩步驟運行。首先,建立森林;其次,為每個測試點估算 GMM 模型,在估算過程中,與測試點經常出現在同一片葉子上的附近點的權重會更高。透過適當的誠實估計,這些森林可以產生具有漸近常態分佈的參數估計值。廣義隨機森林可以看作是 Tibshirani 和 Hastie(1987 年)提出的局部最大似然法的廣義化,但當使用核加權函數時,鄰近觀測點的權重要高於遠離特定測試點的觀測點。
A weakness of forests is that they are not very efficient at capturing linear or quadratic effects or at exploiting smoothness of the underlying data-generating process. In addition, near the boundaries of the covariate space, they are likely to have bias because the leaves of the component trees of the random forest cannot be centered on points near the boundary. Traditional econometrics encounters this boundary bias problem in analyses of regression discontinuity designs where, for example, geographical boundaries of school districts or test score cutoffs determine eligibility for schools or programs (Imbens & Lemieux 2008). The solution proposed in the econometrics literature, for example, in the matching literature (Abadie & Imbens 2011), is to use local linear regression, which is a regression with nearby points weighted more heavily. Suppose that the conditional mean function is increasing as it approaches the boundary. Then, the local linear regression corrects for the fact that, at a test point near the boundary, most sample points lie in a region with lower conditional mean than the conditional mean at the boundary. Friedberg et al. (2018) extend the generalized random forest framework to local linear forests, which are constructed by running a regression weighted by the weighting function derived from a forest. In their simplest form, local linear forests just take the forest weights and use them for local regression:
森林的一個弱點是,它們在捕捉線性或二次效應或利用基本資料生成過程的平穩性方面效率不高。此外,在協變量空間的邊界附近,它們很可能會出現偏差,因為隨機森林的成分樹的樹葉無法以邊界附近的點為中心。傳統計量經濟學在分析迴歸不連續設計時就會遇到這種邊界偏差問題,例如,學區的地理邊界或考試分數分界線決定了學校或計畫的資格(Imbens & Lemieux,2008 年)。計量經濟學文獻,例如匹配文獻(Abadie & Imbens,2011 年)中提出的解決方案是使用局部線性回歸,即對鄰近點進行加權回歸。假設條件均值函數在接近邊界時是遞增的。那麼,在邊界附近的測試點上,大多數樣本點都位於條件平均值低於邊界條件平均值的區域,而局部線性迴歸可以修正這一事實。 Friedberg 等人(2018 年)將廣義隨機森林框架擴展到局部線性森林,局部線性森林是透過運行由森林導出的加權函數加權的回歸而建構的。在最簡單的形式中,局部線性森林只是將森林權重 ,並將其用於局部迴歸:
Performance can be improved by modifying the tree construction to incorporate a regression correction; in essence, splits are optimized for predicting residuals from a local regression. This algorithm performs better than traditional forests in settings where a regression can capture broad patterns in the conditional mean function, such as monotonicity or a quadratic structure, and, again, asymptotic normality is established. Figure 3 illustrates how local linear forests can improve on regular random forests: By fitting local linear regressions with a random forest-estimated kernel, the resulting predictions can match a simple polynomial function even in relatively small data sets. In contrast, a forest tends to have bias, particularly near boundaries, and in small data sets will have more of a step function shape. Although Figure 3 shows the impact in a single dimension, an advantage of the forest over a kernel is that these corrections can occur in multiple dimensions while still allowing the traditional advantages of a forest of uncovering more complex interactions among covariates.
透過修改森林樹的結構以納入迴歸校正,可以提高效能;從本質上講,分割是為預測局部迴歸的殘差而最佳化的。在回歸能捕捉條件均值函數的廣泛模式(如單調性或二次結構)的情況下,這種演算法的表現比傳統森林更好,而且還能建立漸近正態性。圖 3 說明了局部線性森林如何改善常規隨機森林:透過使用隨機森林估計核擬合局部線性迴歸,即使在相對較小的資料集中,預測結果也能與簡單的多項式函數相符。相較之下,森林往往會產生偏差,尤其是在邊界附近,在小資料集中會更像階躍函數的形狀。雖然圖3 顯示的是單一維度的影響,但與核相比,森林的優勢在於這些修正可以在多個維度上進行,同時仍能發揮森林的傳統優勢,揭示協變量之間更複雜的交互作用。
b Local linear forest
b 當地線性森林
Figure 3 圖 3
Predictions from random forests and local linear forests on 600 test points. Training and test data were simulated from , with having dimension (19 covariates are irrelevant) and errors . Forests were trained on training points using the R package GRF and tuned via cross-validation. The true conditional mean signal is shown in black, and predictions are shown in red. Figure adapted with permission from Friedberg et al. (2018).
隨機森林和局部線性森林對 600 個測試點的預測。訓練和測試資料由 模擬, 的維度為 (19 個協變數無關),誤差為 。森林使用 R 軟體包 GRF 在 訓練點上進行訓練,並透過交叉驗證進行調整。真實的條件平均訊號 以黑色顯示,預測結果以紅色顯示。圖經 Friedberg 等人(2018)授權改編。

3.3. Deep Learning and Neural Nets
3.3.深度學習與神經網絡

Using neural networks and related deep learning methods is another general and flexible approach to estimating regression functions. They have been found to be very successful in complex settings with extremely large numbers of features. However, in practice, these methods require a substantial amount of tuning to work well for a given application relative to methods such as random forests. Neural networks were studied in the econometric literature in the 1990s but did not catch on at the time (see Hornik et al. 1989, White 1992).
使用神經網路和相關深度學習方法是估計迴歸函數的另一種通用且靈活的方法。人們發現,這些方法在具有大量特徵的複雜環境中非常成功。然而,在實踐中,與隨機森林等方法相比,這些方法需要進行大量的調整才能在特定應用中取得良好效果。 1990 年代,計量經濟學文獻對神經網路進行了研究,但當時並未流行起來(見 Hornik 等人,1989 年;White,1992 年)。
Let us consider a simple example. Given covariates (features) , we model latent or unobserved variables (hidden nodes) that are linear in the original covariates:
讓我們舉一個簡單的例子。給定 協變數(特徵) ,我們對與原始協變數呈線性關係的 潛在或未觀察變數 (隱藏節點)進行建模:
We then modify these linear combinations using a simple nonlinear transformation, e.g., a sigmoid function
然後,我們使用簡單的非線性變換(如乙次函數)對這些線性組合進行修改
or a rectified linear function
或整流線性函數
and then model the outcome as a linear function of this nonlinear transformation of these hidden nodes plus noise:
然後將結果建模為這些隱藏節點的非線性變換加上雜訊的線性函數:
This is a neural network with a single hidden layer with hidden nodes. The transformation introduces nonlinearities in the model. Even with this single layer, with many nodes, one can approximate arbitrarily well a rich set of smooth functions.
這是一個具有單一隱藏層和 隱藏節點的神經網路。 在模型中引入了非線性變換。即使是單層、多節點,也可以任意逼近豐富的平滑函數。
It may be tempting to fit this into a standard framework and interpret this model simply as a complex, but fully parametric, specification for the potentially nonlinear conditional expectation of given :
將其納入標準框架,並將該模型簡單地解釋為給定 的潛在非線性條件期望的一個複雜但完全參數化的規範,可能很有誘惑力:
Given this interpretation, we can estimate the unknown parameters using nonlinear least squares. We could then derive the properties of the least squares estimators, and functions thereof, under standard regularity conditions. However, this interpretation of a neural net as a standard nonlinear model would be missing the point, for four reasons. First, it is likely that the asymptotic distributions for the parameter estimates would be poor approximations to the actual sampling distributions. Second, the estimators for the parameters would be poorly behaved, with likely substantial collinearity without careful regularization. Third, and more important, these properties are not of intrinsic interest. We are interested in the properties of the predictions from these specifications, and these can be quite attractive even if the properties of the parameter estimates are not. Fourth, we can make these models much more flexible, and at the same time make the properties of the corresponding least squares estimators of the parameters substantially less tractable and attractive, by adding layers to the neural network. A second layer of hidden nodes would have representations that are linear in the same transformation of linear combinations of the first layer of hidden nodes:
鑑於這種解釋,我們可以使用非線性最小二乘法來估計未知參數。然後,我們可以根據標準正規條件,推導出最小平方法估計值及其函數的性質。然而,這種將神經網路解釋為標準非線性模型的做法並不正確,原因有四。首先,參數估計的漸近分佈很可能與實際抽樣分佈不相近。其次,參數的估計值表現不佳,如果沒有仔細的正規化處理,很可能會出現嚴重的共線性。第三,也是更重要的一點,這些特質並不是我們感興趣的內在因素。我們感興趣的是這些規範預測的特性,即使參數估計的特性不吸引人,這些特性也會相當吸引人。第四,透過增加神經網路的層數,我們可以使這些模型變得更加靈活,同時使相應的參數最小二乘估計值的可操作性和吸引力大大降低。第二層隱藏節點的表示將與第一層隱藏節點的線性組合的變換
with the outcome now a function of the second layer of hidden nodes,
結果現在是第二層隱藏節點的函數、
The depth of the network substantially increases the flexibility in practice, even if, with a single layer and many nodes, we can already approximate a very rich set of functions. Asymptotic properties for multilayer networks have recently been established by Farrell et al. (2018). In applications, researchers have used models with many layers, e.g., ten or more, and millions of parameters:
網路的深度大大提高了實際應用中的靈活性,即使只有單層和多個節點,我們也能逼近非常豐富的函數集。 Farrell 等人(2018)最近建立了多層網路的漸近特性。在應用中,研究人員使用了多層(如十層或更多層)和數百萬個參數的模型:
We observe that shallow models [models with few layers] in this context overfit at around 20 millions parameters while deep ones can benefit from having over 60 million. This suggests that using a deep model expresses a useful preference over the space of functions the model can learn. (LeCun et al. 2015, p. 289)
我們觀察到,在這種情況下,淺層模型(層數較少的模型)在約 2,000 萬個參數時會過度擬合,而深層模型則可以從超過 6,000 萬個參數中獲益。這表明,使用深度模型表達了對模型可學習函數空間的有益偏好。 (LeCun 等人,2015 年,第 289 頁)
In cases with multiple hidden layers and many hidden nodes, one needs to carefully regularize the parameter estimation, possibly through a penalty term that is proportional to the sum of the squared coefficients in the linear parts of the model. The architecture of the networks is also important. It is possible, as in the specification above, to have the hidden nodes at a particular layer be a linear function of all the hidden nodes of the previous layer, or to restrict them to a subset based on substantive considerations (e.g., proximity of covariates in some metric, such as location of pixels in a picture). Such convolutional networks have been very successful but require even more careful tuning (Krizhevsky et al. 2012).
在有多個隱藏層和許多隱藏節點的情況下,我們需要對參數估計進行仔細的正則化處理,可以透過與模型線性部分係數平方和成比例的懲罰項來實現。網路結構也很重要。如上文所述,可以讓某一層的隱藏節點成為上一層所有隱藏節點的線性函數,也可以基於實質考量(例如,協變量在某些指標上的接近程度,如圖片中像素的位置),將隱藏節點限制為一個子集。這種卷積網絡非常成功,但需要更仔細的調整(Krizhevsky 等人,2012 年)。
Estimation of the parameters of the network is based on approximately minimizing the sum of the squared residuals, plus a penalty term that depends on the complexity of the model. This minimization problem is challenging, especially in settings with multiple hidden layers. The algorithms of choice use the back-propagation algorithm and variations thereon (Rumelhart et al. 1986) to calculate the exact derivatives with respect to the parameters of the unit-level terms in the objective function. These algorithms exploit in a clever way the hierarchical structure of the layers and the fact that each parameter enters only into a single layer. The algorithms then use stochastic gradient descent (Bottou 1998, 2012; Friedman 2002), described in Section 2.6, as a computationally efficient method for finding the approximate optimum.
網路參數的估計是基於近似最小化殘差平方和,再加上一個取決於模型複雜度的懲罰項。這個最小化問題極具挑戰性,尤其是在有多個隱藏層的情況下。首選演算法使用反向傳播演算法及其變體(Rumelhart 等人,1986 年)來計算目標函數中單元級項參數的精確導數。這些演算法巧妙地利用了層的分層結構以及每個參數只進入一個層的事實。然後,這些演算法使用第 2.6 節所述的隨機梯度下降法(Bottou,1998 年,2012 年;Friedman,2002 年),作為一種計算效率高的方法來尋找近似最優值。

3.4. Boosting 3.4.提升

Boosting is a general-purpose technique to improve the performance of simple supervised learning methods (for a detailed discussion, see Schapire & Freund 2012). Let us say that we are interested in prediction of an outcome given a substantial number of features. Suppose that we have a very simple algorithm for prediction, a simple base learner. For example, we could have a regression tree with three leaves, that is, a regression tree based on two splits, where we estimate the regression function as the average outcome in the corresponding leaf. Such an algorithm on its own would not lead to a very attractive predictor in terms of predictive performance because it uses at most two of the many possible features. Boosting improves this base learner in the following way. Take for all units in the training sample the residual from the prediction based on the simple three-leaf tree model, . Now we apply the same base learner (in this case, the two-split regression tree) with the residuals as the outcome of interest (and with the same set of original features). Let denote the prediction from combining the first and second steps. Given this new tree, we can calculate the new residual, . We can then repeat this step, using the new residual as the outcome and again constructing a two-split regression tree. We can do this many times and get a prediction based on reestimating the basic model many times on the updated residuals.
助推是一種通用技術,用於提高簡單監督學習方法的表現(詳細討論請參閱 Schapire & Freund 2012)。假設我們對給定大量特徵的結果預測感興趣。假設我們有一個非常簡單的預測演算法,即簡單的基礎學習器。例如,我們可以有一棵有三片葉子的迴歸樹,也就是一棵基於兩次分裂的迴歸樹,我們將迴歸函數估計為對應葉子中的平均結果。就預測性能而言,這種演算法本身不會產生非常有吸引力的預測器,因為它最多使用了眾多可能特徵中的兩個。提升演算法透過以下方式改進了這種基礎學習器。對訓練樣本中的所有單元提取基於簡單三葉樹模型的預測殘差, 。現在,我們應用相同的基礎學習器(在本例中為兩分裂回歸樹),將殘差作為關注結果(並使用相同的原始特徵集)。讓 表示結合第一步和第二步所得的預測結果。根據這棵新樹,我們可以計算出新的殘差,即 。然後,我們可以重複這個步驟,使用新的殘差作為結果,再次建構一棵二分裂迴歸樹。我們可以多次這樣做,並根據更新後的殘差多次重新估計基本模型,從而得到預測結果。
If we base our boosting algorithm on a regression tree with splits, then it turns out that the resulting predictor can approximate any regression function that can be written as the sum of functions of of the original features at a time. So, with , we can approximate any function that is additive in the features, and with , we can approximate any function that is additive in functions of the original features that allow for general second-order effects.
如果我們將提升演算法建立在具有 分裂的迴歸樹上,那麼結果表明,預測器可以近似任何一次可以寫成 原始特徵的函數總和的迴歸函數。因此,透過 ,我們可以近似任何與特徵相加的函數,而透過 ,我們可以近似任何與原始特徵的函數相加的函數,這些函數允許一般的二階效應。
Boosting can also be applied using base learners other than regression trees. The key is to choose a base learner that is easy to apply many times without running into computational problems.
除了回歸樹之外,還可以使用其他基礎學習器進行提升。關鍵是要選擇一種易於多次應用而又不會遇到計算問題的基礎學習器。

4. SUPERVISED LEARNING FOR CLASSIFICATION PROBLEMS
4.分類問題的監督學習

Classification problems are the focus of the other main branch of the supervised learning literature. The problem is, given a set of observations on a vector of features and a label (an
分類問題是另一個有監督學習文獻的主要分支。問題是,給定一組關於特徵向量 的觀察結果和一個標籤 (一個

unordered discrete outcome), the goal is a function that assigns new units, on the basis of their features, to one of the labels. This is very closely related to discrete choice analysis in econometrics, where researchers specify statistical models that imply a probability that the outcome takes on a particular value, conditional on the covariates (features). Given such a probability, it is, of course, straightforward to predict a unique label, namely the one with the highest probability. However, there are differences between the two approaches. An important one is that, in the classification literature, the focus is often solely on the classification, the choice of a single label. One can classify given a probability for each label, but one does not need such a probability to do the classification. Many of the classification methods do not, in fact, first estimate a probability for each label, and so are not directly relevant in settings where such a probability is required. A practical difference is that the classification literature has often focused on settings where, ultimately, the covariates allow one to assign the label with almost complete certainty, as opposed to settings where even the best methods have high error rates.
在無序離散結果分析中,目標是根據新單元的特徵將其指派給其中一個標籤的函數。這與計量經濟學中的離散選擇分析密切相關,在離散選擇分析中,研究人員指定的統計模型意味著在協變量(特徵)的條件下,結果取特定值的機率。有了這個機率,當然就可以直接預測出一個唯一的標籤,也就是機率最高的標籤。不過,這兩種方法也有不同之處。其中重要的一點是,在分類文獻中,重點往往只放在分類上,也就是選擇一個標籤。我們可以根據每個標籤的機率進行分類,但並不需要這樣的機率來進行分類。事實上,許多分類方法並未先估計每個標籤的機率,因此與需要這種機率的情況並不直接相關。一個實際的差異是,分類文獻通常關注的是共變數最終允許我們幾乎完全確定地分配標籤的情況,而不是即使是最好的方法也有很高錯誤率的情況。
The classic example is that of digit recognition. Based on a picture, coded as a set of, say, 16 or 256 black and white pixels, the challenge is to classify the image as corresponding to one of the ten digits from 0 to 9. In this case, ML methods have been spectacularly successful. Support vector machines (SVMs) (Cortes & Vapnik 1995) greatly outperformed other methods in the 1990s. More recently, deep convolutional neural networks (Krizhevsky et al. 2012) have improved error rates even further.
數字識別就是一個典型的例子。根據一張編碼為一組(例如 16 或 256 個黑白像素)的圖片,我們面臨的挑戰是將圖片分類為對應 0 到 9 十個數字中的一個。在這種情況下,ML 方法取得了巨大成功。支持向量機(SVM)(Cortes & Vapnik,1995 年)在 20 世紀 90 年代的表現大大優於其他方法。最近,深度卷積神經網路(Krizhevsky 等人,2012 年)進一步提高了錯誤率。

4.1. Classification Trees and Forests
4.1.分類樹與森林

Trees and random forests are easily modified from a focus on estimation of regression functions to classification tasks (for a general discussion, see Breiman et al. 1984). Again, we start by splitting the sample into two leaves, based on a single covariate exceeding or not exceeding a threshold. We optimize the split over the choice of covariate and the threshold. The difference between the regression case and the classification case is in the objective function that measures the improvement from a particular split. In classification problems, this is called the impurity function. It measures, as a function of the shares of units in a given leaf with a particular label, how impure that particular leaf is. If there are only two labels, then we could simply assign the labels the numbers zero and one, interpret the problem as one of estimating the conditional mean, and use the average squared residual as the impurity function. That does not generalize naturally to the multilabel case. Instead, a more common impurity function, as a function of the shares , is the Gini impurity,
樹和隨機森林可以很容易地從回歸函數的估計轉換為分類任務(有關一般性討論,請參見 Breiman 等人,1984 年)。同樣,我們首先根據單一協變數是否超過閾值,將樣本分成兩片葉子。我們根據協變數和臨界值的選擇來優化分割。迴歸案例與分類案例的差異在於衡量特定分割改善的目標函數。在分類問題中,這被稱為雜質函數。作為特定標籤在給定葉片中的單元份額的函數,它衡量了該特定葉片的不純度。如果只有兩個標籤,那麼我們可以簡單地將標籤賦值為 0 和 1,將問題解釋為估計條件平均值,並使用平均殘差平方作為不純度函數。這並不能自然地推廣到多標籤情況。相反,作為 的函數,更常見的雜質函數是基尼雜質、
This impurity function is minimized if the leaf is pure, meaning that all units in that leaf have the same label, and is maximized if the shares are all equal to . The regularization typically works, again, through a penalty term on the number of leaves in the tree. The same extension from a single tree to a random forest that is discussed above for the regression case works for the classification case.
如果樹葉是純的,即該樹葉中的所有單元都有相同的標籤,則雜質函數最小;如果份額都等於 ,則雜質函數最大。正則化通常也是透過對樹葉數量的懲罰項來實現的。上文討論的從單棵樹擴展到隨機森林的迴歸方法同樣適用於分類方法。

4.2. Support Vector Machines and Kernels
4.2.支援向量機和核

SVMs (Vapnik 2013, Scholkopf & Smola 2001) make up another flexible set of methods for classification analyses. SVMs can also be extended to regression settings but are more naturally introduced in a classification context, and, for simplicity, we focus on the case with two possible labels.
SVM (Vapnik,2013 年;Scholkopf & Smola,2001 年)是另一套靈活的分類分析方法。 SVM 也可以擴展到迴歸設定中,但在分類背景下引入 SVM 更為自然,為了簡單起見,我們將重點放在有兩個可能標籤的情況下。