這是用戶在 2024-4-15 21:37 為 https://app.immersivetranslate.com/pdf-pro/2290c009-aaf3-41bf-bfc0-c034e1a0512c 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
2024_04_15_cab9399bf98f42e0708cg

'A ANNUAL R REVIEWS
年度審查

Annual Review of Economics Machine Learning Methods That Economists Should Know About
經濟學年度評論 經濟學家應該了解的機器學習方法

Susan Athey and Guido W. Imbens
Susan Athey 和 Guido W. Imbens
Graduate School of Business, Stanford University, Stanford, California 94305, USA;
史丹佛大學商學院,美國加州史丹佛 94305;
email: athey@stanford.edu, imbens@stanford.edu
電子郵件:athey@stanford.edu, imbens@stanford.edu
Stanford Institute for Economic Policy Research, Stanford University, Stanford,
史丹佛大學史丹佛經濟政策研究所,史丹佛、
California 94305, USA 美國加州 94305 National Bureau of Economic Research, Cambridge, Massachusetts 02138, USA
國家經濟研究局,美國麻薩諸塞州劍橋市,郵編 02138
Department of Economics, Stanford University, Stanford, California 94305, USA
史丹佛大學經濟系,美國加州史丹佛 94305

CONNECT  連接

  • Download figures 下載數位
  • Navigate cited references
    瀏覽引用的參考文獻
  • Keyword search 關鍵字搜尋
  • Explore related articles
    探索相關文章
  • Share via email or social media
    透過電子郵件或社群媒體分享
Annu. Rev. Econ. 2019. 11:685-725
Annu.Rev. Econ.11:685-725

Keywords 關鍵字

First published as a Review in Advance on June 10, 2019
2019年6月10日首次以《先期回顧》出版
The Annual Review of Economics is online at machine learning, causal inference, econometrics economics.annualreviews.org
經濟學年度評論》線上網址為:machine learning, causal inference, econometrics economics.annualreviews.org
Copyright (c) 2019 by Annual Reviews. All rights reserved
Copyright (c) 2019 by Annual Reviews.保留所有權利
JEL code: C30 JEL 代碼:C30

Abstract 摘要

We discuss the relevance of the recent machine learning (ML) literature for economics and econometrics. First we discuss the differences in goals, methods, and settings between the ML literature and the traditional econometrics and statistics literatures. Then we discuss some specific methods from the ML literature that we view as important for empirical researchers in economics. These include supervised learning methods for regression and classification, unsupervised learning methods, and matrix completion methods. Finally, we highlight newly developed methods at the intersection of ML and econometrics that typically perform better than either off-the-shelf ML or more traditional econometric methods when applied to particular classes of problems, including causal inference for average treatment effects, optimal policy estimation, and estimation of the counterfactual effect of price changes in consumer choice models.
我們將討論近期機器學習(ML)文獻與經濟學和計量經濟學的相關性。首先,我們討論 ML 文獻與傳統計量經濟學和統計文獻在目標、方法和設定上的差異。然後,我們討論我們認為對經濟學實證研究人員很重要的 ML 文獻中的一些具體方法。這些方法包括用於回歸和分類的監督學習方法、無監督學習方法和矩陣補全方法。最後,我們重點介紹了在ML 和計量經濟學交叉領域新開發的方法,這些方法在應用於特定類別的問題時,通常比現成的ML 或更傳統的計量經濟學方法表現更好,這些問題包括平均處理效果的因果推論、最適政策估計、消費者選擇模型中價格變動的反事實效應估計。

1. INTRODUCTION 1.引言

In the abstract of his provocative 2001 paper in Statistical Science, the Berkeley statistician Leo Breiman (2001b, p. 199) writes about the difference between model-based and algorithmic approaches to statistics:
柏克萊分校統計學家 Leo Breiman(2001b,第 199 頁)在其 2001 年發表於《統計科學》(Statistical Science)的論文摘要中談到了基於模型的統計方法與演算法統計方法之間的區別:
There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown.
使用統計建模從數據中得出結論有兩種文化。一種假定資料是由給定的隨機資料模型產生的。另一種則使用演算法模型,將資料機制視為未知。
Breiman (2001b, p. 199) goes on to claim that,
Breiman (2001b, 第 199 頁)接著指出:
The statistical community has been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theory and practice, has developed rapidly in fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move away from exclusive dependence on data models and adopt a more diverse set of tools.
統計界一直致力於幾乎完全使用資料模型。這種執著導致了不相關的理論和可疑的結論,並使統計學家無法研究當前大量有趣的問題。演算法建模在理論和實踐上都在統計以外的領域中得到了迅速發展。它既可用於大型複雜資料集,也可在較小的資料集上作為資料建模的更準確、資訊量更大的替代方法。如果我們這個領域的目標是利用資料解決問題,那麼我們就需要擺脫對資料模型的完全依賴,並採用一套更多樣化的工具。
Breiman's (2001b) characterization no longer applies to the field of statistics. The statistics community has by and large accepted the machine learning (ML) revolution that Breiman refers to as the algorithm modeling culture, and many textbooks discuss ML methods alongside more traditional statistical methods (e.g., Hastie et al. 2009, Efron & Hastie 2016). Although the adoption of these methods in economics has been slower, they are now beginning to be widely used in empirical work and are the topic of a rapidly increasing methodological literature. In this review, we want to make the case that economists and econometricians also, as Breiman writes about the statistics community, "need to move away from exclusive dependence on data models and adopt a more diverse set of tools." We discuss some of the specific tools that empirical researchers would benefit from, and that we feel should be part of the standard graduate curriculum in econometrics if, as Breiman writes and we agree with, "our goal as a field is to use data to solve problems;" if, in other words, we view econometrics as, in essence, decision making under uncertainty (e.g., Chamberlain 2000); and if we wish to enable students to communicate effectively with researchers in other fields where these methods are routinely being adopted. Although relevant more generally, the methods developed in the ML literature have been particularly successful in big data settings, where we observe information on a large number of units, many pieces of information on each unit, or both, and often outside the simple setting with a single cross-section of units. For such settings, ML tools are becoming the standard across disciplines, so the economist's toolkit needs to adapt accordingly while preserving the traditional strengths of applied econometrics.
Breiman (2001b) 的描述不再適用於統計學領域。統計學界大體上已經接受了機器學習(ML)革命,也就是Breiman 所說的演算法建模文化,許多教科書在討論ML 方法的同時也討論了更傳統的統計方法(例如,Hastie 等人,2009 年;Efron & Hastie,2016 年)。儘管這些方法在經濟學中的應用較為緩慢,但現在它們已開始廣泛應用於實證工作中,並成為迅速增加的方法論文獻的主題。在這篇綜述中,我們希望說明,正如布雷曼在談到統計界時所寫的那樣,經濟學家和計量經濟學家也"需要擺脫對數據模型的完全依賴,並採用一套更多樣化的工具"。我們將討論實證研究人員將從其中受益的一些具體工具,我們認為這些工具應該成為計量經濟學研究生標準課程的一部分,如果正如布雷曼所寫、我們也同意的那樣,"我們作為一個領域的目標是使用數據來解決問題";換句話說,如果我們將計量經濟學實質上視為不確定性條件下的決策制定(例如,張伯倫,2000 年);如果我們希望使學生能夠與其他領域的研究人員進行有效的交流,而這些領域也常採用這些方法的話。在大數據環境中,我們會觀察到大量單位的資訊、每個單位的許多信息,或兩者兼而有之,而且往往超出了單位單一橫截面的簡單環境。在這種情況下,ML 工具正成為各學科的標準工具,因此經濟學家的工具包也需要相應調整,同時保留應用計量經濟學的傳統優勢。
Why has the acceptance of ML methods been so much slower in economics compared to the broader statistics community? A large part of it may be the culture as Breiman refers to it. Economics journals emphasize the use of methods with formal properties of a type that many of the ML methods do not naturally deliver. This includes large sample properties of estimators and tests, including consistency, normality, and efficiency. In contrast, the focus in the ML literature is often on working properties of algorithms in specific settings, with the formal results being of a different type, e.g., guarantees of error rates. There are typically fewer theoretical results of the type traditionally reported in econometrics papers, although recently there have been some major advances in this area (Wager & Athey 2017, Farrell et al. 2018). There are no formal results that show that, for supervised learning problems, deep learning or neural net methods are uniformly superior to regression trees or random forests, and it appears unlikely that general results for such comparisons will soon be available, if ever.
與更廣泛的統計界相比,為什麼經濟學界對 ML 方法的接受要慢得多?其中很大一部分原因可能是 Breiman 所說的文化。經濟學期刊強調使用具有形式屬性的方法,而許多 ML 方法並不能自然地提供這種類型的屬性。這包括估計和檢定的大樣本屬性,包括一致性、常態性和效率。相較之下,ML 文獻的重點往往是特定環境下演算法的工作屬性,而正式結果則屬於不同類型,如錯誤率保證。計量經濟學論文中傳統報告類型的理論結果通常較少,儘管最近在這一領域取得了一些重大進展(Wager & Athey 2017,Farrell 等 2018)。目前還沒有正式的結果表明,對於監督學習問題,深度學習或神經網路方法統一優於回歸樹或隨機森林,而且這種比較的一般結果似乎不可能很快出現,如果有的話。
Although the ability to construct valid large-sample confidence intervals is important in many cases, one should not out-of-hand dismiss methods that cannot deliver them (or, possibly, that cannot yet deliver them) if these methods have other advantages. The demonstrated ability to outperform alternative methods on specific data sets in terms of out-of-sample predictive power is valuable in practice, even though such performance is rarely explicitly acknowledged as a goal or assessed in econometrics. As Mullainathan & Spiess (2017) highlight, some substantive problems are naturally cast as prediction problems, and assessing their goodness of fit on a test set may be sufficient for the purposes of the analysis in such cases. In other cases, the output of a prediction problem is an input to the primary analysis of interest, and statistical analysis of the prediction component beyond convergence rates is not needed. However, there are also many settings where it is important to provide valid confidence intervals for a parameter of interest, such as an average treatment effect. The degree of uncertainty captured by standard errors or confidence intervals may be a component in decisions about whether to implement the treatment. We argue that, in the future, as ML tools are more widely adopted, researchers should articulate clearly the goals of their analysis and why certain properties of algorithms and estimators may or may not be important.
雖然建立有效的大樣本置信區間的能力在許多情況下都很重要,但如果這些方法具有其他優勢,我們也不應該斷然否定那些無法實現置信區間的方法(或者,可能是那些還無法實現置信區間的方法)。在樣本外預測能力方面,在特定資料集上表現出優於其他方法的能力在實踐中是很有價值的,儘管計量經濟學很少明確承認這種表現是一個目標或對其進行評估。正如 Mullainathan & Spiess(2017)所強調的那樣,一些實質問題被自然地視為預測問題,在這種情況下,評估它們在測試集上的擬合度可能足以達到分析目的。在其他情況下,預測問題的輸出是主要分析的輸入,因此不需要對預測部分進行收斂率以外的統計分析。不過,在許多情況下,為平均治療效果等相關參數提供有效的置信區間也很重要。標準誤或信賴區間所反映的不確定性程度可能是決定是否實施治療的因素。我們認為,在未來,隨著 ML 工具被更廣泛地採用,研究人員應該清楚地闡明他們的分析目標,以及為什麼演算法和估計器的某些屬性可能重要或不重要。
A major theme of this review is that, even though there are cases where using simple offthe-shelf algorithms from the ML literature can be effective (for examples, see Mullainathan & Spiess 2017), there are also many cases where this is not the case. The ML techniques often require careful tuning and adaptation to effectively address the specific problems that economists are interested in. Perhaps the most important type of adaptation is to exploit the structure of the problems, e.g., the causal nature of many estimands; the endogeneity of variables; the configuration of data such as panel data; the nature of discrete choice among a set of substitutable products; or the presence of credible restrictions motivated by economic theory, such as monotonicity of demand in prices or other shape restrictions (Matzkin 1994, 2007). Statistics and econometrics have traditionally put much emphasis on these structures and developed insights to exploit them, whereas ML has often put little emphasis on them. Exploitation of these insights, both substantive and statistical, which, in a different form, is also seen in the careful tuning of ML techniques for specific problems such as image recognition, can greatly improve their performance. Another type of adaptation involves changing the optimization criteria of ML algorithms to prioritize considerations from causal inference, such as controlling for confounders or discovering treatment effect heterogeneity. Finally, techniques such as sample splitting [using different data to select models than to estimate parameters (e.g., Athey & Imbens 2016, Wager & Athey 2017)] and orthogonalization (e.g., Chernozhukov et al. 2016a) can be used to improve the performance of ML estimators, in some cases leading to desirable properties such as asymptotic normality of ML estimators (e.g., Athey et al. 2016b, Farrell et al. 2018).
本綜述的一個重要主題是,儘管在某些情況下,使用 ML 文獻中的簡單現成演算法可能會很有效(有關例子,請參見 Mullainathan & Spiess 2017),但也有很多情況並非如此。 ML 技術往往需要仔細調整和適應,才能有效解決經濟學家感興趣的特定問題。也許最重要的適應類型是利用問題的結構,例如,許多估算對象的因果性質;變量的內生性;數據的配置,如面板數據;在一組可替代產品中離散選擇的性質;或存在由經濟理論激發的可信限制,如價格需求的單調性或其他形狀限制(Matzkin,1994 年,2007 年)。統計學和計量經濟學歷來非常重視這些結構,並提出了利用這些結構的見解,而 ML 則往往很少重視這些結構。利用這些洞察力(包括實質洞察力和統計洞察力),可以大幅提高 ML 技術的效能;針對影像辨識等特定問題對 ML 技術進行精心調整,也是利用這些洞察力的一種不同形式。另一種適應涉及改變 ML 演算法的最佳化標準,優先考慮因果推斷,例如控制混雜因素或發現治療效果的異質性。最後,樣本分割(使用不同的資料選擇模型,而不是估計參數(如Athey & Imbens 2016,Wager & Athey 2017))和正交化(如Chernozhukov 等人2016a)等技術可用於提高ML 估計器的效能,在某些情況下可帶來理想的特性,如ML 估計器的漸近正態性(如Athey 等人2016b,Farrell 等人2018)。
In this review, we discuss a list of tools that we feel should be part of the empirical economist's toolkit and should be covered in the core econometrics graduate courses. Of course, this is a subjective list, and given the speed with which this literature is developing, the list will rapidly evolve. Moreover, we do not give a comprehensive discussion of these topics; rather, we aim to provide an introduction to these methods that conveys the main ideas and insights, with references to more comprehensive treatments. First on our list is nonparametric regression, or in the terminology of the ML literature, supervised learning for regression problems. Second, we discuss supervised learning for classification problems or, closely related but not quite the same, nonparametric regression for discrete response models. This is the area where ML methods have had perhaps their biggest successes. Third, we discuss unsupervised learning, or clustering analysis and density estimation. Fourth, we analyze estimates of heterogeneous treatment effects and optimal policies mapping from individuals' observed characteristics to treatments. Fifth,
在這篇綜述中,我們將討論一份工具清單,我們認為這些工具應該成為實證經濟學家工具包的一部分,並且應該涵蓋在計量經濟學研究生的核心課程中。當然,這是一份主觀性的清單,而且考慮到文獻的發展速度,這份清單還會迅速演變。此外,我們不會對這些主題進行全面討論;相反,我們的目的是介紹這些方法,傳達主要觀點和見解,並參考更全面的論述。首先是非參數迴歸,或用 ML 文獻的術語來說,回歸問題的監督式學習。其次,我們討論了分類問題的監督學習,或與之密切相關但不完全相同的離散響應模型的非參數回歸。這也許是 ML 方法取得最大成功的領域。第三,我們討論無監督學習,或聚類分析和密度估計。第四,我們分析了異質性治療效果的估計以及從個人觀察特徵到治療方法的最優政策映射。第五.我們將討論如何對治療效果進行評估、

we discuss ML approaches to experimental design, where bandit approaches are starting to revolutionize effective experimentation, especially in online settings. Sixth, we discuss the matrix completion problem, including its application to causal panel data models and problems of consumer choice among a discrete set of products. Finally, we discuss the analysis of text data.
第五,我們將討論實驗設計的 ML 方法,其中強盜方法正開始徹底改變有效的實驗,尤其是線上實驗。第六,我們將討論矩陣補全問題,包括其在因果面板​​資料模型中的應用以及消費者在離散產品中的選擇問題。最後,我們討論文字資料分析。
We note that there are a few other recent reviews of ML methods aimed at economists, often with more empirical examples and references to applications than we discuss in this review. Varian (2014) provides an early high-level discussion of a selection of important ML methods. Mullainathan & Spiess (2017) focus on the benefits of supervised learning methods for regression and discuss the prevalence of problems in economics where prediction methods are appropriate. Athey (2017) and Athey et al. (2017c) provide a broader perspective with more emphasis on recent developments in adapting ML methods for causal questions and general implications for economics. Gentzkow et al. (2017) provide an excellent recent discussion of methods for text analyses with a focus on economics applications. In the computer science and statistics literatures, there are also several excellent textbooks, with different levels of accessibility to researchers with a social science background, including the work of Efron & Hastie (2016); Hastie et al. (2009), who provide a more comprehensive text from a statistics perspective; Burkov (2019), who provides a very accessible introduction; Alpaydin (2009); and Knox (2018); all of these works take more of a computer science perspective.
我們注意到,最近有一些針對經濟學家的其他 ML 方法綜述,通常比我們在本綜述中討論的有更多的實證例子和應用參考。 Varian(2014)對一些重要的 ML 方法進行了早期的高階討論。 Mullainathan & Spiess(2017)重點介紹了監督學習方法對回歸的益處,並討論了經濟學中適合使用預測方法的普遍問題。 Athey(2017)和 Athey 等人(2017c)提供了一個更廣闊的視角,更加強調了將 ML 方法應用於因果問題的最新進展以及對經濟學的一般影響。 Gentzkow 等人(2017)對文本分析方法進行了精彩的最新討論,重點關注經濟學應用。在電腦科學和統計學領域,也有幾本優秀的教科書,它們對具有社會科學背景的研究人員具有不同程度的易讀性,其中包括Efron & Hastie(2016 年)的著作;Hastie 等人(2009 年)從統計學角度提供了更全面的文本;Burkov(2019 年)提供了非常易讀的介紹;Alpaydin(2009 年);以及Knox(2018 年);所有這些著作都更多地從計算機科學角度出發。

2. ECONOMETRICS AND MACHINE LEARNING: GOALS, METHODS, AND SETTTINGS
2.計量經濟學和機器學習:目標、方法和設置

In this section, we introduce some of the general themes of this review. What are the differences in the goals and concerns of traditional econometrics and the ML literature, and how do these goals and concerns affect the choices among specific methods?
在本節中,我們將介紹本綜述的一些一般性主題。傳統計量經濟學和 ML 文獻在目標和關注點上有哪些不同,這些目標和關注點如何影響具體方法的選擇?

2.1. Goals 2.1.目標

The traditional approach in econometrics, as exemplified in leading texts such as those of Greene (2000), Angrist & Pischke (2008), and Wooldridge (2010), is to specify a target, an estimand, that is a functional of a joint distribution of the data. The target is often a parameter of a statistical model that describes the distribution of a set of variables (typically conditional on some other variables) in terms of a set of parameters, which can be a finite or infinite set. Given a random sample from the population of interest, the parameter of interest and the nuisance parameters are estimated by finding the parameter values that best fit the full sample, using an objective function such as the sum of squared errors or the likelihood function. The focus is on the quality of the estimators of the target, traditionally measured through large sample efficiency. There is often also interest in constructing confidence intervals. Researchers typically report point estimates and standard errors.
計量經濟學的傳統方法(如 Greene (2000)、Angrist & Pischke (2008) 和 Wooldridge (2010)等人的主要著作)是指定一個目標,即一個估計值,它是資料聯合分佈的函數。目標通常是一個統計模型的參數,該模型用一組參數來描述一組變數的分佈(通常以其他變數為條件),這組參數可以是有限的,也可以是無限的。給定相關群體的隨機樣本,透過使用目標函數(如平方誤差總和)或似然比函數,找到最適合全部樣本的參數值,從而估算相關參數和乾擾參數。重點是目標估計值的質量,傳統上透過大樣本效率來衡量。人們通常也對建立置信區間感興趣。研究人員通常會報告點估計值和標準誤差。
In contrast, in the ML literature, the focus is typically on developing algorithms [a widely cited paper by Wu et al. (2008) has the title "Top 10 Algorithms in Data Mining"]. The goal for the algorithms is typically to make predictions about some variables given others or to classify units on the basis of limited information, for example, to classify handwritten digits on the basis of pixel values.
相反,在 ML 文獻中,重點通常是開發演算法[Wu 等人(2008 年)的一篇論文以 "資料探勘中的十大演算法 "為題被廣泛引用]。演算法的目標通常是根據其他變數對某些變數進行預測,或根據有限的資訊對單元進行分類,例如,根據像素值對手寫數字進行分類。
In a very simple example, suppose that we model the conditional distribution of some outcome given a vector-valued regressor or feature . Suppose that we are confident that
舉個非常簡單的例子,假設我們對給定向量值迴歸因子或特徵 的某個結果的條件分佈 進行建模。假設我們確信
We could estimate by least squares, that is, as
我們可以用最小平方法估計 ,即
Most introductory econometrics texts would focus on the least squares estimator without much discussion. If the model is correct, then the least squares estimator has well-known attractive properties: It is unbiased, it is the best linear unbiased estimator, it is the maximum likelihood estimator, and thus it has large sample efficiency properties.
大多數計量經濟學入門教材都將重點放在最小二乘估計器上,而不做過多討論。如果模型是正確的,那麼最小平方法估計器就具有眾所周知的誘人特性:它是無偏的,是最好的線性無偏估計器,是最大似然估計器,因此具有大樣本效率特性。
In ML settings, the goal may be to make a prediction for the outcome for new units on the basis of their regressor values. Suppose that we are interested in predicting the value of for a new unit , on the basis of the regressor values for this new unit, . Suppose that we restrict ourselves to linear predictors, so that the prediction is
在 ML 設定中,目標可能是根據新單位的迴歸值預測其結果。假設我們有興趣根據新單位 的迴歸因子值 來預測 的值。假設我們只使用線性預測因子,那麼預測結果為
for some estimator . The loss associated with this decision may be the squared error
。與此決定相關的損失可能是平方誤差
The question now is how to come up with estimators that have good properties associated with this loss function. This need not be the least squares estimator. In fact, when the dimension of the features exceeds two, we know from decision theory that we can do better in terms of expected squared error than the least squares estimator. The latter is not admissible; that is, there are other estimators that dominate the least squares estimator.
現在的問題是,如何找到與此損失函數相關的、具有良好特性的估計器 。這不一定是最小平方法估計器。事實上,當特徵維度超過 2 個時,我們可以從決策理論中得知,在期望平方誤差方面,我們可以比最小平方法估計器做得更好。後者是不被允許的;也就是說,有其他估計器可以支配最小平方法估計器。

2.2. Terminology 2.2.術語

One source of confusion is the use of new terminology in ML for concepts that have wellestablished labels in the older literatures. In the context of a regression model, the sample used to estimate the parameters is often referred to as the training sample. Instead of the model being estimated, it is being trained. Regressors, covariates, or predictors are referred to as features. Regression parameters are sometimes referred to as weights. Prediction problems are divided into supervised learning problems, where we observe both the predictors (features) and the outcome , and unsupervised learning problems, where we only observe the and try to group them into clusters or otherwise estimate their joint distribution. Unordered discrete response problems are generally referred to as classification problems.
造成混淆的一個原因是,在 ML 中使用了新的術語來描述在舊文獻中已有固定標籤的概念。在迴歸模型中,用於估計參數的樣本通常被稱為訓練樣本。與其說模型是在估計,不如說是在訓練。迴歸因子、協變數或預測因子稱為特徵。迴歸參數有時稱為權重。預測問題分為監督學習問題和無監督學習問題,前者是指我們同時觀察預測因子(特徵) 和結果 ,後者是指我們只觀察< b2> ,並嘗試將它們歸類為簇或以其他方式估計它們的聯合分佈。無序離散反應問題一般稱為分類問題。

2.3. Validation and Cross-Validation
2.3.驗證和交叉驗證

In most discussions of linear regression in econometric textbooks, there is little emphasis on model validation. The form of the regression model, be it parametric or nonparametric, and the set of regressors are assumed to be given from the outside, e.g., economic theory. Given this specification, the task of the researcher is to estimate the unknown parameters of this model. Much emphasis is placed on doing this estimation step efficiently, typically operationalized through definitions of large sample efficiency. If there is discussion of model selection, it is often in the form of testing null hypotheses concerning the validity of a particular model, with the implication that there is a true model that should be selected and used for subsequent tasks.
計量經濟學教科書在討論線性迴歸時,大多很少強調模型驗證。迴歸模型的形式,無論是參數模型或非參數模型,以及迴歸因子的集合,都被假定是從外在(如經濟理論)給定的。有了這個規範,研究者的任務就是估計這個模型的未知參數。人們非常重視如何有效率地完成這個估算步驟,這通常是透過大樣本效率的定義來實現的。如果有關於模型選擇的討論,其形式通常是檢驗關於特定模型有效性的零假設,其含義是存在一個真正的模型,應選擇該模型並用於後續任務。
Consider the regression example in the previous section. Let us assume that we are interested in predicting the outcome for a new unit, randomly drawn from the same population as our sample
請看上一節的迴歸範例。假設我們想預測一個新單位的結果,這個新單位是從與樣本相同的人群中隨機抽取的

was drawn from. As an alternative to estimating the linear model with an intercept and a scalar , we could estimate the model with only an intercept. Certainly, if , then that model would lead to better predictions. By the same argument, if the true value of were close but not exactly equal to zero, then we would still do better leaving out of the regression. Out-of-sample cross-validation can help guide such decisions. There are two components of the problem that are important for this ability. First, the goal is predictive power, rather than estimation of a particular structural or causal parameter. Second, the method uses out-of-sample comparisons, rather than in-sample goodness-of-fit measures. This ensures that we obtain unbiased comparisons of the fit.
得出。除了估算具有截距和標量的線性模型之外, ,我們還可以估算只有截距的模型。當然,如果 ,那麼模型的預測結果會更好。根據同樣的論點,如果 的真實值接近但不完全等於零,那麼我們不對 進行迴歸會更好。樣本外交叉驗證有助於指導此類決策。對於這種能力來說,有兩個問題很重要。首先,目標是預測能力,而不是對特定結構或因果參數的估計。其次,此方法使用的是樣本外比較,而不是樣本內適合度測量。這樣可以確保我們得到無偏的適合度比較。

2.4. Overfitting, Regularization, and Tuning Parameters
2.4.過擬合、正則化和調整參數

The ML literature is much more concerned with overfitting than the standard statistics or econometrics literatures. Researchers attempt to select flexible models that fit well, but not so well that out-of-sample prediction is compromised. There is much less emphasis on formal results that particular methods are superior in large samples (asymptotically); instead, methods are compared on specific data sets to see what works well. A key concept is that of regularization. As Vapnik (2013, p. 9) writes, "Regularization theory was one of the first signs of the existence of intelligent inference."
與標準統計學或計量經濟學文獻相比,ML 文獻更關注過擬合問題。研究人員試圖選擇適合度高但又不會影響樣本外預測的彈性模型。對於特定方法在大樣本(漸近)中更優越的正式結果,研究者並不太重視;相反,研究者會在特定的數據集上對各種方法進行比較,看看哪種方法更有效。正則化是一個關鍵概念。正如 Vapnik(2013 年,第 9 頁)所寫,"正則化理論是智能推斷存在的最初跡象之一"。
Consider a setting with a large set of models that differ in their complexity, measured, for example, as the number of unknown parameters in the model or, more subtly, through the the Vapnik-Chervonenkis (VC) dimension that measures the capacity or complexity of a space of models. Instead of directly optimizing an objective function, say, minimizing the sum of squared residuals in a least squares regression setting or maximizing the logarithm of the likelihood function, a term is added to the objective function to penalize the complexity of the model. There are antecedents of this practice in the traditional econometrics and statistics literatures. One is that, in likelihood settings, researchers sometimes add a term to the logarithm of the likelihood function equal to minus the logarithm of the sample size times the number of free parameters divided by two, leading to the Bayesian information criterion, or simply the number of free parameters, the Akaike information criterion. In Bayesian analyses of regression models, the use of a prior distribution on the regression parameters, centered at zero, independent across parameters with a constant prior variance, is another way of regularizing estimation that has a long tradition. The modern approaches to regularization are different in that they are more data driven, with the amount of regularization determined explicitly by the out-of-sample predictive performance rather than by, for example, a subjectively chosen prior distribution.
考慮到大量模型的複雜性各不相同,例如,模型中未知參數的數量,或更微妙的,透過 Vapnik-Chervonenkis (VC)維度來衡量模型空間的容量或複雜性。我們不是直接最佳化目標函數,例如最小平方法迴歸中的殘差平方和最小化或似然函數對數最大化,而是在目標函數中加入一個項來懲罰模型的複雜性。這種做法在傳統計量經濟學和統計學文獻中已有先例。其一是,在似然設定中,研究人員有時會在似然函數的對數中加入一個項,等於樣本量的對數減去自由參數的數量再除以2,從而得出貝葉斯訊息準則,或簡單的自由參數數量,即Akaike 資訊準則。在迴歸模型的貝葉斯分析中,使用迴歸參數的先驗分佈(以零為中心,各參數獨立,先驗變異數恆定)是正規化估計的另一種方法,具有悠久的傳統。現代正則化方法的不同之處在於它們更多地由數據驅動,正則化的程度由樣本外預測性能明確決定,而不是由主觀選擇的先驗分佈等決定。
Consider a linear regression model with regressors,
考慮使用 迴歸因子的線性迴歸模型、
Suppose that we also have a prior distribution for the the slope coefficients , with the prior for , and independent of for any . (This may be more plausible if we first normalize the features and outcome to have mean zero and unit variance. We assume that this has been done.) Given the value for the variance of the prior distribution, , the posterior mean for is the solution to
假設我們也有一個斜率係數的先驗分佈 ,其先驗值為 ,且對於任何 都與 無關(如果我們先將特徵和結果歸一化,使其平均值為零,方差為單位,這可能更可信)。我們假定已經這樣做了)。給定先驗分佈變異數的值 的後驗平均數就是下列公式的解
where . One version of an ML approach to this problem is to estimate by minimizing
其中 。解決這個問題的一種 ML 方法是透過最小化 來估算。
The only difference is in the way the penalty parameter is chosen. In a formal Bayesian approach, this reflects the (subjective) prior distribution on the parameters, and it would be chosen a priori. In an ML approach, would be chosen through out-of-sample cross-validation to optimize the out-of-sample predictive performance. This is closer to an empirical Bayes approach, where the data are used to estimate the prior distribution (e.g., Morris 1983).
唯一的差別在於懲罰參數 的選擇方式。在正式的貝葉斯方法中,該參數反映了參數的(主觀)先驗分佈,並且是先驗選擇的。在 ML 方法中, 將透過樣本外交叉驗證來選擇,以優化樣本外預測效能。這更接近經驗貝葉斯方法,即利用資料來估計先驗分佈(如 Morris,1983 年)。

2.5. Sparsity 2.5.稀疏性

In many settings in the ML literature, the number of features is substantial, both in absolute terms and relative to the number of units in the sample. However, there is often a sense that many of the features are of minor importance, if not completely irrelevant. The problem is that we may not know ex ante which of the features matter and which can be dropped from the analysis without substantially hurting the predictive power.
在許多 ML 文獻中,特徵的數量是相當可觀的,無論是絕對數量還是相對於樣本中的單位數量而言。然而,人們往往會覺得許多特徵即使不是完全不相關,也是次要的。問題在於,我們可能事先並不知道哪些特徵是重要的,哪些特徵可以從分析中剔除而不會嚴重影響預測能力。
Hastie et al. discuss what they call the sparsity principle:
Hastie 等人在 上討論了他們所說的稀疏性原則:
Assume that the underlying true signal is sparse and we use an penalty to try to recover it. If our assumption is correct, we can do a good job in recovering the true signal.... But if we are wrong-the underlying truth is not sparse in the chosen bases - then the penalty will not work well. However, in that instance, no method can do well, relative to the Bayes error. (Hastie et al. 2015, page 24)
假設底層真實訊號是稀疏的,我們使用 懲罰來試圖恢復它。如果我們的假設是正確的,我們就能很好地恢復真實訊號....。但是,如果我們的假設是錯誤的--在所選的基底中,底層真實訊號並不稀疏--那麼 懲罰就不能很好地發揮作用。然而,在這種情況下,相對於貝葉斯誤差,沒有一種方法能做得很好。 (哈斯蒂等人,2015 年,第 24 頁)
Exact sparsity is in fact stronger than is necessary; in many cases it is sufficient to have approximate sparsity, where most of the explanatory variables have very limited explanatory power, even if not zero, and only a few of the features are of substantial importance (see, for example, Belloni et al. 2014).
事實上,精確稀疏性比必要稀疏性更強;在許多情況下,有近似稀疏性就足夠了,在這種情況下,大多數解釋變數的解釋能力非常有限,即使不為零,也只有少數特徵具有重要意義(例如,請參閱Belloni 等人,2014 年)。
Traditionally, in the empirical literature in social sciences, researchers limited the number of explanatory variables by hand, rather than choosing them in a data-dependent manner. Allowing the data to play a bigger role in the variable selection process appears to be a clear improvement, even if the assumption that the underlying process is at least approximately sparse is still a very strong one, and even if inference in the presence of data-dependent model selection can be challenging.
傳統上,在社會科學的實證文獻中,研究者手動限制解釋變數的數量,而不是以依賴資料的方式選擇變數。讓資料在變數選擇過程中發揮更大的作用似乎是一個明顯的改進,儘管"底層過程至少近似稀疏"這一假設仍然是一個非常有力的假設,儘管在資料依賴模型選擇的情況下進行推理可能具有挑戰性。

2.6. Computational Issues and Scalability
2.6.計算問題和可擴展性

Compared to the traditional statistics and econometrics literatures, the ML literature is much more concerned with computational issues and the ability to implement estimation methods with large data sets. Solutions that may have attractive theoretical properties in terms of statistical efficiency but that do not scale well to large data sets are often discarded in favor of methods that can be implemented easily in very large data sets. This can be seen in the discussion of the relative merits of least absolute shrinkage and selection operator (LASSO) versus subset selection in linear regression settings. In a setting with a large number of features that might be included in the analysis, subset selection methods focus on selecting a subset of the regressors and then estimating the parameters of the regression function by least squares. However, LASSO has computational
與傳統的統計學和計量經濟學文獻相比,ML 文獻更關注計算問題以及利用大型資料集實施估計方法的能力。在統計效率方面可能具有吸引人的理論特性,但不能很好地擴展到大型資料集的解決方案往往會被放棄,而傾向於那些可以輕鬆地在超大型資料集中實施的方法。在討論線性迴歸設定中最小絕對收縮和選擇算子(LASSO)與子集選擇的相對優勢時,我們就可以看到這一點。在分析中可能包含大量特徵的情況下,子集選擇方法著重於選擇迴歸因子的子集,然後透過最小平方法估計迴歸函數的參數。然而,最小平方法在計算

advantages. It can be implemented by adding a penalty term that is proportional to the sum of the absolute values of the parameters. A major attraction of LASSO is that there are effective methods for calculating the LASSO estimates with the number of regressors in the millions. Best subset selection regression, in contrast, is an NP-hard problem. Until recently, it was thought that this was only feasible in settings with the number of regressors in the 30s, although current research (Bertsimas et al. 2016) suggests that it may be feasible with the number of regressors in the 1,000s. This has reopened a new, still unresolved debate on the relative merits of LASSO versus best subset selection (see Hastie et al. 2017) in settings where both are feasible. There are some indications that, in settings with a low signal-to-noise ratio, as is common in many social science applications, LASSO may have better performance, although there remain many open questions. In many social science applications, the scale of the problems is such that best subset selection is also feasible, and the computational issues may be less important than these substantive aspects of the problems.
優點它可以透過添加與參數絕對值總和成正比的懲罰項來實現。 LASSO 的一個主要吸引力在於,當迴歸因子的數量達到數百萬時,有有效的方法來計算 LASSO 估計值。相比之下,最佳子集選擇迴歸是一個 NP 難問題。雖然目前的研究(Bertsimas 等人,2016 年)表明,在回歸因子數量為 1000 個的情況下,最佳子集選擇回歸也是可行的。這重新引發了一場新的、仍未解決的爭論,即在兩者都可行的情況下,LASSO 與最佳子集選擇的相對優勢(參見 Hastie 等人,2017 年)。有跡象表明,在信噪比較低的情況下,如許多社會科學應用中常見的情況,LASSO 可能具有更好的性能,儘管仍有許多問題有待解決。在許多社會科學應用中,問題的規模使得最佳子集選擇也是可行的,計算問題可能不如問題的這些實質方面重要。
A key computational optimization tool used in many ML methods is stochastic gradient descent (SGD) (Bottou 1998, 2012; Friedman 2002). It is used in a wide variety of settings, including in optimizing neural networks and estimating models with many latent variables (e.g., Ruiz et al. 2017). The idea is very simple. Suppose that the goal is to estimate a parameter , and that the estimation approach entails finding the value that minimizes an empirical loss function, where is the loss for observation , and the overall loss is the sum , with derivative . Classic gradient descent methods involve an iterative approach, where is updated from as follows:
隨機梯度下降(SGD)是許多 ML 方法中使用的關鍵計算優化工具(Bottou,1998 年,2012 年;Friedman,2002 年)。它被廣泛應用於各種場合,包括優化神經網路和估計具有許多潛在變數的模型(例如,Ruiz 等人,2017 年)。其思路非常簡單。假設目標是估算一個參數 ,估算方法需要找出能使經驗損失函數最小化的值 ,其中 是觀測值 ,導數 。經典的梯度下降法採用迭代法,其中 更新如下:
where is the learning rate, often chosen optimally through line search. More sophisticated optimization methods multiply the first derivative by the inverse of the matrix of second derivatives or estimates thereof.
其中 是學習率,通常透過直線搜尋最佳化選擇。更複雜的最佳化方法是將一階導數乘以二階導數矩陣的逆或估計值。
The challenge with this approach is that it can be computationally expensive. The computational cost is in evaluating the full derivative and even more in optimizing the learning rate . The idea behind SGD is that it is better to take many small steps that are noisy but, on average, in the right direction than it is to spend equivalent computational cost in very accurately figuring out in what direction to take a single small step. More specifically, SGD uses the fact that the average of for a random subset of the sample is an unbiased (but noisy) estimate of the gradient. For example, dividing the data randomly into 10 subsets or batches, with denoting the subset unit belongs to, one could do 10 steps of the type
這種方法的挑戰在於計算成本高。計算成本在於評估全導數 ,而最佳化學習率 的計算成本則較高。 SGD 背後的理念是,與花費同等的計算成本來非常精確地確定一小步的方向相比,採取許多雖然有噪音但平均方向正確的小步更好。更具體地說,SGD 利用了這樣一個事實,即樣本隨機子集 的平均值是梯度的無偏估計值(但有雜訊)。例如,將資料隨機分成 10 個子集或批次,以 表示子集單元 屬於哪個子集,就可以進行 10 步驟類型的計算
with a deterministic learning rate . After the 10 iterations, one could reshuffle the data set and then repeat. If the learning rate decreases at an appropriate rate, then under relatively mild assumptions, SGD converges almost surely to a global minimum when the objective function is convex or pseudoconvex and otherwise converges almost surely to a local minimum. Bottou (2012) provides an overview and practical tips for implementation.
確定學習率 。迭代 10 次後,可以重新調整資料集,然後重複。如果學習率 以適當的速率下降,那麼在相對溫和的假設條件下,當目標函數是凸函數或偽凸函數時,SGD 幾乎肯定會收斂到全局最小值,否則幾乎肯定會收斂到局部最小值。 Bottou (2012) 提供了有關實施的概述和實用技巧。
The idea can be pushed even further in the case where is itself an expectation. We can consider evaluating using Monte Carlo integration. However, rather than taking many Monte Carlo draws to get an accurate approximation to the integral, we can instead take a small number
如果 本身就是一個期望值,那麼這個想法還可以更進一步。我們可以考慮使用蒙特卡羅積分來評估 。不過,與其進行多次蒙特卡洛抽樣來獲得積分的精確近似值,我們可以取而代之的是少量的

of draws or even a single draw. This type of approximation is used in economic applications by Ruiz et al. (2017) and Hartford et al. (2016).
甚至是一次抽籤。 Ruiz 等人(2017 年)和 Hartford 等人(2016 年)在經濟應用中使用了這種近似方法。

2.7. Ensemble Methods and Model Averaging
2.7.集合方法與模型平均

Another key feature of the ML literature is the use of model averaging and ensemble methods (e.g., Dietterich 2000). In many cases, a single model or algorithm does not perform as well as a combination of possibly quite different models, averaged using weights (sometimes called votes) obtained by optimizing out-of-sample performance. A striking example is the Netflix Prize competition (Bennett & Lanning 2007), where all of the top contenders use combinations of models and often averages of many models (Bell & Koren 2007). There are two related ideas in the traditional econometrics literature. Obviously, Bayesian analysis implicitly averages over the posterior distribution of the parameters. Mixture models are also used to combine different parameter values in a single prediction. However, in both cases, this model averaging involves averaging over similar models, typically with the same specification, that are only different in terms of parameter values. In the modern literature, and in the top entries in the Netflix Prize competition, the models that are averaged over can be quite different, and the weights are obtained by optimizing out-of-sample predictive power, rather than in-sample fit.
ML 文獻的另一個主要特徵是使用模型平均和集合方法(如 Dietterich 2000)。在許多情況下,單一模型或演算法的表現並不如透過優化樣本外表現而獲得的權重(有時稱為選票)進行平均的可能完全不同的模型組合的表現好。一個突出的例子是Netflix 大獎競賽(Bennett & Lanning,2007 年),在該競賽中,所有最優秀的競爭者都使用了模型組合,而且往往是多種模型的平均值(Bell & Koren,2007 年)。傳統計量經濟學文獻中有兩個相關的觀點。顯然,貝葉斯分析隱含了對參數後驗分佈的平均值。混合模型也用於在單一預測中結合不同的參數值。然而,在這兩種情況下,這種模型平均都涉及到類似模型的平均,這些模型通常具有相同的規範,只是參數值不同而已。在現代文獻中,以及在 Netflix 大獎評選中,被平均的模型可能大相徑庭,權重是透過優化樣本外預測能力而不是樣本內擬合度獲得的。
For example, one may have three predictive models, one based on a random forest, leading to predictions ; one based on a neural net, with predictions ; and one based on a linear model
例如,可能有三個預測模型,一個基於隨機森林,預測結果為 ;一個基於神經網絡,預測結果為 ;一個基於線性模型,預測結果為 。
estimated by LASSO, leading to . Then, using a test sample, one can choose weights , , and by minimizing the sum of squared residuals in the test sample:
透過 LASSO 估計,得出 。然後,利用測試樣本,可以透過最小化測試樣本中的殘差平方和來選擇權重 , , 和
One may also estimate weights based on regression of the outcomes in the test sample on the predictors from the different models without imposing that the weights sum to one and are nonnegative because random forests, neural nets, and LASSO have distinct strengths and weaknesses in terms of how well they deal with the presence of irrelevant features, nonlinearities, and interactions. As a result, averaging over these models may lead to out-of-sample predictions that are strictly better than predictions based on a single model.
由於隨機森林、神經網路和LASSO 在處理不相關特徵、非線性和交互作用方面各有優缺點,因此我們也可以根據測試樣本中的結果對不同模型預測因子的迴歸結果來估計權重,而不必強求權重之和為1 且為非負。因此,對這些模型取平均值可能會導致樣本外預測嚴格優於基於單一模型的預測。
In a panel data context (Athey et al. 2019), one can use ensemble methods combining various forms of synthetic control and matrix completion methods and find that the combinations outperform the individual methods.
在面板資料背景下(Athey 等人,2019 年),我們可以使用組合方法,將各種形式的合成控制和矩陣補全方法結合起來,並發現組合方法的效果優於單一方法。

2.8. Inference 2.8.推理

The ML literature has focused heavily on out-of-sample performance as the criterion of interest. This has come at the expense of one of the concerns that the statistics and econometrics literatures have traditionally focused on, namely, the ability to do inference, e.g., construct confidence intervals that are valid, at least in large samples. Efron & Hastie (2016, p. 209) write:
ML 文獻將樣本外表現作為關注的標準。這犧牲了統計和計量經濟學文獻傳統上所關注的問題之一,即進行推論的能力,例如建構至少在大樣本中有效的置信區間。 Efron & Hastie(2016 年,第 209 頁)寫道:
Prediction, perhaps because of its model-free nature, is an area where algorithmic developments have run far ahead of their inferential justification.
也許是因為預測不需要模型,演算法的發展遠遠超過了其推理的合理性。
Although there has recently been substantial progress in the development of methods for inference for low-dimensional functionals in specific settings [e.g., the work of Wager & Athey (2017) in the context of random forests and of Farrell et al. (2018) in the context of neural networks], it remains the case that, for many methods, it is currently impossible to construct confidence intervals that are valid, even if only asymptotically. One question is whether this ability to construct confidence intervals is as important as the traditional emphasis on it in the econometric literature suggests. For many decision problems, it may be that prediction is of primary importance, and inference is at best of secondary importance. Even in cases where it is possible to do inference, it is important to keep in mind that the requirements that ensure this ability often come at the expense of predictive performance. One can see this tradeoff in traditional kernel regression, where the bandwidth that optimizes expected squared error balances the tradeoff between the square of the bias and the variance, so that the optimal estimators have an asymptotic bias that invalidates the use of standard confidence intervals. This can be fixed by using a bandwidth that is smaller than the optimal one, so that the asymptotic bias vanishes, but it does so explicitly at the expense of increasing the variance.
儘管最近在開發特定環境下的低維函數推斷方法方面取得了實質進展[例如,Wager & Athey(2017)在隨機森林背景下的工作和Farrell 等人(2018)在神經網路背景下的工作] ,但對於許多方法來說,目前仍無法建立有效的置信區間,即使只是漸近線上的置信區間。一個問題是,建構信賴區間的能力是否如計量經濟學文獻中傳統所強調的那麼重要。對於許多決策問題來說,預測可能是最重要的,而推論充其量只是次要的。即使在有可能進行推理的情況下,也必須牢記,確保推理能力的要求往往是以犧牲預測性能為代價的。我們可以在傳統的核回歸中看到這種權衡,優化期望平方誤差的頻寬平衡了偏差平方和變異數之間的權衡,因此最優估計值具有漸進偏差,使標準置信區間的使用無效。可以透過使用比最優頻寬更小的頻寬來解決這個問題,從而使漸近偏差消失,但這樣做顯然要以增加方差為代價。

3. SUPERVISED LEARNING FOR REGRESSION PROBLEMS
3.回歸問題的監督學習

One of the canonical problems in both the ML and econometric literatures is that of estimating the conditional mean of a scalar outcome given a set of of covariates or features. Let denote the outcome for unit , and let denote the -component vector of covariates or features. The conditional expectation is
在 ML 和計量經濟學文獻中,一個典型的問題是在給定一組協變量或特徵的情況下估計標量結果的條件均值。令 表示 單位的結果,讓 表示 協變數或特徵的分量向量。條件期望為
Compared to the traditional econometric textbooks (e.g., Greene 2000, Angrist & Pischke 2008, Wooldridge 2010), there are some conceptual differences in the ML literature (for discussion, see Mullainathan & Spiess 2017). In the settings considered in the ML literature, there are often many covariates, sometimes more than there are units in the sample. There is no presumption in the ML literature that the conditional distribution of the outcomes given the covariates follows a particular parametric model. The derivatives of the conditional expectation for each of the covariates, which in the linear regression model correspond to the parameters, are not of intrinsic interest. Instead, the focus is on out-of-sample predictions and their accuracy. Furthermore, there is less of a sense that the conditional expectation is monotone in each of the covariates compared to many economic applications. There is often concern that the conditional expectation may be an extremely nonmonotone function with some higher-order interactions of substantial importance.
與傳統計量經濟學教科書(如 Greene 2000、Angrist & Pischke 2008、Wooldridge 2010)相比,ML 文獻在概念上存在一些差異(討論見 Mullainathan & Spiess 2017)。在 ML 文獻所考慮的環境中,往往存在許多協變量,有時多於樣本中的單位數量。 ML 文獻並沒有假定給定協變數的結果條件分佈遵循特定的參數模型。在線性迴歸模型中,每個協變數的條件期望的導數對應於參數,這並不引起人們的內在興趣。相反,重點在於樣本外預測及其準確度。此外,與許多經濟應用相比,條件期望在每個協變數中的單調性並不那麼明顯。人們往往擔心條件期望可能是一個極其非單調的函數,其中一些高階交互作用非常重要。
The econometric literature on estimating the conditional expectation is also huge. Parametric methods for estimating often use least squares. Since the work of Bierens (1987), kernel regression methods have become a popular alternative when more flexibility is required, and series or sieve methods have subsequently gained interest (for a survey, see Chen 2007). These methods have well-established large sample properties, allowing for the construction of confidence intervals. Simple nonnegative kernel methods are viewed as performing very poorly in settings with high-dimensional covariates, with the difference of order . This rate can be improved by using higher-order kernels and assuming the existence of many derivatives of , but practical experience with high-dimensional covariates has not been satisfactory for these methods, and applications of kernel methods in econometrics are generally limited to low-dimensional settings.
關於條件期望估計的計量經濟學文獻也非常多。估計 的參數方法通常使用最小平方法。自 Bierens(1987 年)的研究以來,核回歸方法已成為需要更多靈活性時的流行替代方法,序列或篩選方法也隨之受到關注(有關調查,見 Chen,2007 年)。這些方法具有成熟的大樣本特性,可以建立信賴區間。簡單的非負核方法在高維協變數環境中表現很差,差值 ,階次為 。透過使用高階核並假定 的許多導數的存在,這一比率可以得到改善,但這些方法在高維協變量方面的實際經驗並不令人滿意,而且核方法在計量經濟學中的應用通常僅限於低維環境。
The differences in performance between some of the traditional methods such as kernel regression and the modern methods such as random forests are particularly pronounced in sparse settings with a large number of more or less irrelevant covariates. Random forests are effective at picking up on the sparsity and ignoring the irrelevant features, even if there are many of them, while the traditional implementations of kernel methods essentially waste degrees of freedom on accounting for these covariates. Although it may be possible to adapt kernel methods for the presence of irrelevant covariates by allowing for covariate-specific bandwidths, in practice there has been little effort in this direction. A second issue is that the modern methods are particularly good at detecting severe nonlinearities and high-order interactions. The presence of such high-order interactions in some of the success stories of these methods should not blind us to the fact that, with many economic data, we expect high-order interactions to be of limited importance. If we try to predict earnings for individuals, then we expect the regression function to be monotone in many of the important predictors such as education and prior earnings variables, even for homogeneous subgroups. This means that models based on linearizations may do well in such cases relative to other methods, compared to settings where monotonicity is fundamentally less plausible, as, for example, in an image recognition problem. This is also a reason for the superior performance of locally linear random forests (Friedberg et al. 2018) relative to standard random forests.
一些傳統方法(如核回歸)與現代方法(如隨機森林)之間的表現差異在具有大量或多或少無關協變量的稀疏環境中尤其明顯。隨機森林能有效地捕捉稀疏性並忽略不相關的特徵,即使這些特徵數量很多也是如此,而傳統的核方法基本上是在浪費自由度來考慮這些協變量。雖然有可能透過允許特定協變量頻寬來調整核方法,以適應不相關協變量的存在,但在實踐中,這方面的努力還很少。第二個問題是,現代方法特別擅長偵測嚴重的非線性和高階交互作用。在這些方法的一些成功案例中,高階交互作用的存在不應使我們忽略這樣一個事實,即在許多經濟數據中,我們預期高階交互作用的重要性是有限的。如果我們試圖預測個人的收入,那麼我們希望回歸函數在許多重要的預測因素(如教育程度和先前的收入變數)中是單調的,即使對於同質的子群體也是如此。這就意味著,與影像辨識問題等單調性從根本上不太可信的情況相比,基於線性化的模型在這種情況下可能會比其他方法做得更好。這也是局部線性隨機森林(Friedberg 等人,2018 年)相對於標準隨機森林表現優異的原因。
We discuss four specific sets of methods, although there are many more, including variations on the basic methods. First, we discuss methods where the class of models considered is linear in the covariates, and the question is solely about regularization. Second, we discuss methods based on partitioning the covariate space using regression trees and random forests. Third, we discuss neural nets, which were the focus of a small econometrics literature in the 1990s (Hornik et al. 1989, White 1992) but more recently have become a very prominent part of the literature on ML in various subtle reincarnations. Fourth, we discuss boosting as a general principle.
我們將討論四組具體方法,儘管還有更多方法,包括基本方法的變體。首先,我們討論的是考慮協變數線性模型的方法,問題只在於正規化。其次,我們討論基於使用迴歸樹和隨機森林劃分協變量空間的方法。第三,我們討論神經網絡,神經網絡是20 世紀90 年代小型計量經濟學文獻(Hornik 等人,1989 年;White,1992 年)的重點,但最近以各種微妙的形式成為ML 文獻中非常重要的一部分。第四,我們討論作為一般原則的提升。

3.1. Regularized Linear Regression: LASSO, Ridge, and Elastic Nets
3.1.正規化線性迴歸:LASSO、Ridge 和彈性網

Suppose that we consider approximations to the conditional expectation that have a linear form
假設我們考慮的條件期望近似值具有線性形式
after the covariates and the outcome are demeaned, and the covariates are normalized to have unit variance. The traditional method for estimating the regression function in this case is least squares, with
將協變數和結果去均化處理,並對協變數進行歸一化處理,使其具有單位變異數。在這種情況下,估計迴歸函數的傳統方法是最小平方法,即
However, if the number of covariates is large relative to the number of observations , then the least squares estimator does not even have particularly good repeated sampling properties as an estimator for , let alone good predictive properties. In fact, with , the least squares estimator is not even admissible and is dominated by estimators that shrink toward zero. With very large, possibly even exceeding the sample size , the least squares estimator has particularly poor properties, even if the conditional mean of the outcome given the covariates is in fact linear.
但是,如果 的協變量數量相對於 的觀測值數量很大,那麼最小二乘估計器 作為 的估計器甚至沒有特別好的重複抽樣特性,更不用說良好的預測特性了。事實上,在 的情況下,最小平方法估計器甚至都是不可接受的,而且會被向零收縮的估計器所支配。當 非常大,甚至可能超過樣本量 時,即使給定協變量的結果的條件均值實際上是線性的,最小二乘估計器的特性也特別差。
Even with modest in magnitude, the predictive properties of the least squares estimator may be inferior to those of estimators that use some amount of regularization. One common form of
即使 的幅度不大,最小平方法估計器的預測特性可能不如使用一定正規化的估計器。一種常見的

regularization is to add a penalty term that shrinks the toward zero and minimize
正規化是增加一個懲罰項,將 縮減為零,並使其最小化。
where . For , this corresponds to LASSO (Tibshirani 1996). For , this corresponds to ridge regression (Hoerl & Kennard 1970). As , the solution penalizes the number of nonzero covariates, leading to best subset regression (Miller 2002, Bertsimas et al. 2016). In addition, there are many hybrid methods and modifications, including elastic nets, which combine penalty terms from LASSO and ridge (Zou & Hastie 2005); the relaxed LASSO, which combines least squares estimates from the subset selected by LASSO and the LASSO estimates themselves (Meinshausen 2007); least angle regression (Efron et al. 2004); the Dantzig selector (Candès & Tao 2007); and the non-negative garrotte (Breiman 1993).
其中