這是用戶在 2024-4-30 4:34 為 https://ar5iv.labs.arxiv.org/html/1507.03133?_immersive_translate_auto_translate=1#S2.E11 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?

Best Subset Selection via a Modern Optimization Lens
透過現代優化鏡頭選擇最佳子集

Dimitris Bertsimas  迪米特里斯·貝爾西馬斯MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: dbertsim@mit.edu
麻省理工學院史隆管理學院與營運研究中心、麻省理工學院:dbertsim@mit.edu
   Angela King  安吉拉·金Operations Research Center, Massachusetts Institute of Technology: aking10@mit.edu
麻省理工學院運籌研究中心:aking10@mit.edu
   Rahul Mazumder  拉胡爾·馬宗德MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: rahulmaz@mit.edu
麻省理工學院史隆管理學院與營運研究中心,麻省理工學院:rahulmaz@mit.edu
( (This is a Revised Version dated May, 2015. First Version Submitted for Publication on June, 2014.))
((這是 2015 年 5 月的修訂版本。第一個版本於 2014 年 6 月提交出版。))
Abstract 抽象的

In the last twenty-five years (1990-2014), algorithmic advances in integer optimization combined with hardware improvements have resulted in an astonishing 200 billion factor speedup in solving Mixed Integer Optimization (MIO) problems. We present a MIO approach for solving the classical best subset selection problem of choosing k𝑘k out of p𝑝p features in linear regression given n𝑛n observations. We develop a discrete extension of modern first order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n𝑛n in the 1000s and p𝑝p in the 100s in minutes to provable optimality, and finds near optimal solutions for n𝑛n in the 100s and p𝑝p in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.
在過去 25 年(1990-2014 年)中,整數優化演算法的進步與硬體改進相結合,在解決混合整數優化 (MIO) 問題時帶來了驚人的 2000 億因子加速。我們提出了一種MIO 方法,用於解決在給定 n𝑛n 觀測值的線性回歸中從 p𝑝p 特徵中選擇 k𝑘k 的經典最佳子集選擇問題。我們開發了現代一階連續優化方法的離散擴展,以找到高品質的可行解決方案,我們將其用作 MIO 求解器的熱啟動,以找到可證明的最佳解決方案。由此產生的演算法(a) 提供了一個解決方案,即使我們提前終止演算法,也能保證其次優性,(b) 可以適應線性迴歸係數的側面約束,(c) 擴展到尋找最少的最佳子集解決方案絕對偏差損失函數。使用各種合成和真實資料集,我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 問題,以證明最優性,並找到接近最優的解對於 n𝑛n 在 100 秒內, p𝑝p 在 1000 秒內。我們也透過數值實驗證明,在實現具有良好預測能力的稀疏解決方案方面,MIO 方法比 Lasso 和其他常用的稀疏學習程式表現更好。

1 Introduction 1簡介

We consider the linear regression model with response vector 𝐲n×1subscript𝐲𝑛1\mathbf{y}_{n\times 1}, model matrix 𝐗=[𝐱1,,𝐱p]n×p𝐗subscript𝐱1subscript𝐱𝑝superscript𝑛𝑝\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p}, regression coefficients 𝜷p×1𝜷superscript𝑝1\boldsymbol{\beta}\in\mathbb{R}^{p\times 1} and errors ϵn×1bold-italic-ϵsuperscript𝑛1\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1}:
我們考慮具有反應向量 𝐲n×1subscript𝐲𝑛1\mathbf{y}_{n\times 1} 、模型矩陣 𝐗=[𝐱1,,𝐱p]n×p𝐗subscript𝐱1subscript𝐱𝑝superscript𝑛𝑝\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p} 、迴歸係數 𝜷p×1𝜷superscript𝑝1\boldsymbol{\beta}\in\mathbb{R}^{p\times 1} 和誤差 ϵn×1bold-italic-ϵsuperscript𝑛1\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1} 的線性迴歸模型:

𝐲=𝐗𝜷+ϵ.𝐲𝐗𝜷bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}.

We will assume that the columns of 𝐗𝐗\mathbf{X} have been standardized to have zero means and unit 2subscript2\ell_{2}-norm. In many important classical and modern statistical applications, it is desirable to obtain a parsimonious fit to the data by finding the best k𝑘k-feature fit to the response 𝐲𝐲\mathbf{y}. Especially in the high-dimensional regime with pn,much-greater-than𝑝𝑛p\gg n, in order to conduct statistically meaningful inference, it is desirable to assume that the true regression coefficient 𝜷𝜷\boldsymbol{\beta} is sparse or may be well approximated by a sparse vector. Quite naturally, the last few decades have seen a flurry of activity in estimating sparse linear models with good explanatory power. Central to this statistical task lies the best subset problem [40] with subset size k𝑘k, which is given by the following optimization problem:
我們假設 𝐗𝐗\mathbf{X} 的列已標準化為具有零均值和單位 2subscript2\ell_{2} -範數。在許多重要的經典和現代統計應用中,希望透過找到與反應 𝐲𝐲\mathbf{y} 擬合的最佳 k𝑘k 特徵來獲得對資料的簡約擬合。特別是在具有 pn,much-greater-than𝑝𝑛p\gg n, 的高維體系中,為了進行統計上有意義的推斷,最好假設真實的迴歸係數 𝜷𝜷\boldsymbol{\beta} 是稀疏的或可以很好地近似為稀疏係數向量。很自然地,在過去的幾十年裡,在估計具有良好解釋力的稀疏線性模型方面出現了一系列活動。此統計任務的核心在於子集大小 k𝑘k 的最佳子集問題 [40],由以下最佳化問題給出:

min𝜷12𝐲𝐗𝜷22subjectto𝜷0k,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22subjecttosubscriptnorm𝜷0𝑘\min_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\;\;\mathrm{subject\;to}\;\;\;\|\boldsymbol{\beta}\|_{0}\leq k, (1)

where the 0subscript0\ell_{0} (pseudo)norm of a vector 𝜷𝜷\boldsymbol{\beta} counts the number of nonzeros in 𝜷𝜷\boldsymbol{\beta} and is given by 𝜷0=i=1p1(βi0),subscriptnorm𝜷0superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0), where 1()11(\cdot) denotes the indicator function. The cardinality constraint makes Problem (1) NP-hard [41]. Indeed, state-of-the-art algorithms to solve Problem (1), as implemented in popular statistical packages, like leaps in R, do not scale to problem sizes larger than p=30𝑝30p=30. Due to this reason, it is not surprising that the best subset problem has been widely dismissed as being intractable by the greater statistical community.
其中向量 𝜷𝜷\boldsymbol{\beta}0subscript0\ell_{0} (偽)範數計算 𝜷𝜷\boldsymbol{\beta} 中非零的數量,並由 𝜷0=i=1p1(βi0),subscriptnorm𝜷0superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0), 給出,其中< b4 > 表示指標函數。基數約束使得問題 (1) 成為 NP 難題 [41]。事實上,解決問題 (1) 的最先進演算法,如在流行的統計軟體包中實現的那樣,例如 R 中的跳躍,不能擴展到大於 p=30𝑝30p=30 的問題規模。由於這個原因,最佳子集問題被更大的統計界認為難以解決而被廣泛忽視也就不足為奇了。

In this paper we address Problem (1) using modern optimization methods, specifically mixed integer optimization (MIO) and a discrete extension of first order continuous optimization methods. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n𝑛n in the 1000s and p𝑝p in the 100s in minutes to provable optimality, and finds near optimal solutions for n𝑛n in the 100s and p𝑝p in the 1000s in minutes. To the best of our knowledge, this is the first time that MIO has been demonstrated to be a tractable solution method for Problem (1). We note that we use the term tractability not to mean the usual polynomial solvability for problems, but rather the ability to solve problems of realistic size in times that are appropriate for the applications we consider.
在本文中,我們使用現代最佳化方法,特別是混合整數最佳化(MIO)和一階連續最佳化方法的離散擴展來解決問題(1)。使用各種合成和真實資料集,我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 問題,以證明最優性,並找到接近最優的解對於 n𝑛n 在 100 秒內, p𝑝p 在 1000 秒內。據我們所知,這是第一次證明 MIO 是問題(1)的一種易於處理的解決方法。我們注意到,我們使用術語「易處理性」並不是指問題的通常多項式可解性,而是指在適合我們考慮的應用程式的時間內解決實際規模問題的能力。

As there is a vast literature on the best subset problem, we next give a brief and selective overview of related approaches for the problem.
由於關於最佳子集問題有大量文獻,因此我們接下來對此問題的相關方法進行簡要和選擇性的概述。

Brief Context and Background
簡短背景

To overcome the computational difficulties of the best subset problem, computationally tractable convex optimization based methods like Lasso [49, 17] have been proposed as a convex surrogate for Problem (1). For the linear regression problem, the Lagrangian form of Lasso solves
為了克服最佳子集問題的計算困難,基於計算可處理的凸優化的方法(如 Lasso [49, 17])已被提出作為問題(1)的凸替代。對於線性迴歸問題,Lasso 的拉格朗日形式求解

min𝜷12𝐲𝐗𝜷22+λ𝜷1,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22𝜆subscriptnorm𝜷1\min_{\boldsymbol{\beta}}\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\|\boldsymbol{\beta}\|_{1}, (2)

where the 1subscript1\ell_{1} penalty on 𝜷𝜷\boldsymbol{\beta}, i.e., 𝜷1=i|βi|subscriptnorm𝜷1subscript𝑖subscript𝛽𝑖\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}| shrinks the coefficients towards zero and naturally produces a sparse solution by setting many coefficients to be exactly zero. There has been a substantial amount of impressive work on Lasso [23, 15, 5, 55, 32, 59, 19, 35, 39, 53, 50] in terms of algorithms and understanding of its theoretical properties—see for example the excellent books or surveys [11, 34, 50] and the references therein.
其中 1subscript1\ell_{1}𝜷𝜷\boldsymbol{\beta} 的懲罰,即 𝜷1=i|βi|subscriptnorm𝜷1subscript𝑖subscript𝛽𝑖\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}| 將係數縮小到零,並透過將許多係數設為零來自然地產生稀疏解。在演算法和對其理論特性的理解方面,Lasso [23,15,5,55,32,59,19,35,39,53,50] 已經有大量令人印象深刻的工作,例如參見優秀的書籍或調查[11,34,50]以及其中的參考文獻。

Indeed, Lasso enjoys several attractive statistical properties and has drawn a significant amount of attention from the statistics community as well as other closely related fields. Under various conditions on the model matrix 𝐗𝐗\mathbf{X} and n,p,𝜷𝑛𝑝𝜷n,p,\boldsymbol{\beta} it can be shown that Lasso delivers a sparse model with good predictive performance [11, 34]. In order to perform exact variable selection, much stronger assumptions are required [11]. Sufficient conditions under which Lasso gives a sparse model with good predictive performance are the restricted eigenvalue conditions and compatibility conditions [11]. These involve statements about the range of the spectrum of sub-matrices of 𝐗𝐗\mathbf{X} and are difficult to verify, for a given data-matrix 𝐗𝐗\mathbf{X}.
事實上,Lasso 具有幾個有吸引力的統計特性,並引起了統計界以及其他密切相關領域的廣泛關注。在模型矩陣 𝐗𝐗\mathbf{X}n,p,𝜷𝑛𝑝𝜷n,p,\boldsymbol{\beta} 的各種條件下,可以證明 Lasso 提供了具有良好預測性能的稀疏模型 [11, 34]。為了執行精確的變數選擇,需要更強的假設[11]。 Lasso 給出具有良好預測性能的稀疏模型的充分條件是受限特徵值條件和相容性條件[11]。這些涉及關於 𝐗𝐗\mathbf{X} 子矩陣頻譜範圍的陳述,並且對於給定的資料矩陣 𝐗𝐗\mathbf{X} 很難驗證。

An important reason behind the popularity of Lasso is its computational feasibility and scalability to practical sized problems. Problem (2) is a convex quadratic optimization problem and there are several efficient solvers for it, see for example [44, 23, 29].
Lasso 受歡迎的一個重要原因是它的計算可行性和對實際規模問題的可擴展性。問題(2)是一個凸二次最佳化問題,有幾個有效的解算器,例如參見[44,23,29]。

In spite of its favorable statistical properties, Lasso has several shortcomings. In the presence of noise and correlated variables, in order to deliver a model with good predictive accuracy, Lasso brings in a large number of nonzero coefficients (all of which are shrunk towards zero) including noise variables. Lasso leads to biased regression coefficient estimates, since the 1subscript1\ell_{1}-norm penalizes the large coefficients more severely than the smaller coefficients. In contrast, if the best subset selection procedure decides to include a variable in the model, it brings it in without any shrinkage thereby draining the effect of its correlated surrogates. Upon increasing the degree of regularization, Lasso sets more coefficients to zero, but in the process ends up leaving out true predictors from the active set. Thus, as soon as certain sufficient regularity conditions on the data are violated, Lasso becomes suboptimal as (a) a variable selector and (b) in terms of delivering a model with good predictive performance.
儘管 Lasso 具有良好的統計特性,但它也有一些缺點。在有雜訊和相關變數的情況下,為了提供具有良好預測精度的模型,Lasso 引入了大量非零係數(所有係數都縮小到零),包括雜訊變數。套索會導致迴歸係數估計有偏差,因為 1subscript1\ell_{1} 範數對大係數的懲罰比對小係數的懲罰更嚴重。相反,如果最佳子集選擇過程決定在模型中包含一個變量,它將在沒有任何收縮的情況下將其引入,從而耗盡其相關代理的效果。在增加正規化程度後,Lasso 將更多係數設為零,但在此過程中最終會從活動集中遺漏真正的預測變數。因此,一旦違反了資料上的某些足夠的規律性條件,Lasso 就變得不是最理想的(a)變數選擇器和(b)提供具有良好預測性能的模型。

The shortcomings of Lasso are also known in the statistics literature. In fact, there is a significant gap between what can be achieved via best subset selection and Lasso: this is supported by empirical (for small problem sizes, i.e., p30𝑝30p\leq 30) and theoretical evidence, see for example, [46, 58, 38, 31, 56, 48] and the references therein. Some discussion is also presented herein, in Section 4.
Lasso 的缺點在統計文獻中也眾所周知。事實上,透過最佳子集選擇和套索可以實現的目標之間存在顯著差距:這是由經驗(對於小問題規模,即 p30𝑝30p\leq 30 )和理論證據支持的,例如,參見[ 46、58、38、31、56、48]及其中的參考文獻。本文第 4 節也提出了一些討論。

To address the shortcomings, non-convex penalized regression is often used to “bridge” the gap between the convex 1subscript1\ell_{1} penalty and the combinatorial 0subscript0\ell_{0} penalty [38, 27, 24, 54, 55, 28, 61, 62, 57, 13]. Written in Lagrangian form, this gives rise to continuous non-convex optimization problems of the form:
為了解決這些缺點,通常使用非凸懲罰回歸來「彌合」凸 1subscript1\ell_{1} 懲罰和組合 0subscript0\ell_{0} 懲罰之間的差距[38,27,24,54, 55、28 、61、62、57、13]。以拉格朗日形式編寫,這會產生以下形式的連續非凸最佳化問題:

12𝐲𝐗𝜷22+ip(|βi|;γ;λ),12superscriptsubscriptnorm𝐲𝐗𝜷22subscript𝑖𝑝subscript𝛽𝑖𝛾𝜆\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\sum_{i}p(|\beta_{i}|;\gamma;\lambda), (3)

where p(|β|;γ;λ)𝑝𝛽𝛾𝜆p(|\beta|;\gamma;\lambda) is a non-convex function in β𝛽\beta with λ𝜆\lambda and γ𝛾\gamma denoting the degree of regularization and non-convexity, respectively. Typical examples of non-convex penalties include the minimax concave penalty (MCP), the smoothly clipped absolute deviation (SCAD), and γsubscript𝛾\ell_{\gamma} penalties (see for example, [27, 38, 62, 24]). There is strong statistical evidence indicating the usefulness of estimators obtained as minimizers of non-convex penalized problems (3) over Lasso see for example [56, 36, 54, 25, 52, 37, 60, 26]. In a recent paper, [60] discuss the usefulness of non-convex penalties over convex penalties (like Lasso) in identifying important covariates, leading to efficient estimation strategies in high dimensions. They describe interesting connections between 0subscript0\ell_{0} regularized least squares and least squares with the hard thresholding penalty; and in the process develop comprehensive global properties of hard thresholding regularization in terms of various metrics. [26] establish asymptotic equivalence of a wide class of regularization methods in high dimensions with comprehensive sampling properties on both global and computable solutions.
其中 p(|β|;γ;λ)𝑝𝛽𝛾𝜆p(|\beta|;\gamma;\lambda)β𝛽\beta 中的非凸函數, λ𝜆\lambdaγ𝛾\gamma 分別表示正規化程度和非凸性程度。非凸懲罰的典型例子包括極小最大凹懲罰(MCP)、平滑剪切絕對偏差(SCAD)和 γsubscript𝛾\ell_{\gamma} 懲罰(例如,參見[27,38,62,24])。有強有力的統計證據表明,作為非凸懲罰問題 (3) 的最小化器獲得的估計量相對於 Lasso 是有用的,例如 [56,36,54,25,52,37,60,26]。在最近的一篇論文中,[60]討論了非凸懲罰相對於凸懲罰(如 Lasso)在識別重要協變量方面的有用性,從而導致高維度的有效估計策略。他們描述了 0subscript0\ell_{0} 正則化最小二乘法和具有硬閾值懲罰的最小二乘法之間的有趣聯繫;並在此過程中根據各種指標開發硬閾值正則化的全面全局屬性。 [26]建立了一系列高維正則化方法的漸近等價性,並在全局和可計算解決方案上具有全面的採樣特性。

Problem (3) mainly leads to a family of continuous and non-convex optimization problems. Various effective nonlinear optimization based methods (see for example [62, 24, 13, 36, 54, 38] and the references therein) have been proposed in the literature to obtain good local minimizers to Problem (3). In particular [38] proposes Sparsenet, a coordinate-descent procedure to trace out a surface of local minimizers for Problem (3) for the MCP penalty using effective warm start procedures. None of the existing approaches for solving Problem (3), however, come with guarantees of how close the solutions are to the global minimum of Problem (3).
問題(3)主要導致一系列連續非凸優化問題。文獻中已經提出了各種有效的基於非線性最佳化的方法(例如參見[62,24,13,36,54,38]及其參考文獻)以獲得問題(3)的良好局部最小化器。特別是[38]提出了Sparsenet,一種坐標下降過程,用於使用有效的熱啟動過程來追蹤問題(3)的MCP懲罰的局部最小化器的表面。然而,解決問題(3)的現有方法都無法保證解決方案與問題(3)的全局最小值的接近程度。

The Lagrangian version of (1) given by
(1) 的拉格朗日版本由下式給出

12𝐲𝐗𝜷22+λi=1p1(βi0),12superscriptsubscriptnorm𝐲𝐗𝜷22𝜆superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\sum_{i=1}^{p}1(\beta_{i}\neq 0),\;\; (4)

may be seen as a special case of (3). Note that, due to non-convexity, problems (4) and (1) are not equivalent. Problem (1) allows one to control the exact level of sparsity via the choice of k𝑘k, unlike (4) where there is no clear correspondence between λ𝜆\lambda and k𝑘k. Problem (4) is a discrete optimization problem unlike continuous optimization problems (3) arising from continuous non-convex penalties.
可以看作(3)的特例。請注意,由於非凸性,問題 (4) 和 (1) 並不等價。問題(1) 允許人們透過選擇 k𝑘k 來控制稀疏性的確切級別,這與(4) 不同,其中 λ𝜆\lambdak𝑘k .問題(4)是離散最佳化問題,與連續非凸懲罰產生的連續最佳化問題(3)不同。

Insightful statistical properties of Problem (4) have been explored from a theoretical viewpoint in [56, 31, 32, 48]. [48] points out that (1) is preferable over (4) in terms of superior statistical properties of the resulting estimator. The aforementioned papers, however, do not discuss methods to obtain provably optimal solutions to problems (4) or (1), and to the best of our knowledge, computing optimal solutions to problems (4) and (1) is deemed as intractable.
[56,31,32,48]從理論角度探討了問題(4)富有洞察力的統計特性。 [48]指出,就所得估計量的優越統計特性而言,(1)優於(4)。然而,上述論文並沒有討論獲得問題(4)或(1)的可證明最優解的方法,並且據我們所知,計算問題(4)和(1)的最優解被認為是棘手的。

Our Approach 我們的方法

In this paper, we propose a novel framework via which the best subset selection problem can be solved to optimality or near optimality in problems of practical interest within a reasonable time frame. At the core of our proposal is a computationally tractable framework that brings to bear the power of modern discrete optimization methods: discrete first order methods motivated by first order methods in convex optimization [45] and mixed integer optimization (MIO), see [4]. We do not guarantee polynomial time solution times as these do not exist for the best subset problem unless P=NP. Rather, our view of computational tractability is the ability of a method to solve problems of practical interest in times that are appropriate for the application addressed. An advantage of our approach is that it adapts to variants of the best subset regression problem of the form:
在本文中,我們提出了一種新穎的框架,透過該框架可以在合理的時間範圍內將最佳子集選擇問題解決為最優或接近最優的實際感興趣的問題。我們建議的核心是一個計算上易於處理的框架,它發揮了現代離散優化方法的力量:由凸優化[45]和混合整數優化(MIO)中的一階方法驅動的離散一階方法,請參見[4] 。我們不保證多項式時間求解時間,因為除非 P=NP,否則最佳子集問題不存在多項式時間求解時間。相反,我們對計算易處理性的看法是一種方法在適合所解決的應用程式的時間解決實際感興趣的問題的能力。我們的方法的一個優點是它適應以下形式的最佳子集迴歸問題的變體:

min𝜷12𝐲𝐗𝜷qqs.t.𝜷0k𝐀𝜷𝐛,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷𝑞𝑞formulae-sequence𝑠𝑡subscriptnorm𝜷0𝑘missing-subexpression𝐀𝜷𝐛\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{q}^{q}\\ s.t.&\|\boldsymbol{\beta}\|_{0}\leq k\\ &\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b},\end{array}

where 𝐀𝜷𝐛𝐀𝜷𝐛\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b} represents polyhedral constraints and q{1,2}𝑞12q\in\{1,2\} refers to a least absolute deviation or the least squares loss function on the residuals 𝐫:=𝐲𝐗𝜷assign𝐫𝐲𝐗𝜷\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}.
其中 𝐀𝜷𝐛𝐀𝜷𝐛\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b} 表示多面體約束, q{1,2}𝑞12q\in\{1,2\} 指殘差上的最小絕對偏差或最小平方法損失函數 𝐫:=𝐲𝐗𝜷assign𝐫𝐲𝐗𝜷\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}

Existing approaches in the Mathematical Optimization Literature
數學最佳化文獻中的現有方法

In a seminal paper [30], the authors describe a leaps and bounds procedure for computing global solutions to Problem (1) (for the classical n>p𝑛𝑝n>p case) which can be achieved with computational effort significantly less than complete enumeration. leaps, a state-of-the-art R package uses this principle to perform best subset selection for problems with n>p𝑛𝑝n>p and p30𝑝30p\leq 30. [3] proposed a tailored branch-and-bound scheme that can be applied to Problem (1) using ideas from [30] and techniques in quadratic optimization, extending and enhancing the proposal of [6]. The proposal of [3] concentrates on obtaining high quality upper bounds for Problem (1) and is less scalable than the methods presented in this paper.
在一篇開創性論文[30] 中,作者描述了計算問題(1) 全局解決方案(對於經典 n>p𝑛𝑝n>p 情況)的跨越式過程,該過程的計算量明顯少於完全枚舉即可實現。跳躍,最先進的 R 套件使用此原理對 n>p𝑛𝑝n>pp30𝑝30p\leq 30 問題執行最佳子集選擇。 [3]提出了一種客製化的分支定界方案,可以使用[30]中的思想和二次優化技術應用於問題(1),擴展和增強[6]的建議。 [3] 的提議集中於獲得問題(1)的高品質上限,並且比本文提出的方法的可擴展性較差。

Contributions 貢獻

We summarize our contributions in this paper below:
我們在本文中總結了我們的貢獻如下:

  1. 1.

    We use MIO to find a provably optimal solution for the best subset problem. Our approach has the appealing characteristic that if we terminate the algorithm early, we obtain a solution with a guarantee on its suboptimality. Furthermore, our framework can accommodate side constraints on 𝜷𝜷\boldsymbol{\beta} and also extends to finding best subset solutions for the least absolute deviation loss function.


    1. 我們使用 MIO 來尋找最佳子集問題的可證明最優解。我們的方法具有一個吸引人的特點,即如果我們提前終止演算法,我們將獲得一個保證其次優的解決方案。此外,我們的框架可以適應 𝜷𝜷\boldsymbol{\beta} 的側面約束,並且還可以擴展到尋找最小絕對偏差損失函數的最佳子集解決方案。
  2. 2.

    We introduce a general algorithmic framework based on a discrete extension of modern first order continuous optimization methods that provide near-optimal solutions for the best subset problem. The MIO algorithm significantly benefits from solutions obtained by the first order methods and problem specific information that can be computed in a data-driven fashion.


    2. 我們引入了一種基於現代一階連續最佳化方法的離散擴展的通用演算法框架,該框架為最佳子集問題提供近乎最優的解決方案。 MIO 演算法顯著受益於一階方法獲得的解決方案以及可以以數據驅動方式計算的問題特定資訊。
  3. 3.

    We report computational results with both synthetic and real-world datasets that show that our proposed framework can deliver provably optimal solutions for problems of size n𝑛n in the 1000s and p𝑝p in the 100s in minutes. For high-dimensional problems with n{50,100}𝑛50100n\in\{50,100\} and p{1000,2000}𝑝10002000p\in\{1000,2000\}, with the aid of warm starts and further problem-specific information, our approach finds near optimal solutions in minutes but takes hours to prove optimality.


    3. 我們報告了合成資料集和真實資料集的計算結果,顯示我們提出的框架可以為1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 大小的問題提供可證明的最佳解決方案幾分鐘後。對於 n{50,100}𝑛50100n\in\{50,100\}p{1000,2000}𝑝10002000p\in\{1000,2000\} 的高維問題,借助熱啟動和進一步的特定問題信息,我們的方法在幾分鐘內找到接近最優的解決方案,但需要幾個小時才能證明最優性。
  4. 4.

    We investigate the statistical properties of best subset selection procedures for practical problem sizes, which to the best of our knowledge, have remained largely unexplored to date. We demonstrate the favorable predictive performance and sparsity-inducing properties of the best subset selection procedure over its competitors in a wide variety of real and synthetic examples for the least squares and absolute deviation loss functions.


    4. 我們研究了針對實際問題規模的最佳子集選擇程序的統計特性,據我們所知,迄今為止,這在很大程度上仍未被探索。我們在最小二乘和絕對偏差損失函數的各種真實和合成範例中證明了最佳子集選擇過程相對於競爭對手的有利預測性能和稀疏性誘導特性。

The structure of the paper is as follows. In Section 2, we present a brief overview of MIO, including a summary of the computational advances it has enjoyed in the last twenty-five years. We present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters and providing lower bounds for the MIO formulations that improve their computational performance. In Section 3, we develop a discrete extension of first order methods in convex optimization to obtain near optimal solutions for the best subset problem and establish its convergence properties—the proposed algorithm and its properties may be of independent interest. Section 4 briefly reviews some of the statistical properties of the best-subset solution, highlighting the performance gaps in prediction error, over regular Lasso-type estimators. In Section 5, we perform a variety of computational tests on synthetic and real datasets to assess the algorithmic and statistical performances of our approach for the least squares loss function for both the classical overdetermined case n>p𝑛𝑝n>p, and the high-dimensional case pnmuch-greater-than𝑝𝑛p\gg n. In Section 6, we report computational results for the least absolute deviation loss function. In Section 7, we include our concluding remarks. Due to space limitations, some of the material has been relegated to the Appendix.
論文結構如下。在第 2 節中,我們對 MIO 進行了簡要概述,包括對其在過去 25 年中所取得的計算進步的總結。我們提出了針對最佳子集問題提出的 MIO 公式,以及與壓縮感知文獻的一些聯繫,用於估計參數並為 MIO 公式提供下界,從而提高其計算性能。在第3 節中,我們發展了凸優化中一階方法的離散擴展,以獲得最佳子集問題的接近最優解並建立其收斂特性-所提出的演算法及其特性可能具有獨立的意義。第 4 節簡要回顧了最佳子集解決方案的一些統計特性,強調了與常規 Lasso 型估計器相比在預測誤差方面的性能差距。在第5 節中,我們對合成和真實資料集進行了各種計算測試,以評估我們的最小二乘損失函數方法的演算法和統計性能,適用於經典的超定情況 n>p𝑛𝑝n>p 和高維度情況 pnmuch-greater-than𝑝𝑛p\gg n 。在第 6 節中,我們報告了最小絕對偏差損失函數的計算結果。在第 7 節中,我們包含了我們的結論性意見。由於篇幅限制,部分資料已移至附錄。

2 Mixed Integer Optimization Formulations
2混合整數最佳化公式

In this section, we present a brief overview of MIO, including the simply astonishing advances it has enjoyed in the last twenty-five years. We then present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters. We also present completely data driven methods to estimate parameters in the MIO formulations that improve their computational performance.
在本節中,我們將簡要概述 MIO,包括它在過去 25 年中所取得的驚人進展。然後,我們提出了針對最佳子集問題提出的 MIO 公式,以及與用於估計參數的壓縮感知文獻的一些連結。我們也提出了完全資料驅動的方法來估計 MIO 公式中的參數,從而提高其計算效能。

2.1 Brief Background on MIO
2.1MIO簡介

The general form of a Mixed Integer Quadratic Optimization (MIQO) problem is as follows:
混合整數二次最佳化 (MIQO) 問題的一般形式如下:

min𝜶T𝐐𝜶+𝜶T𝐚s.t.𝐀𝜶𝐛αi{0,1},iαj+,j,superscript𝜶𝑇𝐐𝜶superscript𝜶𝑇𝐚missing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequence𝑠𝑡𝐀𝜶𝐛missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequencesubscript𝛼𝑖01for-all𝑖missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequencesubscript𝛼𝑗subscriptfor-all𝑗missing-subexpressionmissing-subexpressionmissing-subexpression\begin{array}[]{l c cl l}\min&\boldsymbol{\alpha}^{T}\mathbf{Q}\boldsymbol{\alpha}+\boldsymbol{\alpha}^{T}\mathbf{a}\\ s.t.&\;\;\mathbf{A}\boldsymbol{\alpha}\boldsymbol{\leq}\mathbf{b}\\ &\;\;\alpha_{i}\in\{0,1\},\quad\forall i\in{\mathcal{I}}\\ &\;\;\alpha_{j}\in\mathbb{R}_{+},\quad\forall j\notin{\mathcal{I}},\end{array}

where 𝐚m,𝐀k×m,𝐛kformulae-sequence𝐚superscript𝑚formulae-sequence𝐀superscript𝑘𝑚𝐛superscript𝑘\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k} and 𝐐m×m𝐐superscript𝑚𝑚\mathbf{Q}\in\mathbb{R}^{m\times m} (positive semidefinite) are the given parameters of the problem; +subscript\mathbb{R}_{+} denotes the non-negative reals, the symbol \boldsymbol{\leq} denotes element-wise inequalities and we optimize over 𝜶m𝜶superscript𝑚\boldsymbol{\alpha}\in\mathbb{R}^{m} containing both discrete (αi,isubscript𝛼𝑖𝑖\alpha_{i},i\in{\mathcal{I}}) and continuous (αi,isubscript𝛼𝑖𝑖\alpha_{i},i\notin{\mathcal{I}}) variables, with {1,,m}1𝑚{\mathcal{I}}\subset\{1,\ldots,m\}. For background on MIO see [4]. Subclasses of MIQO problems include convex quadratic optimization problems (={\mathcal{I}}=\emptyset), mixed integer (𝐐=𝟎m×m𝐐subscript0𝑚𝑚\mathbf{Q}=\mathbf{0}_{m\times m}) and linear optimization problems ( =,𝐐=𝟎m×mformulae-sequence𝐐subscript0𝑚𝑚{\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m}). Modern integer optimization solvers such as Gurobi and Cplex are able to tackle MIQO problems.
其中 𝐚m,𝐀k×m,𝐛kformulae-sequence𝐚superscript𝑚formulae-sequence𝐀superscript𝑘𝑚𝐛superscript𝑘\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k}𝐐m×m𝐐superscript𝑚𝑚\mathbf{Q}\in\mathbb{R}^{m\times m} (正半定)是問題的給定參數; +subscript\mathbb{R}_{+} 表示非負實數,符號 \boldsymbol{\leq} 表示逐元素不等式,我們對包含離散 ( αi,isubscript𝛼𝑖𝑖\alpha_{i},i\in{\mathcal{I}} )和連續( αi,isubscript𝛼𝑖𝑖\alpha_{i},i\notin{\mathcal{I}} )變量,以及 {1,,m}1𝑚{\mathcal{I}}\subset\{1,\ldots,m\} 。有關 MIO 的背景信息,請參閱 [4]。 MIQO 問題的子類別包括凸二次最佳化問題 ( ={\mathcal{I}}=\emptyset )、混合整數 ( 𝐐=𝟎m×m𝐐subscript0𝑚𝑚\mathbf{Q}=\mathbf{0}_{m\times m} ) 和線性最佳化問題 ( =,𝐐=𝟎m×mformulae-sequence𝐐subscript0𝑚𝑚{\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m} )。現代整數最佳化求解器(例如 Gurobi 和 Cplex)能夠解決 MIQO 問題。

In the last twenty-five years (1991-2014) the computational power of MIO solvers has increased at an astonishing rate. In [7], to measure the speedup of MIO solvers, the same set of MIO problems were tested on the same computers using twelve consecutive versions of Cplex and version-on-version speedups were reported. The versions tested ranged from Cplex 1.2, released in 1991 to Cplex 11, released in 2007. Each version released in these years produced a speed improvement on the previous version, leading to a total speedup factor of more than 29,000 between the first and last version tested (see [7], [42] for details). Gurobi 1.0, a MIO solver which was first released in 2009, was measured to have similar performance to Cplex 11. Version-on-version speed comparisons of successive Gurobi releases have shown a speedup factor of more than 20 between Gurobi 5.5, released in 2013, and Gurobi 1.0 ([7], [42]). The combined machine-independent speedup factor in MIO solvers between 1991 and 2013 is 580,000. This impressive speedup factor is due to incorporating both theoretical and practical advances into MIO solvers. Cutting plane theory, disjunctive programming for branching rules, improved heuristic methods, techniques for preprocessing MIOs, using linear optimization as a black box to be called by MIO solvers, and improved linear optimization methods have all contributed greatly to the speed improvements in MIO solvers [7].
在過去二十五年(1991-2014)中,MIO 求解器的運算能力以驚人的速度成長。在[7]中,為了測量 MIO 求解器的加速,使用 12 個連續版本的 Cplex 在相同的計算機上測試了同一組 MIO 問題,並報告了版本間的加速。測試的版本範圍從1991年發布的Cplex 1.2到2007年發布的Cplex 11。係數超過29,000已測試(詳細資訊請參閱[7]、[42])。 Gurobi 1.0 是 2009 年首次發布的 MIO 求解器,測量與 CPLEX 11 具有相似的性能。 [7],[42])。 1991 年至 2013 年間,MIO 解算器中與機器無關的綜合加速因子為 580,000。這一令人印象深刻的加速係數歸功於將理論和實踐進步融入 MIO 求解器。割平面理論、分支規則的析取編程、改進的啟發式方法、MIO 預處理技術、使用線性優化作為MIO 求解器調用的黑匣子以及改進的線性優化方法都對MIO 求解器的速度提高做出了巨大貢獻。

In addition, the past twenty years have also brought dramatic improvements in hardware. Figure 1 shows the exponentially increasing speed of supercomputers over the past twenty years, measured in billion floating point operations per second [1]. The hardware speedup from 1993 to 2013 is approximately 105.5320,000similar-tosuperscript105.532000010^{5.5}\sim 320,000. When both hardware and software improvements are considered, the overall speedup is approximately 200 billion! Note that the speedup factors cited here refer to mixed integer linear optimization problems, not MIQO problems. The speedup factors for MIQO problems are similar. MIO solvers provide both feasible solutions as well as lower bounds to the optimal value. As the MIO solver progresses towards the optimal solution, the lower bounds improve and provide an increasingly better guarantee of suboptimality, which is especially useful if the MIO solver is stopped before reaching the global optimum. In contrast, heuristic methods do not provide such a certificate of suboptimality.
此外,過去二十年在硬體方面也帶來了巨大的進步。圖 1 顯示了過去二十年來超級電腦的速度呈指數級增長,以每秒十億次浮點運算來衡量 [1]。從 1993 年到 2013 年的硬體加速約為 105.5320,000similar-tosuperscript105.532000010^{5.5}\sim 320,000 。同時考慮硬體和軟體改進時,整體加速約 2000 億!請注意,此處引用的加速因子指的是混合整數線性最佳化問題,而不是 MIQO 問題。 MIQO 問題的加速因子類似。 MIO 求解器提供可行的解決方案以及最優值的下限。隨著 MIO 求解器向最優解前進,下界會提高,並為次優性提供越來越好的保證,如果 MIO 求解器在達到全局最優之前停止,這一點尤其有用。相反,啟發式方法不提供這樣的次優證明。

Refer to caption
Figure 1: Log of Peak Supercomputer Speed from 1993–2013.
圖 1:1993 年至 2013 年超級電腦峰值速度的記錄。

The belief that MIO approaches to problems in statistics are not practically relevant was formed in the 1970s and 1980s and it was at the time justified. Given the astonishing speedup of MIO solvers and computer hardware in the last twenty-five years, the mindset of MIO as theoretically elegant but practically irrelevant is no longer justified. In this paper, we provide empirical evidence of this fact in the context of the best subset selection problem.
人們在 20 世紀 70 年代和 80 年代形成了一種信念,認為 MIO 方法解決統計問題沒有實際意義,並且在當時是有道理的。鑑於 MIO 求解器和電腦硬體在過去 25 年中取得了驚人的加速,MIO 理論上優雅但實際上無關緊要的思維方式不再合理。在本文中,我們在最佳子集選擇問題的背景下提供了這一事實的經驗證據。

2.2 MIO Formulations for the Best Subset Selection Problem
2.2最佳子集選擇問題的MIO公式

We first present a simple reformulation to Problem (1) as a MIO (in fact a MIQO) problem:
我們首先將問題 (1) 重新表述為 MIO(實際上是 MIQO)問題:

Z1=min𝜷,𝐳12𝐲𝐗𝜷22s.t.UziβiUzi,i=1,,pzi{0,1},i=1,,pi=1pzik,subscript𝑍1subscript𝜷𝐳12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡formulae-sequencesubscript𝑈subscript𝑧𝑖subscript𝛽𝑖subscript𝑈subscript𝑧𝑖𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\begin{array}[]{l l }Z_{1}=\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;-{\mathcal{M}}_{U}z_{i}\leq\beta_{i}\leq{\mathcal{M}}_{U}z_{i},i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k,\end{array} (5)

where 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} is a binary variable and Usubscript𝑈{\mathcal{M}}_{U} is a constant such that if 𝜷^^𝜷\widehat{\boldsymbol{\beta}} is a minimizer of Problem (5), then U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}. If zi=1subscript𝑧𝑖1z_{i}=1, then |βi|Usubscript𝛽𝑖subscript𝑈|\beta_{i}|\leq{\mathcal{M}}_{U} and if zi=0subscript𝑧𝑖0z_{i}=0, then βi=0subscript𝛽𝑖0\beta_{i}=0. Thus, i=1pzisuperscriptsubscript𝑖1𝑝subscript𝑧𝑖\sum_{i=1}^{p}z_{i} is an indicator of the number of non-zeros in 𝜷𝜷\boldsymbol{\beta}.
其中 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} 是二元變量, Usubscript𝑈{\mathcal{M}}_{U} 是常數,如果 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 是問題 (5) 的最小化,則 U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty} 。如果 zi=1subscript𝑧𝑖1z_{i}=1 ,則 |βi|Usubscript𝛽𝑖subscript𝑈|\beta_{i}|\leq{\mathcal{M}}_{U} ,如果 zi=0subscript𝑧𝑖0z_{i}=0 ,則 βi=0subscript𝛽𝑖0\beta_{i}=0 。因此, i=1pzisuperscriptsubscript𝑖1𝑝subscript𝑧𝑖\sum_{i=1}^{p}z_{i}𝜷𝜷\boldsymbol{\beta} 中非零數量的指示符。

Provided that Usubscript𝑈{\mathcal{M}}_{U} is chosen to be sufficently large with U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}, a solution to Problem (5) will be a solution to Problem (1). Of course, Usubscript𝑈{\mathcal{M}}_{U} is not known a priori, and a small value of Usubscript𝑈{\mathcal{M}}_{U} may lead to a solution different from (1). The choice of Usubscript𝑈{\mathcal{M}}_{U} affects the strength of the formulation and is critical for obtaining good lower bounds in practice. In Section 2.3 we describe how to find appropriate values for Usubscript𝑈{\mathcal{M}}_{U}. Note that there are other MIO formulations, presented herein (See Problem (8)) that do not rely on a-priori specifications of Usubscript𝑈{\mathcal{M}}_{U}. However, we will stick to formulation (5) for the time being, since it provides some interesting connections to the Lasso.
假設 Usubscript𝑈{\mathcal{M}}_{U} 選擇足夠大且 U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty} ,問題 (5) 的解決方案將是問題 (1) 的解決方案。當然, Usubscript𝑈{\mathcal{M}}_{U} 不是先驗已知的, Usubscript𝑈{\mathcal{M}}_{U} 的較小值可能會導致與(1)不同的解。 Usubscript𝑈{\mathcal{M}}_{U} 的選擇會影響配方的強度,對於在實踐中獲得良好的下限至關重要。在第 2.3 節中,我們描述如何為 Usubscript𝑈{\mathcal{M}}_{U} 找到合適的值。請注意,這裡提出的其他 MIO 公式(請參閱問題 (8))不依賴 Usubscript𝑈{\mathcal{M}}_{U} 的先驗規範。然而,我們暫時堅持公式(5),因為它提供了與 Lasso 的一些有趣的聯繫。

Formulation (5) leads to interesting insights, especially via the structure of the convex hull of its constraints, as illustrated next:
公式(5)帶來了有趣的見解,特別是透過其約束的凸包結構,如下所示:

Conv({𝜷:|βi|Uzi,zi{0,1},i=1,,p,i=1pzik})={𝜷:𝜷U,𝜷1Uk}{𝜷:𝜷1Uk}.missing-subexpressionConvconditional-set𝜷formulae-sequencesubscript𝛽𝑖subscript𝑈subscript𝑧𝑖formulae-sequencesubscript𝑧𝑖01formulae-sequence𝑖1𝑝superscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘conditional-set𝜷formulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscript𝑈𝑘conditional-set𝜷subscriptnorm𝜷1subscript𝑈𝑘\begin{array}[]{l l }&\text{Conv}\left(\left\{\boldsymbol{\beta}:|\beta_{i}|\leq{\mathcal{M}}_{U}z_{i},z_{i}\in\{0,1\},i=1,\ldots,p,\sum\limits_{i=1}^{p}z_{i}\leq{k}\right\}\right)\\ =&\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}\subseteq\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}.\end{array}

Thus, the minimum of Problem (5) is lower-bounded by the optimum objective value of both the following convex optimization problems:
因此,問題 (5) 的最小值受到以下兩個凸優化問題的最佳目標值的下界:

Z2:=assignsubscript𝑍2absent\displaystyle Z_{2}:= min𝜷12𝐲𝐗𝜷22subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\; subjecttosubjectto\displaystyle\mathrm{subject\;to} 𝜷U,𝜷1Ukformulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscript𝑈𝑘\displaystyle\;\;\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k} (6)
Z3:=assignsubscript𝑍3absent\displaystyle Z_{3}:= min𝜷12𝐲𝐗𝜷22subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\; subjecttosubjectto\displaystyle\mathrm{subject\;to} 𝜷1Uk,subscriptnorm𝜷1subscript𝑈𝑘\displaystyle\;\;\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}, (7)

where (7) is the familiar Lasso in constrained form. This is a weaker relaxation than formulation (6), which in addition to the 1subscript1\ell_{1} constraint on 𝜷𝜷\boldsymbol{\beta}, has box-constraints controlling the values of the βisubscript𝛽𝑖\beta_{i}’s. It is easy to see that the following ordering exists: Z3Z2Z1,subscript𝑍3subscript𝑍2subscript𝑍1Z_{3}\leq Z_{2}\leq Z_{1}, with the inequalities being strict in most instances.
其中 (7) 是熟悉的 Lasso 的約束形式。這是比公式(6) 更弱的鬆弛,公式(6) 除了對 𝜷𝜷\boldsymbol{\beta}1subscript1\ell_{1} 約束之外,還具有控制 βisubscript𝛽𝑖\beta_{i} 值的框約束' s。很容易看出存在以下順序: Z3Z2Z1,subscript𝑍3subscript𝑍2subscript𝑍1Z_{3}\leq Z_{2}\leq Z_{1}, ,並且在大多數情況下不等式都是嚴格的。

In terms of approximating the optimal solution to Problem (5), the MIO solver begins by first solving a continuous relaxation of Problem (5). The Lasso formulation (7) is weaker than this root node relaxation. Additionally, MIO is typically able to significantly improve the quality of the root node solution as the MIO solver progresses toward the optimal solution.
就逼近問題(5)的最優解而言,MIO求解器首先求解問題(5)的連續鬆弛。 Lasso 公式 (7) 比根節點鬆弛弱。此外,隨著 MIO 解算器不斷向最優解前進,MIO 通常能夠顯著提高根節點解的品質。

To motivate the reader we provide an example of the evolution (see Figure 2) of the MIO formulation (8) for the Diabetes dataset [23], with n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 (for further details on the dataset see Section 5).
為了激勵讀者,我們提供了糖尿病數據集[23] 的MIO 公式(8) 演變的示例(參見圖2),其中 n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 (有關數據集的更多詳細信息,請參見第5節) 。

Since formulation (5) is sensitive to the choice of Usubscript𝑈{\mathcal{M}}_{U}, we consider an alternative MIO formulation based on Specially Ordered Sets [4] as described next.
由於公式 (5) 對 Usubscript𝑈{\mathcal{M}}_{U} 的選擇很敏感,因此我們考慮基於特殊有序集 [4] 的替代 MIO 公式,如下所述。

Formulations via Specially Ordered Sets
透過特別訂購的套裝配製

Any feasible solution to formulation (5) will have (1zi)βi=01subscript𝑧𝑖subscript𝛽𝑖0(1-z_{i})\beta_{i}=0 for every i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\}. This constraint can be modeled via integer optimization using Specially Ordered Sets of Type 1 [4] (SOS-1). In an SOS-1 constraint, at most one variable in the set can take a nonzero value, that is
公式 (5) 的任何可行解決方案對於每個 i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\} 都將具有 (1zi)βi=01subscript𝑧𝑖subscript𝛽𝑖0(1-z_{i})\beta_{i}=0 。此限制可以使用類型 1 [4] (SOS-1) 的特殊有序集合透過整數最佳化進行建模。在 SOS-1 限制條件中,集合中最多有一個變數可以取非零值,即

(1zi)βi=0(βi,1zi):SOS-1,iff1subscript𝑧𝑖subscript𝛽𝑖0subscript𝛽𝑖1subscript𝑧𝑖:SOS-1(1-z_{i})\beta_{i}=0\;\;\iff\;\;(\beta_{i},1-z_{i}):\text{SOS-1},

for every i=1,,p.𝑖1𝑝i=1,\ldots,p. This leads to the following formulation of (1):
對於每個 i=1,,p.𝑖1𝑝i=1,\ldots,p. 這導致 (1) 的公式如下:

min𝜷,𝐳12𝐲𝐗𝜷22s.t.(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzik.subscript𝜷𝐳12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k.\end{array} (8)

We note that Problem (8) can in principle be used to obtain global solutions to Problem (1) — Problem (8) unlike Problem (5) does not require any specification of the parameter Usubscript𝑈{\mathcal{M}}_{U}.
我們注意到,問題(8)原則上可以用來得到問題(1)的全域解-與問題(5)不同,問題(8)不需要任何參數 Usubscript𝑈{\mathcal{M}}_{U} 的規格。

k=6𝑘6k=6 k=7𝑘7k=7
Refer to caption Refer to caption
Refer to caption Refer to caption
Time (secs) Time (secs)
Figure 2: The typical evolution of the MIO formulation (8) for the diabetes dataset with n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 with k=6𝑘6k=6 (left panel) and k=7𝑘7k=7 (right panel). The top panel shows the evolution of upper and lower bounds with time. The lower panel shows the evolution of the corresponding MIO gap with time. Optimal solutions for both the problems are found in a few seconds in both examples, but it takes 10-20 minutes to certify optimality via the lower bounds. Note that the time taken for the MIO to certify convergence to the global optimum increases with increasing k𝑘k.
圖 2:糖尿病資料集的 MIO 公式 (8) 的典型演變,其中 n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64k=6𝑘6k=6 (左圖)和 k=7𝑘7k=7 (右圖)。頂部面板顯示了上限和下限隨時間的演變。下圖顯示了相應的 MIO 差距隨時間的演變。在這兩個範例中,都可以在幾秒鐘內找到這兩個問題的最佳解決方案,但需要 10-20 分鐘才能透過下限證明最優性。請注意,MIO 證明收斂到全域最優所花費的時間隨著 k𝑘k 的增加而增加。

We now proceed to present a more structured representation of Problem (8). Note that objective in this problem is a convex quadratic function in the continuous variable 𝜷𝜷\boldsymbol{\beta}, which can be formulated explicitly as:
我們現在繼續提出問題(8)的更結構化的表示。請注意,此問題中的目標是連續變數 𝜷𝜷\boldsymbol{\beta} 中的凸二次函數,可以明確表示為:

min𝜷,𝐳12𝜷T𝐗T𝐗𝜷𝐗𝐲,𝜷+12𝐲22s.t.(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzikUβiU,i=1,,p𝜷1.subscript𝜷𝐳12superscript𝜷𝑇superscript𝐗𝑇𝐗𝜷superscript𝐗𝐲𝜷12superscriptsubscriptnorm𝐲22formulae-sequence𝑠𝑡:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘missing-subexpressionformulae-sequencesubscript𝑈subscript𝛽𝑖subscript𝑈𝑖1𝑝missing-subexpressionsubscriptnorm𝜷1subscript\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\boldsymbol{\beta}^{T}\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}.\end{array} (9)

We also provide problem-dependent constants Usubscript𝑈{\mathcal{M}}_{U} and [0,]subscript0{\mathcal{M}}_{\ell}\in[0,\infty]. Usubscript𝑈{\mathcal{M}}_{U} provides an upper bound on the absolute value of the regression coefficients and subscript{\mathcal{M}}_{\ell} provides an upper bound on the 1subscript1\ell_{1}-norm of 𝜷𝜷\boldsymbol{\beta}. Adding these bounds typically leads to improved performance of the MIO, especially in delivering lower bound certificates. In Section 2.3, we describe several approaches to compute these parameters from the data.
我們也提供與問題相關的常數 Usubscript𝑈{\mathcal{M}}_{U}[0,]subscript0{\mathcal{M}}_{\ell}\in[0,\infty]Usubscript𝑈{\mathcal{M}}_{U} 提供迴歸係數絕對值的上限, subscript{\mathcal{M}}_{\ell} 提供 𝜷𝜷\boldsymbol{\beta}1subscript1\ell_{1} 範數的上限。增加這些邊界通常會提高 MIO 的效能,特別是在提供較低邊界憑證方面。在 2.3 節中,我們描述了從資料計算這些參數的幾種方法。

We also consider another formulation for (9):
我們也考慮 (9) 的另一種表述:

min𝜷,𝐳,𝜻12𝜻T𝜻𝐗𝐲,𝜷+12𝐲22s.t.𝜻=𝐗𝜷(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzikUβiU,i=1,,p𝜷1UζζiUζ,i=1,,n𝜻1ζ,subscript𝜷𝐳𝜻12superscript𝜻𝑇𝜻superscript𝐗𝐲𝜷12superscriptsubscriptnorm𝐲22formulae-sequence𝑠𝑡𝜻𝐗𝜷missing-subexpression:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘missing-subexpressionformulae-sequencesubscript𝑈subscript𝛽𝑖subscript𝑈𝑖1𝑝missing-subexpressionsubscriptnorm𝜷1subscriptmissing-subexpressionformulae-sequencesubscriptsuperscript𝜁𝑈subscript𝜁𝑖subscriptsuperscript𝜁𝑈𝑖1𝑛missing-subexpressionsubscriptnorm𝜻1subscriptsuperscript𝜁\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z},\boldsymbol{\zeta}}&\;\;\frac{1}{2}\;\boldsymbol{\zeta}^{T}\boldsymbol{\zeta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}\\ &(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &-{\mathcal{M}}^{\zeta}_{U}\leq\zeta_{i}\leq{\mathcal{M}}^{\zeta}_{U},i=1,\ldots,n\\ &\|\boldsymbol{\zeta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell},\end{array} (10)

where the optimization variables are 𝜷p,𝜻nformulae-sequence𝜷superscript𝑝𝜻superscript𝑛\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}, 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} and U,,Uζ,ζ[0,]subscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁0{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty] are problem specific parameters. Note that the objective function in formulation (10) involves a quadratic form in n𝑛n variables and a linear function in p𝑝p variables. Problem (10) is equivalent to the following variant of the best subset problem:
其中最佳化變數是 𝜷p,𝜻nformulae-sequence𝜷superscript𝑝𝜻superscript𝑛\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p}U,,Uζ,ζ[0,]subscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁0{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty] 是問題特定的參數。請注意,公式(10)中的目標函數涉及 n𝑛n 變數中的二次形式和 p𝑝p 變數中的線性函數。問題(10)等價於最佳子集問題的以下變體:

min𝜷12𝐲𝐗𝜷22s.t.𝜷0k𝜷U,𝜷1𝐗𝜷Uζ,𝐗𝜷1ζ.subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡subscriptnorm𝜷0𝑘missing-subexpressionformulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscriptmissing-subexpressionformulae-sequencesubscriptnorm𝐗𝜷subscriptsuperscript𝜁𝑈subscriptnorm𝐗𝜷1subscriptsuperscript𝜁\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;\|\boldsymbol{\beta}\|_{0}\leq k{}{}{}{}{}{}{}{}{}{}{}\\ &\;\;\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &\;\;\|\mathbf{X}\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}^{\zeta}_{U},\|\mathbf{X}\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell}.\end{array} (11)

Formulations (9) and (10) differ in the size of the quadratic forms that are involved. The current state-of-the-art MIO solvers are better-equipped to handle mixed integer linear optimization problems than MIQO problems. Formulation (9) has fewer variables but a quadratic form in p𝑝p variables—we find this formulation more useful in the n>p𝑛𝑝n>p regime, with p𝑝p in the 100100100s. Formulation (10) on the other hand has more variables, but involves a quadratic form in n𝑛n variables—this formulation is more useful for high-dimensional problems pnmuch-greater-than𝑝𝑛p\gg n, with n𝑛n in the 100100100s and p𝑝p in the 100010001000s.
公式(9)和(10)的差異在於所涉及的二次形式的大小。目前最先進的 MIO 求解器比 MIQO 問題更能處理混合整數線性最佳化問題。公式(9) 的變數較少,但 p𝑝p 變數為二次形式- 我們發現公式在 n>p𝑛𝑝n>p 體系中較有用, p𝑝p100100100 s。另一方面,公式 (10) 有更多變量,但涉及 n𝑛n 變量的二次形式 - 該公式對於高維度問題更有用 pnmuch-greater-than𝑝𝑛p\gg n ,其中 n𝑛n 中, p𝑝p100010001000 中。

As we said earlier, the bounds on 𝜷𝜷\boldsymbol{\beta} and 𝜻𝜻\boldsymbol{\zeta} are not required, but if these constraints are provided, they improve the strength of the MIO formulation. In other words, formulations with tightly specified bounds provide better lower bounds to the global optimization problem in a specified amount of time, when compared to a MIO formulation with loose bound specifications. We next show how these bounds can be computed from given data.
正如我們之前所說, 𝜷𝜷\boldsymbol{\beta}𝜻𝜻\boldsymbol{\zeta} 上的界限不是必需的,但如果提供這些約束,它們會提高 MIO 公式的強度。換句話說,與具有鬆散邊界規範的 MIO 公式相比,具有嚴格指定邊界的公式在指定時間內為全域最佳化問題提供了更好的下界。接下來我們將展示如何根據給定資料計算這些界限。

2.3 Specification of Parameters
2.3參數說明

In this section, we obtain estimates for the quantities U,,Uζ,ζsubscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell} such that an optimal solution to Problem (11) is also an optimal solution to Problem (1), and vice-versa.
在本節中,我們得到數量 U,,Uζ,ζsubscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell} 的估計,使得問題(11)的最適解也是問題(1)的最適解,反之亦然。

Coherence and Restricted Eigenvalues of a Model Matrix
模型矩陣的相干性與受限特徵值

Given a model matrix 𝐗𝐗\mathbf{X}, [51] introduced the cumulative coherence function
給定一個模型矩陣 𝐗𝐗\mathbf{X} ,[51]引入了累積相干函數

μ[k]:=max|I|=kmaxjIiI|𝐗j,𝐗i|,assign𝜇delimited-[]𝑘subscript𝐼𝑘subscript𝑗𝐼subscript𝑖𝐼subscript𝐗𝑗subscript𝐗𝑖\mu[k]:=\max_{|I|=k}\;\;\max_{j\notin I}\sum_{i\in I}|\langle\mathbf{X}_{j},\mathbf{X}_{i}\rangle|,

where, 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}, j=1,,p𝑗1𝑝j=1,\ldots,p represent the columns of 𝐗𝐗\mathbf{X}, i.e., features.
其中, 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}j=1,,p𝑗1𝑝j=1,\ldots,p 表示 𝐗𝐗\mathbf{X} 的列,即特徵。

For k=1𝑘1k=1, we obtain the notion of coherence introduced in  [22, 21] as a measure of the maximal pairwise correlation in absolute value of the columns of 𝐗𝐗\mathbf{X}: μ:=μ[1]=maxij|𝐗i,𝐗j|.assign𝜇𝜇delimited-[]1subscript𝑖𝑗subscript𝐗𝑖subscript𝐗𝑗\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.
對於 k=1𝑘1k=1 ,我們獲得了 [ 22, 21] 中引入的一致性概念,作為 𝐗𝐗\mathbf{X} 列絕對值中最大成對相關性的度量: μ:=μ[1]=maxij|𝐗i,𝐗j|.assign𝜇𝜇delimited-[]1subscript𝑖𝑗subscript𝐗𝑖subscript𝐗𝑗\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.

[16, 14] (see also [11] and references therein) introduced the notion that a matrix 𝐗𝐗\mathbf{X} satisfies a restricted eigenvalue condition if
[ 16, 14](另見[ 11]和其中的參考文獻)引入了矩陣 𝐗𝐗\mathbf{X} 滿足受限特徵值條件的概念,如果

λmin(𝐗I𝐗I)ηkforeveryI{1,,p}:|I|k,:subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼subscript𝜂𝑘forevery𝐼1𝑝𝐼𝑘\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I})\geq\eta_{k}~{}~{}{\rm for~{}every}~{}I\subset\{1,\ldots,p\}:~{}|I|\leq k, (12)

where λmin(𝐗I𝐗I)subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}) denotes the smallest eigenvalue of the matrix 𝐗I𝐗Isuperscriptsubscript𝐗𝐼subscript𝐗𝐼\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}. An inequality linking μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} is as follows.
其中 λmin(𝐗I𝐗I)subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}) 表示矩陣 𝐗I𝐗Isuperscriptsubscript𝐗𝐼subscript𝐗𝐼\mathbf{X}_{I}^{\prime}\mathbf{X}_{I} 的最小特徵值。連接 μ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的不等式如下。

Proposition 1.

The following bounds hold :

  1. (a)

    [51]:   μ[k]μk.𝜇delimited-[]𝑘𝜇𝑘\mu[k]\leq\mu\cdot k.

     (a)[51]: μ[k]μk.𝜇delimited-[]𝑘𝜇𝑘\mu[k]\leq\mu\cdot k.
  2. (b)

    [21] :   ηk1μ[k1]1μ(k1).subscript𝜂𝑘1𝜇delimited-[]𝑘11𝜇𝑘1\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).

     (b)[21]: ηk1μ[k1]1μ(k1).subscript𝜂𝑘1𝜇delimited-[]𝑘11𝜇𝑘1\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).

命題 1. 以下界限成立:

The computations of μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} for general k𝑘k are difficult, while μ𝜇\mu is simple to compute. Proposition 1 provides bounds for μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} in terms of the coherence μ𝜇\mu.
一般的 k𝑘kμ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的計算比較困難,而 μ𝜇\mu 的計算則比較簡單。命題 1 根據一致性 μ𝜇\mu 提供了 μ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的界線。

Operator Norms of Submatrices
子矩陣的算子範數

The (p,q)𝑝𝑞(p,q) operator norm of matrix 𝐀𝐀\mathbf{A} is
矩陣 𝐀𝐀\mathbf{A}(p,q)𝑝𝑞(p,q) 算子範數為

𝐀p,q:=max𝐮q=1𝐀𝐮p.assignsubscriptnorm𝐀𝑝𝑞subscriptsubscriptnorm𝐮𝑞1subscriptnorm𝐀𝐮𝑝\|\mathbf{A}\|_{p,q}:=\max_{\|\mathbf{u}\|_{q}=1}\|\mathbf{Au}\|_{p}.

We will use extensively here the (1,1)11(1,1) operator norm. We assume that each column vector of 𝐗𝐗\mathbf{X} has unit 2subscript2\ell_{2}-norm. The results derived in the next proposition borrow and enhance techniques developed by [51] in the context of analyzing the 1subscript1\ell_{1}0subscript0\ell_{0} equivalence in compressed sensing.
我們將在這裡廣泛使用 (1,1)11(1,1) 運算子規格。我們假設 𝐗𝐗\mathbf{X} 的每個列向量都有單位 2subscript2\ell_{2} -範數。下一個命題中得出的結果借用並增強了[51]在分析壓縮感知中的 1subscript1\ell_{1} - 0subscript0\ell_{0} 等價性的背景下開發的技術。

Proposition 2.

For any I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} with |I|=k𝐼𝑘|I|=k we have :

  1. (a)

    𝐗I𝐗I𝐈1,1μ[k1].subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈11𝜇delimited-[]𝑘1\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].

     (一) 𝐗I𝐗I𝐈1,1μ[k1].subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈11𝜇delimited-[]𝑘1\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].
  2. (b)

    If the matrix 𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} is invertible and 𝐗I𝐗I𝐈1,1<1subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈111\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1, then (𝐗I𝐗I)11,111μ[k1].subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼11111𝜇delimited-[]𝑘1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.


    (b)若矩陣 𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} 可逆且 𝐗I𝐗I𝐈1,1<1subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈111\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1 ,則 (𝐗I𝐗I)11,111μ[k1].subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼11111𝜇delimited-[]𝑘1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.

命題 2. 對於任何有 |I|=k𝐼𝑘|I|=kI{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} ,我們有:
Proof.

See Section A.3. ∎


證明。參見 A.3 節。 ∎

We note that Part (b) also appears in [51] for the operator norm (𝐗I𝐗I)1,subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty}.
我們注意到,(b) 部分也出現在 [51] 中的算符範數 (𝐗I𝐗I)1,subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty} 中。

Given a set I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} with |I|=k𝐼𝑘|I|=k we let 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} denote the least squares regression coefficients obtained by regressing 𝐲𝐲\mathbf{y} on 𝐗Isubscript𝐗𝐼\mathbf{X}_{I}, i.e., 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}. If we append 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} with zeros in the remaining coordinates we obtain 𝜷^^𝜷\widehat{\boldsymbol{\beta}} as follows: 𝜷^argmin𝜷:βi=0,iI𝐲𝐗𝜷22.^𝜷subscriptargmin:𝜷formulae-sequencesubscript𝛽𝑖0𝑖𝐼superscriptsubscriptnorm𝐲𝐗𝜷22\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}. Note that 𝜷^^𝜷\widehat{\boldsymbol{\beta}} depends on I𝐼I but we will suppress the dependence on I𝐼I for notational convenience.
給定一個有 |I|=k𝐼𝑘|I|=k 的集合 I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} ,我們讓 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} 表示在 𝐗Isubscript𝐗𝐼\mathbf{X}_{I} 得到的最小平方法迴歸係數> ,即 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y} 。如果我們在剩餘座標中附加 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} 零,我們將得到 𝜷^^𝜷\widehat{\boldsymbol{\beta}} ,如下所示: 𝜷^argmin𝜷:βi=0,iI𝐲𝐗𝜷22.^𝜷subscriptargmin:𝜷formulae-sequencesubscript𝛽𝑖0𝑖𝐼superscriptsubscriptnorm𝐲𝐗𝜷22\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}. 請注意, 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 取決於 I𝐼I 的依賴。

2.3.1 Specification of Parameters in terms of Coherence and Restricted Strong Convexity
2.3.1 相干性和限制強凸性方面的參數規範

Recall that 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}, j=1,,p𝑗1𝑝j=1,\ldots,p represent the columns of 𝐗𝐗\mathbf{X}; and we will use 𝐱i,i=1,,nformulae-sequencesubscript𝐱𝑖𝑖1𝑛\mathbf{x}_{i},~{}i=1,\ldots,n to denote the rows of 𝐗𝐗\mathbf{X}. As discussed above 𝐗j=1normsubscript𝐗𝑗1\|\mathbf{X}_{j}\|=1. We order the correlations |𝐗j,𝐲|subscript𝐗𝑗𝐲|\langle\mathbf{X}_{j},\mathbf{y}\rangle|:
回想一下 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}j=1,,p𝑗1𝑝j=1,\ldots,p 代表 𝐗𝐗\mathbf{X} 的欄位;我們將使用 𝐱i,i=1,,nformulae-sequencesubscript𝐱𝑖𝑖1𝑛\mathbf{x}_{i},~{}i=1,\ldots,n 來表示 𝐗𝐗\mathbf{X} 的行。如上所討論的 𝐗j=1normsubscript𝐗𝑗1\|\mathbf{X}_{j}\|=1 。我們將相關性排序 |𝐗j,𝐲|subscript𝐗𝑗𝐲|\langle\mathbf{X}_{j},\mathbf{y}\rangle|

|𝐗(1),𝐲||𝐗(2),𝐲||𝐗(p),𝐲|.subscript𝐗1𝐲subscript𝐗2𝐲subscript𝐗𝑝𝐲|\langle\mathbf{X}_{(1)},\mathbf{y}\rangle|\geq|\langle\mathbf{X}_{(2)},\mathbf{y}\rangle|\ldots\geq|\langle\mathbf{X}_{(p)},\mathbf{y}\rangle|. (13)

We finally denote by 𝐱i1:ksubscriptnormsubscript𝐱𝑖:1𝑘\|\mathbf{x}_{i}\|_{1:k} the sum of the top k𝑘k absolute values of the entries of xij,j{1,2,,p}subscript𝑥𝑖𝑗𝑗12𝑝x_{ij},j\in\{1,2,\ldots,p\}.
我們最後用 𝐱i1