這是用戶在 2024-4-30 4:34 為 https://ar5iv.labs.arxiv.org/html/1507.03133?_immersive_translate_auto_translate=1#S2.E11 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?

Best Subset Selection via a Modern Optimization Lens
透過現代優化鏡頭選擇最佳子集

Dimitris Bertsimas  迪米特里斯·貝爾西馬斯MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: dbertsim@mit.edu
麻省理工學院史隆管理學院與營運研究中心、麻省理工學院:dbertsim@mit.edu
   Angela King  安吉拉·金Operations Research Center, Massachusetts Institute of Technology: aking10@mit.edu
麻省理工學院運籌研究中心:aking10@mit.edu
   Rahul Mazumder  拉胡爾·馬宗德MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: rahulmaz@mit.edu
麻省理工學院史隆管理學院與營運研究中心,麻省理工學院:rahulmaz@mit.edu
( (This is a Revised Version dated May, 2015. First Version Submitted for Publication on June, 2014.))
((這是 2015 年 5 月的修訂版本。第一個版本於 2014 年 6 月提交出版。))
Abstract 抽象的

In the last twenty-five years (1990-2014), algorithmic advances in integer optimization combined with hardware improvements have resulted in an astonishing 200 billion factor speedup in solving Mixed Integer Optimization (MIO) problems. We present a MIO approach for solving the classical best subset selection problem of choosing k𝑘k out of p𝑝p features in linear regression given n𝑛n observations. We develop a discrete extension of modern first order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n𝑛n in the 1000s and p𝑝p in the 100s in minutes to provable optimality, and finds near optimal solutions for n𝑛n in the 100s and p𝑝p in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.
在過去 25 年(1990-2014 年)中,整數優化演算法的進步與硬體改進相結合,在解決混合整數優化 (MIO) 問題時帶來了驚人的 2000 億因子加速。我們提出了一種MIO 方法,用於解決在給定 n𝑛n 觀測值的線性回歸中從 p𝑝p 特徵中選擇 k𝑘k 的經典最佳子集選擇問題。我們開發了現代一階連續優化方法的離散擴展,以找到高品質的可行解決方案,我們將其用作 MIO 求解器的熱啟動,以找到可證明的最佳解決方案。由此產生的演算法(a) 提供了一個解決方案,即使我們提前終止演算法,也能保證其次優性,(b) 可以適應線性迴歸係數的側面約束,(c) 擴展到尋找最少的最佳子集解決方案絕對偏差損失函數。使用各種合成和真實資料集,我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 問題,以證明最優性,並找到接近最優的解對於 n𝑛n 在 100 秒內, p𝑝p 在 1000 秒內。我們也透過數值實驗證明,在實現具有良好預測能力的稀疏解決方案方面,MIO 方法比 Lasso 和其他常用的稀疏學習程式表現更好。

1 Introduction 1簡介

We consider the linear regression model with response vector 𝐲n×1subscript𝐲𝑛1\mathbf{y}_{n\times 1}, model matrix 𝐗=[𝐱1,,𝐱p]n×p𝐗subscript𝐱1subscript𝐱𝑝superscript𝑛𝑝\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p}, regression coefficients 𝜷p×1𝜷superscript𝑝1\boldsymbol{\beta}\in\mathbb{R}^{p\times 1} and errors ϵn×1bold-italic-ϵsuperscript𝑛1\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1}:
我們考慮具有反應向量 𝐲n×1subscript𝐲𝑛1\mathbf{y}_{n\times 1} 、模型矩陣 𝐗=[𝐱1,,𝐱p]n×p𝐗subscript𝐱1subscript𝐱𝑝superscript𝑛𝑝\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p} 、迴歸係數 𝜷p×1𝜷superscript𝑝1\boldsymbol{\beta}\in\mathbb{R}^{p\times 1} 和誤差 ϵn×1bold-italic-ϵsuperscript𝑛1\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1} 的線性迴歸模型:

𝐲=𝐗𝜷+ϵ.𝐲𝐗𝜷bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}.

We will assume that the columns of 𝐗𝐗\mathbf{X} have been standardized to have zero means and unit 2subscript2\ell_{2}-norm. In many important classical and modern statistical applications, it is desirable to obtain a parsimonious fit to the data by finding the best k𝑘k-feature fit to the response 𝐲𝐲\mathbf{y}. Especially in the high-dimensional regime with pn,much-greater-than𝑝𝑛p\gg n, in order to conduct statistically meaningful inference, it is desirable to assume that the true regression coefficient 𝜷𝜷\boldsymbol{\beta} is sparse or may be well approximated by a sparse vector. Quite naturally, the last few decades have seen a flurry of activity in estimating sparse linear models with good explanatory power. Central to this statistical task lies the best subset problem [40] with subset size k𝑘k, which is given by the following optimization problem:
我們假設 𝐗𝐗\mathbf{X} 的列已標準化為具有零均值和單位 2subscript2\ell_{2} -範數。在許多重要的經典和現代統計應用中,希望透過找到與反應 𝐲𝐲\mathbf{y} 擬合的最佳 k𝑘k 特徵來獲得對資料的簡約擬合。特別是在具有 pn,much-greater-than𝑝𝑛p\gg n, 的高維體系中,為了進行統計上有意義的推斷,最好假設真實的迴歸係數 𝜷𝜷\boldsymbol{\beta} 是稀疏的或可以很好地近似為稀疏係數向量。很自然地,在過去的幾十年裡,在估計具有良好解釋力的稀疏線性模型方面出現了一系列活動。此統計任務的核心在於子集大小 k𝑘k 的最佳子集問題 [40],由以下最佳化問題給出:

min𝜷12𝐲𝐗𝜷22subjectto𝜷0k,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22subjecttosubscriptnorm𝜷0𝑘\min_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\;\;\mathrm{subject\;to}\;\;\;\|\boldsymbol{\beta}\|_{0}\leq k, (1)

where the 0subscript0\ell_{0} (pseudo)norm of a vector 𝜷𝜷\boldsymbol{\beta} counts the number of nonzeros in 𝜷𝜷\boldsymbol{\beta} and is given by 𝜷0=i=1p1(βi0),subscriptnorm𝜷0superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0), where 1()11(\cdot) denotes the indicator function. The cardinality constraint makes Problem (1) NP-hard [41]. Indeed, state-of-the-art algorithms to solve Problem (1), as implemented in popular statistical packages, like leaps in R, do not scale to problem sizes larger than p=30𝑝30p=30. Due to this reason, it is not surprising that the best subset problem has been widely dismissed as being intractable by the greater statistical community.
其中向量 𝜷𝜷\boldsymbol{\beta}0subscript0\ell_{0} (偽)範數計算 𝜷𝜷\boldsymbol{\beta} 中非零的數量,並由 𝜷0=i=1p1(βi0),subscriptnorm𝜷0superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0), 給出,其中< b4 > 表示指標函數。基數約束使得問題 (1) 成為 NP 難題 [41]。事實上,解決問題 (1) 的最先進演算法,如在流行的統計軟體包中實現的那樣,例如 R 中的跳躍,不能擴展到大於 p=30𝑝30p=30 的問題規模。由於這個原因,最佳子集問題被更大的統計界認為難以解決而被廣泛忽視也就不足為奇了。

In this paper we address Problem (1) using modern optimization methods, specifically mixed integer optimization (MIO) and a discrete extension of first order continuous optimization methods. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with n𝑛n in the 1000s and p𝑝p in the 100s in minutes to provable optimality, and finds near optimal solutions for n𝑛n in the 100s and p𝑝p in the 1000s in minutes. To the best of our knowledge, this is the first time that MIO has been demonstrated to be a tractable solution method for Problem (1). We note that we use the term tractability not to mean the usual polynomial solvability for problems, but rather the ability to solve problems of realistic size in times that are appropriate for the applications we consider.
在本文中,我們使用現代最佳化方法,特別是混合整數最佳化(MIO)和一階連續最佳化方法的離散擴展來解決問題(1)。使用各種合成和真實資料集,我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 問題,以證明最優性,並找到接近最優的解對於 n𝑛n 在 100 秒內, p𝑝p 在 1000 秒內。據我們所知,這是第一次證明 MIO 是問題(1)的一種易於處理的解決方法。我們注意到,我們使用術語「易處理性」並不是指問題的通常多項式可解性,而是指在適合我們考慮的應用程式的時間內解決實際規模問題的能力。

As there is a vast literature on the best subset problem, we next give a brief and selective overview of related approaches for the problem.
由於關於最佳子集問題有大量文獻,因此我們接下來對此問題的相關方法進行簡要和選擇性的概述。

Brief Context and Background
簡短背景

To overcome the computational difficulties of the best subset problem, computationally tractable convex optimization based methods like Lasso [49, 17] have been proposed as a convex surrogate for Problem (1). For the linear regression problem, the Lagrangian form of Lasso solves
為了克服最佳子集問題的計算困難,基於計算可處理的凸優化的方法(如 Lasso [49, 17])已被提出作為問題(1)的凸替代。對於線性迴歸問題,Lasso 的拉格朗日形式求解

min𝜷12𝐲𝐗𝜷22+λ𝜷1,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22𝜆subscriptnorm𝜷1\min_{\boldsymbol{\beta}}\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\|\boldsymbol{\beta}\|_{1}, (2)

where the 1subscript1\ell_{1} penalty on 𝜷𝜷\boldsymbol{\beta}, i.e., 𝜷1=i|βi|subscriptnorm𝜷1subscript𝑖subscript𝛽𝑖\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}| shrinks the coefficients towards zero and naturally produces a sparse solution by setting many coefficients to be exactly zero. There has been a substantial amount of impressive work on Lasso [23, 15, 5, 55, 32, 59, 19, 35, 39, 53, 50] in terms of algorithms and understanding of its theoretical properties—see for example the excellent books or surveys [11, 34, 50] and the references therein.
其中 1subscript1\ell_{1}𝜷𝜷\boldsymbol{\beta} 的懲罰,即 𝜷1=i|βi|subscriptnorm𝜷1subscript𝑖subscript𝛽𝑖\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}| 將係數縮小到零,並透過將許多係數設為零來自然地產生稀疏解。在演算法和對其理論特性的理解方面,Lasso [23,15,5,55,32,59,19,35,39,53,50] 已經有大量令人印象深刻的工作,例如參見優秀的書籍或調查[11,34,50]以及其中的參考文獻。

Indeed, Lasso enjoys several attractive statistical properties and has drawn a significant amount of attention from the statistics community as well as other closely related fields. Under various conditions on the model matrix 𝐗𝐗\mathbf{X} and n,p,𝜷𝑛𝑝𝜷n,p,\boldsymbol{\beta} it can be shown that Lasso delivers a sparse model with good predictive performance [11, 34]. In order to perform exact variable selection, much stronger assumptions are required [11]. Sufficient conditions under which Lasso gives a sparse model with good predictive performance are the restricted eigenvalue conditions and compatibility conditions [11]. These involve statements about the range of the spectrum of sub-matrices of 𝐗𝐗\mathbf{X} and are difficult to verify, for a given data-matrix 𝐗𝐗\mathbf{X}.
事實上,Lasso 具有幾個有吸引力的統計特性,並引起了統計界以及其他密切相關領域的廣泛關注。在模型矩陣 𝐗𝐗\mathbf{X}n,p,𝜷𝑛𝑝𝜷n,p,\boldsymbol{\beta} 的各種條件下,可以證明 Lasso 提供了具有良好預測性能的稀疏模型 [11, 34]。為了執行精確的變數選擇,需要更強的假設[11]。 Lasso 給出具有良好預測性能的稀疏模型的充分條件是受限特徵值條件和相容性條件[11]。這些涉及關於 𝐗𝐗\mathbf{X} 子矩陣頻譜範圍的陳述,並且對於給定的資料矩陣 𝐗𝐗\mathbf{X} 很難驗證。

An important reason behind the popularity of Lasso is its computational feasibility and scalability to practical sized problems. Problem (2) is a convex quadratic optimization problem and there are several efficient solvers for it, see for example [44, 23, 29].
Lasso 受歡迎的一個重要原因是它的計算可行性和對實際規模問題的可擴展性。問題(2)是一個凸二次最佳化問題,有幾個有效的解算器,例如參見[44,23,29]。

In spite of its favorable statistical properties, Lasso has several shortcomings. In the presence of noise and correlated variables, in order to deliver a model with good predictive accuracy, Lasso brings in a large number of nonzero coefficients (all of which are shrunk towards zero) including noise variables. Lasso leads to biased regression coefficient estimates, since the 1subscript1\ell_{1}-norm penalizes the large coefficients more severely than the smaller coefficients. In contrast, if the best subset selection procedure decides to include a variable in the model, it brings it in without any shrinkage thereby draining the effect of its correlated surrogates. Upon increasing the degree of regularization, Lasso sets more coefficients to zero, but in the process ends up leaving out true predictors from the active set. Thus, as soon as certain sufficient regularity conditions on the data are violated, Lasso becomes suboptimal as (a) a variable selector and (b) in terms of delivering a model with good predictive performance.
儘管 Lasso 具有良好的統計特性,但它也有一些缺點。在有雜訊和相關變數的情況下,為了提供具有良好預測精度的模型,Lasso 引入了大量非零係數(所有係數都縮小到零),包括雜訊變數。套索會導致迴歸係數估計有偏差,因為 1subscript1\ell_{1} 範數對大係數的懲罰比對小係數的懲罰更嚴重。相反,如果最佳子集選擇過程決定在模型中包含一個變量,它將在沒有任何收縮的情況下將其引入,從而耗盡其相關代理的效果。在增加正規化程度後,Lasso 將更多係數設為零,但在此過程中最終會從活動集中遺漏真正的預測變數。因此,一旦違反了資料上的某些足夠的規律性條件,Lasso 就變得不是最理想的(a)變數選擇器和(b)提供具有良好預測性能的模型。

The shortcomings of Lasso are also known in the statistics literature. In fact, there is a significant gap between what can be achieved via best subset selection and Lasso: this is supported by empirical (for small problem sizes, i.e., p30𝑝30p\leq 30) and theoretical evidence, see for example, [46, 58, 38, 31, 56, 48] and the references therein. Some discussion is also presented herein, in Section 4.
Lasso 的缺點在統計文獻中也眾所周知。事實上,透過最佳子集選擇和套索可以實現的目標之間存在顯著差距:這是由經驗(對於小問題規模,即 p30𝑝30p\leq 30 )和理論證據支持的,例如,參見[ 46、58、38、31、56、48]及其中的參考文獻。本文第 4 節也提出了一些討論。

To address the shortcomings, non-convex penalized regression is often used to “bridge” the gap between the convex 1subscript1\ell_{1} penalty and the combinatorial 0subscript0\ell_{0} penalty [38, 27, 24, 54, 55, 28, 61, 62, 57, 13]. Written in Lagrangian form, this gives rise to continuous non-convex optimization problems of the form:
為了解決這些缺點,通常使用非凸懲罰回歸來「彌合」凸 1subscript1\ell_{1} 懲罰和組合 0subscript0\ell_{0} 懲罰之間的差距[38,27,24,54, 55、28 、61、62、57、13]。以拉格朗日形式編寫,這會產生以下形式的連續非凸最佳化問題:

12𝐲𝐗𝜷22+ip(|βi|;γ;λ),12superscriptsubscriptnorm𝐲𝐗𝜷22subscript𝑖𝑝subscript𝛽𝑖𝛾𝜆\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\sum_{i}p(|\beta_{i}|;\gamma;\lambda), (3)

where p(|β|;γ;λ)𝑝𝛽𝛾𝜆p(|\beta|;\gamma;\lambda) is a non-convex function in β𝛽\beta with λ𝜆\lambda and γ𝛾\gamma denoting the degree of regularization and non-convexity, respectively. Typical examples of non-convex penalties include the minimax concave penalty (MCP), the smoothly clipped absolute deviation (SCAD), and γsubscript𝛾\ell_{\gamma} penalties (see for example, [27, 38, 62, 24]). There is strong statistical evidence indicating the usefulness of estimators obtained as minimizers of non-convex penalized problems (3) over Lasso see for example [56, 36, 54, 25, 52, 37, 60, 26]. In a recent paper, [60] discuss the usefulness of non-convex penalties over convex penalties (like Lasso) in identifying important covariates, leading to efficient estimation strategies in high dimensions. They describe interesting connections between 0subscript0\ell_{0} regularized least squares and least squares with the hard thresholding penalty; and in the process develop comprehensive global properties of hard thresholding regularization in terms of various metrics. [26] establish asymptotic equivalence of a wide class of regularization methods in high dimensions with comprehensive sampling properties on both global and computable solutions.
其中 p(|β|;γ;λ)𝑝𝛽𝛾𝜆p(|\beta|;\gamma;\lambda)β𝛽\beta 中的非凸函數, λ𝜆\lambdaγ𝛾\gamma 分別表示正規化程度和非凸性程度。非凸懲罰的典型例子包括極小最大凹懲罰(MCP)、平滑剪切絕對偏差(SCAD)和 γsubscript𝛾\ell_{\gamma} 懲罰(例如,參見[27,38,62,24])。有強有力的統計證據表明,作為非凸懲罰問題 (3) 的最小化器獲得的估計量相對於 Lasso 是有用的,例如 [56,36,54,25,52,37,60,26]。在最近的一篇論文中,[60]討論了非凸懲罰相對於凸懲罰(如 Lasso)在識別重要協變量方面的有用性,從而導致高維度的有效估計策略。他們描述了 0subscript0\ell_{0} 正則化最小二乘法和具有硬閾值懲罰的最小二乘法之間的有趣聯繫;並在此過程中根據各種指標開發硬閾值正則化的全面全局屬性。 [26]建立了一系列高維正則化方法的漸近等價性,並在全局和可計算解決方案上具有全面的採樣特性。

Problem (3) mainly leads to a family of continuous and non-convex optimization problems. Various effective nonlinear optimization based methods (see for example [62, 24, 13, 36, 54, 38] and the references therein) have been proposed in the literature to obtain good local minimizers to Problem (3). In particular [38] proposes Sparsenet, a coordinate-descent procedure to trace out a surface of local minimizers for Problem (3) for the MCP penalty using effective warm start procedures. None of the existing approaches for solving Problem (3), however, come with guarantees of how close the solutions are to the global minimum of Problem (3).
問題(3)主要導致一系列連續非凸優化問題。文獻中已經提出了各種有效的基於非線性最佳化的方法(例如參見[62,24,13,36,54,38]及其參考文獻)以獲得問題(3)的良好局部最小化器。特別是[38]提出了Sparsenet,一種坐標下降過程,用於使用有效的熱啟動過程來追蹤問題(3)的MCP懲罰的局部最小化器的表面。然而,解決問題(3)的現有方法都無法保證解決方案與問題(3)的全局最小值的接近程度。

The Lagrangian version of (1) given by
(1) 的拉格朗日版本由下式給出

12𝐲𝐗𝜷22+λi=1p1(βi0),12superscriptsubscriptnorm𝐲𝐗𝜷22𝜆superscriptsubscript𝑖1𝑝1subscript𝛽𝑖0\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\sum_{i=1}^{p}1(\beta_{i}\neq 0),\;\; (4)

may be seen as a special case of (3). Note that, due to non-convexity, problems (4) and (1) are not equivalent. Problem (1) allows one to control the exact level of sparsity via the choice of k𝑘k, unlike (4) where there is no clear correspondence between λ𝜆\lambda and k𝑘k. Problem (4) is a discrete optimization problem unlike continuous optimization problems (3) arising from continuous non-convex penalties.
可以看作(3)的特例。請注意,由於非凸性,問題 (4) 和 (1) 並不等價。問題(1) 允許人們透過選擇 k𝑘k 來控制稀疏性的確切級別,這與(4) 不同,其中 λ𝜆\lambdak𝑘k .問題(4)是離散最佳化問題,與連續非凸懲罰產生的連續最佳化問題(3)不同。

Insightful statistical properties of Problem (4) have been explored from a theoretical viewpoint in [56, 31, 32, 48]. [48] points out that (1) is preferable over (4) in terms of superior statistical properties of the resulting estimator. The aforementioned papers, however, do not discuss methods to obtain provably optimal solutions to problems (4) or (1), and to the best of our knowledge, computing optimal solutions to problems (4) and (1) is deemed as intractable.
[56,31,32,48]從理論角度探討了問題(4)富有洞察力的統計特性。 [48]指出,就所得估計量的優越統計特性而言,(1)優於(4)。然而,上述論文並沒有討論獲得問題(4)或(1)的可證明最優解的方法,並且據我們所知,計算問題(4)和(1)的最優解被認為是棘手的。

Our Approach 我們的方法

In this paper, we propose a novel framework via which the best subset selection problem can be solved to optimality or near optimality in problems of practical interest within a reasonable time frame. At the core of our proposal is a computationally tractable framework that brings to bear the power of modern discrete optimization methods: discrete first order methods motivated by first order methods in convex optimization [45] and mixed integer optimization (MIO), see [4]. We do not guarantee polynomial time solution times as these do not exist for the best subset problem unless P=NP. Rather, our view of computational tractability is the ability of a method to solve problems of practical interest in times that are appropriate for the application addressed. An advantage of our approach is that it adapts to variants of the best subset regression problem of the form:
在本文中,我們提出了一種新穎的框架,透過該框架可以在合理的時間範圍內將最佳子集選擇問題解決為最優或接近最優的實際感興趣的問題。我們建議的核心是一個計算上易於處理的框架,它發揮了現代離散優化方法的力量:由凸優化[45]和混合整數優化(MIO)中的一階方法驅動的離散一階方法,請參見[4] 。我們不保證多項式時間求解時間,因為除非 P=NP,否則最佳子集問題不存在多項式時間求解時間。相反,我們對計算易處理性的看法是一種方法在適合所解決的應用程式的時間解決實際感興趣的問題的能力。我們的方法的一個優點是它適應以下形式的最佳子集迴歸問題的變體:

min𝜷12𝐲𝐗𝜷qqs.t.𝜷0k𝐀𝜷𝐛,subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷𝑞𝑞formulae-sequence𝑠𝑡subscriptnorm𝜷0𝑘missing-subexpression𝐀𝜷𝐛\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{q}^{q}\\ s.t.&\|\boldsymbol{\beta}\|_{0}\leq k\\ &\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b},\end{array}

where 𝐀𝜷𝐛𝐀𝜷𝐛\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b} represents polyhedral constraints and q{1,2}𝑞12q\in\{1,2\} refers to a least absolute deviation or the least squares loss function on the residuals 𝐫:=𝐲𝐗𝜷assign𝐫𝐲𝐗𝜷\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}.
其中 𝐀𝜷𝐛𝐀𝜷𝐛\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b} 表示多面體約束, q{1,2}𝑞12q\in\{1,2\} 指殘差上的最小絕對偏差或最小平方法損失函數 𝐫:=𝐲𝐗𝜷assign𝐫𝐲𝐗𝜷\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}

Existing approaches in the Mathematical Optimization Literature
數學最佳化文獻中的現有方法

In a seminal paper [30], the authors describe a leaps and bounds procedure for computing global solutions to Problem (1) (for the classical n>p𝑛𝑝n>p case) which can be achieved with computational effort significantly less than complete enumeration. leaps, a state-of-the-art R package uses this principle to perform best subset selection for problems with n>p𝑛𝑝n>p and p30𝑝30p\leq 30. [3] proposed a tailored branch-and-bound scheme that can be applied to Problem (1) using ideas from [30] and techniques in quadratic optimization, extending and enhancing the proposal of [6]. The proposal of [3] concentrates on obtaining high quality upper bounds for Problem (1) and is less scalable than the methods presented in this paper.
在一篇開創性論文[30] 中,作者描述了計算問題(1) 全局解決方案(對於經典 n>p𝑛𝑝n>p 情況)的跨越式過程,該過程的計算量明顯少於完全枚舉即可實現。跳躍,最先進的 R 套件使用此原理對 n>p𝑛𝑝n>pp30𝑝30p\leq 30 問題執行最佳子集選擇。 [3]提出了一種客製化的分支定界方案,可以使用[30]中的思想和二次優化技術應用於問題(1),擴展和增強[6]的建議。 [3] 的提議集中於獲得問題(1)的高品質上限,並且比本文提出的方法的可擴展性較差。

Contributions 貢獻

We summarize our contributions in this paper below:
我們在本文中總結了我們的貢獻如下:

  1. 1.

    We use MIO to find a provably optimal solution for the best subset problem. Our approach has the appealing characteristic that if we terminate the algorithm early, we obtain a solution with a guarantee on its suboptimality. Furthermore, our framework can accommodate side constraints on 𝜷𝜷\boldsymbol{\beta} and also extends to finding best subset solutions for the least absolute deviation loss function.


    1. 我們使用 MIO 來尋找最佳子集問題的可證明最優解。我們的方法具有一個吸引人的特點,即如果我們提前終止演算法,我們將獲得一個保證其次優的解決方案。此外,我們的框架可以適應 𝜷𝜷\boldsymbol{\beta} 的側面約束,並且還可以擴展到尋找最小絕對偏差損失函數的最佳子集解決方案。
  2. 2.

    We introduce a general algorithmic framework based on a discrete extension of modern first order continuous optimization methods that provide near-optimal solutions for the best subset problem. The MIO algorithm significantly benefits from solutions obtained by the first order methods and problem specific information that can be computed in a data-driven fashion.


    2. 我們引入了一種基於現代一階連續最佳化方法的離散擴展的通用演算法框架,該框架為最佳子集問題提供近乎最優的解決方案。 MIO 演算法顯著受益於一階方法獲得的解決方案以及可以以數據驅動方式計算的問題特定資訊。
  3. 3.

    We report computational results with both synthetic and real-world datasets that show that our proposed framework can deliver provably optimal solutions for problems of size n𝑛n in the 1000s and p𝑝p in the 100s in minutes. For high-dimensional problems with n{50,100}𝑛50100n\in\{50,100\} and p{1000,2000}𝑝10002000p\in\{1000,2000\}, with the aid of warm starts and further problem-specific information, our approach finds near optimal solutions in minutes but takes hours to prove optimality.


    3. 我們報告了合成資料集和真實資料集的計算結果,顯示我們提出的框架可以為1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 大小的問題提供可證明的最佳解決方案幾分鐘後。對於 n{50,100}𝑛50100n\in\{50,100\}p{1000,2000}𝑝10002000p\in\{1000,2000\} 的高維問題,借助熱啟動和進一步的特定問題信息,我們的方法在幾分鐘內找到接近最優的解決方案,但需要幾個小時才能證明最優性。
  4. 4.

    We investigate the statistical properties of best subset selection procedures for practical problem sizes, which to the best of our knowledge, have remained largely unexplored to date. We demonstrate the favorable predictive performance and sparsity-inducing properties of the best subset selection procedure over its competitors in a wide variety of real and synthetic examples for the least squares and absolute deviation loss functions.


    4. 我們研究了針對實際問題規模的最佳子集選擇程序的統計特性,據我們所知,迄今為止,這在很大程度上仍未被探索。我們在最小二乘和絕對偏差損失函數的各種真實和合成範例中證明了最佳子集選擇過程相對於競爭對手的有利預測性能和稀疏性誘導特性。

The structure of the paper is as follows. In Section 2, we present a brief overview of MIO, including a summary of the computational advances it has enjoyed in the last twenty-five years. We present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters and providing lower bounds for the MIO formulations that improve their computational performance. In Section 3, we develop a discrete extension of first order methods in convex optimization to obtain near optimal solutions for the best subset problem and establish its convergence properties—the proposed algorithm and its properties may be of independent interest. Section 4 briefly reviews some of the statistical properties of the best-subset solution, highlighting the performance gaps in prediction error, over regular Lasso-type estimators. In Section 5, we perform a variety of computational tests on synthetic and real datasets to assess the algorithmic and statistical performances of our approach for the least squares loss function for both the classical overdetermined case n>p𝑛𝑝n>p, and the high-dimensional case pnmuch-greater-than𝑝𝑛p\gg n. In Section 6, we report computational results for the least absolute deviation loss function. In Section 7, we include our concluding remarks. Due to space limitations, some of the material has been relegated to the Appendix.
論文結構如下。在第 2 節中,我們對 MIO 進行了簡要概述,包括對其在過去 25 年中所取得的計算進步的總結。我們提出了針對最佳子集問題提出的 MIO 公式,以及與壓縮感知文獻的一些聯繫,用於估計參數並為 MIO 公式提供下界,從而提高其計算性能。在第3 節中,我們發展了凸優化中一階方法的離散擴展,以獲得最佳子集問題的接近最優解並建立其收斂特性-所提出的演算法及其特性可能具有獨立的意義。第 4 節簡要回顧了最佳子集解決方案的一些統計特性,強調了與常規 Lasso 型估計器相比在預測誤差方面的性能差距。在第5 節中,我們對合成和真實資料集進行了各種計算測試,以評估我們的最小二乘損失函數方法的演算法和統計性能,適用於經典的超定情況 n>p𝑛𝑝n>p 和高維度情況 pnmuch-greater-than𝑝𝑛p\gg n 。在第 6 節中,我們報告了最小絕對偏差損失函數的計算結果。在第 7 節中,我們包含了我們的結論性意見。由於篇幅限制,部分資料已移至附錄。

2 Mixed Integer Optimization Formulations
2混合整數最佳化公式

In this section, we present a brief overview of MIO, including the simply astonishing advances it has enjoyed in the last twenty-five years. We then present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters. We also present completely data driven methods to estimate parameters in the MIO formulations that improve their computational performance.
在本節中,我們將簡要概述 MIO,包括它在過去 25 年中所取得的驚人進展。然後,我們提出了針對最佳子集問題提出的 MIO 公式,以及與用於估計參數的壓縮感知文獻的一些連結。我們也提出了完全資料驅動的方法來估計 MIO 公式中的參數,從而提高其計算效能。

2.1 Brief Background on MIO
2.1MIO簡介

The general form of a Mixed Integer Quadratic Optimization (MIQO) problem is as follows:
混合整數二次最佳化 (MIQO) 問題的一般形式如下:

min𝜶T𝐐𝜶+𝜶T𝐚s.t.𝐀𝜶𝐛αi{0,1},iαj+,j,superscript𝜶𝑇𝐐𝜶superscript𝜶𝑇𝐚missing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequence𝑠𝑡𝐀𝜶𝐛missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequencesubscript𝛼𝑖01for-all𝑖missing-subexpressionmissing-subexpressionmissing-subexpressionmissing-subexpressionformulae-sequencesubscript𝛼𝑗subscriptfor-all𝑗missing-subexpressionmissing-subexpressionmissing-subexpression\begin{array}[]{l c cl l}\min&\boldsymbol{\alpha}^{T}\mathbf{Q}\boldsymbol{\alpha}+\boldsymbol{\alpha}^{T}\mathbf{a}\\ s.t.&\;\;\mathbf{A}\boldsymbol{\alpha}\boldsymbol{\leq}\mathbf{b}\\ &\;\;\alpha_{i}\in\{0,1\},\quad\forall i\in{\mathcal{I}}\\ &\;\;\alpha_{j}\in\mathbb{R}_{+},\quad\forall j\notin{\mathcal{I}},\end{array}

where 𝐚m,𝐀k×m,𝐛kformulae-sequence𝐚superscript𝑚formulae-sequence𝐀superscript𝑘𝑚𝐛superscript𝑘\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k} and 𝐐m×m𝐐superscript𝑚𝑚\mathbf{Q}\in\mathbb{R}^{m\times m} (positive semidefinite) are the given parameters of the problem; +subscript\mathbb{R}_{+} denotes the non-negative reals, the symbol \boldsymbol{\leq} denotes element-wise inequalities and we optimize over 𝜶m𝜶superscript𝑚\boldsymbol{\alpha}\in\mathbb{R}^{m} containing both discrete (αi,isubscript𝛼𝑖𝑖\alpha_{i},i\in{\mathcal{I}}) and continuous (αi,isubscript𝛼𝑖𝑖\alpha_{i},i\notin{\mathcal{I}}) variables, with {1,,m}1𝑚{\mathcal{I}}\subset\{1,\ldots,m\}. For background on MIO see [4]. Subclasses of MIQO problems include convex quadratic optimization problems (={\mathcal{I}}=\emptyset), mixed integer (𝐐=𝟎m×m𝐐subscript0𝑚𝑚\mathbf{Q}=\mathbf{0}_{m\times m}) and linear optimization problems ( =,𝐐=𝟎m×mformulae-sequence𝐐subscript0𝑚𝑚{\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m}). Modern integer optimization solvers such as Gurobi and Cplex are able to tackle MIQO problems.
其中 𝐚m,𝐀k×m,𝐛kformulae-sequence𝐚superscript𝑚formulae-sequence𝐀superscript𝑘𝑚𝐛superscript𝑘\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k}𝐐m×m𝐐superscript𝑚𝑚\mathbf{Q}\in\mathbb{R}^{m\times m} (正半定)是問題的給定參數; +subscript\mathbb{R}_{+} 表示非負實數,符號 \boldsymbol{\leq} 表示逐元素不等式,我們對包含離散 ( αi,isubscript𝛼𝑖𝑖\alpha_{i},i\in{\mathcal{I}} )和連續( αi,isubscript𝛼𝑖𝑖\alpha_{i},i\notin{\mathcal{I}} )變量,以及 {1,,m}1𝑚{\mathcal{I}}\subset\{1,\ldots,m\} 。有關 MIO 的背景信息,請參閱 [4]。 MIQO 問題的子類別包括凸二次最佳化問題 ( ={\mathcal{I}}=\emptyset )、混合整數 ( 𝐐=𝟎m×m𝐐subscript0𝑚𝑚\mathbf{Q}=\mathbf{0}_{m\times m} ) 和線性最佳化問題 ( =,𝐐=𝟎m×mformulae-sequence𝐐subscript0𝑚𝑚{\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m} )。現代整數最佳化求解器(例如 Gurobi 和 Cplex)能夠解決 MIQO 問題。

In the last twenty-five years (1991-2014) the computational power of MIO solvers has increased at an astonishing rate. In [7], to measure the speedup of MIO solvers, the same set of MIO problems were tested on the same computers using twelve consecutive versions of Cplex and version-on-version speedups were reported. The versions tested ranged from Cplex 1.2, released in 1991 to Cplex 11, released in 2007. Each version released in these years produced a speed improvement on the previous version, leading to a total speedup factor of more than 29,000 between the first and last version tested (see [7], [42] for details). Gurobi 1.0, a MIO solver which was first released in 2009, was measured to have similar performance to Cplex 11. Version-on-version speed comparisons of successive Gurobi releases have shown a speedup factor of more than 20 between Gurobi 5.5, released in 2013, and Gurobi 1.0 ([7], [42]). The combined machine-independent speedup factor in MIO solvers between 1991 and 2013 is 580,000. This impressive speedup factor is due to incorporating both theoretical and practical advances into MIO solvers. Cutting plane theory, disjunctive programming for branching rules, improved heuristic methods, techniques for preprocessing MIOs, using linear optimization as a black box to be called by MIO solvers, and improved linear optimization methods have all contributed greatly to the speed improvements in MIO solvers [7].
在過去二十五年(1991-2014)中,MIO 求解器的運算能力以驚人的速度成長。在[7]中,為了測量 MIO 求解器的加速,使用 12 個連續版本的 Cplex 在相同的計算機上測試了同一組 MIO 問題,並報告了版本間的加速。測試的版本範圍從1991年發布的Cplex 1.2到2007年發布的Cplex 11。係數超過29,000已測試(詳細資訊請參閱[7]、[42])。 Gurobi 1.0 是 2009 年首次發布的 MIO 求解器,測量與 CPLEX 11 具有相似的性能。 [7],[42])。 1991 年至 2013 年間,MIO 解算器中與機器無關的綜合加速因子為 580,000。這一令人印象深刻的加速係數歸功於將理論和實踐進步融入 MIO 求解器。割平面理論、分支規則的析取編程、改進的啟發式方法、MIO 預處理技術、使用線性優化作為MIO 求解器調用的黑匣子以及改進的線性優化方法都對MIO 求解器的速度提高做出了巨大貢獻。

In addition, the past twenty years have also brought dramatic improvements in hardware. Figure 1 shows the exponentially increasing speed of supercomputers over the past twenty years, measured in billion floating point operations per second [1]. The hardware speedup from 1993 to 2013 is approximately 105.5320,000similar-tosuperscript105.532000010^{5.5}\sim 320,000. When both hardware and software improvements are considered, the overall speedup is approximately 200 billion! Note that the speedup factors cited here refer to mixed integer linear optimization problems, not MIQO problems. The speedup factors for MIQO problems are similar. MIO solvers provide both feasible solutions as well as lower bounds to the optimal value. As the MIO solver progresses towards the optimal solution, the lower bounds improve and provide an increasingly better guarantee of suboptimality, which is especially useful if the MIO solver is stopped before reaching the global optimum. In contrast, heuristic methods do not provide such a certificate of suboptimality.
此外,過去二十年在硬體方面也帶來了巨大的進步。圖 1 顯示了過去二十年來超級電腦的速度呈指數級增長,以每秒十億次浮點運算來衡量 [1]。從 1993 年到 2013 年的硬體加速約為 105.5320,000similar-tosuperscript105.532000010^{5.5}\sim 320,000 。同時考慮硬體和軟體改進時,整體加速約 2000 億!請注意,此處引用的加速因子指的是混合整數線性最佳化問題,而不是 MIQO 問題。 MIQO 問題的加速因子類似。 MIO 求解器提供可行的解決方案以及最優值的下限。隨著 MIO 求解器向最優解前進,下界會提高,並為次優性提供越來越好的保證,如果 MIO 求解器在達到全局最優之前停止,這一點尤其有用。相反,啟發式方法不提供這樣的次優證明。

Refer to caption
Figure 1: Log of Peak Supercomputer Speed from 1993–2013.
圖 1:1993 年至 2013 年超級電腦峰值速度的記錄。

The belief that MIO approaches to problems in statistics are not practically relevant was formed in the 1970s and 1980s and it was at the time justified. Given the astonishing speedup of MIO solvers and computer hardware in the last twenty-five years, the mindset of MIO as theoretically elegant but practically irrelevant is no longer justified. In this paper, we provide empirical evidence of this fact in the context of the best subset selection problem.
人們在 20 世紀 70 年代和 80 年代形成了一種信念,認為 MIO 方法解決統計問題沒有實際意義,並且在當時是有道理的。鑑於 MIO 求解器和電腦硬體在過去 25 年中取得了驚人的加速,MIO 理論上優雅但實際上無關緊要的思維方式不再合理。在本文中,我們在最佳子集選擇問題的背景下提供了這一事實的經驗證據。

2.2 MIO Formulations for the Best Subset Selection Problem
2.2最佳子集選擇問題的MIO公式

We first present a simple reformulation to Problem (1) as a MIO (in fact a MIQO) problem:
我們首先將問題 (1) 重新表述為 MIO(實際上是 MIQO)問題:

Z1=min𝜷,𝐳12𝐲𝐗𝜷22s.t.UziβiUzi,i=1,,pzi{0,1},i=1,,pi=1pzik,subscript𝑍1subscript𝜷𝐳12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡formulae-sequencesubscript𝑈subscript𝑧𝑖subscript𝛽𝑖subscript𝑈subscript𝑧𝑖𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\begin{array}[]{l l }Z_{1}=\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;-{\mathcal{M}}_{U}z_{i}\leq\beta_{i}\leq{\mathcal{M}}_{U}z_{i},i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k,\end{array} (5)

where 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} is a binary variable and Usubscript𝑈{\mathcal{M}}_{U} is a constant such that if 𝜷^^𝜷\widehat{\boldsymbol{\beta}} is a minimizer of Problem (5), then U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}. If zi=1subscript𝑧𝑖1z_{i}=1, then |βi|Usubscript𝛽𝑖subscript𝑈|\beta_{i}|\leq{\mathcal{M}}_{U} and if zi=0subscript𝑧𝑖0z_{i}=0, then βi=0subscript𝛽𝑖0\beta_{i}=0. Thus, i=1pzisuperscriptsubscript𝑖1𝑝subscript𝑧𝑖\sum_{i=1}^{p}z_{i} is an indicator of the number of non-zeros in 𝜷𝜷\boldsymbol{\beta}.
其中 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} 是二元變量, Usubscript𝑈{\mathcal{M}}_{U} 是常數,如果 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 是問題 (5) 的最小化,則 U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty} 。如果 zi=1subscript𝑧𝑖1z_{i}=1 ,則 |βi|Usubscript𝛽𝑖subscript𝑈|\beta_{i}|\leq{\mathcal{M}}_{U} ,如果 zi=0subscript𝑧𝑖0z_{i}=0 ,則 βi=0subscript𝛽𝑖0\beta_{i}=0 。因此, i=1pzisuperscriptsubscript𝑖1𝑝subscript𝑧𝑖\sum_{i=1}^{p}z_{i}𝜷𝜷\boldsymbol{\beta} 中非零數量的指示符。

Provided that Usubscript𝑈{\mathcal{M}}_{U} is chosen to be sufficently large with U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}, a solution to Problem (5) will be a solution to Problem (1). Of course, Usubscript𝑈{\mathcal{M}}_{U} is not known a priori, and a small value of Usubscript𝑈{\mathcal{M}}_{U} may lead to a solution different from (1). The choice of Usubscript𝑈{\mathcal{M}}_{U} affects the strength of the formulation and is critical for obtaining good lower bounds in practice. In Section 2.3 we describe how to find appropriate values for Usubscript𝑈{\mathcal{M}}_{U}. Note that there are other MIO formulations, presented herein (See Problem (8)) that do not rely on a-priori specifications of Usubscript𝑈{\mathcal{M}}_{U}. However, we will stick to formulation (5) for the time being, since it provides some interesting connections to the Lasso.
假設 Usubscript𝑈{\mathcal{M}}_{U} 選擇足夠大且 U𝜷^subscript𝑈subscriptnorm^𝜷{\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty} ,問題 (5) 的解決方案將是問題 (1) 的解決方案。當然, Usubscript𝑈{\mathcal{M}}_{U} 不是先驗已知的, Usubscript𝑈{\mathcal{M}}_{U} 的較小值可能會導致與(1)不同的解。 Usubscript𝑈{\mathcal{M}}_{U} 的選擇會影響配方的強度,對於在實踐中獲得良好的下限至關重要。在第 2.3 節中,我們描述如何為 Usubscript𝑈{\mathcal{M}}_{U} 找到合適的值。請注意,這裡提出的其他 MIO 公式(請參閱問題 (8))不依賴 Usubscript𝑈{\mathcal{M}}_{U} 的先驗規範。然而,我們暫時堅持公式(5),因為它提供了與 Lasso 的一些有趣的聯繫。

Formulation (5) leads to interesting insights, especially via the structure of the convex hull of its constraints, as illustrated next:
公式(5)帶來了有趣的見解,特別是透過其約束的凸包結構,如下所示:

Conv({𝜷:|βi|Uzi,zi{0,1},i=1,,p,i=1pzik})={𝜷:𝜷U,𝜷1Uk}{𝜷:𝜷1Uk}.missing-subexpressionConvconditional-set𝜷formulae-sequencesubscript𝛽𝑖subscript𝑈subscript𝑧𝑖formulae-sequencesubscript𝑧𝑖01formulae-sequence𝑖1𝑝superscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘conditional-set𝜷formulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscript𝑈𝑘conditional-set𝜷subscriptnorm𝜷1subscript𝑈𝑘\begin{array}[]{l l }&\text{Conv}\left(\left\{\boldsymbol{\beta}:|\beta_{i}|\leq{\mathcal{M}}_{U}z_{i},z_{i}\in\{0,1\},i=1,\ldots,p,\sum\limits_{i=1}^{p}z_{i}\leq{k}\right\}\right)\\ =&\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}\subseteq\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}.\end{array}

Thus, the minimum of Problem (5) is lower-bounded by the optimum objective value of both the following convex optimization problems:
因此,問題 (5) 的最小值受到以下兩個凸優化問題的最佳目標值的下界:

Z2:=assignsubscript𝑍2absent\displaystyle Z_{2}:= min𝜷12𝐲𝐗𝜷22subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\; subjecttosubjectto\displaystyle\mathrm{subject\;to} 𝜷U,𝜷1Ukformulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscript𝑈𝑘\displaystyle\;\;\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k} (6)
Z3:=assignsubscript𝑍3absent\displaystyle Z_{3}:= min𝜷12𝐲𝐗𝜷22subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\; subjecttosubjectto\displaystyle\mathrm{subject\;to} 𝜷1Uk,subscriptnorm𝜷1subscript𝑈𝑘\displaystyle\;\;\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}, (7)

where (7) is the familiar Lasso in constrained form. This is a weaker relaxation than formulation (6), which in addition to the 1subscript1\ell_{1} constraint on 𝜷𝜷\boldsymbol{\beta}, has box-constraints controlling the values of the βisubscript𝛽𝑖\beta_{i}’s. It is easy to see that the following ordering exists: Z3Z2Z1,subscript𝑍3subscript𝑍2subscript𝑍1Z_{3}\leq Z_{2}\leq Z_{1}, with the inequalities being strict in most instances.
其中 (7) 是熟悉的 Lasso 的約束形式。這是比公式(6) 更弱的鬆弛,公式(6) 除了對 𝜷𝜷\boldsymbol{\beta}1subscript1\ell_{1} 約束之外,還具有控制 βisubscript𝛽𝑖\beta_{i} 值的框約束' s。很容易看出存在以下順序: Z3Z2Z1,subscript𝑍3subscript𝑍2subscript𝑍1Z_{3}\leq Z_{2}\leq Z_{1}, ,並且在大多數情況下不等式都是嚴格的。

In terms of approximating the optimal solution to Problem (5), the MIO solver begins by first solving a continuous relaxation of Problem (5). The Lasso formulation (7) is weaker than this root node relaxation. Additionally, MIO is typically able to significantly improve the quality of the root node solution as the MIO solver progresses toward the optimal solution.
就逼近問題(5)的最優解而言,MIO求解器首先求解問題(5)的連續鬆弛。 Lasso 公式 (7) 比根節點鬆弛弱。此外,隨著 MIO 解算器不斷向最優解前進,MIO 通常能夠顯著提高根節點解的品質。

To motivate the reader we provide an example of the evolution (see Figure 2) of the MIO formulation (8) for the Diabetes dataset [23], with n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 (for further details on the dataset see Section 5).
為了激勵讀者,我們提供了糖尿病數據集[23] 的MIO 公式(8) 演變的示例(參見圖2),其中 n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 (有關數據集的更多詳細信息,請參見第5節) 。

Since formulation (5) is sensitive to the choice of Usubscript𝑈{\mathcal{M}}_{U}, we consider an alternative MIO formulation based on Specially Ordered Sets [4] as described next.
由於公式 (5) 對 Usubscript𝑈{\mathcal{M}}_{U} 的選擇很敏感,因此我們考慮基於特殊有序集 [4] 的替代 MIO 公式,如下所述。

Formulations via Specially Ordered Sets
透過特別訂購的套裝配製

Any feasible solution to formulation (5) will have (1zi)βi=01subscript𝑧𝑖subscript𝛽𝑖0(1-z_{i})\beta_{i}=0 for every i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\}. This constraint can be modeled via integer optimization using Specially Ordered Sets of Type 1 [4] (SOS-1). In an SOS-1 constraint, at most one variable in the set can take a nonzero value, that is
公式 (5) 的任何可行解決方案對於每個 i{1,,p}𝑖1𝑝i\in\{1,\ldots,p\} 都將具有 (1zi)βi=01subscript𝑧𝑖subscript𝛽𝑖0(1-z_{i})\beta_{i}=0 。此限制可以使用類型 1 [4] (SOS-1) 的特殊有序集合透過整數最佳化進行建模。在 SOS-1 限制條件中,集合中最多有一個變數可以取非零值,即

(1zi)βi=0(βi,1zi):SOS-1,iff1subscript𝑧𝑖subscript𝛽𝑖0subscript𝛽𝑖1subscript𝑧𝑖:SOS-1(1-z_{i})\beta_{i}=0\;\;\iff\;\;(\beta_{i},1-z_{i}):\text{SOS-1},

for every i=1,,p.𝑖1𝑝i=1,\ldots,p. This leads to the following formulation of (1):
對於每個 i=1,,p.𝑖1𝑝i=1,\ldots,p. 這導致 (1) 的公式如下:

min𝜷,𝐳12𝐲𝐗𝜷22s.t.(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzik.subscript𝜷𝐳12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k.\end{array} (8)

We note that Problem (8) can in principle be used to obtain global solutions to Problem (1) — Problem (8) unlike Problem (5) does not require any specification of the parameter Usubscript𝑈{\mathcal{M}}_{U}.
我們注意到,問題(8)原則上可以用來得到問題(1)的全域解-與問題(5)不同,問題(8)不需要任何參數 Usubscript𝑈{\mathcal{M}}_{U} 的規格。

k=6𝑘6k=6 k=7𝑘7k=7
Refer to caption Refer to caption
Refer to caption Refer to caption
Time (secs) Time (secs)
Figure 2: The typical evolution of the MIO formulation (8) for the diabetes dataset with n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 with k=6𝑘6k=6 (left panel) and k=7𝑘7k=7 (right panel). The top panel shows the evolution of upper and lower bounds with time. The lower panel shows the evolution of the corresponding MIO gap with time. Optimal solutions for both the problems are found in a few seconds in both examples, but it takes 10-20 minutes to certify optimality via the lower bounds. Note that the time taken for the MIO to certify convergence to the global optimum increases with increasing k𝑘k.
圖 2:糖尿病資料集的 MIO 公式 (8) 的典型演變,其中 n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64k=6𝑘6k=6 (左圖)和 k=7𝑘7k=7 (右圖)。頂部面板顯示了上限和下限隨時間的演變。下圖顯示了相應的 MIO 差距隨時間的演變。在這兩個範例中,都可以在幾秒鐘內找到這兩個問題的最佳解決方案,但需要 10-20 分鐘才能透過下限證明最優性。請注意,MIO 證明收斂到全域最優所花費的時間隨著 k𝑘k 的增加而增加。

We now proceed to present a more structured representation of Problem (8). Note that objective in this problem is a convex quadratic function in the continuous variable 𝜷𝜷\boldsymbol{\beta}, which can be formulated explicitly as:
我們現在繼續提出問題(8)的更結構化的表示。請注意,此問題中的目標是連續變數 𝜷𝜷\boldsymbol{\beta} 中的凸二次函數,可以明確表示為:

min𝜷,𝐳12𝜷T𝐗T𝐗𝜷𝐗𝐲,𝜷+12𝐲22s.t.(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzikUβiU,i=1,,p𝜷1.subscript𝜷𝐳12superscript𝜷𝑇superscript𝐗𝑇𝐗𝜷superscript𝐗𝐲𝜷12superscriptsubscriptnorm𝐲22formulae-sequence𝑠𝑡:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘missing-subexpressionformulae-sequencesubscript𝑈subscript𝛽𝑖subscript𝑈𝑖1𝑝missing-subexpressionsubscriptnorm𝜷1subscript\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\boldsymbol{\beta}^{T}\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}.\end{array} (9)

We also provide problem-dependent constants Usubscript𝑈{\mathcal{M}}_{U} and [0,]subscript0{\mathcal{M}}_{\ell}\in[0,\infty]. Usubscript𝑈{\mathcal{M}}_{U} provides an upper bound on the absolute value of the regression coefficients and subscript{\mathcal{M}}_{\ell} provides an upper bound on the 1subscript1\ell_{1}-norm of 𝜷𝜷\boldsymbol{\beta}. Adding these bounds typically leads to improved performance of the MIO, especially in delivering lower bound certificates. In Section 2.3, we describe several approaches to compute these parameters from the data.
我們也提供與問題相關的常數 Usubscript𝑈{\mathcal{M}}_{U}[0,]subscript0{\mathcal{M}}_{\ell}\in[0,\infty]Usubscript𝑈{\mathcal{M}}_{U} 提供迴歸係數絕對值的上限, subscript{\mathcal{M}}_{\ell} 提供 𝜷𝜷\boldsymbol{\beta}1subscript1\ell_{1} 範數的上限。增加這些邊界通常會提高 MIO 的效能,特別是在提供較低邊界憑證方面。在 2.3 節中,我們描述了從資料計算這些參數的幾種方法。

We also consider another formulation for (9):
我們也考慮 (9) 的另一種表述:

min𝜷,𝐳,𝜻12𝜻T𝜻𝐗𝐲,𝜷+12𝐲22s.t.𝜻=𝐗𝜷(βi,1zi):SOS-1,i=1,,pzi{0,1},i=1,,pi=1pzikUβiU,i=1,,p𝜷1UζζiUζ,i=1,,n𝜻1ζ,subscript𝜷𝐳𝜻12superscript𝜻𝑇𝜻superscript𝐗𝐲𝜷12superscriptsubscriptnorm𝐲22formulae-sequence𝑠𝑡𝜻𝐗𝜷missing-subexpression:subscript𝛽𝑖1subscript𝑧𝑖SOS-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘missing-subexpressionformulae-sequencesubscript𝑈subscript𝛽𝑖subscript𝑈𝑖1𝑝missing-subexpressionsubscriptnorm𝜷1subscriptmissing-subexpressionformulae-sequencesubscriptsuperscript𝜁𝑈subscript𝜁𝑖subscriptsuperscript𝜁𝑈𝑖1𝑛missing-subexpressionsubscriptnorm𝜻1subscriptsuperscript𝜁\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z},\boldsymbol{\zeta}}&\;\;\frac{1}{2}\;\boldsymbol{\zeta}^{T}\boldsymbol{\zeta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}\\ &(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &-{\mathcal{M}}^{\zeta}_{U}\leq\zeta_{i}\leq{\mathcal{M}}^{\zeta}_{U},i=1,\ldots,n\\ &\|\boldsymbol{\zeta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell},\end{array} (10)

where the optimization variables are 𝜷p,𝜻nformulae-sequence𝜷superscript𝑝𝜻superscript𝑛\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}, 𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p} and U,,Uζ,ζ[0,]subscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁0{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty] are problem specific parameters. Note that the objective function in formulation (10) involves a quadratic form in n𝑛n variables and a linear function in p𝑝p variables. Problem (10) is equivalent to the following variant of the best subset problem:
其中最佳化變數是 𝜷p,𝜻nformulae-sequence𝜷superscript𝑝𝜻superscript𝑛\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}𝐳{0,1}p𝐳superscript01𝑝\mathbf{z}\in\{0,1\}^{p}U,,Uζ,ζ[0,]subscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁0{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty] 是問題特定的參數。請注意,公式(10)中的目標函數涉及 n𝑛n 變數中的二次形式和 p𝑝p 變數中的線性函數。問題(10)等價於最佳子集問題的以下變體:

min𝜷12𝐲𝐗𝜷22s.t.𝜷0k𝜷U,𝜷1𝐗𝜷Uζ,𝐗𝜷1ζ.subscript𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22formulae-sequence𝑠𝑡subscriptnorm𝜷0𝑘missing-subexpressionformulae-sequencesubscriptnorm𝜷subscript𝑈subscriptnorm𝜷1subscriptmissing-subexpressionformulae-sequencesubscriptnorm𝐗𝜷subscriptsuperscript𝜁𝑈subscriptnorm𝐗𝜷1subscriptsuperscript𝜁\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;\|\boldsymbol{\beta}\|_{0}\leq k{}{}{}{}{}{}{}{}{}{}{}\\ &\;\;\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &\;\;\|\mathbf{X}\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}^{\zeta}_{U},\|\mathbf{X}\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell}.\end{array} (11)

Formulations (9) and (10) differ in the size of the quadratic forms that are involved. The current state-of-the-art MIO solvers are better-equipped to handle mixed integer linear optimization problems than MIQO problems. Formulation (9) has fewer variables but a quadratic form in p𝑝p variables—we find this formulation more useful in the n>p𝑛𝑝n>p regime, with p𝑝p in the 100100100s. Formulation (10) on the other hand has more variables, but involves a quadratic form in n𝑛n variables—this formulation is more useful for high-dimensional problems pnmuch-greater-than𝑝𝑛p\gg n, with n𝑛n in the 100100100s and p𝑝p in the 100010001000s.
公式(9)和(10)的差異在於所涉及的二次形式的大小。目前最先進的 MIO 求解器比 MIQO 問題更能處理混合整數線性最佳化問題。公式(9) 的變數較少,但 p𝑝p 變數為二次形式- 我們發現公式在 n>p𝑛𝑝n>p 體系中較有用, p𝑝p100100100 s。另一方面,公式 (10) 有更多變量,但涉及 n𝑛n 變量的二次形式 - 該公式對於高維度問題更有用 pnmuch-greater-than𝑝𝑛p\gg n ,其中 n𝑛n 中, p𝑝p100010001000 中。

As we said earlier, the bounds on 𝜷𝜷\boldsymbol{\beta} and 𝜻𝜻\boldsymbol{\zeta} are not required, but if these constraints are provided, they improve the strength of the MIO formulation. In other words, formulations with tightly specified bounds provide better lower bounds to the global optimization problem in a specified amount of time, when compared to a MIO formulation with loose bound specifications. We next show how these bounds can be computed from given data.
正如我們之前所說, 𝜷𝜷\boldsymbol{\beta}𝜻𝜻\boldsymbol{\zeta} 上的界限不是必需的,但如果提供這些約束,它們會提高 MIO 公式的強度。換句話說,與具有鬆散邊界規範的 MIO 公式相比,具有嚴格指定邊界的公式在指定時間內為全域最佳化問題提供了更好的下界。接下來我們將展示如何根據給定資料計算這些界限。

2.3 Specification of Parameters
2.3參數說明

In this section, we obtain estimates for the quantities U,,Uζ,ζsubscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell} such that an optimal solution to Problem (11) is also an optimal solution to Problem (1), and vice-versa.
在本節中,我們得到數量 U,,Uζ,ζsubscript𝑈subscriptsubscriptsuperscript𝜁𝑈subscriptsuperscript𝜁{\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell} 的估計,使得問題(11)的最適解也是問題(1)的最適解,反之亦然。

Coherence and Restricted Eigenvalues of a Model Matrix
模型矩陣的相干性與受限特徵值

Given a model matrix 𝐗𝐗\mathbf{X}, [51] introduced the cumulative coherence function
給定一個模型矩陣 𝐗𝐗\mathbf{X} ,[51]引入了累積相干函數

μ[k]:=max|I|=kmaxjIiI|𝐗j,𝐗i|,assign𝜇delimited-[]𝑘subscript𝐼𝑘subscript𝑗𝐼subscript𝑖𝐼subscript𝐗𝑗subscript𝐗𝑖\mu[k]:=\max_{|I|=k}\;\;\max_{j\notin I}\sum_{i\in I}|\langle\mathbf{X}_{j},\mathbf{X}_{i}\rangle|,

where, 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}, j=1,,p𝑗1𝑝j=1,\ldots,p represent the columns of 𝐗𝐗\mathbf{X}, i.e., features.
其中, 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}j=1,,p𝑗1𝑝j=1,\ldots,p 表示 𝐗𝐗\mathbf{X} 的列,即特徵。

For k=1𝑘1k=1, we obtain the notion of coherence introduced in  [22, 21] as a measure of the maximal pairwise correlation in absolute value of the columns of 𝐗𝐗\mathbf{X}: μ:=μ[1]=maxij|𝐗i,𝐗j|.assign𝜇𝜇delimited-[]1subscript𝑖𝑗subscript𝐗𝑖subscript𝐗𝑗\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.
對於 k=1𝑘1k=1 ,我們獲得了 [ 22, 21] 中引入的一致性概念,作為 𝐗𝐗\mathbf{X} 列絕對值中最大成對相關性的度量: μ:=μ[1]=maxij|𝐗i,𝐗j|.assign𝜇𝜇delimited-[]1subscript𝑖𝑗subscript𝐗𝑖subscript𝐗𝑗\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.

[16, 14] (see also [11] and references therein) introduced the notion that a matrix 𝐗𝐗\mathbf{X} satisfies a restricted eigenvalue condition if
[ 16, 14](另見[ 11]和其中的參考文獻)引入了矩陣 𝐗𝐗\mathbf{X} 滿足受限特徵值條件的概念,如果

λmin(𝐗I𝐗I)ηkforeveryI{1,,p}:|I|k,:subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼subscript𝜂𝑘forevery𝐼1𝑝𝐼𝑘\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I})\geq\eta_{k}~{}~{}{\rm for~{}every}~{}I\subset\{1,\ldots,p\}:~{}|I|\leq k, (12)

where λmin(𝐗I𝐗I)subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}) denotes the smallest eigenvalue of the matrix 𝐗I𝐗Isuperscriptsubscript𝐗𝐼subscript𝐗𝐼\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}. An inequality linking μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} is as follows.
其中 λmin(𝐗I𝐗I)subscript𝜆superscriptsubscript𝐗𝐼subscript𝐗𝐼\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}) 表示矩陣 𝐗I𝐗Isuperscriptsubscript𝐗𝐼subscript𝐗𝐼\mathbf{X}_{I}^{\prime}\mathbf{X}_{I} 的最小特徵值。連接 μ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的不等式如下。

Proposition 1.

The following bounds hold :

  1. (a)

    [51]:   μ[k]μk.𝜇delimited-[]𝑘𝜇𝑘\mu[k]\leq\mu\cdot k.

     (a)[51]: μ[k]μk.𝜇delimited-[]𝑘𝜇𝑘\mu[k]\leq\mu\cdot k.
  2. (b)

    [21] :   ηk1μ[k1]1μ(k1).subscript𝜂𝑘1𝜇delimited-[]𝑘11𝜇𝑘1\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).

     (b)[21]: ηk1μ[k1]1μ(k1).subscript𝜂𝑘1𝜇delimited-[]𝑘11𝜇𝑘1\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).

命題 1. 以下界限成立:

The computations of μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} for general k𝑘k are difficult, while μ𝜇\mu is simple to compute. Proposition 1 provides bounds for μ[k]𝜇delimited-[]𝑘\mu[k] and ηksubscript𝜂𝑘\eta_{k} in terms of the coherence μ𝜇\mu.
一般的 k𝑘kμ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的計算比較困難,而 μ𝜇\mu 的計算則比較簡單。命題 1 根據一致性 μ𝜇\mu 提供了 μ[k]𝜇delimited-[]𝑘\mu[k]ηksubscript𝜂𝑘\eta_{k} 的界線。

Operator Norms of Submatrices
子矩陣的算子範數

The (p,q)𝑝𝑞(p,q) operator norm of matrix 𝐀𝐀\mathbf{A} is
矩陣 𝐀𝐀\mathbf{A}(p,q)𝑝𝑞(p,q) 算子範數為

𝐀p,q:=max𝐮q=1𝐀𝐮p.assignsubscriptnorm𝐀𝑝𝑞subscriptsubscriptnorm𝐮𝑞1subscriptnorm𝐀𝐮𝑝\|\mathbf{A}\|_{p,q}:=\max_{\|\mathbf{u}\|_{q}=1}\|\mathbf{Au}\|_{p}.

We will use extensively here the (1,1)11(1,1) operator norm. We assume that each column vector of 𝐗𝐗\mathbf{X} has unit 2subscript2\ell_{2}-norm. The results derived in the next proposition borrow and enhance techniques developed by [51] in the context of analyzing the 1subscript1\ell_{1}0subscript0\ell_{0} equivalence in compressed sensing.
我們將在這裡廣泛使用 (1,1)11(1,1) 運算子規格。我們假設 𝐗𝐗\mathbf{X} 的每個列向量都有單位 2subscript2\ell_{2} -範數。下一個命題中得出的結果借用並增強了[51]在分析壓縮感知中的 1subscript1\ell_{1} - 0subscript0\ell_{0} 等價性的背景下開發的技術。

Proposition 2.

For any I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} with |I|=k𝐼𝑘|I|=k we have :

  1. (a)

    𝐗I𝐗I𝐈1,1μ[k1].subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈11𝜇delimited-[]𝑘1\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].

     (一) 𝐗I𝐗I𝐈1,1μ[k1].subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈11𝜇delimited-[]𝑘1\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].
  2. (b)

    If the matrix 𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} is invertible and 𝐗I𝐗I𝐈1,1<1subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈111\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1, then (𝐗I𝐗I)11,111μ[k1].subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼11111𝜇delimited-[]𝑘1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.


    (b)若矩陣 𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} 可逆且 𝐗I𝐗I𝐈1,1<1subscriptnormsubscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈111\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1 ,則 (𝐗I𝐗I)11,111μ[k1].subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼11111𝜇delimited-[]𝑘1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.

命題 2. 對於任何有 |I|=k𝐼𝑘|I|=kI{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} ,我們有:
Proof.

See Section A.3. ∎


證明。參見 A.3 節。 ∎

We note that Part (b) also appears in [51] for the operator norm (𝐗I𝐗I)1,subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty}.
我們注意到,(b) 部分也出現在 [51] 中的算符範數 (𝐗I𝐗I)1,subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty} 中。

Given a set I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} with |I|=k𝐼𝑘|I|=k we let 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} denote the least squares regression coefficients obtained by regressing 𝐲𝐲\mathbf{y} on 𝐗Isubscript𝐗𝐼\mathbf{X}_{I}, i.e., 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}. If we append 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} with zeros in the remaining coordinates we obtain 𝜷^^𝜷\widehat{\boldsymbol{\beta}} as follows: 𝜷^argmin𝜷:βi=0,iI𝐲𝐗𝜷22.^𝜷subscriptargmin:𝜷formulae-sequencesubscript𝛽𝑖0𝑖𝐼superscriptsubscriptnorm𝐲𝐗𝜷22\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}. Note that 𝜷^^𝜷\widehat{\boldsymbol{\beta}} depends on I𝐼I but we will suppress the dependence on I𝐼I for notational convenience.
給定一個有 |I|=k𝐼𝑘|I|=k 的集合 I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} ,我們讓 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} 表示在 𝐗Isubscript𝐗𝐼\mathbf{X}_{I} 得到的最小平方法迴歸係數> ,即 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y} 。如果我們在剩餘座標中附加 𝜷^Isubscript^𝜷𝐼\widehat{\boldsymbol{\beta}}_{I} 零,我們將得到 𝜷^^𝜷\widehat{\boldsymbol{\beta}} ,如下所示: 𝜷^argmin𝜷:βi=0,iI𝐲𝐗𝜷22.^𝜷subscriptargmin:𝜷formulae-sequencesubscript𝛽𝑖0𝑖𝐼superscriptsubscriptnorm𝐲𝐗𝜷22\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}. 請注意, 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 取決於 I𝐼I 的依賴。

2.3.1 Specification of Parameters in terms of Coherence and Restricted Strong Convexity
2.3.1 相干性和限制強凸性方面的參數規範

Recall that 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}, j=1,,p𝑗1𝑝j=1,\ldots,p represent the columns of 𝐗𝐗\mathbf{X}; and we will use 𝐱i,i=1,,nformulae-sequencesubscript𝐱𝑖𝑖1𝑛\mathbf{x}_{i},~{}i=1,\ldots,n to denote the rows of 𝐗𝐗\mathbf{X}. As discussed above 𝐗j=1normsubscript𝐗𝑗1\|\mathbf{X}_{j}\|=1. We order the correlations |𝐗j,𝐲|subscript𝐗𝑗𝐲|\langle\mathbf{X}_{j},\mathbf{y}\rangle|:
回想一下 𝐗jsubscript𝐗𝑗\mathbf{X}_{j}j=1,,p𝑗1𝑝j=1,\ldots,p 代表 𝐗𝐗\mathbf{X} 的欄位;我們將使用 𝐱i,i=1,,nformulae-sequencesubscript𝐱𝑖𝑖1𝑛\mathbf{x}_{i},~{}i=1,\ldots,n 來表示 𝐗𝐗\mathbf{X} 的行。如上所討論的 𝐗j=1normsubscript𝐗𝑗1\|\mathbf{X}_{j}\|=1 。我們將相關性排序 |𝐗j,𝐲|subscript𝐗𝑗𝐲|\langle\mathbf{X}_{j},\mathbf{y}\rangle|

|𝐗(1),𝐲||𝐗(2),𝐲||𝐗(p),𝐲|.subscript𝐗1𝐲subscript𝐗2𝐲subscript𝐗𝑝𝐲|\langle\mathbf{X}_{(1)},\mathbf{y}\rangle|\geq|\langle\mathbf{X}_{(2)},\mathbf{y}\rangle|\ldots\geq|\langle\mathbf{X}_{(p)},\mathbf{y}\rangle|. (13)

We finally denote by 𝐱i1:ksubscriptnormsubscript𝐱𝑖:1𝑘\|\mathbf{x}_{i}\|_{1:k} the sum of the top k𝑘k absolute values of the entries of xij,j{1,2,,p}subscript𝑥𝑖𝑗𝑗12𝑝x_{ij},j\in\{1,2,\ldots,p\}.
我們最後用 𝐱i1:ksubscriptnormsubscript𝐱𝑖:1𝑘\|\mathbf{x}_{i}\|_{1:k} 表示 xij,j{1,2,,p}subscript𝑥𝑖𝑗𝑗12𝑝x_{ij},j\in\{1,2,\ldots,p\} 條目的頂部 k𝑘k 絕對值總和。

Theorem 2.1.

For any k1𝑘1k\geq 1 such that μ[k1]<1𝜇delimited-[]𝑘11\mu[k-1]<1 any optimal solution 𝛃^^𝛃\widehat{\boldsymbol{\beta}} to (1) satisfies:

(𝐚)𝐚\displaystyle{\bf(a)}~{}~{}~{}~{} 𝜷^1subscriptnorm^𝜷1absent\displaystyle\|\widehat{\boldsymbol{\beta}}\|_{1}\leq 11μ[k1]j=1k|𝐗(j),𝐲|.11𝜇delimited-[]𝑘1superscriptsubscript𝑗1𝑘subscript𝐗𝑗𝐲\displaystyle\frac{1}{1-\mu[k-1]}\sum_{j=1}^{k}|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle|. (14)
(𝐛)𝐛\displaystyle{\bf(b)}~{}~{}~{}~{} 𝜷^subscriptnorm^𝜷absent\displaystyle\|\widehat{\boldsymbol{\beta}}\|_{\infty}\leq min{1ηkj=1k|𝐗(j),𝐲|2,1ηk𝐲2}.1subscript𝜂𝑘superscriptsubscript𝑗1𝑘superscriptsubscript𝐗𝑗𝐲21subscript𝜂𝑘subscriptnorm𝐲2\displaystyle\min\left\{\frac{1}{\eta_{k}}\sqrt{\sum_{j=1}^{k}|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle|^{2}},\frac{1}{\sqrt{\eta_{k}}}\|\mathbf{y}\|_{2}\right\}. (15)
(𝐜)𝐜\displaystyle{\bf(c)}~{}~{}~{}~{} 𝐗𝜷^1subscriptnorm𝐗^𝜷1absent\displaystyle\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}\leq min{i=1n𝐱i𝜷^1,k𝐲2}.superscriptsubscript𝑖1𝑛subscriptnormsubscript𝐱𝑖subscriptnorm^𝜷1𝑘subscriptnorm𝐲2\displaystyle\min\left\{\sum_{i=1}^{n}\|\mathbf{x}_{i}\|_{\infty}\|\widehat{\boldsymbol{\beta}}\|_{1},\sqrt{k}\|\mathbf{y}\|_{2}\right\}. (16)
(𝐝)𝐝\displaystyle{\bf(d)}~{}~{}~{}~{} 𝐗𝜷^subscriptnorm𝐗^𝜷absent\displaystyle\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}\leq (maxi=1,,n𝐱i1:k)𝜷^.subscript𝑖1𝑛subscriptnormsubscript𝐱𝑖:1𝑘subscriptnorm^𝜷\displaystyle\left(\max_{i=1,\ldots,n}\|\mathbf{x}_{i}\|_{1:k}\right)\|\widehat{\boldsymbol{\beta}}\|_{\infty}. (17)

定理2.1。對於任何 k1𝑘1k\geq 1 ,使得 μ[k1]<1𝜇delimited-[]𝑘11\mu[k-1]<1 任何最佳解 𝛃^^𝛃\widehat{\boldsymbol{\beta}} 滿足(1):
Proof.

For proof see Section A.4. ∎


證明。證明參見 A.4 節。 ∎

We note that in the above theorem, the upper bound in Part (a) becomes infinite as soon as μ[k1]1𝜇delimited-[]𝑘11\mu[k-1]\geq 1. In such a case, we can use purely data-driven bounds by using convex optimization techniques, as described in Section 2.3.2.
我們注意到,在上述定理中,(a)部分的上限一旦 μ[k1]1𝜇delimited-[]𝑘11\mu[k-1]\geq 1 就變成無限大。在這種情況下,我們可以透過使用凸優化技術來使用純資料驅動的邊界,如第 2.3.2 節所述。

The interesting message conveyed by Theorem 2.1 is that the upper bounds on 𝜷^1,subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1}, 𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty}, 𝐗𝜷^1subscriptnorm𝐗^𝜷1\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1} and 𝐗𝜷^subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}, corresponding to the Problem (11) can all be obtained in terms of ηksubscript𝜂𝑘\eta_{k} and μ[k1]𝜇delimited-[]𝑘1\mu[k-1], quantities of fundamental interest appearing in the analysis of 1subscript1\ell_{1} regularization methods and understanding how close they are to 0subscript0\ell_{0} solutions [51, 22, 21, 16, 14]. On a different note, Theorem 2.1 arises from a purely computational motivation and quite curiously, involves the same quantities: cumulative coherence and restricted eigenvalues.
定理 2.1 傳達的有趣訊息是 𝜷^1,subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1}, 𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty}𝐗𝜷^1subscriptnorm𝐗^𝜷1\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}𝐗𝜷^subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty} 的上限,對應於問題 (11 ) 都可以根據< b4> 和 μ[k1]𝜇delimited-[]𝑘1\mu[k-1] 獲得,在分析 1subscript1\ell_{1} 正則化方法並了解它們與 μ[k1]𝜇delimited-[]𝑘1\mu[k-1] 的接近程度時出現的基本興趣量。 > 解[51,22,21,16,14]。另一方面,定理 2.1 源自於純粹的計算動機,並且非常奇怪地涉及相同的量:累積相干性和受限特徵值。

Note that the quantities μ[k1],ηk𝜇delimited-[]𝑘1subscript𝜂𝑘\mu[k-1],\eta_{k} are difficult to compute exactly, but they can be approximated by Proposition 1 which provides bounds commonly used in the compressed sensing literature. Of course, approximations to these quantities can also be obtained by using subsampling schemes.
請注意,數量 μ[k1],ηk𝜇delimited-[]𝑘1subscript𝜂𝑘\mu[k-1],\eta_{k} 很難精確計算,但可以透過命題 1 進行近似,命題 1 提供了壓縮感知文獻中常用的界限。當然,也可以透過使用子取樣方案來獲得這些量的近似值。

2.3.2 Specification of Parameters via Convex Quadratic Optimization
2.3.2 透過凸二次優化指定參數

We provide an alternative purely data-driven way to compute the upper bounds to the parameters by solving several simple convex quadratic optimization problems.
我們提供了另一種純數據驅動的方法,透過解決幾個簡單的凸二次最佳化問題來計算參數的上限。

Bounds on β^isubscript^𝛽𝑖\hat{\beta}_{i}’s  β^isubscript^𝛽𝑖\hat{\beta}_{i} 的界限

For the case n>p𝑛𝑝n>p, upper and lower bounds on β^isubscript^𝛽𝑖\hat{\beta}_{i} can be obtained by solving the following pair of convex optimization problems:
對於情況 n>p𝑛𝑝n>p ,可以透過求解以下一對凸最佳化問題來獲得 β^isubscript^𝛽𝑖\hat{\beta}_{i} 的上下界:

ui+:=max𝜷βis.t.12𝐲𝐗𝜷22UB,ui:=min𝜷βis.t.12𝐲𝐗𝜷22UB,assignsubscriptsuperscript𝑢𝑖absentsubscript𝜷subscript𝛽𝑖missing-subexpressionformulae-sequence𝑠𝑡12subscriptsuperscriptnorm𝐲𝐗𝜷22UBmissing-subexpressionassignsubscriptsuperscript𝑢𝑖absentsubscript𝜷subscript𝛽𝑖missing-subexpressionformulae-sequence𝑠𝑡12subscriptsuperscriptnorm𝐲𝐗𝜷22UBmissing-subexpression\begin{array}[]{l l l }u^{+}_{i}:=&\max\limits_{\boldsymbol{\beta}}\;\;\beta_{i}&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}\qquad\qquad\begin{array}[]{l l l }u^{-}_{i}:=&\min\limits_{\boldsymbol{\beta}}\;\;\beta_{i}&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array} (18)

for i=1,,p𝑖1𝑝i=1,\ldots,p. Above, UB is an upper bound to the minimum of the k𝑘k-subset least squares problem (1). ui+superscriptsubscript𝑢𝑖u_{i}^{+} is an upper bound to β^isubscript^𝛽𝑖\hat{\beta}_{i}, since the cardinality constraint 𝜷0ksubscriptnorm𝜷0𝑘\|\boldsymbol{\beta}\|_{0}\leq k does not appear in the optimization problem. Similarly, uisubscriptsuperscript𝑢𝑖u^{-}_{i} is a lower bound to β^isubscript^𝛽𝑖\hat{\beta}_{i}. The quantity Ui=max{|ui+|,|ui|}subscriptsuperscript𝑖𝑈subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑖{\mathcal{M}}^{i}_{U}=\max\{|u^{+}_{i}|,|u^{-}_{i}|\} serves as an upper bound to |β^i|subscript^𝛽𝑖|\hat{\beta}_{i}|. A reasonable choice for UB is obtained by using the discrete first order methods (Algorithms 1 and 2 as described in Section 3) in combination with the MIO formulation (8) (for a predefined amount of time). Having obtained Uisubscriptsuperscript𝑖𝑈{\mathcal{M}}^{i}_{U} as described above, we can obtain an upper bound to 𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty} and 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} as follows: U=maxiUisubscript𝑈subscript𝑖subscriptsuperscript𝑖𝑈{\mathcal{M}}_{U}=\max_{i}{\mathcal{M}}^{i}_{U} and 𝜷^1i=1kU(i)subscriptnorm^𝜷1superscriptsubscript𝑖1𝑘subscriptsuperscript𝑖𝑈\|\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i=1}^{k}{\mathcal{M}}^{(i)}_{U} where, U(1)U(2)U(p)subscriptsuperscript1𝑈subscriptsuperscript2𝑈subscriptsuperscript𝑝𝑈{\mathcal{M}}^{(1)}_{U}\geq{\mathcal{M}}^{(2)}_{U}\geq\ldots\geq{\mathcal{M}}^{(p)}_{U}.
對於 i=1,,p𝑖1𝑝i=1,\ldots,p 。上面,UB 是 k𝑘k 子集最小平方法問題 (1) 的最小值的上限。 ui+superscriptsubscript𝑢𝑖u_{i}^{+}β^isubscript^𝛽𝑖\hat{\beta}_{i} 的上限,因為基數限制 𝜷0ksubscriptnorm𝜷0𝑘\|\boldsymbol{\beta}\|_{0}\leq k 不會出現在最佳化問題中。類似地, uisubscriptsuperscript𝑢𝑖u^{-}_{i}β^isubscript^𝛽𝑖\hat{\beta}_{i} 的下界。數量 Ui=max{|ui+|,|ui|}subscriptsuperscript𝑖𝑈subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑖{\mathcal{M}}^{i}_{U}=\max\{|u^{+}_{i}|,|u^{-}_{i}|\} 作為 |β^i|subscript^𝛽𝑖|\hat{\beta}_{i}| 的上限。 UB 的合理選擇是透過使用離散一階方法(第 3 節中描述的演算法 1 和 2)結合 MIO 公式 (8)(對於預先定義的時間量)來獲得的。如上所述獲得 Uisubscriptsuperscript𝑖𝑈{\mathcal{M}}^{i}_{U} 後,我們可以得到 𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty}𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} 的上限,如下所示: U=maxiUisubscript𝑈subscript𝑖subscriptsuperscript𝑖𝑈{\mathcal{M}}_{U}=\max_{i}{\mathcal{M}}^{i}_{U}𝜷^1i=1kU(i)subscriptnorm^𝜷1superscriptsubscript𝑖1𝑘subscriptsuperscript𝑖𝑈\|\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i=1}^{k}{\mathcal{M}}^{(i)}_{U}

Similarly, bounds corresponding to Parts (c) and (d) in Theorem 2.1 can be obtained by using the upper bounds on 𝜷^,𝜷^1subscriptnorm^𝜷subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{\infty},\|\widehat{\boldsymbol{\beta}}\|_{1} as described above.
類似地,可以透過如上所述使用 𝜷^,𝜷^1subscriptnorm^𝜷subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{\infty},\|\widehat{\boldsymbol{\beta}}\|_{1} 的上限來獲得與定理2.1中的(c)和(d)部分相對應的界限。

Note that the quantities ui+superscriptsubscript𝑢𝑖u_{i}^{+} and uisuperscriptsubscript𝑢𝑖u_{i}^{-} are finite when the level sets of the least squares loss function are finite. In particular, the bounds are loose when p>n𝑝𝑛p>n. In the following we describe methods to obtain non-trivial bounds on 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle, for i=1,,n𝑖1𝑛i=1,\ldots,n that apply for arbitrary n,p𝑛𝑝n,p.
請注意,當最小平方法損失函數的水平集合有限時,數量 ui+superscriptsubscript𝑢𝑖u_{i}^{+}uisuperscriptsubscript𝑢𝑖u_{i}^{-} 也是有限的。特別是,當 p>n𝑝𝑛p>n 時,界限是寬鬆的。下面我們描述了在 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle 上取得非平凡邊界的方法,對於適用於任意 n,p𝑛𝑝n,pi=1,,n𝑖1𝑛i=1,\ldots,n

Bounds on 𝐱i,𝜷^subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle’s  𝐱i,𝜷^subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle 的界限

We now provide a generic method to obtain upper and lower bounds on the quantities 𝐱i,𝜷^subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle:
我們現在提供一種通用方法來獲取數量 𝐱i,𝜷^subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle 的上限和下限:

vi+:=max𝜷𝐱i,𝜷s.t.12𝐲𝐗𝜷22UB,vi:=min𝜷𝐱i,𝜷s.t.12𝐲𝐗𝜷22UB,assignsubscriptsuperscript𝑣𝑖absentsubscript𝜷subscript𝐱𝑖𝜷missing-subexpressionformulae-sequence𝑠𝑡12subscriptsuperscriptnorm𝐲𝐗𝜷22UBmissing-subexpressionassignsubscriptsuperscript𝑣𝑖absentsubscript𝜷subscript𝐱𝑖𝜷missing-subexpressionformulae-sequence𝑠𝑡12subscriptsuperscriptnorm𝐲𝐗𝜷22UBmissing-subexpression\begin{array}[]{l l l }v^{+}_{i}:=&\max\limits_{\boldsymbol{\beta}}\;\;\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}\qquad\qquad\begin{array}[]{l l l }v^{-}_{i}:=&\min\limits_{\boldsymbol{\beta}}\;\;\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array} (19)

for i=1,,n𝑖1𝑛i=1,\ldots,n. Note that the bounds obtained from (19) are non-trivial bounds for both the under-determined n<p𝑛𝑝n<p and overdetermined cases. The bounds obtained from (19) are upper and lower bounds since we drop the cardinality constraint on 𝜷𝜷\boldsymbol{\beta}. The bounds are finite since for every i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\} the quantity 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle remains bounded in the feasible set for Problems (19).
對於 i=1,,n𝑖1𝑛i=1,\ldots,n 。請注意,對於欠定 n<p𝑛𝑝n<p 和超定情況,從 (19) 獲得的界限都是非平凡的界限。從 (19) 得到的界限是上限和下限,因為我們放棄了 𝜷𝜷\boldsymbol{\beta} 上的基數約束。界線是有限的,因為對於每個 i{1,,n}𝑖1𝑛i\in\{1,\ldots,n\} ,數量 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle 在問題 (19) 的可行集中仍然有界。

The quantity  vi=max{|vi+|,|vi|}subscript v𝑖subscriptsuperscript𝑣𝑖subscriptsuperscript𝑣𝑖{\mbox{ v}}_{i}=\max\{|v^{+}_{i}|,|v^{-}_{i}|\} serves as an upper bound to |𝐱i,𝜷|subscript𝐱𝑖𝜷|\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle|. In particular, this leads to simple upper bounds on 𝐗𝜷^maxi visubscriptnorm𝐗^𝜷subscript𝑖subscript v𝑖\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}\leq\max_{i}{\mbox{ v}}_{i} and 𝐗𝜷^1i visubscriptnorm𝐗^𝜷1subscript𝑖subscript v𝑖\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i}{\mbox{ v}}_{i} and can be thought of completely data-driven methods to estimate bounds appearing in (16) and (17).
數量  vi=max{|vi+|,|vi|}subscript v𝑖subscriptsuperscript𝑣𝑖subscriptsuperscript𝑣𝑖{\mbox{ v}}_{i}=\max\{|v^{+}_{i}|,|v^{-}_{i}|\} 作為 |𝐱i,𝜷|subscript𝐱𝑖𝜷|\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle| 的上限。特別是,這導致了 𝐗𝜷^maxi visubscriptnorm𝐗^𝜷subscript𝑖subscript v𝑖\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}\leq\max_{i}{\mbox{ v}}_{i}𝐗𝜷^1i visubscriptnorm𝐗^𝜷1subscript𝑖subscript v𝑖\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i}{\mbox{ v}}_{i} 的簡單上限,並且可以被認為是完全資料驅動的方法來估計(16)和(17)中出現的邊界。

We note that Problems (18) and (19) have nice structure amenable to efficient computation as we discuss in Section A.1.
我們注意到問題 (18) 和 (19) 具有良好的結構,適合高效計算,正如我們在 A.1 節中討論的那樣。

2.3.3 Parameter Specifications from Advanced Warm-Starts
2.3.3 高階熱啟動的參數規範

The methods described above in Sections 2.3.1 and 2.3.2 lead to provable bounds on the parameters: with these bounds Problem (11) is provides an optimal solution to Problem (1), and vice-versa. We now describe some other alternatives that lead to excellent parameter specifications in practice.
上述 2.3.1 和 2.3.2 節中描述的方法導致了參數的可證明邊界:利用這些邊界,問題 (11) 為問題 (1) 提供了最優解,反之亦然。我們現在描述一些其他替代方案,這些替代方案可以在實踐中實現出色的參數規格。

The discrete first order methods described in the following section 3 provide good upper bounds to Problem (1). These solutions when supplied as a warm-start to the MIO formulation (8) are often improved by MIO, thereby leading to high quality solutions to Problem (1) within several minutes. If 𝜷^hybsubscript^𝜷hyb\hat{\boldsymbol{\beta}}_{\text{hyb}} denotes an estimate obtained from this hybrid approach, then U:=τ𝜷^hybassignsubscript𝑈𝜏subscriptnormsubscript^𝜷hyb{\mathcal{M}}_{U}:=\tau\|\hat{\boldsymbol{\beta}}_{\text{hyb}}\|_{\infty} with τ𝜏\tau a multiplier greater than one (e.g., τ{1.1,1.5,2}𝜏1.11.52\tau\in\{1.1,1.5,2\}) provides a good estimate for the parameter Usubscript𝑈{\mathcal{M}}_{U}. A reasonable upper bound to 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} is kU𝑘subscript𝑈k{\mathcal{M}}_{U}. Bounds on the other quantities: 𝐗𝜷^1,𝐗𝜷^subscriptnorm𝐗^𝜷1subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty} can be derived by using expressions appearing in Theorem 2.1, with aforementioned bounds on 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} and 𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty}.
以下第 3 節中所述的離散一階方法為問題 (1) 提供了良好的上限。當這些解決方案作為 MIO 公式 (8) 的熱啟動提供時,通常會透過 MIO 進行改進,從而在幾分鐘內得到問題 (1) 的高品質解決方案。如果 𝜷^hybsubscript^𝜷hyb\hat{\boldsymbol{\beta}}_{\text{hyb}} 表示從此混合方法獲得的估計值,則 U:=τ𝜷^hybassignsubscript𝑈𝜏subscriptnormsubscript^𝜷hyb{\mathcal{M}}_{U}:=\tau\|\hat{\boldsymbol{\beta}}_{\text{hyb}}\|_{\infty}τ𝜏\tau 大於 1 的乘數(例如 τ{1.1,1.5,2}𝜏1.11.52\tau\in\{1.1,1.5,2\} )提供對參數 Usubscript𝑈{\mathcal{M}}_{U} 的良好估計。 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} 的合理上限是 kU𝑘subscript𝑈k{\mathcal{M}}_{U} 。其他量的界限: 𝐗𝜷^1,𝐗𝜷^subscriptnorm𝐗^𝜷1subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty} 可以透過使用定理 2.1 中出現的表達式以及上述 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1}𝜷^subscriptnorm^𝜷\|\widehat{\boldsymbol{\beta}}\|_{\infty} 上的界限來導出。

2.3.4 Some Generalizations and Variants
2.3.4一些概括和變體

Some variations and improvements of the procedures described above are presented in Section A.2 (appendix).
A.2 節(附錄)介紹了上述程序的一些變更和改進。

3 Discrete First Order Algorithms
3離散一階演算法

In this section, we develop a discrete extension of first order methods in convex optimization [45, 44] to obtain near optimal solutions for Problem (1) and its variant for the least absolute deviation (LAD) loss function. Our approach applies to the problem of minimizing any smooth convex function subject to cardinality constraints.
在本節中,我們開發了凸優化中一階方法的離散擴展[45, 44],以獲得問題(1)及其最小絕對偏差(LAD)損失函數的變體的接近最優解。我們的方法適用於最小化受基數約束的任何平滑凸函數的問題。

We will use these discrete first order methods to obtain solutions to warm start the MIO formulation. In Section 5, we will demonstrate how these methods greatly enhance the performance of the MIO.
我們將使用這些離散一階方法來獲得解來熱啟動 MIO 公式。在第 5 節中,我們將示範這些方法如何大幅提升 MIO 的效能。

3.1 Finding stationary solutions for minimizing smooth convex functions with cardinality constraints
3.1尋找最小化帶有基數約束的平滑凸函數的平穩解

Related work and contributions
相關工作與貢獻

In the signal processing literature [8, 9] proposed iterative hard-thresholding algorithms, in the context of 0subscript0\ell_{0}-regularized least squares problems, i.e., Problem (4). The authors establish convergence properties of the algorithm under the assumption that 𝐗𝐗\mathbf{X} satisfies coherence [8] or Restricted Isometry Property [9]. The method we propose here applies to a larger class of cardinality constrained optimization problems of the form (20), in particular, in the context of Problem (1) our algorithm and its convergence analysis do not require any form of restricted isometry property on the model matrix 𝐗𝐗\mathbf{X}.
在訊號處理文獻[8, 9]中,在 0subscript0\ell_{0} 正則化最小平方法問題的背景下提出了迭代硬閾值演算法,即問題(4)。作者在 𝐗𝐗\mathbf{X} 滿足一致性 [8] 或受限等距性質 [9] 的假設下建立了演算法的收斂性質。我們在這裡提出的方法適用於形式(20)的更大類基數約束最佳化問題,特別是在問題(1)的背景下,我們的演算法及其收斂性分析不需要任何形式的受限等距屬性模型矩陣 𝐗𝐗\mathbf{X}

Our proposed algorithm borrows ideas from projected gradient descent methods in first order convex optimization problems [45] and generalizes it to the discrete optimization Problem (20). We also derive new global convergence results for our proposed algorithms as presented in Theorem 3.1. Our proposal, with some novel modifications also applies to the non-smooth least absolute deviation loss with cardinality constraints as discussed in Section 3.3.
我們提出的演算法借鑒了一階凸優化問題中的投影梯度下降法的想法[45],並將其推廣到離散最佳化問題(20)。我們也為我們提出的演算法得出了新的全域收斂結果,如定理 3.1 所示。我們的建議,經過一些新穎的修改,也適用於第 3.3 節中討論的具有基數約束的非平滑最小絕對偏差損失。

Consider the following optimization problem:
考慮以下優化問題:

min𝜷g(𝜷)subjectto𝜷0k,subscript𝜷𝑔𝜷subjecttosubscriptnorm𝜷0𝑘\min_{\boldsymbol{\beta}}\;\;\;\;g(\boldsymbol{\beta})\;\;\;\mathrm{subject\;to}\;\;\;\|\boldsymbol{\beta}\|_{0}\leq{k}, (20)

where g(𝜷)0𝑔𝜷0g(\boldsymbol{\beta})\geq 0 is convex and has Lipschitz continuous gradient:
其中 g(𝜷)0𝑔𝜷0g(\boldsymbol{\beta})\geq 0 是凸的並且具有 Lipschitz 連續梯度:

g(𝜷)g(𝜷~)𝜷𝜷~.norm𝑔𝜷𝑔~𝜷norm𝜷~𝜷\|\nabla g(\boldsymbol{\beta})-\nabla g(\widetilde{\boldsymbol{\beta}})\|\leq\ell\|\boldsymbol{\beta}-\widetilde{\boldsymbol{\beta}}\|. (21)

The first ingredient of our approach is the observation that when g(𝜷)=𝜷𝐜22𝑔𝜷superscriptsubscriptnorm𝜷𝐜22g(\boldsymbol{\beta})=\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2} for a given 𝐜𝐜\mathbf{c}, Problem (20) admits a closed form solution.
我們方法的第一個要素是觀察到,當 g(𝜷)=𝜷𝐜22𝑔𝜷superscriptsubscriptnorm𝜷𝐜22g(\boldsymbol{\beta})=\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2} 對於給定的 𝐜𝐜\mathbf{c} 時,問題 (20) 承認一個封閉形式的解。

Proposition 3.

If 𝛃^^𝛃\hat{\boldsymbol{\beta}} is an optimal solution to the following problem:

𝜷^argmin𝜷0k𝜷𝐜22,^𝜷subscriptargminsubscriptnorm𝜷0𝑘superscriptsubscriptnorm𝜷𝐜22\hat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\|\boldsymbol{\beta}\|_{0}\leq k}\;\;\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2}, (22)

then it can be computed as follows: 𝛃^^𝛃\hat{\boldsymbol{\beta}} retains the k𝑘k largest (in absolute value) elements of 𝐜p𝐜superscript𝑝\mathbf{c}\in\mathbb{R}^{p} and sets the rest to zero, i.e., if |c(1)||c(2)||c(p)|,subscript𝑐1subscript𝑐2subscript𝑐𝑝|c_{(1)}|\geq|c_{(2)}|\geq\ldots\geq|c_{(p)}|, denote the ordered values of the absolute values of the vector 𝐜𝐜\mathbf{c}, then:
那麼它可以計算如下: 𝛃^^𝛃\hat{\boldsymbol{\beta}} 保留 𝐜p𝐜superscript𝑝\mathbf{c}\in\mathbb{R}^{p}k𝑘k 最大(絕對值)元素,並將其餘元素設為零,即,如果 < b3> 表示向量 𝐜𝐜\mathbf{c} 絕對值的有序值,則:

β^i={ci,if i{(1),,(k)},0,otherwise,subscript^𝛽𝑖casessubscript𝑐𝑖if i{(1),,(k)},0otherwise\hat{\beta}_{i}=\begin{cases}c_{i},&\text{if $i\in\left\{(1),\ldots,({k})\right\},$}\\ 0,&\text{otherwise},\end{cases} (23)

where, β^isubscript^𝛽𝑖\hat{\beta}_{i} is the i𝑖ith coordinate of 𝛃^^𝛃\hat{\boldsymbol{\beta}}. We will denote the set of solutions to Problem (22) by the notation 𝐇k(𝐜)subscript𝐇𝑘𝐜\mathbf{H}_{{k}}(\mathbf{c}).
其中, β^isubscript^𝛽𝑖\hat{\beta}_{i}𝛃^^𝛃\hat{\boldsymbol{\beta}} 的第 i𝑖i 座標。我們將用符號 𝐇k(𝐜)subscript𝐇𝑘𝐜\mathbf{H}_{{k}}(\mathbf{c}) 來表示問題 (22) 的解集。


命題3.如果 𝛃^^𝛃\hat{\boldsymbol{\beta}} 是下列問題的最適解:
Proof.

We provide a proof of this in Section B.2, for the sake of completeness. ∎


證明。為了完整起見,我們在 B.2 節中提供了證明。 ∎

Note that, we use the notation “argmin” (Problem (22) and in other places that follow) to denote the set of minimizers of the optimization Problem.
請注意,我們使用符號「argmin」(問題(22)以及隨後的其他地方)來表示最佳化問題的最小化集合。

The operator (23) is also known as the hard-thresholding operator [20]—a notion that arises in the context of the following related optimization problem:
運算子 (23) 也稱為硬閾值運算子 [20]——這個概念出現在以下相關最佳化問題的背景下:

𝜷^argmin𝜷12𝜷𝐜22+λ𝜷0,^𝜷subscriptargmin𝜷12superscriptsubscriptnorm𝜷𝐜22𝜆subscriptnorm𝜷0\hat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\boldsymbol{\beta}-\mathbf{c}\|_{2}^{2}+\lambda\|\boldsymbol{\beta}\|_{0}, (24)

where 𝜷^^𝜷\hat{\boldsymbol{\beta}} admits a simple closed form expression given by β^i=cisubscript^𝛽𝑖subscript𝑐𝑖\hat{\beta}_{i}=c_{i} if |ci|>λsubscript𝑐𝑖𝜆|c_{i}|>\sqrt{\lambda} and β^i=0subscript^𝛽𝑖0\hat{\beta}_{i}=0 otherwise, for i=1,,p𝑖1𝑝i=1,\ldots,p.
其中 𝜷^^𝜷\hat{\boldsymbol{\beta}} 承認由 β^i=cisubscript^𝛽𝑖subscript𝑐𝑖\hat{\beta}_{i}=c_{i} 給出的簡單閉合形式表達式,如果 |ci|>λsubscript𝑐𝑖𝜆|c_{i}|>\sqrt{\lambda}β^i=0subscript^𝛽𝑖0\hat{\beta}_{i}=0 否則,對於 i=1,,p𝑖1𝑝i=1,\ldots,p

Remark 1.

There is an important difference between the minimizers of Problems (22) and (24). For Problem (24), the smallest (in absolute value) non-zero element in 𝛃^^𝛃\hat{\boldsymbol{\beta}} is greater than λ𝜆\lambda in absolute value. On the other hand, in Problem (22) there is no lower bound to the minimum (in absolute value) non-zero element of a minimizer. This needs to be taken care of while analyzing the convergence properties of Algorithm 1 (Section 3.2).


備註 1. 問題 (22) 和 (24) 的最小化器之間存在重要差異。對於問題(24), 𝛃^^𝛃\hat{\boldsymbol{\beta}} 中最小的(絕對值)非零元素的絕對值大於 λ𝜆\lambda 。另一方面,在問題(22)中,最小化器的最小(絕對值)非零元素沒有下界。在分析演算法 1 的收斂特性時需要注意這一點(第 3.2 節)。

Given a current solution 𝜷𝜷\boldsymbol{\beta}, the second ingredient of our approach is to upper bound the function g(𝜼)𝑔𝜼g(\boldsymbol{\eta}) around g(𝜷)𝑔𝜷g(\boldsymbol{\beta}). To do so, we use ideas from projected gradient descent methods in first order convex optimization problems [45, 44].
給定目前的解 𝜷𝜷\boldsymbol{\beta} ,我們方法的第二個要素是圍繞 g(𝜷)𝑔𝜷g(\boldsymbol{\beta}) 限制函數 g(𝜼)𝑔𝜼g(\boldsymbol{\eta}) 的上限。為此,我們在一階凸優化問題中使用投影梯度下降法的想法 [45, 44]。

Proposition 4.

([45, 44]) For a convex function g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) satisfying condition (21) and for any L𝐿L\geq\ell we have :

g(𝜼)QL(𝜼,𝜷):=g(𝜷)+L2𝜼𝜷22+g(𝜷),𝜼𝜷𝑔𝜼subscript𝑄𝐿𝜼𝜷assign𝑔𝜷𝐿2superscriptsubscriptnorm𝜼𝜷22𝑔𝜷𝜼𝜷g(\boldsymbol{\eta})\leq Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta}):=g(\boldsymbol{\beta})+\frac{L}{2}\|\boldsymbol{\eta}-\boldsymbol{\beta}\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\boldsymbol{\eta}-\boldsymbol{\beta}\rangle (25)

for all 𝛃,𝛈𝛃𝛈\boldsymbol{\beta},\boldsymbol{\eta} with equality holding at 𝛃=𝛈𝛃𝛈\boldsymbol{\beta}=\boldsymbol{\eta}.
對於所有 𝛃,𝛈𝛃𝛈\boldsymbol{\beta},\boldsymbol{\eta} ,相等性保持在 𝛃=𝛈𝛃𝛈\boldsymbol{\beta}=\boldsymbol{\eta}


命題 4. ([ 45, 44]) 對於滿足條件 (21) 的凸函數 g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) 以及對於任何 L𝐿L\geq\ell 我們有:

Applying Proposition 3 to the upper bound QL(𝜼,𝜷)subscript𝑄𝐿𝜼𝜷Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta}) in Proposition 4 we obtain
將命題 3 應用於命題 4 中的上限 QL(𝜼,𝜷)subscript𝑄𝐿𝜼𝜷Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta}) 我們得到

argmin𝜼0kQL(𝜼,𝜷)=subscriptargminsubscriptnorm𝜼0𝑘subscript𝑄𝐿𝜼𝜷absent\displaystyle\operatorname*{arg\,min}_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})= argmin𝜼0k(L2𝜼(𝜷1Lg(𝜷))2212Lg(𝜷)22+g(𝜷))subscriptargminsubscriptnorm𝜼0𝑘𝐿2superscriptsubscriptnorm𝜼𝜷1𝐿𝑔𝜷2212𝐿superscriptsubscriptnorm𝑔𝜷22𝑔𝜷\displaystyle\operatorname*{arg\,min}_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\left\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\|_{2}^{2}-\frac{1}{2L}\left\|\nabla g(\boldsymbol{\beta})\right\|_{2}^{2}+g(\boldsymbol{\beta})\right)
=\displaystyle= argmin𝜼0k𝜼(𝜷1Lg(𝜷))22subscriptargminsubscriptnorm𝜼0𝑘superscriptsubscriptnorm𝜼𝜷1𝐿𝑔𝜷22\displaystyle\operatorname*{arg\,min}_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;\left\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\|_{2}^{2}
=\displaystyle= 𝐇k(𝜷1Lg(𝜷)),subscript𝐇𝑘𝜷1𝐿𝑔𝜷\displaystyle\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right), (26)

where 𝐇k()subscript𝐇𝑘\mathbf{H}_{{k}}(\cdot) is defined in (23). In light of (26) we are now ready to present Algorithm 1 to find a stationary point (see Definition 1) of Problem (20).
其中 𝐇k()subscript𝐇𝑘\mathbf{H}_{{k}}(\cdot) 在(23)中定義。根據 (26),我們現在準備提出演算法 1,以找出問題 (20) 的駐點(請參閱定義 1)。

Algorithm 1 演算法 1

Input: g(𝜷)𝑔𝜷g(\boldsymbol{\beta}), L𝐿L, ϵitalic-ϵ\epsilon.
輸入: g(𝜷)𝑔𝜷g(\boldsymbol{\beta})L𝐿Lϵitalic-ϵ\epsilon

Output: A first order stationary solution 𝜷superscript𝜷\boldsymbol{\beta}^{*}.
輸出:一階平穩解 𝜷superscript𝜷\boldsymbol{\beta}^{*}

Algorithm: 演算法:

  1. 1.

    Initialize with 𝜷1psubscript𝜷1superscript𝑝{\boldsymbol{\beta}}_{1}\in\mathbb{R}^{p} such that 𝜷10ksubscriptnormsubscript𝜷10𝑘\|{\boldsymbol{\beta}}_{1}\|_{0}\leq k.


    1.使用 𝜷1psubscript𝜷1superscript𝑝{\boldsymbol{\beta}}_{1}\in\mathbb{R}^{p} 初始化,使得 𝜷10ksubscriptnormsubscript𝜷10𝑘\|{\boldsymbol{\beta}}_{1}\|_{0}\leq k
  2. 2.

    For m1𝑚1m\geq 1, apply (26) with 𝜷=𝜷m𝜷subscript𝜷𝑚\boldsymbol{\beta}={\boldsymbol{\beta}}_{m} to obtain 𝜷m+1subscript𝜷𝑚1{\boldsymbol{\beta}}_{m+1} as:

    𝜷m+1𝐇k(𝜷m1Lg(𝜷m))subscript𝜷𝑚1subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚{\boldsymbol{\beta}}_{m+1}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right) (27)

    2.對於 m1𝑚1m\geq 1 ,將 (26) 與 𝜷=𝜷m𝜷subscript𝜷𝑚\boldsymbol{\beta}={\boldsymbol{\beta}}_{m} 一起應用,得到 𝜷m+1subscript𝜷𝑚1{\boldsymbol{\beta}}_{m+1} 為:
  3. 3.

    Repeat Step 2, until 𝜷m+1𝜷m2ϵsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚2italic-ϵ\|{\boldsymbol{\beta}}_{m+1}-{\boldsymbol{\beta}}_{m}\|_{2}\leq\epsilon.


    3.重複步驟2,直到 𝜷m+1𝜷m2ϵsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚2italic-ϵ\|{\boldsymbol{\beta}}_{m+1}-{\boldsymbol{\beta}}_{m}\|_{2}\leq\epsilon
  4. 4.

    Let 𝜷m:=(βm1,,βmp)assignsubscript𝜷𝑚subscript𝛽𝑚1subscript𝛽𝑚𝑝\boldsymbol{\beta}_{m}:=(\beta_{m1},\ldots,\beta_{mp}) denote the current estimate and let I=Supp(𝜷m):={i:βmi0}𝐼Suppsubscript𝜷𝑚assignconditional-set𝑖subscript𝛽𝑚𝑖0I=\text{Supp}({\boldsymbol{\beta}}_{m}):=\{i:~{}\beta_{mi}\neq 0\}. Solve the continuous optimization problem:

    min𝜷,βi=0,iIg(𝜷),subscriptformulae-sequence𝜷subscript𝛽𝑖0𝑖𝐼𝑔𝜷\min_{\boldsymbol{\beta},\beta_{i}=0,~{}i\notin I}g(\boldsymbol{\beta}), (28)

    and let 𝜷superscript𝜷\boldsymbol{\beta}^{*} be a minimizer.
    並讓 𝜷superscript𝜷\boldsymbol{\beta}^{*} 成為最小化器。


    4.令 𝜷m:=(βm1,,βmp)assignsubscript𝜷𝑚subscript𝛽𝑚1subscript𝛽𝑚𝑝\boldsymbol{\beta}_{m}:=(\beta_{m1},\ldots,\beta_{mp}) 表示目前估計並令 I=Supp(𝜷m):={i:βmi0}𝐼Suppsubscript𝜷𝑚assignconditional-set𝑖subscript𝛽𝑚𝑖0I=\text{Supp}({\boldsymbol{\beta}}_{m}):=\{i:~{}\beta_{mi}\neq 0\} 。求解連續最佳化問題:

The convergence properties of Algorithm 1 are presented in Section 3.2. We also present Algorithm 2, a variant of Algorithm 1 with better empirical performance. Algorithm 2 modifies Step 2 of Algorithm 1 by using a line search. It obtains 𝜼m𝐇k(𝜷m1Lg(𝜷m))subscript𝜼𝑚subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\boldsymbol{\eta}_{m}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right) and 𝜷m+1=λm𝜼m+(1λm)𝜷m,subscript𝜷𝑚1subscript𝜆𝑚subscript𝜼𝑚1subscript𝜆𝑚subscript𝜷𝑚\boldsymbol{\beta}_{m+1}=\lambda_{m}\boldsymbol{\eta}_{m}+(1-\lambda_{m})\boldsymbol{\beta}_{m}, where λmargminλg(λ𝜼m+(1λ)𝜷m)subscript𝜆𝑚subscriptargmin𝜆𝑔𝜆subscript𝜼𝑚1𝜆subscript𝜷𝑚\lambda_{m}\in\operatorname*{arg\,min}_{\lambda}g\left(\lambda\boldsymbol{\eta}_{m}+(1-\lambda)\boldsymbol{\beta}_{m}\right).
演算法 1 的收斂特性如第 3.2 節所示。我們也提出了演算法 2,它是演算法 1 的變體,具有更好的經驗性能。演算法2透過使用線搜尋修改了演算法1的步驟2。它得到 𝜼m𝐇k(𝜷m1Lg(𝜷m))subscript𝜼𝑚subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\boldsymbol{\eta}_{m}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)𝜷m+1=λm𝜼m+(1λm)𝜷m,subscript𝜷𝑚1subscript𝜆𝑚subscript𝜼𝑚1subscript𝜆𝑚subscript𝜷𝑚\boldsymbol{\beta}_{m+1}=\lambda_{m}\boldsymbol{\eta}_{m}+(1-\lambda_{m})\boldsymbol{\beta}_{m}, 其中 λmargminλg(λ𝜼m+(1λ)𝜷m)subscript𝜆𝑚subscriptargmin𝜆𝑔𝜆subscript𝜼𝑚1𝜆subscript𝜷𝑚\lambda_{m}\in\operatorname*{arg\,min}_{\lambda}g\left(\lambda\boldsymbol{\eta}_{m}+(1-\lambda)\boldsymbol{\beta}_{m}\right)

Note that the iterate 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} in Algorithm 2 need not be k𝑘k-sparse (i.e., need not satisfy: 𝜷m0ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}{\leq}k), however, 𝜼msubscript𝜼𝑚\boldsymbol{\eta}_{m} is k𝑘k-sparse (𝜼m0ksubscriptnormsubscript𝜼𝑚0𝑘\|\boldsymbol{\eta}_{m}\|_{0}\leq k). Moreover, the sequence may not lead to a decreasing set of objective values, but it satisfies: g(𝜷m+1)g(𝜼m)g(𝜷m).𝑔subscript𝜷𝑚1𝑔subscript𝜼𝑚not-less-than-nor-greater-than𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m+1})\leq g(\boldsymbol{\eta}_{m})\nleq g(\boldsymbol{\beta}_{m}).
請注意,演算法 2 中的迭代 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 不需要是 k𝑘k 稀疏(即不需要滿足: 𝜷m0ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}{\leq}k ),但是 𝜼msubscript𝜼𝑚\boldsymbol{\eta}_{m}k𝑘k -稀疏( 𝜼m0ksubscriptnormsubscript𝜼𝑚0𝑘\|\boldsymbol{\eta}_{m}\|_{0}\leq k )。此外,該序列可能不會導致目標值集遞減,但它滿足: g(𝜷m+1)g(𝜼m)g(𝜷m).𝑔subscript𝜷𝑚1𝑔subscript𝜼𝑚not-less-than-nor-greater-than𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m+1})\leq g(\boldsymbol{\eta}_{m})\nleq g(\boldsymbol{\beta}_{m}).

3.2 Convergence Analysis of Algorithm 1
3.2演算法1的收斂性分析

In this section, we study convergence properties for Algorithm 1. Before we embark on the analysis, we need to define the notion of first order optimality for Problem (20).
在本節中,我們研究演算法 1 的收斂特性。

Definition 1.

Given an L𝐿L\geq\ell, the vector 𝛈p𝛈superscript𝑝\boldsymbol{\eta}\in\mathbb{R}^{p} is said to be a first order stationary point of Problem (20) if 𝛈0ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}\leq k and it satisfies the following fixed point equation:

𝜼𝐇k(𝜼1Lg(𝜼)).𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right). (29)

定義 1. 給定 L𝐿L\geq\ell ,如果 𝛈0ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}\leq k 且滿足以下固定條件,則向量 𝛈p𝛈superscript𝑝\boldsymbol{\eta}\in\mathbb{R}^{p} 稱為問題 (20) 的一階駐點點方程:

Let us give some intuition associated with the above definition.
讓我們給出一些與上述定義相關的直覺。

Consider 𝜼𝜼\boldsymbol{\eta} as in Definition 1. Since 𝜼0k,subscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}\leq k, it follows that there is a set I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} such that ηi=0subscript𝜂𝑖0\eta_{i}=0 for all iI𝑖𝐼i\in I and the size of Icsuperscript𝐼𝑐I^{c} (complement of I𝐼I) is k𝑘k. Since 𝜼𝐇k(𝜼1Lg(𝜼)),𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right), it follows that for all iI𝑖𝐼i\notin I, we have: ηi=ηi1Lig(𝜼),subscript𝜂𝑖subscript𝜂𝑖1𝐿subscript𝑖𝑔𝜼\eta_{i}=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}), where, ig(𝜼)subscript𝑖𝑔𝜼\nabla_{i}g(\boldsymbol{\eta}) is the i𝑖ith coordinate of g(𝜼)𝑔𝜼\nabla g(\boldsymbol{\eta}). It thus follows that: ig(𝜼)=0subscript𝑖𝑔𝜼0\nabla_{i}g(\boldsymbol{\eta})=0 for all iI𝑖𝐼i\notin I. Since g(𝜼)𝑔𝜼g(\boldsymbol{\eta}) is convex in 𝜼𝜼\boldsymbol{\eta}, this means that 𝜼𝜼\boldsymbol{\eta} solves the following convex optimization problem:
考慮定義 1 中的 𝜼𝜼\boldsymbol{\eta} 。 > , Icsuperscript𝐼𝑐I^{c}I𝐼I 的補集)的大小是 k𝑘k 。由於 𝜼𝐇k(𝜼1Lg(𝜼)),𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right), ,對所有 iI𝑖𝐼i\notin I ,我們有: ηi=ηi1Lig(𝜼),subscript𝜂𝑖subscript𝜂𝑖1𝐿subscript𝑖𝑔𝜼\eta_{i}=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}), 其中, ig(𝜼)subscript𝑖𝑔𝜼\nabla_{i}g(\boldsymbol{\eta})i𝑖i th g(𝜼)𝑔𝜼\nabla g(\boldsymbol{\eta}) 的座標。因此可以得到: ig(𝜼)=0subscript𝑖𝑔𝜼0\nabla_{i}g(\boldsymbol{\eta})=0 對於所有 iI𝑖𝐼i\notin I 。由於 g(𝜼)𝑔𝜼g(\boldsymbol{\eta})𝜼𝜼\boldsymbol{\eta} 中是凸的,這意味著 𝜼𝜼\boldsymbol{\eta} 解決了以下凸最佳化問題:

min𝜼g(𝜼)s.t.ηi=0,iI.formulae-sequencesubscript𝜼𝑔𝜼𝑠𝑡formulae-sequencesubscript𝜂𝑖0𝑖𝐼\min_{\boldsymbol{\eta}}\;\;\;g(\boldsymbol{\eta})\;\;\;s.t.\;\;\;\eta_{i}=0,i\in I. (30)

Note however, that the converse of the above statement is not true. That is, if I~{1,,p}~𝐼1𝑝\tilde{{I}}\subset\{1,\ldots,p\} is an arbitrary subset with |I~c|=ksuperscript~𝐼𝑐𝑘|\tilde{{I}}^{c}|=k then a solution 𝜼^I~subscript^𝜼~𝐼\hat{\boldsymbol{\eta}}_{\tilde{I}} to the restricted convex problem (30) with I=I~𝐼~𝐼I=\tilde{{I}} need not correspond to a first order stationary point.
但請注意,上述說法的反面並不成立。也就是說,如果 I~{1,,p}~𝐼1𝑝\tilde{{I}}\subset\{1,\ldots,p\}|I~c|=ksuperscript~𝐼𝑐𝑘|\tilde{{I}}^{c}|=k 的任意子集,則需要 I=I~𝐼~𝐼I=\tilde{{I}} 限制凸問題 (30) 的解 𝜼^I~subscript^𝜼~𝐼\hat{\boldsymbol{\eta}}_{\tilde{I}} 不對應於一階駐點。

Note that any global minimizer to Problem (20) is also a first order stationary point, as defined above (see Proposition 7).
請注意,問題 (20) 的任何全域最小化器也是一階駐點,如上面所定義(請參閱命題 7)。

We present the following proposition (for its proof see Section B.6), which sheds light on a first order stationary point 𝜼𝜼\boldsymbol{\eta} for which 𝜼0<ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}<k.
我們提出以下命題(其證明參見 B.6 節),它闡明了 𝜼0<ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}<k 的一階駐點 𝜼𝜼\boldsymbol{\eta}

Proposition 5.

Suppose 𝛈𝛈\boldsymbol{\eta} satisfies the first order stationary condition (29) and 𝛈0<ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}<k. Then 𝛈argmin𝛃g(𝛃)𝛈subscriptargmin𝛃𝑔𝛃\boldsymbol{\eta}\in\operatorname*{arg\,min}\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta}).


命題5.假設 𝛈𝛈\boldsymbol{\eta} 滿足一階平穩條件(29)和 𝛈0<ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}<k 。然後 𝛈argmin𝛃g(𝛃)𝛈subscriptargmin𝛃𝑔𝛃\boldsymbol{\eta}\in\operatorname*{arg\,min}\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})

We next define the notion of an ϵitalic-ϵ\epsilon-approximate first order stationary point of Problem (20):
接下來我們定義問題 (20) 的 ϵitalic-ϵ\epsilon 近似一階駐點的概念:

Definition 2.

Given an ϵ>0italic-ϵ0\epsilon>0 and L𝐿L\geq\ell we say that 𝛈𝛈\boldsymbol{\eta} satisfies an ϵitalic-ϵ\epsilon-approximate first order optimality condition of Problem (20) if 𝛈0ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}\leq k and for some 𝛈^𝐇k(𝛈1Lg(𝛈))^𝛈subscript𝐇𝑘𝛈1𝐿𝑔𝛈\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right), we have 𝛈𝛈^2ϵsubscriptnorm𝛈^𝛈2italic-ϵ\|\boldsymbol{\eta}-\hat{\boldsymbol{\eta}}\|_{2}\leq\epsilon.


定義 2. 給定 ϵ>0italic-ϵ0\epsilon>0L𝐿L\geq\ell ,我們說 𝛈𝛈\boldsymbol{\eta} 滿足問題 (20) 的 ϵitalic-ϵ\epsilon 近似一階最適性條件如果 𝛈0ksubscriptnorm𝛈0𝑘\|\boldsymbol{\eta}\|_{0}\leq k 和一些 𝛈^𝐇k(𝛈1Lg(𝛈))^𝛈subscript𝐇𝑘𝛈1𝐿𝑔𝛈\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right) ,我們有 𝛈𝛈^2ϵsubscriptnorm𝛈^𝛈2italic-ϵ\|\boldsymbol{\eta}-\hat{\boldsymbol{\eta}}\|_{2}\leq\epsilon

Before we dive into the convergence properties of Algorithm 1, we need to introduce some notation. Let 𝜷m=(βm1,,βmp)subscript𝜷𝑚subscript𝛽𝑚1subscript𝛽𝑚𝑝\boldsymbol{\beta}_{m}=(\beta_{m1},\ldots,\beta_{mp}) and 𝟏m=(e1,,ep)subscript1𝑚subscript𝑒1subscript𝑒𝑝\mathbf{1}_{m}=(e_{1},\ldots,e_{p}) with ej=1subscript𝑒𝑗1e_{j}=1, if βmj0subscript𝛽𝑚𝑗0\beta_{mj}\neq 0, and ej=0subscript𝑒𝑗0e_{j}=0, if βmj=0subscript𝛽𝑚𝑗0\beta_{mj}=0, j=1,,p𝑗1𝑝j=1,\ldots,p, i.e., 𝟏msubscript1𝑚\mathbf{1}_{m} represents the sparsity pattern of the support of 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m}.
在我們深入研究演算法 1 的收斂特性之前,我們需要先介紹一些符號。讓 𝜷m=(βm1,,βmp)subscript𝜷𝑚subscript𝛽𝑚1subscript𝛽𝑚𝑝\boldsymbol{\beta}_{m}=(\beta_{m1},\ldots,\beta_{mp})𝟏m=(e1,,ep)subscript1𝑚subscript𝑒1subscript𝑒𝑝\mathbf{1}_{m}=(e_{1},\ldots,e_{p})ej=1subscript𝑒𝑗1e_{j}=1 ,如果 βmj0subscript𝛽𝑚𝑗0\beta_{mj}\neq 0 ,和 ej=0subscript𝑒𝑗0e_{j}=0 ,如果 βmj=0subscript𝛽𝑚𝑗0\beta_{mj}=0j=1,,p𝑗1𝑝j=1,\ldots,p ,即 𝟏msubscript1𝑚\mathbf{1}_{m} 表示 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 的支持度的稀疏模式。

Suppose, we order the coordinates of 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} by their absolute values: |β(1),m||β(2),m||β(p),m|subscript𝛽1𝑚subscript𝛽2𝑚subscript𝛽𝑝𝑚|\beta_{(1),m}|\geq|\beta_{(2),m}|\geq\ldots\geq|\beta_{(p),m}|. Note that by definition (27), β(i),m=0subscript𝛽𝑖𝑚0\beta_{(i),m}=0 for all i>k𝑖𝑘i>k and m2𝑚2m\geq 2. We denote αk,m=|β(k),m|subscript𝛼𝑘𝑚subscript𝛽𝑘𝑚\alpha_{k,m}=|\beta_{(k),m}| to be the k𝑘kth largest (in absolute value) entry in 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} for all m2𝑚2m\geq 2. Clearly if αk,m>0subscript𝛼𝑘𝑚0\alpha_{k,m}>0 then 𝜷m0=ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}=k and if αk,m=0subscript𝛼𝑘𝑚0\alpha_{k,m}=0 then 𝜷m0<ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}<k. Let α¯k:=lim supmαk,massignsubscript¯𝛼𝑘subscriptlimit-supremum𝑚subscript𝛼𝑘𝑚\overline{\alpha}_{k}:=\limsup\limits_{m\rightarrow\infty}\;\alpha_{k,m} and α¯k:=lim infmαk,massignsubscript¯𝛼𝑘subscriptlimit-infimum𝑚subscript𝛼𝑘𝑚\underline{\alpha}_{k}:=\liminf\limits_{m\rightarrow\infty}\;\alpha_{k,m}.
假設我們按絕對值對 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 的座標進行排序: |β(1),m||β(2),m||β(p),m|subscript𝛽1𝑚subscript𝛽2𝑚subscript𝛽𝑝𝑚|\beta_{(1),m}|\geq|\beta_{(2),m}|\geq\ldots\geq|\beta_{(p),m}| 。請注意,根據定義 (27), β(i),m=0subscript𝛽𝑖𝑚0\beta_{(i),m}=0 對於所有 i>k𝑖𝑘i>km2𝑚2m\geq 2 。我們將 αk,m=|β(k),m|subscript𝛼𝑘𝑚subscript𝛽𝑘𝑚\alpha_{k,m}=|\beta_{(k),m}| 表示為 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 中所有 m2𝑚2m\geq 2 的第 k𝑘k 最大(絕對值)條目。顯然,如果 αk,m>0subscript𝛼𝑘𝑚0\alpha_{k,m}>0 那麼 𝜷m0=ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}=k ,如果 αk,m=0subscript𝛼𝑘𝑚0\alpha_{k,m}=0 那麼 𝜷m0<ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}<k 。令 α¯k:=lim supmαk,massignsubscript¯𝛼𝑘subscriptlimit-supremum𝑚subscript𝛼𝑘𝑚\overline{\alpha}_{k}:=\limsup\limits_{m\rightarrow\infty}\;\alpha_{k,m}α¯k:=lim infmαk,massignsubscript¯𝛼𝑘subscriptlimit-infimum𝑚subscript𝛼𝑘𝑚\underline{\alpha}_{k}:=\liminf\limits_{m\rightarrow\infty}\;\alpha_{k,m}

Proposition 6.

Consider g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) and \ell as defined in (20) and (21). Let 𝛃m,m1subscript𝛃𝑚𝑚1\boldsymbol{\beta}_{m},m\geq 1 be the sequence generated by Algorithm 1. Then we have :

  1. (a)

    For any L𝐿L\geq\ell, the sequence g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) satisfies

    g(𝜷m)g(𝜷m+1)L2𝜷m+1𝜷m22,𝑔subscript𝜷𝑚𝑔subscript𝜷𝑚1𝐿2superscriptsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚22g(\boldsymbol{\beta}_{m})-g(\boldsymbol{\beta}_{m+1})\geq\frac{L-\ell}{2}\left\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\right\|_{2}^{2}, (31)

    is decreasing and converges.
    遞減並收斂。


    (a)對任一 L𝐿L\geq\ell ,數列 g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) 滿足
  2. (b)

    If L>𝐿L>\ell, then 𝜷m+1𝜷m𝟎subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0} as m𝑚m\rightarrow\infty.


    (b)如果 L>𝐿L>\ell ,則 𝜷m+1𝜷m𝟎subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}m𝑚m\rightarrow\infty
  3. (c)

    If L>𝐿L>\ell and α¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0 then the sequence 𝟏msubscript1𝑚\mathbf{1}_{m} converges after finitely many iterations, i.e., there exists an iteration index Msuperscript𝑀M^{*} such that 𝟏m=𝟏m+1subscript1𝑚subscript1𝑚1\mathbf{1}_{m}=\mathbf{1}_{m+1} for all mM𝑚superscript𝑀m\geq M^{*}. Furthermore, the sequence 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} is bounded and converges to a first order stationary point.


    (c)如果 L>𝐿L>\ellα¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0 ,那麼序列 𝟏msubscript1𝑚\mathbf{1}_{m} 在有限多次迭代後收斂,即存在一個迭代索引 Msuperscript𝑀M^{*} ,這樣對所有 mM𝑚superscript𝑀m\geq M^{*} 來說 𝟏m=𝟏m+1subscript1𝑚subscript1𝑚1\mathbf{1}_{m}=\mathbf{1}_{m+1} 。此外,序列 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 是有界的並且收斂到一階駐點。
  4. (d)

    If L>𝐿L>\ell and α¯k=0subscript¯𝛼𝑘0\underline{\alpha}_{k}=0 then lim infmg(𝜷m)=0subscriptlimit-infimum𝑚subscriptnorm𝑔subscript𝜷𝑚0\liminf\limits_{m\rightarrow\infty}\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0.


    (d)如果 L>𝐿L>\ellα¯k=0subscript¯𝛼𝑘0\underline{\alpha}_{k}=0lim infmg(𝜷m)=0subscriptlimit-infimum𝑚subscriptnorm𝑔subscript𝜷𝑚0\liminf\limits_{m\rightarrow\infty}\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0
  5. (e)

    Let L>𝐿L>\ell, α¯k=0subscript¯𝛼𝑘0\overline{\alpha}_{k}=0 and suppose that the sequence 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} has a limit point. Then g(𝜷m)min𝜷g(𝜷)𝑔subscript𝜷𝑚subscript𝜷𝑔𝜷g(\boldsymbol{\beta}_{m})\rightarrow\min\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta}).


    (e)令 L>𝐿L>\ellα¯k=0subscript¯𝛼𝑘0\overline{\alpha}_{k}=0 並假設序列 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 有極限點。然後 g(𝜷m)min𝜷g(𝜷)𝑔subscript𝜷𝑚subscript𝜷𝑔𝜷g(\boldsymbol{\beta}_{m})\rightarrow\min\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})

命題 6. 考慮 (20) 和 (21) 定義的 g(𝛃)𝑔𝛃g(\boldsymbol{\beta})\ell 。令 𝛃m,m1subscript𝛃𝑚𝑚1\boldsymbol{\beta}_{m},m\geq 1 為演算法 1 產生的序列。
Proof.

See Section B.1. ∎


證明。參見 B.1 節。 ∎
Remark 2.

Note that the existence of a limit point in Proposition 6, Part (e) is guaranteed under fairly weak conditions. One such condition is that sup({𝛃:𝛃0k,f(𝛃)f0})<,supremumconditional-set𝛃formulae-sequencesubscriptnorm𝛃0𝑘𝑓𝛃subscript𝑓0\sup\left(\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{0}\leq k,f(\boldsymbol{\beta})\leq f_{0}\right\}\right)<\infty, for any finite value f0subscript𝑓0f_{0}. In words this means that the k𝑘k-sparse level sets of the function g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) is bounded.


備註 2. 請注意,命題 6 (e) 部分中極限點的存在是在相當弱的條件下得到保證的。其中一個條件是 sup({𝛃:𝛃0k,f(𝛃)f0})<,supremumconditional-set𝛃formulae-sequencesubscriptnorm𝛃0𝑘𝑓𝛃subscript𝑓0\sup\left(\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{0}\leq k,f(\boldsymbol{\beta})\leq f_{0}\right\}\right)<\infty, 對於任何有限值 f0subscript𝑓0f_{0} 。換句話說,這意味著函數 g(𝛃)𝑔𝛃g(\boldsymbol{\beta})k𝑘k 稀疏級集是有界的。

In the special case where g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) is the least squares loss function, the above condition is equivalent to every k𝑘k-submatrix (𝐗Jsubscript𝐗𝐽\mathbf{X}_{J}) of 𝐗𝐗\mathbf{X} comprising of k𝑘k columns being full rank. In particular, this holds with probability one when the entries of 𝐗𝐗\mathbf{X} are drawn from a continuous distribution and k<n𝑘𝑛k<n.
g(𝛃)𝑔𝛃g(\boldsymbol{\beta}) 是最小平方法損失函數的特殊情況下,上述條件相當於 𝐗𝐗\mathbf{X} ) b3 > 包含滿等級的 k𝑘k 欄位。特別是,當 𝐗𝐗\mathbf{X} 的條目從連續分佈和 k<n𝑘𝑛k<n 中提取時,這種情況的機率為 1。

Remark 3.

Parts (d) and (e) of Proposition 6 are probably not statistically interesting cases, since they correspond to un-regularized solutions of the problem ming(𝛃)𝑔𝛃\min g(\boldsymbol{\beta}). However, we include them since they shed light on the properties of Algorithm 1.


備註 3. 命題 6 的 (d) 和 (e) 部分可能不是統計上有趣的情況,因為它們對應於問題 ming(𝛃)𝑔𝛃\min g(\boldsymbol{\beta}) 的非正則化解決方案。然而,我們將它們包括在內,因為它們揭示了演算法 1 的屬性。

The conditions assumed in Part (c) imply that the support of 𝛃msubscript𝛃𝑚\boldsymbol{\beta}_{m} stabilizes and Algorithm 1 behaves like vanilla gradient descent thereafter. The support of 𝛃msubscript𝛃𝑚\boldsymbol{\beta}_{m} need not stabilize for Parts (d), (e) and thus Algorithm 1 may not behave like vanilla gradient descent after finitely many iterations. However, the objective values (under minor regularity assumptions) converge to ming(𝛃)𝑔𝛃\min\;g(\boldsymbol{\beta}).
(c) 部分假設的條件意味著 𝛃msubscript𝛃𝑚\boldsymbol{\beta}_{m} 的支援穩定下來,之後演算法 1 的行為就像普通梯度下降一樣。 𝛃msubscript𝛃𝑚\boldsymbol{\beta}_{m} 的支援不需要穩定第 (d)、(e) 部分,因此演算法 1 在有限多次迭代後可能不會像普通梯度下降那樣表現。然而,目標值(在次要規律性假設下)收斂於 ming(𝛃)𝑔𝛃\min\;g(\boldsymbol{\beta})

We present the following Proposition (for proof see Section B.3) about a uniqueness property of the fixed point equation (1).
我們提出以下關於定點方程式(1)的唯一性的命題(證明參見 B.3 節)。

Proposition 7.

Suppose L>𝐿L>\ell and let 𝛈𝛈\boldsymbol{\eta} satisfy a first order stationary point as in Definition 1. Then the set 𝐇k(𝛈1Lg(𝛈))subscript𝐇𝑘𝛈1𝐿𝑔𝛈\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right) has exactly one element: 𝛈𝛈\boldsymbol{\eta}.


命題 7. 假設 L>𝐿L>\ell 並令 𝛈𝛈\boldsymbol{\eta} 滿足定義 1 中的一階駐點。 > .

The following proposition (for a proof see Section B.4) shows that a global minimizer of the Problem (20) is also a first order stationary point.
以下命題(證明參見 B.4 節)顯示問題 (20) 的全域最小化也是一階駐點。

Proposition 8.

Suppose L>𝐿L>\ell and let 𝛃^^𝛃\widehat{\boldsymbol{\beta}} be a global minimizer of Problem (20). Then 𝛃^^𝛃\widehat{\boldsymbol{\beta}} is a first order stationary point.


命題 8. 假設 L>𝐿L>\ell 並讓 𝛃^^𝛃\widehat{\boldsymbol{\beta}} 是問題 (20) 的全域最小化器。那麼 𝛃^^𝛃\widehat{\boldsymbol{\beta}} 就是一階駐點。

Proposition 6 establishes that Algorithm 1 either converges to a first order stationarity point (part (c)) or it converges111under minor technical assumptions
在次要的技術假設下

命題 6 證明演算法 1 要麼收斂到一階平穩點((c) 部分),要麼收斂 1
to a global optimal solution (Parts (d), (e)), but does not quantify the rate of convergence. We next characterize the rate of convergence of the algorithm to an ϵitalic-ϵ\epsilon-approximate first order stationary point.
到全域最優解((d)、(e)部分),但不量化收斂速度。接下來我們描述演算法收斂到 ϵitalic-ϵ\epsilon 近似一階駐點的速率。

Theorem 3.1.

Let L>𝐿L>\ell and 𝛃superscript𝛃\boldsymbol{\beta}^{*} denote a first order stationary point of Algorithm 1. After M𝑀M iterations Algorithm 1 satisfies

minm=1,,M𝜷m+1𝜷m222(g(𝜷1)g(𝜷))M(L),subscript𝑚1𝑀superscriptsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚222𝑔subscript𝜷1𝑔superscript𝜷𝑀𝐿\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}\leq\frac{2(g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}^{*}))}{M(L-\ell)}, (32)

where g(𝛃m)g(𝛃)𝑔subscript𝛃𝑚𝑔superscript𝛃g(\boldsymbol{\beta}_{m})\downarrow g(\boldsymbol{\beta}^{*}) as m𝑚m\rightarrow\infty.
其中 g(𝛃m)g(𝛃)𝑔subscript𝛃𝑚𝑔superscript𝛃g(\boldsymbol{\beta}_{m})\downarrow g(\boldsymbol{\beta}^{*})m𝑚m\rightarrow\infty


定理3.1。令 L>𝐿L>\ell𝛃superscript𝛃\boldsymbol{\beta}^{*} 表示演算法 1 的一階駐點。
Proof.

See Section B.5. ∎


證明。參見 B.5 節。 ∎

Theorem 3.1 implies that for any ϵ>0italic-ϵ0\epsilon>0 there exists M=O(1ϵ)𝑀𝑂1italic-ϵM=O(\frac{1}{\epsilon}) such that for some 1mM1superscript𝑚𝑀1\leq m^{*}\leq M, we have: 𝜷m+1𝜷m22ϵ.superscriptsubscriptnormsubscript𝜷superscript𝑚1subscript𝜷superscript𝑚22italic-ϵ\|\boldsymbol{\beta}_{m^{*}+1}-\boldsymbol{\beta}_{m^{*}}\|_{2}^{2}\leq\epsilon. Note that the convergence rates derived above apply for a large class of problems (20), where, the function g(𝜷)0𝑔𝜷0g(\boldsymbol{\beta})\geq 0 is convex with Lipschitz continuous gradient (21). Tighter rates may be obtained under additional structural assumptions on g()𝑔g(\cdot). For example, the adaptation of Algorithm 1 for Problem (4) was analyzed in [8, 9] with 𝐗𝐗\mathbf{X} satisfying coherence [8] or Restricted Isometry Property (RIP) [9]. In these cases, the algorithm can be shown to have a linear convergence rate [8, 9], where the rate depends upon the RIP constants.
定理3.1 表示對於任何 ϵ>0italic-ϵ0\epsilon>0 都存在 M=O(1ϵ)𝑀𝑂1italic-ϵM=O(\frac{1}{\epsilon}) ,這樣對於某些 1mM1superscript𝑚𝑀1\leq m^{*}\leq M ,我們有: 𝜷m+1𝜷m22ϵ.superscriptsubscriptnormsubscript𝜷superscript𝑚1subscript𝜷superscript𝑚22italic-ϵ\|\boldsymbol{\beta}_{m^{*}+1}-\boldsymbol{\beta}_{m^{*}}\|_{2}^{2}\leq\epsilon. 請注意,導出的收斂率上述適用於一大類問題 (20),其中函數 g(𝜷)0𝑔𝜷0g(\boldsymbol{\beta})\geq 0 是具有 Lipschitz 連續梯度的凸函數 (21)。根據 g()𝑔g(\cdot) 的額外結構假設,可能會獲得更嚴格的利率。例如,在[8, 9]中分析了演算法1對問題(4)的適應性,其中 𝐗𝐗\mathbf{X} 滿足相干性[8]或限制等距性(RIP)[9]。在這些情況下,可以證明該演算法具有線性收斂速度 [8, 9],其中該速度取決於 RIP 常數。

Note that by Proposition 6 the support of 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} stabilizes after finitely many iterations, after which Algorithm 1 behaves like gradient descent on the stabilized support. If g(𝜷)𝑔𝜷g(\boldsymbol{\beta}) restricted to this support is strongly convex, then Algorithm 1 will enjoy a linear rate of convergence [45], as soon as the support stabilizes. This behavior is adaptive, i.e., Algorithm 1 does not need to be modified after the support stabilizes.
請注意,根據命題 6, 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 的支持在有限多次迭代後穩定,之後演算法 1 的行為類似於穩定支持上的梯度下降。如果限制於該支撐的 g(𝜷)𝑔𝜷g(\boldsymbol{\beta}) 是強凸的,那麼一旦支撐穩定,演算法1將享有線性收斂速度[45]。這種行為是自適應的,即支撐穩定後不需要修改演算法1。

The next section describes practical post-processing schemes via which first order stationary points of Algorithm 1 can be obtained by solving a low dimensional convex optimization problem, as soon as the support is found to stabilize, numerically. In our numerical experiments, we this version of Algorithm 1 (with multiple starting points) took at most a few minutes for p=2000𝑝2000p=2000 and a few seconds for smaller values of p𝑝p.
下一節描述了實用的後處理方案,一旦發現支撐在數值上穩定,就可以透過求解低維凸最佳化問題來獲得演算法 1 的一階駐點。在我們的數值實驗中,我們這個版本的演算法 1(具有多個起點)對於 p=2000𝑝2000p=2000 最多花費幾分鐘,對於較小的 p𝑝p 值最多花費幾秒鐘。

Polishing coefficients on the active set
活動集上的拋光係數

Algorithm 1 detects the active set after a few iterations. Once the active set stabilizes, the algorithm may take a number of iterations to estimate the values of the regression coefficients on the active set to a high accuracy level.
演算法 1 在幾次迭代後偵測活動集。一旦活動集穩定,演算法可能需要多次迭代才能將活動集上的迴歸係數的值估計到高精度水準。

In this context, we found the following simple polishing of coefficients to be useful. When the algorithm has converged to a tolerance of ϵitalic-ϵ\epsilon (104absentsuperscript104\approx 10^{-4}), we fix the current active set, {\mathcal{I}}, and solve the following lower-dimensional convex optimization problem:
在這種情況下,我們發現以下簡單的係數修正很有用。當演算法收斂到容差 ϵitalic-ϵ\epsilon ( 104absentsuperscript104\approx 10^{-4} ) 時,我們修復目前活動集 {\mathcal{I}} ,並求解以下低維凸最佳化問題:

min𝜷,βi=0,ig(𝜷).subscriptformulae-sequence𝜷subscript𝛽𝑖0𝑖𝑔𝜷\min_{\boldsymbol{\beta},\beta_{i}=0,i\notin{\mathcal{I}}}\;\;\;\;g(\boldsymbol{\beta}). (33)

In the context of the least squares and the least absolute deviation problems, Problem (33) reduces to to a smaller dimensional least squares and a linear optimization problem respectively, which can be solved very efficiently up to a very high level of accuracy.
在最小平方法和最小絕對偏差問題的背景下,問題(33)分別簡化為較小維的最小二乘和線性最佳化問題,可以非常有效地求解直至非常高的精度。

3.3 Application to Least Squares
3.3最小平方法的應用

For the support constrained problem with squared error loss, we have g(𝜷)=12𝐲𝐗𝜷22𝑔𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22g(\boldsymbol{\beta})=\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2} and g(𝜷)=𝐗(𝐲𝐗𝜷)𝑔𝜷superscript𝐗𝐲𝐗𝜷\nabla g(\boldsymbol{\beta})=-\mathbf{X}^{\prime}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}). The general algorithmic framework developed above applies in a straightforward fashion for this special case. Note that for this case =λmax(𝐗𝐗)subscript𝜆superscript𝐗𝐗\ell=\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X}).
對於平方誤差損失的支援約束問題,我們有 g(𝜷)=12𝐲𝐗𝜷22𝑔𝜷12superscriptsubscriptnorm𝐲𝐗𝜷22g(\boldsymbol{\beta})=\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}g(𝜷)=𝐗(𝐲𝐗𝜷)𝑔𝜷superscript𝐗𝐲𝐗𝜷\nabla g(\boldsymbol{\beta})=-\mathbf{X}^{\prime}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta}) 。上面開發的通用演算法框架以簡單的方式適用於這種特殊情況。請注意,對於本例 =λmax(𝐗𝐗)subscript𝜆superscript𝐗𝐗\ell=\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})

The polishing of the regression coefficients in the active set can be performed via a least squares problem on 𝐲,𝐗I𝐲subscript𝐗𝐼\mathbf{y},\mathbf{X}_{I}, where I𝐼I denotes the support of the regression coefficients.
活動集中迴歸係數的拋光可以透過 𝐲,𝐗I𝐲subscript𝐗𝐼\mathbf{y},\mathbf{X}_{I} 上的最小平方法問題來執行,其中 I𝐼I 表示迴歸係數的支持度。

3.4 Application to Least Absolute Deviation
3.4最小絕對偏差的應用

We will now show how the method proposed in the previous section applies to the least absolute deviation problem with support constraints in 𝜷𝜷\boldsymbol{\beta}:
現在我們將展示上一節中提出的方法如何應用於 𝜷𝜷\boldsymbol{\beta} 中具有支持約束的最小絕對偏差問題:

min𝜷g1(𝜷):=𝐘𝐗𝜷1s.t.𝜷0k.formulae-sequenceassignsubscript𝜷subscript𝑔1𝜷subscriptnorm𝐘𝐗𝜷1𝑠𝑡subscriptnorm𝜷0𝑘\min_{\boldsymbol{\beta}}\;\;g_{1}(\boldsymbol{\beta}):=\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_{1}\;\;s.t.\;\;\|\boldsymbol{\beta}\|_{0}\leq k. (34)

Since g1(𝜷)subscript𝑔1𝜷g_{1}(\boldsymbol{\beta}) is non-smooth, our framework does not apply directly. We smooth the non-differentiable g1(𝜷)subscript𝑔1𝜷g_{1}(\boldsymbol{\beta}) so that we can apply Algorithms 1 and 2. Observing that g1(𝜷)=sup𝐰1𝐘X𝜷,𝐰subscript𝑔1𝜷subscriptsupremumsubscriptnorm𝐰1𝐘𝑋𝜷𝐰g_{1}(\boldsymbol{\beta})=\sup_{\|\mathbf{w}\|_{\infty}\leq 1}\langle\mathbf{Y}-X\boldsymbol{\beta},\mathbf{w}\rangle we make use of the smoothing technique of [43] to obtain g1(𝜷;τ)=sup𝐰1(𝐘X𝜷,𝐰τ2𝐰22)subscript𝑔1𝜷𝜏subscriptsupremumsubscriptnorm𝐰1𝐘𝑋𝜷𝐰𝜏2superscriptsubscriptnorm𝐰22g_{1}(\boldsymbol{\beta};\tau)=\sup_{\|\mathbf{w}\|_{\infty}\leq 1}(\langle\mathbf{Y}-X\boldsymbol{\beta},\mathbf{w}\rangle-\frac{\tau}{2}\|\mathbf{w}\|_{2}^{2}); which is a smooth approximation of g1(β)subscript𝑔1𝛽g_{1}(\beta), with =λmax(𝐗𝐗)τsubscript𝜆superscript𝐗𝐗𝜏\ell=\frac{\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})}{\tau} for which Algorithms 1 and 2 apply.
由於 g1(𝜷)subscript𝑔1𝜷g_{1}(\boldsymbol{\beta}) 是非平滑的,因此我們的框架不能直接應用。我們平滑不可微的 g1(𝜷)subscript𝑔1𝜷g_{1}(\boldsymbol{\beta}) ,以便我們可以應用演算法1和2。這是 g1(β)subscript𝑔1𝛽g_{1}(\beta) 的平滑近似,其中 =λmax(𝐗𝐗)τsubscript𝜆superscript𝐗𝐗𝜏\ell=\frac{\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})}{\tau} 適用演算法 1 和 2。

In order to obtain a good approximation to Problem (34), we found the following strategy to be useful in practice:
為了獲得問題(34)的良好近似,我們發現以下策略在實務上很有用:

  1. 1.

    Fix τ>0𝜏0\tau>0, initialize with 𝜷0psubscript𝜷0superscript𝑝\boldsymbol{\beta}_{0}\in\mathbb{R}^{p} and repeat the following steps [2]—[3] till convergence:


    1. 修正 τ>0𝜏0\tau>0 ,用 𝜷0psubscript𝜷0superscript𝑝\boldsymbol{\beta}_{0}\in\mathbb{R}^{p} 初始化並重複以下步驟[2]—[3]直到收斂:
  2. 2.

    Apply Algorithm 1 (or Algorithm 2) to the smooth function g1(𝜷;τ)subscript𝑔1𝜷𝜏g_{1}(\boldsymbol{\beta};\tau). Let 𝜷τsuperscriptsubscript𝜷𝜏\boldsymbol{\beta}_{\tau}^{*} be the limiting solution.


    2. 將演算法 1(或演算法 2)應用於平滑函數 g1(𝜷;τ)subscript𝑔1𝜷𝜏g_{1}(\boldsymbol{\beta};\tau) 。令 𝜷τsuperscriptsubscript𝜷𝜏\boldsymbol{\beta}_{\tau}^{*} 為極限解。
  3. 3.

    Decrease ττγ𝜏𝜏𝛾\tau\leftarrow\tau\gamma for some pre-defined constant γ=0.8𝛾0.8\gamma=0.8 (say), and go back to step [1] with 𝜷0=𝜷τsubscript𝜷0superscriptsubscript𝜷𝜏\boldsymbol{\beta}_{0}=\boldsymbol{\beta}_{\tau}^{*}. Exit if τ<TOL,𝜏TOL\tau<\text{TOL}, for some pre-defined tolerance.


    3. 減少一些預先定義常數 γ=0.8𝛾0.8\gamma=0.8ττγ𝜏𝜏𝛾\tau\leftarrow\tau\gamma ,然後使用 𝜷0=𝜷τsubscript𝜷0superscriptsubscript𝜷𝜏\boldsymbol{\beta}_{0}=\boldsymbol{\beta}_{\tau}^{*} 返回步驟 [1] 。如果 τ<TOL,𝜏TOL\tau<\text{TOL}, 達到某個預先定義的容差,則退出。

4 A Brief Tour of the Statistical Properties of Problem (1)
4問題統計性質簡述(一)

As already alluded to in the introduction, there is a substantial body of impressive work characterizing the theoretical properties of best subset solutions in terms of various metrics: predictive performance, estimation of regression coefficients, and variable selection properties. For the sake of completeness, we present a brief review of some of the properties of solutions to Problem (1) in Section C.
如同引言中已經提到的,有大量令人印象深刻的工作從各種指標方面描述了最佳子集解決方案的理論屬性:預測性能、回歸係數估計和變量選擇屬性。為了完整起見,我們簡要回顧了 C 節中問題(1)的解決方案的一些屬性。

5 Computational Experiments for Subset Selection with Least Squares Loss
5最小平方法損失子集選擇的計算實驗

In this section, we present a variety of computational experiments to assess the algorithmic and statistical performances of our approach. We consider both the classical overdetermined case with n>p𝑛𝑝n>p (Section 5.2) and the high dimensional pnmuch-greater-than𝑝𝑛p\gg n case (Section 5.3) for the least squares loss function with support constraints.
在本節中,我們提出了各種計算實驗來評估我們方法的演算法和統計性能。對於具有支援約束的最小平方法損失函數,我們考慮經典的超定情況(第 5.2 節)和高維度 pnmuch-greater-than𝑝𝑛p\gg n 情況(第 5.3 節)。

5.1 Description of Experimental Data
5.1實驗數據說明

We demonstrate the performance of our proposal via a series of experiments on both synthetic and real data.
我們透過對合成數據和真實數據進行的一系列實驗來證明我們建議的性能。

Synthetic Datasets. 綜合數據集。

We consider a collection of problems where 𝐱iN(𝟎,𝚺),i=1,,nformulae-sequencesimilar-tosubscript𝐱𝑖N0𝚺𝑖1𝑛\mathbf{x}_{i}\sim\text{N}(\mathbf{0},\mathbf{\Sigma}),i=1,\ldots,n are independent realizations from a p𝑝p-dimensional multivariate normal distribution with mean zero and covariance matrix 𝚺:=(σij)assign𝚺subscript𝜎𝑖𝑗\mathbf{\Sigma}:=(\sigma_{ij}). The columns of the 𝐗𝐗\mathbf{X} matrix were subsequently standardized to have unit 2subscript2\ell_{2} norm. For a fixed 𝐗n×p,subscript𝐗𝑛𝑝\mathbf{X}_{n\times p}, we generated the response 𝐲𝐲\mathbf{y} as follows: 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}, where ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖𝑁0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2}). We denote the number of nonzeros in 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} by k0subscript𝑘0k_{0}. The choice of 𝐗,𝜷0,σ𝐗superscript𝜷0𝜎\mathbf{X},\boldsymbol{\beta}^{0},\sigma determines the Signal-to-Noise Ratio (SNR) of the problem, which is defined as: SNR=var(𝐱𝜷0)σ2.SNRvarsuperscript𝐱superscript𝜷0superscript𝜎2\text{SNR}=\frac{\text{var}(\mathbf{x}^{\prime}\boldsymbol{\beta}^{0})}{\sigma^{2}}.
我們考慮一系列問題,其中 𝐱iN(𝟎,𝚺),i=1,,nformulae-sequencesimilar-tosubscript𝐱𝑖N0𝚺𝑖1𝑛\mathbf{x}_{i}\sim\text{N}(\mathbf{0},\mathbf{\Sigma}),i=1,\ldots,n 是來自平均值為零和協方差矩陣 𝚺:=(σij)assign𝚺subscript𝜎𝑖𝑗\mathbf{\Sigma}:=(\sigma_{ij})p𝑝p 維多元常態分佈的獨立實現。 𝐗𝐗\mathbf{X} 矩陣的列隨後被標準化為具有單位 2subscript2\ell_{2} 範數。對於固定的 𝐗n×p,subscript𝐗𝑛𝑝\mathbf{X}_{n\times p}, ,我們產生回應 𝐲𝐲\mathbf{y} 如下: 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon} ,其中 ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖𝑁0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2}) 。我們用 k0subscript𝑘0k_{0} 表示 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 中非零的數量。 𝐗,𝜷0,σ𝐗superscript𝜷0𝜎\mathbf{X},\boldsymbol{\beta}^{0},\sigma 的選擇決定了問題的訊號雜訊比(SNR),其定義為: SNR=var(𝐱𝜷0)σ2.SNRvarsuperscript𝐱superscript𝜷0superscript𝜎2\text{SNR}=\frac{\text{var}(\mathbf{x}^{\prime}\boldsymbol{\beta}^{0})}{\sigma^{2}}.

We considered the following four different examples:
我們考慮了以下四個不同的例子:

Example 1: We took σij=ρ|ij|subscript𝜎𝑖𝑗superscript𝜌𝑖𝑗\sigma_{ij}=\rho^{|i-j|} for i,j{1,,p}×{1,,p}𝑖𝑗1𝑝1𝑝i,j\in\{1,\ldots,p\}\times\{1,\ldots,p\}. We consider different values of k0{5,10}subscript𝑘0510k_{0}\in\{5,10\} and βi0=1subscriptsuperscript𝛽0𝑖1\beta^{0}_{i}=1 for k0subscript𝑘0k_{0} equi-spaced values. In the case where exactly equi-spaced values are not possible we rounded the indices to the nearest large integer value. of i𝑖i in the range {1,2,,p}12𝑝\{1,2,\ldots,p\}.
範例 1:我們將 σij=ρ|ij|subscript𝜎𝑖𝑗superscript𝜌𝑖𝑗\sigma_{ij}=\rho^{|i-j|} 替換為 i,j{1,,p}×{1,,p}𝑖𝑗1𝑝1𝑝i,j\in\{1,\ldots,p\}\times\{1,\ldots,p\} 。我們考慮 k0{5,10}subscript𝑘0510k_{0}\in\{5,10\}βi0=1subscriptsuperscript𝛽0𝑖1\beta^{0}_{i}=1 的不同值作為 k0subscript𝑘0k_{0} 等間距值。在不可能精確等距值的情況下,我們將索引四捨五入到最接近的大整數值。 i𝑖i{1,2,,p}12𝑝\{1,2,\ldots,p\} 範圍內。

Example 2: We took 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}, k0=5subscript𝑘05k_{0}=5 and 𝜷0=(𝟏5×1,𝟎p5×1)psuperscript𝜷0superscriptsubscriptsuperscript151subscriptsuperscript0𝑝51superscript𝑝\boldsymbol{\beta}^{0}=(\mathbf{1}^{\prime}_{5\times 1},\mathbf{0}^{\prime}_{p-5\times 1})^{\prime}\in\mathbb{R}^{p}.
範例 2:我們採用 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}k0=5subscript𝑘05k_{0}=5𝜷0=(𝟏5×1,𝟎p5×1)psuperscript𝜷0superscriptsubscriptsuperscript151subscriptsuperscript0𝑝51superscript𝑝\boldsymbol{\beta}^{0}=(\mathbf{1}^{\prime}_{5\times 1},\mathbf{0}^{\prime}_{p-5\times 1})^{\prime}\in\mathbb{R}^{p}

Example 3: We took 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}, k0=10subscript𝑘010k_{0}=10 and βi0=12+(1012)(i1)k0,i=1,,10formulae-sequencesuperscriptsubscript𝛽𝑖0121012𝑖1subscript𝑘0𝑖110\beta_{i}^{0}=\frac{1}{2}+(10-\frac{1}{2})\frac{(i-1)}{k_{0}},i=1,\ldots,10 and βi0=0,i>10formulae-sequencesubscriptsuperscript𝛽0𝑖0for-all𝑖10\beta^{0}_{i}=0,\forall i>10 — i.e., a vector with ten nonzero entries, with the nonzero values being equally spaced in the interval [12,10]1210[\frac{1}{2},10].
範例 3:我們採用 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}k0=10subscript𝑘010k_{0}=10βi0=12+(1012)(i1)k0,i=1,,10formulae-sequencesuperscriptsubscript𝛽𝑖0121012𝑖1subscript𝑘0𝑖110\beta_{i}^{0}=\frac{1}{2}+(10-\frac{1}{2})\frac{(i-1)}{k_{0}},i=1,\ldots,10βi0=0,i>10formulae-sequencesubscriptsuperscript𝛽0𝑖0for-all𝑖10\beta^{0}_{i}=0,\forall i>10 — 即具有 10 個非零條目和非零值的向量在區間 [12,10]1210[\frac{1}{2},10] 中等距分佈。

Example 4: We took 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}, k0=6subscript𝑘06k_{0}=6 and 𝜷0=(10,6,2,2,6,10,𝟎p6)superscript𝜷010622610subscript0𝑝6\boldsymbol{\beta}^{0}=(-10,-6,-2,2,6,10,\mathbf{0}_{p-6}), i.e., a vector with six nonzero entries, equally spaced in the interval [10,10]1010[-10,10].
範例4:我們採用 𝚺=𝐈p×p𝚺subscript𝐈𝑝𝑝\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}k0=6subscript𝑘06k_{0}=6𝜷0=(10,6,2,2,6,10,𝟎p6)superscript𝜷010622610subscript0𝑝6\boldsymbol{\beta}^{0}=(-10,-6,-2,2,6,10,\mathbf{0}_{p-6}) ,即具有六個非零條目的向量,在區間 [10,10]1010[-10,10] 中距分佈。

Real Datasets 真實數據集

We considered the Diabetes dataset analyzed in [23]. We used the dataset with all the second order interactions included in the model, which resulted in 64 predictors. We reduced the sample size to n=350𝑛350n=350 by taking a random sample and standardized the response and the columns of the model matrix to have zero means and unit 2subscript2\ell_{2}-norm.
我們考慮了[23]中分析的糖尿病數據集。我們使用的資料集包含模型中包含的所有二階交互作用,從而產生了 64 個預測變數。我們透過隨機抽樣將樣本大小減少到 n=350𝑛350n=350 ,並將反應和模型矩陣的列標準化為具有零平均值和單位 2subscript2\ell_{2} -norm。

In addition to the above, we also considered a real microarray dataset: the Leukemia data [18]. We downloaded the processed dataset from http://stat.ethz.ch/~dettling/bagboost.html, which had n=72𝑛72n=72 binary responses and more than 3000 predictors. We standardized the response and columns of features to have zero means and unit 2subscript2\ell_{2}-norm. We reduced the set of features to 1000 by retaining the features maximally correlated (in absolute value) to the response. We call the resulting feature matrix 𝐗n×psubscript𝐗𝑛𝑝\mathbf{X}_{n\times p} with n=72,p=1000formulae-sequence𝑛72𝑝1000n=72,p=1000. We then generated a semi-synthetic dataset with continuous response as 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\epsilon, where the first five coefficients of 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} were taken as one and the rest as zero. The noise was distributed as ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖𝑁0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2}), with σ2superscript𝜎2\sigma^{2} chosen to get a SNR=7.
除了上述之外,我們還考慮了一個真實的微陣列資料集:白血病資料[18]。我們從 http://stat.ethz.ch/~dettting/bagboost.html 下載了處理後的資料集,其中包含 n=72𝑛72n=72 二元回應和超過 3000 個預測變數。我們將反應和特徵列標準化為具有零均值和單位 2subscript2\ell_{2} -範數。我們透過保留與反應最大相關性(絕對值)的特徵,將特徵集減少到 1000 個。我們將產生的特徵矩陣 𝐗n×psubscript𝐗𝑛𝑝\mathbf{X}_{n\times p} 稱為 n=72,p=1000formulae-sequence𝑛72𝑝1000n=72,p=1000 。然後,我們產生一個連續響應為 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\epsilon 的半合成資料集,其中 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 的前五個係數被視為 1,其餘係數為零。噪音分佈為 ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖𝑁0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2}) ,選擇 σ2superscript𝜎2\sigma^{2} 以獲得 SNR=7。

Computer Specifications and Software
電腦規格和軟體

Computations were carried out in a linux 64 bit server—Intel(R) Xeon(R) eight-core processor @ 1.80GHz, 16 GB of RAM for the overdetermined n>p𝑛𝑝n>p case and in a Dell Precision T7600 computer with an Intel Xeon E52687 sixteen-core processor @ 3.1GHz, 128GB of Ram for the high-dimensional pnmuch-greater-than𝑝𝑛p\gg n case. The discrete first order methods were implemented in Matlab 2012b. We used Gurobi [33] version 5.5 and the Matlab interface to Gurobi for all of our experiments, apart from the computations for synthetic data for n>p𝑛𝑝n>p, which were done in Gurobi via its Python 2.7 interface.
計算在 Linux 64 位元伺服器上進行 - Intel(R) Xeon(R) 八核心處理器 @ 1.80GHz,超定 n>p𝑛𝑝n>p 情況下的 16 GB RAM 以及 Dell Precision T7600 電腦Intel Xeon E52687 十六核處理器@ 3.1GHz,用於高維 pnmuch-greater-than𝑝𝑛p\gg n 機箱的128GB RAM。離散一階方法在 Matlab 2012b 中實作。我們在所有實驗中都使用了 Gurobi [33] 5.5 版和 Gurobi 的 Matlab 接口,除了 n>p𝑛𝑝n>p 的合成數據計算(這是透過 Python 2.7 接口在 Gurobi 中完成的)。

5.2 The Overdetermined Regime: n>p𝑛𝑝n>p
5.2超定政權: n>p𝑛𝑝n>p

Using the Diabetes dataset and synthetic datasets, we demonstrate the combined effect of using the discrete first order methods with the MIO approach. Together, these methods show improvements in obtaining good upper bounds and in closing the MIO gap to certify global optimality. Using synthetic datasets where we know the true linear regression model, we perform side-by-side comparisons of this method with several other state-of-the-art algorithms designed to estimate sparse linear models.
使用糖尿病資料集和合成資料集,我們展示了使用離散一階方法與 MIO 方法的綜合效果。總之,這些方法顯示了在獲得良好上限和縮小 MIO 差距以證明全局最優性方面的改進。使用我們知道真實線性迴歸模型的合成資料集,我們將此方法與其他幾種旨在估計稀疏線性模型的最先進演算法進行並排比較。

5.2.1 Obtaining Good Upper Bounds
5.2.1 獲得良好的上限

We conducted experiments to evaluate the performance of our methods in terms of obtaining high quality solutions for Problem (1).
我們進行了實驗來評估我們的方法在獲得問題(1)的高品質解決方案方面的表現。

We considered the following three algorithms:
我們考慮了以下三種演算法:

  1. (a)

    Algorithm 2 with fifty random initializations222we took fifty random starting values around 𝟎0\mathbf{0} of the form min(i1,1)ϵ,i=1,,50formulae-sequence𝑖11italic-ϵ𝑖150\min(i-1,1)\epsilon,i=1,\ldots,50, where ϵN(𝟎p×1,4𝐈)similar-tobold-italic-ϵ𝑁subscript0𝑝14𝐈\boldsymbol{\epsilon}\sim N(\mathbf{0}_{p\times 1},4\mathbf{I}). We found empirically that Algorithm 2 provided better upper bounds than Algorithm 1.
    我們在 𝟎0\mathbf{0} 周圍隨機取了 50 個起始值,其形式為 min(i1,1)ϵ,i=1,,50formulae-sequence𝑖11italic-ϵ𝑖150\min(i-1,1)\epsilon,i=1,\ldots,50 ,其中 ϵN(𝟎p×1,4𝐈)similar-tobold-italic-ϵ𝑁subscript0𝑝14𝐈\boldsymbol{\epsilon}\sim N(\mathbf{0}_{p\times 1},4\mathbf{I}) 。我們根據經驗發現演算法 2 提供了比演算法 1 更好的上限。
    . We took the solution corresponding to the best objective value.
    。我們採取了與最佳目標值相對應的解決方案。


    (a)演算法 2 有 50 個隨機初始化 2
  2. (b)

    MIO with cold start, i.e., formulation (9) with a time limit of 500500500 seconds.


    (b)冷啟動MIO,即時間限制為 500500500 秒的公式(9)。
  3. (c)

    MIO with warm start. This was the MIO formulation initialized with the discrete first order optimization solution obtained from (a). This was run for a total of 500 seconds.


    (c)MIO 熱啟動。這是用 (a) 中得到的離散一階最佳化解初始化的 MIO 公式。總共運行了 500 秒。

To compare the different algorithms in terms of the quality of upper bounds, we run for every instance all the algorithms and obtain the best solution among them, say, fsubscript𝑓f_{*}. If falgsubscript𝑓algf_{\text{alg}} denotes the value of the best subset objective function for method “alg”, then we define the relative accuracy of the solution obtained by “alg” as:
為了比較不同演算法的上限質量,我們為每個實例運行所有演算法並獲得其中的最佳解決方案,例如 fsubscript𝑓f_{*} 。如果 falgsubscript𝑓algf_{\text{alg}} 表示方法「alg」的最佳子集目標函數的值,那麼我們將「alg」所得的解的相對精度定義為:

Relative Accuracy=(falgf)/f,Relative Accuracysubscript𝑓algsubscript𝑓subscript𝑓\text{Relative Accuracy}=(f_{\text{alg}}-f_{*})/f_{*}, (35)

where alg{(a),(b),(c)}alg(a)(b)(c)\text{alg}\in\{\text{(a)},\text{(b)},\text{(c)}\} as described above.
其中 alg{(a),(b),(c)}alg(a)(b)(c)\text{alg}\in\{\text{(a)},\text{(b)},\text{(c)}\} 如上所述。

We did experiments for the Diabetes dataset for different values of k𝑘k (see Table 1). For each of the algorithms we report the amount of time taken by the algorithm to reach the best objective value during the time of 500500500 seconds.
我們針對不同的 k𝑘k 數值對糖尿病資料集進行了實驗(參見表 1)。對於每種演算法,我們報告演算法在 500500500 秒時間內達到最佳目標值所需的時間。

k𝑘k Discrete First Order 離散一階 MIO Cold Start MIO 冷啟動 MIO Warm Start MIO 熱啟動
Accuracy Time Accuracy Time Accuracy Time
9 0.1306 1 0.0036 500 0 346
20 0.1541 1 0.0042 500 0 77
49 0.1915 1 0.0015 500 0 87
57 0.1933 1 0 500 0 2
Table 1: Quality of upper bounds for Problem (1) for the Diabetes dataset, for different values of k𝑘k. We see that the MIO equipped with warm starts deliver the best upper bounds in the shortest overall times. The run time for the MIO with warm start includes the time taken by the discrete first order method (which were all less than a second).
表 1:對於 k𝑘k 的不同值,糖尿病資料集問題 (1) 的上限品質。我們發現,配備熱啟動功能的 MIO 在最短的整體時間內提供了最佳的上限。熱啟動 MIO 的運行時間包括離散一階方法所花費的時間(均小於一秒)。

Using the discrete first order methods in combination with the MIO algorithm resulted in finding the best possible relative accuracy in a matter of a few minutes.
將離散一階方法與 MIO 演算法結合使用,可以在幾分鐘內找到最佳的相對精度。

5.2.2 Improving MIO Performance via Warm Starts
5.2.2透過熱啟動提高MIO性能

We performed a series of experiments on the Diabetes dataset to obtain a globally optimal solution to Problem (1) via our approach and to understand the implications of using advanced warm starts to the MIO formulation in terms of certifying optimality. For each choice of k𝑘k we ran Algorithm 2 with fifty random initializations. They took less than a few seconds to run. We used the best solution as an advanced warm start to the MIO formulation (9). For each of these examples, we also ran the MIO formulation without any warm start information and also without the parameter specifications in Section 2.3 (we refer to this as “cold start”). Figure 3 summarizes the results. The figure shows that in the presence of warm starts and problem specific side information, the MIO closes the optimality gap significantly faster.
我們在糖尿病數據集上進行了一系列實驗,以透過我們的方法獲得問題 (1) 的全局最優解決方案,並了解在證明最優性方面對 MIO 公式使用高級熱啟動的影響。對於 k𝑘k 的每個選擇,我們都執行具有 50 個隨機初始化的演算法 2。他們只花了不到幾秒鐘就跑了。我們使用最佳解決方案作為 MIO 配方的高級熱啟動 (9)。對於每個範例,我們還運行了 MIO 公式,沒有任何熱啟動訊息,也沒有第 2.3 節中的參數規範(我們稱之為「冷啟動」)。圖 3 總結了結果。該圖顯示,在存在熱啟動和問題特定輔助資訊的情況下,MIO 明顯更快地縮小了最適性差距。

k=9 k=20 k=31 k=42
Refer to caption Refer to caption Refer to caption Refer to caption
Time (secs) Time (secs) Time (secs) Time (secs)
Figure 3: The evolution of the MIO optimality gap (in log10()subscript10\log_{10}(\cdot) scale) for Problem (1), for the Diabetes dataset with n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 with and without warm starts (and parameter specifications as in Section 2.3) for different values of k𝑘k. The MIO significantly benefits by advanced warm starts delivered by Algorithm 2. In all of these examples, the global optimum was found within a very small fraction of the total time, but the proof of global optimality came later.
圖3:問題(1) 的MIO 最適性差距(按 log10()subscript10\log_{10}(\cdot) 規模)的演變,對於具有 n=350,p=64formulae-sequence𝑛350𝑝64n=350,p=64 的糖尿病數據集,有和沒有熱啟動(以及參數規範,如第 2.3 節)針對 k𝑘k 的不同值。 MIO 顯著受益於演算法 2 提供的高級熱啟動。

5.2.3 Statistical Performance
5.2.3統計表現

We considered datasets as described in Example 1, Section 5.1—we took different values of n,p𝑛𝑝n,p with n>p𝑛𝑝n>p, ρ𝜌\rho with k0=10subscript𝑘010k_{0}=10.
我們考慮了範例1 第5.1 節所述的資料集- 我們對 n,p𝑛𝑝n,pn>p𝑛𝑝n>pρ𝜌\rhok0=10subscript𝑘010k_{0}=10 採取了不同的值。

Competing Methods and Performance Measures
競爭方法和績效衡量

For every example, we considered the following learning procedures for comparison purposes: (a) the MIO approach equipped warm starts from Algorithm 2 (annotated as “MIO” in the figure), (b) the Lasso, (c) Sparsenet and (d) stepwise regression (annotated as “Step” in the figure).
對於每個範例,我們考慮了以下學習過程以進行比較:(a) 從演算法2 開始配備熱啟動的MIO 方法(在圖中標註為「MIO」)、(b) Lasso、(c) Sparsenet 和( d) )逐步迴歸(圖中標示為「Step」)。

We used R to compute Lasso, Sparsenet and stepwise regression using the glmnet 1.7.3, Sparsenet and Stats 3.0.2 packages respectively, which were all downloaded from CRAN at http://cran.us.r-project.org/.
我們使用R 分別使用glmnet 1.7.3、Sparsenet 和Stats 3.0.2 軟體套件來計算Lasso、Sparsenet 和逐步迴歸,這些軟體套件皆從CRAN(http://cran.us.r-project.org/)下載。

In addition to the above, we have also performed comparisons with a debiased version of the Lasso: i.e., performing unrestricted least squares on the Lasso support to mitigate the bias imparted by Lasso shrinkage.
除了上述內容之外,我們還與 Lasso 的去偏差版本進行了比較:即,對 Lasso 支持執行無限制的最小二乘法,以減輕 Lasso 收縮帶來的偏差。

We note that Sparsenet [38] considers a penalized likelihood formulation of the form (3), where the penalty is given by the generalized MCP penalty family (indexed by λ,γ𝜆𝛾\lambda,\gamma) for a family of values of γ1𝛾1\gamma\geq 1 and λ0𝜆0\lambda\geq 0. The family of penalties used by Sparsenet is thus given by: p(t;γ;λ)=λ(|t|t22λγ)𝐈(|t|<λγ)+λ2γ2𝐈(|t|λγ)𝑝𝑡𝛾𝜆𝜆𝑡superscript𝑡22𝜆𝛾𝐈𝑡𝜆𝛾superscript𝜆2𝛾2𝐈𝑡𝜆𝛾p(t;\gamma;\lambda)=\lambda(|t|-\frac{t^{2}}{2\lambda\gamma})\mathbf{I}(|t|<\lambda\gamma)+\frac{\lambda^{2}\gamma}{2}\mathbf{I}(|t|\geq\lambda\gamma) for γ,λ𝛾𝜆\gamma,\lambda described as above. As γ=𝛾\gamma=\infty with λ𝜆\lambda fixed, we get the penalty p(t;γ;λ)=λ|t|𝑝𝑡𝛾𝜆𝜆𝑡p(t;\gamma;\lambda)=\lambda|t|. The family above includes as a special case (γ=1𝛾1\gamma=1), the hard thresholding penalty, a penalty recommended in the paper [60] for its useful statistical properties.
我們注意到 Sparsenet [38] 考慮了形式 (3) 的懲罰似然公式,其中懲罰由針對 γ1𝛾1\gamma\geq 1 索引)給出。 > 和 λ0𝜆0\lambda\geq 0 。因此,Sparsenet 使用的懲罰系列由以下公式給出: p(t;γ;λ)=λ(|t|t22λγ)𝐈(|t|<λγ)+λ2γ2𝐈(|t|λγ)𝑝𝑡𝛾𝜆𝜆𝑡superscript𝑡22𝜆𝛾𝐈𝑡𝜆𝛾superscript𝜆2𝛾2𝐈𝑡𝜆𝛾p(t;\gamma;\lambda)=\lambda(|t|-\frac{t^{2}}{2\lambda\gamma})\mathbf{I}(|t|<\lambda\gamma)+\frac{\lambda^{2}\gamma}{2}\mathbf{I}(|t|\geq\lambda\gamma) 對於 γ,λ𝛾𝜆\gamma,\lambda 如上所述。由於 γ=𝛾\gamma=\inftyλ𝜆\lambda 固定,我們得到懲罰 p(t;γ;λ)=λ|t|𝑝𝑡𝛾𝜆𝜆𝑡p(t;\gamma;\lambda)=\lambda|t| 。上面的家族包括一個特殊情況( γ=1𝛾1\gamma=1 ),即硬閾值懲罰,這是論文 [60] 中推薦的一種懲罰,因為它具有有用的統計特性。

For each procedure, we obtained the “optimal” tuning parameter by selecting the model that achieved the best predictive performance on a held out validation set. Once the model 𝜷^^𝜷\widehat{\boldsymbol{\beta}} was selected, we obtained the prediction error as:
對於每個過程,我們透過選擇在驗證集上實現最佳預測性能的模型來獲得「最佳」調整參數。選擇模型 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 後,我們得到的預測誤差為:

Prediction Error=𝐗𝜷^𝐗𝜷022/𝐗𝜷022.Prediction Errorsuperscriptsubscriptnorm𝐗^𝜷𝐗superscript𝜷022superscriptsubscriptnorm𝐗superscript𝜷022\text{Prediction Error}=\|\mathbf{X}\widehat{\boldsymbol{\beta}}-\mathbf{X}\boldsymbol{\beta}^{0}\|_{2}^{2}/\|\mathbf{X}\boldsymbol{\beta}^{0}\|_{2}^{2}. (36)

We report “prediction error” and number of non-zeros in the optimal model in our results. The results were averaged over ten random instances, for different realizations of 𝐗,ϵ𝐗italic-ϵ\mathbf{X},\epsilon. For every run: the training and validation data had a fixed 𝐗𝐗\mathbf{X} but random noise ϵitalic-ϵ\epsilon.
我們在結果中報告最佳模型中的「預測誤差」和非零數。對於 𝐗,ϵ𝐗italic-ϵ\mathbf{X},\epsilon 的不同實現,結果是十個隨機實例的平均值。對於每次運行:訓練和驗證資料都有固定的 𝐗𝐗\mathbf{X} 但隨機雜訊 ϵitalic-ϵ\epsilon

Figure 4 presents results for data generated as per Example 1 with n𝑛n = 500 and p𝑝p = 100. We see that the MIO procedure performs very well across all the examples. Among the methods, MIO performs the best, followed by Sparsenet, Lasso with Step(wise) exhibiting the worst performance. In terms of prediction error, the MIO performs the best, only to be marginally outperformed by Sparsenet in a few instances. This further illustrates the importance of using non-convex methods in sparse learning. Note that the MIO approach, unlike Sparsenet certifies global optimality in terms of solving Problem 1. However, based on the plots in the upper panel, Sparsenet selects a few redundant variables unlike MIO. Lasso delivers quite dense models and pays the price in predictive performance too, by selecting wrong variables. As the value of SNR increases, the predictive power of the methods improve, as expected. The differences in predictive errors between the methods diminish with increasing SNR values. With increasing values of ρ𝜌\rho (from left panel to right panel in the figure), the number of non-zeros selected by the Lasso in the optimal model increases.
圖 4 顯示了根據範例 1 產生的資料結果,其中 n𝑛n = 500 和 p𝑝p = 100。其中,MIO 表現最好,其次是 Sparsenet、Lasso,Step(wise) 表現最差。就預測誤差而言,MIO 表現最好,但在少數情況下略優於 Sparsenet。這進一步說明了在稀疏學習中使用非凸方法的重要性。請注意,與 Sparsenet 不同,MIO 方法在解決問題 1 方面證明了全局最優性。 Lasso 提供了相當密集的模型,並透過選擇錯誤的變數而在預測性能方面付出了代價。如預期的那樣,隨著 SNR 值的增加,這些方法的預測能力也會提高。兩種方法之間預測誤差的差異隨著 SNR 值的增加而減少。隨著 ρ𝜌\rho 值的增加(圖中由左圖到右圖),最優模型中Lasso選擇的非零點數量增加。

We also performed experiments with the debiased version of the Lasso. The unrestricted least squares solution on the optimal model selected by Lasso (as shown in Figure 4) had worse predictive performance than the Lasso, with the same sparsity pattern. This is probably due to overfitting since the model selected by the Lasso is quite dense compared to n,p𝑛𝑝n,p. We also tried some variants of debiased Lasso which led to models with better performances than the Lasso but the results were inferior compared to MIO — we provide a detailed description in Section D.2.
我們也使用 Lasso 的去偏版本進行了實驗。 Lasso 選擇的最優模型上的無限制最小二乘解(如圖 4 所示)的預測性能比 Lasso 差,但具有相同的稀疏性模式。這可能是由於過度擬合,因為 Lasso 選擇的模型與 n,p𝑛𝑝n,p 相比非常密集。我們也嘗試了去偏 Lasso 的一些變體,這些變體導致模型的性能比 Lasso 更好,但結果與 MIO 相比較差——我們在 D.2 節中提供了詳細描述。

Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption
Figure 4: Figure showing the sparsity (upper panel) and predictive performances (bottom panel) for different subset selection procedures for the least squares loss. Here, we consider data generated as per Example 1, with n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100, k0=10subscript𝑘010k_{0}=10, for three different SNR values with [Left Panel] ρ=0.5𝜌0.5\rho=0.5, [Middle Panel] ρ=0.8𝜌0.8\rho=0.8, and [Right Panel] ρ=0.9𝜌0.9\rho=0.9. The dashed line in the top panel represents the true number of nonzero values. For each of the procedures, the optimal model was selected as the one which produced the best prediction accuracy on a separate validation set, as described in Section 5.2.3.
圖 4:此圖顯示了最小平方法損失的不同子集選擇過程的稀疏性(上圖)和預測性能(下圖)。在這裡,我們考慮根據範例1 產生的數據,使用 n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100k0=10subscript𝑘010k_{0}=10 ,針對三個不同的SNR 值,使用[左面板] ρ=0.5𝜌0.5\rho=0.5 和[中面板] ρ=0.8𝜌0.8\rho=0.8 和 [右面板] ρ=0.9𝜌0.9\rho=0.9 。頂部面板中的虛線表示非零值的真實數量。對於每個過程,選擇最佳模型作為在單獨的驗證集上產生最佳預測精度的模型,如第 5.2.3 節所述。

We also performed experiments with n=1000,p=50formulae-sequence𝑛1000𝑝50n=1000,p=50 for data generated as per Example 1. We solved the problems to provable optimality and found that the MIO performed very well when compared to other competing methods. We do not report the experiments for brevity.
我們還使用 n=1000,p=50formulae-sequence𝑛1000𝑝50n=1000,p=50 對按照範例 1 產生的數據進行了實驗。為了簡潔起見,我們不報告實驗。

5.2.4 MIO model training 5.2.4MIO模型訓練

We trained a sequence of best subset models (indexed by k𝑘k) by applying the MIO approach with warm starts. Instead of running the MIO solvers from scratch for different values of k𝑘k, we used callbacks, a feature of integer optimization solvers. Callbacks allow the user to solve an initial model, and then add additional constraints to the model one at a time. These “cuts” reduce the size of the feasible region without having to rebuild the entire optimization model. Thus, in our case, we can save time by building the initial optimization model for k=p𝑘𝑝k=p. Once the solution for k=p𝑘𝑝k=p is obtained, a cut can be added to the model: i=1pziksuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\sum_{i=1}^{p}z_{i}\leq k for k=p1𝑘𝑝1k=p-1 and the model can be re-solved from this point. We apply this procedure until we arrive at a model with k=1𝑘1k=1.
我們透過應用 MIO 方法和熱啟動來訓練一系列最佳子集模型(由 k𝑘k 索引)。我們沒有使用 k𝑘k 的不同值從頭開始執行 MIO 求解器,而是使用了回呼(整數最佳化求解器的一項功能)。回調允許使用者求解初始模型,然後一次向模型新增一個附加限制。這些「削減」減少了可行區域的大小,而無需重建整個最佳化模型。因此,在我們的例子中,我們可以透過為 k=p𝑘𝑝k=p 建立初始最佳化模型來節省時間。一旦獲得 k=p𝑘𝑝k=p 的解,就可以在模型上加入一個切割: k=p1𝑘𝑝1k=p-1i=1pziksuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘\sum_{i=1}^{p}z_{i}\leq k ,並且可以從此時重新求解模型。我們應用此過程,直到得到帶有 k=1𝑘1k=1 的模型。

For each value of k𝑘k tested, the MIO best subset algorithm was set to stop the first time either an optimality gap of 1% was reached or a time limit of 15 minutes was reached. Additionally, we only tested values of k𝑘k from 5 through 25, and used Algorithm 2 to warm start the MIO algorithm. We observed that it was possible to obtain speedups of a factor of 2-4 by carefully tuning the optimization solver for a particular problem, but chose to maintain generality by solving with default parameters. Thus, we do not report times with the intention of accurately benchmarking the best possible time but rather to show that it is computationally tractable to solve problems to optimality using modern MIO solvers.
對於測試的每個 k𝑘k 值,MIO 最佳子集演算法設定為在第一次達到 1% 的最優性差距或達到 15 分鐘的時間限制時停止。此外,我們僅測試了從 5 到 25 的 k𝑘k 值,並使用演算法 2 熱啟動 MIO 演算法。我們觀察到,透過針對特定問題仔細調整最佳化求解器,可以獲得 2-4 倍的加速,但選擇透過使用預設參數求解來保持通用性。因此,我們報告時間的目的並不是為了準確地對最佳可能時間進行基準測試,而是為了表明使用現代 MIO 求解器在計算上可以輕鬆解決問題以實現最優。

5.3 The High-Dimensional Regime: pnmuch-greater-than𝑝𝑛p\gg n
5.3高維體系: pnmuch-greater-than𝑝𝑛p\gg n

In this section, we investigate (a) the evolution of upper bounds in the high-dimensional regime, (b) the effect of a bounding box formulation on the speed of closing the optimality gap and (c) the statistical performance of the MIO approach in comparison to other state-of-the art methods.
在本節中,我們研究 (a) 高維狀態下上限的演變,(b) 邊界框公式對縮小最適性差距的速度的影響,以及 (c) MIO 方法的統計性能與其他最先進的方法相比。

5.3.1 Obtaining Good Upper Bounds
5.3.1 獲得良好的上限

We performed tests similar to those in Section 5.2.1 for the pnmuch-greater-than𝑝𝑛p\gg n regime. We tested a synthetic dataset corresponding to Example 2 with n=30,p=2000formulae-sequence𝑛30𝑝2000n=30,p=2000 for varying SNR values (see Table 2) over a time of 500s. As before, using the discrete first order methods in combination with the MIO algorithm resulted in finding the best possible upper bounds in the shortest possible times.
我們對 pnmuch-greater-than𝑝𝑛p\gg n 體系進行了類似於第 5.2.1 節的測試。我們測試了與範例 2 相對應的合成資料集,其中 n=30,p=2000formulae-sequence𝑛30𝑝2000n=30,p=2000 在 500 秒的時間內針對不同的 SNR 值(參見表 2)進行了測試。與之前一樣,將離散一階方法與 MIO 演算法結合使用可以在盡可能短的時間內找到最佳的可能上限。

k𝑘k Discrete First Order 離散一階 MIO Cold Start MIO 冷啟動 MIO Warm Start MIO 熱啟動
Accuracy Time Accuracy Time Accuracy Time
5 0.1647 37.2 1.0510 500 0 72.2
6 0.6152 41.1 0.2769 500 0 77.1
7 0.7843 40.7 0.8715 500 0 160.7

SNR = 3

8 0.5515 38.8 2.1797 500 0 295.8
9 0.7131 45.0 0.4204 500 0 96.0
5 0.5072 45.6 0.7737 500 0 65.6
6 1.3221 40.3 0.5121 500 0 82.3
7 0.9745 40.9 0.7578 500 0 210.9

SNR = 7

8 0.8293 40.5 1.8972 500 0 262.5
9 1.1879 44.2 0.4515 500 0 254.2
Table 2: The quality of upper bounds for Problem (1) obtained by Algorithm 2, MIO with cold start and MIO warm-started with Algorithm 2. We consider the synthetic dataset of Example 2 with n=30,p=2000formulae-sequence𝑛30𝑝2000n=30,p=2000 and different values of SNR. The MIO method, when warm-started with the first order solution performs the best in terms of getting a good upper bound in the shortest time. The metric “Accuracy” is defined in (35). The first order methods are fast but need not lead to highest quality solutions on their own. MIO improves the quality of upper bounds delivered by the first order methods and their combined effect leads to the best performance.
表 2:透過演算法 2、冷啟動的 MIO 和演算法 2 熱啟動的 MIO 獲得的問題 (1) 上限的品質。噪比。當使用一階解熱啟動時,MIO 方法在最短時間內獲得良好的上限方面表現最佳。度量「準確度」在(35)中定義。一階方法速度很快,但不一定會產生最高品質的解決方案。 MIO 提高了一階方法提供的上限的質量,它們的綜合效果導致了最佳性能。

We also did experiments on the Leukemia dataset. In Figure 5 we demonstrate the evolution of the objective value of the best subset problem for different values of k𝑘k. For each value of k𝑘k, we warm-started the MIO with the solution obtained by Algorithm 2 and allowed the MIO solver to run for 4000 seconds. The best objective value obtained at the end of 4000 seconds is denoted by fsubscript𝑓f_{*}. We plot the Relative Accuracy, i.e., (ftf)/fsubscript𝑓𝑡subscript𝑓subscript𝑓(f_{t}-f_{*})/f_{*}, where ftsubscript𝑓𝑡f_{t} is the objective value obtained after t𝑡t seconds. The figure shows that the solution obtained by Algorithm 2 is improved by the MIO on various instances and the time taken to improve the upper bounds depends upon k𝑘k. In general, for smaller values of k𝑘k the upper bounds obtained by the MIO algorithm stabilize earlier, i.e., the MIO finds improved solutions faster than larger values of k𝑘k.
我們也在白血病資料集上做了實驗。在圖 5 中,我們示範了不同 k𝑘k 值的最佳子集問題的目標值的演進。對於 k𝑘k 的每個值,我們使用演算法 2 獲得的解來熱啟動 MIO,並允許 MIO 求解器運行 4000 秒。 4000秒結束時所獲得的最佳目標值以 fsubscript𝑓f_{*} 表示。我們繪製相對精確度,即 (ftf)/fsubscript𝑓𝑡subscript𝑓subscript𝑓(f_{t}-f_{*})/f_{*} ,其中 ftsubscript𝑓𝑡f_{t}t𝑡t 秒後獲得的目標值。該圖顯示,演算法 2 獲得的解決方案透過 MIO 在各種實例上進行了改進,並且改進上限所需的時間取決於 k𝑘k 。一般來說,對於較小的 k𝑘k 值,MIO 演算法獲得的上限更早穩定,即,MIO 比較大的 k𝑘k 值更快找到改進的解決方案。

Refer to caption
Figure 5: Behavior of MIO aided with warm start in obtaining good upper bounds over time for the Leukemia dataset (n=72,p=1000(n=72,p=1000). The vertical axis shows relative accuracy, i.e., (ftf)/fsubscript𝑓𝑡subscript𝑓subscript𝑓(f_{t}-f_{*})/f_{*}, where ftsubscript𝑓𝑡f_{t} is the objective value obtained after t𝑡t seconds and fsubscript𝑓f_{*} denotes the best objective value obtained by the method after 4000 seconds. The colored diamonds correspond to the locations where the MIO (with warm start) attains the best solution. The figure shows that MIO improves the solution obtained by the first order method in all the instances. The time at which the best possible upper bound is obtained depends upon the choice of k𝑘k. Typically larger k𝑘k values make the problem harder—hence the best solutions are obtained after a longer wait.
圖 5:MIO 的行為有助於熱啟動,從而隨著時間的推移為白血病資料集 (n=72,p=1000(n=72,p=1000 獲得良好的上限。縱軸表示相對精度,即 (ftf)/fsubscript𝑓𝑡subscript𝑓subscript𝑓(f_{t}-f_{*})/f_{*} ,其中 ftsubscript𝑓𝑡f_{t}t𝑡t 秒後獲得的目標值, fsubscript𝑓f_{*} 表示最佳值4000秒後此方法所獲得的客觀值。彩色菱形對應於 MIO(熱啟動)獲得最佳解決方案的位置。該圖顯示,MIO 在所有實例中改進了透過一階方法獲得的解。獲得最佳可能上限的時間取決於 k𝑘k 的選擇。通常,較大的 k𝑘k 值會使問題變得更加困難,因此在較長的等待後才能獲得最佳解決方案。

5.3.2 Bounding Box Formulation
5.3.2 邊界框公式

With the aid of advanced warm starts as provided by Algorithm 2, the MIO obtains a very high quality solution very quickly—in most of the examples the solution thus obtained turns out to be the global minimum. However, in the typical “high-dimensional” regime, with pnmuch-greater-than𝑝𝑛p\gg n, we observe that the certificate of global optimality comes later as the lower bounds of the problem “evolve” slowly. This is observed even in the presence of warm starts and using the implied bounds as developed in Section 2.2 and is aggravated for the cold-started MIO formulation (10).
借助演算法 2 提供的高級熱啟動,MIO 非常快速地獲得非常高品質的解決方案 - 在大多數範例中,由此獲得的解決方案結果是全局最小值。然而,在典型的「高維」體系中,使用 pnmuch-greater-than𝑝𝑛p\gg n ,我們觀察到隨著問題的下界「演化」緩慢,全局最優性的證明出現得更晚。即使存在熱啟動並使用第 2.2 節中提出的隱含邊界,也可以觀察到這種情況,並且對於冷啟動 MIO 公式 (10) 會加劇這種情況。

To address this, we consider the MIO formulation (37) obtained by adding bounding boxes around a local solution. These restrictions guide the MIO in restricting its search space and enable the MIO to certify global optimality inside that bounding box. We consider the following additional bounding box constraints to the MIO formulation (10):
為了解決這個問題,我們考慮在局部解周圍添加邊界框所獲得的 MIO 公式 (37)。這些限制指導 MIO 限制其搜尋空間,並使 MIO 能夠證明該邊界框內的全域最優性。我們考慮 MIO 公式 (10) 的以下附加邊界框限制:

{𝜷:𝐗𝜷𝐗𝜷01,locζ}{𝜷:𝜷𝜷01,locβ},conditional-set𝜷subscriptnorm𝐗𝜷𝐗subscript𝜷01subscriptsuperscript𝜁locconditional-set𝜷subscriptnorm𝜷subscript𝜷01subscriptsuperscript𝛽loc\left\{\boldsymbol{\beta}:\|\mathbf{X}\boldsymbol{\beta}-\mathbf{X}\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}\right\}\;\cap\;\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\beta}_{\ell,\text{loc}}\right\},

where, 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} is a candidate sparse solution. The radii of the two 1subscript1\ell_{1}-balls above, namely, ,locζsubscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}} and ,locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}} are user-defined parameters and control the size of the feasible set.
其中, 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 是候選稀疏解。上面兩個 1subscript1\ell_{1} 球的半徑,即 ,locζsubscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}},locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}} 是使用者定義的參數,控制可行集的大小。

Using the notation 𝜻=𝐗𝜷𝜻𝐗𝜷\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta} we have the following MIO formulation (equipped with the additional bounding boxes):
使用符號 𝜻=𝐗𝜷𝜻𝐗𝜷\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta} 我們有以下 MIO 公式(配備了額外的邊界框):

min𝜷,𝐳,𝜻12𝜻T𝜻𝐗𝐲,𝜷+12𝐲22s.t.𝜻=𝐗𝜷(βi,1zi):SOS type-1,i=1,,pzi{0,1},i=1,,pi=1pzikUβiU,i=1,,p𝜷1UζζiUζ,i=1,,n𝜻1ζ𝜻𝜻01,locζ𝜷𝜷01,locβ.subscript𝜷𝐳𝜻12superscript𝜻𝑇𝜻superscript𝐗𝐲𝜷12superscriptsubscriptnorm𝐲22formulae-sequence𝑠𝑡𝜻𝐗𝜷missing-subexpression:subscript𝛽𝑖1subscript𝑧𝑖SOS type-1,𝑖1𝑝missing-subexpressionformulae-sequencesubscript𝑧𝑖01𝑖1𝑝missing-subexpressionsuperscriptsubscript𝑖1𝑝subscript𝑧𝑖𝑘missing-subexpressionformulae-sequencesubscript𝑈subscript𝛽𝑖subscript𝑈𝑖1𝑝missing-subexpressionsubscriptnorm𝜷1subscriptmissing-subexpressionformulae-sequencesubscriptsuperscript𝜁𝑈subscript𝜁𝑖subscriptsuperscript𝜁𝑈𝑖1𝑛missing-subexpressionsubscriptnorm𝜻1subscriptsuperscript𝜁missing-subexpressionsubscriptnorm𝜻subscript𝜻01subscriptsuperscript𝜁locmissing-subexpressionsubscriptnorm𝜷subscript𝜷01subscriptsuperscript𝛽loc\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z},\boldsymbol{\zeta}}&\;\;\frac{1}{2}\;\boldsymbol{\zeta}^{T}\boldsymbol{\zeta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}\\ &(\beta_{i},1-z_{i}):\text{SOS type-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &-{\mathcal{M}}^{\zeta}_{U}\leq\zeta_{i}\leq{\mathcal{M}}^{\zeta}_{U},i=1,\ldots,n\\ &\|\boldsymbol{\zeta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell}\\ &\|\boldsymbol{\zeta}-\boldsymbol{\zeta}_{0}\|_{1}\leq{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}\\ &\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\beta}_{\ell,\text{loc}}.\end{array} (37)

For large values of ,locζsubscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}} (respectively, ,locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}) the constraints on 𝐗𝜷𝐗𝜷\mathbf{X}\boldsymbol{\beta} (respectively, 𝜷𝜷\boldsymbol{\beta}) become ineffective and one gets back formulation (10). To see the impact of these additional cutting planes in the MIO formulation, we consider a few examples as illustrated in Figures 6,7,12.
對於較大的 ,locζsubscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}} 值(分別為 ,locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}} ),對 𝐗𝜷𝐗𝜷\mathbf{X}\boldsymbol{\beta} (分別為 𝜷𝜷\boldsymbol{\beta} )的限制變得無效,且返回公式(10)。為了了解 MIO 公式中這些額外切割平面的影響,我們考慮一些範例,如圖 6、7、12 所示。

Interpretation of the bounding boxes
邊界框的解釋

A local bounding box in the variable 𝜻=𝐗𝜷𝜻𝐗𝜷\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta} directs the MIO solver to seek for candidate solutions that deliver models with predictive accuracy “similar” (controlled by the radius of the ball) to a reference predictive model, given by 𝜻0subscript𝜻0\boldsymbol{\zeta}_{0}. In our experiments, we typically chose 𝜻0subscript𝜻0\boldsymbol{\zeta}_{0} as the solution delivered by running MIO (warm-started with a first order solution) for a few hundred to a few thousand seconds. More generally, 𝜻0subscript𝜻0\boldsymbol{\zeta}_{0} may be selected by any other sparse learning method. In our experiments, we found that the run-time behavior of the MIO depends upon how correlated the columns of 𝐗𝐗\mathbf{X} are — more correlation leading to longer run-times.
變數 𝜻=𝐗𝜷𝜻𝐗𝜷\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta} 中的局部邊界框指示 MIO 求解器尋找候選解決方案,這些解決方案提供的模型的預測精度與參考預測模型「相似」(由球的半徑控制),由 < b1> 。在我們的實驗中,我們通常選擇 𝜻0subscript𝜻0\boldsymbol{\zeta}_{0} 作為透過運行 MIO(使用一階解決方案熱啟動)幾百到幾千秒來提供的解決方案。更一般地,可以透過任何其他稀疏學習方法來選擇 𝜻0subscript𝜻0\boldsymbol{\zeta}_{0} 。在我們的實驗中,我們發現 MIO 的運行時行為取決於 𝐗𝐗\mathbf{X} 列的相關程度 - 相關性越高,運行時間就越長。

Similarly, a bounding box around 𝜷𝜷\boldsymbol{\beta} directs the MIO to look for solutions in the neighborhood of a reference point 𝜷0subscript𝜷0\boldsymbol{\beta}_{0}. In our experiments, we chose the reference 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} as the solution obtained by MIO (warm-started with a first order solution) and allowing it to run for a few hundred to a few thousand seconds. We observed that the MIO solver in presence of bounding boxes in the 𝜷𝜷\boldsymbol{\beta}-space certified optimality and in the process finding better solutions; much faster than the 𝜻𝜻\boldsymbol{\zeta}-bounding box method.
類似地, 𝜷𝜷\boldsymbol{\beta} 周圍的邊界框指示 MIO 在參考點 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 附近尋找解決方案。在我們的實驗中,我們選擇參考 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 作為MIO(使用一階解熱啟動)所獲得的解,並讓它運行幾百到幾千秒。我們觀察到,MIO 求解器在 𝜷𝜷\boldsymbol{\beta} 空間中存在邊界框,並在過程中找到了更好的解決方案;比 𝜻𝜻\boldsymbol{\zeta} 邊界框方法快得多。

Note that the 𝜷𝜷\boldsymbol{\beta}-bounding box constraint leads to O(p)𝑂𝑝O(p) and the 𝜻𝜻\boldsymbol{\zeta}-box leads to O(n)𝑂𝑛O(n) constraints. Thus, when pnmuch-greater-than𝑝𝑛p\gg n the additional 𝜻𝜻\boldsymbol{\zeta} constraints add a fewer number of extra variables when compared to the 𝜷𝜷\boldsymbol{\beta} constraints.
請注意, 𝜷𝜷\boldsymbol{\beta} 邊界框約束導致 O(p)𝑂𝑝O(p) ,而 𝜻𝜻\boldsymbol{\zeta} 邊界框導致 O(n)𝑂𝑛O(n) 限制。因此,當 pnmuch-greater-than𝑝𝑛p\gg n 時,與 𝜷𝜷\boldsymbol{\beta} 限制相比,附加 𝜻𝜻\boldsymbol{\zeta} 約束增加的額外變數數量較少。

Experiments 實驗

In the first set of experiments, we consider the Leukemia dataset with n=72,p=1000formulae-sequence𝑛72𝑝1000n=72,p=1000. We took two different values of k{5,10}𝑘510k\in\{5,10\} and for each case we ran Algorithm 2 with several random restarts. The best solution thus obtained was used to warm start the MIO formulation (10), which we ran for an additional 3600 seconds. The solution thus obtained is denoted by 𝜷0subscript𝜷0\boldsymbol{\beta}_{0}. We then consider formulation (37) with ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty and different values of ,locβ=Fracsubscriptsuperscript𝛽locFrac{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac} (as annotated in Figure 6) — the results are displayed in Figure 6.
在第一組實驗中,我們考慮有 n=72,p=1000formulae-sequence𝑛72𝑝1000n=72,p=1000 的白血病資料集。我們採用了兩個不同的 k{5,10}𝑘510k\in\{5,10\} 值,並且對於每種情況,我們都執行演算法 2 並隨機重新啟動幾次。由此獲得的最佳解決方案用於熱啟動 MIO 配方 (10),我們又運行了 3600 秒。由此得到的解由 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 表示。然後,我們考慮使用 ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty 和不同的 ,locβ=Fracsubscriptsuperscript𝛽locFrac{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac} 值的公式 (37)(如圖 6 中註釋的那樣)——結果顯示在圖 6 中。

Leukemia dataset: Effect of a Bounding Box for MIO formulation (37)
白血病資料集:MIO 公式邊界框的影響 (37)
k=5𝑘5k=5 k=10𝑘10k=10
Refer to caption Refer to caption
Figure 6: The effect of the MIO formulation (37) for the Leukemia dataset, for different values of k𝑘k. Here ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty and ,locβ=Fracsubscriptsuperscript𝛽locFrac{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac}. For each value of k𝑘k, the global minimum obtained was the same for the different choices of ,locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}.
圖 6:對於不同的 k𝑘k 值,MIO 公式 (37) 對白血病資料集的影響。這裡是 ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty,locβ=Fracsubscriptsuperscript𝛽locFrac{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac} 。對於 k𝑘k 的每個值,對於 ,locβsubscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}} 的不同選擇,所獲得的全域最小值是相同的。

We consider another set of experiments to demonstrate the performance of the MIO in certifying global optimality for different synthetic datasets with varying n,p,k𝑛𝑝𝑘n,p,k as well as with different structures on the bounding box. In the first case, we generated data as per Example 1 with ρ=0.9𝜌0.9\rho=0.9, k0=5subscript𝑘05k_{0}=5. We consider the case with 𝜻0=𝐗𝜷0subscript𝜻0𝐗subscript𝜷0\boldsymbol{\zeta}_{0}=\mathbf{X}\boldsymbol{\beta}_{0}, ,locβ=subscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty and ,locζ=0.5𝐗𝜷01subscriptsuperscript𝜁loc0.5subscriptnorm𝐗subscript𝜷01{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1}, where 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} is a k𝑘k-sparse solution obtained from the MIO formulation (10) run with a time limit of 1000 seconds, after being warm-started with Algorithm 2. The results are displayed in Figure 7[Left Panel]. In the second case (with data same as before) we obtained 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} in the same fashion as described before—we took a bounding box around 𝜷0subscript𝜷0\boldsymbol{\beta}_{0}, and left the box constraint around 𝐗𝜷0𝐗subscript𝜷0\mathbf{X}\boldsymbol{\beta}_{0} inactive, i.e., we set ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty and ,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k. We performed two sets of experiments, where the data were generated based on different SNR values—the results are displayed in Figure 7 with SNR=1 [Middle Panel] and SNR = 3[Right Panel].
我們考慮另一組實驗來證明 MIO 在驗證具有不同 n,p,k𝑛𝑝𝑘n,p,k 以及邊界框上不同結構的不同合成資料集的全域最優性方面的表現。在第一種情況下,我們依照範例 1 使用 ρ=0.9𝜌0.9\rho=0.9k0=5subscript𝑘05k_{0}=5 產生資料。我們考慮 𝜻0=𝐗𝜷0subscript𝜻0𝐗subscript𝜷0\boldsymbol{\zeta}_{0}=\mathbf{X}\boldsymbol{\beta}_{0},locβ=subscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty,locζ=0.5𝐗𝜷01subscriptsuperscript𝜁loc0.5subscriptnorm𝐗subscript𝜷01{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1} 的情況,其中 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 是獲得的 k𝑘k 稀疏解決方案根據MIO 公式(10),在使用演算法2 熱啟動後,以1000 秒的時間限制運行。在第二種情況下(數據與之前相同),我們以與之前描述相同的方式獲得了 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} - 我們在 𝜷0subscript𝜷0\boldsymbol{\beta}_{0} 周圍設置了一個邊界框,並在< b10 周圍保留了框約束> 不活動,即我們設定 ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k 。我們進行了兩組實驗,其中數據是根據不同的 SNR 值產生的 - 結果如圖 7 所示,其中 SNR=1 [中圖] 和 SNR = 3[右圖]。

In the same vein, we have Figure 12 studying the effect of formulations (37) for synthetic datasets generated as per Example 1 with n=50,p=1000,ρ=0.9formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.9n=50,p=1000,\rho=0.9 and k0=5subscript𝑘05k_{0}=5.
同樣,我們在圖 12 中研究了公式 (37) 對於根據範例 1 使用 n=50,p=1000,ρ=0.9formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.9n=50,p=1000,\rho=0.9k0=5subscript𝑘05k_{0}=5 產生的合成資料集的影響。

Evolution of the MIO gap for (37), effect of type of bounding box (n=50,p=500formulae-sequence𝑛50𝑝500n=50,p=500).
(37) 的 MIO 間隙的演變,邊界框類型的影響 ( n=50,p=500formulae-sequence𝑛50𝑝500n=50,p=500 )。
Refer to caption Refer to caption Refer to caption
Figure 7: The effect of the MIO formulation (37) for a synthetic dataset as in Example 1 with ρ=0.9𝜌0.9\rho=0.9, k0=5subscript𝑘05k_{0}=5, n=50,p=500formulae-sequence𝑛50𝑝500n=50,p=500, for different values of k𝑘k. [Left Panel] ,locζ=0.5𝐗𝜷01subscriptsuperscript𝜁loc0.5subscriptnorm𝐗subscript𝜷01{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1} and ,locβ=subscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty for a data-set with SNR = 3. [Middle Panel] ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty, ,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k and SNR = 1. [Right Panel] ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty, ,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k and SNR = 3. The figure shows that the bounding boxes in terms of 𝐗𝜷𝐗𝜷\mathbf{X}\boldsymbol{\beta} (left-panel) make the problem harder to solve, when compared to bounding boxes around 𝜷𝜷\boldsymbol{\beta} (middle and right panels). A possible reason is due to the strong correlations among the columns of 𝐗𝐗\mathbf{X}. The SNR values do not seem to have a big impact on the run-times of the algorithms (middle and right panels).
圖 7:MIO 公式 (37) 對範例 1 中的合成資料集的影響,其中 ρ=0.9𝜌0.9\rho=0.9k0=5subscript𝑘05k_{0}=5n=50,p=500formulae-sequence𝑛50𝑝500n=50,p=500 ,對於 < 的不同值b3 > 。 [左圖] ,locζ=0.5𝐗𝜷01subscriptsuperscript𝜁loc0.5subscriptnorm𝐗subscript𝜷01{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1},locβ=subscriptsuperscript𝛽loc{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty 用於 SNR = 3 的資料集。 1. [右圖] ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k 且SNR = 3. 此圖顯示以 𝐗𝜷𝐗𝜷\mathbf{X}\boldsymbol{\beta} 表示的邊界框(左圖)與 𝜷𝜷\boldsymbol{\beta} 列之間的強相關性。 SNR 值似乎對演算法的運行時間沒有太大影響(中圖和右圖)。

5.3.3 Statistical Performance
5.3.3統計表現

To understand the statistical behavior of MIO when compared to other approaches for learning sparse models, we considered synthetic datasets for values of n𝑛n ranging from 3050305030-50 and values of p𝑝p ranging from 10002000100020001000-2000. The following methods were used for comparison purposes (a) Algorithm 2. Here we used fifty different random initializations around 𝟎0\mathbf{0}, of the form min(i1,1)N(𝟎p×1,4𝐈),i=1,,50formulae-sequence𝑖11𝑁subscript0𝑝14𝐈𝑖150\min(i-1,1)N(\mathbf{0}_{p\times 1},4\mathbf{I}),i=1,\ldots,50 and took the solution corresponding to the best objective value; (b) The MIO approach with warm starts from part (a); (c) The Lasso solution and (d) The Sparsenet solution.
為了了解 MIO 與學習稀疏模型的其他方法相比的統計行為,我們考慮了 n𝑛n 值範圍從 3050305030-50p𝑝p 值的合成資料集範圍從 10002000100020001000-2000 。使用以下方法進行比較(a) 演算法2。的解決方案; (b) MIO 方法從 (a) 部分開始熱啟動; (c) Lasso 解決方案和 (d) Sparsenet 解決方案。

For methods (a), (b) we considered ten equi-spaced values of k𝑘k in the range [3,2k0]32subscript𝑘0[3,2k_{0}] (including the optimal value of k0subscript𝑘0k_{0}). For each of the methods, the best model was selected in the same fashion as described in Section 5.2.3 using separate validation sets.
對於方法 (a)、(b),我們考慮了 [3,2k0]32subscript𝑘0[3,2k_{0}] 範圍內 k𝑘k 的十個等距值(包括 k0subscript𝑘0k_{0} 的最佳值)。對於每種方法,都使用單獨的驗證集以與第 5.2.3 節中所述相同的方式選擇最佳模型。

In addition, for some examples, we also study the performance of the debiased version of the Lasso, as described in Section 5.2.3.
此外,對於一些範例,我們也研究了 Lasso 的去偏版本的效能,如第 5.2.3 節所述。

In Figure 8 and Figure 9 we present selected representative results from four different examples described in Section 5.1.
在圖 8 和圖 9 中,我們展示了從 5.1 節中描述的四個不同範例中選擇的代表性結果。

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 8: The sparsity and predictive performance for different procedures: [Left Panel] shows Example 1 with n=50,p=1000,ρ=0.8,k0=5formulae-sequence𝑛50formulae-sequence𝑝1000formulae-sequence𝜌0.8subscript𝑘05n=50,p=1000,\rho=0.8,k_{0}=5 and [Right Panel] shows Example 2 with n=30,p=1000formulae-sequence𝑛30𝑝1000n=30,p=1000—for each instance several SNR values have been shown.
圖8:不同過程的稀疏性和預測性能:[左圖] 顯示帶有 n=50,p=1000,ρ=0.8,k0=5formulae-sequence𝑛50formulae-sequence𝑝1000formulae-sequence𝜌0.8subscript𝑘05n=50,p=1000,\rho=0.8,k_{0}=5 的示例1,[右圖] 顯示帶有 n=30,p=1000formulae-sequence𝑛30𝑝1000n=30,p=1000 的示例2 — 對於每個實例都有幾個SNR 值已經展示過了。
Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 9: [Left Panel] Shows performance for data generated according to Example 3 with n=30,p=1000formulae-sequence𝑛30𝑝1000n=30,p=1000 and [Right Panel] shows Example 4 with n=50,p=2000formulae-sequence𝑛50𝑝2000n=50,p=2000.
圖 9:[左側面板] 顯示根據範例 3 使用 n=30,p=1000formulae-sequence𝑛30𝑝1000n=30,p=1000 產生的資料的效能,[右面板] 顯示使用 n=50,p=2000formulae-sequence𝑛50𝑝2000n=50,p=2000 的範例 4。

In Figure 8 the left panel shows the performance of different methods for Example 1 with n=50,p=1000,ρ=0.8,k0=5formulae-sequence𝑛50formulae-sequence𝑝1000formulae-sequence𝜌0.8subscript𝑘05n=50,p=1000,\rho=0.8,k_{0}=5. In this example, there are five non-zero coefficients: the features corresponding to the non-zero coefficients are weakly correlated and a feature having a non-zero coefficient is highly correlated with a feature having a zero coefficient. In this situation, the Lasso selects a very dense model since it fails to distinguish between a zero and a non-zero coefficient when the variables are correlated—it brings both the coefficients in the model (with shrinkage). MIO (with warm-start) performs the best—both in terms of predictive accuracy and in selecting a sparse set of coefficients. MIO obtains the sparsest model among the four methods and seems to find better solutions in terms of statistical properties than the models obtained by the first order methods alone. Interestingly, the “optimal model” selected by the first order methods is more dense than that selected by the MIO. The number of non-zero coefficients selected by MIO remains fairly stable across different SNR values, unlike the other three methods. For this example, we also experimented with the different versions of debiased Lasso. In summary: the best debiased Lasso models had performance marginally better than Lasso but quite inferior to MIO. See the results in Appendix, Section D.2 for further details.
在圖 8 中,左側面板顯示了範例 1 中使用 n=50,p=1000,ρ=0.8,k0=5formulae-sequence𝑛50formulae-sequence𝑝1000formulae-sequence𝜌0.8subscript𝑘05n=50,p=1000,\rho=0.8,k_{0}=5 的不同方法的效能。在這個範例中,存在五個非零係數:與非零係數對應的特徵為弱相關,並且具有非零係數的特徵與具有零係數的特徵高度相關。在這種情況下,Lasso 選擇一個非常密集的模型,因為當變數相關時,它無法區分零係數和非零係數 - 它將兩個係數都帶入模型中(帶有收縮)。 MIO(帶熱啟動)在預測準確性和選擇稀疏係數集方面表現最佳。 MIO 獲得了四種方法中最稀疏的模型,並且似乎在統計特性方面找到了比單獨透過一階方法獲得的模型更好的解決方案。有趣的是,一階方法選擇的「最優模型」比 MIO 選擇的「最優模型」更密集。與其他三種方法不同,MIO 選擇的非零係數數量在不同的 SNR 值下保持相當穩定。對於這個例子,我們也嘗試了不同版本的去偏套索。總之:最好的去偏 Lasso 模型的表現略優於 Lasso,但遠低於 MIO。有關更多詳細信息,請參閱附錄 D.2 節中的結果。

In Figure 8 the right panel shows Example 2, with n=30,p=1000,k0=5formulae-sequence𝑛30formulae-sequence𝑝1000subscript𝑘05n=30,p=1000,k_{0}=5 and all non-zero coefficients equal one. In this example, all the methods perform similarly in terms of predictive accuracy. This is because all non-zero coefficients in 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} have the same value. In fact for the smallest value of SNR, the Lasso achieves the best predictive model. In all the cases however, the MIO achieves the sparsest model with favorable predictive accuracy.
在圖 8 中,右圖顯示了範例 2,其中 n=30,p=1000,k0=5formulae-sequence𝑛30formulae-sequence𝑝1000subscript𝑘05n=30,p=1000,k_{0}=5 且所有非零係數都等於 1。在此範例中,所有方法在預測準確性方面的表現相似。這是因為 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 中的所有非零係數都具有相同的值。事實上,對於 SNR 的最小值,Lasso 實現了最佳的預測模型。然而,在所有情況下,MIO 都實現了具有良好預測準確性的最稀疏模型。

In Figure 9, for both the examples, the model matrix is an iid Gaussian ensemble. The underlying regression coefficient 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} however, is structurally different than Example 2 (as in Figure 8, right-panel). The structure in 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} is responsible for different statistical behaviors of the four methods across Figures 8 (right-panel) and Figure 9 (both panels). The alternating signs and varying amplitudes of 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} are responsible for the poor behavior of Lasso. The MIO (with warm-starts) seems to be the best among all the methods. For Example 3 (Figure 9, left panel) the predictive performances of Lasso and MIO are comparable—the MIO however delivers much sparser models than the Lasso.
在圖 9 中,對於這兩個範例,模型矩陣都是獨立同分佈高斯係綜。然而,基礎迴歸係數 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 在結構上與範例 2 不同(如圖 8 右圖)。 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 中的結構負責圖 8(右圖)和圖 9(兩個圖)中四種方法的不同統計行為。 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} 的交替符號和不同幅度是造成 Lasso 不良行為的原因。 MIO(帶熱啟動)似乎是所有方法中最好的。對於範例 3(圖 9,左圖),Lasso 和 MIO 的預測表現相當,但 MIO 提供的模型比 Lasso 稀疏得多。

The key conclusions are as follows:
主要結論如下:

  1. 1.

    The MIO best subset algorithm has a significant edge in detecting the correct sparsity structure for all examples compared to Lasso, Sparsenet and the stand-alone discrete first order method.


    1. 與 Lasso、Sparsenet 和獨立的離散一階方法相比,MIO 最佳子集演算法在檢測所有範例的正確稀疏結構方面具有顯著優勢。
  2. 2.

    For data generated as per Example 1 with large values of ρ𝜌\rho, the MIO best subset algorithm gives better predictive performance compared to its competitors.


    2. 對於根據範例 1 產生的具有較大 ρ𝜌\rho 值的數據,MIO 最佳子集演算法與其競爭對手相比提供了更好的預測效能。
  3. 3.

    For data generated as per Examples 2 and 3, MIO delivers similar predictive models like the Lasso, but produces much sparser models. In fact, Lasso seems to perform marginally better than MIO, as a predictive model for small values of SNR.


    3. 對於根據範例 2 和 3 產生的數據,MIO 提供與 Lasso 類似的預測模型,但產生的模型要稀疏得多。事實上,作為小 SNR 值的預測模型,Lasso 的表現似乎比 MIO 稍好。
  4. 4.

    For Example 4, MIO performs the best both in terms of predictive accuracy and delivering sparse models.


    4. 對於範例 4,MIO 在預測準確度和提供稀疏模型方面均表現最佳。

6 Computational Results for Subset Selection with Least Absolute Deviation Loss
6 絕對偏差損失最小的子集選擇的計算結果

In this section, we demonstrate how our method can be used for the best subset selection problem with LAD objective (34).
在本節中,我們將示範如何使用我們的方法來解決具有 LAD 目標 (34) 的最佳子集選擇問題。

Since the main focus of this paper is the least squares loss function, we consider only a few representative examples for the LAD case. The LAD loss is appropriate when the error follows a heavy tailed distribution. The datasets used for the experiments parallel those described in Section 5.1, the difference being in the distribution of ϵitalic-ϵ\epsilon. We took ϵisubscriptitalic-ϵ𝑖\epsilon_{i} iid from a double exponential distribution with variance σ2superscript𝜎2\sigma^{2}. The value of σ2superscript𝜎2\sigma^{2} was adjusted to get different values of SNR.
由於本文的主要關注點是最小二乘損失函數,因此我們僅考慮 LAD 情況的幾個代表性範例。當誤差服從重尾分佈時,LAD 損失是適當的。用於實驗的資料集與第 5.1 節中所述的資料集相似,差異在於 ϵitalic-ϵ\epsilon 的分佈。我們從方差為 σ2superscript𝜎2\sigma^{2} 的雙指數分佈中取得 ϵisubscriptitalic-ϵ𝑖\epsilon_{i} iid。調整 σ2superscript𝜎2\sigma^{2} 的值以獲得不同的SNR值。

Datasets analysed 分析的數據集

We consider a set-up similar to Example 1 (Section 5.1) with k0=5subscript𝑘05k_{0}=5 and ρ=0.9𝜌0.9\rho=0.9. Different choices of (n,p)𝑛𝑝(n,p) were taken to cover both the overdetermined (n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100) and high-dimensional cases (n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000 and n=500,p=1000formulae-sequence𝑛500𝑝1000n=500,p=1000).
我們考慮與範例 1(第 5.1 節)類似的設置,其中包含 k0=5subscript𝑘05k_{0}=5ρ=0.9𝜌0.9\rho=0.9 。採用不同的 (n,p)𝑛𝑝(n,p) 選擇來涵蓋超定( n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100 )和高維度情況( n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000n=500,p=1000formulae-sequence𝑛500𝑝1000n=500,p=1000 )。

The other competing methods used for comparison were (a) discrete first order method (Section (3.4)) (b) MIO warm-started with the first order solutions and (c) the LAD loss with 1subscript1\ell_{1} regularization:
其他用於比較的競爭方法是(a)離散一階方法(第(3.4)節)(b)使用一階解決方案熱啟動的MIO 和(c)使用 1subscript1\ell_{1} 正則化的LAD損失:

min𝐲𝐗𝜷1+λ𝜷1,subscriptnorm𝐲𝐗𝜷1𝜆subscriptnorm𝜷1\min\;\;\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{1}+\lambda\|\boldsymbol{\beta}\|_{1},

which we denote by LAD-Lasso. The training, validation and testing were done in the same fashion as in the least squares case. For each method, we report the number of non-zeros in the optimal model and associated prediction accuracy (36).
我們用 LAD-Lasso 表示。訓練、驗證和測試的方式與最小平方法情況相同。對於每種方法,我們報告最佳模型中的非零數以及相關的預測精度 (36)。

Refer to caption Refer to caption
Figure 10: The sparsity and predictive performance for different procedures for n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100 for Problem (34). The data is generated as per Example 1 with ρ=0.9,k0=5formulae-sequence𝜌0.9subscript𝑘05\rho=0.9,k_{0}=5 and double exponential errors—further details are available in the text. The acronym “Lasso” refers to LAD-Lasso (6). The MIO is seen to deliver sparser models with better predictive accuracy when compared to the LAD-Lasso.
圖 10:問題 (34) 的 n=500,p=100formulae-sequence𝑛500𝑝100n=500,p=100 的不同過程的稀疏性和預測性能。數據是根據範例 1 生成的,具有 ρ=0.9,k0=5formulae-sequence𝜌0.9subscript𝑘05\rho=0.9,k_{0}=5 和雙指數誤差 - 文本中提供了更多詳細資訊。縮寫「Lasso」指的是 LAD-Lasso (6)。與 LAD-Lasso 相比,MIO 可以提供更稀疏的模型,並具有更好的預測精度。

Figure 10 compares the MIO approach with others for LAD in the overdetermined case (n>p𝑛𝑝n>p). Figure 11 does the same for the high-dimensional case (pnmuch-greater-than𝑝𝑛p\gg n). The conclusions parallel those for the least squares case. Since, in the example considered, the features corresponding to the non-zero coefficients are weakly correlated and a feature having a non-zero coefficient is highly correlated with a feature having a zero coefficient—the LAD-Lasso selects an overly dense model and misses out in terms of prediction error. Both the MIO (with warm-starts) and the discrete first order methods behave similarly—much better than 1subscript1\ell_{1} regularization schemes. As expected, we observed that subset selection with least squares loss leads to inferior models for these examples, due to a heavy-tailed distribution of the errors.
圖 10 將 MIO 方法與超定情況下 LAD 的其他方法進行了比較 ( n>p𝑛𝑝n>p )。圖 11 對於高維度情況 ( pnmuch-greater-than𝑝𝑛p\gg n ) 執行相同的操作。結論與最小平方法情況的結論相似。由於在所考慮的示例中,與非零係數相對應的特徵是弱相關的,並且具有非零係數的特徵與具有零係數的特徵高度相關——LAD-Lasso 選擇了過於密集的模型並錯過了就預測誤差而言。 MIO(帶熱啟動)和離散一階方法的行為相似,比 1subscript1\ell_{1} 正則化方案好得多。如預期的那樣,我們觀察到,由於誤差的重尾分佈,具有最小二乘損失的子集選擇會導致這些範例的模型較差。

The results in this section are similar to the least squares case. The MIO approach provides an edge both in terms of sparsity and predictive accuracy compared to Lasso both for the overdetermined and the high-dimensional case.
本節的結果與最小平方法情況類似。對於超定和高維情況,與 Lasso 相比,MIO 方法在稀疏性和預測準確性方面都具有優勢。

Refer to caption Refer to caption
Refer to caption Refer to caption
Figure 11: Figure showing the number of nonzero values and predictive performance for different values of n𝑛n and p𝑝p for Problem (34) (as in Figure 10). [Left panel] has n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000 and [Right panel] has n=500,p=1000formulae-sequence𝑛500𝑝1000n=500,p=1000.
圖 11:顯示問題 (34) 的 n𝑛np𝑝p 不同值的非零值數量和預測效能的圖(如圖 10 所示)。 [左面板]有 n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000 ,[右面板]有 n=500,p=1000formulae-sequence𝑛500𝑝1000n=500,p=1000

7 Conclusions 7結論

In this paper, we have revisited the classical best subset selection problem of choosing k𝑘k out of p𝑝p features in linear regression given n𝑛n observations using a modern optimization lens, i.e., MIO and a discrete extension of first order methods from continuous optimization. Exploiting the astonishing progress of MIO solvers in the last twenty-five years, we have shown that this approach solves problems with n𝑛n in the 1000s and p𝑝p in the 100s in minutes to provable optimality, and finds near optimal solutions for n𝑛n in the 100s and p𝑝p in the 1000s in minutes. Importantly, the solutions provided by the MIO approach significantly outperform other state of the art methods like Lasso in achieving sparse models with good predictive power. Unlike all other methods, the MIO approach always provides a guarantee on its sub-optimality even if the algorithm is terminated early. Moreover, it can accommodate side constraints on the coefficients of the linear regression and also extends to finding best subset solutions for the least absolute deviation loss function.
在本文中,我們重新審視了經典的最佳子集選擇問題,即使用現代優化鏡頭在給定 n𝑛n 觀測值的線性回歸中從 p𝑝p 特徵中選擇< b0> ,即 MIO 和連續最佳化的一階方法的離散擴展。利用MIO 求解器在過去25 年中取得的驚人進步,我們已經證明這種方法可以在幾分鐘內解決1000 秒內的 n𝑛n 和100 秒內的 p𝑝p 問題,從而證明最優性,並在幾分鐘內找到 100 秒內的 n𝑛n 和 1000 秒內的 p𝑝p 的接近最佳解決方案。重要的是,MIO 方法提供的解決方案在實現具有良好預測能力的稀疏模型方面明顯優於 Lasso 等其他最先進的方法。與所有其他方法不同,即使演算法提前終止,MIO 方法也始終保證其次優性。此外,它可以適應線性迴歸係數的側面約束,並且還可以擴展到尋找最小絕對偏差損失函數的最佳子集解。

While continuous optimization methods have played and continue to play an important role in statistics over the years, discrete optimization methods have not. The evidence in this paper as well as in [2] suggests that MIO methods are tractable and lead to desirable properties (improved accuracy and sparsity among others) at the expense of higher, but still reasonable, computational times.
雖然連續最佳化方法多年來在統計中一直發揮並將繼續發揮重要作用,但離散最佳化方法卻沒有。本文以及 [2] 中的證據表明,MIO 方法易於處理,並且會帶來理想的特性(提高準確性和稀疏性等),但代價是更高但仍然合理的計算時間。

Acknowledgements 致謝

We would like to thank the Associate editor and two reviewers for their comments that helped us improve the paper. A major part of the work was performed when R.M. was at Columbia University.
我們要感謝副主編和兩位審查者的評論,幫助我們改進論文。大部分的工作是在 R.M. 時完成。曾就讀於哥倫比亞大學。

References 參考

  • [1] Top500 Supercomputer Sites, Directory page for Top500 lists. Result for each list since June 1993. http://www.top500.org/statistics/sublist/. Accessed: 2013-12-04.
    Top500 超級電腦站點,Top500 清單的目錄頁。自 1993 年 6 月以來每個清單的結果。造訪時間:2013 年 12 月 4 日。
  • Bertsimas and Mazumder [2014]
    Bertsimas 和 Mazumder [2014]
    D. Bertsimas and R. Mazumder. Least quantile regression via modern optimization. The Annals of Statistics, 42(6):2494–2525, 2014.
    D. Bertsimas 和 R. Mazumder。透過現代優化的最小分位數回歸。 《統計年鑑》,42(6):2494–2525,2014 年。
  • Bertsimas and Shioda [2009]
    Bertsimas 與 Shioda [2009]
    D. Bertsimas and R. Shioda. Algorithm for cardinality-constrained quadratic optimization. Computational Optimization and Applications, 43(1):1–22, 2009.
    D. Bertsimas 和 R. Shioda。基數約束二次最佳化演算法。計算最佳化與應用,43(1):1–22,2009。
  • Bertsimas and Weismantel [2005]
    Bertsimas 和 Weismantel [2005]
    D. Bertsimas and R. Weismantel. Optimization over integers. Dynamic Ideas Belmont, 2005.
    D. Bertsimas 和 R. Weismantel。整數優化。動態想法貝爾蒙特,2005 年。
  • Bickel et al. [2009] 比克爾等人。 [2009] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
    P. 比克爾、Y. 里托夫和 A. 齊巴科夫。同時分析 lasso 和 dantzig 選擇器。 《統計年鑑》,第 1705-1732 頁,2009 年。
  • Bienstock [1996] 比恩斯托克 [1996] D. Bienstock. Computational study of a family of mixed-integer quadratic programming problems. Mathematical programming, 74(2):121–140, 1996.
    D. 比恩斯托克。一系列混合整數二次規劃問題的計算研究。數學規劃,74(2):121–140, 1996。
  • Bixby [2012] 比克斯比 [2012] R. E. Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, Extra Volume: Optimization Stories, pages 107–121, 2012.
    R.E.比克斯比。線性和混合整數規劃計算的簡史。數學文獻展,額外卷數:最佳化故事,第 107-121 頁,2012 年。
  • Blumensath and Davies [2008]
    布魯門薩斯和戴維斯 [2008]
    T. Blumensath and M. Davies. Iterative thresholding for sparse approximations. Journal of Fourier Analysis and Applications, 14(5-6):629–654, 2008.
    T. Blumensath 和 M. Davies。稀疏近似的迭代閾值。傅立葉分析與應用雜誌,14(5-6):629–654,2008。
  • Blumensath and Davies [2009]
    布魯門薩斯和戴維斯 [2009]
    T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
    T. Blumensath 和 M. Davies。壓縮感知的迭代硬閾值。應用與計算諧波分析,27(3):265–274, 2009。
  • Boyd and Vandenberghe [2004]
    博伊德和范登堡 [2004]
    S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
    S.博伊德和L.范登堡。凸優化。劍橋大學出版社,劍橋,2004 年。
  • Bühlmann and van-de-Geer [2011]
    Bühlmann 與 van-de-Geer [2011]
    P. Bühlmann and S. van-de-Geer. Statistics for high-dimensional data. Springer, 2011.
    P. Bühlmann 和 S. van-de-Geer。高維資料的統計。施普林格,2011。
  • Bunea et al. [2007] 佈內亞等人。 [2007] F. Bunea, A. B. Tsybakov, M. H. Wegkamp, et al. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007.
    F. Bunea、A. B. Tsybakov、M. H. Wegkamp 等人。高斯回歸的聚合。 《統計年鑑》,35(4):1674–1697,2007 年。
  • Candes et al. [2008] 坎德斯等人。 [2008] E. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted 1subscript1\ell_{1} minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
    E. 坎德斯、M. 瓦金和 S. 博伊德。透過重新加權 1subscript1\ell_{1} 最小化來增強稀疏性。傅立葉分析與應用雜誌,14(5):877–905,2008。
  • Candes [2008] 燭光 [2008] E. Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008.
    E.坎德斯。受限等距屬性及其對壓縮感知的影響。數學競賽,346(9):589–592, 2008。
  • Candès and Plan [2009]
    坎德斯與計畫 [2009]
    E. Candès and Y. Plan. Near-ideal model selection by 1subscript1\ell_{1} minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
    E. Candès 和 Y. Plan。透過 1subscript1\ell_{1} 最小化來選擇接近理想的模型。 《統計年鑑》,37(5A):2145–2177,2009 年。
  • Candes and Tao [2006]
    坎德斯與陶 [2006]
    E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006.
    E. 坎德斯和 T. 陶。從隨機投影中恢復近乎最佳的訊號:通用編碼策略?資訊理論,IEEE 彙刊,52(12):5406–5425, 2006。
  • Chen et al. [1998] 陳等人。 [1998] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
    S. Chen、D. Donoho 和 M. Saunders。透過基底追蹤進行原子分解。 SIAM 科學計算雜誌,20(1):33–61,1998。
  • Dettling [2004] 德特林 [2004] M. Dettling. Bagboosting for tumor classification with gene expression data. Bioinformatics, 20(18):3583–3593, 2004.
    M.德特林。使用基因表現資料進行腫瘤分類的 Bagboosting。生物資訊學,20(18):3583–3593,2004。
  • Donoho [2006] 多諾霍 [2006] D. Donoho. For most large underdetermined systems of equations, the minimal 1superscript1\ell^{1}-norm solution is the sparsest solution. Communications on Pure and Applied Mathematics, 59:797–829, 2006.
    D. 多諾霍。對於大多數大型欠定方程組,最小 1superscript1\ell^{1} 範數解是最稀疏解。純粹數學與應用數學通訊,59:797-829,2006 年。
  • Donoho and Johnstone [1993]
    多諾霍和約翰斯通 [1993]
    D. Donoho and I. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81:425–455, 1993.
    D. 多諾霍和 I. 約翰斯通。透過小波收縮實現理想的空間適應。生物識別學,81:425–455,1993。
  • Donoho and Elad [2003]
    多諾霍和埃拉德 [2003]
    D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via 1subscript1\ell_{1} minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
    D. 多諾霍和 M. 埃拉德。透過 1subscript1\ell_{1} 最小化在一般(非正交)字典中實現最佳稀疏表示。美國國家科學院院刊,100(5):2197–2202,2003 年。
  • Donoho and Huo [2001]
    多諾霍與霍 [2001]
    D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. Information Theory, IEEE Transactions on, 47(7):2845–2862, 2001.
    D. 多諾霍和 X. 霍。不確定原理和理想原子分解。資訊理論,IEEE 彙刊,47(7):2845–2862, 2001。
  • Efron et al. [2004] 埃夫隆等。 [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression (with discussion). Annals of Statistics, 32(2):407–499, 2004. ISSN 0090-5364.
    B. 埃夫隆、T. 哈斯蒂、I. 約翰斯通和 R. 提布希拉尼。最小角度回歸(帶討論)。統計年鑑,32(2):407–499, 2004。
  • Fan and Li [2001]
    範、李 [2001]
    J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360(13), 2001.
    J. 範和 R. 李。透過非凹懲罰似然及其預言特性進行變數選擇。美國統計協會雜誌,96(456):1348–1360(13),2001。
  • Fan and Lv [2011]
    範與呂 [2011]
    J. Fan and J. Lv. Nonconcave penalized likelihood with NP-dimensionality. Information Theory, IEEE Transactions on, 57(8):5467–5484, 2011.
    J. Fan 與 J. Lv.具有 NP 維的非凹懲罰似然。資訊理論,IEEE 彙刊,57(8):5467–5484,2011。
  • Fan and Lv [2013]
    範與呂 [2013]
    Y. Fan and J. Lv. Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association, 108(503):1044–1061, 2013.
    Y. Fan 與 J. Lv.閾值參數空間中正規化方法的漸近等價。美國統計協會雜誌,108(503):1044–1061,2013 年。
  • Frank and Friedman [1993]
    弗蘭克和弗里德曼 [1993]
    I. Frank and J. Friedman. A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35(2):109–148, 1993.
    I. 弗蘭克和 J. 弗里德曼。一些化學計量學迴歸工具的統計視圖(帶討論)。技術計量學,35(2):109–148,1993。
  • Friedman [2008] 弗里德曼[2008] J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics, Stanford University, 2008.
    J·弗里德曼。快速稀疏回歸和分類。技術報告,史丹佛大學統計系,2008 年。
  • Friedman et al. [2007] 弗里德曼等人。 [2007] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 2(1):302–332, 2007.
    J. Friedman、T. Hastie、H. Hoefling 和 R. Tibshirani。路徑座標優化。應用統計年鑑,2(1):302–332,2007 年。
  • Furnival and Wilson [1974]
    弗尼瓦爾和威爾遜 [1974]
    G. Furnival and R. Wilson. Regression by leaps and bounds. Technometrics, 16:499–511, 1974.
    G. 弗尼瓦爾和 R. 威爾遜。突飛猛進的回歸。技術計量學,16:499–511,1974。
  • Greenshtein [2006] 格林斯坦[2006] E. Greenshtein. Best subset selection, persistence in high-dimensional statistical learning and optimization under 1subscript1\ell_{1} constraint. The Annals of Statistics, 34(5):2367–2386, 2006.
    E.格林斯坦。最佳子集選擇,堅持高維度統計學習和 1subscript1\ell_{1} 限制下的最佳化。 《統計年鑑》,34(5):2367–2386,2006 年。
  • Greenshtein and Ritov [2004]
    格林斯坦和里托夫 [2004]
    E. Greenshtein and Y. Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971–988, 2004.
    E. Greenshtein 和 Y. Ritov。高維線性預測器選擇的持久性和過度參數化的優點。伯努利,10(6):971–988,2004。
  • Gurobi Optimization [2013]
    Gurobi 優化 [2013]
    I. Gurobi Optimization. Gurobi optimizer reference manual, 2013. URL http://www.gurobi.com.
    I.Gurobi 優化。 Gurobi 優化器參考手冊,2013 年。
  • Hastie et al. [2009] 哈斯蒂等人。 [2009] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer New York, 2 edition, 2009. ISBN 0387848576.
    T. Hastie、R. Tibshirani 和 J. Friedman。統計學習的要素,第二版:資料探勘、推理與預測(Springer 統計系列)。施普林格紐約,第 2 版,2009 年。
  • Knight and Fu [2000]
    奈特與福 [2000]
    K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):1356–1378, 2000.
    K. 奈特和 W. Fu。套索型估計量的漸近。統計年鑑,28(5):1356–1378,2000。
  • Loh and Wainwright [2013]
    Loh 和溫賴特 [2013]
    P.-L. Loh and M. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pages 476–484, 2013.
    P.-L。 Loh 和 M.Wainwright。具有非凸性的正則化 m 估計量:局部最優的統計和演算法理論。神經資訊處理系統進展,第 476-484 頁,2013 年。
  • Lv and Fan [2009]
    呂與範 [2009]
    J. Lv and Y. Fan. A unified approach to model selection and sparse recovery using regularized least squares. The Annals of Statistics, pages 3498–3528, 2009.
    J. Lv 和 Y. Fan。使用正則化最小二乘進行模型選擇和稀疏恢復的統一方法。 《統計年鑑》,第 3498-3528 頁,2009 年。
  • Mazumder et al. [2011] 馬宗德等人。 [2011] R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: Coordinate descent with non-convex penalties. Journal of the American Statistical Association, 117(495):1125–1138, 2011.
    R. Mazumder、J. Friedman 和 T. Hastie。 Sparsenet:協調具有非凸懲罰的下降。美國統計協會雜誌,117(49​​5):1125–1138,2011 年。
  • Meinshausen and Bühlmann [2006]
    邁因斯豪森和布爾曼 [2006]
    N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34:1436–1462, 2006.
    N. 邁因斯豪森和 P. 布爾曼。使用套索進行高維圖和變數選擇。統計年鑑,34:1436–1462,2006 年。
  • Miller [2002] 米勒[2002] A. Miller. Subset selection in regression. CRC Press Washington, 2002.
    A.米勒。回歸中的子集選擇。 CRC 出版社華盛頓,2002 年。
  • Natarajan [1995] 納塔拉揚 [1995] B. Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
    B.納塔拉揚。線性系統的稀疏近似解。 SIAM 計算雜誌,24(2):227–234,1995。
  • Nemhauser [2013] 內姆豪瑟 [2013] G. Nemhauser. Integer programming: the global impact. Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro2013.org/wp-content/uploads/Nemhauser_EuroXXVI.pdf. Accessed: 2013-12-04.
    G. 內姆豪瑟。整數規劃:全球影響。發表於 2013 年義大利羅馬 EURO、INFORMS 上。造訪時間:2013 年 12 月 4 日。
  • Nesterov [2005] 涅斯特羅夫 [2005] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, Series A, 103:127–152, 2005.
    Y.內斯特羅夫。非平滑函數的平滑最小化。數學規劃,A 系列,103:127–152,2005 年。
  • Nesterov [2007] 內斯特羅夫 [2007] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. Technical Report number 76.
    Y.內斯特羅夫。最小化複合目標函數的梯度方法。技術報告,運籌學和計量經濟學中心 (CORE),魯汶天主教大學,2007 年。
  • Nesterov [2004] 內斯特羅夫 [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Norwell, 2004.
    Y·內斯特羅夫。凸優化入門講座:基礎課程。克魯維爾,諾威爾,2004 年。
  • Raskutti et al. [2011] 拉斯庫蒂等人。 [2011] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over-balls. Information Theory, IEEE Transactions on, 57(10):6976–6994, 2011.
    G. Raskutti、M. Wainwright 和 B. Yu。高維線性迴歸球的極小極大估計率。資訊理論,IEEE 彙刊,57(10):6976–6994,2011。
  • Rockafellar [1996] 洛克斐拉 [1996] R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1996. ISBN 0691015864. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0691015864.
    R.洛克斐拉。凸分析。普林斯頓大學出版社,普林斯頓,1996 年。
  • Shen et al. [2013] 沉等人。 [2013] X. Shen, W. Pan, Y. Zhu, and H. Zhou. On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5):807–832, 2013.
    X. 沈、W. 潘、Y. 朱和 H. 週。關於約束和正則化高維度回歸。統計數學研究所年鑑,65(5):807–832,2013。
  • Tibshirani [1996] 提布希拉尼 [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
    R.Tibshirani。透過套索進行回歸收縮和選擇。 《皇家統計學會雜誌》,B 系列,58:267–288,1996 年。
  • Tibshirani [2011] 提比希拉尼 [2011] R. Tibshirani. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–282, 2011.
    R.Tibshirani。透過套索進行回歸收縮和選擇:回顧。 《皇家統計學會期刊》:B 系列(統計方法),73(3):273–282,2011 年。
  • Tropp [2006] 特羅普 [2006] J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006.
    J·特羅普。放鬆點:用於識別噪音中稀疏訊號的凸編程方法。資訊理論,IEEE 彙刊,52(3):1030–1051, 2006。
  • van de Geer et al. [2011]
    范德吉爾等人。 [2011]
    S. Geer, P. Bühlmann, and S. Zhou. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electronic Journal of Statistics, 5:688–749, 2011.
    S. Geer、P. Bühlmann 和 S. Zhou。針對可能錯誤指定的模型的自適應套索和閾值套索(以及套索的下限)。電子統計雜誌,5:688–749,2011。
  • Wainwright [2009] 溫賴特 [2009] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009.
    M.溫賴特。使用約束二次規劃(套索)進行高維度和噪音稀疏性恢復的尖銳閾值。資訊理論,IEEE 彙刊,55(5):2183–2202, 2009。
  • Zhang [2010a] 張[2010a] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010a.
    C.-H。張。極小極大凹罰分下幾乎無偏的變數選擇。 《統計年鑑》,38(2):894–942,2010a。
  • Zhang and Huang [2008]
    張和黃[2008]
    C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4):1567–1594, 2008.
    C.-H。張和黃。高維度線性迴歸中套索選擇的稀疏性和偏差。統計年鑑,36(4):1567–1594,2008。
  • Zhang and Zhang [2012]
    張和張 [2012]
    C.-H. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593, 2012.
    C.-H。張和 T. 張。高維度稀疏估計問題的凹正則化的一般理論。統計科學,27(4):576–593,2012。
  • Zhang [2010b] 張[2010b] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081–1107, 2010b.
    T. 張。稀疏正規化的多層次凸鬆弛分析。機器學習研究雜誌,11:1081–1107,2010b。
  • Zhang et al. [2014] 張等人。 [2014] Y. Zhang, M. Wainwright, and M. I. Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. arXiv preprint arXiv:1402.1918, 2014.
    Y. 張、M. 溫賴特和 M. I. 喬丹。稀疏線性迴歸多項式時間演算法效能的下限。 arXiv 預印本 arXiv:1402.1918, 2014。
  • Zhao and Yu [2006]
    趙與餘 [2006]
    P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
    P.Zhao 和 B.Yu。關於lasso的模型選擇一致性。機器學習研究雜誌,7:2541–2563,2006 年。
  • Zheng et al. [2014] 鄭等人。 [2014] Z. Zheng, Y. Fan, and J. Lv. High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3):627–649, 2014. ISSN 1467-9868.
    Z. 鄭、Y. 範、J. Lv。高維度閾值回歸和收縮效應。 《皇家統計學會雜誌》:B 系列(統計方法),76(3):627–649, 2014。
  • Zou [2006] 鄒[2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, 2006.
    H. 鄒。自適應套索及其預言屬性。美國統計協會雜誌,101:1418–1429,2006 年。
  • Zou and Li [2008]
    鄒和李 [2008]
    H. Zou and R. Li. One - step sparse estimates in nonconcave penalized likelihood problems. The Annals of Statistics, 36(4):1509–1533, 2008.
    H. Zou 和 R. Li。非凹懲罰似然問題中的一步稀疏估計。 《統計年鑑》,36(4):1509–1533,2008 年。

Appendix and Supplementary Material
附錄及補充資料

Appendix A Additional Details for Section 2
附錄 A 第 2 節的其他詳細信息

A.1 Solving the convex quadratic optimization Problems in Section 2.3.2
A.1 求解2.3.2節的凸二次最佳化問題

We show here that the convex quadratic optimization problems appearing in Section 2.3.2 are indeed quite simple and can be solved with small computational cost.
我們在這裡表明,第 2.3.2 節中出現的凸二次最佳化問題確實非常簡單,並且可以用很小的計算成本來解決。

We first consider Problem (18), the computation of uisuperscriptsubscript𝑢𝑖u_{i}^{-} which is a minimization problem. We assume without loss of generality that the feasible set of problem (18) is non-empty. Thus by standard results in quadratic optimization [10], it follows that, there exists a τ𝜏\tau such that:
我們先考慮問題(18), uisuperscriptsubscript𝑢𝑖u_{i}^{-} 的計算,這是一個最小化問題。不失一般性,我們假設問題(18)的可行集非空。因此,根據二次最佳化 [10] 中的標準結果,可以得出結論,存在 τ𝜏\tau 使得:

(12𝐲𝐗𝜷22+τβi)=0,12superscriptsubscriptnorm𝐲𝐗𝜷22𝜏subscript𝛽𝑖0\nabla\left(\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\tau\beta_{i}\right)=0,

where, \nabla denotes derivative wrt 𝜷𝜷\boldsymbol{\beta} and a 𝜷𝜷\boldsymbol{\beta} that satisfies the above gradient condition must also be feasible for Problem (18). Simplifying the above equation, we get:
其中, \nabla 表示對 𝜷𝜷\boldsymbol{\beta} 的導數,滿足上述梯度條件的 𝜷𝜷\boldsymbol{\beta} 對於問題(18)也一定是可行的。簡化上式,我們得到:

𝐗𝐗𝜷=𝐗𝐲τei,superscript𝐗𝐗𝜷superscript𝐗𝐲𝜏subscript𝑒𝑖\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\beta}=\mathbf{X}^{\prime}\mathbf{y}-\tau e_{i},

where, eisubscript𝑒𝑖e_{i} is a vector in psuperscript𝑝\Re^{p} such that its i𝑖ith coordinate is one with the remaining equal to zero. Simplifying the above expression, we have
其中, eisubscript𝑒𝑖e_{i}psuperscript𝑝\Re^{p} 中的向量,其 i𝑖i 座標為1,其餘為零。簡化上面的表達式,我們有

𝐲𝐗𝜷22=(𝕀PX)𝐲+τqi22.subscriptsuperscriptnorm𝐲𝐗𝜷22superscriptsubscriptnorm𝕀subscript𝑃𝑋𝐲𝜏subscript𝑞𝑖22\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\|(\mathbb{I}-P_{X})\mathbf{y}+\tau q_{i}\|_{2}^{2}.

Above, 𝕀𝕀\mathbb{I} is the identity matrix of size p×p𝑝𝑝p\times p and PXsubscript𝑃𝑋P_{X} is the familiar projection matrix given by 𝐗(𝐗𝐗)1𝐗𝐗superscriptsuperscript𝐗𝐗1superscript𝐗\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} 333Note that we assume here that p>n𝑝𝑛p>n which typically guarantees that 𝐗𝐗superscript𝐗𝐗\mathbf{X}^{\prime}\mathbf{X} is invertible with probability one, provided the entries of 𝐗𝐗\mathbf{X} are drawn from a continuous ditribution.
請注意,我們在這裡假設 p>n𝑝𝑛p>n 通常保證 𝐗𝐗superscript𝐗𝐗\mathbf{X}^{\prime}\mathbf{X} 以機率 1 可逆,前提是 𝐗𝐗\mathbf{X} 的條目來自連續分佈。

上面, 𝕀𝕀\mathbb{I} 是大小為 p×p𝑝𝑝p\times p 的單位矩陣, PXsubscript𝑃𝑋P_{X}𝐗(𝐗𝐗)1𝐗𝐗superscriptsuperscript𝐗𝐗1superscript𝐗\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime} 3
and qi=𝐗(𝐗𝐗)1eisubscript𝑞𝑖𝐗superscriptsuperscript𝐗𝐗1subscript𝑒𝑖q_{i}=\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}e_{i}. Observing that 𝐲𝐗𝜷22=UBsubscriptsuperscriptnorm𝐲𝐗𝜷22UB\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\text{UB}, one can readily estimate τ𝜏\tau that satisfies the above simple quadratic equation. This leads to the solution of τ𝜏\tau, which subsequently leads to the optimal value 𝜷~~𝜷\widetilde{\boldsymbol{\beta}} that solves Problem (18). This readily leads to the optimum of Problem (18).
qi=𝐗(𝐗𝐗)1eisubscript𝑞𝑖𝐗superscriptsuperscript𝐗𝐗1subscript𝑒𝑖q_{i}=\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}e_{i} 。觀察 𝐲𝐗𝜷22=UBsubscriptsuperscriptnorm𝐲𝐗𝜷22UB\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\text{UB} ,我們可以很容易地估計 τ𝜏\tau 滿足上述簡單的二次方程式。這導致 τ𝜏\tau 的解,隨後導致解決問題 (18) 的最優值 𝜷~~𝜷\widetilde{\boldsymbol{\beta}} 。這很容易導致問題(18)的最優。

The above argument readily applies to Problem (18), for the computation of ui+superscriptsubscript𝑢𝑖u_{i}^{+} by writing it as an equivalent minimization problem and observing that:
上述論點很容易適用於問題 (18),透過將其寫為等效的最小化問題並觀察到來計算 ui+superscriptsubscript𝑢𝑖u_{i}^{+}

ui+=min𝜷βis.t.12𝐲𝐗𝜷22UB.-u_{i}^{+}=\min_{\boldsymbol{\beta}}\;\;-\beta_{i}\;\;\;\;s.t.\;\;\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB}.

The above derivation can also be adapted to the case of Problem (19). Towards this end, notice that for estimating visuperscriptsubscript𝑣𝑖v_{i}^{-} the above steps (for computing uisuperscriptsubscript𝑢𝑖u_{i}^{-}) will be modified: eisubscript𝑒𝑖e_{i} gets replaced by 𝐱ipsubscript𝐱𝑖superscript𝑝\mathbf{x}_{i}\in\Re^{p} (the i𝑖ith row of 𝐗𝐗\mathbf{X}); and PXsubscript𝑃𝑋P_{X} denotes the projection matrix onto the column space of 𝐗𝐗\mathbf{X}, even if the matrix 𝐗𝐗superscript𝐗𝐗\mathbf{X}^{\prime}\mathbf{X} is not invertible (since here, we consider arbitrary n,p𝑛𝑝n,p).
上述推導也可以適用於問題(19)的情況。為此,請注意,為了估計 visuperscriptsubscript𝑣𝑖v_{i}^{-} ,上述步驟(用於計算 uisuperscriptsubscript𝑢𝑖u_{i}^{-} )將被修改: eisubscript𝑒𝑖e_{i}𝐱ipsubscript𝐱𝑖superscript𝑝\mathbf{x}_{i}\in\Re^{p} 替換( 𝐗𝐗\mathbf{X}i𝑖i 行); PXsubscript𝑃𝑋P_{X} 表示在 𝐗𝐗\mathbf{X} 列空間上的投影矩陣,即使矩陣 𝐗𝐗superscript𝐗𝐗\mathbf{X}^{\prime}\mathbf{X} 不可逆(因為在這裡,我們考慮任意 n,p𝑛𝑝n,p

In addition, the Problems (18) for the different variables and (19) for the different samples; can be solved completely independently, in parallel.
此外,針對不同變數的問題(18)和針對不同樣本的問題(19);可以完全獨立、並行地求解。

A.2 Details for Section 2.3.4
A.2 第 2.3.4 節的詳細信息

Note that in Problem (10) we consider a uniform bound on βisubscript𝛽𝑖\beta_{i}’s: UβiU,subscript𝑈subscript𝛽𝑖subscript𝑈-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U}, for all i=1,,p𝑖1𝑝i=1,\ldots,p. Note that some of the variables βisubscript𝛽𝑖\beta_{i} may have larger amplitude than the others, thus it may be reasonable to have bounds depending upon the variable index i𝑖i. Thusly motivated, for added flexibility, one can consider the following (adaptive) bounds on βisubscript𝛽𝑖\beta_{i}’s: UiβiUisubscriptsuperscript𝑖𝑈subscript𝛽𝑖subscriptsuperscript𝑖𝑈-{\mathcal{M}}^{i}_{U}\leq\beta_{i}\leq{\mathcal{M}}^{i}_{U} for i=1,,p𝑖1𝑝i=1,\ldots,p. The parameters Uisubscriptsuperscript𝑖𝑈{\mathcal{M}}^{i}_{U} can be taken as max{|ui+|,|ui|},subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑖\max\{|u^{+}_{i}|,|u^{-}_{i}|\}, as defined in (18).
請注意,在問題(10)中,我們考慮所有 i=1,,p𝑖1𝑝i=1,\ldots,pβisubscript𝛽𝑖\beta_{i} 的統一界限: UβiU,subscript𝑈subscript𝛽𝑖subscript𝑈-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U}, 。請注意,某些變數 βisubscript𝛽𝑖\beta_{i} 可能比其他變數具有更大的幅度,因此根據變數索引 i𝑖i 確定界限可能是合理的。因此,為了增加彈性,可以考慮 βisubscript𝛽𝑖\beta_{i} 的以下(自適應)邊界: UiβiUisubscriptsuperscript𝑖𝑈subscript𝛽𝑖subscriptsuperscript𝑖𝑈-{\mathcal{M}}^{i}_{U}\leq\beta_{i}\leq{\mathcal{M}}^{i}_{U} for i=1,,p𝑖1𝑝i=1,\ldots,p 。參數 Uisubscriptsuperscript𝑖𝑈{\mathcal{M}}^{i}_{U} 可以被視為如(18)定義的 max{|ui+|,|ui|},subscriptsuperscript𝑢𝑖subscriptsuperscript𝑢𝑖\max\{|u^{+}_{i}|,|u^{-}_{i}|\},

More generally, one can also consider asymmetric bounds on βisubscript𝛽𝑖\beta_{i} as: uiβiui+superscriptsubscript𝑢𝑖subscript𝛽𝑖superscriptsubscript𝑢𝑖u_{i}^{-}\leq\beta_{i}\leq u_{i}^{+} for all i𝑖i.
更一般地,我們也可以將 βisubscript𝛽𝑖\beta_{i} 上的不對稱邊界視為:對所有 i𝑖iuiβiui+superscriptsubscript𝑢𝑖subscript𝛽𝑖superscriptsubscript𝑢𝑖u_{i}^{-}\leq\beta_{i}\leq u_{i}^{+}

Note that the above ideas for bounding βisubscript𝛽𝑖\beta_{i}’s can also be extended to obtain sample-specific bounds on 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle for i=1,,n𝑖1𝑛i=1,\ldots,n.
請注意,上述關於邊界 βisubscript𝛽𝑖\beta_{i} 的想法也可以擴展以獲得 𝐱i,𝜷subscript𝐱𝑖𝜷\langle\mathbf{x}_{i},\boldsymbol{\beta}\ranglei=1,,n𝑖1𝑛i=1,\ldots,n 的特定於樣本的邊界。

The bounds on 𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1} and 𝐗𝜷^1,𝐗𝜷^subscriptnorm𝐗^𝜷1subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty} can also be adapted to the above variable dependent bounds on βisubscript𝛽𝑖\beta_{i}’s.
𝜷^1subscriptnorm^𝜷1\|\widehat{\boldsymbol{\beta}}\|_{1}𝐗𝜷^1,𝐗𝜷^subscriptnorm𝐗^𝜷1subscriptnorm𝐗^𝜷\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty} 上的邊界也可以適應 βisubscript𝛽𝑖\beta_{i} 上的上述變數相關邊界。

While the above modifications may lead to marginally improved performances, we do not dwell much on these improvements mainly for the sake of a clear exposition.
雖然上述修改可能會稍微改善效能,但我們不會過多討論這些改進,主要是為了清楚說明。

A.3 Proof of Proposition 2
A.3 命題2的證明

Proof 證明

  1. (a)

    Given a set I𝐼I, we define 𝐆:=𝐗I𝐗I𝐈,assign𝐆subscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈\mathbf{G}:=\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}, and let gijsubscript𝑔𝑖𝑗g_{ij} denote the (i,j)𝑖𝑗(i,j)th entry of 𝐆𝐆\mathbf{G}. For any 𝐮k𝐮superscript𝑘\mathbf{u}\in\mathbb{R}^{k} we have

    max𝐮1=1𝐆𝐮1=subscriptsubscriptnorm𝐮11subscriptnorm𝐆𝐮1absent\displaystyle\max_{\|\mathbf{u}\|_{1}=1}\|\mathbf{G}\mathbf{u}\|_{1}=~{} max𝐮1=1(i=1k|j=1kgijuj|)subscriptsubscriptnorm𝐮11superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1𝑘subscript𝑔𝑖𝑗subscript𝑢𝑗\displaystyle\max_{\|\mathbf{u}\|_{1}=1}\left(\sum_{i=1}^{k}\left|\sum_{j=1}^{k}g_{ij}u_{j}\right|\right)
    \displaystyle\leq~{} max𝐮1=1(i=1kj=1k|uj||gij|)subscriptsubscriptnorm𝐮11superscriptsubscript𝑖1𝑘superscriptsubscript𝑗1𝑘subscript𝑢𝑗subscript𝑔𝑖𝑗\displaystyle\max_{\|\mathbf{u}\|_{1}=1}\left(\sum_{i=1}^{k}\sum_{j=1}^{k}|u_{j}||g_{ij}|\right)
    =\displaystyle=~{} max𝐮1=1(j=1k|uj|ij|gij|)subscriptsubscriptnorm𝐮11superscriptsubscript𝑗1𝑘subscript𝑢𝑗subscript𝑖𝑗subscript𝑔𝑖𝑗\displaystyle\max_{\|\mathbf{u}\|_{1}=1}\left(\sum_{j=1}^{k}|u_{j}|\sum_{i\neq j}|g_{ij}|\right) (gjj=0)subscript𝑔𝑗𝑗0\displaystyle(g_{jj}=0)
    \displaystyle\leq~{} max𝐮1=1(μ[k1]𝐮1)subscriptsubscriptnorm𝐮11𝜇delimited-[]𝑘1subscriptnorm𝐮1\displaystyle\max_{\|\mathbf{u}\|_{1}=1}\left(\mu[k-1]\|\mathbf{u}\|_{1}\right) (ij|gij|μ[k1])subscript𝑖𝑗subscript𝑔𝑖𝑗𝜇delimited-[]𝑘1\displaystyle\left(\sum_{i\neq j}|g_{ij}|\leq\mu[k-1]\right)
    =\displaystyle=~{} μ[k1].𝜇delimited-[]𝑘1\displaystyle\mu[k-1].

    (a) 給定一個集合 I𝐼I ,我們定義 𝐆:=𝐗I𝐗I𝐈,assign𝐆subscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈\mathbf{G}:=\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}, 並讓 gijsubscript𝑔𝑖𝑗g_{ij} 表示 𝐆𝐆\mathbf{G} 條目> .對於任何 𝐮k𝐮superscript𝑘\mathbf{u}\in\mathbb{R}^{k} 我們有
  2. (b)

    Using 𝐗I𝐗I=𝐈+𝐆subscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈𝐆\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}=\mathbf{I}+\mathbf{G} and standard power-series convergence (which is valid since 𝐆1,1<1subscriptnorm𝐆111\|\mathbf{G}\|_{1,1}<1) we obtain

    (𝐗I𝐗I)11,1=(𝐈+𝐆)11,1=i=0𝐆1,1i11𝐆1,111μ[k1].~{}~{}~{}~{}~{}~{}~{}\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}=\|\left(\mathbf{I}+\mathbf{G}\right)^{-1}\|_{1,1}=\sum_{i=0}^{\infty}\|\mathbf{G}\|^{i}_{1,1}\leq\frac{1}{1-\|\mathbf{G}\|_{1,1}}\leq\frac{1}{1-\mu[k-1]}.~{}~{}~{}~{}~{}~{}~{}\hfill\Box

    (b)使用 𝐗I𝐗I=𝐈+𝐆subscriptsuperscript𝐗𝐼subscript𝐗𝐼𝐈𝐆\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}=\mathbf{I}+\mathbf{G} 和標準冪級數收斂(自 𝐆1,1<1subscriptnorm𝐆111\|\mathbf{G}\|_{1,1}<1 起有效),我們得到

A.4 Proof of Theorem 2.1 A.4定理2.1的證明

Proof 證明

  1. (a)

    Since 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y} we have

    𝜷^1=𝜷^I1(𝐗I𝐗I)11,1𝐗I𝐲1.subscriptnorm^𝜷1subscriptnormsubscript^𝜷𝐼1subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼111subscriptnormsubscriptsuperscript𝐗𝐼𝐲1\|\widehat{\boldsymbol{\beta}}\|_{1}=\|\widehat{\boldsymbol{\beta}}_{I}\|_{1}\leq\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{1}. (38)

    Note that 注意

    𝐗I𝐲1=jI|𝐗j,𝐲|maxI,|I|=kjI|𝐗j,𝐲|j=1k|𝐗(j),𝐲|.subscriptnormsubscriptsuperscript𝐗𝐼𝐲1subscript𝑗𝐼subscript𝐗𝑗𝐲subscript𝐼𝐼𝑘subscript𝑗𝐼subscript𝐗𝑗𝐲superscriptsubscript𝑗1𝑘subscript𝐗𝑗𝐲\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{1}=\sum_{j\in I}|\langle\mathbf{X}_{j},\mathbf{y}\rangle|\leq\max_{I,|I|=k}\sum_{j\in I}|\langle\mathbf{X}_{j},\mathbf{y}\rangle|\leq\sum_{j=1}^{k}|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle|. (39)

    Applying Part (b) of Proposition 2 and (39) to (38), we obtain (14) .
    將命題 2 的 (b) 部分和 (39) 應用於 (38),我們得到 (14) 。


    (a)自 𝜷^I=(𝐗I𝐗I)1𝐗I𝐲subscript^𝜷𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐲\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y} 以來,我們有
  2. (b)

    We write 𝜷^I=𝐀𝐲subscript^𝜷𝐼𝐀𝐲\widehat{\boldsymbol{\beta}}_{I}=\mathbf{A}\mathbf{y} for 𝐀=(𝐗I𝐗I)1𝐗I𝐀superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼\mathbf{A}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}. If 𝐚i,i=1,,kformulae-sequencesubscript𝐚𝑖𝑖1𝑘\mathbf{a}_{i},i=1,\ldots,k denote the rows of 𝐀𝐀\mathbf{A} we have:
    我們為 𝐀=(𝐗I𝐗I)1𝐗I𝐀superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼\mathbf{A}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}𝜷^I=𝐀𝐲subscript^𝜷𝐼𝐀𝐲\widehat{\boldsymbol{\beta}}_{I}=\mathbf{A}\mathbf{y} 。如果 𝐚i,i=1,,kformulae-sequencesubscript𝐚𝑖𝑖1𝑘\mathbf{a}_{i},i=1,\ldots,k 表示 𝐀𝐀\mathbf{A} 的行,我們有:

    𝜷^I=maxi=1,,k|𝐚i,𝐲|(maxi=1,,k𝐚i2)𝐲2.subscriptnormsubscript^𝜷𝐼subscript𝑖1𝑘subscript𝐚𝑖𝐲subscript𝑖1𝑘subscriptnormsubscript𝐚𝑖2subscriptnorm𝐲2\|\widehat{\boldsymbol{\beta}}_{I}\|_{\infty}=\max_{i=1,\ldots,k}|\langle\mathbf{a}_{i},\mathbf{y}\rangle|\leq\left(\max_{i=1,\ldots,k}\|\mathbf{a}_{i}\|_{2}\right)\|\mathbf{y}\|_{2}. (40)

    For every i=1,,k𝑖1𝑘i=1,\ldots,k we have
    對於每個 i=1,,k𝑖1𝑘i=1,\ldots,k 我們有

    𝐚i2subscriptnormsubscript𝐚𝑖2absent\displaystyle\|\mathbf{a}_{i}\|_{2}\leq max𝐮2=1𝐀𝐮2subscriptsubscriptnorm𝐮21subscriptnorm𝐀𝐮2\displaystyle\max_{\|\mathbf{u}\|_{2}=1}\|\mathbf{A}\mathbf{u}\|_{2}
    =\displaystyle= max𝐮2=1(𝐗I𝐗I)1𝐗I𝐮2subscriptsubscriptnorm𝐮21subscriptnormsuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼𝐮2\displaystyle\max_{\|\mathbf{u}\|_{2}=1}\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{u}\|_{2}
    \displaystyle\leq λmax((𝐗I𝐗I)1𝐗I)subscript𝜆superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼\displaystyle\lambda_{\max}\left((\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\right)
    =\displaystyle= max{1d1,,1dk},1subscript𝑑11subscript𝑑𝑘\displaystyle\max\;\;\left\{\frac{1}{d_{1}},\ldots,\frac{1}{d_{k}}\right\}, (41)

    where d1,,dksubscript𝑑1subscript𝑑𝑘d_{1},\ldots,d_{k} are the (nonzero) singular values of the matrix 𝐗Isubscript𝐗𝐼\mathbf{X}_{I}. To see how one arrives at (41) let us denote the singular value decomposition of 𝐗I=𝐔𝐃𝐕subscript𝐗𝐼superscript𝐔𝐃𝐕\mathbf{X}_{I}=\mathbf{UDV}^{\prime} with 𝐃=diag(d1,d2,,dk).𝐃diagsubscript𝑑1subscript𝑑2subscript𝑑𝑘\mathbf{D}=\mathrm{diag}\left(d_{1},d_{2},\ldots,d_{k}\right). We then have
    其中 d1,,dksubscript𝑑1subscript𝑑𝑘d_{1},\ldots,d_{k} 是矩陣 𝐗Isubscript𝐗𝐼\mathbf{X}_{I} 的(非零)奇異值。為了看看如何得出 (41),讓我們用 𝐃=diag(d1,d2,,dk).𝐃diagsubscript𝑑1subscript𝑑2subscript𝑑𝑘\mathbf{D}=\mathrm{diag}\left(d_{1},d_{2},\ldots,d_{k}\right). 表示 𝐗I=𝐔𝐃𝐕subscript𝐗𝐼superscript𝐔𝐃𝐕\mathbf{X}_{I}=\mathbf{UDV}^{\prime} 的奇異值分解,然後我們有

    (𝐗I𝐗I)1𝐗I=(𝐕𝐃2𝐕)(𝐔𝐃𝐕)=𝐕𝐃1𝐔superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼superscript𝐕𝐃2superscript𝐕superscriptsuperscript𝐔𝐃𝐕superscript𝐕𝐃1superscript𝐔(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}=(\mathbf{VD}^{-2}\mathbf{V}^{\prime})(\mathbf{UDV}^{\prime})^{\prime}=\mathbf{VD}^{-1}\mathbf{U^{\prime}}

    and the singular values of (𝐗I𝐗I)1𝐗Isuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I} are thus 1/di1subscript𝑑𝑖1/d_{i}, i=1,,k𝑖1𝑘i=1,\ldots,k.
    因此 (𝐗I𝐗I)1𝐗Isuperscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I} 的奇異值為 1/di1subscript𝑑𝑖1/d_{i}i=1,,k𝑖1𝑘i=1,\ldots,k

    The eigenvalues of 𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} are di2superscriptsubscript𝑑𝑖2d_{i}^{2} and from (12) we obtain that di2ηksuperscriptsubscript𝑑𝑖2subscript𝜂𝑘d_{i}^{2}\geq\eta_{k}. Using (41) we thus obtain
    𝐗I𝐗Isubscriptsuperscript𝐗𝐼subscript𝐗𝐼\mathbf{X}^{\prime}_{I}\mathbf{X}_{I} 的特徵值為 di2superscriptsubscript𝑑𝑖2d_{i}^{2} ,從 (12) 我們得到 di2ηksuperscriptsubscript𝑑𝑖2subscript𝜂𝑘d_{i}^{2}\geq\eta_{k} 。利用 (41) 我們得到

    maxi=1,,k𝐚i21ηk.subscript𝑖1𝑘subscriptnormsubscript𝐚𝑖21subscript𝜂𝑘\max_{i=1,\ldots,k}\|\mathbf{a}_{i}\|_{2}\leq\frac{1}{\sqrt{\eta_{k}}}. (42)

    Substituting the bound (42) to (40) we obtain
    將邊界 (42) 代入 (40) 我們得到

    𝜷^I1ηk𝐲2.subscriptnormsubscript^𝜷𝐼1subscript𝜂𝑘subscriptnorm𝐲2\|\widehat{\boldsymbol{\beta}}_{I}\|_{\infty}\leq\frac{1}{\sqrt{\eta_{k}}}\|\mathbf{y}\|_{2}. (43)

    Using the notation 𝐀~=(𝐗I𝐗I)1~𝐀superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\tilde{\mathbf{A}}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}, we have
    用符號 𝐀~=(𝐗I𝐗I)1~𝐀superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1\tilde{\mathbf{A}}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1} ,我們有

    𝜷^I=subscriptnormsubscript^𝜷𝐼absent\displaystyle\|\widehat{\boldsymbol{\beta}}_{I}\|_{\infty}= maxi=1,,k|𝐚~i,𝐗I𝐲|subscript𝑖1𝑘subscript~𝐚𝑖subscriptsuperscript𝐗𝐼𝐲\displaystyle\max_{i=1,\ldots,k}|\langle\tilde{\mathbf{a}}_{i},\mathbf{X}^{\prime}_{I}\mathbf{y}\rangle|
    \displaystyle\leq (maxi=1,,k|𝐚~i2)𝐗I𝐲2\displaystyle\left(\max_{i=1,\ldots,k|}\|\tilde{\mathbf{a}}_{i}\|_{2}\right)\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{2}
    \displaystyle\leq λmax((𝐗I𝐗I)1)𝐗I𝐲2subscript𝜆superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptnormsubscriptsuperscript𝐗𝐼𝐲2\displaystyle\lambda_{\max}\left((\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\right)\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{2}
    =\displaystyle= (maxi=1,,k1di2)jI|𝐗j,𝐲|2subscript𝑖1𝑘1superscriptsubscript𝑑𝑖2subscript𝑗𝐼superscriptsubscript𝐗𝑗𝐲2\displaystyle\left(\max_{i=1,\ldots,k}\frac{1}{d_{i}^{2}}\right)\cdot\sqrt{\sum_{j\in I}|\langle\mathbf{X}_{j},\mathbf{y}\rangle|^{2}}
    \displaystyle\leq 1ηkj=1k|𝐗(j),𝐲|2.1subscript𝜂𝑘superscriptsubscript𝑗1𝑘superscriptsubscript𝐗𝑗𝐲2\displaystyle\frac{1}{\eta_{k}}\sqrt{\sum_{j=1}^{k}|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle|^{2}}. (44)

    Combining (43) and (44) we obtain (15).
    結合(43)和(44)我們得到(15)。

  3. (c)

    We have

    𝐗I𝜷^I1i=1n|𝐱i,𝜷^I|i=1n𝐱i𝜷^I1=i=1n𝐱i𝜷^I1.subscriptnormsubscript𝐗𝐼subscript^𝜷𝐼1superscriptsubscript𝑖1𝑛subscript𝐱𝑖subscript^𝜷𝐼superscriptsubscript𝑖1𝑛subscriptnormsubscript𝐱𝑖subscriptnormsubscript^𝜷𝐼1superscriptsubscript𝑖1𝑛subscriptnormsubscript𝐱𝑖subscriptnormsubscript^𝜷𝐼1\|\mathbf{X}_{I}\widehat{\boldsymbol{\beta}}_{I}\|_{1}\leq\sum_{i=1}^{n}|\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}_{I}\rangle|\leq\sum_{i=1}^{n}\|\mathbf{x}_{i}\|_{\infty}\|\widehat{\boldsymbol{\beta}}_{{I}}\|_{1}=\sum_{i=1}^{n}\|\mathbf{x}_{i}\|_{\infty}\|\widehat{\boldsymbol{\beta}}_{I}\|_{1}. (45)

    Let 𝐏I:=𝐗I(𝐗I𝐗I)1𝐗Iassignsubscript𝐏𝐼subscript𝐗𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼\mathbf{P}_{I}:=\mathbf{X}_{I}(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I} denote the projection onto the columns of 𝐗Isubscript𝐗𝐼\mathbf{X}_{I}. We have 𝐏I𝐲2𝐲2subscriptnormsubscript𝐏𝐼𝐲2subscriptnorm𝐲2\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\|\mathbf{y}\|_{2}, leading to:
    𝐏I:=𝐗I(𝐗I𝐗I)1𝐗Iassignsubscript𝐏𝐼subscript𝐗𝐼superscriptsubscriptsuperscript𝐗𝐼subscript𝐗𝐼1subscriptsuperscript𝐗𝐼\mathbf{P}_{I}:=\mathbf{X}_{I}(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I} 表示到 𝐗Isubscript𝐗𝐼\mathbf{X}_{I} 列的投影。我們有 𝐏I𝐲2𝐲2subscriptnormsubscript𝐏𝐼𝐲2subscriptnorm𝐲2\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\|\mathbf{y}\|_{2} ,導致:

    𝐗I𝜷^I1=𝐏I𝐲1k𝐏I𝐲2k𝐲2,subscriptnormsubscript𝐗𝐼subscript^𝜷𝐼1subscriptnormsubscript𝐏𝐼𝐲1𝑘subscriptnormsubscript𝐏𝐼𝐲2𝑘subscriptnorm𝐲2\|\mathbf{X}_{I}\widehat{\boldsymbol{\beta}}_{I}\|_{1}=\|\mathbf{P}_{I}\mathbf{y}\|_{1}\leq\sqrt{k}\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\sqrt{k}\|\mathbf{y}\|_{2}, (46)

    where we used that for any 𝐚m𝐚superscript𝑚\mathbf{a}\in\mathbb{R}^{m}, we have m𝐚2𝐚1𝑚subscriptnorm𝐚2subscriptnorm𝐚1\sqrt{m}\|\mathbf{a}\|_{2}\geq\|\mathbf{a}\|_{1}. Combining (45) and (46) we obtain (16).
    當我們將其用於任何 𝐚m𝐚superscript𝑚\mathbf{a}\in\mathbb{R}^{m} 時,我們有 m𝐚2𝐚1𝑚subscriptnorm𝐚2subscriptnorm𝐚1\sqrt{m}\|\mathbf{a}\|_{2}\geq\|\mathbf{a}\|_{1} 。結合(45)和(46)我們得到(16)。

     (c)我們有
  4. (d)

    For any vector 𝜷Isubscript𝜷𝐼\boldsymbol{\beta}_{I} which has zero entries in the coordinates outside I𝐼I, we have:

    𝐗𝜷Imaxi=1,,n|𝐱i,𝜷I|maxi=1,,n𝐱i1:k𝜷I,subscriptnorm𝐗subscript𝜷𝐼subscript𝑖1𝑛subscript𝐱𝑖subscript𝜷𝐼subscript𝑖1𝑛subscriptnormsubscript𝐱𝑖:1𝑘subscriptnormsubscript𝜷𝐼\|\mathbf{X}\boldsymbol{\beta}_{I}\|_{\infty}\leq\max_{i=1,\ldots,n}|\langle\mathbf{x}_{i},\boldsymbol{\beta}_{I}\rangle|\leq\max_{i=1,\ldots,n}\|\mathbf{x}_{i}\|_{1:k}\|\boldsymbol{\beta}_{I}\|_{\infty},

    leading to (17). \hfill\Box
    導致 (17)。 \hfill\Box


    (d)對於任何在 I𝐼I 之外的座標中具有零個條目的向量 𝜷Isubscript𝜷𝐼\boldsymbol{\beta}_{I} ,我們有:

Appendix B Proofs and Technical Details for Section 3
附錄 B 第 3 節的證據與技術細節

B.1 Proof of Proposition 6
B.1 命題6的證明

Proof 證明

  1. (a)

    Let 𝜷𝜷\boldsymbol{\beta} be a vector satisfying 𝜷0ksubscriptnorm𝜷0𝑘\|\boldsymbol{\beta}\|_{0}\leq k. Using the notation 𝜼^𝐇k(𝜷1Lg(𝜷))^𝜼subscript𝐇𝑘𝜷1𝐿𝑔𝜷\widehat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right) we have the following chain of inequalities:

    g(𝜷)𝑔𝜷\displaystyle g(\boldsymbol{\beta}) =QL(𝜷,𝜷)absentsubscript𝑄𝐿𝜷𝜷\displaystyle=Q_{L}(\boldsymbol{\beta},\boldsymbol{\beta})
    inf𝜼0kQL(𝜼,𝜷)absentsubscriptinfimumsubscriptnorm𝜼0𝑘subscript𝑄𝐿𝜼𝜷\displaystyle\geq\inf_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})
    =inf𝜼0k(L2𝜼𝜷22+g(𝜷),𝜼𝜷+g(𝜷))absentsubscriptinfimumsubscriptnorm𝜼0𝑘𝐿2superscriptsubscriptnorm𝜼𝜷22𝑔𝜷𝜼𝜷𝑔𝜷\displaystyle=\inf_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\|\boldsymbol{\eta}-\boldsymbol{\beta}\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\boldsymbol{\eta}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)
    =inf𝜼0k(L2𝜼(𝜷1Lg(𝜷))2212Lg(𝜷)22+g(𝜷))absentsubscriptinfimumsubscriptnorm𝜼0𝑘𝐿2superscriptsubscriptnorm𝜼𝜷1𝐿𝑔𝜷2212𝐿superscriptsubscriptnorm𝑔𝜷22𝑔𝜷\displaystyle=\inf_{\|\boldsymbol{\eta}\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\left\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\|_{2}^{2}-\frac{1}{2L}\|\nabla g(\boldsymbol{\beta})\|_{2}^{2}+g(\boldsymbol{\beta})\right)
    =(L2𝜼^(𝜷1Lg(𝜷))2212Lg(𝜷)22+g(𝜷))absent𝐿2superscriptsubscriptnorm^𝜼𝜷1𝐿𝑔𝜷2212𝐿superscriptsubscriptnorm𝑔𝜷22𝑔𝜷\displaystyle=\;\;\left(\frac{L}{2}\|\widehat{\boldsymbol{\eta}}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\|_{2}^{2}-\frac{1}{2L}\|\nabla g(\boldsymbol{\beta})\|_{2}^{2}+g(\boldsymbol{\beta})\right) (From(26))Fromitalic-(26italic-)\displaystyle~{}{\rm(From}~{}\eqref{eta-hat-HT1})
    =(L2𝜼^𝜷22+g(𝜷),𝜼^𝜷+g(𝜷))absent𝐿2superscriptsubscriptnorm^𝜼𝜷22𝑔𝜷^𝜼𝜷𝑔𝜷\displaystyle=\left(\frac{L}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)
    =(L2𝜼^𝜷22+2𝜼^𝜷22+g(𝜷),𝜼^𝜷+g(𝜷))absent𝐿2superscriptsubscriptnorm^𝜼𝜷222superscriptsubscriptnorm^𝜼𝜷22𝑔𝜷^𝜼𝜷𝑔𝜷\displaystyle=\left(\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+\frac{\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)
    L2𝜼^𝜷22+(2𝜼^𝜷22+g(𝜷),𝜼^𝜷+g(𝜷))Q(𝜼^,𝜷)absent𝐿2superscriptsubscriptnorm^𝜼𝜷22subscript2superscriptsubscriptnorm^𝜼𝜷22𝑔𝜷^𝜼𝜷𝑔𝜷subscript𝑄^𝜼𝜷\displaystyle\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+\underbrace{\left(\frac{\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)}_{Q_{\ell}(\widehat{\boldsymbol{\eta}},\boldsymbol{\beta})}
    L2𝜼^𝜷22+g(𝜼^).absent𝐿2superscriptsubscriptnorm^𝜼𝜷22𝑔^𝜼\displaystyle\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}+g(\widehat{\boldsymbol{\eta}}). (From(25))Fromitalic-(25italic-)\displaystyle~{}{\rm(From}~{}\eqref{major-1})

    This chain of inequalities leads to:
    這一系列的不平等導致:

    g(𝜷)g(𝜼^)L2𝜼^𝜷22.𝑔𝜷𝑔^𝜼𝐿2superscriptsubscriptnorm^𝜼𝜷22g(\boldsymbol{\beta})-g(\widehat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}. (47)

    Applying (47) for 𝜷=𝜷m𝜷subscript𝜷𝑚\boldsymbol{\beta}=\boldsymbol{\beta}_{m} and 𝜼^=𝜷m+1^𝜼subscript𝜷𝑚1\widehat{\boldsymbol{\eta}}=\boldsymbol{\beta}_{m+1}, the vectors generated by Algorithm 1, we obtain (31). This implies that the objective values g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) are decreasing and since the sequence is bounded below (g(𝜷)0)𝑔𝜷0(g(\boldsymbol{\beta})\geq 0), we obtain that g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) converges as m𝑚m\rightarrow\infty.
    將演算法 1 產生的向量 𝜷=𝜷m𝜷subscript𝜷𝑚\boldsymbol{\beta}=\boldsymbol{\beta}_{m}𝜼^=𝜷m+1^𝜼subscript𝜷𝑚1\widehat{\boldsymbol{\eta}}=\boldsymbol{\beta}_{m+1} 應用(47),我們得到(31)。這意味著目標值 g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) 正在減少,並且由於序列在 (g(𝜷)0)𝑔𝜷0(g(\boldsymbol{\beta})\geq 0) 下面有界,我們得到 g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) 收斂為 m𝑚m\rightarrow\infty


    (a)設 𝜷𝜷\boldsymbol{\beta} 為滿足 𝜷0ksubscriptnorm𝜷0𝑘\|\boldsymbol{\beta}\|_{0}\leq k 的向量。使用符號 𝜼^𝐇k(𝜷1Lg(𝜷))^𝜼subscript𝐇𝑘𝜷1𝐿𝑔𝜷\widehat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right) 我們有以下不等式鏈:
  2. (b)

    If L>𝐿L>\ell and from part (a), the result follows.


    (b)如果 L>𝐿L>\ell 且來自(a)部分,則結果如下。
  3. (c)

    The condition α¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0 means that for all m𝑚m sufficiently large, the entry |β(k),m|subscript𝛽𝑘𝑚|\beta_{(k),m}| will remain (uniformly) bounded away from zero. We will use this to prove that the support of 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} converges. For the purpose of establishing contradiction suppose that the support does not converge. Then, there are infinitely many values of msuperscript𝑚m^{\prime} such that 𝟏m𝟏m+1subscript1superscript𝑚subscript1superscript𝑚1\mathbf{1}_{m^{\prime}}\neq\mathbf{1}_{m^{\prime}+1}. Using the fact that 𝜷m0=ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}=k for all large m𝑚m we have

    𝜷m𝜷m+12βm,i2+βm+1,j2|βm,i|+|βm+1,j|2,subscriptnormsubscript𝜷superscript𝑚subscript𝜷superscript𝑚12superscriptsubscript𝛽superscript𝑚𝑖2superscriptsubscript𝛽superscript𝑚1𝑗2subscript𝛽superscript𝑚𝑖subscript𝛽superscript𝑚1𝑗2\;\;\|\boldsymbol{\beta}_{m^{\prime}}-\boldsymbol{\beta}_{m^{\prime}+1}\|_{2}\geq\;\;\sqrt{\beta_{m^{\prime},i}^{2}+\beta_{m^{\prime}+1,j}^{2}}\geq\;\;\frac{|\beta_{m^{\prime},i}|+|\beta_{m^{\prime}+1,j}|}{\sqrt{2}}, (48)

    where i,j𝑖𝑗i,j are such that βm+1,i=βm,j=0subscript𝛽superscript𝑚1𝑖subscript𝛽superscript𝑚𝑗0\beta_{m^{\prime}+1,i}=\beta_{m^{\prime},j}=0. As msuperscript𝑚m^{\prime}\rightarrow\infty, the quantity in the rhs of (48) remains bounded away from zero since α¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0. This contradicts the fact that 𝜷m+1𝜷m𝟎,subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}, as established in part (b). Thus, 𝟏msubscript1𝑚\mathbf{1}_{m} converges, and since 𝟏msubscript1𝑚\mathbf{1}_{m} is a discrete sequence, it converges after finitely many iterations, that is 𝟏m=𝟏m+1subscript1𝑚subscript1𝑚1\mathbf{1}_{m}=\mathbf{1}_{m+1} for all mM𝑚superscript𝑀m\geq M^{*}. Algorithm 1 becomes a vanilla gradient descent algorithm, restricted to the space 𝟏msubscript1𝑚\mathbf{1}_{m} for mM𝑚superscript𝑀m\geq M^{*}. Since a gradient descent algorithm for minimizing a convex function over a closed convex set leads to a sequence of iterates that converge [47, 45], we conclude that Algorithm 1 converges. Therefore, the sequence 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} converges to 𝜷superscript𝜷\boldsymbol{\beta}^{*}, a first order stationarity point:
    其中 i,j𝑖𝑗i,j 滿足 βm+1,i=βm,j=0subscript𝛽superscript𝑚1𝑖subscript𝛽superscript𝑚𝑗0\beta_{m^{\prime}+1,i}=\beta_{m^{\prime},j}=0 。作為 msuperscript𝑚m^{\prime}\rightarrow\infty ,自 α¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0 以來,(48) 的 rhs 中的數量仍然遠離零。這與(b)部分所建立的 𝜷m+1𝜷m𝟎,subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}, 事實相矛盾。因此, 𝟏msubscript1𝑚\mathbf{1}_{m} 收斂,並且由於 𝟏msubscript1𝑚\mathbf{1}_{m} 是離散序列,因此在有限多次迭代後收斂,即對於所有 mM𝑚superscript𝑀m\geq M^{*}𝟏m=𝟏m+1subscript1𝑚subscript1𝑚1\mathbf{1}_{m}=\mathbf{1}_{m+1} 。演算法 1 成為普通梯度下降演算法,僅限於 mM𝑚superscript𝑀m\geq M^{*} 的空間 𝟏msubscript1𝑚\mathbf{1}_{m} 。由於用於最小化閉凸集合上的凸函數的梯度下降演算法會導致一系列迭代收斂 [47, 45],因此我們得出結論:演算法 1 收斂。因此,序列 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 收斂到 𝜷superscript𝜷\boldsymbol{\beta}^{*} ,即一階平穩點:

    𝜷𝐇k(𝜷1Lg(𝜷)).superscript𝜷subscript𝐇𝑘superscript𝜷1𝐿𝑔superscript𝜷\boldsymbol{\beta}^{*}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}^{*}-\frac{1}{L}\nabla g(\boldsymbol{\beta}^{*})\right).

    (c)條件 α¯k>0subscript¯𝛼𝑘0\underline{\alpha}_{k}>0 意味著對於所有足夠大的 m𝑚m ,條目 |β(k),m|subscript𝛽𝑘𝑚|\beta_{(k),m}| 將保持(一致)遠離零。我們將用它來證明 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 的支持收斂。為了建立矛盾,假設支持度不收斂。然後, msuperscript𝑚m^{\prime} 有無限多個值,使得 𝟏m𝟏m+1subscript1superscript𝑚subscript1superscript𝑚1\mathbf{1}_{m^{\prime}}\neq\mathbf{1}_{m^{\prime}+1} 。利用 𝜷m0=ksubscriptnormsubscript𝜷𝑚0𝑘\|\boldsymbol{\beta}_{m}\|_{0}=k 對於所有大 m𝑚m 的事實,我們有
  4. (d)

    Let m{1,,p}subscript𝑚1𝑝{\mathcal{I}}_{m}\subset\{1,\ldots,p\} denote the set of k𝑘k largest values of the vector (𝜷m1Lg(𝜷m))subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right) in absolute value. By the definition of 𝐇k(𝜷m1Lg(𝜷m))subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right), we have

    |(𝜷m1Lg(𝜷m))i||(𝜷m1Lg(𝜷m))j|,subscriptsubscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚𝑖subscriptsubscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚𝑗\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{i}\right|\geq\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|,

    for all i,j𝑖𝑗i,j with im𝑖subscript𝑚i\in{\mathcal{I}}_{m} and jm𝑗subscript𝑚j\notin{\mathcal{I}}_{m}. Thus,
    對於所有帶有 im𝑖subscript𝑚i\in{\mathcal{I}}_{m}jm𝑗subscript𝑚j\notin{\mathcal{I}}_{m}i,j𝑖𝑗i,j 。因此,

    lim infmminim|(𝜷m1Lg(𝜷m))i|lim infmmaxjm|(𝜷m1Lg(𝜷m))j|.subscriptlimit-infimum𝑚subscript𝑖subscript𝑚subscriptsubscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚𝑖subscriptlimit-infimum𝑚subscript𝑗subscript𝑚subscriptsubscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚𝑗\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{i}\right|\geq\liminf_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|. (49)

    Moreover, 而且,

    (𝜷m𝐇k(𝜷m1Lg(𝜷m)))i={1L(g(𝜷m))i,im,βm,i,otherwise.subscriptsubscript𝜷𝑚subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚𝑖cases1𝐿subscript𝑔subscript𝜷𝑚𝑖𝑖subscript𝑚subscript𝛽𝑚𝑖otherwise\left(\boldsymbol{\beta}_{m}-\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)\right)_{i}=\begin{cases}\displaystyle\frac{1}{L}(\nabla g(\boldsymbol{\beta}_{m}))_{i},&i\in{\mathcal{I}}_{m},\vspace{3pt}\\ \beta_{m,i},&\text{otherwise}.\end{cases}

    Using the fact that 𝜷m+1𝜷m𝟎subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0} we have
    利用 𝜷m+1𝜷m𝟎subscript𝜷𝑚1subscript𝜷𝑚0\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0} 我們有的事實

    (g(𝜷m))i0,imandβm,j0,jmformulae-sequenceformulae-sequencesubscript𝑔subscript𝜷𝑚𝑖0𝑖subscript𝑚andsubscript𝛽𝑚𝑗0𝑗subscript𝑚(\nabla g(\boldsymbol{\beta}_{m}))_{i}\rightarrow 0,i\in{\mathcal{I}}_{m}\;\;\text{and}\;\;\beta_{m,j}\rightarrow 0,j\notin{\mathcal{I}}_{m}

    as m𝑚m\rightarrow\infty. Combining with (49) we have that:
    作為 m𝑚m\rightarrow\infty 。結合(49)我們有:

    lim infmminim|𝜷mi|lim infmmaxjm1L|(g(𝜷m))j|=1Llim infmg(𝜷m).subscriptlimit-infimum𝑚subscript𝑖subscript𝑚subscript𝜷𝑚𝑖subscriptlimit-infimum𝑚subscript𝑗subscript𝑚1𝐿subscript𝑔subscript𝜷𝑚𝑗1𝐿subscriptlimit-infimum𝑚subscriptnorm𝑔subscript𝜷𝑚\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|\geq\liminf_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\frac{1}{L}\left|\left(\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|=\frac{1}{L}\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}.

    Since, lim infmminim|𝜷mi|=α¯k=0subscriptlimit-infimum𝑚subscript𝑖subscript𝑚subscript𝜷𝑚𝑖subscript¯𝛼𝑘0\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|=\underline{\alpha}_{k}=0 (by hypothesis), the lhs of the above inequality equals zero, which leads to lim infmg(𝜷m)=0subscriptlimit-infimum𝑚subscriptnorm𝑔subscript𝜷𝑚0\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0.
    因為 lim infmminim|𝜷mi|=α¯k=0subscriptlimit-infimum𝑚subscript𝑖subscript𝑚subscript𝜷𝑚𝑖subscript¯𝛼𝑘0\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|=\underline{\alpha}_{k}=0 (根據假設),上述不等式的 lhs 等於 0,這導致 lim infmg(𝜷m)=0subscriptlimit-infimum𝑚subscriptnorm𝑔subscript𝜷𝑚0\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0


    (d)令 m{1,,p}subscript𝑚1𝑝{\mathcal{I}}_{m}\subset\{1,\ldots,p\} 表示向量 (𝜷m1Lg(𝜷m))subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right) 的絕對值的 k𝑘k 個最大值的集合。根據 𝐇k(𝜷m1Lg(𝜷m))subscript𝐇𝑘subscript𝜷𝑚1𝐿𝑔subscript𝜷𝑚\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right) 的定義,我們有
  5. (e)

    We build on the proof of Part (d).


    (e)我們以(d)部分的證明為基礎。

    It follows from equation (49) (by suitably modifying ‘lim inflimit-infimum\liminf’ to ‘lim suplimit-supremum\limsup’) that:
    由式(49)可知(透過將‘ lim inflimit-infimum\liminf ’適當地修改為‘ lim suplimit-supremum\limsup ’):

    lim supmminim|𝜷mi|α¯klim supmmaxjm1L|(g(𝜷m))j|=1Llim supmg(𝜷m).subscriptsubscriptlimit-supremum𝑚subscript𝑖subscript𝑚subscript𝜷𝑚𝑖subscript¯𝛼𝑘subscriptlimit-supremum𝑚subscript𝑗subscript𝑚1𝐿subscript𝑔subscript𝜷𝑚𝑗1𝐿subscriptlimit-supremum𝑚subscriptnorm𝑔subscript𝜷𝑚\underbrace{\limsup_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|}_{\overline{\alpha}_{k}}\geq\limsup_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\frac{1}{L}\left|\left(\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|=\frac{1}{L}\limsup_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}.

    Note that the lhs of the above inequality is α¯ksubscript¯𝛼𝑘\overline{\alpha}_{k} which is zero (by hypothesis), thus g(𝜷m)0subscriptnorm𝑔subscript𝜷𝑚0\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}\rightarrow 0 as m𝑚m\rightarrow\infty.
    請注意,上述不等式的 lhs 是 α¯ksubscript¯𝛼𝑘\overline{\alpha}_{k} ,它為零(根據假設),因此 g(𝜷m)0subscriptnorm𝑔subscript𝜷𝑚0\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}\rightarrow 0m𝑚m\rightarrow\infty

    Suppose 𝜷subscript𝜷\boldsymbol{\beta}_{\infty} is a limit point of the sequence 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m}. Thus there is a subsequence m{1,2,,}m^{\prime}\subset\{1,2,\ldots,\} such that 𝜷m𝜷subscript𝜷superscript𝑚subscript𝜷\boldsymbol{\beta}_{m^{\prime}}\rightarrow\boldsymbol{\beta}_{\infty} and g(𝜷m)g(𝜷)𝑔subscript𝜷superscript𝑚𝑔subscript𝜷g(\boldsymbol{\beta}_{m^{\prime}})\rightarrow g(\boldsymbol{\beta}_{\infty}). Using the continuity of the gradient and hence the function g()\cdot\mapsto\|\nabla g(\cdot)\|_{\infty} we have that g(𝜷m)g(𝜷)=0subscriptnorm𝑔subscript𝜷superscript𝑚subscriptnorm𝑔subscript𝜷0\|\nabla g(\boldsymbol{\beta}_{m^{\prime}})\|_{\infty}\rightarrow\|\nabla g(\boldsymbol{\beta}_{\infty})\|_{\infty}=0 as msuperscript𝑚m^{\prime}\rightarrow\infty. Thus 𝜷subscript𝜷\boldsymbol{\beta}_{\infty} is a solution to the unconstrained (without cardinality constraints) optimization problem ming(𝜷)𝑔𝜷\min~{}g(\boldsymbol{\beta}). Since g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) is a decreasing sequence, g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) converges to the minimum of g(𝜷)𝑔𝜷g(\boldsymbol{\beta}). \hfill\Box
    假設 𝜷subscript𝜷\boldsymbol{\beta}_{\infty} 是序列 𝜷msubscript𝜷𝑚\boldsymbol{\beta}_{m} 的極限點。因此,存在一個子序列 m{1,2,,}m^{\prime}\subset\{1,2,\ldots,\} ,使得 𝜷m𝜷subscript𝜷superscript𝑚subscript𝜷\boldsymbol{\beta}_{m^{\prime}}\rightarrow\boldsymbol{\beta}_{\infty}g(𝜷m)g(𝜷)𝑔subscript𝜷superscript𝑚𝑔subscript𝜷g(\boldsymbol{\beta}_{m^{\prime}})\rightarrow g(\boldsymbol{\beta}_{\infty}) 。使用梯度的連續性以及函數 g()\cdot\mapsto\|\nabla g(\cdot)\|_{\infty} 我們將 g(𝜷m)g(𝜷)=0subscriptnorm𝑔subscript𝜷superscript𝑚subscriptnorm𝑔subscript𝜷0\|\nabla g(\boldsymbol{\beta}_{m^{\prime}})\|_{\infty}\rightarrow\|\nabla g(\boldsymbol{\beta}_{\infty})\|_{\infty}=0 作為 msuperscript𝑚m^{\prime}\rightarrow\infty 。因此 𝜷subscript𝜷\boldsymbol{\beta}_{\infty} 是無約束(無基數約束)最佳化問題 ming(𝜷)𝑔𝜷\min~{}g(\boldsymbol{\beta}) 的解。由於 g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) 是遞減序列,因此 g(𝜷m)𝑔subscript𝜷𝑚g(\boldsymbol{\beta}_{m}) 收斂到 g(𝜷)𝑔𝜷g(\boldsymbol{\beta}) 的最小值。 \hfill\Box

B.2 Proof of Proposition 3
B.2命題3的證明

Proof:  證明:
We provide a proof of Proposition 3, for the sake of completeness.
為了完整起見,我們提供了命題 3 的證明。

It suffices to consider |ci|>0subscript𝑐𝑖0|c_{i}|>0 for all i𝑖i. Let 𝜷𝜷\boldsymbol{\beta} be an optimal solution to Problem (22) and let S:={i:βi0}assign𝑆conditional-set𝑖subscript𝛽𝑖0S:=\{i:\beta_{i}\neq 0\}. The objective function is given by iS|ci|2+iS(βici)2subscript𝑖𝑆superscriptsubscript𝑐𝑖2subscript𝑖𝑆superscriptsubscript𝛽𝑖subscript𝑐𝑖2\sum_{i\not\in S}|c_{i}|^{2}+\sum_{i\in S}(\beta_{i}-c_{i})^{2}. Note that by selecting βi=cisubscript𝛽𝑖subscript𝑐𝑖\beta_{i}=c_{i} for iS𝑖𝑆i\in S, we can make the objective function iS|ci|2subscript𝑖𝑆superscriptsubscript𝑐𝑖2\sum_{i\not\in S}|c_{i}|^{2}. Thus, to minimize the objective function, S𝑆S must correspond to the indices of the largest k𝑘k values of |ci|,i1.formulae-sequencesubscript𝑐𝑖𝑖1|c_{i}|,i\geq 1.\hfill\Box
對於所有 i𝑖i 考慮 |ci|>0subscript𝑐𝑖0|c_{i}|>0 就足夠了。令 𝜷𝜷\boldsymbol{\beta} 為問題 (22) 的最適解,並令 S:={i:βi0}assign𝑆conditional-set𝑖subscript𝛽𝑖0S:=\{i:\beta_{i}\neq 0\} 。目標函數由 iS|ci|2+iS(βici)2subscript𝑖𝑆superscriptsubscript𝑐𝑖2subscript𝑖𝑆superscriptsubscript𝛽𝑖subscript𝑐𝑖2\sum_{i\not\in S}|c_{i}|^{2}+\sum_{i\in S}(\beta_{i}-c_{i})^{2} 給出。請注意,透過為 iS𝑖𝑆i\in S 選擇 βi=cisubscript𝛽𝑖subscript𝑐𝑖\beta_{i}=c_{i} ,我們可以製定目標函數 iS|ci|2subscript𝑖𝑆superscriptsubscript𝑐𝑖2\sum_{i\not\in S}|c_{i}|^{2} 。因此,為了最小化目標函數, S𝑆S 必須對應於 |ci|,i1.formulae-sequencesubscript𝑐𝑖𝑖1|c_{i}|,i\geq 1.\hfill\Box 的最大 k𝑘k 值的索引

B.3 Proof of Proposition 7
B.3 命題7的證明

Proof  證明
This follows from Proposition 6, Part (a), which implies that:
這是從命題 6 (a) 部分得出的結論,它意味著:

g(𝜼)g(𝜼^)L2𝜼^𝜼22,𝑔𝜼𝑔^𝜼𝐿2superscriptsubscriptnorm^𝜼𝜼22g({\boldsymbol{\eta}})-g(\hat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}^{2},

for any 𝜼^𝐇k(𝜼1Lg(𝜼))^𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right). Now by the definition of 𝐇k(),subscript𝐇𝑘\mathbf{H}_{k}(\cdot), we have g(𝜼)=g(𝜼^)𝑔𝜼𝑔^𝜼g(\boldsymbol{\eta})=g(\hat{\boldsymbol{\eta}}) which along with L>𝐿L>\ell implies that the rhs of the above inequality is zero: thus 𝜼^𝜼2=0subscriptnorm^𝜼𝜼20\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}=0, i.e., 𝜼=𝜼^𝜼^𝜼\boldsymbol{\eta}=\hat{\boldsymbol{\eta}}. Since the choice of 𝜼^^𝜼\hat{\boldsymbol{\eta}} was arbitrary, it follows that 𝜼𝜼\boldsymbol{\eta} is the only element in the set 𝐇k(𝜼1Lg(𝜼))subscript𝐇𝑘𝜼1𝐿𝑔𝜼\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right). \hfill\Box
對於任何 𝜼^𝐇k(𝜼1Lg(𝜼))^𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right) 。現在根據 𝐇k(),subscript𝐇𝑘\mathbf{H}_{k}(\cdot), 的定義,我們有 g(𝜼)=g(𝜼^)𝑔𝜼𝑔^𝜼g(\boldsymbol{\eta})=g(\hat{\boldsymbol{\eta}}) ,它與 L>𝐿L>\ell 一起意味著上述不等式的 rhs 為零:因此 𝜼^𝜼2=0subscriptnorm^𝜼𝜼20\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}=0 ,即 𝜼=𝜼^𝜼^𝜼\boldsymbol{\eta}=\hat{\boldsymbol{\eta}} 。由於 𝜼^^𝜼\hat{\boldsymbol{\eta}} 的選擇是任意的,因此 𝜼𝜼\boldsymbol{\eta} 是集合 𝐇k(𝜼1Lg(𝜼))subscript𝐇𝑘𝜼1𝐿𝑔𝜼\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right) 中的唯一元素。 \hfill\Box

B.4 Proof of Proposition 8
B.4 命題8的證明

Proof  證明
The proof follows by noting that 𝜷^^𝜷\widehat{\boldsymbol{\beta}} is k𝑘k-sparse along with Proposition 6, Part (a), which implies that:
證明如下,指出 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 與命題 6 (a) 部分一樣是 k𝑘k 稀疏的,這意味著:

g(𝜷^)g(𝜼^)L2𝜷^𝜼^22,𝑔^𝜷𝑔^𝜼𝐿2superscriptsubscriptnorm^𝜷^𝜼22g(\widehat{\boldsymbol{\beta}})-g(\hat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\beta}}-\hat{\boldsymbol{\eta}}\right\|_{2}^{2},

for any 𝜼^𝐇k(𝜷^1Lg(𝜷^))^𝜼subscript𝐇𝑘^𝜷1𝐿𝑔^𝜷\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\widehat{\boldsymbol{\beta}}-\frac{1}{L}\nabla g(\widehat{\boldsymbol{\beta}})\right). Now, by the definition of 𝜷^^𝜷\widehat{\boldsymbol{\beta}} we have g(𝜷^)=g(𝜼^)𝑔^𝜷𝑔^𝜼g(\widehat{\boldsymbol{\beta}})=g(\hat{\boldsymbol{\eta}}) which along with L>𝐿L>\ell implies that the rhs of the above inequality is zero: thus 𝜷^^𝜷\widehat{\boldsymbol{\beta}} is a first order stationary point. \hfill\Box
對於任何 𝜼^𝐇k(𝜷^1Lg(𝜷^))^𝜼subscript𝐇𝑘^𝜷1𝐿𝑔^𝜷\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\widehat{\boldsymbol{\beta}}-\frac{1}{L}\nabla g(\widehat{\boldsymbol{\beta}})\right) 。現在,根據 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 的定義,我們有 g(𝜷^)=g(𝜼^)𝑔^𝜷𝑔^𝜼g(\widehat{\boldsymbol{\beta}})=g(\hat{\boldsymbol{\eta}}) ,它與 L>𝐿L>\ell 一起意味著上述不等式的 rhs 為零:因此 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 是一階駐點。 \hfill\Box

B.5 Proof of Theorem 3.1 B.5定理3.1的證明

Proof  證明
Summing inequalities  (31) for 1mM.1𝑚𝑀1\leq m\leq M. we obtain
1mM.1𝑚𝑀1\leq m\leq M. 的不等式 (31) 求和,我們得到

m=1M(g(𝜷m)g(𝜷m+1))L2m=1M𝜷m+1𝜷m22,superscriptsubscript𝑚1𝑀𝑔subscript𝜷𝑚𝑔subscript𝜷𝑚1𝐿2superscriptsubscript𝑚1𝑀superscriptsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚22\sum_{m=1}^{M}\left(g(\boldsymbol{\beta}_{m})-g(\boldsymbol{\beta}_{m+1})\right)\geq\frac{L-\ell}{2}\sum_{m=1}^{M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}, (50)

leading to 導致

g(𝜷1)g(𝜷M+1)M(L)2minm=1,,M𝜷m+1𝜷m22.𝑔subscript𝜷1𝑔subscript𝜷𝑀1𝑀𝐿2subscript𝑚1𝑀superscriptsubscriptnormsubscript𝜷𝑚1subscript𝜷𝑚22g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}_{M+1})\geq\frac{M(L-\ell)}{2}\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}.

Since the decreasing sequence g(𝜷m+1)𝑔subscript𝜷𝑚1g(\boldsymbol{\beta}_{m+1}) converges to g(𝜷)𝑔superscript𝜷g(\boldsymbol{\beta}^{*}) by Proposition 6 we have:
由於遞減序列 g(𝜷m+1)𝑔subscript𝜷𝑚1g(\boldsymbol{\beta}_{m+1}) 根據命題 6 收斂於 g(𝜷)𝑔superscript𝜷g(\boldsymbol{\beta}^{*}) ,我們有:

g(𝜷1)g(𝜷)Mg(𝜷1)g(𝜷M+1)M(L)2minm=1,,M𝜷m+1𝜷m22.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\frac{g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}^{*})}{M}\geq\frac{g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}_{M+1})}{M}\geq\frac{(L-\ell)}{2}\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\hfill\Box

B.6 Proof of Proposition 5
B.6 命題 5 的證明

Proof 證明
If 𝜼𝜼\boldsymbol{\eta} is a first order stationary point with 𝜼0ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}\leq k, it follows from the argument following Definition 1, that there is a set I{1,,p}𝐼1𝑝I\subset\{1,\ldots,p\} with |Ic|=ksuperscript𝐼𝑐𝑘|I^{c}|=k such that ig(𝜼)=0subscript𝑖𝑔𝜼0\nabla_{i}g(\boldsymbol{\eta})=0 for all iI𝑖𝐼i\notin I and ηi=0subscript𝜂𝑖0\eta_{i}=0 for all iI𝑖𝐼i\notin I. Let μi:=ηi1Lig(𝜼)assignsubscript𝜇𝑖subscript𝜂𝑖1𝐿subscript𝑖𝑔𝜼\mu_{i}:=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}) for i=1,,p𝑖1𝑝i=1,\ldots,p. Suppose Iksubscript𝐼𝑘I_{k} denotes the set of indices corresponding to the top k𝑘k ordered values of |μi|subscript𝜇𝑖|\mu_{i}|. Note that:
如果 𝜼𝜼\boldsymbol{\eta} 是具有 𝜼0ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}\leq k 的一階駐點,則根據定義 1 後面的參數,存在一個帶有 |Ic|=ksuperscript𝐼𝑐𝑘|I^{c}|=k 對所有 iI𝑖𝐼i\notin Iηi=0subscript𝜂𝑖0\eta_{i}=0 對所有 iI𝑖𝐼i\notin I 。設 μi:=ηi1Lig(𝜼)assignsubscript𝜇𝑖subscript𝜂𝑖1𝐿subscript𝑖𝑔𝜼\mu_{i}:=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}) 代表 i=1,,p𝑖1𝑝i=1,\ldots,p 。假設 Iksubscript𝐼𝑘I_{k} 表示與 |μi|subscript𝜇𝑖|\mu_{i}| 的頂部 k𝑘k 有序值相對應的索引集。注意:

μi=ηi,iIkand|μj|=|1Ljg(𝜼)|,jIk.formulae-sequencesubscript𝜇𝑖subscript𝜂𝑖formulae-sequence𝑖subscript𝐼𝑘andformulae-sequencesubscript𝜇𝑗1𝐿subscript𝑗𝑔𝜼𝑗subscript𝐼𝑘\mu_{i}=\eta_{i},\;\;i\in I_{k}\quad\text{and}\quad|\mu_{j}|=|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})|,\;\;j\notin I_{k}. (51)

For iIk𝑖subscript𝐼𝑘i\in I_{k} and jIk𝑗subscript𝐼𝑘j\notin I_{k} we have |μi||μj|subscript𝜇𝑖subscript𝜇𝑗|\mu_{i}|\geq|\mu_{j}|. This implies that |ηi||1Ljg(𝜼)|subscript𝜂𝑖1𝐿subscript𝑗𝑔𝜼|\eta_{i}|\geq|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})|. Since 𝜼𝐇k(𝜼1Lg(𝜼))𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right) and 𝜼0<ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}<k, it follows that 0=miniIk|ηi|=miniIk|μi|0subscript𝑖subscript𝐼𝑘subscript𝜂𝑖subscript𝑖subscript𝐼𝑘subscript𝜇𝑖0=\min_{i\in I_{k}}|\eta_{i}|=\min_{i\in I_{k}}|\mu_{i}|. We thus have that jg(𝜼)=0subscript𝑗𝑔𝜼0\nabla_{j}g(\boldsymbol{\eta})=0 for all jIk𝑗subscript𝐼𝑘j\notin I_{k}. In addition, note that ig(𝜼)=0subscript𝑖𝑔𝜼0\nabla_{i}g(\boldsymbol{\eta})=0 for all iIk𝑖subscript𝐼𝑘i\in I_{k}. Thus it follows that g(𝜼)=𝟎𝑔𝜼0\nabla g(\boldsymbol{\eta})=\mathbf{0} and hence 𝜼argmin𝜼g(𝜼)𝜼subscriptargmin𝜼𝑔𝜼\boldsymbol{\eta}\in\operatorname*{arg\,min}_{\boldsymbol{\eta}}\;g(\boldsymbol{\eta}). \hfill\Box
對於 iIk𝑖subscript𝐼𝑘i\in I_{k}jIk𝑗subscript𝐼𝑘j\notin I_{k} 我們有 |μi||μj|subscript𝜇𝑖subscript𝜇𝑗|\mu_{i}|\geq|\mu_{j}| 。這意味著 |ηi||1Ljg(𝜼)|subscript𝜂𝑖1𝐿subscript𝑗𝑔𝜼|\eta_{i}|\geq|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})| 。由於 𝜼𝐇k(𝜼1Lg(𝜼))𝜼subscript𝐇𝑘𝜼1𝐿𝑔𝜼\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)𝜼0<ksubscriptnorm𝜼0𝑘\|\boldsymbol{\eta}\|_{0}<k ,因此 0=miniIk|ηi|=miniIk|μi|0subscript𝑖subscript𝐼𝑘subscript𝜂𝑖subscript𝑖subscript𝐼𝑘subscript𝜇𝑖0=\min_{i\in I_{k}}|\eta_{i}|=\min_{i\in I_{k}}|\mu_{i}| 。因此,我們對所有 jIk𝑗subscript𝐼𝑘j\notin I_{k} 都有 jg(𝜼)=0subscript𝑗𝑔𝜼0\nabla_{j}g(\boldsymbol{\eta})=0 。此外,請注意 ig(𝜼)=0subscript𝑖𝑔𝜼0\nabla_{i}g(\boldsymbol{\eta})=0 對於所有 iIk𝑖subscript𝐼𝑘i\in I_{k} 。因此,可以得到 g(𝜼)=𝟎𝑔𝜼0\nabla g(\boldsymbol{\eta})=\mathbf{0}𝜼argmin𝜼g(𝜼)𝜼subscriptargmin𝜼𝑔𝜼\boldsymbol{\eta}\in\operatorname*{arg\,min}_{\boldsymbol{\eta}}\;g(\boldsymbol{\eta})\hfill\Box

Appendix C Brief Review of Statistical Properties for the subset selection problem
附錄 C 子集選擇問題統計特性的簡要回顧

In this section, for the sake of completeness we briefly review some of the properties of solutions to Problem (1).
在本節中,為了完整起見,我們簡要回顧一下問題(1)的解決方案的一些屬性。

Suppose the linear model assumption is true, i.e., 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}, with ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖N0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}\text{N}(0,\sigma^{2}). Let 𝜷^^𝜷\widehat{\boldsymbol{\beta}} denote a solution to (1). [46] showed that, with probability greater than 1exp(c1klog(p/k))1subscript𝑐1𝑘𝑝𝑘1-\exp(-c_{1}k\log(p/k)), the worst case (over 𝜷0superscript𝜷0\boldsymbol{\beta}^{0}) predictive performance has the following upper bound:
假設線性模型假設成立,即 𝐲=𝐗𝜷0+ϵ𝐲𝐗superscript𝜷0bold-italic-ϵ\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}ϵiiidN(0,σ2)superscriptsimilar-toiidsubscriptitalic-ϵ𝑖N0superscript𝜎2\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}\text{N}(0,\sigma^{2}) 。令 𝜷^^𝜷\widehat{\boldsymbol{\beta}} 表示 (1) 的解。 [46]表明,當機率大於 1exp(c1klog(p/k))1subscript𝑐1𝑘𝑝𝑘1-\exp(-c_{1}k\log(p/k)) 時,最壞情況(超過 𝜷0superscript𝜷0\boldsymbol{\beta}^{0} )預測性能具有以下上限:

max𝜷0:𝜷00k1n𝐗𝜷0𝐗𝜷^22c2σ2klog(p/k)n,subscript:superscript𝜷0subscriptnormsuperscript𝜷00𝑘1𝑛superscriptsubscriptnorm𝐗superscript𝜷0𝐗^𝜷22subscript𝑐2superscript𝜎2𝑘𝑝𝑘𝑛\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\;\;\frac{1}{n}\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}\right\|_{2}^{2}\leq c_{2}\sigma^{2}\frac{k\log(p/k)}{n}, (52)

where, c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2} are universal constants. Similar results also appear in [12, 58]. Interestingly, the upper bound (52) does not depend upon 𝐗𝐗\mathbf{X}. Unless p/k=O(1)𝑝𝑘𝑂1p/k=O(1), the upper bound appearing in (52) is of the order O(σ2klog(p)n)𝑂superscript𝜎2𝑘𝑝𝑛O(\sigma^{2}\frac{k\log(p)}{n}) where the constants are universal. In terms of the expected (worst case) predictive risk, an upper bound is given by [58]:
其中, c1,c2subscript𝑐1subscript𝑐2c_{1},c_{2} 是通用常數。類似的結果也出現在[12, 58]。有趣的是,上限 (52) 並不取決於 𝐗𝐗\mathbf{X} 。除非 p/k=O(1)𝑝𝑘𝑂1p/k=O(1) ,否則 (52) 中出現的上限的順序是 O(σ2klog(p)n)𝑂superscript𝜎2𝑘𝑝𝑛O(\sigma^{2}\frac{k\log(p)}{n}) ,其中常數是通用的。就預期(最壞情況)預測風險而言,上限由[58]給出:

max𝜷0:𝜷00k1n𝔼(𝐗𝜷0𝐗𝜷^22)σ2klog(p)n,less-than-or-similar-tosubscript:superscript𝜷0subscriptnormsuperscript𝜷00𝑘1𝑛𝔼superscriptsubscriptnorm𝐗superscript𝜷0𝐗^𝜷22superscript𝜎2𝑘𝑝𝑛\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\;\;\frac{1}{n}\mathbb{E}\left(\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}\right\|_{2}^{2}\right)\lesssim\sigma^{2}\frac{k\log(p)}{n}, (53)

where, the symbol “less-than-or-similar-to\lesssim” means “\leq” upto some universal constants.
其中,符號「 less-than-or-similar-to\lesssim 」表示「 \leq 」直到某些通用常數。

A natural question is how do the bounds for Lasso-based solutions compare with (53)? In a recent paper [58], the authors derive upper and lower bounds of the prediction performance of the thresholded version of the Lasso solution, which we present briefly. Suppose
一個自然的問題是,基於 Lasso 的解決方案的界限與 (53) 相比如何?在最近的一篇論文 [58] 中,作者推導了 Lasso 解決方案的閾值版本的預測性能的上限和下限,我們對此進行了簡要介紹。認為

𝜷^1argmin𝜷12n𝐲𝐗𝜷22+λn𝜷1subscript^𝜷subscript1subscriptargmin𝜷12𝑛superscriptsubscriptnorm𝐲𝐗𝜷22subscript𝜆𝑛subscriptnorm𝜷1\hat{\boldsymbol{\beta}}_{\ell_{1}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}}\;\frac{1}{2n}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda_{n}\|\boldsymbol{\beta}\|_{1}

denotes a Lasso solution for λn=4σlogpnsubscript𝜆𝑛4𝜎𝑝𝑛\lambda_{n}=4\sigma\sqrt{\frac{\log p}{n}}. Let 𝜷^TLsubscript^𝜷TL\hat{\boldsymbol{\beta}}_{\text{TL}} denote the thresholded version of the Lasso solution, which retains the top k𝑘k entries of 𝜷^1subscript^𝜷subscript1\hat{\boldsymbol{\beta}}_{\ell_{1}} in absolute value and sets the remaining to zero. The bounds on the predictive performances of Lasso based solutions depend upon a restricted eigen-value type condition. Following [58], we define, for any subset S{1,2,,p}𝑆12𝑝S\in\{1,2,\ldots,p\}, the quantity: C(S):={𝜷|𝜷Sc12𝜷S1},assign𝐶𝑆conditional-set𝜷subscriptnormsubscript𝜷superscript𝑆𝑐12subscriptnormsubscript𝜷𝑆1C(S):=\{\boldsymbol{\beta}|\|\boldsymbol{\beta}_{S^{c}}\|_{1}\leq 2\|\boldsymbol{\beta}_{S}\|_{1}\}, where, 𝜷S1=jS|βj|subscriptnormsubscript𝜷𝑆1subscript𝑗𝑆subscript𝛽𝑗\|\boldsymbol{\beta}_{S}\|_{1}=\sum_{j\in S}|\beta_{j}| and 𝜷Sc1=jSc|βj|subscriptnormsubscript𝜷superscript𝑆𝑐1subscript𝑗superscript𝑆𝑐subscript𝛽𝑗\|\boldsymbol{\beta}_{S^{c}}\|_{1}=\sum_{j\in S^{c}}|\beta_{j}|. We say that the matrix 𝐗𝐗\mathbf{X} satisfies a restricted eigen-value type condition with parameter γ(X)𝛾𝑋\gamma(X) if it satisfies the following:
表示 λn=4σlogpnsubscript𝜆𝑛4𝜎𝑝𝑛\lambda_{n}=4\sigma\sqrt{\frac{\log p}{n}} 的 Lasso 解。讓 𝜷^TLsubscript^𝜷TL\hat{\boldsymbol{\beta}}_{\text{TL}} 表示 Lasso 解決方案的閾值版本,它以絕對值保留 𝜷^1subscript^𝜷subscript1\hat{\boldsymbol{\beta}}_{\ell_{1}} 的頂部 k𝑘k 條目並將其餘條目設為零。基於 Lasso 的解決方案的預測效能界限取決於受限特徵值類型條件。根據[58],我們為任何子集 S{1,2,,p}𝑆12𝑝S\in\{1,2,\ldots,p\} 定義數量: C(S):={𝜷|𝜷Sc12𝜷S1},assign𝐶𝑆conditional-set𝜷subscriptnormsubscript𝜷superscript𝑆𝑐12subscriptnormsubscript𝜷𝑆1C(S):=\{\boldsymbol{\beta}|\|\boldsymbol{\beta}_{S^{c}}\|_{1}\leq 2\|\boldsymbol{\beta}_{S}\|_{1}\}, 其中, 𝜷S1=jS|βj|subscriptnormsubscript𝜷𝑆1subscript𝑗𝑆subscript𝛽𝑗\|\boldsymbol{\beta}_{S}\|_{1}=\sum_{j\in S}|\beta_{j}|𝜷Sc1=jSc|βj|subscriptnormsubscript𝜷superscript𝑆𝑐1subscript𝑗superscript𝑆𝑐subscript𝛽𝑗\|\boldsymbol{\beta}_{S^{c}}\|_{1}=\sum_{j\in S^{c}}|\beta_{j}| 。如果矩陣 𝐗𝐗\mathbf{X} 滿足以下條件,則我們說矩陣 𝐗𝐗\mathbf{X} 滿足參數 γ(X)𝛾𝑋\gamma(X) 的受限特徵值類型條件:

1n𝐗𝜷22γ(𝐗)𝜷22for𝜷S:|S|=kC(S).formulae-sequence1𝑛superscriptsubscriptnorm𝐗𝜷22𝛾𝐗superscriptsubscriptnorm𝜷22for𝜷subscript:𝑆𝑆𝑘𝐶𝑆\frac{1}{n}\|\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\geq\gamma(\mathbf{X})\|\boldsymbol{\beta}\|_{2}^{2}\;\;\;\quad\text{for}\quad\boldsymbol{\beta}\in\cup_{S:|S|=k}C(S).

Note that γ(𝐗)1𝛾𝐗1\gamma(\mathbf{X})\leq 1 and γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) is also related to the so called compatibility condition [11]. In an insightful paper, [58] show that under such restricted eigenvalue type conditions the following holds:
請注意, γ(𝐗)1𝛾𝐗1\gamma(\mathbf{X})\leq 1γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) 也與所謂的兼容性條件有關[11]。在一篇富有洞察力的論文中,[58]表明,在這種受限的特徵值類型條件下,以下內容成立:

σ2γ(𝐗bad)2k1δlog(p)nmax𝜷0:𝜷00k1n𝔼(𝐗𝜷0𝐗𝜷^TL22)σ2γ(𝐗)2klog(p)nless-than-or-similar-tosuperscript𝜎2𝛾superscriptsubscript𝐗bad2superscript𝑘1𝛿𝑝𝑛subscript:superscript𝜷0subscriptnormsuperscript𝜷00𝑘1𝑛𝔼superscriptsubscriptnorm𝐗superscript𝜷0𝐗subscript^𝜷TL22less-than-or-similar-tosuperscript𝜎2𝛾superscript𝐗2𝑘𝑝𝑛\frac{\sigma^{2}}{\gamma(\mathbf{X}_{\text{bad}})^{2}}\frac{k^{1-\delta}\log(p)}{n}\lesssim\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\frac{1}{n}\mathbb{E}\left(\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}_{\text{TL}}\right\|_{2}^{2}\right)\lesssim\frac{\sigma^{2}}{\gamma(\mathbf{X})^{2}}\frac{k\log(p)}{n} (54)

In particular, the lower bounds apply to bad design matrices 𝐗badsubscript𝐗bad\mathbf{X}_{\text{bad}} for some arbitrarily small scalar δ>0𝛿0\delta>0. In fact [58] establish a result stronger than (54), where, 𝜷^TLsubscript^𝜷TL\widehat{\boldsymbol{\beta}}_{\text{TL}} can be replaced by a k𝑘k-sparse estimate delivered by a polynomial time method. The bounds displayed in (54) show that there is a significant gap between the predictive performance of subset selection procedures (see bound (53)) and Lasso based k𝑘k-sparse solutions—the magnitude of the gap depends upon how small γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) is. γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) can be small if the pairwise correlations between the features of the model matrix is quite high. These results complement our experimental findings in Section 5.
特別是,下界適用於某些任意小標量 δ>0𝛿0\delta>0 的不良設計矩陣 𝐗badsubscript𝐗bad\mathbf{X}_{\text{bad}} 。事實上,[58] 建立了一個比 (54) 更強的結果,其中 𝜷^TLsubscript^𝜷TL\widehat{\boldsymbol{\beta}}_{\text{TL}} 可以替換為由多項式時間方法提供的 k𝑘k 稀疏估計。 (54) 中顯示的界限表明,子集選擇過程的預測性能(參見界限 (53))與基於 Lasso 的稀疏解決方案之間存在顯著差距——差距的大小取決於 γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) 有多小。如果模型矩陣的特徵之間的成對相關性相當高, γ(𝐗)𝛾𝐗\gamma(\mathbf{X}) 可以很小。這些結果補充了我們在第 5 節的實驗結果。

An in-depth analysis of the properties of solutions to the Lagrangian version of Problem (1), namely, Problem (4) is presented in [56]. [46, 56] also analyze the errors in the regression coefficients: 𝜷0𝜷^2subscriptnormsuperscript𝜷0^𝜷2\|\boldsymbol{\beta}^{0}-\widehat{\boldsymbol{\beta}}\|_{2}, under further minor assumptions on the model matrix 𝐗𝐗\mathbf{X}. [56, 48] provide interesting theoretical analysis of the variable selection properties of (1) and (4), showing that subset selection procedures have superior variable selection properties over Lasso based methods.
在[56]中提出了對問題(1)的拉格朗日版本,即問題(4)的解的性質的深入分析。 [ 46, 56] 也分析了迴歸係數的誤差: 𝜷0𝜷^2subscriptnormsuperscript𝜷0^𝜷2\|\boldsymbol{\beta}^{0}-\widehat{\boldsymbol{\beta}}\|_{2} ,在模型矩陣 𝐗𝐗\mathbf{X} 的進一步次要假設下。 [56, 48]提供了對(1)和(4)的變數選擇屬性的有趣的理論分析,表明子集選擇過程比基於 Lasso 的方法具有更優越的變數選擇屬性。

In passing, we remark that [56] develop statistical properties of inexact solutions to Problem (4). This may serve as interesting theoretical support for near global solutions to Problem (1), where the certificates of sub-optimality are delivered by our MIO framework in terms of global lower bounds. A precise and thorough understanding of the statistical properties of sub-optimal solutions to Problem (1) is left for an interesting piece of future work.
順便說一句,我們注意到[56]開發了問題(4)的不精確解的統計特性。這可能為問題 (1) 的近全局解決方案提供有趣的理論支持,其中次優性證書由我們的 MIO 框架根據全局下界提供。對問題(1)的次優解的統計特性的精確和透徹的理解留待未來有趣的工作進行。

Appendix D Additional Details on Experiments and Computations
附錄 D 有關實驗和計算的其他詳細信息

D.1 Some additional figures related to the radii of bounding boxes
D.1與邊界框半徑相關的一些附加數字

Some figures illustrating the effect of the bounding box radii are presented in Figure 12.
圖 12 顯示了一些說明邊界框半徑影響的圖。

Evolution of the MIO gap for (37), effect of bounding box radii (n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000).
(37) 的 MIO 間隙的演變,邊界框半徑的影響 ( n=50,p=1000formulae-sequence𝑛50𝑝1000n=50,p=1000 )。

,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty and ,locβ=2𝜷01/ksubscriptsuperscript𝛽loc2subscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=2\|\boldsymbol{\beta}_{0}\|_{1}/k  ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty,locβ=2𝜷01/ksubscriptsuperscript𝛽loc2subscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=2\|\boldsymbol{\beta}_{0}\|_{1}/k

Refer to caption
Refer to caption

,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty and ,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k  ,locζ=subscriptsuperscript𝜁loc{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty,locβ=𝜷01/ksubscriptsuperscript𝛽locsubscriptnormsubscript𝜷01𝑘{\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k

Refer to caption
Refer to caption
Figure 12: The evolution of the MIO gap with varying radii of bounding boxes for MIO formulation (37). The top panel has radii twice the size of the bottom panel. The dataset considered is generated as per Example 1 with n=50,p=1000,ρ=0.9formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.9n=50,p=1000,\rho=0.9 and k0=5subscript𝑘05k_{0}=5 for different values of SNR: [Left Panel] SNR = 1, [Right Panel] SNR = 3. For each case, different values of k𝑘k have been considered. The top panel has a bounding box radii which is twice the corresponding case in the lower panel. As expected, the times for the MIO gaps to close depends upon the radii of the boxes. The optimal solutions obtained were found to be insensitive to the choice of the bounding box radius.
圖 12:MIO 公式中不同邊界框半徑的 MIO 間隙的演變 (37)。頂部面板的半徑是底部面板尺寸的兩倍。所考慮的資料集是根據範例1 產生的,其中 n=50,p=1000,ρ=0.9formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.9n=50,p=1000,\rho=0.9k0=5subscript𝑘05k_{0}=5 對於不同的SNR 值:[左圖] SNR = 1,[右圖] SNR = 3。對於每種情況,已考慮 k𝑘k 的不同值。頂部面板的邊界框半徑是下部面板中相應情況的兩倍。如預期的那樣,MIO 間隙閉合的時間取決於盒子的半徑。發現獲得的最佳解決方案對邊界框半徑的選擇不敏感。

D.2 Lasso, Debiased Lasso and MIO
D.2 Lasso、Debiased Lasso 與 MIO

We present here comparisons of the debiased Lasso with MIO and Lasso.
我們在此將去偏 Lasso 與 MIO 和 Lasso 進行比較。

Debiasing is often used to mitigate the shrinkage imparted by the Lasso regularization parameter. This is done by performing an unrestricted least squares on the support selected by the Lasso. Of course the results will depend upon the tuning parameter used for the problem. We use two methods towards this end. In the first method we find the best Lasso solution (by obtaining an optimal tuning parameter based on minimizing predictive error on a held out validation set); we then obtain the un-regularized least squares solution for that Lasso solution. This typically performed worse than Lasso in all the experiments we tried—see Tables 3 and 4. The unrestricted least squares solution on the optimal model selected by the Lasso (as shown in Figure 4) had worse predictive performance than the Lasso, with the same sparsity pattern, as shown in Table 3. This is probably due to overfitting since the model selected by the Lasso is quite dense compared to n,p𝑛𝑝n,p. Table 4 presents the results for 50=np=100050𝑛much-less-than𝑝100050=n\ll p=1000. We consider the same example presented in Figure 9, Example 1. First of all, Table 4 presents the prediction performance of Lasso after debiasing—we considered the same tuning parameter considered optimal for the Lasso problem. We see that as in the case of Table 3, the debiasing does not lead to improved performance in terms of prediction error.
去偏通常用於減輕 Lasso 正規化參數造成的收縮。這是透過對套索選擇的支援執行無限制最小二乘來完成的。當然,結果將取決於問題所使用的調整參數。為此,我們使用兩種方法。在第一種方法中,我們找到最佳的 Lasso 解決方案(透過在最小化驗證集上的預測誤差的基礎上獲得最佳調整參數);然後我們得到該 Lasso 解的非正規化最小平方法解。在我們嘗試的所有實驗中,這通常比Lasso 表現更差- 請參閱表3 和4。具有相同的預測性能稀疏模式,如表 3 所示。表 4 顯示了 50=np=100050𝑛much-less-than𝑝100050=n\ll p=1000 的結果。我們考慮圖 9 範例 1 中提供的相同範例。我們看到,與表 3 的情況一樣,去偏並沒有改善預測誤差方面的表現。

We thus experimented with another variant of the debiased Lasso, where for every λ𝜆\lambda we computed the Lasso solution (2) and obtained 𝜷^Deb,λsubscript^𝜷Deb𝜆\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda} by performing an unrestricted least squares fit on the support selected by the Lasso solution at λ𝜆\lambda. This method can be thought of delivering feasible solutions for Problem (1), for a value of k:=k(λ)assign𝑘𝑘𝜆k:=k(\lambda) determined by the Lasso solution at λ𝜆\lambda. The success of this method makes a case in support of using criterion (1). The tuning parameter was then selected by minimizing predictive performance on a held out test validation set. This method in general performed better than Lasso in delivering a sparser model with better predictive accuracy than the Lasso. The performance of the debiased Lasso was similar to Sparsenet and was in general inferior to MIO by orders of magnitude, especially for the problems where the pairwise correlations between the variables was large and SNR was low and npmuch-less-than𝑛𝑝n\ll p. The results are presented in Table 5,6 (for the case n>p𝑛𝑝n>p) and 7 and 8 (for the case npmuch-less-than𝑛𝑝n\ll p).
因此,我們嘗試了去偏Lasso 的另一種變體,其中對於每個 λ𝜆\lambda ,我們計算Lasso 解(2) 並通過在選擇的支持上執行無限制最小二乘擬合來獲得 𝜷^Deb,λsubscript^𝜷Deb𝜆\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda} λ𝜆\lambda 處的套索解決方案。此方法可以被認為是為問題 (1) 提供可行的解決方案,其值 k:=k(λ)assign𝑘𝑘𝜆k:=k(\lambda)λ𝜆\lambda 處的 Lasso 解決方案確定。此方法的成功為標準(1)的使用提供了支持。然後透過最小化測試驗證集上的預測性能來選擇調整參數。此方法通常比 Lasso 表現更好,可以提供比 Lasso 更好的預測精度的稀疏模型。除偏 Lasso 的表現與 Sparsenet 類似,但總體上比 MIO 低幾個數量級,特別是對於變數之間的成對相關性較大且 SNR 較低的問題和 npmuch-less-than𝑛𝑝n\ll p 。結果如表 5,6(針對情況 n>p𝑛𝑝n>p )以及表 7 和 8(針對情況 npmuch-less-than𝑛𝑝n\ll p )所示。

Debiasing at optimal Lasso model, n>p𝑛𝑝n>p
最佳 Lasso 模型的去偏, n>p𝑛𝑝n>p

SNR ρ𝜌\rho Ratio: Lasso/ Debiased Lasso
比率:套索/去偏套索
6.33 0.5 0.33
3.17 0.5 0.54
1.58 0.5 0.53
6.97 0.8 0.67
3.48 0.8 0.64
1.74 0.8 0.63
8.73 0.9 1
4.37 0.9 0.58
2.18 0.9 0.61
Table 3: Lasso and Debiased Lasso corresponding to the numerical experiments of Figure 4, for Example 1 with n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\} and k0=10subscript𝑘010k_{0}=10. Here, “Ratio” equals the ratio of the prediction error of the Lasso and the debiased Lasso at the optimal tuning parameter selected by the Lasso.
表 3:對應於圖 4 的數值實驗的套索和去偏套索,範例 1 具有 n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\}k0=10subscript𝑘010k_{0}=10 。這裡,「Ratio」等於 Lasso 與去偏 Lasso 在 Lasso 選擇的最佳調整參數下的預測誤差比

Debiasing at optimal Lasso model, npmuch-less-than𝑛𝑝n\ll p
最佳 Lasso 模型的去偏, npmuch-less-than𝑛𝑝n\ll p

SNR ρ𝜌\rho Ratio: Lasso/Debiased Lasso
比率:套索/去偏套索
10 0.8 0.90
7 0.8 1.0
3 0.8 0.91
Table 4: Lasso and Debiased Lasso corresponding to the numerical experiments of Figure 9, for Example 1 with n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8 and k0=5subscript𝑘05k_{0}=5. Here, “Ratio” equals the ratio of the prediction error of the Lasso and the debiased Lasso at the optimal tuning parameter selected by the Lasso.
表 4:對應於圖 9 的數值實驗的套索和去偏套索,範例 1 具有 n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8k0=5subscript𝑘05k_{0}=5 。這裡,「Ratio」等於 Lasso 與去偏 Lasso 在 Lasso 選擇的最佳調整參數下的預測誤差之比。

The performance of this model was comparable with Sparsenet—it was better than Lasso in terms of obtaining a sparser model with better predictive accuracy. However, the performance of MIO was significantly better than the debiased version of the Lasso, especially for larger values of ρ𝜌\rho and smaller SNR values.
該模型的性能與 Sparsenet 相當——在獲得具有更好預測精度的稀疏模型方面,它比 Lasso 更好。然而,MIO 的性能明顯優於 Lasso 的去偏版本,特別是對於較大的 ρ𝜌\rho 值和較小的 SNR 值。

Sparsity of Selected Models, n>p𝑛𝑝n>p
所選模型的稀疏性, n>p𝑛𝑝n>p

SNR ρ𝜌\rho Lasso Debiased Lasso 去偏套索 MIO
6.33 0.5 27.6 (2.122) 10.9 (0.65) 10.8 (0.51)
3.17 0.5 27.7 (2.045) 10.9 (0.65) 10.1 (0.1)
1.58 0.5 28.0 (2.276) 10.9 (0.65) 10.2 (0.2)
6.97 0.8 34.1 (3.60) 10.4 (0.15) 10 (0.0)
3.48 0.8 34.0 (3.54) 10.9 (0.55) 10.2 (0.2)
1.74 0.8 33.7 (3.49) 13.7 (1.50) 10 (0.0)
8.73 0.9 25.9 (0.94) 13.9 (0.68) 10.5 (0.17)
4.37 0.9 34.6 (3.23) 18.1 (1.30) 10.2 (0.25)
2.18 0.9 34.7 (3.28) 20.5 (1.85) 10.1 (0.10)
Table 5: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 4, for Example 1 with n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\} and k0=10subscript𝑘010k_{0}=10. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. Numbers within brackets denote standard-errors. Debiased Lasso leads to less dense models than Lasso. When ρ𝜌\rho is small and SNR is large, the model size of debiased Lasso performance is similar to MIO. However, for larger values of ρ𝜌\rho and smaller values of SNR subset selection leads to orders of magnitude sparser solutions than debiased Lasso.
表 5:Lasso、Debiased Lasso 和 MIO 所選模型中的非零數,對應於圖 4 的數值實驗,例如使用 n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\}k0=10subscript𝑘010k_{0}=10 的範例 1。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。括號內的數字表示標準誤。去偏套索會導致模型密度低於套索。當 ρ𝜌\rho 較小且SNR較大時,去偏Lasso表現的模型大小與MIO相似。然而,對於較大的 ρ𝜌\rho 值和較小的 SNR 值,子集選擇會導致比去偏套索稀疏幾個數量級的解決方案。

Predictive Performance of Selected Models, n>p𝑛𝑝n>p
所選模型的預測性能, n>p𝑛𝑝n>p

SNR ρ𝜌\rho Lasso Debiased Lasso 去偏套索 MIO Ratio:
Debiased Lasso/MIO 去偏套索/MIO
6.33 0.5 0.0384 (0.001) 0.0255 (0.002) 0.0266 (0.001) 1.0
3.17 0.5 0.0768 (0.003) 0.0511 (0.004) 0.0478 (0.002) 1.0
1.58 0.5 0.1540 (0.007) 0.1021 (0.009) 0.0901 (0.009) 1.1
6.97 0.8 0.0389 (0.002) 0.0223 (0.001) 0.0231 (0.002) 1.0
3.48 0.8 0.0778 (0.004) 0.0464 (0.003) 0.0484 (0.004) 1.0
1.74 0.8 0.1557 (0.007) 0.1156 (0.008) 0.0795 (0.008) 1.5
8.73 0.9 0.0325 (0.001) 0.0220 (0.002) 0.0197 (0.002) 1.2
4.37 0.9 0.0632 (0.002) 0.0532 (0.003) 0.0427 (0.008) 1.3
2.18 0.9 0.1265 (0.005) 0.1254 (0.006) 0.0703 (0.011) 1.8
Table 6: Predictive Performance for tests of Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 4, for Example 1 with n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\} and k0=10subscript𝑘010k_{0}=10. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. When ρ𝜌\rho is small and SNR is large, debiased Lasso performance is similar to MIO. However, for larger values of ρ𝜌\rho and smaller values of SNR subset selection performs better than debiased Lasso based solutions.
表 6:對應於圖 4 的數值實驗的 Lasso、Debiased Lasso 和 MIO 測試的預測性能,對於具有 n=500,p=100,ρ{0.5,0.8,0.9}formulae-sequence𝑛500formulae-sequence𝑝100𝜌0.50.80.9n=500,p=100,\rho\in\{0.5,0.8,0.9\}k0=10subscript𝑘010k_{0}=10 的範例 1。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。當 ρ𝜌\rho 較小且SNR較大時,去偏Lasso效能與MIO相似。然而,對於較大的 ρ𝜌\rho 值和較小的 SNR 值,子集選擇的表現優於基於去偏 Lasso 的解。

We then follow the method described above (for the n>p𝑛𝑝n>p case), where we consider a sequence of models 𝜷^Deb,λsubscript^𝜷Deb𝜆\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda} and find the λ𝜆\lambda that delivers the best predictive model on a held out validation set.
然後,我們按照上述方法(對於 n>p𝑛𝑝n>p 情況),考慮一系列模型 𝜷^Deb,λsubscript^𝜷Deb𝜆\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda} 並找到提供最佳預測模型的 λ𝜆\lambda 在保留的驗證集上。

Sparsity of Selected Models, npmuch-less-than𝑛𝑝n\ll p
所選模型的稀疏性, npmuch-less-than𝑛𝑝n\ll p

SNR ρ𝜌\rho Lasso Debiased Lasso 去偏套索 MIO
10 0.8 25.7 (1.73) 7.9 (0.43) 5 (0.12)
7 0.8 27.8 (2.69) 8.1 (0.43) 5 (0.16)
3 0.8 28.0 (2.72) 10.0 (0.88) 6 (1.18)
Table 7: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 9, for Example 1with n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8 and k0=5subscript𝑘05k_{0}=5. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. Debiased Lasso leads to less dense models than Lasso but more dense models than MIO. The performance gap between MIO and debiased Lasso becomes larger with lower values of SNR.
表 7:Lasso、Debiased Lasso 和 MIO 所選模型中的非零數,對應於圖 9 的數值實驗,例如帶有 n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8k0=5subscript𝑘05k_{0}=5 的範例 1。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。去偏 Lasso 會導致模型密度低於 Lasso,但模型密度高於 MIO。隨著 SNR 值的降低,MIO 和去偏 Lasso 之間的效能差距變得更大。

Predictive Performance of Selected Models, npmuch-less-than𝑛𝑝n\ll p
所選模型的預測性能, npmuch-less-than𝑛𝑝n\ll p

SNR ρ𝜌\rho Lasso Debiased MIO Ratio:
Lasso Debiased Lasso/ MIO 去偏套索/ MIO
10 0.8 0.084 (0.004) 0.046 (0.003) 0.014 (0.005) 3.3
7 0.8 0.122 (0.005) 0.070 (0.004) 0.020 (0.007) 3.5
3 0.8 0.257 (0.012) 0.185 (0.016) 0.151 (0.027) 1.2
Table 8: Predictive performances of Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 9, for Example 1with n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8 and k0=5subscript𝑘05k_{0}=5. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. MIO consistently leads to better predictive models than Debiased Lasso and ordinary Lasso. Debiased Lasso performs better than ordinary Lasso.
表 8:Lasso、Debiased Lasso 和 MIO 的預測效能,對應於圖 9 的數值實驗,範例 1 中使用 n=50,p=1000,ρ=0.8formulae-sequence𝑛50formulae-sequence𝑝1000𝜌0.8n=50,p=1000,\rho=0.8k0=5subscript𝑘05k_{0}=5 。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。 MIO 始終能夠產生比 Debiased Lasso 和普通 Lasso 更好的預測模型。 Debiased Lasso 的性能比普通 Lasso 更好。