Best Subset Selection via a Modern Optimization Lens
透過現代優化鏡頭選擇最佳子集

Dimitris Bertsimas 迪米特里斯·貝爾西馬斯MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: dbertsim@mit.edu
麻省理工學院史隆管理學院與營運研究中心、麻省理工學院：dbertsim@mit.edu Angela King 安吉拉·金Operations Research Center, Massachusetts Institute of Technology: aking10@mit.edu
麻省理工學院運籌研究中心：aking10@mit.edu Rahul Mazumder 拉胡爾·馬宗德MIT Sloan School of Management and Operations Research Center, Massachusetts Institute of Technology: rahulmaz@mit.edu
麻省理工學院史隆管理學院與營運研究中心，麻省理工學院：rahulmaz@mit.edu

( (This is a Revised Version dated May, 2015. First Version Submitted for Publication on June, 2014.))
（（這是 2015 年 5 月的修訂版本。第一個版本於 2014 年 6 月提交出版。））

Abstract 抽象的

In the last twenty-five years (1990-2014), algorithmic advances in integer optimization combined with hardware improvements have resulted in an astonishing 200 billion factor speedup in solving Mixed Integer Optimization (MIO) problems. We present a MIO approach for solving the classical best subset selection problem of choosing $k$ out of $p$ features in linear regression given $n$ observations. We develop a discrete extension of modern first order continuous optimization methods to find high quality feasible solutions that we use as warm starts to a MIO solver that finds provably optimal solutions. The resulting algorithm (a) provides a solution with a guarantee on its suboptimality even if we terminate the algorithm early, (b) can accommodate side constraints on the coefficients of the linear regression and (c) extends to finding best subset solutions for the least absolute deviation loss function. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with $n$ in the 1000s and $p$ in the 100s in minutes to provable optimality, and finds near optimal solutions for $n$ in the 100s and $p$ in the 1000s in minutes. We also establish via numerical experiments that the MIO approach performs better than Lasso and other popularly used sparse learning procedures, in terms of achieving sparse solutions with good predictive power.
在過去 25 年（1990-2014 年）中，整數優化演算法的進步與硬體改進相結合，在解決混合整數優化 (MIO) 問題時帶來了驚人的 2000 億因子加速。我們提出了一種MIO 方法，用於解決在給定 $n$ 觀測值的線性回歸中從 $p$ 特徵中選擇 $k$ 的經典最佳子集選擇問題。我們開發了現代一階連續優化方法的離散擴展，以找到高品質的可行解決方案，我們將其用作 MIO 求解器的熱啟動，以找到可證明的最佳解決方案。由此產生的演算法(a) 提供了一個解決方案，即使我們提前終止演算法，也能保證其次優性，(b) 可以適應線性迴歸係數的側面約束，(c) 擴展到尋找最少的最佳子集解決方案絕對偏差損失函數。使用各種合成和真實資料集，我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 $n$ 和100 秒內的 $p$ 問題，以證明最優性，並找到接近最優的解對於 $n$ 在 100 秒內， $p$ 在 1000 秒內。我們也透過數值實驗證明，在實現具有良好預測能力的稀疏解決方案方面，MIO 方法比 Lasso 和其他常用的稀疏學習程式表現更好。

1 Introduction 1簡介

We consider the linear regression model with response vector $\mathbf{y}_{n\times 1}$ , model matrix $\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p}$ , regression coefficients $\boldsymbol{\beta}\in\mathbb{R}^{p\times 1}$ and errors $\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1}$ :
我們考慮具有反應向量 $\mathbf{y}_{n\times 1}$ 、模型矩陣 $\mathbf{X}=[\mathbf{x}_{1},\ldots,\mathbf{x}_{p}]\in\mathbb{R}^{n\times p}$ 、迴歸係數 $\boldsymbol{\beta}\in\mathbb{R}^{p\times 1}$ 和誤差 $\boldsymbol{\epsilon}\in\mathbb{R}^{n\times 1}$ 的線性迴歸模型：

\mathbf{y}=\mathbf{X}\boldsymbol{\beta}+\boldsymbol{\epsilon}.

We will assume that the columns of $\mathbf{X}$ have been standardized to have zero means and unit $\ell_{2}$ -norm. In many important classical and modern statistical applications, it is desirable to obtain a parsimonious fit to the data by finding the best $k$ -feature fit to the response $\mathbf{y}$ . Especially in the high-dimensional regime with $p\gg n,$ in order to conduct statistically meaningful inference, it is desirable to assume that the true regression coefficient $\boldsymbol{\beta}$ is sparse or may be well approximated by a sparse vector. Quite naturally, the last few decades have seen a flurry of activity in estimating sparse linear models with good explanatory power. Central to this statistical task lies the best subset problem [40] with subset size $k$ , which is given by the following optimization problem:
我們假設 $\mathbf{X}$ 的列已標準化為具有零均值和單位 $\ell_{2}$ -範數。在許多重要的經典和現代統計應用中，希望透過找到與反應 $\mathbf{y}$ 擬合的最佳 $k$ 特徵來獲得對資料的簡約擬合。特別是在具有 $p\gg n,$ 的高維體系中，為了進行統計上有意義的推斷，最好假設真實的迴歸係數 $\boldsymbol{\beta}$ 是稀疏的或可以很好地近似為稀疏係數向量。很自然地，在過去的幾十年裡，在估計具有良好解釋力的稀疏線性模型方面出現了一系列活動。此統計任務的核心在於子集大小 $k$ 的最佳子集問題 [40]，由以下最佳化問題給出：

\min_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\;\;\;\mathrm{subject\;to}\;\;\;\|\boldsymbol{\beta}\|_{0}\leq k,

(1)

where the $\ell_{0}$ (pseudo)norm of a vector $\boldsymbol{\beta}$ counts the number of nonzeros in $\boldsymbol{\beta}$ and is given by $\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0),$ where $1(\cdot)$ denotes the indicator function. The cardinality constraint makes Problem (1) NP-hard [41]. Indeed, state-of-the-art algorithms to solve Problem (1), as implemented in popular statistical packages, like leaps in R, do not scale to problem sizes larger than $p=30$ . Due to this reason, it is not surprising that the best subset problem has been widely dismissed as being intractable by the greater statistical community.
其中向量 $\boldsymbol{\beta}$ 的 $\ell_{0}$ （偽）範數計算 $\boldsymbol{\beta}$ 中非零的數量，並由 $\|\boldsymbol{\beta}\|_{0}=\sum_{i=1}^{p}1(\beta_{i}\neq 0),$ 給出，其中< b4 > 表示指標函數。基數約束使得問題 (1) 成為 NP 難題 [41]。事實上，解決問題 (1) 的最先進演算法，如在流行的統計軟體包中實現的那樣，例如 R 中的跳躍，不能擴展到大於 $p=30$ 的問題規模。由於這個原因，最佳子集問題被更大的統計界認為難以解決而被廣泛忽視也就不足為奇了。

In this paper we address Problem (1) using modern optimization methods, specifically mixed integer optimization (MIO) and a discrete extension of first order continuous optimization methods. Using a wide variety of synthetic and real datasets, we demonstrate that our approach solves problems with $n$ in the 1000s and $p$ in the 100s in minutes to provable optimality, and finds near optimal solutions for $n$ in the 100s and $p$ in the 1000s in minutes. To the best of our knowledge, this is the first time that MIO has been demonstrated to be a tractable solution method for Problem (1). We note that we use the term tractability not to mean the usual polynomial solvability for problems, but rather the ability to solve problems of realistic size in times that are appropriate for the applications we consider.
在本文中，我們使用現代最佳化方法，特別是混合整數最佳化（MIO）和一階連續最佳化方法的離散擴展來解決問題（1）。使用各種合成和真實資料集，我們證明了我們的方法可以在幾分鐘內解決1000 秒內的 $n$ 和100 秒內的 $p$ 問題，以證明最優性，並找到接近最優的解對於 $n$ 在 100 秒內， $p$ 在 1000 秒內。據我們所知，這是第一次證明 MIO 是問題（1）的一種易於處理的解決方法。我們注意到，我們使用術語「易處理性」並不是指問題的通常多項式可解性，而是指在適合我們考慮的應用程式的時間內解決實際規模問題的能力。

As there is a vast literature on the best subset problem, we next give a brief and selective overview of related approaches for the problem.
由於關於最佳子集問題有大量文獻，因此我們接下來對此問題的相關方法進行簡要和選擇性的概述。

Brief Context and Background
簡短背景

To overcome the computational difficulties of the best subset problem, computationally tractable convex optimization based methods like Lasso [49, 17] have been proposed as a convex surrogate for Problem (1). For the linear regression problem, the Lagrangian form of Lasso solves
為了克服最佳子集問題的計算困難，基於計算可處理的凸優化的方法（如 Lasso [49, 17]）已被提出作為問題（1）的凸替代。對於線性迴歸問題，Lasso 的拉格朗日形式求解

\min_{\boldsymbol{\beta}}\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\|\boldsymbol{\beta}\|_{1},

(2)

where the $\ell_{1}$ penalty on $\boldsymbol{\beta}$ , i.e., $\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}|$ shrinks the coefficients towards zero and naturally produces a sparse solution by setting many coefficients to be exactly zero. There has been a substantial amount of impressive work on Lasso [23, 15, 5, 55, 32, 59, 19, 35, 39, 53, 50] in terms of algorithms and understanding of its theoretical properties—see for example the excellent books or surveys [11, 34, 50] and the references therein.
其中 $\ell_{1}$ 對 $\boldsymbol{\beta}$ 的懲罰，即 $\|\boldsymbol{\beta}\|_{1}=\sum_{i}|\beta_{i}|$ 將係數縮小到零，並透過將許多係數設為零來自然地產生稀疏解。在演算法和對其理論特性的理解方面，Lasso [23,15,5,55,32,59,19,35,39,53,50] 已經有大量令人印象深刻的工作，例如參見優秀的書籍或調查[11,34,50]以及其中的參考文獻。

Indeed, Lasso enjoys several attractive statistical properties and has drawn a significant amount of attention from the statistics community as well as other closely related fields. Under various conditions on the model matrix $\mathbf{X}$ and $n,p,\boldsymbol{\beta}$ it can be shown that Lasso delivers a sparse model with good predictive performance [11, 34]. In order to perform exact variable selection, much stronger assumptions are required [11]. Sufficient conditions under which Lasso gives a sparse model with good predictive performance are the restricted eigenvalue conditions and compatibility conditions [11]. These involve statements about the range of the spectrum of sub-matrices of $\mathbf{X}$ and are difficult to verify, for a given data-matrix $\mathbf{X}$ .
事實上，Lasso 具有幾個有吸引力的統計特性，並引起了統計界以及其他密切相關領域的廣泛關注。在模型矩陣 $\mathbf{X}$ 和 $n,p,\boldsymbol{\beta}$ 的各種條件下，可以證明 Lasso 提供了具有良好預測性能的稀疏模型 [11, 34]。為了執行精確的變數選擇，需要更強的假設[11]。 Lasso 給出具有良好預測性能的稀疏模型的充分條件是受限特徵值條件和相容性條件[11]。這些涉及關於 $\mathbf{X}$ 子矩陣頻譜範圍的陳述，並且對於給定的資料矩陣 $\mathbf{X}$ 很難驗證。

An important reason behind the popularity of Lasso is its computational feasibility and scalability to practical sized problems. Problem (2) is a convex quadratic optimization problem and there are several efficient solvers for it, see for example [44, 23, 29].
Lasso 受歡迎的一個重要原因是它的計算可行性和對實際規模問題的可擴展性。問題（2）是一個凸二次最佳化問題，有幾個有效的解算器，例如參見[44,23,29]。

In spite of its favorable statistical properties, Lasso has several shortcomings. In the presence of noise and correlated variables, in order to deliver a model with good predictive accuracy, Lasso brings in a large number of nonzero coefficients (all of which are shrunk towards zero) including noise variables. Lasso leads to biased regression coefficient estimates, since the $\ell_{1}$ -norm penalizes the large coefficients more severely than the smaller coefficients. In contrast, if the best subset selection procedure decides to include a variable in the model, it brings it in without any shrinkage thereby draining the effect of its correlated surrogates. Upon increasing the degree of regularization, Lasso sets more coefficients to zero, but in the process ends up leaving out true predictors from the active set. Thus, as soon as certain sufficient regularity conditions on the data are violated, Lasso becomes suboptimal as (a) a variable selector and (b) in terms of delivering a model with good predictive performance.
儘管 Lasso 具有良好的統計特性，但它也有一些缺點。在有雜訊和相關變數的情況下，為了提供具有良好預測精度的模型，Lasso 引入了大量非零係數（所有係數都縮小到零），包括雜訊變數。套索會導致迴歸係數估計有偏差，因為 $\ell_{1}$ 範數對大係數的懲罰比對小係數的懲罰更嚴重。相反，如果最佳子集選擇過程決定在模型中包含一個變量，它將在沒有任何收縮的情況下將其引入，從而耗盡其相關代理的效果。在增加正規化程度後，Lasso 將更多係數設為零，但在此過程中最終會從活動集中遺漏真正的預測變數。因此，一旦違反了資料上的某些足夠的規律性條件，Lasso 就變得不是最理想的（a）變數選擇器和（b）提供具有良好預測性能的模型。

The shortcomings of Lasso are also known in the statistics literature. In fact, there is a significant gap between what can be achieved via best subset selection and Lasso: this is supported by empirical (for small problem sizes, i.e., $p\leq 30$ ) and theoretical evidence, see for example, [46, 58, 38, 31, 56, 48] and the references therein. Some discussion is also presented herein, in Section 4.
Lasso 的缺點在統計文獻中也眾所周知。事實上，透過最佳子集選擇和套索可以實現的目標之間存在顯著差距：這是由經驗（對於小問題規模，即 $p\leq 30$ ）和理論證據支持的，例如，參見[ 46、58、38、31、56、48]及其中的參考文獻。本文第 4 節也提出了一些討論。

To address the shortcomings, non-convex penalized regression is often used to “bridge” the gap between the convex $\ell_{1}$ penalty and the combinatorial $\ell_{0}$ penalty [38, 27, 24, 54, 55, 28, 61, 62, 57, 13]. Written in Lagrangian form, this gives rise to continuous non-convex optimization problems of the form:
為了解決這些缺點，通常使用非凸懲罰回歸來「彌合」凸 $\ell_{1}$ 懲罰和組合 $\ell_{0}$ 懲罰之間的差距[38,27,24,54, 55、28 、61、62、57、13]。以拉格朗日形式編寫，這會產生以下形式的連續非凸最佳化問題：

\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\sum_{i}p(|\beta_{i}|;\gamma;\lambda),

(3)

where $p(|\beta|;\gamma;\lambda)$ is a non-convex function in $\beta$ with $\lambda$ and $\gamma$ denoting the degree of regularization and non-convexity, respectively. Typical examples of non-convex penalties include the minimax concave penalty (MCP), the smoothly clipped absolute deviation (SCAD), and $\ell_{\gamma}$ penalties (see for example, [27, 38, 62, 24]). There is strong statistical evidence indicating the usefulness of estimators obtained as minimizers of non-convex penalized problems (3) over Lasso see for example [56, 36, 54, 25, 52, 37, 60, 26]. In a recent paper, [60] discuss the usefulness of non-convex penalties over convex penalties (like Lasso) in identifying important covariates, leading to efficient estimation strategies in high dimensions. They describe interesting connections between $\ell_{0}$ regularized least squares and least squares with the hard thresholding penalty; and in the process develop comprehensive global properties of hard thresholding regularization in terms of various metrics. [26] establish asymptotic equivalence of a wide class of regularization methods in high dimensions with comprehensive sampling properties on both global and computable solutions.
其中 $p(|\beta|;\gamma;\lambda)$ 是 $\beta$ 中的非凸函數， $\lambda$ 和 $\gamma$ 分別表示正規化程度和非凸性程度。非凸懲罰的典型例子包括極小最大凹懲罰（MCP）、平滑剪切絕對偏差（SCAD）和 $\ell_{\gamma}$ 懲罰（例如，參見[27,38,62,24]）。有強有力的統計證據表明，作為非凸懲罰問題 (3) 的最小化器獲得的估計量相對於 Lasso 是有用的，例如 [56,36,54,25,52,37,60,26]。在最近的一篇論文中，[60]討論了非凸懲罰相對於凸懲罰（如 Lasso）在識別重要協變量方面的有用性，從而導致高維度的有效估計策略。他們描述了 $\ell_{0}$ 正則化最小二乘法和具有硬閾值懲罰的最小二乘法之間的有趣聯繫；並在此過程中根據各種指標開發硬閾值正則化的全面全局屬性。 [26]建立了一系列高維正則化方法的漸近等價性，並在全局和可計算解決方案上具有全面的採樣特性。

Problem (3) mainly leads to a family of continuous and non-convex optimization problems. Various effective nonlinear optimization based methods (see for example [62, 24, 13, 36, 54, 38] and the references therein) have been proposed in the literature to obtain good local minimizers to Problem (3). In particular [38] proposes Sparsenet, a coordinate-descent procedure to trace out a surface of local minimizers for Problem (3) for the MCP penalty using effective warm start procedures. None of the existing approaches for solving Problem (3), however, come with guarantees of how close the solutions are to the global minimum of Problem (3).
問題（3）主要導致一系列連續非凸優化問題。文獻中已經提出了各種有效的基於非線性最佳化的方法（例如參見[62,24,13,36,54,38]及其參考文獻）以獲得問題（3）的良好局部最小化器。特別是[38]提出了Sparsenet，一種坐標下降過程，用於使用有效的熱啟動過程來追蹤問題（3）的MCP懲罰的局部最小化器的表面。然而，解決問題（3）的現有方法都無法保證解決方案與問題（3）的全局最小值的接近程度。

The Lagrangian version of (1) given by
(1) 的拉格朗日版本由下式給出

\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda\sum_{i=1}^{p}1(\beta_{i}\neq 0),\;\;

(4)

may be seen as a special case of (3). Note that, due to non-convexity, problems (4) and (1) are not equivalent. Problem (1) allows one to control the exact level of sparsity via the choice of $k$ , unlike (4) where there is no clear correspondence between $\lambda$ and $k$ . Problem (4) is a discrete optimization problem unlike continuous optimization problems (3) arising from continuous non-convex penalties.
可以看作(3)的特例。請注意，由於非凸性，問題 (4) 和 (1) 並不等價。問題(1) 允許人們透過選擇 $k$ 來控制稀疏性的確切級別，這與(4) 不同，其中 $\lambda$ 和 $k$ .問題（4）是離散最佳化問題，與連續非凸懲罰產生的連續最佳化問題（3）不同。

Insightful statistical properties of Problem (4) have been explored from a theoretical viewpoint in [56, 31, 32, 48]. [48] points out that (1) is preferable over (4) in terms of superior statistical properties of the resulting estimator. The aforementioned papers, however, do not discuss methods to obtain provably optimal solutions to problems (4) or (1), and to the best of our knowledge, computing optimal solutions to problems (4) and (1) is deemed as intractable.
[56,31,32,48]從理論角度探討了問題（4）富有洞察力的統計特性。 [48]指出，就所得估計量的優越統計特性而言，（1）優於（4）。然而，上述論文並沒有討論獲得問題（4）或（1）的可證明最優解的方法，並且據我們所知，計算問題（4）和（1）的最優解被認為是棘手的。

Our Approach 我們的方法

In this paper, we propose a novel framework via which the best subset selection problem can be solved to optimality or near optimality in problems of practical interest within a reasonable time frame. At the core of our proposal is a computationally tractable framework that brings to bear the power of modern discrete optimization methods: discrete first order methods motivated by first order methods in convex optimization [45] and mixed integer optimization (MIO), see [4]. We do not guarantee polynomial time solution times as these do not exist for the best subset problem unless P=NP. Rather, our view of computational tractability is the ability of a method to solve problems of practical interest in times that are appropriate for the application addressed. An advantage of our approach is that it adapts to variants of the best subset regression problem of the form:
在本文中，我們提出了一種新穎的框架，透過該框架可以在合理的時間範圍內將最佳子集選擇問題解決為最優或接近最優的實際感興趣的問題。我們建議的核心是一個計算上易於處理的框架，它發揮了現代離散優化方法的力量：由凸優化[45]和混合整數優化（MIO）中的一階方法驅動的離散一階方法，請參見[4] 。我們不保證多項式時間求解時間，因為除非 P=NP，否則最佳子集問題不存在多項式時間求解時間。相反，我們對計算易處理性的看法是一種方法在適合所解決的應用程式的時間解決實際感興趣的問題的能力。我們的方法的一個優點是它適應以下形式的最佳子集迴歸問題的變體：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{q}^{q}\\ s.t.&\|\boldsymbol{\beta}\|_{0}\leq k\\ &\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b},\end{array}

where $\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b}$ represents polyhedral constraints and $q\in\{1,2\}$ refers to a least absolute deviation or the least squares loss function on the residuals $\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}$ .
其中 $\mathbf{A}\boldsymbol{\beta}\leq\mathbf{b}$ 表示多面體約束， $q\in\{1,2\}$ 指殘差上的最小絕對偏差或最小平方法損失函數 $\mathbf{r}:=\mathbf{y}-\mathbf{X}\boldsymbol{\beta}$ 。

Existing approaches in the Mathematical Optimization Literature
數學最佳化文獻中的現有方法

In a seminal paper [30], the authors describe a leaps and bounds procedure for computing global solutions to Problem (1) (for the classical $n>p$ case) which can be achieved with computational effort significantly less than complete enumeration. leaps, a state-of-the-art R package uses this principle to perform best subset selection for problems with $n>p$ and $p\leq 30$ . [3] proposed a tailored branch-and-bound scheme that can be applied to Problem (1) using ideas from [30] and techniques in quadratic optimization, extending and enhancing the proposal of [6]. The proposal of [3] concentrates on obtaining high quality upper bounds for Problem (1) and is less scalable than the methods presented in this paper.
在一篇開創性論文[30] 中，作者描述了計算問題(1) 全局解決方案（對於經典 $n>p$ 情況）的跨越式過程，該過程的計算量明顯少於完全枚舉即可實現。跳躍，最先進的 R 套件使用此原理對 $n>p$ 和 $p\leq 30$ 問題執行最佳子集選擇。 [3]提出了一種客製化的分支定界方案，可以使用[30]中的思想和二次優化技術應用於問題（1），擴展和增強[6]的建議。 [3] 的提議集中於獲得問題（1）的高品質上限，並且比本文提出的方法的可擴展性較差。

Contributions 貢獻

We summarize our contributions in this paper below:
我們在本文中總結了我們的貢獻如下：

1.

We use MIO to find a provably optimal solution for the best subset problem. Our approach has the appealing characteristic that if we terminate the algorithm early, we obtain a solution with a guarantee on its suboptimality. Furthermore, our framework can accommodate side constraints on $\boldsymbol{\beta}$ and also extends to finding best subset solutions for the least absolute deviation loss function.

1. 我們使用 MIO 來尋找最佳子集問題的可證明最優解。我們的方法具有一個吸引人的特點，即如果我們提前終止演算法，我們將獲得一個保證其次優的解決方案。此外，我們的框架可以適應 $\boldsymbol{\beta}$ 的側面約束，並且還可以擴展到尋找最小絕對偏差損失函數的最佳子集解決方案。
2.

We introduce a general algorithmic framework based on a discrete extension of modern first order continuous optimization methods that provide near-optimal solutions for the best subset problem. The MIO algorithm significantly benefits from solutions obtained by the first order methods and problem specific information that can be computed in a data-driven fashion.

2. 我們引入了一種基於現代一階連續最佳化方法的離散擴展的通用演算法框架，該框架為最佳子集問題提供近乎最優的解決方案。 MIO 演算法顯著受益於一階方法獲得的解決方案以及可以以數據驅動方式計算的問題特定資訊。
3.

We report computational results with both synthetic and real-world datasets that show that our proposed framework can deliver provably optimal solutions for problems of size $n$ in the 1000s and $p$ in the 100s in minutes. For high-dimensional problems with $n\in\{50,100\}$ and $p\in\{1000,2000\}$ , with the aid of warm starts and further problem-specific information, our approach finds near optimal solutions in minutes but takes hours to prove optimality.

3. 我們報告了合成資料集和真實資料集的計算結果，顯示我們提出的框架可以為1000 秒內的 $n$ 和100 秒內的 $p$ 大小的問題提供可證明的最佳解決方案幾分鐘後。對於 $n\in\{50,100\}$ 和 $p\in\{1000,2000\}$ 的高維問題，借助熱啟動和進一步的特定問題信息，我們的方法在幾分鐘內找到接近最優的解決方案，但需要幾個小時才能證明最優性。
4.

We investigate the statistical properties of best subset selection procedures for practical problem sizes, which to the best of our knowledge, have remained largely unexplored to date. We demonstrate the favorable predictive performance and sparsity-inducing properties of the best subset selection procedure over its competitors in a wide variety of real and synthetic examples for the least squares and absolute deviation loss functions.

4. 我們研究了針對實際問題規模的最佳子集選擇程序的統計特性，據我們所知，迄今為止，這在很大程度上仍未被探索。我們在最小二乘和絕對偏差損失函數的各種真實和合成範例中證明了最佳子集選擇過程相對於競爭對手的有利預測性能和稀疏性誘導特性。

The structure of the paper is as follows. In Section 2, we present a brief overview of MIO, including a summary of the computational advances it has enjoyed in the last twenty-five years. We present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters and providing lower bounds for the MIO formulations that improve their computational performance. In Section 3, we develop a discrete extension of first order methods in convex optimization to obtain near optimal solutions for the best subset problem and establish its convergence properties—the proposed algorithm and its properties may be of independent interest. Section 4 briefly reviews some of the statistical properties of the best-subset solution, highlighting the performance gaps in prediction error, over regular Lasso-type estimators. In Section 5, we perform a variety of computational tests on synthetic and real datasets to assess the algorithmic and statistical performances of our approach for the least squares loss function for both the classical overdetermined case $n>p$ , and the high-dimensional case $p\gg n$ . In Section 6, we report computational results for the least absolute deviation loss function. In Section 7, we include our concluding remarks. Due to space limitations, some of the material has been relegated to the Appendix.
論文結構如下。在第 2 節中，我們對 MIO 進行了簡要概述，包括對其在過去 25 年中所取得的計算進步的總結。我們提出了針對最佳子集問題提出的 MIO 公式，以及與壓縮感知文獻的一些聯繫，用於估計參數並為 MIO 公式提供下界，從而提高其計算性能。在第3 節中，我們發展了凸優化中一階方法的離散擴展，以獲得最佳子集問題的接近最優解並建立其收斂特性－所提出的演算法及其特性可能具有獨立的意義。第 4 節簡要回顧了最佳子集解決方案的一些統計特性，強調了與常規 Lasso 型估計器相比在預測誤差方面的性能差距。在第5 節中，我們對合成和真實資料集進行了各種計算測試，以評估我們的最小二乘損失函數方法的演算法和統計性能，適用於經典的超定情況 $n>p$ 和高維度情況 $p\gg n$ 。在第 6 節中，我們報告了最小絕對偏差損失函數的計算結果。在第 7 節中，我們包含了我們的結論性意見。由於篇幅限制，部分資料已移至附錄。

2 Mixed Integer Optimization Formulations
2混合整數最佳化公式

In this section, we present a brief overview of MIO, including the simply astonishing advances it has enjoyed in the last twenty-five years. We then present the proposed MIO formulations for the best subset problem as well as some connections with the compressed sensing literature for estimating parameters. We also present completely data driven methods to estimate parameters in the MIO formulations that improve their computational performance.
在本節中，我們將簡要概述 MIO，包括它在過去 25 年中所取得的驚人進展。然後，我們提出了針對最佳子集問題提出的 MIO 公式，以及與用於估計參數的壓縮感知文獻的一些連結。我們也提出了完全資料驅動的方法來估計 MIO 公式中的參數，從而提高其計算效能。

2.1 Brief Background on MIO
2.1MIO簡介

The general form of a Mixed Integer Quadratic Optimization (MIQO) problem is as follows:
混合整數二次最佳化 (MIQO) 問題的一般形式如下：

\begin{array}[]{l c cl l}\min&\boldsymbol{\alpha}^{T}\mathbf{Q}\boldsymbol{\alpha}+\boldsymbol{\alpha}^{T}\mathbf{a}\\ s.t.&\;\;\mathbf{A}\boldsymbol{\alpha}\boldsymbol{\leq}\mathbf{b}\\ &\;\;\alpha_{i}\in\{0,1\},\quad\forall i\in{\mathcal{I}}\\ &\;\;\alpha_{j}\in\mathbb{R}_{+},\quad\forall j\notin{\mathcal{I}},\end{array}

where $\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k}$ and $\mathbf{Q}\in\mathbb{R}^{m\times m}$ (positive semidefinite) are the given parameters of the problem; $\mathbb{R}_{+}$ denotes the non-negative reals, the symbol $\boldsymbol{\leq}$ denotes element-wise inequalities and we optimize over $\boldsymbol{\alpha}\in\mathbb{R}^{m}$ containing both discrete ( $\alpha_{i},i\in{\mathcal{I}}$ ) and continuous ( $\alpha_{i},i\notin{\mathcal{I}}$ ) variables, with ${\mathcal{I}}\subset\{1,\ldots,m\}$ . For background on MIO see [4]. Subclasses of MIQO problems include convex quadratic optimization problems ( ${\mathcal{I}}=\emptyset$ ), mixed integer ( $\mathbf{Q}=\mathbf{0}_{m\times m}$ ) and linear optimization problems ( ${\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m}$ ). Modern integer optimization solvers such as Gurobi and Cplex are able to tackle MIQO problems.
其中 $\mathbf{a}\in\mathbb{R}^{m},\mathbf{A}\in\mathbb{R}^{k\times m},\mathbf{b}\in\mathbb{R}^{k}$ 和 $\mathbf{Q}\in\mathbb{R}^{m\times m}$ （正半定）是問題的給定參數； $\mathbb{R}_{+}$ 表示非負實數，符號 $\boldsymbol{\leq}$ 表示逐元素不等式，我們對包含離散 ( $\alpha_{i},i\in{\mathcal{I}}$ ）和連續（ $\alpha_{i},i\notin{\mathcal{I}}$ ）變量，以及 ${\mathcal{I}}\subset\{1,\ldots,m\}$ 。有關 MIO 的背景信息，請參閱 [4]。 MIQO 問題的子類別包括凸二次最佳化問題 ( ${\mathcal{I}}=\emptyset$ )、混合整數 ( $\mathbf{Q}=\mathbf{0}_{m\times m}$ ) 和線性最佳化問題 ( ${\mathcal{I}}=\emptyset,\mathbf{Q}=\mathbf{0}_{m\times m}$ )。現代整數最佳化求解器（例如 Gurobi 和 Cplex）能夠解決 MIQO 問題。

In the last twenty-five years (1991-2014) the computational power of MIO solvers has increased at an astonishing rate. In [7], to measure the speedup of MIO solvers, the same set of MIO problems were tested on the same computers using twelve consecutive versions of Cplex and version-on-version speedups were reported. The versions tested ranged from Cplex 1.2, released in 1991 to Cplex 11, released in 2007. Each version released in these years produced a speed improvement on the previous version, leading to a total speedup factor of more than 29,000 between the first and last version tested (see [7], [42] for details). Gurobi 1.0, a MIO solver which was first released in 2009, was measured to have similar performance to Cplex 11. Version-on-version speed comparisons of successive Gurobi releases have shown a speedup factor of more than 20 between Gurobi 5.5, released in 2013, and Gurobi 1.0 ([7], [42]). The combined machine-independent speedup factor in MIO solvers between 1991 and 2013 is 580,000. This impressive speedup factor is due to incorporating both theoretical and practical advances into MIO solvers. Cutting plane theory, disjunctive programming for branching rules, improved heuristic methods, techniques for preprocessing MIOs, using linear optimization as a black box to be called by MIO solvers, and improved linear optimization methods have all contributed greatly to the speed improvements in MIO solvers [7].
在過去二十五年（1991-2014）中，MIO 求解器的運算能力以驚人的速度成長。在[7]中，為了測量 MIO 求解器的加速，使用 12 個連續版本的 Cplex 在相同的計算機上測試了同一組 MIO 問題，並報告了版本間的加速。測試的版本範圍從1991年發布的Cplex 1.2到2007年發布的Cplex 11。係數超過29,000已測試（詳細資訊請參閱[7]、[42]）。 Gurobi 1.0 是 2009 年首次發布的 MIO 求解器，測量與 CPLEX 11 具有相似的性能。 [7]，[42]）。 1991 年至 2013 年間，MIO 解算器中與機器無關的綜合加速因子為 580,000。這一令人印象深刻的加速係數歸功於將理論和實踐進步融入 MIO 求解器。割平面理論、分支規則的析取編程、改進的啟發式方法、MIO 預處理技術、使用線性優化作為MIO 求解器調用的黑匣子以及改進的線性優化方法都對MIO 求解器的速度提高做出了巨大貢獻。

In addition, the past twenty years have also brought dramatic improvements in hardware. Figure 1 shows the exponentially increasing speed of supercomputers over the past twenty years, measured in billion floating point operations per second [1]. The hardware speedup from 1993 to 2013 is approximately $10^{5.5}\sim 320,000$ . When both hardware and software improvements are considered, the overall speedup is approximately 200 billion! Note that the speedup factors cited here refer to mixed integer linear optimization problems, not MIQO problems. The speedup factors for MIQO problems are similar. MIO solvers provide both feasible solutions as well as lower bounds to the optimal value. As the MIO solver progresses towards the optimal solution, the lower bounds improve and provide an increasingly better guarantee of suboptimality, which is especially useful if the MIO solver is stopped before reaching the global optimum. In contrast, heuristic methods do not provide such a certificate of suboptimality.
此外，過去二十年在硬體方面也帶來了巨大的進步。圖 1 顯示了過去二十年來超級電腦的速度呈指數級增長，以每秒十億次浮點運算來衡量 [1]。從 1993 年到 2013 年的硬體加速約為 $10^{5.5}\sim 320,000$ 。同時考慮硬體和軟體改進時，整體加速約 2000 億！請注意，此處引用的加速因子指的是混合整數線性最佳化問題，而不是 MIQO 問題。 MIQO 問題的加速因子類似。 MIO 求解器提供可行的解決方案以及最優值的下限。隨著 MIO 求解器向最優解前進，下界會提高，並為次優性提供越來越好的保證，如果 MIO 求解器在達到全局最優之前停止，這一點尤其有用。相反，啟發式方法不提供這樣的次優證明。

Refer to caption — Figure 1: Log of Peak Supercomputer Speed from 1993–2013.
圖 1：1993 年至 2013 年超級電腦峰值速度的記錄。

The belief that MIO approaches to problems in statistics are not practically relevant was formed in the 1970s and 1980s and it was at the time justified. Given the astonishing speedup of MIO solvers and computer hardware in the last twenty-five years, the mindset of MIO as theoretically elegant but practically irrelevant is no longer justified. In this paper, we provide empirical evidence of this fact in the context of the best subset selection problem.
人們在 20 世紀 70 年代和 80 年代形成了一種信念，認為 MIO 方法解決統計問題沒有實際意義，並且在當時是有道理的。鑑於 MIO 求解器和電腦硬體在過去 25 年中取得了驚人的加速，MIO 理論上優雅但實際上無關緊要的思維方式不再合理。在本文中，我們在最佳子集選擇問題的背景下提供了這一事實的經驗證據。

2.2 MIO Formulations for the Best Subset Selection Problem
2.2最佳子集選擇問題的MIO公式

We first present a simple reformulation to Problem (1) as a MIO (in fact a MIQO) problem:
我們首先將問題 (1) 重新表述為 MIO（實際上是 MIQO）問題：

\begin{array}[]{l l }Z_{1}=\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;-{\mathcal{M}}_{U}z_{i}\leq\beta_{i}\leq{\mathcal{M}}_{U}z_{i},i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k,\end{array}

(5)

where $\mathbf{z}\in\{0,1\}^{p}$ is a binary variable and ${\mathcal{M}}_{U}$ is a constant such that if $\widehat{\boldsymbol{\beta}}$ is a minimizer of Problem (5), then ${\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ . If $z_{i}=1$ , then $|\beta_{i}|\leq{\mathcal{M}}_{U}$ and if $z_{i}=0$ , then $\beta_{i}=0$ . Thus, $\sum_{i=1}^{p}z_{i}$ is an indicator of the number of non-zeros in $\boldsymbol{\beta}$ .
其中 $\mathbf{z}\in\{0,1\}^{p}$ 是二元變量， ${\mathcal{M}}_{U}$ 是常數，如果 $\widehat{\boldsymbol{\beta}}$ 是問題 (5) 的最小化，則 ${\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ 。如果 $z_{i}=1$ ，則 $|\beta_{i}|\leq{\mathcal{M}}_{U}$ ，如果 $z_{i}=0$ ，則 $\beta_{i}=0$ 。因此， $\sum_{i=1}^{p}z_{i}$ 是 $\boldsymbol{\beta}$ 中非零數量的指示符。

Provided that ${\mathcal{M}}_{U}$ is chosen to be sufficently large with ${\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ , a solution to Problem (5) will be a solution to Problem (1). Of course, ${\mathcal{M}}_{U}$ is not known a priori, and a small value of ${\mathcal{M}}_{U}$ may lead to a solution different from (1). The choice of ${\mathcal{M}}_{U}$ affects the strength of the formulation and is critical for obtaining good lower bounds in practice. In Section 2.3 we describe how to find appropriate values for ${\mathcal{M}}_{U}$ . Note that there are other MIO formulations, presented herein (See Problem (8)) that do not rely on a-priori specifications of ${\mathcal{M}}_{U}$ . However, we will stick to formulation (5) for the time being, since it provides some interesting connections to the Lasso.
假設 ${\mathcal{M}}_{U}$ 選擇足夠大且 ${\mathcal{M}}_{U}\geq\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ ，問題 (5) 的解決方案將是問題 (1) 的解決方案。當然， ${\mathcal{M}}_{U}$ 不是先驗已知的， ${\mathcal{M}}_{U}$ 的較小值可能會導致與（1）不同的解。 ${\mathcal{M}}_{U}$ 的選擇會影響配方的強度，對於在實踐中獲得良好的下限至關重要。在第 2.3 節中，我們描述如何為 ${\mathcal{M}}_{U}$ 找到合適的值。請注意，這裡提出的其他 MIO 公式（請參閱問題 (8)）不依賴 ${\mathcal{M}}_{U}$ 的先驗規範。然而，我們暫時堅持公式（5），因為它提供了與 Lasso 的一些有趣的聯繫。

Formulation (5) leads to interesting insights, especially via the structure of the convex hull of its constraints, as illustrated next:
公式（5）帶來了有趣的見解，特別是透過其約束的凸包結構，如下所示：

\begin{array}[]{l l }&\text{Conv}\left(\left\{\boldsymbol{\beta}:|\beta_{i}|\leq{\mathcal{M}}_{U}z_{i},z_{i}\in\{0,1\},i=1,\ldots,p,\sum\limits_{i=1}^{p}z_{i}\leq{k}\right\}\right)\\ =&\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}\subseteq\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{U}{k}\}.\end{array}

Thus, the minimum of Problem (5) is lower-bounded by the optimum objective value of both the following convex optimization problems:
因此，問題 (5) 的最小值受到以下兩個凸優化問題的最佳目標值的下界：

	$\displaystyle Z_{2}:=$	$\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\\|_{2}^{2}\;\;$	$\displaystyle\mathrm{subject\;to}$	$\displaystyle\;\;\\|\boldsymbol{\beta}\\|_{\infty}\leq{\mathcal{M}}_{U},\\|\boldsymbol{\beta}\\|_{1}\leq{\mathcal{M}}_{U}{k}$		(6)
	$\displaystyle Z_{3}:=$	$\displaystyle\min\limits_{\boldsymbol{\beta}}\;\;\frac{1}{2}\\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\\|_{2}^{2}\;\;$	$\displaystyle\mathrm{subject\;to}$	$\displaystyle\;\;\\|\boldsymbol{\beta}\\|_{1}\leq{\mathcal{M}}_{U}{k},$		(7)

where (7) is the familiar Lasso in constrained form. This is a weaker relaxation than formulation (6), which in addition to the $\ell_{1}$ constraint on $\boldsymbol{\beta}$ , has box-constraints controlling the values of the $\beta_{i}$ ’s. It is easy to see that the following ordering exists: $Z_{3}\leq Z_{2}\leq Z_{1},$ with the inequalities being strict in most instances.
其中 (7) 是熟悉的 Lasso 的約束形式。這是比公式(6) 更弱的鬆弛，公式(6) 除了對 $\boldsymbol{\beta}$ 的 $\ell_{1}$ 約束之外，還具有控制 $\beta_{i}$ 值的框約束' s。很容易看出存在以下順序： $Z_{3}\leq Z_{2}\leq Z_{1},$ ，並且在大多數情況下不等式都是嚴格的。

In terms of approximating the optimal solution to Problem (5), the MIO solver begins by first solving a continuous relaxation of Problem (5). The Lasso formulation (7) is weaker than this root node relaxation. Additionally, MIO is typically able to significantly improve the quality of the root node solution as the MIO solver progresses toward the optimal solution.
就逼近問題（5）的最優解而言，MIO求解器首先求解問題（5）的連續鬆弛。 Lasso 公式 (7) 比根節點鬆弛弱。此外，隨著 MIO 解算器不斷向最優解前進，MIO 通常能夠顯著提高根節點解的品質。

To motivate the reader we provide an example of the evolution (see Figure 2) of the MIO formulation (8) for the Diabetes dataset [23], with $n=350,p=64$ (for further details on the dataset see Section 5).
為了激勵讀者，我們提供了糖尿病數據集[23] 的MIO 公式(8) 演變的示例（參見圖2），其中 $n=350,p=64$ （有關數據集的更多詳細信息，請參見第5節）。

Since formulation (5) is sensitive to the choice of ${\mathcal{M}}_{U}$ , we consider an alternative MIO formulation based on Specially Ordered Sets [4] as described next.
由於公式 (5) 對 ${\mathcal{M}}_{U}$ 的選擇很敏感，因此我們考慮基於特殊有序集 [4] 的替代 MIO 公式，如下所述。

Formulations via Specially Ordered Sets
透過特別訂購的套裝配製

Any feasible solution to formulation (5) will have $(1-z_{i})\beta_{i}=0$ for every $i\in\{1,\ldots,p\}$ . This constraint can be modeled via integer optimization using Specially Ordered Sets of Type 1 [4] (SOS-1). In an SOS-1 constraint, at most one variable in the set can take a nonzero value, that is
公式 (5) 的任何可行解決方案對於每個 $i\in\{1,\ldots,p\}$ 都將具有 $(1-z_{i})\beta_{i}=0$ 。此限制可以使用類型 1 [4] (SOS-1) 的特殊有序集合透過整數最佳化進行建模。在 SOS-1 限制條件中，集合中最多有一個變數可以取非零值，即

(1-z_{i})\beta_{i}=0\;\;\iff\;\;(\beta_{i},1-z_{i}):\text{SOS-1},

for every $i=1,\ldots,p.$ This leads to the following formulation of (1):
對於每個 $i=1,\ldots,p.$ 這導致 (1) 的公式如下：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k.\end{array}

(8)

We note that Problem (8) can in principle be used to obtain global solutions to Problem (1) — Problem (8) unlike Problem (5) does not require any specification of the parameter ${\mathcal{M}}_{U}$ .
我們注意到，問題（8）原則上可以用來得到問題（1）的全域解－與問題（5）不同，問題（8）不需要任何參數 ${\mathcal{M}}_{U}$ 的規格。

We now proceed to present a more structured representation of Problem (8). Note that objective in this problem is a convex quadratic function in the continuous variable $\boldsymbol{\beta}$ , which can be formulated explicitly as:
我們現在繼續提出問題（8）的更結構化的表示。請注意，此問題中的目標是連續變數 $\boldsymbol{\beta}$ 中的凸二次函數，可以明確表示為：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z}}&\;\;\frac{1}{2}\;\boldsymbol{\beta}^{T}\mathbf{X}^{T}\mathbf{X}\boldsymbol{\beta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\;\;(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}.\end{array}

(9)

We also provide problem-dependent constants ${\mathcal{M}}_{U}$ and ${\mathcal{M}}_{\ell}\in[0,\infty]$ . ${\mathcal{M}}_{U}$ provides an upper bound on the absolute value of the regression coefficients and ${\mathcal{M}}_{\ell}$ provides an upper bound on the $\ell_{1}$ -norm of $\boldsymbol{\beta}$ . Adding these bounds typically leads to improved performance of the MIO, especially in delivering lower bound certificates. In Section 2.3, we describe several approaches to compute these parameters from the data.
我們也提供與問題相關的常數 ${\mathcal{M}}_{U}$ 和 ${\mathcal{M}}_{\ell}\in[0,\infty]$ 。 ${\mathcal{M}}_{U}$ 提供迴歸係數絕對值的上限， ${\mathcal{M}}_{\ell}$ 提供 $\boldsymbol{\beta}$ 的 $\ell_{1}$ 範數的上限。增加這些邊界通常會提高 MIO 的效能，特別是在提供較低邊界憑證方面。在 2.3 節中，我們描述了從資料計算這些參數的幾種方法。

We also consider another formulation for (9):
我們也考慮 (9) 的另一種表述：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z},\boldsymbol{\zeta}}&\;\;\frac{1}{2}\;\boldsymbol{\zeta}^{T}\boldsymbol{\zeta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}\\ &(\beta_{i},1-z_{i}):\text{SOS-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &-{\mathcal{M}}^{\zeta}_{U}\leq\zeta_{i}\leq{\mathcal{M}}^{\zeta}_{U},i=1,\ldots,n\\ &\|\boldsymbol{\zeta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell},\end{array}

(10)

where the optimization variables are $\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}$ , $\mathbf{z}\in\{0,1\}^{p}$ and ${\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty]$ are problem specific parameters. Note that the objective function in formulation (10) involves a quadratic form in $n$ variables and a linear function in $p$ variables. Problem (10) is equivalent to the following variant of the best subset problem:
其中最佳化變數是 $\boldsymbol{\beta}\in\mathbb{R}^{p},\boldsymbol{\zeta}\in\mathbb{R}^{n}$ 、 $\mathbf{z}\in\{0,1\}^{p}$ 和 ${\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}\in[0,\infty]$ 是問題特定的參數。請注意，公式（10）中的目標函數涉及 $n$ 變數中的二次形式和 $p$ 變數中的線性函數。問題(10)等價於最佳子集問題的以下變體：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta}}&\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\\ s.t.&\;\;\|\boldsymbol{\beta}\|_{0}\leq k{}{}{}{}{}{}{}{}{}{}{}\\ &\;\;\|\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}_{U},\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &\;\;\|\mathbf{X}\boldsymbol{\beta}\|_{\infty}\leq{\mathcal{M}}^{\zeta}_{U},\|\mathbf{X}\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell}.\end{array}

(11)

Formulations (9) and (10) differ in the size of the quadratic forms that are involved. The current state-of-the-art MIO solvers are better-equipped to handle mixed integer linear optimization problems than MIQO problems. Formulation (9) has fewer variables but a quadratic form in $p$ variables—we find this formulation more useful in the $n>p$ regime, with $p$ in the $100$ s. Formulation (10) on the other hand has more variables, but involves a quadratic form in $n$ variables—this formulation is more useful for high-dimensional problems $p\gg n$ , with $n$ in the $100$ s and $p$ in the $1000$ s.
公式（9）和（10）的差異在於所涉及的二次形式的大小。目前最先進的 MIO 求解器比 MIQO 問題更能處理混合整數線性最佳化問題。公式(9) 的變數較少，但 $p$ 變數為二次形式- 我們發現公式在 $n>p$ 體系中較有用， $p$ 在 $100$ s。另一方面，公式 (10) 有更多變量，但涉及 $n$ 變量的二次形式 - 該公式對於高維度問題更有用 $p\gg n$ ，其中 $n$ 中， $p$ 在 $1000$ 中。

As we said earlier, the bounds on $\boldsymbol{\beta}$ and $\boldsymbol{\zeta}$ are not required, but if these constraints are provided, they improve the strength of the MIO formulation. In other words, formulations with tightly specified bounds provide better lower bounds to the global optimization problem in a specified amount of time, when compared to a MIO formulation with loose bound specifications. We next show how these bounds can be computed from given data.
正如我們之前所說， $\boldsymbol{\beta}$ 和 $\boldsymbol{\zeta}$ 上的界限不是必需的，但如果提供這些約束，它們會提高 MIO 公式的強度。換句話說，與具有鬆散邊界規範的 MIO 公式相比，具有嚴格指定邊界的公式在指定時間內為全域最佳化問題提供了更好的下界。接下來我們將展示如何根據給定資料計算這些界限。

2.3 Specification of Parameters
2.3參數說明

In this section, we obtain estimates for the quantities ${\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}$ such that an optimal solution to Problem (11) is also an optimal solution to Problem (1), and vice-versa.
在本節中，我們得到數量 ${\mathcal{M}}_{U},{\mathcal{M}}_{\ell},{\mathcal{M}}^{\zeta}_{U},{\mathcal{M}}^{\zeta}_{\ell}$ 的估計，使得問題（11）的最適解也是問題（1）的最適解，反之亦然。

Coherence and Restricted Eigenvalues of a Model Matrix
模型矩陣的相干性與受限特徵值

Given a model matrix $\mathbf{X}$ , [51] introduced the cumulative coherence function
給定一個模型矩陣 $\mathbf{X}$ ，[51]引入了累積相干函數

\mu[k]:=\max_{|I|=k}\;\;\max_{j\notin I}\sum_{i\in I}|\langle\mathbf{X}_{j},\mathbf{X}_{i}\rangle|,

where, $\mathbf{X}_{j}$ , $j=1,\ldots,p$ represent the columns of $\mathbf{X}$ , i.e., features.
其中， $\mathbf{X}_{j}$ 、 $j=1,\ldots,p$ 表示 $\mathbf{X}$ 的列，即特徵。

For $k=1$ , we obtain the notion of coherence introduced in [22, 21] as a measure of the maximal pairwise correlation in absolute value of the columns of $\mathbf{X}$ : $\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.$
對於 $k=1$ ，我們獲得了 [ 22, 21] 中引入的一致性概念，作為 $\mathbf{X}$ 列絕對值中最大成對相關性的度量： $\mu:=\mu[1]=\max_{i\neq j}|\langle\mathbf{X}_{i},\mathbf{X}_{j}\rangle|.$

[16, 14] (see also [11] and references therein) introduced the notion that a matrix $\mathbf{X}$ satisfies a restricted eigenvalue condition if
[ 16, 14]（另見[ 11]和其中的參考文獻）引入了矩陣 $\mathbf{X}$ 滿足受限特徵值條件的概念，如果

\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I})\geq\eta_{k}~{}~{}{\rm for~{}every}~{}I\subset\{1,\ldots,p\}:~{}|I|\leq k,

(12)

where $\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I})$ denotes the smallest eigenvalue of the matrix $\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}$ . An inequality linking $\mu[k]$ and $\eta_{k}$ is as follows.
其中 $\lambda_{\min}(\mathbf{X}_{I}^{\prime}\mathbf{X}_{I})$ 表示矩陣 $\mathbf{X}_{I}^{\prime}\mathbf{X}_{I}$ 的最小特徵值。連接 $\mu[k]$ 和 $\eta_{k}$ 的不等式如下。

Proposition 1.

The following bounds hold :

(a)

[51]: $\mu[k]\leq\mu\cdot k.$

(a)[51]： $\mu[k]\leq\mu\cdot k.$
(b)

[21] : $\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).$

(b)[21]： $\eta_{k}\geq 1-\mu[k-1]\geq 1-\mu\cdot(k-1).$

命題 1. 以下界限成立：

The computations of $\mu[k]$ and $\eta_{k}$ for general $k$ are difficult, while $\mu$ is simple to compute. Proposition 1 provides bounds for $\mu[k]$ and $\eta_{k}$ in terms of the coherence $\mu$ .
一般的 $k$ 的 $\mu[k]$ 和 $\eta_{k}$ 的計算比較困難，而 $\mu$ 的計算則比較簡單。命題 1 根據一致性 $\mu$ 提供了 $\mu[k]$ 和 $\eta_{k}$ 的界線。

Operator Norms of Submatrices
子矩陣的算子範數

The $(p,q)$ operator norm of matrix $\mathbf{A}$ is
矩陣 $\mathbf{A}$ 的 $(p,q)$ 算子範數為

\|\mathbf{A}\|_{p,q}:=\max_{\|\mathbf{u}\|_{q}=1}\|\mathbf{Au}\|_{p}.

We will use extensively here the $(1,1)$ operator norm. We assume that each column vector of $\mathbf{X}$ has unit $\ell_{2}$ -norm. The results derived in the next proposition borrow and enhance techniques developed by [51] in the context of analyzing the $\ell_{1}$ — $\ell_{0}$ equivalence in compressed sensing.
我們將在這裡廣泛使用 $(1,1)$ 運算子規格。我們假設 $\mathbf{X}$ 的每個列向量都有單位 $\ell_{2}$ -範數。下一個命題中得出的結果借用並增強了[51]在分析壓縮感知中的 $\ell_{1}$ - $\ell_{0}$ 等價性的背景下開發的技術。

Proposition 2.

For any $I\subset\{1,\ldots,p\}$ with $|I|=k$ we have :

(a)

$\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].$

（一） $\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}\leq\mu[k-1].$
(b)

If the matrix $\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}$ is invertible and $\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1$ , then $\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.$

(b)若矩陣 $\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}$ 可逆且 $\|\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I}\|_{1,1}<1$ ，則 $\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\leq\frac{1}{1-\mu[k-1]}.$

命題 2. 對於任何有

|I|=k

的

I\subset\{1,\ldots,p\}

，我們有：

Proof.

See Section A.3. ∎

證明。參見 A.3 節。 ∎

We note that Part (b) also appears in [51] for the operator norm $\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty}$ .
我們注意到，(b) 部分也出現在 [51] 中的算符範數 $\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{\infty,\infty}$ 中。

Given a set $I\subset\{1,\ldots,p\}$ with $|I|=k$ we let $\widehat{\boldsymbol{\beta}}_{I}$ denote the least squares regression coefficients obtained by regressing $\mathbf{y}$ on $\mathbf{X}_{I}$ , i.e., $\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}$ . If we append $\widehat{\boldsymbol{\beta}}_{I}$ with zeros in the remaining coordinates we obtain $\widehat{\boldsymbol{\beta}}$ as follows: $\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}.$ Note that $\widehat{\boldsymbol{\beta}}$ depends on $I$ but we will suppress the dependence on $I$ for notational convenience.
給定一個有 $|I|=k$ 的集合 $I\subset\{1,\ldots,p\}$ ，我們讓 $\widehat{\boldsymbol{\beta}}_{I}$ 表示在 $\mathbf{X}_{I}$ 得到的最小平方法迴歸係數> ，即 $\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}$ 。如果我們在剩餘座標中附加 $\widehat{\boldsymbol{\beta}}_{I}$ 零，我們將得到 $\widehat{\boldsymbol{\beta}}$ ，如下所示： $\widehat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}:\beta_{i}=0,i\notin I}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}.$ 請注意， $\widehat{\boldsymbol{\beta}}$ 取決於 $I$ 的依賴。

2.3.1 Specification of Parameters in terms of Coherence and Restricted Strong Convexity
2.3.1 相干性和限制強凸性方面的參數規範

Recall that $\mathbf{X}_{j}$ , $j=1,\ldots,p$ represent the columns of $\mathbf{X}$ ; and we will use $\mathbf{x}_{i},~{}i=1,\ldots,n$ to denote the rows of $\mathbf{X}$ . As discussed above $\|\mathbf{X}_{j}\|=1$ . We order the correlations $|\langle\mathbf{X}_{j},\mathbf{y}\rangle|$ :
回想一下 $\mathbf{X}_{j}$ 、 $j=1,\ldots,p$ 代表 $\mathbf{X}$ 的欄位；我們將使用 $\mathbf{x}_{i},~{}i=1,\ldots,n$ 來表示 $\mathbf{X}$ 的行。如上所討論的 $\|\mathbf{X}_{j}\|=1$ 。我們將相關性排序 $|\langle\mathbf{X}_{j},\mathbf{y}\rangle|$ ：

|\langle\mathbf{X}_{(1)},\mathbf{y}\rangle|\geq|\langle\mathbf{X}_{(2)},\mathbf{y}\rangle|\ldots\geq|\langle\mathbf{X}_{(p)},\mathbf{y}\rangle|.

(13)

We finally denote by $\|\mathbf{x}_{i}\|_{1:k}$ the sum of the top $k$ absolute values of the entries of $x_{ij},j\in\{1,2,\ldots,p\}$ .
我們最後用 $\|\mathbf{x}_{i}\|_{1:k}$ 表示 $x_{ij},j\in\{1,2,\ldots,p\}$ 條目的頂部 $k$ 絕對值總和。

Theorem 2.1.

For any $k\geq 1$ such that $\mu[k-1]<1$ any optimal solution $\widehat{\boldsymbol{\beta}}$ to (1) satisfies:

$\displaystyle{\bf(a)}~{}~{}~{}~{}$	$\displaystyle\\|\widehat{\boldsymbol{\beta}}\\|_{1}\leq$	$\displaystyle\frac{1}{1-\mu[k-1]}\sum_{j=1}^{k}\|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle\|.$	(14)
$\displaystyle{\bf(b)}~{}~{}~{}~{}$	$\displaystyle\\|\widehat{\boldsymbol{\beta}}\\|_{\infty}\leq$	$\displaystyle\min\left\{\frac{1}{\eta_{k}}\sqrt{\sum_{j=1}^{k}\|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle\|^{2}},\frac{1}{\sqrt{\eta_{k}}}\\|\mathbf{y}\\|_{2}\right\}.$	(15)
$\displaystyle{\bf(c)}~{}~{}~{}~{}$	$\displaystyle\\|\mathbf{X}\widehat{\boldsymbol{\beta}}\\|_{1}\leq$	$\displaystyle\min\left\{\sum_{i=1}^{n}\\|\mathbf{x}_{i}\\|_{\infty}\\|\widehat{\boldsymbol{\beta}}\\|_{1},\sqrt{k}\\|\mathbf{y}\\|_{2}\right\}.$	(16)
$\displaystyle{\bf(d)}~{}~{}~{}~{}$	$\displaystyle\\|\mathbf{X}\widehat{\boldsymbol{\beta}}\\|_{\infty}\leq$	$\displaystyle\left(\max_{i=1,\ldots,n}\\|\mathbf{x}_{i}\\|_{1:k}\right)\\|\widehat{\boldsymbol{\beta}}\\|_{\infty}.$	(17)

定理2.1。對於任何

k\geq 1

，使得

\mu[k-1]<1

任何最佳解

\widehat{\boldsymbol{\beta}}

滿足（1）：

Proof.

For proof see Section A.4. ∎

證明。證明參見 A.4 節。 ∎

We note that in the above theorem, the upper bound in Part (a) becomes infinite as soon as $\mu[k-1]\geq 1$ . In such a case, we can use purely data-driven bounds by using convex optimization techniques, as described in Section 2.3.2.
我們注意到，在上述定理中，(a)部分的上限一旦 $\mu[k-1]\geq 1$ 就變成無限大。在這種情況下，我們可以透過使用凸優化技術來使用純資料驅動的邊界，如第 2.3.2 節所述。

The interesting message conveyed by Theorem 2.1 is that the upper bounds on $\|\widehat{\boldsymbol{\beta}}\|_{1},$ $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ , $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}$ and $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ , corresponding to the Problem (11) can all be obtained in terms of $\eta_{k}$ and $\mu[k-1]$ , quantities of fundamental interest appearing in the analysis of $\ell_{1}$ regularization methods and understanding how close they are to $\ell_{0}$ solutions [51, 22, 21, 16, 14]. On a different note, Theorem 2.1 arises from a purely computational motivation and quite curiously, involves the same quantities: cumulative coherence and restricted eigenvalues.
定理 2.1 傳達的有趣訊息是 $\|\widehat{\boldsymbol{\beta}}\|_{1},$ $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ 、 $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}$ 和 $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ 的上限，對應於問題 (11 ) 都可以根據< b4> 和 $\mu[k-1]$ 獲得，在分析 $\ell_{1}$ 正則化方法並了解它們與 $\mu[k-1]$ 的接近程度時出現的基本興趣量。 > 解[51,22,21,16,14]。另一方面，定理 2.1 源自於純粹的計算動機，並且非常奇怪地涉及相同的量：累積相干性和受限特徵值。

Note that the quantities $\mu[k-1],\eta_{k}$ are difficult to compute exactly, but they can be approximated by Proposition 1 which provides bounds commonly used in the compressed sensing literature. Of course, approximations to these quantities can also be obtained by using subsampling schemes.
請注意，數量 $\mu[k-1],\eta_{k}$ 很難精確計算，但可以透過命題 1 進行近似，命題 1 提供了壓縮感知文獻中常用的界限。當然，也可以透過使用子取樣方案來獲得這些量的近似值。

2.3.2 Specification of Parameters via Convex Quadratic Optimization
2.3.2 透過凸二次優化指定參數

We provide an alternative purely data-driven way to compute the upper bounds to the parameters by solving several simple convex quadratic optimization problems.
我們提供了另一種純數據驅動的方法，透過解決幾個簡單的凸二次最佳化問題來計算參數的上限。

Bounds on $\hat{\beta}_{i}$ ’s $\hat{\beta}_{i}$ 的界限

For the case $n>p$ , upper and lower bounds on $\hat{\beta}_{i}$ can be obtained by solving the following pair of convex optimization problems:
對於情況 $n>p$ ，可以透過求解以下一對凸最佳化問題來獲得 $\hat{\beta}_{i}$ 的上下界：

\begin{array}[]{l l l }u^{+}_{i}:=&\max\limits_{\boldsymbol{\beta}}\;\;\beta_{i}&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}\qquad\qquad\begin{array}[]{l l l }u^{-}_{i}:=&\min\limits_{\boldsymbol{\beta}}\;\;\beta_{i}&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}

(18)

for $i=1,\ldots,p$ . Above, UB is an upper bound to the minimum of the $k$ -subset least squares problem (1). $u_{i}^{+}$ is an upper bound to $\hat{\beta}_{i}$ , since the cardinality constraint $\|\boldsymbol{\beta}\|_{0}\leq k$ does not appear in the optimization problem. Similarly, $u^{-}_{i}$ is a lower bound to $\hat{\beta}_{i}$ . The quantity ${\mathcal{M}}^{i}_{U}=\max\{|u^{+}_{i}|,|u^{-}_{i}|\}$ serves as an upper bound to $|\hat{\beta}_{i}|$ . A reasonable choice for UB is obtained by using the discrete first order methods (Algorithms 1 and 2 as described in Section 3) in combination with the MIO formulation (8) (for a predefined amount of time). Having obtained ${\mathcal{M}}^{i}_{U}$ as described above, we can obtain an upper bound to $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ and $\|\widehat{\boldsymbol{\beta}}\|_{1}$ as follows: ${\mathcal{M}}_{U}=\max_{i}{\mathcal{M}}^{i}_{U}$ and $\|\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i=1}^{k}{\mathcal{M}}^{(i)}_{U}$ where, ${\mathcal{M}}^{(1)}_{U}\geq{\mathcal{M}}^{(2)}_{U}\geq\ldots\geq{\mathcal{M}}^{(p)}_{U}$ .
對於 $i=1,\ldots,p$ 。上面，UB 是 $k$ 子集最小平方法問題 (1) 的最小值的上限。 $u_{i}^{+}$ 是 $\hat{\beta}_{i}$ 的上限，因為基數限制 $\|\boldsymbol{\beta}\|_{0}\leq k$ 不會出現在最佳化問題中。類似地， $u^{-}_{i}$ 是 $\hat{\beta}_{i}$ 的下界。數量 ${\mathcal{M}}^{i}_{U}=\max\{|u^{+}_{i}|,|u^{-}_{i}|\}$ 作為 $|\hat{\beta}_{i}|$ 的上限。 UB 的合理選擇是透過使用離散一階方法（第 3 節中描述的演算法 1 和 2）結合 MIO 公式 (8)（對於預先定義的時間量）來獲得的。如上所述獲得 ${\mathcal{M}}^{i}_{U}$ 後，我們可以得到 $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ 和 $\|\widehat{\boldsymbol{\beta}}\|_{1}$ 的上限，如下所示： ${\mathcal{M}}_{U}=\max_{i}{\mathcal{M}}^{i}_{U}$ 和 $\|\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i=1}^{k}{\mathcal{M}}^{(i)}_{U}$ 。

Similarly, bounds corresponding to Parts (c) and (d) in Theorem 2.1 can be obtained by using the upper bounds on $\|\widehat{\boldsymbol{\beta}}\|_{\infty},\|\widehat{\boldsymbol{\beta}}\|_{1}$ as described above.
類似地，可以透過如上所述使用 $\|\widehat{\boldsymbol{\beta}}\|_{\infty},\|\widehat{\boldsymbol{\beta}}\|_{1}$ 的上限來獲得與定理2.1中的(c)和(d)部分相對應的界限。

Note that the quantities $u_{i}^{+}$ and $u_{i}^{-}$ are finite when the level sets of the least squares loss function are finite. In particular, the bounds are loose when $p>n$ . In the following we describe methods to obtain non-trivial bounds on $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ , for $i=1,\ldots,n$ that apply for arbitrary $n,p$ .
請注意，當最小平方法損失函數的水平集合有限時，數量 $u_{i}^{+}$ 和 $u_{i}^{-}$ 也是有限的。特別是，當 $p>n$ 時，界限是寬鬆的。下面我們描述了在 $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ 上取得非平凡邊界的方法，對於適用於任意 $n,p$ 的 $i=1,\ldots,n$ 。

Bounds on $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ ’s $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ 的界限

We now provide a generic method to obtain upper and lower bounds on the quantities $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ :
我們現在提供一種通用方法來獲取數量 $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ 的上限和下限：

\begin{array}[]{l l l }v^{+}_{i}:=&\max\limits_{\boldsymbol{\beta}}\;\;\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}\qquad\qquad\begin{array}[]{l l l }v^{-}_{i}:=&\min\limits_{\boldsymbol{\beta}}\;\;\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle&\\ s.t.&\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB},&\end{array}

(19)

for $i=1,\ldots,n$ . Note that the bounds obtained from (19) are non-trivial bounds for both the under-determined $n<p$ and overdetermined cases. The bounds obtained from (19) are upper and lower bounds since we drop the cardinality constraint on $\boldsymbol{\beta}$ . The bounds are finite since for every $i\in\{1,\ldots,n\}$ the quantity $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ remains bounded in the feasible set for Problems (19).
對於 $i=1,\ldots,n$ 。請注意，對於欠定 $n<p$ 和超定情況，從 (19) 獲得的界限都是非平凡的界限。從 (19) 得到的界限是上限和下限，因為我們放棄了 $\boldsymbol{\beta}$ 上的基數約束。界線是有限的，因為對於每個 $i\in\{1,\ldots,n\}$ ，數量 $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ 在問題 (19) 的可行集中仍然有界。

The quantity ${\mbox{ v}}_{i}=\max\{|v^{+}_{i}|,|v^{-}_{i}|\}$ serves as an upper bound to $|\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle|$ . In particular, this leads to simple upper bounds on $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}\leq\max_{i}{\mbox{ v}}_{i}$ and $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i}{\mbox{ v}}_{i}$ and can be thought of completely data-driven methods to estimate bounds appearing in (16) and (17).
數量 ${\mbox{ v}}_{i}=\max\{|v^{+}_{i}|,|v^{-}_{i}|\}$ 作為 $|\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle|$ 的上限。特別是，這導致了 $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}\leq\max_{i}{\mbox{ v}}_{i}$ 和 $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1}\leq\sum_{i}{\mbox{ v}}_{i}$ 的簡單上限，並且可以被認為是完全資料驅動的方法來估計（16）和（17）中出現的邊界。

We note that Problems (18) and (19) have nice structure amenable to efficient computation as we discuss in Section A.1.
我們注意到問題 (18) 和 (19) 具有良好的結構，適合高效計算，正如我們在 A.1 節中討論的那樣。

2.3.3 Parameter Specifications from Advanced Warm-Starts
2.3.3 高階熱啟動的參數規範

The methods described above in Sections 2.3.1 and 2.3.2 lead to provable bounds on the parameters: with these bounds Problem (11) is provides an optimal solution to Problem (1), and vice-versa. We now describe some other alternatives that lead to excellent parameter specifications in practice.
上述 2.3.1 和 2.3.2 節中描述的方法導致了參數的可證明邊界：利用這些邊界，問題 (11) 為問題 (1) 提供了最優解，反之亦然。我們現在描述一些其他替代方案，這些替代方案可以在實踐中實現出色的參數規格。

The discrete first order methods described in the following section 3 provide good upper bounds to Problem (1). These solutions when supplied as a warm-start to the MIO formulation (8) are often improved by MIO, thereby leading to high quality solutions to Problem (1) within several minutes. If $\hat{\boldsymbol{\beta}}_{\text{hyb}}$ denotes an estimate obtained from this hybrid approach, then ${\mathcal{M}}_{U}:=\tau\|\hat{\boldsymbol{\beta}}_{\text{hyb}}\|_{\infty}$ with $\tau$ a multiplier greater than one (e.g., $\tau\in\{1.1,1.5,2\}$ ) provides a good estimate for the parameter ${\mathcal{M}}_{U}$ . A reasonable upper bound to $\|\widehat{\boldsymbol{\beta}}\|_{1}$ is $k{\mathcal{M}}_{U}$ . Bounds on the other quantities: $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ can be derived by using expressions appearing in Theorem 2.1, with aforementioned bounds on $\|\widehat{\boldsymbol{\beta}}\|_{1}$ and $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ .
以下第 3 節中所述的離散一階方法為問題 (1) 提供了良好的上限。當這些解決方案作為 MIO 公式 (8) 的熱啟動提供時，通常會透過 MIO 進行改進，從而在幾分鐘內得到問題 (1) 的高品質解決方案。如果 $\hat{\boldsymbol{\beta}}_{\text{hyb}}$ 表示從此混合方法獲得的估計值，則 ${\mathcal{M}}_{U}:=\tau\|\hat{\boldsymbol{\beta}}_{\text{hyb}}\|_{\infty}$ 和 $\tau$ 大於 1 的乘數（例如 $\tau\in\{1.1,1.5,2\}$ ）提供對參數 ${\mathcal{M}}_{U}$ 的良好估計。 $\|\widehat{\boldsymbol{\beta}}\|_{1}$ 的合理上限是 $k{\mathcal{M}}_{U}$ 。其他量的界限： $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ 可以透過使用定理 2.1 中出現的表達式以及上述 $\|\widehat{\boldsymbol{\beta}}\|_{1}$ 和 $\|\widehat{\boldsymbol{\beta}}\|_{\infty}$ 上的界限來導出。

2.3.4 Some Generalizations and Variants
2.3.4一些概括和變體

Some variations and improvements of the procedures described above are presented in Section A.2 (appendix).
A.2 節（附錄）介紹了上述程序的一些變更和改進。

3 Discrete First Order Algorithms
3離散一階演算法

In this section, we develop a discrete extension of first order methods in convex optimization [45, 44] to obtain near optimal solutions for Problem (1) and its variant for the least absolute deviation (LAD) loss function. Our approach applies to the problem of minimizing any smooth convex function subject to cardinality constraints.
在本節中，我們開發了凸優化中一階方法的離散擴展[45, 44]，以獲得問題（1）及其最小絕對偏差（LAD）損失函數的變體的接近最優解。我們的方法適用於最小化受基數約束的任何平滑凸函數的問題。

We will use these discrete first order methods to obtain solutions to warm start the MIO formulation. In Section 5, we will demonstrate how these methods greatly enhance the performance of the MIO.
我們將使用這些離散一階方法來獲得解來熱啟動 MIO 公式。在第 5 節中，我們將示範這些方法如何大幅提升 MIO 的效能。

3.1 Finding stationary solutions for minimizing smooth convex functions with cardinality constraints
3.1尋找最小化帶有基數約束的平滑凸函數的平穩解

Related work and contributions
相關工作與貢獻

In the signal processing literature [8, 9] proposed iterative hard-thresholding algorithms, in the context of $\ell_{0}$ -regularized least squares problems, i.e., Problem (4). The authors establish convergence properties of the algorithm under the assumption that $\mathbf{X}$ satisfies coherence [8] or Restricted Isometry Property [9]. The method we propose here applies to a larger class of cardinality constrained optimization problems of the form (20), in particular, in the context of Problem (1) our algorithm and its convergence analysis do not require any form of restricted isometry property on the model matrix $\mathbf{X}$ .
在訊號處理文獻[8, 9]中，在 $\ell_{0}$ 正則化最小平方法問題的背景下提出了迭代硬閾值演算法，即問題（4）。作者在 $\mathbf{X}$ 滿足一致性 [8] 或受限等距性質 [9] 的假設下建立了演算法的收斂性質。我們在這裡提出的方法適用於形式（20）的更大類基數約束最佳化問題，特別是在問題（1）的背景下，我們的演算法及其收斂性分析不需要任何形式的受限等距屬性模型矩陣 $\mathbf{X}$ 。

Our proposed algorithm borrows ideas from projected gradient descent methods in first order convex optimization problems [45] and generalizes it to the discrete optimization Problem (20). We also derive new global convergence results for our proposed algorithms as presented in Theorem 3.1. Our proposal, with some novel modifications also applies to the non-smooth least absolute deviation loss with cardinality constraints as discussed in Section 3.3.
我們提出的演算法借鑒了一階凸優化問題中的投影梯度下降法的想法[45]，並將其推廣到離散最佳化問題（20）。我們也為我們提出的演算法得出了新的全域收斂結果，如定理 3.1 所示。我們的建議，經過一些新穎的修改，也適用於第 3.3 節中討論的具有基數約束的非平滑最小絕對偏差損失。

Consider the following optimization problem:
考慮以下優化問題：

\min_{\boldsymbol{\beta}}\;\;\;\;g(\boldsymbol{\beta})\;\;\;\mathrm{subject\;to}\;\;\;\|\boldsymbol{\beta}\|_{0}\leq{k},

(20)

where $g(\boldsymbol{\beta})\geq 0$ is convex and has Lipschitz continuous gradient:
其中 $g(\boldsymbol{\beta})\geq 0$ 是凸的並且具有 Lipschitz 連續梯度：

\|\nabla g(\boldsymbol{\beta})-\nabla g(\widetilde{\boldsymbol{\beta}})\|\leq\ell\|\boldsymbol{\beta}-\widetilde{\boldsymbol{\beta}}\|.

(21)

The first ingredient of our approach is the observation that when $g(\boldsymbol{\beta})=\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2}$ for a given $\mathbf{c}$ , Problem (20) admits a closed form solution.
我們方法的第一個要素是觀察到，當 $g(\boldsymbol{\beta})=\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2}$ 對於給定的 $\mathbf{c}$ 時，問題 (20) 承認一個封閉形式的解。

Proposition 3.

If $\hat{\boldsymbol{\beta}}$ is an optimal solution to the following problem:

\hat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\|\boldsymbol{\beta}\|_{0}\leq k}\;\;\left\|\boldsymbol{\beta}-\mathbf{c}\right\|_{2}^{2},

(22)

then it can be computed as follows: $\hat{\boldsymbol{\beta}}$ retains the $k$ largest (in absolute value) elements of $\mathbf{c}\in\mathbb{R}^{p}$ and sets the rest to zero, i.e., if $|c_{(1)}|\geq|c_{(2)}|\geq\ldots\geq|c_{(p)}|,$ denote the ordered values of the absolute values of the vector $\mathbf{c}$ , then:
那麼它可以計算如下： $\hat{\boldsymbol{\beta}}$ 保留 $\mathbf{c}\in\mathbb{R}^{p}$ 的 $k$ 最大（絕對值）元素，並將其餘元素設為零，即，如果 < b3> 表示向量 $\mathbf{c}$ 絕對值的有序值，則：

\hat{\beta}_{i}=\begin{cases}c_{i},&\text{if $i\in\left\{(1),\ldots,({k})\right\},$}\\ 0,&\text{otherwise},\end{cases}

(23)

where, $\hat{\beta}_{i}$ is the $i$ th coordinate of $\hat{\boldsymbol{\beta}}$ . We will denote the set of solutions to Problem (22) by the notation $\mathbf{H}_{{k}}(\mathbf{c})$ .
其中， $\hat{\beta}_{i}$ 是 $\hat{\boldsymbol{\beta}}$ 的第 $i$ 座標。我們將用符號 $\mathbf{H}_{{k}}(\mathbf{c})$ 來表示問題 (22) 的解集。

命題3.如果

\hat{\boldsymbol{\beta}}

是下列問題的最適解：

Proof.

We provide a proof of this in Section B.2, for the sake of completeness. ∎

證明。為了完整起見，我們在 B.2 節中提供了證明。 ∎

Note that, we use the notation “argmin” (Problem (22) and in other places that follow) to denote the set of minimizers of the optimization Problem.
請注意，我們使用符號「argmin」（問題（22）以及隨後的其他地方）來表示最佳化問題的最小化集合。

The operator (23) is also known as the hard-thresholding operator [20]—a notion that arises in the context of the following related optimization problem:
運算子 (23) 也稱為硬閾值運算子 [20]——這個概念出現在以下相關最佳化問題的背景下：

\hat{\boldsymbol{\beta}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}}\;\;\frac{1}{2}\|\boldsymbol{\beta}-\mathbf{c}\|_{2}^{2}+\lambda\|\boldsymbol{\beta}\|_{0},

(24)

where $\hat{\boldsymbol{\beta}}$ admits a simple closed form expression given by $\hat{\beta}_{i}=c_{i}$ if $|c_{i}|>\sqrt{\lambda}$ and $\hat{\beta}_{i}=0$ otherwise, for $i=1,\ldots,p$ .
其中 $\hat{\boldsymbol{\beta}}$ 承認由 $\hat{\beta}_{i}=c_{i}$ 給出的簡單閉合形式表達式，如果 $|c_{i}|>\sqrt{\lambda}$ 和 $\hat{\beta}_{i}=0$ 否則，對於 $i=1,\ldots,p$

Remark 1.

There is an important difference between the minimizers of Problems (22) and (24). For Problem (24), the smallest (in absolute value) non-zero element in $\hat{\boldsymbol{\beta}}$ is greater than $\lambda$ in absolute value. On the other hand, in Problem (22) there is no lower bound to the minimum (in absolute value) non-zero element of a minimizer. This needs to be taken care of while analyzing the convergence properties of Algorithm 1 (Section 3.2).

備註 1. 問題 (22) 和 (24) 的最小化器之間存在重要差異。對於問題(24)，

\hat{\boldsymbol{\beta}}

中最小的（絕對值）非零元素的絕對值大於

\lambda

。另一方面，在問題（22）中，最小化器的最小（絕對值）非零元素沒有下界。在分析演算法 1 的收斂特性時需要注意這一點（第 3.2 節）。

Given a current solution $\boldsymbol{\beta}$ , the second ingredient of our approach is to upper bound the function $g(\boldsymbol{\eta})$ around $g(\boldsymbol{\beta})$ . To do so, we use ideas from projected gradient descent methods in first order convex optimization problems [45, 44].
給定目前的解 $\boldsymbol{\beta}$ ，我們方法的第二個要素是圍繞 $g(\boldsymbol{\beta})$ 限制函數 $g(\boldsymbol{\eta})$ 的上限。為此，我們在一階凸優化問題中使用投影梯度下降法的想法 [45, 44]。

Proposition 4.

([45, 44]) For a convex function $g(\boldsymbol{\beta})$ satisfying condition (21) and for any $L\geq\ell$ we have :

g(\boldsymbol{\eta})\leq Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta}):=g(\boldsymbol{\beta})+\frac{L}{2}\|\boldsymbol{\eta}-\boldsymbol{\beta}\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\boldsymbol{\eta}-\boldsymbol{\beta}\rangle

(25)

for all $\boldsymbol{\beta},\boldsymbol{\eta}$ with equality holding at $\boldsymbol{\beta}=\boldsymbol{\eta}$ .
對於所有 $\boldsymbol{\beta},\boldsymbol{\eta}$ ，相等性保持在 $\boldsymbol{\beta}=\boldsymbol{\eta}$ 。

命題 4. ([ 45, 44]) 對於滿足條件 (21) 的凸函數

g(\boldsymbol{\beta})

以及對於任何

L\geq\ell

我們有：

Applying Proposition 3 to the upper bound $Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})$ in Proposition 4 we obtain
將命題 3 應用於命題 4 中的上限 $Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})$ 我們得到

$\displaystyle\operatorname*{arg\,min}_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})=$	$\displaystyle\operatorname*{arg\,min}_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\left\\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\\|_{2}^{2}-\frac{1}{2L}\left\\|\nabla g(\boldsymbol{\beta})\right\\|_{2}^{2}+g(\boldsymbol{\beta})\right)$
$\displaystyle=$	$\displaystyle\operatorname*{arg\,min}_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left\\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right),$	(26)

where $\mathbf{H}_{{k}}(\cdot)$ is defined in (23). In light of (26) we are now ready to present Algorithm 1 to find a stationary point (see Definition 1) of Problem (20).
其中 $\mathbf{H}_{{k}}(\cdot)$ 在(23)中定義。根據 (26)，我們現在準備提出演算法 1，以找出問題 (20) 的駐點（請參閱定義 1）。

Algorithm 1 演算法 1

Input: $g(\boldsymbol{\beta})$ , $L$ , $\epsilon$ .
輸入： $g(\boldsymbol{\beta})$ 、 $L$ 、 $\epsilon$ 。
Output: A first order stationary solution $\boldsymbol{\beta}^{*}$ .
輸出：一階平穩解 $\boldsymbol{\beta}^{*}$ 。
Algorithm: 演算法:

1.

Initialize with ${\boldsymbol{\beta}}_{1}\in\mathbb{R}^{p}$ such that $\|{\boldsymbol{\beta}}_{1}\|_{0}\leq k$ .

1.使用 ${\boldsymbol{\beta}}_{1}\in\mathbb{R}^{p}$ 初始化，使得 $\|{\boldsymbol{\beta}}_{1}\|_{0}\leq k$ 。

For $m\geq 1$ , apply (26) with $\boldsymbol{\beta}={\boldsymbol{\beta}}_{m}$ to obtain ${\boldsymbol{\beta}}_{m+1}$ as:

{\boldsymbol{\beta}}_{m+1}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)

(27)

2.對於

m\geq 1

，將 (26) 與

\boldsymbol{\beta}={\boldsymbol{\beta}}_{m}

一起應用，得到

{\boldsymbol{\beta}}_{m+1}

為：

3.

Repeat Step 2, until $\|{\boldsymbol{\beta}}_{m+1}-{\boldsymbol{\beta}}_{m}\|_{2}\leq\epsilon$ .

3.重複步驟2，直到 $\|{\boldsymbol{\beta}}_{m+1}-{\boldsymbol{\beta}}_{m}\|_{2}\leq\epsilon$ 。
4.

Let $\boldsymbol{\beta}_{m}:=(\beta_{m1},\ldots,\beta_{mp})$ denote the current estimate and let $I=\text{Supp}({\boldsymbol{\beta}}_{m}):=\{i:~{}\beta_{mi}\neq 0\}$ . Solve the continuous optimization problem:

$\min_{\boldsymbol{\beta},\beta_{i}=0,~{}i\notin I}g(\boldsymbol{\beta}),$ (28)

and let $\boldsymbol{\beta}^{*}$ be a minimizer.
並讓 $\boldsymbol{\beta}^{*}$ 成為最小化器。

4.令 $\boldsymbol{\beta}_{m}:=(\beta_{m1},\ldots,\beta_{mp})$ 表示目前估計並令 $I=\text{Supp}({\boldsymbol{\beta}}_{m}):=\{i:~{}\beta_{mi}\neq 0\}$ 。求解連續最佳化問題：

The convergence properties of Algorithm 1 are presented in Section 3.2. We also present Algorithm 2, a variant of Algorithm 1 with better empirical performance. Algorithm 2 modifies Step 2 of Algorithm 1 by using a line search. It obtains $\boldsymbol{\eta}_{m}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)$ and $\boldsymbol{\beta}_{m+1}=\lambda_{m}\boldsymbol{\eta}_{m}+(1-\lambda_{m})\boldsymbol{\beta}_{m},$ where $\lambda_{m}\in\operatorname*{arg\,min}_{\lambda}g\left(\lambda\boldsymbol{\eta}_{m}+(1-\lambda)\boldsymbol{\beta}_{m}\right)$ .
演算法 1 的收斂特性如第 3.2 節所示。我們也提出了演算法 2，它是演算法 1 的變體，具有更好的經驗性能。演算法2透過使用線搜尋修改了演算法1的步驟2。它得到 $\boldsymbol{\eta}_{m}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)$ 和 $\boldsymbol{\beta}_{m+1}=\lambda_{m}\boldsymbol{\eta}_{m}+(1-\lambda_{m})\boldsymbol{\beta}_{m},$ 其中 $\lambda_{m}\in\operatorname*{arg\,min}_{\lambda}g\left(\lambda\boldsymbol{\eta}_{m}+(1-\lambda)\boldsymbol{\beta}_{m}\right)$ 。

Note that the iterate $\boldsymbol{\beta}_{m}$ in Algorithm 2 need not be $k$ -sparse (i.e., need not satisfy: $\|\boldsymbol{\beta}_{m}\|_{0}{\leq}k$ ), however, $\boldsymbol{\eta}_{m}$ is $k$ -sparse ( $\|\boldsymbol{\eta}_{m}\|_{0}\leq k$ ). Moreover, the sequence may not lead to a decreasing set of objective values, but it satisfies: $g(\boldsymbol{\beta}_{m+1})\leq g(\boldsymbol{\eta}_{m})\nleq g(\boldsymbol{\beta}_{m}).$
請注意，演算法 2 中的迭代 $\boldsymbol{\beta}_{m}$ 不需要是 $k$ 稀疏（即不需要滿足： $\|\boldsymbol{\beta}_{m}\|_{0}{\leq}k$ ），但是 $\boldsymbol{\eta}_{m}$ 是 $k$ -稀疏( $\|\boldsymbol{\eta}_{m}\|_{0}\leq k$ )。此外，該序列可能不會導致目標值集遞減，但它滿足： $g(\boldsymbol{\beta}_{m+1})\leq g(\boldsymbol{\eta}_{m})\nleq g(\boldsymbol{\beta}_{m}).$

3.2 Convergence Analysis of Algorithm 1
3.2演算法1的收斂性分析

In this section, we study convergence properties for Algorithm 1. Before we embark on the analysis, we need to define the notion of first order optimality for Problem (20).
在本節中，我們研究演算法 1 的收斂特性。

Definition 1.

Given an $L\geq\ell$ , the vector $\boldsymbol{\eta}\in\mathbb{R}^{p}$ is said to be a first order stationary point of Problem (20) if $\|\boldsymbol{\eta}\|_{0}\leq k$ and it satisfies the following fixed point equation:

\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right).

(29)

定義 1. 給定

L\geq\ell

，如果

\|\boldsymbol{\eta}\|_{0}\leq k

且滿足以下固定條件，則向量

\boldsymbol{\eta}\in\mathbb{R}^{p}

稱為問題 (20) 的一階駐點點方程：

Let us give some intuition associated with the above definition.
讓我們給出一些與上述定義相關的直覺。

Consider $\boldsymbol{\eta}$ as in Definition 1. Since $\|\boldsymbol{\eta}\|_{0}\leq k,$ it follows that there is a set $I\subset\{1,\ldots,p\}$ such that $\eta_{i}=0$ for all $i\in I$ and the size of $I^{c}$ (complement of $I$ ) is $k$ . Since $\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right),$ it follows that for all $i\notin I$ , we have: $\eta_{i}=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}),$ where, $\nabla_{i}g(\boldsymbol{\eta})$ is the $i$ th coordinate of $\nabla g(\boldsymbol{\eta})$ . It thus follows that: $\nabla_{i}g(\boldsymbol{\eta})=0$ for all $i\notin I$ . Since $g(\boldsymbol{\eta})$ is convex in $\boldsymbol{\eta}$ , this means that $\boldsymbol{\eta}$ solves the following convex optimization problem:
考慮定義 1 中的 $\boldsymbol{\eta}$ 。 > ， $I^{c}$ （ $I$ 的補集）的大小是 $k$ 。由於 $\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right),$ ，對所有 $i\notin I$ ，我們有： $\eta_{i}=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta}),$ 其中， $\nabla_{i}g(\boldsymbol{\eta})$ 是 $i$ th $\nabla g(\boldsymbol{\eta})$ 的座標。因此可以得到： $\nabla_{i}g(\boldsymbol{\eta})=0$ 對於所有 $i\notin I$ 。由於 $g(\boldsymbol{\eta})$ 在 $\boldsymbol{\eta}$ 中是凸的，這意味著 $\boldsymbol{\eta}$ 解決了以下凸最佳化問題：

\min_{\boldsymbol{\eta}}\;\;\;g(\boldsymbol{\eta})\;\;\;s.t.\;\;\;\eta_{i}=0,i\in I.

(30)

Note however, that the converse of the above statement is not true. That is, if $\tilde{{I}}\subset\{1,\ldots,p\}$ is an arbitrary subset with $|\tilde{{I}}^{c}|=k$ then a solution $\hat{\boldsymbol{\eta}}_{\tilde{I}}$ to the restricted convex problem (30) with $I=\tilde{{I}}$ need not correspond to a first order stationary point.
但請注意，上述說法的反面並不成立。也就是說，如果 $\tilde{{I}}\subset\{1,\ldots,p\}$ 是 $|\tilde{{I}}^{c}|=k$ 的任意子集，則需要 $I=\tilde{{I}}$ 限制凸問題 (30) 的解 $\hat{\boldsymbol{\eta}}_{\tilde{I}}$ 不對應於一階駐點。

Note that any global minimizer to Problem (20) is also a first order stationary point, as defined above (see Proposition 7).
請注意，問題 (20) 的任何全域最小化器也是一階駐點，如上面所定義（請參閱命題 7）。

We present the following proposition (for its proof see Section B.6), which sheds light on a first order stationary point $\boldsymbol{\eta}$ for which $\|\boldsymbol{\eta}\|_{0}<k$ .
我們提出以下命題（其證明參見 B.6 節），它闡明了 $\|\boldsymbol{\eta}\|_{0}<k$ 的一階駐點 $\boldsymbol{\eta}$ 。

Proposition 5.

Suppose $\boldsymbol{\eta}$ satisfies the first order stationary condition (29) and $\|\boldsymbol{\eta}\|_{0}<k$ . Then $\boldsymbol{\eta}\in\operatorname*{arg\,min}\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})$ .

命題5.假設

\boldsymbol{\eta}

滿足一階平穩條件(29)和

\|\boldsymbol{\eta}\|_{0}<k

。然後

\boldsymbol{\eta}\in\operatorname*{arg\,min}\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})

。

We next define the notion of an $\epsilon$ -approximate first order stationary point of Problem (20):
接下來我們定義問題 (20) 的 $\epsilon$ 近似一階駐點的概念：

Definition 2.

Given an $\epsilon>0$ and $L\geq\ell$ we say that $\boldsymbol{\eta}$ satisfies an $\epsilon$ -approximate first order optimality condition of Problem (20) if $\|\boldsymbol{\eta}\|_{0}\leq k$ and for some $\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ , we have $\|\boldsymbol{\eta}-\hat{\boldsymbol{\eta}}\|_{2}\leq\epsilon$ .

定義 2. 給定

\epsilon>0

和

L\geq\ell

，我們說

\boldsymbol{\eta}

滿足問題 (20) 的

\epsilon

近似一階最適性條件如果

\|\boldsymbol{\eta}\|_{0}\leq k

和一些

\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)

，我們有

\|\boldsymbol{\eta}-\hat{\boldsymbol{\eta}}\|_{2}\leq\epsilon

。

Before we dive into the convergence properties of Algorithm 1, we need to introduce some notation. Let $\boldsymbol{\beta}_{m}=(\beta_{m1},\ldots,\beta_{mp})$ and $\mathbf{1}_{m}=(e_{1},\ldots,e_{p})$ with $e_{j}=1$ , if $\beta_{mj}\neq 0$ , and $e_{j}=0$ , if $\beta_{mj}=0$ , $j=1,\ldots,p$ , i.e., $\mathbf{1}_{m}$ represents the sparsity pattern of the support of $\boldsymbol{\beta}_{m}$ .
在我們深入研究演算法 1 的收斂特性之前，我們需要先介紹一些符號。讓 $\boldsymbol{\beta}_{m}=(\beta_{m1},\ldots,\beta_{mp})$ 和 $\mathbf{1}_{m}=(e_{1},\ldots,e_{p})$ 與 $e_{j}=1$ ，如果 $\beta_{mj}\neq 0$ ，和 $e_{j}=0$ ，如果 $\beta_{mj}=0$ 、 $j=1,\ldots,p$ ，即 $\mathbf{1}_{m}$ 表示 $\boldsymbol{\beta}_{m}$ 的支持度的稀疏模式。

Suppose, we order the coordinates of $\boldsymbol{\beta}_{m}$ by their absolute values: $|\beta_{(1),m}|\geq|\beta_{(2),m}|\geq\ldots\geq|\beta_{(p),m}|$ . Note that by definition (27), $\beta_{(i),m}=0$ for all $i>k$ and $m\geq 2$ . We denote $\alpha_{k,m}=|\beta_{(k),m}|$ to be the $k$ th largest (in absolute value) entry in $\boldsymbol{\beta}_{m}$ for all $m\geq 2$ . Clearly if $\alpha_{k,m}>0$ then $\|\boldsymbol{\beta}_{m}\|_{0}=k$ and if $\alpha_{k,m}=0$ then $\|\boldsymbol{\beta}_{m}\|_{0}<k$ . Let $\overline{\alpha}_{k}:=\limsup\limits_{m\rightarrow\infty}\;\alpha_{k,m}$ and $\underline{\alpha}_{k}:=\liminf\limits_{m\rightarrow\infty}\;\alpha_{k,m}$ .
假設我們按絕對值對 $\boldsymbol{\beta}_{m}$ 的座標進行排序： $|\beta_{(1),m}|\geq|\beta_{(2),m}|\geq\ldots\geq|\beta_{(p),m}|$ 。請注意，根據定義 (27)， $\beta_{(i),m}=0$ 對於所有 $i>k$ 和 $m\geq 2$ 。我們將 $\alpha_{k,m}=|\beta_{(k),m}|$ 表示為 $\boldsymbol{\beta}_{m}$ 中所有 $m\geq 2$ 的第 $k$ 最大（絕對值）條目。顯然，如果 $\alpha_{k,m}>0$ 那麼 $\|\boldsymbol{\beta}_{m}\|_{0}=k$ ，如果 $\alpha_{k,m}=0$ 那麼 $\|\boldsymbol{\beta}_{m}\|_{0}<k$ 。令 $\overline{\alpha}_{k}:=\limsup\limits_{m\rightarrow\infty}\;\alpha_{k,m}$ 和 $\underline{\alpha}_{k}:=\liminf\limits_{m\rightarrow\infty}\;\alpha_{k,m}$ 。

Proposition 6.

Consider $g(\boldsymbol{\beta})$ and $\ell$ as defined in (20) and (21). Let $\boldsymbol{\beta}_{m},m\geq 1$ be the sequence generated by Algorithm 1. Then we have :

(a)

For any $L\geq\ell$ , the sequence $g(\boldsymbol{\beta}_{m})$ satisfies

g(\boldsymbol{\beta}_{m})-g(\boldsymbol{\beta}_{m+1})\geq\frac{L-\ell}{2}\left\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\right\|_{2}^{2},

(31)

is decreasing and converges.
遞減並收斂。

(a)對任一

L\geq\ell

，數列

g(\boldsymbol{\beta}_{m})

滿足

(b)

If $L>\ell$ , then $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}$ as $m\rightarrow\infty$ .

(b)如果 $L>\ell$ ，則 $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}$ 為 $m\rightarrow\infty$ 。
(c)

If $L>\ell$ and $\underline{\alpha}_{k}>0$ then the sequence $\mathbf{1}_{m}$ converges after finitely many iterations, i.e., there exists an iteration index $M^{*}$ such that $\mathbf{1}_{m}=\mathbf{1}_{m+1}$ for all $m\geq M^{*}$ . Furthermore, the sequence $\boldsymbol{\beta}_{m}$ is bounded and converges to a first order stationary point.

(c)如果 $L>\ell$ 和 $\underline{\alpha}_{k}>0$ ，那麼序列 $\mathbf{1}_{m}$ 在有限多次迭代後收斂，即存在一個迭代索引 $M^{*}$ ，這樣對所有 $m\geq M^{*}$ 來說 $\mathbf{1}_{m}=\mathbf{1}_{m+1}$ 。此外，序列 $\boldsymbol{\beta}_{m}$ 是有界的並且收斂到一階駐點。
(d)

If $L>\ell$ and $\underline{\alpha}_{k}=0$ then $\liminf\limits_{m\rightarrow\infty}\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0$ .

(d)如果 $L>\ell$ 和 $\underline{\alpha}_{k}=0$ 則 $\liminf\limits_{m\rightarrow\infty}\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0$ 。
(e)

Let $L>\ell$ , $\overline{\alpha}_{k}=0$ and suppose that the sequence $\boldsymbol{\beta}_{m}$ has a limit point. Then $g(\boldsymbol{\beta}_{m})\rightarrow\min\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})$ .

(e)令 $L>\ell$ 、 $\overline{\alpha}_{k}=0$ 並假設序列 $\boldsymbol{\beta}_{m}$ 有極限點。然後 $g(\boldsymbol{\beta}_{m})\rightarrow\min\limits_{\boldsymbol{\beta}}\;\;g(\boldsymbol{\beta})$ 。

命題 6. 考慮 (20) 和 (21) 定義的

g(\boldsymbol{\beta})

和

\ell

。令

\boldsymbol{\beta}_{m},m\geq 1

為演算法 1 產生的序列。

Proof.

See Section B.1. ∎

證明。參見 B.1 節。 ∎

Remark 2.

Note that the existence of a limit point in Proposition 6, Part (e) is guaranteed under fairly weak conditions. One such condition is that $\sup\left(\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{0}\leq k,f(\boldsymbol{\beta})\leq f_{0}\right\}\right)<\infty,$ for any finite value $f_{0}$ . In words this means that the $k$ -sparse level sets of the function $g(\boldsymbol{\beta})$ is bounded.

備註 2. 請注意，命題 6 (e) 部分中極限點的存在是在相當弱的條件下得到保證的。其中一個條件是

\sup\left(\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}\|_{0}\leq k,f(\boldsymbol{\beta})\leq f_{0}\right\}\right)<\infty,

對於任何有限值

f_{0}

。換句話說，這意味著函數

g(\boldsymbol{\beta})

的

k

稀疏級集是有界的。

In the special case where $g(\boldsymbol{\beta})$ is the least squares loss function, the above condition is equivalent to every $k$ -submatrix ( $\mathbf{X}_{J}$ ) of $\mathbf{X}$ comprising of $k$ columns being full rank. In particular, this holds with probability one when the entries of $\mathbf{X}$ are drawn from a continuous distribution and $k<n$ .
在 $g(\boldsymbol{\beta})$ 是最小平方法損失函數的特殊情況下，上述條件相當於 $\mathbf{X}$ ) b3 > 包含滿等級的 $k$ 欄位。特別是，當 $\mathbf{X}$ 的條目從連續分佈和 $k<n$ 中提取時，這種情況的機率為 1。

Remark 3.

Parts (d) and (e) of Proposition 6 are probably not statistically interesting cases, since they correspond to un-regularized solutions of the problem $\min g(\boldsymbol{\beta})$ . However, we include them since they shed light on the properties of Algorithm 1.

備註 3. 命題 6 的 (d) 和 (e) 部分可能不是統計上有趣的情況，因為它們對應於問題

\min g(\boldsymbol{\beta})

的非正則化解決方案。然而，我們將它們包括在內，因為它們揭示了演算法 1 的屬性。

The conditions assumed in Part (c) imply that the support of $\boldsymbol{\beta}_{m}$ stabilizes and Algorithm 1 behaves like vanilla gradient descent thereafter. The support of $\boldsymbol{\beta}_{m}$ need not stabilize for Parts (d), (e) and thus Algorithm 1 may not behave like vanilla gradient descent after finitely many iterations. However, the objective values (under minor regularity assumptions) converge to $\min\;g(\boldsymbol{\beta})$ .
(c) 部分假設的條件意味著 $\boldsymbol{\beta}_{m}$ 的支援穩定下來，之後演算法 1 的行為就像普通梯度下降一樣。 $\boldsymbol{\beta}_{m}$ 的支援不需要穩定第 (d)、(e) 部分，因此演算法 1 在有限多次迭代後可能不會像普通梯度下降那樣表現。然而，目標值（在次要規律性假設下）收斂於 $\min\;g(\boldsymbol{\beta})$ 。

We present the following Proposition (for proof see Section B.3) about a uniqueness property of the fixed point equation (1).
我們提出以下關於定點方程式（1）的唯一性的命題（證明參見 B.3 節）。

Proposition 7.

Suppose $L>\ell$ and let $\boldsymbol{\eta}$ satisfy a first order stationary point as in Definition 1. Then the set $\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ has exactly one element: $\boldsymbol{\eta}$ .

命題 7. 假設

L>\ell

並令

\boldsymbol{\eta}

滿足定義 1 中的一階駐點。 > .

The following proposition (for a proof see Section B.4) shows that a global minimizer of the Problem (20) is also a first order stationary point.
以下命題（證明參見 B.4 節）顯示問題 (20) 的全域最小化也是一階駐點。

Proposition 8.

Suppose $L>\ell$ and let $\widehat{\boldsymbol{\beta}}$ be a global minimizer of Problem (20). Then $\widehat{\boldsymbol{\beta}}$ is a first order stationary point.

命題 8. 假設

L>\ell

並讓

\widehat{\boldsymbol{\beta}}

是問題 (20) 的全域最小化器。那麼

\widehat{\boldsymbol{\beta}}

就是一階駐點。

Proposition 6 establishes that Algorithm 1 either converges to a first order stationarity point (part (c)) or it converges¹¹1under minor technical assumptions
在次要的技術假設下
命題 6 證明演算法 1 要麼收斂到一階平穩點（(c) 部分），要麼收斂 ¹ to a global optimal solution (Parts (d), (e)), but does not quantify the rate of convergence. We next characterize the rate of convergence of the algorithm to an $\epsilon$ -approximate first order stationary point.
到全域最優解（(d)、(e)部分），但不量化收斂速度。接下來我們描述演算法收斂到 $\epsilon$ 近似一階駐點的速率。

Theorem 3.1.

Let $L>\ell$ and $\boldsymbol{\beta}^{*}$ denote a first order stationary point of Algorithm 1. After $M$ iterations Algorithm 1 satisfies

\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}\leq\frac{2(g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}^{*}))}{M(L-\ell)},

(32)

where $g(\boldsymbol{\beta}_{m})\downarrow g(\boldsymbol{\beta}^{*})$ as $m\rightarrow\infty$ .
其中 $g(\boldsymbol{\beta}_{m})\downarrow g(\boldsymbol{\beta}^{*})$ 為 $m\rightarrow\infty$ 。

定理3.1。令

L>\ell

和

\boldsymbol{\beta}^{*}

表示演算法 1 的一階駐點。

Proof.

See Section B.5. ∎

證明。參見 B.5 節。 ∎

Theorem 3.1 implies that for any $\epsilon>0$ there exists $M=O(\frac{1}{\epsilon})$ such that for some $1\leq m^{*}\leq M$ , we have: $\|\boldsymbol{\beta}_{m^{*}+1}-\boldsymbol{\beta}_{m^{*}}\|_{2}^{2}\leq\epsilon.$ Note that the convergence rates derived above apply for a large class of problems (20), where, the function $g(\boldsymbol{\beta})\geq 0$ is convex with Lipschitz continuous gradient (21). Tighter rates may be obtained under additional structural assumptions on $g(\cdot)$ . For example, the adaptation of Algorithm 1 for Problem (4) was analyzed in [8, 9] with $\mathbf{X}$ satisfying coherence [8] or Restricted Isometry Property (RIP) [9]. In these cases, the algorithm can be shown to have a linear convergence rate [8, 9], where the rate depends upon the RIP constants.
定理3.1 表示對於任何 $\epsilon>0$ 都存在 $M=O(\frac{1}{\epsilon})$ ，這樣對於某些 $1\leq m^{*}\leq M$ ，我們有： $\|\boldsymbol{\beta}_{m^{*}+1}-\boldsymbol{\beta}_{m^{*}}\|_{2}^{2}\leq\epsilon.$ 請注意，導出的收斂率上述適用於一大類問題 (20)，其中函數 $g(\boldsymbol{\beta})\geq 0$ 是具有 Lipschitz 連續梯度的凸函數 (21)。根據 $g(\cdot)$ 的額外結構假設，可能會獲得更嚴格的利率。例如，在[8, 9]中分析了演算法1對問題（4）的適應性，其中 $\mathbf{X}$ 滿足相干性[8]或限制等距性（RIP）[9]。在這些情況下，可以證明該演算法具有線性收斂速度 [8, 9]，其中該速度取決於 RIP 常數。

Note that by Proposition 6 the support of $\boldsymbol{\beta}_{m}$ stabilizes after finitely many iterations, after which Algorithm 1 behaves like gradient descent on the stabilized support. If $g(\boldsymbol{\beta})$ restricted to this support is strongly convex, then Algorithm 1 will enjoy a linear rate of convergence [45], as soon as the support stabilizes. This behavior is adaptive, i.e., Algorithm 1 does not need to be modified after the support stabilizes.
請注意，根據命題 6， $\boldsymbol{\beta}_{m}$ 的支持在有限多次迭代後穩定，之後演算法 1 的行為類似於穩定支持上的梯度下降。如果限制於該支撐的 $g(\boldsymbol{\beta})$ 是強凸的，那麼一旦支撐穩定，演算法1將享有線性收斂速度[45]。這種行為是自適應的，即支撐穩定後不需要修改演算法1。

The next section describes practical post-processing schemes via which first order stationary points of Algorithm 1 can be obtained by solving a low dimensional convex optimization problem, as soon as the support is found to stabilize, numerically. In our numerical experiments, we this version of Algorithm 1 (with multiple starting points) took at most a few minutes for $p=2000$ and a few seconds for smaller values of $p$ .
下一節描述了實用的後處理方案，一旦發現支撐在數值上穩定，就可以透過求解低維凸最佳化問題來獲得演算法 1 的一階駐點。在我們的數值實驗中，我們這個版本的演算法 1（具有多個起點）對於 $p=2000$ 最多花費幾分鐘，對於較小的 $p$ 值最多花費幾秒鐘。

Polishing coefficients on the active set
活動集上的拋光係數

Algorithm 1 detects the active set after a few iterations. Once the active set stabilizes, the algorithm may take a number of iterations to estimate the values of the regression coefficients on the active set to a high accuracy level.
演算法 1 在幾次迭代後偵測活動集。一旦活動集穩定，演算法可能需要多次迭代才能將活動集上的迴歸係數的值估計到高精度水準。

In this context, we found the following simple polishing of coefficients to be useful. When the algorithm has converged to a tolerance of $\epsilon$ ( $\approx 10^{-4}$ ), we fix the current active set, ${\mathcal{I}}$ , and solve the following lower-dimensional convex optimization problem:
在這種情況下，我們發現以下簡單的係數修正很有用。當演算法收斂到容差 $\epsilon$ ( $\approx 10^{-4}$ ) 時，我們修復目前活動集 ${\mathcal{I}}$ ，並求解以下低維凸最佳化問題：

\min_{\boldsymbol{\beta},\beta_{i}=0,i\notin{\mathcal{I}}}\;\;\;\;g(\boldsymbol{\beta}).

(33)

In the context of the least squares and the least absolute deviation problems, Problem (33) reduces to to a smaller dimensional least squares and a linear optimization problem respectively, which can be solved very efficiently up to a very high level of accuracy.
在最小平方法和最小絕對偏差問題的背景下，問題（33）分別簡化為較小維的最小二乘和線性最佳化問題，可以非常有效地求解直至非常高的精度。

3.3 Application to Least Squares
3.3最小平方法的應用

For the support constrained problem with squared error loss, we have $g(\boldsymbol{\beta})=\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}$ and $\nabla g(\boldsymbol{\beta})=-\mathbf{X}^{\prime}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$ . The general algorithmic framework developed above applies in a straightforward fashion for this special case. Note that for this case $\ell=\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})$ .
對於平方誤差損失的支援約束問題，我們有 $g(\boldsymbol{\beta})=\mbox{$\frac{1}{2}$}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}$ 和 $\nabla g(\boldsymbol{\beta})=-\mathbf{X}^{\prime}(\mathbf{y}-\mathbf{X}\boldsymbol{\beta})$ 。上面開發的通用演算法框架以簡單的方式適用於這種特殊情況。請注意，對於本例 $\ell=\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})$ 。

The polishing of the regression coefficients in the active set can be performed via a least squares problem on $\mathbf{y},\mathbf{X}_{I}$ , where $I$ denotes the support of the regression coefficients.
活動集中迴歸係數的拋光可以透過 $\mathbf{y},\mathbf{X}_{I}$ 上的最小平方法問題來執行，其中 $I$ 表示迴歸係數的支持度。

3.4 Application to Least Absolute Deviation
3.4最小絕對偏差的應用

We will now show how the method proposed in the previous section applies to the least absolute deviation problem with support constraints in $\boldsymbol{\beta}$ :
現在我們將展示上一節中提出的方法如何應用於 $\boldsymbol{\beta}$ 中具有支持約束的最小絕對偏差問題：

\min_{\boldsymbol{\beta}}\;\;g_{1}(\boldsymbol{\beta}):=\|\mathbf{Y}-\mathbf{X}\boldsymbol{\beta}\|_{1}\;\;s.t.\;\;\|\boldsymbol{\beta}\|_{0}\leq k.

(34)

Since $g_{1}(\boldsymbol{\beta})$ is non-smooth, our framework does not apply directly. We smooth the non-differentiable $g_{1}(\boldsymbol{\beta})$ so that we can apply Algorithms 1 and 2. Observing that $g_{1}(\boldsymbol{\beta})=\sup_{\|\mathbf{w}\|_{\infty}\leq 1}\langle\mathbf{Y}-X\boldsymbol{\beta},\mathbf{w}\rangle$ we make use of the smoothing technique of [43] to obtain $g_{1}(\boldsymbol{\beta};\tau)=\sup_{\|\mathbf{w}\|_{\infty}\leq 1}(\langle\mathbf{Y}-X\boldsymbol{\beta},\mathbf{w}\rangle-\frac{\tau}{2}\|\mathbf{w}\|_{2}^{2})$ ; which is a smooth approximation of $g_{1}(\beta)$ , with $\ell=\frac{\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})}{\tau}$ for which Algorithms 1 and 2 apply.
由於 $g_{1}(\boldsymbol{\beta})$ 是非平滑的，因此我們的框架不能直接應用。我們平滑不可微的 $g_{1}(\boldsymbol{\beta})$ ，以便我們可以應用演算法1和2。這是 $g_{1}(\beta)$ 的平滑近似，其中 $\ell=\frac{\lambda_{\max}(\mathbf{X}^{\prime}\mathbf{X})}{\tau}$ 適用演算法 1 和 2。

In order to obtain a good approximation to Problem (34), we found the following strategy to be useful in practice:
為了獲得問題（34）的良好近似，我們發現以下策略在實務上很有用：

1.

Fix $\tau>0$ , initialize with $\boldsymbol{\beta}_{0}\in\mathbb{R}^{p}$ and repeat the following steps [2]—[3] till convergence:

1. 修正 $\tau>0$ ，用 $\boldsymbol{\beta}_{0}\in\mathbb{R}^{p}$ 初始化並重複以下步驟[2]—[3]直到收斂：
2.

Apply Algorithm 1 (or Algorithm 2) to the smooth function $g_{1}(\boldsymbol{\beta};\tau)$ . Let $\boldsymbol{\beta}_{\tau}^{*}$ be the limiting solution.

2. 將演算法 1（或演算法 2）應用於平滑函數 $g_{1}(\boldsymbol{\beta};\tau)$ 。令 $\boldsymbol{\beta}_{\tau}^{*}$ 為極限解。
3.

Decrease $\tau\leftarrow\tau\gamma$ for some pre-defined constant $\gamma=0.8$ (say), and go back to step [1] with $\boldsymbol{\beta}_{0}=\boldsymbol{\beta}_{\tau}^{*}$ . Exit if $\tau<\text{TOL},$ for some pre-defined tolerance.

3. 減少一些預先定義常數 $\gamma=0.8$ 的 $\tau\leftarrow\tau\gamma$ ，然後使用 $\boldsymbol{\beta}_{0}=\boldsymbol{\beta}_{\tau}^{*}$ 返回步驟 [1] 。如果 $\tau<\text{TOL},$ 達到某個預先定義的容差，則退出。

4 A Brief Tour of the Statistical Properties of Problem (1)
4問題統計性質簡述（一）

As already alluded to in the introduction, there is a substantial body of impressive work characterizing the theoretical properties of best subset solutions in terms of various metrics: predictive performance, estimation of regression coefficients, and variable selection properties. For the sake of completeness, we present a brief review of some of the properties of solutions to Problem (1) in Section C.
如同引言中已經提到的，有大量令人印象深刻的工作從各種指標方面描述了最佳子集解決方案的理論屬性：預測性能、回歸係數估計和變量選擇屬性。為了完整起見，我們簡要回顧了 C 節中問題（1）的解決方案的一些屬性。

5 Computational Experiments for Subset Selection with Least Squares Loss
5最小平方法損失子集選擇的計算實驗

In this section, we present a variety of computational experiments to assess the algorithmic and statistical performances of our approach. We consider both the classical overdetermined case with $n>p$ (Section 5.2) and the high dimensional $p\gg n$ case (Section 5.3) for the least squares loss function with support constraints.
在本節中，我們提出了各種計算實驗來評估我們方法的演算法和統計性能。對於具有支援約束的最小平方法損失函數，我們考慮經典的超定情況（第 5.2 節）和高維度 $p\gg n$ 情況（第 5.3 節）。

5.1 Description of Experimental Data
5.1實驗數據說明

We demonstrate the performance of our proposal via a series of experiments on both synthetic and real data.
我們透過對合成數據和真實數據進行的一系列實驗來證明我們建議的性能。

Synthetic Datasets. 綜合數據集。

We consider a collection of problems where $\mathbf{x}_{i}\sim\text{N}(\mathbf{0},\mathbf{\Sigma}),i=1,\ldots,n$ are independent realizations from a $p$ -dimensional multivariate normal distribution with mean zero and covariance matrix $\mathbf{\Sigma}:=(\sigma_{ij})$ . The columns of the $\mathbf{X}$ matrix were subsequently standardized to have unit $\ell_{2}$ norm. For a fixed $\mathbf{X}_{n\times p},$ we generated the response $\mathbf{y}$ as follows: $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}$ , where $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2})$ . We denote the number of nonzeros in $\boldsymbol{\beta}^{0}$ by $k_{0}$ . The choice of $\mathbf{X},\boldsymbol{\beta}^{0},\sigma$ determines the Signal-to-Noise Ratio (SNR) of the problem, which is defined as: $\text{SNR}=\frac{\text{var}(\mathbf{x}^{\prime}\boldsymbol{\beta}^{0})}{\sigma^{2}}.$
我們考慮一系列問題，其中 $\mathbf{x}_{i}\sim\text{N}(\mathbf{0},\mathbf{\Sigma}),i=1,\ldots,n$ 是來自平均值為零和協方差矩陣 $\mathbf{\Sigma}:=(\sigma_{ij})$ 的 $p$ 維多元常態分佈的獨立實現。 $\mathbf{X}$ 矩陣的列隨後被標準化為具有單位 $\ell_{2}$ 範數。對於固定的 $\mathbf{X}_{n\times p},$ ，我們產生回應 $\mathbf{y}$ 如下： $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}$ ，其中 $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2})$ 。我們用 $k_{0}$ 表示 $\boldsymbol{\beta}^{0}$ 中非零的數量。 $\mathbf{X},\boldsymbol{\beta}^{0},\sigma$ 的選擇決定了問題的訊號雜訊比（SNR），其定義為： $\text{SNR}=\frac{\text{var}(\mathbf{x}^{\prime}\boldsymbol{\beta}^{0})}{\sigma^{2}}.$

We considered the following four different examples:
我們考慮了以下四個不同的例子：

Example 1: We took $\sigma_{ij}=\rho^{|i-j|}$ for $i,j\in\{1,\ldots,p\}\times\{1,\ldots,p\}$ . We consider different values of $k_{0}\in\{5,10\}$ and $\beta^{0}_{i}=1$ for $k_{0}$ equi-spaced values. In the case where exactly equi-spaced values are not possible we rounded the indices to the nearest large integer value. of $i$ in the range $\{1,2,\ldots,p\}$ .
範例 1：我們將 $\sigma_{ij}=\rho^{|i-j|}$ 替換為 $i,j\in\{1,\ldots,p\}\times\{1,\ldots,p\}$ 。我們考慮 $k_{0}\in\{5,10\}$ 和 $\beta^{0}_{i}=1$ 的不同值作為 $k_{0}$ 等間距值。在不可能精確等距值的情況下，我們將索引四捨五入到最接近的大整數值。 $i$ 在 $\{1,2,\ldots,p\}$ 範圍內。

Example 2: We took $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ , $k_{0}=5$ and $\boldsymbol{\beta}^{0}=(\mathbf{1}^{\prime}_{5\times 1},\mathbf{0}^{\prime}_{p-5\times 1})^{\prime}\in\mathbb{R}^{p}$ .
範例 2：我們採用 $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ 、 $k_{0}=5$ 和 $\boldsymbol{\beta}^{0}=(\mathbf{1}^{\prime}_{5\times 1},\mathbf{0}^{\prime}_{p-5\times 1})^{\prime}\in\mathbb{R}^{p}$ 。

Example 3: We took $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ , $k_{0}=10$ and $\beta_{i}^{0}=\frac{1}{2}+(10-\frac{1}{2})\frac{(i-1)}{k_{0}},i=1,\ldots,10$ and $\beta^{0}_{i}=0,\forall i>10$ — i.e., a vector with ten nonzero entries, with the nonzero values being equally spaced in the interval $[\frac{1}{2},10]$ .
範例 3：我們採用 $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ 、 $k_{0}=10$ 、 $\beta_{i}^{0}=\frac{1}{2}+(10-\frac{1}{2})\frac{(i-1)}{k_{0}},i=1,\ldots,10$ 和 $\beta^{0}_{i}=0,\forall i>10$ — 即具有 10 個非零條目和非零值的向量在區間 $[\frac{1}{2},10]$ 中等距分佈。

Example 4: We took $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ , $k_{0}=6$ and $\boldsymbol{\beta}^{0}=(-10,-6,-2,2,6,10,\mathbf{0}_{p-6})$ , i.e., a vector with six nonzero entries, equally spaced in the interval $[-10,10]$ .
範例4：我們採用 $\boldsymbol{\Sigma}=\mathbf{I}_{p\times p}$ 、 $k_{0}=6$ 和 $\boldsymbol{\beta}^{0}=(-10,-6,-2,2,6,10,\mathbf{0}_{p-6})$ ，即具有六個非零條目的向量，在區間 $[-10,10]$ 中距分佈。

Real Datasets 真實數據集

We considered the Diabetes dataset analyzed in [23]. We used the dataset with all the second order interactions included in the model, which resulted in 64 predictors. We reduced the sample size to $n=350$ by taking a random sample and standardized the response and the columns of the model matrix to have zero means and unit $\ell_{2}$ -norm.
我們考慮了[23]中分析的糖尿病數據集。我們使用的資料集包含模型中包含的所有二階交互作用，從而產生了 64 個預測變數。我們透過隨機抽樣將樣本大小減少到 $n=350$ ，並將反應和模型矩陣的列標準化為具有零平均值和單位 $\ell_{2}$ -norm。

In addition to the above, we also considered a real microarray dataset: the Leukemia data [18]. We downloaded the processed dataset from http://stat.ethz.ch/~dettling/bagboost.html, which had $n=72$ binary responses and more than 3000 predictors. We standardized the response and columns of features to have zero means and unit $\ell_{2}$ -norm. We reduced the set of features to 1000 by retaining the features maximally correlated (in absolute value) to the response. We call the resulting feature matrix $\mathbf{X}_{n\times p}$ with $n=72,p=1000$ . We then generated a semi-synthetic dataset with continuous response as $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\epsilon$ , where the first five coefficients of $\boldsymbol{\beta}^{0}$ were taken as one and the rest as zero. The noise was distributed as $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2})$ , with $\sigma^{2}$ chosen to get a SNR=7.
除了上述之外，我們還考慮了一個真實的微陣列資料集：白血病資料[18]。我們從 http://stat.ethz.ch/~dettting/bagboost.html 下載了處理後的資料集，其中包含 $n=72$ 二元回應和超過 3000 個預測變數。我們將反應和特徵列標準化為具有零均值和單位 $\ell_{2}$ -範數。我們透過保留與反應最大相關性（絕對值）的特徵，將特徵集減少到 1000 個。我們將產生的特徵矩陣 $\mathbf{X}_{n\times p}$ 稱為 $n=72,p=1000$ 。然後，我們產生一個連續響應為 $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\epsilon$ 的半合成資料集，其中 $\boldsymbol{\beta}^{0}$ 的前五個係數被視為 1，其餘係數為零。噪音分佈為 $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}N(0,\sigma^{2})$ ，選擇 $\sigma^{2}$ 以獲得 SNR=7。

Computer Specifications and Software
電腦規格和軟體

Computations were carried out in a linux 64 bit server—Intel(R) Xeon(R) eight-core processor @ 1.80GHz, 16 GB of RAM for the overdetermined $n>p$ case and in a Dell Precision T7600 computer with an Intel Xeon E52687 sixteen-core processor @ 3.1GHz, 128GB of Ram for the high-dimensional $p\gg n$ case. The discrete first order methods were implemented in Matlab 2012b. We used Gurobi [33] version 5.5 and the Matlab interface to Gurobi for all of our experiments, apart from the computations for synthetic data for $n>p$ , which were done in Gurobi via its Python 2.7 interface.
計算在 Linux 64 位元伺服器上進行 - Intel(R) Xeon(R) 八核心處理器 @ 1.80GHz，超定 $n>p$ 情況下的 16 GB RAM 以及 Dell Precision T7600 電腦Intel Xeon E52687 十六核處理器@ 3.1GHz，用於高維 $p\gg n$ 機箱的128GB RAM。離散一階方法在 Matlab 2012b 中實作。我們在所有實驗中都使用了 Gurobi [33] 5.5 版和 Gurobi 的 Matlab 接口，除了 $n>p$ 的合成數據計算（這是透過 Python 2.7 接口在 Gurobi 中完成的）。

5.2 The Overdetermined Regime: $n>p$
5.2超定政權： $n>p$

Using the Diabetes dataset and synthetic datasets, we demonstrate the combined effect of using the discrete first order methods with the MIO approach. Together, these methods show improvements in obtaining good upper bounds and in closing the MIO gap to certify global optimality. Using synthetic datasets where we know the true linear regression model, we perform side-by-side comparisons of this method with several other state-of-the-art algorithms designed to estimate sparse linear models.
使用糖尿病資料集和合成資料集，我們展示了使用離散一階方法與 MIO 方法的綜合效果。總之，這些方法顯示了在獲得良好上限和縮小 MIO 差距以證明全局最優性方面的改進。使用我們知道真實線性迴歸模型的合成資料集，我們將此方法與其他幾種旨在估計稀疏線性模型的最先進演算法進行並排比較。

5.2.1 Obtaining Good Upper Bounds
5.2.1 獲得良好的上限

We conducted experiments to evaluate the performance of our methods in terms of obtaining high quality solutions for Problem (1).
我們進行了實驗來評估我們的方法在獲得問題（1）的高品質解決方案方面的表現。

We considered the following three algorithms:
我們考慮了以下三種演算法：

(a)

Algorithm 2 with fifty random initializations²²2we took fifty random starting values around $\mathbf{0}$ of the form $\min(i-1,1)\epsilon,i=1,\ldots,50$ , where $\boldsymbol{\epsilon}\sim N(\mathbf{0}_{p\times 1},4\mathbf{I})$ . We found empirically that Algorithm 2 provided better upper bounds than Algorithm 1.
我們在 $\mathbf{0}$ 周圍隨機取了 50 個起始值，其形式為 $\min(i-1,1)\epsilon,i=1,\ldots,50$ ，其中 $\boldsymbol{\epsilon}\sim N(\mathbf{0}_{p\times 1},4\mathbf{I})$ 。我們根據經驗發現演算法 2 提供了比演算法 1 更好的上限。. We took the solution corresponding to the best objective value.
。我們採取了與最佳目標值相對應的解決方案。

(a)演算法 2 有 50 個隨機初始化 ²
(b)

MIO with cold start, i.e., formulation (9) with a time limit of $500$ seconds.

(b)冷啟動MIO，即時間限制為 $500$ 秒的公式(9)。
(c)

MIO with warm start. This was the MIO formulation initialized with the discrete first order optimization solution obtained from (a). This was run for a total of 500 seconds.

(c)MIO 熱啟動。這是用 (a) 中得到的離散一階最佳化解初始化的 MIO 公式。總共運行了 500 秒。

To compare the different algorithms in terms of the quality of upper bounds, we run for every instance all the algorithms and obtain the best solution among them, say, $f_{*}$ . If $f_{\text{alg}}$ denotes the value of the best subset objective function for method “alg”, then we define the relative accuracy of the solution obtained by “alg” as:
為了比較不同演算法的上限質量，我們為每個實例運行所有演算法並獲得其中的最佳解決方案，例如 $f_{*}$ 。如果 $f_{\text{alg}}$ 表示方法「alg」的最佳子集目標函數的值，那麼我們將「alg」所得的解的相對精度定義為：

\text{Relative Accuracy}=(f_{\text{alg}}-f_{*})/f_{*},

(35)

where $\text{alg}\in\{\text{(a)},\text{(b)},\text{(c)}\}$ as described above.
其中 $\text{alg}\in\{\text{(a)},\text{(b)},\text{(c)}\}$ 如上所述。

We did experiments for the Diabetes dataset for different values of $k$ (see Table 1). For each of the algorithms we report the amount of time taken by the algorithm to reach the best objective value during the time of $500$ seconds.
我們針對不同的 $k$ 數值對糖尿病資料集進行了實驗（參見表 1）。對於每種演算法，我們報告演算法在 $500$ 秒時間內達到最佳目標值所需的時間。

$k$	Discrete First Order 離散一階		MIO Cold Start MIO 冷啟動		MIO Warm Start MIO 熱啟動
$k$	Accuracy	Time	Accuracy	Time	Accuracy	Time
9	0.1306	1	0.0036	500	0	346
20	0.1541	1	0.0042	500	0	77
49	0.1915	1	0.0015	500	0	87
57	0.1933	1	0	500	0	2

Table 1: Quality of upper bounds for Problem (1) for the Diabetes dataset, for different values of

k

. We see that the MIO equipped with warm starts deliver the best upper bounds in the shortest overall times. The run time for the MIO with warm start includes the time taken by the discrete first order method (which were all less than a second).
表 1：對於

k

的不同值，糖尿病資料集問題 (1) 的上限品質。我們發現，配備熱啟動功能的 MIO 在最短的整體時間內提供了最佳的上限。熱啟動 MIO 的運行時間包括離散一階方法所花費的時間（均小於一秒）。

Using the discrete first order methods in combination with the MIO algorithm resulted in finding the best possible relative accuracy in a matter of a few minutes.
將離散一階方法與 MIO 演算法結合使用，可以在幾分鐘內找到最佳的相對精度。

5.2.2 Improving MIO Performance via Warm Starts
5.2.2透過熱啟動提高MIO性能

We performed a series of experiments on the Diabetes dataset to obtain a globally optimal solution to Problem (1) via our approach and to understand the implications of using advanced warm starts to the MIO formulation in terms of certifying optimality. For each choice of $k$ we ran Algorithm 2 with fifty random initializations. They took less than a few seconds to run. We used the best solution as an advanced warm start to the MIO formulation (9). For each of these examples, we also ran the MIO formulation without any warm start information and also without the parameter specifications in Section 2.3 (we refer to this as “cold start”). Figure 3 summarizes the results. The figure shows that in the presence of warm starts and problem specific side information, the MIO closes the optimality gap significantly faster.
我們在糖尿病數據集上進行了一系列實驗，以透過我們的方法獲得問題 (1) 的全局最優解決方案，並了解在證明最優性方面對 MIO 公式使用高級熱啟動的影響。對於 $k$ 的每個選擇，我們都執行具有 50 個隨機初始化的演算法 2。他們只花了不到幾秒鐘就跑了。我們使用最佳解決方案作為 MIO 配方的高級熱啟動 (9)。對於每個範例，我們還運行了 MIO 公式，沒有任何熱啟動訊息，也沒有第 2.3 節中的參數規範（我們稱之為「冷啟動」）。圖 3 總結了結果。該圖顯示，在存在熱啟動和問題特定輔助資訊的情況下，MIO 明顯更快地縮小了最適性差距。

5.2.3 Statistical Performance
5.2.3統計表現

We considered datasets as described in Example 1, Section 5.1—we took different values of $n,p$ with $n>p$ , $\rho$ with $k_{0}=10$ .
我們考慮了範例1 第5.1 節所述的資料集- 我們對 $n,p$ 和 $n>p$ 、 $\rho$ 和 $k_{0}=10$ 採取了不同的值。

Competing Methods and Performance Measures
競爭方法和績效衡量

For every example, we considered the following learning procedures for comparison purposes: (a) the MIO approach equipped warm starts from Algorithm 2 (annotated as “MIO” in the figure), (b) the Lasso, (c) Sparsenet and (d) stepwise regression (annotated as “Step” in the figure).
對於每個範例，我們考慮了以下學習過程以進行比較：(a) 從演算法2 開始配備熱啟動的MIO 方法（在圖中標註為「MIO」）、(b) Lasso、(c) Sparsenet 和( d) ）逐步迴歸（圖中標示為「Step」）。

We used R to compute Lasso, Sparsenet and stepwise regression using the glmnet 1.7.3, Sparsenet and Stats 3.0.2 packages respectively, which were all downloaded from CRAN at http://cran.us.r-project.org/.
我們使用R 分別使用glmnet 1.7.3、Sparsenet 和Stats 3.0.2 軟體套件來計算Lasso、Sparsenet 和逐步迴歸，這些軟體套件皆從CRAN（http://cran.us.r-project.org/）下載。

In addition to the above, we have also performed comparisons with a debiased version of the Lasso: i.e., performing unrestricted least squares on the Lasso support to mitigate the bias imparted by Lasso shrinkage.
除了上述內容之外，我們還與 Lasso 的去偏差版本進行了比較：即，對 Lasso 支持執行無限制的最小二乘法，以減輕 Lasso 收縮帶來的偏差。

We note that Sparsenet [38] considers a penalized likelihood formulation of the form (3), where the penalty is given by the generalized MCP penalty family (indexed by $\lambda,\gamma$ ) for a family of values of $\gamma\geq 1$ and $\lambda\geq 0$ . The family of penalties used by Sparsenet is thus given by: $p(t;\gamma;\lambda)=\lambda(|t|-\frac{t^{2}}{2\lambda\gamma})\mathbf{I}(|t|<\lambda\gamma)+\frac{\lambda^{2}\gamma}{2}\mathbf{I}(|t|\geq\lambda\gamma)$ for $\gamma,\lambda$ described as above. As $\gamma=\infty$ with $\lambda$ fixed, we get the penalty $p(t;\gamma;\lambda)=\lambda|t|$ . The family above includes as a special case ( $\gamma=1$ ), the hard thresholding penalty, a penalty recommended in the paper [60] for its useful statistical properties.
我們注意到 Sparsenet [38] 考慮了形式 (3) 的懲罰似然公式，其中懲罰由針對 $\gamma\geq 1$ 索引）給出。 > 和 $\lambda\geq 0$ 。因此，Sparsenet 使用的懲罰系列由以下公式給出： $p(t;\gamma;\lambda)=\lambda(|t|-\frac{t^{2}}{2\lambda\gamma})\mathbf{I}(|t|<\lambda\gamma)+\frac{\lambda^{2}\gamma}{2}\mathbf{I}(|t|\geq\lambda\gamma)$ 對於 $\gamma,\lambda$ 如上所述。由於 $\gamma=\infty$ 和 $\lambda$ 固定，我們得到懲罰 $p(t;\gamma;\lambda)=\lambda|t|$ 。上面的家族包括一個特殊情況（ $\gamma=1$ ），即硬閾值懲罰，這是論文 [60] 中推薦的一種懲罰，因為它具有有用的統計特性。

For each procedure, we obtained the “optimal” tuning parameter by selecting the model that achieved the best predictive performance on a held out validation set. Once the model $\widehat{\boldsymbol{\beta}}$ was selected, we obtained the prediction error as:
對於每個過程，我們透過選擇在驗證集上實現最佳預測性能的模型來獲得「最佳」調整參數。選擇模型 $\widehat{\boldsymbol{\beta}}$ 後，我們得到的預測誤差為：

\text{Prediction Error}=\|\mathbf{X}\widehat{\boldsymbol{\beta}}-\mathbf{X}\boldsymbol{\beta}^{0}\|_{2}^{2}/\|\mathbf{X}\boldsymbol{\beta}^{0}\|_{2}^{2}.

(36)

We report “prediction error” and number of non-zeros in the optimal model in our results. The results were averaged over ten random instances, for different realizations of $\mathbf{X},\epsilon$ . For every run: the training and validation data had a fixed $\mathbf{X}$ but random noise $\epsilon$ .
我們在結果中報告最佳模型中的「預測誤差」和非零數。對於 $\mathbf{X},\epsilon$ 的不同實現，結果是十個隨機實例的平均值。對於每次運行：訓練和驗證資料都有固定的 $\mathbf{X}$ 但隨機雜訊 $\epsilon$ 。

Figure 4 presents results for data generated as per Example 1 with $n$ = 500 and $p$ = 100. We see that the MIO procedure performs very well across all the examples. Among the methods, MIO performs the best, followed by Sparsenet, Lasso with Step(wise) exhibiting the worst performance. In terms of prediction error, the MIO performs the best, only to be marginally outperformed by Sparsenet in a few instances. This further illustrates the importance of using non-convex methods in sparse learning. Note that the MIO approach, unlike Sparsenet certifies global optimality in terms of solving Problem 1. However, based on the plots in the upper panel, Sparsenet selects a few redundant variables unlike MIO. Lasso delivers quite dense models and pays the price in predictive performance too, by selecting wrong variables. As the value of SNR increases, the predictive power of the methods improve, as expected. The differences in predictive errors between the methods diminish with increasing SNR values. With increasing values of $\rho$ (from left panel to right panel in the figure), the number of non-zeros selected by the Lasso in the optimal model increases.
圖 4 顯示了根據範例 1 產生的資料結果，其中 $n$ = 500 和 $p$ = 100。其中，MIO 表現最好，其次是 Sparsenet、Lasso，Step(wise) 表現最差。就預測誤差而言，MIO 表現最好，但在少數情況下略優於 Sparsenet。這進一步說明了在稀疏學習中使用非凸方法的重要性。請注意，與 Sparsenet 不同，MIO 方法在解決問題 1 方面證明了全局最優性。 Lasso 提供了相當密集的模型，並透過選擇錯誤的變數而在預測性能方面付出了代價。如預期的那樣，隨著 SNR 值的增加，這些方法的預測能力也會提高。兩種方法之間預測誤差的差異隨著 SNR 值的增加而減少。隨著 $\rho$ 值的增加（圖中由左圖到右圖），最優模型中Lasso選擇的非零點數量增加。

We also performed experiments with the debiased version of the Lasso. The unrestricted least squares solution on the optimal model selected by Lasso (as shown in Figure 4) had worse predictive performance than the Lasso, with the same sparsity pattern. This is probably due to overfitting since the model selected by the Lasso is quite dense compared to $n,p$ . We also tried some variants of debiased Lasso which led to models with better performances than the Lasso but the results were inferior compared to MIO — we provide a detailed description in Section D.2.
我們也使用 Lasso 的去偏版本進行了實驗。 Lasso 選擇的最優模型上的無限制最小二乘解（如圖 4 所示）的預測性能比 Lasso 差，但具有相同的稀疏性模式。這可能是由於過度擬合，因為 Lasso 選擇的模型與 $n,p$ 相比非常密集。我們也嘗試了去偏 Lasso 的一些變體，這些變體導致模型的性能比 Lasso 更好，但結果與 MIO 相比較差——我們在 D.2 節中提供了詳細描述。

We also performed experiments with $n=1000,p=50$ for data generated as per Example 1. We solved the problems to provable optimality and found that the MIO performed very well when compared to other competing methods. We do not report the experiments for brevity.
我們還使用 $n=1000,p=50$ 對按照範例 1 產生的數據進行了實驗。為了簡潔起見，我們不報告實驗。

5.2.4 MIO model training 5.2.4MIO模型訓練

We trained a sequence of best subset models (indexed by $k$ ) by applying the MIO approach with warm starts. Instead of running the MIO solvers from scratch for different values of $k$ , we used callbacks, a feature of integer optimization solvers. Callbacks allow the user to solve an initial model, and then add additional constraints to the model one at a time. These “cuts” reduce the size of the feasible region without having to rebuild the entire optimization model. Thus, in our case, we can save time by building the initial optimization model for $k=p$ . Once the solution for $k=p$ is obtained, a cut can be added to the model: $\sum_{i=1}^{p}z_{i}\leq k$ for $k=p-1$ and the model can be re-solved from this point. We apply this procedure until we arrive at a model with $k=1$ .
我們透過應用 MIO 方法和熱啟動來訓練一系列最佳子集模型（由 $k$ 索引）。我們沒有使用 $k$ 的不同值從頭開始執行 MIO 求解器，而是使用了回呼（整數最佳化求解器的一項功能）。回調允許使用者求解初始模型，然後一次向模型新增一個附加限制。這些「削減」減少了可行區域的大小，而無需重建整個最佳化模型。因此，在我們的例子中，我們可以透過為 $k=p$ 建立初始最佳化模型來節省時間。一旦獲得 $k=p$ 的解，就可以在模型上加入一個切割： $k=p-1$ 的 $\sum_{i=1}^{p}z_{i}\leq k$ ，並且可以從此時重新求解模型。我們應用此過程，直到得到帶有 $k=1$ 的模型。

For each value of $k$ tested, the MIO best subset algorithm was set to stop the first time either an optimality gap of 1% was reached or a time limit of 15 minutes was reached. Additionally, we only tested values of $k$ from 5 through 25, and used Algorithm 2 to warm start the MIO algorithm. We observed that it was possible to obtain speedups of a factor of 2-4 by carefully tuning the optimization solver for a particular problem, but chose to maintain generality by solving with default parameters. Thus, we do not report times with the intention of accurately benchmarking the best possible time but rather to show that it is computationally tractable to solve problems to optimality using modern MIO solvers.
對於測試的每個 $k$ 值，MIO 最佳子集演算法設定為在第一次達到 1% 的最優性差距或達到 15 分鐘的時間限制時停止。此外，我們僅測試了從 5 到 25 的 $k$ 值，並使用演算法 2 熱啟動 MIO 演算法。我們觀察到，透過針對特定問題仔細調整最佳化求解器，可以獲得 2-4 倍的加速，但選擇透過使用預設參數求解來保持通用性。因此，我們報告時間的目的並不是為了準確地對最佳可能時間進行基準測試，而是為了表明使用現代 MIO 求解器在計算上可以輕鬆解決問題以實現最優。

5.3 The High-Dimensional Regime: $p\gg n$
5.3高維體系： $p\gg n$

In this section, we investigate (a) the evolution of upper bounds in the high-dimensional regime, (b) the effect of a bounding box formulation on the speed of closing the optimality gap and (c) the statistical performance of the MIO approach in comparison to other state-of-the art methods.
在本節中，我們研究 (a) 高維狀態下上限的演變，(b) 邊界框公式對縮小最適性差距的速度的影響，以及 (c) MIO 方法的統計性能與其他最先進的方法相比。

5.3.1 Obtaining Good Upper Bounds
5.3.1 獲得良好的上限

We performed tests similar to those in Section 5.2.1 for the $p\gg n$ regime. We tested a synthetic dataset corresponding to Example 2 with $n=30,p=2000$ for varying SNR values (see Table 2) over a time of 500s. As before, using the discrete first order methods in combination with the MIO algorithm resulted in finding the best possible upper bounds in the shortest possible times.
我們對 $p\gg n$ 體系進行了類似於第 5.2.1 節的測試。我們測試了與範例 2 相對應的合成資料集，其中 $n=30,p=2000$ 在 500 秒的時間內針對不同的 SNR 值（參見表 2）進行了測試。與之前一樣，將離散一階方法與 MIO 演算法結合使用可以在盡可能短的時間內找到最佳的可能上限。

	$k$	Discrete First Order 離散一階		MIO Cold Start MIO 冷啟動		MIO Warm Start MIO 熱啟動
	$k$	Accuracy	Time	Accuracy	Time	Accuracy	Time
	5	0.1647	37.2	1.0510	500	0	72.2
	6	0.6152	41.1	0.2769	500	0	77.1
	7	0.7843	40.7	0.8715	500	0	160.7
SNR = 3	8	0.5515	38.8	2.1797	500	0	295.8
	9	0.7131	45.0	0.4204	500	0	96.0
	5	0.5072	45.6	0.7737	500	0	65.6
	6	1.3221	40.3	0.5121	500	0	82.3
	7	0.9745	40.9	0.7578	500	0	210.9
SNR = 7	8	0.8293	40.5	1.8972	500	0	262.5
	9	1.1879	44.2	0.4515	500	0	254.2

Table 2: The quality of upper bounds for Problem (1) obtained by Algorithm 2, MIO with cold start and MIO warm-started with Algorithm 2. We consider the synthetic dataset of Example 2 with

n=30,p=2000

and different values of SNR. The MIO method, when warm-started with the first order solution performs the best in terms of getting a good upper bound in the shortest time. The metric “Accuracy” is defined in (35). The first order methods are fast but need not lead to highest quality solutions on their own. MIO improves the quality of upper bounds delivered by the first order methods and their combined effect leads to the best performance.
表 2：透過演算法 2、冷啟動的 MIO 和演算法 2 熱啟動的 MIO 獲得的問題 (1) 上限的品質。噪比。當使用一階解熱啟動時，MIO 方法在最短時間內獲得良好的上限方面表現最佳。度量「準確度」在（35）中定義。一階方法速度很快，但不一定會產生最高品質的解決方案。 MIO 提高了一階方法提供的上限的質量，它們的綜合效果導致了最佳性能。

We also did experiments on the Leukemia dataset. In Figure 5 we demonstrate the evolution of the objective value of the best subset problem for different values of $k$ . For each value of $k$ , we warm-started the MIO with the solution obtained by Algorithm 2 and allowed the MIO solver to run for 4000 seconds. The best objective value obtained at the end of 4000 seconds is denoted by $f_{*}$ . We plot the Relative Accuracy, i.e., $(f_{t}-f_{*})/f_{*}$ , where $f_{t}$ is the objective value obtained after $t$ seconds. The figure shows that the solution obtained by Algorithm 2 is improved by the MIO on various instances and the time taken to improve the upper bounds depends upon $k$ . In general, for smaller values of $k$ the upper bounds obtained by the MIO algorithm stabilize earlier, i.e., the MIO finds improved solutions faster than larger values of $k$ .
我們也在白血病資料集上做了實驗。在圖 5 中，我們示範了不同 $k$ 值的最佳子集問題的目標值的演進。對於 $k$ 的每個值，我們使用演算法 2 獲得的解來熱啟動 MIO，並允許 MIO 求解器運行 4000 秒。 4000秒結束時所獲得的最佳目標值以 $f_{*}$ 表示。我們繪製相對精確度，即 $(f_{t}-f_{*})/f_{*}$ ，其中 $f_{t}$ 是 $t$ 秒後獲得的目標值。該圖顯示，演算法 2 獲得的解決方案透過 MIO 在各種實例上進行了改進，並且改進上限所需的時間取決於 $k$ 。一般來說，對於較小的 $k$ 值，MIO 演算法獲得的上限更早穩定，即，MIO 比較大的 $k$ 值更快找到改進的解決方案。

5.3.2 Bounding Box Formulation
5.3.2 邊界框公式

With the aid of advanced warm starts as provided by Algorithm 2, the MIO obtains a very high quality solution very quickly—in most of the examples the solution thus obtained turns out to be the global minimum. However, in the typical “high-dimensional” regime, with $p\gg n$ , we observe that the certificate of global optimality comes later as the lower bounds of the problem “evolve” slowly. This is observed even in the presence of warm starts and using the implied bounds as developed in Section 2.2 and is aggravated for the cold-started MIO formulation (10).
借助演算法 2 提供的高級熱啟動，MIO 非常快速地獲得非常高品質的解決方案 - 在大多數範例中，由此獲得的解決方案結果是全局最小值。然而，在典型的「高維」體系中，使用 $p\gg n$ ，我們觀察到隨著問題的下界「演化」緩慢，全局最優性的證明出現得更晚。即使存在熱啟動並使用第 2.2 節中提出的隱含邊界，也可以觀察到這種情況，並且對於冷啟動 MIO 公式 (10) 會加劇這種情況。

To address this, we consider the MIO formulation (37) obtained by adding bounding boxes around a local solution. These restrictions guide the MIO in restricting its search space and enable the MIO to certify global optimality inside that bounding box. We consider the following additional bounding box constraints to the MIO formulation (10):
為了解決這個問題，我們考慮在局部解周圍添加邊界框所獲得的 MIO 公式 (37)。這些限制指導 MIO 限制其搜尋空間，並使 MIO 能夠證明該邊界框內的全域最優性。我們考慮 MIO 公式 (10) 的以下附加邊界框限制：

\left\{\boldsymbol{\beta}:\|\mathbf{X}\boldsymbol{\beta}-\mathbf{X}\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}\right\}\;\cap\;\left\{\boldsymbol{\beta}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\beta}_{\ell,\text{loc}}\right\},

where, $\boldsymbol{\beta}_{0}$ is a candidate sparse solution. The radii of the two $\ell_{1}$ -balls above, namely, ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}$ and ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}$ are user-defined parameters and control the size of the feasible set.
其中， $\boldsymbol{\beta}_{0}$ 是候選稀疏解。上面兩個 $\ell_{1}$ 球的半徑，即 ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}$ 和 ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}$ 是使用者定義的參數，控制可行集的大小。

Using the notation $\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}$ we have the following MIO formulation (equipped with the additional bounding boxes):
使用符號 $\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}$ 我們有以下 MIO 公式（配備了額外的邊界框）：

\begin{array}[]{l l }\min\limits_{\boldsymbol{\beta},\mathbf{z},\boldsymbol{\zeta}}&\;\;\frac{1}{2}\;\boldsymbol{\zeta}^{T}\boldsymbol{\zeta}-\langle\mathbf{X}^{\prime}\mathbf{y},\boldsymbol{\beta}\rangle+\frac{1}{2}\;\|\mathbf{y}\|_{2}^{2}\\ s.t.&\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}\\ &(\beta_{i},1-z_{i}):\text{SOS type-1,}\;\;i=1,\ldots,p\\ &\;\;z_{i}\in\{0,1\},i=1,\ldots,p\\ &\;\;\sum\limits_{i=1}^{p}z_{i}\leq k\\ &-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},i=1,\ldots,p\\ &\|\boldsymbol{\beta}\|_{1}\leq{\mathcal{M}}_{\ell}\\ &-{\mathcal{M}}^{\zeta}_{U}\leq\zeta_{i}\leq{\mathcal{M}}^{\zeta}_{U},i=1,\ldots,n\\ &\|\boldsymbol{\zeta}\|_{1}\leq{\mathcal{M}}^{\zeta}_{\ell}\\ &\|\boldsymbol{\zeta}-\boldsymbol{\zeta}_{0}\|_{1}\leq{\mathcal{L}}^{\zeta}_{\ell,\text{loc}}\\ &\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|_{1}\leq{\mathcal{L}}^{\beta}_{\ell,\text{loc}}.\end{array}

(37)

For large values of ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}$ (respectively, ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}$ ) the constraints on $\mathbf{X}\boldsymbol{\beta}$ (respectively, $\boldsymbol{\beta}$ ) become ineffective and one gets back formulation (10). To see the impact of these additional cutting planes in the MIO formulation, we consider a few examples as illustrated in Figures 6,7,12.
對於較大的 ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}$ 值（分別為 ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}$ ），對 $\mathbf{X}\boldsymbol{\beta}$ （分別為 $\boldsymbol{\beta}$ ）的限制變得無效，且返回公式（10）。為了了解 MIO 公式中這些額外切割平面的影響，我們考慮一些範例，如圖 6、7、12 所示。

Interpretation of the bounding boxes
邊界框的解釋

A local bounding box in the variable $\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}$ directs the MIO solver to seek for candidate solutions that deliver models with predictive accuracy “similar” (controlled by the radius of the ball) to a reference predictive model, given by $\boldsymbol{\zeta}_{0}$ . In our experiments, we typically chose $\boldsymbol{\zeta}_{0}$ as the solution delivered by running MIO (warm-started with a first order solution) for a few hundred to a few thousand seconds. More generally, $\boldsymbol{\zeta}_{0}$ may be selected by any other sparse learning method. In our experiments, we found that the run-time behavior of the MIO depends upon how correlated the columns of $\mathbf{X}$ are — more correlation leading to longer run-times.
變數 $\boldsymbol{\zeta}=\mathbf{X}\boldsymbol{\beta}$ 中的局部邊界框指示 MIO 求解器尋找候選解決方案，這些解決方案提供的模型的預測精度與參考預測模型「相似」（由球的半徑控制），由 < b1> 。在我們的實驗中，我們通常選擇 $\boldsymbol{\zeta}_{0}$ 作為透過運行 MIO（使用一階解決方案熱啟動）幾百到幾千秒來提供的解決方案。更一般地，可以透過任何其他稀疏學習方法來選擇 $\boldsymbol{\zeta}_{0}$ 。在我們的實驗中，我們發現 MIO 的運行時行為取決於 $\mathbf{X}$ 列的相關程度 - 相關性越高，運行時間就越長。

Similarly, a bounding box around $\boldsymbol{\beta}$ directs the MIO to look for solutions in the neighborhood of a reference point $\boldsymbol{\beta}_{0}$ . In our experiments, we chose the reference $\boldsymbol{\beta}_{0}$ as the solution obtained by MIO (warm-started with a first order solution) and allowing it to run for a few hundred to a few thousand seconds. We observed that the MIO solver in presence of bounding boxes in the $\boldsymbol{\beta}$ -space certified optimality and in the process finding better solutions; much faster than the $\boldsymbol{\zeta}$ -bounding box method.
類似地， $\boldsymbol{\beta}$ 周圍的邊界框指示 MIO 在參考點 $\boldsymbol{\beta}_{0}$ 附近尋找解決方案。在我們的實驗中，我們選擇參考 $\boldsymbol{\beta}_{0}$ 作為MIO（使用一階解熱啟動）所獲得的解，並讓它運行幾百到幾千秒。我們觀察到，MIO 求解器在 $\boldsymbol{\beta}$ 空間中存在邊界框，並在過程中找到了更好的解決方案；比 $\boldsymbol{\zeta}$ 邊界框方法快得多。

Note that the $\boldsymbol{\beta}$ -bounding box constraint leads to $O(p)$ and the $\boldsymbol{\zeta}$ -box leads to $O(n)$ constraints. Thus, when $p\gg n$ the additional $\boldsymbol{\zeta}$ constraints add a fewer number of extra variables when compared to the $\boldsymbol{\beta}$ constraints.
請注意， $\boldsymbol{\beta}$ 邊界框約束導致 $O(p)$ ，而 $\boldsymbol{\zeta}$ 邊界框導致 $O(n)$ 限制。因此，當 $p\gg n$ 時，與 $\boldsymbol{\beta}$ 限制相比，附加 $\boldsymbol{\zeta}$ 約束增加的額外變數數量較少。

Experiments 實驗

In the first set of experiments, we consider the Leukemia dataset with $n=72,p=1000$ . We took two different values of $k\in\{5,10\}$ and for each case we ran Algorithm 2 with several random restarts. The best solution thus obtained was used to warm start the MIO formulation (10), which we ran for an additional 3600 seconds. The solution thus obtained is denoted by $\boldsymbol{\beta}_{0}$ . We then consider formulation (37) with ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty$ and different values of ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac}$ (as annotated in Figure 6) — the results are displayed in Figure 6.
在第一組實驗中，我們考慮有 $n=72,p=1000$ 的白血病資料集。我們採用了兩個不同的 $k\in\{5,10\}$ 值，並且對於每種情況，我們都執行演算法 2 並隨機重新啟動幾次。由此獲得的最佳解決方案用於熱啟動 MIO 配方 (10)，我們又運行了 3600 秒。由此得到的解由 $\boldsymbol{\beta}_{0}$ 表示。然後，我們考慮使用 ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty$ 和不同的 ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\text{Frac}$ 值的公式 (37)（如圖 6 中註釋的那樣）——結果顯示在圖 6 中。

We consider another set of experiments to demonstrate the performance of the MIO in certifying global optimality for different synthetic datasets with varying $n,p,k$ as well as with different structures on the bounding box. In the first case, we generated data as per Example 1 with $\rho=0.9$ , $k_{0}=5$ . We consider the case with $\boldsymbol{\zeta}_{0}=\mathbf{X}\boldsymbol{\beta}_{0}$ , ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty$ and ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1}$ , where $\boldsymbol{\beta}_{0}$ is a $k$ -sparse solution obtained from the MIO formulation (10) run with a time limit of 1000 seconds, after being warm-started with Algorithm 2. The results are displayed in Figure 7[Left Panel]. In the second case (with data same as before) we obtained $\boldsymbol{\beta}_{0}$ in the same fashion as described before—we took a bounding box around $\boldsymbol{\beta}_{0}$ , and left the box constraint around $\mathbf{X}\boldsymbol{\beta}_{0}$ inactive, i.e., we set ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty$ and ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k$ . We performed two sets of experiments, where the data were generated based on different SNR values—the results are displayed in Figure 7 with SNR=1 [Middle Panel] and SNR = 3[Right Panel].
我們考慮另一組實驗來證明 MIO 在驗證具有不同 $n,p,k$ 以及邊界框上不同結構的不同合成資料集的全域最優性方面的表現。在第一種情況下，我們依照範例 1 使用 $\rho=0.9$ 和 $k_{0}=5$ 產生資料。我們考慮 $\boldsymbol{\zeta}_{0}=\mathbf{X}\boldsymbol{\beta}_{0}$ 、 ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\infty$ 和 ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=0.5\|\mathbf{X}\boldsymbol{\beta}_{0}\|_{1}$ 的情況，其中 $\boldsymbol{\beta}_{0}$ 是獲得的 $k$ 稀疏解決方案根據MIO 公式(10)，在使用演算法2 熱啟動後，以1000 秒的時間限制運行。在第二種情況下（數據與之前相同），我們以與之前描述相同的方式獲得了 $\boldsymbol{\beta}_{0}$ - 我們在 $\boldsymbol{\beta}_{0}$ 周圍設置了一個邊界框，並在< b10 周圍保留了框約束> 不活動，即我們設定 ${\mathcal{L}}^{\zeta}_{\ell,\text{loc}}=\infty$ 和 ${\mathcal{L}}^{\beta}_{\ell,\text{loc}}=\|\boldsymbol{\beta}_{0}\|_{1}/k$ 。我們進行了兩組實驗，其中數據是根據不同的 SNR 值產生的 - 結果如圖 7 所示，其中 SNR=1 [中圖] 和 SNR = 3[右圖]。

In the same vein, we have Figure 12 studying the effect of formulations (37) for synthetic datasets generated as per Example 1 with $n=50,p=1000,\rho=0.9$ and $k_{0}=5$ .
同樣，我們在圖 12 中研究了公式 (37) 對於根據範例 1 使用 $n=50,p=1000,\rho=0.9$ 和 $k_{0}=5$ 產生的合成資料集的影響。

5.3.3 Statistical Performance
5.3.3統計表現

To understand the statistical behavior of MIO when compared to other approaches for learning sparse models, we considered synthetic datasets for values of $n$ ranging from $30-50$ and values of $p$ ranging from $1000-2000$ . The following methods were used for comparison purposes (a) Algorithm 2. Here we used fifty different random initializations around $\mathbf{0}$ , of the form $\min(i-1,1)N(\mathbf{0}_{p\times 1},4\mathbf{I}),i=1,\ldots,50$ and took the solution corresponding to the best objective value; (b) The MIO approach with warm starts from part (a); (c) The Lasso solution and (d) The Sparsenet solution.
為了了解 MIO 與學習稀疏模型的其他方法相比的統計行為，我們考慮了 $n$ 值範圍從 $30-50$ 到 $p$ 值的合成資料集範圍從 $1000-2000$ 。使用以下方法進行比較(a) 演算法2。的解決方案; (b) MIO 方法從 (a) 部分開始熱啟動； (c) Lasso 解決方案和 (d) Sparsenet 解決方案。

For methods (a), (b) we considered ten equi-spaced values of $k$ in the range $[3,2k_{0}]$ (including the optimal value of $k_{0}$ ). For each of the methods, the best model was selected in the same fashion as described in Section 5.2.3 using separate validation sets.
對於方法 (a)、(b)，我們考慮了 $[3,2k_{0}]$ 範圍內 $k$ 的十個等距值（包括 $k_{0}$ 的最佳值）。對於每種方法，都使用單獨的驗證集以與第 5.2.3 節中所述相同的方式選擇最佳模型。

In addition, for some examples, we also study the performance of the debiased version of the Lasso, as described in Section 5.2.3.
此外，對於一些範例，我們也研究了 Lasso 的去偏版本的效能，如第 5.2.3 節所述。

In Figure 8 and Figure 9 we present selected representative results from four different examples described in Section 5.1.
在圖 8 和圖 9 中，我們展示了從 5.1 節中描述的四個不同範例中選擇的代表性結果。

In Figure 8 the left panel shows the performance of different methods for Example 1 with $n=50,p=1000,\rho=0.8,k_{0}=5$ . In this example, there are five non-zero coefficients: the features corresponding to the non-zero coefficients are weakly correlated and a feature having a non-zero coefficient is highly correlated with a feature having a zero coefficient. In this situation, the Lasso selects a very dense model since it fails to distinguish between a zero and a non-zero coefficient when the variables are correlated—it brings both the coefficients in the model (with shrinkage). MIO (with warm-start) performs the best—both in terms of predictive accuracy and in selecting a sparse set of coefficients. MIO obtains the sparsest model among the four methods and seems to find better solutions in terms of statistical properties than the models obtained by the first order methods alone. Interestingly, the “optimal model” selected by the first order methods is more dense than that selected by the MIO. The number of non-zero coefficients selected by MIO remains fairly stable across different SNR values, unlike the other three methods. For this example, we also experimented with the different versions of debiased Lasso. In summary: the best debiased Lasso models had performance marginally better than Lasso but quite inferior to MIO. See the results in Appendix, Section D.2 for further details.
在圖 8 中，左側面板顯示了範例 1 中使用 $n=50,p=1000,\rho=0.8,k_{0}=5$ 的不同方法的效能。在這個範例中，存在五個非零係數：與非零係數對應的特徵為弱相關，並且具有非零係數的特徵與具有零係數的特徵高度相關。在這種情況下，Lasso 選擇一個非常密集的模型，因為當變數相關時，它無法區分零係數和非零係數 - 它將兩個係數都帶入模型中（帶有收縮）。 MIO（帶熱啟動）在預測準確性和選擇稀疏係數集方面表現最佳。 MIO 獲得了四種方法中最稀疏的模型，並且似乎在統計特性方面找到了比單獨透過一階方法獲得的模型更好的解決方案。有趣的是，一階方法選擇的「最優模型」比 MIO 選擇的「最優模型」更密集。與其他三種方法不同，MIO 選擇的非零係數數量在不同的 SNR 值下保持相當穩定。對於這個例子，我們也嘗試了不同版本的去偏套索。總之：最好的去偏 Lasso 模型的表現略優於 Lasso，但遠低於 MIO。有關更多詳細信息，請參閱附錄 D.2 節中的結果。

In Figure 8 the right panel shows Example 2, with $n=30,p=1000,k_{0}=5$ and all non-zero coefficients equal one. In this example, all the methods perform similarly in terms of predictive accuracy. This is because all non-zero coefficients in $\boldsymbol{\beta}^{0}$ have the same value. In fact for the smallest value of SNR, the Lasso achieves the best predictive model. In all the cases however, the MIO achieves the sparsest model with favorable predictive accuracy.
在圖 8 中，右圖顯示了範例 2，其中 $n=30,p=1000,k_{0}=5$ 且所有非零係數都等於 1。在此範例中，所有方法在預測準確性方面的表現相似。這是因為 $\boldsymbol{\beta}^{0}$ 中的所有非零係數都具有相同的值。事實上，對於 SNR 的最小值，Lasso 實現了最佳的預測模型。然而，在所有情況下，MIO 都實現了具有良好預測準確性的最稀疏模型。

In Figure 9, for both the examples, the model matrix is an iid Gaussian ensemble. The underlying regression coefficient $\boldsymbol{\beta}^{0}$ however, is structurally different than Example 2 (as in Figure 8, right-panel). The structure in $\boldsymbol{\beta}^{0}$ is responsible for different statistical behaviors of the four methods across Figures 8 (right-panel) and Figure 9 (both panels). The alternating signs and varying amplitudes of $\boldsymbol{\beta}^{0}$ are responsible for the poor behavior of Lasso. The MIO (with warm-starts) seems to be the best among all the methods. For Example 3 (Figure 9, left panel) the predictive performances of Lasso and MIO are comparable—the MIO however delivers much sparser models than the Lasso.
在圖 9 中，對於這兩個範例，模型矩陣都是獨立同分佈高斯係綜。然而，基礎迴歸係數 $\boldsymbol{\beta}^{0}$ 在結構上與範例 2 不同（如圖 8 右圖）。 $\boldsymbol{\beta}^{0}$ 中的結構負責圖 8（右圖）和圖 9（兩個圖）中四種方法的不同統計行為。 $\boldsymbol{\beta}^{0}$ 的交替符號和不同幅度是造成 Lasso 不良行為的原因。 MIO（帶熱啟動）似乎是所有方法中最好的。對於範例 3（圖 9，左圖），Lasso 和 MIO 的預測表現相當，但 MIO 提供的模型比 Lasso 稀疏得多。

The key conclusions are as follows:
主要結論如下：

1.

The MIO best subset algorithm has a significant edge in detecting the correct sparsity structure for all examples compared to Lasso, Sparsenet and the stand-alone discrete first order method.

1. 與 Lasso、Sparsenet 和獨立的離散一階方法相比，MIO 最佳子集演算法在檢測所有範例的正確稀疏結構方面具有顯著優勢。
2.

For data generated as per Example 1 with large values of $\rho$ , the MIO best subset algorithm gives better predictive performance compared to its competitors.

2. 對於根據範例 1 產生的具有較大 $\rho$ 值的數據，MIO 最佳子集演算法與其競爭對手相比提供了更好的預測效能。
3.

For data generated as per Examples 2 and 3, MIO delivers similar predictive models like the Lasso, but produces much sparser models. In fact, Lasso seems to perform marginally better than MIO, as a predictive model for small values of SNR.

3. 對於根據範例 2 和 3 產生的數據，MIO 提供與 Lasso 類似的預測模型，但產生的模型要稀疏得多。事實上，作為小 SNR 值的預測模型，Lasso 的表現似乎比 MIO 稍好。
4.

For Example 4, MIO performs the best both in terms of predictive accuracy and delivering sparse models.

4. 對於範例 4，MIO 在預測準確度和提供稀疏模型方面均表現最佳。

6 Computational Results for Subset Selection with Least Absolute Deviation Loss
6 絕對偏差損失最小的子集選擇的計算結果

In this section, we demonstrate how our method can be used for the best subset selection problem with LAD objective (34).
在本節中，我們將示範如何使用我們的方法來解決具有 LAD 目標 (34) 的最佳子集選擇問題。

Since the main focus of this paper is the least squares loss function, we consider only a few representative examples for the LAD case. The LAD loss is appropriate when the error follows a heavy tailed distribution. The datasets used for the experiments parallel those described in Section 5.1, the difference being in the distribution of $\epsilon$ . We took $\epsilon_{i}$ iid from a double exponential distribution with variance $\sigma^{2}$ . The value of $\sigma^{2}$ was adjusted to get different values of SNR.
由於本文的主要關注點是最小二乘損失函數，因此我們僅考慮 LAD 情況的幾個代表性範例。當誤差服從重尾分佈時，LAD 損失是適當的。用於實驗的資料集與第 5.1 節中所述的資料集相似，差異在於 $\epsilon$ 的分佈。我們從方差為 $\sigma^{2}$ 的雙指數分佈中取得 $\epsilon_{i}$ iid。調整 $\sigma^{2}$ 的值以獲得不同的SNR值。

Datasets analysed 分析的數據集

We consider a set-up similar to Example 1 (Section 5.1) with $k_{0}=5$ and $\rho=0.9$ . Different choices of $(n,p)$ were taken to cover both the overdetermined ( $n=500,p=100$ ) and high-dimensional cases ( $n=50,p=1000$ and $n=500,p=1000$ ).
我們考慮與範例 1（第 5.1 節）類似的設置，其中包含 $k_{0}=5$ 和 $\rho=0.9$ 。採用不同的 $(n,p)$ 選擇來涵蓋超定（ $n=500,p=100$ ）和高維度情況（ $n=50,p=1000$ 和 $n=500,p=1000$ ）。

The other competing methods used for comparison were (a) discrete first order method (Section (3.4)) (b) MIO warm-started with the first order solutions and (c) the LAD loss with $\ell_{1}$ regularization:
其他用於比較的競爭方法是（a）離散一階方法（第（3.4）節）（b）使用一階解決方案熱啟動的MIO 和（c）使用 $\ell_{1}$ 正則化的LAD損失：

\min\;\;\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{1}+\lambda\|\boldsymbol{\beta}\|_{1},

which we denote by LAD-Lasso. The training, validation and testing were done in the same fashion as in the least squares case. For each method, we report the number of non-zeros in the optimal model and associated prediction accuracy (36).
我們用 LAD-Lasso 表示。訓練、驗證和測試的方式與最小平方法情況相同。對於每種方法，我們報告最佳模型中的非零數以及相關的預測精度 (36)。

Figure 10 compares the MIO approach with others for LAD in the overdetermined case ( $n>p$ ). Figure 11 does the same for the high-dimensional case ( $p\gg n$ ). The conclusions parallel those for the least squares case. Since, in the example considered, the features corresponding to the non-zero coefficients are weakly correlated and a feature having a non-zero coefficient is highly correlated with a feature having a zero coefficient—the LAD-Lasso selects an overly dense model and misses out in terms of prediction error. Both the MIO (with warm-starts) and the discrete first order methods behave similarly—much better than $\ell_{1}$ regularization schemes. As expected, we observed that subset selection with least squares loss leads to inferior models for these examples, due to a heavy-tailed distribution of the errors.
圖 10 將 MIO 方法與超定情況下 LAD 的其他方法進行了比較 ( $n>p$ )。圖 11 對於高維度情況 ( $p\gg n$ ) 執行相同的操作。結論與最小平方法情況的結論相似。由於在所考慮的示例中，與非零係數相對應的特徵是弱相關的，並且具有非零係數的特徵與具有零係數的特徵高度相關——LAD-Lasso 選擇了過於密集的模型並錯過了就預測誤差而言。 MIO（帶熱啟動）和離散一階方法的行為相似，比 $\ell_{1}$ 正則化方案好得多。如預期的那樣，我們觀察到，由於誤差的重尾分佈，具有最小二乘損失的子集選擇會導致這些範例的模型較差。

The results in this section are similar to the least squares case. The MIO approach provides an edge both in terms of sparsity and predictive accuracy compared to Lasso both for the overdetermined and the high-dimensional case.
本節的結果與最小平方法情況類似。對於超定和高維情況，與 Lasso 相比，MIO 方法在稀疏性和預測準確性方面都具有優勢。

7 Conclusions 7結論

In this paper, we have revisited the classical best subset selection problem of choosing $k$ out of $p$ features in linear regression given $n$ observations using a modern optimization lens, i.e., MIO and a discrete extension of first order methods from continuous optimization. Exploiting the astonishing progress of MIO solvers in the last twenty-five years, we have shown that this approach solves problems with $n$ in the 1000s and $p$ in the 100s in minutes to provable optimality, and finds near optimal solutions for $n$ in the 100s and $p$ in the 1000s in minutes. Importantly, the solutions provided by the MIO approach significantly outperform other state of the art methods like Lasso in achieving sparse models with good predictive power. Unlike all other methods, the MIO approach always provides a guarantee on its sub-optimality even if the algorithm is terminated early. Moreover, it can accommodate side constraints on the coefficients of the linear regression and also extends to finding best subset solutions for the least absolute deviation loss function.
在本文中，我們重新審視了經典的最佳子集選擇問題，即使用現代優化鏡頭在給定 $n$ 觀測值的線性回歸中從 $p$ 特徵中選擇< b0> ，即 MIO 和連續最佳化的一階方法的離散擴展。利用MIO 求解器在過去25 年中取得的驚人進步，我們已經證明這種方法可以在幾分鐘內解決1000 秒內的 $n$ 和100 秒內的 $p$ 問題，從而證明最優性，並在幾分鐘內找到 100 秒內的 $n$ 和 1000 秒內的 $p$ 的接近最佳解決方案。重要的是，MIO 方法提供的解決方案在實現具有良好預測能力的稀疏模型方面明顯優於 Lasso 等其他最先進的方法。與所有其他方法不同，即使演算法提前終止，MIO 方法也始終保證其次優性。此外，它可以適應線性迴歸係數的側面約束，並且還可以擴展到尋找最小絕對偏差損失函數的最佳子集解。

While continuous optimization methods have played and continue to play an important role in statistics over the years, discrete optimization methods have not. The evidence in this paper as well as in [2] suggests that MIO methods are tractable and lead to desirable properties (improved accuracy and sparsity among others) at the expense of higher, but still reasonable, computational times.
雖然連續最佳化方法多年來在統計中一直發揮並將繼續發揮重要作用，但離散最佳化方法卻沒有。本文以及 [2] 中的證據表明，MIO 方法易於處理，並且會帶來理想的特性（提高準確性和稀疏性等），但代價是更高但仍然合理的計算時間。

Acknowledgements 致謝

We would like to thank the Associate editor and two reviewers for their comments that helped us improve the paper. A major part of the work was performed when R.M. was at Columbia University.
我們要感謝副主編和兩位審查者的評論，幫助我們改進論文。大部分的工作是在 R.M. 時完成。曾就讀於哥倫比亞大學。

References 參考

[1] Top500 Supercomputer Sites, Directory page for Top500 lists. Result for each list since June 1993. http://www.top500.org/statistics/sublist/. Accessed: 2013-12-04.
Top500 超級電腦站點，Top500 清單的目錄頁。自 1993 年 6 月以來每個清單的結果。造訪時間：2013 年 12 月 4 日。
Bertsimas and Mazumder [2014]
Bertsimas 和 Mazumder [2014] D. Bertsimas and R. Mazumder. Least quantile regression via modern optimization. The Annals of Statistics, 42(6):2494–2525, 2014.
D. Bertsimas 和 R. Mazumder。透過現代優化的最小分位數回歸。《統計年鑑》，42(6)：2494–2525，2014 年。
Bertsimas and Shioda [2009]
Bertsimas 與 Shioda [2009] D. Bertsimas and R. Shioda. Algorithm for cardinality-constrained quadratic optimization. Computational Optimization and Applications, 43(1):1–22, 2009.
D. Bertsimas 和 R. Shioda。基數約束二次最佳化演算法。計算最佳化與應用，43(1):1–22，2009。
Bertsimas and Weismantel [2005]
Bertsimas 和 Weismantel [2005] D. Bertsimas and R. Weismantel. Optimization over integers. Dynamic Ideas Belmont, 2005.
D. Bertsimas 和 R. Weismantel。整數優化。動態想法貝爾蒙特，2005 年。
Bickel et al. [2009] 比克爾等人。 [2009] P. Bickel, Y. Ritov, and A. Tsybakov. Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, pages 1705–1732, 2009.
P. 比克爾、Y. 里托夫和 A. 齊巴科夫。同時分析 lasso 和 dantzig 選擇器。《統計年鑑》，第 1705-1732 頁，2009 年。
Bienstock [1996] 比恩斯托克 [1996] D. Bienstock. Computational study of a family of mixed-integer quadratic programming problems. Mathematical programming, 74(2):121–140, 1996.
D. 比恩斯托克。一系列混合整數二次規劃問題的計算研究。數學規劃，74(2):121–140, 1996。
Bixby [2012] 比克斯比 [2012] R. E. Bixby. A brief history of linear and mixed-integer programming computation. Documenta Mathematica, Extra Volume: Optimization Stories, pages 107–121, 2012.
R.E.比克斯比。線性和混合整數規劃計算的簡史。數學文獻展，額外卷數：最佳化故事，第 107-121 頁，2012 年。
Blumensath and Davies [2008]
布魯門薩斯和戴維斯 [2008] T. Blumensath and M. Davies. Iterative thresholding for sparse approximations. Journal of Fourier Analysis and Applications, 14(5-6):629–654, 2008.
T. Blumensath 和 M. Davies。稀疏近似的迭代閾值。傅立葉分析與應用雜誌，14(5-6):629–654，2008。
Blumensath and Davies [2009]
布魯門薩斯和戴維斯 [2009] T. Blumensath and M. Davies. Iterative hard thresholding for compressed sensing. Applied and Computational Harmonic Analysis, 27(3):265–274, 2009.
T. Blumensath 和 M. Davies。壓縮感知的迭代硬閾值。應用與計算諧波分析，27(3):265–274, 2009。
Boyd and Vandenberghe [2004]
博伊德和范登堡 [2004] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, Cambridge, 2004.
S.博伊德和L.范登堡。凸優化。劍橋大學出版社，劍橋，2004 年。
Bühlmann and van-de-Geer [2011]
Bühlmann 與 van-de-Geer [2011] P. Bühlmann and S. van-de-Geer. Statistics for high-dimensional data. Springer, 2011.
P. Bühlmann 和 S. van-de-Geer。高維資料的統計。施普林格，2011。
Bunea et al. [2007] 佈內亞等人。 [2007] F. Bunea, A. B. Tsybakov, M. H. Wegkamp, et al. Aggregation for gaussian regression. The Annals of Statistics, 35(4):1674–1697, 2007.
F. Bunea、A. B. Tsybakov、M. H. Wegkamp 等人。高斯回歸的聚合。《統計年鑑》，35(4)：1674–1697，2007 年。
Candes et al. [2008] 坎德斯等人。 [2008] E. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted $\ell_{1}$ minimization. Journal of Fourier Analysis and Applications, 14(5):877–905, 2008.
E. 坎德斯、M. 瓦金和 S. 博伊德。透過重新加權 $\ell_{1}$ 最小化來增強稀疏性。傅立葉分析與應用雜誌，14(5):877–905，2008。
Candes [2008] 燭光 [2008] E. Candes. The restricted isometry property and its implications for compressed sensing. Comptes Rendus Mathematique, 346(9):589–592, 2008.
E.坎德斯。受限等距屬性及其對壓縮感知的影響。數學競賽，346(9):589–592, 2008。
Candès and Plan [2009]
坎德斯與計畫 [2009] E. Candès and Y. Plan. Near-ideal model selection by $\ell_{1}$ minimization. The Annals of Statistics, 37(5A):2145–2177, 2009.
E. Candès 和 Y. Plan。透過 $\ell_{1}$ 最小化來選擇接近理想的模型。《統計年鑑》，37(5A)：2145–2177，2009 年。
Candes and Tao [2006]
坎德斯與陶 [2006] E. Candes and T. Tao. Near-optimal signal recovery from random projections: Universal encoding strategies? Information Theory, IEEE Transactions on, 52(12):5406–5425, 2006.
E. 坎德斯和 T. 陶。從隨機投影中恢復近乎最佳的訊號：通用編碼策略？資訊理論，IEEE 彙刊，52(12):5406–5425, 2006。
Chen et al. [1998] 陳等人。 [1998] S. Chen, D. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing, 20(1):33–61, 1998.
S. Chen、D. Donoho 和 M. Saunders。透過基底追蹤進行原子分解。 SIAM 科學計算雜誌，20(1):33–61，1998。
Dettling [2004] 德特林 [2004] M. Dettling. Bagboosting for tumor classification with gene expression data. Bioinformatics, 20(18):3583–3593, 2004.
M.德特林。使用基因表現資料進行腫瘤分類的 Bagboosting。生物資訊學，20(18)：3583–3593，2004。
Donoho [2006] 多諾霍 [2006] D. Donoho. For most large underdetermined systems of equations, the minimal $\ell^{1}$ -norm solution is the sparsest solution. Communications on Pure and Applied Mathematics, 59:797–829, 2006.
D. 多諾霍。對於大多數大型欠定方程組，最小 $\ell^{1}$ 範數解是最稀疏解。純粹數學與應用數學通訊，59：797-829，2006 年。
Donoho and Johnstone [1993]
多諾霍和約翰斯通 [1993] D. Donoho and I. Johnstone. Ideal spatial adaptation by wavelet shrinkage. Biometrika, 81:425–455, 1993.
D. 多諾霍和 I. 約翰斯通。透過小波收縮實現理想的空間適應。生物識別學，81：425–455，1993。
Donoho and Elad [2003]
多諾霍和埃拉德 [2003] D. Donoho and M. Elad. Optimally sparse representation in general (nonorthogonal) dictionaries via $\ell_{1}$ minimization. Proceedings of the National Academy of Sciences, 100(5):2197–2202, 2003.
D. 多諾霍和 M. 埃拉德。透過 $\ell_{1}$ 最小化在一般（非正交）字典中實現最佳稀疏表示。美國國家科學院院刊，100(5)：2197–2202，2003 年。
Donoho and Huo [2001]
多諾霍與霍 [2001] D. Donoho and X. Huo. Uncertainty principles and ideal atomic decomposition. Information Theory, IEEE Transactions on, 47(7):2845–2862, 2001.
D. 多諾霍和 X. 霍。不確定原理和理想原子分解。資訊理論，IEEE 彙刊，47(7):2845–2862, 2001。
Efron et al. [2004] 埃夫隆等。 [2004] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression (with discussion). Annals of Statistics, 32(2):407–499, 2004. ISSN 0090-5364.
B. 埃夫隆、T. 哈斯蒂、I. 約翰斯通和 R. 提布希拉尼。最小角度回歸（帶討論）。統計年鑑，32(2):407–499, 2004。
Fan and Li [2001]
範、李 [2001] J. Fan and R. Li. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96(456):1348–1360(13), 2001.
J. 範和 R. 李。透過非凹懲罰似然及其預言特性進行變數選擇。美國統計協會雜誌，96(456)：1348–1360(13)，2001。
Fan and Lv [2011]
範與呂 [2011] J. Fan and J. Lv. Nonconcave penalized likelihood with NP-dimensionality. Information Theory, IEEE Transactions on, 57(8):5467–5484, 2011.
J. Fan 與 J. Lv.具有 NP 維的非凹懲罰似然。資訊理論，IEEE 彙刊，57(8):5467–5484，2011。
Fan and Lv [2013]
範與呂 [2013] Y. Fan and J. Lv. Asymptotic equivalence of regularization methods in thresholded parameter space. Journal of the American Statistical Association, 108(503):1044–1061, 2013.
Y. Fan 與 J. Lv.閾值參數空間中正規化方法的漸近等價。美國統計協會雜誌，108(503)：1044–1061，2013 年。
Frank and Friedman [1993]
弗蘭克和弗里德曼 [1993] I. Frank and J. Friedman. A statistical view of some chemometrics regression tools (with discussion). Technometrics, 35(2):109–148, 1993.
I. 弗蘭克和 J. 弗里德曼。一些化學計量學迴歸工具的統計視圖（帶討論）。技術計量學，35(2)：109–148，1993。
Friedman [2008] 弗里德曼[2008] J. Friedman. Fast sparse regression and classification. Technical report, Department of Statistics, Stanford University, 2008.
J·弗里德曼。快速稀疏回歸和分類。技術報告，史丹佛大學統計系，2008 年。
Friedman et al. [2007] 弗里德曼等人。 [2007] J. Friedman, T. Hastie, H. Hoefling, and R. Tibshirani. Pathwise coordinate optimization. Annals of Applied Statistics, 2(1):302–332, 2007.
J. Friedman、T. Hastie、H. Hoefling 和 R. Tibshirani。路徑座標優化。應用統計年鑑，2(1)：302–332，2007 年。
Furnival and Wilson [1974]
弗尼瓦爾和威爾遜 [1974] G. Furnival and R. Wilson. Regression by leaps and bounds. Technometrics, 16:499–511, 1974.
G. 弗尼瓦爾和 R. 威爾遜。突飛猛進的回歸。技術計量學，16：499–511，1974。
Greenshtein [2006] 格林斯坦[2006] E. Greenshtein. Best subset selection, persistence in high-dimensional statistical learning and optimization under $\ell_{1}$ constraint. The Annals of Statistics, 34(5):2367–2386, 2006.
E.格林斯坦。最佳子集選擇，堅持高維度統計學習和 $\ell_{1}$ 限制下的最佳化。《統計年鑑》，34(5)：2367–2386，2006 年。
Greenshtein and Ritov [2004]
格林斯坦和里托夫 [2004] E. Greenshtein and Y. Ritov. Persistence in high-dimensional linear predictor selection and the virtue of overparametrization. Bernoulli, 10(6):971–988, 2004.
E. Greenshtein 和 Y. Ritov。高維線性預測器選擇的持久性和過度參數化的優點。伯努利，10(6)：971–988，2004。
Gurobi Optimization [2013]
Gurobi 優化 [2013] I. Gurobi Optimization. Gurobi optimizer reference manual, 2013. URL http://www.gurobi.com.
I.Gurobi 優化。 Gurobi 優化器參考手冊，2013 年。
Hastie et al. [2009] 哈斯蒂等人。 [2009] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Second Edition: Data Mining, Inference, and Prediction (Springer Series in Statistics). Springer New York, 2 edition, 2009. ISBN 0387848576.
T. Hastie、R. Tibshirani 和 J. Friedman。統計學習的要素，第二版：資料探勘、推理與預測（Springer 統計系列）。施普林格紐約，第 2 版，2009 年。
Knight and Fu [2000]
奈特與福 [2000] K. Knight and W. Fu. Asymptotics for lasso-type estimators. Annals of Statistics, 28(5):1356–1378, 2000.
K. 奈特和 W. Fu。套索型估計量的漸近。統計年鑑，28(5)：1356–1378，2000。
Loh and Wainwright [2013]
Loh 和溫賴特 [2013] P.-L. Loh and M. Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pages 476–484, 2013.
P.-L。 Loh 和 M.Wainwright。具有非凸性的正則化 m 估計量：局部最優的統計和演算法理論。神經資訊處理系統進展，第 476-484 頁，2013 年。
Lv and Fan [2009]
呂與範 [2009] J. Lv and Y. Fan. A unified approach to model selection and sparse recovery using regularized least squares. The Annals of Statistics, pages 3498–3528, 2009.
J. Lv 和 Y. Fan。使用正則化最小二乘進行模型選擇和稀疏恢復的統一方法。《統計年鑑》，第 3498-3528 頁，2009 年。
Mazumder et al. [2011] 馬宗德等人。 [2011] R. Mazumder, J. Friedman, and T. Hastie. Sparsenet: Coordinate descent with non-convex penalties. Journal of the American Statistical Association, 117(495):1125–1138, 2011.
R. Mazumder、J. Friedman 和 T. Hastie。 Sparsenet：協調具有非凸懲罰的下降。美國統計協會雜誌，117(495)：1125–1138，2011 年。
Meinshausen and Bühlmann [2006]
邁因斯豪森和布爾曼 [2006] N. Meinshausen and P. Bühlmann. High-dimensional graphs and variable selection with the lasso. Annals of Statistics, 34:1436–1462, 2006.
N. 邁因斯豪森和 P. 布爾曼。使用套索進行高維圖和變數選擇。統計年鑑，34：1436–1462，2006 年。
Miller [2002] 米勒[2002] A. Miller. Subset selection in regression. CRC Press Washington, 2002.
A.米勒。回歸中的子集選擇。 CRC 出版社華盛頓，2002 年。
Natarajan [1995] 納塔拉揚 [1995] B. Natarajan. Sparse approximate solutions to linear systems. SIAM journal on computing, 24(2):227–234, 1995.
B.納塔拉揚。線性系統的稀疏近似解。 SIAM 計算雜誌，24(2):227–234，1995。
Nemhauser [2013] 內姆豪瑟 [2013] G. Nemhauser. Integer programming: the global impact. Presented at EURO, INFORMS, Rome, Italy, 2013. http://euro2013.org/wp-content/uploads/Nemhauser_EuroXXVI.pdf. Accessed: 2013-12-04.
G. 內姆豪瑟。整數規劃：全球影響。發表於 2013 年義大利羅馬 EURO、INFORMS 上。造訪時間：2013 年 12 月 4 日。
Nesterov [2005] 涅斯特羅夫 [2005] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, Series A, 103:127–152, 2005.
Y.內斯特羅夫。非平滑函數的平滑最小化。數學規劃，A 系列，103：127–152，2005 年。
Nesterov [2007] 內斯特羅夫 [2007] Y. Nesterov. Gradient methods for minimizing composite objective function. Technical report, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain, 2007. Technical Report number 76.
Y.內斯特羅夫。最小化複合目標函數的梯度方法。技術報告，運籌學和計量經濟學中心 (CORE)，魯汶天主教大學，2007 年。
Nesterov [2004] 內斯特羅夫 [2004] Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer, Norwell, 2004.
Y·內斯特羅夫。凸優化入門講座：基礎課程。克魯維爾，諾威爾，2004 年。
Raskutti et al. [2011] 拉斯庫蒂等人。 [2011] G. Raskutti, M. Wainwright, and B. Yu. Minimax rates of estimation for high-dimensional linear regression over-balls. Information Theory, IEEE Transactions on, 57(10):6976–6994, 2011.
G. Raskutti、M. Wainwright 和 B. Yu。高維線性迴歸球的極小極大估計率。資訊理論，IEEE 彙刊，57(10):6976–6994，2011。
Rockafellar [1996] 洛克斐拉 [1996] R. Rockafellar. Convex Analysis. Princeton University Press, Princeton, 1996. ISBN 0691015864. URL http://www.amazon.com/exec/obidos/redirect?tag=citeulike07-20&path=ASIN/0691015864.
R.洛克斐拉。凸分析。普林斯頓大學出版社，普林斯頓，1996 年。
Shen et al. [2013] 沉等人。 [2013] X. Shen, W. Pan, Y. Zhu, and H. Zhou. On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5):807–832, 2013.
X. 沈、W. 潘、Y. 朱和 H. 週。關於約束和正則化高維度回歸。統計數學研究所年鑑，65(5):807–832，2013。
Tibshirani [1996] 提布希拉尼 [1996] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1996.
R.Tibshirani。透過套索進行回歸收縮和選擇。《皇家統計學會雜誌》，B 系列，58：267–288，1996 年。
Tibshirani [2011] 提比希拉尼 [2011] R. Tibshirani. Regression shrinkage and selection via the lasso: a retrospective. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(3):273–282, 2011.
R.Tibshirani。透過套索進行回歸收縮和選擇：回顧。《皇家統計學會期刊》：B 系列（統計方法），73(3):273–282，2011 年。
Tropp [2006] 特羅普 [2006] J. Tropp. Just relax: Convex programming methods for identifying sparse signals in noise. Information Theory, IEEE Transactions on, 52(3):1030–1051, 2006.
J·特羅普。放鬆點：用於識別噪音中稀疏訊號的凸編程方法。資訊理論，IEEE 彙刊，52(3):1030–1051, 2006。
van de Geer et al. [2011]
范德吉爾等人。 [2011] S. Geer, P. Bühlmann, and S. Zhou. The adaptive and the thresholded lasso for potentially misspecified models (and a lower bound for the lasso). Electronic Journal of Statistics, 5:688–749, 2011.
S. Geer、P. Bühlmann 和 S. Zhou。針對可能錯誤指定的模型的自適應套索和閾值套索（以及套索的下限）。電子統計雜誌，5：688–749，2011。
Wainwright [2009] 溫賴特 [2009] M. Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using-constrained quadratic programming (lasso). Information Theory, IEEE Transactions on, 55(5):2183–2202, 2009.
M.溫賴特。使用約束二次規劃（套索）進行高維度和噪音稀疏性恢復的尖銳閾值。資訊理論，IEEE 彙刊，55(5):2183–2202, 2009。
Zhang [2010a] 張[2010a] C.-H. Zhang. Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38(2):894–942, 2010a.
C.-H。張。極小極大凹罰分下幾乎無偏的變數選擇。《統計年鑑》，38(2)：894–942，2010a。
Zhang and Huang [2008]
張和黃[2008] C.-H. Zhang and J. Huang. The sparsity and bias of the lasso selection in high-dimensional linear regression. Annals of Statistics, 36(4):1567–1594, 2008.
C.-H。張和黃。高維度線性迴歸中套索選擇的稀疏性和偏差。統計年鑑，36(4)：1567–1594，2008。
Zhang and Zhang [2012]
張和張 [2012] C.-H. Zhang and T. Zhang. A general theory of concave regularization for high-dimensional sparse estimation problems. Statistical Science, 27(4):576–593, 2012.
C.-H。張和 T. 張。高維度稀疏估計問題的凹正則化的一般理論。統計科學，27(4)：576–593，2012。
Zhang [2010b] 張[2010b] T. Zhang. Analysis of multi-stage convex relaxation for sparse regularization. The Journal of Machine Learning Research, 11:1081–1107, 2010b.
T. 張。稀疏正規化的多層次凸鬆弛分析。機器學習研究雜誌，11：1081–1107，2010b。
Zhang et al. [2014] 張等人。 [2014] Y. Zhang, M. Wainwright, and M. I. Jordan. Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. arXiv preprint arXiv:1402.1918, 2014.
Y. 張、M. 溫賴特和 M. I. 喬丹。稀疏線性迴歸多項式時間演算法效能的下限。 arXiv 預印本 arXiv:1402.1918, 2014。
Zhao and Yu [2006]
趙與餘 [2006] P. Zhao and B. Yu. On model selection consistency of lasso. Journal of Machine Learning Research, 7:2541–2563, 2006.
P.Zhao 和 B.Yu。關於lasso的模型選擇一致性。機器學習研究雜誌，7：2541–2563，2006 年。
Zheng et al. [2014] 鄭等人。 [2014] Z. Zheng, Y. Fan, and J. Lv. High dimensional thresholded regression and shrinkage effect. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(3):627–649, 2014. ISSN 1467-9868.
Z. 鄭、Y. 範、J. Lv。高維度閾值回歸和收縮效應。《皇家統計學會雜誌》：B 系列（統計方法），76(3):627–649, 2014。
Zou [2006] 鄒[2006] H. Zou. The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101:1418–1429, 2006.
H. 鄒。自適應套索及其預言屬性。美國統計協會雜誌，101：1418–1429，2006 年。
Zou and Li [2008]
鄒和李 [2008] H. Zou and R. Li. One - step sparse estimates in nonconcave penalized likelihood problems. The Annals of Statistics, 36(4):1509–1533, 2008.
H. Zou 和 R. Li。非凹懲罰似然問題中的一步稀疏估計。《統計年鑑》，36(4)：1509–1533，2008 年。

Appendix and Supplementary Material
附錄及補充資料

Appendix A Additional Details for Section 2
附錄 A 第 2 節的其他詳細信息

A.1 Solving the convex quadratic optimization Problems in Section 2.3.2
A.1 求解2.3.2節的凸二次最佳化問題

We show here that the convex quadratic optimization problems appearing in Section 2.3.2 are indeed quite simple and can be solved with small computational cost.
我們在這裡表明，第 2.3.2 節中出現的凸二次最佳化問題確實非常簡單，並且可以用很小的計算成本來解決。

We first consider Problem (18), the computation of $u_{i}^{-}$ which is a minimization problem. We assume without loss of generality that the feasible set of problem (18) is non-empty. Thus by standard results in quadratic optimization [10], it follows that, there exists a $\tau$ such that:
我們先考慮問題(18)， $u_{i}^{-}$ 的計算，這是一個最小化問題。不失一般性，我們假設問題（18）的可行集非空。因此，根據二次最佳化 [10] 中的標準結果，可以得出結論，存在 $\tau$ 使得：

\nabla\left(\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\tau\beta_{i}\right)=0,

where, $\nabla$ denotes derivative wrt $\boldsymbol{\beta}$ and a $\boldsymbol{\beta}$ that satisfies the above gradient condition must also be feasible for Problem (18). Simplifying the above equation, we get:
其中， $\nabla$ 表示對 $\boldsymbol{\beta}$ 的導數，滿足上述梯度條件的 $\boldsymbol{\beta}$ 對於問題(18)也一定是可行的。簡化上式，我們得到：

\mathbf{X}^{\prime}\mathbf{X}\boldsymbol{\beta}=\mathbf{X}^{\prime}\mathbf{y}-\tau e_{i},

where, $e_{i}$ is a vector in $\Re^{p}$ such that its $i$ th coordinate is one with the remaining equal to zero. Simplifying the above expression, we have
其中， $e_{i}$ 是 $\Re^{p}$ 中的向量，其 $i$ 座標為1，其餘為零。簡化上面的表達式，我們有

\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\|(\mathbb{I}-P_{X})\mathbf{y}+\tau q_{i}\|_{2}^{2}.

Above, $\mathbb{I}$ is the identity matrix of size $p\times p$ and $P_{X}$ is the familiar projection matrix given by $\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}$ ³³3Note that we assume here that $p>n$ which typically guarantees that $\mathbf{X}^{\prime}\mathbf{X}$ is invertible with probability one, provided the entries of $\mathbf{X}$ are drawn from a continuous ditribution.
請注意，我們在這裡假設 $p>n$ 通常保證 $\mathbf{X}^{\prime}\mathbf{X}$ 以機率 1 可逆，前提是 $\mathbf{X}$ 的條目來自連續分佈。
上面， $\mathbb{I}$ 是大小為 $p\times p$ 的單位矩陣， $P_{X}$ 是 $\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}\mathbf{X}^{\prime}$ ³ and $q_{i}=\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}e_{i}$ . Observing that $\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\text{UB}$ , one can readily estimate $\tau$ that satisfies the above simple quadratic equation. This leads to the solution of $\tau$ , which subsequently leads to the optimal value $\widetilde{\boldsymbol{\beta}}$ that solves Problem (18). This readily leads to the optimum of Problem (18).
和 $q_{i}=\mathbf{X}(\mathbf{X}^{\prime}\mathbf{X})^{-1}e_{i}$ 。觀察 $\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}=\text{UB}$ ，我們可以很容易地估計 $\tau$ 滿足上述簡單的二次方程式。這導致 $\tau$ 的解，隨後導致解決問題 (18) 的最優值 $\widetilde{\boldsymbol{\beta}}$ 。這很容易導致問題（18）的最優。

The above argument readily applies to Problem (18), for the computation of $u_{i}^{+}$ by writing it as an equivalent minimization problem and observing that:
上述論點很容易適用於問題 (18)，透過將其寫為等效的最小化問題並觀察到來計算 $u_{i}^{+}$ ：

-u_{i}^{+}=\min_{\boldsymbol{\beta}}\;\;-\beta_{i}\;\;\;\;s.t.\;\;\;\;\frac{1}{2}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|^{2}_{2}\leq\text{UB}.

The above derivation can also be adapted to the case of Problem (19). Towards this end, notice that for estimating $v_{i}^{-}$ the above steps (for computing $u_{i}^{-}$ ) will be modified: $e_{i}$ gets replaced by $\mathbf{x}_{i}\in\Re^{p}$ (the $i$ th row of $\mathbf{X}$ ); and $P_{X}$ denotes the projection matrix onto the column space of $\mathbf{X}$ , even if the matrix $\mathbf{X}^{\prime}\mathbf{X}$ is not invertible (since here, we consider arbitrary $n,p$ ).
上述推導也可以適用於問題(19)的情況。為此，請注意，為了估計 $v_{i}^{-}$ ，上述步驟（用於計算 $u_{i}^{-}$ ）將被修改： $e_{i}$ 被 $\mathbf{x}_{i}\in\Re^{p}$ 替換（ $\mathbf{X}$ 的 $i$ 行）； $P_{X}$ 表示在 $\mathbf{X}$ 列空間上的投影矩陣，即使矩陣 $\mathbf{X}^{\prime}\mathbf{X}$ 不可逆（因為在這裡，我們考慮任意 $n,p$

In addition, the Problems (18) for the different variables and (19) for the different samples; can be solved completely independently, in parallel.
此外，針對不同變數的問題（18）和針對不同樣本的問題（19）；可以完全獨立、並行地求解。

A.2 Details for Section 2.3.4
A.2 第 2.3.4 節的詳細信息

Note that in Problem (10) we consider a uniform bound on $\beta_{i}$ ’s: $-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},$ for all $i=1,\ldots,p$ . Note that some of the variables $\beta_{i}$ may have larger amplitude than the others, thus it may be reasonable to have bounds depending upon the variable index $i$ . Thusly motivated, for added flexibility, one can consider the following (adaptive) bounds on $\beta_{i}$ ’s: $-{\mathcal{M}}^{i}_{U}\leq\beta_{i}\leq{\mathcal{M}}^{i}_{U}$ for $i=1,\ldots,p$ . The parameters ${\mathcal{M}}^{i}_{U}$ can be taken as $\max\{|u^{+}_{i}|,|u^{-}_{i}|\},$ as defined in (18).
請注意，在問題（10）中，我們考慮所有 $i=1,\ldots,p$ 的 $\beta_{i}$ 的統一界限： $-{\mathcal{M}}_{U}\leq\beta_{i}\leq{\mathcal{M}}_{U},$ 。請注意，某些變數 $\beta_{i}$ 可能比其他變數具有更大的幅度，因此根據變數索引 $i$ 確定界限可能是合理的。因此，為了增加彈性，可以考慮 $\beta_{i}$ 的以下（自適應）邊界： $-{\mathcal{M}}^{i}_{U}\leq\beta_{i}\leq{\mathcal{M}}^{i}_{U}$ for $i=1,\ldots,p$ 。參數 ${\mathcal{M}}^{i}_{U}$ 可以被視為如(18)定義的 $\max\{|u^{+}_{i}|,|u^{-}_{i}|\},$ 。

More generally, one can also consider asymmetric bounds on $\beta_{i}$ as: $u_{i}^{-}\leq\beta_{i}\leq u_{i}^{+}$ for all $i$ .
更一般地，我們也可以將 $\beta_{i}$ 上的不對稱邊界視為：對所有 $i$ 的 $u_{i}^{-}\leq\beta_{i}\leq u_{i}^{+}$ 。

Note that the above ideas for bounding $\beta_{i}$ ’s can also be extended to obtain sample-specific bounds on $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ for $i=1,\ldots,n$ .
請注意，上述關於邊界 $\beta_{i}$ 的想法也可以擴展以獲得 $\langle\mathbf{x}_{i},\boldsymbol{\beta}\rangle$ 上 $i=1,\ldots,n$ 的特定於樣本的邊界。

The bounds on $\|\widehat{\boldsymbol{\beta}}\|_{1}$ and $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ can also be adapted to the above variable dependent bounds on $\beta_{i}$ ’s.
$\|\widehat{\boldsymbol{\beta}}\|_{1}$ 和 $\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{1},\|\mathbf{X}\widehat{\boldsymbol{\beta}}\|_{\infty}$ 上的邊界也可以適應 $\beta_{i}$ 上的上述變數相關邊界。

While the above modifications may lead to marginally improved performances, we do not dwell much on these improvements mainly for the sake of a clear exposition.
雖然上述修改可能會稍微改善效能，但我們不會過多討論這些改進，主要是為了清楚說明。

A.3 Proof of Proposition 2
A.3 命題2的證明

Proof 證明

(a)

Given a set $I$ , we define $\mathbf{G}:=\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I},$ and let $g_{ij}$ denote the $(i,j)$ th entry of $\mathbf{G}$ . For any $\mathbf{u}\in\mathbb{R}^{k}$ we have

$\displaystyle\max_{\\|\mathbf{u}\\|_{1}=1}\\|\mathbf{G}\mathbf{u}\\|_{1}=~{}$	$\displaystyle\max_{\\|\mathbf{u}\\|_{1}=1}\left(\sum_{i=1}^{k}\left\|\sum_{j=1}^{k}g_{ij}u_{j}\right\|\right)$
$\displaystyle\leq~{}$	$\displaystyle\max_{\\|\mathbf{u}\\|_{1}=1}\left(\sum_{i=1}^{k}\sum_{j=1}^{k}\|u_{j}\|\|g_{ij}\|\right)$
$\displaystyle=~{}$	$\displaystyle\max_{\\|\mathbf{u}\\|_{1}=1}\left(\sum_{j=1}^{k}\|u_{j}\|\sum_{i\neq j}\|g_{ij}\|\right)$	$\displaystyle(g_{jj}=0)$
$\displaystyle\leq~{}$	$\displaystyle\max_{\\|\mathbf{u}\\|_{1}=1}\left(\mu[k-1]\\|\mathbf{u}\\|_{1}\right)$	$\displaystyle\left(\sum_{i\neq j}\|g_{ij}\|\leq\mu[k-1]\right)$
$\displaystyle=~{}$	$\displaystyle\mu[k-1].$

(a) 給定一個集合

I

，我們定義

\mathbf{G}:=\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}-\mathbf{I},

並讓

g_{ij}

表示

\mathbf{G}

條目> .對於任何

\mathbf{u}\in\mathbb{R}^{k}

我們有

(b)

Using $\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}=\mathbf{I}+\mathbf{G}$ and standard power-series convergence (which is valid since $\|\mathbf{G}\|_{1,1}<1$ ) we obtain

~{}~{}~{}~{}~{}~{}~{}\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}=\|\left(\mathbf{I}+\mathbf{G}\right)^{-1}\|_{1,1}=\sum_{i=0}^{\infty}\|\mathbf{G}\|^{i}_{1,1}\leq\frac{1}{1-\|\mathbf{G}\|_{1,1}}\leq\frac{1}{1-\mu[k-1]}.~{}~{}~{}~{}~{}~{}~{}\hfill\Box

(b)使用

\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}=\mathbf{I}+\mathbf{G}

和標準冪級數收斂（自

\|\mathbf{G}\|_{1,1}<1

起有效），我們得到

A.4 Proof of Theorem 2.1 A.4定理2.1的證明

Proof 證明

(a)

Since $\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}$ we have

\|\widehat{\boldsymbol{\beta}}\|_{1}=\|\widehat{\boldsymbol{\beta}}_{I}\|_{1}\leq\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\|_{1,1}\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{1}.

(38)

Note that 注意

\|\mathbf{X}^{\prime}_{I}\mathbf{y}\|_{1}=\sum_{j\in I}|\langle\mathbf{X}_{j},\mathbf{y}\rangle|\leq\max_{I,|I|=k}\sum_{j\in I}|\langle\mathbf{X}_{j},\mathbf{y}\rangle|\leq\sum_{j=1}^{k}|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle|.

(39)

Applying Part (b) of Proposition 2 and (39) to (38), we obtain (14) .
將命題 2 的 (b) 部分和 (39) 應用於 (38)，我們得到 (14) 。

(a)自

\widehat{\boldsymbol{\beta}}_{I}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{y}

以來，我們有

(b)

We write $\widehat{\boldsymbol{\beta}}_{I}=\mathbf{A}\mathbf{y}$ for $\mathbf{A}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ . If $\mathbf{a}_{i},i=1,\ldots,k$ denote the rows of $\mathbf{A}$ we have:
我們為 $\mathbf{A}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ 寫 $\widehat{\boldsymbol{\beta}}_{I}=\mathbf{A}\mathbf{y}$ 。如果 $\mathbf{a}_{i},i=1,\ldots,k$ 表示 $\mathbf{A}$ 的行，我們有：

\|\widehat{\boldsymbol{\beta}}_{I}\|_{\infty}=\max_{i=1,\ldots,k}|\langle\mathbf{a}_{i},\mathbf{y}\rangle|\leq\left(\max_{i=1,\ldots,k}\|\mathbf{a}_{i}\|_{2}\right)\|\mathbf{y}\|_{2}.

(40)

For every $i=1,\ldots,k$ we have
對於每個 $i=1,\ldots,k$ 我們有

$\displaystyle\\|\mathbf{a}_{i}\\|_{2}\leq$	$\displaystyle\max_{\\|\mathbf{u}\\|_{2}=1}\\|\mathbf{A}\mathbf{u}\\|_{2}$
$\displaystyle=$	$\displaystyle\max_{\\|\mathbf{u}\\|_{2}=1}\\|(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\mathbf{u}\\|_{2}$
$\displaystyle\leq$	$\displaystyle\lambda_{\max}\left((\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}\right)$
$\displaystyle=$	$\displaystyle\max\;\;\left\{\frac{1}{d_{1}},\ldots,\frac{1}{d_{k}}\right\},$	(41)

where $d_{1},\ldots,d_{k}$ are the (nonzero) singular values of the matrix $\mathbf{X}_{I}$ . To see how one arrives at (41) let us denote the singular value decomposition of $\mathbf{X}_{I}=\mathbf{UDV}^{\prime}$ with $\mathbf{D}=\mathrm{diag}\left(d_{1},d_{2},\ldots,d_{k}\right).$ We then have
其中 $d_{1},\ldots,d_{k}$ 是矩陣 $\mathbf{X}_{I}$ 的（非零）奇異值。為了看看如何得出 (41)，讓我們用 $\mathbf{D}=\mathrm{diag}\left(d_{1},d_{2},\ldots,d_{k}\right).$ 表示 $\mathbf{X}_{I}=\mathbf{UDV}^{\prime}$ 的奇異值分解，然後我們有

(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}=(\mathbf{VD}^{-2}\mathbf{V}^{\prime})(\mathbf{UDV}^{\prime})^{\prime}=\mathbf{VD}^{-1}\mathbf{U^{\prime}}

and the singular values of $(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ are thus $1/d_{i}$ , $i=1,\ldots,k$ .
因此 $(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ 的奇異值為 $1/d_{i}$ 、 $i=1,\ldots,k$ 。

The eigenvalues of $\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}$ are $d_{i}^{2}$ and from (12) we obtain that $d_{i}^{2}\geq\eta_{k}$ . Using (41) we thus obtain
$\mathbf{X}^{\prime}_{I}\mathbf{X}_{I}$ 的特徵值為 $d_{i}^{2}$ ，從 (12) 我們得到 $d_{i}^{2}\geq\eta_{k}$ 。利用 (41) 我們得到

\max_{i=1,\ldots,k}\|\mathbf{a}_{i}\|_{2}\leq\frac{1}{\sqrt{\eta_{k}}}.

(42)

Substituting the bound (42) to (40) we obtain
將邊界 (42) 代入 (40) 我們得到

\|\widehat{\boldsymbol{\beta}}_{I}\|_{\infty}\leq\frac{1}{\sqrt{\eta_{k}}}\|\mathbf{y}\|_{2}.

(43)

Using the notation $\tilde{\mathbf{A}}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}$ , we have
用符號 $\tilde{\mathbf{A}}=(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}$ ，我們有

$\displaystyle\\|\widehat{\boldsymbol{\beta}}_{I}\\|_{\infty}=$	$\displaystyle\max_{i=1,\ldots,k}\|\langle\tilde{\mathbf{a}}_{i},\mathbf{X}^{\prime}_{I}\mathbf{y}\rangle\|$
$\displaystyle\leq$	$\displaystyle\left(\max_{i=1,\ldots,k\|}\\|\tilde{\mathbf{a}}_{i}\\|_{2}\right)\\|\mathbf{X}^{\prime}_{I}\mathbf{y}\\|_{2}$
$\displaystyle\leq$	$\displaystyle\lambda_{\max}\left((\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\right)\\|\mathbf{X}^{\prime}_{I}\mathbf{y}\\|_{2}$
$\displaystyle=$	$\displaystyle\left(\max_{i=1,\ldots,k}\frac{1}{d_{i}^{2}}\right)\cdot\sqrt{\sum_{j\in I}\|\langle\mathbf{X}_{j},\mathbf{y}\rangle\|^{2}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{\eta_{k}}\sqrt{\sum_{j=1}^{k}\|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle\|^{2}}.$	(44)

Combining (43) and (44) we obtain (15).
結合（43）和（44）我們得到（15）。

(c)

We have

\|\mathbf{X}_{I}\widehat{\boldsymbol{\beta}}_{I}\|_{1}\leq\sum_{i=1}^{n}|\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}_{I}\rangle|\leq\sum_{i=1}^{n}\|\mathbf{x}_{i}\|_{\infty}\|\widehat{\boldsymbol{\beta}}_{{I}}\|_{1}=\sum_{i=1}^{n}\|\mathbf{x}_{i}\|_{\infty}\|\widehat{\boldsymbol{\beta}}_{I}\|_{1}.

(45)

Let $\mathbf{P}_{I}:=\mathbf{X}_{I}(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ denote the projection onto the columns of $\mathbf{X}_{I}$ . We have $\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\|\mathbf{y}\|_{2}$ , leading to:
設 $\mathbf{P}_{I}:=\mathbf{X}_{I}(\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\mathbf{X}^{\prime}_{I}$ 表示到 $\mathbf{X}_{I}$ 列的投影。我們有 $\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\|\mathbf{y}\|_{2}$ ，導致：

\|\mathbf{X}_{I}\widehat{\boldsymbol{\beta}}_{I}\|_{1}=\|\mathbf{P}_{I}\mathbf{y}\|_{1}\leq\sqrt{k}\|\mathbf{P}_{I}\mathbf{y}\|_{2}\leq\sqrt{k}\|\mathbf{y}\|_{2},

(46)

where we used that for any $\mathbf{a}\in\mathbb{R}^{m}$ , we have $\sqrt{m}\|\mathbf{a}\|_{2}\geq\|\mathbf{a}\|_{1}$ . Combining (45) and (46) we obtain (16).
當我們將其用於任何 $\mathbf{a}\in\mathbb{R}^{m}$ 時，我們有 $\sqrt{m}\|\mathbf{a}\|_{2}\geq\|\mathbf{a}\|_{1}$ 。結合（45）和（46）我們得到（16）。

(c)我們有

(d)

For any vector $\boldsymbol{\beta}_{I}$ which has zero entries in the coordinates outside $I$ , we have:

\|\mathbf{X}\boldsymbol{\beta}_{I}\|_{\infty}\leq\max_{i=1,\ldots,n}|\langle\mathbf{x}_{i},\boldsymbol{\beta}_{I}\rangle|\leq\max_{i=1,\ldots,n}\|\mathbf{x}_{i}\|_{1:k}\|\boldsymbol{\beta}_{I}\|_{\infty},

leading to (17). $\hfill\Box$
導致 (17)。 $\hfill\Box$

(d)對於任何在

I

之外的座標中具有零個條目的向量

\boldsymbol{\beta}_{I}

，我們有：

Appendix B Proofs and Technical Details for Section 3
附錄 B 第 3 節的證據與技術細節

B.1 Proof of Proposition 6
B.1 命題6的證明

Proof 證明

(a)

Let $\boldsymbol{\beta}$ be a vector satisfying $\|\boldsymbol{\beta}\|_{0}\leq k$ . Using the notation $\widehat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)$ we have the following chain of inequalities:

$\displaystyle g(\boldsymbol{\beta})$	$\displaystyle=Q_{L}(\boldsymbol{\beta},\boldsymbol{\beta})$
	$\displaystyle\geq\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})$
	$\displaystyle=\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\\|\boldsymbol{\eta}-\boldsymbol{\beta}\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\boldsymbol{\eta}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$
	$\displaystyle=\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\left\\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\\|_{2}^{2}-\frac{1}{2L}\\|\nabla g(\boldsymbol{\beta})\\|_{2}^{2}+g(\boldsymbol{\beta})\right)$
	$\displaystyle=\;\;\left(\frac{L}{2}\\|\widehat{\boldsymbol{\eta}}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\\|_{2}^{2}-\frac{1}{2L}\\|\nabla g(\boldsymbol{\beta})\\|_{2}^{2}+g(\boldsymbol{\beta})\right)$	$\displaystyle~{}{\rm(From}~{}\eqref{eta-hat-HT1})$
	$\displaystyle=\left(\frac{L}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$

	$\displaystyle=\left(\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\frac{\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$
	$\displaystyle\geq\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\underbrace{\left(\frac{\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)}_{Q_{\ell}(\widehat{\boldsymbol{\eta}},\boldsymbol{\beta})}$
	$\displaystyle\geq\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+g(\widehat{\boldsymbol{\eta}}).$	$\displaystyle~{}{\rm(From}~{}\eqref{major-1})$

This chain of inequalities leads to:
這一系列的不平等導致：

g(\boldsymbol{\beta})-g(\widehat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\|_{2}^{2}.

(47)

Applying (47) for $\boldsymbol{\beta}=\boldsymbol{\beta}_{m}$ and $\widehat{\boldsymbol{\eta}}=\boldsymbol{\beta}_{m+1}$ , the vectors generated by Algorithm 1, we obtain (31). This implies that the objective values $g(\boldsymbol{\beta}_{m})$ are decreasing and since the sequence is bounded below $(g(\boldsymbol{\beta})\geq 0)$ , we obtain that $g(\boldsymbol{\beta}_{m})$ converges as $m\rightarrow\infty$ .
將演算法 1 產生的向量 $\boldsymbol{\beta}=\boldsymbol{\beta}_{m}$ 和 $\widehat{\boldsymbol{\eta}}=\boldsymbol{\beta}_{m+1}$ 應用（47），我們得到（31）。這意味著目標值 $g(\boldsymbol{\beta}_{m})$ 正在減少，並且由於序列在 $(g(\boldsymbol{\beta})\geq 0)$ 下面有界，我們得到 $g(\boldsymbol{\beta}_{m})$ 收斂為 $m\rightarrow\infty$ 。

(a)設

\boldsymbol{\beta}

為滿足

\|\boldsymbol{\beta}\|_{0}\leq k

的向量。使用符號

\widehat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)

我們有以下不等式鏈：

(b)

If $L>\ell$ and from part (a), the result follows.

(b)如果 $L>\ell$ 且來自(a)部分，則結果如下。

(c)

The condition $\underline{\alpha}_{k}>0$ means that for all $m$ sufficiently large, the entry $|\beta_{(k),m}|$ will remain (uniformly) bounded away from zero. We will use this to prove that the support of $\boldsymbol{\beta}_{m}$ converges. For the purpose of establishing contradiction suppose that the support does not converge. Then, there are infinitely many values of $m^{\prime}$ such that $\mathbf{1}_{m^{\prime}}\neq\mathbf{1}_{m^{\prime}+1}$ . Using the fact that $\|\boldsymbol{\beta}_{m}\|_{0}=k$ for all large $m$ we have

\;\;\|\boldsymbol{\beta}_{m^{\prime}}-\boldsymbol{\beta}_{m^{\prime}+1}\|_{2}\geq\;\;\sqrt{\beta_{m^{\prime},i}^{2}+\beta_{m^{\prime}+1,j}^{2}}\geq\;\;\frac{|\beta_{m^{\prime},i}|+|\beta_{m^{\prime}+1,j}|}{\sqrt{2}},

(48)

where $i,j$ are such that $\beta_{m^{\prime}+1,i}=\beta_{m^{\prime},j}=0$ . As $m^{\prime}\rightarrow\infty$ , the quantity in the rhs of (48) remains bounded away from zero since $\underline{\alpha}_{k}>0$ . This contradicts the fact that $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0},$ as established in part (b). Thus, $\mathbf{1}_{m}$ converges, and since $\mathbf{1}_{m}$ is a discrete sequence, it converges after finitely many iterations, that is $\mathbf{1}_{m}=\mathbf{1}_{m+1}$ for all $m\geq M^{*}$ . Algorithm 1 becomes a vanilla gradient descent algorithm, restricted to the space $\mathbf{1}_{m}$ for $m\geq M^{*}$ . Since a gradient descent algorithm for minimizing a convex function over a closed convex set leads to a sequence of iterates that converge [47, 45], we conclude that Algorithm 1 converges. Therefore, the sequence $\boldsymbol{\beta}_{m}$ converges to $\boldsymbol{\beta}^{*}$ , a first order stationarity point:
其中 $i,j$ 滿足 $\beta_{m^{\prime}+1,i}=\beta_{m^{\prime},j}=0$ 。作為 $m^{\prime}\rightarrow\infty$ ，自 $\underline{\alpha}_{k}>0$ 以來，(48) 的 rhs 中的數量仍然遠離零。這與(b)部分所建立的 $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0},$ 事實相矛盾。因此， $\mathbf{1}_{m}$ 收斂，並且由於 $\mathbf{1}_{m}$ 是離散序列，因此在有限多次迭代後收斂，即對於所有 $m\geq M^{*}$ 為 $\mathbf{1}_{m}=\mathbf{1}_{m+1}$ 。演算法 1 成為普通梯度下降演算法，僅限於 $m\geq M^{*}$ 的空間 $\mathbf{1}_{m}$ 。由於用於最小化閉凸集合上的凸函數的梯度下降演算法會導致一系列迭代收斂 [47, 45]，因此我們得出結論：演算法 1 收斂。因此，序列 $\boldsymbol{\beta}_{m}$ 收斂到 $\boldsymbol{\beta}^{*}$ ，即一階平穩點：

\boldsymbol{\beta}^{*}\in\mathbf{H}_{{k}}\left(\boldsymbol{\beta}^{*}-\frac{1}{L}\nabla g(\boldsymbol{\beta}^{*})\right).

(c)條件

\underline{\alpha}_{k}>0

意味著對於所有足夠大的

m

，條目

|\beta_{(k),m}|

將保持（一致）遠離零。我們將用它來證明

\boldsymbol{\beta}_{m}

的支持收斂。為了建立矛盾，假設支持度不收斂。然後，

m^{\prime}

有無限多個值，使得

\mathbf{1}_{m^{\prime}}\neq\mathbf{1}_{m^{\prime}+1}

。利用

\|\boldsymbol{\beta}_{m}\|_{0}=k

對於所有大

m

的事實，我們有

(d)

Let ${\mathcal{I}}_{m}\subset\{1,\ldots,p\}$ denote the set of $k$ largest values of the vector $\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)$ in absolute value. By the definition of $\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)$ , we have

\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{i}\right|\geq\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|,

for all $i,j$ with $i\in{\mathcal{I}}_{m}$ and $j\notin{\mathcal{I}}_{m}$ . Thus,
對於所有帶有 $i\in{\mathcal{I}}_{m}$ 和 $j\notin{\mathcal{I}}_{m}$ 的 $i,j$ 。因此，

\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{i}\right|\geq\liminf_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\left|\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|.

(49)

Moreover, 而且，

\left(\boldsymbol{\beta}_{m}-\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)\right)_{i}=\begin{cases}\displaystyle\frac{1}{L}(\nabla g(\boldsymbol{\beta}_{m}))_{i},&i\in{\mathcal{I}}_{m},\vspace{3pt}\\ \beta_{m,i},&\text{otherwise}.\end{cases}

Using the fact that $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}$ we have
利用 $\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\rightarrow\mathbf{0}$ 我們有的事實

(\nabla g(\boldsymbol{\beta}_{m}))_{i}\rightarrow 0,i\in{\mathcal{I}}_{m}\;\;\text{and}\;\;\beta_{m,j}\rightarrow 0,j\notin{\mathcal{I}}_{m}

as $m\rightarrow\infty$ . Combining with (49) we have that:
作為 $m\rightarrow\infty$ 。結合(49)我們有：

\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|\geq\liminf_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\frac{1}{L}\left|\left(\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|=\frac{1}{L}\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}.

Since, $\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|=\underline{\alpha}_{k}=0$ (by hypothesis), the lhs of the above inequality equals zero, which leads to $\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0$ .
因為 $\liminf_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|=\underline{\alpha}_{k}=0$ （根據假設），上述不等式的 lhs 等於 0，這導致 $\liminf_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}=0$ 。

(d)令

{\mathcal{I}}_{m}\subset\{1,\ldots,p\}

表示向量

\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)

的絕對值的

k

個最大值的集合。根據

\mathbf{H}_{{k}}\left(\boldsymbol{\beta}_{m}-\frac{1}{L}\nabla g(\boldsymbol{\beta}_{m})\right)

的定義，我們有

(e)

We build on the proof of Part (d).

(e)我們以(d)部分的證明為基礎。

It follows from equation (49) (by suitably modifying ‘ $\liminf$ ’ to ‘ $\limsup$ ’) that:
由式（49）可知（透過將‘ $\liminf$ ’適當地修改為‘ $\limsup$ ’）：

\underbrace{\limsup_{m\rightarrow\infty}\;\;\min_{i\in{\mathcal{I}}_{m}}\;\;|\boldsymbol{\beta}_{mi}|}_{\overline{\alpha}_{k}}\geq\limsup_{m\rightarrow\infty}\;\;\max_{j\notin{\mathcal{I}}_{m}}\;\;\frac{1}{L}\left|\left(\nabla g(\boldsymbol{\beta}_{m})\right)_{j}\right|=\frac{1}{L}\limsup_{m\rightarrow\infty}\;\;\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}.

Note that the lhs of the above inequality is $\overline{\alpha}_{k}$ which is zero (by hypothesis), thus $\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}\rightarrow 0$ as $m\rightarrow\infty$ .
請注意，上述不等式的 lhs 是 $\overline{\alpha}_{k}$ ，它為零（根據假設），因此 $\|\nabla g(\boldsymbol{\beta}_{m})\|_{\infty}\rightarrow 0$ 為 $m\rightarrow\infty$ 。

Suppose $\boldsymbol{\beta}_{\infty}$ is a limit point of the sequence $\boldsymbol{\beta}_{m}$ . Thus there is a subsequence $m^{\prime}\subset\{1,2,\ldots,\}$ such that $\boldsymbol{\beta}_{m^{\prime}}\rightarrow\boldsymbol{\beta}_{\infty}$ and $g(\boldsymbol{\beta}_{m^{\prime}})\rightarrow g(\boldsymbol{\beta}_{\infty})$ . Using the continuity of the gradient and hence the function $\cdot\mapsto\|\nabla g(\cdot)\|_{\infty}$ we have that $\|\nabla g(\boldsymbol{\beta}_{m^{\prime}})\|_{\infty}\rightarrow\|\nabla g(\boldsymbol{\beta}_{\infty})\|_{\infty}=0$ as $m^{\prime}\rightarrow\infty$ . Thus $\boldsymbol{\beta}_{\infty}$ is a solution to the unconstrained (without cardinality constraints) optimization problem $\min~{}g(\boldsymbol{\beta})$ . Since $g(\boldsymbol{\beta}_{m})$ is a decreasing sequence, $g(\boldsymbol{\beta}_{m})$ converges to the minimum of $g(\boldsymbol{\beta})$ . $\hfill\Box$
假設 $\boldsymbol{\beta}_{\infty}$ 是序列 $\boldsymbol{\beta}_{m}$ 的極限點。因此，存在一個子序列 $m^{\prime}\subset\{1,2,\ldots,\}$ ，使得 $\boldsymbol{\beta}_{m^{\prime}}\rightarrow\boldsymbol{\beta}_{\infty}$ 和 $g(\boldsymbol{\beta}_{m^{\prime}})\rightarrow g(\boldsymbol{\beta}_{\infty})$ 。使用梯度的連續性以及函數 $\cdot\mapsto\|\nabla g(\cdot)\|_{\infty}$ 我們將 $\|\nabla g(\boldsymbol{\beta}_{m^{\prime}})\|_{\infty}\rightarrow\|\nabla g(\boldsymbol{\beta}_{\infty})\|_{\infty}=0$ 作為 $m^{\prime}\rightarrow\infty$ 。因此 $\boldsymbol{\beta}_{\infty}$ 是無約束（無基數約束）最佳化問題 $\min~{}g(\boldsymbol{\beta})$ 的解。由於 $g(\boldsymbol{\beta}_{m})$ 是遞減序列，因此 $g(\boldsymbol{\beta}_{m})$ 收斂到 $g(\boldsymbol{\beta})$ 的最小值。 $\hfill\Box$

B.2 Proof of Proposition 3
B.2命題3的證明

Proof: 證明：
We provide a proof of Proposition 3, for the sake of completeness.
為了完整起見，我們提供了命題 3 的證明。

It suffices to consider $|c_{i}|>0$ for all $i$ . Let $\boldsymbol{\beta}$ be an optimal solution to Problem (22) and let $S:=\{i:\beta_{i}\neq 0\}$ . The objective function is given by $\sum_{i\not\in S}|c_{i}|^{2}+\sum_{i\in S}(\beta_{i}-c_{i})^{2}$ . Note that by selecting $\beta_{i}=c_{i}$ for $i\in S$ , we can make the objective function $\sum_{i\not\in S}|c_{i}|^{2}$ . Thus, to minimize the objective function, $S$ must correspond to the indices of the largest $k$ values of $|c_{i}|,i\geq 1.\hfill\Box$
對於所有 $i$ 考慮 $|c_{i}|>0$ 就足夠了。令 $\boldsymbol{\beta}$ 為問題 (22) 的最適解，並令 $S:=\{i:\beta_{i}\neq 0\}$ 。目標函數由 $\sum_{i\not\in S}|c_{i}|^{2}+\sum_{i\in S}(\beta_{i}-c_{i})^{2}$ 給出。請注意，透過為 $i\in S$ 選擇 $\beta_{i}=c_{i}$ ，我們可以製定目標函數 $\sum_{i\not\in S}|c_{i}|^{2}$ 。因此，為了最小化目標函數， $S$ 必須對應於 $|c_{i}|,i\geq 1.\hfill\Box$ 的最大 $k$ 值的索引

B.3 Proof of Proposition 7
B.3 命題7的證明

Proof 證明
This follows from Proposition 6, Part (a), which implies that:
這是從命題 6 (a) 部分得出的結論，它意味著：

g({\boldsymbol{\eta}})-g(\hat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}^{2},

for any $\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ . Now by the definition of $\mathbf{H}_{k}(\cdot),$ we have $g(\boldsymbol{\eta})=g(\hat{\boldsymbol{\eta}})$ which along with $L>\ell$ implies that the rhs of the above inequality is zero: thus $\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}=0$ , i.e., $\boldsymbol{\eta}=\hat{\boldsymbol{\eta}}$ . Since the choice of $\hat{\boldsymbol{\eta}}$ was arbitrary, it follows that $\boldsymbol{\eta}$ is the only element in the set $\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ . $\hfill\Box$
對於任何 $\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ 。現在根據 $\mathbf{H}_{k}(\cdot),$ 的定義，我們有 $g(\boldsymbol{\eta})=g(\hat{\boldsymbol{\eta}})$ ，它與 $L>\ell$ 一起意味著上述不等式的 rhs 為零：因此 $\left\|\hat{\boldsymbol{\eta}}-{\boldsymbol{\eta}}\right\|_{2}=0$ ，即 $\boldsymbol{\eta}=\hat{\boldsymbol{\eta}}$ 。由於 $\hat{\boldsymbol{\eta}}$ 的選擇是任意的，因此 $\boldsymbol{\eta}$ 是集合 $\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ 中的唯一元素。 $\hfill\Box$

B.4 Proof of Proposition 8
B.4 命題8的證明

Proof 證明
The proof follows by noting that $\widehat{\boldsymbol{\beta}}$ is $k$ -sparse along with Proposition 6, Part (a), which implies that:
證明如下，指出 $\widehat{\boldsymbol{\beta}}$ 與命題 6 (a) 部分一樣是 $k$ 稀疏的，這意味著：

g(\widehat{\boldsymbol{\beta}})-g(\hat{\boldsymbol{\eta}})\geq\frac{L-\ell}{2}\left\|\widehat{\boldsymbol{\beta}}-\hat{\boldsymbol{\eta}}\right\|_{2}^{2},

for any $\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\widehat{\boldsymbol{\beta}}-\frac{1}{L}\nabla g(\widehat{\boldsymbol{\beta}})\right)$ . Now, by the definition of $\widehat{\boldsymbol{\beta}}$ we have $g(\widehat{\boldsymbol{\beta}})=g(\hat{\boldsymbol{\eta}})$ which along with $L>\ell$ implies that the rhs of the above inequality is zero: thus $\widehat{\boldsymbol{\beta}}$ is a first order stationary point. $\hfill\Box$
對於任何 $\hat{\boldsymbol{\eta}}\in\mathbf{H}_{{k}}\left(\widehat{\boldsymbol{\beta}}-\frac{1}{L}\nabla g(\widehat{\boldsymbol{\beta}})\right)$ 。現在，根據 $\widehat{\boldsymbol{\beta}}$ 的定義，我們有 $g(\widehat{\boldsymbol{\beta}})=g(\hat{\boldsymbol{\eta}})$ ，它與 $L>\ell$ 一起意味著上述不等式的 rhs 為零：因此 $\widehat{\boldsymbol{\beta}}$ 是一階駐點。 $\hfill\Box$

B.5 Proof of Theorem 3.1 B.5定理3.1的證明

Proof 證明
Summing inequalities (31) for $1\leq m\leq M.$ we obtain
對 $1\leq m\leq M.$ 的不等式 (31) 求和，我們得到

\sum_{m=1}^{M}\left(g(\boldsymbol{\beta}_{m})-g(\boldsymbol{\beta}_{m+1})\right)\geq\frac{L-\ell}{2}\sum_{m=1}^{M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2},

(50)

leading to 導致

g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}_{M+1})\geq\frac{M(L-\ell)}{2}\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}.

Since the decreasing sequence $g(\boldsymbol{\beta}_{m+1})$ converges to $g(\boldsymbol{\beta}^{*})$ by Proposition 6 we have:
由於遞減序列 $g(\boldsymbol{\beta}_{m+1})$ 根據命題 6 收斂於 $g(\boldsymbol{\beta}^{*})$ ，我們有：

~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\frac{g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}^{*})}{M}\geq\frac{g(\boldsymbol{\beta}_{1})-g(\boldsymbol{\beta}_{M+1})}{M}\geq\frac{(L-\ell)}{2}\min_{m=1,\ldots,M}\|\boldsymbol{\beta}_{m+1}-\boldsymbol{\beta}_{m}\|_{2}^{2}.~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}\hfill\Box

B.6 Proof of Proposition 5
B.6 命題 5 的證明

Proof 證明
If $\boldsymbol{\eta}$ is a first order stationary point with $\|\boldsymbol{\eta}\|_{0}\leq k$ , it follows from the argument following Definition 1, that there is a set $I\subset\{1,\ldots,p\}$ with $|I^{c}|=k$ such that $\nabla_{i}g(\boldsymbol{\eta})=0$ for all $i\notin I$ and $\eta_{i}=0$ for all $i\notin I$ . Let $\mu_{i}:=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta})$ for $i=1,\ldots,p$ . Suppose $I_{k}$ denotes the set of indices corresponding to the top $k$ ordered values of $|\mu_{i}|$ . Note that:
如果 $\boldsymbol{\eta}$ 是具有 $\|\boldsymbol{\eta}\|_{0}\leq k$ 的一階駐點，則根據定義 1 後面的參數，存在一個帶有 $|I^{c}|=k$ 對所有 $i\notin I$ 和 $\eta_{i}=0$ 對所有 $i\notin I$ 。設 $\mu_{i}:=\eta_{i}-\frac{1}{L}\nabla_{i}g(\boldsymbol{\eta})$ 代表 $i=1,\ldots,p$ 。假設 $I_{k}$ 表示與 $|\mu_{i}|$ 的頂部 $k$ 有序值相對應的索引集。注意：

\mu_{i}=\eta_{i},\;\;i\in I_{k}\quad\text{and}\quad|\mu_{j}|=|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})|,\;\;j\notin I_{k}.

(51)

For $i\in I_{k}$ and $j\notin I_{k}$ we have $|\mu_{i}|\geq|\mu_{j}|$ . This implies that $|\eta_{i}|\geq|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})|$ . Since $\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ and $\|\boldsymbol{\eta}\|_{0}<k$ , it follows that $0=\min_{i\in I_{k}}|\eta_{i}|=\min_{i\in I_{k}}|\mu_{i}|$ . We thus have that $\nabla_{j}g(\boldsymbol{\eta})=0$ for all $j\notin I_{k}$ . In addition, note that $\nabla_{i}g(\boldsymbol{\eta})=0$ for all $i\in I_{k}$ . Thus it follows that $\nabla g(\boldsymbol{\eta})=\mathbf{0}$ and hence $\boldsymbol{\eta}\in\operatorname*{arg\,min}_{\boldsymbol{\eta}}\;g(\boldsymbol{\eta})$ . $\hfill\Box$
對於 $i\in I_{k}$ 和 $j\notin I_{k}$ 我們有 $|\mu_{i}|\geq|\mu_{j}|$ 。這意味著 $|\eta_{i}|\geq|\frac{1}{L}\nabla_{j}g(\boldsymbol{\eta})|$ 。由於 $\boldsymbol{\eta}\in\mathbf{H}_{{k}}\left(\boldsymbol{\eta}-\frac{1}{L}\nabla g(\boldsymbol{\eta})\right)$ 和 $\|\boldsymbol{\eta}\|_{0}<k$ ，因此 $0=\min_{i\in I_{k}}|\eta_{i}|=\min_{i\in I_{k}}|\mu_{i}|$ 。因此，我們對所有 $j\notin I_{k}$ 都有 $\nabla_{j}g(\boldsymbol{\eta})=0$ 。此外，請注意 $\nabla_{i}g(\boldsymbol{\eta})=0$ 對於所有 $i\in I_{k}$ 。因此，可以得到 $\nabla g(\boldsymbol{\eta})=\mathbf{0}$ 和 $\boldsymbol{\eta}\in\operatorname*{arg\,min}_{\boldsymbol{\eta}}\;g(\boldsymbol{\eta})$ 。 $\hfill\Box$

Appendix C Brief Review of Statistical Properties for the subset selection problem
附錄 C 子集選擇問題統計特性的簡要回顧

In this section, for the sake of completeness we briefly review some of the properties of solutions to Problem (1).
在本節中，為了完整起見，我們簡要回顧一下問題（1）的解決方案的一些屬性。

Suppose the linear model assumption is true, i.e., $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}$ , with $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}\text{N}(0,\sigma^{2})$ . Let $\widehat{\boldsymbol{\beta}}$ denote a solution to (1). [46] showed that, with probability greater than $1-\exp(-c_{1}k\log(p/k))$ , the worst case (over $\boldsymbol{\beta}^{0}$ ) predictive performance has the following upper bound:
假設線性模型假設成立，即 $\mathbf{y}=\mathbf{X}\boldsymbol{\beta}^{0}+\boldsymbol{\epsilon}$ 和 $\epsilon_{i}\stackrel{{\scriptstyle\text{iid}}}{{\sim}}\text{N}(0,\sigma^{2})$ 。令 $\widehat{\boldsymbol{\beta}}$ 表示 (1) 的解。 [46]表明，當機率大於 $1-\exp(-c_{1}k\log(p/k))$ 時，最壞情況（超過 $\boldsymbol{\beta}^{0}$ ）預測性能具有以下上限：

\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\;\;\frac{1}{n}\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}\right\|_{2}^{2}\leq c_{2}\sigma^{2}\frac{k\log(p/k)}{n},

(52)

where, $c_{1},c_{2}$ are universal constants. Similar results also appear in [12, 58]. Interestingly, the upper bound (52) does not depend upon $\mathbf{X}$ . Unless $p/k=O(1)$ , the upper bound appearing in (52) is of the order $O(\sigma^{2}\frac{k\log(p)}{n})$ where the constants are universal. In terms of the expected (worst case) predictive risk, an upper bound is given by [58]:
其中， $c_{1},c_{2}$ 是通用常數。類似的結果也出現在[12, 58]。有趣的是，上限 (52) 並不取決於 $\mathbf{X}$ 。除非 $p/k=O(1)$ ，否則 (52) 中出現的上限的順序是 $O(\sigma^{2}\frac{k\log(p)}{n})$ ，其中常數是通用的。就預期（最壞情況）預測風險而言，上限由[58]給出：

\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\;\;\frac{1}{n}\mathbb{E}\left(\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}\right\|_{2}^{2}\right)\lesssim\sigma^{2}\frac{k\log(p)}{n},

(53)

where, the symbol “ $\lesssim$ ” means “ $\leq$ ” upto some universal constants.
其中，符號「 $\lesssim$ 」表示「 $\leq$ 」直到某些通用常數。

A natural question is how do the bounds for Lasso-based solutions compare with (53)? In a recent paper [58], the authors derive upper and lower bounds of the prediction performance of the thresholded version of the Lasso solution, which we present briefly. Suppose
一個自然的問題是，基於 Lasso 的解決方案的界限與 (53) 相比如何？在最近的一篇論文 [58] 中，作者推導了 Lasso 解決方案的閾值版本的預測性能的上限和下限，我們對此進行了簡要介紹。認為

\hat{\boldsymbol{\beta}}_{\ell_{1}}\in\operatorname*{arg\,min}_{\boldsymbol{\beta}}\;\frac{1}{2n}\|\mathbf{y}-\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}+\lambda_{n}\|\boldsymbol{\beta}\|_{1}

denotes a Lasso solution for $\lambda_{n}=4\sigma\sqrt{\frac{\log p}{n}}$ . Let $\hat{\boldsymbol{\beta}}_{\text{TL}}$ denote the thresholded version of the Lasso solution, which retains the top $k$ entries of $\hat{\boldsymbol{\beta}}_{\ell_{1}}$ in absolute value and sets the remaining to zero. The bounds on the predictive performances of Lasso based solutions depend upon a restricted eigen-value type condition. Following [58], we define, for any subset $S\in\{1,2,\ldots,p\}$ , the quantity: $C(S):=\{\boldsymbol{\beta}|\|\boldsymbol{\beta}_{S^{c}}\|_{1}\leq 2\|\boldsymbol{\beta}_{S}\|_{1}\},$ where, $\|\boldsymbol{\beta}_{S}\|_{1}=\sum_{j\in S}|\beta_{j}|$ and $\|\boldsymbol{\beta}_{S^{c}}\|_{1}=\sum_{j\in S^{c}}|\beta_{j}|$ . We say that the matrix $\mathbf{X}$ satisfies a restricted eigen-value type condition with parameter $\gamma(X)$ if it satisfies the following:
表示 $\lambda_{n}=4\sigma\sqrt{\frac{\log p}{n}}$ 的 Lasso 解。讓 $\hat{\boldsymbol{\beta}}_{\text{TL}}$ 表示 Lasso 解決方案的閾值版本，它以絕對值保留 $\hat{\boldsymbol{\beta}}_{\ell_{1}}$ 的頂部 $k$ 條目並將其餘條目設為零。基於 Lasso 的解決方案的預測效能界限取決於受限特徵值類型條件。根據[58]，我們為任何子集 $S\in\{1,2,\ldots,p\}$ 定義數量： $C(S):=\{\boldsymbol{\beta}|\|\boldsymbol{\beta}_{S^{c}}\|_{1}\leq 2\|\boldsymbol{\beta}_{S}\|_{1}\},$ 其中， $\|\boldsymbol{\beta}_{S}\|_{1}=\sum_{j\in S}|\beta_{j}|$ 和 $\|\boldsymbol{\beta}_{S^{c}}\|_{1}=\sum_{j\in S^{c}}|\beta_{j}|$ 。如果矩陣 $\mathbf{X}$ 滿足以下條件，則我們說矩陣 $\mathbf{X}$ 滿足參數 $\gamma(X)$ 的受限特徵值類型條件：

\frac{1}{n}\|\mathbf{X}\boldsymbol{\beta}\|_{2}^{2}\geq\gamma(\mathbf{X})\|\boldsymbol{\beta}\|_{2}^{2}\;\;\;\quad\text{for}\quad\boldsymbol{\beta}\in\cup_{S:|S|=k}C(S).

Note that $\gamma(\mathbf{X})\leq 1$ and $\gamma(\mathbf{X})$ is also related to the so called compatibility condition [11]. In an insightful paper, [58] show that under such restricted eigenvalue type conditions the following holds:
請注意， $\gamma(\mathbf{X})\leq 1$ 和 $\gamma(\mathbf{X})$ 也與所謂的兼容性條件有關[11]。在一篇富有洞察力的論文中，[58]表明，在這種受限的特徵值類型條件下，以下內容成立：

\frac{\sigma^{2}}{\gamma(\mathbf{X}_{\text{bad}})^{2}}\frac{k^{1-\delta}\log(p)}{n}\lesssim\max_{\boldsymbol{\beta}^{0}:\|\boldsymbol{\beta}^{0}\|_{0}\leq k}\frac{1}{n}\mathbb{E}\left(\left\|\mathbf{X}\boldsymbol{\beta}^{0}-\mathbf{X}\widehat{\boldsymbol{\beta}}_{\text{TL}}\right\|_{2}^{2}\right)\lesssim\frac{\sigma^{2}}{\gamma(\mathbf{X})^{2}}\frac{k\log(p)}{n}

(54)

In particular, the lower bounds apply to bad design matrices $\mathbf{X}_{\text{bad}}$ for some arbitrarily small scalar $\delta>0$ . In fact [58] establish a result stronger than (54), where, $\widehat{\boldsymbol{\beta}}_{\text{TL}}$ can be replaced by a $k$ -sparse estimate delivered by a polynomial time method. The bounds displayed in (54) show that there is a significant gap between the predictive performance of subset selection procedures (see bound (53)) and Lasso based $k$ -sparse solutions—the magnitude of the gap depends upon how small $\gamma(\mathbf{X})$ is. $\gamma(\mathbf{X})$ can be small if the pairwise correlations between the features of the model matrix is quite high. These results complement our experimental findings in Section 5.
特別是，下界適用於某些任意小標量 $\delta>0$ 的不良設計矩陣 $\mathbf{X}_{\text{bad}}$ 。事實上，[58] 建立了一個比 (54) 更強的結果，其中 $\widehat{\boldsymbol{\beta}}_{\text{TL}}$ 可以替換為由多項式時間方法提供的 $k$ 稀疏估計。 (54) 中顯示的界限表明，子集選擇過程的預測性能（參見界限 (53)）與基於 Lasso 的稀疏解決方案之間存在顯著差距——差距的大小取決於 $\gamma(\mathbf{X})$ 有多小。如果模型矩陣的特徵之間的成對相關性相當高， $\gamma(\mathbf{X})$ 可以很小。這些結果補充了我們在第 5 節的實驗結果。

An in-depth analysis of the properties of solutions to the Lagrangian version of Problem (1), namely, Problem (4) is presented in [56]. [46, 56] also analyze the errors in the regression coefficients: $\|\boldsymbol{\beta}^{0}-\widehat{\boldsymbol{\beta}}\|_{2}$ , under further minor assumptions on the model matrix $\mathbf{X}$ . [56, 48] provide interesting theoretical analysis of the variable selection properties of (1) and (4), showing that subset selection procedures have superior variable selection properties over Lasso based methods.
在[56]中提出了對問題（1）的拉格朗日版本，即問題（4）的解的性質的深入分析。 [ 46, 56] 也分析了迴歸係數的誤差： $\|\boldsymbol{\beta}^{0}-\widehat{\boldsymbol{\beta}}\|_{2}$ ，在模型矩陣 $\mathbf{X}$ 的進一步次要假設下。 [56, 48]提供了對（1）和（4）的變數選擇屬性的有趣的理論分析，表明子集選擇過程比基於 Lasso 的方法具有更優越的變數選擇屬性。

In passing, we remark that [56] develop statistical properties of inexact solutions to Problem (4). This may serve as interesting theoretical support for near global solutions to Problem (1), where the certificates of sub-optimality are delivered by our MIO framework in terms of global lower bounds. A precise and thorough understanding of the statistical properties of sub-optimal solutions to Problem (1) is left for an interesting piece of future work.
順便說一句，我們注意到[56]開發了問題（4）的不精確解的統計特性。這可能為問題 (1) 的近全局解決方案提供有趣的理論支持，其中次優性證書由我們的 MIO 框架根據全局下界提供。對問題（1）的次優解的統計特性的精確和透徹的理解留待未來有趣的工作進行。

Appendix D Additional Details on Experiments and Computations
附錄 D 有關實驗和計算的其他詳細信息

D.1 Some additional figures related to the radii of bounding boxes
D.1與邊界框半徑相關的一些附加數字

Some figures illustrating the effect of the bounding box radii are presented in Figure 12.
圖 12 顯示了一些說明邊界框半徑影響的圖。

D.2 Lasso, Debiased Lasso and MIO
D.2 Lasso、Debiased Lasso 與 MIO

We present here comparisons of the debiased Lasso with MIO and Lasso.
我們在此將去偏 Lasso 與 MIO 和 Lasso 進行比較。

Debiasing is often used to mitigate the shrinkage imparted by the Lasso regularization parameter. This is done by performing an unrestricted least squares on the support selected by the Lasso. Of course the results will depend upon the tuning parameter used for the problem. We use two methods towards this end. In the first method we find the best Lasso solution (by obtaining an optimal tuning parameter based on minimizing predictive error on a held out validation set); we then obtain the un-regularized least squares solution for that Lasso solution. This typically performed worse than Lasso in all the experiments we tried—see Tables 3 and 4. The unrestricted least squares solution on the optimal model selected by the Lasso (as shown in Figure 4) had worse predictive performance than the Lasso, with the same sparsity pattern, as shown in Table 3. This is probably due to overfitting since the model selected by the Lasso is quite dense compared to $n,p$ . Table 4 presents the results for $50=n\ll p=1000$ . We consider the same example presented in Figure 9, Example 1. First of all, Table 4 presents the prediction performance of Lasso after debiasing—we considered the same tuning parameter considered optimal for the Lasso problem. We see that as in the case of Table 3, the debiasing does not lead to improved performance in terms of prediction error.
去偏通常用於減輕 Lasso 正規化參數造成的收縮。這是透過對套索選擇的支援執行無限制最小二乘來完成的。當然，結果將取決於問題所使用的調整參數。為此，我們使用兩種方法。在第一種方法中，我們找到最佳的 Lasso 解決方案（透過在最小化驗證集上的預測誤差的基礎上獲得最佳調整參數）；然後我們得到該 Lasso 解的非正規化最小平方法解。在我們嘗試的所有實驗中，這通常比Lasso 表現更差- 請參閱表3 和4。具有相同的預測性能稀疏模式，如表 3 所示。表 4 顯示了 $50=n\ll p=1000$ 的結果。我們考慮圖 9 範例 1 中提供的相同範例。我們看到，與表 3 的情況一樣，去偏並沒有改善預測誤差方面的表現。

We thus experimented with another variant of the debiased Lasso, where for every $\lambda$ we computed the Lasso solution (2) and obtained $\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda}$ by performing an unrestricted least squares fit on the support selected by the Lasso solution at $\lambda$ . This method can be thought of delivering feasible solutions for Problem (1), for a value of $k:=k(\lambda)$ determined by the Lasso solution at $\lambda$ . The success of this method makes a case in support of using criterion (1). The tuning parameter was then selected by minimizing predictive performance on a held out test validation set. This method in general performed better than Lasso in delivering a sparser model with better predictive accuracy than the Lasso. The performance of the debiased Lasso was similar to Sparsenet and was in general inferior to MIO by orders of magnitude, especially for the problems where the pairwise correlations between the variables was large and SNR was low and $n\ll p$ . The results are presented in Table 5,6 (for the case $n>p$ ) and 7 and 8 (for the case $n\ll p$ ).
因此，我們嘗試了去偏Lasso 的另一種變體，其中對於每個 $\lambda$ ，我們計算Lasso 解(2) 並通過在選擇的支持上執行無限制最小二乘擬合來獲得 $\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda}$ $\lambda$ 處的套索解決方案。此方法可以被認為是為問題 (1) 提供可行的解決方案，其值 $k:=k(\lambda)$ 由 $\lambda$ 處的 Lasso 解決方案確定。此方法的成功為標準（1）的使用提供了支持。然後透過最小化測試驗證集上的預測性能來選擇調整參數。此方法通常比 Lasso 表現更好，可以提供比 Lasso 更好的預測精度的稀疏模型。除偏 Lasso 的表現與 Sparsenet 類似，但總體上比 MIO 低幾個數量級，特別是對於變數之間的成對相關性較大且 SNR 較低的問題和 $n\ll p$ 。結果如表 5,6（針對情況 $n>p$ ）以及表 7 和 8（針對情況 $n\ll p$ ）所示。

Debiasing at optimal Lasso model, $n>p$
最佳 Lasso 模型的去偏， $n>p$

SNR	$\rho$	Ratio: Lasso/ Debiased Lasso 比率：套索/去偏套索
6.33	0.5	0.33
3.17	0.5	0.54
1.58	0.5	0.53
6.97	0.8	0.67
3.48	0.8	0.64
1.74	0.8	0.63
8.73	0.9	1
4.37	0.9	0.58
2.18	0.9	0.61

Table 3: Lasso and Debiased Lasso corresponding to the numerical experiments of Figure 4, for Example 1 with

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

and

k_{0}=10

. Here, “Ratio” equals the ratio of the prediction error of the Lasso and the debiased Lasso at the optimal tuning parameter selected by the Lasso.
表 3：對應於圖 4 的數值實驗的套索和去偏套索，範例 1 具有

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

和

k_{0}=10

。這裡，「Ratio」等於 Lasso 與去偏 Lasso 在 Lasso 選擇的最佳調整參數下的預測誤差比

Debiasing at optimal Lasso model, $n\ll p$
最佳 Lasso 模型的去偏， $n\ll p$

SNR	$\rho$	Ratio: Lasso/Debiased Lasso 比率：套索/去偏套索
10	0.8	0.90
7	0.8	1.0
3	0.8	0.91

Table 4: Lasso and Debiased Lasso corresponding to the numerical experiments of Figure 9, for Example 1 with

n=50,p=1000,\rho=0.8

and

k_{0}=5

. Here, “Ratio” equals the ratio of the prediction error of the Lasso and the debiased Lasso at the optimal tuning parameter selected by the Lasso.
表 4：對應於圖 9 的數值實驗的套索和去偏套索，範例 1 具有

n=50,p=1000,\rho=0.8

和

k_{0}=5

。這裡，「Ratio」等於 Lasso 與去偏 Lasso 在 Lasso 選擇的最佳調整參數下的預測誤差之比。

The performance of this model was comparable with Sparsenet—it was better than Lasso in terms of obtaining a sparser model with better predictive accuracy. However, the performance of MIO was significantly better than the debiased version of the Lasso, especially for larger values of $\rho$ and smaller SNR values.
該模型的性能與 Sparsenet 相當——在獲得具有更好預測精度的稀疏模型方面，它比 Lasso 更好。然而，MIO 的性能明顯優於 Lasso 的去偏版本，特別是對於較大的 $\rho$ 值和較小的 SNR 值。

Sparsity of Selected Models, $n>p$
所選模型的稀疏性， $n>p$

SNR	$\rho$	Lasso	Debiased Lasso 去偏套索	MIO
6.33	0.5	27.6 (2.122)	10.9 (0.65)	10.8 (0.51)
3.17	0.5	27.7 (2.045)	10.9 (0.65)	10.1 (0.1)
1.58	0.5	28.0 (2.276)	10.9 (0.65)	10.2 (0.2)
6.97	0.8	34.1 (3.60)	10.4 (0.15)	10 (0.0)
3.48	0.8	34.0 (3.54)	10.9 (0.55)	10.2 (0.2)
1.74	0.8	33.7 (3.49)	13.7 (1.50)	10 (0.0)
8.73	0.9	25.9 (0.94)	13.9 (0.68)	10.5 (0.17)
4.37	0.9	34.6 (3.23)	18.1 (1.30)	10.2 (0.25)
2.18	0.9	34.7 (3.28)	20.5 (1.85)	10.1 (0.10)

Table 5: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 4, for Example 1 with

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

and

k_{0}=10

. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. Numbers within brackets denote standard-errors. Debiased Lasso leads to less dense models than Lasso. When

\rho

is small and SNR is large, the model size of debiased Lasso performance is similar to MIO. However, for larger values of

\rho

and smaller values of SNR subset selection leads to orders of magnitude sparser solutions than debiased Lasso.
表 5：Lasso、Debiased Lasso 和 MIO 所選模型中的非零數，對應於圖 4 的數值實驗，例如使用

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

和

k_{0}=10

的範例 1。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。括號內的數字表示標準誤。去偏套索會導致模型密度低於套索。當

\rho

較小且SNR較大時，去偏Lasso表現的模型大小與MIO相似。然而，對於較大的

\rho

值和較小的 SNR 值，子集選擇會導致比去偏套索稀疏幾個數量級的解決方案。

Predictive Performance of Selected Models, $n>p$
所選模型的預測性能， $n>p$

SNR	$\rho$	Lasso	Debiased Lasso 去偏套索	MIO	Ratio:
					Debiased Lasso/MIO 去偏套索/MIO
6.33	0.5	0.0384 (0.001)	0.0255 (0.002)	0.0266 (0.001)	1.0
3.17	0.5	0.0768 (0.003)	0.0511 (0.004)	0.0478 (0.002)	1.0
1.58	0.5	0.1540 (0.007)	0.1021 (0.009)	0.0901 (0.009)	1.1
6.97	0.8	0.0389 (0.002)	0.0223 (0.001)	0.0231 (0.002)	1.0
3.48	0.8	0.0778 (0.004)	0.0464 (0.003)	0.0484 (0.004)	1.0
1.74	0.8	0.1557 (0.007)	0.1156 (0.008)	0.0795 (0.008)	1.5
8.73	0.9	0.0325 (0.001)	0.0220 (0.002)	0.0197 (0.002)	1.2
4.37	0.9	0.0632 (0.002)	0.0532 (0.003)	0.0427 (0.008)	1.3
2.18	0.9	0.1265 (0.005)	0.1254 (0.006)	0.0703 (0.011)	1.8

Table 6: Predictive Performance for tests of Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 4, for Example 1 with

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

and

k_{0}=10

. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. When

\rho

is small and SNR is large, debiased Lasso performance is similar to MIO. However, for larger values of

\rho

and smaller values of SNR subset selection performs better than debiased Lasso based solutions.
表 6：對應於圖 4 的數值實驗的 Lasso、Debiased Lasso 和 MIO 測試的預測性能，對於具有

n=500,p=100,\rho\in\{0.5,0.8,0.9\}

和

k_{0}=10

的範例 1。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。當

\rho

較小且SNR較大時，去偏Lasso效能與MIO相似。然而，對於較大的

\rho

值和較小的 SNR 值，子集選擇的表現優於基於去偏 Lasso 的解。

We then follow the method described above (for the $n>p$ case), where we consider a sequence of models $\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda}$ and find the $\lambda$ that delivers the best predictive model on a held out validation set.
然後，我們按照上述方法（對於 $n>p$ 情況），考慮一系列模型 $\hat{\boldsymbol{\beta}}_{\text{Deb},\lambda}$ 並找到提供最佳預測模型的 $\lambda$ 在保留的驗證集上。

Sparsity of Selected Models, $n\ll p$
所選模型的稀疏性， $n\ll p$

SNR	$\rho$	Lasso	Debiased Lasso 去偏套索	MIO
10	0.8	25.7 (1.73)	7.9 (0.43)	5 (0.12)
7	0.8	27.8 (2.69)	8.1 (0.43)	5 (0.16)
3	0.8	28.0 (2.72)	10.0 (0.88)	6 (1.18)

Table 7: Number of non-zeros in the selected model by Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 9, for Example 1with

n=50,p=1000,\rho=0.8

and

k_{0}=5

. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. Debiased Lasso leads to less dense models than Lasso but more dense models than MIO. The performance gap between MIO and debiased Lasso becomes larger with lower values of SNR.
表 7：Lasso、Debiased Lasso 和 MIO 所選模型中的非零數，對應於圖 9 的數值實驗，例如帶有

n=50,p=1000,\rho=0.8

和

k_{0}=5

的範例 1。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。去偏 Lasso 會導致模型密度低於 Lasso，但模型密度高於 MIO。隨著 SNR 值的降低，MIO 和去偏 Lasso 之間的效能差距變得更大。

Predictive Performance of Selected Models, $n\ll p$
所選模型的預測性能， $n\ll p$

SNR	$\rho$	Lasso	Debiased	MIO	Ratio:
			Lasso		Debiased Lasso/ MIO 去偏套索/ MIO
10	0.8	0.084 (0.004)	0.046 (0.003)	0.014 (0.005)	3.3
7	0.8	0.122 (0.005)	0.070 (0.004)	0.020 (0.007)	3.5
3	0.8	0.257 (0.012)	0.185 (0.016)	0.151 (0.027)	1.2

Table 8: Predictive performances of Lasso, Debiased Lasso, and MIO corresponding to the numerical experiments of Figure 9, for Example 1with

n=50,p=1000,\rho=0.8

and

k_{0}=5

. Numbers within brackets denote standard-errors. The tuning parameters for all three models were selected separately based on the best predictive model on a held out validation set. MIO consistently leads to better predictive models than Debiased Lasso and ordinary Lasso. Debiased Lasso performs better than ordinary Lasso.
表 8：Lasso、Debiased Lasso 和 MIO 的預測效能，對應於圖 9 的數值實驗，範例 1 中使用

n=50,p=1000,\rho=0.8

和

k_{0}=5

。括號內的數字表示標準誤。所有三個模型的調整參數都是根據驗證集上的最佳預測模型單獨選擇的。 MIO 始終能夠產生比 Debiased Lasso 和普通 Lasso 更好的預測模型。 Debiased Lasso 的性能比普通 Lasso 更好。

$k=6$	$k=7$


Time (secs)	Time (secs)

k=9	k=20	k=31	k=42

Time (secs)	Time (secs)	Time (secs)	Time (secs)

$\displaystyle\\|\widehat{\boldsymbol{\beta}}_{I}\\|_{\infty}=$	$\displaystyle\max_{i=1,\ldots,k}\|\langle\tilde{\mathbf{a}}_{i},\mathbf{X}^{\prime}_{I}\mathbf{y}\rangle\|$
$\displaystyle\leq$	$\displaystyle\left(\max_{i=1,\ldots,k\|}\\|\tilde{\mathbf{a}}_{i}\\|_{2}\right)\\|\mathbf{X}^{\prime}_{I}\mathbf{y}\\|_{2}$
$\displaystyle\leq$	$\displaystyle\lambda_{\max}\left((\mathbf{X}^{\prime}_{I}\mathbf{X}_{I})^{-1}\right)\\|\mathbf{X}^{\prime}_{I}\mathbf{y}\\|_{2}$
$\displaystyle=$	$\displaystyle\left(\max_{i=1,\ldots,k}\frac{1}{d_{i}^{2}}\right)\cdot\sqrt{\sum_{j\in I}\|\langle\mathbf{X}_{j},\mathbf{y}\rangle\|^{2}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{\eta_{k}}\sqrt{\sum_{j=1}^{k}\|\langle\mathbf{X}_{(j)},\mathbf{y}\rangle\|^{2}}.$	(44)

$\displaystyle g(\boldsymbol{\beta})$	$\displaystyle=Q_{L}(\boldsymbol{\beta},\boldsymbol{\beta})$
	$\displaystyle\geq\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;Q_{L}(\boldsymbol{\eta},\boldsymbol{\beta})$
	$\displaystyle=\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\\|\boldsymbol{\eta}-\boldsymbol{\beta}\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\boldsymbol{\eta}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$
	$\displaystyle=\inf_{\\|\boldsymbol{\eta}\\|_{0}\leq{k}}\;\;\left(\frac{L}{2}\left\\|\boldsymbol{\eta}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\right\\|_{2}^{2}-\frac{1}{2L}\\|\nabla g(\boldsymbol{\beta})\\|_{2}^{2}+g(\boldsymbol{\beta})\right)$
	$\displaystyle=\;\;\left(\frac{L}{2}\\|\widehat{\boldsymbol{\eta}}-\left(\boldsymbol{\beta}-\frac{1}{L}\nabla g(\boldsymbol{\beta})\right)\\|_{2}^{2}-\frac{1}{2L}\\|\nabla g(\boldsymbol{\beta})\\|_{2}^{2}+g(\boldsymbol{\beta})\right)$	$\displaystyle~{}{\rm(From}~{}\eqref{eta-hat-HT1})$
	$\displaystyle=\left(\frac{L}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$

	$\displaystyle=\left(\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\frac{\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)$
	$\displaystyle\geq\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\underbrace{\left(\frac{\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+\langle\nabla g(\boldsymbol{\beta}),\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\rangle+g(\boldsymbol{\beta})\right)}_{Q_{\ell}(\widehat{\boldsymbol{\eta}},\boldsymbol{\beta})}$
	$\displaystyle\geq\frac{L-\ell}{2}\left\\|\widehat{\boldsymbol{\eta}}-\boldsymbol{\beta}\right\\|_{2}^{2}+g(\widehat{\boldsymbol{\eta}}).$	$\displaystyle~{}{\rm(From}~{}\eqref{major-1})$

Leukemia dataset: Effect of a Bounding Box for MIO formulation (37) 白血病資料集：MIO 公式邊界框的影響 (37)
$k=5$	$k=10$

Best Subset Selection via a Modern Optimization Lens透過現代優化鏡頭選擇最佳子集

Abstract 抽象的

1 Introduction 1簡介

Brief Context and Background簡短背景

Our Approach 我們的方法

Existing approaches in the Mathematical Optimization Literature數學最佳化文獻中的現有方法

Contributions 貢獻

2 Mixed Integer Optimization Formulations2混合整數最佳化公式

2.1 Brief Background on MIO2.1MIO簡介

2.2 MIO Formulations for the Best Subset Selection Problem2.2最佳子集選擇問題的MIO公式

Formulations via Specially Ordered Sets透過特別訂購的套裝配製

2.3 Specification of Parameters2.3參數說明

Coherence and Restricted Eigenvalues of a Model Matrix模型矩陣的相干性與受限特徵值

Proposition 1.

Operator Norms of Submatrices子矩陣的算子範數

Proposition 2.

Proof.

2.3.1 Specification of Parameters in terms of Coherence and Restricted Strong Convexity2.3.1 相干性和限制強凸性方面的參數規範

Theorem 2.1.

Proof.

2.3.2 Specification of Parameters via Convex Quadratic Optimization2.3.2 透過凸二次優化指定參數

Bounds on β^isubscript^𝛽𝑖\hat{\beta}_{i}’s β^isubscript^𝛽𝑖\hat{\beta}_{i} 的界限

Bounds on ⟨𝐱i,𝜷^⟩subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle’s ⟨𝐱i,𝜷^⟩subscript𝐱𝑖^𝜷\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle 的界限

2.3.3 Parameter Specifications from Advanced Warm-Starts2.3.3 高階熱啟動的參數規範

2.3.4 Some Generalizations and Variants2.3.4一些概括和變體

3 Discrete First Order Algorithms3離散一階演算法

3.1 Finding stationary solutions for minimizing smooth convex functions with cardinality constraints3.1尋找最小化帶有基數約束的平滑凸函數的平穩解

Related work and contributions相關工作與貢獻

Proposition 3.

Proof.

Remark 1.

Proposition 4.

Algorithm 1 演算法 1

3.2 Convergence Analysis of Algorithm 13.2演算法1的收斂性分析

Definition 1.

Proposition 5.

Definition 2.

Proposition 6.

Proof.

Remark 2.

Remark 3.

Proposition 7.

Proposition 8.

Theorem 3.1.

Proof.

Polishing coefficients on the active set活動集上的拋光係數

3.3 Application to Least Squares3.3最小平方法的應用

3.4 Application to Least Absolute Deviation3.4最小絕對偏差的應用

4 A Brief Tour of the Statistical Properties of Problem (1)4問題統計性質簡述（一）

5 Computational Experiments for Subset Selection with Least Squares Loss5最小平方法損失子集選擇的計算實驗

5.1 Description of Experimental Data5.1實驗數據說明

Synthetic Datasets. 綜合數據集。

Real Datasets 真實數據集

Computer Specifications and Software電腦規格和軟體

5.2 The Overdetermined Regime: n>p𝑛𝑝n>p5.2超定政權： n>p𝑛𝑝n>p

5.2.1 Obtaining Good Upper Bounds5.2.1 獲得良好的上限

5.2.2 Improving MIO Performance via Warm Starts5.2.2透過熱啟動提高MIO性能

5.2.3 Statistical Performance5.2.3統計表現

Competing Methods and Performance Measures競爭方法和績效衡量

5.2.4 MIO model training 5.2.4MIO模型訓練

5.3 The High-Dimensional Regime: p≫nmuch-greater-than𝑝𝑛p\gg n5.3高維體系： p≫nmuch-greater-than𝑝𝑛p\gg n

5.3.1 Obtaining Good Upper Bounds5.3.1 獲得良好的上限

5.3.2 Bounding Box Formulation5.3.2 邊界框公式

Interpretation of the bounding boxes邊界框的解釋

Experiments 實驗

5.3.3 Statistical Performance5.3.3統計表現

6 Computational Results for Subset Selection with Least Absolute Deviation Loss6 絕對偏差損失最小的子集選擇的計算結果

Datasets analysed 分析的數據集

7 Conclusions 7結論

Acknowledgements 致謝

References 參考

Appendix and Supplementary Material附錄及補充資料

Appendix A Additional Details for Section 2附錄 A 第 2 節的其他詳細信息

A.1 Solving the convex quadratic optimization Problems in Section 2.3.2A.1 求解2.3.2節的凸二次最佳化問題

A.2 Details for Section 2.3.4A.2 第 2.3.4 節的詳細信息

A.3 Proof of Proposition 2A.3 命題2的證明

A.4 Proof of Theorem 2.1 A.4定理2.1的證明

Appendix B Proofs and Technical Details for Section 3附錄 B 第 3 節的證據與技術細節

B.1 Proof of Proposition 6B.1 命題6的證明

B.2 Proof of Proposition 3B.2命題3的證明

Best Subset Selection via a Modern Optimization Lens
透過現代優化鏡頭選擇最佳子集

Brief Context and Background
簡短背景

Existing approaches in the Mathematical Optimization Literature
數學最佳化文獻中的現有方法

2 Mixed Integer Optimization Formulations
2混合整數最佳化公式

2.1 Brief Background on MIO
2.1MIO簡介

2.2 MIO Formulations for the Best Subset Selection Problem
2.2最佳子集選擇問題的MIO公式

Formulations via Specially Ordered Sets
透過特別訂購的套裝配製

2.3 Specification of Parameters
2.3參數說明

Coherence and Restricted Eigenvalues of a Model Matrix
模型矩陣的相干性與受限特徵值

Operator Norms of Submatrices
子矩陣的算子範數

2.3.1 Specification of Parameters in terms of Coherence and Restricted Strong Convexity
2.3.1 相干性和限制強凸性方面的參數規範

2.3.2 Specification of Parameters via Convex Quadratic Optimization
2.3.2 透過凸二次優化指定參數

Bounds on $\hat{\beta}_{i}$ ’s $\hat{\beta}_{i}$ 的界限

Bounds on $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ ’s $\langle\mathbf{x}_{i},\widehat{\boldsymbol{\beta}}\rangle$ 的界限

2.3.3 Parameter Specifications from Advanced Warm-Starts
2.3.3 高階熱啟動的參數規範

2.3.4 Some Generalizations and Variants
2.3.4一些概括和變體

3 Discrete First Order Algorithms
3離散一階演算法

3.1 Finding stationary solutions for minimizing smooth convex functions with cardinality constraints
3.1尋找最小化帶有基數約束的平滑凸函數的平穩解

Related work and contributions
相關工作與貢獻

3.2 Convergence Analysis of Algorithm 1
3.2演算法1的收斂性分析

Polishing coefficients on the active set
活動集上的拋光係數

3.3 Application to Least Squares
3.3最小平方法的應用

3.4 Application to Least Absolute Deviation
3.4最小絕對偏差的應用

4 A Brief Tour of the Statistical Properties of Problem (1)
4問題統計性質簡述（一）

5 Computational Experiments for Subset Selection with Least Squares Loss
5最小平方法損失子集選擇的計算實驗

5.1 Description of Experimental Data
5.1實驗數據說明

Computer Specifications and Software
電腦規格和軟體

5.2 The Overdetermined Regime: $n>p$
5.2超定政權： $n>p$

5.2.1 Obtaining Good Upper Bounds
5.2.1 獲得良好的上限

5.2.2 Improving MIO Performance via Warm Starts
5.2.2透過熱啟動提高MIO性能

5.2.3 Statistical Performance
5.2.3統計表現

Competing Methods and Performance Measures
競爭方法和績效衡量

5.3 The High-Dimensional Regime: $p\gg n$
5.3高維體系： $p\gg n$

5.3.1 Obtaining Good Upper Bounds
5.3.1 獲得良好的上限

5.3.2 Bounding Box Formulation
5.3.2 邊界框公式

Interpretation of the bounding boxes
邊界框的解釋

5.3.3 Statistical Performance
5.3.3統計表現

6 Computational Results for Subset Selection with Least Absolute Deviation Loss
6 絕對偏差損失最小的子集選擇的計算結果

Appendix and Supplementary Material
附錄及補充資料

Appendix A Additional Details for Section 2
附錄 A 第 2 節的其他詳細信息

A.1 Solving the convex quadratic optimization Problems in Section 2.3.2
A.1 求解2.3.2節的凸二次最佳化問題

A.2 Details for Section 2.3.4
A.2 第 2.3.4 節的詳細信息

A.3 Proof of Proposition 2
A.3 命題2的證明

Appendix B Proofs and Technical Details for Section 3
附錄 B 第 3 節的證據與技術細節

B.1 Proof of Proposition 6
B.1 命題6的證明

B.2 Proof of Proposition 3
B.2命題3的證明

B.3 Proof of Proposition 7
B.3 命題7的證明

B.4 Proof of Proposition 8
B.4 命題8的證明

B.6 Proof of Proposition 5
B.6 命題 5 的證明

Appendix C Brief Review of Statistical Properties for the subset selection problem
附錄 C 子集選擇問題統計特性的簡要回顧

Appendix D Additional Details on Experiments and Computations
附錄 D 有關實驗和計算的其他詳細信息

D.1 Some additional figures related to the radii of bounding boxes
D.1與邊界框半徑相關的一些附加數字

D.2 Lasso, Debiased Lasso and MIO
D.2 Lasso、Debiased Lasso 與 MIO