A surrogate ℓ0 sparse Cox's regression with applications to sparse high-dimensional massive sample size time-to-event data
一个代理 l 0 稀疏 Cox 回归,用于稀疏高维大规模样本量事件发生时间数据
首次发表时间: 08十二月2019 https://doi.org/10.1002/sim.8438Citations:6
Abstract 抽象
Sparse high-dimensional massive sample size (sHDMSS) time-to-event data present multiple challenges to quantitative researchers as most current sparse survival regression methods and software will grind to a halt and become practically inoperable. This paper develops a scalable ℓ0-based sparse Cox regression tool for right-censored time-to-event data that easily takes advantage of existing high performance implementation of ℓ2-penalized regression method for sHDMSS time-to-event data. Specifically, we extend the ℓ0-based broken adaptive ridge (BAR) methodology to the Cox model, which involves repeatedly performing reweighted ℓ2-penalized regression. We rigorously show that the resulting estimator for the Cox model is selection consistent, oracle for parameter estimation, and has a grouping property for highly correlated covariates. Furthermore, we implement our BAR method in an R package for sHDMSS time-to-event data by leveraging existing efficient algorithms for massive ℓ2-penalized Cox regression. We evaluate the BAR Cox regression method by extensive simulations and illustrate its application on an sHDMSS time-to-event data from the National Trauma Data Bank with hundreds of thousands of observations and tens of thousands sparsely represented covariates.
稀疏高维大规模样本量 (sHDMSS) 事件发生时间数据给定量研究人员带来了多重挑战,因为当前大多数稀疏生存回归方法和软件将停滞不前并变得几乎无法操作。本文开发了一种可扩展的基于 l 0 的稀疏 Cox 回归工具,用于右删失事件时间数据,该工具可轻松利用 sHDMSS 事件发生时间数据的现有 l 2 惩罚回归方法的高性能实现。具体来说,我们将基于 l 0 的破碎自适应脊 (BAR) 方法扩展到 Cox 模型,该模型涉及重复执行重新加权的 l 2 惩罚回归。我们严格地表明,Cox模型的结果估计器是选择一致的,用于参数估计的预言机,并且具有高度相关的协变量的分组属性。此外,我们通过利用现有的高效算法进行大规模 l 2 惩罚 Cox 回归,在 R 包中为 sHDMSS 事件时间数据实现我们的 BAR 方法。我们通过广泛的模拟评估了 BAR Cox 回归方法,并说明了其在来自国家创伤数据库的 sHDMSS 事件发生时间数据上的应用,该数据具有数十万个观测值和数万个稀疏表示的协变量。
1 INTRODUCTION 1 引言
Advancements in medical informatics tools and high-throughput biological experimentation are making large-scale data routinely accessible to researchers, administrators, and policymakers. This data deluge poses new challenges and critical barriers for quantitative researchers as existing statistical methods and software grind to a halt when analyzing these large-scale data sets, and calls for appropriate methods that can readily fit large-scale data. This paper primarily concerns survival analysis of sparse high-dimensional massive sample size (sHDMSS) data, a particular type of large-scale data with the following characteristics: (1) high-dimensional with a large number of covariates (pn in thousands or tens of thousands), (2) massive in sample-size (n in thousands to hundreds of millions), (3) sparse in covariates with only a very small portion of covariates being nonzero for each subject, and (4) rare in event rate. An example of sHDMSS data is the pediatric trauma mortality data from the National Trauma Data Bank (NTDB) maintained by the American College of Surgeons.1 This data set includes 210 555 patient records of injured children under 15 collected over 5 years from 2006 to 2010. Each patient record includes 125 952 binary covariates that indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as their two-way interactions. The data matrix is extremely sparse with less than 1% of the covariates being non zero. The event (mortality) rate is also very low at 2%. Another application domain where sHDMSS data are common is drug safety studies that use massive patient-level databases such as the US FDA's Sentinel Initiative (https://www.fda.gov/safety/fdassentinelinitiative/ucm2007250.htm) and the Observational Health Data Sciences and Informatics program (https://ohdsi.org/) to study rare adverse events with hundreds of millions of patient records and tens of thousands of patient attributes that are sparse in the covariates.
医学信息学工具和高通量生物实验的进步使研究人员、管理人员和政策制定者能够经常访问大规模数据。这种数据洪流给定量研究人员带来了新的挑战和关键障碍,因为现有的统计方法和软件在分析这些大规模数据集时停滞不前,并需要适当的方法来适应大规模数据。本文主要涉及稀疏高维大规模样本量 (sHDMSS) 数据的生存分析,这是一种具有以下特征的特殊类型的大规模数据:(1) 具有大量协变量的高维数据(p n 以数千或数万计),(2) 样本量很大(n 以千万计到数亿),(3) 协变量稀疏,每个受试者只有很小一部分协变量为非零, (4)罕见事件发生率。sHDMSS 数据的一个例子是来自美国外科医师学会维护的国家创伤数据库 (NTDB) 的儿科创伤死亡率数据。1 本数据集包括2006年至2010年5年间收集的210 555份15岁以下受伤儿童的病历。每个患者记录包括 125 952 个二元协变量,这些协变量表示属性(ICD9 代码、AIS 代码等)的存在与否以及它们的双向交互。数据矩阵非常稀疏,只有不到 1% 的协变量不为零。事件(死亡率)也非常低,为2%。sHDMSS数据常见的另一个应用领域是使用大量患者级数据库的药物安全性研究,例如美国FDA的Sentinel Initiative(https://www.fda.gov/safety/fdassentinelinitiative/ucm2007250。htm)和观察性健康数据科学和信息学计划(https://ohdsi.org/),以研究具有数亿份患者记录和数以万计的患者属性的罕见不良事件,这些属性在协变量中是稀疏的。
The sHDMSS survival data present multiple challenges to quantitative researchers. First, not all of the thousands of covariates are expected to be relevant to an outcome of interest. It would also be practically undesirable to predict a patient outcome using thousands of covariates. Traditionally, researchers hand-pick subject characteristics to include in an analysis. However, hand picking can introduce not only bias, but also a source of variability between researchers and studies. Moreover, it would become impractical in large-scale evidence generation when hundreds or thousands of analyses are to be performed.2 Hence, automated sparse regression methods are desired. Secondly, the commonly used “divide and conquer” strategy for massive size data is deemed inappropriate for sHDMSS time-to-event data since each of the divided data would have too few events for a meaningful analysis. Third, sHDMSS data presents a critical barrier to the application of existing sparse survival regression methods, since most current methods and standard software become inoperable for large data sets due to high computational costs and large memory requirements. Although many sparse survival regression methods are available,3-10 to the best of our knowledge, only LASSO, Elastic Net11 and ridge regression have been adapted to fit sHDMSS time-to-event data. In particular, Mittal et al12 developed a tool, named CYCLOPS, for fitting LASSO and ridge Cox regression with sHDMSS time-to-event data by storing data in a sparse format, exploiting sparsity in the data and partial likelihood, and using multicore threading and vector processing, along with other high-performance computing techniques, which delivers > 10-fold speedup12 over its competitors. However, ridge Cox regression does not yield a sparse model and LASSO tends to select too many noise features and is biased for estimation.13, 14 Improved sparse Cox regression tools for sHDMSS time-to-event data are desired.
sHDMSS生存数据给定量研究人员带来了多重挑战。首先,并非所有数千个协变量都与感兴趣的结果相关。使用数千个协变量预测患者结果实际上也是不可取的。传统上,研究人员会亲自挑选要包含在分析中的受试者特征。然而,手工挑选不仅会带来偏见,还会成为研究人员和研究之间差异的来源。此外,当要进行数百或数千次分析时,在大规模证据生成中将变得不切实际。2 因此,需要自动稀疏回归方法。其次,对于大规模数据,常用的“分而治之”策略被认为不适用于 sHDMSS 事件时间数据,因为每个划分的数据都具有太少的事件,无法进行有意义的分析。第三,sHDMSS数据对现有稀疏生存回归方法的应用构成了一个关键的障碍,因为由于计算成本高和内存需求大,大多数当前方法和标准软件都无法用于大型数据集。尽管有许多稀疏生存回归方法可用,但据我们所知,只有 3-10 种,只有 LASSO、Elastic Net 11 和 ridge 回归经过调整以拟合 sHDMSS 事件发生时间数据。特别是,Mittal 等人 12 开发了一种名为 CYCLOPS 的工具,通过以稀疏格式存储数据、利用数据的稀疏性和部分似然性、使用多核线程和矢量处理以及其他高性能计算技术,将 LASSO 和 ridge Cox 回归与 sHDMSS 事件发生时间数据拟合,>竞争对手的加速速度提高了 10 倍 12。 然而,ridge Cox回归不会产生稀疏模型,并且LASSO倾向于选择过多的噪声特征,并且在估计时存在偏差。13, 14 需要改进的稀疏 Cox 回归工具,用于 sHDMSS 事件发生时间数据。
The purpose of this paper is to develop a surrogate ℓ0-based sparse Cox regression method and adapt it to sHDMSS time-to-event data. It is well known that ℓ0-penalized regression is natural for variable selection and parameter estimation with some optimal properties.15-18 On the other hand, it is also known to have some pitfalls such as instability19 and being unscalable to even moderate dimensional covariates. The broken adaptive ridge (BAR) estimator, defined as the limit of an iteratively reweighted ℓ2-penalization algorithm, was introduced to approximate the ℓ0-penalization problem and has been recently shown to possess some desirable selection, estimation, and clustering properties under the linear model and several other model settings.10, 20-22 It is also computationally scalable to high-dimensional covariates and stable for variable selection as discussed later in Remark 2 of Section 2. However, the BAR method has yet to be rigorously studied for the Cox model. Moreover, current BAR algorithms have only been implemented for densely-represented covariates and are unsuitable for sHDMSS data due to high computational costs, high memory requirements, and numerical instability. Computation of the Cox partial likelihood and its derivatives is particularly demanding for massive sample size data since the required number of operations grows at the rate of O(n2). The key contributions of this paper are twofold. First, we rigorously extend the BAR methodology to the Cox model. Specifically, we establish the selection consistency, an oracle property for parameter estimation, and a grouping property of highly correlated covariates for the Cox model. It is worth noting that the theoretical extension of the BAR methodology to Cox model is nontrivial and notably different from other models because the log-partial likelihood for the Cox model is not the sum of independent terms and the standard martingale central limit theorem used to derive the asymptotic theory for Cox's model with a fixed number of covariates is no longer applicable when the number of parameters diverges. Furthermore, because BAR involves performing an infinite number of penalized regressions, the derivations of its selection consistency and oracle property for estimation are substantially different from those for a single-step oracle estimator in the literature. The second key contribution of this paper is to develop an efficient implementation of BAR for Cox regression with sHDMSS time-to-event data by leveraging existing efficient massive ℓ2-penalized Cox regression techniques,12 which include employing a column relaxation with logistic loss (CLG) algorithm using one-dimensional updates and a one-step Newton-Raphson approximation as well as exploiting the sparsity in the covariate structure and the Cox partial likelihood to reduce the number of operations from O(n2) to O(n).
本文旨在开发一种基于代理 l 0 的稀疏 Cox 回归方法,并使其适应 sHDMSS 事件发生时间数据。众所周知,l 0 惩罚回归对于具有某些最优属性的变量选择和参数估计是很自然的。15- 18 另一方面,众所周知,它也存在一些缺陷,例如不稳定性 19 并且无法扩展到甚至中等维协变量。断开自适应岭 (BAR) 估计器定义为迭代重加权 l 2 惩罚算法的极限,用于近似 l 0 惩罚化问题,最近已被证明具有一些理想的选择、估计和聚类属性在线性模型和其他几个模型设置下。10, 20- 22 它还可以在计算上扩展到高维协变量,并且对于变量选择是稳定的,如第 2 节后面的注释 2 所述。然而,BAR方法尚未针对Cox模型进行严格的研究。此外,目前的BAR算法只针对密集表示的协变量实现,由于计算成本高、内存要求高、数值不稳定,不适合sHDMSS数据。Cox 偏似然及其导数的计算对于大量样本量数据的要求特别高,因为所需的运算数量以 O(n 2 ) 的速率增长。本文的主要贡献是双重的。首先,我们将BAR方法严格扩展到Cox模型。具体来说,我们为 Cox 模型建立了选择一致性、用于参数估计的预言机属性以及高度相关协变量的分组属性。 值得注意的是,BAR方法对Cox模型的理论扩展是非平凡的,并且与其他模型明显不同,因为Cox模型的对数偏似然不是独立项的总和,并且当参数数量发散时,用于推导具有固定数量协变量的Cox模型的渐近理论的标准马丁格尔中心极限定理不再适用。此外,由于 BAR 涉及执行无限数量的惩罚回归,因此其用于估计的选择一致性和预言机属性的推导与文献中单步预言机估计器的推导有很大不同。本文的第二个主要贡献是通过利用现有的高效大规模 l 2 惩罚 Cox 回归技术 12,开发一种使用 sHDMSS 事件发生时间数据的 BAR 的高效实现,其中包括使用一维更新和一步牛顿-拉夫森近似的逻辑损失列松弛 (CLG) 算法,以及利用协变量结构中的稀疏性和 Cox 偏似然来降低从 O(n 2 ) 到 O(n) 的操作次数。
In Section 2, we formally define the BAR estimator, state its theoretical properties for variable selection, parameter estimation, and grouping highly correlated covariates for the Cox model, and describe an efficient implementation for sHDMSS survival data. We also discuss how to adapt BAR as a postscreening sparse regression method for ultrahigh dimensional Cox regression with relatively small sample size. In Section 3, we present simulation studies to demonstrate the performance of the CoxBAR estimator with both moderate and massive sample size in various low and high-dimensional settings. We provide a real data example using the pediatric trauma mortality data12 in Section 4. Lastly, we give closing remarks in Section 5. The appendix collects proofs of the theoretical results and regularity conditions needed for the derivations. An R package has been developed for BAR and made available at https:github.com/OHDSI/BrokenAdaptiveRidge.
在第 2 节中,我们正式定义了 BAR 估计器,陈述了其对变量选择、参数估计和 Cox 模型高度相关协变量分组的理论属性,并描述了 sHDMSS 生存数据的有效实现。我们还讨论了如何将 BAR 作为筛选后稀疏回归方法,用于样本量相对较小的超高维 Cox 回归。在第 3 节中,我们提出了仿真研究,以证明 CoxBAR 估计器在各种低维和高维设置下具有中等和大样本量的性能。我们在第 4 节中使用儿科创伤死亡率数据 12 提供了一个真实的数据示例。最后,我们在第 5 节中作结束语。附录收集了推导所需的理论结果和正则性条件的证明。已为 BAR 开发了一个 R 包,可在 https:github.com/OHDSI/BrokenAdaptiveRidge 上获得。
2 METHODOLOGY 2 方法论
2.1 Cox's BAR regression and its large sample properties
2.1 Cox's BAR回归及其大样本性质
2.1.1 The data structure, model, and estimator
2.1.1 数据结构、模型和估计器
Suppose that one observes a random sample of right-censored time-to-event data consisting of n independent and identically distributed triplets
假设观察到一个由 n 个独立且分布相同的三元
假设 Cox 23 比例风险模型
其中 h{t | z(t)} 是给定 {z(u), 0 ≤ u ≤ t,} 的条件风险函数,h 0 (t) 是未指定的基线风险函数,
其中,对于受试者 i,Y(s)=I(X ≥ s) 是风险过程,N(s)=I(X ≤ s,δ=1) 是具有强度过程
我们对 β 的 Cox BAR 估计从初始 Cox 岭回归估计器 25 开始
它由重新加权的 l 2 惩罚 Cox 回归估计器迭代更新
其中 ξ n 和 λ n 是非负惩罚调整参数。BAR估计器定义为
由于 l 2 -惩罚化产生非稀疏解,因此将 BAR 估计器定义为极限对于生成稀疏性是必要的。尽管 λ 在每次迭代中 n 都是固定的,但它的权重与上一次迭代的岭回归估计值的平方成反比。因此,真实值为零的系数在下一次迭代中将受到更大的惩罚,而真正非零系数的惩罚将收敛到一个常数。我们将在定理 1 的后面说明,在某些正则性条件下,真正零系数的估计值向零收缩,而真正非零系数的估计值收敛到它们的预言机估计值,概率趋于 1。如补充材料第S2节(图S1)中的小型仿真所示,信号(非零系数)和噪声(零系数)可以在几次BAR迭代中快速分离,尽管在某些情况下可能需要更多迭代来改进对非零系数的估计。
Remark 1.(Computational aspects of BAR) For moderate size data, one may calculate
备注 1.(BAR的计算方面)对于中等大小的数据,可以使用 Newton-Raphson 方法在 4 中进行计算
For massive-size data with large n and pn, the Newton-Raphson procedure, which at each iteration, calls for the calculation of both the gradient and Hessian can become practically infeasible due to high computational costs, memory requirements, and numerical instability. In Section 2.2, we will discuss how to adapt an efficient algorithm for massive ℓ2-penalized Cox regression via cyclic coordinate descent and exploit the sparsity of the covariate structure to make BAR scalable to sHDMSS data.
对于具有大 n 和 p n 的海量数据,Newton-Raphson 过程在每次迭代中都要求计算梯度和 Hessian 由于计算成本高、内存要求和数值不稳定性而变得几乎不可行。在第 2.2 节中,我们将讨论如何通过循环坐标下降调整一种有效的算法,以实现大规模 l 2 惩罚的 Cox 回归,并利用协变量结构的稀疏性使 BAR 可扩展到 sHDMSS 数据。
Remark 2.(Broken adaptive ridge versus best subset selection) The BAR method can be viewed as a performing a sequence of surrogate ℓ0-penalizations, where each reweighted ℓ2 penalty serves as a surrogate ℓ0-penalty and the approximation of ℓ0-penalization improves with each iteration. Consequently, BAR enjoys the best of ℓ0- and ℓ2-penalized regressions. For example, we establish in the next two sections that BAR possesses the oracle properties for estimation and selection consistency (an ℓ0 property) as well as a grouping property (an ℓ2 property). Numerically, for a fixed tuning parameter value, BAR is a surrogate to ℓ0-penalization is not expected to be identical, but can be similar to the exact global ℓ0-penalization solution where the latter must be solved using the best subset search (BSS). We illustrate this in Section S3 of the Supplementary Material (Figures S2 and S3) using a small simulation study. It is worth emphasizing that BAR overcomes some shortcomings of BSS; for example, BSS is computationally NP-hard and can be unstable for variable selection,19 whereas BAR is scalable to high-dimensional covariates and is more stable for variable selection as demonstrated in Figures S2 and S3 in Section S3 of the Supplementary Material.
备注 2.(破碎的自适应脊与最佳子集选择)BAR 方法可以看作是执行一系列代理 l 0 惩罚,其中每个重新加权的 l 2 惩罚都充当代理 l 0 惩罚,并且 l 0 惩罚的近似值随着每次迭代而提高。因此,BAR 享有 l 0 和 l 2 惩罚回归的最佳效果。例如,在接下来的两节中,我们将确定 BAR 具有用于估计和选择一致性的预言机属性(l 0 属性)以及分组属性(l 2 属性)。从数值上讲,对于固定的调优参数值,BAR 是 l 0 的替代项,预计不会完全相同,但可以类似于精确的全局 l 0 惩罚化解决方案,后者必须使用最佳子集搜索 (BSS) 求解。我们在补充材料的S3部分(图S2和S3)中使用一个小型模拟研究来说明这一点。值得强调的是,BAR克服了BSS的一些缺点;例如,BSS 在计算上是 NP 困难的,对于变量选择可能不稳定 19,而 BAR 可扩展到高维协变量,并且对于变量选择更稳定,如补充材料第 S2 节中的图 S3 和 S3 所示。
2.1.2 Oracle properties 2.1.2 Oracle属性
We establish the oracle properties for the BAR estimator for simultaneous variable selection and parameter estimation where we allow both qn and pn to diverge to infinity.
我们为 BAR 估计器建立了预言机属性,用于同时进行变量选择和参数估计,其中我们允许 q n 和 p n 发散到无穷大。
Theorem 1. (Oracle properties)Assume the regularity conditions (C1) to (C6) in Section S4.1 of the Supplementary Material hold. Let
- (a)
the BAR estimator
exists and is unique, where ; - (b) (二)
, for any qn-dimensional vector bn such that ‖bn‖2 ≤ 1 and where ∑(β0)11 is the first qn×qn submatrix of ∑(β0), where ∑(β0) is defined in Condition (C4).
,对于任何 q n 维向量 b n ,使得 ‖b n 2 ‖ ≤ 1,并且 ∑(β 0 ) 11 是 ∑(β 0 的第一个 q n ×q n 子矩阵),其中 ∑(β 0 ) 在条件 (C4) 中定义。
Theorem 1(a) establishes selection consistency of the BAR estimator. Part (b) of the theorem essentially states that the nonzero component of the BAR estimator is asymptotically normal and equivalent to the weighted ridge estimator of the oracle model, as shown in the proof provided in Section S4.2 of the Supplementary Material.
定理1(a)建立了BAR估计器的选择一致性。该定理的 (b) 部分基本上指出,BAR 估计器的非零分量是渐近正态的,等效于预言机模型的加权脊估计器,如补充材料第 S4.2 节中提供的证明所示。
Remark 3. (Ultrahigh-dimensional covariates setting)Although we allow pn to diverge, the asymptotic properties of the BAR estimator in the Section 28 are derived for pn<n. In an ultrahigh-dimensional setting where the number of covariates far exceeds the number of observations (pn≫n), one may couple a sure screening27 method with the BAR estimator to obtain a two-step estimator with desirable selection and estimation properties. The orders of qn, pn, and n and their relationships depend on the employed screening procedure. For example, coupling the BAR estimator with the sure joint screening procedure28 has been explored in the work of Kawaguchi.29
备注 3.(超高维协变量设置)尽管我们允许 p n 发散,但第 28 节中 BAR 估计器的渐近性质是针对 p n ≫n) 的超高维环境中,可以将确定筛选 27 方法与 BAR 估计器耦合,以获得具有理想选择和估计特性的两步估计器。q n 、p n 和 n 的顺序及其关系取决于所采用的筛选程序。例如,在Kawaguchi的工作中已经探索了BAR估计器与确定联合筛选程序28的耦合。29
2.1.3 A grouping property
2.1.3 分组属性
When the true model has a group structure, it is desirable for a variable selection method to either retain or drop all variables that are clustered within the same group. It is well known that ridge regression possesses the grouping property for highly correlated covariates.11 Because the BAR estimator is based on an iterative ridge regression, we show that BAR also possesses a grouping property for highly correlated covariates as stated in following theorem.
当真实模型具有组结构时,变量选择方法最好保留或删除聚类在同一组中的所有变量。众所周知,岭回归具有高度相关协变量的分组特性。11 由于 BAR 估计器基于迭代岭回归,因此我们表明 BAR 还具有高度相关的协变量的分组属性,如以下定理所述。
Theorem 2.Let λn,
定理 2.设 λ n ,
概率趋于 1,其中
The proof is provided in Section S4.3 of the Supplementary Material. It is seen from 6 that, as ri j→1, the absolute difference between
该证明在补充材料的第 S4.3 节中提供。从图 6 可以看出,当 r i j →1 时,和 之间的
2.1.4 Selection of tuning parameters
2.1.4 调优参数的选择
Model complexity depends critically on the choice of the tuning parameters. The BAR estimator depends on two tuning parameters, ie, ξn for the initial ridge estimator in 3 and λn for the iterative ridge step in 4. Our simulations in Section 3.1 illustrate that, while fixing λn, the BAR estimator is insensitive to the choice of ξn over a wide interval (Figure 1) and thus practically only optimization with respect to λn is needed.
模型的复杂性很大程度上取决于调优参数的选择。BAR 估计器依赖于两个调谐参数,即 3 中的初始岭估计器为 ξ,4 n 中的迭代岭步长为 λ n 。我们在 3.1 节中的模拟表明,在固定 λ 时 n ,BAR 估计器对宽区间内 ξ n 的选择不敏感(图 1),因此实际上只需要对 λ n 进行优化。
We optimize with respect to λn in a similar manner to currently used penalization methods. A popular strategy for tuning parameter selection is to perform optimization with respect to a data-driven selection criterion such as cross-validation (CV),30, 31 Akaike information criterion,15 and Bayesian information criterion (BIC).16, 17, 32 Although CV has been used extensively in the literature, it has been known to asymptotically overfit models with a positive probability.33, 34 Recent theoretical work has shown that, for penalized Cox models that possess the oracle property, BIC-based tuning parameter selection identifies the true model with probability tending to one.32 Further discussion on selecting λn for BAR is provided in the last paragraph of Section 3.2.
我们 n 以与当前使用的惩罚方法类似的方式对 λ 进行优化。调整参数选择的常用策略是针对数据驱动的选择标准(例如交叉验证 (CV)、30、31 赤池信息标准、15 和贝叶斯信息标准 (BIC))执行优化。16, 17, 32 尽管 CV 在文献中被广泛使用,但已知它会以正概率渐近过拟合模型。33, 34 最近的理论研究表明,对于具有预言机属性的惩罚 Cox 模型,基于 BIC 的调参数选择可以识别概率趋向于 1 的真实模型。32 关于 n 为 BAR 选择 λ 的进一步讨论在第 3.2 节的最后一段中提供。
2.2 Implementation of BAR for sHDMSS data
2.2 sHDMSS 数据的 BAR 实现
As mentioned in Remark 1, the Newton-Raphson algorithm used for each iteration of the BAR algorithm will become infeasible in large-scale settings with large n and pn due to high computational costs, high memory requirements, and numerical instability. Furthermore, recently proposed BAR algorithms, as with most popularly available procedures, cannot directly handle sHDMSS data due to the computational burden imposed by the design matrix. Because BAR only involves fitting a reweighted Cox's ridge regression at each iteration step, it allows us to adapt an efficient algorithm developed by Mittal et al12 for massive Cox ridge regression.
如注1所述,由于计算成本高、内存要求高、数值不稳定,BAR算法每次迭代时使用的Newton-Raphson算法在n和p n 较大的大规模设置中都变得不可行。此外,最近提出的BAR算法与大多数常用的程序一样,由于设计矩阵施加的计算负担,无法直接处理sHDMSS数据。由于 BAR 仅涉及在每个迭代步骤中拟合重新加权的 Cox's 岭回归,因此它允许我们采用 Mittal 等人 12 开发的高效算法来进行大规模 Cox 岭回归。
2.2.1 Adaptation of existing efficient algorithms for fitting massive ℓ2-penalized Cox's regression
2.2.1 对现有高效算法的拟合 massive l 2 -惩罚 Cox 回归的调整
Mittal et al12 developed an efficient implementation of the massive Cox's ridge regression for sHDMSS data. For parameter estimation, the authors adopted the CLG algorithm of Zhang and Oles,35 which is a type of cyclic coordinate descent algorithm that estimates the coefficients using one-dimensional updates. The CLG easily scales to high-dimensional data7, 36, 37 and has been recently implemented for fitting ℓ2- and ℓ1-penalized generalized linear models,38 parametric time-to-event models,39 and Cox's model.12 Readers are encouraged to refer to Section S3 of the Supplementary Material for a detailed explanation of the algorithm.
Mittal 等人 12 开发了用于 sHDMSS 数据的大规模 Cox's ridge 回归的高效实现。对于参数估计,作者采用了Zhang和Oles, 35的CLG算法,这是一种使用一维更新估计系数的循环坐标下降算法。CLG 可轻松扩展到高维数据 7、36、37,并且最近已实现用于拟合 l 2 和 l 1 惩罚广义线性模型、38 参数化事件发生时间模型、39 和 Cox 模型。12 鼓励读者参阅补充材料第S3节,了解该算法的详细说明。
The design matrix Z for sHDMSS data has few nonzero entries for each subject. Storing such a sparse matrix as a dense matrix is inefficient and may increase computation time and/or cause standard software to crash due to insufficient memory allocation. To the best of our knowledge, popular penalization packages such as glmnet40 and ncvreg41 do not support a sparse data format as an input for right-censored time-to-event models, although the former supports the input for other generalized linear models. For sHDMSS data, we propose to use specialized column-data structures as in the works of Mittal et al12 and Suchard et al.38 The advantage of this structure is two-fold, ie, it significantly reduces the memory requirement needed to store the covariate information, and performance is enhanced when employing cyclic coordinate descent. For example, when updating βj, efficiency is gained when computing and storing the inner product
sHDMSS 数据的设计矩阵 Z 对于每个主题都有几个非零条目。将这种稀疏矩阵存储为密集矩阵效率低下,并且可能会增加计算时间和/或由于内存分配不足而导致标准软件崩溃。据我们所知,流行的惩罚包(如 glmnet 40 和 ncvreg 41)不支持稀疏数据格式作为右删失事件时间模型的输入,尽管前者支持其他广义线性模型的输入。对于 sHDMSS 数据,我们建议使用专门的列数据结构,如 Mittal 等人 12 和 Suchard 等人的工作 38 这种结构的优点是双重的,即它显着减少了存储协变量信息所需的内存需求,并且在采用循环坐标下降时性能得到增强。例如,在更新β时,使用所有 i. 12、35、36、38、42 的低秩更新
Furthermore, one requires a series of cumulative sums introduced through the risk set Ri={j:Xj>Xi} for each subject i to calculate the gradient and Hessian diagonal. These cumulative sums would need to be calculated when updating each parameter estimate in the optimization routine. This can prove to be computationally costly, especially when both n and pn are large. By taking advantage of the sparsity of the design matrix, one can reduce the computational time needed to calculate these cumulative sums by entering into this operation only if at least one observation in the risk set has a nonzero covariate value along dimension j and embarking on the scan at the first nonzero entry rather than from the beginning. Mittal et al12 and Suchard et al38 have implemented these efficiency techniques for conditional Poisson regression and Cox's regression, respectively. Our BAR implementation naturally exploits the sparsity in the design matrix and the partial likelihood by embedding an adaptive version of Mittal et al's12 massive Cox's ridge regression within each iteration of the iteratively reweighted Cox's ridge regression.
此外,需要通过风险集 R={j:X>X} 为每个受试者 i 引入一系列累积总和来计算梯度和黑森对角线。在优化例程中更新每个参数估计值时,需要计算这些累积总和。这可能被证明是计算成本高昂的,尤其是当 n 和 p n 都很大时。通过利用设计矩阵的稀疏性,只有当风险集中至少有一个观测值沿维度 j 具有非零协变量值并在第一个非零条目而不是从头开始进行扫描时,才可以进入此操作,从而减少计算这些累积总和所需的计算时间。Mittal 等人 12 和 Suchard 等人 38 分别将这些效率技术用于条件泊松回归和 Cox 回归。我们的 BAR 实现通过在迭代重新加权的 Cox 脊回归的每次迭代中嵌入 Mittal 等人的 12 个大规模 Cox's 岭回归的自适应版本,自然而然地利用了设计矩阵中的稀疏性和部分似然性。
3 SIMULATIONS 3 模拟
This section presents three simulation studies. First, we demonstrate in Section 3.1 that, for fixed λn, the BAR estimator is insensitive to the tuning parameter ξn of its initial ridge estimator and does well in terms of performing variable selection and correcting possible bias of the initial ridge estimator. Then, in Section 3.2, we evaluate and compare the operating characteristics of BAR with some popular penalized Cox regression methods, where we only consider settings with moderate sample sizes due to the fact that most of the competing methods are inoperable for massive sample size data. Finally, in Section 3.3, we use a sHDMSS setting to illustrate the performance of BAR over its closest competitor.
本节介绍三个仿真研究。首先,我们在第 3.1 节中证明,对于固定 λ n ,BAR 估计器对其初始岭估计器的调谐参数 ξ n 不敏感,并且在执行变量选择和校正初始岭估计器的可能偏差方面表现良好。然后,在第 3.2 节中,我们评估并比较了 BAR 的操作特性与一些流行的惩罚 Cox 回归方法,其中我们只考虑具有中等样本量的设置,因为大多数竞争方法都不适用于大量样本量数据。最后,在第 3.3 节中,我们使用 sHDMSS 设置来说明 BAR 相对于其最接近的竞争对手的性能。
Sections 3.1 and 3.2 employ the same simulation structure. Event times are drawn from an exponential proportional hazards model with baseline hazard h0(t)=1,
第 3.1 节和第 3.2 节采用相同的模拟结构。事件时间是从基线风险 h 0 (t)=1 的指数比例风险模型中得出的,
3.1 Broken adaptive ridge estimator for varying values of ξn
3.1 不同ξ n 值的自适应岭估计器损坏
We illustrate how the BAR estimator behaves by fixing λn and varying the tuning parameter ξn of the initial Cox ridge regression in the following. Figures 1B to 1D depict the solution path plots average over 100 Monte Carlo simulations of the BAR estimator with respect to ξn over a wide interval [10−2, 102] for n=300, pn=100, ≈25% censoring, and
我们通过固定 λ n 并改变初始 Cox 岭回归的调谐参数 ξ n 来说明 BAR 估计器的行为,如下所示。图 1B 至 1D 分别描绘了 n=300、p n =100、≈25% 删失和
As a reference, we also display the solution path plots of the corresponding initial ridge estimator in panel (a). The initial ridge estimator starts to introduce over shrinkage and, consequently, estimation bias when ξn exceeds 10. However, its bias has been effectively corrected by BAR. Therefore, by iteratively refitting reweighted Cox ridge regression, the BAR estimator not only performs variable selection by shrinking estimates of the true zero parameters to zero, but also effectively corrects the estimation bias from the initial Cox ridge estimator. Similar results are obtained for several different simulation scenarios and can be found in Section S4 of the Supplementary Material.
作为参考,我们还在图(a)中显示了相应初始岭估计器的求解路径图。当 ξ n 超过 10 时,初始脊估计器开始引入过度收缩,从而引入估计偏差。然而,BAR已经有效地纠正了它的偏差。因此,通过迭代重新拟合重新加权的 Cox 岭回归,BAR 估计器不仅通过将真零参数的估计值缩小到零来执行变量选择,而且还有效地校正了初始 Cox 岭估计器的估计偏差。在几种不同的仿真场景中也获得了类似的结果,可以在补充材料的 S4 部分中找到。
3.2 Model selection and parameter estimation
3.2 模型选择与参数估计
In this simulation, we evaluate and compare the variable selection and parameter estimation performance of BAR with four popular penalized Cox regression methods, ie, LASSO,3 SCAD,4 adaptive LASSO (ALASSO),5 and MCP.6 We fix ξn=1 for the BAR methods since Section 3.1 yields evidence that the BAR estimator is insensitive to the selection of ξn. For all methods, a 25-value grid was used to find the optimal value of the tuning parameter via BIC minimization.32
在本次仿真中,我们评估并比较了BAR与四种流行的惩罚Cox回归方法(即LASSO、3 SCAD、4自适应LASSO(ALASSO)、5和MCP)的变量选择和参数估计性能。6 我们为 BAR 方法固定 ξ n =1,因为第 3.1 节提供了 BAR 估计器对 ξ 的选择不敏感 n 的证据。对于所有方法,都使用 25 个值的网格通过 BIC 最小化找到调谐参数的最佳值。32
Estimation bias is summarized through the mean squared bias,
估计偏差通过均方偏差
表 1.(中等维度和样本量)破损自适应脊 (BAR) 贝叶斯信息准则 (BIC)、LASSO (BIC)、SCAD (BIC)、自适应套索 (ALASSO) (BIC) 和 MCP (BIC) 的模拟估计和变量选择性能,其中括号中的 BIC 表示 BIC 最小化用于通过网格搜索选择调谐参数。(MSB = 均方偏差;FN = 平均误报数;FP = 假阴性的平均数;SM = 平均相似度度量;BIC = 平均 BIC 分数;每个条目基于 100 个蒙特卡罗样本,大小为 n=300,1000,p n =100,删失率 = 25%)
MSB | FN | FP | SM | BIC | ||
---|---|---|---|---|---|---|
BAR (
|
0.06 | 0.02 | 0.23 | 0.98 | 1930.97 | |
BAR (
|
0.10 | 0.17 | 0.02 | 0.98 | 1938.43 | |
BAR (BIC) 酒吧 (BIC) | 0.11 | 0.01 | 1.79 | 0.89 | 1919.26 | |
n=300 | LASSO (BIC) 拉索 (BIC) | 0.27 | 0.01 | 3.32 | 0.82 | 1958.40 |
SCAD (BIC) SCAD (BIC) (英语) | 0.12 | 0.01 | 2.23 | 0.87 | 1933.43 | |
ALASSO (BIC) 阿拉索 (BIC) | 0.11 | 0.04 | 1.48 | 0.90 | 1935.60 | |
MCP (BIC) MCP (BIC) (英语) | 0.09 | 0.02 | 1.21 | 0.92 | 1929.33 | |
BAR (
|
0.01 | 0.00 | 0.19 | 0.99 | 8200.97 | |
BAR (
|
0.01 | 0.00 | 0.00 | 1.00 | 8203.52 | |
BAR (BIC) 酒吧 (BIC) | 0.02 | 0.00 | 0.73 | 0.95 | 8196.51 | |
n=1000 | LASSO (BIC) 拉索 (BIC) | 0.10 | 0.00 | 2.77 | 0.84 | 8236.76 |
SCAD (BIC) SCAD (BIC) (英语) | 0.01 | 0.00 | 0.23 | 0.98 | 8203.00 | |
ALASSO (BIC) 阿拉索 (BIC) | 0.02 | 0.00 | 0.26 | 0.98 | 8204.58 | |
MCP (BIC) MCP (BIC) (英语) | 0.01 | 0.00 | 0.08 | 0.99 | 8202.04 |
From Table 1, we have that, when the tuning parameter λn is selected by minimizing the BIC score as the other methods, the performance of BAR (BIC) is generally comparable to other methods with respect to all measures across both scenarios. We have conducted more extensive simulations with different combinations of model dimension, censoring rates, sample sizes, and model sparsity, which yielded consistent findings and are reported in Section S5 of the Supplementary Material.
从表 1 中可以看出,当通过最小化 BIC 分数来选择调优参数 λ n 时,BAR (BIC) 在两种情况下的所有测量方面通常与其他方法相当。我们使用模型维度、删失率、样本量和模型稀疏度的不同组合进行了更广泛的模拟,得出了一致的结果,并在补充材料的 S5 部分中进行了报告。
Since BAR aims to approximate ℓ0-penalized regression, it directly provides a surrogate optima to some popular information criteria with some prefixed λn. For example, performing BAR with
由于 BAR 旨在近似 l 0 -惩罚回归,因此它直接为一些带有前缀 λ 的流行信息标准提供代理最优 n 。例如,对
3.3 Sparse high-dimensional massive sample size data
3.3 稀疏高维海量样本量数据
In this simulation, we simulate a sHDMSS time-to-event data set with n=200,000, pn=20,000, and qn=80. Event times are generated from an exponential hazards model with baseline hazard h0(t)=1, regression coefficients
在此仿真中,我们模拟了 n=200,000、p n =20,000 和 q n =80 的 sHDMSS 事件发生时间数据集。事件时间由指数风险模型生成,基线风险 h 0 (t)=1,回归系数
表 2.(稀疏高维和海量样本量)对于n=200 000、p n =20 000和q n =80的模拟sHDMSS数据集,具有破碎自适应脊(mBAR)和LASSO惩罚(mCox-LASSO 12)的大规模Cox回归的估计和变量选择结果。(偏差 =
Method 方法 | Bias 偏见 | FP | FN | BIC score BIC 评分 |
---|---|---|---|---|
mBAR (
|
1.19 | 0 | 3 | 83 313.02 |
mBAR (
|
2.02 | 0 | 10 | 83 573.96 |
mBAR (BIC) mBAR (BIC) | 0.97 | 5 | 0 | 83 266.47 |
mCox-LASSO (BIC) mCox-LASSO (BIC) | 2.93 | 12 | 0 | 84 479.47 |
mCox-LASSO (CV) mCox-LASSO (CV) | 2.12 | 963 | 0 | 93 770.58 |
- Abbreviations: BIC, Bayesian information criterion; CV, cross-validation.
缩写:BIC,贝叶斯信息准则;简历,交叉验证。
We observe that both mCox-LASSO methods have retained all 80 true nonzero coefficients together with a moderate to large number of noise variables (12 for BIC and 967 for CV). In contrast, mBAR (BIC) chooses a sparser model selecting all 80 nonzero coefficients and 5 noise variables. As expected, mBAR (BIC) is less biased (0.82) than mCox-LASSO (2.49 for BIC and 2.02 for CV) and has a much lower BIC score when compared to both mCox-LASSO methods. We also notice that mBAR with the two prefixed λn tends to underestimate the true model, ie, fixing
我们观察到,两种 mCox-LASSO方法都保留了所有 80 个真正的非零系数以及中等到大量的噪声变量(BIC 为 12 个,CV 为 967 个)。相比之下,mBAR (BIC) 选择稀疏模型,选择所有 80 个非零系数和 5 个噪声变量。正如预期的那样,mBAR (BIC) 的偏差 (0.82) 低于 mCox-LASSO(BIC 为 2.49,CV 为 2.02),并且与两种 mCox-LASSO 方法相比,BIC 分数要低得多。我们还注意到,带有两个前缀 λ 的 mBAR n 往往会低估真实模型,即修复
We further examined the solution paths of mCox-LASSO and mBAR in Figure 2. The vertical solid and dashed lines in the mCox-LASSO solution path plot (Figure 2A) represent the estimates at the optimal tuning parameter obtained via CV and BIC minimization, respectively. We can see that the mCox-LASSO solution path changes rapidly as its tuning parameter varies and shows severe bias. In contrast, the mBAR solution path plot (Figure 2B) with respect to λn changes very slowly where the vertical line represents the estimates at the optimal tuning parameter selected by BIC minimization and selects a model with estimates that are less biased than mCox-LASSO (see Table 2). Furthermore, the optimal value of λn that minimizes the BIC score for mBAR roughly corresponds to
我们进一步检查了图2中mCox-LASSO和mBAR的求解路径。mCox-LASSO解决方案路径图(图2A)中的垂直实线和虚线分别表示通过CV和BIC最小化获得的最佳调谐参数的估计值。我们可以看到,mCox-LASSO求解路径随着其调谐参数的变化而迅速变化,并显示出严重的偏差。相比之下,mBAR解路径图(图2B)相对于λ n 的变化非常缓慢,其中垂直线表示通过BIC最小化选择的最佳调谐参数的估计值,并选择估计值比mCox-LASSO偏差小的模型(见表2)。此外,最小化 mBAR 的 BIC 分数的 λ n 的最优值大致对应于
For the mBAR method, we also made a solution path plot with respect to ξn, while fixing
对于mBAR方法,我们还绘制了关于ξ的求解路径图 n ,同时
4 PEDIATRIC TRAUMA MORTALITY
4 小儿创伤死亡率
For an application of mBAR regression in the sHDMSS setting, we consider a subset of the NTDB, a trauma database maintained by the American College of Surgeons.1 This data set was previously analyzed by Mittal et al12 as an example for efficient massive Cox regression with mCox-LASSO and ridge regression to sHDMSS data. The data set includes 210 555 patient records of injured children under 15 that were collected over 5 years (2006 to 2010). Each patient record includes 125 952 binary covariates, which indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as the two-way interactions. The outcome of interest is mortality after time of injury. The data is extremely sparse, with less than 1% of the covariates being nonzero and has a censoring rate of 98%. We randomly split the data into training and test sets of 168 000 and 42 555, respectively. The mortality rate of both sets were approximately equal to the combined rate. Similar to Section 3.3, we were unable to load the training set (n=168 000, pn=125 000) into other popular oracle procedures due to the memory requirements needed to support a dense design matrix of that size and compare mBAR to mCox-LASSO. The BIC-score minimization over a penalization path of 10 tuning parameters was used to select the final model for both mBAR (fixing
对于mBAR回归在sHDMSS环境中的应用,我们考虑NTDB的一个子集,NTDB是由美国外科医师学会维护的创伤数据库。1 Mittal 等人 12 之前分析了该数据集,作为使用 mCox-LASSO 进行有效大规模 Cox 回归和 sHDMSS 数据的岭回归的示例。该数据集包括210 555份15岁以下受伤儿童的病历,这些病历是在5年内(2006年至2010年)收集的。每个患者记录包括 125 952 个二元协变量,这些协变量表示属性(ICD9 代码、AIS 代码等)的存在与否以及双向交互。感兴趣的结果是受伤后的死亡。数据非常稀疏,只有不到 1% 的协变量为非零,删失率为 98%。我们将数据随机分为训练集和测试集,分别为168 000和42 555。两组死亡率大致等于合并死亡率。与第 3.3 节类似,我们无法将训练集 (n=168 000, p n =125 000) 加载到其他流行的预言机程序中,因为支持该大小的密集设计矩阵并将 mBAR 与 mCox-LASSO 进行比较所需的内存要求。在10个调谐参数的惩罚路径上,BIC分数最小化用于选择mBAR(固定
Table 3 summarizes the findings for our example, which reflect what we observe in Section 3.3. Massive Cox's regression for BAR, using BIC minimization, selects fewer covariates than both mCox-LASSO methods. Both model selection and discriminatory performance are similar to slightly superior for mBAR (BIC) over both mCox-LASSO methods. Again, mBAR with prefixed λn selects far fewer covariates than mBAR (BIC); however, the overall high c-index for both methods suggest that the strong predictors for pediatric trauma are still retained in the model. In terms of runtime, mBAR (BIC) is more time consuming than LASSO (BIC) as expected, but BAR with a prefixed tuning parameter value can help to reduce the runtime with a comparable prediction performance.
表 3 总结了我们示例的结果,这些结果反映了我们在第 3.3 节中观察到的内容。使用 BIC 最小化的 BAR 的 Massive Cox 回归选择的协变量比两种 mCox-LASSO方法都少。mBAR (BIC) 的模型选择和判别性能都略优于 mCox-LASSO 方法。同样,前缀为 λ 的 mBAR n 选择的协变量比 mBAR (BIC) 少得多;然而,两种方法的总体高 C 指数表明,模型中仍保留了儿科创伤的强预测因子。在运行时方面,mBAR (BIC) 比 LASSO (BIC) 更耗时,但具有前缀调优参数值的 BAR 可以帮助减少运行时间,并具有相当的预测性能。
表 3.(儿科国家创伤数据库 (NTDB) 数据)小儿 NTDB 数据的 mCox-LASSO和大规模 Cox 回归的破坏性适应性脊 (mBAR) 回归的比较。(mCox-LASSO 交叉验证 (CV) 和 mCox-LASSO 贝叶斯信息准则 (BIC) 分别对应于使用交叉验证和 BIC 选择准则的 mCox-LASSO。 mBAR (BIC) 表示在固定时使用 BIC 选择准则的
Method 方法 | # Selected # 已选 | BIC score BIC 评分 | c-index c指数 | Runtime (hours) 运行时间(小时) |
---|---|---|---|---|
mBAR (
|
45 | 51 613.52 | 0.91 | 8 |
mBAR (
|
21 | 52 182.90 | 0.89 | 8 |
mBAR (BIC) mBAR (BIC) | 83 | 51 269.43 | 0.93 | 97 |
mCox-LASSO (BIC) mCox-LASSO (BIC) | 100 | 52 544.90 | 0.91 | 25 |
mCox-LASSO (CV) mCox-LASSO (CV) | 253 | 53 165.44 | 0.92 | 41 |
5 DISCUSSION 5 讨论
We have extended the BAR methodology to Cox's model as a new sparse Cox regression method and rigorously established that it is selection consistent, oracle for parameter estimation, stable, and has a grouping property for highly correlated covariates. We illustrate through empirical studies that the BAR estimator has satisfactory performance for variable selection and parameter estimation. We have also extended the application of BAR to the sHDMSS domain by taking advantage of the fact that the BAR algorithm allows us to easily adapt existing high performance algorithms and software for massive ℓ2-penalized Cox regression.12
我们已将 BAR 方法扩展到 Cox 模型,作为一种新的稀疏 Cox 回归方法,并严格确定它是选择一致的、用于参数估计的预言机、稳定的,并且具有高度相关的协变量的分组属性。通过实证研究证明,BAR估计器在变量选择和参数估计方面具有令人满意的性能。我们还利用BAR算法允许我们轻松调整现有的高性能算法和软件,以实现大规模l 2 惩罚Cox回归这一事实,将BAR的应用扩展到sHDMSS域。12
Our surrogate ℓ0-based BAR method and theory can be easily extended to a surrogate ℓd-based BAR method for any d∈[0,1], by replacing
我们基于代理 l 0 的 BAR 方法和理论可以很容易地扩展到任何 d∈[0,1] 的代理 l 的 BAR 方法,只需将
Our theoretical and empirical results have established the BAR method as a valid and viable tool for variable selection and parameter estimation under the pn<n setting although pn is allowed to diverge with n. Theoretical properties of the BAR estimator for the high-dimensional setting (pn≫n) remain to be investigated. Furthermore, as pointed out by a referee, although BAR is selection consistent and oracle, it is subject to the same postselection inference issues as other variable selection methods.50, 51 Lastly, although iteratively performing reweighted ℓ2-penalizations allows BAR to enjoy the best of ℓ0- and ℓ2-penalized regressions and to readily adopt an existing efficient implementation of ℓ2-penalization for sHDMSS data, its iterative nature does present another layer of computational complexity. While this added layer of computational complexity is not a practical concern for moderate size data, it can considerably increase the runtime in a large data setting when both n and p are large. As illustrated in our real data example, trying a prefixed tuning parameter value based on the extended BIC
我们的理论和实证结果表明,尽管p n 允许与n发散,但BAR方法在p n ≫n)的BAR估计器的理论性质仍有待研究。此外,正如裁判所指出的,尽管BAR是选择一致的,并且是预言机,但它与其他变量选择方法一样,会受到相同的选择后推断问题的影响。50, 51 最后,尽管迭代执行重加权 l 2 惩罚化允许 BAR 享受 l 0 和 l 2 惩罚回归的最佳效果,并随时采用现有的 l 2 惩罚化 sHDMSS 数据的有效实现,但其迭代性质确实带来了另一层计算复杂性。虽然对于中等大小的数据来说,这种增加的计算复杂性层并不是一个实际问题,但当 n 和 p 都很大时,它可以大大增加大型数据设置中的运行时间。如我们的真实数据示例所示,尝试基于扩展 BIC
ACKNOWLEDGEMENTS 确认
The authors are grateful to the referees for their insightful comments and suggestions that have greatly improved the paper. The authors are also grateful to Drs Randall Burd and Sushil Mittal for providing us access to the NTDB data. Gang Li's research was supported in part by National Institutes of Health grants P30 CA-16042, UL1TR000124-02, and P01AT003960.
作者感谢审稿人提出的有见地的意见和建议,大大改进了论文。作者还感谢Randall Burd博士和Sushil Mittal博士为我们提供了对NTDB数据的访问。李刚的研究得到了美国国立卫生研究院(National Institutes of Health)P30 CA-16042、UL1TR000124-02和P01AT003960的部分支持。
CONFLICT OF INTEREST 利益冲突
The authors declare no potential conflict of interest.
作者声明没有潜在的利益冲突。
DATA AVAILABILITY STATEMENT
数据可用性声明
The data that support the findings of this study are available from the American College of Surgeons.1 Restrictions apply to the availability of these data.
支持这项研究结果的数据可从美国外科医师学会获得。1 这些数据的可用性受到限制。