这是用户在 2024-6-16 18:46 为 https://onlinelibrary.wiley.com/doi/10.1002/sim.8438 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
一个代理 l 0 稀疏 Cox 回归,用于稀疏高维大规模样本量事件发生时间数据
RESEARCH ARTICLE 研究论文
Full Access 完全访问权限

A surrogate 0 sparse Cox's regression with applications to sparse high-dimensional massive sample size time-to-event data
一个代理 l 0 稀疏 Cox 回归,用于稀疏高维大规模样本量事件发生时间数据

Eric S. Kawaguchi

Eric S. Kawaguchi

Department of Preventive Medicine, University of Southern California, Los Angeles, California

Search for more papers by this author
Marc A. Suchard

Marc A. Suchard

Department of Preventive Medicine, University of Southern California, Los Angeles, California

Department of Biomathematics, University of California, Los Angeles, California

Department of Human Genetics, University of California, Los Angeles, California

Search for more papers by this author
Zhenqiu Liu

Zhenqiu Liu

Department of Public Health Sciences, Penn State Cancer Institute, Hershey, Pennsylvania

Search for more papers by this author
Gang Li

Corresponding Author

Gang Li

Department of Preventive Medicine, University of Southern California, Los Angeles, California

Department of Biomathematics, University of California, Los Angeles, California

Gang Li, Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA 90095.

Email: vli@ucla.edu

Search for more papers by this author
First published: 08 December 2019
Citations: 6

首次发表时间: 08十二月2019 https://doi.org/10.1002/sim.8438Citations:6

Abstract 抽象

Sparse high-dimensional massive sample size (sHDMSS) time-to-event data present multiple challenges to quantitative researchers as most current sparse survival regression methods and software will grind to a halt and become practically inoperable. This paper develops a scalable 0-based sparse Cox regression tool for right-censored time-to-event data that easily takes advantage of existing high performance implementation of 2-penalized regression method for sHDMSS time-to-event data. Specifically, we extend the 0-based broken adaptive ridge (BAR) methodology to the Cox model, which involves repeatedly performing reweighted 2-penalized regression. We rigorously show that the resulting estimator for the Cox model is selection consistent, oracle for parameter estimation, and has a grouping property for highly correlated covariates. Furthermore, we implement our BAR method in an R package for sHDMSS time-to-event data by leveraging existing efficient algorithms for massive 2-penalized Cox regression. We evaluate the BAR Cox regression method by extensive simulations and illustrate its application on an sHDMSS time-to-event data from the National Trauma Data Bank with hundreds of thousands of observations and tens of thousands sparsely represented covariates.
稀疏高维大规模样本量 (sHDMSS) 事件发生时间数据给定量研究人员带来了多重挑战,因为当前大多数稀疏生存回归方法和软件将停滞不前并变得几乎无法操作。本文开发了一种可扩展的基于 l 0 的稀疏 Cox 回归工具,用于右删失事件时间数据,该工具可轻松利用 sHDMSS 事件发生时间数据的现有 l 2 惩罚回归方法的高性能实现。具体来说,我们将基于 l 0 的破碎自适应脊 (BAR) 方法扩展到 Cox 模型,该模型涉及重复执行重新加权的 l 2 惩罚回归。我们严格地表明,Cox模型的结果估计器是选择一致的,用于参数估计的预言机,并且具有高度相关的协变量的分组属性。此外,我们通过利用现有的高效算法进行大规模 l 2 惩罚 Cox 回归,在 R 包中为 sHDMSS 事件时间数据实现我们的 BAR 方法。我们通过广泛的模拟评估了 BAR Cox 回归方法,并说明了其在来自国家创伤数据库的 sHDMSS 事件发生时间数据上的应用,该数据具有数十万个观测值和数万个稀疏表示的协变量。

1 INTRODUCTION 1 引言

Advancements in medical informatics tools and high-throughput biological experimentation are making large-scale data routinely accessible to researchers, administrators, and policymakers. This data deluge poses new challenges and critical barriers for quantitative researchers as existing statistical methods and software grind to a halt when analyzing these large-scale data sets, and calls for appropriate methods that can readily fit large-scale data. This paper primarily concerns survival analysis of sparse high-dimensional massive sample size (sHDMSS) data, a particular type of large-scale data with the following characteristics: (1) high-dimensional with a large number of covariates (pn in thousands or tens of thousands), (2) massive in sample-size (n in thousands to hundreds of millions), (3) sparse in covariates with only a very small portion of covariates being nonzero for each subject, and (4) rare in event rate. An example of sHDMSS data is the pediatric trauma mortality data from the National Trauma Data Bank (NTDB) maintained by the American College of Surgeons.1 This data set includes 210 555 patient records of injured children under 15 collected over 5 years from 2006 to 2010. Each patient record includes 125 952 binary covariates that indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as their two-way interactions. The data matrix is extremely sparse with less than 1% of the covariates being non zero. The event (mortality) rate is also very low at 2%. Another application domain where sHDMSS data are common is drug safety studies that use massive patient-level databases such as the US FDA's Sentinel Initiative (https://www.fda.gov/safety/fdassentinelinitiative/ucm2007250.htm) and the Observational Health Data Sciences and Informatics program (https://ohdsi.org/) to study rare adverse events with hundreds of millions of patient records and tens of thousands of patient attributes that are sparse in the covariates.
医学信息学工具和高通量生物实验的进步使研究人员、管理人员和政策制定者能够经常访问大规模数据。这种数据洪流给定量研究人员带来了新的挑战和关键障碍,因为现有的统计方法和软件在分析这些大规模数据集时停滞不前,并需要适当的方法来适应大规模数据。本文主要涉及稀疏高维大规模样本量 (sHDMSS) 数据的生存分析,这是一种具有以下特征的特殊类型的大规模数据:(1) 具有大量协变量的高维数据(p n 以数千或数万计),(2) 样本量很大(n 以千万计到数亿),(3) 协变量稀疏,每个受试者只有很小一部分协变量为非零, (4)罕见事件发生率。sHDMSS 数据的一个例子是来自美国外科医师学会维护的国家创伤数据库 (NTDB) 的儿科创伤死亡率数据。1 本数据集包括2006年至2010年5年间收集的210 555份15岁以下受伤儿童的病历。每个患者记录包括 125 952 个二元协变量,这些协变量表示属性(ICD9 代码、AIS 代码等)的存在与否以及它们的双向交互。数据矩阵非常稀疏,只有不到 1% 的协变量不为零。事件(死亡率)也非常低,为2%。sHDMSS数据常见的另一个应用领域是使用大量患者级数据库的药物安全性研究,例如美国FDA的Sentinel Initiative(https://www.fda.gov/safety/fdassentinelinitiative/ucm2007250。htm)和观察性健康数据科学和信息学计划(https://ohdsi.org/),以研究具有数亿份患者记录和数以万计的患者属性的罕见不良事件,这些属性在协变量中是稀疏的。

The sHDMSS survival data present multiple challenges to quantitative researchers. First, not all of the thousands of covariates are expected to be relevant to an outcome of interest. It would also be practically undesirable to predict a patient outcome using thousands of covariates. Traditionally, researchers hand-pick subject characteristics to include in an analysis. However, hand picking can introduce not only bias, but also a source of variability between researchers and studies. Moreover, it would become impractical in large-scale evidence generation when hundreds or thousands of analyses are to be performed.2 Hence, automated sparse regression methods are desired. Secondly, the commonly used “divide and conquer” strategy for massive size data is deemed inappropriate for sHDMSS time-to-event data since each of the divided data would have too few events for a meaningful analysis. Third, sHDMSS data presents a critical barrier to the application of existing sparse survival regression methods, since most current methods and standard software become inoperable for large data sets due to high computational costs and large memory requirements. Although many sparse survival regression methods are available,3-10 to the best of our knowledge, only LASSO, Elastic Net11 and ridge regression have been adapted to fit sHDMSS time-to-event data. In particular, Mittal et al12 developed a tool, named CYCLOPS, for fitting LASSO and ridge Cox regression with sHDMSS time-to-event data by storing data in a sparse format, exploiting sparsity in the data and partial likelihood, and using multicore threading and vector processing, along with other high-performance computing techniques, which delivers > 10-fold speedup12 over its competitors. However, ridge Cox regression does not yield a sparse model and LASSO tends to select too many noise features and is biased for estimation.13, 14 Improved sparse Cox regression tools for sHDMSS time-to-event data are desired.
sHDMSS生存数据给定量研究人员带来了多重挑战。首先,并非所有数千个协变量都与感兴趣的结果相关。使用数千个协变量预测患者结果实际上也是不可取的。传统上,研究人员会亲自挑选要包含在分析中的受试者特征。然而,手工挑选不仅会带来偏见,还会成为研究人员和研究之间差异的来源。此外,当要进行数百或数千次分析时,在大规模证据生成中将变得不切实际。2 因此,需要自动稀疏回归方法。其次,对于大规模数据,常用的“分而治之”策略被认为不适用于 sHDMSS 事件时间数据,因为每个划分的数据都具有太少的事件,无法进行有意义的分析。第三,sHDMSS数据对现有稀疏生存回归方法的应用构成了一个关键的障碍,因为由于计算成本高和内存需求大,大多数当前方法和标准软件都无法用于大型数据集。尽管有许多稀疏生存回归方法可用,但据我们所知,只有 3-10 种,只有 LASSO、Elastic Net 11 和 ridge 回归经过调整以拟合 sHDMSS 事件发生时间数据。特别是,Mittal 等人 12 开发了一种名为 CYCLOPS 的工具,通过以稀疏格式存储数据、利用数据的稀疏性和部分似然性、使用多核线程和矢量处理以及其他高性能计算技术,将 LASSO 和 ridge Cox 回归与 sHDMSS 事件发生时间数据拟合,>竞争对手的加速速度提高了 10 倍 12。 然而,ridge Cox回归不会产生稀疏模型,并且LASSO倾向于选择过多的噪声特征,并且在估计时存在偏差。13, 14 需要改进的稀疏 Cox 回归工具,用于 sHDMSS 事件发生时间数据。

The purpose of this paper is to develop a surrogate 0-based sparse Cox regression method and adapt it to sHDMSS time-to-event data. It is well known that 0-penalized regression is natural for variable selection and parameter estimation with some optimal properties.15-18 On the other hand, it is also known to have some pitfalls such as instability19 and being unscalable to even moderate dimensional covariates. The broken adaptive ridge (BAR) estimator, defined as the limit of an iteratively reweighted 2-penalization algorithm, was introduced to approximate the 0-penalization problem and has been recently shown to possess some desirable selection, estimation, and clustering properties under the linear model and several other model settings.10, 20-22 It is also computationally scalable to high-dimensional covariates and stable for variable selection as discussed later in Remark 2 of Section 2. However, the BAR method has yet to be rigorously studied for the Cox model. Moreover, current BAR algorithms have only been implemented for densely-represented covariates and are unsuitable for sHDMSS data due to high computational costs, high memory requirements, and numerical instability. Computation of the Cox partial likelihood and its derivatives is particularly demanding for massive sample size data since the required number of operations grows at the rate of O(n2). The key contributions of this paper are twofold. First, we rigorously extend the BAR methodology to the Cox model. Specifically, we establish the selection consistency, an oracle property for parameter estimation, and a grouping property of highly correlated covariates for the Cox model. It is worth noting that the theoretical extension of the BAR methodology to Cox model is nontrivial and notably different from other models because the log-partial likelihood for the Cox model is not the sum of independent terms and the standard martingale central limit theorem used to derive the asymptotic theory for Cox's model with a fixed number of covariates is no longer applicable when the number of parameters diverges. Furthermore, because BAR involves performing an infinite number of penalized regressions, the derivations of its selection consistency and oracle property for estimation are substantially different from those for a single-step oracle estimator in the literature. The second key contribution of this paper is to develop an efficient implementation of BAR for Cox regression with sHDMSS time-to-event data by leveraging existing efficient massive 2-penalized Cox regression techniques,12 which include employing a column relaxation with logistic loss (CLG) algorithm using one-dimensional updates and a one-step Newton-Raphson approximation as well as exploiting the sparsity in the covariate structure and the Cox partial likelihood to reduce the number of operations from O(n2) to O(n).
本文旨在开发一种基于代理 l 0 的稀疏 Cox 回归方法,并使其适应 sHDMSS 事件发生时间数据。众所周知,l 0 惩罚回归对于具有某些最优属性的变量选择和参数估计是很自然的。15- 18 另一方面,众所周知,它也存在一些缺陷,例如不稳定性 19 并且无法扩展到甚至中等维协变量。断开自适应岭 (BAR) 估计器定义为迭代重加权 l 2 惩罚算法的极限,用于近似 l 0 惩罚化问题,最近已被证明具有一些理想的选择、估计和聚类属性在线性模型和其他几个模型设置下。10, 20- 22 它还可以在计算上扩展到高维协变量,并且对于变量选择是稳定的,如第 2 节后面的注释 2 所述。然而,BAR方法尚未针对Cox模型进行严格的研究。此外,目前的BAR算法只针对密集表示的协变量实现,由于计算成本高、内存要求高、数值不稳定,不适合sHDMSS数据。Cox 偏似然及其导数的计算对于大量样本量数据的要求特别高,因为所需的运算数量以 O(n 2 ) 的速率增长。本文的主要贡献是双重的。首先,我们将BAR方法严格扩展到Cox模型。具体来说,我们为 Cox 模型建立了选择一致性、用于参数估计的预言机属性以及高度相关协变量的分组属性。 值得注意的是,BAR方法对Cox模型的理论扩展是非平凡的,并且与其他模型明显不同,因为Cox模型的对数偏似然不是独立项的总和,并且当参数数量发散时,用于推导具有固定数量协变量的Cox模型的渐近理论的标准马丁格尔中心极限定理不再适用。此外,由于 BAR 涉及执行无限数量的惩罚回归,因此其用于估计的选择一致性和预言机属性的推导与文献中单步预言机估计器的推导有很大不同。本文的第二个主要贡献是通过利用现有的高效大规模 l 2 惩罚 Cox 回归技术 12,开发一种使用 sHDMSS 事件发生时间数据的 BAR 的高效实现,其中包括使用一维更新和一步牛顿-拉夫森近似的逻辑损失列松弛 (CLG) 算法,以及利用协变量结构中的稀疏性和 Cox 偏似然来降低从 O(n 2 ) 到 O(n) 的操作次数。

In Section 2, we formally define the BAR estimator, state its theoretical properties for variable selection, parameter estimation, and grouping highly correlated covariates for the Cox model, and describe an efficient implementation for sHDMSS survival data. We also discuss how to adapt BAR as a postscreening sparse regression method for ultrahigh dimensional Cox regression with relatively small sample size. In Section 3, we present simulation studies to demonstrate the performance of the CoxBAR estimator with both moderate and massive sample size in various low and high-dimensional settings. We provide a real data example using the pediatric trauma mortality data12 in Section 4. Lastly, we give closing remarks in Section 5. The appendix collects proofs of the theoretical results and regularity conditions needed for the derivations. An R package has been developed for BAR and made available at https:github.com/OHDSI/BrokenAdaptiveRidge.
在第 2 节中,我们正式定义了 BAR 估计器,陈述了其对变量选择、参数估计和 Cox 模型高度相关协变量分组的理论属性,并描述了 sHDMSS 生存数据的有效实现。我们还讨论了如何将 BAR 作为筛选后稀疏回归方法,用于样本量相对较小的超高维 Cox 回归。在第 3 节中,我们提出了仿真研究,以证明 CoxBAR 估计器在各种低维和高维设置下具有中等和大样本量的性能。我们在第 4 节中使用儿科创伤死亡率数据 12 提供了一个真实的数据示例。最后,我们在第 5 节中作结束语。附录收集了推导所需的理论结果和正则性条件的证明。已为 BAR 开发了一个 R 包,可在 https:github.com/OHDSI/BrokenAdaptiveRidge 上获得。

2 METHODOLOGY 2 方法论

2.1 Cox's BAR regression and its large sample properties
2.1 Cox's BAR回归及其大样本性质

2.1.1 The data structure, model, and estimator
2.1.1 数据结构、模型和估计器

Suppose that one observes a random sample of right-censored time-to-event data consisting of n independent and identically distributed triplets {(Xi,δi,zi(·))}i=1n, where for subject i, Xi=min(Ti,Ci) is the observed event time, δi=I(Ti ≤ Ci) is the censoring indicator, Ti is the event time of interest, and Ci is a censoring time that is conditionally independent of Ti given a pn-dimensional, possibly time-dependent, covariate vector zi(·)=(zi1(·),,zipn(·)).
假设观察到一个由 n 个独立且分布相同的三元 {(Xi,δi,zi(·))}i=1n 组组成的右删失事件时间数据的随机样本,其中对于受试者 i, Xi=min(Ti,Ci) 是观察到的事件时间,δ=I(T ≤ C) 是删失指标,T 是感兴趣的事件时间,C 是条件独立于 T 的删失时间,给定 p n 维, 可能与时间相关的协变量向量 zi(·)=(zi1(·),,zipn(·))

Assume the Cox23 proportional hazard model
假设 Cox 23 比例风险模型
h{t|z(t)}=h0(t)exp{z(t)β}, (1)
where h{t | z(t)} is the conditional hazard function of Ti given {z(u), 0 ≤ u ≤ t,}, h0(t) is an unspecified baseline hazard function, and β=(β1,,βpn) is a vector of time-independent regression coefficients. Denote by β1 and β2 the first qn and remaining pnqn components of β, respectively, and define β0=(β01,β02) as the true values of β, where, without loss of generality, β01=(β01,β0qn) is a vector of qn nonzero values and β02=0 is a pnqn dimensional vector of zeros. Further technical assumptions for β0 and pn are given later in condition (C6) of Section S4 of the Supplementary Material. For simplicity, we work on the time interval s∈[0, 1] as in the work of Andersen and Gill,24 which can be extended to any time interval [0, τ] for 0<τ<. Using the standard counting process notation, the log-partial likelihood for the Cox model is defined as
其中 h{t | z(t)} 是给定 {z(u), 0 ≤ u ≤ t,} 的条件风险函数,h 0 (t) 是未指定的基线风险函数, β=(β1,,βpn) 是与时间无关的回归系数的向量。分别用 β 1 和 β 2 表示 β 的前 q n 和剩余的 p n −q n 分量,并定义为 β0=(β01,β02) β 的真值,其中在不损失普遍性的情况下, β01=(β01,β0qn) 是 q n 非零值的向量,β 02 =0 是零的 p n −q n 维向量。β 0 和 p n 的进一步技术假设将在补充材料第 S4 节的条件 (C6) 中给出。为简单起见,我们按照 Andersen 和 Gill 的工作 24 处理时间间隔 s∈[0, 1],它可以扩展到任何时间间隔 [0, τ] 0<τ<∞。使用标准计数过程表示法,Cox 模型的对数部分似然定义为
ln(β)=i=1n01βzi(s)dNi(s)01log[j=1nYj(s)exp{βzj(s)}]dN¯(s), (2)
where, for subject i, Yi(s)=I(Xi ≥ s) is the at-risk process and Ni(s)=I(Xi ≤ s,δi=1) is the counting process of the uncensored event with intensity process hi(t|β)=h0(t)Yi(t)exp{zi(t)β} and N¯=i=1nNi.
其中,对于受试者 i,Y(s)=I(X ≥ s) 是风险过程,N(s)=I(X ≤ s,δ=1) 是具有强度过程 hi(t|β)=h0(t)Yi(t)exp{zi(t)β} 的未经审查事件的计数过程,并且 N¯=i=1nNi
Our Cox's BAR estimation of β starts with an initial Cox ridge regression estimator25
我们对 β 的 Cox BAR 估计从初始 Cox 岭回归估计器 25 开始
β^(0)=argminβ{2ln(β)+ξnj=1pnβj2}, (3)
which is updated iteratively by a reweighed 2-penalized Cox regression estimator
它由重新加权的 l 2 惩罚 Cox 回归估计器迭代更新
β^(k)=argminβ{2ln(β)+λnj=1pnβj2(β^j(k1))2},k1, (4)
where ξn and λn are nonnegative penalization tuning parameters. The BAR estimator is defined as
其中 ξ n 和 λ n 是非负惩罚调整参数。BAR估计器定义为
β^=limkβ^(k). (5)
Since 2-penalization yields a nonsparse solution, defining the BAR estimator as the limit is necessary to produce sparsity. Although λn is fixed at each iteration, it is weighted inversely by the square of the ridge regression estimates from the previous iteration. Consequently, coefficients whose true values are zero will have larger penalties in the next iteration, whereas penalties for truly nonzero coefficients will converge to a constant. We will show later in Theorem 1 that, under certain regularity conditions, the estimates of the truly zero coefficients shrink toward zero while the estimates of the truly nonzero coefficients converge to their oracle estimates with probability tending to 1. As illustrated by a small simulation in Section S2 (Figure S1) of the Supplementary Material, the signal (nonzero coefficients) and noise (zero coefficients) can be quickly separated within a few BAR iterations, although more iterations may be necessary in some scenarios to improve estimation of the nonzero coefficients.
由于 l 2 -惩罚化产生非稀疏解,因此将 BAR 估计器定义为极限对于生成稀疏性是必要的。尽管 λ 在每次迭代中 n 都是固定的,但它的权重与上一次迭代的岭回归估计值的平方成反比。因此,真实值为零的系数在下一次迭代中将受到更大的惩罚,而真正非零系数的惩罚将收敛到一个常数。我们将在定理 1 的后面说明,在某些正则性条件下,真正零系数的估计值向零收缩,而真正非零系数的估计值收敛到它们的预言机估计值,概率趋于 1。如补充材料第S2节(图S1)中的小型仿真所示,信号(非零系数)和噪声(零系数)可以在几次BAR迭代中快速分离,尽管在某些情况下可能需要更多迭代来改进对非零系数的估计。

Remark 1.(Computational aspects of BAR) For moderate size data, one may calculate β^(k) in 4 using the Newton-Raphson method as in the work of Frommlet and Nuel,26 who outlined an iterative reweighted ridge regression for generalized linear models. It appears at first sight that 4 will encounter numerical overflow as some of the coefficients β^j(k1) will go to zero as k increases. However, it can be shown that after some simple algebraic manipulation the Newton-Raphson updating formula will only involve multiplications, instead of divisions, by β^j(k1)s, and thus numerical overflow can be avoided. Further details are provided in Section S1 (Equation (3)) of the Supplementary Material. We also note that because the limit of the BAR algorithm cannot be numerically achieved at any finite iteration step, an extra thresholding rule for small coefficients will be required to numerically obtain a sparse solution. However, this thresholding level can be set arbitrarily small (by default, we set the threshold value to 10−6 in our implementation) since it is simply used for numerical convergence to zero and has minimal impact on the resulting BAR estimator. Furthermore, Equation (3) of Section S1 of the Supplementary Material implies that, once a β^j(k1) becomes zero, it will remain as zero in subsequent iterations. Thus, one only needs to update β^(k) within the reduced nonzero parameter space, an appealing computational advantage for high-dimensional settings.
备注 1.(BAR的计算方面)对于中等大小的数据,可以使用 Newton-Raphson 方法在 4 中进行计算 β^(k) ,就像 Frommlet 和 Nuel 的工作一样,26 他们概述了广义线性模型的迭代重加权岭回归。乍一看,4 似乎会遇到数值溢出,因为随着 k 的增加,一些系数 β^j(k1) 将变为零。然而,可以证明,经过一些简单的代数运算后,牛顿-拉夫森更新公式将只涉及乘法,而不是除 β^j(k1) 法,乘以 s,因此可以避免数值溢出。补充材料的S1部分(等式(3))提供了更多详细信息。我们还注意到,由于BAR算法的极限在任何有限迭代步骤中都无法数值实现,因此需要对小系数的额外阈值规则进行数值化以获得稀疏解。但是,这个阈值水平可以任意设置(默认情况下,我们在实现 −6 中将阈值设置为 10),因为它只是用于数值收敛到零,并且对生成的 BAR 估计器的影响最小。此外,补充材料 S1 部分的等式 (3) 意味着,一旦 a β^j(k1) 变为零,它将在随后的迭代中保持为零。因此,只需要在减少的非零参数空间内进行更新 β^(k) ,这对于高维设置来说是一个有吸引力的计算优势。

For massive-size data with large n and pn, the Newton-Raphson procedure, which at each iteration, calls for the calculation of both the gradient and Hessian can become practically infeasible due to high computational costs, memory requirements, and numerical instability. In Section 2.2, we will discuss how to adapt an efficient algorithm for massive 2-penalized Cox regression via cyclic coordinate descent and exploit the sparsity of the covariate structure to make BAR scalable to sHDMSS data.
对于具有大 n 和 p n 的海量数据,Newton-Raphson 过程在每次迭代中都要求计算梯度和 Hessian 由于计算成本高、内存要求和数值不稳定性而变得几乎不可行。在第 2.2 节中,我们将讨论如何通过循环坐标下降调整一种有效的算法,以实现大规模 l 2 惩罚的 Cox 回归,并利用协变量结构的稀疏性使 BAR 可扩展到 sHDMSS 数据。

Remark 2.(Broken adaptive ridge versus best subset selection) The BAR method can be viewed as a performing a sequence of surrogate 0-penalizations, where each reweighted 2 penalty serves as a surrogate 0-penalty and the approximation of 0-penalization improves with each iteration. Consequently, BAR enjoys the best of 0- and 2-penalized regressions. For example, we establish in the next two sections that BAR possesses the oracle properties for estimation and selection consistency (an 0 property) as well as a grouping property (an 2 property). Numerically, for a fixed tuning parameter value, BAR is a surrogate to 0-penalization is not expected to be identical, but can be similar to the exact global 0-penalization solution where the latter must be solved using the best subset search (BSS). We illustrate this in Section S3 of the Supplementary Material (Figures S2 and S3) using a small simulation study. It is worth emphasizing that BAR overcomes some shortcomings of BSS; for example, BSS is computationally NP-hard and can be unstable for variable selection,19 whereas BAR is scalable to high-dimensional covariates and is more stable for variable selection as demonstrated in Figures S2 and S3 in Section S3 of the Supplementary Material.
备注 2.(破碎的自适应脊与最佳子集选择)BAR 方法可以看作是执行一系列代理 l 0 惩罚,其中每个重新加权的 l 2 惩罚都充当代理 l 0 惩罚,并且 l 0 惩罚的近似值随着每次迭代而提高。因此,BAR 享有 l 0 和 l 2 惩罚回归的最佳效果。例如,在接下来的两节中,我们将确定 BAR 具有用于估计和选择一致性的预言机属性(l 0 属性)以及分组属性(l 2 属性)。从数值上讲,对于固定的调优参数值,BAR 是 l 0 的替代项,预计不会完全相同,但可以类似于精确的全局 l 0 惩罚化解决方案,后者必须使用最佳子集搜索 (BSS) 求解。我们在补充材料的S3部分(图S2和S3)中使用一个小型模拟研究来说明这一点。值得强调的是,BAR克服了BSS的一些缺点;例如,BSS 在计算上是 NP 困难的,对于变量选择可能不稳定 19,而 BAR 可扩展到高维协变量,并且对于变量选择更稳定,如补充材料第 S2 节中的图 S3 和 S3 所示。

2.1.2 Oracle properties 2.1.2 Oracle属性

We establish the oracle properties for the BAR estimator for simultaneous variable selection and parameter estimation where we allow both qn and pn to diverge to infinity.
我们为 BAR 估计器建立了预言机属性,用于同时进行变量选择和参数估计,其中我们允许 q n 和 p n 发散到无穷大。

Theorem 1. (Oracle properties)Assume the regularity conditions (C1) to (C6) in Section S4.1 of the Supplementary Material hold. Let β^1 and β^2 be the first qn and the remaining pnqn components of the BAR estimator β^, respectively. Then, as n, with probability tending to one,

  • (a) the BAR estimator β^=(β^1,β^2) exists and is unique, where β^2=0;
  • (b)  (二) nbn(β0)111/2(β^1β01)DN(0,1), for any qn-dimensional vector bn such that ‖bn2 ≤ 1 and where ∑(β0)11 is the first qn×qn submatrix of ∑(β0), where ∑(β0) is defined in Condition (C4).
    ,对于任何 q n 维向量 b n ,使得 ‖b n 2 ‖ ≤ 1,并且 ∑(β 011 是 ∑(β 0 的第一个 q n ×q n 子矩阵),其中 ∑(β 0 ) 在条件 (C4) 中定义。

Theorem 1(a) establishes selection consistency of the BAR estimator. Part (b) of the theorem essentially states that the nonzero component of the BAR estimator is asymptotically normal and equivalent to the weighted ridge estimator of the oracle model, as shown in the proof provided in Section S4.2 of the Supplementary Material.
定理1(a)建立了BAR估计器的选择一致性。该定理的 (b) 部分基本上指出,BAR 估计器的非零分量是渐近正态的,等效于预言机模型的加权脊估计器,如补充材料第 S4.2 节中提供的证明所示。

Remark 3. (Ultrahigh-dimensional covariates setting)Although we allow pn to diverge, the asymptotic properties of the BAR estimator in the Section 28 are derived for pn<n. In an ultrahigh-dimensional setting where the number of covariates far exceeds the number of observations (pnn), one may couple a sure screening27 method with the BAR estimator to obtain a two-step estimator with desirable selection and estimation properties. The orders of qn, pn, and n and their relationships depend on the employed screening procedure. For example, coupling the BAR estimator with the sure joint screening procedure28 has been explored in the work of Kawaguchi.29
备注 3.(超高维协变量设置)尽管我们允许 p n 发散,但第 28 节中 BAR 估计器的渐近性质是针对 p n ≫n) 的超高维环境中,可以将确定筛选 27 方法与 BAR 估计器耦合,以获得具有理想选择和估计特性的两步估计器。q n 、p n 和 n 的顺序及其关系取决于所采用的筛选程序。例如,在Kawaguchi的工作中已经探索了BAR估计器与确定联合筛选程序28的耦合。29

2.1.3 A grouping property
2.1.3 分组属性

When the true model has a group structure, it is desirable for a variable selection method to either retain or drop all variables that are clustered within the same group. It is well known that ridge regression possesses the grouping property for highly correlated covariates.11 Because the BAR estimator is based on an iterative ridge regression, we show that BAR also possesses a grouping property for highly correlated covariates as stated in following theorem.
当真实模型具有组结构时,变量选择方法最好保留或删除聚类在同一组中的所有变量。众所周知,岭回归具有高度相关协变量的分组特性。11 由于 BAR 估计器基于迭代岭回归,因此我们表明 BAR 还具有高度相关的协变量的分组属性,如以下定理所述。

Theorem 2.Let λn, {(Xi,δi,zi)}i=1n be given and assume that Z=(zi,zn) is standardized. That is, for all j=1,…,pn, i=1nzij=0,z[,j]z[,j]=n1, where z[, j] is the jth column of Z. Suppose the regularity conditions (C1) to (C6) in Section S4.1 of the Supplementary Material hold and let β^ be the BAR estimator. Then, for any β^i0 and β^j0,
定理 2.设 λ n{(Xi,δi,zi)}i=1n 并假设 Z=(zi,zn) 它是标准化的。也就是说,对于所有 j=1,...,p ni=1nzij=0,z[,j]z[,j]=n1, 其中 z [, j] 是 Z 的 j th 列。 假设补充材料第 S4.1 节中的正则性条件 (C1) 至 (C6) 成立,并设 β^ 为 BAR 估计器。然后,对于任何 β^i0β^j0

|β^i1±β^j1|1λn2{(n1)(1±rij)}n(1+dn)2, (6)
with probability tending to one, where dn=i=1nδi, and rij=1n1z[,i]z[,j] is the sample correlation of z[,i] and z[, j].
概率趋于 1,其中 dn=i=1nδirij=1n1z[,i]z[,j] 是 z [,i] 和 z [, j] 的样本相关性。

The proof is provided in Section S4.3 of the Supplementary Material. It is seen from 6 that, as rij→1, the absolute difference between β^i and β^j approaches 0, implying that the estimated coefficients of two highly positively correlated variables will be similar in magnitude. Similarly, the estimated coefficients of two highly negatively correlated variables are also similar in magnitude with a sign change.
该证明在补充材料的第 S4.3 节中提供。从图 6 可以看出,当 r ij →1 时,和 之间的 β^i β^j 绝对差值接近 0,这意味着两个高度正相关变量的估计系数在大小上相似。同样,两个高度负相关变量的估计系数在量级上也相似,并具有符号变化。

2.1.4 Selection of tuning parameters
2.1.4 调优参数的选择

Model complexity depends critically on the choice of the tuning parameters. The BAR estimator depends on two tuning parameters, ie, ξn for the initial ridge estimator in 3 and λn for the iterative ridge step in 4. Our simulations in Section 3.1 illustrate that, while fixing λn, the BAR estimator is insensitive to the choice of ξn over a wide interval (Figure 1) and thus practically only optimization with respect to λn is needed.
模型的复杂性很大程度上取决于调优参数的选择。BAR 估计器依赖于两个调谐参数,即 3 中的初始岭估计器为 ξ,4 n 中的迭代岭步长为 λ n 。我们在 3.1 节中的模拟表明,在固定 λ 时 n ,BAR 估计器对宽区间内 ξ n 的选择不敏感(图 1),因此实际上只需要对 λ n 进行优化。

Details are in the caption following the image
Path plot for broken adaptive ridge (BAR) regression with varying (A) ξn and (B) λn=log(pn), (C)  , (三)λn=0.5log(pn), and (D)  和 (D)λn=0.75log(pn) with estimates averaged over 100 Monte Carlo simulations of size n=300, pn=100, and censoring rate ≈25%. Path plot for ridge regression (D) with varying ξn is also included as a comparison
具有变化 (A) ξ n 和 (B) λn=log(pn) 、 (C) λn=0.5log(pn) 和 (D) λn=0.75log(pn) 的断裂自适应脊 (BAR) 回归的路径图,估计值平均超过 100 个大小 n=300、p n =100 和删失率 ≈25% 的蒙特卡罗模拟。还包括具有不同 ξ n 的岭回归 (D) 的路径图作为比较

We optimize with respect to λn in a similar manner to currently used penalization methods. A popular strategy for tuning parameter selection is to perform optimization with respect to a data-driven selection criterion such as cross-validation (CV),30, 31 Akaike information criterion,15 and Bayesian information criterion (BIC).16, 17, 32 Although CV has been used extensively in the literature, it has been known to asymptotically overfit models with a positive probability.33, 34 Recent theoretical work has shown that, for penalized Cox models that possess the oracle property, BIC-based tuning parameter selection identifies the true model with probability tending to one.32 Further discussion on selecting λn for BAR is provided in the last paragraph of Section 3.2.
我们 n 以与当前使用的惩罚方法类似的方式对 λ 进行优化。调整参数选择的常用策略是针对数据驱动的选择标准(例如交叉验证 (CV)、30、31 赤池信息标准、15 和贝叶斯信息标准 (BIC))执行优化。16, 17, 32 尽管 CV 在文献中被广泛使用,但已知它会以正概率渐近过拟合模型。33, 34 最近的理论研究表明,对于具有预言机属性的惩罚 Cox 模型,基于 BIC 的调参数选择可以识别概率趋向于 1 的真实模型。32 关于 n 为 BAR 选择 λ 的进一步讨论在第 3.2 节的最后一段中提供。

2.2 Implementation of BAR for sHDMSS data
2.2 sHDMSS 数据的 BAR 实现

As mentioned in Remark 1, the Newton-Raphson algorithm used for each iteration of the BAR algorithm will become infeasible in large-scale settings with large n and pn due to high computational costs, high memory requirements, and numerical instability. Furthermore, recently proposed BAR algorithms, as with most popularly available procedures, cannot directly handle sHDMSS data due to the computational burden imposed by the design matrix. Because BAR only involves fitting a reweighted Cox's ridge regression at each iteration step, it allows us to adapt an efficient algorithm developed by Mittal et al12 for massive Cox ridge regression.
如注1所述,由于计算成本高、内存要求高、数值不稳定,BAR算法每次迭代时使用的Newton-Raphson算法在n和p n 较大的大规模设置中都变得不可行。此外,最近提出的BAR算法与大多数常用的程序一样,由于设计矩阵施加的计算负担,无法直接处理sHDMSS数据。由于 BAR 仅涉及在每个迭代步骤中拟合重新加权的 Cox's 岭回归,因此它允许我们采用 Mittal 等人 12 开发的高效算法来进行大规模 Cox 岭回归。

2.2.1 Adaptation of existing efficient algorithms for fitting massive 2-penalized Cox's regression
2.2.1 对现有高效算法的拟合 massive l 2 -惩罚 Cox 回归的调整

Mittal et al12 developed an efficient implementation of the massive Cox's ridge regression for sHDMSS data. For parameter estimation, the authors adopted the CLG algorithm of Zhang and Oles,35 which is a type of cyclic coordinate descent algorithm that estimates the coefficients using one-dimensional updates. The CLG easily scales to high-dimensional data7, 36, 37 and has been recently implemented for fitting 2- and 1-penalized generalized linear models,38 parametric time-to-event models,39 and Cox's model.12 Readers are encouraged to refer to Section S3 of the Supplementary Material for a detailed explanation of the algorithm.
Mittal 等人 12 开发了用于 sHDMSS 数据的大规模 Cox's ridge 回归的高效实现。对于参数估计,作者采用了Zhang和Oles, 35的CLG算法,这是一种使用一维更新估计系数的循环坐标下降算法。CLG 可轻松扩展到高维数据 7、36、37,并且最近已实现用于拟合 l 2 和 l 1 惩罚广义线性模型、38 参数化事件发生时间模型、39 和 Cox 模型。12 鼓励读者参阅补充材料第S3节,了解该算法的详细说明。

The design matrix Z for sHDMSS data has few nonzero entries for each subject. Storing such a sparse matrix as a dense matrix is inefficient and may increase computation time and/or cause standard software to crash due to insufficient memory allocation. To the best of our knowledge, popular penalization packages such as glmnet40 and ncvreg41 do not support a sparse data format as an input for right-censored time-to-event models, although the former supports the input for other generalized linear models. For sHDMSS data, we propose to use specialized column-data structures as in the works of Mittal et al12 and Suchard et al.38 The advantage of this structure is two-fold, ie, it significantly reduces the memory requirement needed to store the covariate information, and performance is enhanced when employing cyclic coordinate descent. For example, when updating βj, efficiency is gained when computing and storing the inner product ri=ziβ using a low-rank update ri(new)=ri+zij+Δβj for all i.12, 35, 36, 38, 42
sHDMSS 数据的设计矩阵 Z 对于每个主题都有几个非零条目。将这种稀疏矩阵存储为密集矩阵效率低下,并且可能会增加计算时间和/或由于内存分配不足而导致标准软件崩溃。据我们所知,流行的惩罚包(如 glmnet 40 和 ncvreg 41)不支持稀疏数据格式作为右删失事件时间模型的输入,尽管前者支持其他广义线性模型的输入。对于 sHDMSS 数据,我们建议使用专门的列数据结构,如 Mittal 等人 12 和 Suchard 等人的工作 38 这种结构的优点是双重的,即它显着减少了存储协变量信息所需的内存需求,并且在采用循环坐标下降时性能得到增强。例如,在更新β时,使用所有 i. 12、35、36、38、42 的低秩更新 ri(new)=ri+zij+Δβj 计算和存储内部产品 ri=ziβ 时,效率会提高

Furthermore, one requires a series of cumulative sums introduced through the risk set Ri={j:Xj>Xi} for each subject i to calculate the gradient and Hessian diagonal. These cumulative sums would need to be calculated when updating each parameter estimate in the optimization routine. This can prove to be computationally costly, especially when both n and pn are large. By taking advantage of the sparsity of the design matrix, one can reduce the computational time needed to calculate these cumulative sums by entering into this operation only if at least one observation in the risk set has a nonzero covariate value along dimension j and embarking on the scan at the first nonzero entry rather than from the beginning. Mittal et al12 and Suchard et al38 have implemented these efficiency techniques for conditional Poisson regression and Cox's regression, respectively. Our BAR implementation naturally exploits the sparsity in the design matrix and the partial likelihood by embedding an adaptive version of Mittal et al's12 massive Cox's ridge regression within each iteration of the iteratively reweighted Cox's ridge regression.
此外,需要通过风险集 R={j:X>X} 为每个受试者 i 引入一系列累积总和来计算梯度和黑森对角线。在优化例程中更新每个参数估计值时,需要计算这些累积总和。这可能被证明是计算成本高昂的,尤其是当 n 和 p n 都很大时。通过利用设计矩阵的稀疏性,只有当风险集中至少有一个观测值沿维度 j 具有非零协变量值并在第一个非零条目而不是从头开始进行扫描时,才可以进入此操作,从而减少计算这些累积总和所需的计算时间。Mittal 等人 12 和 Suchard 等人 38 分别将这些效率技术用于条件泊松回归和 Cox 回归。我们的 BAR 实现通过在迭代重新加权的 Cox 脊回归的每次迭代中嵌入 Mittal 等人的 12 个大规模 Cox's 岭回归的自适应版本,自然而然地利用了设计矩阵中的稀疏性和部分似然性。

3 SIMULATIONS 3 模拟

This section presents three simulation studies. First, we demonstrate in Section 3.1 that, for fixed λn, the BAR estimator is insensitive to the tuning parameter ξn of its initial ridge estimator and does well in terms of performing variable selection and correcting possible bias of the initial ridge estimator. Then, in Section 3.2, we evaluate and compare the operating characteristics of BAR with some popular penalized Cox regression methods, where we only consider settings with moderate sample sizes due to the fact that most of the competing methods are inoperable for massive sample size data. Finally, in Section 3.3, we use a sHDMSS setting to illustrate the performance of BAR over its closest competitor.
本节介绍三个仿真研究。首先,我们在第 3.1 节中证明,对于固定 λ n ,BAR 估计器对其初始岭估计器的调谐参数 ξ n 不敏感,并且在执行变量选择和校正初始岭估计器的可能偏差方面表现良好。然后,在第 3.2 节中,我们评估并比较了 BAR 的操作特性与一些流行的惩罚 Cox 回归方法,其中我们只考虑具有中等样本量的设置,因为大多数竞争方法都不适用于大量样本量数据。最后,在第 3.3 节中,我们使用 sHDMSS 设置来说明 BAR 相对于其最接近的竞争对手的性能。

Sections 3.1 and 3.2 employ the same simulation structure. Event times are drawn from an exponential proportional hazards model with baseline hazard h0(t)=1, β0=(0.40,0,0.45,0,0.50,0.55,0,0,0.70,0.80,0pn10), representing qn=6 small to moderate effect sizes; the design matrix Z=(z1,,zn) is generated from a pn-dimensional normal distribution with mean zero and covariance matrix ∑=(σij) with an autoregressive structure such that σij=0.5|ij| and independent censoring times are generated from uniform distribution U(0,umax), where umax is chosen to achieve different percentages of censoring. We describe how we simulate sHDMSS time-to-event data in Section 3.3.
第 3.1 节和第 3.2 节采用相同的模拟结构。事件时间是从基线风险 h 0 (t)=1 的指数比例风险模型中得出的, β0=(0.40,0,0.45,0,0.50,0.55,0,0,0.70,0.80,0pn10) 表示 q n =6 个小到中等效应量;设计矩阵 Z=(z1,,zn) 由均值为零的 p n 维正态分布和协方差矩阵 ∑=(σ ij ) 生成,具有自回归结构,使得 σ ij =0.5 |ij| 和独立删失时间由均匀分布 U(0,u max ) 生成,其中选择 U max 以实现不同的删失百分比。我们将在第 3.3 节中描述如何模拟 sHDMSS 事件发生时间数据。

3.1 Broken adaptive ridge estimator for varying values of ξn
3.1 不同ξ n 值的自适应岭估计器损坏

We illustrate how the BAR estimator behaves by fixing λn and varying the tuning parameter ξn of the initial Cox ridge regression in the following. Figures 1B to 1D depict the solution path plots average over 100 Monte Carlo simulations of the BAR estimator with respect to ξn over a wide interval [10−2, 102] for n=300, pn=100, ≈25% censoring, and λn=log(pn),0.5log(pn),0.75log(pn), respectively. It is seen that the resulting BAR estimator is essentially unchanged, regardless of the choice of λn, over a large interval of ξn, suggesting that the BAR estimator is relatively insensitive to original ridge estimator.
我们通过固定 λ n 并改变初始 Cox 岭回归的调谐参数 ξ n 来说明 BAR 估计器的行为,如下所示。图 1B 至 1D 分别描绘了 n=300、p n =100、≈25% 删失和 λn=log(pn),0.5log(pn),0.75log(pn) ,在宽区间 [10 −2 , 10 2 ] 内,BAR 估计器相对于 ξ n 的 100 次蒙特卡罗模拟的平均解路径图。可以看出,无论选择λ n ,得到的BAR估计器在ξ的大区间内基本没有变化 n ,这表明BAR估计器对原始岭估计器相对不敏感。

As a reference, we also display the solution path plots of the corresponding initial ridge estimator in panel (a). The initial ridge estimator starts to introduce over shrinkage and, consequently, estimation bias when ξn exceeds 10. However, its bias has been effectively corrected by BAR. Therefore, by iteratively refitting reweighted Cox ridge regression, the BAR estimator not only performs variable selection by shrinking estimates of the true zero parameters to zero, but also effectively corrects the estimation bias from the initial Cox ridge estimator. Similar results are obtained for several different simulation scenarios and can be found in Section S4 of the Supplementary Material.
作为参考,我们还在图(a)中显示了相应初始岭估计器的求解路径图。当 ξ n 超过 10 时,初始脊估计器开始引入过度收缩,从而引入估计偏差。然而,BAR已经有效地纠正了它的偏差。因此,通过迭代重新拟合重新加权的 Cox 岭回归,BAR 估计器不仅通过将真零参数的估计值缩小到零来执行变量选择,而且还有效地校正了初始 Cox 岭估计器的估计偏差。在几种不同的仿真场景中也获得了类似的结果,可以在补充材料的 S4 部分中找到。

3.2 Model selection and parameter estimation
3.2 模型选择与参数估计

In this simulation, we evaluate and compare the variable selection and parameter estimation performance of BAR with four popular penalized Cox regression methods, ie, LASSO,3 SCAD,4 adaptive LASSO (ALASSO),5 and MCP.6 We fix ξn=1 for the BAR methods since Section 3.1 yields evidence that the BAR estimator is insensitive to the selection of ξn. For all methods, a 25-value grid was used to find the optimal value of the tuning parameter via BIC minimization.32
在本次仿真中,我们评估并比较了BAR与四种流行的惩罚Cox回归方法(即LASSO、3 SCAD、4自适应LASSO(ALASSO)、5和MCP)的变量选择和参数估计性能。6 我们为 BAR 方法固定 ξ n =1,因为第 3.1 节提供了 BAR 估计器对 ξ 的选择不敏感 n 的证据。对于所有方法,都使用 25 个值的网格通过 BIC 最小化找到调谐参数的最佳值。32

Estimation bias is summarized through the mean squared bias, E(β^β02). Variable selection performance is measured by a number of indices, ie, the mean number of false positives (FP), the mean number of false negatives (FN), and average similarity measure for support recovery where SM=S^S00/S^|0·S00 and S0 and S^ are the set of indices for the nonzero components of β0 and β^, respectively.43 The similarity measure can be viewed as a continuous measure for true model recovery, ie, it is close to 1 when the estimated model is similar to the true model and close to 0 when the estimated model is highly dissimilar to the true model. We use the R package ncvreg to perform LASSO, ALASSO, SCAD, and MCP penalizations in our simulations. For ALASSO, we let the initial weight be the maximum partial likelihood estimator since pn<n. Partial simulation results are summarized in Table 1 where we fix n=300,1000, pn=100, a censoring rate of ≈25%, and average results over 100 replications.
估计偏差通过均方偏差 E(β^β02) .变量选择性能由许多指数来衡量,即平均误报数 (FP)、平均假阴性数 (FN) 和支持恢复的平均相似性度量,其中 SM=S^S00/S^|0·S00S0 分别 S^ 是 β 0β^ 的非零分量的索引集。43 相似性度量可以看作是真实模型恢复的连续度量,即当估计模型与真实模型相似时,相似性度量接近 1,当估计模型与真实模型高度不同时,相似度度量接近 0。我们在模拟中使用 R 包 ncvreg 来执行 LASSO、ALASSO、SCOD 和 MCP 惩罚。对于 ALASSO,我们让初始权重是自 p n =100,删失率为 ≈25%,以及超过 100 次重复的平均结果。

Table 1. (Moderate dimension and sample size) Simulated estimation and variable selection performance of broken adaptive ridge (BAR) Bayesian information criterion (BIC), LASSO (BIC), SCAD (BIC), adaptive lasso (ALASSO) (BIC), and MCP (BIC) where BIC in parenthesis indicates that the BIC minimization was used to select the tuning parameters via a grid search. (MSB = mean squared bias; FN = mean number of false positives; FP = mean number of false negatives; SM = average similarity measure; BIC = average BIC score; Each entry is based on 100 Monte Carlo samples of size n=300,1000, pn=100, censoring rate = 25%)
表 1.(中等维度和样本量)破损自适应脊 (BAR) 贝叶斯信息准则 (BIC)、LASSO (BIC)、SCAD (BIC)、自适应套索 (ALASSO) (BIC) 和 MCP (BIC) 的模拟估计和变量选择性能,其中括号中的 BIC 表示 BIC 最小化用于通过网格搜索选择调谐参数。(MSB = 均方偏差;FN = 平均误报数;FP = 假阴性的平均数;SM = 平均相似度度量;BIC = 平均 BIC 分数;每个条目基于 100 个蒙特卡罗样本,大小为 n=300,1000,p n =100,删失率 = 25%)
MSB FN FP SM BIC
BAR ( λn=0.5log(pn)) 酒吧 ( λn=0.5log(pn) 0.06 0.02 0.23 0.98 1930.97
BAR ( λn=log(pn)) 酒吧 ( λn=log(pn) 0.10 0.17 0.02 0.98 1938.43
BAR (BIC) 酒吧 (BIC) 0.11 0.01 1.79 0.89 1919.26
n=300 LASSO (BIC) 拉索 (BIC) 0.27 0.01 3.32 0.82 1958.40
SCAD (BIC) SCAD (BIC) (英语) 0.12 0.01 2.23 0.87 1933.43
ALASSO (BIC) 阿拉索 (BIC) 0.11 0.04 1.48 0.90 1935.60
MCP (BIC) MCP (BIC) (英语) 0.09 0.02 1.21 0.92 1929.33
BAR ( λn=0.5log(pn)) 酒吧 ( λn=0.5log(pn) 0.01 0.00 0.19 0.99 8200.97
BAR ( λn=log(pn)) 酒吧 ( λn=log(pn) 0.01 0.00 0.00 1.00 8203.52
BAR (BIC) 酒吧 (BIC) 0.02 0.00 0.73 0.95 8196.51
n=1000 LASSO (BIC) 拉索 (BIC) 0.10 0.00 2.77 0.84 8236.76
SCAD (BIC) SCAD (BIC) (英语) 0.01 0.00 0.23 0.98 8203.00
ALASSO (BIC) 阿拉索 (BIC) 0.02 0.00 0.26 0.98 8204.58
MCP (BIC) MCP (BIC) (英语) 0.01 0.00 0.08 0.99 8202.04

From Table 1, we have that, when the tuning parameter λn is selected by minimizing the BIC score as the other methods, the performance of BAR (BIC) is generally comparable to other methods with respect to all measures across both scenarios. We have conducted more extensive simulations with different combinations of model dimension, censoring rates, sample sizes, and model sparsity, which yielded consistent findings and are reported in Section S5 of the Supplementary Material.
从表 1 中可以看出,当通过最小化 BIC 分数来选择调优参数 λ n 时,BAR (BIC) 在两种情况下的所有测量方面通常与其他方法相当。我们使用模型维度、删失率、样本量和模型稀疏度的不同组合进行了更广泛的模拟,得出了一致的结果,并在补充材料的 S5 部分中进行了报告。

Since BAR aims to approximate 0-penalized regression, it directly provides a surrogate optima to some popular information criteria with some prefixed λn. For example, performing BAR with λn=clog(pn) for some c>0 leads to a surrogate optima for the directly optimizing the extended BIC.44-46 For thoroughness, in addition to using a 25-value grid for c, we also include simulation results in Table 1 for BAR with some prefixed values λn=0.5log(pn) and λn=log(pn). Not surprisingly, BAR with these prefixed values produced sometimes slightly suboptimal, but generally comparable estimation and selection performance. We also conducted further simulations using a 10-value coarse grid for λn. The results are presented in Tables S1 to S3 of the Supplementary Material, which showed that the 10-value grid worked as well as the 25-value grid across almost all of our simulation scenarios. This suggests that potential computational savings could be gained for BAR by using either prefixed or a coarse grid of values for λn for massive data, which is also illustrated in Section 4 (Table 3).
由于 BAR 旨在近似 l 0 -惩罚回归,因此它直接为一些带有前缀 λ 的流行信息标准提供代理最优 n 。例如,对 λn=clog(pn) 某些 c>0 执行 BAR 会导致直接优化扩展 BIC 的代理最优。44- 46 为了彻底起见,除了对 c 使用 25 个值的网格外,我们还在表 1 中包括 BAR 的仿真结果,其中包含一些前缀值 λn=0.5log(pn)λn=log(pn) 。毫不奇怪,具有这些前缀值的BAR有时产生的估计和选择性能有时略逊一筹,但通常具有可比性。我们还使用 λ 的 10 值粗网格进行了进一步的模拟 n 。结果显示在补充材料的表 S1 到 S3 中,该表表明,在几乎所有的仿真场景中,10 值网格和 25 值网格都有效。这表明, n 通过对海量数据使用前缀或粗略的 λ 值网格,可以为 BAR 节省潜在的计算成本,这也在第 4 节(表 3)中进行了说明。

3.3 Sparse high-dimensional massive sample size data
3.3 稀疏高维海量样本量数据

In this simulation, we simulate a sHDMSS time-to-event data set with n=200,000, pn=20,000, and qn=80. Event times are generated from an exponential hazards model with baseline hazard h0(t)=1, regression coefficients β0=(0.710,0.510,0.810,110,0.710,0.510,0.810,110,0pn80), and a censoring rate of 95%. The covariates for each subject are simulated such, on average, 2% are assigned a nonzero value. The amount of memory used to store this dense design matrix would require over 16 GB, which exceeds the functional capacity of most statistical software packages on standard hardware. To overcome this difficulty, we efficiently store the information in a coordinate list fashion and compare our massive Cox's regression for BAR (mBAR) with the massive sparse Cox's regression for LASSO (mCox-LASSO) using the Cyclops package,12, 38 which, to the best of our knowledge, is the fastest software available today that exploits the sparsity of sHDMSS time-to-event data for efficient computing and offers > 10-fold speedup12 over its competitors such as CoxNet7 and FastCox.47 For LASSO, CV (mCox-LASSO (CV)), combined with a nonconvex optimization technique which is more efficient than the classical grid search approach, and BIC score minimization (mCox-LASSO (BIC)), implemented with the classical grid search approach, were used to find the optimal value for the tuning parameter. For the mBAR method, we implement BIC score minimization using a grid search and two prefixed tuning parameters λn=0.5log(pn) and log(pn) for comparative purposes. We report the bias ( β^β02), number of FP, FN, and BIC score ( 2ln(β^)+log(n)jI(β^j0)) in Table 2.
在此仿真中,我们模拟了 n=200,000、p n =20,000 和 q n =80 的 sHDMSS 事件发生时间数据集。事件时间由指数风险模型生成,基线风险 h 0 (t)=1,回归系数 β0=(0.710,0.510,0.810,110,0.710,0.510,0.810,110,0pn80) ,删失率为 95%。模拟每个主题的协变量,平均分配 2% 的非零值。用于存储此密集设计矩阵的内存量将需要超过 16 GB,这超出了标准硬件上大多数统计软件包的功能容量。为了克服这一困难,我们以坐标列表的方式有效地存储信息,并使用 Cyclops 软件包 12, 38 将 BAR 的大量 Cox 回归 (mBAR) 与 LASSO 的大量稀疏 Cox 回归 (mCox-LASSO) 进行比较,据我们所知,它是当今最快的软件,它利用 sHDMSS 事件时间数据的稀疏性进行高效计算,并提供> 10 倍的加速 12 比其竞争对手(如 CoxNet 7) 快 7 倍和 FastCox。47 对于LASSO来说,使用CV(mCox-LASSO(CV))结合比经典网格搜索方法更有效的非凸优化技术,以及使用经典网格搜索方法实现的BIC分数最小化(mCox-LASSO(BIC))来找到调谐参数的最优值。对于mBAR方法,我们使用网格搜索和两个前缀调优参数 λn=0.5log(pn) 来实现BIC分数最小化,并 log(pn) 用于比较目的。我们在表 2 中报告了偏差 ( β^β02 )、FP、FN 和 BIC 评分 ( 2ln(β^)+log(n)jI(β^j0) ) 的数量。

Table 2. (Sparse high-dimensional and massive sample size) Estimation and variable selection results for massive Cox regression with broken adaptive ridge (mBAR) and LASSO penalty (mCox-LASSO12) for a simulated sHDMSS data set with n=200 000, pn=20 000, and qn=80. (Bias = β^β02; FP= number of false positives; FN = number of false negatives)
表 2.(稀疏高维和海量样本量)对于n=200 000、p n =20 000和q n =80的模拟sHDMSS数据集,具有破碎自适应脊(mBAR)和LASSO惩罚(mCox-LASSO 12)的大规模Cox回归的估计和变量选择结果。(偏差 = β^β02 ;FP=误报数;FN = 假阴性数)
Method 方法 Bias 偏见 FP FN BIC score BIC 评分
mBAR ( λn=0.5log(pn)) mBAR ( λn=0.5log(pn) 1.19 0 3 83 313.02
mBAR ( λn=log(pn)) mBAR ( λn=log(pn) 2.02 0 10 83 573.96
mBAR (BIC) mBAR (BIC) 0.97 5 0 83 266.47
mCox-LASSO (BIC) mCox-LASSO (BIC) 2.93 12 0 84 479.47
mCox-LASSO (CV) mCox-LASSO (CV) 2.12 963 0 93 770.58
  • Abbreviations: BIC, Bayesian information criterion; CV, cross-validation.
    缩写:BIC,贝叶斯信息准则;简历,交叉验证。

We observe that both mCox-LASSO methods have retained all 80 true nonzero coefficients together with a moderate to large number of noise variables (12 for BIC and 967 for CV). In contrast, mBAR (BIC) chooses a sparser model selecting all 80 nonzero coefficients and 5 noise variables. As expected, mBAR (BIC) is less biased (0.82) than mCox-LASSO (2.49 for BIC and 2.02 for CV) and has a much lower BIC score when compared to both mCox-LASSO methods. We also notice that mBAR with the two prefixed λn tends to underestimate the true model, ie, fixing λn=log(pn) results in estimating a model that is too sparse, whereas λn=0.5log(pn) produces a model that is closer to the oracle model.
我们观察到,两种 mCox-LASSO方法都保留了所有 80 个真正的非零系数以及中等到大量的噪声变量(BIC 为 12 个,CV 为 967 个)。相比之下,mBAR (BIC) 选择稀疏模型,选择所有 80 个非零系数和 5 个噪声变量。正如预期的那样,mBAR (BIC) 的偏差 (0.82) 低于 mCox-LASSO(BIC 为 2.49,CV 为 2.02),并且与两种 mCox-LASSO 方法相比,BIC 分数要低得多。我们还注意到,带有两个前缀 λ 的 mBAR n 往往会低估真实模型,即修复 λn=log(pn) 导致估计模型过于稀疏,而 λn=0.5log(pn) 生成更接近预言机模型的模型。

We further examined the solution paths of mCox-LASSO and mBAR in Figure 2. The vertical solid and dashed lines in the mCox-LASSO solution path plot (Figure 2A) represent the estimates at the optimal tuning parameter obtained via CV and BIC minimization, respectively. We can see that the mCox-LASSO solution path changes rapidly as its tuning parameter varies and shows severe bias. In contrast, the mBAR solution path plot (Figure 2B) with respect to λn changes very slowly where the vertical line represents the estimates at the optimal tuning parameter selected by BIC minimization and selects a model with estimates that are less biased than mCox-LASSO (see Table 2). Furthermore, the optimal value of λn that minimizes the BIC score for mBAR roughly corresponds to 0.3log(pn). Since our empirical results suggest that the optimal value for λn generally lies within some constant of log(pn), we recommend that a coarse grid search within clog(pn) where c∈(0,1] can be used. This is further corroborated by additional simulations in the Supplementary Material (Tables S1 to S3).
我们进一步检查了图2中mCox-LASSO和mBAR的求解路径。mCox-LASSO解决方案路径图(图2A)中的垂直实线和虚线分别表示通过CV和BIC最小化获得的最佳调谐参数的估计值。我们可以看到,mCox-LASSO求解路径随着其调谐参数的变化而迅速变化,并显示出严重的偏差。相比之下,mBAR解路径图(图2B)相对于λ n 的变化非常缓慢,其中垂直线表示通过BIC最小化选择的最佳调谐参数的估计值,并选择估计值比mCox-LASSO偏差小的模型(见表2)。此外,最小化 mBAR 的 BIC 分数的 λ n 的最优值大致对应于 0.3log(pn) 。由于我们的实证结果表明,λ 的最优值 n 通常位于 的 log(pn) 某个常数内,我们建议在 的范围内进行粗略网格搜索 clog(pn) ,其中 c∈(0,1)。补充材料中的其他模拟进一步证实了这一点(表S1至S3)。

Details are in the caption following the image
Path plots for massive sparse Cox's regression for LASSO (mCox-LASSO) and massive Cox's regression for broken adaptive ridge (mBAR) regression. A, Path plot for mCox-LASSO regression, where the black solid and dashed lines represents the estimates when BIC minimization and cross-validation where used to find the optimal value of the tuning parameter, respectively; B, Path plot for mBAR regression with ξn=log(pn) and varying λn, where the black solid, dashed, and dotted lines represent estimates where λn was found using Bayesian information criterion minimization, fixed at log(pn) and 0.5log(pn), respectively; C, Path plot for mBAR regression with λn=log(pn) and varying ξn, where the black solid line represent the estimates for mBAR when ξn=log(pn)
LASSO 的大规模稀疏 Cox 回归 (mCox-LASSO) 和破碎自适应脊的大量 Cox 回归 (mBAR) 回归的路径图。A,mCox-LASSO回归的路径图,其中黑色实线和虚线分别表示BIC最小化和交叉验证时的估计值,分别用于查找调谐参数的最优值;B,具有 ξn=log(pn) 和变化 λ 的 mBAR 回归的路径图 n ,其中黑色实线、虚线和虚线表示使用贝叶斯信息准则最小化找到 λ n 的估计值,分别固定在 log(pn)0.5log(pn) ;C,具有 λn=log(pn) 和变化 ξ 的 mBAR 回归的路径图 n ,其中黑色实线表示 mBAR 的估计值 ξn=log(pn)

For the mBAR method, we also made a solution path plot with respect to ξn, while fixing λn=log(pn) in Figure 2C. It shows that the mBAR estimates are very stable over a large range of ξn, affirming our observation in Section 3.1 with small scale data that mBAR is generally insensitive ξn.
对于mBAR方法,我们还绘制了关于ξ的求解路径图 n ,同时 λn=log(pn) 固定在图2C中。它表明mBAR估计值在ξ的大范围内非常稳定 n ,证实了我们在第3.1节中用小尺度数据观察到的mBAR通常是不敏感的ξ n

4 PEDIATRIC TRAUMA MORTALITY
4 小儿创伤死亡率

For an application of mBAR regression in the sHDMSS setting, we consider a subset of the NTDB, a trauma database maintained by the American College of Surgeons.1 This data set was previously analyzed by Mittal et al12 as an example for efficient massive Cox regression with mCox-LASSO and ridge regression to sHDMSS data. The data set includes 210 555 patient records of injured children under 15 that were collected over 5 years (2006 to 2010). Each patient record includes 125 952 binary covariates, which indicate the presence or absence of an attribute (ICD9 Codes, AIS codes, etc) as well as the two-way interactions. The outcome of interest is mortality after time of injury. The data is extremely sparse, with less than 1% of the covariates being nonzero and has a censoring rate of 98%. We randomly split the data into training and test sets of 168 000 and 42 555, respectively. The mortality rate of both sets were approximately equal to the combined rate. Similar to Section 3.3, we were unable to load the training set (n=168 000, pn=125 000) into other popular oracle procedures due to the memory requirements needed to support a dense design matrix of that size and compare mBAR to mCox-LASSO. The BIC-score minimization over a penalization path of 10 tuning parameters was used to select the final model for both mBAR (fixing ξn=log(pn)) and mCox-LASSO. In addition, we perform mCox-LASSO using CV and mBAR with fixed tuning parameters λn=0.5log(pn) and log(pn). The BIC score based on the training data is used to compare selection performance between models and discriminatory performance is measured using Harrell's c-statistic48, 49 based on the test data.
对于mBAR回归在sHDMSS环境中的应用,我们考虑NTDB的一个子集,NTDB是由美国外科医师学会维护的创伤数据库。1 Mittal 等人 12 之前分析了该数据集,作为使用 mCox-LASSO 进行有效大规模 Cox 回归和 sHDMSS 数据的岭回归的示例。该数据集包括210 555份15岁以下受伤儿童的病历,这些病历是在5年内(2006年至2010年)收集的。每个患者记录包括 125 952 个二元协变量,这些协变量表示属性(ICD9 代码、AIS 代码等)的存在与否以及双向交互。感兴趣的结果是受伤后的死亡。数据非常稀疏,只有不到 1% 的协变量为非零,删失率为 98%。我们将数据随机分为训练集和测试集,分别为168 000和42 555。两组死亡率大致等于合并死亡率。与第 3.3 节类似,我们无法将训练集 (n=168 000, p n =125 000) 加载到其他流行的预言机程序中,因为支持该大小的密集设计矩阵并将 mBAR 与 mCox-LASSO 进行比较所需的内存要求。在10个调谐参数的惩罚路径上,BIC分数最小化用于选择mBAR(固定 ξn=log(pn) )和mCox-LASSO的最终模型。此外,我们使用 CV 和 mBAR 执行 mCox-LASSO,并具有固定的调谐参数 λn=0.5log(pn)log(pn) .基于训练数据的BIC分数用于比较模型之间的选择性能,并使用基于测试数据的Harrell's c统计量48,49来测量判别性能。

Table 3 summarizes the findings for our example, which reflect what we observe in Section 3.3. Massive Cox's regression for BAR, using BIC minimization, selects fewer covariates than both mCox-LASSO methods. Both model selection and discriminatory performance are similar to slightly superior for mBAR (BIC) over both mCox-LASSO methods. Again, mBAR with prefixed λn selects far fewer covariates than mBAR (BIC); however, the overall high c-index for both methods suggest that the strong predictors for pediatric trauma are still retained in the model. In terms of runtime, mBAR (BIC) is more time consuming than LASSO (BIC) as expected, but BAR with a prefixed tuning parameter value can help to reduce the runtime with a comparable prediction performance.
表 3 总结了我们示例的结果,这些结果反映了我们在第 3.3 节中观察到的内容。使用 BIC 最小化的 BAR 的 Massive Cox 回归选择的协变量比两种 mCox-LASSO方法都少。mBAR (BIC) 的模型选择和判别性能都略优于 mCox-LASSO 方法。同样,前缀为 λ 的 mBAR n 选择的协变量比 mBAR (BIC) 少得多;然而,两种方法的总体高 C 指数表明,模型中仍保留了儿科创伤的强预测因子。在运行时方面,mBAR (BIC) 比 LASSO (BIC) 更耗时,但具有前缀调优参数值的 BAR 可以帮助减少运行时间,并具有相当的预测性能。

Table 3. (Pediatric National Trauma Data Bank (NTDB) data) Comparison of mCox-LASSO and massive Cox's regression for broken adaptive ridge (mBAR) regression for the pediatric NTDB data. (mCox-LASSO cross-validation (CV) and mCox-LASSO Bayesian information criterion (BIC) correspond to mCox-LASSO using cross validation and BIC selection criterion, respectively. mBAR (BIC) denotes mBAR using the BIC selection criterion while fixing ξn=log(pn). The training set has a sample size of 168 000, while the test set used for the c-index has a sample size of 45 555)
表 3.(儿科国家创伤数据库 (NTDB) 数据)小儿 NTDB 数据的 mCox-LASSO和大规模 Cox 回归的破坏性适应性脊 (mBAR) 回归的比较。(mCox-LASSO 交叉验证 (CV) 和 mCox-LASSO 贝叶斯信息准则 (BIC) 分别对应于使用交叉验证和 BIC 选择准则的 mCox-LASSO。 mBAR (BIC) 表示在固定时使用 BIC 选择准则的 ξn=log(pn) mBAR。训练集的样本量为 168 000,而用于 c 指数的测试集的样本量为 45 555)
Method 方法 # Selected # 已选 BIC score BIC 评分 c-index c指数 Runtime (hours) 运行时间(小时)
mBAR ( λn=0.5log(pn)) mBAR ( λn=0.5log(pn) 45 51 613.52 0.91 8
mBAR ( λn=log(pn)) mBAR ( λn=log(pn) 21 52 182.90 0.89 8
mBAR (BIC) mBAR (BIC) 83 51 269.43 0.93 97
mCox-LASSO (BIC) mCox-LASSO (BIC) 100 52 544.90 0.91 25
mCox-LASSO (CV) mCox-LASSO (CV) 253 53 165.44 0.92 41

5 DISCUSSION 5 讨论

We have extended the BAR methodology to Cox's model as a new sparse Cox regression method and rigorously established that it is selection consistent, oracle for parameter estimation, stable, and has a grouping property for highly correlated covariates. We illustrate through empirical studies that the BAR estimator has satisfactory performance for variable selection and parameter estimation. We have also extended the application of BAR to the sHDMSS domain by taking advantage of the fact that the BAR algorithm allows us to easily adapt existing high performance algorithms and software for massive 2-penalized Cox regression.12
我们已将 BAR 方法扩展到 Cox 模型,作为一种新的稀疏 Cox 回归方法,并严格确定它是选择一致的、用于参数估计的预言机、稳定的,并且具有高度相关的协变量的分组属性。通过实证研究证明,BAR估计器在变量选择和参数估计方面具有令人满意的性能。我们还利用BAR算法允许我们轻松调整现有的高性能算法和软件,以实现大规模l 2 惩罚Cox回归这一事实,将BAR的应用扩展到sHDMSS域。12

Our surrogate 0-based BAR method and theory can be easily extended to a surrogate d-based BAR method for any d∈[0,1], by replacing (β^j(k1))2 with |β^j(k1)|2d in 4. We have observed empirically that, as d increases toward 1, the resulting estimator becomes less sparse, and the average number of FP as well as estimation bias tend to increase, especially for larger pn, while the average number of FN tends to decrease. In practice, d can be used as a resolution tuning parameter.
我们基于代理 l 0 的 BAR 方法和理论可以很容易地扩展到任何 d∈[0,1] 的代理 l 的 BAR 方法,只需将 |β^j(k1)|2d in 4 替换 (β^j(k1))2 为代理 l d 。我们根据经验观察到,当 d 向 1 增加时,得到的估计器变得不那么稀疏,FP 的平均数和估计偏差趋于增加,尤其是对于较大的 p n ,而 FN 的平均数趋于减少。在实践中,d 可以用作分辨率调谐参数。

Our theoretical and empirical results have established the BAR method as a valid and viable tool for variable selection and parameter estimation under the pn<n setting although pn is allowed to diverge with n. Theoretical properties of the BAR estimator for the high-dimensional setting (pnn) remain to be investigated. Furthermore, as pointed out by a referee, although BAR is selection consistent and oracle, it is subject to the same postselection inference issues as other variable selection methods.50, 51 Lastly, although iteratively performing reweighted 2-penalizations allows BAR to enjoy the best of 0- and 2-penalized regressions and to readily adopt an existing efficient implementation of 2-penalization for sHDMSS data, its iterative nature does present another layer of computational complexity. While this added layer of computational complexity is not a practical concern for moderate size data, it can considerably increase the runtime in a large data setting when both n and p are large. As illustrated in our real data example, trying a prefixed tuning parameter value based on the extended BIC λn=clog(pn) can reduce the runtime of BAR with reasonably good performance. To further improve its computational efficiency, we are currently developing some modified BAR algorithms including a cyclic coordinatewise BAR algorithm, which will have comparable computational complexity and runtime to other popular variable methods such as LASSO. This line of further developments is beyond the scope of this paper and will be fully studied in a sequel paper.
我们的理论和实证结果表明,尽管p n 允许与n发散,但BAR方法在p n ≫n)的BAR估计器的理论性质仍有待研究。此外,正如裁判所指出的,尽管BAR是选择一致的,并且是预言机,但它与其他变量选择方法一样,会受到相同的选择后推断问题的影响。50, 51 最后,尽管迭代执行重加权 l 2 惩罚化允许 BAR 享受 l 0 和 l 2 惩罚回归的最佳效果,并随时采用现有的 l 2 惩罚化 sHDMSS 数据的有效实现,但其迭代性质确实带来了另一层计算复杂性。虽然对于中等大小的数据来说,这种增加的计算复杂性层并不是一个实际问题,但当 n 和 p 都很大时,它可以大大增加大型数据设置中的运行时间。如我们的真实数据示例所示,尝试基于扩展 BIC λn=clog(pn) 的前缀调优参数值可以减少 BAR 的运行时间,并具有相当好的性能。为了进一步提高其计算效率,我们目前正在开发一些改进的BAR算法,包括循环坐标BAR算法,该算法的计算复杂度和运行时间将与其他流行的变量方法(如LASSO)相当。这一系列的进一步发展超出了本文的范围,将在后续论文中进行全面研究。

ACKNOWLEDGEMENTS 确认

The authors are grateful to the referees for their insightful comments and suggestions that have greatly improved the paper. The authors are also grateful to Drs Randall Burd and Sushil Mittal for providing us access to the NTDB data. Gang Li's research was supported in part by National Institutes of Health grants P30 CA-16042, UL1TR000124-02, and P01AT003960.
作者感谢审稿人提出的有见地的意见和建议,大大改进了论文。作者还感谢Randall Burd博士和Sushil Mittal博士为我们提供了对NTDB数据的访问。李刚的研究得到了美国国立卫生研究院(National Institutes of Health)P30 CA-16042、UL1TR000124-02和P01AT003960的部分支持。

    CONFLICT OF INTEREST 利益冲突

    The authors declare no potential conflict of interest.
    作者声明没有潜在的利益冲突。

    DATA AVAILABILITY STATEMENT
    数据可用性声明

    The data that support the findings of this study are available from the American College of Surgeons.1 Restrictions apply to the availability of these data.
    支持这项研究结果的数据可从美国外科医师学会获得。1 这些数据的可用性受到限制。

      The full text of this article hosted at iucr.org is unavailable due to technical difficulties.