Abstract 抽象

Motivation 赋予动机

Identification of genomic, molecular and clinical markers prognostic of patient survival is important for developing personalized disease prevention, diagnostic and treatment approaches. Modern omics technologies have made it possible to investigate the prognostic impact of markers at multiple molecular levels, including genomics, epigenomics, transcriptomics, proteomics and metabolomics, and how these potential risk factors complement clinical characterization of patient outcomes for survival prognosis. However, the massive sizes of the omics datasets, along with their correlation structures, pose challenges for studying relationships between the molecular information and patients’ survival outcomes.
确定患者生存预后的基因组、分子和临床标志物对于开发个性化的疾病预防、诊断和治疗方法非常重要。现代组学技术使得在多个分子水平上研究标志物的预后影响成为可能,包括基因组学、表观基因组学、转录组学、蛋白质组学和代谢组学,以及这些潜在风险因素如何补充患者生存预后预后的临床特征。然而,组学数据集的庞大规模及其相关结构对研究分子信息与患者生存结果之间的关系构成了挑战。

Results 结果

We present a general workflow for survival analysis that is applicable to high-dimensional omics data as inputs when identifying survival-associated features and validating survival models. In particular, we focus on the commonly used Cox-type penalized regressions and hierarchical Bayesian models for feature selection in survival analysis, which are especially useful for high-dimensional data, but the framework is applicable more generally.
我们提出了一种生存分析的一般工作流程,适用于在识别生存相关特征和验证生存模型时作为输入的高维组学数据。特别是,我们关注了生存分析中常用的 Cox 型惩罚回归和分层贝叶斯模型进行特征选择,它们对高维数据特别有用,但该框架更普遍地适用。

Availability and implementation
可用性和实施

A step-by-step R tutorial using The Cancer Genome Atlas survival and omics data for the execution and evaluation of survival models has been made available at https://ocbe-uio.github.io/survomics.
https://ocbe-uio.github.io/survomics 上提供了使用癌症基因组图谱生存和组学数据执行和评估生存模型的分步 R 教程。

1 Introduction 1 引言

Personalized medicine improves patient diagnosis and treatment by making use of patient-specific genomic and molecular markers that are indicative of disease development. Time to an event of interest (e.g. time to death or disease progression) is a widely used end point and patient outcome for many diseases, and therefore it has become popular to identify genomic, molecular and clinical markers for survival or progression prediction of patients suffering from complex diseases such as cancer. In this tutorial, we will only consider so-called right-censored survival data, where a patient has been followed for a certain time period and the event of interest is either observed in this time period or might occur at a later (as yet unobserved) time point. Right-censored survival data include both a time and the status of each patient at that time as joint outcomes, where time is a continuous variable and status is a binary variable indicating whether the event of interest has been observed up to the given time or not; in the latter case, we refer to this observation as censored at the observed time. See Table 1 for an example illustration, and more details of the survival data in Section 2. When using the status label as an outcome in an ordinary logistic regression, the regression coefficients will become increasingly uncertain and less reliable with increasing follow-up time (Green and Symons 1983). When using the observed time (or its transformation, e.g. logarithm of time) as an outcome in an ordinary linear regression, the presence of censored observations (i.e. patients still alive by the end of follow-up period) causes considerable difficulties for assessing the accuracy of predictions (Henderson 1995). For example, Table 1 shows an example where the proportion of patients surviving past 10 years is 1/5=20% based on the observed data, but the (unobserved) factual proportion surviving past 10 years is 3/5=60%.
个性化医疗通过利用指示疾病发展的患者特异性基因组和分子标志物来改善患者的诊断和治疗。发生感兴趣事件的时间(例如死亡或疾病进展的时间)是许多疾病广泛使用的终点和患者结果,因此识别基因组、分子和临床标志物以预测患有癌症等复杂疾病的患者的生存或进展已变得很流行。在本教程中,我们将只考虑所谓的右删失生存数据,其中患者已经被跟踪了一段时间,并且感兴趣的事件要么在该时间段内观察到,要么可能发生在较晚的(尚未观察到的)时间点。右删失生存数据包括时间和每个患者在该时间的状态作为联合结果,其中时间是一个连续变量,状态是一个二进制变量,表明在给定时间之前是否观察到感兴趣的事件;在后一种情况下,我们将此观测称为在观测时间删失。有关示例说明,请参见表 1,并在第 2 节中有关生存数据的更多详细信息。当在普通 logistic 回归中使用状态标签作为结果时,随着随访时间的增加,回归系数将变得越来越不确定和不可靠 (Green 和 Symons 1983)。当使用观测时间(或其变换,例如 时间的对数)作为普通线性回归的结果时,删失观察的存在(即 患者在随访期结束时仍然活着)对评估预测的准确性造成相当大的困难(Henderson 1995)。 例如,表 1 显示了一个例子,根据观察到的数据,存活超过 10 年的患者比例为 1/5=20%,但(未观察到的)存活超过 10 年的事实比例为 3/5=60%。

Table 1. 表 1.

An example of right-censored time-to-event (survival) outcomes, illustrated in an example where the outcome of interest is the time from disease diagnosis to death.a
右删失事件时间(生存)结果的示例,在感兴趣的结果是从疾病诊断到死亡的时间的示例中进行了说明。一个

Patient ID 患者 IDObserved time (years) 观察时间(年)Status at observed time 观察时的状态Factual time to death, possibly unobserved (years)
实际死亡时间,可能未观察到(年)
100111Censored 审查20
10024Dead 4
10035Censored 审查12
10049Dead 9
10051Censored 审查11
Patient IDObserved time (years)Status at observed timeFactual time to death, possibly unobserved (years)
100111Censored20
10024Dead4
10035Censored12
10049Dead9
10051Censored11
a

If a patient was alive at the last observed time point, the measurement is censored at that time point.
如果患者在最后一个观察到的时间点还活着,则在该时间点删失测量值。

Table 1. 表 1.

An example of right-censored time-to-event (survival) outcomes, illustrated in an example where the outcome of interest is the time from disease diagnosis to death.a
右删失事件时间(生存)结果的示例,在感兴趣的结果是从疾病诊断到死亡的时间的示例中进行了说明。一个

Patient ID 患者 IDObserved time (years) 观察时间(年)Status at observed time 观察时的状态Factual time to death, possibly unobserved (years)
实际死亡时间,可能未观察到(年)
100111Censored 审查20
10024Dead 4
10035Censored 审查12
10049Dead 9
10051Censored 审查11
Patient IDObserved time (years)Status at observed timeFactual time to death, possibly unobserved (years)
100111Censored20
10024Dead4
10035Censored12
10049Dead9
10051Censored11
a

If a patient was alive at the last observed time point, the measurement is censored at that time point.
如果患者在最后一个观察到的时间点还活着,则在该时间点删失测量值。

The availability of multiple types of genomic and molecular data poses great opportunities but also further challenges for building effective statistical models to identify biomarkers that are prognostic of patient survival. For example, omics profiles, such as those from mRNA expression, DNA copy number, single-point and other genetic mutations, may be available from the same patient, and these high-dimensional data come with intra- and inter-dataset correlations, heterogeneous measurement scales, missing values, technical variability, and other background noise (Hasin et al. 2017). Rahnenführer et al. (2023) provided a general guideline for high-dimensional omics and electronic health records data analysis, and discussed statistical challenges and opportunities for survival modeling. Bøvelstad et al. (2007) and Bøvelstad et al. (2009) compared various machine learning methods and showed improved survival prediction performance by coefficient shrinkage methods that combine several data sources, in their case clinical and gene expression data. Another recent study by Bommert et al. (2022) performed a benchmark of filter methods for feature selection in high-dimensional gene expression data for survival prediction, and recommended using the simple variance filter before fitting a 2-regularized Cox regression model (Simon et al. 2011) for accurate survival prediction and stable feature selection. Vinga (2021) reviewed structured penalized regressions for analyzing high-dimensional omics data for survival prediction or evaluation. Multiple studies have shown that it is possible to further improve the prediction accuracy and feature selection by considering more complex structures, such as biological pathways, or by identifying significant features among functional relationships between the omics features (Chekouo et al. 2015, Kundu et al. 2018, Wang and Liu 2020, Madjar et al. 2021).
多种类型的基因组和分子数据的可用性带来了巨大的机遇,但也为构建有效的统计模型以识别预测患者生存的生物标志物带来了进一步的挑战。例如,组学图谱,例如来自 mRNA 表达、DNA 拷贝数、单点和其他基因突变的组学图谱,可能来自同一患者,并且这些高维数据带有数据集内和数据集间的相关性、异质性测量尺度、缺失值、技术可变性和其他背景噪声(Hasin 等人,2017 年)。Rahnenführer 等人(2023 年)为高维组学和电子健康记录数据分析提供了通用指南,并讨论了生存建模的统计挑战和机遇。Bøvelstad 等人 (2007)Bøvelstad 等人 (2009) 比较了各种机器学习方法,并表明结合多个数据源的系数收缩方法提高了生存预测性能,在他们的案例中是临床和基因表达数据。Bommert 等人 (2022) 最近的另一项研究对高维基因表达数据中的特征选择进行了过滤方法的基准测试,用于生存预测,并建议在拟合 l2 正则化 Cox 回归模型之前使用简单的方差过滤器 (Simon et al. 2011) 进行准确的生存预测和稳定的特征选择。Vinga (2021) 回顾了用于分析高维组学数据以进行生存预测或评估的结构化惩罚回归。 多项研究表明,通过考虑更复杂的结构(例如生物途径),或通过识别组学特征之间的功能关系中的显着特征,可以进一步提高预测准确性和特征选择(Chekouo 等人,2015 年,Kundu 等人,2018 年,Wang 和 Liu 2020 年,Madjar 等人,2021 年)。

In this tutorial, we describe a general workflow for survival analysis with omics data, as well as review the commonly used statistical methods for feature selection and survival prediction; importantly, we provide a step-by-step R tutorial using publicly available omics data from The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih.gov). In this example dataset, the overall survival time, demographic and gene expression data from primary invasive breast cancer patients in TCGA (TCGA-BRCA) were retrieved from the Genomic Data Commons Data Portal data release 32.0–36.0. Compared to the previous reviews and benchmarks of survival models in bioinformatic applications, this tutorial provides a complete workflow ranging from data preparation to final calibrated models, with a particular focus on building survival models using high-dimensional omics data, and as such covers both the commonly used penalized regressions and Bayesian models for survival analysis with high-dimensional and generally noisy datasets. For this tutorial, we assume that readers have the knowledge of basic statistical methods. Terms beyond basic statistics are explained in corresponding text.
在本教程中,我们描述了使用组学数据进行生存分析的一般工作流程,并回顾了用于特征选择和生存预测的常用统计方法;重要的是,我们使用癌症基因组图谱 (TCGA) 项目 (http://cancergenome.nih.gov) 中公开可用的组学数据提供了分步 R 教程。在此示例数据集中,从 Genomic Data Commons Data Portal 数据版本 32.0-36.0 中检索了 TCGA (TCGA-BRCA) 原发性浸润性乳腺癌患者的总生存时间、人口统计学和基因表达数据。与之前生物信息学应用中生存模型的综述和基准测试相比,本教程提供了从数据准备到最终校准模型的完整工作流程,特别侧重于使用高维组学数据构建生存模型,因此涵盖了常用的惩罚回归和贝叶斯模型,用于高维和通常噪声数据集的生存分析。在本教程中,我们假设读者具有基本统计方法的知识。基本统计以外的术语在相应的文本中进行了解释。

2 Data categories 2 数据类别

We use omics data and overall survival outcomes from cancer patients as an example in this tutorial, but the methods are applicable also to other diseases with similar data types in personalized medicine applications. The ultimate goal of personalized medicine is to identify patient-specific risk factors to guide disease prevention, diagnostic and treatment strategies. The identification of potential risk factors for cancer patients often considers clinical, demographic, genomic and molecular information, and their associations with the patients’ time-to-event data (i.e. survival).
在本教程中,我们使用癌症患者的组学数据和总生存期结果作为示例,但这些方法也适用于个性化医疗应用中具有类似数据类型的其他疾病。个性化医疗的最终目标是确定患者特异性的风险因素,以指导疾病预防、诊断和治疗策略。确定癌症患者的潜在风险因素通常考虑临床、人口统计学、基因组和分子信息,以及它们与患者事件发生时间数据(即生存率)的关联。

2.1 Time-to-event data 2.1 事件发生时间数据

Time-to-event or survival data contain the event of interest (e.g. death is the event assumed in this section), together with the time from the beginning of a study period either to the occurrence of the event, end of the study, or patient loss to follow-up (i.e. right censoring; discussions of right, left and interval censoring can be found in Leung et al. 1997). Figure 1c shows the survival or right-censored times since cancer diagnosis of patients. The exact survival time of a patient may be not observed due to censoring. Therefore, a patient has two outcome indices: censoring indicator δ (also called status) and observed time T˜=min{T,C}, where T is the exact survival time and C is the censoring time. Indicator δ can also be denoted as δ=1{TC}, where 1{·} is an indicator function. To characterize the survival time of a patient, we can use survival function S(t)={T>t}, which gives the probability of the patient’s survival beyond time t. Another useful quantity is the hazard function
T IME 到事件或生存数据包含感兴趣的事件(例如,死亡是本节中假设的事件),以及从研究期开始到事件发生、研究结束或患者失访的时间(即右删失;右、左和区间删失的讨论可以在 Leung 等人 1997 中找到)。图 1c 显示了患者自癌症诊断以来的生存时间或右删失时间。由于删失,可能无法观察到患者的确切生存时间。因此,患者有两个结果指数:删失指标 δ (也称为状态)和观察时间 T ̃=min{T,C},其中 T 是确切的生存时间,C 是删失时间。指示符 δ 也可以表示为 δ=1{TC},其中 1{·} 是一个指示函数。为了描述患者的生存时间,我们可以使用生存函数 St=P{T>t}它给出了患者在时间 t 之后生存的概率。另一个有用的量是 hazard function
which is the instantaneous probability of the patient’s death at time t conditional on the patient having survived up to that time point.
这是患者在时间 t 死亡的瞬时概率,条件是患者存活到该时间点。
Pan-cancer survival data and omics signatures. (a) Multiple omics layers [modified from Haukaas et al. (2017) and Jendoubi (2021)]. The network illustrates intra- and cross-layered biological features or molecules (e.g. DNA methylation, mRNA, proteins, metabolites). (b) Different origins of tumors represented by the TCGA pan-cancer patient cohorts. (c) Overall survival times of cancer patients from TCGA. Novel methods are needed to model the high-dimensional multi-omics data and leverage information from heterogeneous cancer cohorts for improved survival prognosis and biomarker identification.
Figure 1.

Pan-cancer survival data and omics signatures. (a) Multiple omics layers [modified from Haukaas et al. (2017) and Jendoubi (2021)]. The network illustrates intra- and cross-layered biological features or molecules (e.g. DNA methylation, mRNA, proteins, metabolites). (b) Different origins of tumors represented by the TCGA pan-cancer patient cohorts. (c) Overall survival times of cancer patients from TCGA. Novel methods are needed to model the high-dimensional multi-omics data and leverage information from heterogeneous cancer cohorts for improved survival prognosis and biomarker identification.


图 1.
泛癌生存数据和组学特征。(a) 多组学层 [修改自 Haukaas 等人 (2017) 和 Jendoubi (2021)]。该网络说明了层内和跨层的生物学特征或分子(例如 DNA 甲基化、mRNA、蛋白质、代谢物)。(b) TCGA 泛癌患者队列代表的肿瘤的不同来源。(c) TCGA 癌症患者的总生存时间。需要新的方法来模拟高维多组学数据,并利用来自异质性癌症队列的信息来改善生存预后和生物标志物鉴定。

2.2 Clinical and demographic data
2.2 临床和人口统计数据

There are multiple sources of patient-level information that can be explored to identify risk factors for cancer patients. One can start with routinely collected and commonly used patient data, such as clinical and demographic variables. For example, an older male patient with a low body mass index (BMI) has a relatively high risk of gastric cancer (Nam et al. 2022). Table 2 illustrates selected clinical and demographic variables often available for cancer patients. Clinical and demographic variables are considered important sources of information for predicting survival and are often used to build reference models for omics-based prognostic models (Herrmann et al. 2021).
可以探索多种患者层面的信息来源,以确定癌症患者的风险因素。可以从常规收集和常用的患者数据开始,例如临床和人口统计变量。例如,体重指数 (BMI) 低的老年男性患者患胃癌的风险相对较高(Nam 等人,2022 年)。表 2 说明了癌症患者通常可用的选定临床和人口统计变量。临床和人口统计变量被认为是预测生存的重要信息来源,通常用于为基于组学的预后模型构建参考模型(Herrmann 等人,2021 年)。

Table 2. 表 2.

Examples of clinical and demographic variables.
临床和人口统计变量的示例。

Variable 变量Data type 数据类型
Sex Binary 二元的
Body mass index (BMI, kg/m2)
体重指数 (BMI, kg/m2
Continuous 连续的
Ethnicity 种族Nominal 名义
Age at first diagnosis in years
首次诊断年龄
Integer 整数
Pathological stage 病理阶段Ordinal 序数
Therapy type (e.g. chemo-, hormone, immuno-therapy)
治疗类型(例如 化疗、激素、免疫疗法)
Nominal 名义
VariableData type
SexBinary
Body mass index (BMI, kg/m2)Continuous
EthnicityNominal
Age at first diagnosis in yearsInteger
Pathological stageOrdinal
Therapy type (e.g. chemo-, hormone, immuno-therapy)Nominal
Table 2. 表 2.

Examples of clinical and demographic variables.
临床和人口统计变量的示例。

Variable 变量Data type 数据类型
Sex Binary 二元的
Body mass index (BMI, kg/m2)
体重指数 (BMI, kg/m2
Continuous 连续的
Ethnicity 种族Nominal 名义
Age at first diagnosis in years
首次诊断年龄
Integer 整数
Pathological stage 病理阶段Ordinal 序数
Therapy type (e.g. chemo-, hormone, immuno-therapy)
治疗类型(例如 化疗、激素、免疫疗法)
Nominal 名义
VariableData type
SexBinary
Body mass index (BMI, kg/m2)Continuous
EthnicityNominal
Age at first diagnosis in yearsInteger
Pathological stageOrdinal
Therapy type (e.g. chemo-, hormone, immuno-therapy)Nominal

2.3 Omics data 2.3 组学数据

Thanks to the rapid development of modern molecular biotechnology, large amounts of human genomic and molecular data have become available from many patient profiling projects. These projects often collect multiple levels of molecular information such as genomics data for DNA variation, transcriptomics data for mRNA expression, proteomics data for protein abundance and metabolomics data for metabolite processes, as illustrated in Fig. 1a. Among the multiple omics levels, metabolomics is the closest to observable phenotypes, such as tumor growth and proliferation (Cairns et al. 2011). To deeply understand the molecular biology of tumor development, multiple levels of omics data may deliver novel insights into the circuits of molecular interactions that underlie the disease initiation and progression (Tarazona et al. 2021).
由于现代分子生物技术的快速发展,许多患者分析项目已经提供了大量的人类基因组和分子数据。这些项目通常收集多个层次的分子信息,例如 DNA 变异的基因组学数据、mRNA 表达的转录组学数据、蛋白质丰度的蛋白质组学数据和代谢物过程的代谢组学数据,如图 1a 所示 。在多个组学水平中,代谢组学是最接近可观察的表型的,例如肿瘤生长和增殖(Cairns 等人,2011 年)。为了深入了解肿瘤发展的分子生物学,多层次的组学数据可能会为构成疾病发生和进展基础的分子相互作用回路提供新的见解(Tarazona 等人,2021 年)。

DNA-level omics data often include single nucleotide polymorphisms (SNP), DNA methylation, and somatic copy number variation, see illustration of these data types in Table 3. Each SNP feature can be coded as {0,1,2}, according to the number of minor alleles at the given locus in each individual, i.e. AA, AG, GC, or TT. DNA methylation reveals methyl groups added to the DNA molecule, which can be quantified as a β-value β=MM+U+a[0,1], where M and U are the fluorescence intensities of the methylated and unmethylated DNA at a CpG locus, and the offset a is often set to 100 to stabilize the β-value when M and U are small. Somatic copy number variation measures the number of repeated sections of the tumor genome. Table 3 shows the assumed distributions for the downstream statistical modeling. For example, DNA methylation β-value is a proportion, often either close to 1 or 0, which can be characterized by a mixture of two beta distributions.
DNA 水平组学数据通常包括单核苷酸多态性 (SNP)、DNA 甲基化和体细胞拷贝数变异,参见表 3 中这些数据类型的插图。每个 SNP 特征都可以编码为 {0,1,2},具体取决于每个个体中给定基因座的次要等位基因数量,即 AA、AG、GC 或 TT。DNA 甲基化揭示了添加到 DNA 分子中的甲基,可以量化为 β -值 β =MM+U+a[0,1],其中 MU是甲基化和未甲基化 DNA 在 CpG 位点的荧光强度,当 MU 较小时,偏移量 a 通常设置为 100 以稳定 β -值。体细胞拷贝数变异测量肿瘤基因组重复切片的数量。表 3 显示了下游统计建模的假设分布。例如,DNA 甲基化 β 值是一个比例,通常接近 1 或 0,这可以通过两个 β 分布的混合来表征。

Table 3. 表 3.

Illustration of omics data and distribution assumptions for data generated by commonly used high-throughput technologies.
组学数据图示和常用高通量技术生成的数据的分布假设。

Patient ID 患者 IDSNP SNP (SNP)Methylation (β-value)
甲基化 ( β -值)
Copy number variation (number of copies)
拷贝数变化(拷贝数)
Gene expression (reads) 基因表达 (reads)miRNA (reads) miRNA(读取)Protein (intensity/concentration)Metabolite (intensity/concentration)
100110.20550.070.07
100200.115220.10.1
100310.9510009.69.6
100420.5430302.82.8
Technology 科技SNP array SNP 阵列Infinium BeadChipSNP array (Gistic2) SNP 阵列 (Gistic2)RNA-Seq RNA-测序RNA-Seq RNA-测序Mass spectrometryMass spectrometry
Distribution assumptions 分配假设Ordinal 序数Beta mixture β 混合物Negative binomial 负二项式Negative binomial 负二项式Negative binomial 负二项式log-Gaussianlog-Gaussian
log2 ratio: Gaussian
对数2 比值:高斯
log2 scale: Gaussian
对数2 尺度:高斯
log2 scale: Gaussian
对数2 尺度:高斯
log2 scale: Gaussianlog2 scale: Gaussian
Patient IDSNPMethylation (β-value)Copy number variation (number of copies)Gene expression (reads)miRNA (reads)Protein (intensity/concentration)Metabolite (intensity/concentration)
100110.20550.070.07
100200.115220.10.1
100310.9510009.69.6
100420.5430302.82.8
TechnologySNP arrayInfinium BeadChipSNP array (Gistic2)RNA-SeqRNA-SeqMass spectrometryMass spectrometry
Distribution assumptionsOrdinalBeta mixtureNegative binomialNegative binomialNegative binomiallog-Gaussianlog-Gaussian
log2 ratio: Gaussianlog2 scale: Gaussianlog2 scale: Gaussianlog2 scale: Gaussianlog2 scale: Gaussian
Table 3. 表 3.

Illustration of omics data and distribution assumptions for data generated by commonly used high-throughput technologies.
组学数据图示和常用高通量技术生成的数据的分布假设。

Patient ID 患者 IDSNP SNP (SNP)Methylation (β-value)
甲基化 ( β -值)
Copy number variation (number of copies)
拷贝数变化(拷贝数)
Gene expression (reads) 基因表达 (reads)miRNA (reads) miRNA(读取)Protein (intensity/concentration)
蛋白质(强度/浓度)
Metabolite (intensity/concentration)
100110.20550.070.07
100200.115220.10.1
100310.9510009.69.6
100420.5430302.82.8
Technology 科技SNP array SNP 阵列Infinium BeadChipSNP array (Gistic2) SNP 阵列 (Gistic2)RNA-Seq RNA-测序RNA-Seq RNA-测序Mass spectrometry 质谱法Mass spectrometry
Distribution assumptions 分配假设Ordinal 序数Beta mixture β 混合物Negative binomial 负二项式Negative binomial 负二项式Negative binomial 负二项式log-Gaussian 对数高斯log-Gaussian
log2 ratio: Gaussian
对数2 比值:高斯
log2 scale: Gaussian
对数2 尺度:高斯
log2 scale: Gaussian
对数2 尺度:高斯
log2 scale: Gaussian
对数2 尺度:高斯
log2 scale: Gaussian
Patient IDSNPMethylation (β-value)Copy number variation (number of copies)Gene expression (reads)miRNA (reads)Protein (intensity/concentration)Metabolite (intensity/concentration)
100110.20550.070.07
100200.115220.10.1
100310.9510009.69.6
100420.5430302.82.8
TechnologySNP arrayInfinium BeadChipSNP array (Gistic2)RNA-SeqRNA-SeqMass spectrometryMass spectrometry
Distribution assumptionsOrdinalBeta mixtureNegative binomialNegative binomialNegative binomiallog-Gaussianlog-Gaussian
log2 ratio: Gaussianlog2 scale: Gaussianlog2 scale: Gaussianlog2 scale: Gaussianlog2 scale: Gaussian

RNA-level omics data usually include messenger RNA (mRNA) expression and microRNA (miRNA) expression. Traditionally, DNA microarrays were used to measure the expression levels of DNA sequences called probes, which acted as a proxy for the amount of reads representing a genomic feature of interest. The reads of microarray expression level can be characterized by a negative binomial distribution directly, or a Gaussian distribution after log2 transformation (see Table 3). Nowadays, RNA sequencing (RNA-seq) has replaced DNA microarrays, since it allows for the sequencing of the whole transcriptome, while DNA microarrays only profile predefined transcripts or gene probes. miRNAs are small noncoding regulatory RNAs that play an important role in regulating gene expression and are highly evolutionary conserved in mammals (Bartel 2018).
RNA 水平组学数据通常包括信使 RNA (mRNA) 表达和 microRNA (miRNA) 表达。传统上,DNA 微阵列用于测量称为探针的 DNA 序列的表达水平,探针充当代表目标基因组特征的读数量的代理。微阵列表达水平的读数可以直接用负二项分布或 log2 转换后的高斯分布来表征(见表 3)。如今,RNA 测序 (RNA-seq) 已经取代了 DNA 微阵列,因为它允许对整个转录组进行测序,而 DNA 微阵列仅分析预定义的转录本或基因探针。miRNA 是小的非编码调节 RNA,在调节基因表达中起重要作用,并且在哺乳动物中高度进化保守 (Bartel 2018)。

Protein-level omics data usually originates from mass spectrometry (MS)-based proteomics profiling of a particular cell, tissue, organ, which can detect and measure abundance levels of entire or phosphorylated proteins. In contrast to global proteomics with MS, TCGA consortium has produced more targeted proteomics profiles, involving a set of 180–250 protein features using reverse-phase protein arrays (RPPA) (Akbani et al. 2014). In contrast, The Clinical Proteomic Tumor Analysis Consortium (CPTAC) profiling has produced >10 000 protein features using MS technology (Edwards et al. 2015). Protein expression data can often be considered approximately Gaussian distributed after log2 transformation, depending on the data-generating platform. In global proteomics, there are often sample-specific missing values due to detection limits of protein quantification.
蛋白质水平组学数据通常来源于基于质谱 (MS) 的特定细胞、组织、器官的蛋白质组学分析,可以检测和测量完整或磷酸化蛋白质的丰度水平。与使用 MS 的全球蛋白质组学相比,TCGA 联盟产生了更具针对性的蛋白质组学概况,涉及使用反相蛋白质阵列 (RPPA) 的一组 180-250 个蛋白质特征(Akbani 等人,2014 年)。相比之下,临床蛋白质组学肿瘤分析联盟 (CPTAC) 分析使用 MS 技术产生了 >10 000 个蛋白质特征(Edwards 等人,2015 年)。根据数据生成平台,在 log2 转换后,蛋白质表达数据通常可以被认为是近似高斯分布的。在全球蛋白质组学中,由于蛋白质定量的检测限,经常存在样品特异性缺失值。

Metabolite-level omics data has similar statistical properties to proteomics data, and metabolomics profiling is usually done with MS-based technologies, enabling the detection, and quantification of many thousands of metabolic features simultaneously. Nuclear magnetic resonance (NMR) spectroscopy is the other main analytical technology to profile metabolic processes. An important limitation of NMR spectroscopy is its relatively low sensitivity, which may lead to relatively few detected metabolites. Metabolic concentrations often follow logarithmic Gaussian distribution. Similar to large-scale proteomics, missing values due to detection limits should be treated differently than missing values due to measurement artefacts, which are more frequent in metabolite-level omics data (Sun and Xia 2023).
代谢物水平的组学数据具有与蛋白质组学数据相似的统计特性,代谢组学分析通常使用基于 MS 的技术完成,能够同时检测和定量数千个代谢特征。核磁共振 (NMR) 波谱是描述代谢过程的另一种主要分析技术。NMR 波谱的一个重要局限性是其灵敏度相对较低,这可能导致检测到的代谢物相对较少。代谢浓度通常服从对数高斯分布。与大规模蛋白质组学类似,应区别对待由于检测限导致的缺失值与由于测量伪影导致的缺失值,后者在代谢物水平的组学数据中更为常见(Sun 和 Xia 2023)。

Single-cell sequencing is becoming increasingly more prevalent in many profiling studies. The modern single-cell omics technologies can produce multiple levels of omics data derived from the same samples, such as transcriptomics and chromatin accessibility. The single-cell data types are similar to the illustration in Table 3, but each measurement originates from the level of individual cells. This article focuses mainly on survival analysis with single bulk omics data.
单细胞测序在许多分析研究中变得越来越普遍。现代单细胞组学技术可以产生来自相同样本的多层次组学数据,例如转录组学和染色质可及性。单单元数据类型与表 3 中的插图类似,但每个测量值都源自单个单元的级别。本文主要关注使用单个批量组学数据的生存分析。

2.4 Missing data 2.4 缺失数据

Missing values are often observed in many types of high-dimensional omics data due to various experimental reasons (Aittokallio 2009, Kong et al. 2022). For example, mRNA transcriptomics data from microarrays have 1%–10% missing values affecting up to 95% of genes due to corruption of image, scratches on the slides, poor hybridization, inadequate resolution, fabrication errors (de Brevern et al. 2004, Tuikkala et al. 2005); MS-based metabolomics data have 10%–20% missing values affecting up to 80% of variables due to lack of peak identification by chromatogram, limitation of computational detection, measurement error, and deconvolution errors (Hrydziuszko and Viant 2012, Sun and Xia 2023). The aforementioned technical reasons can lead to missing data that is either missing at random (MAR) or missing not at random (MNAR). When dealing with MNAR data, traditional imputation methods like multiple imputation may introduce bias. It is thus recommended to remove omics features which have large proportion (e.g. 50%) missingness over patients, and then apply imputation methods (e.g. k-nearest neighbor imputation) for the rest of the features with missing values before doing any statistical analysis or modeling. Alternatively, imputation-free methods (e.g. mixture models) that can deal with missing values can be applied directly (Taylor et al. 2022). Single-cell RNA-seq (scRNA-seq) data has a vast number of zeros, so-called gene dropout events, leading highly scarce data matrices. Jiang et al. (2022) discussed the sources of biological and nonbiological zeros in scRNA-seq data and the existing approaches for handling them.
由于各种实验原因,在许多类型的高维组学数据中经常观察到缺失值(Aittokallio 2009,Kong et al. 2022)。例如,由于图像损坏、载玻片上的划痕、杂交不良、分辨率不足、制造错误,来自微阵列的 mRNA 转录组学数据有 1%-10% 的缺失值,影响高达 95% 的基因(de Brevern 等人,2004 年,Tuikkala 等人,2005 年);由于无法通过色谱图鉴定峰、计算检测的限制、测量误差和反卷积误差,基于 MS 的代谢组学数据存在 10%-20% 的缺失值,影响高达 80% 的变量(Hrydziuszko 和 Viant 2012,Sun 和 Xia 2023)。上述技术原因可能导致随机缺失 (MAR) 或非随机缺失 (MNAR) 缺失数据。在处理 MNAR 数据时,传统的插补方法(如多重插补)可能会引入偏倚。因此,建议删除患者缺失比例较大(例如 50%)的组学特征,然后在进行任何统计分析或建模之前对其余缺失值的特征应用插补方法(例如 k 最近邻插补)。或者,可以直接应用可以处理缺失值的无插补方法(例如 混合模型)(Taylor 等人,2022 年)。单细胞 RNA-seq (scRNA-seq) 数据有大量的零,即所谓的基因缺失事件,导致数据矩阵高度稀缺。et al. (2022) 讨论了 scRNA-seq 数据中生物和非生物零的来源以及处理它们的现有方法。

3 Survival analysis with low-dimensional input data
3 使用低维输入数据进行生存分析

Let us assume we have data D={(T˜i,δi,Xi):i=1,,n} for n patients, where T˜i is the observed survival time, δi the censoring indicator and Xi contains p covariates including clinical, demographic and omics features. To estimate a survival function S(t) given the data D, one needs to keep track both of the number of patients at risk and those who left the study at any time point (here we only consider the case of right censoring and assume no delayed entry). At time t there are Y(t)=i=1nYi(t) patients at risk, where Yi(t)=1{T˜it} is the indicator that patient i is at risk. The nonparametric Kaplan–Meier (KM) estimator (Kaplan and Meier 1958) uses the multiplication rule for conditional probabilities to obtain an estimation of the survival function S^(t)=k=1TktK{1dkY(Tk)}, where all events happen (e.g. patients die) at K distinct times, T1<T2<<TK (Kn) and there are dk1 events happened at the time Tk. If no two events happen at the same time point, dk=1 and k=1,,n. The KM estimator gives an estimate of the marginal survival function, i.e. when you disregard the information from the covariates.
假设我们有数据 D ={T ̃ i,δ i,Xi):i=1,...n} 表示 n 名患者,其中 T ̃i 是观察到的生存时间,δi 是删失指标,Xi包含 p 个协变量,包括临床、人口统计学和组学特征。为了在给定数据 D 的情况下估计生存函数 St),需要跟踪处于风险中的患者数量和在任何时间点退出研究的患者数量(这里我们只考虑右删失的情况,并假设没有延迟进入)。 在时间 tYt=i=1nYit 患者处于风险中,其中 Yit=1{T ̃it} 是患者 i 处于风险中的指标。 非参数 Kaplan-Meier (KM) 估计器 (Kaplan 和 Meier 1958) 使用条件概率的乘法规则来获得生存函数 S^t=k=1 的估计值TktK{1dkYTk}所有事件都发生的地方(例如 患者死亡)在 K 个不同时间,T1<T2<<TK Kn 和有 dk1 事件发生在 Tk 时间。如果同一时间点没有两个事件发生,dk=1k=1...n.KM 估计器给出了边际生存函数的估计值,即当您忽略协变量的信息时。

Figure 2a shows the KM curve for TCGA-BRCA primary breast cancer patients data. Some basic statistics can be revealed from the survival curve. For example, the estimated median survival time, i.e. the time when the survival probability is 50%, of all the breast cancer patients is 10.8 years (dashed line in Fig. 2a), and 1-, 5-, and 10-year survival probabilities are 0.988, 0.853, and 0.658, respectively. A log-rank test (Peto and Peto 1972) can be used to test whether two groups of patients [e.g. with treatment (pharmaceutical/radiation therapy) or nontreatment] have the same (null hypothesis) or different survival functions (alternative hypothesis), and provide a corresponding P-value (see Fig. 2b). The log-rank test can also be used to compare the survival probabilities of any subgroups of patients based on other categorical variables.
图 2a 显示了 TCGA-BRCA 原发性乳腺癌患者的 KM 曲线数据。从生存曲线中可以揭示一些基本统计数据。例如,所有乳腺癌患者的估计中位生存时间,即生存概率为 50% 的时间为 10.8 年(图 2a 中的虚线),1 年、5 年和 10 年生存概率分别为 0.988、0.853 和 0.658。对数秩检验(Peto 和 Peto 1972)可用于检验两组患者 [例如接受治疗(药物治疗/放射治疗)或未接受治疗] 是否具有相同(零假设)或不同的生存函数(替代假设),并提供相应的 P 值(见图 2b)。对数秩检验还可用于根据其他分类变量比较任何患者亚组的生存概率。

Kaplan–Meier curves of TCGA-BRCA data. (a) KM curve of all the TCGA-BRCA patients’ survival data. (b) KM curves of the TCGA-BRCA patients’ survival data grouped by treatment (i.e. pharmaceutical/radiation therapy) or nontreatment. The log-rank test is used to compare the two survival distributions corresponding to the two groups of patients.
Figure 2.

Kaplan–Meier curves of TCGA-BRCA data. (a) KM curve of all the TCGA-BRCA patients’ survival data. (b) KM curves of the TCGA-BRCA patients’ survival data grouped by treatment (i.e. pharmaceutical/radiation therapy) or nontreatment. The log-rank test is used to compare the two survival distributions corresponding to the two groups of patients.


图 2.

TCGA-BRCA 数据的 Kaplan-Meier 曲线。(a) 所有 TCGA-BRCA 患者生存数据的 KM 曲线。(b) TCGA-BRCA 患者生存数据的 KM 曲线,按治疗(即 药物/放射治疗)或非治疗。log-rank 检验用于比较对应于两组患者的两个生存分布。

In the case where multiple clinical, demographic or omics features are available, one can explore each variable’s association with survival outcomes separately. For a categorical variable, KM curves and log-rank tests can be used to investigate whether there is a difference between multiple survival curves categorized by the variable. For a continuous variable X, the semi-parametric Cox proportional hazards model (Cox model, Cox 1972) is often used:
在有多种临床、人口统计学或组学特征可用的情况下,可以分别探索每个变量与生存结果的关联。对于分类变量,可以使用 KM 曲线和对数秩检验来调查按变量分类的多条生存曲线之间是否存在差异。对于连续变量 X,通常使用半参数 Cox 比例风险模型(Cox 模型,Cox 1972):
(1)
where h0(t) is the baseline hazard function and is left unspecified. As the name ‘proportional hazards’ implies, the Cox-model estimated hazard functions of two individuals (with different values of the covariate X) are indeed proportional, because h0(t) does not depend on X and is thus assumed to be the same for all individuals. The functional form (1) describes the log-linear relationship between variable X and the hazard h(t|X) at any given timepoint t. It may be difficult to satisfy the log-linear relationship based on the original scale of some omics features, e.g. gene expression data from a DNA microarray study; in those cases, the use of log2-transformation of the data (Table 3) can be helpful. The Cox model can also be used to investigate the risk of a categorical variable, provided the above assumptions are satisfied, which provides an effect estimate (hazard ratio, HR), in addition to the associated P-value.
其中 H0t 是基线风险函数,未指定。顾名思义,两个个体(具有不同的协变量 X 值)的 Cox 模型估计风险函数确实是成比例的,因为 h0t 不依赖于 X,因此假设所有个体都是相同的。泛函形式 (1) 描述了变量 X 与危险性 ht|X 在任何给定的时间点 t 上。可能很难根据某些组学特征的原始尺度满足对数线性关系,例如 来自 DNA 微阵列研究的基因表达数据;在这些情况下,使用数据的 log 2 转换(表 3)可能会有所帮助。如果满足上述假设,Cox 模型还可用于调查分类变量的风险,除了相关的 P 值外,它还提供效果估计值(风险比,HR)。

It is often of interest to gain insights into the multiple factors and their cooperation for the survival outcomes. Multivariable statistical analysis plays an important role in such multi-modal survival modeling. The univariate Cox model (1) can be straightforwardly generalized by including multiple clinical, demographic or omics variables of interest, as long as the total number of covariates is much smaller than the number of samples; this often requires the use of variable or feature selection methods. Heinze et al. (2018) provides pragmatic recommendations for practitioners of multivariable analysis with variable selection methods in low-dimensional modeling problems.
深入了解多种因素及其对生存结果的合作通常是有趣的。多变量统计分析在这种多模态生存建模中起着重要作用。单变量 Cox 模型 (1) 可以通过包括多个感兴趣的临床、人口统计学或组学变量来直接推广,只要协变量的总数远小于样本的数量;这通常需要使用变量或特征选择方法。Heinze et al. (2018) 为在低维建模问题中使用变量选择方法进行多变量分析的从业者提供了务实的建议。

4 Survival analysis with high-dimensional input data
4 使用高维输入数据的生存分析

Since some omics data contain high-number of variables (e.g. RNA-seq with ca. 60 000 transcriptomic features and DNA methylation with ca. 450 000 features), there is a need to reduce the computational and modeling burden in multivariable analyses. One heuristics approach to pre-select a subset of features is to include only omics features at a specific statistical significance level when fitting a univariate Cox model (1); the pre-specified significance level would usually be higher than the commonly used 0.05 threshold, e.g. 0.1 or 0.2, to avoid losing important features. However, this univariate approach focuses on features that are independently associated with the outcome and might miss variables that are important in combination with other features (Okser et al. 2013).
由于一些组学数据包含大量变量(例如,具有约 60 000 个转录组特征的 RNA-seq 和具有约 450 000 个特征的 DNA 甲基化),因此需要减少多变量分析中的计算和建模负担。一种预先选择特征子集的启发式方法是在拟合单变量 Cox 模型时仅包括特定统计显着性水平的组学特征 (1);预先设定的显著性水平通常会高于常用的 0.05 阈值,例如 0.1 或 0.2,以避免丢失重要特征。然而,这种单变量方法侧重于与结果独立相关的特征,并且可能会错过与其他特征相结合的重要变量(Okser 等人,2013 年)。

Another simple approach is to pre-select omics features by variance, since larger variability across patients usually implies higher biological information, or at least predictive signal. One can for example pre-select omics features explaining 50% of the total cumulative variance (Zhao and Zucknick 2020). Such unsupervised feature pre-selection is a recommended method to reduce dimensionality when dealing with hundreds or thousands of omics features, with the aim to improve the stability of final feature selection. For example, Bommert et al. (2022) showed that the simple variance filter was the best method among all considered filter methods in terms of the predictive accuracy, run time and the feature selection stability in their benchmark study. Zhao et al. (2024) showed that the stability of final feature subset critically depends on the pre-selected feature set when using a standard Bayesian stochastic search variable selection method (see Section 4.3), and their proposed method that used known biological relationships between omics features lead to a more stable feature selection and slightly improved outcome prediction.
另一种简单的方法是通过方差预先选择组学特征,因为患者之间更大的变异性通常意味着更高的生物学信息,或者至少是预测信号。例如,可以预先选择组学特征来解释总累积方差的 50%(Zhao 和 Zucknick 2020)。这种无监督特征预选是在处理成百上千个组学特征时降低维度的推荐方法,目的是提高最终特征选择的稳定性。例如,Bommert 等人(2022 年)表明,在他们的基准研究中,就预测准确性、运行时间和特征选择稳定性而言,简单方差滤波器是所有考虑的滤波器方法中最好的方法。Zhao et al. (2024) 表明,当使用标准贝叶斯随机搜索变量选择方法时,最终特征子集的稳定性在很大程度上取决于预先选择的特征集(参见第 4.3 节),他们提出的方法使用组学特征之间已知的生物学关系导致更稳定的特征选择和略微改善的结果预测。

When drawing conclusions on survival differences solely from the univariate Cox model (1), it is important to adjust the P-values of risk features for multiple comparisons to control false positives globally, e.g. by controlling the family-wise error rate (FWER) or false discovery rate (FDR). The problem of multiple testing is beyond the scope of this tutorial, but the interested readers are referred to a recent review article (Korthauer et al. 2019). We note also that univariate analysis does not consider any relationships between multiple omics features, e.g. potential confounders for survival outcomes (Clark et al. 2003). To avoid making seriously misleading conclusions in such cases, it is necessary to perform multivariable survival analysis (Bradburn et al. 2003).
当仅从单变量 Cox 模型 (1) 得出关于生存差异的结论时,重要的是要调整风险特征的 P 值以进行多重比较以全局控制假阳性,例如通过控制家庭错误率 (FWER) 或错误发现率 (FDR)。多重测试的问题超出了本教程的范围,但感兴趣的读者可以参考最近的一篇评论文章(Korthauer et al. 2019)。我们还注意到,单变量分析不考虑多个组学特征之间的任何关系,例如 生存结果的潜在混杂因素 (Clark et al. 2003)。为避免在这种情况下做出严重误导性的结论,有必要进行多变量生存分析(Bradburn 等人,2003 年)。

Omics data can have hundreds of thousands of variables measured at various molecular levels, which greatly challenges the classical multivariable regression models for time-to-event endpoints, since the number of variables is often much larger than the number of patients (i.e. pn). To proceed with survival analysis, one option is to reduce the dimensionality of the omics features via unsupervised learning, and then investigate the association of the learned low-dimensional variables with survival outcomes (see next section). An alternative approach is to directly use supervised learning methods, such as penalized regressions or sparse Bayesian models (Table 4), which enable the modeling of high-dimensional omics features, and the selection of key important features associated with the survival outcomes.
组学数据可以有数十万个在不同分子水平上测量的变量,这极大地挑战了用于事件发生时间终点的经典多变量回归模型,因为变量的数量通常比患者的数量大得多(即 pn)。要进行生存分析,一种选择是通过无监督学习来降低组学特征的维数,然后研究学习到的低维变量与生存结果的关联(见下一节)。另一种方法是直接使用监督学习方法,例如惩罚回归或稀疏贝叶斯模型(表 4),它们能够对高维组学特征进行建模,并选择与生存结果相关的关键重要特征。

Table 4. 表 4.

Cox-type supervised learning methods.
Cox 型监督学习方法。

Method 方法Feature selection via 特征选择Grouping effects 分组效果Uncertainty quantification
不确定性量化
Comment 评论
Penalized regressions: 惩罚回归:Penalty 罚款
 Lasso Cox (Tibshirani 1997)
Lasso Cox (Tibshirani 1997
1-norm
l1 范数
 Adaptive Lasso Cox (Zhang and Lu 2007)
适应性套索 Cox (Zhang 和 Lu 2007
Weighted 1-norm
加权 l1 范数
Less false positives than Lasso
比 Lasso 更少的误报
 Elastic Net Cox (Simon et al. 2011)
Elastic Net Cox (Simon et al. 2011
1/2-norm
l1/l2 标准杆
 Group-Lasso Cox (Kim et al. 2012)
Group-Lasso Cox (Kim et al. 2012
2-norm
l2 范数
Independent groups of features selected
选定的独立功能组
 Sparse Group-Lasso Cox (Simon et al. 2013)
稀疏群-Lasso Cox (Simon et al. 2013
1/2-norm
l1/l2 标准杆
 SCAD Cox (Fan and Li 2002)
SCAD Cox (Fan 和 Li 2002
Quadratic spline and symmetric penalty
二次样条和对称罚
Selection of relatively large effects
选择相对较大的效应
 SIS Cox (Fan et al. 2010)
SIS Cox (Fan et al. 2010
Top-ranked variables and any penalty
排名靠前的变量和任何惩罚
Suited to ultra-high dimensions
适用于超高尺寸
Bayesian models: 贝叶斯模型:Shrinkage prior 收缩先验
 Lee et al. (2011)
Lee 等人 (2011)
Lasso (Laplace) prior 套索 (拉普拉斯) priorSelection of posterior mean with a cutoff
选择具有截止值的后平均值
 Lee et al. (2015)
Lee 等人 (2015)
Elastic Net, group/fused Lasso priors
Elastic Net,分组/融合 Lasso 先验
Selection of posterior mean with a cutoff
选择具有截止值的后平均值
 Konrath et al. (2013)
Konrath 等人 (2013)
Spike-and-slab prior 钉板先验
 Madjar et al. (2021)
Madjar 等人 (2021)
Spike-and-slab and MRF priors
Spike-and-slab 和 MRF 先验
 Mu et al. (2021)
Mu 等人 (2021)
Horseshoe prior 马蹄形Selection of posterior mean with a cutoff
选择具有截止值的后平均值
MethodFeature selection viaGrouping effectsUncertainty quantificationComment
Penalized regressions:Penalty
 Lasso Cox (Tibshirani 1997)1-norm
 Adaptive Lasso Cox (Zhang and Lu 2007)Weighted 1-normLess false positives than Lasso
 Elastic Net Cox (Simon et al. 2011)1/2-norm
 Group-Lasso Cox (Kim et al. 2012)2-normIndependent groups of features selected
 Sparse Group-Lasso Cox (Simon et al. 2013)1/2-norm
 SCAD Cox (Fan and Li 2002)Quadratic spline and symmetric penaltySelection of relatively large effects
 SIS Cox (Fan et al. 2010)Top-ranked variables and any penaltySuited to ultra-high dimensions
Bayesian models:Shrinkage prior
 Lee et al. (2011)Lasso (Laplace) priorSelection of posterior mean with a cutoff
 Lee et al. (2015)Elastic Net, group/fused Lasso priorsSelection of posterior mean with a cutoff
 Konrath et al. (2013)Spike-and-slab prior
 Madjar et al. (2021)Spike-and-slab and MRF priors
 Mu et al. (2021)Horseshoe priorSelection of posterior mean with a cutoff
Table 4. 表 4.

Cox-type supervised learning methods.
Cox 型监督学习方法。

Method 方法Feature selection via 特征选择Grouping effects 分组效果Uncertainty quantification
不确定性量化
Comment 评论
Penalized regressions: 惩罚回归:Penalty 罚款
 Lasso Cox (Tibshirani 1997)
Lasso Cox (Tibshirani 1997
1-norm
l1 范数
 Adaptive Lasso Cox (Zhang and Lu 2007)
适应性套索 Cox (Zhang 和 Lu 2007
Weighted 1-norm
加权 l1 范数
Less false positives than Lasso
比 Lasso 更少的误报
 Elastic Net Cox (Simon et al. 2011)
Elastic Net Cox (Simon et al. 2011
1/2-norm
l1/l2 标准杆
 Group-Lasso Cox (Kim et al. 2012)
Group-Lasso Cox (Kim et al. 2012
2-norm
l2 范数
Independent groups of features selected
选定的独立功能组
 Sparse Group-Lasso Cox (Simon et al. 2013)
稀疏群-Lasso Cox (Simon et al. 2013
1/2-norm
l1/l2 标准杆
 SCAD Cox (Fan and Li 2002)
SCAD Cox (Fan 和 Li 2002
Quadratic spline and symmetric penalty
二次样条和对称罚
Selection of relatively large effects
选择相对较大的效应
 SIS Cox (Fan et al. 2010)
SIS Cox (Fan et al. 2010
Top-ranked variables and any penalty
排名靠前的变量和任何惩罚
Suited to ultra-high dimensions
适用于超高尺寸
Bayesian models: 贝叶斯模型:Shrinkage prior 收缩先验
 Lee et al. (2011)
Lee 等人 (2011)
Lasso (Laplace) prior 套索 (拉普拉斯) priorSelection of posterior mean with a cutoff
选择具有截止值的后平均值
 Lee et al. (2015)
Lee 等人 (2015)
Elastic Net, group/fused Lasso priors
Elastic Net,分组/融合 Lasso 先验
Selection of posterior mean with a cutoff
选择具有截止值的后平均值
 Konrath et al. (2013)
Konrath 等人 (2013)
Spike-and-slab prior 钉板先验
 Madjar et al. (2021)
Madjar 等人 (2021)
Spike-and-slab and MRF priors
Spike-and-slab 和 MRF 先验
 Mu et al. (2021)
Mu 等人 (2021)
Horseshoe prior 马蹄形Selection of posterior mean with a cutoff
选择具有截止值的后平均值
MethodFeature selection viaGrouping effectsUncertainty quantificationComment
Penalized regressions:Penalty
 Lasso Cox (Tibshirani 1997)1-norm
 Adaptive Lasso Cox (Zhang and Lu 2007)Weighted 1-normLess false positives than Lasso
 Elastic Net Cox (Simon et al. 2011)1/2-norm
 Group-Lasso Cox (Kim et al. 2012)2-normIndependent groups of features selected
 Sparse Group-Lasso Cox (Simon et al. 2013)1/2-norm
 SCAD Cox (Fan and Li 2002)Quadratic spline and symmetric penaltySelection of relatively large effects
 SIS Cox (Fan et al. 2010)Top-ranked variables and any penaltySuited to ultra-high dimensions
Bayesian models:Shrinkage prior
 Lee et al. (2011)Lasso (Laplace) priorSelection of posterior mean with a cutoff
 Lee et al. (2015)Elastic Net, group/fused Lasso priorsSelection of posterior mean with a cutoff
 Konrath et al. (2013)Spike-and-slab prior
 Madjar et al. (2021)Spike-and-slab and MRF priors
 Mu et al. (2021)Horseshoe priorSelection of posterior mean with a cutoff

4.1 Unsupervised learning
4.1 无监督学习

Unsupervised learning methods aim to identify hidden patterns or data groupings, and are for example useful when a phenotype (e.g. a cancer type) is to be divided into several subtypes (e.g. to explain heterogeneity among patients). For example, breast cancer has been traditionally categorized into five conceptual molecular classes, originally using pairwise average-linkage cluster analysis of DNA microarray data, to better understand tumor biology and guide clinical decision making (Perou et al. 2000). Unsupervised methods learn underlying patterns from unlabeled data by transforming high-dimensional omics features into a lower dimensional space. Principal component analysis (PCA) is a classical multivariate technique that represents high-dimensional features in a low-dimensional space by building orthogonal (uncorrelated) linear combinations of the features that best capture the variance in the high-dimensional data.
无监督学习方法旨在识别隐藏的模式或数据分组,例如,当表型(例如癌症类型)要分为几个亚型时(例如,解释患者之间的异质性)时,这种方法非常有用。例如,乳腺癌传统上被分为五个概念分子类别,最初使用 DNA 微阵列数据的成对平均连锁聚类分析,以更好地了解肿瘤生物学并指导临床决策(Perou 等人,2000 年)。无监督方法通过将高维组学特征转换为低维空间,从未标记的数据中学习基本模式。主成分分析 (PCA) 是一种经典的多元技术,它通过构建最能捕获高维数据中方差的特征的正交(不相关)线性组合来表示低维空间中的高维特征。

Different from the distance-based PCA with linear transformation, nonlinear techniques have recently emerged, such as t-stochastic neighbor embedding (t-SNE). t-SNE uses pairwise similarities of individuals based on Kullback–Leibler divergence to project similarities into a lower dimensional space, ensuring that individuals with similar omics features are close in the generated embedding (van der Maaten and Hinton 2008). The focus of t-SNE is on preserving local distances between neighboring data points. Another nonlinear alternative to PCA is UMAP (Uniform Manifold Approximation and Projection), which is a general dimension reduction method built upon Riemannian geometry (McInnes et al. 2018). Compared to other nonlinear methods of dimension reduction such as t-SNE, UMAP can sometimes provide better visualization quality in a shorter amount of time, while preserving the global structure of the omics data better.
与线性变换的基于距离的 PCA 不同,最近出现了非线性技术,例如 t 随机邻域嵌入 (t-SNE)。t-SNE 使用基于 Kullback-Leibler 散度的个体成对相似性,将相似性投影到较低维空间中,确保具有相似组学特征的个体在生成的嵌入中接近(van der Maaten 和 Hinton 2008)。t-SNE 的重点是保持相邻数据点之间的本地距离。PCA 的另一种非线性替代方案是 UMAP(均匀流形近似和投影),这是一种建立在黎曼几何基础上的通用降维方法(McInnes 等人,2018 年)。与其他非线性降维方法(如 t-SNE)相比,UMAP 有时可以在更短的时间内提供更好的可视化质量,同时更好地保留组学数据的全局结构。

Unsupervised methods only make use of the input data matrix X and are agnostic to the survival information. To find out whether a given omics profile (i.e. the full set of omics features, not individual features) is associated with survival outcomes, a straightforward approach is to use a few representative components from PCA, t-SNE, or UMAP as covariates in a multivariable Cox model. An alternative is to use semi-supervised methods (Bair and Tibshirani 2004) that combine the clustering procedure and survival modeling together. However, such principal component regression and semi-supervised methods lose interpretability of the individual omics features, since each component is a linear or nonlinear combination of all omics features.
无监督方法仅使用输入数据矩阵 X ,并且与生存信息无关。为了找出给定的组学概况(即完整的组学特征集,而不是单个特征)是否与生存结果相关,一种简单的方法是使用 PCA、t-SNE 或 UMAP 中的一些代表性成分作为多变量 Cox 模型中的协变量。另一种方法是使用半监督方法 (Bair 和 Tibshirani 2004),将聚类过程和生存建模结合在一起。然而,这种主成分回归和半监督方法失去了单个组学特征的可解释性,因为每个成分都是所有组学特征的线性或非线性组合。

4.2 Supervised learning via penalized regressions
4.2 通过惩罚回归进行监督学习

For the purpose of personalized cancer medicine, one is typically interested in identifying risk factor combinations from clinical and omics features. These factors can be targeted (directly or indirectly) via therapeutic strategies or used for diagnostics. Therefore, the objective is to identify a parsimonious set of features linked to survival outcomes by utilizing the wealth of information present in, e.g. the vast amount of available omics data. Penalized Lasso Cox regression (Tibshirani 1997) can be used to select a few relevant omics features by estimating their coefficients as nonzero (the nonrelevant features’ coefficients are shrunk to zero) via maximizing the penalized partial log-likelihood function of the regression coefficients with 1-norm penalty
出于个性化癌症医学的目的,人们通常对从临床和组学特征中识别风险因素组合感兴趣。这些因素可以通过治疗策略(直接或间接)靶向或用于诊断。因此,目标是通过利用大量可用的组学数据等大量信息来识别一组与生存结果相关的简洁特征。惩罚 Lasso Cox 回归 (Tibshirani 1997) 可用于选择一些相关的组学特征,方法是将它们的系数估计为非零(不相关特征的系数缩小到零),通过最大化具有 l1 范数惩罚的回归系数的惩罚部分对数似然函数
(2)
Here, 2/n is a scaling factor for convenience, D={(T˜i,δi,Xi):i=1,,n}, Xi includes p (omics) features of the ith patient, λ0 is a tuning parameter to control the overall penalty strength of the coefficients, ||β||1=j=1p|βj|, and the partial log-likelihood is (β|D)=logi=1n{exp(Xiβ)lRkexp(Xlβ)}δi, where Rk={l:Yl(Tk)=1} is the risk set at time Tk. The Elastic Net Cox model (Simon et al. 2011) considers both the Lasso feature selection and the grouping effect of correlated omics features in ridge regression via a combination of the 1- and 2-norm (i.e. 1/2-norm) penalty λ{α||β||1+12(1α)||β||22} (where ||β||22=j=1p|βj|2), which can usually improve the prediction performance over the Lasso Cox model. Figure 3 shows an example of Elastic Net Cox model feature selection from gene expression features associated with breast cancer patients’ survival. Note that often we wish to include a small set of well-established clinical risk factors in a survival model. Since they are established as important covariates, they can be included in the Lasso or Elastic Net Cox model as mandatory covariates without penalization. Then the penalized partial log-likelihood function becomes
这里,2/n 是方便使用的比例因子,D={T ̃ i,δ i,Xi:i=1...n}Xi 包括第 i个患者的 p(组学)特征,λ0 是一个调整参数,用于控制系数的整体惩罚强度, ||β||1=j=1p|βj|,部分对数似然为 lβ|D=log i=1n{ expXiβlRk exp Xlβ}δi其中Rk={lYlTk=1} 是时间设置的风险k。弹性网络考克斯模型(Simon 等人,2011 年)通过结合 l1l2 范数(即 L1/L2-norm) 罚 λ{α||β||1+121α||β||22} (其中 ||β||22=j=1p|βj|阿拉伯数字),这通常可以提高 Lasso Cox 模型的预测性能。图 3 显示了从与乳腺癌患者生存相关的基因表达特征中选择 Elastic Net Cox 模型特征的示例。请注意,我们通常希望在生存模型中包括一小部分公认的临床风险因素。由于它们被确立为重要协变量,因此它们可以作为强制协变量包含在 Lasso 或 Elastic Net Cox 模型中,而无需进行惩罚。然后,被惩罚的偏对数似然函数变为
(3)
where β0 are coefficients corresponding to the ith individual’s mandatory covariates X0i, pen(β) is a 1- or 1/2-norm penalty for feature selection of omics features Xi. De Bin et al. (2014) investigated more strategies to combine a low-dimensional set of well-established clinical factors and high-dimensional omics features into a global prediction model.
其中 β 0 是对应于第 i个个体的强制协变量 X0i 的系数,pen(β) 是 a l 1- 或 l1/l2 范数惩罚组学特征选择 XiDe Bin et al. (2014) 研究了更多策略,将一组低维的成熟临床因素和高维组学特征结合到一个全局预测模型中。
Coefficient trace plot of an Elastic Net Cox model for overall survival prognosis of breast cancer patients from TCGA based on transcriptomic data and mandatory demographic variables age and ethnicity. The y-axis shows the magnitude of each feature’s coefficient given the strength of penalization displayed on the x-axis (from left to right the penalization decreases). The vertical line indicates the optimal λ=0.032 (maximizes the partial likelihood via cross-validation) and its corresponding selected feature names are listed on the right side. Note that the demographic variables age and ethnicity were not penalized, so that their coefficient paths did not start from zero in the figure.
Figure 3.

Coefficient trace plot of an Elastic Net Cox model for overall survival prognosis of breast cancer patients from TCGA based on transcriptomic data and mandatory demographic variables age and ethnicity. The y-axis shows the magnitude of each feature’s coefficient given the strength of penalization displayed on the x-axis (from left to right the penalization decreases). The vertical line indicates the optimal λ=0.032 (maximizes the partial likelihood via cross-validation) and its corresponding selected feature names are listed on the right side. Note that the demographic variables age and ethnicity were not penalized, so that their coefficient paths did not start from zero in the figure.


图 3.
基于转录组数据和强制性人口统计变量年龄和种族的 TCGA 乳腺癌患者总生存预后的弹性网考克斯模型的系数轨迹图。y 轴显示给定 x 轴上显示的惩罚强度的每个特征系数的大小(从左到右,惩罚减少)。垂直线表示最佳 λ = 0.032 (通过交叉验证最大化部分似然)及其相应的所选特征名称列在右侧。请注意,人口统计变量 age 和 ethnicity 没有受到惩罚,因此它们的系数路径在图中不是从零开始的。

There are many alternative penalties that will achieve feature selection which have been applied to Cox proportional hazards regression for survival outcomes, such as the Adaptive Lasso Cox model that incorporates different penalties for different coefficients to retain important variables (Zhang and Lu 2007), Group-Lasso that performs group selection on (predefined) groups of variables (Kim et al. 2012), Sparse Group-Lasso that introduces sparse effects both on a group and within group level (Simon et al. 2013), the smoothly clipped absolute deviation (SCAD) Cox model that overcomes substantially biased estimates for large coefficients in ultra-sparse models (Fan and Li 2002), sure independence screening (SIS) procedure in combination with Cox model that speeds-up the feature selection dramatically and can also improve the accuracy of estimation when dimensionality becomes ultra-high, i.e. log(p)=O(nξ) for some ξ>0 (Fan et al. 2010). However, all these penalized Cox models do not directly provide uncertainty of feature selection or survival prediction. One empirical way for uncertainty quantification is through additional resampling-based methods, see Section 5 for more details.
有许多替代惩罚可以实现特征选择,这些惩罚已应用于生存结果的 Cox 比例风险回归,例如自适应套索 Cox 模型,该模型对不同系数进行了不同的惩罚以保留重要变量(Zhang 和 Lu 2007),对(预定义)变量组执行组选择的 Group-Lasso(Kim 等人,2012 年)、在组级别和组内引入稀疏效应的稀疏组-套索(Simon et al. 2013),平滑裁剪绝对偏差 (SCAD) Cox 模型克服了超稀疏模型中大系数的实质性偏差估计(Fan 和 Li 2002),确定独立性筛选 (SIS) 程序与 Cox 模型相结合,大大加快了特征选择的速度,并且还可以提高维数变得超高,即 logp=Onξ 对于某些 ξ>0Fan et al. 2010)。然而,所有这些被惩罚的 Cox 模型并不直接提供特征选择或生存预测的不确定性。不确定性量化的一种实证方法是通过其他基于重采样的方法,有关详细信息,请参见第 5 节。

4.3 Supervised learning via Bayesian priors
4.3 通过贝叶斯先验进行监督学习

Bayesian inference is an appealing approach for survival analysis due to its ability to provide straightforward uncertainty quantification (e.g. credible intervals) of parameters and survival probabilities. For instance, Lee et al. (2011) proposed a Bayesian version of the Lasso Cox (Bayesian Lasso Cox) model that provides credible intervals of coefficients fairly straightforward (see Fig. 4), but it is not easy to derive confidence intervals of coefficients in a Lasso-type model.
贝叶斯推理是一种有吸引力的生存分析方法,因为它能够提供参数和生存概率的直接不确定性量化(例如可信的区间)。例如,Lee 等人(2011 年)提出了套索考克斯(Bayesian Lasso Cox)模型的贝叶斯版本,该模型提供了相当简单的可信系数区间(见图 4),但在套索型模型中推导出系数的置信区间并不容易。

Point estimation and uncertainty quantification of regression coefficients using Bayesian Lasso Cox model with Laplace prior for overall survival prognosis of breast cancer patients from TCGA based on transcriptomic data and mandatory demographic variables age and ethnicity. Solid dots indicate the posterior mean over 20000 MCMC iterations excluding burn-in period, and horizontal lines show the corresponding 95% credible intervals.
Figure 4.

Point estimation and uncertainty quantification of regression coefficients using Bayesian Lasso Cox model with Laplace prior for overall survival prognosis of breast cancer patients from TCGA based on transcriptomic data and mandatory demographic variables age and ethnicity. Solid dots indicate the posterior mean over 20000 MCMC iterations excluding burn-in period, and horizontal lines show the corresponding 95% credible intervals.


图 4.

使用贝叶斯 Lasso Cox 模型和拉普拉斯先验根据转录组数据和强制性人口统计变量年龄和种族对 TCGA 乳腺癌患者总生存期预后的回归系数进行点估计和不确定性量化。实线表示 20000 次 MCMC 迭代的后验平均值,不包括老化期,水平线表示相应的 95% 可信区间。

The fundamental theorem of Bayesian methods is Bayes’ rule. Let β be the parameters of interest, e.g. gene effects, external to the data D. To estimate the parameters β given the data information, one can use Bayes’ rule to obtain the conditional (posterior) distribution of β:
贝叶斯方法的基本定理是贝叶斯规则。设 β 为数据 D 外部的感兴趣参数,例如基因效应。为了在给定数据信息的情况下估计参数 β ,可以使用贝叶斯规则来获得 的条件(后验)分布: β
where f(D) is a normalization constant that can be neglected in inference since the data are already observed, f(D|β) is the likelihood of the data viewed as a function of the parameters of a statistical model and f(β) is the prior distribution of β. The prior distribution can be chosen either based on historical data from past similar studies or from popular (non)informative priors (Ibrahim et al. 2001). The estimation of β is to maximize the posterior f(β|D) or logf(β|D), which is equivalent to maximize the sum of the log-likelihood and the log prior, i.e. logf(β|D)=logf(D|β)+logf(β), which takes into account both the observed data information and the prior information in an optimal way.
其中 fD 是一个归一化常数,由于已经观察到数据,因此在推理中可以忽略它,fD| β 是将数据视为统计模型参数函数的可能性,f() β β先验分布。可以根据过去类似研究的历史数据或流行的(非)信息先验(Ibrahim 等人,2001 年)来选择先验分布。β估计是为了最大化后验 fβ|Dlog fβ|D 的 d) 进行调用,它等效于最大化对数似然和对数先验之和,即 log fβ|D=log fD|β+log fβ ,它以最佳方式同时考虑观测数据信息和先验信息。
The Bayesian version of the Lasso Cox model can have a log posterior similar to the frequentist penalized partial log-likelihood function (2), if we assign independent double exponential (also known as Laplace, Fig. 5a) prior, f(β)=j=1pλ2exp{λ|βj|} with a scale parameter λ>0, for all the coefficients, i.e.
如果我们分配独立的双指数(也称为拉普拉斯,图 5a)先验,fβ=j=1pλ2,则 Lasso Cox 模型的贝叶斯版本可以具有类似于频率惩罚偏对数似然函数 (2) 的对数后验 实验 {λ|βj|} 的比例参数 λ>0,对于所有系数,即
(4)
where *(β|D) is the full log-likelihood function i.e. log(0th0(s)ds)+(β|D), C is a normalization constant independent of β and the 1-norm penalty tends to choose only a few nonzero coefficients. Markov chain Monte Carlo (MCMC) sampling can be performed for posterior inference of β. Note that instead of using the partial log-likelihood (β|D) in (4), a full log-likelihood function is used, which includes the baseline hazard function. This can be achieved, e.g. by assigning a stochastic process prior, e.g. a gamma process, for the cumulative baseline hazard function. More details about the prior setup and inference can be found in Lee et al. (2011). Lee et al. (2015) extended the Laplace prior to Elastic Net prior, fused Lasso prior and group Lasso prior, which are often more suitable for correlated omics features in survival analysis. But Lee et al. (2011, 2015) assign the same shrinkage priors to all covariates indiscriminately, see Zucknick et al. (2015) for an extended Bayesian Lasso Cox model which permits the use of mandatory covariates.
其中 l* β |D 是完整的对数似然函数,log0th0sds+l β |D 中,C 是独立于 β 的归一化常数,并且 l 1 范数惩罚往往只选择几个非零系数。马尔可夫链蒙特卡洛 (MCMC) 采样可用于β的后验推断。请注意,不要使用部分对数似然 lβ|D(4) 中,使用了完整的对数似然函数,其中包括基线风险函数。这可以通过实现,例如,通过为累积基线风险函数分配一个先验的随机过程,例如伽马过程。有关先验设置和推理的更多详细信息,请参见 Lee et al. (2011)。Lee et al. (2015) 扩展了 Elastic Net 先验之前的 Laplace,融合了 Lasso 先验和组 Lasso 先验,这通常更适合于生存分析中的相关组学特征。但 Lee 等人。 (20112015) 不加选择地将相同的收缩先验分配给所有协变量,参见 Zucknick 等人(2015 年)的扩展贝叶斯套索 Cox 模型,该模型允许使用强制性协变量。
Density of shrinkage priors in Bayesian survival modeling. (a) Density of the double exponential (Laplace) prior. (b) Density of the mixture spike-and-slab prior. The spike component δ0(βj) induces βj=0 and the slab component N(0,τj2) induces βj≠0. (c) Density of the horseshoe prior. The shrinkage weight κj close to 0 shrinks βj toward zero, and κj close to 1 allows βj to escape the shrinkage.
Figure 5.

Density of shrinkage priors in Bayesian survival modeling. (a) Density of the double exponential (Laplace) prior. (b) Density of the mixture spike-and-slab prior. The spike component δ0(βj) induces βj=0 and the slab component N(0,τj2) induces βj0. (c) Density of the horseshoe prior. The shrinkage weight κj close to 0 shrinks βj toward zero, and κj close to 1 allows βj to escape the shrinkage.


图 5.

贝叶斯生存模型中的收缩先验密度。(a) 双指数 (Laplace) 先验的密度。(b) 混合物 spike-slab 的密度。尖峰分δ0βj 导致βj=0,板分量 N0,τ j2 诱导 βJ0。(c) 马蹄形先验的密度。接近 0 的收缩权重 κjβj 向零收缩,而接近 1 的 κj 会允许 βj 逃避收缩。

Note that the Bayesian version of the Lasso with Laplace-type priors in practice do not result in automatic feature selection, because only the posterior modes of the coefficients are equivalent to the frequentist Lasso solution, while in Bayesian inference one usually focuses on the posterior means as point estimates. As an alternative, a particular omics feature can be excluded if the estimated credible interval of the corresponding coefficient covers zero.
请注意,在实践中,具有拉普拉斯类型先验的 Lasso 的贝叶斯版本不会导致自动特征选择,因为只有系数的后验模式等价于频率拉索解,而在贝叶斯推理中,人们通常将后验均值作为点估计。作为替代方案,如果相应系数的估计可信区间覆盖零,则可以排除特定的组学特征。

Stochastic search variable selection (SSVS) is an alternative approach to identify important covariates (George and McCulloch 1993, Konrath et al. 2013). SSVS uses independent spike-and-slab priors for regression coefficients, e.g.
随机搜索变量选择 (SSVS) 是识别重要协变量的另一种方法(George 和 McCulloch 1993,Konrath 等人,2013 年)。SSVS 使用独立的 spike-and-slab 先验作为回归系数,例如
(5)
where γj (j=1,,p) is a latent variable (which can have a Bernoulli prior with a fixed probability π) for feature selection indicating βj0 if γj=1 and βj=0 if γj=0, τj2 is an additional shrinkage parameter which can be assigned with an additional prior (e.g. exponential or inverse gamma prior), and δ0(·) is the Dirac delta function. Figure 5b shows the two components of the spike-and-slab mixture distribution. Madjar et al. (2021) proposed graph-structured feature selection priors for Cox model by assigning a Markov random field prior on the latent variables in the spike-and-slab priors (5), in which the graph helps to identify pathways of functionally related genes or proteins that are simultaneously prognostic in different patient cohorts. Formulation (5) implies independence between the priors of the individual βj in the slab component. In contrast, a variant of the spike-and-slab prior has a g-prior slab (Zellner 1986, Held et al. 2016)
其中 γjj=1p) 是一个潜在变量(可以具有具有固定概率 π 的伯努利先验),用于指示 βj如果 j=1则为 0 γ β如果 j=0,则γ j=0τj2 是一个附加的收缩参数,可以分配一个附加的先验(例如指数或逆伽马先验),δ 0· 是 Dirac delta 函数。图 5b 显示了 spike-splate 混合物分布的两个分量。Madjar et al. (2021) 通过在尖峰和平板先验 (5) 中的潜在变量上分配马尔可夫随机场先验,为 Cox 模型提出了图结构特征选择先验,其中该图有助于识别功能相关基因或蛋白质的通路,这些基因或蛋白质在不同患者队列中同时预后。 公式 (5) 意味着板组件中单个 βj 的先验之间的独立性。相比之下,尖峰和平板先验的变体具有 g 先验平板(Zellner 1986,Held 等人。 2016 年)
where βγ={βj:γj=1,j=1,,p}, γ={γj:j=1,,p}, g is either a scalar estimated by Empirical Bayes or assigned with additional prior, and Iβγ,βγ1 is the expected Fisher information for βγ.
其中 βγ={βjγj=1,j=1...p}γ={γjj=1...p}g 是由经验贝叶斯估计的标量,或者是分配了额外的先验的标量,Iβ γ,βγ1 是预期的 Fisher 信息βγ
Another popular shrinkage prior is the horseshoe prior, a continuous and global-local shrinkage prior, in which the global parameters allow for sharing information across omics features and the local parameters allow for adjustments at the individual omics feature level (Carvalho et al. 2009, Mu et al. 2021). In a similar setup to the Cox model with spike-and-slab priors in (5), a horseshoe prior for the regression coefficient is
另一种流行的收缩先验是马蹄形先验,这是一种连续的全局-局部收缩先验,其中全局参数允许在组学特征之间共享信息,而局部参数允许在单个组学特征水平进行调整(Carvalho 等人,2009 年,Mu 等人,2021 年)。在与 (5) 中具有尖峰和平板先验的 Cox 模型类似的设置中,回归系数的马蹄形先验为
where the local parameter λj and global parameter τ are both half-Cauchy distributed C+(·,·). With the horseshoe prior, the posterior mean of βj will be shrunk by a weight κj=11+λj2(0,1) as in Fig. 5c, where κj0 induces βj0. Using a user-adjustable cutoff value, many coefficients can be shrunk to zero, enabling the selection of only a few associated omics features (with nonzero coefficients).
其中,局部参数 λj 和全局参数 τ 都是半柯西分布C+·· 中。使用马蹄形先验时,βj后验均值将缩小权重 κj=11+λj 20,1),如图 5c 所示 ,其中 κj0 诱导βj0。使用用户可调的截止值,许多系数可以缩小到零,从而只能选择少数相关的组学特征(系数为非零)。

Although Bayesian models can quantify uncertainty of estimators more straightforward than penalized regressions, most Bayesian Cox-type models for high-dimensional covariates do not have user-friendly and standalone R packages on CRAN or GitHub. The main reason is the high computational cost of running a high-dimensional Bayesian Cox model. Advanced users with programming capabilities can contact the corresponding authors for original scripts. Since Bayesian priors are more flexible than frequentist Lasso-type penalties, it can be easier to extend Bayesian models by changing shrinkage priors while keeping almost the same algorithm framework. This means that the Bayesian framework can be more suitable if one is interested in tailoring the shrinkage effects, e.g. to include prior knowledge about the importance of omics features, e.g. features corresponding to a molecular pathway that is known to be affected in the disease under study. For more information on different Bayesian priors in cancer prognosis, we suggest a recent review by Chu et al. (2022) which summarized other different shrinkage priors on regression coefficients, such as Gaussian-gamma, Gaussian, Cauchy, pMOM (product moment distribution), piMOM (product inverse moment distribution) and peNMIG (parameter-expanded normal-mixture-of-inverse-gamma distribution) priors.
尽管贝叶斯模型可以比惩罚回归更直接地量化估计量的不确定性,但大多数用于高维协变量的贝叶斯 Cox 型模型在 CRAN 或 GitHub 上没有用户友好的独立 R 包。主要原因是运行高维 Bayesian Cox 模型的高计算成本。具有编程能力的高级用户可以联系相应的作者获取原始脚本。由于贝叶斯先验比频率 Lasso 类型的惩罚更灵活,因此在保持几乎相同的算法框架的同时,通过更改收缩先验来扩展贝叶斯模型会更容易。这意味着如果有人对定制收缩效应感兴趣,例如包括关于组学特征重要性的先验知识,例如对应于已知在所研究疾病中受影响的分子途径的特征,贝叶斯框架可能更合适。有关癌症预后中不同贝叶斯先验的更多信息,我们建议 Chu 等人(2022 年)最近的一篇综述,该综述总结了回归系数的其他不同收缩先验,例如高斯伽马、高斯、柯西、pMOM(乘积矩分布)、piMOM(乘积逆矩分布)和 peNMIG(参数扩展的正态混合逆伽马分布)先验。

5 Survival model validation
5 生存模型验证

Model validation plays an important role in identifying potential issues, such as model misspecification or overfitting. This is achieved by revisiting the model’s specifications and assumptions following model estimation. For example, the Cox model (1) requires proportional hazards and the logarithm of the hazard to be linear with respect to the model covariates. The former assumption can for example be checked by the cumulative Schoenfeld residuals (Grambsch and Therneau 1994), and the latter assumption can be checked by plotting a nonlinear functional form (e.g. spline) for the effect of a covariate. If the Cox model assumptions are not satisfied, one can try certain transformations of covariates (e.g. Box-Cox power transformations, Box and Cox 1964), allow time-varying coefficients or model interactions among covariates (Ng et al. 2023), or investigate patient stratification using unsupervised approaches (Cristescu et al. 2015). However, the assumption checks are usually suitable only for low-dimensional models, i.e. for a few clinical variables or a few factors projected from the high-dimensional omics feature space. Novel approaches for assumption checks in general high-dimensional settings require further methodological developments. Johnson and Long (2011) used heuristic methods to investigate the Cox model assumptions by separately fitting univariate Cox models one feature at one time, and check P-values for the score tests of individual features and P-values for testing the proportional hazards assumption of univariate Cox models. But the univariate Cox models do not take into account confounding variables. An alternative approach is to loosen the model assumptions for a more robust modeling approach. One example developed specifically for feature selection under possibly nonproportional hazards in a high-dimensional space is concordance regression (Dunkler et al. 2010).
模型验证在识别潜在问题(例如模型错误指定或过度拟合)方面起着重要作用。这是通过在模型估计后重新审视模型的规范和假设来实现的。例如,Cox 模型 (1) 要求比例风险和风险的对数相对于模型协变量呈线性。例如,前一个假设可以通过累积 Schoenfeld 残差 (Grambsch and Therneau 1994) 来检查,后一个假设可以通过绘制协变量效应的非线性函数形式(例如样条)来检查。如果不满足 Cox 模型假设,可以尝试协变量的某些变换(例如 Box-Cox 幂变换、Box 和 Cox 1964),允许协变量之间的时变系数或模型交互(Ng 等人,2023 年),或使用无监督方法研究患者分层(Cristescu 等人,2015 年).然而,假设检查通常仅适用于低维模型,即从高维组学特征空间投射的少数临床变量或少数因素。在一般高维设置中进行假设检查的新方法需要进一步的方法论发展。Johnson 和 Long (2011) 使用启发式方法通过一次单独拟合单变量 Cox 模型一个特征来研究 Cox 模型假设,并检查单个特征的分数测试的 P 值和用于测试单变量 Cox 模型的比例风险假设的 P 值。但是单变量 Cox 模型没有考虑混杂变量。 另一种方法是放宽模型假设,以获得更稳健的建模方法。专门为高维空间中可能的非比例风险下的特征选择开发的一个例子是一致性回归(Dunkler 等人,2010 年)。

5.1 Feature stability analysis
5.1 特征稳定性分析

One important aspect in model validation when using omics or other high-dimensional data is the potential instability of feature selection (Kalousis et al. 2007). Feature selection using penalized regressions as described in Section 4.2 heavily depends on the values of the penalty parameters [e.g. for the λ parameter in Lasso Cox model (2)]. The penalty parameters are often optimized by cross-validation (CV) or other resampling methods, and the uncertainty associated with the random selection of subsets may result in uncertainty in the feature selection, e.g. different CV folds will typically result in different selected features. A straightforward way to identify the most stable features is to find the overlap of identified omics features among different data subsets (e.g. CV folds or resamples) to avoid high false discovery rate (Zucknick et al. 2008). One can also perform stability selection (Meinshausen and Bühlmann 2010), which allows to select the most stable features at a given Type-I error level for a Lasso or Elastic Net Cox model (Sill et al. 2014).
使用组学或其他高维数据时,模型验证的一个重要方面是特征选择的潜在不稳定性(Kalousis 等人,2007 年)。如第 4.2 节所述,使用惩罚回归的特征选择在很大程度上取决于惩罚参数的值 [例如,对于 Lasso Cox 模型 (2) 中的参数]。 λ 惩罚参数通常通过交叉验证 (CV) 或其他重采样方法进行优化,与随机选择子集相关的不确定性可能会导致特征选择的不确定性,例如,不同的 CV 折叠通常会导致不同的选定特征。识别最稳定特征的一种直接方法是找到不同数据子集(例如 CV 折叠或重新采样)之间已识别的组学特征的重叠,以避免高错误发现率(Zucknick 等人,2008 年)。还可以进行稳定性选择(Meinshausen 和 Bühlmann 2010),这允许在给定的 I 类误差级别为 Lasso 或 Elastic Net Cox 模型选择最稳定的特征(Sill 等人,2014 年)。

For the Bayesian models in Section 4.3, feature selection stability is naturally assessed by the uncertainty of coefficients’ estimators, as reflected in the posterior variances of βj or the posterior selection probabilities p(γj|D) (in SSVS), which is a natural benefit of utilizing full Bayesian inference. Although the uncertainty in feature selection introduces increased variability in the predicted survival probabilities, in the Bayesian framework, this can be addressed quite naturally by averaging the survival predictions over all models using Bayesian model averaging (Volinsky et al. 1997). If one is interested in a single model, rather than model averaging, the median probability model (Barbieri and Berger 2004) can be used for uncertainty analyses in survival and high-dimensional omics data (Madjar et al. 2021).
对于第 4.3 节中的贝叶斯模型,特征选择稳定性自然是通过系数估计量的不确定性来评估的,这反映在 βj后验方差或后验选择概率 pγj|D (在 SSVS 中),这是利用完整贝叶斯推理的自然好处。尽管特征选择的不确定性增加了预测生存概率的可变性,但在贝叶斯框架中,这可以通过使用贝叶斯模型平均对所有模型的生存预测进行平均来自然地解决(Volinsky 等人,1997 年)。如果对单个模型感兴趣,而不是模型平均,中位概率模型(Barbieri 和 Berger 2004)可用于生存和高维组学数据的不确定性分析(Madjar 等人,2021 年)。

5.2 Survival prediction and calibration
5.2 生存预测和校准

The fundamental goal of any statistical prediction model is to achieve a better prediction performance than an existing statistical model which we could call the ‘conventional model’ (Gerds and Kattan 2021). The special nature of the combination of clinical and/or other known prognostic factors (typically low-dimensional, with established effect) and novel omics features (high-dimensional, with unknown effect) should especially be taken into account. Therefore, a survival model consisting of only established clinical and/or other known prognostic factors should serve as the benchmark (i.e. conventional model) for the upcoming modeling. The inclusion of new covariates (e.g. omics features) into a prognostic model only makes sense if the new covariates have added prognostic value over the established clinical prognostic factors (De Bin et al. 2014), i.e. the new prognostic model consisting of the new covariates plus benchmark covariates improves the prediction performance over the conventional model.
任何统计预测模型的基本目标都是实现比现有统计模型更好的预测性能,我们可以称之为 “传统模型”(Gerds 和 Kattan 2021)。应特别考虑临床和/或其他已知预后因素(通常是低维的,具有既定效果)和新的组学特征(高维,效果未知)组合的特殊性质。因此,仅由已建立的临床和/或其他已知预后因素组成的生存模型应作为即将进行的建模的基准(即 常规模型)。只有当新的协变量在已建立的临床预后因素上增加了预后价值时,将新的协变量(例如组学特征)纳入预后模型才有意义(De Bin 等人,2014 年),即由新协变量和基准协变量组成的新预后模型提高了预测性能优于传统模型。

To confirm that the identified clinical and omics features have prognostic power with respect to the prediction of patients’ survival outcomes, a model should be both accurate (low prediction error) and precise (low prediction uncertainty). The simplest way to demonstrate the prognostic power is to dichotomize the prognostic scores [i.e. linear predictor Xiβ in the Cox model, Cox (1957)] by its median value, and then use a log-rank test to compare the survival probabilities of the patients in the two groups, see Fig. 2b. Similarly, one can categorize the prognostic scores by multiple quantiles (e.g. 25%, 50% median, and 75%) into multiple groups of patients and perform a log-rank test.
为了确认已确定的临床和组学特征在预测患者生存结果方面具有预后能力,模型应同时准确(低预测误差)和精确(低预测不确定性)。证明预后能力的最简单方法是将预后评分 [即 Cox 模型中的线性预测因子 Xiβ,Cox (1957)] 按其中位数,然后使用对数秩检验来比较两组中患者的生存概率,见图 2b.同样,可以按多个分位数(例如 25%、50% 中位数和 75%)将预后评分分类为多组患者,并执行对数秩检验。

To validate a prediction model systematically (Rahman et al. 2017, Royston and Altman 2013), the predictive performance of the model is commonly addressed by
为了系统地验证预测模型(Rahman 等人,2017 年,Royston 和 Altman 2013 年),模型的预测性能通常由

  • discrimination: the ability of the model to distinguish between low and high risk patients,
    鉴别力:模型区分低风险和高风险患者的能力,

  • calibration: the agreement between the observed outcomes and predicted survival probabilities, and
    校准:观察到的结果与预测的生存概率之间的一致性,以及

  • overall performance: the distance between the observed and predicted survival probabilities.
    总体性能:观测到的生存概率和预测的生存概率之间的距离。

5.2.1 Discrimination performance
5.2.1 鉴别表现

If one focuses on survival prediction at a fixed time point (e.g. 5-year survival probability), a receiver operating characteristic (ROC) curve can be used to evaluate the prognostic (i.e. prediction or discrimination of survival) ability of the survival model, often summarized by its area under the ROC curve (AUC) (Heagerty et al. 2000), see Fig. 7a for an example. An AUC of 0.5 is equivalent to the toss of a coin, and the closer the AUC is to 1, the more predictive is the model. When making predictions at multiple time points, ROC curves can be summarized as time-dependent AUC scores, i.e. AUC scores calculated at prespecified time points. Alternatively, the concordance index (C-index) provides a more global, time-independent assessment of the discrimination ability of a prognostic model, such that a better model predicts higher prognostic scores for patients with shorter survival times (Harrell et al. 1982, Antolini et al. 2005), i.e.
如果专注于固定时间点的生存预测(例如 5 年生存概率),则可以使用受试者工作特征 (ROC) 曲线来评估生存模型的预后(即预测或鉴别生存)能力,通常用其 ROC 曲线下面积 (AUC) 来总结(Heagerty 等人,2000 年),见图 7a例如。AUC 为 0.5 相当于抛硬币,AUC 越接近 1,模型的预测性就越强。当在多个时间点进行预测时,ROC 曲线可以概括为时间依赖性的 AUC 分数,即在预先指定的时间点计算的 AUC 分数。或者,一致性指数 (C-index) 对预后模型的鉴别能力提供了更全面的、时间独立的评估,因此更好的模型对生存时间较短的患者预测了更高的预后评分(Harrell 等人,1982 年,Antolini 等人,2005 年),即
which means in the absence of censoring, any pair of individuals {i,j} with survival times Ti<Tj is concordant if and only if S(t|Xi(t))<S(t|Xj(t)) for any t (equivalent to ranking the prognostic scores Xiβ>Xjβ in a Cox model), where t denotes the time instants where there are covariate variations. The C-index can be expressed as a weighted average of the time-dependent AUC over time (Heagerty and Zheng 2005). Therefore, its interpretation is similar to the AUC, where a C-index of 0.5 indicates random predictions, while a perfect prognostic model would have a C-index of 1. There are multiple types of C-indices for survival modeling, in particular the most frequently used Harrell’s (Harrell et al. 1982) and Uno’s C-index (Uno et al. 2011). Uno’s C-index is more robust than Harrell’s C-index, in case there is dependence of the concordance on the study-specific censoring distribution.
这意味着在没有删失的情况下 t当且仅当 St|Xit<St|Xjt 对于任何 t(相当于在 Cox 模型中对预后评分 Xiβ>Xjβ 进行排名),其中 t表示存在协变量变化的时间时刻。C 指数可以表示为随时间变化的 AUC 随时间变化的加权平均值(Heagerty 和 Zheng 2005)。因此,它的解释类似于 AUC,其中 C 指数为 0.5 表示随机预测,而完美的预后模型的 C 指数为 1。用于生存建模的 C 指数有多种类型,特别是最常用的 Harrell 指数 (Harrell et al. 1982) 和 Uno 的 C 指数 (Uno et al. 2011)。Uno 的 C-index 比 Harrell 的 C-index 更稳健,以防一致性依赖于研究特定的删失分布。

In the classical Cox modeling framework, both Harrell’s and Uno’s C-indices only depend on the linear predictors Xiβ, which is independent of t. But if a model includes covariates with time-dependent effects β(t) and/or time-dependent covariates Xi(t), Harrell’s and Uno’s C-indices are difficult to be calculated, since they require the calculation of survival functions for each individual over time. In this context, Antolini et al. (2005) proposed a time-dependent C-index, which assesses the concordance of a model’s survival distribution predictions S^(t). This means that Antolini’s C-index requires the full specification of S^(t), even though a C-index only compares the survival probabilities between any pair of individuals, that is, it only assesses whether the relative order of estimated survival probabilities is concordant with observed survival outcomes (Blanche et al. 2019). Time-dependent prediction indices can better evaluate a model including candidate features with time-dependent effects and/or time-dependent features. To avoid C-hacking among different C-indices in model comparison, Sonabend et al. (2022) recommended that if all models make survival distribution predictions, then select a time-dependent C-index; otherwise choose a time-independent measure (e.g. Uno’s C-index); if there is a combination of risk- and distribution-predicting models, then choose a transformation method for analysis (e.g. expected mortality).
在经典的 Cox 建模框架中,Harrell 和 Uno 的 C 指数都只依赖于线性预测变量 Xiβ,它与 t 无关。但是,如果模型包括具有时间依赖效应 βt 和/或时间依赖协变量 Xit 的协变量,则 Harrell 和 Uno 的 C 指数很难计算,因为它们需要计算每个个体随时间变化的生存函数。在这种情况下,Antolini 等人 (2005) 提出了一个时间依赖性的 C 指数,它评估模型的生存分布预测 S^t 的一致性。这意味着 Antolin 的 C 指数需要 S^t 的完整规格,即使 C 指数仅比较任何一对个体之间的生存概率,也就是说,它只评估估计生存概率的相对顺序是否与观察到的生存结果一致(Blanche 等人,2019 年).时间依赖性预测指数可以更好地评估模型,包括具有时间依赖效应的候选特征和/或时间依赖性特征。为了避免在模型比较中不同 C 指数之间的 C-hacking,Sonabend 等人。 (2022) 建议,如果所有模型都进行生存分布预测,则选择瞬态 C 指数;否则选择与时间无关的度量(例如 Uno 的 C-index);如果存在风险预测模型和分布预测模型的组合,则选择一种转换方法进行分析(例如预期死亡率)。

5.2.2 Calibration performance
5.2.2 校准性能

Calibration is to quantify the agreement between the observed and predicted outcomes, which is useful for both internal and external model validation and is recommended to report routinely. The calibration slope is commonly used (van Houwelingen 2000), which is the slope of the regression of the observed/actual survival probabilities on the model-predicted survival probabilities. A survival model can be reported with the estimated t-year survival probabilities in predefined subgroups, denoted as Smodel(t|g) for subgroups g=1,,G. The observed t-year survival probabilities in the subgroups can be estimated by the KM method, denoted as SKM(t|g). Using the ln(ln(·))-link, the calibration model is
校准是为了量化观察到的结果和预测的结果之间的一致性,这对于内部和外部模型验证都很有用,建议定期报告。通常使用校准斜率 (van Houwelingen 2000),它是观测/实际生存概率对模型预测生存概率的回归斜率。生存模型可以用预定义的子组中估计的 t 年生存概率来报告,表示为 S模型t|g 对于子组 g=1...G.在亚组中观察到的 t 年生存概率可以通过 KM 方法估计,表示为 SKM(t|g使用 lnln·)-link,则校准模型为
(6)
where ϵ is an error term. If the intercept α=0 and the slope β=1, it means that the survival prediction model is well calibrated. For example, Fig. 7b shows a calibration plot, visualizing the calibration of the estimated 5-year survival probabilities (with 95% confidence interval by bootstrapping) using the KM method for TCGA-BRCA patients grouped by the quartiles of Cox-model predicted survival probabilities. Furthermore, one can calibrate a Cox model in terms of the baseline cumulative hazard and prognostic score. For nonproportional hazard models, calibration using the model cumulative hazard function can be considered (van Houwelingen 2000).
其中 ϵ 是错误项。如果截距 α=0 且斜率 β=1,则表示生存预测模型校准良好。例如,图 7b 显示了一个校准图,可视化了使用 KM 方法对 TCGA-BRCA 患者的估计 5 年生存概率(通过自举法获得 95% 置信区间)的校准,按 Cox 模型预测生存概率的四分位数分组。此外,人们可以根据基线累积风险和预后评分来校准 Cox 模型。对于非比例风险模型,可以考虑使用模型累积风险函数进行校准 (van Houwelingen 2000)。

As an alternative to the calibration slope at a single time point, Andres et al. (2018) and Haider et al. (2020) suggested the distributional calibration (D-Calibration) for accounting survival probabilities across all time points. This can be useful when assessing the entire post-treatment survival prediction, e.g. assessing post liver transplantation survival utility in Andres et al. (2018).
作为单个时间点校准斜率的替代方案,Andres 等人 (2018)Haider 等人 (2020) 提出了分布校准 (D-Calibration) 来计算所有时间点的生存概率。这在评估整个治疗后生存预测时很有用,例如 Andres 等人 (2018) 评估肝移植后生存效用。

5.2.3 Overall performance
5.2.3 整体性能

Scoring rules can evaluate the accuracy and confidence of probabilistic predictions, and assess both discrimination and calibration (Gneiting and Raftery 2007, Avati et al. 2020). The idea of scoring rules dates back to Brier (1950) which assigned a numerical score for verifying ensemble-based probabilistic predictions of discrete outcomes.
评分规则可以评估概率预测的准确性和置信度,并评估判别和校准(Gneiting 和 Raftery 2007,Avati 等人,2020 年)。评分规则的想法可以追溯到 Brier (1950),他分配了一个数字分数来验证基于集成的离散结果的概率预测。

Graf et al. (1999) proposed the time-dependent Brier score, which is the expected mean-squared error of survival probability prediction at different time points, i.e.
Graf 等人 (1999) 提出了时间依赖性 Brier 评分,它是不同时间点生存概率预测的预期均方误差,即
where ti is the survival time of ith individual, S^(t|Xi) is the Cox-model predicted survival probability and G^(t) is the KM estimate of the censoring distribution. The benefit of the Brier score is that it does not only measure discrimination, similar to evaluation measures like the C-index, but also calibration performance of a survival model. The integrated Brier score (IBS) is as a single measure of prediction accuracy integrating BS(t) over an entire follow-up time period. Hielscher et al. (2010) presented a comparison between the IBS and a D measure (Schemper and Henderson 2000), which is an integrated measure based on the mean absolute deviation rather than the mean-squared error used in IBS. The D measure is more robust toward extreme observations and has a smaller variance than the IBS.
其中 ti 是第 i个个体的生存 ti me,S^t|Xi 是 Cox 模型预测的生存概率,G^t 是删失分布的 KM 估计值。Brier 评分的好处是,它不仅可以测量辨别度(类似于 C 指数等评估指标),还可以测量生存模型的校准性能。综合 Brier 评分 (IBS) 是在整个随访时间内整合 BS(t) 的预测准确性的单一衡量标准。Hielscher 等人(2010 年)提出了 IBS 与 D 度量(Schemper 和 Henderson 2000)之间的比较,D 度量是一种基于平均绝对偏差而不是 IBS 中使用的均方误差的综合度量。D 测度对极端观测值更稳健,并且比 IBS 具有更小的方差。

To overcome potential overfitting when using feature selection and model estimation, the survival predictions and model calibration should be evaluated in independent validation datasets. As independent validation data are seldom available, we can split the available data into training and validation datasets. However, any single split of the data hides partial data from all steps of model building, which might introduce bias, it is recommended to use resampling-based methods for assessing the survival model’s performance. In addition, using resampling-based methods also allow us to estimate the uncertainty of the performance estimator (Sill et al. 2014; Gerds and Kattan 2021, chapter 7). This can be done for example by repeatedly splitting the data to training and validation sets, and evaluating a survival model’s performance on the different validation sets using various discrimination or calibration indices. The 0.632+ bootstrap estimator for a discrimination or calibration index can balance the apparent (training) error and the average out-of-bag bootstrap error, and in addition accounts for the relative overfitting based on a no-information error rate in high-dimensional settings (Schumacher et al. 2007, Binder and Schumacher 2008a). This is a typical machine learning approach with two levels of resampling. The outer layer of resampling is to evaluate the prediction performance and the inner layer of resampling (usually CV) is to optimize model’s tuning parameters. We note that the preparatory steps, such as multi-modal data standardization and feature pre-selection in the context of high-dimensional input data, may affect the survival prediction performance, and thus should be included in the resampling steps for model validation.
为了克服使用特征选择和模型估计时可能出现的过拟合,应在独立的验证数据集中评估生存预测和模型校准。由于很少有独立的验证数据,我们可以将可用数据拆分为训练数据集和验证数据集。但是,数据的任何一次拆分都会隐藏模型构建所有步骤的部分数据,这可能会引入偏差,建议使用基于重采样的方法来评估生存模型的性能。此外,使用基于重采样的方法还允许我们估计性能估算器的不确定性(Sill 等人,2014 年;Gerds 和 Kattan 2021 年,第 7 章)。例如,可以通过将数据重复拆分为训练集和验证集,并使用各种判别或校准指数评估生存模型在不同验证集上的性能来实现。判别或校准指数的 0.632+ bootstrap 估计器可以平衡表观(训练)误差和平均袋外 bootstrap 误差,此外还可以解释基于高维设置中无信息误差率的相对过拟合(Schumacher et al. 2007Binder and Schumacher 2008a).这是一种典型的机器学习方法,具有两个级别的重新采样。重采样的外层是评估预测性能,重采样的内层(通常是 CV)是优化模型的调优参数。 我们注意到,在高维输入数据背景下的多模态数据标准化和特征预选等准备步骤可能会影响生存预测性能,因此应包含在模型验证的重采样步骤中。

5.2.4 Graphical representation
5.2.4 图形表示

After confirming that a model is valid (assumptions hold), accurate (low prediction error), precise (uncertainty of performance measures properly quantified) and its predictions generalizable beyond the training dataset (using independent validation data if available), a prognostic nomogram (Kattan et al. 1999) can be used to summarize the prognostic effect of the identified clinical and omics features on the risk of a specific year’s overall survival (Fig. 6), which may help the clinicians to enhance the patient management and personalized treatment strategies. For example, the red colored dots in Fig. 6 show the information of the identified five variables from an example patient and the corresponding scoring points. The summed scoring points of 263 maps to the predicted 1-year, 3-year and 5-year survival probabilities of this patient. Note that most nomograms treat the identified variables independently in the risk calculation, even though there may be significant interactions among the model features that were used in the feature selection step. However, visualizing such interaction effects would make the nomograms less accessible and interpretable, and so, there is still a room for improvement in how to translate the multivariate risk scores into clinical practice.
在确认模型有效(假设成立)、准确(低预测误差)、精确(正确量化性能测量的不确定性)及其预测可在训练数据集之外推广(使用独立的验证数据,如果可用)后,可以使用预后列线图(Kattan 等人,1999 年)来总结已确定的临床和组学特征对特定年份总生存期风险的预后影响(图 6),这可能有助于临床医生加强患者管理和个性化治疗策略。例如,图 6 中的红色点显示了从示例患者中识别出的 5 个变量的信息以及相应的评分点。263 的总得分点映射到该患者预测的 1 年、3 年和 5 年生存概率。请注意,大多数列线图在风险计算中独立处理已识别的变量,即使在特征选择步骤中使用的模型特征之间可能存在显著的交互。然而,可视化这种交互效应会使列线图更难获得和解释,因此,在如何将多变量风险评分转化为临床实践方面仍有改进的空间。

Nomogram developed to estimate the overall survival probability for TCGA-BRCA patients based on clinical (age) demographic (ethnicity) and three selected mRNA features form a Lasso Cox model. The dot symbols represent example patient’s information and the triangle symbols represent predicted probabilities of 1-, 3-, and 5-year survival.
Figure 6.

Nomogram developed to estimate the overall survival probability for TCGA-BRCA patients based on clinical (age) demographic (ethnicity) and three selected mRNA features form a Lasso Cox model. The dot symbols represent example patient’s information and the triangle symbols represent predicted probabilities of 1-, 3-, and 5-year survival.


图 6.

开发列线图以根据临床 (年龄) 人口统计学 (种族) 估计 TCGA-BRCA 患者的总体生存概率,并且三个选定的 mRNA 特征形成 Lasso Cox 模型。点符号表示示例患者的信息,三角形符号表示 1 年、3 年和 5 年生存的预测概率。

When an independent validation dataset is available, it is recommended to report a calibration plot corresponding to the nomogram. Using independent validation data to obtain Smodel(t|g) in the calibration model (6) is for the generalization capacity of the model. Since we here do not have independent validation data besides TCGA-BRCA data, Fig. 7 shows an example calibration plot at 5-year survival evaluation time point based on the built Cox model in Fig. 6 for a split 20% TCGA-BRCA dataset.
当有独立的验证数据集可用时,建议报告与列线图对应的校准图。使用独立验证数据得到 S模型t|g 在校准模型 (6) 中是针对模型的泛化能力。由于除了 TCGA-BRCA 数据之外,我们这里没有独立的验证数据,因此图 7 显示了基于图 6 中构建的 Cox 模型的 5 年生存评估时间点的示例校准图, 用于拆分的 20% TCGA-BRCA 数据集。

Discrimination and calibration of survival prediction for 20% TCGA-BRCA validation data. TCGA-BRCA data were split to a 80% training set and a 20% validation set. (a) Receiver operating characteristic (ROC) curve estimated at 5-year survival evaluation time point based on the Cox model in Fig. 6. The AUC score is the area under the ROC curve. The 45-degree line represents the performance of a random prediction of the outcome event with AUC = 0.5. (b) Calibration plot estimating 5-year survival probabilities in TCGA-BRCA patients grouped by the quartiles of the predicted probabilities based on the Cox model in Fig. 6. The actual 5-year survival probabilities in each group were estimated by the Kaplan–Meier (KM) method with a 95% confidence interval by bootstrapping. The cross symbol indicates a bias-corrected KM estimate. The predicted survival probability of each quartile group was the mean of the 5-year survival probabilities based on the Cox model for the corresponding group.
Figure 7.

Discrimination and calibration of survival prediction for 20% TCGA-BRCA validation data. TCGA-BRCA data were split to a 80% training set and a 20% validation set. (a) Receiver operating characteristic (ROC) curve estimated at 5-year survival evaluation time point based on the Cox model in Fig. 6. The AUC score is the area under the ROC curve. The 45-degree line represents the performance of a random prediction of the outcome event with AUC = 0.5. (b) Calibration plot estimating 5-year survival probabilities in TCGA-BRCA patients grouped by the quartiles of the predicted probabilities based on the Cox model in Fig. 6. The actual 5-year survival probabilities in each group were estimated by the Kaplan–Meier (KM) method with a 95% confidence interval by bootstrapping. The cross symbol indicates a bias-corrected KM estimate. The predicted survival probability of each quartile group was the mean of the 5-year survival probabilities based on the Cox model for the corresponding group.


图 7.
20% TCGA-BRCA 验证数据的生存预测的区分和校准。TCGA-BRCA 数据被分成 80% 的训练集和 20% 的验证集。(a) 根据图 6 中的 Cox 模型估计的 5 年生存评估时间点的受试者工作特征 (ROC) 曲线。AUC 分数是 ROC 曲线下的面积。45 度线表示 AUC = 0.5 时结果事件的随机预测性能。(b) 估计 TCGA-BRCA 患者 5 年生存概率的校准图,按基于 6 中 Cox 模型的预测概率的四分位数分组。每组的实际 5 年生存概率通过 Kaplan-Meier (KM) 方法估计,通过自举法估计 95% 置信区间。叉号表示偏差校正的 KM 估计值。每个四分位数组的预测生存概率是基于相应组的 Cox 模型的 5 年生存概率的平均值。

6 Beyond penalized and Bayesian Cox models
6 超越罚函数模型和贝叶斯 Cox 模型

In this tutorial, we mainly focused on penalized regressions and Bayesian hierarchical models in the Cox proportional hazards framework. One can extend this framework in several ways. For instance, one can stay in a likelihood-based modeling framework, but replace the partial likelihood function of the semi-parametric Cox model by alternative likelihood functions (which do not necessarily need to imply proportional hazards), e.g. parametric survival models like exponential, Weibull, or accelerated failure time (AFT) models, or Aalen’s additive hazard model (Gorst-Rasmussen and Scheike 2012). Alternatively, one can move to a more algorithmic machine learning approach, such as tree-based boosting or bagging methods, e.g. random survival forests (Hothorn et al. 2006, Binder and Schumacher 2008b, Jaeger et al. 2019), or (deep) neural networks (Wiegrebe et al. 2024).
在本教程中,我们主要关注 Cox 比例风险框架中的惩罚回归和贝叶斯分层模型。可以通过多种方式扩展此框架。例如,可以停留在基于似然的建模框架中,但用替代似然函数(不一定需要暗示比例风险)替换半参数 Cox 模型的偏似然函数,例如指数、Weibull 或加速失效时间 (AFT) 模型等参数生存模型,或 Aalen 的加性风险模型(Gorst-Rasmussen 和 Scheike 2012).或者,人们可以转向更具算法性的机器学习方法,例如基于树的提升或装袋方法,例如随机生存森林(Hothorn 等人,2006 年,Binder 和 Schumacher 2008b,Jaeger 等人,2019 年)或(深度)神经网络(Wiegrebe 等人,2024 年)。

Hothorn et al. (2006) introduced ensemble tree methods for analyzing right-censored survival data, which construct ensembles from base learners, e.g. binary survival trees for each omics feature. Hothorn et al. (2006) also proposed a gradient boosting algorithm to predict the survival time of patients with acute myeloid leukemia (AML), based on clinical and omics features. Similarly, Binder and Schumacher (2008b) developed a likelihood-based boosting method, which aims to maximize the Cox partial likelihood function, for modeling time-to-event data based on high-dimensional omics input data and which also allows the inclusion of a small number of mandatory covariates. In general, one needs to be cautious if using some machine learning methods that are not well-suited for high-dimensional features. For example, Kvamme et al. (2019) proposed extensions of the Cox model with neural networks, which are only valid if the number of covariates is smaller than the number of samples, i.e. if p<n. A systematic review of deep learning for survival analysis, which includes a survey of methods suitable for high-dimensional data (p>n), is provided by Wiegrebe et al. (2024).
Hothorn et al. (2006) 引入了用于分析右删失生存数据的集合树方法,该方法从基本学习器构建集合,例如每个组学特征的二元生存树。Hothorn 等人 (2006) 还提出了一种梯度提升算法,以根据临床和组学特征预测急性髓性白血病 (AML) 患者的生存时间。同样,Binder 和 Schumacher (2008b) 开发了一种基于似然的提升方法,旨在最大化 Cox 偏似然函数,用于基于高维组学输入数据对事件发生时间数据进行建模,并且还允许包含少量的强制性协变量。一般来说,如果使用一些不太适合高维特征的机器学习方法,则需要谨慎。例如,Kvamme 等人(2019 年)提出了使用神经网络对 Cox 模型进行扩展,只有当协变量的数量小于样本的数量时,即如果 p<n.Wiegrebe 等人 (2024) 提供了对用于生存分析的深度学习的系统评价,其中包括对适用于高维数据 (p>n) 的方法的调查。

In the case of nonproportional hazards, many likelihood-based survival models beyond Cox-type models have also been extended to account for high-dimensional omics as input data. For example, Ma et al. (2006) combined Lin and Yin’s additive hazard model (Lin and Ying 1994) with principal component regression for dimension reduction of omics features, which was applied to the study of gene expression-based survival prediction for diffuse large B-cell lymphoma. Engler and Li (2009) added the Elastic Net penalty in an accelerated ATF model, which assumes that the effect of a covariate accelerates or decelerates the life course of patients. Schmid and Hothorn (2008) and Barnwal et al. (2022) used boosting algorithms to learn parametric AFT models. Ha et al. (2014) considered the Lasso, SCAD and a penalized h-likelihood for feature selection in frailty models which assume that individuals have unobserved heterogeneity captured by a latent random term Z, which adapts the Cox model (1) into h(t|X)=Zh0(t)exp(Xβ).
在非比例风险的情况下,除了 Cox 型模型之外,许多基于可能性的生存模型也被扩展为将高维组学作为输入数据。例如,等人(2006)将Lin和Yin的加性危害模型(Lin和Ying 1994)与组学特征降维的主成分回归相结合,该模型应用于弥漫性大B细胞淋巴瘤的基于基因表达的生存预测研究。Engler 和 Li (2009) 在加速 ATF 模型中添加了弹性网惩罚,该模型假设协变量的影响会加速或减慢患者的生命进程。Schmid 和 Hothorn (2008) 以及 Barnwal 等人 (2022) 使用提升算法来学习参数化 AFT 模型。Ha 等人(2014 年)考虑了虚弱模型中特征选择的套索、SCAD 和惩罚 h 似然,这些模型假设个体具有由潜在随机项 Z 捕获的未观察到的异质性,该模型将 Cox 模型 (1) 改编为 ht|X=Zh0t expXβ.

6.1 Advanced survival models: cure models, competing risks and multi-state models
6.1 高级生存模型:治愈模型、竞争风险和多状态模型

In some situations, survival data may be different from Fig. 1c (also Section 2.1), where it was presumed that all individuals will eventually experience the event of interest. Liu et al. (2012) studied the Lasso and SCAD feature selection for the proportional hazard mixture cure model, in which a certain fraction of individuals will never experience the event of interest. Tapak et al. (2015) investigated Lasso, Elastic Net and likelihood-based boosting for microarray-based survival modeling with competing risks, such as ‘progression’ versus ‘death from noncancer cause’, i.e. the event of a patient can occur due to one of multiple distinct causes. There is a growing awareness of the impact of competing risks when developing prognostic models with high-dimensional input data, e.g. Binder et al. (2009), Ambrogi and Scheike (2016) and Fu et al. (2017). For a single individual who can experience several possible events, Dutta et al. (2017) proposed a multi-state model to identify risk factors in different stages of disease based on high-dimensional input data.
在某些情况下,生存数据可能与图 1c(也是第 2.1 节)不同,其中假设所有个体最终都会经历感兴趣的事件。Liu et al. (2012) 研究了比例风险混合治愈模型的 Lasso 和 SCAD 特征选择,其中一部分人永远不会经历感兴趣的事件。Tapak 等人(2015 年)研究了套索、弹性网和基于可能性的提升,用于基于微阵列的生存建模,具有竞争风险,例如“进展”与“非癌症原因死亡”,即患者的事件可能是由于多种不同原因之一而发生的。在开发具有高维输入数据的预后模型时,人们越来越意识到竞争风险的影响,例如 Binder 人(2009 年)、Ambrogi 和 Scheike (2016 年)以及 Fu 等人(2017 年)。对于可能经历多种可能事件的单个个体,Dutta 等人(2017 年)提出了一种多状态模型,以基于高维输入数据识别疾病不同阶段的风险因素。

7 Toward single-cell data analysis
7 迈向单细胞数据分析

The cellular heterogeneity of complex sample mixtures pose challenges and also opportunities for precision medicine and survival prediction. For example, Zhou et al. (2019) showed that tumor microenvironment-related gene expression signatures do not only accurately predict the survival among colon cancer patients, but also serve as biomarkers for identifying patients who could potentially benefit from adjuvant chemotherapy. Single-cell technologies provide an unprecedented opportunity for dissecting the interplay between the cancer cells and the associated tumor microenvironment, and the produced high-dimensional data should also augment existing survival modeling approaches. The emerging single-cell atlases are providing a detailed and quantitative overview of tissue composition and organization, and will advance both biomedical research and clinical practice (Elmentaite et al. 2022).
复杂样品混合物的细胞异质性为精准医学和生存预测带来了挑战和机遇。例如,et al. (2019) 表明,肿瘤微环境相关的基因表达特征不仅可以准确预测结肠癌患者的生存率,还可以作为识别可能从辅助化疗中受益的患者的生物标志物。单细胞技术为剖析癌细胞与相关肿瘤微环境之间的相互作用提供了前所未有的机会,生成的高维数据也应增强现有的生存建模方法。新兴的单细胞图谱提供了组织组成和组织的详细和定量概述,并将推进生物医学研究和临床实践(Elmentaite 等人,2022 年)。

Currently, the focus of statistical model development for single-cell data analysis is to understand cell type composition and its impact on gene regulation and transcriptional dynamics, usually based on only a small number of samples/individuals. On the one hand, the underlying statistical models for single-cell data analysis are still in development and continuously being re-evaluated and challenged (Kharchenko 2021). On the other hand, survival analysis tackles disease and health contexts at the individual level, and usually requires a relatively large number of individuals for sufficient statistical power. In particular, the development of improved computational methods is urgently needed to enable the consideration of multiple layers (e.g. individual-level, cellular-level, molecular-level) of information when integrating groups of individuals and omics data from a variety of molecular modalities. Therefore, current approaches often map the findings from single-cell omics data to large-scale bulk sequencing omics and survival data, rather than jointly analyzing single-cell omics and survival data. For example, Guo et al. (2018) introduced a generalizable approach to first study T-cells in 14 nonsmall cell lung cancer (NSCLC) patients and to identify gene signatures from the tumor-enriched T cell clusters, and then investigated these gene signatures using bulk RNA-seq and survival data from larger TCGA-NSCLC patient cohort. Similarly, Zhang et al. (2020) developed a scRNA-seq-based approach to reconstruct a multilayer signaling network based on 16 glioma patients, and then investigated the network genes using survival data from TCGA Chinese glioma genome atlas (TCGA-CGGA) patients. However, direct joint analysis of survival and single-cell omics data from multiple cellular hierarchies requires further methodological developments and new statistical and machine learning methods.
目前,单细胞数据分析统计模型开发的重点是了解细胞类型组成及其对基因调控和转录动力学的影响,通常仅基于少量样本/个体。一方面,单细胞数据分析的基础统计模型仍在开发中,并不断受到重新评估和挑战 (Kharchenko 2021)。另一方面,生存分析在个体层面处理疾病和健康背景,通常需要相对大量的个体才能获得足够的统计能力。特别是,迫切需要开发改进的计算方法,以便在整合来自各种分子模式的个体群体和组学数据时能够考虑多层(例如个体水平、细胞水平、分子水平)的信息。因此,目前的方法往往将单细胞组学数据的结果映射到大规模批量测序组学和生存数据,而不是联合分析单细胞组学和生存数据。例如,Guo 等人 (2018) 引入了一种通用方法,首先研究 14 名非小细胞肺癌 (NSCLC) 患者的 T 细胞,并从富含肿瘤的 T 细胞簇中识别基因特征,然后使用大量 RNA-seq 和来自更大 TCGA-NSCLC 患者队列的生存数据来研究这些基因特征。同样,Zhang et al. (2020) 开发了一种基于 scRNA-seq 的方法,基于 16 名神经胶质瘤患者重建多层信号网络,然后使用来自 TCGA 中国神经胶质瘤基因组图谱 (TCGA-CGGA) 患者的生存数据研究网络基因。 然而,对来自多个细胞层次结构的存活和单细胞组学数据进行直接联合分析需要进一步的方法学发展以及新的统计和机器学习方法。

8 Discussion 8 讨论

Although survival analysis faces many modeling challenges, mainly due to censored outcomes, it represents a well-established methodology for finding risk factors associated with patients’ survival. The identification of omics biomarkers for survival prognosis may provide systematic means to guide patient management and personalized treatment and diagnostic strategies. In this tutorial, we provided a comprehensive workflow for survival analysis with high-dimensional omics and standard clinical data, with a specific focus on feature selection of survival-associated omics features and survival model validation. We covered many penalized regressions and Bayesian models for feature selection and survival prediction, accounting for their specific assumptions and applications. Examples of real data and R scripts have been made available to illustrate the use of different methods, which should help researchers to choose and apply suitable methods for their survival analysis applications (https://ocbe-uio.github.io/survomics). We note that this review only considers methods for right-censored time-to-event data, i.e. where all individuals are assumed to be followed continuously from the start of the study, but where the follow-up period might end before the event (e.g. death) was observed. Other types of censoring include interval censoring and left truncation, and appropriate statistical methods dealing with these censoring patterns should be chosen accordingly.
尽管生存分析面临许多建模挑战,主要是由于删失结果,但它代表了一种发现与患者生存相关的风险因素的成熟方法。确定组学生物标志物的生存预后可能提供系统性的手段来指导患者管理以及个性化治疗和诊断策略。在本教程中,我们提供了一个全面的生存分析工作流程,包括高维组学和标准临床数据,特别关注生存相关组学特征的特征选择和生存模型验证。我们介绍了许多用于特征选择和生存预测的惩罚回归和贝叶斯模型,并考虑了它们的具体假设和应用。已经提供了真实数据和 R 脚本的示例来说明不同方法的使用,这应该有助于研究人员为他们的生存分析应用程序 (https://ocbe-uio.github.io/survomics) 选择和应用合适的方法。我们注意到,本综述仅考虑了右删失事件发生时间数据的方法,即假设所有个体从研究开始就被连续跟踪,但随访期可能在观察到事件(例如死亡)之前结束。其他类型的删失包括区间删失和左截断,应相应地选择处理这些删失模式的适当统计方法。

Most of the current methods for survival analysis do not explicitly take into account the complex structures within and between multi-omics data, such as gene regulation and DNA-protein interactions. Regulatory networks constructed either based on prior biological knowledge or using data-driven, yet biologically explainable approaches, may help establish useful methodologies for survival analysis that are more effective for deriving biological insights as well as enable improved clinical translation. However, to achieve a comprehensive and biologically meaningful integration of high-dimensional multi-omics data, there is a need for continued development of computational and statistical approaches that consider both technical and biological intricacies of the data and technologies, respectively (Wissel et al. 2023). This is currently a very active research field, and we expect to see many improved multi-omics methods for survival prediction in the future.
目前的大多数生存分析方法都没有明确考虑多组学数据内部和之间的复杂结构,例如基因调控和 DNA-蛋白质相互作用。基于先前的生物学知识或使用数据驱动但生物学上可解释的方法构建的监管网络可能有助于建立有用的生存分析方法,这些方法更有效地获得生物学见解并能够改进临床转化。然而,为了实现高维多组学数据的全面和生物学意义的整合,需要继续开发计算和统计方法,分别考虑数据和技术的技术和生物学复杂性(Wissel 等人,2023 年)。这是目前一个非常活跃的研究领域,我们预计未来会看到许多改进的多组学生存预测方法。

Another limitation of most of the reviewed methods is that they identify omics features prognostic of survival, but they cannot determine causal relationships. Causality is a fundamental notion to understand omics features causing disease progression, which will allow one to reliably intervene omics features for targeted therapies. There are two popular causal inference models, Pearl’s structural causal model (SCM) and Rubin’s causal model (RCM), both of which introduce perturbations to draw causal inference. Farooq et al. (2023) utilized SCM-based causal discovery approaches to unravel relationships between omics features and survival of breast cancer patients. However, to identify reliable causal relations for clinical applications, laboratory-based experiments, e.g. clustered regularly interspaced short palindromic repeats (CRISPR) techniques (Wang and Doudna 2023), are often necessary to verify the functional relevance of the identified omics features. High-dimensional RCM-based mediation analysis has been used to investigate the indirect effect transmitted by omics features between an exposure and survival outcomes (Song et al. 2020, 2021). Causal mediation analysis is an important tool, which considers the problem of decomposing the causal effect of treatment/exposure into direct and indirect effects (Lange and Hansen 2011, VanderWeele 2011). The direct effect corresponds to the effect of a treatment directly on the survival outcome, while an indirect effect corresponds to the effect of a treatment on the outcome that is due to its effect on an intermediate variable (e.g. gene expression) that also has a causal effect on the survival outcome. Targeted learning also fills a much needed gap between statistical modeling and causal inference (van der Laan and Rose 2011, 2018). Tuglus and van der Laan (2011) used targeted maximum likelihood estimation to provide an interpretable causal measure of variable importance for the discovery of biomarkers and omics features. Another way to formalize personalized medicine is dynamic treatment regimes (Chakraborty and Moodie 2013, Tsiatis et al. 2019, Deliu et al. 2023) that encompasses causal inference and takes into account for the variability in omics, environment and lifestyle factors for each individual to improve the treatment of a particular patient. However, all the causal machine learning methods require further methodological developments for adaptation to survival modeling with high-dimensional input data.
大多数综述方法的另一个局限性是它们识别了生存预后的组学特征,但它们无法确定因果关系。因果关系是理解导致疾病进展的组学特征的基本概念,这将使人们能够可靠地干预组学特征以进行靶向治疗。有两种流行的因果推理模型,Pearl 的结构因果模型 (SCM) 和 Rubin 的因果模型 (RCM),这两种模型都引入了扰动来得出因果推理。Farooq 等人(2023 年)利用基于 SCM 的因果发现方法来揭示组学特征与乳腺癌患者生存之间的关系。然而,为了确定临床应用的可靠因果关系,通常需要基于实验室的实验,例如成簇的规则间隔短回文重复序列 (CRISPR) 技术 (Wang 和 Doudna 2023),以验证已确定的组学特征的功能相关性。基于高维 RCM 的中介分析已被用于研究组学特征在暴露和生存结果之间传递的间接影响(Song 等人,2020 年,2021 年)。因果中介分析是一个重要的工具,它考虑了将治疗/暴露的因果效应分解为直接和间接效应的问题(Lange 和 Hansen 2011,VanderWeele 2011)。直接效应对应于治疗对生存结果的直接影响,而间接效应对应于治疗对结果的影响,这是由于其对中间变量(例如 基因表达)的影响,该变量也对生存结果有因果影响。 有针对性的学习还填补了统计建模和因果推理之间急需的空白(van der Laan 和 Rose 2011,2018)。 Tuglus 和 van der Laan (2011) 使用目标最大似然估计为生物标志物和组学特征的发现提供了可变重要性的可解释因果测量。另一种将个性化医疗正规化的方法是动态治疗方案(Chakraborty 和 Moodie 2013,Tsiatis 等人,2019 年,Deliu 等人,2023 年),它包含因果推断,并考虑到每个人的组学、环境和生活方式因素的可变性,以改善对特定患者的治疗。然而,所有因果机器学习方法都需要进一步的方法论发展,以适应使用高维输入数据的生存建模。

Acknowledgements 确认

The authors thank Professor Ørnulf Borgan for his valuable suggestions and comments.
作者感谢 Ørnulf Borgan 教授的宝贵建议和评论。

Conflict of interest 利益冲突

None declared. 没有人宣布。

Funding 资金

This work was supported by grants from Helse Sør-Øst [2020026 to T.A.], the Norwegian Cancer Society [216104 to T.A.], the Radium Hospital Foundation, the Academy of Finland [326238, 340141, 344698, 345803 to T.A.], Cancer Society of Finland [to T.A.], the European Union’s Horizon 2020 research and innovation programme [‘PANCAIM’ 101016851 to J.Z. and T.A., ‘RESCUER’ 847912 to M.Z.], the Innovative Medicines Initiative 2 Joint Undertaking of the European Union’s Horizon 2020 research and innovation programme and EFPIA and JDRF INTERNATIONAL [‘imSAVAR’ 853988 to M.Z.], and ERA PerMed under the ERA-NET Cofund scheme of the European Union’s Horizon 2020 research and innovation framework programme [‘SYMMETRY’ 342752 to M.Z.].
这项工作得到了 Helse Sør-Øst [2020026 to TA]、挪威癌症协会 [216104 to TA]、镭医院基金会、芬兰科学院 [326238、340141、344698、345803 to TA]、芬兰癌症协会 [to TA]、欧盟地平线 2020 研究和创新计划 ['PANCAIM' 101016851 to J.Z. 和 T.A., “RESCUEER”847912 M.Z.]、欧盟 Horizon 2020 研究和创新计划的创新药物倡议 2 联合项目以及 EFPIA 和 JDRF INTERNATIONAL ['imSAVAR' 853988 M.Z.],以及欧盟 Horizon 2020 研究和创新框架计划的 ERA-NET 共同基金计划下的 ERA PerMed ['SYMMETRY' 342752 M.Z.]。

Data availability 数据可用性

Supplementary step-by-step R tutorial is available online at https://ocbe-uio.github.io/survomics. TCGA data are publicly available at https://portal.gdc.cancer.gov.
补充的分步 R 教程可在 https://ocbe-uio.github.io/survomics 在线获取。TCGA 数据在 https://portal.gdc.cancer.gov 上公开提供。

References 引用

Aittokallio
T.
Dealing with missing values in large-scale studies: microarray data imputation and beyond
.
Brief Bioinform
2009
;
11
:
253
64
.
Aittokallio
T.
处理大规模研究中的缺失值:微阵列数据插补及其他
简介 Bioinform
2009
;
11
253-64

Akbani
R
,
Ng
PKS
,
Werner
HMJ
et al. 
A pan-cancer proteomic perspective on the cancer genome atlas
.
Nat Commun
2014
;
5
:
3887
.
Akbani
R
Ng
PKS
Werner
HMJ
等人。
癌症基因组图谱的泛癌蛋白质组学观点
Nat Commun
2014
年;
5
3887
页。

Ambrogi
F
,
Scheike
TH.
Penalized estimation for competing risks regression with applications to high-dimensional covariates
.
Biostatistics
2016
;
17
:
708
21
.
安布罗吉
F
谢克
TH。
应用于高维协变量的竞争风险回归的惩罚估计
生物统计学
2016
;
17
708-21

Andres
A
,
Montano-Loza
A
,
Greiner
R
et al. 
A novel learning algorithm to predict individual survival after liver transplantation for primary sclerosing cholangitis
.
PLoS One
2018
;
13
:
e0193523
.
Andres
A
Montano-Loza
A
Greiner
R
等人。
一种新颖的学习算法,用于预测原发性硬化性胆管炎肝移植后个体的生存率
公共科学图书馆一号
2018
;
13
e0193523
.

Antolini
L
,
Boracchi
P
,
Biganzoli
E.
A time-dependent discrimination index for survival data
.
Stat Med
2005
;
24
:
3927
44
.
Antolini
L
Boracchi
P
Biganzoli
E.A
生存数据的时间依赖性判别指数
统计医学
2005
;
24
3927-44

Avati
A
,
Duan
T
,
Zhou
S
et al.  Countdown regression: sharp and calibrated survival predictions. In:
Adams
RP
,
Gogate
V
(eds.), Proceedings of The 35th Uncertainty in Artificial Intelligence Conference, volume 115 of Proceedings of Machine Learning Research (PMLR). Tel Aviv, Israel: ML Research Press.
2020
,
145
55
.
Avati
A
T
S
等人。 倒计时回归:敏锐且校准的生存预测。在:
Adams
RP
Gogate
V
(编辑),第 35 届人工智能不确定性会议论文集,机器学习研究论文集 (PMLR) 第 115 卷。以色列特拉维夫:ML 研究出版社。
2020
年,
145-55

Bair
E
,
Tibshirani
R.
Semi-supervised methods to predict patient survival from gene expression data
.
PLoS Biol
2004
;
2
:
e108
.
Bair
E
Tibshirani
R.
从基因表达数据预测患者生存的半监督方法
PLoS 生物学
2004
;
2
e108
.

Barbieri
MM
,
Berger
JO.
Optimal predictive model selection
.
Ann. Stat
2004
;
32
:
870
97
.
巴比里
MM,
伯杰
JO。
最优预测模型选择
Ann. Stat
2004
;
32
870-97

Barnwal
A
,
Cho
H
,
Hocking
T.
Survival regression with accelerated failure time model in XGBoost
.
J. Comput. Graph Stat
2022
;
31
:
1292
302
.
Barnwal
A
Cho
H
Hocking
T.XGBoost
中加速失效时间模型的生存回归
J. 计算。图形统计
2022
;
31
1292
302

Bartel
DP.
Metazoan MicroRNAs
.
Cell
2018
;
173
:
20
51
.
巴特尔
DP。
后生动物 MicroRNA
细胞
2018
;
173
20-51

Binder
H
,
Allignol
A
,
Schumacher
M
et al. 
Boosting for high-dimensional time-to-event data with competing risks
.
Bioinformatics
2009
;
25
:
890
6
.
Binder
H
Allignol
A
Schumacher
M
等人。
提升具有竞争风险的高维事件发生时间数据
生物信息学
2009
;
25
890-6

Binder
H
,
Schumacher
M.
Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples
.
Stat Appl Genet Mol
2008a
;
7
:
1
.
Binder
H
Schumacher
M.
调整高维 bootstrap 样本中偏倚复杂性选择的预测误差估计
统计应用基因分子
2008a
;
7
1

Binder
H
,
Schumacher
M.
Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models
.
BMC Bioinformatics
2008b
;
9
:
14
.
Binder
H
Schumacher
M.
允许强制性协变量促进稀疏高维生存模型的估计
BMC 生物信息学
2008b
;
9
14

Blanche
P
,
Kattan
MW
,
Gerds
TA.
The c-index is not proper for the evaluation of t-year predicted risks
.
Biostatistics
2019
;
20
:
347
57
.
Blanche
P
Kattan
MW、
Gerds
TA。
c 指数不适合用于评估 t 年预测风险
生物统计学
2019
;
20
347-57

Bommert
A
,
Welchowski
T
,
Schmid
M
et al. 
Benchmark of filter methods for feature selection in high-dimensional gene expression survival data
.
Brief. Bioinform
2022
;
23
:
bbab354
.
Bommert
A
Welchowski
T
Schmid
M
等人。
高维基因表达存活数据中特征选择的过滤方法的基准
短。生物通知
2022
;
23
bbab354

Bøvelstad
HM
,
Nygård
S
,
Størvold
HL
et al. 
Predicting survival from microarray data – a comparative study
.
Bioinformatics
2007
;
23
:
2080
7
.
Bøvelstad
HM
Nygård
S
Størvold
HL
et al.
从微阵列数据预测生存 – 一项比较研究
生物信息学
2007
;
23
2080-7

Bøvelstad
HM
,
Nygård
S
,
Borgan
O.
Survival prediction from clinico-genomic models – a comparative study
.
BMC Bioinformatics
2009
;
10
:
413
.
Bøvelstad
HM
Nygård
S
Borgan
O.
临床基因组模型的生存预测——一项比较研究
BMC 生物信息学
2009
;
10
413
.

Box
GEP
,
Cox
DR.
An analysis of transformations
.
J R Stat Soc B Met
1964
;
26
:
211
52
.
Box
GEP
Cox
DR.
转换分析
JR Stat Soc B Met
1964
年;
26
211-52

Bradburn
MJ
,
Clark
TG
,
Love
SB
et al. 
Survival analysis part II: multivariate data analysis – an introduction to concepts and methods
.
Br J Cancer
2003
;
89
:
431
6
.
Bradburn
MJ
Clark
TG
Love
SB
等人。
生存分析第二部分:多变量数据分析 – 概念和方法简介
Br J 癌症
2003
;
89
431-6

Brier
GW.
Verification of forecasts expressed in terms of probability
.
Mon Weather Rev
1950
;
78
:
1
3
.
布赖尔
GW。
验证以概率表示的预测
Mon Weather Rev
1950
;
78
1-3

Cairns
RA
,
Harris
IS
,
Mak
TW.
Regulation of cancer cell metabolism
.
Nat Rev Cancer
2011
;
11
:
85
95
.
凯恩斯
RA
哈里斯
IS,
TW.
癌细胞代谢的调节
Nat Rev Cancer
2011
;
(约翰福音11
85-95

Carvalho
CM
,
Polson
NG
,
Scott
JG.
Handling sparsity via the horseshoe. In:
van Dyk
D
,
Welling
M
(eds.), Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics, volume 5 of Proceedings of Machine Learning Research, Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA. PMLR.
2009
,
73
80
.
Carvalho
CM
Polson
NG,
Scott
JG.
通过马蹄铁处理稀疏性。载于:
van Dyk
D
Welling
M
(编辑),第十二届人工智能与统计国际会议论文集,机器学习研究论文集第 5 卷,美国佛罗里达州克利尔沃特海滩希尔顿克利尔沃特海滩度假村。PMLR.
2009
年,
73-80

Chakraborty
B
,
Moodie
EE.
Statistical Methods for Dynamic Treatment Regimes: Reinforcement Learning, Causal Inference and Personalized Medicine
.
New York
:
Springer
,
2013
.
Chakraborty
B
穆迪
EE。
动态治疗方案的统计方法:强化学习、因果推理和个性化医学
纽约
施普林格
2013
年。

Chekouo
T
,
Stingo
FC
,
Doecke
JD
et al. 
miRNA-target gene regulatory networks: a Bayesian integrative approach to biomarker selection with application to kidney cancer
.
Biometrics
2015
;
71
:
428
38
.
Chekouo
T
Stingo
FC
Doecke
JD
等人。
miRNA 靶基因调控网络:一种应用于肾癌的生物标志物选择的贝叶斯综合方法
生物识别学
2015
;
71
428-38

Chu
J
,
Sun
NA
,
Hu
W
et al. 
The application of Bayesian methods in cancer prognosis and prediction
.
Cancer Genomics Proteomics
2022
;
19
:
1
11
.
Chu
J
Sun
NA
W
等.
贝叶斯方法在癌症预后和预测中的应用
癌症基因组学蛋白质组学
2022
;
19
1-11

Clark
TG
,
Bradburn
MJ
,
Love
SB
et al. 
Survival analysis part IV: further concepts and methods in survival analysis
.
Br J Cancer
2003
;
89
:
781
6
.
Clark
TG
Bradburn
MJ
Love
SB
等人。
生存分析第四部分:生存分析的更多概念和方法
Br J 癌症
2003
;
89
781-6

Cox
DR.
Note on grouping
.
J Am Stat Assoc
1957
;
52
:
543
7
.
Cox
DR.
关于分组的说明
J Am Stat Assoc
1957
年;
52
543-7

Cox
DR.
Regression models and life-tables
.
J R Stat Soc B Met
1972
;
34
:
187
202
.
Cox
DR.回归
模型和寿命表
JR Stat Soc B Met
1972
年;
34
187-202

Cristescu
R
,
Lee
J
,
Nebozhyn
M
et al. 
Molecular analysis of gastric cancer identifies subtypes associated with distinct clinical outcomes
.
Nat Med
2015
;
21
:
449
56
.
Cristescu
R
Lee
J
Nebozhyn
M
等人。
胃癌的分子分析确定了与不同临床结果相关的亚型
国家医学会
2015
;
21
449-56

De Bin
R
,
Sauerbrei
W
,
Boulesteix
A-L.
Investigating the prediction ability of survival models based on both clinical and omics data: two case studies
.
Stat Med
2014
;
33
:
5310
29
.
De Bin
R
Sauerbrei
W
Boulesteix
A-L.
根据临床和组学数据调查生存模型的预测能力:两个案例研究
统计医学
2014
;
33
5310-29

de Brevern
A
,
Hazout
S
,
Malpertuy
A.
Influence of microarrays experiments missing values on the stability of gene groups by hierarchical clustering
.
BMC Bioinformatics
2004
;
5
:
114
.
de Brevern
A
Hazout
S
Malpertuy
A.
通过分层聚类,微阵列实验缺失值对基因组稳定性的影响
BMC 生物信息学
2004
;
5
114

Deliu
N
,
Williams
JJ
,
Chakraborty
B.
Reinforcement learning in modern biostatistics: Constructing optimal adaptive interventions. arXiv, arXiv:2203.02605,
2023
, preprint: not peer reviewed.
Deliu
N
Williams
JJ
Chakraborty
B.
现代生物统计学中的强化学习:构建最佳适应性干预。arXiv, arXiv:2203.02605,
2023
, 预印本:未经同行评审。

Dunkler
D
,
Schemper
M
,
Heinze
G.
Gene selection in microarray survival studies under possibly non-proportional hazards
.
Bioinformatics
2010
;
26
:
784
90
.
Dunkler
D
Schemper
M
Heinze
G.
在可能的不成比例风险下微阵列存活研究中的基因选择
生物信息学
2010
;
26
784-90

Dutta
S
,
Datta
S
,
Datta
S.
Temporal prediction of future state occupation in a multistate model from high-dimensional baseline covariates via pseudo-value regression
.
J Stat Comput Simul
2017
;
87
:
1363
78
.
Dutta
S
Datta
S
Datta
S.
通过伪值回归从高维基线协变量对多状态模型中未来状态占用的时间预测
J Stat Comput Simul
2017
年;
87
1363-78

Edwards
NJ
,
Oberti
M
,
Thangudu
RR
et al. 
The CPTAC data portal: a resource for cancer proteomics research
.
J Proteome Res
2015
;
14
:
2707
13
.
Edwards
NJ
Oberti
M
Thangudu
RR
等人。
CPTAC 数据门户:癌症蛋白质组学研究的资源
J 蛋白质组研究
2015
;
14
2707-13

Elmentaite
R
,
Domínguez Conde
C
,
Yang
L
et al. 
Single-cell atlases: shared and tissue-specific cell types across human organs
.
Nat Rev Genet
2022
;
23
:
395
410
.
Elmentaite
R
Domínguez Conde
C
Yang
L
等人。
单细胞图谱:跨人体器官的共享和组织特异性细胞类型
Nat Rev Genet
2022
;
23
395
410

Engler
D
,
Li
Y.
Survival analysis with high-dimensional covariates: an application in microarray studies
.
Stat Appl Genet Mol
2009
;
8
:
1
22
.
Engler
D
Li
Y.
高维协变量的生存分析:在微阵列研究中的应用
统计应用基因 Mol
2009
;
8
1-22

Fan
J
,
Feng
Y
,
Wu
Y.
High-dimensional variable selection for Cox’s proportional hazards model
.
IMS Collections
2010
;
6
:
70
86
.
Fan
J
Feng
Y
Wu
Y.Cox
比例风险模型的高维变量选择
.
IMS 收藏
2010
;
6
70-86

Fan
J
,
Li
R.
Variable selection for Cox’s proportional hazards model and frailty model
.
Ann. Stat
2002
;
30
:
74
99
.
Fan
J
Li
R.Cox
比例风险模型和虚弱模型的变量选择
Ann. Stat
2002
;
30
74-99

Farooq
M
,
Hardan
S
,
Zhumbhayeva
A
et al.  Understanding breast cancer survival: using causality and language models on multi-omics data. arXiv, arXiv:2305.18410,
2023
, preprint: not peer reviewed.
Farooq
M
Hardan
S
Zhumbhayeva
A
等人。 了解乳腺癌生存率:在多组学数据上使用因果关系和语言模型。arXiv,arXiv:2305.18410,2023 年,预印本:未经同行评审。

Fu
Z
,
Parikh
CR
,
Zhou
B.
Penalized variable selection in competing risks regression
.
Lifetime Data Anal
2017
;
23
:
353
76
.
Fu
Z
Parikh
CR
B.
竞争风险回归中的惩罚变量选择
Lifetime Data Anal
2017
;
23
353-76

George
E
,
McCulloch
R.
Variable selection via Gibbs sampling
.
J Am Stat Assoc
1993
;
88
:
881
9
.
George
E
McCulloch
R.
通过 Gibbs 采样进行变量选择
J Am Stat Assoc
1993
年;
88
881-9

Gerds
TA
,
Kattan
MW.
Medical Risk Prediction Models: With Ties to Machine Learning
.
Boca Raton, FL
:
Chapman and Hall/CRC
,
2021
.
Gerds
TA,Kattan
MW。
医疗风险预测模型:与机器学习相关
佛罗里达州博卡拉顿
查普曼和霍尔/CRC,2021
年。

Gneiting
T
,
Raftery
AE.
Strictly proper scoring rules, prediction, and estimation
.
J Am Stat Assoc
2007
;
102
:
359
78
.
Gneiting
T,Raftery
AE。
严格正确的评分规则、预测和估计
J Am Stat Assoc
2007
年;
102
359-78

Gorst-Rasmussen
A
,
Scheike
TH.
Coordinate descent methods for the penalized semiparametric additive hazards model
.
J Stat Soft
2012
;
47
:
1
17
.
Gorst-Rasmussen
A
Scheike
TH.
协调惩罚半参数加性灾害模型的下降方法
J 统计软件
2012
;
47
1-17

Graf
E
,
Schmoor
C
,
Sauerbrei
W