Introduction 介绍

In recent years, there has been a significant transformation in medical research methodologies towards the adoption of Deep Learning (DL) techniques for predicting critical events, such as disease development and patient mortality. Despite their potential to handle complex data, practical applications in this domain still need to be expanded, with most studies still relying on traditional statistical methods.
近年来,医学研究方法发生了重大转变,采用深度学习 (DL) 技术来预测关键事件,例如疾病发展和患者死亡率。尽管它们有可能处理复杂数据,但该领域的实际应用仍需要扩展,大多数研究仍然依赖于传统的统计方法。

Survival Analysis (SA), or time-to-event analysis, is an essential tool for studying specific events in various disciplines, not only in medicine but also in fields such as recommendation systems1, employee retention2, market modeling3, and financial risk assessment4.
生存分析 (SA) 或事件发生时间分析是研究各个学科中特定事件的重要工具,不仅在医学领域,而且在推荐系统 1 (recommendation systems)、员工保留 2 (employee retention)、市场建模 3 (market modeling 和金融风险评估4 等领域。

According to the existing literature, the Cox proportional hazards model (Cox-PH)5 is the dominant SA method that offers a semiparametric regression solution to the non-parametric Kaplan-Meier estimator problem6. Unlike the Kaplan-Meier method, which uses a single covariate, Cox-PH incorporates multiple covariates to predict event times and assess their impact on the hazard rate at specific time points. However, it is crucial to acknowledge that the Cox-PH model is built on certain strong assumptions. One of these is the proportional hazards assumption, which posits that different individuals have hazard functions that remain constant over time. Also, the model assumes a linear relation between the natural logarithm of the relative hazard (the ratio of the hazard at time t to the baseline hazard) and the covariates. Although the standard Cox-PH model assumes the absence of interactions among these covariates, it can be extended by introducing interaction terms, such as quadratic or higher-order terms, allowing the modeling of more complex relations between covariates. However, even with these extensions, the model may struggle to capture non-linearities in real-world datasets, where intricate interactions between covariates and non-linear relationships might exist. Other traditional parametric statistical models for SA make specific assumptions about the distribution of event times. For instance, there are models that assume exponential and Weibull distributions, respectively, for event times7,8. However, one drawback of these models is that they lack flexibility when changing the assumed distribution for survival times, making them less adaptable to diverse datasets.
根据现有文献,Cox 比例风险模型 (Cox-PH)5 是主要的 SA 方法,它为非参数 Kaplan-Meier 估计器问题6 提供了半参数回归解决方案。与使用单个协变量的 Kaplan-Meier 方法不同,Cox-PH 包含多个协变量来预测事件时间并评估它们在特定时间点对风险率的影响。但是,重要的是要承认 Cox-PH 模型是建立在某些强有力的假设之上的。其中之一是比例风险假设,它假设不同的个体具有随时间保持不变的风险函数。此外,该模型还假设相对风险的自然对数(时间 t 的风险与基线风险的比率)与协变量之间存在线性关系。尽管标准 Cox-PH 模型假设这些协变量之间不存在交互作用,但可以通过引入交互作用项(例如二次项或高阶项)来扩展它,从而允许对协变量之间更复杂的关系进行建模。然而,即使有了这些扩展,该模型也可能难以在实际数据集中捕获非线性,其中协变量和非线性关系之间可能存在复杂的交互。SA 的其他传统参数统计模型对事件时间的分布做出了具体假设。例如,有些模型分别假设事件时间7,8 为指数分布和 Weibull 分布。然而,这些模型的一个缺点是,它们在改变生存时间的假设分布时缺乏灵活性,这使得它们对不同数据集的适应性较差。

In response, researchers have explored Deep Neural Networks (DNNs) to effectively capture the intricate and non-linear relations between predictive variables and a patient’s risk of failure. Significant emphasis has been placed on improving the Cox PH model, the standard SA approach. Recent approaches have introduced Neural Networks (NN) in various configurations, either enhancing the Cox-PH model with neural components or proposing entirely novel architectures. This exploration of NN applications for SA traces back to 19959, when a simple feed-forward NN was employed to replace linear interaction terms while incorporating non-linearities. Subsequently, the field saw the emergence of DeepSurv10, a model designed to extract non-linearities from input data, albeit still assuming the proportional hazards assumption. This assumption persists in other related models11. Beyond addressing non-linearity, some researchers have sought to enhance prediction accuracy and model interpretability by combining Bayesian networks with the Cox-PH model12. Additionally, efforts have been made to introduce concepts that facilitate analysis when data availability is limited13,14. However, it is essential to note that all these models still depend on the assumption of proportional hazards. As a result, novel architectures such as DeepHit15 have emerged as alternatives that do not rely on this assumption. While DeepHit has exhibited superior performance compared to other state-of-the-art models, it operates exclusively in the discrete-time domain, which comes with certain limitations, notably the requirement for a dataset with a substantial number of observations. This condition may not be feasible in real-world scenarios.
作为回应,研究人员探索了深度神经网络 (DNN),以有效捕获预测变量与患者失败风险之间的复杂非线性关系。重点放在改进 Cox PH 模型,即标准 SA 方法上。最近的方法引入了各种配置的神经网络 (NN),要么使用神经组件增强 Cox-PH 模型,要么提出全新的架构。对 SA 的 NN 应用的探索可以追溯到 19959 年,当时采用简单的前馈 NN 来替换线互项,同时结合非线性。随后,该领域出现了 DeepSurv10,这是一个旨在从输入数据中提取非线性的模型,尽管它仍然假设了比例风险。这种假设在其他相关模型中仍然存在11。除了解决非线性问题外,一些研究人员还试图通过将贝叶斯网络与 Cox-PH 模型相结合来提高预测准确性和模型可解释性12。此外,还努力引入在数据可用性有限时促进分析的概念13,14。然而,必须注意的是,所有这些模型仍然依赖于比例风险的假设。因此,DeepHit15 等新型架构已成为不依赖于此假设的替代方案。虽然与其他最先进的模型相比,DeepHit 表现出卓越的性能,但它仅在离散时间域中运行,这有一定的局限性,特别是需要具有大量观察的数据集。 这种情况在实际场景中可能不可行。

In light of the persistent limitations of existing approaches in the realm of SA, this paper introduces a novel, versatile algorithm grounded in DL advances named SAVAE (Survival Analysis Variational Autoencoder). SAVAE has been meticulously designed to predict the time distribution leading to a predefined event and adapts to application in various domains, explicitly focusing on the medical context. Then, our main contributions consist of:
鉴于现有方法在 SA 领域的持续局限性,本文介绍了一种基于 DL 进步的新型多功能算法,名为 SAVAE(生存分析变分自动编码器)。SAVAE 经过精心设计,可预测导致预定义事件的时间分布,并适应各个领域的应用,明确关注医疗环境。然后,我们的主要贡献包括:

  • We introduce a generative approach that underpins the development of a flexible tool, SAVAE, based on Variational Autoencoders (VAEs). SAVAE can effectively reproduce the data by analytically modeling the discrete or continuous time to a specific event. This analytical approach enables the precision calculation of all necessary statistics, as the output provided by SAVAE are the estimated parameters of the predicted time distribution.
    我们介绍了一种生成方法,它支持基于变分自动编码器 (VAE) 的灵活工具 SAVAE 的开发。SAVAE 可以通过对特定事件的离散或连续时间进行分析建模来有效地再现数据。这种分析方法可以精确计算所有必要的统计数据,因为 SAVAE 提供的输出是预测时间分布的估计参数。

  • SAVAE is a flexible tool that enables us to use various distributions to model the time-to-event and the covariates. This allows us not to assume proportional hazards. Using NN permits modeling complex, non-linear relations between the covariates and the time-to-event, as opposed to linearity assumptions in the state of the art. Also, the time-to-event is trained with standard likelihood techniques, unlike state-of-the-art models like DeepHit, which trains the Concordance Index (C-index). This makes our approach more general and flexible, as any differentiable distribution could be used to model the time and the covariates.
    SAVAE 是一个灵活的工具,使我们能够使用各种分布来模拟事件发生时间和协变量。这使我们能够不假设成比例的风险。使用 NN 可以对协变量和事件发生时间之间复杂的非线性关系进行建模,而不是最先进的线性假设。此外,事件发生时间是使用标准似然技术训练的,这与 DeepHit 等最先进的模型不同,后者训练一致性指数 (C-index)。这使得我们的方法更加通用和灵活,因为任何可微分布都可以用于对时间和协变量进行建模。

  • Furthermore, our proposal can be trained on right-censored data, effectively leveraging information from patients who have not yet experienced the event of interest.
    此外,我们的提案可以在右删失数据上进行训练,有效地利用尚未经历过感兴趣事件的患者的信息。

  • We have conducted comprehensive time-to-event estimation experiments using datasets characterized by continuous and discrete time-to-event values and varying covariate natures, encompassing clinical and genomic data. These experiments involve a comparative analysis with the traditional Cox-PH model and other DL techniques. The results indicate that SAVAE is competitive with these models regarding the C-index and the Integrated Brier score (IBS).
    我们使用以连续和离散事件发生时间值和不同协变量性质为特征的数据集进行了全面的事件发生时间估计实验,包括临床和基因组数据。这些实验涉及与传统 Cox-PH 模型和其他 DL 技术的比较分析。结果表明,SAVAE 在 C 指数和综合 Brier 评分 (IBS) 方面与这些模型相比具有竞争力。

Methods 方法

Survival analysis 生存分析

In a conventional time-to-event or SA setup, N observations are given. Each of these observations is described by \(D = (x_i, t_i, d_i)^N_{i=1}\) triplets, where \(x_i=(x_i^1,\ldots , x_i^L)\) is an L-dimensional vector where \(l=1,2,\ldots ,L\) indexes the covariates, \(t_i\) is the time-to-event, and \(d_i \in \{0,1\}\) is the censor indicator. When \(d_i = 0\) (censored), the subject has not experienced an event up to time \(t_i\), while \(d_i = 1\) indicates the observed events (ground truth). SA models are conditional on covariates: time probability density function \(p(t\vert x)\), hazard rate function (the instantaneous rate of occurrence of the event at a specific time) \(h(t\vert x)\), or survival function \(S(t\vert x) = P(T>t) = 1-F(t\vert x)\), also known as the probability of a failure occurring after time t, where \(F(t\vert x)\) is the Cumulative Distribution Function (CDF) of the time. From standard definitions of the survival function, the relations between these three characterizations are formulated as follows:
在传统的事件发生时间或 SA 设置中,给出 N 个观测值。这些观测值中的每一个都由 \(D = (x_i, t_i, d_i)^N_{i=1}\) 三元组描述,其中 \(x_i=(x_i^1,\ldots , x_i^L)\) 是一个 L 维向量,其中 \(l=1,2,\ldots ,L\) 为协变量编制索引,\(t_i\) 是事件发生时间,\(d_i \in \{0,1\}\) 是删失指标。当\(d_i = 0\)(删失)时,受试者在时间\(t_i\)之前没有经历过事件,而\(d_i = 1\)表示观察到的事件(真实值)。SA 模型以协变量为条件:时间概率密度函数 \(p(t\vert x)\))、风险率函数(事件在特定时间的瞬时发生率)\(h(t\vert x)\)或生存函数 \(S(t\vert x) = P(T>t) = 1-F(t\vert x)\),也称为时间 t 之后发生故障的概率, 其中 \(F(t\vert x)\) 是当时的累积分布函数 (CDF)。根据生存函数的标准定义,这三个特征之间的关系表述如下:

$$\begin{aligned} p(t\vert x) = h(t\vert x)S(t\vert x). \end{aligned}$$
$$\begin{aligned} p(t\vert x) = h(t\vert x)S(t\vert x).\end{aligned}$$
(1)

Vanilla variational autoencoder
Vanilla 变分自动编码器

The original VAE was proposed in 201316, a robust approach employing DNNs for Bayesian inference. It addresses the problem of a dataset consisting of N i.i.d. samples \(x_i\) of a continuous or discrete variable, where \(i \in 1,2,\ldots , N\), \(x_i\) are generated by the following random process, which is depicted in Fig. 1:
最初的 VAE 于 201316 年提出,这是一种使用 DNN 进行贝叶斯推理的稳健方法。它解决了一个数据集由连续或离散变量的 N 个 i.i.d. 样本 \(x_i\) 组成的问题,其中 \(i \in 1,2,\ldots , N\)\(x_i\) 由以下随机过程生成,如图 1 所示。1

Figure 1 图 1
figure 1

Bayesian VAE vanilla model. The shaded circle refers to the latent variable z, and the white circle refers to the observable x. Probabilities \(p_\theta (x\vert z)\) and \(q_\phi (z\vert x)\) denote, respectively, the generative model and the variational approximation to the posterior, since the true posterior \(p_{\theta }(z\vert x)\) is unknown.
Bayesian VAE vanilla 模型.阴影圆圈是指潜在变量 z,白色圆圈是指可观察对象 x。概率 \(p_\theta (x\vert z)\)\(q_\phi (z\vert x)\) 分别表示生成模型和后验的变分近似值,因为真正的后验 \(p_{\theta }(z\vert x)\) 是未知的。

  1. 1.

    A latent variable \(z_i\) is sampled from a given prior probability distribution p(z). The original research16 assumes a form \(p_\theta (z)\), i.e., the prior depends on some parameters \(\theta\), but its main result drops this dependence. Therefore, a simple prior p(z) is assumed in this paper.
    潜在变量 \(z_i\) 是从给定的先验概率分布 pz) 中采样的。最初的研究16假设形式为\(p_\theta (z)\),即先验取决于一些参数\(\theta\),但其主要结果放弃了这种依赖性。因此,本文假设了一个简单的先验 pz)。

  2. 2.

    A conditional distribution, \(p_\theta (x\vert z)\), with parameters \(\theta\) generates the observed values, \(x_i\). A generative model governs this process. Certain assumptions are made, including the differentiability of probability density functions (pdfs), p(z), and \(p_\theta (x\vert z)\), regarding \(\theta\) and z.
    条件分布 \(p_\theta (x\vert z)\) 和参数 \(\theta\) 生成观测值 \(x_i\)。生成模型控制此过程。我们做出了某些假设,包括概率密度函数 (pdfs)、pz) 和 \(p_\theta (x\vert z)\) 关于 \(\theta\)z 的可微性。

The latent variable z and the parameters \(\theta\) are unknown. Without simplifying assumptions, evaluating the marginal likelihood \(p_\theta (x) = \int p(z)p_\theta (x\vert z)dz\) is infeasible. The true posterior density \(p_\theta (z\vert x)\), which we aim to approximate, can be defined as Eq. (2) using Bayes’ theorem:
潜在变量 z 和参数 \(\theta\) 是未知的。如果不简化假设,评估边际似然 \(p_\theta (x) = \int p(z)p_\theta (x\vert z)dz\) 是不可行的。我们旨在近似的真实后验密度 \(p_\theta (z\vert x)\) 可以用贝叶斯定理定义为方程 (2):

$$\begin{aligned} p_\theta (z\vert x) = \frac{p_\theta (x\vert z)p(z)}{p_\theta (x)}. \end{aligned}$$
$$\begin{aligned} p_\theta (z\vert x) = \frac{p_\theta (x\vert z)p(z)}{p_\theta (x)}.\end{aligned}$$
(2)

However, since the marginal likelihood \(p_\theta (x)\) is often intractable, direct computation of the true posterior \(p_\theta (z\vert x)\) is not practicable.
然而,由于边际似然 \(p_\theta (x)\) 通常很难处理,因此直接计算真正的后验 \(p_\theta (z\vert x)\) 是不可行的。

Variational methods offer a solution by introducing a variational approximation, \(q_\phi (z\vert x)\), to the true posterior. This approximation involves optimizing the best parameters for a chosen family of distributions. The quality of the approximation depends on the expressiveness of this parametric family.
变分方法通过将变分近似 \(q_\phi (z\vert x)\) 引入真正的后验来提供解决方案。此近似涉及优化所选分布系列的最佳参数。近似的质量取决于此参数族的表现力。

ELBO derivation ELBO 派生

Since an optimization problem must be solved, the optimization target needs to be developed. Considering \(x_i\) are assumed to be i.i.d., the marginal likelihood of a set of points \(\{x_i\}_{i=1}^{N}\) can be expressed as
由于必须解决优化问题,因此需要制定优化目标。假设 \(x_i\) 是 i.i.d.,一组点 \(\{x_i\}_{i=1}^{N}\) 的边际似然可以表示为

$$\begin{aligned} \log p_\theta (x_1, x_2,\ldots , x_N) = \sum _{i=1}^{N}\log p_\theta (x_i), \end{aligned}$$
$$\begin{aligned} \log p_\theta (x_1, x_2,\ldots, x_N) = \sum _{i=1}^{N}\log p_\theta (x_i), \end{aligned}$$
(3)

where 哪里

$$\begin{aligned} \begin{aligned} p_\theta (x) = \int p_\theta (x,z)dz = \int p_\theta (x,z) \frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz = {\mathbb {E}}_{q_\phi (z\vert x)} \left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned} p_\theta (x) = \int p_\theta (x,z)dz = \int p_\theta (x,z) \frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz = {\mathbb {E}}_{q_\phi (z\vert x)} \left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] .\end{对齐} \end{对齐}$$
(4)

Using Jensen’s inequality, we can obtain:
使用 Jensen 不等式,我们可以得到:

$$\begin{aligned} \begin{aligned} \log p_\theta (x) = \log \Bigg [{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] \Bigg ] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
$$\begin{aligned} \log p_\theta (x) = \log \Bigg [{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] \Bigg ] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right] .\end{对齐} \end{对齐}$$
(5)

Rearranging Eq. (5), we can express it as follows:
重新排列方程 (5),我们可以将其表示如下:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\Bigg [\log \left( \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right) \Bigg ]\\&\quad = \int q_\phi (z\vert x)\log \frac{p_\theta (x\vert z)p(z)}{q_\phi (z\vert x)}dz\\&\quad = \int q_\phi (z\vert x)\log \frac{p(z)}{q_\phi (z\vert x)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = - \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] \\&\quad = \mathcal {L}(x, \theta , \phi ), \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\Bigg [\log \left( \frac{p_\theta (x,z)}{q_\phi (z\vert x)}\right) \Bigg ]\\&\quad = \int q_\phi (z\vert x)\log \frac{p_\theta (x\vert z)p(z)}{q_\phi (z\vert x)}dz\\&\quad = \int q_\phi (z\vert x)\log \frac{p(z)}{q_\phi (z\vert x)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = - \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\log p_\theta (x\vert z)dz\\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] \\&\quad = \mathcal {L}(x, \theta , \phi ), \end{对齐} \end{对齐}$$
(6)

where \(D_{KL}(p\vert \vert q)\) is the Kullback-Leibler divergence between distributions p and q, and \(\mathcal {L}(x, \theta , \phi )\) is the Evidence Lower BOund (ELBO), whose name comes from Eq. (5):
其中 \(D_{KL}(p\vert \vert q)\) 是分布 pq 之间的 Kullback-Leibler 散度,\(\mathcal {L}(x, \theta , \phi )\) 是证据下 BOund (ELBO),其名称来自方程 (5):

$$\begin{aligned} \begin{aligned} \log p_\theta (x) \ge - D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] = \mathcal {L}(x, \theta , \phi ), \end{aligned} \end{aligned}$$
$$\begin{aligned} \log p_\theta (x) \ge - D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_\theta (x\vert z)\right] = \mathcal {L}(x, \theta , \phi ), \end{aligned} \end{aligned}$$
(7)

the ELBO is a lower bound for the marginal log-likelihood of the relevant set of points. Thus, maximizing the ELBO maximizes the log-likelihood of the data. This would be the optimization problem to solve.
ELBO 是相关点集的边际对数似然的下限。因此,最大化 ELBO 会最大化数据的对数似然。这将是要解决的优化问题。

Implementation 实现

The ELBO derived from Eq. (7) can be effectively implemented using a DNN-based architecture. However, computing the gradient of the ELBO concerning \(\phi\) presents challenges due to the presence of \(\phi\) in the expectation term (the second part of the ELBO in Eq. (7)). To address this issue, the original research16 introduced the reparameterization trick. This method involves modifying the latent space sampling process to make it differentiable, enabling gradient-based optimization techniques. Rather than sampling directly from the latent space distribution, VAEs sample \(\epsilon\) from a simple distribution, often a standard normal distribution. Subsequently, a deterministic transformation \(g_\phi\) is applied to \(\epsilon\), producing \(z = g_\phi (x, \epsilon )\) where \(z \sim q_\phi (z\vert x)\) and \(\epsilon \sim p(\epsilon )\). In this case, the ELBO can be estimated as follows.
从方程 (7) 派生的 ELBO 可以使用基于 DNN 的架构有效地实现。然而,计算关于 \(\phi\) 的 ELBO 梯度存在挑战,因为期望项(方程 (7) 中 ELBO 的第二部分)中存在 \(\phi\)。为了解决这个问题,最初的研究16 引入了重新参数化技巧。这种方法涉及修改潜在空间采样过程以使其可微分,从而启用基于梯度的优化技术。VAE 不是直接从潜在空间分布中采样,而是从简单分布(通常是标准正态分布)中采样 \(\epsilon\)。随后,对 \(\epsilon\) 应用确定性变换 \(g_\phi\),生成 \(z = g_\phi (x, \epsilon )\),其中 \(z \sim q_\phi (z\vert x)\)\(\epsilon \sim p(\epsilon )\)。在这种情况下,ELBO 可以按以下方式估算。

$$\begin{aligned} \begin{aligned} \mathcal {{\hat{L}}}(x, \theta , \phi ) = \frac{1}{N}\sum _{i=1}^{N} \bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_\theta (x_i\vert g_\phi (x_i, \epsilon _{i})) \bigg ). \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned} \mathcal {{\hat{L}}}(x, \theta , \phi ) = \frac{1}{N}\sum _{i=1}^{N} \bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_\theta (x_i\vert g_\phi (x_i, \epsilon _{i})) \bigg )。\end{对齐} \end{对齐}$$
(8)

This modification facilitates the calculation of the ELBO gradient concerning \(\theta\) and \(\phi\), allowing the application of standard gradient optimization methods.
此修改有助于计算有关 \(\theta\)\(\phi\) 的 ELBO 梯度,从而允许应用标准梯度优化方法。

Equation (8) offers a solution using DNNs, with functions parameterized by \(\phi\) and \(\theta\). Gradients can be conveniently computed using the Backpropagation algorithm, which various programming libraries automate. The term VAE derives from the fact that Eq. (8) resembles the architecture of an Autoencoder (AE)17, as illustrated in Fig. 2. The variational distribution \(q_\phi\) can be implemented using a DNN with weights \(\phi\), taking an input sample x and outputting parameters for the deterministic transformation \(g_\phi\). The VAE’s latent space comprises the latent variable z distribution, a deterministic transformation \(g_\phi\) of the encoder DNN output and random ancillary noise \(\epsilon\). A sampled value \(z_i\) is drawn from the latent distribution and used to generate an output sample, where another DNN with weights \(\theta\) acts as a decoder, taking z as input and providing parameters of the distribution \(p_\theta (x\vert z)\) as output.
方程 (8) 提供了一个使用 DNN 的解,函数由 \(\phi\)\(\theta\) 参数化。使用 Backpropagation 算法可以方便地计算梯度,各种编程库都可以自动使用该算法。术语 VAE 源于方程 (8) 类似于自动编码器 (AE)17 的架构,如图 8 所示。2. 变分分布 \(q_\phi\) 可以使用权重为 \(\phi\) 的 DNN 来实现,取输入样本 x 并输出确定性变换 \(g_\phi\) 的参数。VAE 的潜在空间包括潜在变量 z 分布、编码器 DNN 输出的确定性变换 \(g_\phi\) 和随机辅助噪声 \(\epsilon\)。从潜在分布中抽取一个采样值 \(z_i\) 并用于生成输出样本,其中另一个权重为 \(\theta\) 的 DNN 充当解码器,将 z 作为输入并提供分布 \(p_\theta (x\vert z)\) 的参数作为输出。

Figure 2 图 2
figure 2

VAE vanilla model implementation using DNNs.
使用 DNN 的 VAE Vanilla 模型实现。

Two key observations emerge.
出现了两个关键的观察结果。

  1. 1.

    The ELBO losses in Eq. (7) include a regularization term penalizing deviations from the prior in the latent space and a reconstruction error term that enforces similarity between generated samples from the latent space and inputs.
    方程(7)中的 ELBO 损失包括一个正则化项,用于惩罚潜在空间中与先验的偏差,以及一个重建误差项,该项强制从潜在空间生成的样本和输入之间产生相似性。

  2. 2.

    In contrast to standard AEs, VAEs incorporate intermediate sampling, rendering them non-deterministic. This dual sampling process is retained in applications where the distribution of output variables is of interest, facilitating the derivation of input value distribution parameters.
    与标准 AE 相比,VAE 包含中间采样,使其具有不确定性。在关注输出变量分布的应用程序中保留这种双采样过程,从而有助于推导输入值分布参数。

Our contribution 我们的贡献

The interest lies in using VAEs to obtain the predictive distribution of time-to-event given covariates. The proposed approach termed Survival Analysis VAE (SAVAE), depicted in Fig. 3, extends the Vanilla VAE. SAVAE includes a continuous latent variable z, two vectors (an observable covariate vector x and the time-to-event t), and generative models \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\), assuming conditional independence, which is a characteristic inherent to VAEs and their ability to model the joint distribution of variables effectively. This means that knowing z, the components of the vector x and t can be generated independently. A single variational distribution estimates the variational posterior \(p(z\vert x)\) to define the predictive distribution based on covariates. While it is possible to include the effect of time (\(p(z\vert t, x)\)), this approach focuses on using only covariates to obtain the latent space, as the time t can be unknown to predict survival times for test patients and could be censored. SAVAE combines VAEs and survival analysis, offering a flexible framework for modeling complex event data.
兴趣在于使用 VAE 获得给定协变量的事件发生时间的预测分布。所提出的方法称为生存分析 VAE (SAVAE),如图 2 所示。3,扩展了 Vanilla VAE。SAVAE 包括一个连续的潜在变量 z、两个向量(一个可观察的协变量向量 x 和事件发生时间 t)以及生成模型 \(p_{\theta _1}(x\vert z)\)\(p_{\theta _2}(t\vert z)\),假设条件独立性,这是 VAE 固有的特征,以及它们有效模拟变量联合分布的能力。这意味着知道 z,向量 xt 的分量可以独立生成。单个变分分布估计变分后验 \(p(z\vert x)\) 以定义基于协变量的预测分布。虽然可以包括时间的影响 (\(p(z\vert t, x)\)),但这种方法侧重于仅使用协变量来获得潜在空间,因为时间 t 可能是未知的,无法预测测试患者的生存时间,并且可以被删失。SAVAE 结合了 VAE 和生存分析,为复杂事件数据建模提供了一个灵活的框架。

Figure 3 图 3
figure 3

SAVAE Bayesian model. The shadowed circle refers to the latent variable, and the white circles refer to the observables. Note that the probabilities \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) denote the generative models, and \(q_\phi (z|x)\) denotes the variational approximation to the posterior, since the true posterior \(p_{\theta }(z\vert x)\) is unknown.
SAVAE 贝叶斯模型。阴影圆圈是指 latent 变量,白色圆圈是指 observables。请注意,概率 \(p_{\theta _1}(x\vert z)\)\(p_{\theta _2}(t\vert z)\) 表示生成模型,而 \(q_\phi (z|x)\) 表示后验的变分近似,因为真正的后验 \(p_{\theta }(z\vert x)\) 是未知的。

Goal 目标

To achieve the main objective, which is to obtain the predictive distribution for the time to event, variational methods will be used in the following way18:
为了实现主要目标,即获得事件发生时间的预测分布,将按以下方式使用变分方法18

$$\begin{aligned} \begin{aligned} p\left( t^*\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) = \int p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right) p \left( z\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) dz, \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned} p\left( t^*\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) = \int p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right) p \left( z\vert x^*, \left\{ x_i, t_i\right\} ^N_{i=1}\right) dz, \end{aligned} \end{aligned}$$
(9)

where \(x^*\) represents the covariates of a particular patient, and its survival time distribution \(p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right)\) needs to be estimated.
其中 \(x^*\) 表示特定患者的协变量,其生存时间分布 \(p \left( t^*\vert z, \left\{ x_i, t_i\right\} ^N_{i=1}\right)\) 需要估计。

ELBO derivation ELBO 派生

Considering our main objective and the use of VAE as the architecture on which we base our approach, the previous ELBO development can be extended to apply to our case. SAVAE assumes that the two generative models \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) are conditionally independent. This implies that if z is known, generating x or t is possible. Furthermore, due to the VAE architecture, it is assumed that each component of the covariate vector x is also conditionally independent given z. Therefore,
考虑到我们的主要目标以及使用 VAE 作为我们方法的基础架构,之前的 ELBO 开发可以扩展到适用于我们的案例。SAVAE 假设两个生成模型 \(p_{\theta _1}(x\vert z)\)\(p_{\theta _2}(t\vert z)\) 在条件上是独立的。这意味着如果 z 已知,则生成 xt 是可能的。此外,由于 VAE 架构,假设协变量向量 x 的每个分量在给定 z 的情况下也是条件独立的。因此

$$\begin{aligned} p(x, t, z) = p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z) = p_\theta (x, t\vert z)p(z). \end{aligned}$$
$$\begin{对齐} p(x, t, z) = p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z) = p_\theta (x, t\vert z)p(z). \end{aligned}$$
(10)

It also assumes that the distribution families of \(p_{\theta _1}(x\vert z)\) and \(p_{\theta _2}(t\vert z)\) are known, but not the parameters \(\theta _1\) and \(\theta _2\). Taking into account these assumptions, the ELBO can be computed in a similar way to the Vanilla VAE. First, the conditional likelihood of a set of points \(\left\{ x_i, t_i\right\} ^N_{i=1}\) can be expressed as follows:
它还假设 \(p_{\theta _1}(x\vert z)\)\(p_{\theta _2}(t\vert z)\) 的分布族是已知的,但参数 \(\theta _1\)\(\theta _2\) 是已知的。考虑到这些假设,ELBO 的计算方式与 Vanilla VAE 类似。首先,一组点 \(\left\{ x_i, t_i\right\} ^N_{i=1}\) 的条件似然可以表示如下:

$$\begin{aligned} \begin{aligned} \log p_\theta (x_1, x_2,\ldots , x_N, t_1, t_2,\ldots , t_N\vert z) = \sum _{i=1}^{N}\log p_\theta (x_i, t_i\vert z) = \sum _{i=1}^{N}\left( \log p_{\theta _2}(t_i\vert z) + \sum _{l=1}^{L} \log p_{\theta _{1}}(x_{i}^{l}\vert z)\right) , \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin {aligned} \log p_\theta (x_1, x_2,\ldots , x_N, t_1, t_2,\ldots , t_N\vert z) = \sum _{i=1}^{N}\log p_\theta (x_i, t_i\vert z) = \sum _{i=1}^{N}\left( \log p_{\theta _2}(t_i\vert z) + \sum _{l=1}^{L} \log p_{\theta _{1}}(x_{i}^{l}\vert z)\right) , \end{对齐} \end{对齐}$$
(11)

where the expected conditional likelihood can be expressed as:
其中,预期条件似然可以表示为:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_z\left[ p_\theta (x, t\vert z)\right] \\&\quad = \int p_\theta (x, t\vert z)p(z)dz \\&\quad = \int \frac{p_\theta (x, t, z)}{p(z)}p(z)dz \\&\quad = \int p_\theta (x, t, z)dz \\&\quad =p_\theta (x, t) = \int p_\theta (x, t, z)\frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz \\&\quad = {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] . \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_z\left[ p_\theta (x, t\vert z)\right] \\&\quad = \int p_\theta (x, t\vert z)p(z)dz \\&\quad = \int \frac{p_\theta (x, t, z)}{p(z)}p(z)dz \\&\quad = \int p_\theta (x, t, z)dz \\&\quad =p_\theta (x, t) = \int p_\theta (x, t, z)\frac{q_\phi (z\vert x)}{q_\phi (z\vert x)}dz \\&\quad = {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] .\end{对齐} \end{对齐}$$
(12)

As the interest lies in computing the log-likelihood:
由于计算对数似然的兴趣在于:

$$\begin{aligned} \begin{aligned} \log p_\theta (x, t) = \log \left[ {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] \right] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] , \end{aligned} \end{aligned}$$
$$\begin{对齐} \log p_\theta (x, t) = \log \left[ {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] \right] \ge {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right] , \end{aligned} \end{aligned}$$
(13)

where the inequality comes from applying Jensen’s inequality. Then, this could be rearranged as:
其中不等式来自应用 Jensen 不等式。然后,可以将其重新排列为:

$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \left( \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right) \right] \\ \\&\quad = \int q_\phi (z\vert x)\log \frac{p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z)}{q_\phi (z\vert x)}dz \\&\quad = - \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\left( \log p_{\theta _1}(x\vert z)+\log p_{\theta _2}(t\vert z)\right) dz \\ \\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_{\theta _1}(x\vert z) + \log p_{\theta _2}(t\vert z)\right] \\&\quad = \mathcal {L}(x, \theta _1, \theta _2, \phi ). \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned}&{\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log \left( \frac{p_\theta (x,t,z)}{q_\phi (z\vert x)}\right) \right] \\ \\&\quad = \int q_\phi (z\vert x)\log \frac{p_{\theta _1}(x\vert z)p_{\theta _2}(t\vert z)p(z)}{q_\phi (z\vert x)}dz \\&\quad = - \int q_\phi (z\vert x)\log \frac{q_\phi (z\vert x)}{p(z)}dz + \int q_\phi (z\vert x)\left( \log p_{\theta _1}(x\vert z)+\log p_{\theta _2}(t\vert z)\right) dz \\ \\&\quad = -D_{KL}(q_\phi (z\vert x) \vert \vert p(z)) + {\mathbb {E}}_{q_\phi (z\vert x)}\left[ \log p_{\theta _1}(x\vert z) + \log p_{\theta _2}(t\vert z)\right] \\&\quad = \mathcal {L}(x, \theta _1, \theta _2, \phi )。\end{对齐} \end{对齐}$$
(14)

After computing this ELBO, it can be seen that it is similar to the Vanilla VAE’s one (Eq. 8). The only difference lies in the reconstruction term, which is expressed differently to distinguish between the covariates and the time-to-event explicitly. By using Eq. (11) and the reparameterization trick, the ELBO estimator is obtained, explicitly accounting for each dimension of the covariates vector:
计算了这个 ELBO 后,可以看出它与 Vanilla VAE 的 ELBO 相似(方程 8)。唯一的区别在于重建项,它以不同的方式表示,以明确区分协变量和事件发生时间。通过使用方程 (11) 和重新参数化技巧,获得了 ELBO 估计器,显式考虑了协变量向量的每个维度:

$$\begin{aligned} \begin{aligned} \mathcal {{\hat{L}}}(x, \theta _1, \theta _2, \phi ) = \frac{1}{N} \sum _{i=1}^{N}\Bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) + \sum _{l=1}^{L}\log p_{\theta _{1}}(x_{i}^{l}\vert g_\phi (x_i, \epsilon _{i}))\Bigg ). \end{aligned} \end{aligned}$$
$$\begin{aligned} \mathcal {{\hat{L}}}(x, \theta _1, \theta _2, \phi ) = \frac{1}{N} \sum _{i=1}^{N}\Bigg (- D_{KL}(q_\phi (z\vert x_i)\vert \vert p(z)) + \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) + \sum _{l=1}^{L}\log p_{\theta _{1}}(x_{i}^{l}\vert g_\phi (x_i, \epsilon _{i}))\Bigg )。\end{对齐} \end{对齐}$$
(15)

Three DNNs have been used in implementation, as specified in Fig. 4. Note that the decoder DNNs output the parameters of each distribution.
在实现中使用了三个 DNN,如图 1 所示。4. 请注意,解码器 DNN 输出每个分配的参数。

Figure 4 图 4
figure 4

SAVAE implementation using DNNs. One of them acts as an encoder, which has the covariates vector as input. The other two act as decoders, one for the covariates and the other one for the time.
使用 DNN 的 SAVAE 实现。其中一个充当编码器,将协变量向量作为输入。另外两个充当解码器,一个用于协变量,另一个用于时间。

Divergence computation 发散计算

SAVAE assumes that \(q_\phi (z\vert x)\) follows a multidimensional Gaussian distribution defined by a vector of means \(\mu\), where each element is \(\mu _j\) and by a diagonal covariance matrix \({\textbf {C}}\), where the main diagonal consists of variances \(\sigma ^2_j\). It can be stated that:
SAVAE假设\(q_\phi (z\vert x)\)服从由均值向量\(\mu\)定义的多维高斯分布,其中每个元素都是\(\mu _j\)和对角协方差矩阵\({\textbf {C}}\),其中主对角线由方差\(\sigma ^2_j\)组成。可以说:

$$\begin{aligned} - D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) = \frac{1}{2} \sum _{j=1}^{J} (1 + \log (\sigma _j^2) - \mu _j^2 -\sigma _j^2), \end{aligned}$$
$$\begin{对齐} - D_{KL}(q_\phi (z\vert x)\vert \vert p(z)) = \frac{1}{2} \sum _{j=1}^{J} (1 + \log (\sigma _j^2) - \mu _j^2 -\sigma _j^2), \end{aligned}$$
(16)

where J is the dimension of the latent space z16. This means the Kullback-Leibler divergence from the ELBO equation 15 can be calculated analytically.
其中 J 是潜在空间 z16 的维度。这意味着可以通过解析计算与 ELBO 方程 15 的 Kullback-Leibler 背离。

Time modeling 时间建模

One significant challenge in handling survival data is the issue of censorship, which occurs when a patient has not yet experienced the event of interest. In such cases, the survival time remains unknown, resulting in partial or incomplete observations. Consequently, SA models must employ techniques capable of accommodating censored observations and uncensored ones to estimate relevant parameters reliably.
处理生存数据的一个重大挑战是审查问题,当患者尚未经历感兴趣的事件时,就会出现审查问题。在这种情况下,生存时间仍然未知,从而导致部分或不完全的观测值。因此,SA 模型必须采用能够容纳删失观察和未删失观察的技术来可靠地估计相关参数。

In our case, to account for censoring in survival data, we start from the time t reconstruction term from Eq. 15 for a single patient:
在我们的例子中,为了解释生存数据中的删失,我们从方程 15 中单个患者的时间 t 重建项开始:

$$\begin{aligned} \mathcal {\hat{L}}_{time}(x_i, \theta _2, \phi ) = \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})). \end{aligned}$$
$$\begin{aligned} \mathcal {\hat{L}}_{time}(x_i, \theta _2, \phi ) = \log p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})).\end{aligned}$$
(17)

Taking into account the censoring indicator \(d_i\):
考虑到删失指示符 \(d_i\):

$$\begin{aligned} d_i = {\left\{ \begin{array}{ll} 0 \;\;\;\text {if censored}\\ 1 \;\;\;\text {if event experienced} \end{array}\right. }, \end{aligned}$$
$$\begin{对齐} d_i = {\left\{ \begin{array}{ll} 0 \;\;\;\text {如果删失}\\ 1 \;\;\;\text {如果经历事件} \end{array}\right. }, \end{对齐}$$
(18)

we could just use the information given by uncensored patients. However, we would waste information since we know that the censored patients have not experienced the event until time \(t_i\). Hence, considering Eq. (1) and following19, we model the time pdf as:
我们可以使用未经审查的患者提供的信息。但是,我们会浪费信息,因为我们知道被删失的患者直到时间 \(t_i\) 才经历过该事件。因此,考虑到方程 (1) 和以下19,我们将时间 pdf 建模为:

$$\begin{aligned} p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) = h(t_i \vert g_\phi (x_i, \epsilon _{i}))^{d_i}S(t_i \vert g_\phi (x_i, \epsilon _{i})). \end{aligned}$$
$$\begin{aligned} p_{\theta _2}(t_i\vert g_\phi (x_i, \epsilon _{i})) = h(t_i \vert g_\phi (x_i, \epsilon _{i}))^{d_i}S(t_i \vert g_\phi (x_i, \epsilon _{i}))).\end{aligned}$$
(19)

Therefore, the hazard function term is only considered when the event has been experienced, when the data are not censored. This way, SAVAE incorporates information from censored observations, providing consistent parameter estimates.
因此,仅当经历过事件且数据未删失时,才会考虑灾害函数项。这样,SAVAE 整合了来自删失观察的信息,从而提供了一致的参数估计。

Regarding the distribution chosen for the time event, we have followed several publications such as8, where the Weibull distribution model is used. This distribution is two-parameter, with positive support, that is, \(p(t) = 0, \forall t < 0\). The two scalar parameters of the distribution are \(\lambda\) and \(\alpha\), where \(\lambda > 0\) controls the scale and \(\alpha > 0\) controls the shape as follows:
关于为时间事件选择的分布,我们遵循了几篇出版物,例如8,其中使用了 Weibull 分布模型。这个分布是双参数的,有正支持,即 \(p(t) = 0, \forall t < 0\)。分布的两个标量参数是 \(\lambda\)\(\alpha\),其中 \(\lambda > 0\) 控制尺度,\(\alpha > 0\) 控制形状,如下所示:

$$\begin{aligned} {\left\{ \begin{array}{ll} p(t; \alpha , \lambda ) = \frac{\alpha }{\lambda } \left( \frac{t}{\lambda } \right) ^{\alpha - 1} \exp { \left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ S(t; \alpha , \lambda ) = \frac \exp {\left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ h(t; \alpha , \lambda ) = \frac{p(t; \alpha , \lambda )}{S(t; \alpha , \lambda )} = \frac{\alpha }{\lambda }\left( \frac{t}{\lambda }\right) ^{\alpha -1} \end{array}\right. }. \end{aligned}$$
$$\begin{aligned} {\left\{ \begin{array}{ll} p(t; \alpha , \lambda ) = \frac{\alpha }{\lambda } \left( \frac{t}{\lambda } \right) ^{\alpha - 1} \exp { \left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ S(t; \alpha , \lambda ) = \frac \exp {\left( -\left( \frac{t}{\lambda }\right) ^\alpha \right) }\\ h(t; \alpha , \lambda ) = \frac{p(t; \alpha , \lambda )}{S(t; \alpha , \lambda )} = \frac{\alpha }{\lambda }\left( \frac{t}{\lambda }\right) ^{\alpha -1} \end{array}\right. }.\end{aligned}$$
(20)

Although the Weibull distribution is our primary choice for modeling time-to-event data in SAVAE, it is crucial to highlight that other distributions are feasible as long as their hazard functions and CDFs can be analytically calculated. This versatility distinguishes SAVAE from other models. For example, the exponential distribution, a particular case of Weibull with \(\alpha = 1\), can represent constant hazard functions. Integrating alternative distributions, such as the exponential, into SAVAE is straightforward and only requires adjusting the terms in Eq. (19). The ability of SAVAE to predict the distribution parameters for each patient facilitates the calculation of various statistics, such as means, medians, and percentiles, providing flexibility beyond the models customized to a single distribution.
尽管 Weibull 分布是我们在 SAVAE 中对事件发生时间数据进行建模的主要选择,但重要的是要强调其他分布是可行的,只要它们的危险函数和 CDF 可以分析计算。这种多功能性使 SAVAE 与其他型号区分开来。例如,指数分布是 \(\alpha = 1\) 的 Weibull 的一个特殊情况,可以表示常数风险函数。将替代分布(例如指数)集成到 SAVAE 中很简单,只需要调整方程 (19) 中的项。SAVAE 预测每位患者的分布参数的能力有助于计算各种统计数据,例如平均值、中位数和百分位数,从而提供超越为单个分布定制的模型的灵活性。

Marginal log-likelihood computation
边际对数似然计算

Assigning distribution models to patient covariates in the reconstruction term is essential in SAVAE. This choice enables control over the resulting output variable distribution, but it also implies that the model approximates the chosen distribution even if the actual distribution differs. The third component of the ELBO (15) depends on the log-likelihood of the data, which for some representative distributions is:
在 SAVAE 中,将分布模型分配给重建项中的患者协变量是必不可少的。此选项可以控制生成的输出变量分布,但这也意味着即使实际分布不同,模型也会近似所选分布。ELBO 的第三个分量 (15) 取决于数据的对数似然,对于某些代表性分布,该对数似然为:

  • Gaussian distribution: Suitable for real-numbered variables (\(x_{i}^{l} \in (-\infty , +\infty )\)), it has parameters \(\mu \in (-\infty , +\infty )\) and \(\sigma \in (0, +\infty )\), known for its symmetric nature. Its log-likelihood function is:
    高斯分布:适用于实数变量(\(x_{i}^{l} \in (-\infty , +\infty )\)),它有参数 \(\mu \in (-\infty , +\infty )\)\(\sigma \in (0, +\infty )\),以其对称性而闻名。它的对数似然函数为:

    $$\begin{aligned} \begin{aligned} \log (p(x_{i}^{l};\mu , \sigma )) = -\log (\sigma \sqrt{2\pi }) - \frac{1}{2} \left( \frac{x_{i}^{l}-\mu }{\sigma }\right) ^2. \end{aligned} \end{aligned}$$
    $$\begin{aligned} \begin{aligned} \log (p(x_{i}^{l};\mu , \sigma )) = -\log (\sigma \sqrt{2\pi }) - \frac{1}{2} \left( \frac{x_{i}^{l}-\mu }{\sigma }\right) ^2.\end{对齐} \end{对齐}$$
    (21)
  • Bernoulli distribution: Applied to binary variables (\(x_{i}^{l} \in \{0,1\}\)), it has a single parameter \(\beta \in [0,1]\), representing the probability of \(x_{i}^{l}=1\).Its log-likelihood function is:
    伯努利分布:应用于二元变量(\(x_{i}^{l} \in \{0,1\}\)),它只有一个参数\(\beta \in [0,1]\),代表\(x_{i}^{l}=1\)的概率。它的对数似然函数为:

    $$\begin{aligned} \log (p(x_{i}^{l};\beta )) = x_{i}^{l}\log (\beta ) + (1-x_{i}^{l})\log (1-\beta ). \end{aligned}$$
    $$\begin{aligned} \log (p(x_{i}^{l};\beta )) = x_{i}^{l}\log (\beta ) + (1-x_{i}^{l})\log (1-\beta ). \end{aligned}$$
    (22)
  • Categorical distribution: Models discrete variables with K possible values. We can think of \(x_{i}^{l}\) as a categorical scalar random variable with K different values. Each possible outcome is assigned a probability \(\theta _k\) (note that \(\sum _{k=1}^K \theta _k = 1\)). The log-likelihood function can be computed based on the Probability Mass Function (PMF) following the expression:
    分类分布:使用 K 个可能值对离散变量进行建模。我们可以将 \(x_{i}^{l}\) 视为具有 K 个不同值的分类标量随机变量。每个可能的结果都被分配了一个概率 \(\theta _k\) (注意 \(\sum _{k=1}^K \theta _k = 1\))。对数似然函数可以根据表达式后面的概率质量函数 (PMF) 计算:

    $$\begin{aligned} \log (p(x_{i}^{l} \vert \theta _1, \theta _2,\ldots , \theta _k)) = \log \left( \prod _{k=1}^K \theta _k^{{\mathbb {I}}(x_{i}^{l}=k)}\right) , \end{aligned}$$
    $$\begin{aligned} \log (p(x_{i}^{l} \vert \theta _1, \theta _2,\ldots , \theta _k)) = \log \left( \prod _{k=1}^K \theta _k^{{\mathbb {I}}(x_{i}^{l}=k)}\right) , \end{aligned}$$
    (23)

    where the indicator function means:
    其中,指示符函数的含义为:

    $$\begin{aligned} {\mathbb {I}}(x_{i}^{l}=k) = {\left\{ \begin{array}{ll} 1 \quad x_{i}^{l} = k\\ 0 \quad x_{i}^{l} \ne k \end{array}\right. }. \end{aligned}$$
    $$\begin{aligned} {\mathbb {I}}(x_{i}^{l}=k) = {\left\{ \begin{array}{ll} 1 \quad x_{i}^{l} = k\\ 0 \quad x_{i}^{l} \ne k \end{array}\right. }. \end{aligned}$$
    (24)

Recall that other desired distributions can be implemented in SAVAE if their log-likelihood is differentiable.
回想一下,如果其他期望的分布的对数似然是可微的,则可以在 SAVAE 中实现。

Results 结果

This section proceeds with the experimental validation of SAVAE. First, we describe the survival data and the performance metrics used to validate the model. Then, we define the experimental setup (network architecture and training process). Finally, we analyze the different experiments carried out. The code can be found in https://github.com/Patricia-A-Apellaniz/savae.
本节继续进行 SAVAE 的实验验证。首先,我们描述了用于验证模型的生存数据和性能指标。然后,我们定义实验设置(网络架构和训练过程)。最后,我们分析了进行的不同实验。代码可以在 https://github.com/Patricia-A-Apellaniz/savae 中找到。

Survival data 生存数据

In SA datasets, each patient contributes information about whether events of interest occurred during a study period, categorizing them as censored or uncensored and indicating their respective follow-up times. To evaluate SAVAE, we trained it in nine diverse disease datasets, including WHAS, SUPPORT, GBSG, FLCHAIN, NWTCO, METABRIC, PBC, STD, and PNEUMON. We followed pre-processing procedures similar to state-of-the-art models, ensuring a fair evaluation of established benchmarks in SA.
在 SA 数据集中,每位患者都提供了有关研究期间是否发生感兴趣事件的信息,将其分类为删失或未删失,并指示他们各自的随访时间。为了评估 SAVAE,我们在 9 种不同的疾病数据集中对其进行了训练,包括 WHAS、SUPPORT、GBSG、FLCHAIN、NWTCO、METABRIC、PBC、STD 和 PNEUMON。我们遵循类似于最先进模型的预处理程序,确保对 SA 中已建立的基准进行公平评估。

The Worcester Heart Attack Study (WHAS)20 focuses on patients with acute myocardial infarction (AMI), providing clinical and demographic data. The Study to Understand Prognoses Outcomes and Risks of Treatment (SUPPORT)21 investigates seriously ill hospitalized adults and includes information on demographics, comorbidities, and physiological measurements. The Rotterdam & German Breast Cancer Study Group (GBSG)22,23 combines data from node-positive breast cancer patients and a chemotherapy trial. The FLCHAIN24 dataset studies the relationship between mortality and serum immunoglobulin-free Light Chains, which are essential in hematological disorders. NWTCO25 studies Wilms tumor in children, Molecular Taxonomy of Breast Cancer International Consortium (METABRIC)26 explores breast cancer, PBC27 focuses on Primary Biliary Cholangitis, STD deals with sexually transmitted diseases, and PNEUMON examines infant pneumonia.
伍斯特心脏病发作研究 (WHAS)20 侧重于急性心肌梗死 (AMI) 患者,提供临床和人口统计数据。了解预后结果和治疗风险的研究 (SUPPORT)21 调查了重病住院的成年人,包括有关人口统计学、合并症和生理测量的信息。鹿特丹和德国乳腺癌研究组(GBSG)22,23结合了淋巴结阳性乳腺癌患者的数据和一项化疗试验。FLCHAIN24 数据集研究了死亡率与血清无免疫球蛋白轻链之间的关系,轻链在血液系统疾病中至关重要。NWTCO25 研究儿童肾母细胞瘤,乳腺癌国际联盟分子分类学 (METABRIC)26 探讨乳腺癌,PBC27 侧重于原发性胆汁性胆管炎,性传播疾病处理性传播疾病,PNEUMON 检查婴儿肺炎。

Table 1 offers a more comprehensive view of the temporal aspects and occurrences of events within the various datasets considered. It becomes evident that a deliberate selection of diverse disease datasets has been made, each characterized by distinct types and quantities of information. This diversity in the disease datasets showcases the model’s versatility. Significantly, the evaluation of the model has been carried out systematically in datasets that show varying proportions of censored samples and differing time-to-event ranges. This strategic approach aims to provide a broader perspective on how the model might perform when applied to other real-world datasets.
1 提供了所考虑的各种数据集中事件的时间方面和发生情况的更全面视图。很明显,已经对不同的疾病数据集进行了精心选择,每个数据集都有不同的类型和数量的信息。疾病数据集的这种多样性展示了该模型的多功能性。值得注意的是,该模型的评估已在数据集中系统地进行,这些数据集显示了不同比例的删失样本和不同的事件发生时间范围。这种战略方法旨在提供更广泛的视角,了解模型在应用于其他实际数据集时的性能。

Table 1 Data information from datasets used to train SAVAE model. We have analyzed nine different disease datasets with different proportions of samples, censored data, and varying survival times.
表 1 来自用于训练 SAVAE 模型的数据集的数据信息。我们分析了 9 个不同的疾病数据集,这些数据集具有不同的样本比例、删失数据和不同的生存时间。

Performance metrics 性能指标

Recalling from the Survival Analysis Section, each dataset is described by \(D = (x_i, t_i, d_i)^N_{i=1}\) triplets, where \(x_i = x_{i}^{1},\ldots , x_{i}^{L}\) is an L-dimensional vector of covariates, \(t_i\) is the time to event and \(d_i \in \{0,1\}\) is the censoring indicator.
回顾生存分析部分,每个数据集都由 \(D = (x_i, t_i, d_i)^N_{i=1}\) 三元组描述,其中 \(x_i = x_{i}^{1},\ldots , x_{i}^{L}\) 是协变量的 L 维向量,\(t_i\) 是事件发生的时间,\(d_i \in \{0,1\}\) 是删失指标。

When evaluating an SA model, the literature shows that the most commonly used metric is the C-index, which is the generalization of the ROC curve for all data. It measures the rank correlation between predicted risk and observed times. The concept arises from the intuition that a higher risk of an event occurring has a complete relation with a short time to the event. Therefore, a high number of correlating pairs, i.e., pairs of samples that meet this expectation, is decisive to say that the model has good predictive quality.
在评估 SA 模型时,文献显示最常用的指标是 C 指数,它是所有数据的 ROC 曲线的泛化。它度量预测风险与观测时间之间的秩相关性。这个概念源于一种直觉,即事件发生的较高风险与事件发生的短时间完全相关。因此,大量相关对,即满足此期望的样本对,是说模型具有良好的预测质量的决定性因素。

In this case, the time-dependent C-index28 will be used since the original one29 cannot reflect the possible changes in risk over time being only computed at the initial observation time. This C-index is defined as follows:
在这种情况下,将使用与时间相关的 C 指数28,因为原始指数29 无法反映仅在初始观察时间计算的风险随时间可能的变化。此 C 指数定义如下:

$$\begin{aligned} \begin{aligned} C_{index} = P\Big ( {\hat{F}}(t\vert x_i) > {\hat{F}}(t\vert x_j)\vert d_i = 1, t_i <t_j, t_i \le t \Big ), \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned} C_{index} = P\Big ( {\hat{F}}(t\vert x_i) > {\hat{F}}(t\vert x_j)\vert d_i = 1, t_i <t_j, t_i \le t \Big ), \end{aligned} \end{aligned}$$
(25)

where \({\hat{F}}(t\vert x_i)\) is the CDF estimated by the model at the time t given a set of covariates \(x_i\). The probability is estimated by comparing the relative risks pairwise, as already mentioned.
其中 \({\hat{F}}(t\vert x_i)\) 是模型在给定一组协变量 \(x_i\) 的情况下,在时间 t 时估计的 CDF。如前所述,通过成对比较相对风险来估计概率。

Based on the prediction index proposed30,31, the second evaluation metric that has been used in this analysis: Brier Score (BS). It is essentially a square prediction error based on the Inverse Probability of Censoring Weighting (IPCW)32, a technique designed to recreate an unbiased scenario compensating for censored samples by giving more weight to samples with similar features that are not censored. So, given a time t the BS can be calculated as follows, with \(G(\cdot )\) being the survival function corresponding to censoring (1/G(t) is the IPCW):
根据提议的预测指数30,31,本分析中使用的第二个评估指标:Brier 评分 (BS)。它本质上是基于删失加权逆概率 (IPCW)32 的平方预测误差,该技术旨在通过为具有未删失的相似特征的样本赋予更多权重来重新创建补偿删失样本的无偏场景。因此,给定时间 t,BS 可以计算如下,其中 \(G(\cdot )\) 是对应于删失的生存函数(1/Gt) 是 IPCW):

$$\begin{aligned} \begin{aligned} BS(t) = \frac{1}{N} \sum _{i=1}^N \Bigg [ \frac{ (S(t\vert x_i))^2}{G(t_i)} \cdot {\mathbb {I}}(t_i < t, d_i=1) + \frac{(1 - S(t\vert x_i))^2}{G(t)} \cdot {\mathbb {I}}(t_i \ge t) \Bigg ]. \end{aligned} \end{aligned}$$
$$\begin{aligned} \begin{aligned} BS(t) = \frac{1}{N} \sum _{i=1}^N \Bigg [ \frac{ (S(t\vert x_i))^2}{G(t_i)} \cdot {\mathbb {I}}(t_i < t, d_i=1) + \frac{(1 - S(t\vert x_i))^2}{G(t)} \cdot {\mathbb {I}}(t_i \ge t) \Bigg ]。\end{对齐} \end{对齐}$$
(26)

Since the C-index does not take into account the actual values of the predicted risk scores, BS can be used to assess calibration, i.e., if a model predicts a 10% risk of experiencing an event at time t, the observed frequency in the data should match this percentage for a well-calibrated model. On the other hand, it is also a measure of discrimination: whether a model can predict risk scores that allow us to determine the order of events correctly.
由于 C 指数不考虑预测风险评分的实际值,因此 BS 可用于评估校准,即,如果模型预测在时间 t 发生事件的风险为 10%,则数据中观察到的频率应与校准良好的模型的这个百分比相匹配。另一方面,它也是衡量辨别度的指标:模型是否可以预测风险评分,从而使我们能够正确确定事件的顺序。

In this case, the evaluation is made using the integral form of BS since it does not depend on the selection of a specific time t:
在这种情况下,使用 BS 的积分形式进行评估,因为它不依赖于特定时间 t 的选择:

$$\begin{aligned} IBS(t_{max}) = \frac{1}{t_{max}}\int _0^{t_{max}}BS(t)dt. \end{aligned}$$
$$\begin{aligned} IBS(t_{max}) = \frac{1}{t_{max}}\int _0^{t_{max}}BS(t)dt.\end{aligned}$$
(27)

To statistically assess each model’s performance based on the global C-index, we propose the Mean Reciprocal Rank (MRR) as the third metric. It measures the effectiveness of a prediction by considering the rank of the first relevant C-index within a list composed of the C-indices obtained from each model. Formally, the Reciprocal Rank (RR) for a set of results for each model is the inverse of the position of the first pertinent result. For example, if the first relevant result is in position 1, its RR is 1; if it is in position 2, the RR is 0.5; if it is in position 3, the RR is approximately 0.33, and so on. Thus, the MRR is the average of the RRs for a set of models:
为了根据全局 C 指数对每个模型的性能进行统计评估,我们建议将平均倒数秩 (MRR) 作为第三个指标。它通过考虑由从每个模型获得的 C 指数组成的列表中第一个相关 C 指数的排名来衡量预测的有效性。从形式上讲,每个模型的一组结果的倒数秩 (RR) 是第一个相关结果位置的倒数。例如,如果第一个相关结果位于位置 1,则其 RR 为 1;如果位于位置 2,则 RR 为 0.5;如果位于位置 3,则 RR 约为 0.33,依此类推。因此,MRR 是一组模型的 RR 的平均值:

$$\begin{aligned} MRR = \frac{1}{Q}\sum _{i=1}^{Q}\frac{1}{rank_i}, \end{aligned}$$
$$\begin{aligned} MRR = \frac{1}{Q}\sum _{i=1}^{Q}\frac{1}{rank_i}, \end{aligned}$$
(28)

where Q is the total number of models being compared, and \(rank_i\) is the position of the first relevant C-index for the \(i-th\) model. Higher MRR values indicate that relevant results appear higher in the list.
其中 Q 是被比较的模型的总数,\(rank_i\)\(i-th\) 模型的第一个相关 C-index 的位置。MRR 值越高,表明相关结果在列表中显示得靠前。

Additionally, to add more statistical information on the performance of the models, we performed hypothesis testing to compare the mean C-index and IBS values of our model with those of the state-of-the-art models in multiple folds since we are using a five-fold cross-validation method. Specifically, we formulated a null hypothesis that assumes that the mean performance metrics of the state-of-the-art models are more significant than our model’s mean performance metrics. To assess the validity of this null hypothesis, we used p-values as a statistical measure. We established a significance threshold of 0.05, a common practice in hypothesis testing. When the obtained p-value for each case fell below this threshold, we rejected the null hypothesis. In practical terms, this indicated that our model exhibited superior performance compared to the other models. On the contrary, if the p-value exceeded 0.05, we concluded that there were no statistically significant differences between our model and the others. It is important to note that this approach considered variations in results across different folds, providing a more comprehensive assessment of model performance beyond just the average results. Given the multiple hypothesis tests performed, we acknowledge that the Family-Wise Error Rate (FWER)33 increases as the number of tests grows, as established in the literature34,35. This increase in FWER raises the risk of Type I errors, where false positives may occur due to the accumulation of multiple tests. To address this issue, we have applied an appropriate method to control this inflation, the Holm adjustment36. The results of these adjustments can be seen in the Supplementary Information, ensuring the robustness of our findings.
此外,为了添加更多有关模型性能的统计信息,我们进行了假设检验,以将模型的平均 C 指数和 IBS 值与最先进模型的平均 C 指数和 IBS 值进行多次比较,因为我们使用的是五重交叉验证方法。具体来说,我们制定了一个原假设,该假设假设最先进模型的平均性能指标比我们模型的平均性能指标更显著。为了评估这个零假设的有效性,我们使用 p 值作为统计度量。我们建立了 0.05 的显著性阈值,这是假设检验的常见做法。当每个个案获得的 p 值低于此阈值时,我们拒绝了原假设。实际上,这表明我们的模型与其他模型相比表现出卓越的性能。相反,如果 p 值超过 0.05,我们得出结论,我们的模型与其他模型之间没有统计学上的显著差异。需要注意的是,这种方法考虑了不同折叠的结果变化,提供了对模型性能的更全面的评估,而不仅仅是平均结果。鉴于执行的多重假设检验,我们承认家庭错误率 (FWER)33 随着检验数量的增加而增加,如文献34,35 所示。FWER 的增加增加了 I 类错误的风险,由于多次测试的积累,可能会出现假阳性。为了解决这个问题,我们采用了一种适当的方法来控制这种通货膨胀,即霍尔姆调整36。 这些调整的结果可以在补充信息中看到,从而确保我们研究结果的稳健性。

Finally, we performed a sensitivity analysis to assess the robustness of our model and to understand how variations in the input data influence its predictions. This analysis provides insights into the impact of individual features on the model’s performance and contributes to a better understanding of the model’s decision-making process. Furthermore, we analyzed the computational complexity of our model by comparing its runtime with state-of-the-art models. This analysis considers the time required for training and validating SAVAE across multiple datasets, providing insights into its efficiency relative to other methods. The findings illustrate that while SAVAE’s computational demands are higher due to its complex architecture, they remain manageable, making it suitable for practical applications. The detailed results of these analyses can be found in the Supplementary Information.
最后,我们进行了敏感性分析,以评估模型的稳健性,并了解输入数据的变化如何影响其预测。此分析提供了对各个特征对模型性能影响的见解,并有助于更好地了解模型的决策过程。此外,我们通过将模型的运行时间与最先进的模型进行比较来分析模型的计算复杂性。该分析考虑了跨多个数据集训练和验证 SAVAE 所需的时间,从而深入了解其相对于其他方法的效率。研究结果表明,虽然 SAVAE 的计算需求由于其复杂的架构而更高,但它们仍然是可管理的,使其适合实际应用。这些分析的详细结果可以在补充信息中找到。

Experimental setting 实验设置

The implementation of SAVAE was executed using the PyTorch framework37. As defined in Section ELBO derivation , three different DNNs were trained, consisting of one encoder and two decoders. These decoders were designed to infer covariates and time parameters, respectively. The Gaussian encoder exhibits a straightforward architecture characterized by a single hidden linear layer featuring a Rectified Linear Unit (ReLU) activation function and an output linear layer with hyperbolic tangent activation. The input to this encoder consists of the covariate vectors from the training dataset, while the output generates a Gaussian latent space. The dimensionality of this latent space has been fixed to 5. The generated latent space is input for both decoders, each featuring two linear layers. The first layer employs a ReLU activation function and incorporates a dropout rate of 20%. However, the final layer of the decoders employs different activation functions based on the specified distribution, thereby tailoring the output to the parameters of the respective covariate distribution. Furthermore, the number of neurons in each hidden layer was also fixed at 50. The training process involved 3,000 epochs with a batch size of 64 samples while incorporating an Early Stop mechanism in case of an insufficient reduction in validation loss.
SAVAE 的实现是使用 PyTorch 框架37 执行的。如 Section ELBO derivation 中所定义,训练了三种不同的 DNN,由一个编码器和两个解码器组成。这些解码器旨在分别推断协变量和时间参数。高斯编码器具有简单的架构,其特点是具有修正线性单元 (ReLU) 激活函数的单个隐藏线性层和具有双曲正切激活的输出线性层。该编码器的输入由来自训练数据集的协变量向量组成,而输出则生成一个高斯潜在空间。这个潜在空间的维数被固定为 5。生成的潜在空间是两个解码器的输入,每个解码器都有两个线性层。第一层采用 ReLU 激活函数,并包含 20% 的 dropout 率。但是,解码器的最后一层根据指定的分布采用不同的激活函数,从而根据相应协变量分布的参数定制输出。此外,每个隐藏层中的神经元数量也固定为 50。训练过程涉及 3000 个 epoch,批量大小为 64 个样本,同时在验证损失减少不足的情况下加入了 Early Stop 机制。

To better understand the behavior of the SAVAE model and to justify the selection of the defined hyperparameters, we conducted an ablation study. This study analyzed how changes in key hyperparameters, such as latent space dimensionality, number of neurons, and dropout rates, affect model performance. We could identify settings that balance performance and computational efficiency by systematically varying these parameters. The results of this ablation study are provided in the Supplementary Information, offering further insights into the rationale behind our chosen hyperparameter configuration.
为了更好地理解 SAVAE 模型的行为并证明选择定义的超参数的合理性,我们进行了一项消融研究。本研究分析了关键超参数(例如潜在空间维度、神经元数量和退出率)的变化如何影响模型性能。我们可以通过系统地改变这些参数来确定平衡性能和计算效率的设置。这项消融研究的结果在补充信息中提供,进一步了解我们选择的超参数配置背后的基本原理。

We used a five-fold cross-validation technique to evaluate the results while ensuring their robustness against data partitioning. This method was applied to our model and the state-of-the-art models used for performance comparison and result evaluation, including Cox-PH, DeepHit, and DeepSurv. Moreover, due to the inherent sensitivity of VAE architectures to initial conditions, we conducted training using up to 10 different random seeds. Subsequently, the C-index was averaged among the three best-performing seeds. The average performance of the three seeds provides a representative and sufficient evaluation. Lastly, note that the three state-of-the-art models have been implemented using the Pycox package38 and the different metrics used for validation, C-index, and IBS. The MRR has been calculated manually, while the p-value has been obtained using the SciPy39 package.
我们使用了五重交叉验证技术来评估结果,同时确保它们对数据分区的稳健性。该方法应用于我们的模型以及用于性能比较和结果评估的最先进模型,包括 Cox-PH 、 DeepHit 和 DeepSurv 。此外,由于 VAE 架构对初始条件具有固有的敏感性,我们使用多达 10 种不同的随机种子进行了训练。随后,将 C 指数平均为表现最好的 3 种种子。三个种子的平均表现提供了具有代表性和充分的评估。最后,请注意,这三个最先进的模型是使用 Pycox 包38 和用于验证、C-index 和 IBS 的不同指标实现的。MRR 是手动计算的,而 p 值是使用 SciPy39 包获得的。

Experiments and results 实验和结果

In this section, we present a comprehensive assessment of the performance of our proposed model, SAVAE, compared to three well-established state-of-the-art models. Cox-PH, DeepSurv, and DeepHit. Across multiple datasets encompassing a diverse range of medical and clinical scenarios, we conducted extensive experiments to assess the performance of these models. The key focus was evaluating their ability to predict survival outcomes, considering censored and uncensored data points.
在本节中,我们全面评估了我们提出的模型 SAVAE 与三个成熟的最先进模型的性能。Cox-PH、DeepSurv 和 DeepHit。在涵盖各种医疗和临床场景的多个数据集中,我们进行了广泛的实验来评估这些模型的性能。重点是评估他们预测生存结果的能力,考虑删失和无删失的数据点。

As the initial set of results, we focus on comparing the performance and results in terms of the C-index. Table 2 provides a comprehensive view of how our model is completely comparable to the state-of-the-art models regarding the average C-index. Additionally, note that all intervals for the minimum and maximum values across various folds overlap, indicating consistent performance across different data subsets. The results displayed in the table reveal that our model consistently achieves a higher MRR compared to others across multiple datasets, showcasing its superiority in many cases regarding the average C-index. However, it is essential to acknowledge that the C-index results among the different models are generally similar, highlighting the competitiveness of our model within the field. Furthermore, it is important to note that the broad intervals are primarily attributed to the limited sample sizes commonly found in medical databases, a characteristic that poses challenges when assessing model performance. To address this issue, we employed cross-validation, as previously mentioned, ensuring that our model’s performance is robust and reliable. In summary, while our model demonstrates its strength by outperforming other models in terms of MRR and achieving competitive average C-index scores, the overall similarity in C-index results underscores its robustness and suitability for various medical datasets.
作为初始结果集,我们专注于根据 C 指数比较性能和结果。表 2 全面展示了我们的模型如何与平均 C 指数方面最先进的模型完全可比。此外,请注意,不同折叠中最小值和最大值的所有间隔都重叠,这表明不同数据子集的性能一致。表中显示的结果表明,与多个数据集中的其他模型相比,我们的模型始终保持更高的 MRR,在许多情况下展示了其在平均 C 指数方面的优势。然而,必须承认不同模型之间的 C 指数结果通常是相似的,这突出了我们的模型在该领域的竞争力。此外,重要的是要注意,宽区间主要归因于医学数据库中常见的样本量有限,这一特性在评估模型性能时构成了挑战。为了解决这个问题,如前所述,我们采用了交叉验证,以确保我们模型的性能稳健可靠。总之,虽然我们的模型通过在 MRR 方面优于其他模型并取得有竞争力的平均 C 指数分数来证明其优势,但 C 指数结果的总体相似性强调了其稳健性和对各种医学数据集的适用性。

Table 2 C-index average results across different folds for each state-of-the-art model.
表 2 每个最先进模型在不同折叠下的平均 C 指数结果。
Table 3 p-values obtained to determine whether the mean of SAVAE is greater than the state-of-the-art folds C-indexes.
表 3 为确定 SAVAE 的平均值是否大于最先进的折叠 C 指数而获得的 p 值。

In our validation process, we performed a statistical analysis using p-values to determine whether our model exhibited superior performance in terms of the C-index. To carry out this analysis, we compared the average C-index of our model with the mean C-index values obtained from multiple folds for each state-of-the-art model. The objective was to determine whether the performance of our model was statistically better than the alternative models. We established a significance threshold of 0.05, a common practice in hypothesis testing. Our findings in Table 3 reveal several instances in which our model outperformed the state-of-the-art models, as evidenced by p-values below the 0.05 threshold. These results highlight the effectiveness and competitiveness of our proposed approach. This comprehensive analysis, which considers the diverse C-index values in multiple folds, provides a robust evaluation of the model’s performance, extending beyond simple average comparisons.
在我们的验证过程中,我们使用 p 值进行了统计分析,以确定我们的模型在 C 指数方面是否表现出卓越的性能。为了进行这项分析,我们将模型的平均 C 指数与每个最先进模型从多个折叠中获得的平均 C 指数值进行了比较。目的是确定我们模型的性能在统计学上是否优于替代模型。我们建立了 0.05 的显著性阈值,这是假设检验的常见做法。我们在表 3 中的研究结果揭示了我们的模型优于最先进模型的几个实例,低于 0.05 阈值的 p 值证明了这一点。这些结果突出了我们提出的方法的有效性和竞争力。这种综合分析将不同的 C 指数值考虑在多个方面,提供了对模型性能的可靠评估,超越了简单的平均比较。

Our validation through IBS values (Tables 4 and 5) yielded conclusions that closely parallel those derived from the C-index analysis. Overall, it is essential to note that our model’s IBS results align closely with the state-of-the-art models, demonstrating comparable performance. However, our proposed model consistently demonstrated competitiveness and emerged as the top performer in the various datasets used in our study. This convergence of results across different evaluation metrics reinforces the robustness and effectiveness of our novel approach. While our model maintains a competitive edge within the context of the state-of-the-art models, further solidifying its potential and utility in the field of SA, it also stands out as a top-performing solution.
我们通过 IBS 值(表 45)进行验证,得出的结论与 C 指数分析得出的结论非常相似。总体而言,必须注意的是,我们模型的 IBS 结果与最先进的模型密切相关,表现出可比的性能。然而,我们提出的模型始终表现出竞争力,并在我们研究中使用的各种数据集中成为表现最好的。不同评估指标的结果的这种融合加强了我们新方法的稳健性和有效性。虽然我们的模型在最先进的模型背景下保持了竞争优势,进一步巩固了其在 SA 领域的潜力和实用性,但它也是一个表现最好的解决方案。

It is essential to recall that, like DeepSurv and Cox-PH, SAVAE is a parametric model. However, unlike these models, we do not limit ourselves to the exponential distribution to model survival time. Our approach allows for the use of any differentiable distribution. Unlike DeepHit, which trains the model using loss functions, our framework uses likelihood functions, providing considerable flexibility. We specifically assumed the Weibull distribution for these experiments, deriving the shape parameter \(\alpha\) and the scale parameter \(\lambda\) for each patient, although any differentiable distribution could have been used. This ability enables us to extract vital statistical information for personalized patient treatments, offering a significant advantage in medical applications.
必须记住,与 DeepSurv 和 Cox-PH 一样,SAVAE 是一个参数模型。然而,与这些模型不同的是,我们并不局限于指数分布来模拟生存时间。我们的方法允许使用任何可微分布。与使用损失函数训练模型的 DeepHit 不同,我们的框架使用似然函数,提供了相当大的灵活性。我们特别假设了这些实验的 Weibull 分布,得出每个患者的形状参数 \(\alpha\) 和尺度参数 \(\lambda\),尽管可以使用任何可微分布。这种能力使我们能够提取重要的统计信息,用于个性化的患者治疗,在医疗应用中具有显著优势。

Table 4 IBS average results across different folds for each state-of-the-art model.
表 4 每种最先进型号在不同折叠下的平均 IBS 结果。
Table 5 p-values obtained to determine whether the mean of SAVAE is greater than the state-of-the-art folds IBS values.
表 5 为确定 SAVAE 的平均值是否大于最先进的折叠 IBS 值而获得的 p 值。

Conclusion 结论

In this paper, we have successfully described an SA model (SAVAE), which stands out for its ability to avoid assumptions that can limit performance in real-world scenarios. It is a model based on VAEs in charge of estimating continuous or discrete survival times, first, modeling complex non-linear relations among covariates due to the use of highly expressive DNNs, and second, taking advantage of a combination of loss functions that capture the censoring inherent to survival data. Our model demonstrates efficiency compared to various state-of-the-art models, namely Cox-PH, DeepSurv, and DeepHit, because of its freedom from assumptions related to linearity and proportional hazards. In contrast to DeepHit, which directly learns the C-Index metric, we train using standard likelihood techniques. Note that this means that our approach is more flexible, as it allows using many different distributions to model the data, and the performance is competitive, as it performs well in C-Index and IBS, instilling confidence in its capabilities.
在本文中,我们成功地描述了一个 SA 模型 (SAVAE),该模型因其能够避免在实际场景中限制性能的假设而脱颖而出。这是一个基于 VAE 的模型,负责估计连续或离散生存时间,首先,由于使用高表达性 DNN 而对协变量之间的复杂非线性关系进行建模,其次,利用捕获生存数据固有删失的损失函数组合。与各种最先进的模型(即 Cox-PH、DeepSurv 和 DeepHit)相比,我们的模型展示了效率,因为它不受与线性和比例风险相关的假设的影响。与直接学习 C-Index 指标的 DeepHit 相比,我们使用标准似然技术进行训练。请注意,这意味着我们的方法更加灵活,因为它允许使用许多不同的分布来对数据进行建模,并且性能具有竞争力,因为它在 C-Index 和 IBS 中表现良好,从而对其功能充满信心。

Furthermore, the adaptability of our model is a notable strength. While we have assumed specific distributions for both survival times and covariates in our experiments, SAVAE’s versatility extends to accommodating any other parametric distribution, as long as their CDF and hazard function are differentiable, making it a scalable tool. Notably, our model can efficiently handle censoring to mitigate bias, introducing a novel improvement in results. However, it is essential to acknowledge that the model’s reliance on specific parametric distributions could pose limitations. If the chosen distribution does not align well with the underlying data distribution, the model may perform suboptimally. This is a known challenge in parametric survival analysis models, and further research could explore more flexible non-parametric or semi-parametric approaches to address this limitation40.
此外,我们模型的适应性是一个显着的优势。虽然我们在实验中假设了生存时间和协变量的特定分布,但 SAVAE 的多功能性扩展到适应任何其他参数分布,只要它们的 CDF 和风险函数是可微的,使其成为一个可扩展的工具。值得注意的是,我们的模型可以有效地处理删失以减少偏差,从而在结果中引入了新的改进。但是,必须承认模型对特定参数分布的依赖可能会带来限制。如果所选分布与基础数据分布不一致,则模型的性能可能不佳。这是参数生存分析模型中的一个已知挑战,进一步的研究可以探索更灵活的非参数或半参数方法来解决这一限制40

This work raises several attractive lines for the future. Since the parameters estimated by SAVAE are subject to statistical uncertainty, we propose as future work using Monte Carlo sampling from the latent space to derive confidence intervals for survival predictions, providing more robust patient-wise survival curves with associated margins of error. An additional advantage lies in our model’s architecture, where time and covariates are reconstructed from latent space information. This feature opens opportunities for its utility to be expanded to various tasks that have been developed using VAEs, including clustering41, imputation of missing data42, and data augmentation43 by the generation of synthetic patients. Thus, this tool has great potential and can be exploited in future work to have different functionalities even in the world of Federated Learning44,45.
这项工作为未来提出了几条有吸引力的线路。由于 SAVAE 估计的参数受统计不确定性的影响,我们建议作为未来的工作,使用来自潜在空间的蒙特卡洛采样来得出生存预测的置信区间,提供更稳健的患者生存曲线和相关误差幅度。另一个优势在于我们模型的架构,其中时间和协变量是根据潜在空间信息重建的。此功能为其实用程序扩展到使用 VAE 开发的各种任务提供了机会,包括聚类41、缺失数据的插补42 和通过生成合成患者进行数据增强43。因此,该工具具有巨大的潜力,即使在联邦学习44,45 的世界中,也可以在未来的工作中利用它来具有不同的功能。

In summary, SAVAE emerges as a versatile and robust SA model, surpassing state-of-the-art methods while offering extensibility to a broader range of healthcare applications. It presents a compelling solution for healthcare professionals seeking enhanced performance and adaptability in SA tasks.
总之,SAVAE 是一种多功能且强大的 SA 模型,超越了最先进的方法,同时为更广泛的医疗保健应用提供了可扩展性。它为寻求增强 SA 任务性能和适应性的医疗保健专业人员提供了一个引人注目的解决方案。