Introduction 导言

Genomic instability is a hallmark of human cancer which leads to copy number alterations (CNAs) in cancer cell genomes, and extensive intra-tumor heterogeneity1,2,3. It is well established that CNAs of driver oncogenes and tumor suppressors are causal determinants that change the fitness of cancer cells4,5, leading to clonal expansions, clone-clone variation6 and tumor evolution. In addition to impacting specific genes, CNAs often span chromosome arms or whole chromosomes and therefore potentiate transcriptional impact across hundreds of genes with a single genomic event. Recent reports on the extent of cell-to-cell variation of CNAs in tumors (including in well understood oncogenes)3 raise the critical question of how granular subpopulations are phenotypically impacted by subclonal CNAs. Importantly, phenotypic impact of subclonal CNAs can have both cell intrinsic effects and act as cell-extrinsic determinants of the tumor microenvironment7, further illustrating the importance of dissecting how CNAs modulate phenotypic intra-tumor heterogeneity.
基因组不稳定性是人类癌症的一大特征,它会导致癌细胞基因组中的拷贝数改变(CNAs)和广泛的肿瘤内异质性 1,2,3 。 已有研究证实,驱动癌基因和肿瘤抑制因子的CNAs是改变癌细胞适应性的决定性因素 4,5 ,会导致克隆扩增、克隆-克隆变异 6 和肿瘤进化。除了影响特定基因外,CNAs 还经常跨越染色体臂或整条染色体,因此单个基因组事件就能对数百个基因的转录产生潜在影响。最近关于肿瘤中 CNA 的细胞间变异程度的报道(包括已被充分理解的致癌基因) 3 提出了一个关键问题:亚克隆 CNA 如何对颗粒亚群的表型产生影响。重要的是,亚克隆CNA对表型的影响既有细胞内在的影响,也有肿瘤微环境的细胞外决定因素 7 ,这进一步说明了研究CNA如何调节肿瘤内表型异质性的重要性。

Previous studies using bulk sequencing techniques have investigated the association between clonal CNAs and gene expression8,9,10,11. The expression level of a gene can be influenced by copy-number dosage effects reflected by the significant positive correlation between gene expression and the underlying copy number (CN)12. However, gene dosage effects are not deterministic and may be subject to compensatory mechanisms, rendering the impact of CNAs on expression as highly variable across the genome. Transcriptional adaptive mechanisms13 including epigenetic modifications and downstream transcriptional regulation, can modulate CN dosage effects14,15,16, further obscuring the direct impact of gene dosage. For example, the expression of certain immune response pathways often exhibit both CNA-dependent and CNA-independent expression8.
以往使用批量测序技术的研究调查了克隆 CNA 与基因表达之间的关联 8,9,10,11 。基因的表达水平会受到拷贝数剂量效应的影响,基因表达与底层拷贝数(CN)之间存在显著的正相关 12 。 然而,基因剂量效应并不是决定性的,可能会受到补偿机制的影响,使得CNAs对表达的影响在整个基因组中变化很大。转录适应机制 13 包括表观遗传修饰和下游转录调控,可调节 CN 剂量效应 14,15,16 ,进一步模糊了基因剂量的直接影响。例如,某些免疫反应途径的表达往往同时表现出依赖 CNA 和不依赖 CNA 的表达 8

Theoretically, measuring single cell RNA and DNA data should elucidate how genotypes translate to phenotypes at single cell resolution. Technologies that sequence both RNA and DNA modalities co-registered in the same cell would be ideal for linking genomic alterations to transcriptional changes in tumor evolution. However, pioneering technologies17,18 have had limited throughput, lower quality and are still not mature enough for large-scale profiling of cancer cells. Sequencing single cell RNA or DNA independently allows more cells to be profiled and reveals a more comprehensive view of the cell populations, but requires computational integration of the two data modalities.
从理论上讲,测量单细胞 RNA 和 DNA 数据应能阐明基因型如何在单细胞分辨率下转化为表型。对同一细胞中的RNA和DNA模式进行测序的技术是将基因组变化与肿瘤演变过程中的转录变化联系起来的理想选择。然而,先驱技术 17,18 的通量有限,质量较低,仍不够成熟,无法对癌细胞进行大规模分析。对单细胞RNA或DNA进行独立测序可以分析更多的细胞,更全面地了解细胞群,但需要对两种数据模式进行计算整合。

Several computational methods have been proposed for joint analysis of single cell DNA and RNA data. CloneAlign19 is a probabilistic framework to assign transcriptional profiles to genomic subclones based on the assumption that the expression level of a gene is proportional to its underlying copy number. More recent methods SCATrEx20 and CCNMF21 are also based on this assumption but use different methods to model the similarity between copy number profiles and gene expression patterns. However, these methods do not consider the possibility that transcriptional effects of copy number could be variable between genes and therefore lack the specificity to decipher genes that may be subject to dosage effects from those that are independent of CNAs. In addition, these methods either require using predefined subclones from scDNA data or specify the number of subclones as input which may propagate errors of uninformative subclones or may miss more granular gene dosage effects. More importantly, the revelation of phenotypic plasticity as a driver of genetically independent transcription in cancer cells22,23,24 motivates the need to disentangle genetic from epigenetic mechanisms. No available methods directly model dosage effects of subclonal CNAs, which is critical to infer which genes are deterministically modulated by subclonal CNAs and which genes are independent of CNAs. Moreover, recent advances have illuminated the extent to which allele-specific copy number alterations can mark clonal haplotypes both in DNA-based3 and RNA-based25 single cell analysis, illustrating both a methodological gap and analytical opportunity for integration.
目前已经提出了几种计算方法,用于联合分析单细胞 DNA 和 RNA 数据。CloneAlign 19 是一种概率框架,根据基因表达水平与其基本拷贝数成正比的假设,为基因组亚克隆分配转录谱。最近的 SCATrEx 20 和 CCNMF 21 方法也是基于这一假设,但使用了不同的方法来模拟拷贝数图谱和基因表达模式之间的相似性。不过,这些方法没有考虑拷贝数的转录效应在不同基因间可能存在差异的可能性,因此缺乏特异性,无法区分可能受剂量效应影响的基因和不受 CNA 影响的基因。此外,这些方法要么要求使用来自 scDNA 数据的预定义子克隆,要么指定子克隆的数量作为输入,这可能会传播无信息子克隆的错误,或可能会错过更细粒度的基因剂量效应。更重要的是,表型可塑性是癌细胞中基因独立转录的驱动因素 22,23,24 ,这促使人们需要将遗传机制与表观遗传机制区分开来。现有的方法都无法直接模拟亚克隆 CNA 的剂量效应,而剂量效应对于推断哪些基因受亚克隆 CNA 的决定性调控以及哪些基因独立于 CNA 至关重要。此外,最近的研究进展表明,在基于 DNA 的 3 和基于 RNA 的 25 单细胞分析中,等位基因特异性拷贝数改变可在多大程度上标记克隆单倍型,这既说明了方法学上的差距,也提供了整合分析的机会。

In this study, we address the questions of how subclonal CNAs drive phenotypic divergence and evolution in cancer cells, and quantitatively encode (allele specific) CN dosage effects in this process. We present a Bayesian method, TreeAlign, to enumerate and define CNA-driven clone-specific phenotypes, and also a statistical framework to compare the transcriptional readouts of genomically defined clones. TreeAlign implements a Bayesian probabilistic model that maps gene expression profiles from scRNA to genomic subclones from scDNA which i) can refine subclone definition from single cell phylogenies through a recursive process suggested by transcriptional divergence, ii) explicitly models dosage effects of each gene and iii) models allele-specific CNAs to better resolve clonal mappings. Through extensive benchmarking, we demonstrate that TreeAlign outperforms alternative approaches in both clone assignment and gene dosage effect prediction. Applying TreeAlign to both primary tumors and cancer cell lines, we characterized the phenotypic differences between tumor subclones, investigated the contribution of subclonal CNAs to clone-specific gene expression patterns in cancer cells and common expression programs which are altered by subconal CNAs.
在这项研究中,我们探讨了亚克隆 CNA 如何驱动癌细胞的表型分化和进化,以及在这一过程中定量编码(等位基因特异性)CN 剂量效应的问题。我们提出了一种贝叶斯方法 TreeAlign,用于列举和定义 CNA 驱动的克隆特异性表型,还提出了一个统计框架,用于比较基因组学定义的克隆的转录读数。TreeAlign 采用贝叶斯概率模型,将 scRNA 中的基因表达谱映射到 scDNA 中的基因组亚克隆,该模型 i) 可以通过转录分化所建议的递归过程,从单细胞系统发生中完善亚克隆定义;ii) 明确模拟每个基因的剂量效应;iii) 模拟等位基因特异性 CNA,以更好地解析克隆映射。通过广泛的基准测试,我们证明 TreeAlign 在克隆分配和基因剂量效应预测方面都优于其他方法。将 TreeAlign 应用于原发性肿瘤和癌细胞系,我们描述了肿瘤亚克隆之间的表型差异,研究了亚克隆 CNA 对癌细胞中克隆特异性基因表达模式的贡献,以及亚克隆 CNA 改变的常见表达程序。

Results 成果

TreeAlign: a probabilistic graphical model for clone assignment and dosage effect inference
TreeAlign:用于克隆分配和剂量效应推断的概率图形模型

We developed TreeAlign, a probabilistic graphical model which maps scRNA sequenced cells to scDNA-derived subclones. TreeAlign employs a recursive algorithm for delineating subclones within phylogenies constructed from scDNA data (Fig. 1a). The model jointly infers clone assignments, clone-specific CN dosage effects and (optionally), models clone-specific allelic transcriptional effects (Fig. 1b). The TreeAlign framework assumes a subset of genes with positively correlated expressions to their underlying copy numbers. For each gene, expression is modeled by k, where k {0, 1} is a Bernoulli variable such that the probability p(k = 1) represents the probability the gene has clone-specific CN dosage effects (Fig. 1c). This encoding allows us to frame the problem as a conditional probability distributon which separates the expected expression into two components. To infer clone assignments and p(k), TreeAlign requires three inputs: (1) a cell × gene matrix of raw read counts from scRNA-seq, (2) a cell × gene copy number matrix estimated from scDNA data and (3) a phylogenetic tree (or optionally, predetermined clone labels) from scDNA profiles. TreeAlign can either assign expression profiles to predefined clone labels, similar to CloneAlign19 or can operate on a phylogenetic tree directly to assign cells to clades of the phylogeny (Fig. 1a). When using a phylogenetic tree, a Bayesian hierarchical model is recursively applied starting from the root of the tree, computing the probability that expression profiles in scRNA can be mapped to a subtree. The stopping condition of the recursion is satisfied when the genomic or phenotypic differences between two subtrees become too small to allow confident assignment of expression profiles.
我们开发了 TreeAlign,这是一种概率图形模型,可将 scRNA 测序细胞映射到 scDNA 衍生的亚克隆。TreeAlign 采用了一种递归算法,用于在由 scDNA 数据构建的系统发育中划分亚克隆(图 1a)。该模型可联合推断克隆分配、克隆特异性 CN 剂量效应以及(可选)克隆特异性等位基因转录效应模型(图 1b)。TreeAlign 框架假定基因子集的表达与其基础拷贝数呈正相关。对于每个基因,表达量由 k 来建模,其中 k ∈ {0, 1} 是一个伯努利变量,概率 p(k = 1) 代表该基因具有克隆特异性 CN 剂量效应的概率(图 1c)。通过这种编码,我们可以把问题看作是一个条件概率分布,它将预期表达分成两个部分。要推断克隆分配和 p(k),TreeAlign 需要三个输入:(1) 来自 scRNA-seq 的原始读数的细胞×基因矩阵;(2) 来自 scDNA 数据估计的细胞×基因拷贝数矩阵;(3) 来自 scDNA 图谱的系统发生树(或可选的预定克隆标签)。TreeAlign 既可以将表达谱分配到预定义的克隆标签,类似于 CloneAlign 19 ,也可以直接在系统发生树上操作,将细胞分配到系统发生的支系中(图 1a)。在使用系统发生树时,贝叶斯层次模型从树的根部开始递归应用,计算 scRNA 中的表达谱映射到子树的概率。当两个子树之间的基因组或表型差异太小,无法确信表达谱的分配时,递归的停止条件就满足了。

Fig. 1: Overview of TreeAlign.
图 1:TreeAlign 概述。
figure 1

a TreeAlign takes raw count data from scRNA-seq, the copy number matrix and the phylogenetic tree from scDNA-seq. By recursively assigning the expression profiles to phylogenetic subtrees, TreeAlign infers the clone-of-origin of cells identified in scRNA-seq and the dosage effects of subclonal CNAs. b Allelic imbalance can be inferred from DNA data and RNA data. We assume a positive correlation between the two measurements to improve clone assignment. c Graphical model of TreeAlign.
a TreeAlign 采用 scRNA-seq 的原始计数数据、拷贝数矩阵和 scDNA-seq 的系统发生树。通过将表达谱递归到系统发育子树,TreeAlign 可以推断出在 scRNA-seq 中识别出的细胞的原生克隆以及亚克隆 CNA 的剂量效应。c TreeAlign 的图形模型。

In addition to altered gene expression levels, allele-specific CNAs also lead to allele-specific expression imbalance which is detectable in scRNA data3,26 (Fig. 1b). For example, genomic segments harboring loss of heterozygosity (LOH) deterministically leads to mono-allelic expression of genes in the segment while allelic imbalance owing to allele specific gains will skew the relative expression of specific alleles. To exploit how allelic imbalance modulates allele specific expression, we extended TreeAlign to model both total CN and allelic imbalance (Fig. 1c, Fig. S1). Given the B allele frequencies (BAFs) estimated from scDNA haplotype blocks using, for example, SIGNALS3 and allele-specific expression at corresponding heterozygous SNPs in scRNA data, the allele-specific model contributes to clone assignment and infers the probability of the allele assignment p(a = 1), a {0, 1}, which indicates whether the SNP is on allele B or not. The total copy and allele-specific components of the probabilistic graphical model combine to form the ‘integrated model’.
除了改变基因表达水平外,等位基因特异性 CNA 还会导致等位基因特异性表达失衡,这在 scRNA 数据 3,26 中可以检测到(图 1b)。例如,携带杂合性缺失(LOH)的基因组片段会决定性地导致片段中基因的单等位基因表达,而等位基因特异性增益导致的等位基因不平衡则会使特定等位基因的相对表达出现偏差。为了探讨等位基因失衡如何调节等位基因特异性表达,我们对 TreeAlign 进行了扩展,以模拟总 CN 和等位基因失衡(图 1c,图 S1)。鉴于使用 SIGNALS 3 等方法从 scDNA 单倍型区块估算出的 B 等位基因频率(BAF)以及 scRNA 数据中相应杂合 SNP 的等位基因特异性表达,等位基因特异性模型有助于克隆分配,并推断出等位基因分配的概率 p(a=1),a ∈ {0, 1},这表明 SNP 是否在等位基因 B 上。概率图形模型的总拷贝和等位基因特异性部分组合成 "综合模型"。

The software for all models of TreeAlign (https://github.com/shahcompbio/TreeAlign) is implemented in Python using Pyro27 and is publicly available. Our implementation allows users to run the total CN model, allele-specific model and integrated model by providing different inputs. See “Methods” for additional mathematical, inference and implementation details.
TreeAlign ( https://github.com/shahcompbio/TreeAlign) 所有模型的软件都是用 Pyro 27 在 Python 中实现的,并已公开发布。我们的实现允许用户通过提供不同的输入运行全基因组模型、等位基因特异性模型和综合模型。更多数学、推理和实现细节请参见 "方法"。

Performance of TreeAlign on simulated data
TreeAlign 在模拟数据上的性能

We first evaluated TreeAlign on synthetic datasets, quantifying the effect of three main parameters in the input data: number of cells (100–5000), number of genes (100–1000) and proportions of genes with dosage effects (10–100%). Simulations were performed using the generative model of CloneAlign19. We compared the performance of assigning expression profiles to ground truth predefined clones between TreeAlign, CloneAlign and InferCNV28. InferCNV was originally developed for inferring CNAs from gene expression data, but has also been repurposed for clone assignment in some studies29. InferCNV analysis in this context acts as a way of inferring clone assignment without the benefit of the scDNA data. Compared to CloneAlign and InferCNV, TreeAlign performed significantly better in terms of clone assignment accuracy especially in the regime where fewer genes exhibit dosage effects (Fig. 2a). For example, in the regime of 60% of genes with dosage effects (1000 cells, 500 genes), TreeAlign achieved mean clone assignment accuracy of 91.1%, compared to CloneAlign with 75.1% accuracy. The improvement in clone assignment accuracy was consistent across all cell and gene number simulation scenarios (Fig. S2). We also tested performance with phylogenetic tree inputs to evaluate if TreeAlign could achieve similar results on tree input compared to pre-defined clone input. Similar to the ‘clone’ regime, these simulations varied the proportion of genes with gene dosage effects in 10% increments. TreeAlign was able to assign expression profiles back to the corresponding clades of the phylogeny with similar accuracies compared to the clone input in regimes with >40% genes with dosage effects (Fig. 2b, Fig. S3). Together these evaluations reflect that the model effectively obviates a priori tree cutting without paying a penalty in accurate clone mapping.
我们首先在合成数据集上对 TreeAlign 进行了评估,量化了输入数据中三个主要参数的影响:细胞数(100-5000)、基因数(100-1000)和具有剂量效应的基因比例(10-100%)。模拟使用了 CloneAlign 19 的生成模型。我们比较了 TreeAlign、CloneAlign 和 InferCNV 在将表达谱分配给基本真实的预定义克隆方面的性能 28 。 InferCNV 最初是为从基因表达数据中推断 CNA 而开发的,但在一些研究中也被重新用于克隆分配 29 。 在这种情况下,InferCNV 分析是一种推断克隆分配的方法,而无需借助 scDNA 数据。与 CloneAlign 和 InferCNV 相比,TreeAlign 在克隆分配准确性方面的表现要好得多,尤其是在表现出剂量效应的基因较少的情况下(图 2a)。例如,在 60% 的基因具有剂量效应的情况下(1000 个细胞,500 个基因),TreeAlign 的克隆分配平均准确率为 91.1%,而 CloneAlign 的准确率为 75.1%。克隆分配准确率的提高在所有细胞和基因数模拟情况下都是一致的(图 S2)。我们还测试了系统发育树输入的性能,以评估 TreeAlign 能否在系统发育树输入上取得与预定义克隆输入类似的结果。与 "克隆 "机制类似,这些模拟以 10%的增量改变了具有基因剂量效应的基因比例。与克隆输入相比,在剂量效应基因比例大于 40% 的情况下,TreeAlign 能够以相似的精确度将表达谱分配回系统发生的相应支系(图 2b,图 S3)。 这些评估结果共同表明,该模型有效地避免了先验地砍伐树木,同时又不影响精确的克隆映射。

Fig. 2: Performance of TreeAlign on simulated data.
图 2:TreeAlign 在模拟数据上的性能。
figure 2

a Clone assignment accuracy of TreeAlign, CloneAlign and InferCNV on simulated datasets (500 cells, 1000 genes, 3 clones) containing varying proportions of genes with CN dosage effects. Brackets: P values with two-sided Wilcoxon signed-rank test. For the box plot, box limits extend from the 25th to 75th percentile, while the middle line represents the median. Whiskers extend to the largest value no further than 1.5 times the inter-quartile range (IQR) from each box hinge. Points beyond the whiskers are outliers. b Phylogenetic tree (left) of cells from patient 081 constructed using scDNA data. Heat map (right) of clone assignment by TreeAlign. Each column shows the assignment of simulated expression profiles to subtrees of the phylogeny. The bar chart above shows the overall accuracy of clone assignment. c Scatter plots comparing inferred gene dosage effect frequencies and the simulated frequencies. Each panel groups genes with similar expression levels from low expression genes (0–10%, with normalized expression between 0.00076–0.008) to high expression genes (90–100%, with normalized expression between 1.7 and 5.6). Pearson correlation coefficients (R) and P values for the linear fit (Two-sided Student’s t-test) are shown. Source data are provided as a Source Data file.
a TreeAlign、CloneAlign 和 InferCNV 在含有不同比例 CN 剂量效应基因的模拟数据集(500 个细胞、1000 个基因、3 个克隆)上的克隆分配准确率。括号:P 值采用双侧 Wilcoxon 符号秩检验。方框图中,方框范围从第 25 百分位数到第 75 百分位数,中间线代表中位数。晶须的最大值不超过每个方框边缘四分位数间距(IQR)的 1.5 倍。b 利用 scDNA 数据构建的 081 号患者细胞的系统发生树(左)。TreeAlign 克隆分配热图(右)。每一列都显示了模拟表达谱对系统发生子树的分配。c 比较推断基因剂量效应频率和模拟频率的散点图。每个面板将具有相似表达水平的基因分组,从低表达基因(0-10%,归一化表达在 0.00076-0.008 之间)到高表达基因(90-100%,归一化表达在 1.7-5.6 之间)。图中显示了线性拟合的皮尔逊相关系数(R)和 P 值(双侧学生 t 检验)。源数据作为源数据文件提供。

We also evaluated the accuracy of predicting dosage effects for each gene in the input datasets. We compared the simulated and predicted (using p(k) as an estimate) frequency of genes with CN dosage effects. For high expression genes, simulated and predicted frequencies were highly concordant (Fig. 2c). For datasets with ≥50% of genes with dosage effects, the mean area under the receiver-operator curve (AUC) was ≥0.99 for genes with relatively high expression level (genes in top 40% in terms of normalized expression levels) (Fig. S4). We compared p(k) to a baseline estimation of CN dosage effects which is the per-gene Pearson correlation coefficient (R) of CN and expression after fitting CloneAlign. p(k) from TreeAlign had an overall higher AUC compared to R from CloneAlign for predicting CN dosage effects. This establishes that p(k) captures gene dosage effects and has the ability to distinguish genes with dosage effects from those without dosage effects.
我们还评估了预测输入数据集中每个基因剂量效应的准确性。我们比较了具有 CN 剂量效应的基因的模拟频率和预测频率(使用 p(k) 作为估计值)。对于高表达基因,模拟频率和预测频率高度一致(图 2c)。对于剂量效应基因≥50%的数据集,相对高表达水平基因(归一化表达水平前40%的基因)的平均接受者操作曲线下面积(AUC)≥0.99(图S4)。我们将 p(k)与 CN 剂量效应的基线估计值(即拟合 CloneAlign 后 CN 与表达的每基因皮尔逊相关系数 (R))进行了比较。这证明 p(k) 能够捕捉基因剂量效应,并有能力区分有剂量效应和无剂量效应的基因。

We then investigated how allele-specific information improves clone assignment with synthetic datasets. We simulated BAFs for varying numbers (0, 250, 500, 750 and 1000) of heterozygous SNPs with allelic-imbalance and simulated allele-specific expression from these SNPs using the generative model of allele-specific TreeAlign. We applied the integrated model on these synthetic datasets which contained total CN and allelic information, and confirmed that clone assignment accuracy was improved when more SNPs were included (Figs. S5 and S6).
然后,我们利用合成数据集研究了等位基因特异性信息如何改善克隆分配。我们模拟了具有等位基因不平衡的不同数量(0、250、500、750 和 1000)杂合 SNP 的 BAF,并使用等位基因特异性 TreeAlign 的生成模型模拟了这些 SNP 的等位基因特异性表达。我们在这些包含全部 CN 和等位基因信息的合成数据集上应用了集成模型,结果证实,当包含更多 SNP 时,克隆分配的准确性得到了提高(图 S5 和 S6)。

TreeAlign assigns HGSC expression profiles to phylogeny accurately
TreeAlign 将 HGSC 表达谱准确地分配到系统发育中

We next investigated TreeAlign’s performance on real-world patient derived data from high grade serous ovarian cancer (HGSC). We first applied TreeAlign on single cell sequencing data from an HGSC patient (patient 022)7. Tumor samples were obtained from both left and right adnexa sites of the patient. scDNA (n = 1050 cells) and scRNA (n = 4134 cells) data were generated through Direct Library Preparation (DLP+)30 and 10x genomics single-cell RNA-seq31 respectively. 3579 (86.6%) ovarian cancer cells profiled by scRNA were assigned to 4 subclones identified by scDNA-seq. The expression profiles of clone C and D are overlapped on the UMAP embedding, while separated from the profiles of clone A and clone B, which coincides with the shorter phylogenetic distance between clone C and D (Fig. 3a). The separation of cells by assigned clones on the expression-based UMAP also suggests that the genetic subclones possess distinct transcriptional phenotypes.
接下来,我们研究了 TreeAlign 在来自高级别浆液性卵巢癌(HGSC)的实际患者数据上的表现。我们首先将 TreeAlign 应用于一名 HGSC 患者(022 号患者)的单细胞测序数据 7 。scDNA(n=1050个细胞)和scRNA(n=4134个细胞)数据分别通过直接文库制备(DLP+) 30 和10倍基因组学单细胞RNA-seq 31 生成。3579个(86.6%)通过scRNA分析的卵巢癌细胞被归入通过scDNA-seq鉴定的4个亚克隆。克隆 C 和 D 的表达谱在 UMAP 包埋图上重叠,而与克隆 A 和克隆 B 的表达谱分离,这与克隆 C 和 D 之间较短的系统发育距离相吻合(图 3a)。在基于表达的 UMAP 上按指定克隆分离细胞也表明基因亚克隆具有不同的转录表型。

Fig. 3: TreeAlign assigns HGSC expression profiles to phylogeny accurately.
图 3:TreeAlign 将 HGSC 表达谱准确地分配到系统发育中。
figure 3

a UMAP plot of scRNA-data from patient 022 colored by clone labels assigned by TreeAlign. b Correlation between clone frequencies of patient 022 estimated by scRNA-data (x axis) and scDNA data (y axis). Pearson correlation coefficients (R) and P values for the linear fit (Two-sided Student’s t-test) are shown. c Single cell phylogenetic tree of patient 022 constructed with scDNA data (left). Pie charts on the tree showing how TreeAlign assigns cell expression profiles to subtrees recursively. The pie charts are colored by the proportions of cell expression profiles assigned to downstream subtrees. The outer ring color of the pie charts denotes the current subtree. For example, the leftmost pie chart represents the proportions of cells assigned to the two main subtrees. The outer ring represents the root of the phylogeny. The red proportion of the pie chart represents the subtree on the top or clone A. The blue proportion represents the bottom subtree which contains clone B, C and D. Left heat map, total copy number from scDNA; right heat map, InferCNV corrected expression from scRNA; middle Sankey chart, clone assignments from RNA to DNA. d Normalized expression of CLDN16 in clone A and clone B-D (Two-sided Wilcoxon signed-rank test). Source data are provided as a Source Data file. e Scaled expression and copy number profiles for regions on chromosome 9 and 12 as a function of genes ordered by genomic location.
a 根据 TreeAlign 分配的克隆标签着色的 022 号患者 scRNA 数据 UMAP 图。 b 根据 scRNA 数据(x 轴)和 scDNA 数据(y 轴)估算的 022 号患者克隆频率之间的相关性。c 利用 scDNA 数据构建的患者 022 的单细胞系统发生树(左)。树上的饼图显示了 TreeAlign 如何将细胞表达谱递归到子树上。饼图按分配给下游子树的细胞表达谱比例着色。饼图的外圈颜色表示当前子树。例如,最左边的饼图代表分配给两个主子树的细胞比例。外圈代表系统发生的根。左侧热图,来自 scDNA 的总拷贝数;右侧热图,来自 scRNA 的 InferCNV 校正表达;中间桑基图,来自 RNA 到 DNA 的克隆分配。e 9 号染色体和 12 号染色体上各区域的比例表达量和拷贝数图谱是按基因组位置排序的基因的函数。

We confirmed the clone assignment accuracy of TreeAlign by comparing the clonal frequencies estimated by RNA and DNA data (Fig. 3b). As both scRNA and scDNA data were generated by sampling from the same populations of cells, the clonal frequency estimated by the two methods should be consistent. Clonal frequencies in the left and right adnexa sample from the two modalities were significantly correlated (R = 0.99, P = 9 × 10−7). In addition, copy number alterations inferred for scRNA cells using InferCNV28 were concordant with the scDNA based CNA of the clones to which scRNA cells were assigned (Fig. 3c). For example, notable clone specific copy number changes can be seen in both scDNA and scRNA on chromosome X in clone A. Clone B, C and D-specific amplification on 3q can also be observed in both scDNA and scRNA (Fig. 3d). By comparing the RNA-derived copy number profiles with scDNA data, we noticed that inferring copy number from RNA data is not always accurate. For example, the inferred profiles missed the focal amplification on chromosome 18. We also held out genes from chromosome 9 and chromosome 12 and re-ran TreeAlign with the remaining genes. 98.8% cells assigned by both the full and held-out dataset had consistently clone assignment labels. Clone level gene expression on chromosome 9 and 12 was consistent with the corresponding copy numbers (Fig. 3e). These results demonstrated a proof of principle that TreeAlign can properly integrate scRNA and scDNA datasets and highlighted that scDNA-seq can provide valuable information on CNAs and tumor subclonal structures which would be difficult to detect with expression data only.
我们通过比较 RNA 和 DNA 数据估计的克隆频率,证实了 TreeAlign 的克隆分配准确性(图 3b)。由于 scRNA 和 scDNA 数据都是从相同的细胞群中取样生成的,因此两种方法估算的克隆频率应该是一致的。两种方法得出的左侧和右侧附件样本的克隆频率有明显的相关性(R = 0.99,P = 9 × 10 −7 )。此外,使用 InferCNV 28 推断出的 scRNA 细胞拷贝数变化与 scRNA 细胞所属克隆的基于 scDNA 的 CNA 一致(图 3c)。例如,在克隆 A 的 X 染色体上,scDNA 和 scRNA 都能看到明显的克隆特异性拷贝数变化;在 scDNA 和 scRNA 中还能观察到克隆 B、C 和 D 在 3q 上的特异性扩增(图 3d)。通过比较 RNA 导出的拷贝数图谱与 scDNA 数据,我们注意到从 RNA 数据推断拷贝数并不总是准确的。例如,推断出的图谱遗漏了 18 号染色体上的病灶扩增。我们还保留了 9 号染色体和 12 号染色体上的基因,并用其余基因重新运行 TreeAlign。完整数据集和保留数据集所分配的 98.8% 细胞具有一致的克隆分配标签。9 号和 12 号染色体上克隆水平的基因表达与相应的拷贝数一致(图 3e)。这些结果证明了 TreeAlign 可以正确整合 scRNA 和 scDNA 数据集,并强调了 scDNA-seq 可以提供有价值的 CNA 和肿瘤亚克隆结构信息,而这些信息仅靠表达数据是很难检测到的。

We also applied TreeAlign to previously published data from a gastric cell line NCI-N87 generated by 10x genomics single-cell CNV and 10x scRNA assays32. TreeAlign assigned 3212 cells from scRNA to three clones identified in scDNA. The clonal frequencies estimated by both assays were closely aligned (Fig. S7). As for the patient 022 data, the scRNA cells showed subclonal copy number similar to the scDNA clones to which they were assigned, thus illustrating that TreeAlign also performs well on platforms other than DLP+.
我们还将 TreeAlign 应用于之前发表的由 10 倍基因组学单细胞 CNV 和 10 倍 scRNA 检测 32 生成的胃细胞系 NCI-N87 的数据。TreeAlign 将 scRNA 中的 3212 个细胞分配给了 scDNA 中确定的三个克隆。两种检测方法估计的克隆频率非常接近(图 S7)。至于患者 022 的数据,scRNA 细胞显示的亚克隆拷贝数与它们被分配到的 scDNA 克隆相似,从而说明 TreeAlign 在 DLP+ 以外的平台上也有良好的表现。

Incorporating allele specific expression increases clone assignment resolution
结合等位基因特异性表达提高克隆分配分辨率

We next investigated the extent to which accurate clone assignment solely based on allele specific expression could be performed. We inferred allele specific CN and BAF in scDNA data from patient 022 using SIGNALS3. The allele specific heat map (Fig. 4a) revealed characteristic patterns of clonal LOH in whole chromosomes (e.g. chr 6, 13, 14, 17) as well as subclonal losses (e.g. chr 9q in clone A and parallel losses on chr 5 across multiple subclones). With the allele-specific model, we assigned cells from scRNA to clone A as identified by scDNA in patient 022. Clone assignments were consistent between the allele specific model and the total CN model with 87% cells concordant. The clone-specific frequencies of reads from B allele in scRNA accurately reflected scDNA BAF (Fig. S8a), with the exception of SNPs on chromosome X which showed allelic imbalance in scRNA but not in scDNA due to X-inactivation. These results suggest that allelic imbalance information can be effectively exploited for clonal mapping.
接下来,我们研究了仅根据等位基因特异性表达就能准确分配克隆的程度。我们使用 SIGNALS 3 在患者 022 的 scDNA 数据中推断等位基因特异性 CN 和 BAF。等位基因特异性热图(图 4a)显示了整个染色体(如 6、13、14、17 号染色体)以及亚克隆丢失(如克隆 A 中的 9q 号染色体和多个亚克隆中 5 号染色体的平行丢失)的克隆 LOH 特征模式。利用等位基因特异性模型,我们将来自 scRNA 的细胞分配到克隆 A 中,克隆 A 是由患者 022 的 scDNA 确定的。等位基因特异性模型和全 CN 模型的克隆分配是一致的,87% 的细胞是一致的。scRNA中来自B等位基因的克隆特异性读数频率准确地反映了scDNA的BAF(图S8a),但X染色体上的SNPs除外,它们在scRNA中显示出等位基因失衡,而在scDNA中则没有,原因是X失活。这些结果表明,等位基因不平衡信息可以有效地用于克隆作图。

Fig. 4: Incorporating allele specific expression increases clone assignment resolution.
图 4:纳入等位基因特异性表达可提高克隆分配分辨率。
figure 4

a Integrated TreeAlign model assigns expression profiles to phylogeny of patient 022. Left heat map, single cell BAF profiles estimated from scDNA data using SIGNALS, annotated with clone labels on the left side (BAF profiles without clone label represent cells ignored by TreeAlign) (Methods). b Proportions of reads from B allele in scRNA-data for clone D.1 and D.2 at region chr5:72,798,845-135,518,242. c BAF of subclones with scDNA. Heatmap color represents BAF following the scale in panel a. d Proportions of reads from B allele for subclones in scRNA. Heatmap color represents the proportion of reads from B allele following the scale in panel a. e, Correlation between % of reads from B allele in scRNA and BAF estimated with scDNA in patient 022. Annotations at the top indicate the Pearson correlation coefficient (R) and P value derived from a linear regression. Color represents BAF following the scale in panel a. f ROC curves for predicting p(a = 1) with allele-specific TreeAlign and integrated TreeAlign. Haplotype phasing from SIGNALS was treated as ground truth. g Robustness of clone assignment to gene subsampling in patient 022. Adjusted Rand index was calculated by comparing clone assignments using subsampled datasets (n = 10 subsampled datasets for each condition) to the complete dataset. h Robustness of clone assignment to cell subsampling in patient 022 (n = 10 subsampled datasets for each condition). For the box plots in g, h, box limits extend from the 25th to 75th percentile, while the middle line represents the median. Whiskers extend to the largest value no further than 1.5 times the inter-quartile range (IQR) from each box hinge. Points beyond the whiskers are outliers. Source data are provided as a Source Data file.
a 整合的 TreeAlign 模型将表达谱分配到 022 号患者的系统发育中。左侧热图,使用 SIGNALS 从 scDNA 数据估算的单细胞 BAF 图谱,左侧标注有克隆标签(无克隆标签的 BAF 图谱代表 TreeAlign 忽略的细胞)(方法)。 b 位于 chr5:72,798,845-135,518,242 区域的克隆 D.1 和 D.2 的 scRNA 数据中来自 B 等位基因的读数比例。热图颜色表示按照 a 面板中的比例计算的 BAF。 d scRNA 中来自 B 等位基因的亚克隆读数比例。热图颜色代表来自 B 等位基因的读数比例,比例如图 a 所示。 e, ScRNA 中来自 B 等位基因的读数百分比与用 scDNA 估计的 022 号患者 BAF 之间的相关性。顶部的注释表示线性回归得出的皮尔逊相关系数(R)和 P 值。f 用等位基因特异性 TreeAlign 和整合 TreeAlign 预测 p(a = 1) 的 ROC 曲线。g 在 022 号患者中,克隆分配对基因子取样的稳健性。通过比较使用子取样数据集(每种情况 n = 10 个子取样数据集)和完整数据集的克隆分配,计算调整后的 Rand 指数。 h 022 号患者的克隆分配对细胞子取样的稳健性(每种情况 n = 10 个子取样数据集)。对于 g、h 中的方框图,方框范围从第 25 百分位数到第 75 百分位数,中间线代表中位数。晶须延伸至最大值,距离每个方框边缘不超过四分位数间距(IQR)的 1.5 倍。超出晶须的点为异常值。源数据以源数据文件的形式提供。

We then applied the integrated model utilizing both total CN and allele-specific information on data from patient 022. Relative to the total CN model, the integrated model mapped scRNA cells to smaller subclones (Fig. 4a). Specifically, we note when considering allele specificity, Clone B was subdivided into two subclones (B.1 and B.2). Clone B.1 had an additional deletion at 16q leading to LOH, whereas Clone B.2 had an amplification at 11q with increased BAF (Fig. 4a). Clone D was further divided into four subclones (D.1, D.2, D.3 and D.4). Clone D.1 and clone D.2 both had a deletion on chromosome 5, but the deletion events occurred on different alleles in the two subclones with different breakpoints, each of which was distinct from the 5q deletion on Clone B, indicative that parallel evolution is indeed reflected in transcription with the allele specific model (Fig. 4b). Moreover, allele-specific copy number profiles of individual clones (Fig. 4c) were broadly reflected in the allele-specific expression profiles from scRNA data (Fig. 4d). We computed proportions of B allele reads at each heterozygous SNP for each of the subclones assigned from the scRNA data. Subclonal BAF estimated with scDNA data and proportions of reads from B allele from scRNA were significantly correlated (0.25 < R < 0.53 for each subclone, P < 2.2 × 10−22) (Fig. 4e; Fig. S8c), consistent with more accurate clone assignment. With integrated TreeAlign, we also achieved better performance for predicting allele assignment parameter a of SNPs compared to the allele-specific model (Fig. 4f). We note that recent identifications of parallel allelic-specific alterations whereby maternal and paternal alleles are independently lost or gained in different cells3,26,33 would further complicate clonal mapping, if allele specificity is not taken into account. Here we show that mono-alleleic expression of maternal and paternal alleles is consistent with coincident maternal and paternal allelic loss in different clones (Fig. 4b). The allele-specific TreeAlign model correctly assigns cells at this level of granularity that would otherwise be missed.
然后,我们将利用总 CN 和等位基因特异性信息的综合模型应用于 022 号患者的数据。与总 CN 模型相比,综合模型将 scRNA 细胞映射到更小的亚克隆上(图 4a)。具体来说,我们注意到在考虑等位基因特异性时,克隆 B 被细分为两个亚克隆(B.1 和 B.2)。克隆 B.1 在 16q 处有一个额外的缺失,导致 LOH,而克隆 B.2 在 11q 处有一个扩增,导致 BAF 增加(图 4a)。克隆 D 又分为四个亚克隆(D.1、D.2、D.3 和 D.4)。克隆 D.1 和克隆 D.2 在 5 号染色体上都有缺失,但在两个亚克隆中,缺失事件发生在不同的等位基因上,断点也不同,每一个都与克隆 B 上的 5q 缺失不同,这表明等位基因特异性模型确实反映了转录的平行进化(图 4b)。此外,单个克隆的等位基因特异性拷贝数图谱(图 4c)大致反映了来自 scRNA 数据的等位基因特异性表达图谱(图 4d)。我们计算了根据 scRNA 数据分配的每个亚克隆的每个杂合 SNP 的 B 等位基因读数比例。用 scDNA 数据估算的亚克隆 BAF 与来自 scRNA 的 B 等位基因读数比例显著相关(每个亚克隆的 0.25 < R < 0.53,P < 2.2 × 10 −22 )(图 4e;图 S8c),这与更准确的克隆分配一致。与等位基因特异性模型相比,利用集成的 TreeAlign,我们在预测 SNP 的等位基因分配参数 a 方面也取得了更好的效果(图 4f)。 我们注意到,如果不考虑等位基因的特异性,最近发现的平行等位基因特异性改变(母系和父系等位基因在不同细胞中独立丢失或获得 3,26,33 )会使克隆图谱更加复杂。在这里,我们发现母系和父系等位基因的单等位基因表达与不同克隆中母系和父系等位基因的同时缺失是一致的(图 4b)。等位基因特异性 TreeAlign 模型在这种粒度水平上正确地分配了细胞,否则这些细胞就会被遗漏。

The predicted allele assignments of SNPs from the allele-specific model were consistent with haplotype phasing from scDNA reported by SIGNALS (AUC = 0.84). With the integrated model considering total CN expression, the prediction of allele assignments of SNPs can be further improved (AUC = 0.88) (Fig. 4f, Methods). We compared the performance of total CN, allele-specific and integrated TreeAlign using subsampled datasets of patient 022 and evaluating against results from the full dataset. Compared to the total CN model, the integrated model performed significantly better when fewer genomic regions were included in the input suggesting it is more robust when there are fewer copy number differences between subclones (Fig. 4g). Both the total CN and integrated model were robust to reduced numbers of cells (Fig. 4h). The allele-specific model without total CN is inferior, as expected.
等位基因特异性模型预测的 SNP 等位基因分配与 SIGNALS 报告的 scDNA 单倍型相位一致(AUC = 0.84)。在考虑了 CN 总表达量的综合模型中,SNP 的等位基因分配预测结果得到了进一步改善(AUC = 0.88)(图 4f,方法)。我们使用 022 号患者的子采样数据集比较了总 CN、等位基因特异性和集成 TreeAlign 的性能,并与完整数据集的结果进行了评估。与全 CN 模型相比,当输入中包含的基因组区域较少时,整合模型的表现明显更好,这表明当亚克隆之间的拷贝数差异较少时,整合模型更稳健(图 4g)。总 CN 模型和综合模型对细胞数量减少都很稳健(图 4h)。不包含总 CN 的等位基因特异性模型不如预期。

To investigate the influence of inaccurate phylogeny input on TreeAlign, we randomly selected different proportions of CN profiles from scDNA and shuffled their cell labels in patient 022. With more cell labels being shuffled, the tree will become less accurate in reflecting the true phylogeny of the population. When less than 20% of cells were shuffled, TreeAlign was able to resolve the same number of subclones as with the original data (Fig. S9). When more than 50% cells were shuffled, TreeAlign failed and assigned all expression profiles to the unassigned state. These results suggest that TreeAlign can tolerate inaccurate phylogeny input to some extent.
为了研究不准确的系统发生输入对 TreeAlign 的影响,我们从 scDNA 中随机选取了不同比例的 CN 图谱,并对 022 号患者的细胞标签进行了洗牌。细胞标签被洗牌的次数越多,树在反映群体真实系统发生方面的准确性就越低。当洗牌的细胞少于 20% 时,TreeAlign 能够解析与原始数据相同数量的亚克隆(图 S9)。当洗牌的细胞数超过 50%时,TreeAlign 就会失败,并将所有表达谱分配到未分配状态。这些结果表明 TreeAlign 可以在一定程度上容忍不准确的系统发生输入。

Inferring copy number dosage effects in human cancer data
推断人类癌症数据中的拷贝数剂量效应

We next compared the integrated model to the total CN model on a recently published cohort of cell lines and patient derived xenografts (PDXs) with scDNA and scRNA matched data (Fig. 5a) from Funnell et al.3. We applied TreeAlign on data from PDXs of Triple Negative Breast Cancer (TNBC) (n = 3) and HGSC (n = 6). In addition, we tested the model on one ovarian cancer control cell line and 6 184-hTERT cell lines engineered to induce genomic instability from a diploid background with CRISPR loss of function of TP53 combined with BRCA1 or BRCA2. Both integrated and total CN TreeAlign were run on matched DLP+ and 10x scRNA-seq data (Figs. S24S53). The integrated model was fitted for 1-10 rounds and the total CN model was fitted for 1-3 rounds when we ran TreeAlign with the phylogeny input (Figs. S10 and S11). The integrated model only failed to assign expression profiles to any subclones for cell line SA906a, due to a low number of genes (n = 32) with CN differences and heterozygous SNPs (n = 7) with BAF differences between subclones. In comparison, the total CN model failed in 8 cases due to lack of allelic information. As expected, the integrated model characterized more clones (Fig. 5b) and achieved a lower number of cells not confidently assigned to a subclone (Fig. 5c). For cells that were assigned confidently by the integrated model but not the total CN model, their InferCNV-corrected expression showed higher correlation coefficients with the CN profiles of subclones assigned by the integrated model compared to random subclones (Fig. 5d; Fig. S14), implying better performance of the integrated model.
接下来,我们在 Funnell 等人最近发表的一组细胞系和患者衍生异种移植物 (PDX) 的 scDNA 和 scRNA 匹配数据(图 5a)中,将综合模型与全 CN 模型进行了比较。 3 .我们将 TreeAlign 应用于三阴性乳腺癌(TNBC)(n = 3)和 HGSC(n = 6)的 PDX 数据。此外,我们还在一个卵巢癌对照细胞系和 6 个 184-hTERT 细胞系上测试了该模型,这些细胞系是通过 CRISPR 缺失 TP53 与 BRCA1 或 BRCA2 的功能来诱导二倍体背景的基因组不稳定性。对匹配的 DLP+ 和 10x scRNA-seq 数据(图 S24-S53)运行了综合 CN TreeAlign 和总 CN TreeAlign。当我们使用系统发生输入运行 TreeAlign 时,综合模型拟合了 1-10 轮,而全 CN 模型拟合了 1-3 轮(图 S10 和 S11)。对于细胞系 SA906a,整合模型仅未能为任何亚克隆分配表达谱,原因是亚克隆之间存在 CN 差异的基因(n = 32)和存在 BAF 差异的杂合 SNP(n = 7)数量较少。相比之下,由于缺乏等位基因信息,全 CN 模型有 8 个失败。不出所料,整合模型表征了更多的克隆(图 5b),而且不能确定分配到子克隆的细胞数量较少(图 5c)。与随机亚克隆相比,综合模型有把握分配到而总 CN 模型没有分配到的细胞,其 InferCNV 校正表达与综合模型分配到的亚克隆的 CN 图谱显示出更高的相关系数(图 5d;图 S14),这意味着综合模型的性能更好。

Fig. 5: Incorporating allele specific expression increases clone assignment resolution.
图 5:纳入等位基因特异性表达可提高克隆分配分辨率。
figure 5

a Heat map representations of genes in CSCN regions in HGSC PDX SA1053BX1XB01603. Top heat map: clone-level total CN from scDNA; bottom heat map: InferCNV-corrected expression profiles from scRNA; bottom track: p(k) estimated by TreeAlign. b Number of clones characterized by total CN and integrated model (n = 15, Two-sided Wilcoxon signed-rank test). c Frequencies of unassigned cells (Methods) from total CN and integrated model (n = 15, Two-sided Wilcoxon signed-rank test). d Distribution of Pearson correlation coefficients (R) between scDNA-estimated total CN and InferCNV-corrected expression for cells assigned by the integrated model but unassigned by the total CN model (n = 8829 cells, Two-sided Wilcoxon signed-rank test). Left, correlation distribution calculated by comparing InferCNV profiles to CN profiles of a random subclone; Right, correlation distribution calculated by comparing InferCNV profiles to CN profiles of subclones assigned by integrated TreeAlign. e Variance of p(k) sampled from the same genomic regions and across regions (n = 314 regions, Two-sided Wilcoxon signed-rank test). For the box plots in be, box limits extend from the 25th to 75th percentile, while the middle line represents the median. Whiskers extend to the largest value no further than 1.5 times the inter-quartile range (IQR) from each box hinge. Points beyond the whiskers are outliers. Source data are provided as a Source Data file.
a HGSC PDX SA1053BX1XB01603 中 CSCN 区域基因的热图表示。顶部热图:来自 scDNA 的克隆级总 CN;底部热图:b 以总 CN 和整合模型为特征的克隆数量(n = 15,双侧 Wilcoxon 符号秩检验)。 c 总 CN 和整合模型中未分配细胞(方法)的频率(n = 15,双侧 Wilcoxon 符号秩检验)。d 对于综合模型分配但总 CN 模型未分配的细胞,scDNA 估算的总 CN 与 InferCNV 校正表达之间的皮尔逊相关系数(R)分布(n = 8829 个细胞,双侧 Wilcoxon 符号秩检验)。左图,将 InferCNV 图谱与随机子克隆的 CN 图谱进行比较计算得出的相关性分布;右图,将 InferCNV 图谱与集成 TreeAlign 分配的子克隆的 CN 图谱进行比较计算得出的相关性分布。 e 从相同基因组区域和跨区域采样的 p(k) 方差(n = 314 个区域,双侧 Wilcoxon 符号秩检验)。对于 b-e 中的方框图,方框范围从第 25 百分位数到第 75 百分位数,中间线代表中位数。晶须延伸至最大值,距离每个方框边缘不超过四分位数间距(IQR)的 1.5 倍。超出晶须的点为异常值。源数据以源数据文件的形式提供。

For high expression genes (mean normalized expression >0.1) located in clone specific copy number (CSCN) regions, 76.7% (64.4–86.6% across cases) had p(k) > 0.5 suggesting their expression is dependent on copy number (Fig. S15, Fig. S16a). Taking together the simulation results and the fact that there are 13.4–35.6% genes with low CN dosage effects (p(k) ≤ 0.5), we would expect benefits of incorporating k and p(k) in TreeAlign as compared to CloneAlign (Fig. S16c). It was reported that cancer genes tend to have stronger CN-expression correlation compared to non-cancer genes in HGSCs34. We also observed concordant results that cancer genes annotated by COSMIC Cancer Gene Census35 tend to have higher p(k) compared to non-cancer genes suggesting stronger CN dosage effects in cancer genes (Fig. S16d, e).
对于位于克隆特异拷贝数(CSCN)区域的高表达基因(平均归一化表达>0.1),76.7%(64.4-86.6%)的 p(k) > 0.5,表明它们的表达依赖于拷贝数(图 S15、图 S16a)。综合模拟结果和 13.4-35.6% 的基因具有较低的 CN 剂量效应(p(k) ≤ 0.5)这一事实,我们认为与 CloneAlign 相比,在 TreeAlign 中加入 k 和 p(k) 会有好处(图 S16c)。据报道,在 HGSCs 中,与非癌基因相比,癌基因往往具有更强的 CN 表达相关性 34 。我们还观察到,与非癌症基因相比,由 COSMIC 癌症基因普查注释的癌症基因 35 往往具有更高的 p(k),这表明癌症基因具有更强的 CN 剂量效应(图 S16d、e)。

When we summarized p(k) by genomic locations, genes located at the same CSCN region had more consistent p(k). Notably, p(k) of genes in a contiguous region exhibited significantly lower variation compared to randomly sampled genes across different regions (Fig. 5a, e). It should be noted that we only included CN events that span more than 10 genes in this analysis. In addition to broad regions of the genome, we note subclonal high-level amplifications affecting known oncogenes which have been identified previously3. Using TreeAlign, we also identified subclonal amplifications of oncogenes accompanied by consistent changes in gene expression. For example, in OV2295, subclonal upregulation of MYC expression coincides with the clone-specific MYC amplification with p(k) > 0.8 (Fig. S17). To investigate whether MYC pathway activation was also impacted by non-CNA driven effects, we performed pathway enrichment on genes with low p(k) and found genes in the Hallmark MYC Target V1 gene set36 significantly enriched in low p(k) genes. Combined with HLAMP results, this suggests the pathway can be regulated by both CN dosage effects and other (potentially non-genomic) effects at the subclonal level (Fig. S18a, b), further highlighting the importance of p(k) for interpreting the mechanism of gene dysregulation.
当我们按基因组位置总结 p(k) 时,位于同一 CSCN 区域的基因具有更一致的 p(k)。值得注意的是,与在不同区域随机取样的基因相比,连续区域内基因的 p(k) 变异显著较低(图 5a,e)。值得注意的是,我们只将跨度超过 10 个基因的 CN 事件纳入本分析。除了基因组的广泛区域外,我们还注意到影响已知致癌基因的亚克隆高水平扩增,这些基因以前已经被鉴定过 3 。利用 TreeAlign,我们还发现了伴随基因表达一致变化的癌基因亚克隆扩增。例如,在 OV2295 中,MYC 表达的亚克隆上调与 p(k) > 0.8 的克隆特异性 MYC 扩增相吻合(图 S17)。为了研究 MYC 通路激活是否也受到非 CNA 驱动效应的影响,我们对 p(k) 较低的基因进行了通路富集,发现 Hallmark MYC Target V1 基因集中的基因 36 显著富集在 p(k) 较低的基因中。结合 HLAMP 的结果,这表明该通路在亚克隆水平上可能受到 CN 剂量效应和其他(潜在的非基因组)效应的调控(图 S18a、b),进一步突出了 p(k) 对解读基因失调机制的重要性。

Clone-specific transcriptional profiles highlight clonal divergence in immune pathways
克隆特异性转录谱突显免疫途径中的克隆差异

We next sought to interpret clone-specific transcriptional phenotypes and phenotypic divergence during clonal evolution from TreeAlign mappings. For patient 022, differential expression and gene set enrichment analysis identified genes and pathways upregulated in each clone (Fig. 6a, b). In total, we found 1346 genes significantly upregulated (adjusted P < 0.05, MAST37) in at least one of the subclones in patient 022. 52.1% (701) of these genes were not located in CSCN regions, while 47.9% (645) genes were located within CSCN regions. For 90.7% (585/645) of genes in CSCN regions, p(k) was > 0.5, reflecting probable gene dosage effects.
接下来,我们试图从 TreeAlign 映射中解释克隆进化过程中克隆特异性转录表型和表型分化。对于患者 022,差异表达和基因组富集分析确定了每个克隆中上调的基因和通路(图 6a、b)。我们共发现 1346 个基因在患者 022 的至少一个亚克隆中明显上调(调整后 P < 0.05,MAST 37 )。其中52.1%(701个)的基因不位于CSCN区域,而47.9%(645个)的基因位于CSCN区域内。90.7%(585/645)的基因位于 CSCN 区域,p(k) > 0.5,反映出可能存在基因剂量效应。

Fig. 6: Clone-specific transcriptional profiles highlight clonal divergence in immune pathways.
图 6:克隆特异性转录谱突显了免疫途径的克隆分化。
figure 6

a Scaled expression of upregulated genes in each subclone in patient 022, showing genes in rows and subclones in columns. Genes in the COSMIC Cancer Gene Census35 are highlighted. b UMAP embedding of expression profiles from patient 022 colored by clone labels assigned by integrated TreeAlign model. c Differentially expressed genes between clone A and other subclones (clone B-D) in patient 022. Adjusted P values were calculated by MAST37. Proportions of subclonal differentially expressed genes located in CSCN regions for d 184-hTERT cell lines, e an HGSC control cell line and PDXs. f Pathways with clone-specific expression patterns in TNBC and HGSC PDXs. Source data are provided as a Source Data file.
a 022 号患者每个亚克隆中上调基因的比例表达,以行显示基因,以列显示亚克隆。b 022 号患者表达谱的 UMAP 嵌入,由集成 TreeAlign 模型分配的克隆标签着色。 c 022 号患者克隆 A 和其他亚克隆(克隆 B-D)之间的差异表达基因。调整后的 P 值由 MAST 37 计算得出。d 184-hTERT 细胞系、e HGSC 对照细胞系和 PDX 中位于 CSCN 区域的亚克隆差异表达基因比例。源数据作为源数据文件提供。

Immune related pathways such as IFN-α and IFN-γ response were differentially expressed, and with increased relative expression in clone A (Figs. 6c, S19e, S21). Clone A contains cells from both right and left adnexa, thus dysregulation of these pathways cannot be simply explained by the microenvironment of clone A. Differential expression of immune related pathways was also found between more closely related subclones. Compared to clone B.2, clone B.1 also has enriched expression in IFN-α and IFN-γ signaling pathways and downregulation in MYC targets V1 and G2M checkpoint gene sets (Figs. S19a, 19f and 22). Clone D.4, compared to other clone D subclones, had down-regulated TNF-α signaling via NFκB (Figs. S19b, g and S23). Seeking to explain the relative contribution of subclonal CNAs to differentially expressed pathways, we analyzed the proportion of differentially expressed genes found in subclonal CNAs for each pathway. Only 17.4% (4/23) of differentially expressed genes in the Allograft Rejection gene set are in CSCN regions compared to 61.5% (24/39) in the MYC Targets V1 gene set highlighting the distinct impact of subclonal CNA between pathways (Fig. S20).
免疫相关通路,如 IFN-α 和 IFN-γ 反应,在克隆 A 中有差异表达,且相对表达量增加(图 6c、S19e、S21)。克隆 A 含有来自右侧和左侧附件的细胞,因此不能简单地用克隆 A 的微环境来解释这些通路的失调。与克隆 B.2 相比,克隆 B.1 在 IFN-α 和 IFN-γ 信号通路中也有富集表达,而在 MYC 靶点 V1 和 G2M 检查点基因集中则有下调(图 S19a、19f 和 22)。与其他克隆 D 亚克隆相比,克隆 D.4 通过 NFκB 下调 TNF-α 信号转导(图 S19b、g 和 S23)。为了解释亚克隆 CNA 对差异表达通路的相对贡献,我们分析了亚克隆 CNA 中发现的各通路差异表达基因的比例。异体移植排斥基因集中只有 17.4% (4/23)的差异表达基因位于 CSCN 区域,而 MYC Targets V1 基因集中则有 61.5% (24/39)的差异表达基因位于 CSCN 区域,这凸显了亚克隆 CNA 对不同通路的不同影响(图 S20)。

We conducted a similar analysis on data from Funnell et al.3. Differential expression analysis revealed varying proportions of DE genes located in CSCN regions ranging from 1.3% to 63.9%, indicating that transcriptional heterogeneity due to cis-acting subclonal CNAs varied across tumors (Fig. 6d, e). In addition to pathways such as KRAS signaling, IFN-α and IFN-γ response pathways also show frequent variable expression within subclones of TNBC and HGSC PDXs (Fig. 6f). IFN signaling has important immune modulatory effects, and has been previously linked to immune evasion and resistance to immunotherapy38. The recurrent differential expression of immune related pathways between subclones suggests their importance in clonal divergence in these cancers of genomic instability.
我们对 Funnell 等人的数据进行了类似分析。 3 .差异表达分析显示,位于 CSCN 区域的 DE 基因比例从 1.3% 到 63.9% 不等,这表明顺式作用亚克隆 CNA 所导致的转录异质性在不同肿瘤之间存在差异(图 6d,e)。除了 KRAS 信号转导等通路外,IFN-α 和 IFN-γ 反应通路在 TNBC 和 HGSC PDXs 亚克隆中也经常出现不同的表达(图 6f)。IFN 信号传导具有重要的免疫调节作用,以前曾被认为与免疫逃避和免疫治疗抗药性有关 38 。免疫相关通路在亚克隆间的反复差异表达表明,它们在这些基因组不稳定的癌症的克隆分化中具有重要作用。

To investigate transcriptional diversity within and across subclonal populations, we calculated Pearson correlation coefficients (R) and Euclidean distance between cells using the top 20 principal components of the gene expression matrices. In addition to TreeAlign, we also used InferCNV to assign cells from scRNA to genomic clones. Clonal frequencies in scRNA estimated by TreeAlign are more consistent with scDNA compared to InferCNV estimations (Fig. S12). We found that cells sampled from the same TreeAlign clone or InferCNV clone tend to have higher correlation and lower distance (Fig. S13), suggesting lower transcriptional diversity within the subclonal populations.
为了研究亚克隆群体内部和之间的转录多样性,我们使用基因表达矩阵的前 20 个主成分计算了细胞间的皮尔逊相关系数(R)和欧氏距离。除 TreeAlign 外,我们还使用 InferCNV 将 scRNA 中的细胞分配到基因组克隆中。与 InferCNV 估算的结果相比,TreeAlign 估算的 scRNA 中的克隆频率与 scDNA 更为一致(图 S12)。我们发现,从相同的 TreeAlign 克隆或 InferCNV 克隆取样的细胞往往具有较高的相关性和较低的距离(图 S13),这表明亚克隆群体内的转录多样性较低。

Discussion 讨论

TreeAlign establishes a probabilistic framework for integration of scRNA and scDNA data and inference of dosage effects of subclonal CNAs. TreeAlign achieves high accuracy of assigning single cell expression profiles to genetic subclones and was built to operate on phylogenetic trees directly, therefore informing phenotypically divergent subclones during the recursive clone assignment process. In addition to scRNA and scDNA integration, TreeAlign disentangles the in cis dosage effects of subclonal CNAs which highlights highly regulated pathways in clonal evolution. The model also has improved flexibility allowing either total or allelic copy number or both to be used as input. With additional allele-specific information, TreeAlign has improved prediction accuracy and model robustness and is able to identify more refined clonal structure.
TreeAlign 建立了一个概率框架,用于整合 scRNA 和 scDNA 数据以及推断亚克隆 CNA 的剂量效应。TreeAlign 能高精度地将单细胞表达谱分配给基因亚克隆,并能直接在系统发生树上操作,因此在递归克隆分配过程中能为表型不同的亚克隆提供信息。除了 scRNA 和 scDNA 整合外,TreeAlign 还能区分亚克隆 CNA 的顺式剂量效应,从而突出克隆进化中的高度调控途径。该模型还提高了灵活性,允许使用总拷贝数或等位基因拷贝数或两者作为输入。有了额外的等位基因特异性信息,TreeAlign 提高了预测的准确性和模型的稳健性,并能识别更精细的克隆结构。

In terms of limitations, TreeAlign was designed to integrate matched scRNA and scDNA datasets. For partially matched datasets with different clonal compositions, TreeAlign may have compromised performance. TreeAlign also assigns expression profiles based on clone-specific CNAs. For cancer types not driven by CN events, TreeAlign is not suitable due to lack of input features. The way TreeAlign encodes the relationship between gene expression and CN could also be further improved. As TreeAlign uses the binary parameter k to indicate whether gene expression is conditioned on copy number, it would be straightforward to change how expected expression is modeled in both conditions and explore more complex patterns of transcriptional regulations in the future. By default, TreeAlign truncates CNs > 10 to 10 and represents the CN-expression relationship with a linear function. Functions that are more biologically meaningful could be used as a replacement. Another extension could be to model k as a categorical variable with a 1 of K Multinomial distribution where the different components represent different functional forms of gene dosage (e.g. linear vs logistic vs exponential).
就局限性而言,TreeAlign 的设计目的是整合匹配的 scRNA 和 scDNA 数据集。对于具有不同克隆组成的部分匹配数据集,TreeAlign 的性能可能会受到影响。TreeAlign 还根据克隆特异性 CNA 分配表达谱。对于不是由 CN 事件驱动的癌症类型,由于缺乏输入特征,TreeAlign 并不适用。TreeAlign 对基因表达和 CN 之间的关系进行编码的方式还可以进一步改进。由于 TreeAlign 使用二进制参数 k 来表示基因表达是否以拷贝数为条件,因此可以直接改变这两种条件下的预期表达建模方式,并在未来探索更复杂的转录调控模式。默认情况下,TreeAlign 会将大于 10 的 CN 截断为 10,并用线性函数表示 CN 表达关系。可以使用更具生物意义的函数来替代。另一种扩展方法是将 k 作为分类变量建模,并采用 K 的 1 多叉分布,其中不同的分量代表基因剂量的不同功能形式(如线性 vs logistic vs 指数)。

We expect potential extensions of TreeAlign for integration of other single cell data modalities such as single-cell epigenetic data. Current methods for integration of scRNA and scATAC data are primarily based on nearest neighbor graphs or other distance metrics to match similar cells across multimodal datasets39. The advantage of TreeAlign is that it estimates how well the expression of a gene matches with the given biological assumption, hence it is more interpretable and provides explanations for gene expression variations.
我们预计 TreeAlign 有可能扩展用于整合其他单细胞数据模式,如单细胞表观遗传数据。目前整合 scRNA 和 scATAC 数据的方法主要基于近邻图或其他距离度量,以匹配多模态数据集中的相似细胞 39 。TreeAlign 的优势在于它能估计基因表达与给定生物假设的匹配程度,因此更容易解释,并提供基因表达变化的解释。

The emergence of more single cell multimodal datasets enables future studies to further reveal how genotypes translate to phenotypes and how ongoing mutational processes drive clonal diversification and evolution in cancer cells. It remains an open question whether the CN-expression relation is consistent across tumors and whether application at scale can reveal phenotypic consequences of CNAs at subclonal resolution. Furthermore, as TreeAlign also integrates allele-specific CN and expression, it would be interesting to investigate patterns of LOH and allele-specific expression on a subclone level as modulators of germline alterations and bi-allelic inactivation to better understand these events in the context of tumor heterogeneity and clonal evolution. We expect that concepts introduced in TreeAlign will facilitate the integration of single cell multimodal datasets and the interpretation of associations between modalities.
更多单细胞多模态数据集的出现使未来的研究能够进一步揭示基因型如何转化为表型,以及持续的突变过程如何驱动癌细胞的克隆多样化和进化。CN与表达的关系在不同肿瘤中是否一致,大规模应用是否能揭示CNA在亚克隆分辨率下的表型后果,这些仍是未决问题。此外,由于 TreeAlign 还整合了等位基因特异性 CN 和表达,因此在亚克隆水平上研究 LOH 和等位基因特异性表达模式作为种系改变和双等位基因失活的调节因子,以更好地了解肿瘤异质性和克隆进化背景下的这些事件,将是非常有趣的。我们希望 TreeAlign 中引入的概念能促进单细胞多模态数据集的整合和模态间关联的解释。

In conclusion, we anticipate that studying how CNAs impact gene expression programs in cancer applies broadly to different questions in cancer biology including etiology, tumor evolution, drug resistance and metastasis. In these settings, TreeAlign provides a flexible and scalable method for explaining gene expression with subclonal CNAs as a quantitative framework to arrive at mechanistic hypotheses from multimodal single cell data. Our approach provides a computational tool to disentangle the relative contribution of fixed genomic alterations and other dynamic processes on gene expression programs in cancer.
总之,我们预计研究 CNA 如何影响癌症中的基因表达程序可广泛应用于癌症生物学的不同问题,包括病因学、肿瘤进化、耐药性和转移。在这些情况下,TreeAlign 提供了一种灵活、可扩展的方法,利用亚克隆 CNA 作为定量框架来解释基因表达,从而从多模态单细胞数据中得出机理假设。我们的方法提供了一种计算工具,用于区分固定基因组改变和其他动态过程对癌症基因表达程序的相对贡献。

Methods 方法

TreeAlign total CN model TreeAlign CN 总模型

The TreeAlign model is a probabilistic graphical model as shown in Fig. 1c. Here we describe the model in detail. Let X be a cell × gene expression matrix of raw counts from scRNA-seq for N cells and G genes, and xng be the scRNA read count for cell n and gene g. Let Λ be a gene × clone copy number matrix for G genes and C clones, and λgc be the copy number at gene g for clone c. To assign cells from the expression matrix to a clone in copy number matrix, we use a categorical variable zn which indicates the clone to which a cell should be assigned. zn = c if cell n is assigned to clone c. zn is drawn from a Categorical distribution with Dirichlet prior.
TreeAlign 模型是一个概率图形模型,如图 1c 所示。下面我们将详细介绍该模型。假设 X 是一个细胞 × 基因表达矩阵,包含 N 个细胞和 G 个基因的 scRNA-seq 原始计数,x ng 是细胞 n 和基因 g 的 scRNA 读数。让Λ成为 G 个基因和 C 个克隆的基因 × 克隆拷贝数矩阵,λ gc 是克隆 c 在基因 g 处的拷贝数。要将表达矩阵中的细胞分配给拷贝数矩阵中的克隆,我们使用一个分类变量 z n 来表示细胞应分配给哪个克隆。

zn=1NCategorical(π)
(1)
πDir(α)
(2)

To indicate whether the expression of a gene is dependent on the underlying CN, we introduced another indicator variable kg. kg = 0 if expression of gene g is not dependent on CN. kg = 1 if expression of gene g is dependent on CN. kg is a Bernoulli random variable with Beta prior.
如果基因 g 的表达不依赖于 CN,则 k g = 0;如果基因 g 的表达依赖于 CN,则 k g = 1。k g 是一个具有 Beta 先验的伯努利随机变量。

kg=1GBernoulli(p(kg))
(3)
p(kg)Beta(β1,β2)
(4)

where we have β1 = 1, β2 = 1 as default.
其中,我们默认 β 1 = 1,β 2 = 1。

Our assumption is that yng, the expected expression of gene g in cell n, will be proportional to the copy number of gene g in clone c to which cell n is assigned, if expression of gene g is dependent on copy number as indicated by kg. Based on this assumption, our model is:
我们的假设是,如果基因 g 的表达与拷贝数有关(如 k g 所示),那么基因 g 在细胞 n 中的预期表达量 y ng 将与细胞 n 所在克隆 c 中基因 g 的拷贝数成正比。基于这一假设,我们的模型是

yng=E[xng|zn=c]=ln×[μg0×λgc×kg+μg1×(1kg)]×eψnwgTg=1G[μg0×λgc×kg+μg1×(1kg)]×eψnwgT
(5)
Xn=(xn1,,xnG)
(6)
Yn=(yn1,,ynG)
(7)
Xn=1NMultinomial(ln,Yn)
(8)

where ln is the total scRNA read count from cell n. Vector Yn represents the expected read count for each gene in cell n. Xn is the actual read count from each gene in cell n we want to model. μg0 is the per-copy expression of gene g if the expression is dependent on copy number while μg1 is the base expression of gene g if its expression is independent of copy number. The intuition is when kg = 1, we expect the expression of g to be proportional to its copy number; when kg = 0, the expression of g is not dependent on the underlying copy number. We specified a softplus transformed Normal prior over the per-copy expression μg0 and CN-independent base expression μg1.
其中,l n 是细胞 n 的 scRNA 总读数;向量 Y n 代表细胞 n 中每个基因的预期读数;X n 是我们要建立模型的细胞 n 中每个基因的实际读数;μ g0 是基因 g 的每拷贝表达量(如果表达量取决于拷贝数);μ g1 是基因 g 的基本表达量(如果表达量与拷贝数无关)。直觉上,当 k g = 1 时,我们希望基因 g 的表达量与其拷贝数成正比;当 k g = 0 时,基因 g 的表达量与基本拷贝数无关。我们在每个拷贝表达式 μ g0 和与 CN 无关的基础表达式 μ g1 上指定了一个软加变换正态先验。

μg0,μg1log(1+eN(μg,10))
(9)

where we set μg to the softplus inverse transformed mean read count of gene g across all cells.
其中,我们将 μg 设为所有细胞中基因 g 的软加逆变换平均读数。

The inner product ψnwgT introduces noise into the model to avoid over-fitting. We set the following priors: ψnN(0,1), wgiN(0,χi1), χi ~Gamma(2, 1)19.
内积 ψnwgT 将噪声引入模型,以避免过度拟合。我们设置了以下先验: ψnN(0,1) , wgiN(0,χi1) , χ ~Gamma(2, 1) 19

TreeAlign allele-specific model
TreeAlign 等位基因特异性模型

To use allele specific copy number information for clone assignment, we set up a separate model, allele-specific TreeAlign which only takes in allele specific information. The input to allele-specific TreeAlign includes single cell level B allele frequencies at heterozygous SNPs estimated from scDNA data and read counts of reference allele and alternative allele of these SNPs from scRNA-data.
为了使用等位基因特定的拷贝数信息进行克隆分配,我们建立了一个单独的模型,即等位基因特定的 TreeAlign,它只接收等位基因特定的信息。等位基因特异性 TreeAlign 的输入包括从 scDNA 数据中估算的杂合 SNP 的单细胞水平 B 等位基因频率,以及从 scRNA 数据中估算的这些 SNP 的参考等位基因和替代等位基因的读数计数。

Let tns be the scRNA read count at a heterozygous SNP s in cell n, rns be the scRNA read count from the reference allele at heterozygous SNP s in cell n. Both tns and rns can be obtained by genotyping heterozygous SNPs of interest with tools such as cellsnp-lite40.
t ns 和 r ns 都可以通过细胞np-lite 40 等工具对感兴趣的杂合SNP进行基因分型获得。

With scDNA data, we can estimate bsc, the BAF for SNP s and clone c using tools such as SIGNALS3. We assume that fns(zn = c), the expressed reference allele frequency at cell n and SNP s when cell n is assigned to clone c is controlled by the followings: (1). DNA BAF at that SNP of clone c and (2). whether the reference allele in scRNA data is B allele or not. We use a binary variable as to indicate whether the reference allele at SNP s should be assigned as B allele. as can be obtained using SIGNALS which can use information from scDNA to phase the SNPs in scRNA and assign alleles accordingly. We can also treat as as a hidden variable and jointly infer it from the allele-specific model of TreeAlign. Comparing as inferred from TreeAlign to SIGNALS output allows us to estimate the performance of TreeAlign.
利用 scDNA 数据,我们可以使用 SIGNALS 3 等工具估计 SNP s 和克隆 c 的 BAF b sc 。我们假设,当细胞 n 被分配到克隆 c 时,细胞 n 和 SNP s 的表达参考等位基因频率 f ns (z n = c)受以下因素控制:(1).克隆 c SNP 上的 DNA BAF;(2)scRNA 数据中的参考等位基因是否为 B 等位基因。我们使用二进制变量 a s 来表示 SNP s 上的参考等位基因是否应被分配为 B 等位基因。a s 可以通过 SIGNALS 获得,它可以利用 scDNA 中的信息对 scRNA 中的 SNP 进行分期,并相应地分配等位基因。我们还可以将 s 作为一个隐藏变量,并从 TreeAlign 的等位基因特异性模型中联合推断出来。将从 TreeAlign 中推断出的 s 与 SIGNALS 的输出结果进行比较,我们就能估算出 TreeAlign 的性能。

p(as)Beta(β1,β2)
(10)
as=1SBernoulli(p(as))
(11)
fns(zn=c)=as×bsc+(1as)×(1bsc)
(12)
rnsBinomial(tns,fns)
(13)

where we have β1=1,β2=1 as default.
其中 β1=1,β2=1 为默认值。

The total CN model and allele-specific model share categorical variable zn which indicates the clone assignment of cell n. Therefore, zn can be inferred from the two models separately or combined depending on the input data provided. The integrated model is illustrated in Fig. 1c. The prior distributions of all random variables are summarised in (Fig. S1).
因此,根据所提供的输入数据,z n 可以从这两个模型中分别或合并推断。综合模型如图 1c 所示。所有随机变量的先验分布见(图 S1)。

Model implementation and inference
模型实施与推理

TreeAlign is implemented with Pyro27 which is a universal probabilistic programming language written in Python and supported by PyTorch. Inference of TreeAlign is done by Pyro’s Stochastic Variational Inference (SVI) functions automatically. Specifically, we use the AutoDelta function which implements the delta method variational inference41. The delta method variational inferences use a Taylor approximation around the maximum a posterior (MAP) to approximate the posterior. Optimization is performed using the Adam optimizer. By default, we set a learning rate of 0.1 and the convergence is determined when the relative change in ELBO is lower than 10−5 by default.
TreeAlign 是用 Pyro 27 实现的,Pyro 是用 Python 编写的通用概率编程语言,由 PyTorch 支持。TreeAlign 的推理由 Pyro 的随机变异推理 (SVI) 功能自动完成。具体来说,我们使用的 AutoDelta 函数实现了 delta 法变分推理 41 。德尔塔法变分推理使用围绕最大后验(MAP)的泰勒近似来近似后验。优化使用 Adam 优化器进行。默认情况下,我们设置的学习率为 0.1,当 ELBO 的相对变化低于 10 时,即确定收敛 −5

Incorporating phylogeny as input
将系统发育纳入输入

In addition to the gene × clone copy number matrix, TreeAlign can also take the cell × gene copy number matrix from scDNA directly along with the phylogenetic tree constructed from this matrix as input. Starting from the root of the phylogeny, TreeAlign summarizes the copy number of gene g for each clade by taking the mode of copy number, and assigns cells from scRNA to clade-level CN profiles. This process is repeated recursively from the root of the phylogeny to smaller clades until: i) TreeAlign can no longer assign cells consistently in multiple runs (less than 70% cells have consistent assignments between runs by default), or ii) the number of genes located in CSCN regions becomes too small (100 genes in CSCN regions by default), or iii) Limited number of cells remain in scDNA or scRNA (100 by default). By default, TreeAlign also ignores subclades with less than 20 cells in scDNA. Some scRNA cells may remain unassigned to the scDNA phylogenetic tree. For a single cell, if the clone assignment probability < 0.8 or clone assignments are not consistent in 70% of repeated runs, the cell will be denoted as unassigned. This feature is important to the model because there might be incomplete sampling of a given tumor, leading to a subclone only appearing in one of the two data modalities. Note, all parameters are fully configurable at run time by the user.
除了基因 × 克隆拷贝数矩阵外,TreeAlign 还能直接从 scDNA 中获取细胞 × 基因拷贝数矩阵,并将根据该矩阵构建的系统发生树作为输入。TreeAlign 从系统发育树的根部开始,通过拷贝数模式总结每个支系的基因 g 拷贝数,并将 scRNA 中的细胞分配到支系级的 CN 图谱中。这一过程从系统发生的根部向更小的支系递归重复,直到:i) TreeAlign 无法在多次运行中一致地分配细胞(默认情况下运行之间一致分配的细胞少于 70%),或 ii) 位于 CSCN 区域的基因数量变得太少(默认情况下 CSCN 区域的基因数量为 100 个),或 iii) 保留在 scDNA 或 scRNA 中的细胞数量有限(默认情况下为 100 个)。默认情况下,TreeAlign 也会忽略 scDNA 中少于 20 个细胞的子支系。一些 scRNA 细胞可能仍未分配到 scDNA 系统发生树中。对于单个细胞,如果克隆分配概率小于 0.8 或在 70% 的重复运行中克隆分配不一致,该细胞将被标记为未分配。这一特征对模型非常重要,因为对特定肿瘤的取样可能不完整,导致子克隆只出现在两种数据模式中的一种。请注意,用户可在运行时对所有参数进行完全配置。

Benchmarking clone assignment and dosage effect prediction with simulations

To generate simulated data for benchmarking, TreeAlign model was fit to the MSKSPECTRUM patient 081 dataset to obtain the empirical estimations of model parameters. Then we simulated from TreeAlign considering the following scenarios: 1. Varying proportion (10%, 20%, 30%, …, 90%) of genes with dosage effect. 2. Varying number of genes (100, 500 and 1000) in CSCN regions. 3. Varying number of cells (100, 1000 and 5000) in scRNA.

We compared TreeAlign to CloneAlign and InferCNV v.1.3.5 in terms of the performance of clone assignment. For CloneAlign, we summarized clone-level copy number by calculating the mode of copy number for each gene and ran CloneAlign with default parameters. For InferCNV, we used the recommended setting for 10x. 3200 non-cancer cells were randomly sampled from the MSK SPECTRUM dataset and used as the set of reference “normal” cells. To assign clones with InferCNV, we calculated Pearson correlation coefficient between InferCNV corrected gene expression profile (expr.infercnv.dat) and the clone-level copy number profiles from scDNA. Cells from scRNA-seq were assigned to the clone according to the highest correlation coefficient. Accuracy of clone assignment was computed to compare the performance of the three methods. We also evaluated TreeAlign’s performance on predicting CN dosage effects.

To evaluate TreeAlign’s performance on predicting CN dosage effects, we calculated the area under the curve (AUC) using p(k) output by TreeAlign, and compared it to a baseline model. The baseline model for CN dosage effects was constructed by (1). assigning expression profiles to genomic clones using CloneAlign (2). calculating Pearson correlation coefficients (R) between normalized read count from scRNA and clone-specific CN from scDNA for each gene in input. The resulting R can be viewed as a metric for CN dosage effects. We calculated the baseline model AUC using R and compared to TreeAlign model.

To demonstrate the performance of allele-specific TreeAlign, for the simulated datasets with 30% CN-dependent genes, we also simulated reference allele and total read counts for varying number of heterozygous SNPs (0, 250, 500, 750, 1000 and 1250) from the generative model of allele-specific TreeAlign. Adjusted Rand index of clone assignments was calculated to evaluate the performance of the integrated TreeAlign model on simulated datasets with varying number of heterozygous SNPs.

To evaluate TreeAlign’s performance on inaccurate trees, we randomly shuffled labels for different proportions (10%, 20%,…,90%) of cells on the phylogeny of patient 22. TreeAlign was run with the shuffled phylogenies. Clone assignment results were compared to results obtained from the original phylogeny using adjusted Rand index.
为了评估 TreeAlign 在不准确的树上的性能,我们在患者 22 的系统发生图上随机地对不同比例(10%、20%......、90%)的单元格进行了洗牌。使用洗牌后的系统发生树运行 TreeAlign。使用调整后的兰德指数将克隆分配结果与原始系统发生的结果进行比较。

Human participants 人类参与者

All patient data were obtained from the MSK SPECTRUM cohort published before. Information regarding patient consent and ethical approval can be found in the previous publication7.
所有患者数据均来自之前发表的 MSK SPECTRUM 队列。有关患者同意书和伦理批准的信息可参见之前的出版物 7

MSK SPECTRUM data MSK SPECTRUM 数据

We obtained matched scRNA and scDNA from two HGSC patients (patient 022 and patient 081) from the MSK SPECTRUM cohort7. Samples were collected under Memorial Sloan Kettering Cancer Center’s institutional IRB protocol 15-200 and 06-107. Single cell suspensions from surgically excised tissues were generated and flow sorted on CD45 to separate the immune component. CD45 negative fractions were then sequenced using the DLP+ platform3,6,30.
我们从 MSK SPECTRUM 队列 7 中的两名 HGSC 患者(患者 022 和患者 081)身上获得了匹配的 scRNA 和 scDNA。样本是根据纪念斯隆-凯特琳癌症中心(Memorial Sloan Kettering Cancer Center)机构IRB协议15-200和06-107收集的。从手术切除的组织中提取单细胞悬浮液,并进行 CD45 流式分选,以分离免疫成分。然后使用 DLP+ 平台 3,6,30 对 CD45 阴性部分进行测序。

Gastric cancer cell line data
胃癌细胞系数据

Preprocessed scDNA data and scRNA count matrix of the gastric cancer cell line (NCI-N87)32 were downloaded from SRA (PRJNA498809) and GEO (GSE142750). Copy number calling for scDNA was performed using the Cellranger-DNA pipeline using default parameters.
胃癌细胞系(NCI-N87) 32 的预处理 scDNA 数据和 scRNA 计数矩阵从 SRA (PRJNA498809) 和 GEO (GSE142750) 下载。scDNA 的拷贝数调用使用 Cellranger-DNA 管道,使用默认参数。

PDXs and additional cell line data
PDXs 和其他细胞系数据

scRNA and scDNA from 6 HGSC PDX samples (SA1052BX1XB01516, SA1052JX1XB01535, SA1053BX1XB01603, SA1091AX1XB01790, SA1093CX1XB01917, SA1181AX1XB02700), 3 TNBC PDX samples (SA1035X6XB03216, SA1035X7XB03502, SA610X3XB03802), 1 ovarian cancer cell line (OV2295) and 6 hTERT-184 cell lines (SA039, SA1054, SA1055, SA1188, SA906a, SA906b) were downloaded from https://zenodo.org/records/6998936 according to Funnell et al.3.

scDNA data analysis

scDNA DLP+ data was processed3,30 using the Isabl platform42. Single cell copy number was inferred using HMMcopy43. Cells with quality score > 0.75 and not in S-phase were retained for downstream analysis. Allele specific copy number was called using SIGNALS, which provides allele specific copy number of the from AB in 500kb bins across the genome. A and B being the copy number of alleles A and B respectively with totalCN = A + B. As the single cell data is sparse, only a subset of germline SNPs have coverage in each cell, therefore to produce the input required for TreeAlign (B-Allele frequencies per SNP per cell), we impute the BAF of each SNP assuming that a SNP will have the same BAF as the bin in which the SNP resides.

Clustering and phylogenetic inference

Clustering and phylogenetic inference of scDNA was performed using UMAP and HDBSCAN (parameters min_samples=20, min_cluster_size=30, cluster_selection_epsilon=0.2). For patient 022, we also constructed phylogenetic trees using Sitka6.

Genotyping SNPs in scRNAseq cells

SNPs identified in scDNA-seq and matched bulk whole genome sequencing were genotyped in each single cell using cell-snplite with default parameters.

scRNA data analysis

For scRNA data processing, read alignment and barcode filtering were performed by CellRanger v.3.1.0. Cancer cell identification was performed with CellAssign44. Principal-component analysis (PCA) was performed on the top 2000 highly variable features output by function FindVariableFeatures using Seurat v.4.245. UMAP embeddings and visualization were generated using the first 20 principal components. Unsupervised clustering was performed using FindNeighbors function followed by FindClusters function (resolution = 0.2). To compare transcriptional heterogeneity across or within clones, we randomly sampled 100 expression profiles from the following groups: 1. all cancer cells in a patient/cell line/PDX 2. cancer cells in the same TreeAlign clone 3. cancer cells in the same InferCNV clone. Pearson correlation coefficients and Euclidean distance between the sampled expression profiles were calculated using the top 20 principal components.

Differential expression and gene set enrichment analysis

Differential expression analysis was performed using FindAllMarkers and FindMarkers function (test.use=”MAST”, latent.vars=c(”nCount_RNA”, ”nFeature_RNA”)) in Seurat v.4.2. Only G1 cells were used in differential expression analysis to avoid confounding of cycling cells. Cell cycle phase was annotated with CellCycleScoring function in Seurat.

We used the fgsea v.1.24.046 package to conduct gene set enrichment analysis with Hallmark gene sets (n=50) downloaded from MSigDB47. We set the following parameters for the gene set enrichment analysis: nperm = 1000, minSize = 15, maxSize = 500.

Statistical analysis and visualization

Statistical tests and visualization were performed with R (v.4.2) package ggpubr (v.0.5.0) and ggplot2 (v.3.4).

Reporting on sex and gender

Data from patient 081 and patient 022 from study cohort MSK SPECTRUM were used in this study. All patients are female with high-grade serous ovarian cancer.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.