Abstract 摘要
Single-cell bisulfite sequencing (scBS) is a technique that enables the assessment of DNA methylation at single-base pair and single-cell resolution. The analysis of large datasets obtained from scBS requires preprocessing to reduce the data size, improve the signal-to-noise ratio and provide interpretability. Typically, this is achieved by dividing the genome into large tiles and averaging the methylation signals within each tile. Here we demonstrate that this coarse-graining approach can lead to signal dilution. We propose improved strategies to identify more informative regions for methylation quantification and a more accurate quantitation method than simple averaging. Our approach enables better discrimination of cell types and other features of interest and reduces the need for large numbers of cells. We also present an approach to detect differentially methylated regions between groups of cells and demonstrate its ability to identify biologically meaningful regions that are associated with genes involved in the core functions of specific cell types. Finally, we present the software tool MethSCAn for scBS data analysis (https://anders-biostat.github.io/MethSCAn).
单细胞亚硫酸氢盐测序(scBS)是一种能够在单碱基对和单细胞分辨率上评估 DNA 甲基化的技术。从 scBS 获得的大数据集需要进行预处理以减少数据量、提高信噪比并提供可解释性。通常,这通过将基因组划分为大块并计算每个块内的甲基化信号平均值来实现。我们证明,这种粗粒度的方法会导致信号稀释。我们提出了改进的策略来识别更多具有信息量的区域以进行甲基化定量,并提出了一种比简单平均更准确的定量方法。我们的方法能够更好地区分细胞类型和其他感兴趣的特征,并减少了对大量细胞的需求。我们还提出了一种检测细胞群之间不同甲基化区域的方法,并证明了其能够识别与特定细胞类型核心功能相关的具有生物学意义的区域。最后,我们介绍了 MethSCAn 软件工具,用于 scBS 数据分析(https://anders-biostat.github.io/MethSCAn)。
Similar content being viewed by others
其他人正在查看相似内容
Main 主
Sequencing-based assays with single-cell resolution have offered new means to understand the differences between the cells making up a sample. Single-cell RNA sequencing (scRNA-seq) techniques have matured at great pace in recent years, with well-developed analysis methodologies, and methods to study epigenetics at single-cell resolution are rapidly catching up.
基于测序的单细胞分辨率检测方法为理解样本中细胞之间的差异提供了新的手段。单细胞 RNA 测序(scRNA-seq)技术在近年来取得了巨大的进展,已有成熟的分析方法,而用于单细胞分辨率研究表观遗传学的方法也在迅速跟进。
Briefly, in a bisulfite sequencing assay, DNA is treated with bisulfite, which converts unmethylated cytosines to uracils that are read as thymine in subsequent PCR, while methylated cytosines are protected from conversion. After sequencing, these conversions allow for the determination of the methylation status of all cytosines covered by reads1. Bisulfite sequencing can also be performed at single-cell resolution2 and even in parallel with scRNA-seq3,4,5.
简而言之,在一种亚硫酸氢盐测序实验中,DNA 会用亚硫酸氢盐处理,未甲基化的胞嘧啶会被转化为尿嘧啶,在后续的 PCR 中被读作胸腺嘧啶,而甲基化的胞嘧啶则受到保护,不会发生转化。测序后,这些转化使得可以确定所有被测序覆盖的胞嘧啶的甲基化状态 1 。亚硫酸氢盐测序也可以在单细胞分辨率下进行 2 ,甚至可以与单细胞 RNA 测序(scRNA-seq)并行进行 3,4,5 。
In this Article, we discuss strategies to analyze single-cell bisulfite sequencing (scBS) data. We suggest improvements to current approaches, and demonstrate their value in benchmarks, using four real-world datasets. Furthermore, we discuss how to perform comparative analyses. Finally, we present MethSCAn, a comprehensive software toolkit to perform scBS data analysis.
在本文中,我们讨论了分析单细胞亚硫酸氢盐测序(scBS)数据的策略。我们建议改进当前的方法,并通过四个真实数据集在基准测试中展示了这些方法的价值。此外,我们讨论了如何进行比较分析。最后,我们介绍了 MethSCAn,一个全面的软件工具包,用于进行 scBS 数据的分析。
The standard approach to analyze scBS data is based on methodology developed for the analysis of scRNA-seq data. Therefore, we start by briefly reviewing how scRNA-seq data are commonly analyzed, before we discuss scBS data analysis.
分析 scBS 数据的标准方法是基于用于分析 scRNA-seq 数据的方法学。因此,我们先简要回顾 scRNA-seq 数据通常是如何分析的,然后再讨论 scBS 数据的分析。
The starting point in most scRNA-seq analyses is a matrix of unique molecular identifier (UMI) counts (that is, counts of distinct RNA molecules), with one row for each cell and one column for each gene. A first goal is usually to assign cell types or states to cells. To this end, one needs to establish which cells are similar to each other, that is, quantify the distance (that is, dissimilarity) between any two given cells’ transcriptional profile. A standard approach, used with minor variation in virtually all recent research and automated by popular software such as Seurat6 or Scanpy7, is as follows: one first accounts for cell-to-cell variation in sequencing depth by dividing each UMI count by the respective cell’s total UMI count, then transforms to a homoskedastic scale by taking the logarithm. To avoid matrix elements with zero count to be transformed to minus infinity, one typically adds a very small ‘pseudocount’ (often 10−4) to the normalized fractions before taking the logarithm. Now, one could use Euclidean distances of these vectors of logarithmized fractions as the dissimilarity score. However, these scores would be exceedingly noisy owing to the strong Poisson noise introduced by the many genes with very low counts. Therefore, one performs a principal component analysis (PCA), keeping only the top few (typically, 20–50) components. As Poisson noise is uncorrelated between genes, it will average out in the top principal components, as these are all linear combinations with weight on a large number of genes. Therefore, Euclidean distances between these ‘PCA space’ vectors provide a robust dissimilarity score. Hence, the PCA space representation is suitable as input to methods such as t-distributed stochastic neighbor embedding (t-SNE) and uniform manifold approximation and projection (UMAP), which provide a two-dimensional representation of the data, or to methods for clustering (assigning cells to groups by similarity) and trajectory finding (identifying elongated manifolds in PCA space and assigning cells to quasi-one-dimensional positions along them).
大多数单细胞 RNA 测序(scRNA-seq)分析的起点是一个独特的分子标识符(UMI)计数矩阵(即,不同 RNA 分子的计数),每一行代表一个细胞,每一列代表一个基因。通常的第一个目标是将细胞类型或状态分配给细胞。为此,需要确定哪些细胞彼此相似,即量化任意两个给定细胞转录谱之间的距离(即,差异性)。一种标准的方法,几乎在所有最近的研究中都略有变化,并且由流行的软件如 Seurat 6 或 Scanpy 7 自动化,如下所述:首先通过将每个 UMI 计数除以相应细胞的总 UMI 计数来考虑细胞间测序深度的差异,然后通过取对数转换为同方差尺度。为了避免矩阵元素中的零计数被转换为负无穷大,通常会在取对数之前向归一化分数中添加一个非常小的“伪计数”(通常为 10 −4 )。现在,可以使用这些对数化分数向量的欧几里得距离作为差异性评分。 然而,由于引入了大量低计数基因的强泊松噪声,这些得分将极其嘈杂。因此,需要进行主成分分析(PCA),只保留前几项(通常为 20–50 项)。由于泊松噪声在基因之间是不相关的,它将在前几项主成分中平均掉,因为这些成分都是对大量基因的线性组合。因此,这些“PCA 空间”向量之间的欧几里得距离提供了一个稳健的差异评分。因此,PCA 空间表示适合作为 t 分布随机邻嵌入(t-SNE)和均匀流形逼近与投影(UMAP)等方法的输入,这些方法提供数据的二维表示,或者作为聚类(根据相似性将细胞分配到组)和轨迹查找(在 PCA 空间中识别拉长的流形并沿它们将细胞分配到准一维位置)方法的输入。
This procedure is commonly adapted when working with single-cell DNA methylation data, because once one gets to the PCA step, one can then continue with the established methods just mentioned. However, constructing a matrix suitable as input for PCA from methylation data requires deviation from the standard scRNA-seq workflow owing to considerable differences in data structure. First, while scRNA-seq quantifies the RNA abundance of genes or transcripts, scBS is genome-wide and thus lacks a natural choice for features in which methylation is to be quantified. Second, instead of counts, scBS generates binary data that inform us whether certain cytosines in a given cell are methylated. A simple and common approach to construct a methylation matrix suitable for PCA, used for instance by Luo et al.8, is to divide the genome into tiles of, for example, 100 kb size, and calculate for each cell the average methylation of the DNA within each tile. To this end, one identifies in the tile all CpG sites that are covered by at least one read and averages their methylation state, that is, one denotes as average DNA methylation of the tile in a given cell the proportion of the observed CpG sites in the tile that were found to be methylated (Fig. 1a). This yields a matrix, with one row for each cell and one column for each genomic tile, comprising numbers (‘methylation fractions’) between 0 and 1. This matrix is now subjected to PCA. After PCA, one can proceed with dimensionality reduction and clustering approaches known from scRNA-seq.
这种程序在处理单细胞 DNA 甲基化数据时通常会被采用,因为在进行 PCA 步骤时,可以继续使用之前提到的标准化方法。然而,从甲基化数据构建适合 PCA 输入的矩阵需要偏离标准的单细胞 RNA 测序工作流程,因为数据结构存在显著差异。首先,单细胞 RNA 测序量化的是基因或转录本的 RNA 丰度,而单细胞 BS(单细胞表观基因组学)是全基因组的,因此缺乏一个自然的选择来量化甲基化。其次,单细胞 BS 生成的是二进制数据,告诉我们某个细胞中特定胞嘧啶是否被甲基化。Luo 等人 8 采用的一种简单且常见的构建适合 PCA 的甲基化矩阵的方法是将基因组分为例如 100 kb 大小的片段,并计算每个细胞内每个片段的 DNA 平均甲基化水平。 为了实现这一目标,识别出所有在标题中至少被一个读取覆盖的 CpG 站点,并计算它们的甲基化状态,即在给定细胞中,该标题的平均 DNA 甲基化程度定义为标题中观察到的 CpG 站点中被甲基化的比例(图 1a)。这将生成一个矩阵,每一行代表一个细胞,每一列代表一个基因组片段,矩阵中的数值(“甲基化分数”)在 0 和 1 之间。然后对这个矩阵进行主成分分析(PCA)。PCA 后,可以使用来自 scRNA-seq 的降维和聚类方法进行处理。
While this simple procedure is straightforward and produces usable results, it is not optimal. In this paper, we discuss weaknesses of the simple approach and suggest several refinements to overcome them. Using benchmarks and an application to real data, we show that our improvements substantially increase the information content of the processed data. In the main text, we explain the proposed methods and their motivation in a qualitative manner, while mathematical details are given later in Methods. Then, we demonstrate the value of our methods using benchmarks and an application to real data. Finally, we describe and demonstrate an approach to detect differentially methylated regions (DMRs) in scBS data. We also describe our software toolkit, MethSCAn, that facilitates all these analyses.
虽然这种简单的程序直观且能产生可用的结果,但它并不是最优的。在本文中,我们讨论了简单方法的弱点,并提出了一些改进措施以克服这些问题。通过基准测试和对实际数据的应用,我们展示了我们的改进显著增加了处理数据的信息含量。在正文部分,我们以定性的方法解释了所提出的方法及其动机,而数学细节则在方法部分给出。然后,我们通过基准测试和对实际数据的应用来展示我们方法的价值。最后,我们描述并演示了一种检测单细胞 BS 数据中差异甲基化区域(DMRs)的方法。我们还描述了我们的软件工具包 MethSCAn,它有助于进行所有这些分析。
Results 结果
Read-position-aware quantitation
读位置感知定量分析
We first discuss the task of quantifying the level of methylation in a given, fixed genomic interval. Typically, the read coverage per cell is sparse in scBS data. In the example shown in Fig. 1a, the depicted interval is covered by a single read for two of the three cells shown and no read in the third. The read from cell 2 shows much more methylation than the read from cell 1, and the standard analysis would therefore consider cell 2 to have higher methylation in the interval than cell 1. However, given that the two reads agree wherever they overlap, a more parsimonious interpretation would be that the cells do not show a difference in methylation within the interval. Rather, both cells, and similarly maybe most other cells, might have stronger methylation in the left third of the interval than in the middle one.
我们首先讨论在给定的固定基因区间内量化甲基化水平的任务。通常,在 scBS 数据中,每个细胞的读取覆盖度是稀疏的。如图 1a 所示的示例中,三个细胞中有两个细胞的区间仅被一个读取覆盖,而第三个细胞则没有读取。来自细胞 2 的读取显示出比来自细胞 1 的读取更多的甲基化,因此标准分析会认为细胞 2 在该区间内的甲基化水平高于细胞 1。然而,鉴于两个读取在重叠部分是一致的,更简洁的解释是这些细胞在该区间内没有甲基化差异。相反,两个细胞,也许还有大多数其他细胞,在区间的左三分之一部分可能比中间部分有更强的甲基化。
Therefore, we propose to first obtain, for each CpG position, a smoothed average of the methylation across all cells and then quantify each cell’s deviation from this average. In Fig. 1b, the curved line depicts such an average over all cells, and the red vertical lines show an individual cell’s deviation from the ensemble average. We take the lengths of the red lines as signed values (‘residuals’), being positive for lines extending upward from the curve (methylated CpG) and negative for lines extending downward (unmethylated CpG). For each cell, we then take the average over the residuals for all the CpGs in the interval that are covered by reads from this cell. In this average, we perform shrinkage toward zero via a pseudocount (to trade bias for variance, see Methods for detail) to dampen the signal in cells with low coverage of the interval.
因此,我们建议首先为每个 CpG 位置获得所有细胞的甲基化平均值的平滑平均值,然后量化每个细胞与这一平均值的偏差。在图 1b 中,弯曲的线条表示所有细胞的这种平均值,而红色的垂直线则显示单个细胞与群体平均值的偏差。我们将红色线的长度视为带符号的值(‘残差’),向上延伸的线为正(甲基化 CpG),向下延伸的线为负(未甲基化 CpG)。然后,对于每个细胞,我们计算该区间内被该细胞读取覆盖的所有 CpG 的残差的平均值。在这一平均值中,通过伪计数的方式向零收缩(以权衡偏差和方差,详情见方法部分)来减弱低覆盖区间细胞中的信号。
The average thus obtained, that is, the shrunken mean of the residuals, is what we use to quantify the cell’s (relative) methylation in the interval. For a genome tiled into such intervals, we thus obtain a matrix, with one row per cell and one column per interval, that can be used for downstream analysis, for example, as input for PCA. The signal-to-noise ratio in this matrix will be better than in a matrix obtained by simply averaging absolute methylation (0 or 1) over all the cells’ covered CpG sites in a region. The reason for this is that we reduce the variation in situations such as the one depicted in Fig. 1a, where the methylation of the reads might differ strongly even though there is no actual evidence for a difference between the two cells.
因此得到的平均值,即残差的收缩均值,就是我们用来量化细胞(相对)甲基化程度的方法,在该区间内。对于被划分为这种区间的基因组,我们因此会得到一个矩阵,每一行代表一个细胞,每一列代表一个区间,该矩阵可用于下游分析,例如作为主成分分析(PCA)的输入。该矩阵中的信噪比将优于简单地在区域内所有细胞覆盖的 CpG 位点上平均绝对甲基化(0 或 1)得到的矩阵。这是因为我们减少了类似于图 1a 所示的情况下的变异,即使甲基化读数之间可能存在很大差异,但实际上两个细胞之间并不存在差异。
How should one obtain the ensemble average (Fig. 1b, curved line)? A simple approach to get a value for a specific CpG would be to take all cells with read coverage for the CpG and use the fraction of these that show the CpG as methylated. However, especially when only few cells offer coverage, these averages will be very noisy. Therefore, we propose to smoothen using a kernel smoother, that is, by performing a kernel-weighted average over the CpG site’s neighborhood. The kernel bandwidth (that is, the size of the neighborhood to average over) is a tuning parameter; for the examples presented here, we used 1,000 bp.
如何获得 ensemble 平均值(图 1b 中的曲线)?获得特定 CpG 值的一个简单方法是取所有具有 CpG 读取覆盖的细胞,并使用这些细胞中表现出 CpG 甲基化的比例。然而,特别是在只有少量细胞提供覆盖时,这些平均值将会非常嘈杂。因此,我们建议使用核平滑器进行平滑处理,即通过对 CpG 位点邻域进行核加权平均来实现。核带宽(即要平均的邻域大小)是一个调节参数;对于这里呈现的示例,我们使用了 1,000 bp。
A minor remaining issue is how to deal with the case that a cell has no reads at all within a given interval. Here, it is justified to simply put zero into the matrix element, because a shrunken residual average of zero indicates that there is no evidence of the cell deviating from the mean. We slightly refine this by using an iterative imputation within the PCA (‘iterative PCA’, see Methods for details).
一个剩余的较小问题是,如何处理在给定区间内某个细胞完全没有读数的情况。在这种情况下,将矩阵元素设为零是合理的,因为收缩后的残差平均值为零表明没有证据显示该细胞偏离均值。我们通过在主成分分析中使用迭代插补(“迭代 PCA”,详见方法部分)稍微改进了这一点。
Taken together, this shrunken mean of residuals quantitation reduces the variance in comparison to simple averaging of raw methylation calls. We show further below how this improves the results.
综合起来,这种缩减后的残差均值定量减少了与简单平均原始甲基化调用相比的方差。我们将在下面进一步展示这如何改进结果。
Finding variably methylated regions
寻找变异性甲基化区域
Typically, some regions in a chromosome will have a very similar methylation status in all cells, while other regions show variability in methylation across cells. For instance, it has long been known that CpG-rich promoters of housekeeping genes are unmethylated, and that a large proportion of the remaining genome is highly methylated regardless of cell type9. In contrast, DNA methylation at certain genomic features such as enhancers is more dynamic10 and thus more variable across cells. Only the latter regions are of value for our goal of quantitating the dissimilarity between cells. We call these the variably methylated regions (VMRs).
通常,在染色体中的一些区域,在所有细胞中都会表现出非常相似的甲基化状态,而其他区域则在不同细胞中表现出甲基化状态的差异。例如,长期以来人们知道,管家基因的 CpG 富集启动子是未甲基化的,而剩余基因组的很大一部分无论细胞类型如何都是高度甲基化的 9 。相比之下,某些基因组特征如增强子的 DNA 甲基化则更为动态 10 ,因此在不同细胞之间表现出更大的变化性。只有这些后一类区域对于衡量细胞之间的差异性是有价值的。我们将这些区域称为可变甲基化区域(VMRs)。
In the standard approach, one divides up (tiles) each chromosome into non-overlapping, equally sized intervals and quantitates the methylation of each tile. Such rigid placement of interval boundaries is unlikely to be optimal: for example, a VMR might be much smaller than a tile, and the signal from its CpG sites will hence be drowned out by the larger number of uninformative CpG sites that are equal in all cells, when averaging over all the CpG sites in the tile.
在标准方法中,会将每条染色体分割(平铺)为不重叠的等大小区间,并量化每个区间的甲基化程度。这样的区间边界固定放置很可能不是最优的:例如,一个 VMR 可能比一个区间小得多,因此其 CpG 位点的信号在对区间内所有 CpG 位点进行平均时会被大量无信息的 CpG 位点淹没。
Therefore, we propose the following approach (Fig. 2): Divide up the chromosome into many overlapping windows that start at regular multiples of a fixed, small step size. Quantify the methylation of each cell in each window by averaging the cell’s methylation residuals over all CpGs in the window, as described earlier and depicted in Fig. 1b. Next, calculate for each window the variance of these values over all cells. Select, say, the top 2% windows with the highest variances and mark them as VMRs. Wherever the marked windows overlap, merge them into one larger VMR. Then, calculate for each of these merged VMRs the methylation signal, as before, by averaging for each cell over the residuals of all contained CpG sites.
因此,我们提出以下方法(图 2):将染色体划分为许多重叠的窗口,这些窗口从固定的小步长的整数倍开始。通过计算窗口内所有 CpG 位点的甲基化残差的平均值,量化每个细胞在每个窗口中的甲基化程度,如前所述并在图 1b 中所示。接下来,计算每个窗口中这些值在所有细胞中的方差。选择方差最高的前 2%窗口,并将它们标记为 VMR。如果标记的窗口重叠,则将它们合并为一个较大的 VMR。然后,通过计算每个合并后的 VMR 中每个细胞的甲基化信号,如前所述,即通过计算包含的所有 CpG 位点的残差的平均值来计算。
In this manner, we obtain a methylation matrix, with one row per cell and one column per VMR, that is (in a sense) richer in information and has better signal-to-noise ratio than the matrix obtained by the simple analysis sketched at the very beginning. As we demonstrate below, a PCA performed on such a matrix provided a distance metric for the cells that contains more information on biological detail than one from a simpler analysis.
通过这种方式,我们获得了一个甲基化矩阵,每一行代表一个细胞,每一列代表一个 VMR,这个矩阵在信息量上更为丰富,并且信噪比更好,比初始简单分析得到的矩阵要好。如我们下面所展示的,对这种矩阵进行主成分分析(PCA),可以为细胞提供一个包含更多生物细节的 distance metric,而简单的分析则不能。
Application and benchmarks
应用和基准
To demonstrate the value of our proposed improvements, we benchmarked various combinations of analysis methods on five diverse single-cell methylome datasets, starting with a dataset from our own research.
为了展示我们提出改进的价值,我们在五个不同的单细胞甲基组数据集中 benchmark 了各种分析方法的组合,首先是从我们自己的研究中获取的一个数据集。
Correlating VMR methylation with gene expression
将 VMR 甲基化与基因表达相关联
Our dataset11 comprises the single-cell methylomes of 1,566 cells isolated from mouse forebrains as well as matched single-cell transcriptomes of the same cells. Among these cells are distinct cell types such as oligodendrocytes, oligodendrocyte precursor cells and endothelial cells, as well as cellular substates that are part of the continuous neural stem cell differentiation trajectory. To assess whether our VMR detection method captures genomic intervals that are biologically meaningful, we probed whether their methylation level correlates with the expression of nearby genes.
我们的数据集 11 包含从小鼠前脑中分离出的 1,566 个单细胞的甲基化组,以及这些细胞的匹配单细胞转录组。这些细胞包括不同的细胞类型,如少突胶质细胞、少突胶质细胞前体细胞和内皮细胞,以及处于连续神经干细胞分化轨迹中的细胞亚状态。为了评估我们的 VMR 检测方法是否捕获了具有生物学意义的基因区间,我们探究了这些甲基化水平是否与附近基因的表达相关。
We first note that gene expression is more strongly correlated with the methylation of nearby VMRs than with the methylation of their promoters (Fig. 3a), indicating that VMR methylation is often a better predictor of gene expression than promoter methylation. Indeed, a gene-wise comparison revealed many genes whose expression is correlated with the nearest VMR but not with promoter methylation. One such example gene, Htra1, is depicted in Fig. 3b. While the promoter of this gene is lowly methylated regardless of gene expression, a VMR located downstream of the promoter is lowly methylated in cells with high Htra1 expression.
我们首先注意到,基因表达与附近 VMRs 的甲基化程度相关性更强,而不是与启动子的甲基化程度相关(图 3a),这表明 VMR 甲基化通常是比启动子甲基化更好的基因表达预测指标。事实上,基因层面的比较显示,有许多基因的表达与最近的 VMRs 相关,但与启动子甲基化无关。其中一个这样的例子基因是 Htra1,如图 3b 所示。尽管该基因的启动子无论基因表达如何都低度甲基化,但在 Htra1 表达高的细胞中,位于启动子下游的 VMR 是低度甲基化的。
Improved identification of cell types
改进细胞类型识别
We next tested whether our methods improve the ability to distinguish cell types and cell states (Fig. 4). To this end, we obtained cell type/state labels based on the single-cell transcriptomes from Kremer et al.11 (Fig. 4a). We consider these cell labels as ground truth and tested whether we are able to distinguish the same groups of cells on the basis of their methylomes. To do this, we subjected the methylomes to various combinations of analysis methods including our own proposed methods and others that are commonly used. Specifically, we selected four different sets of genomic intervals at which CpG methylation is to be quantified: either VMRs detected with our approach, 100 kb genomic tiles, promoter regions (that is, transcription start site (TSS) ±2,000 bp) or candidate cis-regulatory elements (ENCODE cCREs12). To quantify methylation at these features, we either simply averaged in these intervals, obtaining methylation percentages, or calculated the shrunken mean of residuals, as described earlier. Finally, the resulting methylation matrices were subjected to iterative PCA (Methods) and subsequent UMAP for visualization.
我们接下来测试我们的方法是否能够提高区分细胞类型和细胞状态的能力(图 4)。为此,我们根据 Kremer 等人 11 的单细胞转录组数据获得了细胞类型/状态标签(图 4a)。我们将这些细胞标签视为真实标签,并测试我们是否能够根据细胞的甲基化组区分相同的细胞群。为此,我们将甲基化组应用了各种分析方法的组合,包括我们提出的方法和其他常用的方法。具体来说,我们选择了四种不同的 CpG 甲基化量化基因区间集合:使用我们方法检测到的 VMRs、100 kb 基因瓦片、启动子区域(即,转录起始位点(TSS)±2,000 bp)或候选的顺式调控元件(ENCODE cCREs 12 )。为了量化这些特征的甲基化,我们或直接在这些区间内取平均值,得到甲基化百分比,或计算残差的收缩均值,如前所述。最后,生成的甲基化矩阵进行了迭代 PCA(方法部分)和后续的 UMAP 可视化。
Visual inspection of the resulting UMAPs revealed that our proposed combination of methods results in more clearly separated cell types, compared to a UMAP obtained with default analysis methods (Fig. 4b,c). While all cells form a continuous point cloud when using default methods, our improvements led to a clear separation of oligodendrocytes, oligodendrocyte precursor cells and endothelial cells. Furthermore, even cellular substates of cells in the continuous neural stem cell lineage were partially separated. To quantify this performance gain in a more rigorous manner, we used a score that quantifies whether cells were placed, in 15-dimensional principal component (PC) space, in a neighborhood comprising cells of the same cell type (‘neighbor score’; Fig. 4e, see Methods for details). A higher neighbor score implies better separation of cell types inferred from single-cell transcriptomes of the same cells.
视觉检查所得到的 UMAP 图显示,我们提出的方法组合使得细胞类型更加清晰地分离,相比于使用默认分析方法得到的 UMAP 图(图 4b、c)。虽然使用默认方法时所有细胞形成一个连续的点云,但我们的改进使得少突胶质细胞、少突胶质细胞前体细胞和内皮细胞得到了明确的分离。此外,即使在连续的神经干细胞谱系中的细胞亚状态也部分被分离出来。为了以更严谨的方式量化这种性能提升,我们使用了一个评分系统,该评分系统量化了在 15 维主成分(PC)空间中,细胞是否被放置在一个包含相同细胞类型细胞的邻域内(“邻域评分”;图 4e,详见方法部分)。邻域评分越高,意味着从单细胞转录组推断出的细胞类型分离得越好。
The neighborhood relation in PCA space is relevant as it is used in many downstream analyses, for example, for clustering. We ask how the neighbor score depends on the number of available cells. This is relevant as scBS protocols are costly and labor-intensive, and only few laboratories are currently able to obtain thousands of single-cell methylomes. To simulate smaller datasets, we subsampled the 1,566-cell dataset into smaller data sets, analyzed them again with all combinations of analysis methods, and calculated the mean neighbor score for each (Fig. 4d). The results confirmed that quantifying methylation at VMRs leads to a cleaner separation of cell states than using promoter regions or 100 kb tiles. Using the shrunken mean of residuals as a measure of DNA methylation improved the results further. This effect was most noticeable when quantifying promoters or genomic tiles, presumably because an individual promoter region or tile might span genomic regions with varying levels of DNA methylation as depicted in Fig. 1b, which our method accounts for. Although cell cluster separation generally becomes more difficult in such cases, performance gains were observed also in smaller datasets.
PCA 空间中的邻域关系是相关的,因为它在许多下游分析中被使用,例如聚类。我们询问邻域得分如何依赖于可用细胞的数量。这很重要,因为 scBS 协议成本高且劳动密集,目前只有少数实验室能够获得数千个单细胞甲基组。为了模拟较小的数据集,我们将 1,566 个细胞的数据集子采样成更小的数据集,再次使用所有组合的分析方法进行分析,并计算每个数据集的平均邻域得分(图 4d)。结果证实,在 VMRs 处量化甲基化比使用启动子区域或 100 kb 的片段能更清晰地分离细胞状态。使用残差的压缩均值作为 DNA 甲基化的度量可以进一步改善结果。这种效应在量化启动子或基因组片段时最为明显,可能是因为单个启动子区域或片段可能跨越图 1b 所示的不同 DNA 甲基化水平的基因组区域,而我们的方法能够考虑到这一点。 虽然在这种情况下细胞簇分离通常变得更加困难,但在较小的数据集中也观察到了性能提升。
Overall, VMR quantification yielded results similar to those obtained when quantifying ENCODE regulatory regions, even though the number of detected VMRs (63,421) is considerably smaller than the number of ENCODE cCREs (339,815). When we repeated our analysis using only the 63,421 cCREs with the highest coverage, the ability to distinguish cell types was diminished, suggesting that the average VMR is more informative for this task then the average cCRE (Extended Data Fig. 1a,b). This demonstrates the ability of our de novo VMR detection approach to identify in scBS data a parsimonious set of relevant elements. The overlap between VMRs and cCREs is limited (Fig. 4f), indicating that VMR detection yields information that is complementary to other epigenetic marks. A further benefit of VMR detection is that this approach is available even in the absence of such annotations, for instance when studying species other than human or mouse. Lastly, using VMRs over regulatory regions resulted in decreased RAM requirements as well as a shorter runtime, even when accounting for the additional step of VMR detection (Extended Data Fig. 1c,d).
总体而言,VMR 量化得到的结果与量化 ENCODE 调控区域的结果相似,尽管检测到的 VMR 数量(63,421 个)远少于 ENCODE cCRE 的数量(339,815 个)。当我们仅使用覆盖度最高的 63,421 个 cCRE 重复分析时,区分细胞类型的能力有所下降,这表明平均 VMR 比平均 cCRE 更适合这项任务(扩展数据图 1a、b)。这表明我们新的 VMR 检测方法能够从 scBS 数据中识别出一组精简的相关元素。VMR 和 cCRE 之间的重叠有限(图 4f),表明 VMR 检测提供了与其他表观遗传标记互补的信息。VMR 检测的另一个好处是,即使在没有此类注释的情况下,这种方法也是可用的,例如在研究人类或小鼠以外的物种时。最后,使用 VMRs 而不是调控区域可以减少 RAM 要求并缩短运行时间,即使考虑了 VMR 检测的额外步骤(扩展数据图 1c、d)。
We repeated this benchmark on an additional three published single-cell methylome datasets (Extended Data Fig. 2). These include neuronal subtypes of the murine cortex8 (using cell type labels derived from CH methylation in genomic tiles instead of CpG methylation as ground truth), cells isolated from mouse embryos during the onset of gastrulation10 (using RNA-derived cell clusters or alternatively embryonic stage as ground truth) and human colorectal cancer cells13 (using sampling region as ground truth). Again, we subjected each dataset to all possible combinations of genomic feature selection and methylation quantification. We furthermore included three additional approaches to perform dimensionality reduction in our benchmarks, including PCA with two different preprocessing strategies, as well as Multi-Omics Factor Analysis version 2 (MOFA+), a dimensionality reduction technique designed for multimodal single-cell data that can also process methylation data14.
我们在另外三个已发表的单细胞甲基组数据集中重复了这一基准测试(扩展数据图 2)。这些数据集包括小鼠皮层神经亚型 8 (使用来自 CH 甲基化在基因组瓦片中的细胞类型标签而不是 CpG 甲基化作为真实值),胚胎发育起始期的小鼠胚胎细胞 10 (使用 RNA 导出的细胞簇或胚胎发育阶段作为真实值)和人类结肠直肠癌细胞 13 (使用采样区域作为真实值)。我们再次对每个数据集进行了所有可能的基因组特征选择和甲基化量化组合。此外,我们在基准测试中还包括了三种额外的降维方法,包括两种不同的预处理策略的 PCA,以及专为多模态单细胞数据设计的 Multi-Omics Factor Analysis 版本 2(MOFA+),这是一种可以处理甲基化数据的降维技术 14 。
These extensive benchmarks confirmed that our proposed combination of methods, that is, using the shrunken means of residuals of VMRs for dimensionality reduction, is able to distinguish diverse cellular properties such as cell type, colorectal cancer stage (normal tissue, primary tumor and metastasis), embryonic stage and germ layer.
这些广泛的基准测试证实,我们提出的结合方法能够区分多种细胞特性,如细胞类型、结直肠癌分期(正常组织、原发肿瘤和转移灶)、胚胎发育阶段和胚层。
Robustness to parameter changes
参数变化的稳健性
Next, we assessed whether our proposed workflow requires fine-tuning of parameters (Extended Data Fig. 3). To this end, we re-analyzed two datasets, as well as subsamples of the data with different VMR detection parameters, namely the width of the sliding window in bp (set with the option ‘--bandwidth’ in our MethSCAn software, default 2,000), the variance threshold above which windows are merged to VMRs (‘--var-threshold’, default 0.02) and the step size of the sliding window (‘--stepsize’, default 100 bp). This parameter sweep showed that our workflow gives good results over a wide range of parameter values. For the CpG methylation data of Luo et al.8, the results are nearly independent of the parameters (Extended Data Fig. 3b). In the more challenging dataset of Kremer et al.11, cell types were less cleanly separated when very large bandwidths, very strict variance thresholds or a very large step size was selected (Extended Data Fig. 3a,c). However, very small bandwidths or very lenient thresholds resulted in a much higher number of VMRs and thus long computing times. Overall, our default parameter combination provided good results and fast compute times in both datasets.
接下来,我们评估了所提出的流程是否需要对参数进行微调(扩展数据图 3)。为此,我们重新分析了两个数据集,并且使用不同的 VMR 检测参数对数据进行了子样本分析,具体包括滑动窗口的宽度(使用我们 MethSCAn 软件中的 ‘--bandwidth’ 选项设置,默认值为 2,000 bp)、超过该阈值的方差窗口将被合并为 VMR(使用 ‘--var-threshold’ 选项设置,默认值为 0.02)以及滑动窗口的步长(使用 ‘--stepsize’ 选项设置,默认值为 100 bp)。参数扫面结果显示,我们的流程在广泛的参数值范围内都能获得良好的结果。对于 Luo 等人 8 的 CpG 甲基化数据,结果几乎与参数无关(扩展数据图 3b)。在 Kremer 等人 11 的更具挑战性的数据集中,当选择非常大的带宽、非常严格的方差阈值或非常大的步长时,细胞类型之间的区分效果较差(扩展数据图 3a、c)。然而,非常小的带宽或非常宽松的阈值会导致 VMR 数量大幅增加,从而导致较长的计算时间。总体而言,我们的默认参数组合在两个数据集中都提供了良好的结果和较快的计算时间。
Further applications 进一步的应用
Lastly, we asked whether our methods are also suitable for the analysis of DNA methylation outside the default CpG context. To this end, we revisited the Luo et al.8 dataset but this time only considered CH methylation. VMR detection with default options produced results that were qualitatively similar to those reported in Luo et al.8, suggesting that our methods are also suitable for this data type (Extended Data Fig. 4a). Finally, as single-cell methylome datasets are expected to rapidly grow in size in the coming years, we furthermore performed a stress test on a large dataset comprising 100,350 cells15 (Extended Data Fig. 4b).
最后,我们询问我们的方法是否也适用于默认 CpG 上下文之外的 DNA 甲基化分析。为此,我们重新审视了 Luo 等人的 8 数据集,但这次仅考虑了 CH 甲基化。使用默认选项进行 VMR 检测产生的结果与 Luo 等人 8 报道的结果在定性上相似,这表明我们的方法也适用于这种数据类型(扩展数据图 4a)。最后,由于单细胞甲基组数据集预计在未来几年会迅速增长,我们还对包含 100,350 个细胞 15 的大数据集进行了压力测试(扩展数据图 4b)。
Finding DMRs 寻找 DMRs
A common task in the analysis of bulk bisulfite-sequencing data is the detection of DMRs between conditions, tissues or cell types16,17. As DNA methylation affects gene expression, DMRs can provide insights into the unique epigenetic and gene regulatory characteristics of cell types. However, to date, no approach to detect DMRs in scBS data has been reported. To enable DMR detection in scBS data, we thus developed an approach that detects DMRs of variable size between two groups of cells and controls the false discovery rate (FDR) (Fig. 5a).
在批量 bisulfite 测序数据的分析中,一个常见的任务是在不同条件、组织或细胞类型之间检测差异甲基化区域(DMRs) 16,17 。由于 DNA 甲基化会影响基因表达,DMRs 可以提供有关细胞类型独特表观遗传和基因调控特征的见解。然而,到目前为止,尚未有在单细胞 bisulfite 测序(scBS)数据中检测 DMRs 的方法。为了能够在 scBS 数据中检测 DMRs,我们因此开发了一种方法,在两个细胞群之间检测可变大小的 DMRs,并控制假发现率(FDR)(图 5a)。
Similar to the previously described approach for VMR identification (Fig. 2), we divide each chromosome into overlapping windows shifted by a small and fixed size (step size) and quantify the methylation of each cell in each window. Next, instead of the variance, we obtain the t statistic as a measure of differential methylation between the two cell groups. We identify the windows with the most extreme t statistics, for example, windows in the 2% upper and lower tails. We then merge any overlapping windows in the upper tail into larger DMRs and do the same for windows in the lower tail, and then we recalculate the t statistic for each larger DMR.
类似于之前描述的 VMR 识别方法(图 2),我们将每条染色体划分为重叠窗口,并将窗口之间移动一个小且固定的步长(步长)。然后,我们量化每个窗口中每个细胞的甲基化程度。接下来,我们不再使用方差,而是使用 t 统计量作为两个细胞群之间差异甲基化的度量。我们识别出 t 统计量最极端的窗口,例如,位于上尾部和下尾部的 2%窗口。然后,我们将上尾部的重叠窗口合并为较大的 DMR,并对下尾部的窗口做同样的处理,然后我们重新计算每个较大 DMR 的 t 统计量。
To assess the statistical significance of DMRs, we repeat the same procedure on permutations of the scBS data, that is, the same dataset with randomly shuffled cell labels. The DMR t statistics obtained from permuted data are then used to estimate the FDR, yielding an adjusted P value for each DMR. While the primary purpose of VMRs is to provide better input for PCA and distance calculations, DMR detection facilitates the discovery of epigenetic differences between conditions or cell types, as we demonstrate next.
为了评估差异甲基化区域(DMR)的统计显著性,我们在 scBS 数据的随机打乱细胞标签的置换数据上重复相同的程序。从置换数据中获得的 DMR t 统计量随后用于估计 FDR,从而为每个 DMR 提供一个调整后的 P 值。虽然 VMR 的主要目的是为 PCA 和距离计算提供更好的输入,但 DMR 检测有助于发现不同条件或细胞类型之间的表观遗传差异,如我们在接下来的内容中所展示的。
Detecting DMRs between oligodendrocytes and neural stem cells
检测少突胶质细胞与神经干细胞之间的 DMRs
We used the single-cell multi-omics dataset11 to evaluate our DMR detection approach. We first selected two cell populations from the healthy murine ventricular–subventricular zone: neural stem cells (NSCs, 130 cells) and oligodendrocytes (58 cells). Then, we used the method just described to identify DMRs between these two cell types (Fig. 5b). Repeating this after permuting the cell-type labels, in order to obtain a null distribution of the t statistics, yielded DMRs with much weaker methylation differences. Consequently, we could assign to many of the DMRs detected in the unpermuted data a low adjusted P value (Fig. 5b, colors).
我们使用了单细胞多组学数据集 11 来评估我们的 DMR 检测方法。我们首先从健康的 murine 脑室-次室区选择了两种细胞群体:神经干细胞(NSCs,130 个细胞)和少突胶质细胞(58 个细胞)。然后,我们使用刚才描述的方法来识别这两种细胞类型之间的 DMR(图 5b)。通过在细胞类型标签上进行置换以获得 t 统计量的 null 分布,我们得到了甲基化差异较弱的 DMR。因此,我们可以将许多在未置换数据中检测到的 DMR 分配一个较低的调整后 P 值(图 5b,颜色)。
Gene Ontology (GO) enrichment with the Genomic Regions Enrichment of Annotations Tool (GREAT)18 revealed that DMRs lowly methylated in oligodendrocytes are located near genes involved in myelination, the main function of oligodendrocytes. Similarly, DMRs specifically demethylated in NSCs occur near genes involved in stem cell population maintenance. This demonstrates that our DMR detection approach is able to identify biologically meaningful DMRs, even in scBS datasets of modest cell number. Figure 5d,e depicts an exemplary DMR, located at the gene encoding myelin-basic protein (Mbp), the major component of myelin that is essential for myelination of neuronal axons19. Our results suggest that oligodendrocyte-specific gene expression of Mbp is supported by low methylation at the detected DMR.
基因本体论(GO)富集分析使用 Genomic Regions Enrichment of Annotations Tool(GREAT) 18 表明,在少突胶质细胞中低甲基化的 DMRs 位于参与髓鞘形成的基因附近,这是少突胶质细胞的主要功能。同样,在 NSCs 中特异性去甲基化的 DMRs 位于参与干细胞群体维持的基因附近。这表明我们的 DMR 检测方法能够在少量细胞的 scBS 数据集中识别出生物学上有意义的 DMR。图 5d、e 展示了位于髓鞘基本蛋白(Mbp)基因处的一个示例 DMR,Mbp 是髓鞘的主要成分,对于神经元轴突的髓鞘形成至关重要 19 。我们的结果表明,检测到的 DMR 处的低甲基化支持了少突胶质细胞特异性的 Mbp 基因表达。
The MethSCAn software toolkit
The MethSCAn 软件工具包
We have implemented the methods just described in a Python package with a command line interface, MethSCAn (Single-Cell Analysis of Methylation data), which also offers a number of other functionalities for the analysis of scBS data. Figure 6 illustrates a typical scBS data analysis with MethSCAn and provides an overview of the implemented core functionalities. A tutorial that showcases the analysis of a small example dataset can be found at https://anders-biostat.github.io/MethSCAn.
我们已经将刚刚描述的方法实现为一个具有命令行界面的 Python 包 MethSCAn(单细胞甲基化数据分析),该包还提供了许多其他功能以分析 scBS 数据。图 6 展示了使用 MethSCAn 进行典型 scBS 数据分析的过程,并概述了实现的核心功能。一个展示如何分析小型示例数据集的教程可以在 https://anders-biostat.github.io/MethSCAn 找到。
The starting point of such an analysis are methylation files generated by tools such as Bismark20, methylpy21 or bisulfite-seq command line user interface toolkit22. Since it is inconvenient to work with hundreds or thousands of text files, MethSCAn provides the ‘prepare’ command that parses these methylation files and stores their content in a compressed format that enables efficient access to all CpG sites of the genome (Methods). The ‘methscan prepare’ command also computes a number of summary statistics for each cell, including the mean genome-wide methylation level and the number of observed CpG sites, that is, the number of CpG sites that have sequencing coverage. These summary statistics can be used to detect cells with poor quality. The quality of each single cell methylome can furthermore be inspected with ‘methscan profile’, which computes the average methylation profile of a set of user-defined genomic regions such as TSS at single-cell resolution. The TSS profile is a useful quality control plot since methylation shows a characteristic dip roughly ±1 kb around the TSS in mammalian genomes. Cells that do not show this pattern, or cells with few observed CpG sites, may then be discarded with ‘methscan filter’.
此类分析的起点是由 Bismark 20 、methylpy 21 或 bisulfite-seq 命令行用户界面工具包 22 生成的甲基化文件。由于处理数百或数千个文本文件不方便,MethSCAn 提供了“prepare”命令,该命令解析这些甲基化文件并将它们的内容以压缩格式存储,从而可以高效地访问整个基因组中的所有 CpG 站点(Methods)。MethSCAn 的“methscan prepare”命令还为每个细胞计算了一些汇总统计信息,包括全基因组平均甲基化水平和观察到的 CpG 站点数量,即具有测序覆盖的 CpG 站点数量。这些汇总统计信息可用于检测质量较差的细胞。此外,单个细胞甲基组的质量还可以通过“methscan profile”命令进行检查,该命令可以计算用户定义的一组基因组区域(如 TSS)的单细胞分辨率平均甲基化谱。TSS 谱是一种有用的质控图,因为甲基化在哺乳动物基因组中大约在 TSS 两侧 ±1 kb 处显示出一个特征性的凹陷。 不显示这种模式的细胞,或观察到的 CpG 站点较少的细胞,随后可以用“methscan filter”去除。
After quality control, the user has access to the genome-wide VMR detection approach described earlier, using the ‘methscan scan’ command. This produces a browser extensible data (BED) file that lists the genome coordinates of VMRs. To finally obtain a methylation matrix analogous to a scRNA-seq count matrix, this BED file can be used as input for ‘methscan matrix’, which quantifies the methylation at genomic intervals in all cells. The command produces both the simple percentage of methylated CpG sites as well as our proposed methylation measure, that is, the shrunken mean of the residuals, which is more robust to variation in sequencing coverage or stochastic differences in read position between cells. We note that both ‘matrix’ and ‘profile’ accept any valid BED file as input, which means that the user can quantify and visualize methylation at any set of genomic features of interest, such as promoters, enhancers or transcription factor binding sites, obtaining one profile plot per cell. The methylation matrix produced by the ‘matrix’ function can then serve as input for established methods used for the exploration of single-cell data, such as dimensionality reduction and cell clustering.
经过质量控制后,用户可以使用‘methscan scan’命令访问之前描述的全基因组 VMR 检测方法。这会产生一个浏览器可扩展数据(BED)文件,列出 VMR 的基因组坐标。为了最终获得类似于 scRNA-seq 计数矩阵的甲基化矩阵,可以将这个 BED 文件作为输入使用‘methscan matrix’命令,该命令可以量化所有细胞中基因组区间内的甲基化程度。该命令生成简单百分比的甲基化 CpG 位点以及我们提出的甲基化度量,即残差的收缩均值,这种度量对测序覆盖度的变化或细胞间读取位置的随机差异更具鲁棒性。我们注意到,‘matrix’和‘profile’都接受任何有效的 BED 文件作为输入,这意味着用户可以对感兴趣的任何一组基因组特征进行甲基化量化和可视化,例如启动子、增强子或转录因子结合位点,每个细胞获得一个特征图。 “matrix”函数生成的甲基化矩阵可以作为输入,用于探索单细胞数据的已建立方法,如降维和细胞聚类。
Finally, after annotation and exploration of the dataset, the user may specify two groups of cells for DMR detection with ‘methscan diff’. This command produces a BED file listing the genome coordinates, adjusted P values, as well as several other metrics associated with DMRs. These DMRs may then be associated with nearby genes, or used for GO enrichment with tools such as GREAT18 to enable functional interpretation.
最后,在对数据集进行注释和探索之后,用户可以使用“methscan diff”指定两个细胞群进行 DMR 检测。该命令会生成一个 BED 文件,列出基因组坐标、调整后的 P 值以及其他与 DMR 相关的多种指标。这些 DMR 可以与附近的基因相关联,或者使用如 GREAT 18 之类的工具进行 GO 富集分析,以实现功能解释。
Discussion 讨论
Here, we have proposed an improved strategy to preprocess single-cell bisulfite sequencing data. On the basis of the observation that incomplete read coverage of a genomic interval can lead to inaccurate methylation estimates, we suggest a scoring scheme that is aware of a read’s local context rather than just treating all reads in an interval alike.
在这里,我们提出了一种改进的策略来预处理单细胞亚硫酸氢盐测序数据。基于基因区间读取覆盖不完整会导致甲基化估计不准确的观察,我们建议采用一种评分方案,该方案考虑了读取的局部上下文,而不是将区间内的所有读取视为相同。
Furthermore, we show a way to pinpoint minimal regions of high variability across cells, which we call VMRs. Unlike other tools for scBS data analysis23,24,25, which rely on the user to manually specify which genomic intervals should be quantified, MethSCAn implements an approach to discover these intervals in the data itself. This not only reduces noise and allows to focus only on the features that are important for the given dataset but also provides useful input for genomics-style analyses. Depending on the research question at hand, individual VMRs may be related to nearby genomic features such as gene bodies or known regulatory elements, or subjected to gene ontology and motif enrichment.
此外,我们展示了如何确定细胞间高变异性的小区域,我们称之为 VMRs。与其他 scBS 数据分析工具 23,24,25 不同,这些工具依赖用户手动指定需要量化的基因区间,MethSCAn 实施了一种在数据本身中发现这些区间的方法。这不仅减少了噪声,使分析专注于对给定数据集重要的特征,还为基因组学风格的分析提供了有用的输入。根据手头的研究问题,单个 VMRs 可能与附近的基因特征(如基因体或已知调控元件)相关,或者进行基因 ontology 和 motif 富集分析。
To furthermore aid interpretation of scBS data, we developed and implemented an algorithm for genome-wide detection of DMRs in single-cell methylomes. FDR estimation via permutation allows us to report the statistical significance of each DMR. In a similar manner as VMRs, pinpointing the regions that differ between groups of cells and hence have a putative regulatory role aids biological interpretability. For instance, applying our approach to NSCs and oligodendrocytes demonstrated that the obtained DMRs locate near meaningful loci associated with cell-type specific functions.
为了进一步帮助解释 scBS 数据,我们开发并实现了一个算法,用于单细胞甲基组中全基因组 DMR 检测。通过置换估计 FDR,使我们可以报告每个 DMR 的统计显著性。与 VMRs 类似,确定不同细胞群之间的区域并因此具有潜在调控作用,有助于生物学解释。例如,将我们的方法应用于 NSCs 和少突胶质细胞表明,获得的 DMR 位于与细胞类型特异性功能相关的有意义位点附近。
We also presented an open-source software tool called MethSCAn that provides an easy-to-use implementation of the described algorithms. It can start directly from the output of methylation callers such as Bismark20, bisulfite-seq command line user interface toolkit22 and methylpy21 and produce a cell × region matrix. We suggest iterative PCA as an approach to map the count matrix to a reduced-dimensional space overcoming the abundance of missing values. Alternatively, one may use for this purpose established tools based on matrix factorization such as MOFA+14 (included in our benchmarks) or Linked Inference of Genomic Experimental Relationships26. Once a low-dimensional embedding is obtained, one can switch to well-established methods from scRNA-seq analysis, including visualization tools such as t-SNE and UMAP, Leiden/Louvain clustering, pseudotime trajectory analyses, and so on, for example, by using either overall toolkits such as Seurat6 and ScanPy/EpiScanPy7,25 or any of the many tools available for specific tasks. By offering these functions, MethSCAn bridges a gap in the chain of existing tools that has so far hindered practitioners in their data analysis. Our implementation can handle datasets of various sizes up to a 100,000 cells (Extended Data Fig. 4b).
我们还介绍了一个开源软件工具 MethSCAn,它提供了所描述算法的易用实现。它可以直接从 Bismark 20 、bisulfite-seq 命令行用户界面工具包 22 和 methylpy 21 等甲基化调用器的输出开始,并生成一个细胞×区域矩阵。我们建议使用迭代主成分分析将计数矩阵映射到低维空间,以克服大量缺失值的问题。或者,可以使用基于矩阵分解的现有工具,如 MOFA+ 14 (包含在我们的基准测试中)或基因组实验关系的链接推断 26 。一旦获得低维嵌入,就可以使用单细胞 RNA 测序分析中的成熟方法,包括 t-SNE 和 UMAP 等可视化工具、Leiden/Louvain 聚类、假时间轨迹分析等,例如,可以使用诸如 Seurat 6 和 ScanPy/EpiScanPy 7,25 等整体工具包,或任何针对特定任务的众多工具之一。通过提供这些功能,MethSCAn 填补了现有工具链中的空白,此前这些空白阻碍了实践者的数据分析。 我们的实现可以处理多达 100,000 个细胞的各种大小的数据集(扩展数据图 4b)。
By re-analyzing four published datasets, we show that these improvements to data preprocessing help to increase signal and decrease noise, resulting in a more informative intermediate-dimensional representation of the data. As examples of practical benefits, we demonstrate that our preprocessing allows for better distinction of cell subtypes, especially for challenging datasets comprising cellular substates and lineage transitions or for datasets with small cell number.
通过重新分析四组已发布的数据集,我们展示了这些数据预处理改进有助于增强信号并减少噪声,从而获得更具信息量的中间维度数据表示。作为实际益处的例子,我们证明了我们的预处理方法能够更好地区分细胞亚型,尤其是在包含细胞亚状态和谱系转换的挑战性数据集或细胞数量较少的数据集中。
In conclusion, we have presented powerful improvements to scBS data preprocessing and a software tool that implements these.
总之,我们提出了 scBS 数据预处理的强大改进,并实现了一个软件工具来实施这些改进。
Methods 方法
Raw data 原始数据
Let us write xij for the methylation status of CpG i in cell j. The index i runs over all CpG positions present in the genome, and the index j over all cells in the assay. We write xij = 0 if position i was found to be unmethylated in cell j by bisulfite sequencing, xij = 1 if it was methylated and xij = NA if position i was not covered by reads from cell j and the methylation status is therefore not available (NA).
让我们将 x ij 写作 CpG i 在细胞 j 的甲基化状态。索引 i 遍历基因组中存在的所有 CpG 位置,而索引 j 遍历实验中的所有细胞。如果位置 i 在细胞 j 中通过重亚硫酸盐测序被发现未甲基化,则 x ij = 0;如果被发现甲基化,则 x ij = 1;如果位置 i 未被细胞 j 的读取覆盖且甲基化状态不可用,则 x ij = NA。
These values can be readily obtained from single-cell bisulfite sequencing data using tools such as Bismark.
这些值可以使用 Bismark 等工具从单细胞亚硫酸氢盐测序数据中轻松获得。
If multiple reads from the same cell cover a position, these will typically be PCR duplicates of each other and hence agree. Of course, the two alleles of a CpG may rarely differ in their methylation status. While it is, in principle, possible that one obtains discordant reads stemming from the same position on both the paternal and the maternal chromosomes of the same cell, this is so unlikely that we can ignore such cases. Hence, whenever the methylation caller reports multiple reads covering the same position in the same cell, we set xij to 0 or 1 whenever all reads agree. When there is disagreement, we put xij = NA by default, or optionally follow the majority of reads whenever possible.
如果同一个细胞的多条读取覆盖同一位置,这些读取通常会是 PCR 重复,因此会一致。当然,一个 CpG 的两个等位基因偶尔可能在甲基化状态上有所不同。虽然原则上有可能在同一细胞的父系和母系染色体的同一位置上获得不一致的读取,但这种情况极为罕见,可以忽略不计。因此,每当甲基化识别器报告同一个细胞的同一位置有多条读取时,如果所有读取一致,则设置 x ij 为 0 或 1。当存在分歧时,默认情况下我们设置 x ij 为 NA,或者尽可能遵循多数读取。
For later use, we define C as the set of all cells in the assay (that is, C is the index set for the cell indices j). Moreover, we define Ci ⊂ C as the set of all those cells j that have reads covering position i:
为了后续使用,我们将 C 定义为实验中所有细胞的集合(即,C 是细胞索引 j 的索引集)。此外,我们将 C⊂ C 定义为所有具有覆盖位置 i 的读取的细胞 j 的集合:
Conversely, we define Gj as the set of all the CpG positions i covered by reads from cell j:
相反,我们将 G 定义为细胞 j 的读取覆盖的所有 CpG 位置 i 的集合:
Data storage 数据存储
The function ‘methscan prepare’ reads a set of methylation files (for example, produced by Bismark) and produces one file per chromosome. These files store the matrix x, where each column represents a cell and each row represents a base pair, in a space-efficient format as follows: x is represented as a SciPy sparse matrix27, encoding the actual values 0, 1 and NA as −1, 1 and 0, respectively. Since the vast majority of values in this matrix are missing owing to the sparsity of scBS data (and because rows for base pairs not corresponding to a CpG site contain no data), we encode missing values as zero and then store the data in compressed sparse row format. This format does not explicitly store zeroes (here, missing values) and is optimized for row-wise access, which results in substantial compression and allows fast access to the methylation status of genomic intervals. In all that follows here, any mention of x will, however, always mean the encoding as xij ∈ {0, 1, NA}.
函数 ‘methscan prepare’ 读取一组甲基化文件(例如,由 Bismark 生成),并为每条染色体生成一个文件。这些文件以空间高效的方式存储矩阵 x,其中每列代表一个细胞,每行代表一个碱基对:x 以 SciPy 稀疏矩阵的形式表示 27 ,将实际值 0、1 和 NA 分别编码为 −1、1 和 0。由于 scBS 数据的稀疏性,矩阵中的绝大多数值是缺失的(并且因为不对应 CpG 站点的碱基对行中没有数据),我们将缺失值编码为零,然后以压缩稀疏行格式存储数据。这种格式不显式存储零(这里为缺失值),并针对行访问进行了优化,从而实现了显著的压缩,并允许快速访问基因组区间上的甲基化状态。在此后的一切中,任何提到 x 的地方,都始终意味着 x ij ∈ {0, 1, NA}。
Smoothing 平滑
For each CpG position i, we write
对于每个 CpG 位置 i,我们写
for the average methylation at position i, where 〈⋅〉 denotes averaging, and the average runs over all the cells j ∈ Ci, that is, over those cells for which methylation data are available for position i.
对于位置 i 的平均甲基化程度,其中〈⋅〉表示平均化,平均化过程覆盖所有细胞 j ∈ C,即对于位置 i 有甲基化数据的所有细胞。
We then run a kernel smoother over these per-position averages to obtain the smoothed averages . Specifically, we use
我们然后在这些位置平均值上运行核平滑器以获得平滑平均值 。具体来说,我们使用
that is, is the weighted average over the per-position averages , taken over the CpG sites in the neighborhood of i, and weighted using a smoothing kernel kh with bandwidth h. Here, is the distance between CpG positions i and , measured in base pairs, h is the smoothing bandwidth in base pairs (by default, h = 1,000) and kh is the tricube kernel
即, 是在 CpG 位置平均值 上加权平均的结果,这些平均值是在位置 i 的邻近 CpG 站点 上计算的,并使用带宽为 h h 的平滑核 k 加权。这里, 是 CpG 位置 i 与 之间的距离,以碱基对为单位测量,h 是平滑带宽,以碱基对为单位(默认情况下,h = 1,000),k h 是三次立方核。
Methylation for an interval
区间甲基化
Next, we discuss averaging methylation over a range of CpG sites.
接下来,我们讨论在一系列 CpG 位点上平均甲基化程度。
Given an interval I on the chromosome, we wish to quantify the average methylation mIj of the CpG sites within the interval for cell j. If we interpret I as the set of CpG positions i in the interval, we may write
给定染色体上的区间 I,我们希望量化区间内 CpG 站点的平均甲基化程度 m Ij 对于细胞 j。如果我们将 I 解释为区间内的 CpG 位置 i 的集合,我们可以写出
Here, the average runs over all those sites i that lie within the interval I and are covered by reads from cell j.
在这里,平均值涵盖了所有位于区间 I 内且被细胞 j 的读取覆盖的站点 i。
If we wish to compare cells, it can be helpful to center this quantity by subtracting its average using
如果我们想要比较细胞,可以通过减去其平均值来中心化这个量,从而有所帮助
As an alternative, we suggest to consider the residuals of the individual CpG methylation values xij from the smoothed average
作为替代方案,我们建议考虑个体 CpG 甲基化值 x 从平滑平均值 的残差
and averaging over these, obtaining
并对这些进行平均,得到
This is a shrunken average, with denominator n + 1. This extra pseudocount has the effect of shrinking the value toward the ‘neutral’ value 0, with the shrinkage becoming stronger if the data are ‘weak’ because the number ∣I ∩ Ci∣ of positions covered by reads from cell j is low. In the extreme case of none of the reads from cell j covering I, the sum becomes 0 and the denominator 1; that is, rIj = 0 in this case.
这是一个收缩平均值,分母为 n + 1。这个额外的伪计数的效果是使值向‘中性’值 0 收缩,如果数据‘弱’,即细胞 j 的读取覆盖位置 ∣I ∩ C∣ 较少,收缩会变得更强烈。在细胞 j 的读取 none 覆盖 I 的极端情况下,和变为 0,分母为 1;也就是说,在这种情况下,r Ij = 0。
Finding VMRs 查找 VMRs
For any interval I, we denote by the variance of its residual averages rIj:
对于任意区间 I,我们用 表示其残差平均值 r Ij 的方差
where the average runs only over the set CI = ⋃i∈I Ci of those cells j that have reads covering interval I.
其中平均值仅对那些被区间 I 覆盖的细胞 j 集合 C = ⋃ i∈I C 进行求和。
To find VMRs, we define intervals I1, I2, …, all of the same width, and with stepwise increasing starts, then calculate v1, v2, … for these intervals. We then mark the intervals with the 2% highest variances. We take the union of all these intervals, split the union into connected components and call each component a VMR. Putting that last step in other words: We take all the intervals with variance in the top 2 percentile, fuse intervals that overlap and call the regions thus obtained the VMRs.
为了找到 VMRs,我们定义了一系列宽度相同的区间 I 1 、I 2 、…,并且这些区间的起始位置逐步增加,然后计算这些区间对应的 v 1 、v 2 等值。接着,我们标记出方差最高的前 2%的区间。我们将所有这些区间合并,将其划分为连续的成分,并将每个成分称为一个 VMR。用另一种方式来说最后一步:我们将方差处于前 2 百分位的所有区间合并,将重叠的区间合并,将得到的区域称为 VMR。
Finding DMRs 寻找 DMRs
When searching for DMRs, we compare two groups of cells, whose index sets we denote by CA and CB. For a given interval I, we calculate the mean each of the mean shrunken residuals rij (equation (1)) over the cells j in each of the two groups:
在搜索 DMRs 时,我们将两组细胞进行比较,这两组细胞的索引集分别表示为 C A 和 C B 。对于给定的区间 I,我们计算每个组中每个细胞 j 的均值收缩残差 r ij (方程(1))的均值:
We also calculate a variance as in equation (2):
我们还计算方程(2)中的方差:
From this, we calculate Welch’s t statistic as usual:
从这里,我们像往常一样计算 Welch’s t 统计量:
To find candidate DMRs, we again define overlapping and stepwise-shifted intervals I1, I2, … as for the VMRs and calculate t statistics t1, t2, … for these. As before, we take the top 2 percentile of these values, fuse intervals that overlap and call the regions thus obtained candidate DMRs. We repeat the procedure for the bottom 2 percentile to get the candidate DMRs for the other sign.
为了找到候选的 DMRs,我们再次定义重叠和逐步偏移的区间 I 1 、I 2 等,就像对 VMRs 所做的那样,并计算这些区间的 t 统计量 t 1 、t 2 等。就像之前一样,我们取这些值的前 2 百分位数,合并重叠的区间,并将由此获得的区域称为候选 DMRs。我们重复该过程以底 2 百分位数来获得另一符号的候选 DMRs。
Next, we need to check these candidate DMRs for statistical significance. We first remind the readers here that, as this is a within-sample analysis, cells, not samples, are the statistical unit. Therefore, a call as significant implies that the same DMR is likely to be called again if we repeated the analysis with another set of cells taken from the same biological sample, not that it would generalize to further samples. This fact, although often overlooked, is common to all within-sample analyses in the single-cell field, for example, also to the differential expression tests performed in scRNA-seq analyses to find marker genes that differentiate clusters.
接下来,我们需要检查这些候选的 DMR 的统计显著性。我们在此提醒读者,由于这是在样本内进行的分析,因此细胞而不是样本是统计单位。因此,一个显著的调用意味着如果我们用同一生物样本中的另一组细胞重复该分析,很可能会再次调用相同的 DMR,而不是意味着它会在其他样本中泛化。这一事实虽然经常被忽视,但在单细胞领域内的所有样本内分析中都是共通的,例如,在 scRNA-seq 分析中进行的差异表达检验,以寻找区分簇的标记基因。
It may seem that we could use the standard procedure for the Welch t-test here, that is, use the Welch–Satterthwaite formula to get an approximate degree of freedom and then calculate the tail probability of the corresponding t distribution. However, this is unlikely to hold for two reasons: First, the Welch–Satterthwaite degrees of freedom are only based on the number of cells per group and do not account for the fact that the read coverage might vary from cell to cell. Second, the fusing of the DMRs obtained in the scanning step to obtain fused candidate DMRs would invalidate subsequent P value-based adjustment for multiple testing.
这里似乎可以使用 Welch t 检验的标准程序,即使用 Welch-Satterthwaite 公式得到一个近似的自由度,然后计算相应 t 分布的尾概率。然而,由于两个原因,这很可能不适用:首先,Welch-Satterthwaite 自由度仅基于每个组的单元格数量,而不考虑读覆盖可能在单元格之间有所不同。其次,扫描步骤中获得的 DMR 进行融合以获得融合候选 DMR 会使得后续基于 P 值的多重检验调整无效。
Therefore, we have instead implemented a permutation procedure, which works as follows: We randomly shuffle the assignment of the cells in CA ⋃ CA to either of the two groups and then rerun the whole procedure, that is, the scanning step, the DMR fusing and the calculation of t values from the (potentially fused) candidate DMRs. This needs to be done for a sufficiently large number of permutations. Running through the whole genome for each permutation would be too computationally expensive. Instead, we go through the genome only once, but reshuffle the group labels every 2 Mb.
因此,我们反而实现了一个排列程序,其工作方式如下:我们随机重新分配 C A ⋃ C A 中的细胞到两个组中的任意一个,并然后重新运行整个程序,即扫描步骤、DMR 融合以及从(潜在融合的)候选 DMR 中计算 t 值。这需要进行足够多次的排列。对整个基因组进行每次排列的完整扫描会过于计算密集。相反,我们只对基因组进行一次扫描,但每 2 Mb 重新洗牌组标签。
All the t values obtained from this permutation procedure are taken together to obtain an empirical null distribution. Then, we can use this null distribution to control the FDR by applying the Benjamini–Hochberg procedure in its P value-free form: Let us write T for the set of all t values obtained from the ‘real’ assignment of cells to group labels and T0 from the set of all t values obtained from the shuffled assignments, that is, the empirical null distribution. To adjust a specific t value ti ∈ T, we calculate
从这种置换程序获得的所有 t 值被一起用来获得经验零分布。然后,我们可以使用这种零分布通过 Benjamini–Hochberg 程序的 P 值自由形式来控制 FDR:让我们用 T 表示从“真实”分配细胞到组标签获得的所有 t 值,用 T 0 表示从打乱分配获得的所有 t 值,即经验零分布。为了调整特定的 t 值 t∈ T,我们计算
In words, we calculate which fraction of the null t values is greater than t by absolute value, and which fraction of the real t values is. The ratio gives us the FDR we should expect if we used the t value as threshold.
用语言来说,我们计算在绝对值意义上,null t 值中有多少比例大于 t 值,以及真实 t 值中有多少比例大于 t 值。这个比率给出了如果我们使用 t 值作为阈值,我们应预期的 FDR。
Calculating cell-to-cell distances
计算细胞间距离
Given a set of intervals corresponding to VMRs, we get a relative methylation fraction rij for each VMR and each cell j from equation (1). The matrix thus obtained can then be centered and used as input for a PCA. If we calculate the top R principal components, we thus obtain for each cell j an R-dimensional principal component vector . For any two cells j and , we use the Euclidean distance as the measure of dissimilarity of the two cells. Thus, the matrix of PC scores can be used as input to dimension reduction methods such as t-SNE or UMAP, and to clustering methods like the Louvain or Leiden algorithm, which require such a matrix as input to the approximate nearest neighbor finding algorithm that is their first step.
给定一组对应于 VMR 的区间集合 ,我们从方程(1)中得到每个 VMR ij 和每个细胞 j 的相对甲基化分数 r 。由此得到的矩阵可以进行中心化处理,并作为 PCA 的输入。如果我们计算前 R 个主成分,我们就可以为每个细胞 j 得到一个 R 维的主成分向量 。对于任意两个细胞 j 和 ,我们使用欧几里得距离 作为两个细胞的不相似性度量。因此,主成分得分矩阵可以作为 t-SNE 或 UMAP 等降维方法的输入,以及 Louvain 或 Leiden 算法等聚类方法的输入,这些方法的第一步都需要这样的矩阵来进行近似最近邻查找算法。
PCA with iterative imputation
迭代插补的主成分分析
Whenever a region is not covered by any read in a cell, the corresponding data entry in the input data matrix for PCA will be missing. The standard approach to calculate PCA, commonly done using the implicitly restarted Lanczos bidiagonalization algorithm28, is not suited to deal with missing data. We circumvent this issue by simply using the PCA itself to impute the missing value in an approach that we call ‘iterative PCA’.
每当某个区域在细胞中没有被任何读取覆盖时,输入数据矩阵中对应的 PCA 数据项将是缺失的。通常用于计算 PCA 的标准方法,通常是使用隐重新启动 Lanczos 双对角线化算法 28 ,不适用于处理缺失数据。我们通过简单地使用 PCA 本身来填补缺失值,这种方法我们称之为“迭代 PCA”。
Let us write A for the matrix to which the PCA is to be applied, with the features (here, regions) represented by the matrix rows. The matrix has already been centered, that is, ∑iaij = 0. To establish notation, we remind the reader that performing a PCA on A means finding the singular value decomposition A = UDR⊤, with D diagonal and U and R orthonormal. The PCA scores are contained in X = UD, the loadings in R. To reconstruct the input data A from the PCA representation, one may use A = XR⊤, that is, aij = ∑rxirrjr, where the equation is exact if r runs over all principal components and approximate if it is truncated to the leading ones.
让我们用 A 表示要应用 PCA 的矩阵,特征(这里,区域)由矩阵的行表示。该矩阵已经中心化了,即 ∑a = 0。为了建立符号,我们提醒读者,对 A 进行 PCA 意味着找到奇异值分解 A = UDR,其中 D 是对角矩阵,U 和 R 是正交的。PCA 得分包含在 X = UD 中,载荷包含在 R 中。要从 PCA 表示重构输入数据 A,可以使用 A = XR,即 a = ∑ x r,如果 r 跑过所有主成分,则方程精确成立,如果截断为前导主成分,则为近似。
Our iterative imputation strategy is now simply the following: We first replace all missing values in the row-centered input matrix A with zeroes and perform the (truncated) PCA. Then, we use the PCA predictions for the missing values, that is, the aij = ∑rxirrjr, as refined stand-ins for the missing values in A and run PCA once more. This can now be iterated until convergence.
我们的迭代插补策略现在如下:首先,我们将行中心化的输入矩阵 A 中的所有缺失值用零替换,并执行(截断的)主成分分析(PCA)。然后,我们使用 PCA 对缺失值的预测,即 a ij = ∑ r x ir r jr ,作为缺失值的改进替代值,并再次运行 PCA。现在可以迭代直到收敛。
We note that similar approaches have also been used elsewhere29.
我们注意到,在其他地方也采用了类似的方法 29 。
Analysis of scBS datasets for benchmarks
scBS 数据集的基准分析
To analyze scBS data from Kremer et al.11, single-cell CpG methylation reports from all conditions were first stored with ‘methscan prepare’ and then smoothed with ‘methscan smooth’ using the default bandwidth of 1,000 bp. These data were then analyzed multiple times with different combinations of analysis methods, namely four ways to divide the genome into intervals, two ways to quantify methylation in these intervals and four approaches for dimensionality reduction. The following four sets of genomic intervals were used: VMRs, obtained with ‘methscan scan’ using the current default options (bandwidth of 2,000, step size of 10, variance threshold of 0.02 and minimum cell requirement of 6); adjacent tiles of 100 kb width; promoter regions as defined by the ±2 kb domain around the TSS of coding genes; and candidate cis-regulatory regions annotated by the ENCODE consortium12. Methylation was quantified either by averaging binary methylation values, or by calculating the shrunken mean of the residuals as described earlier.
要分析 Kremer 等人 11 的 scBS 数据,首先使用 ‘methscan prepare’ 将所有条件的单细胞 CpG 甲基化报告存储起来,然后使用默认带宽 1,000 bp 的 ‘methscan smooth’ 对其进行平滑处理。然后,使用不同的分析方法组合多次分析这些数据,具体而言,有四种方式将基因组分为区间,两种方式在这些区间内量化甲基化水平,以及四种降维方法。使用以下四种基因组区间集:使用 ‘methscan scan’ 并采用当前默认选项(带宽 2,000,步长 10,方差阈值 0.02,最小细胞要求 6)获得的 VMRs;100 kb 宽的相邻瓷砖;由编码基因的 TSS ±2 kb 域定义的启动子区域;以及由 ENCODE 合作组织注释的候选 cis 调控区域 12 。甲基化水平通过平均二进制甲基化值或如前所述计算残差的收缩均值来量化。
We used four different approaches for dimensionality reduction. Three of them involve imputation of missing values followed by PCA: The first approach, iterative PCA, was described earlier. Second, ‘PCA on high-quality features’ imputes missing values with the mean methylation level of a given interval, while only retaining high-quality features selected as in Luo et al.8: We selected tiles spanning ≥20 CpG sites and with sequencing coverage in at least 70% of cells. We then imputed missing values with the mean of each tile, centered the values and performed PCA. The third approach, ‘mean-imputed PCA’ is identical to the second approach but without the quality-filtering step. Lastly, we used MOFA+ with default parameters instead of PCA, which does not require imputation of missing values. In all four cases, we reduced the dimensionality of the input data to 15 PCs or MOFA factors. In some cases, MOFA+ returned a smaller number of factors, since some of the requested 15 had zero variance. For visualization, these 15 PCs or factors were subjected to UMAP with parameters min_dist = 0.2 and init = ‘spca’. To flexibly adapt to datasets of different sizes, we set n_neighbors (rounded to the nearest integer), where n is the total cell number.
我们使用了四种不同的降维方法。其中三种方法涉及缺失值填充,随后进行 PCA:第一种方法,迭代 PCA,之前已经描述过。第二种,‘高质量特征的 PCA’,用给定区间平均甲基化水平填充缺失值,同时仅保留如 Luo 等人 8 所选的高质量特征。我们选择了跨越≥20 个 CpG 位点且在至少 70%的细胞中有测序覆盖的斑块。然后用每个斑块的平均值填充缺失值,对值进行中心化处理,并执行 PCA。第三种方法,‘均值填充的 PCA’,与第二种方法相同,但没有质量过滤步骤。最后,我们使用了 MOFA+,并使用默认参数代替 PCA,这种方法不需要填充缺失值。在所有四种情况下,我们将输入数据的维度减少到 15 个主成分或 MOFA 因子。在某些情况下,MOFA+返回的因子数量较少,因为一些请求的 15 个因子中有些方差为零。为了可视化,这些 15 个主成分或因子被提交给 UMAP 处理,参数为 min_dist = 0.2 和 init = ‘spca’。 为了灵活适应不同大小的数据集,我们将 n_neighbors 设置为 (四舍五入到最接近的整数),其中 n 是总的细胞数。
The same analysis was repeated for three additional scBS datasets8,10,13 and for smaller datasets generated by randomly subsampling cells separately from these datasets.
对另外三个 scBS 数据集 8,10,13 进行了相同的分析,并对从这些数据集独立随机抽样细胞生成的小数据集进行了分析。
VMRs that intersect protein-coding gene bodies, CpG islands (from the University of California Santa Cruz genome browser) or cCREs were quantified by subtracting VMRs with at least 1 bp of overlap using ‘bedtools subtract -A’30 and counting the remaining VMRs.
与蛋白质编码基因体、CpG 岛(来自加州大学圣克鲁兹分校基因浏览器)或 cCREs 重叠的 VMRs 被通过使用 ‘bedtools subtract -A’ 30 减去至少 1 个碱基对重叠的 VMRs 并统计剩余的 VMRs 来量化。
To test our DMR detection approach, we selected oligodendrocytes and NSCs from healthy wild-type mice of Kremer et al.11 and ran ‘methscan diff’ with default parameters. For GO enrichment analysis, DMRs with adjusted P < 0.01 were uploaded to GREAT 4.0.4 (ref. 18) with the association rule ‘basal plus extension, 0 bp upstream, 20 kb downstream, 1 Mbp max extension, curated regulatory domains included’.
为了测试我们的 DMR 检测方法,我们选择了 Kremer 等人 11 的健康野生型小鼠的少突胶质细胞和 NSCs,并运行了‘methscan diff’,使用默认参数。对于 GO 富集分析,调整后的 P < 0.01 的 DMR 被上传到 GREAT 4.0.4(参考 18 ),关联规则为“基线加上扩展,上游 0 bp,下游 20 kb,最大扩展 1 Mb,包含已 curated 的调控域”。
Mean neighbor score 邻域平均得分
To assess the performance of our methods, we employed a score that quantifies how well cell types (or cell states) are separated in 15-dimensional PCA space. For data from Luo et al.8, we used cell type labels that the authors manually curated based on CH methylation. For the multi-omic dataset, we repeated the single-cell transcriptomics analysis described in Kremer et al.11 with two adjustments: We did not filter off-target cells such as endothelial cells, and we annotated cell types using Leiden clustering with the Seurat6 function ‘FindClusters(resolution = 0.1)’. The score, based on the Γ score31, varies between 0 and 1, where higher scores reflect better separation of cell types. For each cell j, we count how many of its k nearest neighbors have been assigned to the same cell type as cell j. We denote this count by , where we have chosen , rounded. The overall score used to evaluate a given combination of methods is then simply the mean of all cell-wise scores.
为了评估我们方法的性能,我们采用了一个评分标准,该标准量化了细胞类型(或细胞状态)在 15 维 PCA 空间中的分离程度。对于 Luo 等人 8 的数据,我们使用了作者根据 CH 甲基化手动整理的细胞类型标签。对于多组学数据集,我们按照 Kremer 等人 11 所述进行了单细胞转录组学分析,并进行了两项调整:我们没有过滤掉目标细胞以外的细胞,如内皮细胞,并使用 Leiden 聚类和 Seurat 6 函数‘FindClusters(resolution = 0.1)’对细胞类型进行了注释。该评分基于Γ分数 31 ,范围在 0 到 1 之间,分数越高表示细胞类型分离得越好。对于每个细胞 j,我们计算其 k 个最近邻中有多少个被分配到了与细胞 j 相同的细胞类型。我们用 表示这个计数,其中我们选择了 ,并进行了四舍五入。用于评估给定方法组合的整体评分则是所有细胞评分的平均值。
Correlation of DNA methylation and gene expression
DNA 甲基化与基因表达的相关性
To assess the correlation between gene expression and the methylation status of VMRs or promoters, we first detected VMRs with ‘methscan scan --bandwidth 1000 --var-threshold 0.05’. We then quantified DNA methylation at VMRs and promoters with ‘methscan matrix’. We defined promoters as ±2,000 bp regions centered on a gene’s TSS. When multiple TSSs were annotated, we chose the TSS of the ‘principal’ isoform32. Log-normalized expression values reported by Kremer et al.11 were then correlated with methylation of the gene’s promoter or with methylation of the VMR closest to the gene body. When multiple VMRs intersected the gene body, we chose the VMR with the highest methylation variance. As a measure of methylation, we used the shrunken mean of the residuals. We omitted lowly expressed genes (with scRNA-seq counts in <5% of cells) and promoters and VMRs with low scBS coverage (in <10 cells).
为了评估基因表达与 VMRs 或启动子的甲基化状态之间的相关性,我们首先使用“methscan scan --bandwidth 1000 --var-threshold 0.05”检测 VMRs。然后使用“methscan matrix”定量 VMRs 和启动子的 DNA 甲基化水平。我们将启动子定义为中心在基因 TSS 周围±2,000 bp 的区域。当有多个 TSS 注释时,我们选择“主要”isoform 的 TSS 32 。克雷默等人 11 报告的对数正态化表达值随后与基因启动子的甲基化或与基因体附近甲基化最高的 VMR 的甲基化相关。当多个 VMRs 与基因体交叠时,我们选择甲基化方差最高的 VMR。作为甲基化度量,我们使用残差的收缩均值。我们省去了低表达基因(scRNA-seq 计数在<5%的细胞中)以及 scBS 覆盖度低(<10 个细胞)的启动子和 VMR。
MethSCAn was implemented in Python 3.8 using the packages NumPy 1.20.1, SciPy 1.6.1, numba 0.53.0 and Pandas 1.2.3. Benchmarks were performed on MethSCAn version 0.6.2 using Snakemake 7.26 and analyzed/visualized with tidyverse 1.3.1 packages.
MethSCAn 是使用 Python 3.8 语言以及 NumPy 1.20.1、SciPy 1.6.1、numba 0.53.0 和 Pandas 1.2.3 这些包实现的。基准测试是在 MethSCAn 0.6.2 版本中使用 Snakemake 7.26 进行的,并使用 tidyverse 1.3.1 包进行分析和可视化。
Reporting summary 报告摘要
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
有关研究设计的更多信息,请参阅与本文链接的 Nature Portfolio Reporting Summary。
Data availability 数据可用性
All data used to benchmark and showcase the MethSCAn software is publicly available under the following National Center for Biotechnology Information Gene Expression Omnibus accessions: single-cell multi-omics of the murine forebrain11: GSE210806; mouse neurons8: GSE97179; colorectal cancer13: GSE97693; mouse gastrulation10: GSE121708; 100k brain cells15: GSE132489.
所有用于基准测试和展示 MethSCAn 软件的数据均在以下国家生物技术信息中心基因表达综合数据库中公开获取:小鼠前脑单细胞多组学数据 11 : GSE210806;小鼠神经元数据 8 : GSE97179;结直肠癌数据 13 : GSE97693;小鼠胚胎发生数据 10 : GSE121708;100,000 个脑细胞数据 15 : GSE132489。
Code availability 代码可用性
MethSCAn is available on the Python Package Index and can be installed by using the Python package installer ‘pip’. At https://anders-biostat.github.io/MethSCAn we provide detailed documentation including a link to the source code and a tutorial that demonstrates a complete MethSCAn analysis on an example dataset.
MethSCAn 可在 Python 包索引上获得,并可通过 Python 包安装器 ‘pip’ 安装。在 https://anders-biostat.github.io/MethSCAn 上,我们提供了详细的文档,包括源代码链接和一个教程,该教程演示了在示例数据集上进行完整的 MethSCAn 分析的过程。
References 参考文献
Frommer, M. et al. A genomic sequencing protocol that yields a positive display of 5-methylcytosine residues in individual DNA strands. Proc. Natl Acad. Sci. USA 89, 1827–1831 (1992).
Frommer, M. et al. 一种基因组测序协议,该协议能够在单个 DNA 链上正向显示 5-甲基胞嘧啶残基。Proc. Natl. Acad. Sci. USA 89, 1827–1831 (1992).Smallwood, S. A. et al. Single-cell genome-wide bisulfite sequencing for assessing epigenetic heterogeneity. Nat. Methods 11, 817–820 (2014).
Smallwood, S. A. 等. 全基因组单细胞亚硫酸氢盐测序评估表观遗传异质性. 自然方法 11, 817–820 (2014).Hu, Y. et al. Simultaneous profiling of transcriptome and DNA methylome from a single cell. Genome Biol. 17, 1–11 (2016).
Hu, Y. 等. 从单个细胞同时分析转录组和 DNA 甲基化组. 《基因组生物学》. 17, 1–11 (2016).Clark, S. J. et al. scNMT-seq enables joint profiling of chromatin accessibility DNA methylation and transcription in single cells. Nat. Commun. 9, 781 (2018).
Clark, S. J. 等. scNMT-seq 使研究人员能够对单个细胞中的染色质可访问性、DNA 甲基化和转录进行联合分析。自然通讯 9, 781 (2018).Cerrizuela, S. et al. High-throughput scNMT protocol for multiomics profiling of single cells from mouse brain and pancreatic organoids. STAR Protoc. 3, 101555–101016 (2022).
Cerrizuela, S. et al. 高通量单细胞多组学 profiling 协议:来自小鼠大脑和胰腺类器官的单细胞多组学分析。STAR Protoc. 3, 101555–101016 (2022).Hao, Y. et al. Dictionary learning for integrative, multimodal and scalable single-cell analysis. Nat. Biotechnol. 42, 293–304 (2023).
Hao, Y. 等. 综合、多模态和可扩展单细胞分析的字典学习. 自然-生物技术 42, 293–304 (2023).Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 1–5 (2018).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: 大规模单细胞基因表达数据分析. Genome Biol. 19, 1–5 (2018).Luo, C. et al. Single-cell methylomes identify neuronal subtypes and regulatory elements in mammalian cortex. Science 357, 600–604 (2017).
Luo, C. 等. 哺乳动物皮层中神经亚型和调控元件的单细胞甲基化图谱. 科学 357, 600–604 (2017).Bird, A. P. CpG-rich islands and the function of DNA methylation. Nature 321, 209–213 (1986).
Argelaguet, R. et al. Multi-omics profiling of mouse gastrulation at single-cell resolution. Nature 576, 487–491 (2019).
Argelaguet, R. 等. 小鼠胚层形成过程的多组学单细胞解析. 自然 576, 487–491 (2019).Kremer, L.P. et al. Single-cell triple-omics uncovers DNA methylation as key feature of stemness in the healthy and ischemic adult brain. Preprint at bioRxiv https://doi.org/10.1101/2022.07.13.499860 (2022).
Kremer, L.P. 等. 健康和缺血成人脑中干细胞特征的单细胞三组学揭示 DNA 甲基化为核心特征. 预印本于 bioRxiv https://doi.org/10.1101/2022.07.13.499860 (2022).Moore, J. E. et al. Expanded encyclopaedias of DNA elements in the human and mouse genomes. Nature 583, 699–710 (2020).
Moore, J. E. 等. 人类和小鼠基因组的 DNA 元件扩展百科全书. 自然 583, 699–710 (2020).Bian, S. et al. Single-cell multiomics sequencing and analyses of human colorectal cancer. Science 362, 1060–1063 (2018).
Bian, S. 等. 人类结直肠癌的单细胞多组学测序与分析. 科学 362, 1060–1063 (2018).Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 1–17 (2020).
Argelaguet, R. et al. MOFA+: 一种综合多模态单细胞数据的统计框架. Genome Biol. 21, 1–17 (2020).Liu, H. et al. DNA methylation atlas of the mouse brain at single-cell resolution. Nature 598, 120–128 (2021).
Liu, H. 等. 小鼠大脑的单细胞分辨率 DNA 甲基化图谱. 自然 598, 120–128 (2021).Hebestreit, K., Dugas, M. & Klein, H.-U. Detection of significantly differentially methylated regions in targeted bisulfite sequencing data. Bioinformatics 29, 1647–1653 (2013).
Hebestreit, K., Dugas, M. & Klein, H.-U. 在靶向 bisulfite 测序数据中检测显著差异甲基化区域. Bioinformatics 29, 1647–1653 (2013).Korthauer, K., Chakraborty, S., Benjamini, Y. & Irizarry, R. A. Detection and accurate false discovery rate control of differentially methylated regions from whole genome bisulfite sequencing. Biostatistics 20, 367–383 (2018).
Korthauer, K., Chakraborty, S., Benjamini, Y. & Irizarry, R. A. 从全基因组亚硫酸氢盐测序检测和准确的假发现率控制甲基化差异区域. 生物统计学 20, 367–383 (2018).McLean, C. Y. et al. GREAT improves functional interpretation of cis-regulatory regions. Nat. Biotechnol. 28, 495–501 (2010).
McLean, C. Y. 等. GREAT 提高了 cis-调控区域的功能解释. 自然生物技术 28, 495–501 (2010).Boggs, J. M. Myelin basic protein: a multifunctional protein. Cell. Mol. Life Sci. 63, 1945–1961 (2006).
Boggs, J. M. 神经髓鞘基本蛋白:一种多功能蛋白.《细胞与分子生命科学》. 63, 1945–1961 (2006).Krueger, F. & Andrews, S. R. Bismark: a flexible aligner and methylation caller for bisulfite-seq applications. Bioinformatics 27, 1571–1572 (2011).
Krueger, F. & Andrews, S. R. Bismark: 一种灵活的比对器和甲基化检测器,适用于 bisulfite-seq 应用。Bioinformatics 27, 1571–1572 (2011).Schultz, M. D. et al. Human body epigenome maps reveal noncanonical DNA methylation variation. Nature 523, 212–216 (2015).
Schultz, M. D. 等. 人类身体表观基因组图谱揭示非经典的 DNA 甲基化变异. 自然 523, 212–216 (2015).Zhou, W. et al. BISCUIT: an efficient, standards-compliant tool suite for simultaneous genetic and epigenetic inference in bulk and single-cell studies. Nucleic Acids Res. 52, e32 (2024).
周, W. 等. BISCUIT: 一种高效且符合标准的工具套件,用于同时进行 bulk 和单细胞研究中的遗传和表观遗传推断. Nucleic Acids Res. 52, e32 (2024).Kapourani, C.-A. & Sanguinetti, G. Melissa: Bayesian clustering and imputation of single-cell methylomes. Genome Biol. 20, 61 (2019).
Kapourani, C.-A., Argelaguet, R., Sanguinetti, G. & Vallejos, C. A. scMET: Bayesian modeling of DNA methylation heterogeneity at single-cell resolution. Genome Biol. 22, 1–21 (2021).
Kapourani, C.-A., Argelaguet, R., Sanguinetti, G. & Vallejos, C. A. scMET: 基于贝叶斯模型的单细胞分辨率 DNA 甲基化异质性分析. Genome Biol. 22, 1–21 (2021).Danese, A. et al. EpiScanpy: integrated single-cell epigenomic analysis. Nat. Commun. 12, 5228 (2021).
Danese, A. et al. EpiScanpy: 集成单细胞表观基因组分析. Nat. Commun. 12, 5228 (2021).Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).
Welch, J. D. 等. 单细胞多组学整合比较和对比了脑细胞身份的特征. 细胞 177, 1873–1887 (2019).Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Virtanen, P. et al. SciPy 1.0: 基于 Python 的科学计算基础算法. Nat. Methods 17, 261–272 (2020).Baglama, J. & Reichel, L. Augmented implicitly restarted Lanczos bidiagonalization methods. SIAM J. Sci. Comput. 27, 19–42 (2005).
Baglama, J. & Reichel, L. 增广隐重启 Lanczos 双对角化方法. SIAM J. Sci. Comput. 27, 19–42 (2005).Josse, J. & Husson, F. Handling missing values in exploratory multivariate data analysis methods. J. Soc. Fr. Stat. 153, 79–99 (2012).
Quinlan, A. R. & Hall, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841–842 (2010).
Kireeva, N. V. et al. Nonlinear dimensionality reduction for visualizing toxicity data: distance-based versus topology-based approaches. ChemMedChem 9, 1047–1059 (2014).
Kireeva, N. V. 等. 用于可视化毒理学数据的非线性维数约简:基于距离的方法与基于拓扑的方法. 化学生物化学 9, 1047–1059 (2014).Rodriguez, J. M. et al. APPRIS: annotation of principal and alternative splice isoforms. Nucleic Acids Res. 41, D110–D117 (2013).
Rodriguez, J. M. 等. APPRIS: 注解主要和替代剪接异构体. Nucleic Acids Res. 41, D110–D117 (2013).
Acknowledgements 致谢
S.A. acknowledges funding by the Klaus Tschira Foundation (project 00.022.2019). A.M.-V. acknowledges funding from the European Commission via ERC grant 771376, from the DFG via SFB 873 and from the DKFZ. We thank A. Uvarovskii for code refactoring and helpful suggestions on the source code.
S.A. 感谢 Klaus Tschira 基金会(项目 00.022.2019)的资金支持。A.M.-V. 感谢欧洲委员会通过 ERC 基金 771376、德国研究基金会通过 SFB 873 以及德国癌症研究中心的支持。感谢 A. Uvarovskii 对代码重构以及源代码的有益建议。
Funding 资助
Open access funding provided by Deutsches Krebsforschungszentrum (DKFZ).
开放获取资金由德国癌症研究中心(DKFZ)提供。
Ethics declarations 伦理声明
Competing interests 竞合利益
The authors declare no competing interests.
作者声明无竞争利益。
Peer review 同行评审
Peer review information 同行评审信息
Nature Methods thanks Andrew Adey, Simon Andrews and Wanding Zhou for their contribution to the peer review of this work. Primary Handling editor: Lei Tang, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Nature Methods 感谢 Andrew Adey、Simon Andrews 和 Wanding Zhou 对本文同行评审工作的贡献。主要处理编辑:Lei Tang,与 Nature Methods 团队合作。同行评审报告可供查阅。
Additional information 附加信息
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
出版社须知 Springer Nature 对出版地图中的管辖争端声明以及机构关系保持中立。
Extended data 扩展数据
Extended Data Fig. 1 VMRs are more informative than an equal number of regulatory regions with high sequencing coverage.
Extended Data Fig. 1 VMRs 比具有高测序覆盖度的等数量调控区域更具信息量。
(A-B) Mean neighbor score obtained when quantifying CpG methylation at VMRs, all ENCODE regulatory regions, or a subset of regulatory regions for the data sets of Kremer et al.11 and Luo et al.8. The subset of high-quality (HQ) regulatory regions was selected in such a way that it matches the number of VMRs and contains those regulatory regions with the highest coverage. (C-D) Runtime (top) and RAM usage (bottom) of the two dimensionality reduction techniques shown in A and B.
(A-B) 在量化 Kremer 等人 11 和 Luo 等人 8 数据集中 VMRs、所有 ENCODE 调控区域或数据集的一部分调控区域的 CpG 甲基化时获得的邻域评分均值。该部分高质量 (HQ) 调控区域的选择方式是使其数量与 VMRs 相匹配,并包含覆盖率最高的调控区域。 (C-D) A 和 B 中所示的两种降维技术的运行时间(顶部)和 RAM 使用量(底部)。
Extended Data Fig. 2 Benchmark of several combinations of analysis methods on four single-cell methylation data sets.
Extended Data Fig. 2 几种分析方法在四个单细胞甲基化数据集上的基准测试。
(A-E) Benchmarking the ability to separate groups of cells based on CpG methylation data, using different analysis approaches. Methylation matrices were obtained either by averaging CpG methylation in genomic intervals (dashed lines, hollow points) or using the shrunken mean of residuals (solid lines and points). Each analysis was performed for the full data sets and sub-samples of each. Four sets of genomic intervals (VMRs, ENCODE regulatory elements, 100 kb tiles or promoters) were quantified and separately subjected to four dimensionality reduction techniques (see Methods for details): iterative PCA as proposed by us, mean-imputed PCA on intervals with high sequencing coverage8, mean-imputed PCA on all intervals, and MOFA+14 on all intervals. For each combination of methods, we quantified the ability to use distance in 15-dimensional PCA space to separate ground-truth cell labels reported by the authors (neighbor score). (A) Separation of neural cell types/states annotated based on scRNA-seq of the same cells11. (B) Separation of neuron sub-types, annotated by averaging CH-methylation (mCH) in 100 kb genomic tiles8. (C) Separation of human colorectal cancer sampling regions (primary tumor, normal adjacent tissue, lymph node metastasis, liver metastasis before and after treatment, omental metastasis)13. (D-E) Separation of three properties of mouse embryo cells: The developmental stage (E4.5, E5.5, E6.5, E7.5), the germ layer or extra-embryonic tissue of origin, or the cell lineage. Germ layer and lineage (E, only E7.5-cells) were annotated by Argelaguet et al.10 based on scRNA-seq of the same cells.
(A-E) 基于 CpG 甲基化数据,评估不同分析方法区分细胞群的能力。甲基化矩阵要么通过基因区间内 CpG 甲基化平均值获得(虚线,空心点),要么通过残差的收缩均值获得(实线和实心点)。每种分析方法分别对完整数据集和每个数据集的子样本进行了处理。四个基因区间集(VMRs、ENCODE 调控元件、100 kb 瓦片或启动子)被量化并分别应用了四种降维技术(详见方法部分):我们提出的迭代 PCA,高测序覆盖度区间上的均值填充 PCA,所有区间上的均值填充 PCA,以及所有区间上的 MOFA+。对于每种方法组合,我们量化了在 15 维 PCA 空间中使用距离区分作者报告的真实细胞标签(邻居得分)的能力。(A)基于单细胞 RNA 测序(scRNA-seq)对同一细胞的神经细胞类型/状态进行区分。(B)基于 100 kb 基因组瓦片中 CH-甲基化(mCH)平均值对神经元亚型进行区分。 (C) 人类结直肠癌取样区域的分离(原发肿瘤、正常相邻组织、淋巴结转移、治疗前后的肝转移、腹膜转移) 13 。(D-E) 小鼠胚胎细胞的三个属性分离:发育阶段(E4.5、E5.5、E6.5、E7.5),起源于胚层或胚外组织,或细胞谱系。胚层和谱系(E,仅 E7.5 细胞)由 Argelaguet 等人 10 根据相同细胞的单细胞 RNA 测序进行注释。
Extended Data Fig. 3 Effect of VMR detection parameters on separation of cell types.
Extended Data Fig. 3 VMR 检测参数对细胞类型分离的影响。
(A-B) Mean neighbor scores obtained after analyzing single-cell methylomes from Kremer et al.11 (A) or Luo et al.8 (B) with our proposed methods. VMRs were detected with 'methscan scan' using various sliding window bandwidths, variance thresholds, and on sub-samples of the full data sets. (C) Mean neighbor scores obtained after analyzing the Kremer et al.11 data set with various sliding window step sizes.
(A-B) 分析 Kremer 等人 11 (A) 或 Luo 等人 8 (B) 单细胞甲基化组后获得的邻近节点得分。使用 'methscan scan' 检测 VMRs,并使用不同的滑动窗口带宽、方差阈值,在整个数据集的子样本上进行。 (C) 使用不同的滑动窗口步长分析 Kremer 等人 11 数据集后获得的邻近节点得分。
Extended Data Fig. 4 MethSCAn performs well on CH-methylation data and on extremely large data sets.
Extended Data Fig. 4 MethSCAn 在 CH-甲基化数据和极其庞大的数据集上表现良好。
(A) UMAPs obtained when analysing data from Luo et al.8 with three different strategies for producing a methylation matrix: top: averaging CH-methylation (mCH) in 100 kb genomic tiles (as done in ref. 8 to obtain the depicted cell type labels); center: using the shrunken mean of residuals to quantify mCH in mCH-VMRs; bottom: shrunken mean of residuals to quantify CpG methylation in CpG-VMRs. (B) UMAP obtained when analysing 100 350 neural single-cell methylomes15 with the default MethSCAn workflow. The depicted cell labels were reported by Liu et al.15. Analysing this large data set takes approximately a week on a computer with 256 GB RAM and 48 CPUs, with almost all of this time (152 hours) spent by the 'prepare' step, which is run only once per data set. The speed of this step can be increased considerably by processing chromosomes in parallel, which reduced the total runtime to approximately two days in our case.
(A) 当分析 Luo 等人 8 的数据时,使用三种不同的方法生成甲基化矩阵获得的 UMAP:顶部:在 100 kb 基因组瓦片中平均 CH-甲基化 (mCH)(如参考文献 8 中所述,以获得所示的细胞类型标签);中间:使用残差的收缩均值来量化 mCH 在 mCH-VMRs 中;底部:使用残差的收缩均值来量化 CpG 甲基化在 CpG-VMRs 中。 (B) 当使用默认的 MethSCAn 工作流分析 100 个 350 神经单细胞甲基组 15 时获得的 UMAP。所示的细胞标签由 Liu 等人 15 报告。分析这个大数据集大约需要一周时间,使用 256 GB 内存和 48 个 CPU 的计算机,其中几乎所有时间(152 小时)都花费在 'prepare' 步骤上,该步骤仅在每次数据集运行一次。通过并行处理染色体可以大大加快此步骤的速度,在我们的情况下,总运行时间减少到大约两天。
Supplementary information
补充信息
Rights and permissions 权利和许可
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
开放获取 本文采用知识共享 Attribution 4.0 International License 许可协议,根据该协议,您可以自由使用、分享、改编和分发本文,但需向原始作者(们)和来源提供适当 attribution,并附上知识共享许可链接,如需对本文进行修改,则需标明所做的修改。本文中的图片或其他第三方材料,除非材料的致谢说明中另有说明,否则均受本文的知识共享许可协议覆盖。如果材料未包含在本文的知识共享许可协议中,且您的使用方式未被相关法律法规允许或超出许可范围,您需要直接向版权持有者获得许可。要查看该许可协议的副本,请访问 http://creativecommons.org/licenses/by/4.0/。
About this article
Cite this article
Kremer, L.P.M., Braun, M.M., Ovchinnikova, S. et al. Analyzing single-cell bisulfite sequencing data with MethSCAn. Nat Methods 21, 1616–1623 (2024). https://doi.org/10.1038/s41592-024-02347-xIF: 36.1 Q1 IF: 36.1 Q1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-024-02347-xIF: 36.1 Q1 IF: 36.1 Q1