摘要
针对细胞靶点筛选具有潜在生物活性的新化合物对药物发现和化学安全至关重要。转录组学提供了一种评估全局基因表达变化的有效方法,但从这些数据中解读化学机制往往具有挑战性。根据许多生物过程都与独特的基因表达特征(基因特征)相关联这一观察结果,连接性图谱是将化学品与机理联系起来的潜在数据驱动途径。然而,由于转录组数据包含数以千计的嘈杂基因,因此挖掘化学物质对基因特征对生物机制的影响具有挑战性。由于有望从不断增长的转录组数据中发现化学机制、新药和疾病靶点,人们不断开发新的连通性图谱方法,以区分信号和噪音。在此,我们将从不同的转录组技术、公共数据库、基因特征、模式匹配算法和统计评估标准等方面对这些方法进行分析。为了应对连通性图谱的复杂性,我们提出了一个统一的方案,以连贯地组织和比较已发表的工作流程。首先,我们基于微阵列、RNA-Seq 和 L1000 等各种转录组技术,对转录组图谱和基因特征的基本概念进行了标准化,并讨论了基因表达总库(GEO)、ArrayExpress 和 MSigDB 等广泛使用的数据源。接下来,我们将连接性映射概括为一种模式匹配任务,用于发现查询(如新化学物质的转录组概况)与参考(如已知目标的基因特征)之间的相似性。 已发表的模式匹配方法分为两大类:基于向量的使用相关性、雅卡指数等指标,以及基于聚合的使用参数和非参数统计(如基因组富集分析)。我们介绍了评估不同方法性能的统计方法,以及文献中报告的基准转录组数据集的比较。最后,我们回顾了连通性图谱在毒理学中的应用,并为利用浓度反应转录组数据评估化学物质诱导的毒性提供了指导。除了作为理解和实施连通性图谱工作流程的高级指南和教程外,我们还希望这篇综述能激发使用转录组数据评估化学安全性和药物发现的新算法。
关键词:化学筛选、基因表达、生物活性指纹、转录组剖析、基因特征、相似性算法、基因组富集、转录组浓度-反应、新方法方法
图表摘要
导言
药物发现和化学品安全需要有效的工具来筛选新化合物,以确定其针对细胞靶点的潜在生物活性。转录组学是通过影响全局基因表达来评估化学物质生物效应的广泛应用技术之一。1由于化学物质通过直接与受体结合2或间接破坏细胞稳态3来诱导基因表达变化,因此从转录组数据中推断其靶标具有挑战性。连接性图谱利用基因的 "通用语言 "4,5测量转录组图谱与细胞靶标相关基因特征之间的相似性,从而解决了这一问题。转录组学在人类基因组测序后的二十年间发展迅速。6从 cDNA 点阵7开始,到高密度寡核苷酸阵8,再到最近的 RNA 测序技术9 ,转录组学的可重复性、可靠性和成本效益不断提高。10因此,公共领域的资料库11、12现已提供了数以百万计的转录组图谱,涉及数千种疾病。4、13需要创新工具来利用这些丰富的转录组数据揭示化学品、途径和疾病之间的新关系。连接性图谱4就是这样一种工具,它可以促进药物发现14,帮助现有药物的再利用15,并生产出更安全的化学品。16
Connectivity mapping with transcriptomic data is one of many techniques in a rich landscape of computational methods for inferring the putative interactions between chemicals and biological targets or pathways. This landscape can be broadly divided into approaches based on binding, similarity, and machine learning (ML). Binding-based methods attempt to model physico-chemical interactions between a chemical and a protein target with three-dimensional structure data using molecular dynamics17,18 or, more recently, using ML.19 There have been impressive advances predicting new ligands for specific protein targets,20 and with predicted three-dimensional structures for all known proteins,21 virtually screening all chemicals against thousands of protein targets could be within reach.22
Connectivity mapping is conceptually related to other similarity-based approaches, which attempt to infer the properties of a new chemical using pair-wise similarity with chemicals of known properties, including physico-chemical properties or biological activities. If two chemicals have significant structural similarities, then similarity-based approaches assume they also have similar properties. Similarity-based approaches have two essential ingredients: a vector of attributes and a measure of similarity based on the attributes. Similarity-based pattern-matching techniques are also considered instance-based learning methods23 in ML, which includes approaches like k-nearest neighbor (KNN) classification. Chemical similarity-based approaches use molecular structure descriptors (such as extended connectivity fingerprints24) to represent chemicals and measure similarity using set operations (for a review of similarity measures, see Bero et al.25). For example, a query chemical can be searched against a database to find other structurally similar chemicals from which the unknown biological role can be inferred. Chemical structure-based similarity is widely used to infer molecular targets.26 One of the problems with using chemical similarity-based techniques is that minor alterations in structure can lead to drastic changes in their affinity for the same target, which are known as “activity cliffs” in structure-activity relationship (SAR) research.27 Another issue is that new structural categories of chemicals can be discovered or synthesized that have no existing analogues. If they bear insufficient resemblance to known chemicals, it is not possible to infer their properties based on structural similarity alone. Despite these limitations, structure-based automated prediction approaches28 are routinely used to fill data gaps for untested chemicals based on the known properties of analogues in the same local domains. More recently, structural and bioactivity similarity between chemicals has been used to infer the toxicity of untested chemicals.29–32
Finding pair-wise similarities using biological and chemical descriptors is a practical strategy for inferring the properties of untested chemicals; however, if hundreds of chemicals are associated with different classes of biological activities (e.g., protein target, pathway activation, toxicity, etc.), then ML can be more effective. ML algorithms systematically mine patterns in data (i.e., vector representations of data derived from biological and chemical descriptors) to build accurate predictive models of various biological activities.33 For example, ML algorithms mine chemical structure representations to build models, referred to as quantitative structure-activity relations (QSARs).34 QSAR models have been used to classify potential nuclear receptor activators,35,36 cellular stress responses,37,38 and toxicities.39,40 Similarly, ML algorithms mine transcriptomic data on chemicals (derived from different cellular contexts) to build models of biological mechanisms,41–44 and toxicities.45,46 Models derived by ML can predict the bioactivity or toxicity of new or untested chemicals using vector representations of data (i.e. attribute-value vectors that are used to train the model). Different ML methods have varying requirements for training data to produce reliable predictive models. Whereas similarity-based approaches such as KNN may only require a few examples because of their simplicity, more complex ML algorithms need varying amounts of training data to tune model parameters reliably. Furthermore, for in vivo toxicity prediction, it is also essential to consider the chemical dose, duration, and route of exposure. A systematic comparison of similarity-based and other ML algorithms is beyond the scope of this review.
Connectivity mapping may be considered an automated biological read-across47,48 technique to infer properties of untested substances using transcriptomic profiles in place of chemical structure representations. Gene-based descriptors in transcriptomic profiles measure the expression of specific genes in the genome, just like structure descriptors capture the presence of substructural moieties in chemicals. Transcriptomic profiles, however, can capture the biological response to chemical treatments, genetic perturbations, or pathological conditions using continuous expression levels of genes in ways that chemical structure descriptors cannot. The ability of transcriptomics to capture a diverse array of physiological states also makes it a powerful tool for finding similarity-based connections. This review is a guide for navigating connectivity mapping in terms of the diverse array of technologies to generate transcriptomic profiles, define biological states using gene-based descriptors, and organize the plethora of algorithms to measure transcriptomic similarity.
Historical Background
Connectivity mapping originates from functional discovery studies,49 which aimed to interpret the molecular phenotypes of biological samples using transcriptomics.50,51 A pivotal study by Hughes et al. produced one of the earliest and largest compendia of transcriptomic profiles for 300 genetic and chemical perturbations in yeast.52 The authors used similarity between transcriptomic profiles to cluster known mutants, uncharacterized mutants, and pharmacologic agents. For example, deleting YER044c, an uncharacterized yeast open reading frame (ORF), produced transcriptional profiles similar to the sterol isomerase (ERG2) deletion mutant. Further experiments determined that the YER044c ORF encoded the endoplasmic reticulum protein (ERG28). Because ERG2 and ERG28 are both involved in ergosterol biosynthesis, their deletion mutants produced similar transcriptomic profiles. Hughes et al. also showed transcriptomic responses to the drug fenpropimorph were similar to the responses due to ERG2 deletion mutants. This is not surprising as fenpropimorph is a fungicide that disrupts eukaryotic sterol biosynthesis pathways. Surprisingly, fenpropimorph was also a potent mammalian antagonist of sigma-1 receptor (SIGMAR1), which is involved in neuromodulatory pathways involved in pain. SIGMAR1 antagonists are being explored as a novel class of analgesic agents for treating pain.53 There is growing evidence that ERG2 disruptors in yeast are SIGMAR1 antagonists,54 and such pharmacological agents can be identified by connectivity mapping. The ability to link chemicals to mechanisms within and across species showed the value of transcriptomics as a “universal phenotype” for fingerprinting global biological states and of transcriptomic similarity to uncover novel relationships between chemicals and their targets.
Before connectivity mapping approaches, transcriptomics mainly identified differentially expressed genes between cases and controls using p-value and fold-change thresholds. Lists of differentially expressed genes helped identify statistically over-represented pathways (e.g., using Fisher’s Exact Test55) and provided insight into putative biological mechanisms (see Khatri and Draghici56, and Rivals et al.57)). However, because gene lists are sensitive to the choice of differential expression thresholds, using varying statistical cut-offs can produce inconsistent biological interpretations. Mootha et al. showed over-representation analysis of gene lists ignored the subtle yet coordinated regulation of gene sets relevant to a pathway. They found a gene set for the oxidative phosphorylation pathway “enriched” in diabetic versus healthy muscle tissues even though individual genes in the set were not significantly differentially expressed.58,59 Mootha et al.58 and Subramanian et al.59 called this approach gene set enrichment analysis (GSEA). Other gene set analysis (GSA) approaches subsequently used for pathway, and function enrichment60–62 have been reviewed extensively elsewhere.56,57
The connectivity map (CMap) project, which gave rise to the eponymous “connectivity mapping” approach, was the first publicly available large-scale compendium of transcriptomic profiles generated by treating human cells with a library of small molecules.5 Connectivity mapping used this compendium of 564 unique transcriptomic profiles for 164 chemicals (Build 01 of the CMap database, which we refer to as CMap v1). Connectivity, or similarity, was measured using a modified version of GSEA for analyzing “gene signatures” derived from highly up- and down-regulated genes in the transcriptomic profiles (see Figure 1). For example, Lamb et al.5 searched a signature of histone deacetylase (HDAC) inhibitors against the CMap v1 reference database. The HDAC inhibitor signature was derived from an independent study of HDAC inhibitors in bladder and breast cancer cells,63 which comprised eight up-regulated and five down-regulated genes (illustrated in Figure 1(a)). Searching the entire CMap v1 database (illustrated in Figure 1(e)) with this HDAC signature using GSEA (illustrated in Figure 1(c)), identified the most robust connections with vorinostat and trichostatin A (an example of such a match is shown in Figure (1(f)), both HDAC inhibitors. The ability of GSEA to link signatures of HDAC inhibitors from disparate experiments provided compelling evidence for the utility of connectivity mapping approaches.
In the second example, López et al. searched a gene signature of diet-induced obesity in rats.64 Using GSEA, they found a strong match between this signature and transcriptomic profiles for troglitazone, rosiglitazone, and indomethacin, all peroxisome proliferator-activated receptor gamma (PPARG) agonists. However, the directions of gene expression in the diet-induced obesity signature (i.e., up- and down-regulated genes) were found to be opposite to the directions of the genes in the profile for PPARG agonists. Such matches are referred to as “negative connections” as they have negative GSEA scores (see Figure 1(i) for a visual example of a negative connection). Interestingly, PPARG agonists are prescribed as hypolipidemic agents for the treatment of diabetes but can produce weight gain and liver injury as unwanted side effects. Thus, connectivity mapping revealed that the biological state of diet-induced obesity is “negatively connected” with PPARG-mediated hypolipidemic activity, notwithstanding differences in cells, treatment conditions, and gene expression assaying technologies. Finding negative connections between disease gene signatures and transcriptomic profiles of approved drugs forms the basis of some drug-repurposing approaches.15 These findings further demonstrated the utility of transcriptomic connectivity mapping for linking disease phenotypes with putative chemical treatments based on gene signatures. The initial success of connectivity mapping led to an expansion of the CMap (Build 02 of the CMap database, which we refer to as v2) to cover 1,309 chemicals and 6,100 transcriptomic profiles.65
Connectivity mapping and toxicology
A key challenge in toxicology is evaluating the safety of chemicals by determining their potency and potential for activating molecular targets that can lead to adverse health outcomes.66 In computational toxicology, transcriptomic profiling is used to rapidly screen thousands of untested chemicals to identify their putative targets, mechanism of action, or other effects.10,67–69 This is because high-throughput transcriptomic profiling using mRNA sequencing (RNA-Seq),9 and more recently targeted RNA-Seq,70 are extremely promising and cost-effective approaches for generating transcriptomic profiles for tens of thousands of chemical treatments. Whether evaluating new chemical entities for drug discovery or untested environmental chemicals for public health protection, transcriptomic connectivity mapping is a robust and high-throughput alternative to the existing techniques.16 Therefore, it is essential to examine the landscape of connectivity mapping approaches, understand their operation transparently, and assess their utility for specific toxicology applications.
Harmonizing connectivity mapping approaches
Dozens of refinements or alternatives to connectivity mapping have been proposed and are reviewed elsewhere.71,72 In this review, we develop a coherent view of various connectivity mapping approaches with an emphasis on three main ingredients: a transcriptomic profile produced by a perturbagen, a gene signature associated with a biological state, and an approach for matching the profile with the signature. The connectivity mapping workflow can be generalized as a database search and retrieval operation (see Figure 1) in which a “query” object (Figure (1a)) is compared with an extensive collection of “reference” (Figure 1(b)) objects (from a reference database (Figure 1(e))) using a pattern matching algorithm (Figure 1(c)) to find the most similar “hits” (Figure 1(f) and (h))). We employ the generic term “object” to cover several kinds of gene set-based inputs (summarized visually in Figure 2) for pattern matching. The variation between the connectivity mapping approaches is explained by the differences in the choice of the query, the reference database, and the pattern-matching algorithm. For example, Mootha et al.58 used a transcriptomic profile (derived from diabetic versus healthy muscle tissue) (Figure 2(a)) as the query, pathway-based gene sets (Figure 2 (e) and (d)) for the reference database, and GSEA for pattern-matching. On the other hand, Lamb et al.4 used gene signatures of HDAC inhibitors as the query, transcriptomic profiles as the reference database, and a modified version of GSEA. Although the workflow used by Mootha et al. is generally referred to as “pathway enrichment,” using the harmonized scheme presented here, we discuss how “enrichment” and “connectivity mapping” may be considered different types of similarity measures for comparing gene set objects comprised of gene signatures and transcriptomic profiles.
For example, the query object used by Lamb et al.5 was a gene signature derived from transcriptomic profiles of HDAC inhibitors in bladder and breast cancer cells.63 This gene signature for “HDAC inhibition” was defined by a set of up- and down-regulated genes (visualized in Figure 2(c)). In contrast, the reference objects were transcriptomic profiles (HYPERLINK Figure 2(a)) in the CMap v2 database. The similarity between the query gene signature for HDAC inhibition and each CMap v2 reference transcriptomic profile was measured using the same scoring metric as GSEA. Top-scoring matches with vorinostat and trichostatin A are both well-known HDAC inhibitors. In other words, GSEA “connected” the biological state of the query object, represented by a gene signature, with HDAC inhibition. This approach for connectivity mapping can be used in toxicity testing for new chemicals by generating gene signatures using transcriptomics, matching signature