Dear Editors, 亲爱的编辑们
We hereby submit a manuscript entitled “Supergenomes of Scenedesmaceae (Chlorophyta) illuminate lipid accumulation and stress adaptation in oleaginous microalgae” 我们特此提交题为 "Scenedesmaceae(叶绿体)的超级基因组揭示了含油微藻类的脂质积累和应激适应 "的手稿。
to be considered for publication in Cell Host & Microbe 考虑在《细胞-宿主与微生物》上发表
Oleaginous microalgae combine high lipid content with the ability to grow under diverse environmental conditions. The oleaginous species of family Scenedesmaceae (Sphaeropleales, Chlorophyta), an important constituent of the freshwater phytoplankton globally, have emerged as a promising source for biofuels, biostimulants, and bioremediation. Long-read genome sequencing of 38 species of Scenedesmaceae, including 16 chromosome-level genome assemblies, provided genomic underpinnings for the exceptional adaptive properties of their vegetative cells against stress. Our high-quality genome assemblies enabled us to establish the first pan-genomes of green algae at family, genus and species level. The pangenomes revealed a surprisingly high proportion of dispensable gene families over core gene families, even at species level. This suggests that genomes in the Scenedesmaceae are highly dynamic, probably reflecting their extraordinary adaptive properties. 油藻兼具高脂含量和在不同环境条件下生长的能力。景天科(Sphaeropleales, Chlorophyta)的含油物种是全球淡水浮游植物的重要组成部分,已成为生物燃料、生物刺激剂和生物修复的重要来源。对 38 种 Scenedesmaceae 进行的长线程基因组测序(包括 16 个染色体级基因组组装)为其无性细胞对压力的特殊适应性提供了基因组基础。通过高质量的基因组组装,我们首次在科、属和种一级建立了绿藻泛基因组。泛基因组显示,即使在物种水平上,可有可无的基因家族也比核心基因家族的比例高得惊人。这表明景天科的基因组具有高度的动态性,可能反映了它们非凡的适应特性。
Analyses of gains and expansion of gene families gave evidence that horizontal gene transfer (HGT) from non-viridiplant sources contributed to gene innovations in four major processes, that have been implicated in stress adaptation: sulfur and lipid metabolism, heterotrophy, and resistance (to stress). Although the mechanism(s) of HGT remain to be further investigated, we identified endogeneous viral elements from phylum Nucleocytoviricota in most of the Scenedesmaceae genomes, suggesting that viral-mediated HGT could have donated nonviridiplant eukaryotic genes (from animals, fungi and protists) as well as bacterial genes (from intracellular bacteria) to the genomes of Scenedesmaceae. A second putative mechanism for gene innovation that previously received relatively little attention in Chlorophyta, and microalgae in general, whole genome duplication (WGD), has also been identified in the Scenedesmaceae in the form of genome diploidization. About a quarter of the investigated strains were found to be diploid, and preliminary phylogenetic analyses suggests that interspecific hybridization may have been involved in diploidization. We conclude that HGT and WGD (diploidization), in combination with waves of expansions of transposable elements, played and play major roles in adaptive evolution of Scenedesmaceae and probably microalgae, in general. 对基因家族增殖和扩展的分析表明,非病毒植物来源的水平基因转移(HGT)促进了四个主要过程的基因创新,这四个过程与胁迫适应有关:硫和脂代谢、异养和抗性(对胁迫)。尽管HGT的机制还有待进一步研究,但我们在大多数景天科植物的基因组中发现了来自核细胞病毒科的内源病毒元件,这表明病毒介导的HGT可能将非病毒植物的真核基因(来自动物、真菌和原生动物)以及细菌基因(来自细胞内细菌)捐赠给了景天科植物的基因组。在景天科植物中还发现了基因组二倍体化形式的第二种基因创新机制,即全基因组复制(WGD)。大约四分之一的调查菌株被发现是二倍体,初步的系统发生学分析表明,种间杂交可能参与了二倍体化。我们的结论是,HGT 和 WGD(二倍体化)与转座元件的扩增波结合在一起,在景天科乃至整个微藻类的适应性进化中发挥了重要作用。
We feel that the results of our study should be of general interest to readers of the journal, be they geneticists, genome researchers or evolutionary biologists. We would 我们认为,无论是遗传学家、基因组研究人员还是进化生物学家,我们的研究结果都会引起期刊读者的普遍兴趣。我们希望
therefore be excited, if you consider this manuscript for publication in your prestigious journal. 因此,如果您考虑将此稿件发表在您的权威期刊上,我们将感到非常高兴。
7 ¹BGI-Research, Wuhan 430074, China 7 ¹北京地理研究所,中国武汉 430074 8quad^(2)8 \quad{ }^{2} Department of Biosciences, Swansea University, Swansea, United Kingdom 8quad^(2)8 \quad{ }^{2} 英国斯旺西,斯旺西大学生物科学系 9^(3)9{ }^{3} College of Life Sciences, University of Chinese Academy of Sciences, Beijing, China. 9^(3)9{ }^{3} 中国科学院大学生命科学学院,中国北京。
10 4BGI Research, Sanya 572025, China 11^(5)11{ }^{5} State Key Laboratory of Agricultural Genomics, BGI Research, Shenzhen 518083, 12 China 11^(5)11{ }^{5} 农业基因组学国家重点实验室,北京基因组研究所,深圳 518083,12 中国 13^(6)13{ }^{6} Culture Collection of Algae at Göttingen University SAG, Göttingen, Germany 13^(6)13{ }^{6} 德国哥廷根大学 SAG 的藻类培养采集中心 14quad^(7)14 \quad{ }^{7} Culture Collection of Algae and Protozoa CCAP, Scottish Association for Marine 15 Science, Oban, Argyll, United Kingdom 14quad^(7)14 \quad{ }^{7} 藻类和原生动物培养收集中心(CCAP),苏格兰海洋科学协会,英国,阿盖尔,奥班 16^(8)16{ }^{8} Department of Plant Microbe Interactions, Max Planck Institute for Plant Breeding 17 Research, Cologne, Germany 16^(8)16{ }^{8} 德国科隆马克斯-普朗克植物育种研究所植物微生物相互作用研究室 18^(9)18{ }^{9} Department of Biological Sciences and Department of Medicine, University of Alberta, 19 Edmonton, Alberta, Canada 18^(9)18{ }^{9} 加拿大艾伯塔省埃德蒙顿市 19 号艾伯塔大学生物科学系和医学系 ^(11){ }^{11} These authors contributed equally: Linzhou Li, Hongli Wang, Xiayi Chen, Jingmin Kang ^(11){ }^{11} 这些作者的贡献相同:李林洲、王红丽、陈夏怡、康敬民
*Correspondence: *通信:
Tong Wei, weitong@genomics.cn, Tong Wei,weitong@genomics.cn、
Michael Melkonian, mmelkonian@mpipz.mpg.de, Michael Melkonian,mmelkonian@mpipz.mpg.de、
Sibo Wang, wangsibo1@genomics.cn
SUMMARY 摘要
Oleaginous microalgae combine high lipid content with the ability to grow under diverse environmental conditions, the oleaginous species of family Scenedesmaceae (Sphaeropleales, Chlorophyta) having emerged as a promising source for biofuels, biostimulants, and bioremediation. Long-read genome sequencing of 38 species of Scenedesmaceae, including 16 chromosome-level genome assemblies, provided genomic underpinnings for the exceptional adaptive properties of their vegetative cells against stress. Pan-genome analysis identified core and dispensible genomes in the Scenedesmaceae, and unveiled gene family expansions and gains in four major metabolic processes that enhance survival of cells under stress: lipid and sulfolipid metabolism, resistance to oxidative stress, and heterotrophy. Gene family gains by horizontal gene transfer (HGT) from non-viridiplant donors were discovered in all four processes. Additionally, ten diploid species were identified, which may have evolved by fusion of homothallic gametes. We conclude that HGT and diploidization have fostered adaptive processes in the Scenedesmaceae, contributing to the unique properties of these oleaginous microalgae. 含油微藻类具有高脂含量和在不同环境条件下生长的能力,其中景天科(Sphaeropleales,Chlorophyta)的含油物种已成为生物燃料、生物刺激剂和生物修复的理想来源。对 38 个景天科物种进行的长线基因组测序(包括 16 个染色体级基因组组装)为景天科无性细胞对压力的特殊适应性提供了基因组基础。泛基因组分析确定了景天科植物的核心基因组和可消化基因组,并揭示了基因家族在四个主要代谢过程中的扩展和增殖,这四个代谢过程可提高细胞在胁迫下的存活率:脂质和硫脂代谢、抗氧化胁迫和异营养。在所有这四个过程中,都发现了来自非病毒植物供体的水平基因转移(HGT)带来的基因家族增殖。此外,还发现了 10 个二倍体物种,它们可能是通过同性配子融合进化而来的。我们的结论是,HGT 和二倍体化促进了景天科(Scenedesmaceae)的适应过程,造就了这些含油微藻的独特特性。
Microalgae are emerging as promising sustainable sources for biofuels, biostimulants in agriculture, bioremediation of polluted water and soil, and feed and food for livestock and human nutrition ^(1,2){ }^{1,2}. Oleaginous microalgae have received particular attention in this respect, because they often contain high amounts of lipids and can grow under a wide range of environmental conditions, including various types of wastewaters ^(3-7){ }^{3-7}. Lipid accumulation in these algae is often accompanied by the biosynthesis of ketocarotenoids, such as astaxanthin, which add value to their biomass ^(8,9){ }^{8,9}. 微藻正在成为生物燃料、农业生物刺激剂、受污染水体和土壤的生物修复以及牲畜和人类营养饲料和食品 ^(1,2){ }^{1,2} 的有前途的可持续来源。含油微藻在这方面尤其受到关注,因为它们通常含有大量脂质,并能在包括各类废水在内的多种环境条件下生长 ^(3-7){ }^{3-7} 。这些藻类的脂质积累往往伴随着类酮胡萝卜素(如虾青素)的生物合成,从而增加了其生物量的价值 ^(8,9){ }^{8,9} 。
Among the most widely studied oleaginous microalgae are members of the Sphaeropleales (Chlorophyta). This large order (with ∼25\sim 25 families and 1,005 species ^(10){ }^{10} ) comprises some of the most common freshwater microalgae with a cosmopolitan distribution (e.g., species of Scenedesmus, Desmodesmus, and Tetradesmus) in the phytoplankton of eutrophic/hypertrophic ponds and lakes ^(11){ }^{11}. As part of the nano/microplankton, they are subjected to turbulence and must have some means to survive long periods of darkness, drawing on intracellular reserves of energy-rich 在研究最广泛的含油微藻中,有一种是Sphaeropleales(叶绿藻纲)的成员。这个大目( ∼25\sim 25 科,1,005 种 ^(10){ }^{10} )包括一些最常见的淡水微藻类,它们分布在全球各地(例如,Scenedesmus、Desmodesmus 和 Tetradesmus 的种类),属于富营养化/过富营养化池塘和湖泊的浮游植物 ^(11){ }^{11} 。作为纳米/微型浮游生物的一部分,它们会受到湍流的影响,必须有一些方法在长时间黑暗的环境中存活下来,利用细胞内储备的富含能量的物质。
substances, low rates of basal metabolism and/or the use of organic substances for heterotrophic growth ^(12-14){ }^{12-14}. 物质、低基础代谢率和/或利用有机物质进行异养生长 ^(12-14){ }^{12-14} 。
Being abundant, these microalgae are also able to withstand periods of strong grazing pressure, because they have evolved an enormous variety of cell shapes and, often, associations of cells with communities (coenobia) of specific forms ^(15){ }^{15}. Members of the Scenedesmaceae, the most species-rich family in the Sphaeropleales, have also been shown to cope with harsh environmental conditions such as heat, salinity, high light intensities, and exposure to high concentrations of heavy metals in their vegetative stage ^(16-18){ }^{16-18}, indicating efficient molecular responses to oxidative stress ^(19){ }^{19}. This contrasts with the characteristics of the sister lineage of Sphaeropleales, the Volvocales ^(20){ }^{20}, in which blooms of planktonic species are usually terminated when environmentally stressed by the transformation of vegetative, flagellate cells into resting cells, often following sexual reproduction ^(21-23){ }^{21-23}. 这些微藻类物种丰富,也能够承受强大的放牧压力,因为它们已经进化出了多种多样的细胞形状,而且通常还进化出了具有特定形态 ^(15){ }^{15} 的细胞群落(coenobia)。Scenedesmaceae是Sphaeropleales中物种最丰富的科,其成员也被证明能够应对恶劣的环境条件,如高温、盐度、高光照强度,以及无性阶段暴露于高浓度重金属的情况 ^(16-18){ }^{16-18} ,这表明它们对氧化应激有高效的分子反应 ^(19){ }^{19} 。这与 Sphaeropleales 的姊妹系 Volvocales 的特征形成了鲜明对比 ^(20){ }^{20} ,在后者中,浮游物种的藻华通常在环境压力下通过无性鞭毛细胞转变为静止细胞而终止,通常是在有性生殖之后 ^(21-23){ }^{21-23} 。
Understanding the molecular basis for the evolution of these different ecological adaptations and life histories in the two major orders of Chlorophyceae requires comparative analyses of high-quality genome assemblies ^(24){ }^{24}. Such genome assemblies are also invaluable resources for designing genetic tools needed for microalgal biotechnology and bioengineering applications ^(25){ }^{25}. While reference genomes have been established for several unicellular and multicellular members of Volvocales ^(26-29){ }^{26-29}, only one chromosome-level genome assembly has recently been published for the Scenedesmaceae (Tetradesmus [‘Scenedesmus’] obliquis) ^(30){ }^{30}. In the related, monotypic family Chromochloridaceae ^(31){ }^{31}, the reference genome of Chromochloris zofingiensis has yielded important insights into the regulation of photosynthesis ^(32){ }^{32}, mixotrophy ^(33){ }^{33}, lipids, and astaxanthin and wax biosynthesis ^(34-38){ }^{34-38} in the Sphaeropleales. The availability of a large number of strains from all major genera of Scenedesmaceae in public algal culture collections enabled us to assemble the first pan-genome of eukaryotic algae at the family (Scenedesmaceae) level. For this purpose, 38 high quality genomes ( 16 genomes at chromosome-level) of Scenedesmaceae were newly established, spanning the taxonomic diversity of the family. The pan-genome was used to identify core, dispensible, and private orthogroups, as well as private unique genes. Analyses of gene family expansions and gains identified four major metabolic processes, that are implicated in enhanced resistance to abiotic and biotic stresses, namely the sulfur pathway ^(39,40){ }^{39,40}, the lipid pathway ^(41-44){ }^{41-44}, heterotrophy ^(45,46){ }^{45,46}, and resistance against oxidative stress elicited by abiotic and biotic factors ^(47,48){ }^{47,48}. Using phylogenomics and genome structure analyses, we show that genes involved in these processes have been acquired through horizontal gene transfer from non-viridiplant sources. Endogenized viral elements of large DNA viruses were identified in most genomes of the studied Scenedesmaceae species and are hypothesized to have acted as vectors for horizontally 要了解叶绿藻两大纲中这些不同生态适应性和生活史进化的分子基础,需要对高质量的基因组组装进行比较分析 ^(24){ }^{24} 。这些基因组也是设计微藻生物技术和生物工程应用所需的遗传工具的宝贵资源 ^(25){ }^{25} 。虽然已经为 Volvocales 的几个单细胞和多细胞成员建立了参考基因组 ^(26-29){ }^{26-29} ,但最近只发表了 Scenedesmaceae(Tetradesmus ['Scenedesmus'] obliquis)的一个染色体组水平的基因组组装 ^(30){ }^{30} 。在相关的单型科 Chromochloridaceae ^(31){ }^{31} 中,Chromochloris zofingiensis 的参考基因组对 Sphaeropleales 的光合作用 ^(32){ }^{32} 、混合营养 ^(33){ }^{33} 、脂质以及虾青素和蜡的生物合成 ^(34-38){ }^{34-38} 的调控产生了重要影响。我们从公共藻类培养库中获得了大量来自景天科所有主要属的菌株,这使我们能够在科(景天科)水平上首次构建真核藻类的泛基因组。为此,我们新建立了 38 个高质量的景天科基因组(16 个染色体级基因组),涵盖了该科的分类多样性。泛基因组被用来识别核心、无用和私有正交群,以及私有独特基因。对基因家族扩展和增益的分析确定了四个主要代谢过程,它们与增强对非生物和生物胁迫的抗性有关,即硫途径 ^(39,40){ }^{39,40} 、脂质途径 ^(41-44){ }^{41-44} 、异营养 ^(45,46){ }^{45,46} 以及对非生物和生物因素引起的氧化胁迫的抗性 ^(47,48){ }^{47,48} 。 通过系统发生组学和基因组结构分析,我们发现参与这些过程的基因是通过非病毒植物来源的水平基因转移获得的。在所研究的景天科物种的大多数基因组中,都发现了大型 DNA 病毒的内生病毒元件,并推测这些病毒元件可作为水平传播的载体。
transferred genes. Another mechanism fostering evolution of novel gene functions is polyploidization, i.e. whole genome duplication (WGD) ^(49-52){ }^{49-52}. In particular diploidization, either cytological diploidization or post-polyploid diploidization ^(53-55){ }^{53-55} has been considered a driving force in plant evolution ^(56){ }^{56}. While post-polyploid diploidization has received much attention recently ^(55,57){ }^{55,57}, the evolutionary transition from a haploid to a diploid life history through cytological processes has been much less studied ^(58,59){ }^{58,59}. A transcriptomic survey of ∼1,000\sim 1,000 plant transcriptomes, including algae, concluded in 2019 that most algal representatives of Archaeplastida showed little or no evidence of WGD^(60)\mathrm{WGD}^{60}. In fact, in a recent review of WGD in major plant lineages and key evolutionary nodes, highlighting their contributions to morphological innovation and adaptive evolution, the Chlorophyta drew a blank ^(61){ }^{61}. Increased application of long-read sequencing technologies and chromosome-level genome assemblies are, however, beginning to change that notion ^(24,30,62,63){ }^{24,30,62,63}. Here, we provide evidence that diploid genomes are widespread among Scenedesmaceae, characterizing more than quarter of the studied genomes, suggesting that diploidization may be an important factor in adaptive evolution of Scenedesmaceae and microalgae in general. 转移的基因。促进新基因功能进化的另一种机制是多倍体化,即全基因组复制(WGD) ^(49-52){ }^{49-52} 。特别是二倍体化,无论是细胞学二倍体化还是多倍体后二倍体化 ^(53-55){ }^{53-55} 都被认为是植物进化的驱动力 ^(56){ }^{56} 。虽然多倍体后二倍体化最近受到了广泛关注 ^(55,57){ }^{55,57} ,但通过细胞学过程从单倍体向二倍体生活史进化的研究却少得多 ^(58,59){ }^{58,59} 。2019 年对包括藻类在内的 ∼1,000\sim 1,000 植物转录组进行的一项转录组调查得出结论,弓形虫的大多数藻类代表几乎没有或根本没有 WGD^(60)\mathrm{WGD}^{60} 的证据。事实上,最近对主要植物系和关键进化节点的WGD进行了回顾,强调了它们对形态创新和适应性进化的贡献,但叶绿体却是一片空白 ^(61){ }^{61} 。然而,长线程测序技术和染色体级基因组组装技术的进一步应用开始改变这一观点 ^(24,30,62,63){ }^{24,30,62,63} 。在这里,我们提供的证据表明,二倍体基因组在景天科(Scenedesmaceae)中非常普遍,超过四分之一的研究基因组具有二倍体特征,这表明二倍体化可能是景天科和一般微藻类适应性进化的一个重要因素。
RESULTES 结果
Genome assemblies and annotation 基因组组装和注释
We assembled 38 new genomes of Scenedesmaceae from ten genera, i.e., Desmodesmus, Tetradesmus, Coelastrella, Coelastrum, Scenedesmus, Enallax, Chodatodesmus, Neodesmus, Pectinodesmus and Verrucodesmus (Fig. 1 and Extended Data Fig. 1), using a combination of short reads and long reads with a mean sequencing depth of approximately 145x (Supplementary Table 1). The assembled haplotype sizes were close to their respective estimates according to a KK-mer analysis (Supplementary Fig. 1). The genome sizes ranged from 37.7 Mb (Neodesmus danubialis) to 186.7 Mb (Coelastrella striolata SAG 16.95) and differed considerably among the Scenedesmaceae, even within genera (Supplementary Table 1). The mean contig N_(50)\mathrm{N}_{50} length of these 38 assemblies was 2.7 Mb , indicating high continuity in their genomes. Chromatin conformation data ( Hi-C\mathrm{Hi}-\mathrm{C} ) for 16 representative species from 9 genera were generated and the genomes scaffolded to reach pseudochromosome level (Supplementary Figs. 2-4). With the exception of 16 pseudochromosomes found in Neodesmus, all other chromosome-level assemblies had 17 pseudochromosomes (Fig. 1). The Scenedesmaceae genomes covered an average of 91.8%91.8 \% of the complete Benchmarking Universal Single-Copy Orthologs (BUSCO) with the lowest BUSCO value 87.6%87.6 \% for Verrucodesmus verrucosus (Supplementary Table 2). These results indicated that the Scenedesmaceae genome assemblies were of high quality and 我们利用短读数和长读数相结合的方法,从10个属,即Desmodesmus、Tetradesmus、Coelastrella、Coelastrum、Senedesmus、Enallax、Chodatodesmus、Neodesmus、Pectinodesmus和Verrucodesmus,组装了38个新的Senedesmaceae基因组(图1和扩展数据图1),平均测序深度约为145倍(补充表1)。根据 KK -mer 分析(补充图 1),组装的单倍型大小接近各自的估计值。基因组大小从 37.7 Mb(Neodesmus danubialis)到 186.7 Mb(Coelastrella striolata SAG 16.95)不等,在景天科中差异很大,即使在属内也是如此(补充表 1)。这 38 个组装体的平均等位基因 N_(50)\mathrm{N}_{50} 长度为 2.7 Mb,表明它们的基因组具有很高的连续性。生成了来自 9 个属的 16 个代表性物种的染色质构象数据( Hi-C\mathrm{Hi}-\mathrm{C} ),并将基因组支架化以达到假染色体水平(补充图 2-4)。除了在 Neodesmus 中发现 16 个假染色体外,所有其他染色体水平的组装都有 17 个假染色体(图 1)。Scenedesmaceae(景天科)基因组平均覆盖了 91.8%91.8 \% 个完整的Benchmarking Universal Single-Copy Orthologs (BUSCO),其中Verrucodesmus verrucosus的BUSCO值最低 87.6%87.6 \% (补充表2)。这些结果表明,Scenedesmaceae 的基因组组装质量很高,并且
completeness. By combining aba b initio gene prediction, transcript alignment, and evidence of protein homology, an average of 14,774 gene models were predicted in the Scenedesmaceae genomes, varying between 9,755 ( N . danubiales) and 24,415 (Coelastrella striolata) (Supplementary Table 3). The number and length of genes, exons, and introns were comparable to those of other published Scenedesmaceae genomes, and 89-98%89-98 \% of the putative genes could be assigned known functions (Supplementary Table 3-4). Repeat annotation of the Scenedesmaceae revealed that transposable elements (TEs) dominated the repeats, and the proportion of TEs fluctuated greatly among the intergeneric and intrageneric species (Supplementary Table 5). Most of the Scenedesmaceae featured a LINE-dominated TE composition, whereas many Volvocales exhibit LTR- or DNA-dominated TE sets (Supplementary Table 5). We next compared the evolutionary patterns of TEs by using Kimura distancebased copy divergence analyses (Supplementary Fig. 5). A total of 16 species experienced only one wave of TE burst but 22 species experienced two waves of TE amplification, one ancient (the peak ranged from 20%-30%) and one recent ( < 3%<3 \% ). Those that experienced two waves of TE outbreaks within the same genus had larger genome sizes than those that experienced only one wave. In general fluctuations in genome size correlated positively with TE content (Supplementary Fig. 6), suggesting that expansions of TEs contribute to the increase of genome size in Scenedesmaceae ^(28,64,65){ }^{28,64,65}. 完整性。通过结合 aba b 初始基因预测、转录本比对、蛋白质同源性证据,在景天科(Scenedesmaceae)基因组中平均预测了14774个基因模型,从9755个(N . danubiales)到24415个(Coelastrella striolata)不等(补充表3)。基因、外显子和内含子的数量和长度与其他已发表的景天科植物基因组相当, 89-98%89-98 \% 的推测基因可被赋予已知的功能(补充表 3-4)。Scenedesmaceae的重复注释表明,转座元件(TE)在重复中占主导地位,而且TE的比例在属间和属内物种之间波动很大(补充表5)。大多数景天科(Scenedesmaceae)植物的 TE 组成以 LINE 为主,而许多伏牛科(Volvocales)植物的 TE 组则以 LTR 或 DNA 为主(补充表 5)。接下来,我们利用基于木村距离的拷贝分歧分析比较了 TE 的进化模式(附图 5)。共有 16 个物种只经历了一波 TE 爆发,但有 22 个物种经历了两波 TE 扩增,一波是远古时期的(峰值在 20%-30% 之间),一波是近期的( < 3%<3 \% )。同属中经历过两波 TE 爆发的物种的基因组大小要大于只经历过一波的物种。一般来说,基因组大小的波动与 TE 含量呈正相关(补图 6),这表明 TE 的扩展有助于 Scenedesmaceae 基因组大小的增加 ^(28,64,65){ }^{28,64,65} 。
A deep phylogeny of the Scenedesmaceae 景天科的深度系统发育
A resolved species phylogeny is a prerequisite for analyses of changes in gene family composition (expansions, contractions) and innovations (gains and losses) at specified nodes. For Sphaeropleales and Scenedesmaceae this has been challenging, because previous molecular phylogenies using small gene sets or chloroplast genome sequences have either yielded conflicting results or lacked resolution ^(66-75){ }^{66-75}. A maximum likelihood phylogeny (model: CAT+GTR), based on a concatenated sequence alignment of 90 single-copy genes from genomes of 52 Scenedesmaceae, 7 Volvocales, and 3 Trebouxiophyceae species/strains was performed and yielded a nearly fully resolved tree of the Scenedesmaceae (Fig. 1b). In particular, the monophyly of Sphaeropleales, Scenedesmaceae and the genera Tetradesmus, Scenedesmus, Coelastrella, Coelastrum and Desmodesmus received maximal support. Phylogenetic relationships among these genera were also resolved as almost all internal branches in Scenedesmaceae received maximal support, and none of those (4) that were not maximally supported, received less than 75%75 \% support. Please note that we kept genus/species designations in Fig. 1b as used in the respective publications, databases and culture collections. It is clear, 解析物种系统发育是分析特定节点上基因家族组成变化(扩展、收缩)和创新(增加和减少)的先决条件。对于 Sphaeropleales 和 Scenedesmaceae 而言,这一直是一个挑战,因为之前使用小基因组或叶绿体基因组序列进行的分子系统进化要么产生了相互矛盾的结果,要么缺乏分辨率 ^(66-75){ }^{66-75} 。基于来自 52 个景天科(Scenedesmaceae)、7 个伏牛科(Volvocales)和 3 个三叶草科(Trebouxiophyceae)物种/菌株基因组的 90 个单拷贝基因的序列比对,进行了最大似然系统发生(模型:CAT+GTR),得到了一个几乎完全解析的景天科树(图 1b)。其中,Sphaeropleales、Scenedesmaceae 和 Tetradesmus 属、Scenedesmus 属、Coelastrella 属、Coelastrum 属和 Desmodesmus 属的单系支持率最高。这些属之间的系统发育关系也得到了解决,因为 Scenedesmaceae 中几乎所有的内部分支都得到了最大支持,而那些没有得到最大支持的分支(4 个)都没有得到低于 75%75 \% 的支持。请注意,图 1b 中的属/种名称与相关出版物、数据库和培养物中使用的名称相同。很明显
however, that all species/strains positioned e.g., within clade Tetradesmus should be regarded as species of genus Tetradesmus (not Scenedesmus, Pectinodesmus or Acutodesmus). The situation is less clear for genus Enallax, which we found to be nested within Coelastrella, its two species paraphyletically arranged (Fig. 1b). The taxonomic status of Enallax is uncertain as no authentic strain of the type species exists. The genus may eventually be synonymized with Coelastrella Chodat1922, which has priority. If one plots the preferred habitats (aquatic vs. terrestrial/halophilic) of the Scenedesmaceae strains/species sequenced on the phylogenetic tree (Fig. 1b), the most parsimonious conclusion is that the origin of the family is aquatic with several (minimally five) independent transitions to terrestrial habitats. Interestingly, in one of these transitions, a species from desert soil crusts (T. deserticola) is sister to two halophilic strains (Scenedesmus sp. NREL 46 B-D3, Scenedesmus rubescens SAG 5.95), the habitat of their last common ancestor remaining unknown. A second transition to a terrestrial habitat in Tetradesmus refers to a strain (SAG 3.99), labeled TT. wisconsinensis, which most likely has been misidentified, because T. wisconsinensis (the type species of the genus) is a well-known aquatic species. We find no phylogenetic evidence that a reversal from a terrestrial to an aquatic habitat occurred in Tetradesmus ^(76){ }^{76}. 然而,所有被定位的物种/品系,例如在 Tetradesmus 支系中的物种/品系,都应被视为 Tetradesmus 属的物种(而不是 Scenedesmus、Pectinodesmus 或 Acutodesmus)。而 Enallax 属的情况则不太清楚,我们发现该属嵌套在 Coelastrella 内,其两个种呈旁系排列(图 1b)。由于没有模式种的真实菌株,Enallax 属的分类地位尚不确定。该属最终可能会与 Coelastrella Chodat1922 同名,后者具有优先权。如果将已测序的 Scenedesmaceae 菌株/物种的喜好生境(水生与陆生/嗜盐)绘制在系统发生树上(图 1b),最合理的结论是该科起源于水生,并有几个(最少五个)独立的过渡到陆生生境。有趣的是,在其中一次过渡中,沙漠土壤结壳中的一个物种(T. deserticola)与两个嗜卤菌株(Senedesmus sp.wisconsinensis 很可能被误认,因为 T. wisconsinensis(该属的模式种)是一个著名的水生物种。我们没有发现任何系统发育证据表明四裂殖菌 ^(76){ }^{76} 发生了从陆生生境到水生生境的逆转。
Diversity of ploidy and heterozygosity in Scenedesmaceae 景天科植物倍性和杂合度的多样性
A K-mer analysis was used to estimate the genome characteristics of each species. The K-mer frequency distribution reflected the characteristics of each genome (such as genome size, heterozygosity and duplication) (Supplementary Fig. 1). The haploid genome sizes varied greatly among species and were close to those of the actual assembled genomes. The K-mer plots of most species in the Scenedesmaceae showed only one peak and little to no evidence of heterozygosity, which indicated that these species were haploid. Interestingly, we found that the K-mer plots of 8 species from 5 genera had 2 peaks and featured a pronounced heterozygous peak (Fig. 2a), suggesting that these species may be diploids with extremely high levels of heterozygosity. Traditionally, life histories in the Chlorophyta have been recorded as primarily haplontic or diplohaplontic (the latter in many marine Ulvophyceae), only rarely as diplontic. Next, a SNP-aware method was used to further analyze chromosome ploidy (Fig. 2b and Supplementary Fig. 7) (see Methods for details). In the case of the haploid structure, the most frequent allele was close to 95%95 \%, the second most frequent allele had an abundance of less than 5%, and most likely represented sequencing errors. For the diploid genome, the most frequent allele and the second most frequent allele are both close to 50%50 \%, forming a monomorphic peak. Our analysis suggested that 10 K-mer 分析用于估计每个物种的基因组特征。K-mer 频率分布反映了每个基因组的特征(如基因组大小、杂合度和重复)(补图 1)。不同物种的单倍体基因组大小差异很大,与实际组装的基因组大小接近。景天科大多数物种的 K-mer 图仅显示一个峰值,几乎没有杂合度的迹象,这表明这些物种是单倍体。有趣的是,我们发现来自 5 个属的 8 个物种的 K-mer图有两个峰,并有一个明显的杂合峰(图 2a),这表明这些物种可能是杂合度极高的二倍体。传统上,叶绿体的生活史主要被记录为单倍体或二倍体(在许多海生 Ulvophyceae 中为后者),只有极少数被记录为二倍体。接下来,使用 SNP 感知方法进一步分析染色体倍性(图 2b 和补充图 7)(详见方法)。在单倍体结构中,最常见的等位基因接近 95%95 \% ,第二常见的等位基因丰度低于 5%,很可能是测序错误。在二倍体基因组中,最常出现的等位基因和第二常出现的等位基因都接近 50%50 \% ,形成了一个单态峰。我们的分析表明,10
Scenedesmaceae species (among them, seven are newly sequenced) exhibit diploid characteristics. The GC depth plot of these genomes revealed two clusters, with the average sequencing depth of the top cluster being twice that of the lower cluster (Fig. 2c and Supplementary Fig. 2), indicating the diploid nature of these genomes. We then calculated SNPs and InDels for each putative diploid genome (Fig. 2d, Supplementary Table 6), which not only further confirmed their diploid status but also indicated a high level of heterozygosity between haplotypes ^(77){ }^{77}. Additionally, the age distribution of paralogous gene pairs was analyzed for four selected diploid genomes (Cst, Cte, Ecoe, and UTEX3031) (Fig. 2e), providing evidence of a recent whole genome duplication (neopolyploidy) in these species. Further whole genome alignment of selected diploid genomes (Fig. 2f) revealed large-scale duplicated syntenic blocks within Cst and Sob2, respectively. Given the high heterozygosity observed, we hypothesize that selfdiploidization in the Scenedesmaceae likely occurred through allopolyploidy (i.e., the mating of two clonal cells or the fusion of homothallic gametes via sexual reproduction) rather than autopolyploidy (endoreduplication, where the nucleus replicates its DNA without division) ^(78){ }^{78}. Moreover, the phylogenetically distinct haplotypes of UTEX3031 further support our hypothesis (Fig. 2g). 景天科(Scenedesmaceae)物种(其中 7 个为新测序物种)表现出二倍体特征。这些基因组的 GC 深度图显示出两个聚类,上聚类的平均测序深度是下聚类的两倍(图 2c 和补充图 2),表明这些基因组具有二倍体的性质。然后,我们计算了每个推定二倍体基因组的 SNPs 和 InDels(图 2d,补充表 6),这不仅进一步证实了它们的二倍体地位,还表明单倍型 ^(77){ }^{77} 之间的杂合度很高。此外,还分析了四个选定的二倍体基因组(Cst、Cte、Ecoe 和 UTEX3031)中旁系基因对的年龄分布(图 2e),为这些物种最近的全基因组重复(新多倍体)提供了证据。所选二倍体基因组的进一步全基因组比对(图 2f)分别揭示了 Cst 和 Sob2 中大规模重复的同源区块。鉴于观察到的高杂合度,我们推测景天科植物的自二倍体化可能是通过异源多倍体(即两个克隆细胞交配或通过有性生殖融合同性配子)而非自多倍体(内核复制,即细胞核不经分裂而复制其DNA) ^(78){ }^{78} 发生的。此外,UTEX3031系统发育上不同的单倍型进一步支持了我们的假设(图 2g)。
Collinearity and comparative genomics of Scenedesmaceae 景天科的共线性和比较基因组学
The genome conservation of Scenedesmaceae has never been systematically investigated. We therefore conducted a large-scale collinearity study based on the diverse and high-quality Scenedesmaceae genomes generated in this study. An alignment of pseudochromosomes in 17 species/strains of Scenedesmaceae showed good collinearity across the family (Fig. 1b and Fig. 3a). In general, the collinearity of species within each genus is higher than that of species between genera. For example, a higher collinearity was observed within Tetradesmus, Coelastrella, Coelastrum and Desmodesmus than between them. In addition, most of the Scenedesmaceae had low genome collinearity with Other Sphaeropleales, and very low collinearity with Volvocales and Trebouxiophyceae. However, the genomes of NN. danubialis, and VV. verrucosus exhibited a relatively high collinearity with Other Sphaeropleales but also with Volvocales and Chlorella variabilis among the Trebouxiophyceae (especially N. danubialis, which is sister to all other Scenedesmaceae in the species phylogeny, Fig. 1a). N. danubialis also displayed the smallest genome among the Scenedesmaceae studied, suggesting that this represents the ancestral condition in the Scenedesmaceae and perhaps also the Sphaeropleales. The relatively low collinearity among genera of Scenedesmaceae may be due to WGD, proliferation of TEs and additional genetic variations. Scenedesmaceae 的基因组保护问题从未得到过系统的研究。因此,我们基于本研究中产生的多样化和高质量的景天科基因组进行了大规模的比对研究。对 17 个景天科物种/品系的假染色体进行的比对显示,整个科的比对性良好(图 1b 和图 3a)。一般来说,各属内物种的共线性高于属间物种的共线性。例如,在 Tetradesmus、Coelastrella、Coelastrum 和 Desmodesmus 中观察到的共线性要高于它们之间的共线性。此外,大多数 Scenedesmaceae 的基因组与其他 Sphaeropleales 的共线性较低,与 Volvocales 和 Trebouxiophyceae 的共线性也很低。然而, NN . danubialis 和 VV . verrucosus 的基因组与其他 Sphaeropleales 的共线性相对较高,但与 Volvocales 和 Chlorella variabilis 的共线性也相对较高。在所研究的景天科植物中,N. danubialis 的基因组也是最小的,这表明它代表了景天科植物的祖先状态,或许也代表了 Sphaeropleales 的祖先状态。Scenedesmaceae属间的共线性相对较低,这可能是由于WGD、TE的扩散和额外的遗传变异造成的。
We also performed comparative phylogenomic analyses among representative genomes of Chlorophyceae. A homolog matrix of orthogroups generated from gene family clustering was analyzed to infer ancestral and lineage-specific gene dynamics within the phylogenetic tree. We found 627 newly gained and 101 expanded orthogroups in the ancestor of Sphaeropleales (Extended Data Fig. 2a), and many orthogroups related to fatty acid biosynthetic processes were found at this node. After the divergence of Other Sphaeropleales, 507 orthogroups were newly identified, and 70 expanded in the last common ancestor of the Scenedesmaceae. Many of these genes were found to be involved in lipid metabolism and stress resistance. For example, orthogroups related to ‘response to oxidative stress’ and ‘triglyceride biosynthetic process’ were significantly enriched in the gained and expanded orthogroups (Extended Data Fig. 2b). 我们还对叶绿藻科代表性基因组进行了系统发生组比较分析。我们分析了由基因家族聚类产生的正交群同源矩阵,以推断系统发生树中祖先和特定世系的基因动态。我们在 Sphaeropleales 的祖先中发现了 627 个新获得的正交群和 101 个扩展的正交群(扩展数据图 2a),在这个节点上发现了许多与脂肪酸生物合成过程有关的正交群。在其他 Sphaeropleales 的分化之后,新发现了 507 个直向组,并在 Scenedesmaceae 的最后共同祖先中扩展了 70 个直向组。其中许多基因被发现与脂质代谢和抗逆性有关。例如,与 "对氧化应激的反应 "和 "甘油三酯的生物合成过程 "有关的正交组在获得的正交组和扩展的正交组中明显富集(扩展数据图 2b)。
Gene-based pan-genome of the Scenedesmaceae 基于基因的景天科泛基因组
We performed a pan-genome analysis integrating 38 high-quality Scenedesmaceae genomes generated in this study with 13 previously published Scenedesmaceae genomes. In addition, we constructed a pan-genome of Chlorophyta, ‘Core Chlorophyta’, Sphaeropleales, and Volvocales for comparison. The gene sets of 115 Chlorophyta covered nine prasinophytes and 106 core chlorophytes (including one Pedinophyceae, 18 Trebouxiophyceae, two Ulvophyceae and 85 Chlorophyceae) (Supplementary Table 7). OrthoFinder was used to cluster all the collected gene sets of Chlorophyta, the Chlorophyta genomes included 57, 153 classified orthogroups and 97,841 unclassified accession-specific genes (i.e., private unique genes) (Supplementary Table 7). In this study, the pan-genome is made up of four parts: core orthogroups, dispensable orthogroups, private orthogroups and private unique genes (see Methods for details). When considering all orthogroups and private unique genes together, a plateau (saturation of number of orthogroups) was not observed for pangenomes at all taxonomic levels, despite extensive sampling across the phylogeny (Fig. 3c), suggesting that many orthogroups and/or genes still remain to be discovered. However, when considering only orthogroups (i.e., excluding private unique genes), for the Sphaeropleales and Scenedesmaceae, saturation of the number of orthogroups is expected, when the number of genomes is about twice the current number (Fig. 3d). In the pan-genomes of Chlorophyta, the number of total core orthogroups (i.e., orthogroups in which each accession has at least one gene) rapidly decreased with sequential analyses of randomly added genomes and stabilized after species were represented by n >= 100\mathrm{n} \geq 100 genomes, reaching a plateau at a number of ∼400\sim 400 orthogroups. According to the slope, the number of stabilized core orthogroups of the Scenedesmaceae is estimated to be ∼1,500\sim 1,500 (Fig. 3e). 我们进行了泛基因组分析,将本研究中生成的 38 个高质量景天科基因组与之前发表的 13 个景天科基因组进行了整合。此外,我们还构建了叶绿体、"核心叶绿体"、Sphaeropleales 和 Volvocales 的泛基因组进行比较。115个叶绿体的基因组涵盖了9个原生叶绿体和106个核心叶绿体(包括1个裙带菜科(Pedinophyceae)、18个蝶形叶绿体(Trebouxiophyceae)、2个莼菜科(Ulvophyceae)和85个叶绿体(Chlorophyceae))(补充表7)。利用 OrthoFinder 对收集到的所有叶绿体基因组进行聚类,叶绿体基因组包括 57 153 个已分类的正交群和 97 841 个未分类的加入特异性基因(即私有独特基因)(补充表 7)。在本研究中,泛基因组由四部分组成:核心正交群、可有可无的正交群、私有正交群和私有独特基因(详见方法)。当把所有正交群和私有独特基因放在一起考虑时,尽管在整个系统发育过程中进行了广泛采样,但在所有分类水平上的泛基因组都没有观察到高原(正交群数量饱和)现象(图 3c),这表明仍有许多正交群和/或基因有待发现。然而,如果只考虑正交群(即不包括私有的独特基因),对于 Sphaeropleales 和 Scenedesmaceae 而言,当基因组的数量约为当前数量的两倍时,正交群的数量预计将达到饱和(图 3d)。在叶绿体泛基因组中,随着对随机添加基因组的连续分析,核心正交组总数(即每个加入者至少有一个基因的正交组)迅速减少,在物种被 n >= 100\mathrm{n} \geq 100 基因组代表后趋于稳定,并在 ∼400\sim 400 正交组数达到高峰。 根据斜率,估计景天科稳定的核心正交群的数量为 ∼1,500\sim 1,500 (图 3e)。
A comparison of gene families was performed based on the orthogroups found in the Chlorophyta. A Venn diagram showed that 34,119 orthogroups (including accessionspecific genes) were unique to the Scenedesmaceae compared to 38,993 in the Volvocales (Fig. 3b). We analyzed the functional categories of these gene families by KOG (Eukaryotic Orthologous Groups) ^(79){ }^{79} annotation, which resulted in 2,424 KOG categories. After excluding unknown hits, the most frequent KOG category was “Secondary metabolites biosynthesis, transport and catabolism”, followed by “Cell motility” (Supplementary Table 8). At the Pfam domain level, our results showed that 213 types of domains were present only in Scenedesmaceae, such as Heavy-metal resistance (PF13801), Sucrose synthase (PF00862), Sugar transport protein (PF06800) and WS/DGAT C-terminal domain (PF06974) (Supplementary Table 9). In some cases, domains exclusive to Scenedesmaceae were likely acquired by HGT. 根据叶绿体中发现的正交群对基因家族进行了比较。维恩图显示,景天科(Scenedesmaceae)独有 34119 个正交组(包括加入的特异基因),而伏牛科(Volvocales)独有 38993 个正交组(图 3b)。我们通过 KOG(真核同源组) ^(79){ }^{79} 注释分析了这些基因家族的功能类别,得出了 2424 个 KOG 类别。在剔除未知点击后,最常见的 KOG 类别是 "次级代谢物的生物合成、运输和分解",其次是 "细胞运动"(补充表 8)。在 Pfam 结构域水平上,我们的结果显示有 213 种结构域只存在于景天科植物中,如重金属抗性(PF13801)、蔗糖合成酶(PF00862)、糖转运蛋白(PF06800)和 WS/DGAT C 端结构域(PF06974)(补充表 9)。在某些情况下,Scenedesmaceae 独有的结构域可能是通过 HGT 获得的。
With a focus on the pan-genome of Scenedesmaceae (51 species/strains), 2,092 orthogroups were present in all 51 accessions and were defined as core orthogroups, accounting for an average of only 19.7%19.7 \% in individual accessions. Dispensable orthogroups (with an average of 75.6%75.6 \% in individual accessions), in which genes were present in 2-50 accessions, while private orthogroups plus private unique genes, which were detected in only one accession accounted for an average of 4.6%4.6 \% in individual accessions (Supplementary Table 10). Notably, when investigating the whole pangenome not individual accession, the total of private orthogroups plus private unique genes accounted for 54.8%54.8 \% of the total gene sets in the 51 Scenedesmaceae accessions (Supplementary Table 10). If one considers the two genera with the largest number of sequenced genomes (Desmodesmus [15 species/strains] and Tetradesmus [17 species/strains]), the number of core orthogroups increased to 5,314 (with an average of 51.5%51.5 \% in individual accessions) and 4,743 (with an average of 43,3%43,3 \% ), and the percentages of dispensable orthogroups decreased to 43.2%43.2 \% and 52.1%52.1 \%, respectively (Supplementary Table 10). Although the number of genomes sequenced at species level in the Scenedesmaceae is still too low for generalizations (e.g., in Tetradesmus obliquus only 9 strains), we tentatively conclude that, when more accessions are included, the percentage of core orthogroups will not exceed 50%50 \% ( 56.9%56.9 \% for 9 strains of TT. obliquus, Supplementary Table 10) and the percentage of dispensable orthogroups will remain significant ( 39.7%39.7 \% in T. obliquus), suggesting considerable genome dynamics at lower (genus, species) taxonomic levels in the Scenedesmaceae. 以 Scenedesmaceae(51 个物种/品系)的泛基因组为重点,在所有 51 个加入物中发现了 2,092 个正交群,这些正交群被定义为核心正交群,在单个加入物中平均只占 19.7%19.7 \% 。可有可无的正交群(平均在单个加入物中占 75.6%75.6 \% )中的基因出现在 2-50 个加入物中,而私有正交群加上私有独特基因只在一个加入物中检测到,平均在单个加入物中占 4.6%4.6 \% (补充表 10)。值得注意的是,在研究整个庞基因组而不是单个加入物时,在 51 个景天科加入物中,私有正交组加上私有独特基因的总数占总基因组的 54.8%54.8 \% (补充表 10)。如果考虑到基因组测序数量最多的两个属(Desmodesmus [15 个种/株] 和 Tetradesmus [17 个种/株]),核心正交群的数量分别增加到 5,314 个(在单个加入物中平均为 51.5%51.5 \% )和 4,743 个(平均为 43,3%43,3 \% ),可有可无的正交群的百分比分别下降到 43.2%43.2 \% 和 52.1%52.1 \% (补充表 10)。虽然景天科物种水平上的基因组测序数量仍然太少,不能一概而论(如在 Tetradesmus obliquus 中只有 9 个菌株),但我们初步断定,当纳入更多的加入物时,核心正交群的百分比将不会超过 50%50 \% ( 56.9%56.9 \% 为 TT 的 9 个菌株。obliquus的 39.7%39.7 \% ),表明在景天科较低的(属、种)分类水平上基因组具有相当大的动态性。
We further observed that core genes had longer peptide, and larger exon numbers per gene than dispensable and private genes (Fig. 3f-i). Based on domain analyses, approximately 75%75 \% of the core orthogroups were assigned a putative function in Scenedesmaceae, whereas only 42%42 \% of the dispensable orthogroups, and 7%7 \% of the private orthogroups plus private unique genes had annotated domain information (Fig. 我们进一步观察到,与可免除基因和私有基因相比,核心基因具有更长的肽段和更大的外显子数(图 3f-i)。根据结构域分析,大约 75%75 \% 个核心正交基因组在景天科中被赋予了推定功能,而只有 42%42 \% 个可有可无的正交基因组和 7%7 \% 个私有正交基因组以及私有独特基因有注释的结构域信息(图 3f-i)。 3j)3 j). For private orthogroups and private unique genes, we cannot exclude false positives caused by errors in sequencing, assembly and annotation, which may result in many ‘private’ genes without functional annotation. It is undeniable, however, that private orthogroups and private unique genes represent important genetic resources for evolutionary innovation of species, their often-low expression levels may indicate that they are in functional transit or that environmental conditions for elevated gene expression have yet to be discovered. We note that some private orthogroups and private unique genes in species of Scenedesmaceae have apparently been acquired by HGT (Supplementary Fig. 9d-f, Supplementary Table 14). Further gene Ontology (GO) analyses showed that core genes were enriched in categories associated with intermediary metabolism such as ‘ATP metabolic process’, ‘cellular amino acid metabolic process’, ‘tricarboxylic acid cycle’, etc., whereas dispensable genes were enriched in GO categories associated with methylation, translational initiation, DNA recombination and others. Enrichment of GO categories revealed that private genes were related to functions of e.g. ‘glycogen phosphorylase activity’, ‘4-hydroxytetrahydrodipicolinate reductase’ and ‘phosphoenolpyruvate carboxykinase’ (Supplementary Table 11). 3j)3 j) 。对于私有正交群和私有独特基因,我们不能排除测序、组装和注释中的错误造成的假阳性,这可能导致许多 "私有 "基因没有功能注释。但不可否认的是,私有正交群和私有独特基因是物种进化创新的重要遗传资源,它们的表达水平往往很低,这可能表明它们处于功能中转期,或者说基因表达升高的环境条件尚未被发现。我们注意到,景天科(Scenedesmaceae)物种中的一些私有正交群和私有独特基因显然是通过 HGT 获得的(补充图 9d-f,补充表 14)。进一步的基因本体(GO)分析表明,核心基因富集在与中间代谢相关的类别中,如 "ATP 代谢过程"、"细胞氨基酸代谢过程"、"三羧酸循环 "等,而可有可无的基因则富集在与甲基化、翻译起始、DNA 重组等相关的 GO 类别中。GO类别的富集显示,私有基因与 "糖原磷酸化酶活性"、"4-羟基四氢二羟酸还原酶 "和 "磷酸烯醇丙酮酸羧激酶 "等功能有关(补充表 11)。
Pan-genome analysis based on our high-quality genomes shows that Scenedesmaceae possesses a high proportion of dispensable orthogroups and relatively low percentages of core orthogroups. In fungi and embryophyte plants, core orthogroups usually constitute the majority of the total number of orthogroups at the species level. For example, in model fungal species, core genomes have been found to constitute 80-90% of orthogroups of the total pan-genomes ^(80){ }^{80}. In embryophyte model species, in which hundreds of accessions have been included in pan-genome analyses (e.g., Arabidopsis thaliana, Oryza sativa), the percentage of dispensable orthogroups can be higher ( 40%40 \% or 37.9%37.9 \%, respectively) but is still less than that of the respective core genomes ^(81-83){ }^{81-83}. 基于高质量基因组的泛基因组分析表明,景天科(Senedesmaceae)拥有较高比例的可抛弃正交群和相对较低比例的核心正交群。在真菌和胚状植物中,核心正交群通常占物种水平正交群总数的大多数。例如,在模式真菌物种中,核心基因组占泛基因组总数 ^(80){ }^{80} 的 80-90%。在胚胎植物模式物种中,泛基因组分析包括了数百个序列(如拟南芥、黑麦草),可有可无的正交组的百分比可能更高(分别为 40%40 \% 或 37.9%37.9 \% ),但仍低于各自核心基因组 ^(81-83){ }^{81-83} 的百分比。
Several independent origins for novel eukaryotic genes are currently being recognized: whole genome duplications (WGDs), local tandem duplications, TE-mediated duplications, segmental duplications, introgression from related species, horizontal gene transfer and de novo gene birth ^(83-86){ }^{83-86}. In addition, genes can be lost due to deletions by intrachromosomal recombination and pseudogenization ^(87){ }^{87}. We have already highlighted the possible role of genome duplications (diploidization) and TE bursts in the generation of genome complexity and diversity in the Scenedesmaceae at different taxonomic levels. Next, we analyzed gains and expansions of orthogroups in the Scenedesmaceae in a phylogenetic context, and discovered extensive HGT in four metabolic processes that are thought to play major roles in stress resistance and thus adaptive evolution, adding another level of genome variability in the Scenedesmaceae. 目前公认的真核生物新基因有几种独立的来源:全基因组重复(WGD)、局部串联重复、TE介导的重复、片段重复、相关物种的引种、水平基因转移和新基因的诞生 ^(83-86){ }^{83-86} 。此外,基因还可能因染色体内重组和假基因 ^(87){ }^{87} 的缺失而丢失。我们已经强调了基因组复制(二倍体化)和TE爆发在景天科不同分类水平的基因组复杂性和多样性产生过程中可能起到的作用。接下来,我们在系统发育的背景下分析了景天科正交群的增益和扩展,发现在四个代谢过程中存在着广泛的 HGT,而这四个代谢过程被认为在抗逆性和适应性进化中发挥着重要作用,从而增加了景天科基因组变异的另一个层次。
Comparison of lipid accumulation in the Scenedesmaceae and their genomic underpinnings 比较景天科植物的脂质积累及其基因组基础
The metabolite levels of lipids were determined for representative Scenedesmaceae plants. N. danubialis presented the highest total lipid content, which was mainly attributed to the high accumulation of monoglycerides (Extended Data Fig. 3 and Supplementary Table 12). Among the various components of lipids, Tetradesmus species generally have a high proportion of fatty acids/acyls (FAs) (avg. at 55.9%) and a low proportion of glycerolipids (GL) (avg. 27.5%) and glycerophospholipids (GP) (avg. 17.0%), with the exception of T. distendus CCAP 276/37 and T. lagerheimii CCAP 276/30, which both displayed a low proportion of FAs ( 2.7%2.7 \% and 18.2%18.2 \% ) but a high proportion of GP (53.7%) and GL (71.3%), respectively. For most species within Coelastrum and Desmodesmus, FA was also the main component of the total lipids. However, the total FA content in many Tetradesmus species is greater than that in Coelastrum and Desmodesmus species, mainly due to the greater accumulation of C16-type FAs most Tetradesmus species (Supplementary Table 12). Notably, D. ultrasquamatus CCAP 258/1 has very high contents of linoleic acid (C18:2 FA) and linolenic acid (C18:3 FA), suggesting that this species is potentially valuable for nutritional oil development. Among the selected Scenedesmaceae for the detection of quantitative lipidomes, DD. armatus SAG 276-4d had the highest Triglyceride (TG) level (Supplementary Table 12), which further supports potential commercial use of this strain ^(88){ }^{88}. In brief, our quantitative lipidome results showed that lipid accumulation levels and lipid composition varied widely in Scenedesmaceae, probably reflecting both species/strain specificity as well as dependence on the growth status of the cultures, which was not further characterized. 对具有代表性的景天科植物的脂质代谢物含量进行了测定。N. danubialis 的总脂质含量最高,这主要归因于单甘油酯的大量积累(扩展数据图 3 和补充表 12)。在脂质的各种成分中,四裂叶草属物种通常脂肪酸/酰基(FAs)的比例较高(平均为 55.9%),而甘油脂类(GL)(平均为 27.5%)和甘油磷脂类(GP)(平均为 17.0%)的比例较低。除 T. distendus CCAP 276/37 和 T. lagerheimii CCAP 276/30 外,它们的 FAs( 2.7%2.7 \% 和 18.2%18.2 \% )比例都很低,但 GP(53.7%)和 GL(71.3%)的比例都很高。对于腔肠动物和底栖生物中的大多数物种来说,FA 也是总脂质的主要成分。然而,许多四膜虫物种的总 FA 含量高于 Coelastrum 和 Desmodesmus 物种,这主要是由于大多数四膜虫物种的 C16 型 FA 累积较多(补充表 12)。值得注意的是,D. ultrasquamatus CCAP 258/1 的亚油酸(C18:2 FA)和亚麻酸(C18:3 FA)含量非常高,这表明该物种具有开发营养油的潜在价值。在被选中检测定量脂质体的景天科植物中, DD . armatus SAG 276-4d的甘油三酯(TG)含量最高(补充表12),这进一步支持了该菌株 ^(88){ }^{88} 的潜在商业用途。简而言之,我们的定量脂质体结果表明,景天科植物的脂质积累水平和脂质组成差异很大,这可能反映了物种/菌株的特异性以及对培养物生长状况的依赖性,而这一点没有得到进一步表征。
We next sought to identify the genomic underpinnings of high lipid accumulation in Scenedesmaceae, with special attention to the GL, GP and FA types of lipids and to TG. To analyze whether the level of metabolite accumulation is related to the copy number of the gene families involved in their synthesis and breakdown, and thus to expansion of the respective gene families, we calculated the Spearman correlation coefficient to determine correlation between the lipid levels (as one quantitative trait) and the copy number of potential genes, as well as the correlation between the lipid and transcript abundance levels in Scenedesmaceae species with high accumulation levels of GL, GP, FA and TG lipids. We detected strong positive correlation between transcript abundance levels of 13, 6, 2 and 10 genes and high accumulation levels of GL, GP, FA and TG, respectively (Extended Data Fig. 3 and Supplementary Table 13). For example, the transcript abundance level of methylcrotonyl-CoA 接下来,我们试图确定景天科植物脂质高积累的基因组基础,特别关注 GL、GP 和 FA 类型的脂质以及 TG。为了分析代谢物的积累水平是否与参与其合成和分解的基因家族的拷贝数有关,从而是否与相应基因家族的扩增有关,我们计算了斯皮尔曼相关系数,以确定脂质水平(作为一种定量性状)与潜在基因拷贝数之间的相关性,以及在 GL、GP、FA 和 TG 脂质积累水平较高的景天科物种中脂质与转录本丰度水平之间的相关性。我们分别检测到 13、6、2 和 10 个基因的转录本丰度水平与 GL、GP、FA 和 TG 的高积累水平之间存在很强的正相关性(扩展数据图 3 和补充表 13)。例如,甲基巴豆酰-CoA 的转录本丰度水平与 GL、GP、FA 和 TG 的高积累水平之间存在相关性(扩展数据图 3 和补充表 13)。
carboxylase (MCCC) was strongly correlated with the level of GL accumulation in Scenedesmaceae. MCCC was reported to regulate triacylglycerol accumulation in the model diatom Phaeodactylum tricornutum ^(89){ }^{89}. Notably, many of the identified genes are involved in the TCA cycle and glycolysis; the TCA cycle is known to oxidize acetyl-CoA derived from carbohydrates, fatty acids, amino acids, and ketone bodies, and provides intermediates that are utilized for the formation of glucose, lipids, and amino acids ^(90){ }^{90}. Lipid synthesis requires the glycolytic intermediate dihydroxyacetone phosphate and the TCA cycle intermediate citrate to generate glycerol 3-phosphate and acetyl-CoA, respectively. This might suggest highly active carbon reallocation through the TCA cycle for the accumulation of various lipids in algae with various highly accumulated lipids ^(91){ }^{91}. At the gene copy number levels, 19, 3, 7 and 18 genes were strongly positively correlated with lipid levels of GL, GP, FA and TG, respectively (Extended Data Fig. 3). For example, MYB-like genes were highly duplicated in species of Scenedesmaceae with high GL and TG accumulation levels. In C. reinhardtii, the MYB transcription factor is involved in regulating lipid metabolic pathways for oil biosynthesis ^(92){ }^{92}. In conclusion, based on the correlation of transcript abundance levels and gene copy number with the accumulation of various types of lipids, we identified many highly correlated candidate genes, suggesting that gene family expansions of lipid-related genes might contribute to high lipid accumulation in Scdenedesmaceae. 羧化酶(MCCC)与景天科植物的甘油积累水平密切相关。据报道,在模式硅藻 Phaeodactylum tricornutum ^(89){ }^{89} 中,MCCC 可调节三酰甘油的积累。值得注意的是,许多已发现的基因都参与了 TCA 循环和糖酵解;已知 TCA 循环可氧化来自碳水化合物、脂肪酸、氨基酸和酮体的乙酰-CoA,并提供中间产物,用于形成葡萄糖、脂类和氨基酸 ^(90){ }^{90} 。脂质合成需要糖酵解中间体磷酸二羟丙酮和 TCA 循环中间体柠檬酸来分别生成 3-磷酸甘油和乙酰-CoA。这可能表明,在具有各种高积累脂质的藻类中,通过 TCA 循环进行的碳再分配非常活跃 ^(91){ }^{91} 。在基因拷贝数水平上,分别有19、3、7和18个基因与GL、GP、FA和TG的脂质水平呈强正相关(扩展数据图3)。例如,在 GL 和 TG 积累水平较高的 Scenedesmaceae 物种中,MYB 类基因高度重复。在 C. reinhardtii 中,MYB 转录因子参与调控油脂生物合成的脂质代谢途径 ^(92){ }^{92} 。总之,根据转录本丰度水平和基因拷贝数与各类脂质积累的相关性,我们发现了许多高度相关的候选基因,这表明脂质相关基因家族的扩展可能是导致Scdenedesmaceae中脂质高积累的原因。
Frequent HGT contributed to adaptive evolution of the Scenedesmaceae 频繁的 HGT 促进了景天科植物的适应性进化
Based on the homolog matrix of orthogroups generated from the 113 collected Archaeplastida genomes, we performed a systematic investigation of their global patterns and function in post-transfer adaptation in the Scenedesmaceae. Using a published pipeline ^(93){ }^{93}, we identified a total of 261 HGT-derived gene families in Scenedesmaceae (Fig. 4a and Supplementary Table 14), including HGT events that occurred at genus-specific levels within the Scenedesmaceae ( 141 gene families). The screening of putative donors for the identified 261 HGT gene families via both BLAST and phylogenetic analyses suggested that 56.7%56.7 \% of these HGT gene families might have been acquired from bacteria, while metazoans and fungi, as putative donor species provided 16.1%16.1 \% and 7.7%7.7 \%, respectively, of all identified HGT gene families (Fig. 4b). About 42.5%42.5 \% of the identified HGT genes were classified as having unknown functions, as they did not match any existing database entries. The HGT genes with known functions, however, were significantly enriched in several KEGG pathways associated with lipid metabolism, such as ‘glycosphingolipid biosynthesis’, ‘linolenic acid metabolism’, ‘fatty acid biosynthesis’, ‘glycerolipid metabolism’, and ‘sphingolipid 根据从收集到的113个弓形虫基因组中生成的正交组同源矩阵,我们对它们在景天科转移后适应过程中的全局模式和功能进行了系统的研究。利用已发表的管道 ^(93){ }^{93} ,我们在景天科中总共鉴定出了 261 个 HGT 衍生基因家族(图 4a 和补充表 14),其中包括发生在景天科属特异性水平上的 HGT 事件(141 个基因家族)。通过 BLAST 和系统进化分析对已发现的 261 个 HGT 基因家族的推定供体进行筛选后发现,这些 HGT 基因家族中的 56.7%56.7 \% 可能是从细菌中获得的,而在所有已发现的 HGT 基因家族中,作为推定供体物种的元古宙和真菌分别提供了 16.1%16.1 \% 和 7.7%7.7 \% (图 4b)。大约 42.5%42.5 \% 个已鉴定的 HGT 基因被归类为功能未知,因为它们与现有数据库中的任何条目都不匹配。然而,具有已知功能的 HGT 基因在与脂质代谢有关的几条 KEGG 通路中明显富集,如 "糖磷脂生物合成"、"亚麻酸代谢"、"脂肪酸生物合成"、"甘油酯代谢 "和 "鞘脂"。
metabolism’ (Fig. 4c). 新陈代谢"(图 4c)。
We further compared structural features between the identified HGT-acquired genes and the core genes in the Scenedesmaceae. We found that peptide lengths of the HGT genes were significantly shorter than those of the core genes, and the exon number was larger for the core genes than for the HGT genes despite displaying similar GC content (Fig. 4d-f). However, HGT genes carry a larger proportion of transposon elements upstream and downstream of the genes than the core genes (Supplementary Fig. 8). The average transcript abundance levels of the core genes were significantly higher than that of the HGT genes (Fig. 4g). Interestingly, the transcript abundance levels of almost all of the identified HGT genes in Tetradesmus lagerheimii and Scenedesmus sp. NREL46BD3, respectively, exhibited significant changes under different stress conditions (Supplementary Fig. 8 and Supplementary Table 15), suggesting that HGT genes may be involved in environmental adaptation of Scenedesmaceae. 我们进一步比较了已发现的HGT获得基因与景天科核心基因的结构特征。我们发现 HGT 基因的肽段长度明显短于核心基因,尽管 GC 含量相似,但核心基因的外显子数目大于 HGT 基因(图 4d-f)。然而,与核心基因相比,HGT 基因上下游携带的转座子元件比例更大(补充图 8)。核心基因的平均转录本丰度水平明显高于 HGT 基因(图 4g)。有趣的是,在不同胁迫条件下,Tetradesmus lagerheimii 和 Scenedesmus sp. NREL46BD3 中几乎所有已鉴定的 HGT 基因的转录本丰度水平都发生了显著变化(补充图 8 和补充表 15),表明 HGT 基因可能参与了 Scenedesmaceae 的环境适应。
HGT is generally acknowledged as a major driver of evolution in prokaryotic organisms ^(94){ }^{94}. Until relatively recently, however, the extent and significance of HGT in eukaryotes has been controversially discussed ^(95,96){ }^{95,96}. With the advent of long-read sequencing technology and the exponential growth in genome data of eukaryotes, the widespread occurrence of HGT in eukaryotes is generally accepted ^(97-99){ }^{97-99}. Although, the mechanism(s) of HGT into eukaryote genomes is still largely unknown, evidence is accumulating that mobile genetic elements (viruses, transposons) play a major role by integrating into the genome and persisting as endogeneous viral elements (EVEs) ^(100-){ }^{100-}^(104){ }^{104}. We identified EVEs of giant DNA viruses (NCLDV, phylum Nucleocytoviricota) in many (18) Scenedesmaceae genomes (Extended Data Fig. 4 and Supplementary Fig. 9) (see Methods for details). Phylogenetic analyses identified major viral hallmark genes, such as mcpm c p (major capsid protein) (Fig. 4h) and PolB (DNA polymerase B) (Fig. 4i) from Phycodnaviridae (and related viral lineages) as likely donors for homologs in Scenedesmaceae. Mapping viral hallmark genes on pseudochromosomes of selected Scenedesmaceae (Fig. 4j-1) identified EVEs with characteristic features such as clustering of viral genes (hallmark genes, viral best hits, NCVOG [nucleocytoplasmic virus orthologous groups]), clustering of non-Viridiplantae genes and low density of Viridiplantae genes, and lower TE and intron density in the central part of the EVEs. 人们普遍认为HGT是原核生物进化的主要驱动力 ^(94){ }^{94} 。然而,直到最近,关于真核生物中HGT的程度和意义的讨论一直存在争议 ^(95,96){ }^{95,96} 。随着长读测序技术的出现和真核生物基因组数据的指数级增长,真核生物中 HGT 的广泛存在已被普遍接受 ^(97-99){ }^{97-99} 。虽然HGT进入真核生物基因组的机制在很大程度上还不清楚,但越来越多的证据表明,移动遗传元件(病毒、转座子)通过整合到基因组中并以内源性病毒元件(EVEs)的形式持续存在,在其中发挥了重要作用 ^(100-){ }^{100-}^(104){ }^{104} 。我们在许多(18 个)景天科植物基因组中发现了巨型 DNA 病毒(NCLDV,核细胞病毒科)的 EVEs(扩展数据图 4 和补充图 9)(详见方法)。系统发生学分析确定了主要的病毒标志基因,如来自 Phycodnaviridae(及相关病毒系)的 mcpm c p (主要囊膜蛋白)(图 4h)和 PolB(DNA 聚合酶 B)(图 4i),它们可能是景天科同源基因的供体。在选定的景天科植物的假染色体上绘制病毒标志基因图谱(图 4j-1)发现了具有以下特征的 EVEs:病毒基因聚集(标志基因、病毒最佳基因、NCVOG [核细胞质病毒同源组])、非病毒基因聚集和病毒基因密度较低、EVEs 中心部分的 TE 和内含子密度较低。
HGT and gene family expansion enrich lipid metabolism of the Scenedesmaceae Many Scenedesmaceae species are rich in lipids and have important potential applications. We therefore performed phylogenomic analyses to study the genomic basis of lipid metabolism, including fatty acid biosynthesis, the Kennedy pathway and wax ester metabolism, in selected Chlorophyta to gain insights into the evolutionary HGT 和基因家族扩展丰富了景天科植物的脂质代谢 许多景天科植物富含脂质,具有重要的潜在应用价值。因此,我们进行了系统发生组分析,研究了部分叶绿体中脂质代谢(包括脂肪酸生物合成、肯尼迪途径和蜡酯代谢)的基因组基础,以深入了解其进化过程。
novelties of lipid metabolism in the Scenedesmaceae. Scenedesmaceae(景天科)脂质代谢的新特点。
To explore differences in metabolism revealed by gene content, we compared the number of enzymes present in lipid-related KEGG pathways (Extended Data Fig. 5). In general, the number of genes associated with ‘Fatty acid metabolism’, ‘Fatty acid elongation’, ‘Fatty acid degradation’, ‘Glycerolipid metabolism’, ‘Biosynthesis of unsaturated fatty acids’, ‘Glycerophospholipid metabolism’ and ‘Cutin, suberine and wax biosynthesis’ was higher in the Scenedesmaceae genomes compared e.g. to those of the Volvocales. 为了探索基因含量所揭示的代谢差异,我们比较了与脂质相关的 KEGG 通路中存在的酶的数量(扩展数据图 5)。一般来说,与 "脂肪酸代谢"、"脂肪酸伸长"、"脂肪酸降解"、"甘油酯代谢"、"不饱和脂肪酸的生物合成"、"甘油磷脂代谢 "和 "木脂素、亚脂素和蜡的生物合成 "相关的基因数量在景天科(Scenedesmaceae)的基因组中要多于伏牛花科(Volvocales)的基因组。
After further comparison of gene copy numbers of Scenedesmaceae at each step of the lipid metabolism pathway with those of the Volvocales (Fig. 5a), we found several genes, such as phosphatidic acid phosphatase (PAP), 3-ketoacyl-CoA synthase (KCS), fatty acyl-CoA reductase (FAR), acyl-CoA:diacylglycerol acyltransferase (DGAT) and lipase, which displayed an expanded gene repertoire in almost all the Scenedesmaceae compared to Volvocales (Fig. 5a, Supplementary 10 and Supplementary Table16). 1) DGAT is a key enzyme catalyzing the final steps in the biosynthesis of triacylglycerol (TAG), and our analysis revealed signs of expansion in ‘DGAT like 1’ and ‘DGAT like 4’ (Fig. 5 and Supplementary Fig. 11). 2) PAP is involved in the catalysis and conversion of phosphatidic acid to diacylglycerol, which is the penultimate step of TAG biosynthesis. 3) Compared with Volvocales, the gene copy numbers of KCSK C S and FARF A R, which participate in wax ester biosynthesis were markedly higher in the Scenedesmaceae (Fig. 5a). 在进一步比较了 Scenedesmaceae 与 Volvocales 在脂质代谢途径各步骤的基因拷贝数后(图 5a),我们发现了几个基因,如磷脂酸磷酸酶(PAP)、3-酮酰-CoA 合成酶(KCS)、脂肪酰-CoA 还原酶(Fatty acyl-CoA reductory)等。图 5a),我们发现几乎所有景天科植物中的磷脂酸磷酸酶(PAP)、3-酮酰基-CoA 合成酶(KCS)、脂肪酰基-CoA 还原酶(FAR)、酰基-CoA:二酰甘油酰基转移酶(DGAT)和脂肪酶等基因都比伏牛花科植物中的基因扩增了(图 5a,补充 10 和补充表 16)。1)DGAT 是催化三酰甘油(TAG)生物合成最后步骤的关键酶,我们的分析显示 "DGAT 样 1 "和 "DGAT 样 4 "有扩增迹象(图 5 和补充图 11)。2)PAP 参与了磷脂酸向二酰基甘油的催化和转化,这是 TAG 生物合成的倒数第二步。3)与伏牛科相比,景天科参与蜡酯生物合成的 KCSK C S 和 FARF A R 的基因拷贝数明显较高(图 5a)。
In addition to oil body biosynthesis, oleosin hydrolase is involved in the degradation of TAG and wax esters in oil bodies to yield free fatty acids through the catalysis of lipase ^(105){ }^{105} (Supplementary Fig. 12). We found two types of TAG hydrolases, triacylglycerol lipase (TAGL) and lipase member 1, which were more abundant in the Scenedesmaceae than e.g., in Volvocales (Fig. 5 and Supplementary Fig. 13). In addition to TAGL and lipase members, the genomes of Scenedesmaceae and other Sphaeropleales encode hormone-sensitive lipase (HSL), which also functions in the conversion of TAG to diacylglycerols and monoacylglycerols in oil bodies ^(106,107){ }^{106,107} and has not been found in other Chlorophyta (Fig. 5d and Supplementary Fig. 13), suggesting acquisition by an HGT event in the last common ancestor of Sphaeropleales. Similarly, long-chain fatty acid omega-hydroxylase (cytochrome P450-704B1) seems to be confined to Scenedesmaceae within Chlorophyta (Fig. 5a and Supplementary Table 16). 除了油体的生物合成外,油苷水解酶还参与油体中 TAG 和蜡酯的降解,通过脂肪酶 ^(105){ }^{105} 的催化作用产生游离脂肪酸(补图 12)。我们发现了两类 TAG 水解酶,即三酰基甘油脂肪酶(TAGL)和脂肪酶成员 1,它们在景天科植物中的含量高于伏牛花科植物等(图 5 和补充图 13)。除了 TAGL 和脂肪酶成员外,景天科和其它石蒜科植物的基因组还编码激素敏感脂肪酶(HSL),HSL 也具有将 TAG 转化为油体 ^(106,107){ }^{106,107} 中的二酰甘油和单酰甘油的功能,但在其它叶绿体中没有发现(图 5d 和补充图 13),这表明石蒜科植物的最后共同祖先是通过 HGT 事件获得的。同样,长链脂肪酸ω-羟化酶(细胞色素 P450-704B1)似乎仅限于叶绿体中的 Scenedesmaceae(图 5a 和补充表 16)。
Finally, another lipid metabolism-related gene family that expanded in the Scenedesmaceae is phosphatidylcholine-sterol acyltransferase (LCAT), which is involved in the metabolism of phospholipids, sterols and extracellular plasma 最后,另一个在景天科植物中扩展的脂质代谢相关基因家族是磷脂酰胆碱-甾醇酰基转移酶(LCAT),它参与磷脂、甾醇和细胞外血浆的代谢。
lipoproteins ^(108){ }^{108}. Our phylogenetic analysis revealed that neither Sphaeropleales nor Volvocales encode plant-type LCATs (LCAT group 3 and group 5 in Supplementary Fig. 14) ^(109){ }^{109}. We identified homologs of both bacterial type (LCAT group4) and animal type (LCAT group2) LCATs in many Volvocales. However, the Scenedesmaceae displayed a specific LCAT (LCAT group 1). Moreover, a remarkable expansion signature of this LCAT clade was found in the Scenedesmaceae (Supplementary Table 17). The differences in LCATs between Scenedesmaceae and Volvocales suggest a complex evolutionary history and origin for LCATs, which likely involved horizontal gene transfer events involving bacteria and animal sequences. 脂蛋白 ^(108){ }^{108} 。我们的系统发育分析表明,Sphaeropleales 和 Volvocales 都不编码植物型 LCAT(LCAT 第 3 组和第 5 组,见附图 14) ^(109){ }^{109} 。我们在许多伏牛科植物中发现了细菌型 LCAT(LCAT 第 4 组)和动物型 LCAT(LCAT 第 2 组)的同源物。然而,景天科(Scenedesmaceae)却显示出一种特殊的 LCAT(LCAT 组 1)。此外,在 Scenedesmaceae 中还发现了该 LCAT 支系的显著扩展特征(补充表 17)。景天科(Scenedesmaceae)与伏牛花科(Volvocales)在 LCATs 上的差异表明 LCATs 的进化历史和起源非常复杂,很可能涉及到细菌和动物序列的水平基因转移事件。
Based on the results of the HGT identifications, we further investigated the involvement of these genes in lipid metabolism (Extended Data Fig. 6). For example, our phylogenetic analysis of the KAS III family showed that a unique clade of KAS III homologs in several Scenedesmaceae species clustered with genes from bacteria, supporting the notion that this KAS III homolog might have been acquired from bacteria via HGT. In addition, bacteria or viruses were likely HGT donors for a particular class of FAD homologs in some Scenedesmaceae. The former catalyzes the condensation reaction of fatty acid synthesis, while the latter is involved in desaturation of fatty acids. 根据 HGT 鉴定的结果,我们进一步研究了这些基因参与脂质代谢的情况(扩展数据图 6)。例如,我们对 KAS III 家族的系统进化分析表明,在多个景天科物种中,KAS III 同源物的一个独特支系与来自细菌的基因聚集在一起,支持了这种 KAS III 同源物可能是通过 HGT 从细菌中获得的观点。此外,细菌或病毒很可能是某些景天科植物中一类特殊的 FAD 同源物的 HGT 供体。前者催化脂肪酸合成的缩合反应,后者则参与脂肪酸的脱饱和反应。
When the Scenedesmaceae were exposed to environmental stresses, we observed significant changes in various lipid levels and in the transcript abundance levels of many genes in the lipid metabolism pathway (Supplementary Table 18, Supplementary Fig. 15, Extended Data Fig. 8). For example, 36% of the lipid genes exhibited significant changes in transcript abundance levels under nitrogen deficiency in TT. lagerheimii. These results suggest that Scenedesmaceae may regulate lipid metabolism in response to environmental stress by changing transcription levels of related genes. 当Scenedesmaceae暴露于环境胁迫时,我们观察到各种脂质水平以及脂质代谢途径中许多基因的转录本丰度水平发生了显著变化(补充表18、补充图15、扩展数据图8)。例如,在 TT .lagerheimii缺氮条件下,36%的脂质基因的转录本丰度水平发生了显著变化。这些结果表明,景天科植物可能通过改变相关基因的转录水平来调节脂质代谢以应对环境胁迫。
Evolutionary novelties in sulfur metabolism in the Scenedesmaceae 景天科植物硫代谢的进化新特点
Many oleaginous species of the Scenedesmaceae can produce high levels of sulfolipids ^(110){ }^{110}. Thus, we investigated the evolution of genes involved in sulfur metabolism, especially those related to transporters. The sulfite exporter TauE/SafE showed a remarkable expansion in most Scenedesmaceae, the average number of gene copies being 3.5 times larger than in the Volvocales (Fig. 5a and Supplementary Table 19). In addition, none of the Volvocales genomes encoded homologs of the sulfite efflux pump SSU1, despite the widespread occurrence of this gene in multiple copies in the Scenedesmaceae (Fig. 5a). Our phylogenetic analyses revealed that the Scenedesmaceae SSU1 homologs (and two sequences of Trebouxiophyceae) were nested within a radiation of fungal SSU1/MAE1 genes (Fig. 5e), suggesting that the 景天科(Scenedesmaceae)的许多含油物种都能产生大量的硫脂 ^(110){ }^{110} 。因此,我们研究了参与硫代谢的基因的进化,特别是与转运体有关的基因。亚硫酸盐转运体 TauE/SafE 在大多数景天科植物中都有显著的扩增,其平均基因拷贝数是 Volvocales 植物的 3.5 倍(图 5a 和补充表 19)。此外,尽管亚硫酸盐外排泵 SSU1 基因在景天科植物中广泛存在多个拷贝,但没有一个 Volvocales 基因组编码该基因的同源物(图 5a)。我们的系统进化分析表明,景天科的 SSU1 同源物(和两个 Trebouxiophyceae 的序列)嵌套在真菌 SSU1/MAE1 基因的辐射中(图 5e),这表明景天科的 SSU1/MAE1 基因与真菌的 SSU1/MAE1 基因是同源的(图 5f)。
green algal SSU1 homologs might have been acquired by HGT from fungi. However, whether the functions of these genes are related to those of fungal SSU1 (sulfite efflux) or MAE1 (enhancing oil accumulation) genes has yet to be determined. 绿藻 SSU1 同源物可能是从真菌中通过 HGT 获得的。不过,这些基因的功能是否与真菌 SSU1(亚硫酸盐外排)或 MAE1(增强油脂积累)基因的功能有关,还有待确定。
We also found another key gene of the sulfur metabolic pathway, persulfide dioxygenase ETHE1, to be present in many Scenedesmaceae but absent in all Volvocales and the Chlorellales of the Trebouxiophyceae (Supplementary Table 19). Sulfur dioxygenase plays an essential role in hydrogen sulfide (H_(2)(S))\left(\mathrm{H}_{2} \mathrm{~S}\right) catabolism in the mitochondrial matrix by catalyzing the oxidation of sulfide/persulfide to generate sulfite and preventing the accumulation of toxic H_(2)S^(111)\mathrm{H}_{2} \mathrm{~S}^{111}. 我们还发现硫代谢途径中的另一个关键基因--过硫化物二氧合酶 ETHE1 存在于许多景天科植物中,但在所有伏牛花科植物和蝶形花科植物中都不存在(补充表 19)。硫二加氧酶在线粒体基质中的硫化氢 (H_(2)(S))\left(\mathrm{H}_{2} \mathrm{~S}\right) 分解代谢中起着重要作用,它催化硫化物/过硫化物氧化生成亚硫酸盐,防止有毒的 H_(2)S^(111)\mathrm{H}_{2} \mathrm{~S}^{111} 积累。
The dramatic expansion of the sulfite exporter TauE/SafE as well as the widespread occurrence of homologs to the sulfite efflux pump SSU1/MAE1 and sulfur dioxygenase suggest that the Scenedesmaceae might take advantage of sulfite as a preferred sulfur resource to produce sulfolipids. In this context, we note that the unsaturated chondroitin disaccharide hydrolase (UCDH) is specific to most Scenedesmaceae and other Sphaeropleales but not to Volvocales (Supplementary Table 20). UCDH belongs to the GH88 glycoside hydrolase class and preferentially catalyzes the hydrolysis of unsaturated hyaluronate and chondroitin disaccharides. According to our phylogenetic analysis, the Sphaeropleales-specific UCDH was presumably acquired by HGT from bacteria (Supplementary Fig. 16). Furthermore, we also found that Scenedesmaceae encode the bacterial homolog chondroitin synthase and the animal homolog N -acetylgalactosamine-6-sulfatase (Supplementary Fig. 17), which are responsible for chondroitin sulfate disaccharide metabolism, both genes being absent in Volvocales (Supplementary Table 20). 亚硫酸盐外排泵 TauE/SafE 的急剧扩展以及亚硫酸盐外排泵 SSU1/MAE1 和硫二氧酶同源物的广泛存在表明,景天科植物可能会利用亚硫酸盐这一首选硫资源来生产硫脂。在这种情况下,我们注意到不饱和软骨素二糖水解酶(UCDH)是大多数 Scenedesmaceae 和其他 Sphaeropleales 特有的,但不是 Volvocales 特有的(补充表 20)。UCDH 属于 GH88 糖苷水解酶类,优先催化不饱和透明质酸和软骨素二糖的水解。根据我们的系统进化分析,Sphaeropleales 特有的 UCDH 可能是从细菌中通过 HGT 获得的(补图 16)。此外,我们还发现 Scenedesmaceae 编码细菌同源的软骨素合成酶和动物同源的 N -乙酰半乳糖胺-6-硫酸酯酶(补图 17),它们负责硫酸软骨素二糖的代谢,而这两个基因在 Volvocales 中都不存在(补表 20)。
Given the expansion of sulfite transporters, homologs of sulfite efflux pumps and sulfide dioxygenases across the Scenedesmaceae, sulfated glycosaminoglycans or related nonglycosaminoglycan-sulfated glycans may be present in the cell walls of the Scenedesmaceae and could be involved in diverse stress responses (desiccation, antiviral) ^(76,112){ }^{76,112}. 鉴于亚硫酸盐转运体、亚硫酸盐外排泵的同源物和硫化物二氧酶在整个景天科植物中的扩展,硫酸化氨基糖或相关的非氨基糖硫酸化糖可能存在于景天科植物的细胞壁中,并可能参与多种应激反应(干燥、抗病毒) ^(76,112){ }^{76,112} 。
Pathogen resistance and stress adaptation in the Scenedesmaceae 景天科植物的抗病性和应激适应性
Many Scenedesmaceae have a global distribution and are able to grow or survive in various types of extreme environments (such as industrial wastewater, hot and cold deserts, and ice-covered lakes). We therefore investigated the distribution of gene elements associated with abiotic and biotic stress responses (Supplementary Table 21). Chlorophyta are thought to display only few intracellular immune receptors and thus lack access to a variety of immunogenic signals associated with pathogen infections in 许多景天科植物分布在全球各地,能够在各种极端环境(如工业废水、炎热和寒冷的沙漠以及冰封的湖泊)中生长或生存。因此,我们调查了与非生物和生物胁迫反应相关的基因元件的分布情况(补充表 21)。叶绿体被认为只显示很少的细胞内免疫受体,因此无法获得与病原体感染相关的各种免疫原性信号。
plants; only Chromochloris zofingiensis among Chlorophyceae was reported to contain a few nucleotide-binding leucine-rich repeat (NLR) proteins ^(113){ }^{113}. Our comparative genome analysis across Chlorophyta revealed that not only Chromochloris but also Scenedesmaceae and other Sphaeropleales encode NLR genes (Extended Data Fig. 7a and Supplementary Table 21). Consistent with the findings of previous studies, no NLRs were detected in any of the Volvocales or Mamiellophyceae (Extended Data Fig. 7a). 据报道,在叶绿纲植物中只有 Chromochloris zofingiensis 含有一些核苷酸结合富亮氨酸重复(NLR)蛋白 ^(113){ }^{113} 。我们对整个叶绿体进行的基因组比较分析表明,不仅 Chromochloris,而且 Scenedesmaceae 和其他 Sphaeropleales 也编码 NLR 基因(扩展数据图 7a 和补充表 21)。与之前的研究结果一致的是,在伏牛花科(Volvocales)和神仙掌科(Mamiellophyceae)中没有发现 NLR 基因(扩展数据图 7a)。
The subtilisin-like protease (SBT) gene family was found more abundant in Sphaeropleales than in Volvocales (Extended Data Fig. 7b). Phylogenetic analysis indicated that several specific groups of SBTs (classes 1-8) exist in the Scenedesmaceae. SBTs are a diverse family of serine peptidases that are present in many organisms, including plants, and have a broad spectrum of biological functions, ranging from protein turnover to development and defense responses against diverse pathogens ^(114,115){ }^{114,115}. In addition, SBTs are reportedly involved in the formation of the cuticle, a continuous lipophilic layer located at the outer epidermal cell walls of embryophytes ^(116){ }^{116}. Thus, we hypothesize that the remarkable expansion of SBTs in Sphaeropleales might also be related to the formation and breakdown of wax ester-based oleosins. Additionally, considering that many Scenedesmaceae, such as Tetradesmus, Desmodesmus and Scenedesmus, synthesize large amounts of lipids as a stress response, wax ester- and TAG-based oil bodies could also function to deter pathogens. 在 Sphaeropleales 中发现的类枯草蛋白酶(SBT)基因家族比在 Volvocales 中发现的更丰富(扩展数据图 7b)。系统进化分析表明,在景天科(Scenedesmaceae)中存在几类特定的 SBTs(1-8 类)。SBTs 是丝氨酸肽酶的一个多样化家族,存在于包括植物在内的许多生物体中,具有广泛的生物学功能,包括蛋白质周转、发育和对不同病原体的防御反应 ^(114,115){ }^{114,115} 。此外,据报道 SBTs 还参与了角质层的形成,角质层是位于胚状植物外表皮细胞壁的连续亲脂层 ^(116){ }^{116} 。因此,我们推测,Sphaeropleales 中 SBTs 的显著扩展也可能与蜡酯类油脂蛋白的形成和分解有关。此外,考虑到许多 Scenedesmaceae(如 Tetradesmus、Desmodesmus 和 Scenedesmus)会合成大量脂质作为应激反应,蜡酯和 TAG 基油体也可能具有阻止病原体的功能。
Animals, plants, algae and other eukaryotic organisms all require L-ascorbic acid (vitamin C) to enable many of their enzymes to work properly since ascorbate protects cells from damage by reactive oxygen species (ROS) ^(117){ }^{117}. Animals synthesize ascorbate via D-glucuronic acid and L-gulonolactone (L-GulL), with L-gulonolactone oxidase (GULO) catalyzing the oxidation of L-GulL to ascorbate (GULO is nonfunctional in some Metazoa )^(118,119))^{118,119}. In contrast, ascorbate biosynthesis in plants occurs via a different route using D-mannose and L-galactose, which employs L-galactonolactone dehydrogenase (GLDH) as the terminal enzyme instead of GULO (Extended Data Fig. 7c)^(120,121)7 c)^{120,121}. Green algae generally use the ‘plant ascorbate pathway’ ^(122){ }^{122}. Although animal GULO is absent from all Archaeplastida genomes, an enzyme family (termed “GULOlike” in this study) exhibiting weak similarity to GULO has been reported in Arabidopsis and shown to contribute to the biosynthesis of ascorbic acid ^(123){ }^{123}. Interestingly, in addition to the “plant ascorbate pathway”, we also identified a complete ‘animal ascorbate pathway’ in Sphaeropleales (and almost all Scenedesmaceae) but not in other Chlorophyta (Extended Data Fig. 7c). No homologs of Gluconactonase (SMP30), which catalyzes the transformation of L-gulonate into L-gulonate-lactone, were detected in the Volvocales, whereas Gluconactonase homologs were detected in the Sphaeropleales, Mamiellophyceae and Trebouxiophyceae (Extended Data Fig. 7c 动物、植物、藻类和其他真核生物都需要左旋抗坏血酸(维生素 C)来使它们的许多酶正常工作,因为抗坏血酸能保护细胞免受活性氧(ROS) ^(117){ }^{117} 的损害。动物通过D-葡萄糖醛酸和L-古洛内酯(L-GulL)合成抗坏血酸,L-古洛内酯氧化酶(GULO)催化L-古洛内酯氧化成抗坏血酸(GULO在某些后生动物中不起作用) )^(118,119))^{118,119} 。与此相反,植物中抗坏血酸的生物合成是通过使用 D-甘露糖和 L-半乳糖的不同途径进行的,它使用 L-半乳糖内酯脱氢酶(GLDH)代替 GULO 作为终端酶(扩展资料图 7c)^(120,121)7 c)^{120,121} )。绿藻通常使用 "植物抗坏血酸途径" ^(122){ }^{122} 。虽然所有古细菌基因组中都没有动物 GULO,但拟南芥中的一个酶族(在本研究中称为 "类 GULO")与 GULO 有微弱的相似性,并被证明有助于抗坏血酸的生物合成 ^(123){ }^{123} 。有趣的是,除了 "植物抗坏血酸途径 "之外,我们还在 Sphaeropleales(以及几乎所有 Scenedesmaceae)中发现了完整的 "动物抗坏血酸途径",但在其他叶绿体中却没有发现(扩展数据图 7c)。在伏绿藻门(Volvocales)中没有检测到催化 L-古洛酸盐转化为 L-古洛酸盐内酯的葡萄糖酸内酯酶(SMP30)的同源物,而在沙棘藻门(Sphaeropleales)、神仙掌科(Mamiellophyceae)和蝶形花科(Trebouxiophyceae)中检测到葡萄糖酸内酯酶的同源物(扩展数据图 7c)。
and Supplementary Table 22). Interestingly, compared with Volvocales, gene copy number of GULO-like exhibited a burst in Scenedesmaceae (Sphaeropleales/Volvocales GULO-like in Supplementary Fig. 18). Notably, although some Mamiellophyceae and Trebouxiophyceae encode gluconolactonase, GULO-like was absent in these clades. Additionally, we also found that the number of UDPglucuronosyltransferase (UGT) genes was larger in many Scenedesmaceae than in Volvocales (Extended Data Fig. 7c and Supplementary Table 22). UGT was reported to be involved in the stimulation of ascorbic acid biosynthesis in animals ^(124){ }^{124}. In conclusion, the Scenedesmaceae not only exhibit complete plant and animal ascorbate biosynthesis pathways but also GULO-like pathways, with a large expansion of gene copies, indicating evolutionary innovations in stress responses in Scdenedesmaceae. Our phylogenetic analysis (Supplementary Fig. 18) indicated that GULO-like may have originated from gene duplication and functional diversification of the ALO (d-arabinono-1,4-lactone oxidase) gene in the common ancestor of Streptophyta and Chlorophyta. 和补充表 22)。有趣的是,与伏牛花科(Volvocales)相比,景天科(Sphaeropleales/Volvocales GULO-like,见附图 18)的 GULO-like 基因拷贝数出现了突增。值得注意的是,虽然一些 Mamiellophyceae 和 Trebouxiophyceae 编码葡萄糖酸内酯酶,但这些支系中却没有 GULO-like 基因。此外,我们还发现在许多景天科植物中,UDP-葡萄糖醛酸转移酶(UGT)基因的数量多于伏牛花科植物(扩展数据图 7c 和补充表 22)。据报道,UGT 参与刺激动物体内抗坏血酸的生物合成 ^(124){ }^{124} 。总之,景天科(Scenedesmaceae)不仅表现出完整的植物和动物抗坏血酸生物合成途径,而且还表现出类似 GULO 的途径,基因拷贝大量扩增,表明景天科(Scdenedesmaceae)在应激反应方面的进化创新。我们的系统进化分析(补图 18)表明,GULO-like 可能起源于链格藻纲和叶绿藻纲共同祖先中的 ALO(d-阿拉伯酮-1,4-内酯氧化酶)基因的复制和功能多样化。
Heterotrophic nutrition 异养营养
In comparison with Volvocales, many Sphaeropleales can survive for several months in the dark in the vegetative state and have been shown to grow in various types of wastewaters ^(125){ }^{125}. We therefore examined gene families related to heterotrophic energy metabolism across Chlorophyta and found major differences between Volvocales and Sphaeropleales (Fig. 6 and Supplementary Table 23: Several key gene families, such as sucrose synthase (SuSy) (Supplementary Fig. 19), sugar transport protein (STP; except for C. eustigma), sucrose phosphorylase (SP), and xylose isomerase (XI), could not be identified in the Volvocales. In contrast, many Scenedesmaceae had > 10>10 gene copies of STP (the Trebouxiophyceae genomes also contained an expanded set of 5-14 STP gene copies) (Supplementary Table 23). Sucrose phosphorylase (SP) catalyzes the reversible phosphorolysis of sucrose with inorganic phosphate, producing alpha\alpha-D-glucose 1-phosphate (Glc1P) and fructose ^(126){ }^{126}. It was previously thought that SP is present only in bacteria and is not found in either plants or cyanobacteria ^(127){ }^{127}. A genome-wide exploration of SP across Viridiplantae, however, revealed that many Sphaeropleales contain this gene, often in multiple copies (Supplementary Table 23), and further phylogenetic analysis indicated that the Sphaeropleales as well as K. nitens and C. subellipsoidea likely acquired SP through multiple independent HGTs from bacteria (Fig. 6b). Our phylogenetic analysis identified five groups of XI (XI groups 1-5) with Viridiplantae confined to group 5, emerging from a radiation of bacterial XI groups (XI groups 2-5, three of which also contained some fungal sequences) (Supplementary Fig. 与Volvocales相比,许多Sphaeropleales可以在黑暗中以无性状态存活数月之久,并已被证明可以在各种废水 ^(125){ }^{125} 中生长。因此,我们研究了叶绿体中与异养型能量代谢有关的基因家族,发现Volvocales和Sphaeropleales之间存在很大差异(图6和补充表23:Volvocales中无法鉴定出几个关键基因家族,如蔗糖合成酶(SuSy)(补充图19)、糖转运蛋白(STP;C. eustigma除外)、蔗糖磷酸化酶(SP)和木糖异构酶(XI)。与此相反,许多景天科植物都有 > 10>10 个 STP 基因拷贝(Trebouxiophyceae 基因组也包含 5-14 个 STP 基因拷贝)(补充表 23)。蔗糖磷酸化酶(SP)催化蔗糖与无机磷酸盐的可逆磷酸化反应,产生 alpha\alpha -D- 1-磷酸葡萄糖(Glc1P)和果糖 ^(126){ }^{126} 。以前认为 SP 只存在于细菌中,植物和蓝藻中都没有 ^(127){ }^{127} 。然而,对病毒植物中 SP 基因的全基因组检测发现,许多 Sphaeropleales 都含有该基因,而且往往是多个拷贝(补充表 23),进一步的系统发生分析表明,Sphaeropleales 以及 K. nitens 和 C. subellipsoidea 很可能是通过多个独立的 HGT 从细菌中获得 SP 的(图 6b)。我们的系统发育分析确定了 5 个 XI 组(XI 组 1-5),其中病毒植物属于第 5 组,它们是从细菌 XI 组(XI 组 2-5,其中 3 组也包含一些真菌序列)的辐射中产生的(附图 6b)。
20). XI is also commonly known as a fructose isomerase because it can interconvert D fructose and D-glucose (Fig. 6c). The presence of SuSy, SP and XI in most Sphaeropleales suggests the evolutionary novelty of the sucrose metabolic pathway in the Sphaeropleales, as this pathway could effectively drive the transformation of sucrose to F-6-P for ATP generation via glycolysis and G-3-P for TAG biosynthesis (Fig. 6c). Sucrose is the most abundant disaccharide in the environment because of its origin in higher plant tissues ^(128){ }^{128}. The presence of a complete sucrose metabolic pathway in Sphaeropleales with a large number of sugar transport proteins (the number of which is comparable to that of embryophytes; Supplementary Table 23) allows Sphaeropleales (and in particular Scenedesmaceae) to obtain and utilize sucrose, e.g., from wastewater, as the main carbon source, opening new niches and enhancing their survival under light limitation and/or anaerobiosis. We hypothesize that key enzymes involved in sucrose metabolism were acquired by Viridiplantae/Sphaeropleales through HGT, as has been suggested for plant-pathogenic nematodes and arthropods and for bacteria via mobile genetic elements and transposons. 20).XI 通常也被称为果糖异构酶,因为它可以相互转化 D-果糖和 D-葡萄糖(图 6c)。大多数 Sphaeropleales 中都存在 SuSy、SP 和 XI,这表明蔗糖代谢途径在 Sphaeropleales 中具有进化的新颖性,因为该途径可有效地将蔗糖转化为 F-6-P,通过糖酵解产生 ATP,并转化为 G-3-P 用于 TAG 的生物合成(图 6c)。蔗糖是环境中最丰富的双糖,因为它来源于高等植物组织 ^(128){ }^{128} 。蔗糖是环境中最丰富的双糖,因为它来源于高等植物组织 ^(128){ }^{128} 。蔗糖科植物体内有完整的蔗糖代谢途径和大量的糖转运蛋白(其数量与胚胎植物相当;补充表 23),这使得蔗糖科植物(尤其是景天科植物)能够从废水中获取并利用蔗糖作为主要碳源,从而开辟了新的生态位,并提高了它们在光照限制和/或无氧环境下的生存能力。我们推测,参与蔗糖代谢的关键酶是通过 HGT 获得的,正如植物病原线虫和节肢动物以及细菌通过移动遗传因子和转座子获得的那样。
In addition, microorganisms utilize glycerol as a source of carbon under anaerobic conditions, which requires glycerol dehydrogenase. Interestingly, we also detected glycerol dehydrogenase homologs in the Scenedesmaceae, and further phylogenetic analysis indicated that this gene was also likely acquired via HGT from bacteria (Extended Data Fig. 6b and Supplementary Table 24). Glycerol can enter cells from the environment via aquaporins ^(129){ }^{129}, thus, expanded aquaporin genes found in Scenedesmaceae (Supplementary Table 25 and Supplementary Fig. 21), may also have facilitated the uptake of glycerol in the Scenedesmaceae. 此外,微生物在厌氧条件下利用甘油作为碳源,这需要甘油脱氢酶。有趣的是,我们还在景天科植物中检测到了甘油脱氢酶同源物,进一步的系统发育分析表明,该基因也可能是通过细菌的 HGT 获得的(扩展数据图 6b 和补充表 24)。甘油可以通过水蒸气素 ^(129){ }^{129} 从环境中进入细胞,因此,在景天科(Scenedesmaceae)中发现的扩大的水蒸气素基因(补充表 25 和补充图 21)也可能促进了景天科对甘油的吸收。
To investigate whether these genes function under heterotrophic conditions, we used TT. lagerheimii, cultivated with 0.5%0.5 \% glucose in the dark, and analyzed transcript abundance levels as well as the metabolome (Extended Data Fig. 8). We focused mainly on transcripts related to sucrose metabolism. Sucrose synthase (Tetla_jg16120.t1) and sucrose phosphorylase (Tetla_jg18837.t1 and Tetla_jg10016.t1) exhibited significantly higher transcript abundance levels in T. lagerheimii exposed to dark/glucose conditions than in those exposed to light/glucose conditions (Extended Data Fig. 8). Since Sucrose-phosphate synthase (SPS) and several copies of STP and sucrose phosphatase (SPP) showed also increased transcript abundance levels in the dark/glucose condition, we conclude that glucose taken up in the dark by the cells is used to provide energy. Surprisingly, many antenna photosynthesis genes also exhibited increased transcript abundance levels (the culture remained dark-green macroscopically) (Extended Data Fig. 8). We hypothesize that the enhanced synthesis of antenna proteins in the dark is an adaptation to a low light situation enabling cells to absorb light efficiently once they are again exposed to light (as in daybreak). In summary, HGT events have given the 为了研究这些基因是否在异养条件下发挥作用,我们使用了在黑暗条件下用 0.5%0.5 \% 葡萄糖培养的 TT .lagerheimii,并分析了转录本丰度水平以及代谢组(扩展数据图 8)。我们主要关注与蔗糖代谢有关的转录本。在暴露于黑暗/葡萄糖条件下的 T. lagerheimii 中,蔗糖合成酶(Tetla_jg16120.t1)和蔗糖磷酸化酶(Tetla_jg18837.t1 和 Tetla_jg10016.t1)的转录本丰度水平明显高于暴露于光照/葡萄糖条件下的(扩展数据图 8)。由于蔗糖磷酸合成酶(SPS)、STP 和蔗糖磷酸酶(SPP)的几个拷贝在黑暗/葡萄糖条件下的转录本丰度水平也有所提高,我们得出结论,细胞在黑暗中吸收的葡萄糖被用来提供能量。令人惊讶的是,许多触角光合作用基因的转录本丰度水平也有所提高(培养物宏观上仍为深绿色)(扩展数据图 8)。我们推测,在黑暗中增强触角蛋白质的合成是对弱光环境的一种适应,使细胞一旦再次暴露在光下(如天亮时)就能有效地吸收光。总之,HGT 事件赋予了
Scenedesmaceae (and Sphaeropleales in general), a unique ability to obtain sufficient carbon sources and possibly survive in darkness for extended periods of time in the vegetative condition. Scenedesmaceae(以及一般的 Sphaeropleales)具有独特的能力,能够获得足够的碳源,并有可能在黑暗中长时间以无性状态存活。
DISCUSSION 讨论
High quality genome assemblies from 38 species/strains of the Scenedesmaceae, a family of green algae (Chlorophyceae) of global distribution and abundance in the phytoplankton of freshwaters, that have potential significance in biotechnology and environmental biotechnology, enabled us to establish the first pan-genomes of green algae at family, genus and species level. The pan-genomes revealed a surprisingly high proportion of dispensable orthogroups over core orthogroups, the latter not exceeding half of the total number of orthogroups, even at species level. This suggests that genomes in the Scenedesmaceae are highly dynamic, probably reflecting the extraordinary adaptive properties of their constituent species to diverse environments and stresses. Analyses of gains and expansion of gene families gave evidence that HGT from non-viridiplant sources contributed to gene innovations in four major processes, that have been linked to stress adaptation, namely sulfur and lipid metabolism, heterotrophy, and resistance (to stress). Although the mechanism(s) of HGT remain to be further investigated, we identified endogeneous viral elements from phylum Nucleocytoviricota (large, double-stranded DNA viruses), that host viral hallmark genes, in most of the Scenedesmaceae genomes. Viral-mediated HGT could have donated non-viridiplant eukaryotic genes (from animals, fungi and protists) as well as bacterial genes (from intracellular bacteria ^(130){ }^{130} ) to the genomes of Scenedesmaceae. A second putative mechanism for gene innovation that previously received relatively little attention in Chlorophyta, and microalgae in general, whole genome duplication (WGD), has also been identified in the Scenedesmaceae in the form of genome diploidization. About a quarter of the investigated strains were found to be diploid, and preliminary phylogenetic analyses suggests that interspecific hybridization may have contributed to diploidization. We conclude that HGT and WGD have played major roles in adaptive evolution of Scenedesmaceae and possibly microalgae, in general. Putative mechanisms and processes of adaptive evolution to stress in the Scenedesmaceae are summarizted in the “Spinning Wheel of Evolution” of Fig. 7. 景天科(Scenedesmaceae)是绿藻(叶绿藻科)的一个家族,分布于全球各地,在淡水浮游植物中含量丰富,对生物技术和环境生物技术具有潜在意义。泛基因组显示,可有可无的正交组比核心正交组的比例高得惊人,即使在物种水平,后者也不超过正交组总数的一半。这表明景天科植物的基因组是高度动态的,可能反映了其组成物种对不同环境和压力的非凡适应性。对基因家族增殖和扩展的分析表明,来自非病毒植物的 HGT 促进了与胁迫适应有关的四个主要过程的基因创新,即硫和脂质代谢、异养和抗性(对胁迫)。虽然 HGT 的机制还有待进一步研究,但我们在大多数景天科植物的基因组中发现了来自核细胞病毒科(大型双链 DNA 病毒)的内源性病毒元件,这些病毒元件承载着病毒标志基因。病毒介导的HGT可能将非病毒植物的真核基因(来自动物、真菌和原生生物)以及细菌基因(来自细胞内细菌 ^(130){ }^{130} )捐赠给景天科植物的基因组。在叶绿体和一般微藻类中,基因创新的第二种假定机制--全基因组复制(WGD)--以前很少受到关注,但在景天科中也以基因组二倍体化的形式被发现。 调查发现,约四分之一的菌株为二倍体,初步的系统发育分析表明,种间杂交可能促成了二倍体化。我们的结论是,HGT 和 WGD 在景天科乃至整个微藻类的适应性进化中发挥了重要作用。图 7 中的 "进化之轮 "总结了景天科对压力的适应性进化的可能机制和过程。
STAR & METHODS 星级与方法
Cultivation of algae and nucleic acid extraction 培养藻类和提取核酸
The Scenedesmaceae strains used in this study (see Figure 1b, bold strain numbers) were obtained from the Culture Collection of Algae at Göttingen University (SAG; 本研究中使用的景天科菌株(见图 1b,粗体菌株编号)来自哥廷根大学藻类培养物保藏中心(SAG; https://uni-goettingen.de/en/45175.html), the Culture Collection of Algae and Protozoa, Oban, Scotland (CCAP; https://www.ccap.ac.uk/), and the Central Collection of Algal Cultures at the University of Duisburg-Essen (CCAC; https://www.unidue.de/biology/ccac/). Only axenic strains were used. If necessary, axenic cultures were prepared by streaking algae on agar and picking single cell-derived clones from the plates. Algae were grown in 3 N -BBM +V culture medium (https://www.ccap.ac.uk/index.php/media-recipes/) in aerated 1 L culture flasks in a 14//10hr14 / 10 \mathrm{hr} light/dark cycle at 18 mumol18 \mu \mathrm{~mol} photons m^(-2)s^(-1)\mathrm{m}^{-2} \mathrm{~s}^{-1} (white LED illumination) and 20^(@)C20^{\circ} \mathrm{C}. During all scale-up steps of the cultures until nucleic acid extraction, axenicity was monitored via sterility tests and light microscopy. The cells were harvested by centrifugation ( 5min,250-1000 xx g5 \mathrm{~min}, 250-1000 \times g depending on the strain processed), rapidly frozen in liquid nitrogen, and subsequently transferred to -80^(@)C-80^{\circ} \mathrm{C} until freeze-drying. Highquality DNA was extracted from the freeze-dried material using QIAGEN® Genomic kits, and DNA quantification was carried out via Nanodrop and Qubit instruments. The SQK_LSK109 Ligation Sequencing Kit was used to prepare the sequencing libraries. A PromethION Nanopore sequencer with the long-read DNA sequencing type was used for genome sequencing. RNA extractions were performed from freeze-dried material using RNA-easyTM Isolation Reagent (Vazyme, Nanjing), and stranded RNA-seq libraries were prepared using the TrueSeq Stranded mRNA Library Prep Kit (Illumina). Libraries were sequenced using an Illumina HiSeq platform. https://uni-goettingen.de/en/45175.html)、苏格兰奥班藻类和原生动物培养物保藏中心(CCAP;https://www.ccap.ac.uk/)以及杜伊斯堡-埃森大学藻类培养物中央保藏中心(CCAC;https://www.unidue.de/biology/ccac/)。只使用轴生菌株。如有必要,可将藻类在琼脂上做条纹培养,然后从平板上挑取单细胞克隆,制备轴生培养物。藻类在 3 N -BBM +V 培养基(https://www.ccap.ac.uk/index.php/media-recipes/)中生长,培养基置于 1 L 的通气培养瓶中,光照/黑暗循环为 14//10hr14 / 10 \mathrm{hr}18 mumol18 \mu \mathrm{~mol} 光子 m^(-2)s^(-1)\mathrm{m}^{-2} \mathrm{~s}^{-1} (白色 LED 照明)和 20^(@)C20^{\circ} \mathrm{C} 。在培养物放大到提取核酸的所有步骤中,通过无菌测试和光学显微镜监测轴突性。细胞通过离心( 5min,250-1000 xx g5 \mathrm{~min}, 250-1000 \times g 取决于处理的菌株)收获,在液氮中快速冷冻,随后转移到 -80^(@)C-80^{\circ} \mathrm{C} 中冷冻干燥。使用 QIAGEN® 基因组试剂盒从冻干材料中提取高质量的 DNA,并通过 Nanodrop 和 Qubit 仪器进行 DNA 定量。SQK_LSK109 连接测序试剂盒用于制备测序文库。基因组测序使用了长读程 DNA 测序类型的 PromethION Nanopore 测序仪。使用 RNA-easyTM Isolation Reagent(Vazyme,南京)从冻干材料中提取 RNA,并使用 TrueSeq Stranded mRNA Library Prep Kit(Illumina)制备 RNA-seq 文库。使用 Illumina HiSeq 平台对文库进行测序。
Nucleic acid sequencing and genome survey 核酸测序和基因组调查
DNA libraries for short-read WGS were constructed using an Illumina TruSeq DNA PCR-free library preparation kit with 300-500 bp fragment sizes and sequenced on the Illumina NovaSeq 6000 platform to generate 150 bp paired-end (PE) reads. DNA longread libraries were constructed with a Ligation Sequencing Kit with more than 30 kb fragments and sequenced on a nanopore PromethION sequencer. The poly-A-selected transcriptome libraries were constructed with a TruSeq RNA Library Prep Kit v2 (Illumina, CA, USA) with an insert size of 200-400 bp and sequenced on the Illumina NovaSeq 6000 platform to generate 150 bp paired-end (PE) reads. 短线程 WGS 的 DNA 文库使用 Illumina TruSeq DNA 无 PCR 文库制备试剂盒构建,片段大小为 300-500 bp,在 Illumina NovaSeq 6000 平台上测序,生成 150 bp 的成对端(PE)读数。使用连接测序试剂盒构建的 DNA 长线程文库含有 30 kb 以上的片段,并在纳米孔 PromethION 测序仪上进行测序。使用 TruSeq RNA Library Prep Kit v2(Illumina,CA,USA)构建插入大小为 200-400 bp 的多 A 选择转录组文库,并在 Illumina NovaSeq 6000 平台上测序,生成 150 bp 的成对端(PE)读数。
Jellyfish v2.3.0 ^(131){ }^{131} and Kmerfreq ^(132){ }^{132} software were used to calculate the frequency of third-generation ONT data and second-generation Illumina data, respectively. Subsequently, GenomeScope 2.0^(131)2.0^{131} was used for visualization of the obtained results. In the majority of cases, the estimation of genome size was conducted with the use of “-p 1” and default settings for other parameters. Modifying the parameter to “-p 2” occurred exclusively when the distribution of K-mer frequencies revealed two discernible peaks, and the frequencies corresponding to these peaks exhibited a precise twofold relationship. This observation indicated that this alga is a potential diploid 使用 Jellyfish v2.3.0 ^(131){ }^{131} 和 Kmerfreq ^(132){ }^{132} 软件分别计算第三代 ONT 数据和第二代 Illumina 数据的频率。随后,GenomeScope 2.0^(131)2.0^{131} 被用于可视化所获得的结果。在大多数情况下,使用"-p 1 "和其他参数的默认设置来估算基因组大小。将参数修改为"-p 2 "时,K-mer 频率的分布显示出两个明显的峰值,并且这些峰值对应的频率呈现出精确的两倍关系。这一观察结果表明,这种藻类可能是二倍体
organism, prompting adjustment of the genome size estimation. 生物体,从而调整基因组大小的估计。
Genome assembly, decontamination and evaluation 基因组组装、净化和评估
The primary genome framework was initially constructed by NextDenovo ^(133){ }^{133} based on Oxford Nanopore Technology (ONT) data. The seed_cutoff parameter for each species was determined by the estimated genome size of the respective species. Moreover, the read_cutoff was also adjusted appropriately based on the coverage of the sequencing data, thereby ensuring the utilization of sufficient and longer reads for the assembly process. Subsequently, the original ONT reads and Illumina reads were fed to NextPolish ^(134){ }^{134} for error correction of the above contigs. To obtain more complete genomes, a subset of representative species was selected to connect contigs into pseudochromosomes utilizing HiC data. Specifically, Juicer v1. ^(135){ }^{135} was first employed to individually align paired Hi-C\mathrm{Hi}-\mathrm{C} reads to the genome. Subsequently, the valid data were extracted to generate a sequence contact matrix based on the paired-read information. Using the information from the contact matrix, the 3 d -DNA pipeline ^(136){ }^{136} was used to correct, anchor, order, and orientate the input contigs. Juicebox v1.11.08 ^(137){ }^{137} was used for the final manual correction of any misconnections or inversions. The chloroplast and mitochondrial genomes were then independently assembled by GetOrganelle ^(138){ }^{138}. 主基因组框架最初由 NextDenovo ^(133){ }^{133} 根据牛津纳米孔技术(ONT)数据构建。每个物种的seed_cutoff参数由各自物种的估计基因组大小决定。此外,还根据测序数据的覆盖率适当调整了 read_cutoff,从而确保在组装过程中使用足够长的读数。随后,将原始 ONT 读数和 Illumina 读数输入 NextPolish ^(134){ }^{134} 对上述等位基因进行纠错。为了获得更完整的基因组,我们选择了一部分有代表性的物种,利用 HiC 数据将等位基因连接成假染色体。具体来说,首先使用 Juicer v1. ^(135){ }^{135} 将成对的 Hi-C\mathrm{Hi}-\mathrm{C} 读数与基因组单独比对。随后,提取有效数据,根据成对读数信息生成序列接触矩阵。利用接触矩阵中的信息,3 d -DNA管道 ^(136){ }^{136} 被用来校正、锚定、排序和定向输入的等位基因。Juicebox v1.11.08 ^(137){ }^{137} 用于对任何错误连接或反转进行最后的人工校正。然后用 GetOrganelle ^(138){ }^{138} 独立组装叶绿体和线粒体基因组。
To remove potential contamination of the assembly, the contigs were first subjected to BLASTX searches against the NCBI nonredundant protein (NR) database. The contigs with > 50%>50 \% hits matching non-plants were referred to as candidate contamination contigs. We further calculated the GC content and coverage depth of all contigs. We compared the difference in GC content and coverage depth between the clean contig and the candidate contaminated contig obtained in the previous step and manually removed the contigs with abnormal GC content and coverage depth. 为了去除装配中可能存在的污染,首先对等位基因进行 BLASTX 搜索,并与 NCBI 非冗余蛋白质(NR)数据库进行比对。与非植物匹配的 > 50%>50 \% 命中的等位基因被称为候选污染等位基因。我们进一步计算了所有等位基因的 GC 含量和覆盖深度。我们比较了干净的等位基因和上一步得到的候选污染等位基因在 GC 含量和覆盖深度上的差异,并手动删除了 GC 含量和覆盖深度异常的等位基因。
The evaluation of assembly quality was carried out via DNA/RNA mapping and BUSCO. The high-quality short-insert PE reads were aligned onto the assembled genome to evaluate the accuracy of the results by using BWA ^(139){ }^{139} with the default parameters. The distribution of the sequencing depth at each position was calculated to measure the completeness of the genome assembly. Genome completeness was also assessed using the algae dataset (chlorophyta_odb10) of Benchmarking Universal Single-Copy Orthologs (BUSCO) ^(140){ }^{140}. Gene region completeness was evaluated from the transcripts assembled by Trinity using transcriptome data of algal samples cultured under different growth conditions. 通过 DNA/RNA 图谱和 BUSCO 对组装质量进行了评估。使用 BWA ^(139){ }^{139} (默认参数)将高质量的短插入 PE 读数与已组装的基因组进行比对,以评估结果的准确性。计算每个位置的测序深度分布,以衡量基因组组装的完整性。基因组的完整性也使用通用单拷贝同源物基准(BUSCO) ^(140){ }^{140} 的藻类数据集(chlorophyta_odb10)进行评估。利用在不同生长条件下培养的藻类样本的转录组数据,通过 Trinity 组装的转录本评估了基因区域的完整性。
Repetitive sequence annotation and comparison 重复序列注释和比较
Repetitive sequence annotation was performed by integrating results from several software programs, primarily relying on two strategies. The initial strategy relies on de 重复序列注释是通过整合多个软件程序的结果完成的,主要依靠两种策略。最初的策略依赖于
novo annotation using the software RepeatScout ^(141){ }^{141}, LTR_retriever ^(142){ }^{142} and RepeatModeler2 ^(143){ }^{143} to construct a custom library, and RepeatMasker ^(144){ }^{144} was employed to annotate transposable elements. The second strategy involves repetitive annotation via RepeatMasker based on the well-known Repbase database. The results from each software package were integrated to yield conclusive repetitive sequence annotations. To determine the divergence of transposable elements among different species, Kimura distance-based copy divergence analysis of transposable elements was performed based on the results produced by RepeatMasker. 使用 RepeatScout ^(141){ }^{141} 软件、LTR_retriever ^(142){ }^{142} 和 RepeatModeler2 ^(143){ }^{143} 软件构建自定义文库,并使用 RepeatMasker ^(144){ }^{144} 软件注释转座元件。第二种策略是基于著名的 Repbase 数据库,通过 RepeatMasker 进行重复注释。对每个软件包的结果进行整合,以得出确定的重复序列注释。为了确定不同物种间转座元件的差异,根据 RepeatMasker 得出的结果,对转座元件进行了基于 Kimura 距离的拷贝差异分析。
Annotations and comparisons of protein-coding genes 蛋白质编码基因的注释和比较
Three different types of gene annotation evidence, transcriptome evidence, homologous protein evidence, and de novo gene annotation, were combined and integrated by BRAKER2 ^(145){ }^{145}. RNA-seq data generated from each sample were used for gene annotation, processed by TopHat and Cufflinks, and subsequently assembled by Trinity ^(146){ }^{146}. For gene functional annotation, the annotated protein sequences were aligned against the KEGG and SwissProt databases by NCBI BLASTP with an e-value of 1e^(-5)1 \mathrm{e}^{-5}. The protein domains were predicted with InterProScan v5.51-85.0. The total mRNA length, average gene length, total CDS length, average CDS length, exons per gene, average exon length, total intron length and average intron length were calculated for the identified genes. BRAKER2 ^(145){ }^{145} 结合并整合了三种不同类型的基因注释证据:转录组证据、同源蛋白质证据和新基因注释。每个样本产生的 RNA-seq 数据用于基因注释,由 TopHat 和 Cufflinks 处理,然后由 Trinity ^(146){ }^{146} 进行组装。在基因功能注释方面,注释的蛋白质序列通过 NCBI BLASTP 与 KEGG 和 SwissProt 数据库进行比对,e 值为 1e^(-5)1 \mathrm{e}^{-5} 。用 InterProScan v5.51-85.0 预测了蛋白质结构域。计算了已鉴定基因的 mRNA 总长度、平均基因长度、CDS 总长度、平均 CDS 长度、每个基因的外显子长度、平均外显子长度、内含子总长度和平均内含子长度。
Phylogenetic analysis and comparative genomics 系统发生分析和比较基因组学
Representative green algal genomes were selected for gene family clustering using OrthoFinder ^(147){ }^{147} with default parameters. The protein sequences from the single-copy gene families were aligned with MAFFT, trimmed of the alignment by GBLOCKS with -b3 (the least stringent settings with the maximum number of contiguous non-conserved positions) set to 8 , -b4 (minimum length of a block) set to 5 and -b5 (allowed gap positions) set to half and -b6 (use similarity matrices) set to yes. The final phylogenetic tree was constructed by RAxML ^(148){ }^{148} or IQ-TREE ^(149){ }^{149} with the concatenated method and ASTRAL ^(150){ }^{150} with coalescent methods, respectively. Count ^(151){ }^{151} was used for gene family expansion and contraction analysis. For the phylogenetic analysis of specific gene families, the best-fitting evolutionary model was determined using ModelFinder2, and trees were reconstructed using IQ-TREE. Branch supports were tested with 10,000 replicates of UltraFast Bootstraps. Trees were visualized using the online iTOL website (https://itol.embl.de/). 使用 OrthoFinder ^(147){ }^{147} (默认参数)选择具有代表性的绿藻基因组进行基因家族聚类。用 MAFFT 对来自单拷贝基因家族的蛋白质序列进行比对,用 GBLOCKS 对比对结果进行修剪,将 -b3 (最不严格的设置,最大连续非保留位置数)设为 8,-b4(块的最小长度)设为 5,-b5(允许的间隙位置)设为一半,-b6(使用相似性矩阵)设为是。最终的系统发生树分别由采用连接法的 RAxML ^(148){ }^{148} 或 IQ-TREE ^(149){ }^{149} 和采用聚合法的 ASTRAL ^(150){ }^{150} 构建。计数 ^(151){ }^{151} 用于基因家族扩展和收缩分析。在对特定基因家族进行系统进化分析时,使用 ModelFinder2 确定最适合的进化模型,并使用 IQ-TREE 重建树。用 10,000 次重复的 UltraFast Bootstraps 测试分支支持率。使用在线 iTOL 网站(https://itol.embl.de/)对树进行可视化。
Analysis of polyploidy 多倍体分析
Three different methods were used to evaluate the possibility of polyploidy for each species. The first method primarily refers to the K-mer frequency distribution. In 我们采用了三种不同的方法来评估每个物种的多倍体可能性。第一种方法主要参考 K-聚合体频率分布。在
general, the k -mer frequency distribution of a regular green algal genome with few repetitive sequences exhibited only one peak. Therefore, species with two or more peaks in their k-mer frequency distribution were selected as candidates for polyploidy. The second method relies on the allele frequency. Initially, the bwa-MEM algorithm was used to map Illumina short reads onto the assembly that had undergone contig redundancy removal using purge_dups ^(152){ }^{152}. The proportion of the two highest base frequencies in coverage depth was subsequently calculated for each SNP using ploidyNGS ^(153){ }^{153}. If the reference genome is diploid, the most abundant allele percentage and the second most abundant allele percentage will both be close to 50%50 \%. nQuire ^(154){ }^{154} was also used to confirm the result of ploidyNGS. The third method is based on the coverage depth of ONT reads against the assembly. The minimap2 algorithm was used to align ONT reads to the assembly that had undergone contig redundancy removal using the parameter “–secondary=no”. A GC-depth distribution plot was generated to examine whether a twofold depth relationship existed in the assembly while maintaining consistent GC content. 一般来说,重复序列较少的常规绿藻基因组的 k - 聚合体频率分布只呈现一个峰值。因此,k-mer 频率分布有两个或更多峰值的物种被选为多倍体的候选物种。第二种方法依赖于等位基因频率。首先,使用 bwa-MEM 算法将 Illumina 短读数映射到已使用 purge_dups ^(152){ }^{152} 去除冗余序列的集合上。随后,使用 ploidyNGS ^(153){ }^{153} 计算每个 SNP 的覆盖深度中两个最高碱基频率的比例。如果参考基因组是二倍体,那么最丰富的等位基因比例和第二丰富的等位基因比例都将接近 50%50 \% 。nQuire ^(154){ }^{154} 也被用来确认ploidyNGS的结果。第三种方法是基于 ONT 读数对集合的覆盖深度。使用 minimap2 算法将 ONT 读数与使用参数"-secondary=no "去除冗余的序列集进行比对。生成的 GC 深度分布图可在保持一致的 GC 含量的同时,检验序列集中是否存在两倍深度关系。
To determine the synteny between the chromosome-level assembled algal species, JCVI^(155)\mathrm{JCVI}^{155} was used to identify the synteny blocks and visualize them using the parameter “-minspan=60”. Wgdi ^(156){ }^{156} was used to calculate the intraspecies and interspecies Ks distributions. 为了确定染色体级组装藻类物种之间的同源关系,我们使用 JCVI^(155)\mathrm{JCVI}^{155} 来识别同源区块,并使用参数"-minspan=60 "将其可视化。Wgdi ^(156){ }^{156} 用于计算种内和种间 Ks 分布。
Gene-based pan-genome construction and whole-genome alignment 基于基因的泛基因组构建和全基因组比对
We initially gathered high-quality gene sets from 115 chlorophytes, including 9 prasinophytes, 1 Pedinophyceae, 18 Trebouxiophyceae, 2 Ulvophyceae and 85 Chlorophyceae (see Supplementary Table S8). OrthoFinder was subsequently employed to construct the gene-based pan-genome with the default parameters. In this study, our pan-genome was made up of four parts: Core orthogroups, Dispensable orthogroups, Private orthogroups, and Private unique genes. Core orthogroups were defined as those present in all accessions of candidate linages, dispensable orthogroups were defined as those present in at least two but not all accessions of candidate lineages, private orthogroups were defined as those present in only one accession, and private unique genes (i.e., unclassified genes) were defined as genes in only one accession, which cannot be clustered with other genes. To avoid calculation errors in physical characteristics (such as GC content and gene length) caused by the inconsistent number of genes in each orthogroup, the sequence with the longest amino acid length in each orthogroup was chosen as the representative sequence for further calculation, comparison and statistical analysis. Whole-genome alignment was performed by Cactus v2.6.6 ^(157){ }^{157}. We first generated a maximum likelihood phylogenetic tree as a guide tree for Cactus species by extracting single-copy genes from 65 species of the 我们最初收集了 115 个叶绿体的高质量基因组,包括 9 个原生叶绿体、1 个裙带叶绿体、18 个 Trebouxiophyceae、2 个 Ulvophyceae 和 85 个叶绿体(见补充表 S8)。随后,OrthoFinder 使用默认参数构建了基于基因的泛基因组。在这项研究中,我们的泛基因组由四个部分组成:核心正交群、可弃正交群、私有正交群和私有独特基因。核心正交群被定义为存在于所有候选系谱中的基因,可有可无的正交群被定义为存在于至少两个但不是所有候选系谱中的基因,私有正交群被定义为只存在于一个系谱中的基因,私有独特基因(即未分类基因)被定义为只存在于一个系谱中,且不能与其他基因聚类的基因。为避免因每个正交组中基因数量不一致而导致物理特征(如 GC 含量和基因长度)的计算误差,选择每个正交组中氨基酸长度最长的序列作为代表序列,进行进一步的计算、比较和统计分析。全基因组比对由 Cactus v2.6.6 ^(157){ }^{157} 进行。我们首先从仙人掌科 65 个物种中提取单拷贝基因,生成最大似然系统发生树,作为仙人掌物种的指导树。
Scenedesmaceae. A total of 65 assemblies were fed into Cactus to generate the wholegenome alignment after repeat masking. A MAF-format file was derived from the output alignment using the command hal2maf -onlyOrthologs -refGenome Coelastrella_oocystiformis -noDupes. Scenedesmaceae.在重复屏蔽后,共将 65 个组装结果输入 Cactus 生成全基因组比对。使用 hal2maf -onlyOrthologs -refGenome Coelastrella_oocystiformis -noDupes 命令从输出比对结果中生成 MAF 格式文件。
Horizontal gene transfer (HGT) analysis 水平基因转移(HGT)分析
We employed the published HGT pipeline by Ma et al ^(93){ }^{93}. To identify HGT-acquired genes, it is necessary to guarantee that each protein of the algae has hits from as many taxa as possible rather than from having only more target proteins. As a result, we first divided the NCBI NR database into seven groups (refer to non-viridiplant in this study): Bacteria, Fungi, Archaea, viruses, Metazoa, Archaeplastida, and other protists (SAR group, Cryptophyceae, Haptista, Euglenozoa). Diamond was used to search the local alignment of the proteins of each species against the above seven databases, with the parameters "–sensitive -e 1e-5 -f 6 --max-target-seqs 500 --max-hsps 1 --taxon-k 1 -id 40 ". Here, we utilized the Alien index (AI) value to identify candidate HGT genes with the following formula: AI=ln(E:}\mathrm{AI}=\ln \left(\mathrm{E}\right.-value of best hit in donor group {:+1E^(-200))-ln\left.+1 \mathrm{E}^{-200}\right)-\ln (E-value of best hit in Viridiplantae +1E^(-200))^(158)\left.+1 \mathrm{E}^{-200}\right)^{158}. Five groups, Bacteria (excluding cyanobacteria), Fungi, Archaea, viruses, and Metazoa, were treated as potential donor groups. We propose that genes with an AI > 0\mathrm{AI}>0 are potential candidates for HGT. This indicates that these candidate HGT genes are more similar to the donor compared to other Viridiplantae. Furthermore, we consider genes with an AI > 30>30 to be highconfidence candidate HGT genes. Each high-confidence HGT was used to construct a phylogenetic tree to verify the HGT. 我们采用了 Ma 等人 ^(93){ }^{93} 发表的 HGT 方法。要鉴定 HGT 获取的基因,必须保证藻类的每个蛋白质都有尽可能多的分类群的命中,而不是只有更多的目标蛋白质。因此,我们首先将 NCBI NR 数据库分为七组(本研究中指非病毒植物):细菌、真菌、古细菌、病毒、元古代动物、古细菌和其他原生动物(SAR 组、隐球藻纲、鞘藻纲、优生动物)。我们使用 Diamond 根据上述七个数据库对每个物种的蛋白质进行局部比对搜索,参数为"-sensitive -e 1e-5 -f 6 -max-target-seqs 500 -max-hpss 1 --taxon-k 1 -id 40"。在此,我们利用异形指数(AI)值来识别候选 HGT 基因,公式如下: AI=ln(E:}\mathrm{AI}=\ln \left(\mathrm{E}\right. -供体组中的最佳命中值 {:+1E^(-200))-ln\left.+1 \mathrm{E}^{-200}\right)-\ln (病毒组中的最佳命中值 +1E^(-200))^(158)\left.+1 \mathrm{E}^{-200}\right)^{158} 。细菌(不包括蓝藻)、真菌、古细菌、病毒和后生动物这五个组别被视为潜在的供体组。我们认为带有 AI > 0\mathrm{AI}>0 的基因是潜在的 HGT 候选基因。这表明这些候选 HGT 基因与其他病毒植物相比,与供体更为相似。此外,我们认为具有 AI > 30>30 的基因是高置信度的候选 HGT 基因。每个高置信度 HGT 都被用来构建系统发生树,以验证 HGT。
Identification of endogenization of viral regions of Scenedesmaceae genomes 确定 Scenedesmaceae 基因组病毒区的内源性
To identify virus-like regions in each genome, the geNomad pipeline ^(159){ }^{159} was initially performed to find the potential virus regions or mobile genetic elements. Specifically, the genome was first annotated using Prodigal ^(160){ }^{160}, followed by aligning the genes to a viral marker database to demarcate preliminary proviral regions by viral-specific hallmark genes. Subsequently, common features and neural networks were employed to classify candidate regions, culminating in the identification of the tentative region encompassing mobile genetic elements of the virus. To finalize the process, we manually conducted a series of verifications on each identified viral-like region, encompassing physical characteristics (e.g., GC content, TE distribution, eukaryotic gene intron distribution), gene accuracy (RNA support), alignment with the NCBI NR database (split into Viridiplantae and non-Viridiplantae), and identification of the viral genes with best hit alignment with the Nucleocytoplasmic large DNA viruses database to search the nucleocytoplasmic virus orthologous group (NCVOG), and comparison with the virus’s hallmark genes. 为了识别每个基因组中的病毒样区,首先使用 geNomad 管道 ^(159){ }^{159} 寻找潜在的病毒区或移动遗传元件。具体来说,首先使用 Prodigal ^(160){ }^{160} 对基因组进行注释,然后将基因与病毒标记数据库进行比对,通过病毒特异性标志基因来划分初步的前病毒区域。随后,我们利用共同特征和神经网络对候选区域进行分类,最终确定了包含病毒移动遗传因子的暂定区域。为了最终完成这一过程,我们对每个已确定的类病毒区域进行了一系列人工验证,包括物理特征(如 GC 含量、TE 分布、基因组学特征等)、GC 含量、TE 分布、真核基因内含子分布)、基因准确性(RNA 支持率)、与 NCBI NR 数据库比对(分为病毒和非病毒)、确定与核细胞质大 DNA 病毒数据库比对命中率最高的病毒基因,以搜索核细胞质病毒同源组(NCVOG),并与病毒的标志基因进行比较。
Identification of key functional genes 关键功能基因的鉴定
Candidate genes were systematically filtered using the following criteria: (1) Candidate gene sequences exhibited significant similarity to the query genes obtained from previous studies or databases (BLAST < 10 xx10-5<10 \times 10-5 ); and (2) The functional attributes (Swissprot functional annotations or online NR BLAST) of the candidate genes were in concordance with those of the query genes. 候选基因采用以下标准进行系统筛选:(1) 候选基因序列与先前研究或数据库中获得的查询基因具有显著相似性(BLAST < 10 xx10-5<10 \times 10-5 );(2) 候选基因的功能属性(Swissprot 功能注释或在线 NR BLAST)与查询基因的功能属性一致。
HMM profiles of various domains of NLR-like genes were first downloaded from the Pfam database using their respective PF numbers. The HMM profiles downloaded in this study included the TIR (PF01582), TIR_2 (PF13676), RPW8 (PF05659), coiled coil (PF05710, and PF14916), NB-ARC (PF00931), NACHT (PF05729), and Lipase_3 (PF01764) domains and all domains of the TPR, WD40, ankyrin, and LRR domains. An HMM search was conducted to obtain candidate sequences of representative species with an e-value of 1e-5, and further verification of the NLR-related domains was performed via online Pfam annotation (http://pfam-legacy.xfam.org/). We used pepcoil ^(161){ }^{161} to confirm the coiled-coil structure. The sequences of the NLR-related domains were ultimately extracted according to the information provided in the annotation results for subsequent analyses. We used localized iTAK (v 1.7a) ^(162){ }^{162} to predict transcription factors and transcription regulators of Scenedesmaceae with default parameters. For cell wall-related gene annotation, we used the CAZyme database as a query, and the web meta-server dbCAN3 (http://bcb.unl.edu/dbCAN2/index.php) was subsequently used to detect CAZymes. TCDB (tcdb.org) was used to identify potential transporters of Scenedesmaceae with an e-value of 1e-10. 首先使用各自的 PF 编号从 Pfam 数据库中下载了 NLR 样基因不同结构域的 HMM 图谱。本研究下载的 HMM 图谱包括 TIR(PF01582)、TIR_2(PF13676)、RPW8(PF05659)、盘绕线圈(PF05710 和 PF14916)、NB-ARC(PF00931)、NACHT(PF05729)和 Lipase_3 (PF01764)结构域以及 TPR、WD40、ankyrin 和 LRR 结构域的所有结构域。通过 HMM 搜索获得了 e 值为 1e-5 的代表性物种的候选序列,并通过在线 Pfam 注释(http://pfam-legacy.xfam.org/)进一步验证了 NLR 相关结构域。我们使用 pepcoil ^(161){ }^{161} 来确认盘绕线圈结构。最终根据注释结果提供的信息提取了 NLR 相关结构域的序列,用于后续分析。我们使用本地化的 iTAK(v 1.7a) ^(162){ }^{162} 以默认参数预测景天科的转录因子和转录调节因子。对于细胞壁相关基因的注释,我们使用 CAZyme 数据库作为查询,随后使用网络元服务器 dbCAN3(http://bcb.unl.edu/dbCAN2/index.php)检测 CAZymes。我们使用 TCDB(tcdb.org)来识别景天科潜在的转运体,其 e 值为 1e-10。
To analyze the correlation between the copy number or transcript levels of genes and the content of different types of lipids, we conducted an association analysis between the gene copy numbers and the transcript levels of each gene family and the lipid content. For the gene family copy numbers, we first performed gene family clustering by using OrthoFinder in 26 species with lipid data and then calculated the Spearman correlation coefficient between the copy numbers of each gene family and the content of each type of lipid. For the gene transcript levels, we first calculated the transcript level of the longest gene in each gene family of each species to represent the transcript level of that species in the specific gene family. If the species did not have a gene copy in that family, the expression level was also recorded as zero. To eliminate the differences in transcript levels among species, we normalized the transcript levels to numbers from 0 to 500. Specifically, we took the minimum and maximum expression values as 0 and 500, allocated the remaining transcript level between 0 and 500 proportionally, and then created a matrix of normalized transcript values. Finally, we calculated the Spearman correlation coefficient between the normalized transcript values of each gene family and the content of each type of lipid. 为了分析基因拷贝数或转录本水平与不同类型脂质含量之间的相关性,我们对每个基因家族的基因拷贝数和转录本水平与脂质含量之间进行了关联分析。对于基因家族拷贝数,我们首先利用 OrthoFinder 对有脂质数据的 26 个物种进行了基因家族聚类,然后计算了各基因家族拷贝数与各类脂质含量之间的斯皮尔曼相关系数。对于基因转录水平,我们首先计算了每个物种每个基因家族中最长基因的转录水平,以代表该物种在特定基因家族中的转录水平。如果该物种在该基因家族中没有基因拷贝,表达水平也记为零。为了消除物种间转录本水平的差异,我们将转录本水平归一化为 0 至 500。具体来说,我们将表达量的最小值和最大值分别取为 0 和 500,然后按比例分配 0 至 500 之间的剩余转录本水平,最后创建一个归一化转录本值矩阵。最后,我们计算了各基因家族的归一化转录本值与各类脂质含量之间的斯皮尔曼相关系数。
Stress experiments and differential gene expression analysis 压力实验和差异基因表达分析
For differential transcriptome analyses, algae were grown under different cultivation conditions. In brief, 800 mL of a dense culture of TT. lagerheimii was concentrated by 为了进行差异转录组分析,藻类在不同的培养条件下生长。简单地说,将 800 毫升的 TT .
centrifugation ( 5min,250 xxg5 \mathrm{~min}, 250 \times \mathrm{g} ) to 60 mL . Five milliliters of the concentrated suspension was applied to an agar plate ( 9 cm diameter). In total, 12 agar plates ( 1%1 \% [w/w] agar) were inoculated. Algae were grown on agar plates for three weeks as follows: a control was grown on 3NBBM+V3 \mathrm{~N} B B M+V agar under regular cultivation conditions (see above for suspension cultures); for nitrogen limitation, 0.1NBBM+V0.1 \mathrm{~N} \mathrm{BBM}+\mathrm{V} was used. To test for heterotrophy, 0.5%0.5 \% ( w//w\mathrm{w} / \mathrm{w} ) glucose in 3NBBM+V3 \mathrm{~N} \mathrm{BBM}+\mathrm{V} was applied, and the plates were exposed either to regular cultivation conditions or stored in the dark. For each treatment, three replicate plates were used. Algae were harvested by the application of two mL of the respective culture media ( 3NBBM+V3 \mathrm{~N} \mathrm{BBM}+\mathrm{V} or 0.1NBBM+V0.1 \mathrm{NBBM}+\mathrm{V} ) to the plates, after which all cells were removed with a scraper and micropipette. The cells were frozen in liquid N_(2)\mathrm{N}_{2} and further processed as described above for suspension culture. 离心( 5min,250 xxg5 \mathrm{~min}, 250 \times \mathrm{g} )至 60 毫升。将 5 毫升浓缩悬浮液涂抹在琼脂平板(直径 9 厘米)上。总共接种了 12 块琼脂平板( 1%1 \% [w/w] 琼脂)。藻类在琼脂平板上生长了三周,具体方法如下:对照组在常规培养条件下(见上文悬浮培养)在 3NBBM+V3 \mathrm{~N} B B M+V 琼脂上生长;为进行氮限制,使用 0.1NBBM+V0.1 \mathrm{~N} \mathrm{BBM}+\mathrm{V} 。为了测试异养,在 3NBBM+V3 \mathrm{~N} \mathrm{BBM}+\mathrm{V} 中加入 0.5%0.5 \% ( w//w\mathrm{w} / \mathrm{w} ) 葡萄糖,然后将平板置于常规培养条件下或保存在黑暗中。每种处理使用三个重复的平板。在平板上加入两毫升相应的培养基( 3NBBM+V3 \mathrm{~N} \mathrm{BBM}+\mathrm{V} 或 0.1NBBM+V0.1 \mathrm{NBBM}+\mathrm{V} )后收获藻类,然后用刮刀和微量移液管清除所有细胞。将细胞冷冻在液体 N_(2)\mathrm{N}_{2} 中,并按上述悬浮培养方法进一步处理。
Analyses of differences in transcript levels under different growth conditions 分析不同生长条件下转录本水平的差异
The RNA-seq reads were trimmed using the Trimmomatic ^(166){ }^{166} program and mapped against the annotated gene models using Bowtie2 ^(167){ }^{167} by retaining the best alignments. The FPKM (million mapped reads) value was calculated using the RSEM program ^(168){ }^{168}, which was incorporated into the Trinity package ^(146){ }^{146}. Furthermore, differences in transcript levels of FDR <= 0.01\leq 0.01 and at least a 2-fold change in trtanscript levels were identified using DESeq2 ^(169){ }^{169}. 使用 Trimmomatic ^(166){ }^{166} 程序对 RNA-seq 读数进行修剪,并使用 Bowtie2 ^(167){ }^{167} 保留最佳比对结果,与注释的基因模型进行比对。使用 RSEM 程序 ^(168){ }^{168} 计算 FPKM(百万映射读数)值,该程序已纳入 Trinity 软件包 ^(146){ }^{146} 。此外,使用 DESeq2 ^(169){ }^{169} 确定了 FDR <= 0.01\leq 0.01 的转录本水平差异和至少 2 倍的 trtanscript 水平变化。
Statistical analysis 统计分析
The GO enrichments were tested using Fisher’s exact tests with a false discovery rate correction (Benjamini-Hochberg FDR method) of 0.01, and these statistical tests were carried out in R (https://www.r-project.org/). All details of the statistics applied are provided alongside the respective analysis in each Method section. 这些统计检验是在 R (https://www.r-project.org/) 中进行的。所有应用统计量的细节都将在每个方法部分的相应分析中一并提供。