(etoolbox) Package etoolbox Error: “uline not a macroSee the etoolbox package documentation for explanation.\@eha
（etoolbox）软件包 etoolbox 错误：“uline not a macro请参阅 etoolbox 软件包文档以获取说明。\@eha

\modelname: A Vision Foundation Model for the Tree of Life
\modelname：生命之树的 Vision Foundation 模型

Samuel Stevens 塞缪尔·史蒂文斯 Jiaman Wu 吴佳满 Matthew J Thompson 马修 J 汤普森 The Ohio State University
俄亥俄州立大学 Elizabeth G Campolongo 伊丽莎白 G 坎波隆戈 The Ohio State University
俄亥俄州立大学 Chan Hee Song 陈熙松 The Ohio State University
俄亥俄州立大学 David Edward Carlyn 大卫·爱德华·卡林 The Ohio State University
俄亥俄州立大学 Li Dong 李东 Microsoft Research Wasila M Dahdul 瓦西拉·达杜尔 University of California, Irvine
加州大学尔湾分校 Charles Stewart 查尔斯·斯图尔特 Rensselaer Polytechnic Institute
伦斯勒理工学院
Tanya Berger-Wolf 坦尼娅·伯杰-沃尔夫 The Ohio State University
俄亥俄州立大学 Wei-Lun Chao The Ohio State University
俄亥俄州立大学 Yu Su 苏宇

Abstract 抽象

Images of the natural world, collected by a variety of cameras, from drones to individual phones, are increasingly abundant sources of biological information. There is an explosion of computational methods and tools, particularly computer vision, for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions, contexts, and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this, we curate and release TreeOfLife-10M, the largest and most diverse ML-ready dataset of biology images. We then develop \modelname, a foundation model for the tree of life, leveraging the unique properties of biology captured by TreeOfLife-10M, namely the abundance and variety of images of plants, animals, and fungi, together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks, and find that \modelname consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that \modelname has learned a hierarchical representation conforming to the tree of life, shedding light on its strong generalizability.¹¹1See github.com/Imageomics/bioclip for models, data and code.
从无人机到个人手机，各种相机收集的自然世界图像是越来越丰富的生物信息来源。计算方法和工具呈爆炸式增长，尤其是计算机视觉，用于从图像中提取生物相关信息，用于科学和保护。然而，其中大多数都是为特定任务设计的定制方法，不容易适应或扩展到新的问题、背景和数据集。图像上一般生物生物学问题的视觉模型是及时需要的。为了解决这个问题，我们策划并发布了 TreeOfLife-10M，这是最大、最多样化的 ML 就绪生物学图像数据集。然后，我们开发了 \modelname，这是生命之树的基础模型，利用 TreeOfLife-10M 捕获的生物学的独特特性，即植物、动物和真菌图像的丰富性和多样性，以及丰富的结构化生物学知识的可用性。我们根据各种细粒度的生物学分类任务对我们的方法进行了严格的基准测试，发现 \modelname 始终如一地大大优于现有基线（绝对值 17% 到 20%）。内在评估表明 \modelname 已经学会了符合生命之树的分层表示，从而阐明了其强大的泛化性。

¹¹footnotetext: Equal contribution. ^†{stevens.994,su.809}@osu.edu
¹平等的贡献。^†{stevens.994，su.809}@osu.edu

Refer to caption — (d) Onoclea sensibilis
（d） Onoclea sensibilis

1 Introduction 1 介绍

Digital images and computer vision are quickly becoming pervasively used tools to study the natural world, from evolutionary biology [13, 48] to ecology and biodiversity [79, 5, 74]. The capability to rapidly convert vast quantities of images from museums [61], camera traps [6, 7, 74, 56, 1], and citizen science platforms [37, 57, 76, 2, 51, 75, 77, 59, 72, 83, 85, 55] into actionable information (e.g., species classification, individual identification, and trait detection) has accelerated and enabled new advances in tasks such as species delineation [30], understanding mechanisms of adaptation [36, 21], abundance and population structure estimation [37, 78, 55, 3], and biodiversity monitoring and conservation [79].
数字图像和计算机视觉正迅速成为研究自然界的广泛使用的工具，从进化生物学 [13， 48] 到生态学和生物多样性 [79， 5， 74]。能够快速转换来自博物馆 [61]、相机陷阱 [6， 7， 74， 56， 1] 和公民科学平台 [37， 57， 76， 2， 51， 75， 77] 的大量图像、59、72、83、85、55] 转化为可操作的信息（例如，物种分类、个体识别和特征检测）加速并推动了物种划定 [30]、理解适应机制 [36， 21] 等任务的新进展、丰度和种群结构估计 [37， 78， 55， 3] 以及生物多样性监测和保护 [79]。

However, applying computer vision to answer a biological question is still a laborious task requiring substantial machine learning expertise and effort—biologists must manually label sufficient data for the specific taxa and task of interest, and find and train a suitable model for the task. Meanwhile, foundation models [12] such as CLIP [66] and GPT-3 [14] are extraordinarily valuable by enabling zero-shot or few-shot learning for a wide range of tasks. An analogous vision foundation model for biology should be useful for tasks spanning the entire tree of life [34, 50] instead of just the taxa it has been trained on. That would significantly lower the barrier to apply AI to biology.
然而，应用计算机视觉来回答生物学问题仍然是一项艰巨的任务，需要大量的机器学习专业知识和努力——生物学家必须为感兴趣的特定分类群和任务手动标记足够的数据，并为该任务找到并训练合适的模型。同时，CLIP [66] 和 GPT-3 [14] 等基础模型 [12] 通过为各种任务实现零样本或少样本学习而非常有价值。一个类似的生物学视觉基础模型应该对跨越整个生命树的任务 [34， 50] 有用，而不仅仅是它已经训练过的分类群。这将大大降低将人工智能应用于生物学的门槛。

In this work, we aim to develop such a vision foundation model for the tree of life. To be broadly useful for real-world biology tasks, this model should meet the following criteria. First, it should generalize to the entire tree of life, where possible, to ensure it supports researchers studying many different clades rather than a niche. Furthermore, it is infeasible to collect training data that covers the millions of known taxa [35, 41], so the model must generalize to taxa not present in training data. Second, it should learn fine-grained representations of images of organisms as biology frequently engages with organisms that are visually similar, like closely related species within the same genus [64] or species mimicking others’ appearances for a fitness advantage [36]. This fine-grained granularity is crucial because the tree of life organizes living things into both broad categories (animal, fungus, and plant) and very fine-grained ones (see Fig. 1). Finally, due to the high cost of data collection and labeling in biology, strong performance in the low-data regime (i.e., zero-shot or few-shot) is critical.
在这项工作中，我们的目标是为生命之树开发这样一个愿景基础模型。为了广泛地用于实际生物学任务，此模型应满足以下标准。首先，它应该在可能的情况下推广到整个生命之树，以确保它支持研究许多不同的分支而不是一个生态位的研究人员。此外，收集涵盖数百万个已知分类群的训练数据是不可行的 [35， 41]，因此该模型必须推广到训练数据中不存在的分类群。其次，它应该学习生物体图像的精细表示，因为生物学经常与视觉相似的生物体进行互动，例如同一属中密切相关的物种 [64] 或为了适应度优势而模仿其他物种的外表的物种 [36]。这种细粒度至关重要，因为生命之树将生物分为大类（动物、真菌和植物）和非常细粒度的类别（见图 D）。 1）.最后，由于生物学中数据收集和标记的成本很高，因此在低数据方案（即零样本或少量样本）中的出色表现至关重要。

While the goals of generalization, fine-grained classification, and data efficiency are not new in computer vision, existing general-domain vision models [66, 92, 58] trained on hundreds of millions of images fall short when applied to evolutionary biology and ecology. Specifically, existing vision models produce general fine-grained representations, useful for comparing common organisms like dogs and wolves, but not for more fine-grained comparisons, e.g., Onoclea sensibilis L. and Onoclea hintonii (see Fig. 1).
虽然泛化、细粒度分类和数据效率的目标在计算机视觉中并不新鲜，但现有的通用域视觉模型 [66， 92， 58] 在应用于进化生物学和生态学时，在数亿张图像上训练时是不够的。具体来说，现有的视觉模型会产生一般的细粒度表示，可用于比较常见的生物，如狗和狼，但不适用于更细粒度的比较，例如，Onoclea sensibilis L. 和 Onoclea hintonii（见图 D）。 1）.

We identify two major barriers to developing a vision foundation model for biology. First, there is a need for suitable pre-training datasets: existing datasets [86, 84, 82, 26] lack either scale, diversity, or fine-grained labels. Second, there is a need to investigate suitable pre-training strategies that leverage special properties of the biology domain to better achieve the three pivotal goals, e.g., the tree of life taxonomy, which is insufficiently considered in mainstream pre-training algorithms [45, 58, 66].
我们确定了开发生物学视觉基础模型的两个主要障碍。首先，需要合适的预训练数据集：现有的数据集 [86， 84， 82， 26] 要么缺乏规模、多样性，要么缺乏细粒度的标签。其次，需要研究合适的预训练策略，利用生物学领域的特殊特性来更好地实现三个关键目标，例如生命之树分类法，这在主流预训练算法中没有得到充分考虑 [45， 58， 66]。

In light of these goals and challenges in achieving them, we introduce 1) TreeOfLife-10M, a large-scale ML-ready biology image dataset, and 2) \modelname, a vision foundation model for the tree of life, trained with suitable use of taxa in TreeOfLife-10M. We outline the contributions, conceptual framework, and design decisions below:
鉴于这些目标和实现这些目标所面临的挑战，我们引入了 1） TreeOfLife-10M，一个大规模的 ML 就绪生物学图像数据集，以及 2） \modelname，一个生命之树的视觉基础模型，在 TreeOfLife-10M 中适当使用分类群进行训练。我们概述了以下贡献、概念框架和设计决策：

1. TreeOfLife-10M: a large-scale, diverse ML-ready biology image dataset. We curate and release the largest-to-date ML-ready dataset of biology images with associated taxonomic labels, containing over $10$ million images covering $454$ thousand taxa in the tree of life.²²2 By ML-ready, we mean the data is standardized in a format suitable for training ML models and is readily available for downloading. In comparison, the current largest ML-ready biology image dataset, iNat21 [82], contains only $2.7$ million images covering $10$ thousand taxa. TreeOfLife-10M integrates existing high-quality datasets like iNat21 and Bioscan-1M [26]. More importantly, it includes newly curated images from the Encyclopedia of Life (eol.org), which supplies most of TreeOfLife-10M’s data diversity. Every image in TreeOfLife-10M is labeled with its taxonomic hierarchy to the finest level possible, as well as higher taxonomic ranks in the tree of life (see Figs. 1 and 3 for examples of taxonomic ranks and labels). TreeOfLife-10M enables training \modelname and future biology foundation models.
1. TreeOfLife-10M：一个大规模、多样化的 ML 就绪生物学图像数据集。我们策划并发布了迄今为止最大的 ML 就绪生物学图像数据集，其中包含超过 $10$ 100 万张图像，涵盖了 $454$ 生命之树中的数千个分类群。²²2 ML 就绪是指数据以适合训练 ML 模型的格式进行标准化，并且随时可供下载。相比之下，目前最大的 ML 就绪生物学图像数据集 iNat21 [82] 仅包含 $2.7$ 涵盖 $10$ 数千个分类群的 100 万张图像。TreeOfLife-10M 集成了现有的高质量数据集，如 iNat21 和 Bioscan-1M [26]。更重要的是，它包括来自生命百科全书（eol.org）的新精选图像，该百科全书提供了 TreeOfLife-10M 的大部分数据多样性。TreeOfLife-10M 中的每张图像都标有其分类层次结构，以及生命之树中更高的分类等级（见图 1 和 3 为分类等级和标签的示例）。TreeOfLife-10M 支持训练 \modelname 和未来的生物学基础模型。

2. \modelname: a vision foundation model for the tree of life. With a large-scale labeled dataset like TreeOfLife-10M, a standard, intuitive training strategy (as adopted by other vision models like ResNet50 [31] and Swin Transformer [45]) is to use a supervised classification objective and learn to predict the taxonomic indices from an image. However, this fails to recognize and leverage the rich structure of taxonomic labels—taxa do not exist in isolation but are interconnected in a comprehensive taxonomy. Consequently, a model trained via plain supervised classification may not generalize well to taxa unseen in training, nor could it support zero-shot classification of unseen taxa.
2. \modelname：生命之树的 Vision Foundation 模型。对于像 TreeOfLife-10M 这样的大规模标记数据集，一种标准、直观的训练策略（如 ResNet50 [31] 和 Swin Transformer [45] 等其他视觉模型所采用的）是使用监督分类目标并学习从图像中预测分类指数。然而，这未能识别和利用分类标签的丰富结构——分类群不是孤立存在的，而是在一个全面的分类学中相互关联的。因此，通过普通监督分类训练的模型可能无法很好地推广到训练中看不见的分类群，也不能支持看不见的分类群的零样本分类。

Instead, we propose a novel strategy combining CLIP-style multimodal contrastive learning [66] with the rich biological taxonomy for \modelname. We “flatten” the taxonomy from Kingdom to the distal-most taxon rank into a string called taxonomic name, and use the CLIP contrastive learning objective to learn to match images with their corresponding taxonomic names. Intuitively, this helps the model generalize to unseen taxa—even if the model has not seen a species, it has likely learned a reasonable representation for that species’ genus or family (see Fig. 1). \modelname also supports zero-shot classification with taxonomic names of unseen taxa. We further propose, and demonstrate the effectiveness of, a mixed text type training strategy; by mixing different text types (e.g., taxonomic vs. scientific vs. common names) during training, we retain the generalization from taxonomic names while being more flexibility at test time. For example, \modelname still excels even if only common species names are offered by downstream users.
相反，我们提出了一种新的策略，将CLIP风格的多模态对比学习 [66]与\modelname的丰富生物分类法相结合。我们将从 Kingdom 到最远端分类群等级的分类法“扁平化”成一个称为分类学名称的字符串，并使用 CLIP 对比学习目标来学习将图像与其相应的分类学名称相匹配。直观地说，这有助于模型推广到看不见的分类群——即使模型没有见过某个物种，它也可能已经学会了该物种的属或科的合理表示（见图 1）。 1）.\modelname 还支持使用未见过的分类群的分类名称进行零样本分类。我们进一步提出并证明了混合文本类型训练策略的有效性;通过在训练过程中混合不同的文本类型（例如，分类名称、学名和俗名），我们保留了分类名称的泛化，同时在测试时具有更大的灵活性。例如， \modelname 即使下游用户只提供常见的物种名称，它仍然非常出色。

3. Comprehensive benchmarking. We comprehensively evaluate \modelname on $10$ fine-grained image classification datasets covering animals, plants, and fungi, including a newly curated Rare Species dataset unseen in training. \modelname achieves strong performance in both zero-shot and few-shot settings and substantially outperforms both CLIP [66] and OpenCLIP [39], leading to an average absolute improvement of 18% (zero-shot) and 17% (few-shot). Intrinsic analysis further reveals that \modelname has learned a more fine-grained hierarchical representation conforming to the tree of life, explaining its superior generalization.
3. 全面的基准测试。我们在涵盖动物、植物和真菌的细粒度图像分类数据集上 $10$ 全面评估了 \modelname，包括在训练中未见过的新策划的稀有物种数据集。\modelname 在零样本和少样本设置中均取得了出色的性能，大大优于 CLIP [66] 和 OpenCLIP [39]，平均绝对改进 18%（零样本）和 17%（少数样本）。内在分析进一步揭示了 \modelname 已经学会了一种更细粒度的分层表示，符合生命之树，解释了其优越的泛化性。

Dataset 数据	Description 描述	Images 图像	Unique Classes 唯一类
iNat21	Citizen scientist labeled image dataset from iNaturalist for fine-grained classification. 来自 iNaturalist 的公民科学家标记图像数据集，用于精细分类。	2.7M	$10,000$
Bioscan-1M 生物扫描-1M	Expert labeled image dataset of insects for classification. 用于分类的昆虫专家标记图像数据集。	1.1M	$7,831$
EOL	A new dataset with citizen scientist images sourced from Encyclopedia of Life and taxonomic labels standardized by us. 一个新的数据集，其中包含来自生命百科全书的公民科学家图像和我们标准化的分类标签。	6.6M	$448,910$
TreeOfLife-10M TreeOfLife-10M （生命之树 10M）	Largest-to-date ML-ready dataset of biology images with taxonomic labels. 迄今为止最大的带有分类标签的生物图像 ML 就绪数据集。	10.4M	$454,103$

Table 1: Training data sources used in TreeOfLife-10M. We integrate and canonicalize taxonomic labels across the sources (Sec. 2.2).
表 1： TreeOfLife-10M 中使用的训练数据源。我们整合并规范化了各个来源的分类标签（第 2.2 节）。

2 TreeOfLife-10M
2 生命树 10M

Recent work has shown that data quality and diversity is critical when training CLIP models [22, 54, 24]. We curate TreeOfLife-10M, the most diverse large-scale public ML-ready dataset for computer vision models in biology.
最近的研究表明，在训练 CLIP 模型时，数据质量和多样性至关重要 [22， 54， 24]。我们策划了 TreeOfLife-10M，这是用于生物学计算机视觉模型的最多样化的大规模公共 ML 就绪数据集。

2.1 Images 2.1 图像

The largest ML-ready biology image dataset is iNat21 [82], which contains $2.7$ M images of $10$ K species. Despite this class breadth compared to popular general-domain datasets like ImageNet-1K [67], $10$ K species are rather limited in the context of biology. The International Union for Conservation of Nature (IUCN) reported over $2$ M total described species in 2022, with over $10$ K bird species and over $10$ K reptile species alone [41]. iNat21’s species diversity limits its potential for pre-training a foundation model for the entire tree of life.
最大的 ML 就绪生物学图像数据集是 iNat21 [82]，其中包含 $2.7$ K 物种的 $10$ M 张图像。尽管与ImageNet-1K[67]等流行的通用域数据集相比， $10$ K物种的类广度相当有限，但在生物学的背景下，K物种相当有限。国际自然保护联盟（IUCN）报告称，2022 年已描述的物种总数超过 $2$ M，仅 K 类鸟类物种就超过 $10$ K 种，爬行动物物种超过 $10$ K 种 [41]。iNat21 的物种多样性限制了它为整个生命之树预训练基础模型的潜力。

Motivated to find high-quality biology images with a focus on species diversity, we turn to the Encyclopedia of Life project (EOL; eol.org). EOL collaborates with a variety of institutions to gather and label millions of images. We download $6.6$ M images from EOL and expand our dataset to cover an additional $\mathbf{440}$ K taxa.
为了寻找以物种多样性为重点的高质量生物学图像，我们求助于生命百科全书项目（EOL;eol.org）。EOL 与各种机构合作，收集和标记数百万张图像。我们从 EOL 下载 $6.6$ M 图像并扩展我们的数据集以涵盖额外的 $\mathbf{440}$ K 分类群。

Species are not evenly distributed among the different subtrees in the tree of life; insects (of the class Insecta with $1$ M+ species), birds (of the class Aves with $10$ K+ species) and reptiles (of the class Reptilia with $10$ K+ species) are examples of highly diverse subtrees with many more species. To help a foundation model learn extremely fine-grained visual representations for insects, we also incorporate Bioscan-1M [26], a recent dataset of $1$ M lab images of insects, covering $494$ different families.³³3 We note that Bioscan-1M’s label granularity may still be limited for insects. $98.6$ % of Bioscan-1M’s images are labeled to the family level but only $22.5$ % and $7.5$ % of the images have genus or species indicated, respectively. Lack of label granularity is an inherent challenge. Furthermore, Bioscan-1M contains lab images, rather than in situ images like iNat21, diversifying the image distribution.
物种在生命之树的不同子树之间分布不均匀;昆虫（昆虫类，M $1$ + 物种）、鸟类（鸟纲类，K $10$ + 物种）和爬行动物（爬行动物类，K $10$ + 物种）是高度多样化的亚树的例子，物种更多。为了帮助基础模型学习昆虫的极其精细的视觉表示，我们还整合了 Bioscan-1M [26]，这是一个最新的昆虫 $1$ M 实验室图像数据集，涵盖了 $494$ 不同的科。³³3 我们注意到 Bioscan-1M 的标记粒度对于昆虫可能仍然受到限制。 $98.6$ % 的 Bioscan-1M 图像被标记为科级别，但只有 $22.5$ % 和 $7.5 7.5$ % 的图像分别标明了属或种。缺乏标签粒度是一个固有的挑战。此外，Bioscan-1M 包含实验室图像，而不是像 iNat21 这样的原位图像，从而使图像分布多样化。

2.2 Metadata & Aggregation
2.2 元元数据和聚合

The TreeOfLife-10M dataset integrates iNat21 (training split), our curated EOL dataset, and Bioscan-1M by aggregating the images and canonicalizing the labels. This is a highly non-trivial task because taxonomic hierarchies are notoriously noisy and rarely consistent between sources [33, 29, 49, 60, 4], likely contributing to the prior lack of image datasets large enough to train a foundation-scale vision model for the entire tree of life. We carefully unify and backfill taxonomic hierarchies from EOL, the Integrated Taxonomic Information System (ITIS) [40] and iNaturalist with special consideration for the existence of homonyms (genus-species labels shared among higher-order taxa). For more information on this process, the challenges, our solutions, and remaining issues, see Appendix C.
TreeOfLife-10M 数据集通过聚合图像和规范化标签，集成了 iNat21（训练拆分）、我们精选的 EOL 数据集和 Bioscan-1M。这是一项非常重要的任务，因为分类学层次结构是出了名的嘈杂，而且来源之间很少保持一致[33， 29， 49， 60， 4]，这可能是导致之前缺乏足够大的图像数据集来训练整个生命之树的基础规模视觉模型的原因。我们仔细地统一和回填了 EOL、综合分类信息系统（ITIS） [40] 和 iNaturalist 的分类层次结构，并特别考虑了同音异义词（高阶分类群之间共享的属-种标签）的存在。有关此过程、挑战、我们的解决方案和剩余问题的更多信息，请参阅附录 C。

2.3 Release & Statistics
2.3 发布和统计数据

Tab. 1 presents dataset statistics: TreeOfLife-10M has over $10$ M images across more than $450$ K unique taxonomic names. Fig. 2 shows the distribution of images by phyla and the respective lower-rank taxa (order through family).
表 1 显示了数据集统计数据：TreeOfLife-10M 拥有超过 $10$ $450$ K 个唯一分类名称的 M 张图像。无花果。图 2 显示了按门和相应的低级分类群（按科排序）的图像分布。

We will release our curated training and test data (TreeOfLife-10M and Rare Species, described in Sec. 4.2) on Hugging Face (with DOIs) under a public domain waiver, to the extent primary source licenses allow. This will include a CSV with image metadata and links to the primary sources, accompanied by a GitHub repository with the scripts to generate the datasets.⁴⁴4 We encourage future work to cite iNat21 [82], Bioscan-1M [26] and to appropriately attribute images from EOL based on their licenses if citing TreeOfLife-10M.
在主要来源许可允许的范围内，我们将在公共领域豁免下在 Hugging Face（带有 DOI）上发布我们精心策划的训练和测试数据（TreeOfLife-10M 和稀有物种，如第 4.2 节所述）。这将包括一个包含图像元数据和主要来源链接的 CSV，以及一个 GitHub 存储库，其中包含用于生成数据集的脚本。

	Name 名字	Description 描述	Examples 例子	Classes 类	Labels 标签
Animals 动物	Birds 525 鸟类 525	Scraped dataset of bird images from web search. [65] 从 Web 搜索中抓取的鸟类图像数据集。 [65]	$89,885$	$525$	Taxonomic 分类
	Plankton 浮游生物	Expert-labeled in situ images of plankton [32] 专家标记的浮游生物原位图像 [32].	$4,080$	$102$	Mixed 混合
	Insects 昆虫	Expert and volunteer-labeled in-the-wild citizen science images of insects [71] 昆虫的专家和志愿者标记的野外公民科学图像 [71].	$4,680$	$117$	Scientific 科学的
	Insects 2 昆虫 2	Mixed common and scientific name classification for insect pests [88] 害虫的通用名和学名混合分类 [88].	$4,080$	$102$	Mixed 混合
Plants & Fungi 植物与真菌	PlantNet 植物网	Citizen science species-labeled plant images, some drawings [25] 公民科学物种标记的植物图像，一些图纸 [25].	$1,000$	$25$	Scientific 科学的
	Fungi 真菌	Expert-labeled images of Danish fungi [63] 丹麦真菌的专家标记图像 [63].	$1,000$	$25$	Scientific 科学的
	PlantVillage 植物村	Museum-style leaf specimens labeled with common names [23] 标有俗名的博物馆式叶标本[23].	$1,520$	$38$	Common 常见
	Medicinal Leaf 药用叶	Species classification of leaves from mature, healthy medicinal plants [68] 成熟、健康的药用植物叶片的物种分类 [68].	$1,040$	$26$	Scientific 科学的
	PlantDoc 植物博士	17 diseases for 13 plant species [73] 13 种植物的 17 种病害 [73].	$1,080$	$27$	Common 常见
	Rare Species 稀有物种	Subset of species in the IUCN Red List categories: Near Threatened through Extinct in the Wild (iucnredlist.org). 世界自然保护联盟红色名录类别中的物种子集：近危到野外灭绝（iucnredlist.org）。	$12,000$	$400$	Taxonomic 分类

Table 2: Datasets used for evaluation. All tasks are classification evaluated with Top-1 accuracy.
表 2：用于评估的数据集。所有任务均以 Top-1 准确率进行分类评估。

3 Modeling 3 建模

\modelname \型号名称

is initialized from OpenAI’s public CLIP checkpoint and continually pre-trained on TreeOfLife-10M with CLIP’s multimodal contrastive learning objective.
从 OpenAI 的公共 CLIP 检查点初始化，并使用 CLIP 的多模态对比学习目标在 TreeOfLife-10M 上持续预训练。

3.1 Why CLIP? 3.1 为什么选择 CLIP？

Compared with general domain computer vision tasks, one of the most salient differences for the biology domain is its rich label space. Not only are the taxon labels large in quantity (there are $2$ M+ recorded species as of 2022 [41]), but they are also connected with each other in a hierarchical taxonomy. This is a challenge for training a good foundation model that can achieve satisfactory coverage and generalization. Despite this, the intricate structure in the label space, accumulated through centuries of biology research, provides very rich signal for learning better generalization. Intuitively, if the label space’s structure is successfully encoded in a foundation model, even if the model has not seen a certain species, it will likely have learned a good representation for that species’ corresponding genus or family. Such a hierarchical representation serves as a strong prior to enable few-shot or even zero-shot learning of new taxa.
与一般领域的计算机视觉任务相比，生物领域最显着的区别之一是其丰富的标签空间。不仅分类单元标签数量众多（截至 2022 年有 $2$ M+ 记录的物种 [41]），而且它们在分层分类中也相互关联。这对于训练一个好的基础模型来说是一个挑战，可以达到令人满意的覆盖率和泛化。尽管如此，通过几个世纪的生物学研究积累的标签空间中的复杂结构为学习更好的泛化提供了非常丰富的信号。直观地说，如果标签空间的结构在基础模型中成功编码，即使模型没有看到某个物种，它也可能已经学会了该物种的相应属或科的良好表示。这种分层表示是实现新分类群的少样本甚至零样本学习的强先例。

Many vision foundation models, such as ResNet [31] and Swin Transformer [45], adopt a supervised classification objective and directly learn the mapping from input images to class indices. As a result, each class label is treated as a distinct symbol, and their relationships are neglected. A key realization of our work is that the multimodal contrastive learning objective used in CLIP can be repurposed for leveraging the hierarchical structure of the label space. This is not an obvious choice; after all, TreeOfLife-10M is largely labeled with class labels and not with free-form text like image captions. The autoregressive text encoder naturally embeds the taxonomic hierarchy into a dense label space by conditioning later taxonomic rank representations on higher ranks (Fig. 1). While hierarchical classification [9, 93, 11] can also leverage taxonomy, we empirically show that CLIP-style contrastive learning significantly improves generalization (Sec. 4.4). We note that repurposing CLIP’s multimodal contrastive learning objective for learning hierarchical representations conforming to a taxonomy is a novel and non-trivial technical contribution.
许多视觉基础模型，如 ResNet [31] 和 Swin Transformer [45]，采用监督分类目标，直接学习从输入图像到类索引的映射。因此，每个类标签都被视为一个不同的符号，它们的关系被忽略了。我们工作的一个关键实现是，CLIP 中使用的多模态对比学习目标可以重新用于利用标签空间的层次结构。这不是一个明显的选择;毕竟，TreeOfLife-10M 主要使用类标签进行标记，而不是使用图像标题等自由格式的文本。自回归文本编码器通过对更高等级的后续分类等级表示进行调节，自然地将分类层次结构嵌入到密集的标签空间中（图 D）。 1）.虽然分层分类 [9， 93， 11] 也可以利用分类法，但我们实证表明 CLIP 风格的对比学习显着提高了泛化性（第 4.4 节）。我们注意到，重新利用 CLIP 的多模态对比学习目标来学习符合分类法的分层表示是一项新颖且非平凡的技术贡献。

CLIP trains two uni-modal embedding models, a vision encoder and a text encoder, to (1) maximize feature similarity between positive (image, text) pairs and (2) minimize feature similarity between negative (image, text) pairs, where positive pairs from the training data and negative pairs are all other possible (image, text) pairings in a batch. After training, CLIP’s encoder models embed individual instances of their respective modalities into a shared feature space. Next, we discuss formatting the text input to CLIP to incorporate the taxonomic structure.
CLIP 训练两个单模态嵌入模型，一个视觉编码器和一个文本编码器，以（1）最大化正（图像、文本）对之间的特征相似性，以及（2）最小化负（图像、文本）对之间的特征相似性，其中来自训练数据的正对和负对是批次中所有其他可能的（图像、文本）配对。训练后，CLIP 的编码器模型将其各自模态的各个实例嵌入到共享特征空间中。接下来，我们讨论将文本输入格式化为 CLIP 以合并分类结构。

3.2 Text Types 3.2 文本类型

Text Type 文本类型	Example 例
Common 常见	black-billed magpie 黑嘴喜鹊
Scientific 科学的	Pica hudsonia Pica hudsonia （哈德森异癖）
Taxonomic 分类	Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia 动物界脊索动物鸟纲雀形目鸦科皮卡哈德森
Scientific + Common 科学 + 常见	Pica hudsonia with common name black-billed magpie 俗名黑嘴喜鹊的 Pica hudsonia
Taxonomic + Common 分类 + 常见	Animalia Chordata Aves Passeriformes Corvidae Pica hudsonia with common name black-billed magpie 动物界脊索动物鸟纲雀形目鸦科皮卡哈德森俗名黑嘴鹊

Table 3: Text types considered in the training of \modelname.
表 3：在 \modelname 的训练中考虑的文本类型。

An advantage of CLIP is that the text encoder accepts free-form text. In biology, unlike other classification tasks, class names are diversely formatted. We consider the following:
CLIP 的一个优点是文本编码器接受自由格式的文本。在生物学中，与其他分类任务不同，类名称的格式多种多样。我们考虑以下几点：

Taxonomic name. A standard seven-level biology taxonomy from higher to lower level is kingdom, phylum, class, order, family, genus and species. For each species, we “flatten” the taxonomy by concatenating all labels from root to leaf into a single string, which we call the taxonomic name.
分类名称。从高到低的标准七级生物学分类法是界、门、类、目、科、属和种。对于每个物种，我们通过将从根到叶的所有标签连接成一个字符串来 “扁平化 ”分类法，我们称之为分类学名称。

Scientific name. Scientific names are composed of genus and species (e.g., Pica hudsonia).
学名。学名由属和种组成（例如，Pica hudsonia）。

Common name. Taxonomy categories are usually Latin, which is not often seen in generalist image-text pre-training datasets. Instead, the common name, such as “black-billed magpie,” is more widespread. Note that common names may not have a 1-to-1 mapping to taxa: A single species may have multiple common names, or the same common name may refer to multiple species.
通用名称。分类类别通常是拉丁语，这在通用图像文本预训练数据集中并不常见。相反，通用名称，例如“黑嘴喜鹊”，更为普遍。请注意，通用名称可能与分类群没有 1 对 1 的映射关系：单个物种可能有多个通用名称，或者相同的通用名称可能指代多个物种。

For certain downstream use cases of \modelname, it might be the case that only one type of label, e.g., scientific names, is available. To improve the flexibility at inference time, we propose a mixed text type training strategy: at each training step, we pair each input image with a text randomly sampled from all of its available text types (shown in Tab. 3). We empirically show that this simple strategy retains the generalization benefits of taxonomic names while providing more flexibility in using other names at inference time (Sec. 4.3). The final text input to CLIP is the name in the standard CLIP template, e.g., “a photo of Pica hudsonia”.
对于 \modelname 的某些下游用例，可能只有一种类型的标签（例如学名）可用。为了提高推理时间的灵活性，我们提出了一种混合文本类型训练策略：在每个训练步骤中，我们将每个输入图像与从其所有可用文本类型中随机采样的文本配对（如表 3 所示）。我们凭经验表明，这种简单的策略保留了分类名称的泛化优势，同时在推理时使用其他名称提供了更大的灵活性（第 4.3 节）。CLIP 的最终文本输入是标准 CLIP 模板中的名称，例如，“a photo of Pica hudsonia”。

4 Experiments 4 实验

We train \modelname on TreeOfLife-10M, compare \modelname to general vision models and investigate how our modeling choices affect \modelname’s performance.
我们在 TreeOfLife-10M 上训练 \modelname，将 \modelname 与一般视觉模型进行比较，并研究我们的建模选择如何影响 \modelname 的性能。

4.1 Training and Evaluation Details
4.1 训练和评估详细信息

To train \modelname, we initialize from OpenAI’s CLIP weights [66] with a ViT-B/16 vision transformer [20] image encoder and a $77$ -token causal autoregressive transformer text encoder. We continue pre-training on TreeOfLife-10M for 100 epochs with a cosine learning rate schedule [46]. We train on $8$ NVIDIA A100-80GB GPUs over $2$ nodes with a global batch size of $32,768$ . We also train a baseline model on only the iNat21 dataset and multiple ablation models on 1M examples randomly sampled from TreeOfLife-10M (Secs. 4.3 and 4.4), following the same procedure for \modelname except with a smaller global batch size of $16,384$ on $4$ NVIDIA A100 GPUs on $1$ node. All hyperparameters and training details are in Appendix D; the training and evaluation code will be released.
为了训练 \modelname，我们使用 ViT-B/16 视觉转换器 [20] 图像编码器和 $77$ 标记因果自回归转换器文本编码器从 OpenAI 的 CLIP 权重 [66] 进行初始化。我们继续在 TreeOfLife-10M 上进行 100 个 epoch 的预训练，采用余弦学习率时间表 [46]。我们在全局批处理大小为 $8$ $32, 768$ . $2$ 我们还仅在 iNat21 数据集上训练了一个基线模型，并在从 TreeOfLife-10M （第 4.3 和 4.4 节）随机采样的 1M 样本上训练了多个消融模型，遵循与 \modelname 相同的过程，只是节点上 NVIDIA A100 GPU $16, 384$ $1$ 上的 $4$ 全局批处理大小较小。所有超参数和训练详细信息都在附录 D 中;将发布训练和评估代码。

We evaluate on 10 different classification tasks: the 8 biologically-relevant tasks from Meta-Album [80], Birds 525 [65] and our new Rare Species task (described in Sec. 4.2). Meta-Album is a dataset collection for meta-learning, encompassing various subjects. Specifically, we use the Plankton, Insects, Insects 2, PlantNet, Fungi, PlantVillage, Medicinal Leaf, and PlantDoc datasets. Our classification tasks cover all four multi-celled kingdoms in the tree of life (animals, plants, fungi, and protists) and have a diverse image distribution (photographs, microscope images, drawings, and museum specimens). Tab. 2 provides an overview of the datasets; they comprise a variety of label types from full taxonomic names to only scientific or common name, or a mix of the latter.
我们评估了 10 个不同的分类任务：来自 Meta-Album[80]、鸟类 525[65] 的 8 个生物学相关任务和我们新的稀有物种任务（在第 4.2 节中描述）。Meta-Album 是一个用于元学习的数据集集合，包含各种主题。具体来说，我们使用 Plankton、Insects、Insects 2、PlantNet、Fungi、PlantVillage、Medicinal Leaf 和 PlantDoc 数据集。我们的分类任务涵盖了生命之树中的所有四个多细胞界（动物、植物、真菌和原生生物），并且具有多样化的图像分布（照片、显微镜图像、图纸和博物馆标本）。表 2 提供了数据集的概述;它们包括多种标签类型，从完整的分类名称到仅科学名或通用名，或后者的混合。

For zero-shot learning, we follow the same procedure as CLIP. For few-shot learning, we follow SimpleShot [87] to use a nearest-centroid classifier. For $k$ -shot learning, we first randomly sample $k$ examples for each class and obtain the image embedding from the visual encoder of the pre-trained models. We then compute the average feature vector of the $k$ embeddings as the centroid for each class. All the examples left in the dataset are used for testing. After applying mean subtraction and L2-normalization to each centroid and test feature vector, we choose the class with the nearest centroid to the test vector as the prediction. We repeat each few-shot experiment $5$ times with different random seeds and report the mean accuracy in Tab. 4. Results with standard deviations are reported in Appendix E.
对于零样本学习，我们遵循与 CLIP 相同的程序。对于小样本学习，我们遵循 SimpleShot [87] 使用最近质心分类器。对于 $k$ -shot 学习，我们首先为每个类随机采样 $k$ 样本，并从预训练模型的视觉编码器中获取图像嵌入。然后，我们计算嵌入的平均 $k$ 特征向量作为每个类的质心。数据集中留下的所有示例都用于测试。在对每个质心和测试特征向量应用均值减去和 L2 归一化后，我们选择质心离测试向量最近的类作为预测。我们使用不同的随机种子重复每个小样本实验 $5$ 时间，并在表 4 中报告平均准确度。标准差的结果在附录 E 中报告。

We compare \modelname with the original OpenAI CLIP [66] as well as OpenCLIP [39] trained on LAION-400M [70]. Intuitively, common names of organisms are most pervasive in the training data of CLIP and OpenCLIP and these models work best with common names. This is also confirmed in our preliminary tests. Therefore, we use common names as class labels for CLIP and OpenCLIP by default unless unavailable for a dataset. \modelname can leverage taxonomic names, so we use taxonomic + common names by default. However, as noted in Tab. 2, the test datasets come in a variety of labels. Whenever the preferred label type is not available, we use labels that come with the dataset.
我们将 \modelname 与在 LAION-400M [70] 上训练的原始 OpenAI CLIP [66] 以及 OpenCLIP [39] 进行了比较。直观地说，生物体的通用名称在 CLIP 和 OpenCLIP 的训练数据中最为普遍，并且这些模型最适合使用通用名称。我们的初步测试也证实了这一点。因此，默认情况下，我们使用通用名称作为 CLIP 和 OpenCLIP 的类标签，除非对数据集不可用。\modelname 可以利用分类名称，因此我们默认使用 taxonomic + common names。然而，如表 2 所示，测试数据集有多种标签。每当首选标签类型不可用时，我们都会使用数据集附带的标签。

	Animals 动物				Plants & Fungi 植物与真菌
Model	Birds 525 鸟类 525	Plankton 浮游生物	Insects 昆虫	Insects 2 昆虫 2	PlantNet 植物网	Fungi 真菌	PlantVillage 植物村	Med. Leaf	PlantDoc 植物博士	Rare Species 稀有物种	Mean ( $\Delta$ ) 平均值（ $\Delta$ ）
Random Guessing 随机猜测	$0.2$	$1.2$	$1.0$	$1.0$	$4.0$	$4.0$	$2.6$	$4.0$	$3.7$	$0.3$	$2.2$
Zero-Shot Classification 零样本分类
CLIP	$49.9$	$3.2$	$9.1$	$9.8$	$58.5$	$10.2$	$5.4$	$15.9$	$26.1$	$26.6$	$21.4$	–
OpenCLIP	$54.7$	$2.2$	$6.5$	$9.6$	$50.2$	$5.7$	$8.0$	$12.4$	$25.8$	$31.0$	$20.6$	$-0.8$
\modelname \型号名称	$72.1$	$6.1$	$34.8$	$20.4$	$91.4$	$40.7$	$24.4$	$38.6$	$28.4$	$37.8$	$39.4$	$+18.0$
– iNat21 Only – 仅限 iNat21	$56.1$	$2.6$	$30.7$	$11.5$	$88.2$	$43.0$	$18.4$	$25.6$	$20.5$	$19.4$	$31.6$	$+10.2$
One-Shot Classification 一次性分类
CLIP	$43.7$	$25.1$	$21.6$	$13.7$	$42.1$	$17.2$	$49.7$	$70.1$	$24.8$	$28.4$	$33.6$	–
OpenCLIP	$53.7$	$32.3$	$23.2$	$14.3$	$45.1$	$18.4$	$53.6$	$71.2$	$26.8$	$29.3$	$36.7$	$+3.1$
\modelname \型号名称	$71.8$	$30.6$	$57.4$	$20.4$	$64.5$	$40.3$	$58.8$	$84.3$	$30.7$	$45.3$	$50.4$	$+16.8$
– iNat21 Only – 仅限 iNat21	$74.8$	$29.6$	$53.9$	$19.7$	$67.4$	$35.5$	$55.2$	$75.1$	$27.8$	$37.6$	$47.6$	$+14.0$
Five-Shot Classification 五发分类
CLIP	$73.5$	$41.2$	$39.9$	$24.6$	$65.2$	$27.9$	$71.8$	$89.7$	$35.2$	$46.9$	$51.5$	–
OpenCLIP	$81.9$	$52.5$	$42.6$	$25.0$	$68.0$	$30.6$	$77.8$	$91.3$	$42.0$	$48.4$	$56.0$	$+4.5$
\modelname \型号名称	$90.0$	$49.3$	$77.8$	$33.6$	$85.6$	$62.3$	$80.9$	$95.9$	$47.5$	$66.4$	$68.9$	$+17.4$
– iNat21 Only – 仅限 iNat21	$90.1$	$48.2$	$73.7$	$32.1$	$84.7$	$55.6$	$77.2$	$93.5$	$41.0$	$56.0$	$65.1$	$+13.6$

Table 4: Zero-, one- and five-shot classification top-1 accuracy for different CLIP models. Bold indicates best accuracy. All models use the same architecture: ViT-B/16 vision encoders, 77-token text encoder. “iNat21 Only” follows the same procedure as \modelname but uses iNat21 instead of TreeOfLife-10M.

\Delta

denotes the difference in mean accuracy with CLIP.
表 4：针对不同 CLIP 模型的零、一和五次分类精度排名第一。粗体表示最佳准确性。所有型号都使用相同的架构：ViT-B/16 视觉编码器、77 令牌文本编码器。“iNat21 Only” 遵循与 \modelname 相同的过程，但使用 iNat21 而不是 TreeOfLife-10M。

\Delta

表示与 CLIP 的平均准确度差异。

Dataset 数据	Train $\downarrow$ Test $\rightarrow$ 火车 $\downarrow$ 测试 $\rightarrow$	Com 通讯	Sci 科学	Tax 税	Sci+Com 科学+通信	Tax+Com 税务 + Com
ToL-1M	Com 通讯	$25.5$	$9.5$	$10.6$	$22.6$	$23.1$
	Sci 科学	$11.1$	$22.2$	$4.4$	$20.9$	$7.7$
	Tax 税	$13.5$	$10.5$	$26.5$	$15.9$	$24.1$
	Sci+Com 科学+通信	$25.2$	$12.8$	$12.5$	$27.2$	$25.3$
	Tax+Com 税务 + Com	$22.1$	$8.0$	$19.5$	$25.1$	$30.0$
	Mixture 混合物	$26.8$	$24.7$	$26.4$	$30.0$	$30.5$
iNat21-2.7M	Mixture 混合物	$20.9$	$14.7$	$15.6$	$21.7$	$21.8$
ToL-10M ToL-10M 系列	Mixture 混合物	$31.1$	$30.3$	$33.4$	$36.6$	$37.8$

Table 5: Zero-shot accuracy on species not seen during training (Rare Species task). Different rows and columns indicate different text types during training and test time, respectively. Blue indicates best accuracy and Orange indicates second-best. Using the taxonomic name over the scientific name always improves accuracy (

22.2

\rightarrow

26.5

and

27.2

\rightarrow

30.0

). The final rows use the full iNat21 dataset and TreeOfLife-10M for reference.
表 5：训练期间未见物种的零射击精度（Rare Species 任务）。不同的行和列分别表示训练和测试期间的不同文本类型。蓝色表示最佳准确性，橙色表示次佳。使用分类名称而不是学名总是可以提高准确性（

22.2

\rightarrow

26.5

和

27.2

\rightarrow

30.0

）。最后几行使用完整的 iNat21 数据集和 TreeOfLife-10M 作为参考。

4.2 Can \modelname Generalize to Unseen Taxa?
4.2 \modelname 可以推广到看不见的分类群吗？

Taxonomic names are added, removed, and changed as biologists discover and classify new and existing species. \modelname should generalize to unseen taxonomic names so that it doesn’t need to be re-trained for every new species. To empirically answer whether \modelname generalizes well to unseen taxa, we introduce a new evaluation task that is both biologically and empirically motivated: Rare Species.
随着生物学家发现和分类新的和现有的物种，分类名称会被添加、删除和更改。\modelname 应该推广到看不见的分类名称，这样就不需要为每个新物种重新训练它。为了实证回答 \modelname 是否很好地推广到看不见的分类群，我们引入了一个既有生物学动机又有经验动机的新评估任务：稀有物种。

Classifying “rare” species is an important and challenging computer vision application in biology, particularly in the context of conservation efforts around the world [79]. To the best of our knowledge, there is no diverse, publicly available rare species classification dataset. Recently published examples [53, 44] lack species diversity with only a dozen classes, compared to our 400. We aim to fill this gap and collect all the species on the IUCN Red List (iucnredlist.org) classified⁵⁵5 IUCN has classified $150,388$ species and generally updates their list twice per year (IUCN Update Schedule). The classifications used for this dataset are current as of July 13, 2023. as Near Threatened, Vulnerable, Endangered, Critically Endangered, or Extinct in the Wild. There are approximately $25,000$ species in these categories, and we select $400$ species from this list where each is represented by at least $30$ images in our EOL dataset. These species are then completely removed from TreeOfLife-10M, creating an unseen Rare Species test set with $30$ images per species. This dataset demonstrates (1) \modelname’s out-of-distribution generalization to unseen taxa, (2) \modelname’s potential applications, and (3) a crucial large Rare Species dataset for the community to address the ongoing biodiversity crisis.
对“稀有”物种进行分类是生物学中一个重要且具有挑战性的计算机视觉应用，尤其是在世界各地的保护工作中[79]。据我们所知，没有多样化的、公开可用的稀有物种分类数据集。最近发表的例子 [53， 44] 缺乏物种多样性，只有十几个类，而我们的 400 个纲。我们的目标是填补这一空白，收集 IUCN 红色名录（iucnredlist.org）上分类为⁵、⁵、5 的所有物种 IUCN 已分类为 $150,388$ 个物种，通常每年更新两次（IUCN 更新时间表）。用于此数据集的分类是截至 2023 年 7 月 13 日的最新分类。为近危、易危、濒危、极度濒危或在野外灭绝。这些类别中大约 $25, 000$ 有物种，我们从此列表中选择 $400$ 物种，其中每个物种至少 $30$ 由我们的 EOL 数据集中的图像表示。然后将这些物种从 TreeOfLife-10M 中完全删除，创建一个看不见的稀有物种测试集，其中包含 $30$ 每个物种的图像。该数据集展示了（1） \modelname 对看不见的分类群的分布外泛化，（2） \modelname 的潜在应用，以及（3）社区解决持续的生物多样性危机的关键大型稀有物种数据集。

Results. Tab. 4 shows that \modelname substantially outperforms both baseline CLIP models, as well as the iNat21-trained CLIP model, at zero-shot classification, especially on unseen taxa (see the column labeled “Rare Species”).
结果。表 4 显示，在零样本分类方面，\modelname 的性能大大优于基线 CLIP 模型以及 iNat21 训练的 CLIP 模型，尤其是在看不见的分类群上（参见标记为“稀有物种”的列）。

We attribute \modelname’s strong zero-shot performance on this broad and diverse set of tasks to the broad and diverse classes present in TreeOfLife-10M. We further explore how data diversity leads to broadly useful image representations in Sec. 4.4. With multimodal contrastive learning (Sec. 3.1) and the incorporation of taxonomic structure (Sec. 3.2), \modelname capitalizes on this data diversity for strong zero-shot generalization.
我们将 \modelname 在这组广泛而多样的任务中强大的零镜头性能归功于 TreeOfLife-10M 中存在的广泛而多样的类。我们在第 4.4 节中进一步探讨了数据多样性如何导致广泛有用的图像表示。通过多模态对比学习（第 3.1 节）和分类结构的结合（第 3.2 节），\modelname 利用这些数据多样性进行强大的零镜头泛化。

Objective 目的	Mean 1-Shot 均值 1 次	Mean 5-shot 均值 5 次
Cross-entropy 交叉熵	$16.7$	$26.3$
Hier. cross-entropy 更高。交叉熵	$19.3$	$30.7$
CLIP	$45.1$	$64.2$

Table 6: One- and five-shot classification top-1 accuracy for different pre-training objectives on TreeOfLife-1M. Results are macro-averaged over all the test sets in Tab. 4.
表 6：在 TreeOfLife-1M 上针对不同训练前目标的 1 次和 5 次分类精度排名第一。结果是表 4 中所有测试集的宏平均值。

4.3 How Do Text Types Affect Generalization?
4.3 文本类型如何影响泛化？

We investigate how different text types affect zero-shot generalization by training \modelname on a 10% subset of TreeOfLife-10M (10% due to computational constraints). We use our Rare Species dataset because the test classes have every text type, and all species are excluded from training, making it ideal for testing generalization to unseen taxa. Nguyen et al. [54] find that the diversity of captions makes stronger vision models and [69] randomly use one of five different captions for each image during training rather than a single fixed caption. Similarly, we use a mixed text type strategy (Sec. 3.2). How does that affect performance?
我们通过在 TreeOfLife-10M 的 10% 子集上训练 \modelname 来研究不同的文本类型如何影响零样本泛化（由于计算限制，为 10%）。我们使用我们的 Rare Species 数据集，因为测试类具有每种文本类型，并且所有物种都被排除在训练之外，这使其成为测试对未见过的分类群的泛化的理想选择。Nguyen等 [54]发现，字幕的多样性使视觉模型更强大，[69]在训练过程中，每张图像随机使用五种不同的字幕之一，而不是单个固定的字幕。同样，我们使用混合文本类型策略（第 3.2 节）。这对性能有何影响？

Results. The zero-shot ablation results are in Tab. 5; there are several salient observations. First, using taxonomic + common names yields the strongest performance, showing the importance of incorporating the taxonomic structure for generalization. Second, when only using a single text type for training, performance degrades substantially when a different text type is used at test time. Using mixed text types for training, while not the strongest, yields consistently strong performance across the board. These results indicate that mixed text type pre-training largely retains the generalization benefits of using taxonomic names while also providing flexibility of different text types for inference, an important property for a foundation model that may be used for diverse downstream tasks. Finally, using 1M examples from TreeOfLife-10M outperforms using 2.7M examples from iNat21, further confirming the importance of the added data diversity from TreeOfLife-10M.
结果。零点消融结果见表 5;有几个突出的观察结果。首先，使用分类学 + 通用名称产生最强的性能，这表明了结合分类结构进行泛化的重要性。其次，当仅使用单一文本类型进行训练时，如果在测试时使用不同的文本类型，性能会大大降低。使用混合文本类型进行训练，虽然不是最强的，但可以产生始终如一的出色性能。这些结果表明，混合文本类型预训练在很大程度上保留了使用分类名称的泛化优势，同时也为推理提供了不同文本类型的灵活性，这是可用于各种下游任务的基础模型的重要属性。最后，使用 TreeOfLife-10M 的 1M 示例优于使用 iNat21 的 2.7M 示例，进一步证实了 TreeOfLife-10M 增加的数据多样性的重要性。

4.4 Is the CLIP Objective Necessary?
4.4 是否需要 CLIP 物镜？

Using the CLIP objective to pre-train on a labeled image dataset is an unintuitive decision (Goyal et al. [27] fine-tune using the CLIP objective, but do not pretrain). We justify our choice by training two ViT-B/16 models on TreeOfLife-1M using a cross-entropy classification loss and a multitask hierarchical variant, then compare them against the CLIP objective under the few-shot setting. The multitask hierarchical training objective is to predict the labels for kingdom, phylum, etc., down to species, using cross entropy for each level of the taxonomy, then summing those losses [11]. Pseudo-code is provided in Fig. D1.
使用 CLIP 目标对标记的图像数据集进行预训练是一个不直观的决定（Goyal 等人。[27] 使用 CLIP 物镜进行微调，但不要预训练）。我们通过使用交叉熵分类损失和多任务分层变体在 TreeOfLife-1M 上训练两个 ViT-B/16 模型来证明我们的选择是合理的，然后将它们与少数镜头设置下的 CLIP 目标进行比较。多任务分层训练的目标是预测界、门等的标签，直至物种，对分类学的每个级别使用交叉熵，然后将这些损失相加 [11]。伪代码如图 1 所示。 D1.

Results. We evaluate each model on the same set of $10$ tasks but only in the one-shot and five-shot settings because non-CLIP models cannot do zero-shot classification. We report mean accuracy in Tab. 6. The hierarchical classification model outperforms simple classification and is comparable to the CLIP baseline (see Tab. 4). However, the CLIP objective massively outperforms both baselines and strongly justifies our repurposing of the CLIP objective.
结果。我们在同一组 $10$ 任务上评估每个模型，但仅在 one-shot 和 five-shot 设置中进行评估，因为非 CLIP 模型无法进行零shot分类。我们在表 6 中报告了平均准确性。分层分类模型优于简单分类，与 CLIP 基线相当（见表 4）。然而，CLIP 目标的性能大大优于这两个基线，并有力地证明了我们重新利用 CLIP 目标的合理性。

4.5 Can \modelname Classify More Than Species?
4,5 \modelname 可以分类的不仅仅是物种吗？

\modelname \型号名称

is trained on a (contrastive) species-classification objective, which might limit its downstream use beyond species classification. We consider plant diagnosis with the PlantVillage and PlantDoc datasets, which require classifying both species and the disease present (if any).
根据（对比）物种分类目标进行训练，这可能会限制其在物种分类之外的下游使用。我们考虑使用 PlantVillage 和 PlantDoc 数据集进行植物诊断，这需要对物种和存在的疾病（如果有）进行分类。

Large-scale data labeling is expensive in biology, but biologists always label several instances for use in field guides or museum collections. Few-shot classification is thus an ideal setting for this sort of task transfer.
在生物学中，大规模数据标记的成本很高，但生物学家总是标记多个实例以用于野外指南或博物馆收藏。因此，Few-shot 分类是此类任务转移的理想设置。

Results. We find that \modelname outperforms baselines at classifying plant diseases based on visual symptoms, in both zero-shot and few-shot settings (see the PlantVillage and PlantDoc columns in Tab. 4).
结果。我们发现，在零样本和少数样本设置中，\modelname 在根据视觉症状对植物病害进行分类时优于基线（参见表 4 中的 PlantVillage 和 PlantDoc 列）。

Notably, while Radford et al. [66] find that CLIP one-shot and two-shot classification is often worse than zero-shot (because few-shot settings cannot use the semantic information in the class name), \modelname has learned useful visual representations that are useful even with only one labeled example: \modelname’s mean one-shot accuracy is 9.1% higher than zero-shot accuracy.
值得注意的是，虽然 Radford 等人 [66] 发现 CLIP 单发和双发分类通常比零发更差（因为少发设置不能使用类名中的语义信息），但 \modelname 已经学会了有用的视觉表示，即使只有一个标记的示例也很有用：\modelname 的平均单发准确率比零发准确率高 9.1%。

4.6 Does \modelname Learn the Hierarchy?
4,6 \modelname 会学习层次结构吗？

\modelname \型号名称

demonstrates strong performance in the low-data regime on our extrinsic evaluation, but why? We further conduct an intrinsic evaluation and directly visualize the image feature representations \modelname has learned to shed light on this question (Fig. 3). We embed image representations from iNat21’s validation set (which is not see during training) in two dimensions using t-SNE [81], then color the points by the image’s taxonomic label. For each plot, we run t-SNE independently on the subset of examples under the labeled taxonomical rank. Each plot visualizes one rank of the taxonomic hierarchy and the top six categories of the following rank, e.g., the top left plot visualizes the six most common phyla of the Animalia Kingdom. At higher ranks like kingdom (omitted for space) and phylum, both CLIP and \modelname have good separation, but \modelname’s representation is more fine-grained and contains a richer clustering structure. At lower ranks, \modelname produces evidently more separable features, while CLIP’s features are cluttered and lack a clear clustering structure. Appendix F has more qualitative results and visuals.
在我们的外在评估中，在低数据体系中表现出强劲的表现，但为什么呢？我们进一步进行了内在评估，并直接可视化了 \modelname 所学到的图像特征表示，以阐明这个问题（图 D）。 3）.我们使用 t-SNE [81] 将来自 iNat21 验证集的图像表示（在训练过程中看不到）嵌入到二维中，然后通过图像的分类标签为点着色。对于每个图，我们在标记的分类学等级下的示例子集上独立运行 t-SNE。每个图都可视化了分类层次结构的一个等级和下一个等级的前六个类别，例如，左上角的图可视化了动物界的六个最常见的门。在更高的等级中，如 kingdom （省略了空间）和 phylum ， CLIP 和 \modelname 都有很好的分离，但 \modelname 的表示更细粒度，并且包含更丰富的聚类结构。在较低的秩中， \modelname 显然会产生更多可分离的特征，而 CLIP 的特征则杂乱无章，缺乏清晰的聚类结构。附录 F 有更多的定性结果和视觉效果。

5 Related Work 5 相关工作

Multimodal foundation model training data. CLIP [66] trained state-of-the-art vision models from noisy, web-scale ( $100$ M+) image-text datasets using a contrastive objective that is optimized for image retrieval. ALIGN [42] and BASIC [62] further scaled the number of training examples from 400M to 6.6B, improving vision representation quality. However, further work [22, 54, 24, 90, 91] all find that dataset diversity and better alignment between the image and caption semantics are more important than dataset size and lead to stronger performance on downstream tasks with less compute. TreeOfLife-10M emphasizes the importance of diversity, adding over $440$ K classes to iNat21’s $10$ K and leading to stronger zero-shot performance even when training on fewer images (TreeOfLife-1M).
多模态基础模型训练数据。CLIP [66] 使用针对图像检索优化的对比物镜，从嘈杂的 Web 规模（ $100$ M+）图像文本数据集中训练了最先进的视觉模型。ALIGN [42] 和 BASIC [62] 进一步将训练样本的数量从 400M 扩展到 6.6B，从而提高了视觉表示质量。然而，进一步的工作 [22， 54， 24， 90， 91] 都发现数据集的多样性以及图像和标题语义之间的更好对齐比数据集大小更重要，并且会导致下游任务的性能更高，而计算量更少。TreeOfLife-10M 强调多样性的重要性，在 iNat21 的 $10$ K 中增加了超过 $440$ K 类，即使在使用较少的图像进行训练时也能实现更强的零镜头性能（TreeOfLife-1M）。

Domain-specific CLIPs. Domain-specific training often beats general training [28, 17]. Labeling domain-specific datasets can be prohibitively expensive because annotators must be subject-matter experts. Image-text training is thus particularly potent because annotations are optional; models can learn from noisy image-text pairs. Ikezogwo et al. [38], Lu et al. [47] gathered $1$ M+ image-text pairs for use in computational pathology, where expert-labeled examples are difficult to gather due to both time and privacy. We gather $10\times$ the images, emphasizing class diversity.
特定于域的 CLIP。特定领域的培训通常优于一般培训 [28， 17]。标记特定于域的数据集可能非常昂贵，因为注释者必须是主题专家。因此，图像文本训练特别有效，因为注释是可选的;模型可以从嘈杂的图像-文本对中学习。Ikezogwo等 [38]、Lu 等[47]收集了 $1$ M+图像-文本对用于计算病理学，由于时间和隐私的原因，很难收集专家标记的示例。我们收集 $10\times$ 图像，强调阶级多样性。

Hierarchy in computer vision. Hierarchy in computer vision is a well-studied topic, in part because ImageNet [67] concepts are scraped from the hierarchical WordNet [52]. Bilal et al. [10] develop a visual interface to study vision model predictions on ImageNet and find that model confusion patterns follow the same hierarchical structure present in the classes. They leverage this knowledge to incorporate hierarchy into AlexNet’s architecture [43] and improve the top-1 error on ImageNet by $8$ % absolute. Bertinetto et al. [9] measure image classifiers’ mistakes’ severity and propose alternative training objectives that incorporate class hierarchy. This dramatically reduces mistake severity at the expense of worsening top-1 accuracy. Zhang et al. [93] propose a contrastive learning objective, HiMulCon, where the hierarchical distance between labels corresponds to the desired distance in the embedding space. They use HiMulCon to fine-tune a ResNet that was pre-trained with cross-entropy on ImageNet and outperform cross-entropy on both ImageNet and iNat17 [84].
计算机视觉中的层次结构。计算机视觉中的层次结构是一个经过充分研究的话题，部分原因是 ImageNet [67] 概念是从分层 WordNet [52] 中抓取而来的。Bilal等 [10]开发了一个可视化界面来研究ImageNet上的视觉模型预测，发现模型混淆模式遵循与类中存在的相同的层次结构。他们利用这些知识将层次结构整合到 AlexNet 的架构中 [43]，并将 ImageNet 上的前 1 个错误提高 $8$ % 绝对值。Bertinetto et al. [9] 测量图像分类器错误的严重性，并提出了包含类层次结构的替代训练目标。这大大降低了错误的严重性，但代价是 top-1 准确性下降。Zhang et al. [93] 提出了一个对比学习目标 HiMulCon，其中标签之间的分层距离对应于嵌入空间中的所需距离。他们使用 HiMulCon 来微调在 ImageNet 上用交叉熵预先训练的 ResNet，并且在 ImageNet 和 iNat17 上都优于交叉熵 [84]。

We repurpose the CLIP objective and feed taxonomic names to an autoregressive text encoder, which forms a dense hierarchical label space into which a vision encoder learns to embed images. This dense label space is better suited to the $454$ K unique labels, while prior work applied hierarchies to smaller label spaces.
我们重新调整了 CLIP 目标的用途，并将分类名称提供给自回归文本编码器，该编码器形成了一个密集的分层标签空间，视觉编码器在其中学习嵌入图像。这种密集的标签空间更适合 $454$ K 个唯一标签，而之前的工作将层次结构应用于较小的标签空间。

Computer vision for biology. Fine-grained classification is a classic challenge in computer vision, and biological images are often used to benchmark models. Wah et al. [86], Berg et al. [8], Piosenka [65] all use bird species classification to evaluate fine-grained classification ability. Xiao et al. [89], Cole et al. [19] use biology tasks in contrastive learning frameworks, and Cole et al. [18] uses a biology task in weakly supervised object detection.
用于生物学的计算机视觉。细粒度分类是计算机视觉中的经典挑战，生物图像通常用于对模型进行基准测试。Wah等 [86]、Berg等 [8]、Piosenka [65]都使用鸟类物种分类来评价细粒度的分类能力。Xiao等 [89]、Cole等 [19]在对比学习框架中使用生物任务，Cole等 [18]在弱监督对象检测中使用生物学任务。

6 Conclusion 6 结论

We introduce TreeOfLife-10M and \modelname, a large-scale diverse biology image dataset and a foundation model for the tree of life, respectively. Through extensive evaluation, we show that \modelname is a strong fine-grained classifier for biology in both zero- and few-shot settings. We corroborate our hypothesis that using the entire taxonomic name leads to stronger generalization than other caption types through an ablation on unseen species and by visualizing \modelname’s representations, finding that \modelname-embedded images better match the taxonomic hierarchy.
我们介绍了 TreeOfLife-10M 和 \modelname，分别是一个大规模的多样化生物学图像数据集和生命之树的基础模型。通过广泛的评估，我们表明 \modelname 在零镜头和少镜头设置中都是一个强大的细粒度生物学分类器。我们证实了我们的假设，即使用整个分类名称比其他标题类型更强的泛化，通过对看不见的物种进行消融，并通过可视化 \modelname 的表示，发现 \modelname 嵌入的图像更符合分类层次结构。

Although we leverage the CLIP objective to efficiently learn visual representations over hundreds of thousands of taxa, \modelname is fundamentally still trained with a classification objective. In future work, we will further scale up the data, e.g., incorporating research-grade images from inaturalist.org with $100$ M+ images, and collect richer textual descriptions of species’ appearances such that \modelname can extract fine-grained trait-level representations.
尽管我们利用 CLIP 目标来有效地学习数十万个分类群的视觉表示，但 \modelname 从根本上仍然是使用分类目标进行训练的。在未来的工作中，我们将进一步扩大数据规模，例如，将 inaturalist.org 的研究级图像与 $100$ M+ 图像相结合，并收集更丰富的物种外观文本描述，以便 \modelname 可以提取细粒度的特征级表示。

Acknowledgements 确认

The authors would like to thank Josef Uyeda, Jim Balhoff, Dan Rubenstein, Hank Bart, Hilmar Lapp, Sara Beery, and colleagues from the Imageomics Institute and the OSU NLP group for their valuable feedback. We also thank the Bioscan-1M team and the iNaturalist team for making their data available and easy to use, and Jennifer Hammack at EOL for her invaluable help in accessing EOL’s images.
作者要感谢 Josef Uyeda、Jim Balhoff、Dan Rubenstein、Hank Bart、Hilmar Lapp、Sara Beery 以及图像组学研究所和俄勒冈州立大学 NLP 小组的同事，感谢他们的宝贵反馈。我们还感谢 Bioscan-1M 团队和 iNaturalist 团队提供他们的数据且易于使用，并感谢 EOL 的 Jennifer Hammack 在访问 EOL 图像方面提供的宝贵帮助。

This research was supported by NSF OAC 2118240 and used computing resources from the Ohio Supercomputer Center [15].
这项研究得到了 NSF OAC 2118240 的支持，并使用了俄亥俄州超级计算机中心的计算资源 [15]。

References

Ahumada et al. [2020] Jorge A Ahumada, Eric Fegraus, Tanya Birch, Nicole Flores, Roland Kays, Timothy G O’Brien, Jonathan Palmer, Stephanie Schuttler, Jennifer Y Zhao, Walter Jetz, Margaret Kinnaird, Sayali Kulkarni, Arnaud Lyet, David Thau, Michelle Duong, Ruth Oliver, and Anthony Dancer. Wildlife Insights: A Platform to Maximize the Potential of Camera Trap and Other Passive Sensor Wildlife Data for the Planet. Environmental Conservation, 47(1):1–6, 2020. Edition: 2019/09/26 ISBN: 0376-8929 Publisher: Cambridge University Press.
Antonelli et al. [2023] Alexandre Antonelli, Kiran L. Dhanjal‐Adams, and Daniele Silvestro. Integrating machine learning, remote sensing and citizen science to create an early warning system for biodiversity. PLANTS, PEOPLE, PLANET, 5(3):307–316, 2023.
Araujo et al. [2022] Gonzalo Araujo, Ariana Agustines, Steffen S. Bach, Jesse E. M. Cochran, Emilio De La Parra-Galván, Rafael De La Parra-Venegas, Stella Diamant, Alistair Dove, Steve Fox, Rachel T. Graham, Sofia M. Green, Jonathan R. Green, Royale S. Hardenstine, Alex Hearn, Mahardika R. Himawan, Rhys Hobbs, Jason Holmberg, Ibrahim Shameel, Mohammed Y. Jaidah, Jessica Labaja, Savi Leblond, Christine G. Legaspi, Rossana Maguiño, Kirsty Magson, Stacia D. Marcoux, Travis M. Marcoux, Sarah Anne Marley, Meynard Matalobos, Alejandra Mendoza, Joni A. Miranda, Brad M. Norman, Cameron T. Perry, Simon J. Pierce, Alessandro Ponzo, Clare E. M. Prebble, Dení Ramírez-Macías, Richard Rees, Katie E. Reeve-Arnold, Samantha D. Reynolds, David P. Robinson, Christoph A. Rohner, David Rowat, Sally Snow, Abraham Vázquez-Haikin, and Alex M. Watts. Improving sightings-derived residency estimation for whale shark aggregations: A novel metric applied to a global data set. Frontiers in Marine Science, 9:775691, 2022.
A. Rees and Cranston [2017] Jonathan A. Rees and Karen Cranston. Automated assembly of a reference taxonomy for phylogenetic data synthesis. Biodiversity Data Journal, 5:e12581, 2017.
Beery [2021] Sara Beery. Scaling Biodiversity Monitoring for the Data Age. XRDS: Crossroads, The ACM Magazine for Students, 27(4):14–18, 2021.
Beery et al. [2020] Sara Beery, Elijah Cole, and Arvi Gjoka. The iwildcam 2020 competition dataset. arXiv preprint arXiv:2004.10340, 2020.
Beery et al. [2021] Sara Beery, Arushi Agarwal, Elijah Cole, and Vighnesh Birodkar. The iWildCam 2021 Competition Dataset, 2021. arXiv:2105.03494 [cs].
Berg et al. [2014] Thomas Berg, Jiongxin Liu, Seung Woo Lee, Michelle L. Alexander, David W. Jacobs, and Peter N. Belhumeur. Birdsnap: Large-scale fine-grained visual categorization of birds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
Bertinetto et al. [2020] Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A Lord. Making better mistakes: Leveraging class hierarchies with deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12506–12515, 2020.
Bilal et al. [2018] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Do convolutional neural networks learn class hierarchy? IEEE Transactions on Visualization and Computer Graphics, 24(1):152–162, 2018.
Bjerge et al. [2023] Kim Bjerge, Quentin Geissmann, Jamie Alison, Hjalte MR Mann, Toke T Høye, Mads Dyrmann, and Henrik Karstoft. Hierarchical classification of insects with multitask learning and anomaly detection. Ecological Informatics, 77:102278, 2023.
Bommasani et al. [2021] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
Borowiec et al. [2022] Marek L Borowiec, Rebecca B Dikow, Paul B Frandsen, Alexander McKeeken, Gabriele Valentini, and Alexander E White. Deep learning as a tool for ecology and evolution. Methods in Ecology and Evolution, 13(8):1640–1660, 2022.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Center [1987] Ohio Supercomputer Center. Ohio supercomputer center, 1987.
Chao et al. [2016] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Sha. An empirical study and analysis of generalized zero-shot learning for object recognition in the wild. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 52–68. Springer, 2016.
Chia et al. [2022] Patrick John Chia, Giuseppe Attanasio, Federico Bianchi, Silvia Terragni, Ana Rita Magalhães, Diogo Goncalves, Ciro Greco, and Jacopo Tagliabue. Contrastive language and vision learning of general fashion concepts. Scientific Reports, 12(1):18958, 2022.
Cole et al. [2022a] Elijah Cole, Kimberly Wilber, Grant Van Horn, Xuan Yang, Marco Fornoni, Pietro Perona, Serge Belongie, Andrew Howard, and Oisin Mac Aodha. On label granularity and object localization. In European Conference on Computer Vision, pages 604–620. Springer, 2022a.
Cole et al. [2022b] Elijah Cole, Xuan Yang, Kimberly Wilber, Oisin Mac Aodha, and Serge Belongie. When does contrastive visual representation learning work? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14755–14764, 2022b.
Dosovitskiy et al. [2020] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Ezray et al. [2019] Briana D. Ezray, Drew C. Wham, Carrie E. Hill, and Heather M. Hines. Unsupervised machine learning reveals mimicry complexes in bumblebees occur along a perceptual continuum. Proceedings of the Royal Society B: Biological Sciences, 286(1910):20191501, 2019.
Fang et al. [2022] Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, and Ludwig Schmidt. Data determines distributional robustness in contrastive language image pre-training (clip). In International Conference on Machine Learning, pages 6216–6234. PMLR, 2022.
G. and J. [2019] Geetharamani G. and Arun Pandian J. Identification of plant leaf diseases using a nine-layer deep convolutional neural network. Computers & Electrical Engineering, 76:323–338, 2019.
Gadre et al. [2023] Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108, 2023.
Garcin et al. [2021] Camille Garcin, alexis joly, Pierre Bonnet, Antoine Affouard, Jean-Christophe Lombardo, Mathias Chouet, Maximilien Servajean, Titouan Lorieul, and Joseph Salmon. Pl@ntnet-300k: a plant image dataset with high label ambiguity and a long-tailed distribution. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
Gharaee et al. [2023] Zahra Gharaee, ZeMing Gong, Nicholas Pellegrino, Iuliia Zarubiieva, Joakim Bruslund Haurum, Scott C Lowe, Jaclyn TA McKeown, Chris CY Ho, Joschka McLeod, Yi-Yun C Wei, et al. A step towards worldwide biodiversity assessment: The bioscan-1m insect dataset. arXiv preprint arXiv:2307.10455, 2023.
Goyal et al. [2023] Sachin Goyal, Ananya Kumar, Sankalp Garg, Zico Kolter, and Aditi Raghunathan. Finetune like you pretrain: Improved finetuning of zero-shot vision models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19338–19347, 2023.
Gu et al. [2021] Yu Gu, Robert Tinn, Hao Cheng, Michael Lucas, Naoto Usuyama, Xiaodong Liu, Tristan Naumann, Jianfeng Gao, and Hoifung Poon. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23, 2021.
Guralnick et al. [2015] Robert P. Guralnick, Nico Cellinese, John Deck, Richard L. Pyle, John Kunze, Lyubomir Penev, Ramona Walls, Gregor Hagedorn, Donat Agosti, John Wieczorek, Terry Catapano, and Roderic D. M. Page. Community next steps for making globally unique identifiers work for biocollections data. ZooKeys, 494:133–154, 2015.
Hansen et al. [2020] Oskar L. P. Hansen, Jens-Christian Svenning, Kent Olsen, Steen Dupont, Beulah H. Garner, Alexandros Iosifidis, Benjamin W. Price, and Toke T. Høye. Species-level image classification with convolutional neural network enables insect identification from habitus images. Ecology and Evolution, 10(2):737–747, 2020.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Heidi M. Sosik [2015] Emily F. Brownlee Heidi M. Sosik, Emily E. Peacock. Annotated plankton images - data set for developing and evaluating classification methods., 2015.
Hinchliff et al. [2015a] Cody E. Hinchliff, Stephen A. Smith, James F. Allman, J. Gordon Burleigh, Ruchi Chaudhary, Lyndon M. Coghill, Keith A. Crandall, Jiabin Deng, Bryan T. Drew, Romina Gazis, Karl Gude, David S. Hibbett, Laura A. Katz, H. Dail Laughinghouse, Emily Jane McTavish, Peter E. Midford, Christopher L. Owen, Richard H. Ree, Jonathan A. Rees, Douglas E. Soltis, Tiffani Williams, and Karen A. Cranston. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112(41):12764–12769, 2015a.
Hinchliff et al. [2015b] Cody E Hinchliff, Stephen A Smith, James F Allman, J Gordon Burleigh, Ruchi Chaudhary, Lyndon M Coghill, Keith A Crandall, Jiabin Deng, Bryan T Drew, Romina Gazis, et al. Synthesis of phylogeny and taxonomy into a comprehensive tree of life. Proceedings of the National Academy of Sciences, 112(41):12764–12769, 2015b.
Hobern et al. [2021] Donald Hobern, Saroj K Barik, Les Christidis, Stephen T. Garnett, Paul Kirk, Thomas M Orrell, Thomas Pape, Richard L Pyle, Kevin R Thiele, Frank E Zachos, et al. Towards a global list of accepted species vi: The catalogue of life checklist. Organisms Diversity & Evolution, 21(4):677–690, 2021.
Hoyal Cuthill et al. [2019] Jennifer F Hoyal Cuthill, Nicholas Guttenberg, Sophie Ledger, Robyn Crowther, and Blanca Huertas. Deep learning on butterfly phenotypes tests evolution’s oldest mathematical model. Science advances, 5(8):eaaw4967, 2019.
Høye et al. [2021] Toke T Høye, Johanna Ärje, Kim Bjerge, Oskar LP Hansen, Alexandros Iosifidis, Florian Leese, Hjalte MR Mann, Kristian Meissner, Claus Melvad, and Jenni Raitoharju. Deep learning and computer vision will transform entomology. Proceedings of the National Academy of Sciences, 118(2):e2002545117, 2021.
Ikezogwo et al. [2023] Wisdom Oluchi Ikezogwo, Mehmet Saygin Seyfioglu, Fatemeh Ghezloo, Dylan Stefan Chan Geva, Fatwir Sheikh Mohammed, Pavan Kumar Anand, Ranjay Krishna, and Linda Shapiro. Quilt-1m: One million image-text pairs for histopathology. arXiv preprint arXiv:2306.11207, 2023.
Ilharco et al. [2021] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Openclip, 2021. If you use this software, please cite it as below.
[40] ITIS. Integrated taxonomic information system (itis) on-line database. www.itis.gov. Retrieved July 21, 2023.
IUCN [2022] IUCN. Iucn red list summary table 1a, 2022.
Jia et al. [2021] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
Krizhevsky et al. [2012] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2012.
[44] Dan Liu, Jin Hou, Shaoli Huang, Jing Liu, Yuxin He, Bochuan Zheng, Jifeng Ning, and Jingdong Zhang. Lote-animal: A long time-span dataset for endangered animal behavior understanding.
Liu et al. [2021] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
Loshchilov and Hutter [2017] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradient descent with warm restarts. In 5th International Conference on Learning Representations, 2017.
Lu et al. [2023] Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Ivy Liang, Tong Ding, Guillaume Jaume, Igor Odintsov, Andrew Zhang, Long Phi Le, et al. Towards a visual-language foundation model for computational pathology. arXiv preprint arXiv:2307.12914, 2023.
Lürig et al. [2021] Moritz D Lürig, Seth Donoughe, Erik I Svensson, Arthur Porto, and Masahito Tsuboi. Computer vision, machine learning, and the promise of phenomics in ecology and evolutionary biology. Frontiers in Ecology and Evolution, 9:642774, 2021.
L. Pyle [2016] Richard L. Pyle. Towards a global names architecture: The future of indexing scientific names. ZooKeys, 550:261–281, 2016.
Maddison and Schultz [2007] David R. Maddison and K.-S. Schultz. The Tree of Life Web Project. 2007.
McKinley et al. [2017] Duncan C. McKinley, Abe J. Miller-Rushing, Heidi L. Ballard, Rick Bonney, Hutch Brown, Susan C. Cook-Patton, Daniel M. Evans, Rebecca A. French, Julia K. Parrish, Tina B. Phillips, Sean F. Ryan, Lea A. Shanley, Jennifer L. Shirk, Kristine F. Stepenuck, Jake F. Weltzin, Andrea Wiggins, Owen D. Boyle, Russell D. Briggs, Stuart F. Chapin, David A. Hewitt, Peter W. Preuss, and Michael A. Soukup. Citizen science can improve conservation science, natural resource management, and environmental protection. Biological Conservation, 208:15–28, 2017.
Miller [1995] George A. Miller. Wordnet: A lexical database for english. Commun. ACM, 38(11):39–41, 1995.
Mou et al. [2023] Chao Mou, Aokang Liang, Chunying Hu, Fanyu Meng, Baixun Han, and Fu Xu. Monitoring endangered and rare wildlife in the field: A foundation deep learning model integrating human knowledge for incremental recognition with few data and low cost. Animals, 13(20):3168, 2023.
Nguyen et al. [2022] Thao Nguyen, Gabriel Ilharco, Mitchell Wortsman, Sewoong Oh, and Ludwig Schmidt. Quality not quantity: On the interaction between dataset design and robustness of clip. Advances in Neural Information Processing Systems, 35:21455–21469, 2022.
Norman et al. [2017] Bradley M. Norman, Jason A. Holmberg, Zaven Arzoumanian, Samantha D. Reynolds, Rory P. Wilson, Dani Rob, Simon J. Pierce, Adrian C. Gleiss, Rafael De La Parra, Beatriz Galvan, Deni Ramirez-Macias, David Robinson, Steve Fox, Rachel Graham, David Rowat, Matthew Potenski, Marie Levine, Jennifer A. Mckinney, Eric Hoffmayer, Alistair D. M. Dove, Robert Hueter, Alessandro Ponzo, Gonzalo Araujo, Elson Aca, David David, Richard Rees, Alan Duncan, Christoph A. Rohner, Clare E. M. Prebble, Alex Hearn, David Acuna, Michael L. Berumen, Abraham Vázquez, Jonathan Green, Steffen S. Bach, Jennifer V. Schmidt, Stephen J. Beatty, and David L. Morgan. Undersea Constellations: The Global Biology of an Endangered Marine Megavertebrate Further Informed through Citizen Science. BioScience, 67(12):1029–1043, 2017.
Norouzzadeh et al. [2021] Mohammad Sadegh Norouzzadeh, Dan Morris, Sara Beery, Neel Joshi, Nebojsa Jojic, and Jeff Clune. A deep active learning system for species identification and counting in camera trap images. Methods in Ecology and Evolution, 12(1):150–161, 2021.
Nugent [2018] Jill Nugent. Inaturalist. Science Scope, 41(7):12–13, 2018.
Oquab et al. [2023] Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Labatut, Armand Joulin, and Piotr Bojanowski. Dinov2: Learning robust visual features without supervision, 2023.
Parham et al. [2017] Jason Parham, Jonathan Crall, Charles Stewart, Tanya Berger-Wolf, and Daniel Rubenstein. Animal Population Censusing at Scale with Citizen Science and Photographic Identification. In AAAI 2017 Spring Symposium on AISOC, 2017.
Patterson et al. [2016] David Patterson, Dmitry Mozzherin, David Peter Shorthouse, and Anne Thessen. Challenges with using names to link digital biodiversity information. Biodiversity Data Journal, 4:e8080, 2016.
Pearson et al. [2020] Katelin D Pearson, Gil Nelson, Myla FJ Aronson, Pierre Bonnet, Laura Brenskelle, Charles C Davis, Ellen G Denny, Elizabeth R Ellwood, Hervé Goëau, J Mason Heberling, et al. Machine learning using digitized herbarium specimens to advance phenological research. BioScience, 70(7):610–620, 2020.
Pham et al. [2023] Hieu Pham, Zihang Dai, Golnaz Ghiasi, Kenji Kawaguchi, Hanxiao Liu, Adams Wei Yu, Jiahui Yu, Yi-Ting Chen, Minh-Thang Luong, Yonghui Wu, et al. Combined scaling for zero-shot transfer learning. Neurocomputing, 555:126658, 2023.
Picek et al. [2021] Lukas Picek, Milan Sulc, Jiri Matas, Jacob Heilmann-Clausen, Thomas S. Jeppesen, Thomas Laessoe, and Tobias Froslev. Danish fungi 2020 - not just another image recognition dataset. 2021.
Pinho et al. [2022] Catarina Pinho, Antigoni Kaliontzopoulou, Carlos A Ferreira, and João Gama. Identification of morphologically cryptic species with computer vision models: wall lizards (Squamata: Lacertidae: Podarcis) as a case study. Zoological Journal of the Linnean Society, 198(1):184–201, 2022.
Piosenka [2023] Gerald Piosenka. Birds 525 species - image classification. 2023.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Russakovsky et al. [2015] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
S and J [2020] Roopashree S and Anitha J. Medicinal leaf dataset. 2020.
Santurkar et al. [2022] Shibani Santurkar, Yann Dubois, Rohan Taori, Percy Liang, and Tatsunori Hashimoto. Is a Caption Worth a Thousand Images? A Controlled Study for Representation Learning, 2022.
Schuhmann et al. [2021] Cristoph Schuhmann, Richard Vencu, Romain Beaumont, Robert Kaczmarczyk, Clayton Mullis, Jenia Jitsev, and Aran Komatsuzaki. LAION-400M: Open dataset of CLIP-filtered 400 million image-text pairs. In Proceedings of Neurips Data-Centric AI Workshop, 2021.
Serret et al. [2019] Hortense Serret, Nicolas Deguines, Yikweon Jang, Gregoire Lois, and Romain Julliard. Data quality and participant engagement in citizen science: comparing two approaches for monitoring pollinators in france and south korea. Citizen Science: Theory and Practice, 4(1):22, 2019.
Simpson et al. [2014] Robert Simpson, Kevin R. Page, and David De Roure. Zooniverse: observing the world’s largest citizen science platform. In Proceedings of the 23rd International Conference on World Wide Web, pages 1049–1054, Seoul, Korea, 2014. Association for Computing Machinery. Type: 10.1145/2567948.2579215.
Singh et al. [2020] Davinder Singh, Naman Jain, Pranjali Jain, Pratik Kayal, Sudhakar Kumawat, and Nipun Batra. Plantdoc: A dataset for visual plant disease detection. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 249–253, New York, NY, USA, 2020. Association for Computing Machinery.
Steenweg et al. [2017] Robin Steenweg, Mark Hebblewhite, Roland Kays, Jorge Ahumada, Jason T Fisher, Cole Burton, Susan E Townsend, Chris Carbone, J Marcus Rowcliffe, Jesse Whittington, Jedediah Brodie, J Andrew Royle, Adam Switalski, Anthony P Clevenger, Nicole Heim, and Lindsey N Rich. Scaling-up camera traps: monitoring the planet’s biodiversity with networks of remote sensors. Frontiers in Ecology and the Environment, 15(1):26–34, 2017. ISBN: 1540-9295 Publisher: John Wiley & Sons, Ltd Type: https://doi.org/10.1002/fee.1448.
Sullivan et al. [2014a] Brian L. Sullivan, Jocelyn L. Aycrigg, Jessie H. Barry, Rick E. Bonney, Nicholas Bruns, Caren B. Cooper, Theo Damoulas, André A. Dhondt, Tom Dietterich, Andrew Farnsworth, Daniel Fink, John W. Fitzpatrick, Thomas Fredericks, Jeff Gerbracht, Carla Gomes, Wesley M. Hochachka, Marshall J. Iliff, Carl Lagoze, Frank A. La Sorte, Matthew Merrifield, Will Morris, Tina B. Phillips, Mark Reynolds, Amanda D. Rodewald, Kenneth V. Rosenberg, Nancy M. Trautmann, Andrea Wiggins, David W. Winkler, Weng-Keen Wong, Christopher L. Wood, Jun Yu, and Steve Kelling. The eBird enterprise: An integrated approach to development and application of citizen science. Biological Conservation, 169:31–40, 2014a.
Sullivan et al. [2014b] Brian L Sullivan, Jocelyn L Aycrigg, Jessie H Barry, Rick E Bonney, Nicholas Bruns, Caren B Cooper, Theo Damoulas, André A Dhondt, Tom Dietterich, Andrew Farnsworth, et al. The ebird enterprise: An integrated approach to development and application of citizen science. Biological conservation, 169:31–40, 2014b.
Swanson et al. [2015] Alexandra Swanson, Margaret Kosmala, Chris Lintott, Robert Simpson, Arfon Smith, and Craig Packer. Snapshot Serengeti, high-frequency annotated camera trap images of 40 mammalian species in an African savanna. Scientific Data, 2(150026):1–14, 2015.
Teng et al. [2023] Mélisande Teng, Amna Elmustafa, Benjamin Akera, Hugo Larochelle, and David Rolnick. Bird Distribution Modelling using Remote Sensing and Citizen Science data, 2023. arXiv:2305.01079 [cs].
Tuia et al. [2022] Devis Tuia, Benjamin Kellenberger, Sara Beery, Blair R Costelloe, Silvia Zuffi, Benjamin Risse, Alexander Mathis, Mackenzie W Mathis, Frank van Langevelde, Tilo Burghardt, et al. Perspectives in machine learning for wildlife conservation. Nature communications, 13(1):792, 2022.
Ullah et al. [2022] Ihsan Ullah, Dustin Carrion, Sergio Escalera, Isabelle M Guyon, Mike Huisman, Felix Mohr, Jan N van Rijn, Haozhe Sun, Joaquin Vanschoren, and Phan Anh Vu. Meta-album: Multi-domain meta-dataset for few-shot image classification. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2022.
Van der Maaten and Hinton [2008] Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
Van Horn and Mac Aodha [2021] Grant Van Horn and Oisin Mac Aodha. inat challenge 2021 - fgvc8, 2021.
Van Horn et al. [2015] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 595–604, 2015. ISSN: 1063-6919.
Van Horn et al. [2018a] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018a.
Van Horn et al. [2018b] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The iNaturalist Species Classification and Detection Dataset, 2018b. arXiv:1707.06642 [cs].
Wah et al. [2011] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011.
Wang et al. [2019] Yan Wang, Wei-Lun Chao, Kilian Q. Weinberger, and Laurens van der Maaten. Simpleshot: Revisiting nearest-neighbor classification for few-shot learning. CoRR, 2019.
Wu et al. [2019] Xiaoping Wu, Chi Zhan, Yukun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. In IEEE CVPR, pages 8787–8796, 2019.
Xiao et al. [2021] Tete Xiao, Xiaolong Wang, Alexei A Efros, and Trevor Darrell. What should not be contrastive in contrastive learning. In International Conference on Learning Representations, 2021.
Xu et al. [2023a] Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. CiT: Curation in Training for Effective Vision-Language Data, 2023a.
Xu et al. [2023b] Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data, 2023b.
Yuan et al. [2021] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
Zhang et al. [2022] Shu Zhang, Ran Xu, Caiming Xiong, and Chetan Ramaiah. Use all the labels: A hierarchical multi-label contrastive learning framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16660–16669, 2022.

Appendices

Many details are omitted in the main text because of space concerns; we present relevant details here.

1.

Appendix A: Reproducibility statement
2.

Appendix B: Ethics statement
3.

Appendix C: Details of training data aggregation
4.

Appendix D: Training details and hyperparameters
5.

Appendix E: Standard deviations for few-shot results
6.

Appendix F: Example zero-shot predictions on our evaluation tasks.
7.

Appendix G: Additional text-type results

Appendix A Reproducibility Statement

We will ensure reproducibility of our results by releasing our datasets (TreeOfLife-10M and Rare Species), data pre-processing code, training code, evaluation code, code to generate all figures (Figs. 2 and 3), and pre-trained model weights. With these resources, anyone with sufficient compute resources can download the original data, then reproduce the pre-processing, training, and evaluation. For those with limited compute, the pre-trained model weights will enable full reproducibility of our evaluation results (Sec. 4).

Appendix B Ethics Statement

We are not aware of any major ethical issues that arise from our work. \modelname is further pre-trained from the original CLIP model; many of the same concerns (class design, surveillance, etc.) apply; however, these concerns are discussed in great detail in Radford et al. [66], so we will focus on addressing these concerns as they relate to the biological addition provided in \modelname.

How could \modelname affect endangered species–does \modelname or TreeOfLife-10M pose a threat by aiding poachers? Though \modelname leads to improved automatic species classification, it does not include specific geographic information such as GPS coordinates. Furthermore, animal conservation status is not included during training.

Could \modelname have a negative impact on biologists? \modelname is designed to combine visual cues with an established taxonomic hierarchy to aid in scientific discovery. Concerns regarding over-reliance on model predictions is a warning that accompanies many–if not all–contemporary models and is not unique to \modelname. The goal is for \modelname to aid biologists in their work, not to replace them. As such, it is important for users to retain that understanding/context when applying \modelname to downstream tasks.

Appendix C Training Data Aggregation

We aggregate images and labels from the iNat21 training data, Bioscan-1M’s, and data downloaded from EOL. While every image has at least one taxonomic rank labeled, full taxonomic hierarchies and common names are scraped on a best-effort basis from metadata providers, including iNaturalist (iNaturalist Taxonomy DarwinCore Archive), Encyclopedia of Life (eol.org) and Integrated Taxonomic Information System (ITIS) (itis.gov).

We create a lookup between scientific name and taxonomic hierarchy and a lookup between scientific name and common name. We populate these lookups using the following sources in order of descending prioritization, as earlier sources are considered more authoritative. That is, if a duplicate appears in a later source, it is superseded by the higher priority source: Bioscan-1M metadata, EOL aggregate datasets: information retrieved using EOL page IDs with the pages API, which checks for a match in the ITIS hierarchy for higher-level taxa standardization (setting aside homonyms for proper linkage). The full list of taxa and vernacular names provided by iNaturalist and the iNat21 training set class names were maintained. From here, any taxa that could not be resolved using these sources were fed through the Global Names Resolver (GNR) API. Overall we were able to achieve 84% full taxa labeling for images in TreeOfLife-10M, for context, 10% of TreeOfLife-10M is only labeled down to the family rank (Bioscan-1M), thus, genus-species information is not available.

Appendix D Hyperparameters & Training Details

Tabs. D1 and D2 contain our training hyperparameters for the different models. Tab. D2 notes the different epochs at which we had the lowest validation loss, as evaluated using the CLIP objective on the validation split of TreeOfLife-10M (even for the TreeOfLife-1M models). We will release our training code upon acceptance.

Hyperparameter	Value
Architecture	ViT-B/16
Max learning rate	$1\times 10^{-4}$
Warm-up steps	$1,000$
Weight Decay	$0.2$
Input Res.	$224\times 224$

Table D1: Common hyperparameters among all models we train.

Dataset	Text Type	Batch Size	Epoch
TreeOfLife-10M	Mixture	32K	100
iNat21 Only	Mixture	16K	65
TreeOfLife-1M	Common	16K	86
	Scientific		87
	Taxonomy		87
	Sci+Com		87
	Tax+Com		86
	Mixture		91

Table D2: Hyperparameters that differ between the various models we train. We use a smaller batch size and only 1M examples for our text-type ablation because of limited compute.

We trained a hierarchical classification model in Sec. 4.4. Python pseudocode for the training objective is in Fig. D1. We will publicly release full training code upon acceptance.

⬇

import torch.nn.functional as F

\pardef forward(vit, heads, images, h_labels):

”””

vit: vision transformer.

heads: linear layers, one for each taxonomic

rank.

images: batch of input images

h_labels: hierarchical labels; each image has

7 labels

”””

img_feats = vit(images)

h_logits = [head(img_feats) for head in heads]

losses = [F.cross_entropy(logits, label)

for logits, labels in zip(h_logits, h_labels)]

return sum(losses)

Figure D1: Python code to calculate the hierarchical multitask objective. Each image has 7 class labels: one for each taxonomic rank. The ViT calculates dense features for each image, then each taxonomic rank has its own linear layer that produces logits. By summing the losses, the ViT learns to produce image features that are useful for classifying images at multiple taxonomic ranks.

Appendix E Standard Deviation of Main Results

Tabs. E3 and E4 show the accuracy with standard deviation over five runs on the test sets presented in Tab. 2. Since we randomly select the training examples from the datasets for few-shot, accuracies vary based on which examples are train examples and which are test examples. However, the variation is small enough that our conclusions in Sec. 4.5 still hold. Zero-shot results are deterministic and have no variation.

Model	Birds 525	Plankton	Insects	Insects 2	Rare Species
One-Shot Classification
CLIP	$43.7\pm 0.26$	$25.1\pm 0.71$	$21.6\pm 1.05$	$13.7\pm 1.09$	$28.4\pm 0.92$
OpenCLIP	$53.7\pm 0.52$	$\mathbf{32.3\pm 0.63}$	$23.2\pm 1.58$	$14.3\pm 0.67$	$29.3\pm 0.68$
\modelname	$71.8\pm 0.47$	$30.6\pm 0.77$	$\mathbf{57.4\pm 2.4}$	$\mathbf{20.4\pm 1.28}$	$\mathbf{45.3\pm 1.16}$
– iNat21 Only	$\mathbf{74.8\pm 0.89}$	$29.6\pm 0.82$	$53.9\pm 0.97$	$19.7\pm 0.80$	$37.6\pm 0.63$
Five-Shot Classification
CLIP	$73.5\pm 0.37$	$41.2\pm 1.01$	$39.9\pm 0.86$	$24.6\pm 0.90$	$46.9\pm 0.21$
OpenCLIP	$81.9\pm 0.25$	$\mathbf{52.5\pm 0.83}$	$42.6\pm 0.82$	$25.0\pm 0.83$	$48.4\pm 0.62$
\modelname	$90.0\pm 0.12$	$49.3\pm 1.14$	$\mathbf{77.8\pm 0.81}$	$\mathbf{33.6\pm 0.74}$	$\mathbf{66.4\pm 0.16}$
– iNat21 Only	$\mathbf{90.1\pm 0.08}$	$48.2\pm 1.24$	$73.7\pm 0.65$	$32.1\pm 1.97$	$56.0\pm 0.36$

Table E3: Accuracy with standard deviation of five runs on animals and rare species in Tab. 4

Model	PlantNet	Fungi	PlantVillage	Med. Leaf	PlantDoc
One-Shot Classification
CLIP	$42.1\pm 3.40$	$17.2\pm 0.78$	$49.7\pm 2.53$	$70.1\pm 2.83$	$24.8\pm 1.61$
OpenCLIP	$45.1\pm 3.40$	$18.4\pm 1.26$	$53.6\pm 0.79$	$71.2\pm 3.58$	$26.8\pm 1.45$
\modelname	$64.5\pm 2.15$	$\mathbf{40.3\pm 3.00}$	$\mathbf{58.8\pm 2.83}$	$\mathbf{84.3\pm 1.90}$	$\mathbf{30.7\pm 1.75}$
– iNat21 Only	$\mathbf{67.4\pm 4.54}$	$35.5\pm 2.93$	$55.2\pm 1.58$	$75.1\pm 1.16$	$27.8\pm 1.31$
Five-Shot Classification
CLIP	$65.2\pm 1.25$	$27.9\pm 2.54$	$71.8\pm 1.46$	$89.7\pm 1.45$	$35.2\pm 1.59$
OpenCLIP	$68.0\pm 0.86$	$30.6\pm 1.26$	$77.8\pm 1.28$	$91.3\pm 0.85$	$42.0\pm 1.32$
\modelname	$\mathbf{85.6\pm 1.79}$	$\mathbf{62.3\pm 1.82}$	$\mathbf{80.9\pm 1.04}$	$\mathbf{95.9\pm 1.07}$	$\mathbf{47.5\pm 1.35}$
– iNat21 Only	$84.7\pm 1.24$	$55.6\pm 2.61$	$77.2\pm 0.68$	$93.5\pm 1.13$	$41.0\pm 1.75$

Table E4: Accuracy with standard deviation of five runs on plants and fungi in Tab. 4

Appendix F Example Predictions

Figs. F2 and F3 show \modelname and CLIP zero-shot predictions on all ten evaluation tasks. We randomly pick examples from each dataset that \modelname correctly labels and example that CLIP incorrect labels but \modelname correctly labels. \modelname performs well on a variety of tasks, including out-of-distribution images (Plankton, Medicinal Leaf) and mixes of scientific and common names (PlantVillage, PlantDoc).

Appendix G More Results of Text-Type

We investigated the effects of text-type during training and testing in Sec. 4.3 using the Rare Species task. We present zero-shot results for all text-types on all tasks using the same procedure as in Sec. 4.2, where we use whatever taxonomic+common if available, otherwise whatever text-type is available.

	Animals				Plants & Fungi
Training Text Type	Birds 525	Plankton	Insects	Insects 2	PlantNet	Fungi	PlantVillage	Med. Leaf	PlantDoc	Rare Species	Mean ( $\Delta$ )
Random Guessing	$0.2$	$1.2$	$1.0$	$1.0$	$4.0$	$4.0$	$2.6$	$4.0$	$3.7$	$0.3$	$2.2$
Common	$42.4$	$4.4$	$15.8$	$13.3$	$45.2$	$20.7$	$10.7$	$15.4$	$19.6$	$23.1$	$21.0$	$-12.0$
Scientific	$11.3$	$3.8$	$18.7$	$11.0$	$84.8$	$35.3$	$12.5$	$20.3$	$13.9$	$7.7$	$21.9$	$-11.1$
Taxonomic	$46.6$	$2.2$	$25.1$	$8.7$	$70.4$	$29.0$	$8.8$	$18.4$	$12.8$	$24.1$	$24.6$	$-8.4$
Sci+Com	$55.0$	$2.2$	$19.2$	$12.6$	$71.5$	$24.8$	$17.6$	$21.5$	$20.0$	$25.3$	$26.9$	$-6.1$
Tax+Com	$61.9$	$2.0$	$27.4$	$11.6$	$68.4$	$19.2$	$10.4$	$19.5$	$15.8$	$30.0$	$26.6$	$-6.4$
Mixture	$66.6$	$3.5$	$30.6$	$17.3$	$86.3$	$32.8$	$19.9$	$18.7$	$24.5$	$30.5$	$33.0$	–

Table G5: Zero–shot classification top-1 accuracy for different text-types used during training. Bold indicates best accuracy. All models use the same architecture (ViT-B/16 vision encoders, 77-token text encoder) and are trained on the same dataset (TreeOfLife-1M).

\Delta

denotes the difference in mean accuracy with “Mixture”, which is the text-type we used for \modelname.

Appendix H Generalized Zero-Shot Learning

Chao et al. [16] introduced generalized zero-shot learning, where a model must label images of unseen classes from a set of both seen and unseen labels. We pick out a set of 400 seen species from TreeOfLife-10M using the same methodology as we used for the Rare Species task. We classify the same images from the Rare Species task using this set of 800 labels (a mix of seen and unseen). CLIP and OpenCLIP achieve $24.1$ % and $21.7$ % top-1 accuracy, while \modelname achieves $26.4$ % top-1 accuracy in this challenging GZSL setting.

\modelname: A Vision Foundation Model for the Tree of Life\modelname： 生命之树的 Vision Foundation 模型

Abstract 抽象

1 Introduction 1 介绍

2 TreeOfLife-10M2 生命树 10M

2.1 Images 2.1 图像

2.2 Metadata & Aggregation2.2 元元数据和聚合

2.3 Release & Statistics2.3 发布和统计数据

3 Modeling 3 建 模

3.1 Why CLIP? 3.1 为什么选择 CLIP？

3.2 Text Types 3.2 文本类型

4 Experiments 4 实验

4.1 Training and Evaluation Details4.1 训练和评估详细信息

4.2 Can \modelname Generalize to Unseen Taxa?4.2 \modelname 可以推广到看不见的分类群吗？

4.3 How Do Text Types Affect Generalization?4.3 文本类型如何影响泛化？

4.4 Is the CLIP Objective Necessary?4.4 是否需要 CLIP 物镜？

4.5 Can \modelname Classify More Than Species?4,5 \modelname 可以分类的不仅仅是物种吗？

4.6 Does \modelname Learn the Hierarchy?4,6 \modelname 会学习层次结构吗？

5 Related Work 5 相关工作

6 Conclusion 6 结论

Acknowledgements 确认

References

Appendices

Appendix A Reproducibility Statement

Appendix B Ethics Statement

Appendix C Training Data Aggregation

Appendix D Hyperparameters & Training Details

Appendix E Standard Deviation of Main Results

Appendix F Example Predictions

Appendix G More Results of Text-Type

Appendix H Generalized Zero-Shot Learning

\modelname: A Vision Foundation Model for the Tree of Life
\modelname：生命之树的 Vision Foundation 模型

2 TreeOfLife-10M
2 生命树 10M

2.2 Metadata & Aggregation
2.2 元元数据和聚合

2.3 Release & Statistics
2.3 发布和统计数据

3 Modeling 3 建模

4.1 Training and Evaluation Details
4.1 训练和评估详细信息

4.2 Can \modelname Generalize to Unseen Taxa?
4.2 \modelname 可以推广到看不见的分类群吗？

4.3 How Do Text Types Affect Generalization?
4.3 文本类型如何影响泛化？

4.4 Is the CLIP Objective Necessary?
4.4 是否需要 CLIP 物镜？

4.5 Can \modelname Classify More Than Species?
4,5 \modelname 可以分类的不仅仅是物种吗？

4.6 Does \modelname Learn the Hierarchy?
4,6 \modelname 会学习层次结构吗？