Grounding DINO: Marrying DINO with Grounded Pre-Training for
Open-Set Object Detection
接地 DINO：将 DINO 与接地预训练相结合，实现开放集目标检测

Shilong Liu^1,2¹¹footnotemark: 1, Zhaoyang Zeng², Tianhe Ren², Feng Li^{2, 3}, Hao Zhang^{2, 3}, Jie Yang^{2, 4},
Shilong Liu ^1,2 ¹ , Zhaoyang Zeng ² , Tianhe Ren ² , Feng Li ^{2, 3} , Hao Zhang ^{2, 3} , Jie Yang ^{2, 4} 、
Chunyuan Li⁵, Jianwei Yang⁵, Hang Su¹, Jun Zhu^1†, Lei Zhang².
李春元 ⁵ , 杨建伟 ⁵ , 苏杭 ¹ , 朱俊 ^1† , 张磊 ² .
¹ Dept. of Comp. Sci. and Tech., BNRist Center, State Key Lab for Intell. Tech. & Sys.,
¹ 系。BNRist Center, State Key Lab for Intell.技术与系统国家重点实验室
Institute for AI, Tsinghua-Bosch Joint Center for ML, Tsinghua University
² International Digital Economy Academy (IDEA)
清华大学人工智能研究院、清华-博世人工智能联合中心 ² 国际数字经济研究院（IDEA）
³ The Hong Kong University of Science and Technology
³ 香港科技大学
⁴ The Chinese University of Hong Kong (Shenzhen) ⁵ Microsoft Research, Redmond
⁴ 香港中文大学（深圳） ⁵ 雷德蒙德微软研究院
liusl20@mails.tsinghua.edu.cn {zengzhaoyang, rentianhe}@idea.edu.cn {fliay, hzhangcx}@connect.ust.hk jieyang5@link.cuhk.edu.cn
liusl20@mails.tsinghua.edu.cn {zengzhaoyang, renianhe}@idea.edu.cn {fliay, hzhangcx}@connect.ust.hk jieyang5@link.cuhk.edu.cn

{chunyl, jianwei.yang}@microsoft.com {suhangss, dcszj}@mail.tsinghua.edu.cn leizhang@idea.edu.cn
{chunyl, jianwei.yang}@microsoft.com {suhangss, dcszj}@mail.tsinghua.edu.cn leizhang@idea.edu.cn
Corresponding author. 通讯作者：

Abstract 摘要

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. After fine-tuning with COCO data, Grounding DINO reaches $63.0$ AP. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at https://github.com/IDEA-Research/GroundingDINO.
本文通过将基于变换器的检测器 DINO 与接地预训练相结合，提出了一种名为 "接地 DINO "的开放集对象检测器，该检测器可通过人类输入（如类别名称或引用表达）检测任意对象。开放集对象检测的关键解决方案是在封闭集检测器中引入语言，以实现开放集概念泛化。为了有效地融合语言和视觉模式，我们从概念上将闭集检测器分为三个阶段，并提出了一个紧密的融合解决方案，其中包括特征增强器、语言引导的查询选择和用于跨模式融合的跨模式解码器。以往的工作主要是对新类别的开放集对象检测进行评估，而我们建议同时对带有属性的对象的引用表达理解进行评估。接地 DINO 在所有三种设置中都表现出色，包括 COCO、LVIS、ODinW 和 RefCOCO/+/g 的基准测试。接地型 DINO 在 COCO 检测零镜头传输基准（即没有 COCO 的任何训练数据）上实现了 $52.5$ AP。在使用 COCO 数据进行微调后，Grounding DINO 达到了 $63.0$ AP。它以平均 $26.1$ AP 的成绩刷新了 ODinW 零射基准的记录。代码将发布在 https://github.com/IDEA-Research/GroundingDINO 网站上。

Figure 1: (a) Closed-set object detection requires models to detect objects of pre-defined categories. (b) Previous work zero-shot transfer models to novel categories for model generalization. We propose to add Referring expression comprehension (REC) as another evaluation for model generalizations on novel objects with attributes. (c) We present an image editing application by combining Grounding DINO and Stable Diffusion [42]. Best view in colors.
图 1：(a) 封闭集物体检测要求模型检测预定义类别的物体。(b) 以往的工作是将模型零散地转移到新的类别中，以实现模型泛化。我们建议增加引用表达理解（REC），作为对带有属性的新物体进行模型泛化的另一种评估。 (c) 我们结合接地 DINO 和稳定扩散（Stable Diffusion）[42] 提出了一种图像编辑应用。最佳色彩视图。

⁰⁰footnotetext: ^* This work was done when Shilong Liu, Feng Li, Hao Zhang, and Jie Yang were interns at IDEA. ^† Corresponding authors.
⁰ ^* 这项工作是刘世龙、李峰、张浩和杨杰在 IDEA 实习时完成的。 ^† 通讯作者。

1 Introduction 1引言

Understanding novel concepts is a fundamental capability of visual intelligence. In this work, we aim to develop a strong system to detect arbitrary objects specified by human language inputs, which we name as open-set object detection²²2We view the terms open-set object detection, open-world object detection, and open-vocabulary object detection the same task in this paper. To avoid confusion, we always use open-set object detection in our paper.. The task has wide applications for its great potential as a generic object detector. For example, we can cooperate it with generative models for image editing (as shown in Fig. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b)).
理解新概念是视觉智能的一项基本能力。在这项工作中，我们的目标是开发一个强大的系统来检测由人类语言输入指定的任意物体，我们将其命名为开放集物体检测 ² 。作为一种通用的物体检测器，这项任务具有巨大的应用潜力。例如，我们可以将其与图像编辑的生成模型配合使用（如图 Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b) 所示）。

The key to open-set detection is introducing language for unseen object generalization [26, 1, 7]. For example, GLIP [26] reformulates object detection as a phrase grounding task and introduces contrastive training between object regions and language phrases. It shows a great flexibility for heterogeneous datasets and remarkable performance on both closed-set and open-set detection. Despite its impressive results, GLIP’s performance can be constrained since it is designed based on a traditional one-stage detector Dynamic Head [5]. As open-set and closed-set detection are closely related, we believe a stronger closed-set object detector can result in an even better open-set detector.
开放集检测的关键在于引入语言对未见物体进行概括[26, 1, 7]。例如，GLIP [ 26] 将对象检测重新表述为短语基础任务，并在对象区域和语言短语之间引入对比训练。它在异构数据集上表现出极大的灵活性，在封闭集和开放集检测上都有不俗的表现。尽管 GLIP 取得了令人瞩目的成绩，但由于它是基于传统的单级检测器 Dynamic Head [ 5] 设计的，因此其性能可能会受到限制。由于开集检测和闭集检测密切相关，我们相信一个更强大的闭集物体检测器可以带来一个更好的开集检测器。

Motivated by the encouraging progress of Transformer-based detectors [58, 31, 24, 25], in this work, we propose to build a strong open-set detector based on DINO [58], which not only offers the state-of-the-art object detection performance, but also allows us to integrate multi-level text information into its algorithm by grounded pre-training. We name the model as Grounding DINO. Grounding DINO has several advantages over GLIP. First, its Transformer-based architecture is similar to language models, making it easier to process both image and language data. For example, as all the image and language branches are built with Transformers, we can easily fuse cross-modality features in its whole pipeline. Second, Transformer-based detectors have demonstrated a superior capability of leveraging large-scale datasets. Lastly, as a DETR-like model, DINO can be optimized end-to-end without using any hard-crafted modules such as NMS (Non-Maximum Suppression), which greatly simplifies the overall grounding model design.
基于 Transformer 的检测器取得了令人鼓舞的进展[58, 31, 24, 25]，在这项工作中，我们提议基于 DINO 建立一个强大的开放集检测器[58]，它不仅具有最先进的对象检测性能，还允许我们通过接地预训练将多层次文本信息集成到其算法中。我们将该模型命名为接地 DINO。与 GLIP 相比，Grounding DINO 有几个优势。首先，它基于变换器（Transformer）的架构与语言模型类似，因此更易于处理图像和语言数据。例如，由于所有图像和语言分支都是用变换器构建的，因此我们可以很容易地在整个管道中融合跨模态特征。其次，基于 Transformer 的检测器在利用大规模数据集方面表现出了卓越的能力。最后，作为一种类似于 DETR 的模型，DINO 可以进行端到端的优化，而无需使用任何硬造的模块，如 NMS（非最大抑制），这大大简化了整个接地模型的设计。

Refer to caption — Figure 2: Existing approaches to extending closed-set detectors to open-set scenarios. Note that some closed-set detectors can have only partial phases of the figure.
图 2：将封闭集探测器扩展到开放集场景的现有方法。请注意，某些封闭集探测器只能有图中的部分阶段。

Most existing open-set detectors are developed by extending closed-set detectors to open-set scenarios with language information. As shown in Fig. 2, a closed-set detector typically has three important modules, a backbone for feature extraction, a neck for feature enhancement, and a head for region refinement (or box prediction). A closed-set detector can be generalized to detect novel objects by learning language-aware region embeddings so that each region can be classified into novel categories in a language-aware semantic space. The key to achieving this goal is using contrastive loss between region outputs and language features at the neck and/or head outputs. To help a model align cross-modality information, some work tried to fuse features before the final loss stage. Fig. 2 shows that feature fusion can be performed in three phases: neck (phase A), query initialization (phase B), and head (phase C). For example, GLIP [26] performs early fusion in the neck module (phase A), and OV-DETR [56] uses language-aware queries as head inputs (phase B).
现有的开放集检测器大多是通过将封闭集检测器扩展到带有语言信息的开放集场景而开发出来的。如图 2 所示，闭集检测器通常有三个重要模块：用于特征提取的主干模块、用于特征增强的颈部模块和用于区域细化（或方框预测）的头部模块。封闭集检测器可以通过学习语言感知区域嵌入，从而将每个区域归类到语言感知语义空间中的新类别中，从而将封闭集检测器推广用于检测新物体。实现这一目标的关键在于利用区域输出与颈部和/或头部输出的语言特征之间的对比损失。为了帮助模型调整跨模态信息，一些研究尝试在最终损失阶段之前融合特征。图 2 显示，特征融合可以分三个阶段进行：颈部（A 阶段）、查询初始化（B 阶段）和头部（C 阶段）。例如，GLIP [ 26] 在颈部模块（A 阶段）执行早期融合，OV-DETR [ 56] 使用语言感知查询作为头部输入（B 阶段）。

We argue that more feature fusion in the pipeline enables the model to perform better. It is worth noting that retrieval tasks prefer a CLIP-like two-tower architecture which only performs multi-modality feature comparison at the end for efficiency. However, for open-set detection, the model is normally given both an image and a text input that specifies the target object categories or a specific object. In such a case, a tight (and early) fusion model is more preferred for a better performance [1, 26] as both image and text are available at beginning. Although conceptually simple, it is hard for previous work to perform feature fusion in all three phases. The design of classical detectors like Faster RCNN makes it hard to interact with language information in most blocks. Unlike classical detectors, the Transformer-based detector DINO has a consistent structure with language blocks. The layer-by-layer design enables it to interact with language information easily. Under this principle, we design three feature fusion approaches in the neck, query initialization, and head phases. More specifically, we design a feature enhancer by stacking self-attention, text-to-image cross-attention, and image-to-text cross-attention as the neck module. We then develop a language-guided query selection method to initialize queries for head. We also design a cross-modality decoder for the head phase with image and text cross-attention layers to boost query representations. The three fusion phases effectively help the model achieve better performance on existing benchmarks, which will be shown in Sec. 4.4.
我们认为，在管道中进行更多的特征融合能让模型表现得更好。值得注意的是，检索任务更倾向于采用类似 CLIP 的双塔结构，这种结构只在最后进行多模态特征比较，以提高效率。然而，在开放集检测中，模型通常会同时获得图像和文本输入，其中指定了目标对象类别或特定对象。在这种情况下，为了获得更好的性能，更倾向于使用紧密（早期）的融合模型[1, 26]，因为图像和文本在一开始都是可用的。虽然概念上很简单，但以往的工作很难在所有三个阶段进行特征融合。经典检测器（如 Faster RCNN）的设计使其很难与大多数区块中的语言信息交互。与经典检测器不同，基于变换器的检测器 DINO 具有与语言区块一致的结构。逐层设计使其能够轻松地与语言信息交互。根据这一原则，我们在颈部、查询初始化和头部阶段设计了三种特征融合方法。更具体地说，我们设计了一个特征增强器，将自我注意、文本到图像交叉注意和图像到文本交叉注意作为颈部模块进行堆叠。然后，我们开发了一种语言引导的查询选择方法来初始化头部的查询。我们还为头部阶段设计了一个跨模态解码器，其中包含图像和文本交叉注意层，以增强查询表示。这三个融合阶段有效地帮助模型在现有基准测试中取得更好的性能，这将在第 4.4 节中展示。

Although significant improvements have been achieved in multi-modal learning, most existing open-set detection work evaluates their models on objects of novel categories, as shown in the left column of Fig. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b). We argue that another important scenario, where objects are described with attributes, should also be considered. In the literature, the task is named Referring Expression Comprehension (REC) [34, 30]³³3We use the term Referring Expression Comprehension (REC) and Referring (Object) Detection exchangeable in this paper.. We present some examples of REC in the right column of Fig. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b). It is a closely related field but tends to be overlooked in previous open-set detection work. In this work, we extend open-set detection to support REC and also evaluate its performance on REC datasets.
尽管在多模式学习方面已经取得了重大改进，但大多数现有的开放集检测工作都是在新类别的物体上对其模型进行评估，如图（a）左列所示。 Grounding DINO：将 DINO 与开放集物体检测的基础预训练相结合（b）。我们认为，还应该考虑另一种重要的情况，即用属性描述对象。在文献中，这项任务被命名为引用表达理解（REC）[ 34, 30] ³ 。我们在图 1 的右栏展示了 REC 的一些示例。 Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b)。这是一个密切相关的领域，但在以往的开集检测工作中往往被忽视。在这项工作中，我们将开放集检测扩展到支持 REC，并评估其在 REC 数据集上的性能。

We conduct experiments on all three settings, including closed-set detection, open-set detection, and referring object detection, to comprehensively evaluate open-set detection performance. Grounding DINO outperforms competitors by a large margin. For example, Grounding DINO reaches a $52.5$ AP on COCO minival without any COCO training data. It also establishes a new state of the art on the ODinW [23] zero-shot benchmark with a $26.1$ mean AP.
我们对封闭集检测、开放集检测和引用对象检测等三种设置进行了实验，以全面评估开放集检测的性能。接地 DINO 的性能远远超过竞争对手。例如，在没有任何 COCO 训练数据的情况下，Gounding DINO 在 COCO minival 上达到了 $52.5$ AP。它还在 ODinW [ 23] 零射击基准测试中以 $26.1$ 的平均 AP 值确立了新的技术水平。

The contributions of this paper are summarized as follows:
本文的贡献概述如下：

1.

We propose Grounding DINO, which extends a closed-set detector DINO by performing vision-language modality fusion at multiple phases, including a feature enhancer, a language-guided query selection module, and a cross-modality decoder. Such a deep fusion strategy effectively improves open-set object detection.

1.我们提出了 "接地 "DINO，它通过在多个阶段执行视觉-语言模态融合（包括特征增强器、语言引导的查询选择模块和跨模态解码器）来扩展封闭集检测器 DINO。这种深度融合策略能有效改进开放集物体检测。
2.

We propose to extend the evaluation of open-set object detection to REC datasets. It helps evaluate the performance of the model with freeform text inputs.

2.我们建议将开放集对象检测的评估扩展到 REC 数据集。这有助于评估模型在使用自由形式文本输入时的性能。
3.

The experiments on COCO, LVIS, ODinW, and RefCOCO/+/g datasets demonstrate the effectiveness of Grounding DINO on open-set object detection tasks.

3.在 COCO、LVIS、ODinW 和 RefCOCO/+/g 数据集上的实验证明了接地 DINO 在开放集物体检测任务中的有效性。

Model	Model Design 模型设计			Text Prompt 文本提示	Closed-Set Settings 封闭式设置	Zero-Shot Transfer 零距离传输			Referring Detection 转介检测
Model	Base Detector 底座探测器	Fusion Phases (Fig. 2) 聚变阶段（图 2）	use CLIP 使用 CLIP	Represent. Level (Sec. 3.4) 代表。级别（第 3.4 节）	COCO	COCO	LVIS	ODinW	RefCOCO/+/g
ViLD [13] ViLD [ 13］	Mask R-CNN [15] 掩码 R-CNN [ 15］	-	✓	sentence 判刑	✓	partial label 部分标签	partial label 部分标签
RegionCLIP [62] RegionCLIP [ 62］	Faster RCNN [39] 更快的 RCNN [ 39］	-	✓	sentence 判刑	✓	partial label 部分标签	partial label 部分标签
FindIt [21] 查找 [ 21］	Faster RCNN [39] 更快的 RCNN [ 39］	A		sentence 判刑	✓	partial label 部分标签			fine-tune 微调
MDETR [18]	DETR [2]	A,C A、C		word 词			fine-tune 微调	zero-shot 零摄	fine-tune 微调
DQ-DETR [46] DQ-DETR [ 46］	DETR [2]	A,C A、C		word 词	✓		zero-shot 零摄		fine-tune 微调
GLIP [26] GLIP [ 26］	DyHead [5] DyHead [ 5］	A		word 词	✓	zero-shot 零摄	zero-shot 零摄	zero-shot 零摄
GLIPv2 [59] GLIPv2 [ 59］	DyHead [5] DyHead [ 5］	A		word 词	✓	zero-shot 零摄	zero-shot 零摄	zero-shot 零摄
OV-DETR [56] OV-DETR [ 56］	Deformable DETR [64] 可变形 DETR [ 64］	B	✓	sentence 判刑	✓	partial label 部分标签	partial label 部分标签
OWL-ViT [35] OWL-ViT [ 35］	-	-	✓	sentence 判刑	✓	partial label 部分标签	partial label 部分标签	zero-shot 零摄
DetCLIP [53] DetCLIP [ 53］	ATSS [60] ATSS [ 60］	-	✓	sentence 判刑			zero-shot 零摄	zero-shot 零摄
OmDet [61] OmDet [ 61］	Sparse R-CNN [47] 稀疏 R-CNN [ 47］	C	✓	sentence 判刑	✓			zero-shot 零摄
Grounding DINO (Ours) 接地 DINO（我们的）	DINO [58] DINO [ 58］	A,B,C A、B、C		sub-sentence 分句	✓	zero-shot 零摄	zero-shot 零摄	zero-shot 零摄	zero-shot 零摄

Table 1: A comparison of previous open-set object detectors. Our summarization is based on the experiments in their paper, but not the ability to extend their models to other tasks. It is worth noting that some related works may not (only) be designed for the open-set object detection initially, like MDETR [18] and GLIPv2[59], but we list them here for a comprehensive comparison with existing work. We use the term “partial label” for the settings, where models are trained on partial data (e.g. base categories) and evaluated on other cases.
表 1：以往开放集物体检测器的比较。我们的总结基于他们论文中的实验，但不包括将他们的模型扩展到其他任务的能力。值得注意的是，一些相关作品最初可能并非（仅）针对开放集对象检测而设计，如 MDETR [ 18] 和 GLIPv2[ 59]，但我们在此列出这些作品，以便与现有作品进行全面比较。我们使用 "部分标签 "一词来表示这种设置，即在部分数据（如基本类别）上训练模型，并在其他情况下进行评估。

2 Related Work 2 相关工作

Detection Transformers. Grounding DINO is built upon the DETR-like model DINO [58], which is an end-to-end Transformer-based detector. DETR was first proposed in [2] and then has been improved from many directions [64, 33, 12, 5, 50, 17, 4] in the past few years. DAB-DETR [31] introduces anchor boxes as DETR queries for more accurate box prediction. DN-DETR [24] proposes a query denoising approach to stabilizing the bipartite matching. DINO [58] further develops several techniques including contrastive de-noising and set a new record on the COCO object detection benchmark. However, such detectors mainly focus on closed-set detection and are difficult to generalize to novel classes because of the limited pre-defined categories.
检测变压器。接地 DINO 建立在类 DETR 模型 DINO [ 58] 的基础上，后者是一种基于变压器的端到端检测器。DETR 最早在 [ 2] 中提出，随后在过去几年中从多个方向进行了改进 [ 64, 33, 12, 5, 50, 17, 4]。DAB-DETR [ 31] 引入了锚框作为 DETR 查询，以获得更准确的框预测。DN-DETR [ 24] 提出了一种查询去噪方法，以稳定双方格匹配。DINO [ 58] 进一步开发了包括对比去噪在内的多种技术，并在 COCO 物体检测基准测试中创造了新纪录。然而，这些检测器主要侧重于封闭集检测，由于预定义的类别有限，很难推广到新的类别。

Open-Set Object Detection. Open-set object detection is trained using existing bounding box annotations and aims at detecting arbitrary classes with the help of language generalization. OV-DETR [57] uses image and text embedding encoded by a CLIP model as queries to decode the category-specified boxes in the DETR framework [2]. ViLD [13] distills knowledge from a CLIP teacher model into a R-CNN-like detector so that the learned region embeddings contain the semantics of language. GLIP [11] formulates object detection as a grounding problem and leverages additional grounding data to help learn aligned semantics at phrase and region levels. It shows that such a formulation can even achieve stronger performance on fully-supervised detection benchmarks. DetCLIP [53] involves large-scale image captioning datasets and uses the generated pseudo labels to expand the knowledge database. The generated pseudo labels effectively help extend the generalization ability of the detectors.
开放集对象检测。开放集对象检测使用现有的边界框注释进行训练，旨在借助语言泛化检测任意类别。OV-DETR [ 57] 使用由 CLIP 模型编码的图像和文本嵌入作为查询，以解码 DETR 框架中的类别指定框[ 2]。ViLD [ 13] 将 CLIP 教师模型中的知识提炼到类似 R-CNN 的检测器中，从而使学习到的区域嵌入包含语言语义。GLIP [ 11] 将对象检测表述为接地问题，并利用额外的接地数据帮助学习短语和区域级别的对齐语义。该研究表明，这样的表述甚至可以在完全监督的检测基准上获得更强的性能。DetCLIP [ 53] 涉及大规模图像标题数据集，并利用生成的伪标签来扩展知识数据库。生成的伪标签有效地帮助扩展了检测器的泛化能力。

However, previous works only fuse multi-modal information in partial phases, which may lead to sub-optimal language generalization ability. For example, GLIP only considers fusion in the feature enhancement (phase A) and OV-DETR only injects language information at the decoder inputs (phase B). Moreover, the REC task is normally overlooked in evaluation, which is an important scenario for open-set detection. We compare our model with other open-set methods in Table 1.
然而，以往的研究仅在部分阶段融合多模态信息，这可能会导致语言泛化能力达不到最佳水平。例如，GLIP 仅在特征增强（A 阶段）中考虑融合，而 OV-DETR 仅在解码器输入（B 阶段）中注入语言信息。此外，REC 任务通常在评估中被忽略，而这正是开放集检测的一个重要场景。我们在表 1 中将我们的模型与其他开集方法进行了比较。

3 Grounding DINO 3G 接地 DINO

Grounding DINO outputs multiple pairs of object boxes and noun phrases for a given (Image, Text) pair. For example, as shown in Fig. 3, the model locates a cat and a table from the input image and extracts word cat and table from the input text as corresponding labels. Both object detection and REC tasks can be aligned with the pipeline. Following GLIP [26], we concatenate all category names as input texts for object detection tasks. REC requires a bounding box for each text input. We use the output object with the largest scores as the output for the REC task.
对于给定的（图像、文本）对，接地 DINO 会输出多对对象框和名词短语。例如，如图 3 所示，该模型从输入图像中找出猫和桌子，并从输入文本中提取单词 cat 和 table 作为相应的标签。对象检测和 REC 任务都可以与管道相匹配。按照 GLIP [ 26] 的方法，我们将所有类别名称串联起来，作为对象检测任务的输入文本。REC 要求每个输入文本都有一个边界框。我们使用得分最高的输出对象作为 REC 任务的输出。

Grounding DINO is a dual-encoder-single-decoder architecture. It contains an image backbone for image feature extraction, a text backbone for text feature extraction, a feature enhancer for image and text feature fusion (Sec. 3.1), a language-guided query selection module for query initialization (Sec. 3.2), and a cross-modality decoder for box refinement (Sec. 3.3). The overall framework is available in Fig. 3.
DINO 是一种双编码器-单解码器结构。它包含一个用于图像特征提取的图像骨干网、一个用于文本特征提取的文本骨干网、一个用于图像和文本特征融合的特征增强器（第 3.1 节）、一个用于查询初始化的语言导向查询选择模块（第 3.2 节）和一个用于方框细化的跨模态解码器（第 3.3 节）。整体框架见图 3。

For each (Image, Text) pair, we first extract vanilla image features and vanilla text features using an image backbone and a text backbone, respectively. The two vanilla features are fed into a feature enhancer module for cross-modality feature fusion. After obtaining cross-modality text and image features, we use a language-guided query selection module to select cross-modality queries from image features. Like the object queries in most DETR-like models, these cross-modality queries will be fed into a cross-modality decoder to probe desired features from the two modal features and update themselves. The output queries of the last decoder layer will be used to predict object boxes and extract corresponding phrases.
对于每一对（图像、文本），我们首先使用图像骨干网和文本骨干网分别提取虚图像特征和虚文本特征。这两个虚构特征被送入特征增强器模块进行跨模态特征融合。获得跨模态文本和图像特征后，我们使用语言引导查询选择模块从图像特征中选择跨模态查询。与大多数类 DETR 模型中的对象查询一样，这些跨模态查询将被送入跨模态解码器，从两种模态特征中探寻所需的特征，并更新自身。最后一个解码器层的输出查询将用于预测对象框和提取相应的短语。

3.1 Feature Extraction and Enhancer
3.1 特征提取和增强器

Given an (Image, Text) pair, we extract multi-scale image features with an image backbone like Swin Transformer [32], and text features with a text backbone like BERT [8]. Following previous DETR-like detectors [64, 58], multi-scale features are extracted from the outputs of different blocks. After extracting vanilla image and text features, we fed them into a feature enhancer for cross-modality feature fusion. The feature enhancer includes multiple feature enhancer layers. We illustrate a feature enhancer layer in Fig. 3 block 2. We leverage the Deformable self-attention to enhance image features and the vanilla self-attention for text feature enhancers. Inspired by GLIP [26], we add an image-to-text cross-attention and a text-to-image cross-attention for feature fusion. These modules help align features of different modalities.
给定一个（图像、文本）对，我们用类似 Swin Transformer [ 32] 的图像骨干提取多尺度图像特征，用类似 BERT [ 8] 的文本骨干提取文本特征。按照之前的类 DETR 检测器[64, 58]，多尺度特征是从不同区块的输出中提取的。提取出虚构图像和文本特征后，我们将其输入特征增强器，进行跨模态特征融合。特征增强器包括多个特征增强层。图 3 图块 2 展示了一个特征增强层。我们利用可变形自我关注来增强图像特征，并利用 vanilla 自我关注来增强文本特征。受 GLIP [ 26] 的启发，我们为特征融合添加了图像到文本交叉注意和文本到图像交叉注意。这些模块有助于调整不同模式的特征。

3.2 Language-Guided Query Selection
3.2 语言引导的查询选择

Grounding DINO aims to detect objects from an image specified by an input text. To effectively leverage the input text to guide object detection, we design a language-guided query selection module to select features that are more relevant to the input text as decoder queries. We present the query selection process in Algorithm 1 in PyTorch style. The variables image_features and text_features are used for image and text features, respectively. num_query is the number of queries in the decoder, which is set to $900$ in our implementation. We use bs and ndim for batch size and feature dimension in the pseudo-code. num_img_tokens and num_text_tokens are used for the number of image and text tokens, respectively.
接地 DINO 的目标是从输入文本指定的图像中检测物体。为了有效利用输入文本来引导物体检测，我们设计了一个语言引导的查询选择模块，以选择与输入文本更相关的特征作为解码器查询。我们在算法 1 中以 PyTorch 风格介绍了查询选择过程。变量 image_features 和 text_features 分别用于图像和文本特征。num_query 是解码器中的查询次数，在我们的实现中设置为 $900$ 。num_img_tokens 和 num_text_tokens 分别表示图像和文本标记的数量。

The language-guided query selection module outputs num_query indices. We can extract features based on the selected indices to initialize queries. Following DINO [58], we use mixed query selection to initialize decoder queries. Each decoder query contains two parts: content part and positional part [33], respectively. We formulate the positional part as dynamic anchor boxes [31], which are initialized with encoder outputs. The other part, the content queries, are set to be learnable during training.
语言导向查询选择模块会输出 num_query 索引。我们可以根据所选索引提取特征来初始化查询。按照 DINO [ 58] 的方法，我们使用混合查询选择来初始化解码器查询。每个解码器查询分别包含两部分：内容部分和位置部分[33]。我们将位置部分表述为动态锚点框[ 31]，并使用编码器输出对其进行初始化。另一部分是内容查询，在训练过程中设置为可学习。

3.3 Cross-Modality Decoder
3.3 跨模态解码器

We develop a cross-modality decoder to combine image and text modality features, as shown in Fig. 3 block 3. Each cross-modality query is fed into a self-attention layer, an image cross-attention layer to combine image features, a text cross-attention layer to combine text features, and an FFN layer in each cross-modality decoder layer. Each decoder layer has an extra text cross-attention layer compared with the DINO decoder layer, as we need to inject text information into queries for better modality alignment.
我们开发了一个跨模态解码器来结合图像和文本模态特征，如图 3 块 3 所示。在每个跨模态解码器层中，每个跨模态查询都被送入一个自注意层、一个图像交叉注意层以结合图像特征、一个文本交叉注意层以结合文本特征，以及一个 FFN 层。与 DINO 解码层相比，每个解码器层都多了一个文本交叉注意层，因为我们需要将文本信息注入查询，以便更好地进行模态对齐。

Algorithm 1 Language-guided query selection.
算法 1 语言引导的查询选择。

⬇

"""

Input: 输入：

image_features:␣(bs,␣num_img_tokens,␣ndim)

text_features:␣(bs,␣num_text_tokens,␣ndim)

num_query:␣int. num_query:␣int。

Output: 输出：

topk_proposals_idx:␣(bs,␣num_query)

"""

logits = torch.einsum("bic,btc->bit",
logits = torch.einsum("bic,btc->bit"、

image_features, text_features)
图像特征、文本特征)

# bs, num_img_tokens, num_text_tokens

logits_per_img_feat = logits.max(-1)[0]

# bs, num_img_tokens

topk_proposals_idx = torch.topk(

logits_per_image_feature,
logits_per_image_feature、

num_query, dim=1)[1]

# bs, num_query

3.4 Sub-Sentence Level Text Feature
3.4 子句级文本特征

Two kinds of text prompts are explored in previous works, which we named as sentence level representation and word level representation, as shown in Fig. 4. Sentence level representation [53, 35] encodes a whole sentence to one feature. If some sentences in phrase grounding data have multiple phrases, it extracts these phrases and discards other words. In this way, it removes the influence between words while losing fine-grained information in sentences. Word level representation [11, 18] enables encoding multiple category names with one forward but introduces unnecessary dependencies among categories, especially when the input text is a concatenation of multiple category names in an arbitrary order. As shown in Fig. 4 (b), some unrelated words interact during attention. To avoid unwanted word interactions, we introduce attention masks to block attentions among unrelated category names, named “sub-sentence” level representation. It eliminates the influence between different category names while keeping per-word features for fine-grained understanding.
如图 4 所示，前人探索了两种文本提示，我们将其命名为句子级表示和单词级表示。句子级表示法 [ 53, 35] 将整个句子编码为一个特征。如果短语基础数据中的某些句子有多个短语，它就会提取这些短语，而舍弃其他词。这样，它可以消除词与词之间的影响，同时丢失句子中的细粒度信息。词级表示法[11, 18]可以用一个前向词对多个类别名称进行编码，但会在类别之间引入不必要的依赖关系，尤其是当输入文本是多个类别名称以任意顺序的串联时。如图 4(b)所示，一些不相关的词在注意过程中会发生相互作用。为了避免不必要的词语交互，我们引入了注意力掩码来阻断不相关的类别名称之间的注意力，这就是 "子句 "级表示法。它消除了不同类别名称之间的影响，同时保留了每个单词的特征，以实现细粒度理解。

3.5 Loss Function 3.5 损失函数

Following previous DETR-like works [2, 64, 33, 31, 24, 58], we use the L1 loss and the GIOU [41] loss for bounding box regressions. We follow GLIP [26] and use contrastive loss between predicted objects and language tokens for classification. Specifically, we dot product each query with text features to predict logits for each text token and then compute focal loss [28] for each logit. Box regression and classification costs are first used for bipartite matching between predictions and ground truths. We then calculate final losses between ground truths and matched predictions with the same loss components. Following DETR-like models, we add auxiliary loss after each decoder layer and after the encoder outputs.
按照之前类似 DETR 的研究[2, 64, 33, 31, 24, 58]，我们使用 L1 loss 和 GIOU [ 41] loss 进行边界框回归。我们效仿 GLIP [ 26] 的做法，在预测对象和语言标记之间使用对比损失进行分类。具体来说，我们将每个查询与文本特征进行点乘，以预测每个文本标记的对数，然后计算每个对数的焦点损失[28]。盒式回归和分类成本首先用于预测结果和基本事实之间的双向匹配。然后，我们用相同的损失分量计算基本事实与匹配预测之间的最终损失。按照类似 DETR 的模型，我们在每个解码器层和编码器输出之后添加了辅助损失。

Table 2: Zero-shot domain transfer and fine-tuning on COCO. * The results in brackets are trained with 1.5

\times

image sizes, i.e., with a maximum image size of 2000. †The models map a subset of O365 categories to COCO for zero-shot evaluations.
表 2：COCO 上的零拍摄域转移和微调。* 括号中的结果是在 1.5

\times

图像大小（即最大图像大小为 2000）下训练得出的。模型将 O365 类别的一个子集映射到 COCO 以进行零拍摄评估。

Model	Backbone 骨干网	Pre-Training Data 训练前数据	Zero-Shot 零点射击	Fine-Tuning 微调
Model	Backbone 骨干网	Pre-Training Data 训练前数据	2017val	2017val/test-dev
Faster R-CNN 更快的 R-CNN	RN50-FPN	-	-	40.2 / -
Faster R-CNN 更快的 R-CNN	RN101-FPN	-	-	42.0 / -
DyHead-T [5] DyHead-T [ 5］	Swin-T	-	-	49.7 / -
DyHead-L [5]	Swin-L	-	-	58.4 / 58.7
DyHead-L [5]	Swin-L	O365,ImageNet21K O365、ImageNet21K	-	60.3 / 60.6
SoftTeacher [52] 软教师 [ 52］	Swin-L	O365,SS-COCO O365、SS-COCO	-	60.7 / 61.3
DINO(Swin-L) [58]	Swin-L	O365	-	62.5 / -
DyHead-T†[5] DyHead-T†[ 5］	Swin-T	O365	43.6	53.3 / -
GLIP-T (B) [26] 格利普-特 (b) [ 26］	Swin-T	O365	44.9	53.8 / -
GLIP-T (C) [26]	Swin-T	O365,GoldG O365、GoldG	46.7	55.1 / -
GLIP-L [26] GLIP-L [ 26］	Swin-L	FourODs,GoldG,Cap24M 四个 ODS、GoldG、Cap24M	49.8	60.8 / 61.0
DINO(Swin-T)†[58] DINO(Swin-T)†[ 58］	Swin-T	O365	46.2	56.9 / -
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365	46.7	56.9 / -
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365,GoldG O365、GoldG	48.1	57.1 / -
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365,GoldG,Cap4M O365、GoldG、Cap4M	48.4	57.2 / -
Grounding-DINO-L (Ours) 接地-DINO-L（我们的）	Swin-L	O365,OI[19],GoldG O365、OI[ 19]、GoldG	52.5	62.6 / 62.7 (63.0 / 63.0)*
Grounding-DINO-L (Ours) 接地-DINO-L（我们的）	Swin-L	O365,OI,GoldG,Cap4M,COCO,RefC O365、OI、GoldG、Cap4M、COCO、RefC	60.7	62.6 / -

Table 3: Zero-shot domain transfer to LVIS. *The models are fine-tuned on the LVIS dataset before evaluation.
表 3：向 LVIS 的零拍摄域转移。*模型在评估前在 LVIS 数据集上进行了微调。

Model	Backbone 骨干网	Pre-Training Data 训练前数据	MiniVal [18] 迷你变量[ 18］
Model	Backbone 骨干网	Pre-Training Data 训练前数据	AP	APr/APc/APf
MDETR [18]*	RN101	GoldG,RefC 黄金G、参考文件C	24.2	20.9/24.9/24.3
Mask R-CNN [18]* 掩码 R-CNN [ 18]*	RN101	-	33.3	26.3/34.0/33.9
GLIP-T (C)	Swin-T	O365,GoldG O365、GoldG	24.9	17.7/19.5/31.0
GLIP-T	Swin-T	O365,GoldG,Cap4M O365、GoldG、Cap4M	26.0	20.8/21.4/31.0
Grounding-DINO-T 接地-DINO-T	Swin-T	O365,GoldG O365、GoldG	25.6	14.4/19.6/32.2
Grounding-DINO-T 接地-DINO-T	Swin-T	O365,GoldG,Cap4M O365、GoldG、Cap4M	27.4	18.1/23.3/32.7
Grounding-DINO-L 接地-DINO-L	Swin-L	O365,OI,GoldG,Cap4M,COCO,RefC O365、OI、GoldG、Cap4M、COCO、RefC	33.9	22.2/30.7/38.8

4 Experiments 4实验

4.1 Setup 4.1 设置

We conduct extensive experiments on three settings: a closed-set setting on the COCO detection benchmark (Sec. C.1), an open-set setting on zero-shot COCO, LVIS, and ODinW (Sec. 4.2), and a referring detection setting on RefCOCO/+/g (Sec. 4.3). Ablations are then conducted to show the effectiveness of our model design (Sec. 4.4). We also explore a way to transfer a well-trained DINO to the open-set scenario by training a few plug-in modules in Sec. 4.5. The test of our model efficiency is presented in Sec. I.
我们在三种设置下进行了广泛的实验：COCO 检测基准的封闭设置（C.1 节）、零射 COCO、LVIS 和 ODinW 的开放设置（4.2 节）以及 RefCOCO/+/g 的引用检测设置（4.3 节）。然后进行消融以显示模型设计的有效性（第 4.4 节）。在第 4.5 节中，我们还探索了一种方法，即通过训练一些插件模块，将训练有素的 DINO 移植到开放集场景中。对模型效率的测试将在第 I 节中介绍。

Implementation Details We trained two model variants, Grounding-DINO-T with Swin-T [32], and Grounding-DINO-L with Swin-L [32] as an image backbone, respectively. We leveraged BERT-base [8] from Hugging Face [51] as text backbones. As we focus more on the model performance on novel classes, we list zero-shot transfer and referring detection results in the main text. More implementation details are available in the Appendix Sec. A.
实现细节我们训练了两个模型变体，分别是以 Swin-T [ 32] 为图像骨干的 Grounding-DINO-T 和以 Swin-L [ 32] 为图像骨干的 Grounding-DINO-L。我们利用来自 Hugging Face 的 BERT-base [ 8] [ 51] 作为文本骨干。由于我们更关注模型在新类别中的表现，我们在正文中列出了零镜头转移和引用检测结果。附录 A 部分提供了更多实现细节。

4.2 Zero-Shot Transfer of Grounding DINO
4.2 接地 DINO 的零点转移

In this setting, we pre-train models on large-scale datasets and directly evaluate models on new datasets. We also list some fine-tuned results for a more thorough comparison of our model with prior works.
在这种情况下，我们在大规模数据集上对模型进行预训练，并直接在新数据集上对模型进行评估。我们还列出了一些微调结果，以便将我们的模型与之前的作品进行更全面的比较。

COCO Benchmark COCO 基准

We compare Grounding DINO with GLIP and DINO in Table 2. We pre-train models on large-scale datasets and directly evaluate our model on the COCO benchmark. As the O365 dataset [44] has (nearly⁴⁴4It is not an exact mapping between O365 and COCO categories. We made some approximations during evaluation. ) covered all categories in COCO, we evaluate an O365 pre-trined DINO on COCO as a zero-shot baseline. The result shows that DINO performs better on the COCO zero-shot transfer than DyHead. Grounding DINO outperforms all previous models on the zero-shot transfer setting, with $+0.5$ AP and $+1.8$ AP compared with DINO and GLIP under the same setting. Grounding data is still helpful for Grounding DINO, introducing more than $1$ AP (48.1 vs. 46.7) on the zero-shot transfer setting. With stronger backbones and larger data, Grounding DINO sets a new record of $52.5$ AP on the COCO object detection benchmark without seeing any COCO images during training. Grounding DINO obtains a $62.6$ AP on COCO minival, outperforming DINO’s $62.5$ AP. When enlarging the input images by $1.5\times$ , the benefits reduce. We suspect that the text branch enlarges the gap between models with different input images. Even though, Grounding DINO gets an impressive $63.0$ AP on COCO test-dev with fine-tuning on the COCO dataset(See the number in brackets of Table 2).
在表 2 中，我们将接地 DINO 与 GLIP 和 DINO 进行了比较。我们在大规模数据集上对模型进行预训练，并在 COCO 基准上直接评估我们的模型。由于 O365 数据集[44]已经（几乎 ⁴ ）覆盖了 COCO 的所有类别，因此我们在 COCO 上评估了 O365 预训练的 DINO，将其作为零点基准。结果表明，DINO 在 COCO 零点传输上的表现优于 DyHead。接地 DINO 在零镜头传输设置上的表现优于之前的所有模型，在相同设置下， $+0.5$ AP 和 $+1.8$ AP 与 DINO 和 GLIP 相比。接地数据对接地 DINO 仍有帮助，在零镜头传输设置下，接地数据比 $1$ AP（48.1 对 46.7）更多。凭借更强的骨干和更大的数据，接地 DINO 在 COCO 物体检测基准测试中创下了 $52.5$ AP 的新纪录，而在训练过程中并未看到任何 COCO 图像。Grounding DINO 在 COCO minival 上获得了 $62.6$ AP，超过了 DINO 的 $62.5$ AP。当输入图像放大到 $1.5\times$ 时，优势就会减小。我们怀疑是文本分支扩大了不同输入图像模型之间的差距。尽管如此，在 COCO 数据集上进行微调后，Grounding DINO 在 COCO test-dev 上获得了令人印象深刻的 $63.0$ AP（参见表 2 中括号内的数字）。

LVIS Benchmark LVIS 基准

LVIS [14] is a dataset for long-tail objects. It contains more than $1000$ categories for evaluation. We use LVIS as a downstream task to test the zero-shot abilities of our model. We use GLIP as baselines for our models. The results are shown in Table 3. Grounding DINO outperforms GLIP under the same settings. We found two interesting phenomena in the results. First, Grounding DINO works better than common objects than GLIP, but worse on rare categories. We suspect that the $900$ query design limits the ability for long-tailed objects. By contrast, the one-stage detector uses all proposals in the feature map for comparisons. The other phenomenon is that Grounding DINO has larger gains with more data than GLIP. For example, Grounding DINO introduces $+1.8$ AP gains with the caption data Cap4M, whereas GLIP has only $+1.1$ AP. We believe that Grounding DINO has a better scalability compared with GLIP. A larger-scale training will be left as our future work.
LVIS [ 14] 是一个长尾对象数据集。它包含超过 $1000$ 个可供评估的类别。我们使用 LVIS 作为下游任务来测试我们模型的零拍能力。我们使用 GLIP 作为模型的基准。结果如表 3 所示。在相同设置下，接地 DINO 优于 GLIP。我们在结果中发现了两个有趣的现象。首先，与 GLIP 相比，Grounding DINO 对常见对象的处理效果更好，但对稀有类别的处理效果较差。我们怀疑是 $900$ 查询设计限制了对长尾对象的处理能力。相比之下，单级检测器使用特征图中的所有提议进行比较。另一个现象是，与 GLIP 相比，Grounding DINO 在使用更多数据时的收益更大。例如，Grounding DINO 在使用 Cap4M 这一字幕数据时可获得 $+1.8$ AP 增益，而 GLIP 只有 $+1.1$ AP 增益。我们认为，与 GLIP 相比，接地 DINO 具有更好的可扩展性。我们将在今后的工作中进行更大规模的训练。

Table 4: Results on the ODinW benchmark.
表 4：ODinW 基准测试结果。

Model	Language Input 语言输入	Backbone 骨干网	Model Size 模型尺寸	Pre-Training Data 训练前数据	Test 测试
Model	Language Input 语言输入	Backbone 骨干网	Model Size 模型尺寸	Pre-Training Data 训练前数据	AP_average	AP_median
Zero-Shot Setting 零点射击设置
MDETR [18]	✓	ENB5 [48] ENB5 [ 48］	169M	GoldG,RefC 黄金G、参考文件C	10.7	3.0
OWL-ViT [35] OWL-ViT [ 35］	✓	ViT L/14(CLIP) ViT L/14（CLIP）	$>$ 1243M	O365, VG	18.8	9.8
GLIP-T [26] GLIP-T [ 26］	✓	Swin-T	232M	O365,GoldG,Cap4M O365、GoldG、Cap4M	19.6	5.1
OmDet [61] OmDet [ 61］	✓	ConvNeXt-B	230M	COCO,O365,LVIS,PhraseCut COCO、O365、LVIS、PhraseCut	19.7	10.8
GLIPv2-T [59] GLIPv2-T [ 59］	✓	Swin-T	232M	O365,GoldG,Cap4M O365、GoldG、Cap4M	22.3	8.9
DetCLIP [53] DetCLIP [ 53］	✓	Swin-L	267M	O365,GoldG,YFCC1M O365、GoldG、YFCC1M	24.9	18.3
Florence [55] 佛罗伦萨 [ 55］	✓	CoSwinH	$\approx$ 841M	FLD900M,O365,GoldG FLD900M、O365、GoldG	25.8	14.3
Grounding-DINO-T(Ours) 接地-DINO-T（我们的）	✓	Swin-T	172M	O365,GoldG O365、GoldG	20.0	9.5
Grounding-DINO-T(Ours) 接地-DINO-T（我们的）	✓	Swin-T	172M	O365,GoldG,Cap4M O365、GoldG、Cap4M	22.3	11.9
Grounding DINO L(Ours) 接地 DINO L（我们的）	✓	Swin-L	341M	O365,OI,GoldG,Cap4M,COCO,RefC O365、OI、GoldG、Cap4M、COCO、RefC	26.1	18.4
Few-Shot Setting 少镜头设置
DyHead-T [5] DyHead-T [ 5］	✗	Swin-T	$\approx$ 100M	O365	37.5	36.7
GLIP-T [26] GLIP-T [ 26］	✓	Swin-T	232M	O365,GoldG,Cap4M O365、GoldG、Cap4M	38.9	33.7
DINO-Swin-T [58] DINO-Swin-T [ 58］	✗	Swin-T	49M	O365	41.2	41.1
OmDet [61] OmDet [ 61］	✓	ConvNeXt-B	230M	COCO,O365,LVIS,PhraseCut COCO、O365、LVIS、PhraseCut	42.4	41.7
Grounding-DINO-T(Ours) 接地-DINO-T（我们的）	✓	Swin-T	172M	O365,GoldG O365、GoldG	46.4	51.1
Full-Shot Setting 全枪设置
GLIP-T [26] GLIP-T [ 26］	✓	Swin-T	232M	O365,GoldG,Cap4M O365、GoldG、Cap4M	62.6	62.1
DyHead-T [5] DyHead-T [ 5］	✗	Swin-T	$\approx$ 100M	O365	63.2	64.9
DINO-Swin-T [58] DINO-Swin-T [ 58］	✗	Swin-T	49M	O365	66.7	68.5
OmDet [61] OmDet [ 61］	✓	ConvNeXt-B	230M	COCO,O365,LVIS,PhraseCut COCO、O365、LVIS、PhraseCut	67.1	71.2
DINO-Swin-L [58] DINO-Swin-L [ 58］	✗	Swin-L	218M	O365	68.8	70.7
Grounding-DINO-T(Ours) 接地-DINO-T（我们的）	✓	Swin-T	172M	O365,GoldG O365、GoldG	70.7	76.2

ODinW Benchmark 业务连续性基准

ODinW (Object Detection in the Wild) [23] is a more challenging benchmark to test model performance under real-world scenarios. It collects more than $35$ datasets for evaluation. We report three settings, zero-shot, few-shot, and full-shot results in Table 4. Grounding DINO performs well on this benchmark. With only O365 and GoldG for pre-train, Grounding-DINO-T outperforms DINO on few-shot and full-shot settings. Impressively, Grounding DINO with a Swin-T backbone outperforms DINO with Swin-L on the full-shot setting. Grounding DINO outperforms GLIP under the same backbone for the zero-shot setting, comparable with GLIPv2 [59] without any new techniques like masked training. The results show the superiority of our proposed models. Grounding-DINO-L set a new record on ODinW zero-shot with a $26.1$ AP, even outperforming the giant Florence models [55]. The results show the generalization and scalability of Grounding DINO.
ODinW（Object Detection in the Wild）[ 23] 是一个更具挑战性的基准，用于测试模型在真实世界场景下的性能。它收集了超过 $35$ 个数据集进行评估。我们在表 4 中报告了三种设置，即零镜头、少镜头和全镜头的结果。接地 DINO 在该基准测试中表现良好。在仅使用 O365 和 GoldG 进行预训练的情况下，Grounding-DINO-T 在少发和全发设置上均优于 DINO。令人印象深刻的是，采用 Swin-T 骨架的接地型 DINO 在全射设置上优于采用 Swin-L 骨架的 DINO。在相同骨干网下，接地DINO在零次击球设置中的表现优于GLIP，与GLIPv2[59]不相上下，但没有采用任何新技术（如掩码训练）。这些结果显示了我们提出的模型的优越性。Grounding-DINO-L在ODinW零点拍摄中创下了 $26.1$ AP的新纪录，甚至超过了巨型佛罗伦斯模型[ 55]。这些结果表明了 Grounding DINO 的通用性和可扩展性。

Method 方法	Backbone 骨干网	Pre-Training Data 训练前数据	Fine-tuning 微调	RefCOCO			RefCOCO+			RefCOCOg
Method 方法	Backbone 骨干网	Pre-Training Data 训练前数据	Fine-tuning 微调	val 缬氨酸	testA 测试 A	testB 测试B	val 缬氨酸	testA 测试 A	testB 测试B	val 缬氨酸	test 测试
MAttNet [54] MAttNet [ 54］	R101	None 无	✓	76.65	81.14	69.99	65.33	71.62	56.02	66.58	67.27
VGTR [9] VGTR [ 9］	R101	None 无	✓	79.20	82.32	73.78	63.91	70.09	56.51	65.73	67.23
TransVG [7] TransVG [ 7］	R101	None 无	✓	81.02	82.72	78.35	64.82	70.70	56.94	68.67	67.73
VILLA_L^∗ [10]	R101	CC, SBU, COCO, VG CC、SBU、COCO、VG	✓	82.39	87.48	74.84	76.17	81.54	66.84	76.18	76.71
RefTR [27] RefTR [ 27］	R101	VG	✓	85.65	88.73	81.16	77.55	82.26	68.99	79.25	80.01
MDETR [18]	R101	GoldG,RefC 黄金G、参考文件C	✓	86.75	89.58	81.41	79.52	84.09	70.62	81.64	80.89
DQ-DETR [46] DQ-DETR [ 46］	R101	GoldG,RefC 黄金G、参考文件C	✓	88.63	91.04	83.51	81.66	86.15	73.21	82.76	83.44
GLIP-T(B)	Swin-T	O365,GoldG O365、GoldG		49.96	54.69	43.06	49.01	53.44	43.42	65.58	66.08
GLIP-T	Swin-T	O365,GoldG,Cap4M O365、GoldG、Cap4M		50.42	54.30	43.83	49.50	52.78	44.59	66.09	66.89
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365,GoldG O365、GoldG		50.41	57.24	43.21	51.40	57.59	45.81	67.46	67.13
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365,GoldG,RefC O365、GoldG、RefC		73.98	74.88	59.29	66.81	69.91	56.09	71.06	72.07
Grounding-DINO-T (Ours) 接地-DINO-T（我们的）	Swin-T	O365,GoldG,RefC O365、GoldG、RefC	✓	89.19	91.86	85.99	81.09	87.40	74.71	84.15	84.94
Grounding-DINO-L (Ours)* 接地-DINO-L（我们的）*	Swin-L	O365,OI,GoldG,Cap4M,COCO,RefC O365、OI、GoldG、Cap4M、COCO、RefC	✓	90.56	93.19	88.24	82.75	88.95	75.92	86.13	87.02

Table 5: Top-1 accuracy comparison on the referring expression comprehension task. We mark the best results in bold. All models are trained with a ResNet-101 backbone. We use the notations “CC”, “SBU”, “VG”, “OI”, “O365”, and “YFCC” for Conceptual Captions [45], SBU Captions [36], Visual Genome [20], OpenImage [22], Objects365 [63], YFCC100M [49] respectively. The term “RefC” is used for RefCOCO, RefCOCO+, and RefCOCOg three datasets. * There might be a data leak since COCO includes validation images in RefC. But the annotations of the two datasets are different.
表 5：指代表达理解任务的最高-1 准确率比较。我们用粗体标出了最佳结果。所有模型均以 ResNet-101 为骨干进行训练。我们使用 "CC"、"SBU"、"VG"、"OI"、"O365 "和 "YFCC "分别表示概念性字幕 [ 45]、SBU 字幕 [ 36]、视觉基因组 [ 20]、OpenImage [ 22]、Objects365 [ 63]、YFCC100M [ 49]。术语 "RefC "用于 RefCOCO、RefCOCO+ 和 RefCOCOg 三个数据集。* 由于 COCO 包含 RefC 中的验证图像，因此可能存在数据泄露。但两个数据集的注释不同。

#ID	Model	COCO minival		LVIS minival
#ID	Model	Zero-Shot 零点射击	Fine-Tune 微调	Zero-Shot 零点射击
0	Grounding DINO (Full Model) 接地 DINO（全型号）	46.7	56.9	16.1
1	w/o encoder fusion 无编码器融合	45.8	56.1	13.1
2	static query selection 静态查询选择	46.3	56.6	13.6
3	w/o text cross-attention 无文字交叉注意	46.1	56.3	14.3
4	word-level text prompt 词级文本提示	46.4	56.6	15.6

Table 6: Ablations for our model. All models are trained on the O365 dataset with a Swin Transformer Tiny backbone.
表 6:我们模型的消减率。所有模型均在 O365 数据集上使用 Swin Transformer Tiny 骨干进行训练。

Model	Pre-Train Data 培训前数据		COCO minival	LVIS minival	ODinW
Model	DINO Pre-Train DINO 预培训	Grounded Fine-Tune 接地微调	Zero-Shot 零点射击	Zero-Shot 零点射击	Zero-Shot 零点射击
Grounding-DINO-T 接地-DINO-T	-	O365	46.7	16.2	14.5
(from scratch) (从零开始)	-	O365,GoldG O365、GoldG	48.1	25.6	20.0
Grounding-DINO-T 接地-DINO-T	O365	O365	46.5	17.9	13.6
(from pre-trained DINO) (来自预先训练的 DINO）	O365	O365,GoldG O365、GoldG	46.4	26.1	18.5

Table 7: Transfer pre-trained DINO to Grounding DINO. We freeze shared modules between DINO and Grounding DINO during grounded fine-tuning. All models are trained with a Swin Transformer Tiny backbone.
表 7：将预先训练好的 DINO 转移到接地 DINO。在接地微调过程中，我们冻结了 DINO 和接地 DINO 之间的共享模块。所有模型均使用 Swin Transformer Tiny 骨干进行训练。

4.3 Referring Object Detection Settings
4.3 引用对象检测设置

We further explore our models’ performances on the REC task. We leverage GLIP [26] as our baseline. We evaluate the model performance on RefCOCO/+/g directly.⁵⁵5We used the official released code and checkpoints in https://github.com/microsoft/GLIP. The results are shown in Table 5. Grounding DINO outperforms GLIP under the same setting. Nevertheless, both GLIP and Grounding DINO perform not well without REC data. More training data like caption data or larger models help the final performance, but quite minor. After injecting RefCOCO/+/g data into training, Grounding DINO obtains significant gains. The results reveal that most nowadays open-set object detectors need to pay more attention for a more fine-grained detection.
我们将进一步探讨模型在 REC 任务中的表现。我们利用 GLIP [ 26] 作为基准。我们直接评估模型在 RefCOCO/+/g 上的性能。结果如表 5 所示。在相同设置下，接地 DINO 的性能优于 GLIP。不过，在没有 REC 数据的情况下，GLIP 和 Grounding DINO 的表现都不尽如人意。更多的训练数据（如标题数据或更大的模型）有助于提高最终性能，但作用不大。在训练中注入 RefCOCO/+/g 数据后，Grounding DINO 取得了显著提高。结果表明，目前大多数开放集对象检测器都需要更多的关注才能实现更精细的检测。

4.4 Ablations 4.4 切割

We conduct ablation studies in this section. We propose a tight fusion grounding model for open-set object detection and a sub-sentence level text prompt. To verify the effectiveness of the model design, we remove some fusion blocks for different variants. Results are shown in Table 6. All models are pre-trained on O365 with a Swin-T backbone. The results show that each fusion helps the final performance. Encoder fusion is the most important design. The impact of word-level text prompts the smallest, but helpful as well. The language-guided query selection and text cross-attention present a larger influence on LVIS and COCO, respectively.
我们将在本节中进行消融研究。我们提出了一个用于开放集对象检测和子句级文本提示的紧密融合基础模型。为了验证模型设计的有效性，我们删除了不同变体的一些融合块。结果如表 6 所示。所有模型均以 Swin-T 为骨干在 O365 上进行了预训练。结果表明，每种融合都有助于提高最终性能。编码器融合是最重要的设计。单词级文本提示的影响最小，但也很有帮助。语言引导查询选择和文本交叉关注分别对 LVIS 和 COCO 有较大影响。

4.5 Transfer from DINO to Grounding DINO
4.5 从 DINO 转到接地 DINO

Recent work has presented many large-scale image models for detection with DINO architecture⁶⁶6See model instances at https://github.com/IDEA-Research/detrex . It is computationally expensive to train a Grounding DINO model from scratch. However, the cost can be significantly reduced if we leverage pre-trained DINO weights. Hence, we conduct some experiments to transfer pre-trained DINO to Grounding DINO models. We freeze the modules co-existing in DINO and Grounding DINO and fine-tune the other parameters only. (We compare DINO and Grounding DINO in Sec. E.) The results are available in Table 7.
最近的研究提出了许多利用 DINO 架构进行检测的大规模图像模型 ⁶ 。从零开始训练一个接地 DINO 模型的计算成本很高。但是，如果我们利用预先训练好的 DINO 权重，就可以大大降低成本。因此，我们进行了一些实验，将预先训练好的 DINO 移植到接地 DINO 模型中。我们冻结了 DINO 和接地 DINO 中并存的模块，只对其他参数进行了微调。(我们将在 E 部分对 DINO 和接地 DINO 进行比较）结果见表 7。

It shows that we can achieve similar performances with Grounding-DINO-Training only text and fusion blocks using a pre-trained DINO. Interestingly, the DINO-pre-trained Grounding DINO outperforms standard Grounding DINO on LVIS under the same setting. The results show that there might be much room for model training improvement, which will be our future work to explore. With a pre-trained DINO initialization, the model converges faster than Grounding DINO from scratch, as shown in Fig. 5. Notably, we use the results without exponential moving average (EMA) for the curves in Fig. 5, which results in a different final performance that in Table 7. As the model trained from scratch need more training time, we only show results of early epochs.
这表明，我们可以通过接地-DINO-仅训练文本和使用预训练 DINO 的融合块实现类似的性能。有趣的是，在相同设置下，预训练的接地 DINO 在 LVIS 上的表现优于标准接地 DINO。结果表明，模型训练可能还有很大的改进空间，这将是我们未来工作的探索方向。如图 5 所示，在预训练 DINO 初始化的情况下，模型的收敛速度比从零开始的接地 DINO 更快。值得注意的是，我们在图 5 的曲线中使用了不含指数移动平均（EMA）的结果，其最终性能与表 7 中的结果不同。由于从头开始训练的模型需要更多的训练时间，我们只显示了早期历元的结果。

5 Conclusion 5 结束语

We have presented a Grounding DINO model in this paper. Grounding DINO extends DINO to open-set object detection, enabling it to detect arbitrary objects given texts as queries. We review open-set object detector designs and propose a tight fusion approach to better fusing cross-modality information. We propose a sub-sentence level representation to use detection data for text prompts in a more reasonable way. The results show the effectiveness of our model design and fusion approach. Moreover, we extend open-set object detection to REC tasks and perform evaluation accordingly. We show that existing open-set detectors do not work well for REC data without fine-tuning. Hence we call extra attention to REC zero-shot performance in future studies.
我们在本文中介绍了一种接地 DINO 模型。接地 DINO 将 DINO 扩展到了开放集对象检测，使其能够检测任意以文本为查询条件的对象。我们回顾了开放集对象检测器的设计，并提出了一种紧密融合的方法，以更好地融合跨模态信息。我们提出了一种子句子级表示法，以更合理的方式将检测数据用于文本提示。结果显示了我们的模型设计和融合方法的有效性。此外，我们还将开集对象检测扩展到了 REC 任务，并进行了相应的评估。我们发现，现有的开放集检测器在没有微调的情况下不能很好地检测 REC 数据。因此，我们呼吁在未来的研究中格外关注 REC 的零镜头性能。

Limitations: Although the great performance on open-set object detection setting, Grounding DINO cannot be used for segmentation tasks like GLIPv2. Moreover, our training data is less than the largest GLIP model, which may limit our final performance.
局限性尽管 Grounding DINO 在开放集物体检测设置上表现出色，但它不能用于像 GLIPv2 那样的分割任务。此外，我们的训练数据少于最大的 GLIP 模型，这可能会限制我们的最终性能。

6 Acknowledgement 6致谢

We thank the authors of GLIP [26]: Liunian Harold Li, Pengchuan Zhang, and Haotian Zhang for their helpful discussions and instructions. We also thank Tiancheng Zhao, the author of OmDet [61], and Jianhua Han, the author of DetCLIP [53], for their response on their model details. We thank He Cao of The Hong Kong University of Science and Technology for his helps on diffusion models.
我们感谢 GLIP [ 26] 的作者：感谢 Liunian Harold Li、Pengchuan Zhang 和 Haotian Zhang 的讨论和指导。我们还要感谢 OmDet [ 61] 的作者赵天成和 DetCLIP [ 53] 的作者韩建华对其模型细节的回应。我们感谢香港科技大学的曹贺在扩散模型方面提供的帮助。

References 参考资料

[1] Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. computer vision and pattern recognition, 2017.
Peter Anderson、Xiaodong He、Chris Buehler、Damien Teney、Mark Johnson、Stephen Gould 和 Lei Zhang。图像字幕和视觉问题解答的自下而上和自上而下注意力》，计算机视觉与模式识别，2017 年。
[2] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In European Conference on Computer Vision, pages 213–229. Springer, 2020.
Nicolas Carion、Francisco Massa、Gabriel Synnaeve、Nicolas Usunier、Alexander Kirillov 和 Sergey Zagoruyko。使用变换器进行端到端物体检测。欧洲计算机视觉会议，第 213-229 页。Springer, 2020.
[3] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi, Wanli Ouyang, et al. Hybrid task cascade for instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974–4983, 2019.
陈凯、庞江苗、王佳琪、熊宇、李潇潇、孙树阳、冯万森、刘紫薇、史建平、欧阳万里等：实例分割的混合任务级联。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4974-4983, 2019.
[4] Qiang Chen, Xiaokang Chen, Jian Wang, Haocheng Feng, Junyu Han, Errui Ding, Gang Zeng, and Jingdong Wang. Group detr: Fast detr training with group-wise one-to-many assignment. 2022.
陈强、陈小康、王健、冯浩成、韩俊宇、丁二瑞、曾刚、王敬东。群检测：采用组内一对多分配的快速 detr 训练。2022.
[5] Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. Dynamic head: Unifying object detection heads with attentions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7373–7382, 2021.
戴希阳、陈银鹏、肖斌、陈东东、刘梦晨、袁璐、张磊动态头部用注意力统一物体检测头。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 7373-7382 页，2021 年。
[6] Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang. Dynamic detr: End-to-end object detection with dynamic attention. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2988–2997, October 2021.
Xiyang Dai, Yinpeng Chen, Jianwei Yang, Pengchuan Zhang, Lu Yuan, and Lei Zhang.动态检测：具有动态注意力的端到端物体检测。IEEE/CVF 计算机视觉国际会议（ICCV）论文集》，第 2988-2997 页，2021 年 10 月。
[7] Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual grounding with transformers. arXiv: Computer Vision and Pattern Recognition, 2021.
邓家骏、杨正源、陈天朗、周文刚、李厚强。ArXiv：计算机视觉与模式识别，2021 年。
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Jacob Devlin、Ming-Wei Chang、Kenton Lee 和 Kristina Toutanova.伯特：用于语言理解的深度双向变换器预训练》，arXiv preprint arXiv:1810.04805, 2018.
[9] Ye Du, Zehua Fu, Qingjie Liu, and Yunhong Wang. Visual grounding with transformers. 2021.
Ye Du、Zehua Fu、Qingjie Liu 和 Yunhong Wang.变压器可视化接地2021.
[10] Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu. Large-scale adversarial training for vision-and-language representation learning. neural information processing systems, 2020.
Zhe Gan, Yen-Chun Chen, Linjie Li, Chen Zhu, Yu Cheng, and Jingjing Liu.用于视觉和语言表征学习的大规模对抗训练》，神经信息处理系统，2020 年。
[11] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
高鹏、耿世杰、张仁瑞、马泰利、方荣耀、张永峰、李红生、乔宇剪辑适配器：更好的视觉语言模型与特征适配器。arXiv 预印本 arXiv:2110.04544, 2021.
[12] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. Fast convergence of detr with spatially modulated co-attention. arXiv preprint arXiv:2101.07448, 2021.
Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, and Hongsheng Li.具有空间调制协同注意的 detr 的快速收敛。arXiv 预印本 arXiv:2101.07448, 2021.
[13] Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui. Open-vocabulary object detection via vision and language knowledge distillation. Learning, 2021.
Xiuye Gu, Tsung-Yi Lin, Weicheng Kuo, and Yin Cui.通过视觉和语言知识提炼实现开放词汇对象检测。学习，2021。
[14] Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356–5364, 2019.
Agrim Gupta、Piotr Dollar 和 Ross Girshick。Lvis：大词汇量实例分割数据集。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5356-5364, 2019.
[15] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick.Mask r-cnn。 In Proceedings of the IEEE international conference on computer vision, pages 2961-2969, 2017.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
何开明、张翔宇、任少清和孙健。图像识别的深度残差学习。In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770-778, 2016.
[17] Ding Jia, Yuhui Yuan, † ‡ Haodi He, † Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu. Detrs with hybrid matching. 2022.
Ding Jia, Yuhui Yuan, † ‡ Haodi He, † Xiaopei Wu, Haojun Yu, Weihong Lin, Lei Sun, Chao Zhang, and Han Hu.混合匹配的 Detrs。2022.
[18] Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. Mdetr-modulated detection for end-to-end multi-modal understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1780–1790, 2021.
Aishwarya Kamath、Mannat Singh、Yann LeCun、Gabriel Synnaeve、Ishan Misra 和 Nicolas Carion。用于端到端多模态理解的 Mdetr 调制检测。IEEE/CVF 计算机视觉国际会议论文集》，第 1780-1790 页，2021 年。
[19] Ivan Krasin, Tom Duerig, Neil Alldrin, Vittorio Ferrari, Sami Abu-El-Haija, Alina Kuznetsova, Hassan Rom, Jasper Uijlings, Stefan Popov, Andreas Veit, et al. Openimages: A public dataset for large-scale multi-label and multi-class image classification. Dataset available from https://github. com/openimages, 2(3):18, 2017.
Ivan Krasin、Tom Duerig、Neil Alldrin、Vittorio Ferrari、Sami Abu-El-Haija、Alina Kuznetsova、Hassan Rom、Jasper Uijlings、Stefan Popov、Andreas Veit 等。 Openimages：用于大规模多标签和多类图像分类的公共数据集。数据集可从 https://github. com/openimages 获取，2（3）：18，2017。
[20] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 2017.
Ranjay Krishna、Yuke Zhu、Oliver Groth、Justin Johnson、Kenji Hata、Joshua Kravitz、Stephanie Chen、Yannis Kalantidis、Li-Jia Li、David A. Shamma、Michael S. Bernstein 和 Li Fei-Fei。视觉基因组：利用众包密集图像注释连接语言和视觉。国际计算机视觉杂志》，2017 年。
[21] Weicheng Kuo, Fred Bertsch, Wei Li, AJ Piergiovanni, Mohammad Saffar, and Anelia Angelova. Findit: Generalized localization with natural language queries. 2022.
Weicheng Kuo、Fred Bertsch、Wei Li、AJ Piergiovanni、Mohammad Saffar 和 Anelia Angelova。Findit：使用自然语言查询的通用本地化2022.
[22] Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. arXiv: Computer Vision and Pattern Recognition, 2018.
Alina Kuznetsova、Hassan Rom、Neil Alldrin、Jasper Uijlings、Ivan Krasin、Jordi Pont-Tuset、Shahab Kamali、Stefan Popov、Matteo Malloci、Alexander Kolesnikov、Tom Duerig 和 Vittorio Ferrari。开放图像数据集 V4：大规模统一图像分类、对象检测和视觉关系检测》。ArXiv：计算机视觉与模式识别，2018 年。
[23] Chunyuan Li, Haotian Liu, Liunian Harold Li, Pengchuan Zhang, Jyoti Aneja, Jianwei Yang, Ping Jin, Yong Jae Lee, Houdong Hu, Zicheng Liu, and Jianfeng Gao. Elevater: A benchmark and toolkit for evaluating language-augmented visual models. 2022.
李春元、刘昊天、李柳年、张鹏川、Jyoti Aneja、杨建伟、金平、李永在、胡厚东、刘自成和高剑锋。Elevater：评估语言增强视觉模型的基准和工具包。2022.
[24] Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang. Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022.
Feng Li, Hao Zhang, Shilong Liu, Jian Guo, Lionel M Ni, and Lei Zhang.Dn-detr：通过引入查询去噪加速检测训练。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 13619-13627 页，2022 年。
[25] Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. 2023.
李锋、张浩、徐怀哲、刘世龙、张磊、倪磊和岑向阳。Mask dino：基于变换器的物体检测与分割统一框架》。2023.
[26] Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. arXiv preprint arXiv:2112.03857, 2021.
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. ArXiv preprint arXiv:2112.03857, 2021.
[27] Muchen Li and Leonid Sigal. Referring transformer: A one-step approach to multi-task visual grounding. arXiv: Computer Vision and Pattern Recognition, 2021.
李沐晨和列昂尼德-西格尔.参考变换器：arXiv：计算机视觉与模式识别，2021年。
[28] Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017.
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár.密集物体检测的焦点丢失。In Proceedings of the IEEE international conference on computer vision, pages 2980-2988, 2017.
[29] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick.微软 coco：上下文中的常见对象。欧洲计算机视觉会议，第 740-755 页。Springer, 2014.
[30] Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. Referring expression generation and comprehension via attributes. international conference on computer vision, 2017.
Jingyu Liu, Liang Wang, and Ming-Hsuan Yang.通过属性生成和理解参考表达式》，国际计算机视觉会议，2017。
[31] Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. DAB-DETR: Dynamic anchor boxes are better queries for DETR. In International Conference on Learning Representations, 2022.
Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang.DAB-DETR：动态锚框是更好的 DETR 查询。国际学习表征大会，2022 年。
[32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo.Swin transformer: Hierarchical vision transformer using shifted windows. ArXiv preprint arXiv:2103.14030, 2021.
[33] Depu Meng, Xiaokang Chen, Zejia Fan, Gang Zeng, Houqiang Li, Yuhui Yuan, Lei Sun, and Jingdong Wang. Conditional detr for fast training convergence. arXiv preprint arXiv:2108.06152, 2021.
孟德普、陈小康、范泽佳、曾刚、李厚强、袁宇辉、孙磊、王敬东。快速训练收敛的条件矢量（Conditional Detr）。arXiv预印本arXiv:2108.06152，2021。
[34] Peihan Miao, Wei Su, Lian Wang, Yongjian Fu, and Xi Li. Referring expression comprehension via cross-level multi-modal fusion. ArXiv, abs/2204.09957, 2022.
苗沛涵、苏伟、王莲、傅永建和李茜。通过跨层级多模态融合理解引用表达。ArXiv，abs/2204.09957，2022。
[35] Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. Simple open-vocabulary object detection with vision transformers. 2022.
Matthias Minderer、Alexey Gritsenko、Austin Stone、Maxim Neumann、Dirk Weissenborn、Alexey Dosovitskiy、Aravindh Mahendran、Anurag Arnab、Mostafa Dehghani、沈卓然、王晓、翟晓华、Thomas Kipf 和 Neil Houlsby。利用视觉转换器进行简单的开放词汇对象检测。2022.
[36] Vicente Ordonez, Girish Kulkarni, and Tamara L. Berg. Im2text: Describing images using 1 million captioned photographs. neural information processing systems, 2011.
Vicente Ordonez、Girish Kulkarni 和 Tamara L. Berg。Im2text：神经信息处理系统，2011 年。
[37] Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
布莱恩-A-普卢默、王立伟、克里斯-M-塞万提斯、胡安-C-凯塞多、朱莉娅-霍肯迈尔和斯韦特兰娜-拉泽布尼克。Flickr30k 实体：收集区域到短语的对应关系，建立更丰富的图像到句子模型。IEEE 计算机视觉国际会议论文集》，第 2641-2649 页，2015 年。
[38] Bryan A. Plummer, Liwei Wang, Christopher M. Cervantes, Juan C. Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. International Journal of Computer Vision, 2015.
布莱恩-A-普卢默、王立伟、克里斯托弗-M-塞万提斯、胡安-C-凯塞多、朱莉娅-霍肯迈尔和斯维特兰娜-拉泽布尼克。Flickr30k 实体：为更丰富的图像-句子模型收集区域-句子对应关系。国际计算机视觉杂志》，2015 年。
[39] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems (NeurIPS), volume 28. Curran Associates, Inc., 2015.
任少卿、何开明、罗斯-吉尔希克和孙健。更快的 R-CNN：利用区域建议网络实现实时物体检测。见 C. Cortes、N. Lawrence、D. Lee、M. Sugiyama 和 R. Garnett 编辑，《神经信息处理系统进展》（NeurIPS），第 28 卷。库兰联合公司，2015 年。
[40] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137–1149, 2017.
Shaoqing Ren、Kaiming He、Ross Girshick 和 Jian Sun.更快的r-cnn：利用区域建议网络实现实时物体检测。IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6):1137-1149, 2017.
[41] Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658–666, 2019.
Hamid Rezatofighi、Nathan Tsoi、JunYoung Gwak、Amir Sadeghian、Ian Reid 和 Silvio Savarese。广义相交于联合：边界框回归的度量和损失。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 658-666, 2019.
[42] Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models, 2021.
Robin Rombach、Andreas Blattmann、Dominik Lorenz、Patrick Esser 和 Björn Ommer。利用潜在扩散模型进行高分辨率图像合成，2021 年。
[43] Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. meeting of the association for computational linguistics, 2015.
Rico Sennrich、Barry Haddow 和 Alexandra Birch。用子词单元进行罕见词的神经机器翻译》，计算语言学协会会议，2015 年。
[44] Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun. Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE international conference on computer vision, pages 8430–8439, 2019.
Shuai Shao, Zeming Li, Tianyuan Zhang, Chao Peng, Gang Yu, Xiangyu Zhang, Jing Li, and Jian Sun.Objects365：用于物体检测的大规模高质量数据集。In Proceedings of the IEEE international conference on computer vision, pages 8430-8439, 2019.
[45] Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. meeting of the association for computational linguistics, 2018.
Piyush Sharma、Nan Ding、Sebastian Goodman 和 Radu Soricut。概念性标题：计算语言学协会会议，2018 年。
[46] Liu Shilong, Liang Yaoyuan, Huang Shijia, Li Feng, Zhang Hao, Su Hang, Zhu Jun, and Zhang Lei. DQ-DETR: Dual query detection transformer for phrase extraction and grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
刘世龙、梁耀元、黄世佳、李锋、张浩、苏航、朱军和张磊。DQ-DETR：用于短语提取和接地的双查询检测变换器。美国 AAAI 人工智能大会论文集，2023 年。
[47] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chenfeng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, Zehuan Yuan, Changhu Wang, and Ping Luo. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14454–14463, 2021.
Peize Sun、Rufeng Zhang、Yi Jiang、Tao Kong、Chenfeng Xu、Wei Zhan、Masayoshi Tomizuka、Lei Li、Zehuan Yuan、Changhu Wang 和 Ping Luo。稀疏 R-CNN：使用可学习建议的端到端物体检测。IEEE/CVF 计算机视觉与模式识别会议论文集》，第 14454-14463 页，2021 年。
[48] Mingxing Tan and Quoc V. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. international conference on machine learning, 2019.
Mingxing Tan 和 Quoc V. Le.高效网络：重新思考卷积神经网络的模型缩放。机器学习国际会议，2019 年。
[49] Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas N. Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research. Communications of The ACM, 2016.
Bart Thomee、David A. Shamma、Gerald Friedland、Benjamin Elizalde、Karl Ni、Douglas N. Poland、Damian Borth 和 Li-Jia Li。Yfcc100m：多媒体研究的新数据。ACM 通信》，2016 年。
[50] Yingming Wang, Xiangyu Zhang, Tong Yang, and Jian Sun. Anchor detr: Query design for transformer-based detector. national conference on artificial intelligence, 2021.
王英明、张翔宇、杨彤、孙健。锚点检测器：全国人工智能大会，2021年。
[51] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface's transformers：最先进的自然语言处理。arXiv 预印本 arXiv:1910.03771, 2019.
[52] Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu. End-to-end semi-supervised object detection with soft teacher. arXiv preprint arXiv:2106.09018, 2021.
Mengde Xu, Zheng Zhang, Han Hu, Jianfeng Wang, Lijuan Wang, Fangyun Wei, Xiang Bai, and Zicheng Liu.带软教师的端到端半监督对象检测》，arXiv preprint arXiv:2106.09018, 2021.
[53] Lewei Yao, Jianhua Han, Youpeng Wen, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Chunjing Xu, and Hang Xu. Detclip: Dictionary-enriched visual-concept paralleled pre-training for open-world detection. 2022.
[54] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. Mattnet: Modular attention network for referring expression comprehension. computer vision and pattern recognition, 2018.
Licheng Yu、Zhe Lin、Xiaohui Shen、Jimei Yang、Xin Lu、Mohit Bansal 和 Tamara L. Berg。Mattnet：计算机视觉与模式识别，2018。
[55] Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, Ce Liu, Mengchen Liu, Zicheng Liu, Yumao Lu, Yu Shi, Lijuan Wang, Jianfeng Wang, Bin Xiao, Zhen Xiao, Jianwei Yang, Michael Zeng, Luowei Zhou, and Pengchuan Zhang. Florence: A new foundation model for computer vision. 2022.
Lu Yuan、Dongdong Chen、Yi-Ling Chen、Noel Codella、Xiyang Dai、Jianfeng Gao、Houdong Hu、Xuedong Huang、Boxin Li、Chunyuan Li、Ce Liu、Mengchen Liu、Zicheng Liu、Yumao Lu、Yu Shi、Lijuan Wang、Jianfeng Wang、Bin Xiao、Zhen Xiao、Jianwei Yang、Michael Zeng、Luowei Zhou 和 Pengchuan Zhang。佛罗伦萨：计算机视觉的新基础模型2022.
[56] Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. 2022.
Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy.带条件匹配的开放词汇检测。2022.
[57] Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. Open-vocabulary object detection using captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14393–14402, 2021.
Alireza Zareian、Kevin Dela Rosa、Derek Hao Hu 和 Shih-Fu Chang。使用标题的开放词汇对象检测。IEEE/CVF 计算机视觉与模式识别大会论文集》，第 14393-14402 页，2021 年。
[58] Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection, 2022.
Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M. Ni, and Heung-Yeung Shum.Dino：使用改进的去噪锚框进行端到端对象检测的 Detr，2022 年。
[59] Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao. Glipv2: Unifying localization and vision-language understanding. 2022.
Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, and Jianfeng Gao.Glipv2：统一定位和视觉语言理解。2022.
[60] Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. computer vision and pattern recognition, 2019.
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z. Li.通过自适应训练样本选择缩小基于锚和无锚检测之间的差距》，计算机视觉与模式识别，2019 年。
[61] Tiancheng Zhao, Peng Liu, Xiaopeng Lu, and Kyusong Lee. Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training. 2022.
赵天成、刘鹏、吕晓鹏、李奎松。Omdet：使用大规模视觉语言多数据集预训练的语言感知物体检测。2022.
[62] Yiwu Zhong, Jianwei Yang, Pengchuan Zhang, Chunyuan Li, Noel Codella, Liunian Harold Li, Luowei Zhou, Xiyang Dai, Lu Yuan, Yin Li, and Jianfeng Gao. Regionclip: Region-based language-image pretraining. 2022.
钟毅武、杨建伟、张鹏川、李春元、诺埃尔-科德拉、李柳年-哈罗德、周洛伟、戴希阳、袁璐、李寅、高剑锋。Regionclip：基于区域的语言图像预训练。2022.
[63] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as points. arXiv preprint arXiv:1904.07850, 2019.
Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl.对象即点》，arXiv预印本arXiv:1904.07850，2019。
[64] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In ICLR 2021: The Ninth International Conference on Learning Representations, 2021.
Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai.可变形检测器：用于端到端物体检测的可变形变换器。In ICLR 2021：第九届学习表征国际会议，2021年。

Appendix A More Implementation Details
附录 AM 更多实施细节

By default, we use 900 queries in our model following DINO. We set the maximum text token number as 256. Using BERT as our text encoder, we follow BERT to tokenize texts with a BPE scheme [43]. We use six feature enhancer layers in the feature enhancer module. The cross-modality decoder is composed of six decoder layers as well. We leverage deformable attention [64] in image cross-attention layers.
默认情况下，我们按照 DINO 在模型中使用 900 个查询。我们将最大文本标记数设为 256。我们使用 BERT 作为文本编码器，按照 BERT 以 BPE 方案对文本进行标记化[43]。我们在特征增强模块中使用了六个特征增强层。跨模态解码器也由六个解码器层组成。我们在图像交叉注意力层中使用了可变形注意力[ 64]。

Both matching costs and final losses include classification losses (or contrastive losses), box L1 losses, and GIOU [41] losses. Following DINO, we set the weight of classification costs, box L1 costs, and GIOU costs as 2.0, 5.0, and 2.0, respectively, during Hungarian matching. The corresponding loss weights are 1.0, 5.0, and 2.0 in the final loss calculation.
匹配成本和最终损失都包括分类损失（或对比损失）、框 L1 损失和 GIOU [ 41] 损失。根据 DINO，我们将匈牙利语匹配过程中的分类成本、方框 L1 成本和 GIOU 成本权重分别设定为 2.0、5.0 和 2.0。在最终的损失计算中，相应的损失权重分别为 1.0、5.0 和 2.0。

Our Swin Transformer Tiny models are trained on 16 Nvidia V100 GPUs with a total batch size of 32. We extract three image feature scales, from 8 $\times$ to 32 $\times$ . It is named “4scale” in DINO since we downsample the 32 $\times$ feature map to 64 $\times$ as an extra feature scale. For the model with Swin Transformer Large, we extract four image feature scales from backbones, from 4 $\times$ to 32 $\times$ . The model is trained on 64 Nvidia A100 GPUs with a total batch size of 64.
我们的 Swin Transformer Tiny 模型在 16 个 Nvidia V100 GPU 上进行训练，总批次大小为 32。我们提取了三个图像特征尺度，从 8 $\times$ 到 32 $\times$ 。在 DINO 中被命名为 "4scale"，因为我们将 32 $\times$ 的特征图下采样到 64 $\times$ 作为额外的特征尺度。对于使用 Swin Transformer Large 的模型，我们从骨干中提取了 4 个图像特征尺度，从 4 $\times$ 到 32 $\times$ 。模型在 64 个 Nvidia A100 GPU 上进行训练，总批次大小为 64。

Item 项目	Value 价值
optimizer 优化器	AdamW
lr	1e-4
lr of image backbone 图像骨干的 lr	1e-5
lr of text backbone 文本骨干的 lr	1e-5
weight decay 体重衰减	0.0001
clip max norm 最大值	0.1
number of encoder layers 编码器层数	6
number of decoder layers 解码器层数	6
dim feedforward 调光	2048
hidden dim 隐隐约约	256
dropout 辍学者	0.0
nheads 头	8
number of queries 查询次数	900
set cost class 设置费用等级	1.0
set cost bbox 设置成本 bbox	5.0
set cost giou 设置成本	2.0
ce loss coef	2.0
bbox loss coef bbox 损失系数	5.0
giou loss coef 损失系数	2.0

Table 8: Hyper-parameters used in our pre-trained models.
表 8：预训练模型中使用的超参数。

Appendix B Data Usage 附录 B 数据使用情况

We use three types of data in our model pre-train.
我们在模型预训练中使用了三种类型的数据。

1.

Detection data. Following GLIP [26], we reformulate the object detection task to a phrase grounding task by concatenating the category names into text prompts. We use COCO [29], O365 [44], and OpenImage(OI) [19] for our model pretrain. To simulate different text inputs, we randomly sampled category names from all categories in a dataset on the fly during training.

1.检测数据。按照 GLIP [ 26] 的方法，我们将物体检测任务重新表述为短语接地任务，将类别名称连接到文本提示中。我们使用 COCO [ 29]、O365 [ 44] 和 OpenImage(OI) [ 19] 进行模型预训练。为了模拟不同的文本输入，我们在训练过程中从数据集中的所有类别中随机抽取类别名称。
2.

Grounding data. We use the GoldG and RefC data as grounding data. Both GoldG and RefC are preprocessed by MDETR [18]. These data can be fed into Grounding DINO directly. GoldG contains images in Flickr30k entities [37, 38] and Visual Genome [20]. RefC contains images in RefCOCO, RefCOCO+, and RefCOCOg.

2.基础数据。我们使用 GoldG 和 RefC 数据作为基础数据。GoldG 和 RefC 都经过 MDETR [ 18] 的预处理。这些数据可以直接输入接地 DINO。GoldG 包含 Flickr30k 实体 [ 37, 38] 和 Visual Genome [ 20] 中的图像。RefC 包含 RefCOCO、RefCOCO+ 和 RefCOCOg 中的图像。
3.

Caption data. To enhance the model performance on novel categories, we feed the semantic-rich caption data to our model. Following GLIP, we use the pseudo-labeled caption data for model training. A well-trained model generates the pseudo labels.

3.标题数据。为了提高模型在新类别上的性能，我们将语义丰富的标题数据输入到模型中。按照 GLIP 的方法，我们使用伪标签标题数据进行模型训练。训练有素的模型会生成伪标签。

There are two versions of the O365 dataset, which we termed O365v1 and O365v2, respectively. O365v1 is a subset of O365v2. O365v1 contains about 600K images, while O365v2 contains about 1.7M images. Following previous works [26, 53], we pre-train the Grounding-DINO-T on O365v1 for a fair comparison. The Grounding-DINO-L is pre-trained on O365v2 for a better result.
O365 数据集有两个版本，我们分别称之为 O365v1 和 O365v2。O365v1 是 O365v2 的子集。O365v1 包含约 600K 幅图像，而 O365v2 包含约 170 万幅图像。根据之前的研究[26, 53]，我们在 O365v1 上预先训练了 Grounding-DINO-T，以便进行公平比较。为了获得更好的结果，我们在 O365v2 上对 Grounding-DINO-L 进行了预训练。

Appendix C More Results on COCO Detection Benchmarks
附录 CMCO 检测基准的更多结果

C.1 COCO Detection Results under the 1 $\times$ Setting
C.1.1 $\times$ 设置下的 COCO 检测结果

Model	Epochs 纪元	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
Faster-RCNN(5scale) [40]	$12$	$37.9$	$58.8$	$41.1$	$22.4$	$41.1$	$49.1$
DETR(DC5) [2]	$12$	$15.5$	$29.4$	$14.5$	$4.3$	$15.1$	$26.7$
Deformable DETR(4scale)[64] 可变形 DETR（4 级）[ 64］	$12$	$41.1$	$-$	$-$	$-$	$-$
DAB-DETR(DC5)^† [31]	$12$	$38.0$	$60.3$	$39.8$	$19.2$	$40.9$	$55.4$
Dynamic DETR(5scale) [6] 动态 DETR(5scale) [ 6]	$12$	$42.9$	$61.0$	$46.3$	$24.6$	$44.9$	$54.4$
Dynamic Head(5scale) [5] 动态云台（5 级）[ 5］	$12$	$43.0$	$60.7$	$46.8$	$24.7$	$46.4$	$53.9$
HTC(5scale) [3]	$12$	$42.3$	$-$	$-$	$-$	$-$	$-$
DN-Deformable-DETR(4scale) [24] DN-可变形-DETR（4 级）[ 24］	$12$	$43.4$	$61.9$	$47.2$	$24.8$	$46.8$	$59.4$
DINO-4scale [58] DINO-4 级 [ 58］	$12$	${49.0}$	${66.6}$	${53.5}$	${32.0}$	${52.3}$	${63.0}$
Grounding DINO (4scale) 接地 DINO（4 级）	$12$	$48.1$	$65.8$	$52.3$	$30.4$	$51.3$	$62.3$

Table 9: Results for Grounding DINO and other detection models with the ResNet50 backbone on COCO val2017 trained with

12

epochs (the so called

1\times

setting).
表 9：在 COCO val2017 上使用 ResNet50 主干网对接地 DINO 和其他检测模型进行

12

epochs（即所谓的

1\times

设置）训练的结果。

We present the performance of Grounding DINO on standard COCO detection benchmark in Table 9. All models are trained with a ResNet-50 [16] backbone for 12 epochs. Grounding DINO achieves 48.1 AP under the research setting, which shows that Grounding DINO is a strong closed-set detector. However, it is inferior compared with the original DINO. We suspect that the new components may make the model harder to optimize than DINO.
表 9 列出了接地 DINO 在标准 COCO 检测基准上的性能。所有模型均以 ResNet-50 [ 16] 为骨干进行了 12 次历时训练。在研究环境下，Grounding DINO 的 AP 值达到了 48.1，这表明 Grounding DINO 是一种强大的闭集检测器。然而，与最初的 DINO 相比，它的性能要差一些。我们怀疑新的组件可能会使模型比 DINO 更难优化。

Appendix D Detailed Results on ODinW
附录 D 关于开放式数据采集的详细结果

We present detailed results of Grounding DINO on ODinW35 in Table 10, Table 11, and Table 12.
表 10、表 11 和表 12 列出了在 ODinW35 上接地 DINO 的详细结果。

Dataset 数据集	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
AerialMaritimeDrone_large 大型空中海事无人机	9.48	15.61	8.35	8.72	10.28	2.91
AerialMaritimeDrone_tiled	17.56	26.35	13.89	0	1.61	28.7
AmericanSignLanguageLetters 美国手语信函	1.45	2.21	1.39	-1	-1	1.81
Aquarium 水族馆	18.83	34.32	18.19	10.65	20.64	21.52
BCCD_BCCD	6.17	11.31	6.04	1.27	9.09	6.89
ChessPiece 棋子	6.99	11.13	9.03	-1	-1	8.11
CottontailRabbits 棉尾兔	71.93	85.05	85.05	-1	70	73.58
DroneControl_Drone_Control 无人机控制_无人机_控制	6.15	10.95	6.23	2.08	6.91	6.16
EgoHands_generic 通用自我之手	48.07	75.06	56.52	1.48	11.42	51.84
EgoHands_specific 自我之手	0.66	1.25	0.64	0	0.02	0.92
HardHatWorkers 硬帽工人	2.39	9.17	1.07	2.13	4.32	4.6
MaskWearing 佩戴面具	0.58	1.43	0.56	0.12	0.51	4.66
MountainDewCommercial	18.22	29.73	21.33	0	23.23	49.8
NorthAmericaMushrooms 北美蘑菇	65.48	71.26	66.18	-1	-1	65.49
OxfordPets_by-breed 牛津宠物	0.27	0.6	0.21	-1	1.38	0.33
OxfordPets_by-species 牛津宠物	1.66	5.02	1	-1	0.65	1.89
PKLot_640	0.08	0.26	0.02	0.14	0.79	0.11
Packages 套餐	56.34	68.65	68.65	-1	-1	56.34
PascalVOC	47.21	57.59	51.28	16.53	39.51	58.5
Raccoon_Raccoon 浣熊_浣熊	44.82	76.44	46.16	-1	17.08	48.56
ShellfishOpenImages 贝类打开图片	23.08	32.21	26.94	-1	18.82	23.28
ThermalCheetah 热敏雪茄	12.9	19.65	14.72	0	8.35	50.15
UnoCards	0.87	1.52	0.96	2.91	2.18	-1
VehiclesOpenImages 车辆打开图像	59.24	71.88	64.69	7.42	32.38	72.21
WildfireSmoke 野火烟雾	25.6	43.96	25.34	5.03	18.85	42.59
boggleBoards 博格板	0.81	2.92	0.12	2.96	1.13	-1
brackishUnderwater 咸水	1.3	1.88	1.4	0.99	1.75	11.39
dice_mediumColor	0.16	0.72	0.07	0.38	3.3	2.23
openPoetryVision 开放诗视	0.18	0.5	0.06	-1	0.25	0.17
pistols 手枪	46.4	66.47	47.98	4.51	22.94	55.03
plantdoc	0.34	0.51	0.35	-1	0.28	0.86
pothole 坑洼	19.87	28.94	22.23	12.49	15.6	28.78
selfdrivingCa 自动驾驶汽车	9.46	19.13	8.19	0.85	6.82	16.51
thermalDogsAndPeople 热狗与人	72.67	86.65	79.98	33.93	30.2	86.71
websiteScreenshots 网站截图	1.51	2.8	1.42	0.85	2.06	2.59

Table 10: Detailed results on 35 datasets in ODinW of Grounding DINO with Swin-T pre-trained on O365 and GoldG.
表 10：在 O365 和 GoldG 上预先训练 Swin-T 的接地 DINO 在 ODinW 中 35 个数据集上的详细结果。

Dataset 数据集	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
AerialMaritimeDrone_large 大型空中海事无人机	10.3	18.17	9.21	8.92	11.2	7.35
AerialMaritimeDrone_tiled	17.5	28.04	18.58	0	3.64	24.16
AmericanSignLanguageLetters 美国手语信函	0.78	1.17	0.76	-1	-1	1.02
Aquarium 水族馆	18.64	35.27	17.29	11.33	17.8	21.34
BCCD_BCCD	11.96	22.77	8.65	0.16	5.02	13.15
ChessPiece 棋子	15.62	22.02	20.19	-1	-1	15.72
CottontailRabbits 棉尾兔	67.61	78.82	78.82	-1	70	68.09
DroneControl_Drone_Control 无人机控制_无人机_控制	4.99	8.76	5	0.65	5.03	8.61
EgoHands_generic 通用自我之手	57.64	90.18	66.78	3.74	24.67	61.33
EgoHands_specific 自我之手	0.69	1.37	0.63	0	0.02	1.03
HardHatWorkers 硬帽工人	4.05	13.16	1.96	2.29	7.55	9.81
MaskWearing 佩戴面具	0.25	0.81	0.15	0.09	0.13	2.78
MountainDewCommercial	25.46	39.08	28.89	0	32.53	58.38
NorthAmericaMushrooms 北美蘑菇	68.18	72.89	69.75	-1	-1	68.62
OxfordPets_by-breed 牛津宠物	0.21	0.42	0.22	-1	2.91	0.17
OxfordPets_by-species 牛津宠物	1.3	3.95	0.71	-1	0.28	1.62
PKLot_640	0.06	0.18	0.02	0.03	0.59	0.15
Packages 套餐	60.53	76.24	76.24	-1	-1	60.53
PascalVOC	55.65	66.51	60.47	19.61	44.25	67.21
Raccoon_Raccoon 浣熊_浣熊	60.07	84.81	66.5	-1	11.23	65.86
ShellfishOpenImages 贝类打开图片	29.56	38.08	33.5	-1	6.38	29.95
ThermalCheetah 热敏雪茄	17.72	25.93	19.61	1.04	20.02	63.69
UnoCards	0.81	1.3	1	2.6	1.01	-1
VehiclesOpenImages 车辆打开图像	58.49	71.56	63.64	8.22	28.03	71.1
WildfireSmoke 野火烟雾	20.04	39.74	22.49	4.13	15.71	30.41
boggleBoards 博格板	0.29	1.15	0.04	1.8	0.57	-1
brackishUnderwater 咸水	1.47	2.34	1.58	2.32	3.31	9.96
dice_mediumColor	0.33	1.38	0.15	0.03	1.05	12.57
openPoetryVision 开放诗视	0.05	0.19	0	-1	0.09	0.21
pistols 手枪	66.99	86.34	72.65	16.25	39.24	75.98
plantdoc	0.36	0.47	0.39	-1	0.24	0.82
pothole 坑洼	25.21	38.21	26.01	8.94	18.45	39.28
selfdrivingCa 自动驾驶汽车	9.95	20.55	8.28	1.36	7.27	15.46
thermalDogsAndPeople 热狗与人	67.89	80.85	78.66	45.05	30.24	85.56
websiteScreenshots 网站截图	1.3	2.26	1.21	0.95	1.81	2.23

Table 11: Detailed results on 35 datasets in ODinW of Grounding DINO with Swin-T pre-trained on O365, GoldG, and Cap4M.
表 11：接地 DINO 在 ODinW 的 35 个数据集上的详细结果，Swin-T 在 O365、GoldG 和 Cap4M 上进行了预训练。

Dataset 数据集	AP	AP₅₀	AP₇₅	AP_S	AP_M	AP_L
AerialMaritimeDrone_large 大型空中海事无人机	12.64	18.44	14.75	9.15	19.16	0.98
AerialMaritimeDrone_tiled	20.47	34.81	12.79	0	7.61	26.93
AmericanSignLanguageLetters 美国手语信函	3.94	4.84	4	-1	-1	4.48
Aquarium 水族馆	28.14	45.47	30.97	12.1	24.71	39.42
BCCD_BCCD	23.85	36.92	28.88	0.3	10.8	24.43
ChessPiece 棋子	18.44	26.3	23.33	-1	-1	18.62
CottontailRabbits 棉尾兔	71.66	88.48	88.48	-1	66	73.04
DroneControl_Drone_Control 无人机控制_无人机_控制	7.16	11.56	7.67	2.29	10.6	7.68
EgoHands_generic 通用自我之手	52.08	81.57	59.15	1.12	31.78	55.46
EgoHands_specific 自我之手	1.22	2.28	1.2	0	0.05	1.5
HardHatWorkers 硬帽工人	9.14	23.64	5.6	5.09	15.34	13.59
MaskWearing 佩戴面具	1.64	4.69	1.18	0.44	1.05	8.67
MountainDewCommercial	33.28	53.59	32.76	0	35.86	80
NorthAmericaMushrooms 北美蘑菇	72.33	73.18	73.18	-1	-1	72.39
OxfordPets_by-breed 牛津宠物	0.58	1.05	0.59	-1	4.46	0.6
OxfordPets_by-species 牛津宠物	1.64	4.8	0.87	-1	1.51	1.8
PKLot_640	0.25	0.71	0.05	0.31	1.44	0.4
Packages 套餐	63.86	76.24	76.24	-1	-1	63.86
PascalVOC	66.01	76.65	71.8	32.01	55.7	75.37
Raccoon_Raccoon 浣熊_浣熊	65.81	90.39	69.93	-1	26	68.97
ShellfishOpenImages 贝类打开图片	62.47	74.25	70.07	-1	26	63.06
ThermalCheetah 热敏雪茄	21.33	26.11	24.92	2.39	15.84	75.34
UnoCards	0.52	0.84	0.66	3.02	0.92	-1
VehiclesOpenImages 车辆打开图像	62.74	75.15	67.23	10.66	47.46	76.36
WildfireSmoke 野火烟雾	23.66	45.72	25.06	1.58	22.22	35.27
boggleBoards 博格板	0.28	1.04	0.05	5.64	0.7	-1
brackishUnderwater 咸水	2.41	3.39	2.79	4.43	3.88	21.22
dice_mediumColor	0.26	1.15	0.03	0	1.09	4.07
openPoetryVision 开放诗视	0.08	0.35	0.01	-1	0.15	0.11
pistols 手枪	71.4	90.69	77.21	18.74	39.58	80.78
plantdoc	2.02	2.64	2.37	-1	0.5	2.82
pothole 坑洼	30.4	44.22	33.84	12.27	18.84	48.57
selfdrivingCa 自动驾驶汽车	9.25	17.72	8.39	1.93	7.03	13.02
thermalDogsAndPeople 热狗与人	72.02	86.02	79.47	29.16	68.05	86.75
websiteScreenshots 网站截图	1.32	2.64	1.16	0.79	1.8	2.46

Table 12: Detailed results on 35 datasets in ODinW of Grounding DINO with Swin-L pre-trained on O365, OI, GoldG, Cap4M, COCO, and RefC.
表 12：Swin-L 在 O365、OI、GoldG、Cap4M、COCO 和 RefC 上预先训练的接地 DINO 的 ODinW 中 35 个数据集的详细结果。

Appendix E Comparison between DINO and Grounding DINO
附录 E DINO 与接地 DINO 的比较

To illustrate the difference between DINO and Grounding DINO, we compare DINO and Grounding DINO in Fig. 6. We mark the DINO blocks in gray, while the newly proposed modules are shaded in blue.
为了说明 DINO 和接地 DINO 的区别，我们在图 6 中对 DINO 和接地 DINO 进行了比较。我们将 DINO 模块标记为灰色，而将新提出的模块标记为蓝色。

Appendix F Visualizations 附录 F

We present some visualizations in Fig. 7. Our model presents great generalization on different scenes and text inputs. For example, Grounding DINO accurately locates man in blue and child in red in the last image.
我们在图 7 中展示了一些可视化效果。我们的模型在不同的场景和文本输入上都有很好的通用性。例如，在最后一张图片中，接地 DINO 准确定位了蓝色的男子和红色的儿童。

Appendix G Marry Grounding DINO with Stable Diffusion
附录 GMarry 利用稳定扩散将 DINO 接地

We present an image editing application in Fig. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b). The results in Fig. Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection (b) are generated by two processes. First, we detect objects with Grounding DINO and generate masks by masking out the detected objects or backgrounds. After that, we feed original images, image masks, and generation prompts to an inpainting model (typical Stable Diffusion [42]) to render new images. We use the released checkpoints in https://github.com/Stability-AI/stablediffusion for new image generation. More results are available in Figure 8.
我们在图 1 中展示了一个图像编辑应用程序。图 2.接地 DINO：将 DINO 与接地预训练相结合用于开放集对象检测（b）。图.接地 DINO：将 DINO 与接地预训练相结合用于开放集对象检测（b）中的结果由两个过程生成。首先，我们使用接地 DINO 检测物体，并通过遮蔽检测到的物体或背景生成遮罩。然后，我们将原始图像、图像遮罩和生成提示输入到内绘模型（典型的稳定扩散模型[42]）中，以渲染新图像。我们使用 https://github.com/Stability-AI/stablediffusion 中发布的检查点生成新图像。更多结果见图 8。

The “detection prompt” is the language input for Grounding DINO, while the “generation prompt” is for the inpainting model.
检测提示 "是接地 DINO 的语言输入，而 "生成提示 "则是内绘模型的语言输入。

Using GLIGEN for Grounded Generation
使用 GLIGEN 进行基础生成

To enable fine-grained image editing, we combine the Grounding DINO with GLIGEN [GLIGEN]. We use the “phrase prompt” in Figure 9 as the input phrases of each box for GLIGEN.
为了实现精细的图像编辑，我们将接地 DINO 与 GLIGEN [GLIGEN]结合起来。我们使用图 9 中的 "短语提示 "作为 GLIGEN 每个方框的输入短语。

GLIGEN supports grounding results as inputs and can generate objects on specific positions. We can assign each bounding box an object with GLIGEN, as shown in Figure 9 (c) (d). Moreover, GLIGEN can full fill each bounding box, which results in better visualization, as that in Figure 9 (a) (b). For example, we use the same generative prompt in Figure 8 (b) and Figure 9 (b). The GLIGEN results ensure each bounding box with an object and fulfills the detected regions.
GLIGEN 支持将接地结果作为输入，并能在特定位置上生成对象。我们可以使用 GLIGEN 为每个边界框分配一个对象，如图 9 (c) (d) 所示。此外，GLIGEN 还可以填充每个边界框，从而获得更好的可视化效果，如图 9 (a) (b) 所示。例如，我们在图 8 (b) 和图 9 (b) 中使用了相同的生成提示。GLIGEN 的结果确保了每个边界框中都有一个对象，并满足了检测到的区域。

Appendix H Effects of RefC and COCO Data
附录 HE RefC 和 COCO 数据的影响

We add the RefCOCO/+/g (we note it as “RefC” in tables) and COCO into training in some settings. We explore the influence of these data in Table 13. The results show that RefC helps improve the COCO zero-shot and fine-tuning performance but hurts the LVIS and ODinW results. With COCO introduced, the COCO results is greatly improved. It shows that COCO brings marginal improvements on LVIS and slightly decreases on ODinW.
在某些情况下，我们将 RefCOCO/+/g（我们在表格中注为 "RefC"）和 COCO 加入到训练中。我们在表 13 中探讨了这些数据的影响。结果显示，RefC 有助于提高 COCO 的零点和微调性能，但会损害 LVIS 和 ODinW 的结果。引入 COCO 后，COCO 结果得到极大改善。结果表明，COCO 对 LVIS 的改善不大，而对 ODinW 的改善则略有下降。

Model	Pre-Train 预培训	COCO minival		LVIS minival	ODinW
Model	Pre-Train 预培训	Zero-Shot 零点射击	Fine-Tune 微调	Zero-Shot 零点射击	Zero-Shot 零点射击
Grounding DINO T 接地 DINO T	O365,GoldG O365、GoldG	48.1	57.1	25.6	20.0
Grounding DINO T 接地 DINO T	O365,GoldG,RefC O365、GoldG、RefC	48.5	57.3	21.9	17.7
Grounding DINO T 接地 DINO T	O365,GoldG,RefC,COCO O365、GoldG、RefC、COCO	56.1	57.5	22.3	17.4

Table 13: Impacts of RefC and COCO data for open-set settings. All models are trained with a Swin Transformer Tiny backbone.
表 13：RefC 和 COCO 数据对开放设置的影响。所有模型均使用 Swin Transformer Tiny 骨干进行训练。

Table 14: Comparison of model size and model efficiency between GLIP and Grounding DINO.
表 14：GLIP 与接地 DINO 的模型大小和模型效率比较。

Model	params 参数	GFLOPS	FPS
GLIP-T [26] GLIP-T [ 26］	232M	488G	6.11
Grounding DINO T (Ours) 接地 DINO T（我们的）	172M	464G	8.37

Appendix I Model Efficiency
附录 I 模式效率

We compare the model size and efficiency between Grounding-DINO-T and GLIP-T in Table 14. The results show that our model has a smaller parameter size and better efficiency than GLIP.
表 14 比较了 Grounding-DINO-T 和 GLIP-T 的模型大小和效率。结果表明，与 GLIP 相比，我们的模型参数更少，效率更高。

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection接地 DINO：将 DINO 与接地预训练相结合，实现开放集目标检测

Abstract 摘要

1 Introduction 1引言

2 Related Work 2 相关工作

3 Grounding DINO 3G 接地 DINO

3.1 Feature Extraction and Enhancer3.1 特征提取和增强器

3.2 Language-Guided Query Selection3.2 语言引导的查询选择

3.3 Cross-Modality Decoder3.3 跨模态解码器

3.4 Sub-Sentence Level Text Feature3.4 子句级文本特征