SCANNER：针对未见实体的鲁棒多模态命名实体识别的知识增强方法

2024_08_18_3d5fbf7f343429d3204eg

Hyunjong , Taeho Kil , Sukmin Seo , Jaeho Lee
Hyunjong 、Taeho Kil 、Sukmin Seo 、Jaeho Lee POSTECH NAVER Cloud HJ AILABhyunjong.ok@gmail.com, taeho.kil@navercorp.com

Abstract 摘要

Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations. To address this obstacle, we propose SCANNER (Span CANdidate detection and recognition for NER), a model capable of effectively handling all three NER variants. SCANNER is a twostage structure; we extract entity candidates in the first stage and use it as a query to get knowledge, effectively pulling knowledge from various sources. We can boost our performance by utilizing this entity-centric extracted knowledge to address unseen entities. Furthermore, to tackle the challenges arising from noisy annotations in NER datasets, we introduce a novel self-distillation method, enhancing the robustness and accuracy of our model in processing training data with inherent uncertainties. Our approach demonstrates competitive performance on the NER benchmark and surpasses existing methods on both MNER and GMNER benchmarks. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks.
命名实体识别（NER）领域的最新进展推动了这一任务的边界，将视觉信号纳入其中，从而产生了许多变体，包括多模态 NER（MNER）或接地 MNER（GMNER）。这些任务面临的一个主要挑战是，模型应能泛化到训练期间未见过的实体，并能处理带有噪声注释的训练样本。为了解决这一障碍，我们提出了 SCANNER（Span CANdidate detection and recognition for NER），这是一种能够有效处理所有三种 NER 变体的模型。SCANNER 是一个两阶段结构；我们在第一阶段提取实体候选，并将其用作获取知识的查询，从而有效地从各种来源获取知识。我们可以利用这些以实体为中心提取的知识来处理未见实体，从而提高我们的性能。此外，为了应对 NER 数据集中嘈杂注释带来的挑战，我们引入了一种新颖的自蒸馏方法，从而提高了我们的模型在处理具有固有不确定性的训练数据时的鲁棒性和准确性。我们的方法在 NER 基准上表现出极具竞争力的性能，并在 MNER 和 GMNER 基准上超越了现有方法。进一步的分析表明，所提出的蒸馏和知识利用方法提高了我们的模型在各种基准上的性能。

1 Introduction 1 引言

Named entity recognition (NER) is a fundamental task in natural language processing to identify textual spans that correspond to named entities in the given text, and classify them into pre-defined categories, such as persons, locations, and organizations (Li et al., 2020). The extracted information can be utilized for various downstream tasks, including entity linking and relation extraction.
命名实体识别（NER）是自然语言处理中的一项基本任务，用于识别给定文本中与命名实体相对应的文本跨度，并将其归入预定义的类别，如人物、地点和组织（Li 等人，2020 年）。提取的信息可用于各种下游任务，包括实体链接和关系提取。

Figure 1: Illustrations of NER, MNER, and GMNER tasks. The NER task aims to identify named entities from the given text. MNER extends this task to utilize additional image informations. GMNER additionally requires the model to predict entity bounding boxes in the given image, if they are present.
图 1：NER、MNER 和 GMNER 任务示意图。NER 任务旨在从给定文本中识别命名实体。MNER 扩展了这一任务，利用了更多的图像信息。如果存在实体边界框，GMNER 还要求模型预测给定图像中的实体边界框。

The rapid growth of the amount of multi-modal contents on social media platforms has given rise to the multi-modal variants of NER. The most prominent example is multi-modal NER (MNER; Zhang et al. (2018)), which extends traditional NER to identifying named entities in the text based on additional image input paired with the text (Fig. 1b). Another recent example is the grounded MNER (GMNER; Yu et al. (2023)); here, one additionally aims to predict the bounding boxes of named entities appearing in the given image (Fig. 1c).
社交媒体平台上多模态内容数量的快速增长催生了 NER 的多模态变体。最突出的例子是多模态 NER（MNER；Zhang 等人（2018 年）），它将传统的 NER 扩展到基于与文本配对的额外图像输入来识别文本中的命名实体（图 1b）。另一个最新的例子是接地 MNER（GMNER；Yu 等人（2023 年））；在这里，人们还旨在预测给定图像中出现的命名实体的边界框（图 1c）。

A major challenge in NER, MNER, and GMNER tasks is the presence of unseen entities in the test datasets, which are not found in the training datasets. Traditional models often struggle with low performance on these unseen entities (see Table 1). To tackle this problem effectively, it is important to use knowledge about unseen entities in a way that boosts ability of the model to generalize
在 NER、MNER 和 GMNER 任务中，一个主要挑战是测试数据集中存在训练数据集中没有的未知实体。传统模型在处理这些未见实体时往往表现不佳（见表 1）。要有效地解决这个问题，重要的是要利用有关未见实体的知识来提高模型的泛化能力。

Datasets 数据集	Methods 方法	Seen entities 看到的实体	Unseen entities 看不见的实体
CoNLL2003	BERT-base	93.78	80.90
	Ours (w/o Knowledge) 我们的（无知识）	96.29	89.68
Twitter-2015	BERT-base	79.81	57.81
	Ours (w/o Knowledge) 我们的（无知识）	87.18	73.84
Twitter-2017 推特-2017	BERT-base	93.81	67.76
Twitter-2017 推特-2017	Ours (w/o Knowledge) 我们的（无知识）	95.68	82.96

Table 1: A comparison of test F1 scores for the named entities that have appeared at least once in the training dataset, versus the entities that have not appeared.
表 1：在训练数据集中至少出现过一次的命名实体与未出现过的实体的测试 F1 分数比较。

Figure 2: 'Kroger' is an unseen entity that is hard to recognize as an Organization or Location. By our knowledge base model, it brings to successful prediction.
图 2："Kroger "是一个难以识别为组织或地点的未知实体。通过我们的知识库模型，它可以成功预测。
and perform well across different types of data. In this paper, we introduce SCANNER, which stands for Span CANdidate detection and recognition for Named Entity Recognition. Our approach is designed to effectively use knowledge about unseen entities, addressing NER, MNER, and GMNER tasks with improved robustness. SCANNER adopts a two-stage structure, comprising a span candidate detection module and entity recognition module The span candidate detection module identifies named entity candidates within sentences. Following this, the entity recognition module uses these candidates as queries to extract relevant knowledge from various sources, effectively recognizing the class of the entity candidate. As illustrated in Fig. 2, we were able to accurately identify 'Kroger' as an 'organization' by utilizing object knowledge. SCANNER effectively gathers and uses knowledge from various sources, boosting its performance in the challenging NER, MNER, and GMNER benchmarks. Notably, the GMNER challenge involves the intricate process of identifying entities and determining their bounding boxes within images. The architecture of SCANNER, leveraging its comprehensive knowledge, is effective in addressing the GMNER task. The effectiveness of SCANNER in the GMNER task is highlighted by establishing a new baseline that is over

higher than the pre-
并在不同类型的数据中表现良好。在本文中，我们介绍了 SCANNER，它代表用于命名实体识别的跨 CANdidate 检测和识别。我们的方法旨在有效利用关于未见实体的知识，以更高的鲁棒性解决 NER、MNER 和 GMNER 任务。SCANNER 采用两阶段结构，包括跨度候选实体检测模块和实体识别模块。随后，实体识别模块利用这些候选实体作为查询，从各种来源中提取相关知识，从而有效识别候选实体的类别。如图 2 所示，我们利用对象知识准确地将 "Kroger "识别为 "组织"。SCANNER 有效地收集和使用了各种来源的知识，从而提高了它在具有挑战性的 NER、MNER 和 GMNER 基准测试中的性能。值得注意的是，GMNER 挑战涉及在图像中识别实体并确定其边界框的复杂过程。SCANNER 的架构利用其全面的知识，能有效解决 GMNER 任务。SCANNER 在 GMNER 任务中的有效性突出表现在建立了一个新的基线，该基线比之前的

高出

以上。

Text 文本	Dataset 数据集
	CoNLL2003
	Twitter-2015
	Twitter-2015
	Twitter-2017 推特-2017
Jr 小	Twitter-2017 推特-2017

Table 2: Examples of gold annotation and potential alternatives. The gold annotations are marked in blue [], whereas the alternative annotations are in red [].
表 2：黄金注释和潜在替代注释示例。金色注释用蓝色[]标出，而备选注释用红色[]标出。
vious standard, as measured by the F1 score. Additionally, we introduce the novel self-distillation method, called as Trust Your Teacher. The NER task faces challenges with noisy annotations (Wang et al., 2019; Zhu and Li, 2022), particularly at entity boundaries where exact span matching is crucial and ambiguity often leads to increased noise (see Table 2). Our distillation method, which softly utilizes both the prediction of the teacher model and ground truth (GT) logit, addresses the challenges of noisy annotations.
通过 F1 分数来衡量，我们发现该方法与传统的标准相去甚远。此外，我们还介绍了一种新颖的自蒸馏方法，即 "相信你的老师"（Trust Your Teacher）。NER 任务面临着噪声注释的挑战（Wang 等人，2019；Zhu 和 Li，2022），尤其是在实体边界，精确的跨度匹配至关重要，而模糊性往往会导致噪声增加（见表 2）。我们的蒸馏方法同时利用了教师模型的预测和地面实况（GT）对数，解决了噪声注释的难题。

Our approach demonstrates competitive performance on NER and surpasses existing methods on both MNER and GMNER. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks.
我们的方法在 NER 上表现出极具竞争力的性能，在 MNER 和 GMNER 上超越了现有方法。进一步的分析表明，提出的提炼和知识利用方法提高了我们的模型在各种基准上的性能。

The contributions of SCANNER are summarized in three key aspects:
SCANNER 的贡献主要体现在三个方面：

We propose a new distillation method that softly blends the predictions of the teacher model with ground truth annotations to enhance data quality and model training.
我们提出了一种新的提炼方法，将教师模型的预测与地面实况注释进行软融合，以提高数据质量和模型训练。
We develop SCANNER, a two-stage structured model that effectively utilizes knowledge to improve performance, particularly in recognizing unseen entities.
我们开发的 SCANNER 是一种两阶段结构化模型，可有效利用知识来提高性能，尤其是在识别未见实体方面。
The SCANNER model shows competitive performance in NER benchmarks and demonstrates higher performance than existing methods in MNER and GMNER benchmarks.
SCANNER 模型在 NER 基准测试中表现出极具竞争力的性能，在 MNER 和 GMNER 基准测试中表现出高于现有方法的性能。

Prior works on MNER typically operates by first extracting the NER-related features from the image, and then combining these features with text features to recognize name entities. Roughly, ex-
先前的 MNER 工作通常是先从图像中提取与 NER 相关的特征，然后将这些特征与文本特征相结合来识别名称实体。粗略地说，前
isting works fall into two categories according to how they extract image features.
根据提取图像特征的方式，提取工作可分为两类。
Textual features. Several works extract the textual metadata from the given image and utilize them as features for the subsequent NER task (Wang et al., 2022b,a; Li et al., 2023b). For instance, ITA (Wang et al., 2022b) extracts object tags, image captions, and OCR results from the given image. Similarly, Li et al. (2023b) also extracts image captions, but additionally utilizes large language model as an implicit knowledge source to further refine the features. MoRe (Wang et al., 2022a) takes a slightly different approach, using an image-based retrieval system to retrieve textual descriptions of the closest images in the database.
文本特征。有几项研究从给定图像中提取文本元数据，并将其作为后续 NER 任务的特征（Wang 等人，2022b,a；Li 等人，2023b）。例如，ITA（Wang 等人，2022b）从给定图像中提取对象标签、图像标题和 OCR 结果。同样，Li 等人（2023b）也提取了图像标题，但还利用大型语言模型作为隐式知识源，进一步完善了特征。MoRe（Wang 等人，2022a）采用的方法略有不同，它使用基于图像的检索系统来检索数据库中最接近图像的文本描述。
Visual encoders. Another line of work attempts to extract the image features using a visual encoder, such as pre-trained ResNets, ViTs, or CLIP vision encoder (Wang et al., 2022e; Zhang et al., 2023; Chen et al., 2023). The extracted features are then combined with the text features extracted from a separate text encoder, which often involves additional alignment via cross-modal attention (Chen et al., 2022; Lu et al., 2022; Wang et al., 2022e; Zhang et al., 2023; Chen et al., 2023). Notably, PromptMNER (Wang et al., 2022d) calculates the similarity between visual features and various text prompts to extract visual cues that are loosely related to the input text.
视觉编码器。另一种方法是尝试使用视觉编码器提取图像特征，如预训练的 ResNets、ViTs 或 CLIP 视觉编码器（Wang 等人，2022e；Zhang 等人，2023；Chen 等人，2023）。然后将提取的特征与从单独的文本编码器中提取的文本特征相结合，这通常需要通过跨模态注意进行额外的对齐（Chen 等，2022；Lu 等，2022；Wang 等，2022e；Zhang 等，2023；Chen 等，2023）。值得注意的是，PromptMNER（Wang 等人，2022d）计算视觉特征与各种文本提示之间的相似性，以提取与输入文本松散相关的视觉线索。

In this paper, we take a different path and extract the image features conditioned on the information extracted from the given text. Up to our knowledge, even though it has related works in NER (Wang et al., 2021, 2022c; Tan et al., 2023), it is the first such attempt in the context of MNER, which is a more challenging task.
在本文中，我们另辟蹊径，根据从给定文本中提取的信息提取图像特征。据我们所知，尽管它在 NER 方面有相关的工作（Wang 等人，2021 年，2022c；Tan 等人，2023 年），但在 MNER（一项更具挑战性的任务）的背景下，它是第一次这样的尝试。

In addition, a new task has been introduced, which not only incorporates image inputs but also actively addresses the task of grounding entity locations within images (Yu et al., 2023).
此外，还引入了一项新任务，该任务不仅包含图像输入，还积极处理图像中实体位置的定位任务（Yu 等人，2023 年）。

3 Method 3 方法

In this section, we first introduce the architecture of the proposed method, which comprises the span candidate detection module and the named entity recognition module (Sec. 3.1). Then, we describe the named entity recognition module, which performs entity recognition and visual grounding in the image for each entity candidate (Sec. 3.2) Finally, we explain a novel distillation method, named Trust Your Teacher, which is designed to
在本节中，我们首先介绍了拟议方法的架构，包括跨度候选检测模块和命名实体识别模块（第 3.1 节）。然后，我们将介绍命名实体识别模块，该模块将在图像中对每个候选实体进行实体识别和视觉接地（第 3.2 节）。最后，我们将解释一种名为 "相信你的老师 "的新颖提炼方法，该方法旨在

Figure 3: The overall architecture of the proposed SCANNER method. The two-stage structure allows for efficient extraction and utilization of knowledge, as knowledge is extracted only for those entity candidates that were filtered through in stage 1.
图 3：拟议 SCANNER 方法的整体架构。两阶段结构可有效提取和利用知识，因为知识只针对在第一阶段中筛选出的候选实体。
robustly train our model even in the presence of noisy dataset annotations (Sec. 3.3).
即使在有噪声数据集注释的情况下，也能稳健地训练我们的模型（第 3.3 节）。

3.1 SCANNER Architecture 3.1 扫描仪结构

The primary focus of this paper is to perform MNER using both knowledge extracted from within images and external knowledge, even for entities not encountered during training. To achieve this, as illustrated in Fig. 3, we propose a two-stage architecture, known for its efficiency in extracting and searching for knowledge from various sources. In the first stage, we extract named entity candidates, and in the second stage, we efficiently search and extract only knowledge relevant to these candidates. This acquired knowledge is then utilized for entity recognition.
本文的主要重点是利用从图像内部提取的知识和外部知识来执行 MNER，即使是在训练过程中未遇到的实体也不例外。为实现这一目标，如图 3 所示，我们提出了一个两阶段架构，该架构以高效提取和搜索各种来源的知识而著称。在第一阶段，我们提取候选命名实体；在第二阶段，我们只对与这些候选实体相关的知识进行高效搜索和提取。获得的知识将用于实体识别。
Stage 1: Span Candidate Detection Module. In the first stage of SCANNER, the transformer encoder (Liu et al., 2019) is employed to detect entity candidates from the input text. During this phase, we utilizes BIO (Beginning, Inside, Outside) tagging to classify each token in the input text, determining whether it corresponds to the beginning, inside, or outside of an entity span. The classification process is guided by cross-entropy loss.
第一阶段：跨度候选词检测模块。在 SCANNER 的第一阶段，我们使用变换器编码器（Liu 等人，2019 年）从输入文本中检测实体候选。在这一阶段，我们利用 BIO（Beginning, Inside, Outside）标记对输入文本中的每个标记进行分类，确定其是否对应于实体跨度的开头、内部或外部。分类过程以交叉熵损失为指导。
Stage 2: Entity Recognition Module. In Stage 2, SCANNER performs named entity recognition and visual grounding for each entity candidate detected in Stage 1. It utilizes each entity candidate as a query to extract and leverage the necessary knowledge for the tasks. During this process, SCANNER efficiently searches and extracts knowledge by focusing on the initially detected entity candidates rather than the entire input text. SCANNER utilizes both internal (image-based) and external (e.g., Wikipedia) knowledge sources to perform
第 2 阶段：实体识别模块。在第 2 阶段，SCANNER 对第 1 阶段检测到的每个候选实体进行命名实体识别和视觉基础处理。它将每个候选实体作为一个查询，以提取和利用任务所需的知识。在此过程中，SCANNER 将重点放在最初检测到的候选实体上，而不是整个输入文本上，从而有效地搜索和提取知识。SCANNER 同时利用内部（基于图像）和外部（如维基百科）知识源来执行以下任务

Figure 4: An illustration of the entity recognition module (stage 2). Based on the entity candidates (extracted in stage 1), SCANNER utilizes various knowledge sources such as Wikipedia, image captioner, and object knowledge extractor. The knowledge collected from these sources are then processed by RoBERTa to give the final prediction.
图 4：实体识别模块（第 2 阶段）示意图。根据候选实体（在第 1 阶段提取），SCANNER 利用各种知识来源，如维基百科、图像标题器和对象知识提取器。然后，RoBERTa 对从这些来源收集到的知识进行处理，得出最终预测结果。

MNER on unseen entities, not encountered in training. Detailed information about these modules will be provided in Section 3.2.
MNER 针对训练中未遇到的未知实体。有关这些模块的详细信息将在第 3.2 节中提供。

3.2 Entity Recognition Module
3.2 实体识别模块

For each entity candidate identified by the span candidate detection module, the entity recognition module processes a text prompt that includes both the entity candidate and associated knowledge. This knowledge, extracted from images and external knowledge sources, allows for performing MNER on unseen entities that were not encountered during training. Our methodology involves extracting this knowledge from a variety of sources, utilizing the identified entity candidates as the basis for the extraction process. Then, this module classifies the class of each entity candidate and performs grounding to determine which object in the image corresponds to the entity. A detailed illustration is shown in Fig. 4.
对于候选跨度检测模块识别出的每个候选实体，实体识别模块都会处理一个文本提示，其中包括候选实体和相关知识。这些知识是从图像和外部知识源中提取的，可以对训练过程中未遇到的未知实体执行 MNER。我们的方法包括从各种来源中提取这些知识，并将已识别的候选实体作为提取过程的基础。然后，该模块会对每个候选实体进行分类，并执行接地处理，以确定图像中哪个对象与实体相对应。详细图示如图 4 所示。

3.2.1 Prompt construction with knowledge
3.2.1 以知识促进建设

The entity recognition module extracts and utilizes useful knowledge from various sources when constructing the text prompt corresponding to the input. The knowledge applied for constructing text prompts in our method includes the following.
实体识别模块在构建与输入相对应的文本提示时，会从各种来源提取并利用有用的知识。在我们的方法中，用于构建文本提示的知识包括以下内容。
Wikipedia knowledge. Initially, information is searched using the entity candidate as a query in external knowledge source, which is Wikipedia. This information can be valuable for classifying the type of entity for each candidate and, moreover, enables the model to classify unseen entities that were not encountered during training. As illustrated in Fig. 4, for entity candidates like 'Steve Kerr', it enhances entity recognition performance by providing valuable information for classification as an American basketball player and coach.
维基百科知识。最初，我们使用候选实体作为外部知识源（即维基百科）的查询条件来搜索信息。这些信息对每个候选实体的类型分类很有价值，而且还能让模型对训练过程中未遇到的实体进行分类。如图 4 所示，对于像 "史蒂夫-科尔 "这样的候选实体，维基百科提供的宝贵信息有助于将其分类为美国篮球运动员和教练，从而提高了实体识别性能。
Image caption. To effectively utilize visual information, image captioning results are also used. We use the BLIP-2 (Li et al., 2023a) to extract synthetic captions for the whole image.
图像标题。为了有效利用视觉信息，我们还使用了图像标题结果。我们使用 BLIP-2（Li 等人，2023a）来提取整幅图像的合成标题。
Object knowledge. In addition to global information about the image, object-level information is also beneficial for entity recognition. To achieve this, results obtained from the object detector are employed as knowledge. Initially, object classes are converted into text format and used as knowledge. Then, synthetic captions for each object region are also utilized in conjunction with class names. This information is structured as details corresponding to each object, along with a special token denoted as [obj], as shown in Fig. 4. Additionally, during this process, the visual-language similarity between each object and entity candidate is calculated, and objects are arranged in order of high similarity, which is then included in the text prompt. One of the problems with existing meth-
物体知识。除了图像的全局信息外，物体级信息也有利于实体识别。为此，我们采用了对象检测器获得的结果作为知识。首先，将对象类别转换为文本格式并作为知识使用。然后，每个对象区域的合成标题也与类别名称结合使用。如图 4 所示，这些信息的结构是与每个物体相对应的详细信息，以及一个表示为 [obj] 的特殊标记。此外，在此过程中还会计算每个对象与候选实体之间的视觉语言相似度，并将对象按相似度高的顺序排列，然后将其包含在文本提示中。现有方法存在的问题之一是
ods for the MNER task is that the model sometimes references objects in the image that are irrelevant to the entity, leading to incorrect recognition. By arranging the object details in the text prompt according to the visual-language similarity order with the entity, our model can focus more on the object regions that are highly related to the entity. In this paper, CLIP (Radford et al., 2021) is employed for visual-language similarity, specifically calculating the similarity between the text representation of the entity candidate and the visual representation of each Region of Interest (RoI).
在 MNER 任务中，模型有时会引用图像中与实体无关的物体，从而导致识别错误。通过根据与实体的视觉语言相似性顺序排列文本提示中的物体细节，我们的模型可以更加关注与实体高度相关的物体区域。本文采用 CLIP（Radford 等人，2021 年）进行视觉语言相似性分析，具体计算候选实体的文本表示与每个感兴趣区域（RoI）的视觉表示之间的相似性。

All such knowledge mentioned above is converted into a textual format and integrated with the text prompt for entity recognition and visual grounding.
上述所有知识都被转换成文本格式，并与文本提示相结合，用于实体识别和视觉基础。

The text prompt, structured to include entity candidates, the entire input text sentence, and extracted knowledge, is presented as "The entity is [mask] for {entity} in this sentence. {original sentence}

Wikipedia} {image caption} [obj] {object 1} [obj] {object 2

..."
文本提示的结构包括候选实体、整个输入文本句子和提取的知识，显示为 "该句子中{实体}的实体是[掩码]。{原句}。

维基百科}。{图片标题} [对象] {对象 1}[对象 2

......"

3.2.2 Encoder and Objective
3.2.2 编码器和目标

The prompts constructed for each entity candidate are input into a transformer encoder model (Liu et al., 2019). For entity recognition, the output token representation of the [mask] token in the text prompt

for the

-th entity candidate is fed into a linear layer to predict the probability distribution

. Given the ground truth

, the objective function is to minimize the cross-entropy loss between the predicted entity class distribution and the ground truth logit:
为每个实体候选者构建的提示输入到转换编码器模型中（Liu 等人，2019 年）。在实体识别中，第

个候选实体的文本提示

中 [掩码] 标记的输出标记表示被输入线性层，以预测概率分布

。给定地面实况

，目标函数是最小化预测实体类别分布与地面实况 logit 之间的交叉熵损失：

where

is the total number of the entity candidates.
其中，

是候选实体的总数。

Additionally, the visual grounding is performed by feeding the output token representation of the

-th

token from the text prompt

into a linear layer. This is followed by a sigmoid function, which aids in predicting the overlap score

between the ground truth image region grounding entity candidate

and object

. The objective function of visual grounding is calculated based on the binary cross-entropy loss between the overlap score and the ground truth Intersection over Union (IoU):
此外，视觉接地是通过将文本提示

中

第

个标记的输出标记表示送入线性层来实现的。然后是一个 sigmoid 函数，该函数有助于预测地面真实图像区域接地实体候选

和对象

之间的重叠分数

。视觉接地的目标函数是根据重叠分值与地面实况交集大于联合（IoU）之间的二元交叉熵损失来计算的：

where

is the ground truth IoU between the ground truth image region of the entity

and object region

其中，

是实体

的地面实况图像区域与对象区域

之间的地面实况 IoU

In training stage, we combine two losses as the final loss of our model:
在训练阶段，我们将两种损失合并为模型的最终损失：

where

is the weighting coefficient, we set

to 1 for the GMNER task and to 0 for the NER and MNER tasks in this paper.
其中

为加权系数，本文将 GMNER 任务中的

设为 1，将 NER 和 MNER 任务中的

设为 0。

3.3 Trust Your Teacher 3.3 相信你的老师

We introduce the novel self-distillation method, called as Trust Your Teacher (TYT). Our distillation method, which softly utilizes both the prediction of the teacher model and ground truth (GT) logit, addresses the challenges of noisy annotations. First, we train the teacher model using equation 1, and then train the final student model using both the predictions of the teacher model and the ground truth labels. The most significant feature of our proposed method is that it assesses the reliability of each sample by utilizing the prediction of the teacher model to determine if it is trustworthy or noisy. Based on this assessment, the method sets the weights between the model prediction and the gt label, which are then reflected in the loss calculation. The objective of the our proposed distillation method composes a cross-entropy loss with ground truth and Kullback-Leibler Divergence (KLD) loss with teacher predictions:
我们介绍了一种名为 "信任你的老师"（TYT）的新型自蒸馏方法。我们的蒸馏方法同时利用了教师模型的预测和地面实况（GT）对数，解决了有噪声注释的难题。首先，我们利用公式 1 训练教师模型，然后利用教师模型的预测和地面实况标签训练最终的学生模型。我们提出的方法的最大特点是，它通过利用教师模型的预测来评估每个样本的可靠性，从而确定它是可信的还是有噪声的。基于这一评估，该方法设定了模型预测和 gt 标签之间的权重，然后将其反映在损失计算中。我们提出的蒸馏方法的目标是将与地面实况的交叉熵损失和与教师预测的库尔贝克-莱布勒发散（KLD）损失结合起来：

where

is the input sample,

and

are the model parameters of the student and teacher,

and

are the prediction distributions of the student and teacher and

is a balancing factor proposed in this paper. In detail,

determines whether to trust the teacher model prediction or the ground truth, and it represents the prediction score of the teacher model for the ground truth class index, which is

. This implies that since the teacher model is well-trained, if the score for the ground
其中，

为输入样本，

和

为学生和教师的模型参数，

和

为学生和教师的预测分布，

为本文提出的平衡因子。具体来说，

决定了是相信教师模型的预测还是相信地面实况，它表示教师模型对地面实况类指数的预测得分，即

。这意味着，由于教师模型经过了良好的训练，如果地面实况指数的得分是

，那么教师模型的预测得分就是

。

Figure 5: Experiments of text classification task in MNLI datasets. 'matched' is in-domain, and 'mismatched‘ is out-domain.
图 5：MNLI 数据集中的文本分类任务实验。匹配 "表示域内，"不匹配 "表示域外。

Methods 方法	Twitter-2015				Twitter-2017 推特-2017			Twitter-GMNER
Methods 方法	Pre.	Rec. 回顾。	F1	Pre.	Rec. 回顾。	F1	Pre.	Rec. 回顾。	F1
Base 基地	83.28	87.68	85.43		93.23	92.08	86.60	87.59	87.09
Half 一半	83.36	87.69	85.47	90.31	92.92	91.60		87.96	87.47
Full 全部		87.72	85.63	90.53	92.95	91.72	86.90	87.80	87.35
TYT	83.59			90.94			86.82

Table 3: Ablation study on the MNER dataset in first stage. 'Half' is when

is 0.5 and 'Full' is 0 . In 'TYT',

is adjusted through the trust your teacher method.
表 3：第一阶段 MNER 数据集的消融研究。半 "是指

为 0.5，"全 "是指

为 0。在 "TYT "中，

通过 "相信你的老师 "方法进行调整。
truth class is high, then the sample is considered reliable and more weight is given to the cross-entropy with the ground truth label. Conversely, if the score is low, the sample is assumed to be an unreliable, noisy sample, and more weight is placed on the KLD loss with the prediction of the teacher model, rather than the ground truth label.
如果真值类得分高，则样本被认为是可靠的，与地面实况标签的交叉熵权重就会增加。反之，如果得分较低，则样本被认为是不可靠的、有噪声的样本，与教师模型预测的 KLD 损失，而不是地面实况标签的权重会更大。

To demonstrate the significant impact of our TYT approach, we have carried out some experiments. Fig. 5 illustrates our experiments on a text classification task in MNLI dataset. We extract about

of the train set for experimental efficiency and intentionally added label noise at rates of

and

to this subset. We then compare the performance of the model trained with our TYT method on the train set with added label noise against the baseline that does not use distillation. Fig. 5 indicates that using TYT demonstrates relatively robust performance under moderate noise conditions. Additionally, we compare our method with the conventional soft distillation methods that do not dynamically vary the

parameter in the entity detection task, stage 1 of MNER and GMNER. Table 3 shows that our method has better performance on MNER and GMNER benchmarks, and adaptively varying the

is more effective than keeping it fixed.
为了证明 TYT 方法的显著效果，我们进行了一些实验。图 5 展示了我们在 MNLI 数据集中进行的文本分类任务实验。为了提高实验效率，我们提取了大约

的训练集，并有意在该子集中添加了比率为

和

的标签噪声。然后，我们将在添加了标签噪声的训练集上使用 TYT 方法训练的模型的性能与不使用蒸馏的基线进行比较。图 5 显示，在中等噪声条件下，使用 TYT 的性能相对稳健。此外，在实体检测任务中，即 MNER 和 GMNER 的第 1 阶段，我们将我们的方法与不动态变化

参数的传统软蒸馏方法进行了比较。表 3 显示，在 MNER 和 GMNER 基准测试中，我们的方法具有更好的性能，而且自适应地改变

比保持固定参数更有效。

We apply the TYT to both stages 1 and 2. But in
我们将 TYT 应用于第 1 和第 2 阶段。但在
NER, we only use it in stage 1 . The loss from the TYT is applied only to the classification loss and not to the loss for visual grounding.
我们只在第 1 阶段使用 NER。TYT 的损失只用于分类损失，而不用于视觉接地损失。

4 Experiment 4 实验

4.1 Dataset 4.1 数据集

Our methodology's efficacy was assessed using widely used datasets for each task. We utilize CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) for NER, Twitter-2015 (Zhang et al., 2018) and Twitter-2017 (Lu et al., 2018) for MNER, and Twitter-GMNER (Yu et al., 2023) for GMNER. Details are in appendix B.
我们使用广泛使用的数据集对每项任务的有效性进行了评估。我们将 CoNLL2003（Tjong Kim Sang 和 De Meulder，2003 年）用于 NER，将 Twitter-2015 （Zhang 等人，2018 年）和 Twitter-2017 （Lu 等人，2018 年）用于 MNER，将 Twitter-GMNER （Yu 等人，2023 年）用于 GMNER。详情见附录 B。

4.2 Experimental Setups 4.2 实验装置

Evaluation metrics. To evaluate our method, we use Entity-wise F1, precision, and recall scores for NER and MNER tasks. For the GMNER task, there is an additional evaluation of the visual grounding. For instances, where the visual grounding is ungroundable, a prediction is correct if it is classified as 'None.' For others, correctness hinges on the IoU metric. A prediction is considered correct if the IoU score between the predicted visual region and the ground truth bounding boxes exceeds a threshold of 0.5 . We use F1, precision, and recall scores, which are calculated based on the aggregate correctness across entity, type, and visual region predictions. Our primary focus is on the F1 score in line with numerous preceding studies.
评估指标。为了评估我们的方法，我们在 NER 和 MNER 任务中使用了实体方面的 F1、精确度和召回分数。对于 GMNER 任务，还需要对视觉接地进行额外的评估。对于视觉接地不可接地的实例，如果被归类为 "无"，则预测正确。对于其他情况，正确与否取决于 IoU 指标。如果预测的视觉区域与地面实况边界框之间的 IoU 分数超过 0.5 的阈值，则预测被认为是正确的。我们使用 F1、精确度和召回分数，它们是根据实体、类型和视觉区域预测的总体正确性计算得出的。我们主要关注的是 F1 分数，这与之前的许多研究结果是一致的。
Implementation details. Following most recent works, we implement our model utilizing RoBERTa-large in NER, XLM-RoBERTalarge (Conneau et al., 2020) for MNER, GMNER both in stage 1 and stage 2 . For the object detector, we use VinVL (Zhang et al., 2021b) following the settings with ITA (Wang et al., 2022b). To address the requirements of visual-language similarity and image caption, we use each of them CLIP

and BLIP-

models respectively. Detailed hyper-parameter settings are shown in appendix

. All experiments were done on a single GeForce RTX 4090 GPU or NVIDIA H100 GPU, and we report the average score from 5 runs with different random seeds for each setting.
实施细节。根据最新的研究成果，我们在 NER 中使用了 RoBERTa-large，在 MNER 中使用了 XLM-RoBERTalarge (Conneau et al., 2020)，在第 1 阶段和第 2 阶段都使用了 GMNER。对于对象检测器，我们使用 VinVL（Zhang 等人，2021b），其设置与 ITA（Wang 等人，2022b）相同。为了满足视觉语言相似性和图像标题的要求，我们分别使用了 CLIP

和 BLIP-

模型。详细的超参数设置见附录

。所有实验均在单个 GeForce RTX 4090 GPU 或 NVIDIA H100 GPU 上完成，我们报告了每种设置下使用不同随机种子运行 5 次的平均得分。

Also we applied several minor methods to enhance performance. In the second stage, we incorporated a 'non-entity' label to account for instances
此外，我们还采用了一些次要方法来提高性能。在第二阶段，我们加入了一个 "非实体 "标签，以考虑以下情况

Methods 方法	Twitter-2015							Twitter-2017 推特-2017
	Single Type(F1) 单一类型（F1）				Overall 总体情况			Single Type(F1) 单一类型（F1）				Overall 总体情况
	PER	LOC	ORG	OTH.	Pre.	Rec. 回顾。	F1	PER	LOC	ORG	OTH.	Pre.	Rec. 回顾。	F1
Text 文本
BERT-CRF	85.37	81.82	63.26	44.13	75.56	73.88	74.71	90.66	84.89	83.71	66.86	86.10	83.85	84.96
BERT-SPAN (Yamada et al., 2020) BERT-SPAN （Yamada 等人，2020 年）	85.35	81.88	62.06	43.23	75.52	73.83	74.76	90.84	85.55	81.99	69.77	85.68	84.60	85.14
RoBERTa-SPAN (Yamada et al., 2020) RoBERTa-SPAN （Yamada 等人，2020 年）	87.20	83.58	66.33	50.66	77.48	77.43	77.45	94.27	86.23	87.22	74.94	88.71	89.44	89.06
Vision-LLM (w/ zero-shot) 愿景-LLM（带零发子弹）
Gem 宝石	73.12	65.53	35.80	20.72	48.24	64.88	55.34	84.36	71.65	61.24	22.02	64.02	69.90	66.83
GPT4-V	80.00	75.26	40.53	25.26	51.46	70.42	59.46	85.63	78.62	73.68	36.63	67.63	74.90	71.08
Text+Image 文本+图片
UMT	85.24	81.58	63.03	39.45	71.67	75.23	73.41	91.56	84.73	82.24	70.10	85.28	85.34	85.31
UMGF (Zhang et al., 2021a) UMGF （Zhang 等人，2021a）	84.26	83.17	62.45	42.42	74.49	75.21	74.85	91.92	85.22	83.13	69.83	86.54	84.50	85.51
MNER-QG (Jia et al., 2023) MNER-QG （Jia 等人，2023 年）	85.68	81.42	63.62	41.53	77.76	72.31	74.94	93.17	86.02	84.64	71.83	88.57	85.96	87.25
R-GCN (Zhao et al., 2022) R-GCN （Zhao 等人，2022 年）	86.36	82.08	60.78	41.56	73.95	76.18	75.00	92.86	86.10	84.05	72.38	86.72	87.53	87.11
ITA (Wang et al., 2022b) ITA（Wang 等人，2022b）	-	-	-	-	-	-	78.03	-	-	-	-	-	-	89.75
PromptMNER (Wang et al., 2022d) PromptMNER （Wang 等人，2022d）	-	-	-	-	78.03	79.17	78.60	-	-	-	-	89.93	90.60	90.27
CAT-MNER (Wang et al., 2022e) CAT-MNER （Wang 等人，2022e）	88.04	84.70	68.04	52.33	78.75	78.69	78.72	94.61	88.40	88.14	80.50	90.27	90.67	90.47
MoRe (Wang et al., 2022a) MoRe （Wang 等人，2022a）	-	-	-	-	-	-	79.21	-	-	-	-	-	-	90.67
PGIM (Li et al., 2023b) PGIM （Li 等人，2023b）	88.34	84.22	70.15	52.34	79.21	79.45	79.33	96.46	89.89	89.03	79.62	90.86	92.01	91.43
SCANNER (Ours) 扫描仪（我们的）	88.24	85.16	69.86	52.23	79.72	79.03	79.38	95.18	88.52	88.45	79.71	90.40	90.67	90.54
SCANNER (Ours) 扫描仪（我们的）

Table 4: Experiment results on the Twitter-2015 and Twitter-2017. The results for methods marked with

are from Wang et al. (2022e). The methods marked with

denotes that they utilize LLMs (of ChatGPT scale) as knowledge sources.
表 4：Twitter-2015 和 Twitter-2017 的实验结果。标有

的方法的结果来自 Wang 等人（2022e）。标有

的方法表示它们使用 LLMs（ChatGPT 量表）作为知识源。

Methods 方法	CoNLL2003
Methods 方法	Pre.	Rec. 回顾。	F1
(Li et al., 2022) （Li 等人，2022 年）	92.71		93.07
DiffusionNER (Shen et al., 2023a) DiffusionNER （Shen 等人，2023a）	92.99	92.56	92.78
PromptNER (Shen et al., 2023b) PromptNER （Shen 等人，2023b）	92.96	93.18	93.08
SCANNER (Ours) 扫描仪（我们的）

Table 5: Experiment results on the CoNLL2003.
表 5：CoNLL2003 的实验结果。
where the model erroneously predicts entity candidates not present in the dataset. That allowed for more accurate handling of such cases. We augmented it with non-entity data by dividing the training set into four folds in stage 1 and validating each fold. Secondly, we employed adversarial weight perturbation (AWP) (Wu et al., 2020) in stage 1 , which enhances the robustness and generalization capabilities of the model. We initiated AWP from an intermediate stage of our training process.
模型错误地预测了数据集中不存在的候选实体。这样就能更准确地处理此类情况。我们在第 1 阶段将训练集分为四折，并对每一折进行验证，从而用非实体数据对其进行扩充。其次，我们在第 1 阶段采用了对抗性权重扰动（AWP）（Wu 等人，2020 年），这增强了模型的鲁棒性和泛化能力。我们从训练过程的中间阶段开始启动 AWP。

4.3 Experimental results in various NER tasks
4.3 各种 NER 任务的实验结果

Experimental results in NER. To evaluate the effectiveness of our approach in NER, we primarily compared our model against the existing methods in Table 5. It shows that SCANNER exhibits a competitive performance compared to the existing NER methods.
NER 的实验结果。为了评估我们的方法在 NER 中的有效性，我们主要在表 5 中将我们的模型与现有方法进行了比较。结果表明，与现有的 NER 方法相比，SCANNER 表现出了极具竞争力的性能。
Experimental results in MNER. In assessing the effectiveness of SCANNER in MNER, we conducted comparative analyses against various leading models in this task. The results, detailed in
MNER 中的实验结果。为了评估 SCANNER 在 MNER 中的有效性，我们与该任务中的各种领先模型进行了比较分析。结果详见

Methods 方法	Twitter-GMNER
Methods 方法	Pre.	Rec. 回顾。	F1
Text 文本
HBiLSTM-CRF-None (Lu et al., 2018) HBiLSTM-CRF-None （Lu 等人，2018 年）	43.56	40.69	42.07
BERT-None (Devlin et al., 2019) BERT-无（Devlin 等人，2019 年）	42.18	43.76	42.96
BERT-CRF-None	42.73	44.88	43.78
BARTNER-None (Yan et al., 2021a) BARTNER-无（Yan 等人，2021a）	44.61	45.04	44.82
Text+Image 文本+图片
GVATT-RCNN-EVG (Lu et al., 2018) GVATT-RCNN-EVG （Lu 等人，2018 年）	49.36	47.80	48.57
UMT-RCNN-EVG (Yu et al., 2020) UMT-RCNN-EVG （Yu 等人，2020 年）	49.16	51.48	50.29
UMT-VinVL-EVG (Yu et al., 2020) UMT-VinVL-EVG （Yu 等人，2020 年）	50.15	52.52	51.31
UMGF-VinVL-EVG (Zhang et al., 2021a) UMGF-VinVL-EVG （Zhang 等人，2021a）	51.62	51.72	51.67
ITA-VinVL-EVG (Wang et al., 2022b) ITA-VinVL-EVG（Wang 等人，2022b）	52.37	50.77	51.56
BARTMNER-VinVL-EVG (Yu et al., 2023) BARTMNER-VinVL-EVG （Yu 等人，2023 年）	52.47	52.43	52.45
H-Index (Yu et al., 2023) H 指数（Yu 等人，2023 年）	56.16	56.67	56.41
SCANNER (Ours) 扫描仪（我们的）

Table 6: Experiment results on the Twitter-GMNER. The reported figures for the baseline models are taken from Yu et al. (2023).
表 6：Twitter-GMNER 的实验结果。报告中的基准模型数据来自 Yu 等人（2023 年）。

Table 4, reveal that our model achieves superior performance in Twitter-2015 and exhibits markedly impressive results in Twitter-2017. Notably, while PGIM shows outstanding performance on Twitter2017, it utilizes large language models (LLM) like ChatGPT, which incurs API costs, a notable drawback. In contrast, our model does not rely on LLM knowledge, freeing it from such disadvantages and demonstrating better performance on Twitter-2015. Additionally, we conduct experience using the same LLM knowledge as PGIM, which is in appendix C.
表 4 显示，我们的模型在 Twitter-2015 中取得了卓越的性能，在 Twitter-2017 中也表现出了显著的效果。值得注意的是，虽然 PGIM 在 Twitter2017 中表现出色，但它使用了像 ChatGPT 这样的大型语言模型（LLM），这会产生 API 成本，这是一个显著的缺点。相比之下，我们的模型不依赖于 LLM 知识，从而摆脱了这些缺点，并在 Twitter-2015 上表现出更好的性能。此外，我们还使用与 PGIM 相同的 LLM 知识进行了体验，见附录 C。
Experimental results in GMNER. To show our effectiveness in GMNER, we make broad compar-
GMNER 的实验结果为了证明我们在 GMNER 中的有效性，我们对我们的研究结果进行了广泛的比较。

Figure 6: Visualization results showing how various types of knowledge are brought in and utilized differently to perform the MNER task. Knowledge highlighted in blue positively influences correct predictions.
图 6：可视化结果显示了在执行 MNER 任务时如何引入和利用各种类型的知识。蓝色突出显示的知识对正确预测有积极影响。

Methods 方法	Twitter-2015			Twitter-2017 推特-2017
Methods 方法	Pre.	Rec. 回顾。	F1	Pre.	Rec. 回顾。	F1
SCANNER	79.72	79.03	79.38	90.40	90.67	90.54
- TYT	-0.26	-0.17	-0.21	-0.24	-0.11	-0.18
- OBK	+0.11	-0.60	-0.26	-0.23	-0.22	-0.22
- WKK	-1.12	-0.14	-0.64	-0.51	-0.44	-0.48
- ICK	-0.08	-0.54	-0.31	-0.29	-0.31	-0.29

Table 7: Ablation studies on MNER datasets. '-TYT' is without trust your teacher method. '-OBK' is without object knowledge. '-WKK' is without Wikipedia knowledge. '-ICK' is without image caption knowledge.
表 7：对 MNER 数据集的消融研究。-TYT "表示不使用 "相信你的老师 "方法。-OBK "表示没有对象知识。-WKK'表示没有维基百科知识。-ICK "不包含图像标题知识。
isons with all existing methods. Text-only models made to predict the visual groundings all 'None'. The Table 6 shows that our model achieves significant performance improvements over prior research and establishes a new powerful baseline for future GMNER studies.
与所有现有方法的差异。预测视觉基础的纯文本模型全部为 "无"。表 6 显示，与之前的研究相比，我们的模型性能有了显著提高，为未来的 GMNER 研究建立了一个新的强大基线。

4.4 Ablation study 4.4 消融研究

Ablation study in MNER. We conduct ablation experiments on the MNER task to evaluate the effectiveness of the proposed method. These results are shown in Table 7. We observe that removing the Trust Your Teacher method led to a decrease in performance. Our proposed distillation method effectively alleviates the dataset noise issue, mak- ing our model more robust to learning from noisy dataset. Additionally, to verify the effectiveness of the various types of knowledge used in our study, we compare the results with experiments where each type of knowledge was removed. We confirm that the object knowledge, Wikipedia knowledge, and image caption knowledge used in our paper all contribute to the performance improvement of the MNER task.
MNER 中的消融研究。我们在 MNER 任务中进行了消融实验，以评估所建议方法的有效性。结果如表 7 所示。我们发现，删除 "信任你的老师 "方法会导致性能下降。我们提出的蒸馏方法有效地缓解了数据集噪声问题，使我们的模型在从噪声数据集学习时更加稳健。此外，为了验证我们研究中使用的各类知识的有效性，我们将结果与去除各类知识的实验进行了比较。我们证实，本文中使用的对象知识、维基百科知识和图像标题知识都有助于提高 MNER 任务的性能。
Case study. As shown in Fig. 6, all three types of knowledge can be utilized as useful information for named entity recognition. In the case of the first image, knowledge from Wikipedia such as "American retail company" and object knowledge containing the logo information of "Kroger" both help in predicting the "Kroger" entity as an organization. For the image on the bottom left, image caption and object knowledge aided in named entity recognition. Moreover, in the image on the bottom right, vision information like image caption and object knowledge led to incorrect entity recognition results, but it was corrected through external knowledge from Wikipedia. Thus, the three types of knowledge proposed in this paper complement each other, enabling accurate MNER performance.
案例研究。如图 6 所示，这三类知识都可以作为命名实体识别的有用信息。在第一张图片中，来自维基百科的知识（如 "美国零售公司"）和包含 "Kroger "徽标信息的对象知识都有助于将 "Kroger "实体预测为一个组织。对于左下方的图像，图像标题和对象知识有助于命名实体识别。此外，在右下方的图像中，图像标题和对象知识等视觉信息导致了错误的实体识别结果，但通过维基百科的外部知识得到了纠正。因此，本文提出的三种知识可以相互补充，从而实现准确的 MNER 性能。
Effectiveness in unseen entity. Table 8 shows the effectiveness of knowledge in unseen entities. As
未见实体的有效性。表 8 显示了知识在未见实体中的有效性。如

Datasets 数据集	w/o Knowledge 无知识		w/ Knowledge w/ 知识
Datasets 数据集	Seen 看到的	Unseen 看不见的	Seen 看到的	Unseen 看不见的
CoNLL2003	96.29	89.68	96.35	89.70
Twitter-2015	87.18	73.84	87.50	75.45
Twitter-2017 推特-2017	95.68	82.96	95.90	83.71

Table 8: The result comparing the test F1 scores in unseen entities of knowledge extracted and baseline.
表 8：知识提取和基线在未见实体中的测试 F1 分数比较结果。

SCANNER utilizes various knowledge in MNER， it greatly increases performance in unseen entities. In NER, lack of various knowledge causes there to be no image, which slightly improves the performance.
SCANNER 在 MNER 中利用了各种知识，大大提高了在未知实体中的性能。而在 NER 中，由于缺乏各种知识导致没有图像，因此性能略有提高。

5 Conclusions 5 结论

We introduce SCANNER, a novel approach for performing NER tasks by utilizing knowledge from various sources. To efficiently fetch diverse knowledge, SCANNER employs a two-stage structure, which detects entity candidates first, and performs named entity recognition and visual grounding on these candidates. Additionally, we propose the novel distillation method, which robustly trains the model against dataset noise, demonstrating superior performance in various NER benchmarks. We believe that our method can be easily extended to utilize knowledge from multiple sources that were not covered in this paper.
我们介绍了 SCANNER，这是一种通过利用各种来源的知识来执行 NER 任务的新方法。为了有效获取各种知识，SCANNER 采用了两阶段结构，首先检测候选实体，然后对这些候选实体进行命名实体识别和视觉接地。此外，我们还提出了新颖的蒸馏方法，该方法可针对数据集噪声对模型进行稳健训练，在各种 NER 基准测试中均表现出卓越的性能。我们相信，我们的方法可以很容易地扩展到利用本文未涉及的多种来源的知识。

Limitations 局限性

In this study, we extract knowledge from various sources and utilize it to perform MNER tasks. By leveraging several vision experts such as CLIP, and also fetching external knowledge, our method takes relatively longer inference time compared to approaches that do not use knowledge. However, the use of vision experts and knowledge is essential for a MNER model that functions well even with unseen entities, and we efficiently extract information through a two-stage structure.
在这项研究中，我们从各种来源提取知识，并利用这些知识来执行 MNER 任务。通过利用 CLIP 等多个视觉专家以及获取外部知识，与不使用知识的方法相比，我们的方法所需的推理时间相对较长。不过，视觉专家和知识的使用对于即使在未见实体的情况下也能良好运行的 MNER 模型来说至关重要，我们通过两阶段结构有效地提取了信息。

Ethics statement 道德规范声明

All experimental results we provide in this paper is based on publicly available datasets and opensource models.
本文提供的所有实验结果均基于公开数据集和开源模型。

Acknowledgments 致谢

This work was partly supported by Artificial Intelligence Industrial Convergence Cluster Development project funded by the Ministry of Science and
这项工作得到了科学和技术部资助的人工智能产业融合集群发展项目的部分支持。
ICT (MSIT, Korea) & Gwangju Metropolitan City, and the National Research Foundation of Korea (NRF) grant (RS-2023-00213710, Neural Network Optimization with Minimal Optimization Costs), and the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.20190-01906, Artificial Intelligence Graduate School Program(POSTECH))
韩国信息通信技术研究院（MSIT，韩国）与光州广域市，韩国国家研究基金会（NRF）资助（RS-2023-00213710，优化成本最小化的神经网络优化），韩国政府（MSIT）资助的信息通信技术规划与评估研究所（IIPT）资助（No.20190-01906，人工智能研究生院项目（POSTECH））。

References 参考资料

F. Chen, J. Liu, K. Ji, W. Ren, J. Wang, and J. Chen. 2023. Learning implicit entity-object relations by bidirectional generative alignment for multimodal ner. In

.
F.Chen, J. Liu, K. Ji, W. Ren, J. Wang, and J. Chen.2023.通过双向生成对齐学习隐式实体-对象关系以实现多模态ner。In

.
X. Chen, N. Zhang, L. Li, S. Deng, C. Tan, C. Xu, F. Huang, L. Si, and H. Chen. 2022. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In SIGIR.
X.Chen, N. Zhang, L. Li, S. Deng, C. Tan, C. Xu, F. Huang, L. Si, and H. Chen.2022.用于多模态知识图谱补全的多层次融合混合变换器。In SIGIR.
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale. In

.
A.Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov.2020.大规模无监督跨语言表征学习。In

.
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In

.
J.Devlin、M. Chang、K. Lee 和 K. Toutanova。2019.BERT：用于语言理解的深度双向变换器预训练。在

.
M. Jia, L. Shen, X. Shen, L. Liao, M. Chen, X. He, Z. Chen, and J. Li. 2023. MNER-QG: An end-to-end mrc framework for multimodal named entity recognition with query grounding. In AAAI.
M.Jia, L. Shen, X. Shen, L. Liao, M. Chen, X. He, Z. Chen, and J. Li.2023.MNER-QG：多模态命名实体识别与查询接地的端到端 mrc 框架。In AAAI.
J. Li, H. Fei, J. Liu, S. Wu, M. Zhang, C. Teng, D. Ji, and F. Li. 2022. Unified named entity recognition as word-word relation classification. In AAAI.
J.Li, H. Fei, J. Liu, S. Wu, M. Zhang, C. Teng, D. Ji, and F. Li.2022.作为词-词关系分类的统一命名实体识别。In AAAI.
J. Li, D. Li, S. Savarese, and S. Hoi. 2023a. BLIP2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
J.Li、D. Li、S. Savarese 和 S. Hoi.2023a.BLIP2：使用冻结图像编码器和大型语言模型进行语言图像预训练的引导。
J. Li, H. Li, Z. Pan, and G. Pan. 2023b. Prompt ChatGPT in MNER: Improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT. arXiv preprint arXiv:2305.12212.
J.Li、H. Li、Z. Pan 和 G. Pan.2023b.MNER 中的 Prompt ChatGPT：基于 ChatGPT 的辅助提炼知识的改进型多模态命名实体识别方法。
J. Li, A. Sun, J. Han, and C. Li. 2020. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34.
J.Li、A. Sun、J. Han 和 C. Li.2020.命名实体识别的深度学习调查。IEEE Transactions on Knowledge and Data Engineering, 34.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.
Y.Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov.2019.RoBERTa：ArXiv preprint arXiv:1907.11692.
I. Loshchilov. and . Hutter. 2019. Decoupled weight decay regularization. In ICLR.
I.洛希洛夫和.胡特。2019.解耦权重衰减正则化。In ICLR.

D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji. 2018. Visual attention model for name tagging in multimodal social media. In

.
D.Lu、L. Neves、V. Carvalho、N. Zhang 和 H. Ji。2018.多模态社交媒体中姓名标记的视觉注意力模型。在

.
J. Lu, D. Zhang, J. Zhang, and P. Zhang. 2022. Flat multi-modal interaction transformer for named entity recognition. In COLING.
J.Lu, D. Zhang, J. Zhang, and P. Zhang.2022.用于命名实体识别的扁平多模态交互转换器。在 COLING.
M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power 2017. Semi-supervised sequence tagging with bidirectional language models. In

.
M.E. Peters、W. Ammar、C. Bhagavatula 和 R. Power 2017。使用双向语言模型的半监督序列标记。在

.
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML
Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al.从自然语言监督中学习可转移的视觉模型。在 ICML
R. R Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV.
R.R. Selvaraju、M. Cogswell、A. Das、R. Vedantam、D. Parikh 和 D. Batra。2017.Grad-CAM：基于梯度定位的深度网络视觉解释。In ICCV.
Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. 2023a. DiffusionNER: Boundary diffusion for named entity recognition. In

Y.Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang.2023a.DiffusionNER：命名实体识别的边界扩散。见

。
Y. Shen, Z. Tan, S. Wu, W. Zhang, R. Zhang, Y. Xi, W. Lu, and Y. Zhuang. 2023b. PromptNER: Prompt locating and typing for named entity recognition. In

.
Y.Shen, Z. Tan, S. Wu, W. Zhang, R. Zhang, Y. Xi, W. Lu, and Y. Zhuang.2023b.PromptNER：用于命名实体识别的提示定位和键入。在

.
Z. Tan, S. Huang, Z. Jia, J. Cai, Y. Li, W. Lu, Y. Zhuang, K. Tu, P. Xie, F. Huang, and Y. Jiang. 2023. DAMONLP at SemEval-2023 task 2: A unified retrievalaugmented system for multilingual named entity recognition. In SemEval.
Z.Tan, S. Huang, Z. Jia, J. Cai, Y. Li, W. Lu, Y. Zhuang, K. Tu, P. Xie, F. Huang, and Y. Jiang.2023.DAMONLP at SemEval-2023 task 2: A unified retrievalaugmented system for multilingual named entity recognition.In SemEval.
E. F. Tjong Kim Sang and F. De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Languageindependent named entity recognition. In NAACL.
E.F. Tjong Kim Sang and F. De Meulder.2003.CoNLL-2003 共享任务简介：语言无关的命名实体识别。In NAACL.
X. Wang, J. Cai, Y. Jiang, P. Xie, K. Tu, and W. Lu. 2022a. Named entity and relation extraction with multi-modal retrieval. In EMNLP findings.
X.Wang, J. Cai, Y. Jiang, P. Xie, K. Tu, and W. Lu.2022a.多模态检索的命名实体和关系提取。在 EMNLP 研究成果中。
X. Wang, M. Gui, Y. Jiang, Z. Jia, N. Bach, T. Wang, Z. Huang, and K. Tu. 2022b. ITA: Image-text alignments for multi-modal named entity recognition. In NAACL-HLT.
X.Wang, M. Gui, Y. Jiang, Z. Jia, N. Bach, T. Wang, Z. Huang, and K. Tu.2022b.ITA：用于多模态命名实体识别的图像-文本对齐。In NAACL-HLT.
X. Wang, Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, and K. Tu. 2021. Improving named entity recognition by external context retrieving and cooperative learning. In

.
X.Wang, Y. Jiang, N. Bach, T. Wang, Z. Huang, F. Huang, and K. Tu.2021.通过外部上下文检索和合作学习改进命名实体识别。在

.
X. Wang, Y. Shen, J. Cai, T. Wang, X. Wang, P. Xie, F. Huang, W. Lu, Y. Zhuang, K. Tu, W. Lu, and Y. Jiang. 2022c. DAMO-NLP at SemEval-2022 task 11: A knowledge-based system for multilingual named entity recognition. In SemEval.
X.Wang, Y. Shen, J. Cai, T. Wang, X. Wang, P. Xie, F. Huang, W. Lu, Y. Zhuang, K. Tu, W. Lu, and Y. Jiang.2022c.DAMO-NLP at SemEval-2022 task 11: A knowledge-based system for multilingual named entity recognition.In SemEval.
X. Wang, J. Tian, M. Gui, Z. Li, J. Ye, M. Yan, and Y. Xiao. 2022d. PromptMNER: prompt-based entityrelated visual clue extraction and integration for multimodal named entity recognition. In DASFAA.
X.Wang, J. Tian, M. Gui, Z. Li, J. Ye, M. Yan, and Y. Xiao.2022d.PromptMNER：用于多模态命名实体识别的基于提示的实体相关视觉线索提取与整合。In DASFAA.
X. Wang, J. Ye, Z. Li, J. Tian, Y. Jiang, M. Yan, J. Zhang, and Y. Xiao. 2022e. CAT-MNER: Multimodal named entity recognition with knowledgerefined cross-modal attention. In ICME.
X.Wang, J. Ye, Z. Li, J. Tian, Y. Jiang, M. Yan, J. Zhang, and Y. Xiao.2022e.CAT-MNER：利用知识精炼的跨模态注意力进行多模态命名实体识别。In ICME.
Z. Wang, J. Shang, L. Liu, L. Lu, J. Liu, and J. Han. 2019. Crossweigh: Training named entity tagger from imperfect annotations. In EMNLP-IJCNLP.
Z.Wang, J. Shang, L. Liu, L. Lu, J. Liu, and J. Han.2019.Crossweigh：从不完善的注释中训练命名实体标记。在 EMNLP-IJCNLP 中。
D. Wu, S.-T. Xia, and Y. Wang. 2020. Adversarial weight perturbation helps robust generalization. In NeurIPS.
D.Wu, S.-T. Xia, and Y. Wang.Xia, and Y. Wang.2020.逆向权重扰动有助于稳健泛化。In NeurIPS.
I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity-aware self-attention. In EMNLP.
I.Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Matsumoto.2020.LUKE：具有实体感知自我关注的深度上下文实体表征。在 EMNLP 中。
H. Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu. 2021a. A unified generative framework for various ner subtasks. In

.
H.Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu.Qiu.2021a.各种纳子任务的统一生成框架。在

.
H. Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu. 2021b. A unified generative framework for various NER subtasks. In

H.Yan, T. Gui, J. Dai, Q. Guo, Z. Zhang, and X. Qiu.Qiu.2021b.各种 NER 子任务的统一生成框架。见

。
J. Yu, J. Jiang, L. Yang, and R. Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer. In

.
J.Yu, J. Jiang, L. Yang, and R. Xia.2020.用统一多模态变换器通过实体跨度检测改进多模态命名实体识别。在

.
J. Yu, Z. Li, J. Wang, and R. Xia. 2023. Grounded multimodal named entity recognition on social media In

.
J.Yu, Z. Li, J. Wang, and R. Xia.2023.社交媒体上的基础多模态命名实体识别》，

.
D. Zhang, S. Wei, S. Li, H. Wu, Q. Zhu, and G. Zhou. 2021a. Multi-modal graph fusion for named entity recognition with targeted visual guidance. AAAI.
D.Zhang, S. Wei, S. Li, H. Wu, Q. Zhu, and G. Zhou.2021a.有针对性视觉引导的命名实体识别多模态图融合。AAAI.
P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao. 2021b. VinVL: Revisiting visual representations in vision-language models. In CVPR.
P.Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao.2021b.VinVL：重新审视视觉语言模型中的视觉表征。In CVPR.
Q. Zhang, J. Fu, X. Liu, and X. Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In AAAI.
Q.Zhang, J. Fu, X. Liu, and X. Huang.2018.用于推文中命名实体识别的自适应协同关注网络。In AAAI.

Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu. 2023. Reducing the bias of visual objects in multimodal named entity recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 958-966.
Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu.2023.减少多模态命名实体识别中视觉对象的偏差。第十六届 ACM 网络搜索与数据挖掘国际会议论文集》，第 958-966 页。
F. Zhao, C. Li, Z. Wu, S. Xing, and X. Dai. 2022. Learning from different text-image pairs: A relationenhanced graph convolutional network for multimodal ner. In ACM MM.
F.Zhao, C. Li, Z. Wu, S. Xing, and X. Dai.2022.Learning from different text-image pairs：用于多模态 Ner 的关系增强图卷积网络。In ACM MM.
E. Zhu and J. Li. 2022. Boundary smoothing for named entity recognition. In

.
E.Zhu and J. Li.2022.命名实体识别的边界平滑。在

A Hyper-parameter settings
A 超参数设置

Datasets 数据集	Stage 阶段
Datasets 数据集	epochs 纪元	batch size 批量大小	weight decay 体重衰减
CoNLL2003	5	8	1
Twitter-2015	10	4	2
Twitter-2017 推特-2017	10	8	2
Twitter-GMNER	10	8	2

Table 9: Hyper-parameter settings in Stage 1 were used in the experiments for NER, MNER, and GMNER.
表 9：在 NER、MNER 和 GMNER 的实验中使用了第 1 阶段的超参数设置。

Datasets 数据集	Stage 2 第 2 阶段
Datasets 数据集	epochs 纪元	batch size 批量大小	lr	weight decay 体重衰减	max objects 最大对象
CoNLL2003	20	8		0.01	-
Twitter-2015	5	8		0.01	15
Twitter-2017 推特-2017	7	8		0.01	15
Twitter-GMNER	5	8		2	18

Table 10: Hyper-parameter settings in Stage 2 were used in the experiments for NER, MNER, and GMNER.
表 10：在 NER、MNER 和 GMNER 的实验中使用了第 2 阶段的超参数设置。

We conducted our experiments with hyperparameter settings as outlined in the following Table 9 and Table 10, and we utilize AdamW (Loshchilov. and Hutter, 2019) optimizer for all tasks. 'max objects' refers to the maximum number of object knowledge inputs. We performed a grid search for the learning rate within the range of

. We tested batch sizes of 4,8 , and 16 to determine the optimal value, and we explored weight decay within a range of

.
我们使用下表 9 和表 10 所列的超参数设置进行了实验，并在所有任务中使用了 AdamW（Loshchilov 和 Hutter，2019 年）优化器。最大对象 "指的是对象知识输入的最大数量。我们对

范围内的学习率进行了网格搜索。我们测试了 4、8 和 16 个批次的大小，以确定最佳值，并在

的范围内探索了权重衰减。

B Detailed dataset statistics
B 数据集详细统计

To demonstrate the superiority of our method for various NER tasks, we conduct experiments on a range of datasets. The overall dataset statistics are shown in Table 11, and each task description is in below.
为了证明我们的方法在各种 NER 任务中的优越性，我们在一系列数据集上进行了实验。整个数据集的统计数据如表 11 所示，每个任务的描述如下。
NER dataset. CoNLL2003 (Tjong Kim Sang and De Meulder, 2003), a dataset with four named entities: PER, LOC, ORG, and MISC. We follow the standard setting (Peters et al., 2017; Yan et al., 2021b; Shen et al., 2023a): use both the train set and dev set for training and evaluate with the test set
NER 数据集。CoNLL2003（Tjong Kim Sang 和 De Meulder，2003 年）是一个包含四个命名实体的数据集：PER、LOC、ORG 和 MISC。我们遵循标准设置（Peters 等人，2017；Yan 等人，2021b；Shen 等人，2023a）：同时使用训练集和开发集进行训练，并使用测试集进行评估。
MNER dataset. Twitter-2015 (Zhang et al., 2018) and Twitter-2017 (Lu et al., 2018); collected from social network service posts. Like CoNLL2003, it consists of the same four named entity types. We operate a train set for training and hyper-parameter tuning using a dev set and evaluate it with the test set.
MNER 数据集。Twitter-2015（Zhang 等人，2018）和 Twitter-2017（Lu 等人，2018）；收集自社交网络服务帖子。与 CoNLL2003 一样，它同样由四种命名实体类型组成。我们使用训练集进行训练和超参数调整，并用测试集进行评估。
GMNER dataset. Twitter-GMNER (Yu et al., 2023), a dataset collected by extracting some of
GMNER 数据集。Twitter-GMNER（Yu 等人，2023 年）是一个数据集，它是通过提取 Twitter 上的一些数据而收集的。

	Text 文本					Image Total 图片总数
	#Total	#Train	#Dev	#Test	#Groundable Entity
CoNLL2003	20,744	17,291	-	3,453	-
Twitter-2015	8,257	4,000	1,000	3,257	-
Twitter-2017 推特-2017	4,819	3,373	723	723	-
Twitter-GMNER	10,000	7,000	1,500	1,500	6,716

Table 11: Dataset statistics of NER, MNER, and GMNER benchmarks
表 11：NER、MNER 和 GMNER 基准的数据集统计数据

Methods 方法	Twitter-2015			Twitter-2017 推特-2017
Methods 方法	Pre.	Rec. 回顾。	F1	Pre.	Rec. 回顾。	F1
MoRe (Wang et al., 2022a) MoRe （Wang 等人，2022a）	-	-	79.21	-	-	90.67
PGIM (Li et al., 2023b) PGIM （Li 等人，2023b）	79.21	79.45	79.33		92.01
SCANNER		79.03	79.38	90.40	90.67	90.54
SCANNER (w/ GPT knowledge) 扫描仪（具有 GPT 知识）	79.24			90.22		91.22

Table 12: Comparison SCANNER w/ GPT knowledge with previous leading baseline methods on Twitter-2015 and Twitter-2017 datasets.
表 12：在 Twitter-2015 和 Twitter-2017 数据集上，使用 GPT 知识的 SCANNER 与之前主要基线方法的比较。
the data from Twitter-2015 and Twitter-2017, and employ bounding box annotation. We operate same validate strategy as MNER.
Twitter-2015 和 Twitter-2017 的数据，并采用边界框注释。我们采用与 MNER 相同的验证策略。

C Compare with MoRe and PGIM
C 与租金指数和 PGIM 比较

Using LLM knowledge. With the advancements in LLM, we conducted experiments using GPT knowledge in SCANNER instead of Wikipedia knowledge. We utilize the same knowledge used in PGIM, and the results are in Table 12. We can observe that while applying GPT knowledge in SCANNER increases time and API-related costs, it also enhances performance. Notably, despite the GPT knowledge being tailored for PGIM, its performance in SCANNER is superior.
使用 LLM 知识。随着 LLM 的进步，我们使用 SCANNER 中的 GPT 知识代替维基百科知识进行了实验。我们使用了与 PGIM 相同的知识，结果如表 12 所示。我们可以看到，虽然在 SCANNER 中应用 GPT 知识会增加时间和与 API 相关的成本，但它也提高了性能。值得注意的是，尽管 GPT 知识是为 PGIM 量身定制的，但它在 SCANNER 中的性能却更优越。
Time cost in retrieval knowledge. We conduct time cost comparisons using retrieval knowledge. As shown in Table 13, our entity-centric approach for integrating Wikipedia information demonstrates a speed advantage over MoRe's BM25 and K-NN based information retrieval and PGIM's GPT-based knowledge creation. This speed benefit arises from our direct method of identifying entity candidates, which allows for immediate retrieval of relevant Wikipedia articles without any bottlenecks.
检索知识中的时间成本。我们使用检索知识对时间成本进行了比较。如表 13 所示，与 MoRe 基于 BM25 和 K-NN 的信息检索方法以及 PGIM 基于 GPT 的知识创建方法相比，我们以实体为中心的维基百科信息整合方法具有速度优势。这种速度优势源于我们识别候选实体的直接方法，这种方法允许立即检索相关的维基百科文章，而不会出现任何瓶颈。

D Ablation study in GMNER
D GMNER 的消融研究

We conduct additional ablation experiments on the Twitter-GMNER task to evaluate the effectiveness of the proposed method. Effects of each method are shown in Table 14, and ablation studies on the number of object tokens are shown in Table 15. These tables substantiate the efficacy of our pro-
我们在 Twitter-GMNER 任务中进行了更多的消减实验，以评估所提出方法的有效性。表 14 显示了每种方法的效果，表 15 显示了对对象标记数量的消减研究。这些表格证实了我们的方法的有效性。

Methods 方法	Knowledge base 知识库	Sentences / Sec. 句子/章节
MoRe	Wiki (Text) 维基（文本）	64.6
MoRe	Wiki (Image) 维基（图片）	650.1
PGIM	GPT	0.92
SCANNER (Ours) 扫描仪（我们的）	Wiki (Text) 维基（文本）

Table 13: Throughput of knowledge extractor. Performance metrics for MoRe are sourced directly from the MoRe paper, while those for PGIM and SCANNER are obtained from our measurements on a Ryzen 7900 CPU Bigger is faster. For each 100 sentences, PGIM paid

for using ChatGPT API.
表 13：知识提取器的吞吐量。MoRe 的性能指标直接来源于 MoRe 论文，而 PGIM 和 SCANNER 的性能指标则是我们在 Ryzen 7900 CPU 上测量获得的。每 100 个句子，PGIM 支付

作为使用 ChatGPT API 的费用。

Methods 方法	Twitter-GMNER
Methods 方法	Pre.	Rec. 回顾。	F1
SCANNER	68.34	68.71	68.52
- TYT	-0.46	-0.44	-0.45
- SOC	-0.14	-0.16	-0.15
- WKK	-0.11	-0.17	-0.14
- ICK	-1.09	-1.09	-1.09

Table 14: Ablation studies on GMNER datasets. '-TYT' is without trust your teacher method. '-SOC' is without sorting objects by clip. '-WKK' is without Wikipedia knowledge. '-ICK' is without image caption knowledge.
表 14：对 GMNER 数据集进行的消融研究。-TYT "表示不使用 "相信你的老师 "方法。-SOC "表示未按片段对对象进行排序。-WKK'表示没有维基百科知识。-ICK'不包含图像标题知识。
posed methodologies in the context of GMNER and show that it is optimal when the number of object tokens is 18 .
在 GMNER 的背景下提出的方法，并表明当对象标记数为 18 时是最优的。
Effect of CLIP knowledge. CLIP is a practical module extracting entities from an image. As visualized in Fig 7, CLIP utilizes its knowledge and attention to the entity location in the image and improves the model's capability in GMNER tasks.
CLIP 知识的效果。CLIP 是一个从图像中提取实体的实用模块。如图 7 所示，CLIP 利用其知识和对图像中实体位置的关注，提高了模型在 GMNER 任务中的能力。

Object tokens 对象标记	Twitter-GMNER
Object tokens 对象标记	Pre.	Rec. 回顾。	F1
9	67.50	67.85	67.67
12	68.06	68.43	68.24
15	68.22	68.55	68.38
18
21		68.68	68.51

Table 15: Ablation studies on the number of object tokens.
表 15：关于物体标记数的消融研究。

Output 输出

Input: Nottingham Castle 投入：诺丁汉城堡
Output 输出
Figure 7: A visualization of which location CLIP focuses on by using Grad-CAM (Selvaraju et al., 2017).
图 7：使用 Grad-CAM 可视化显示 CLIP 重点关注的位置（Selvaraju 等人，2017 年）。

- Work done as an intern at NAVER Cloud.
  在 NAVER Cloud 实习时完成的工作。
  To be corresponded with.
  待通信。
openai/clip-vit-large-patch 14
salesforce/blip2-opt-2.7b

SCANNER: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen Entities SCANNER：针对未见实体的鲁棒多模态命名实体识别的知识增强方法

Abstract 摘要

1 Introduction 1 引言

2 Related work 2 相关工作

3 Method 3 方法

3.1 SCANNER Architecture 3.1 扫描仪结构

3.2 Entity Recognition Module3.2 实体识别模块

3.2.1 Prompt construction with knowledge3.2.1 以知识促进建设

3.2.2 Encoder and Objective3.2.2 编码器和目标

3.3 Trust Your Teacher 3.3 相信你的老师

4 Experiment 4 实验

4.1 Dataset 4.1 数据集

4.2 Experimental Setups 4.2 实验装置

4.3 Experimental results in various NER tasks4.3 各种 NER 任务的实验结果

4.4 Ablation study 4.4 消融研究

5 Conclusions 5 结论

Limitations 局限性

Ethics statement 道德规范声明

Acknowledgments 致谢

References 参考资料

A Hyper-parameter settingsA 超参数设置

B Detailed dataset statisticsB 数据集详细统计

C Compare with MoRe and PGIMC 与租金指数和 PGIM 比较

D Ablation study in GMNERD GMNER 的消融研究

SCANNER: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen Entities
SCANNER：针对未见实体的鲁棒多模态命名实体识别的知识增强方法

3.2 Entity Recognition Module
3.2 实体识别模块

3.2.1 Prompt construction with knowledge
3.2.1 以知识促进建设

3.2.2 Encoder and Objective
3.2.2 编码器和目标

4.3 Experimental results in various NER tasks
4.3 各种 NER 任务的实验结果

A Hyper-parameter settings
A 超参数设置

B Detailed dataset statistics
B 数据集详细统计

C Compare with MoRe and PGIM
C 与租金指数和 PGIM 比较

D Ablation study in GMNER
D GMNER 的消融研究