SCANNER: Knowledge-Enhanced Approach for Robust Multi-modal Named Entity Recognition of Unseen Entities SCANNER:针对未见实体的鲁棒多模态命名实体识别的知识增强方法
Hyunjong , Taeho Kil , Sukmin Seo , Jaeho Lee Hyunjong 、Taeho Kil 、Sukmin Seo 、Jaeho Lee POSTECH NAVER Cloud HJ AILABhyunjong.ok@gmail.com, taeho.kil@navercorp.com
Abstract 摘要
Recent advances in named entity recognition (NER) have pushed the boundary of the task to incorporate visual signals, leading to many variants, including multi-modal NER (MNER) or grounded MNER (GMNER). A key challenge to these tasks is that the model should be able to generalize to the entities unseen during the training, and should be able to handle the training samples with noisy annotations. To address this obstacle, we propose SCANNER (Span CANdidate detection and recognition for NER), a model capable of effectively handling all three NER variants. SCANNER is a twostage structure; we extract entity candidates in the first stage and use it as a query to get knowledge, effectively pulling knowledge from various sources. We can boost our performance by utilizing this entity-centric extracted knowledge to address unseen entities. Furthermore, to tackle the challenges arising from noisy annotations in NER datasets, we introduce a novel self-distillation method, enhancing the robustness and accuracy of our model in processing training data with inherent uncertainties. Our approach demonstrates competitive performance on the NER benchmark and surpasses existing methods on both MNER and GMNER benchmarks. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks. 命名实体识别(NER)领域的最新进展推动了这一任务的边界,将视觉信号纳入其中,从而产生了许多变体,包括多模态 NER(MNER)或接地 MNER(GMNER)。这些任务面临的一个主要挑战是,模型应能泛化到训练期间未见过的实体,并能处理带有噪声注释的训练样本。为了解决这一障碍,我们提出了 SCANNER(Span CANdidate detection and recognition for NER),这是一种能够有效处理所有三种 NER 变体的模型。SCANNER 是一个两阶段结构;我们在第一阶段提取实体候选,并将其用作获取知识的查询,从而有效地从各种来源获取知识。我们可以利用这些以实体为中心提取的知识来处理未见实体,从而提高我们的性能。此外,为了应对 NER 数据集中嘈杂注释带来的挑战,我们引入了一种新颖的自蒸馏方法,从而提高了我们的模型在处理具有固有不确定性的训练数据时的鲁棒性和准确性。我们的方法在 NER 基准上表现出极具竞争力的性能,并在 MNER 和 GMNER 基准上超越了现有方法。进一步的分析表明,所提出的蒸馏和知识利用方法提高了我们的模型在各种基准上的性能。
1 Introduction 1 引言
Named entity recognition (NER) is a fundamental task in natural language processing to identify textual spans that correspond to named entities in the given text, and classify them into pre-defined categories, such as persons, locations, and organizations (Li et al., 2020). The extracted information can be utilized for various downstream tasks, including entity linking and relation extraction. 命名实体识别(NER)是自然语言处理中的一项基本任务,用于识别给定文本中与命名实体相对应的文本跨度,并将其归入预定义的类别,如人物、地点和组织(Li 等人,2020 年)。提取的信息可用于各种下游任务,包括实体链接和关系提取。
Figure 1: Illustrations of NER, MNER, and GMNER tasks. The NER task aims to identify named entities from the given text. MNER extends this task to utilize additional image informations. GMNER additionally requires the model to predict entity bounding boxes in the given image, if they are present. 图 1:NER、MNER 和 GMNER 任务示意图。NER 任务旨在从给定文本中识别命名实体。MNER 扩展了这一任务,利用了更多的图像信息。如果存在实体边界框,GMNER 还要求模型预测给定图像中的实体边界框。
The rapid growth of the amount of multi-modal contents on social media platforms has given rise to the multi-modal variants of NER. The most prominent example is multi-modal NER (MNER; Zhang et al. (2018)), which extends traditional NER to identifying named entities in the text based on additional image input paired with the text (Fig. 1b). Another recent example is the grounded MNER (GMNER; Yu et al. (2023)); here, one additionally aims to predict the bounding boxes of named entities appearing in the given image (Fig. 1c). 社交媒体平台上多模态内容数量的快速增长催生了 NER 的多模态变体。最突出的例子是多模态 NER(MNER;Zhang 等人(2018 年)),它将传统的 NER 扩展到基于与文本配对的额外图像输入来识别文本中的命名实体(图 1b)。另一个最新的例子是接地 MNER(GMNER;Yu 等人(2023 年));在这里,人们还旨在预测给定图像中出现的命名实体的边界框(图 1c)。
A major challenge in NER, MNER, and GMNER tasks is the presence of unseen entities in the test datasets, which are not found in the training datasets. Traditional models often struggle with low performance on these unseen entities (see Table 1). To tackle this problem effectively, it is important to use knowledge about unseen entities in a way that boosts ability of the model to generalize 在 NER、MNER 和 GMNER 任务中,一个主要挑战是测试数据集中存在训练数据集中没有的未知实体。传统模型在处理这些未见实体时往往表现不佳(见表 1)。要有效地解决这个问题,重要的是要利用有关未见实体的知识来提高模型的泛化能力。
Datasets 数据集
Methods 方法
Seen entities 看到的实体
Unseen entities 看不见的实体
CoNLL2003
BERT-base
93.78
80.90
Ours (w/o Knowledge) 我们的(无知识)
96.29
89.68
Twitter-2015
BERT-base
79.81
57.81
Ours (w/o Knowledge) 我们的(无知识)
87.18
73.84
Twitter-2017 推特-2017
BERT-base
93.81
67.76
Ours (w/o Knowledge) 我们的(无知识)
95.68
82.96
Table 1: A comparison of test F1 scores for the named entities that have appeared at least once in the training dataset, versus the entities that have not appeared. 表 1:在训练数据集中至少出现过一次的命名实体与未出现过的实体的测试 F1 分数比较。
Figure 2: 'Kroger' is an unseen entity that is hard to recognize as an Organization or Location. By our knowledge base model, it brings to successful prediction. 图 2:"Kroger "是一个难以识别为组织或地点的未知实体。通过我们的知识库模型,它可以成功预测。
and perform well across different types of data. In this paper, we introduce SCANNER, which stands for Span CANdidate detection and recognition for Named Entity Recognition. Our approach is designed to effectively use knowledge about unseen entities, addressing NER, MNER, and GMNER tasks with improved robustness. SCANNER adopts a two-stage structure, comprising a span candidate detection module and entity recognition module The span candidate detection module identifies named entity candidates within sentences. Following this, the entity recognition module uses these candidates as queries to extract relevant knowledge from various sources, effectively recognizing the class of the entity candidate. As illustrated in Fig. 2, we were able to accurately identify 'Kroger' as an 'organization' by utilizing object knowledge. SCANNER effectively gathers and uses knowledge from various sources, boosting its performance in the challenging NER, MNER, and GMNER benchmarks. Notably, the GMNER challenge involves the intricate process of identifying entities and determining their bounding boxes within images. The architecture of SCANNER, leveraging its comprehensive knowledge, is effective in addressing the GMNER task. The effectiveness of SCANNER in the GMNER task is highlighted by establishing a new baseline that is over higher than the pre- 并在不同类型的数据中表现良好。在本文中,我们介绍了 SCANNER,它代表用于命名实体识别的跨 CANdidate 检测和识别。我们的方法旨在有效利用关于未见实体的知识,以更高的鲁棒性解决 NER、MNER 和 GMNER 任务。SCANNER 采用两阶段结构,包括跨度候选实体检测模块和实体识别模块。随后,实体识别模块利用这些候选实体作为查询,从各种来源中提取相关知识,从而有效识别候选实体的类别。如图 2 所示,我们利用对象知识准确地将 "Kroger "识别为 "组织"。SCANNER 有效地收集和使用了各种来源的知识,从而提高了它在具有挑战性的 NER、MNER 和 GMNER 基准测试中的性能。值得注意的是,GMNER 挑战涉及在图像中识别实体并确定其边界框的复杂过程。SCANNER 的架构利用其全面的知识,能有效解决 GMNER 任务。SCANNER 在 GMNER 任务中的有效性突出表现在建立了一个新的基线,该基线比之前的 高出 以上。
Text 文本
Dataset 数据集
CoNLL2003
Twitter-2015
Twitter-2015
Twitter-2017 推特-2017
Jr 小
Twitter-2017 推特-2017
Table 2: Examples of gold annotation and potential alternatives. The gold annotations are marked in blue [], whereas the alternative annotations are in red []. 表 2:黄金注释和潜在替代注释示例。金色注释用蓝色[]标出,而备选注释用红色[]标出。
vious standard, as measured by the F1 score. Additionally, we introduce the novel self-distillation method, called as Trust Your Teacher. The NER task faces challenges with noisy annotations (Wang et al., 2019; Zhu and Li, 2022), particularly at entity boundaries where exact span matching is crucial and ambiguity often leads to increased noise (see Table 2). Our distillation method, which softly utilizes both the prediction of the teacher model and ground truth (GT) logit, addresses the challenges of noisy annotations. 通过 F1 分数来衡量,我们发现该方法与传统的标准相去甚远。此外,我们还介绍了一种新颖的自蒸馏方法,即 "相信你的老师"(Trust Your Teacher)。NER 任务面临着噪声注释的挑战(Wang 等人,2019;Zhu 和 Li,2022),尤其是在实体边界,精确的跨度匹配至关重要,而模糊性往往会导致噪声增加(见表 2)。我们的蒸馏方法同时利用了教师模型的预测和地面实况(GT)对数,解决了噪声注释的难题。
Our approach demonstrates competitive performance on NER and surpasses existing methods on both MNER and GMNER. Further analysis shows that the proposed distillation and knowledge utilization methods improve the performance of our model on various benchmarks. 我们的方法在 NER 上表现出极具竞争力的性能,在 MNER 和 GMNER 上超越了现有方法。进一步的分析表明,提出的提炼和知识利用方法提高了我们的模型在各种基准上的性能。
The contributions of SCANNER are summarized in three key aspects: SCANNER 的贡献主要体现在三个方面:
We propose a new distillation method that softly blends the predictions of the teacher model with ground truth annotations to enhance data quality and model training. 我们提出了一种新的提炼方法,将教师模型的预测与地面实况注释进行软融合,以提高数据质量和模型训练。
We develop SCANNER, a two-stage structured model that effectively utilizes knowledge to improve performance, particularly in recognizing unseen entities. 我们开发的 SCANNER 是一种两阶段结构化模型,可有效利用知识来提高性能,尤其是在识别未见实体方面。
The SCANNER model shows competitive performance in NER benchmarks and demonstrates higher performance than existing methods in MNER and GMNER benchmarks. SCANNER 模型在 NER 基准测试中表现出极具竞争力的性能,在 MNER 和 GMNER 基准测试中表现出高于现有方法的性能。
2 Related work 2 相关工作
Prior works on MNER typically operates by first extracting the NER-related features from the image, and then combining these features with text features to recognize name entities. Roughly, ex- 先前的 MNER 工作通常是先从图像中提取与 NER 相关的特征,然后将这些特征与文本特征相结合来识别名称实体。粗略地说,前
isting works fall into two categories according to how they extract image features. 根据提取图像特征的方式,提取工作可分为两类。
Textual features. Several works extract the textual metadata from the given image and utilize them as features for the subsequent NER task (Wang et al., 2022b,a; Li et al., 2023b). For instance, ITA (Wang et al., 2022b) extracts object tags, image captions, and OCR results from the given image. Similarly, Li et al. (2023b) also extracts image captions, but additionally utilizes large language model as an implicit knowledge source to further refine the features. MoRe (Wang et al., 2022a) takes a slightly different approach, using an image-based retrieval system to retrieve textual descriptions of the closest images in the database. 文本特征。有几项研究从给定图像中提取文本元数据,并将其作为后续 NER 任务的特征(Wang 等人,2022b,a;Li 等人,2023b)。例如,ITA(Wang 等人,2022b)从给定图像中提取对象标签、图像标题和 OCR 结果。同样,Li 等人(2023b)也提取了图像标题,但还利用大型语言模型作为隐式知识源,进一步完善了特征。MoRe(Wang 等人,2022a)采用的方法略有不同,它使用基于图像的检索系统来检索数据库中最接近图像的文本描述。
Visual encoders. Another line of work attempts to extract the image features using a visual encoder, such as pre-trained ResNets, ViTs, or CLIP vision encoder (Wang et al., 2022e; Zhang et al., 2023; Chen et al., 2023). The extracted features are then combined with the text features extracted from a separate text encoder, which often involves additional alignment via cross-modal attention (Chen et al., 2022; Lu et al., 2022; Wang et al., 2022e; Zhang et al., 2023; Chen et al., 2023). Notably, PromptMNER (Wang et al., 2022d) calculates the similarity between visual features and various text prompts to extract visual cues that are loosely related to the input text. 视觉编码器。另一种方法是尝试使用视觉编码器提取图像特征,如预训练的 ResNets、ViTs 或 CLIP 视觉编码器(Wang 等人,2022e;Zhang 等人,2023;Chen 等人,2023)。然后将提取的特征与从单独的文本编码器中提取的文本特征相结合,这通常需要通过跨模态注意进行额外的对齐(Chen 等,2022;Lu 等,2022;Wang 等,2022e;Zhang 等,2023;Chen 等,2023)。值得注意的是,PromptMNER(Wang 等人,2022d)计算视觉特征与各种文本提示之间的相似性,以提取与输入文本松散相关的视觉线索。
In this paper, we take a different path and extract the image features conditioned on the information extracted from the given text. Up to our knowledge, even though it has related works in NER (Wang et al., 2021, 2022c; Tan et al., 2023), it is the first such attempt in the context of MNER, which is a more challenging task. 在本文中,我们另辟蹊径,根据从给定文本中提取的信息提取图像特征。据我们所知,尽管它在 NER 方面有相关的工作(Wang 等人,2021 年,2022c;Tan 等人,2023 年),但在 MNER(一项更具挑战性的任务)的背景下,它是第一次这样的尝试。
In addition, a new task has been introduced, which not only incorporates image inputs but also actively addresses the task of grounding entity locations within images (Yu et al., 2023). 此外,还引入了一项新任务,该任务不仅包含图像输入,还积极处理图像中实体位置的定位任务(Yu 等人,2023 年)。
3 Method 3 方法
In this section, we first introduce the architecture of the proposed method, which comprises the span candidate detection module and the named entity recognition module (Sec. 3.1). Then, we describe the named entity recognition module, which performs entity recognition and visual grounding in the image for each entity candidate (Sec. 3.2) Finally, we explain a novel distillation method, named Trust Your Teacher, which is designed to 在本节中,我们首先介绍了拟议方法的架构,包括跨度候选检测模块和命名实体识别模块(第 3.1 节)。然后,我们将介绍命名实体识别模块,该模块将在图像中对每个候选实体进行实体识别和视觉接地(第 3.2 节)。 最后,我们将解释一种名为 "相信你的老师 "的新颖提炼方法,该方法旨在
Figure 3: The overall architecture of the proposed SCANNER method. The two-stage structure allows for efficient extraction and utilization of knowledge, as knowledge is extracted only for those entity candidates that were filtered through in stage 1. 图 3:拟议 SCANNER 方法的整体架构。两阶段结构可有效提取和利用知识,因为知识只针对在第一阶段中筛选出的候选实体。
robustly train our model even in the presence of noisy dataset annotations (Sec. 3.3). 即使在有噪声数据集注释的情况下,也能稳健地训练我们的模型(第 3.3 节)。
3.1 SCANNER Architecture 3.1 扫描仪结构
The primary focus of this paper is to perform MNER using both knowledge extracted from within images and external knowledge, even for entities not encountered during training. To achieve this, as illustrated in Fig. 3, we propose a two-stage architecture, known for its efficiency in extracting and searching for knowledge from various sources. In the first stage, we extract named entity candidates, and in the second stage, we efficiently search and extract only knowledge relevant to these candidates. This acquired knowledge is then utilized for entity recognition. 本文的主要重点是利用从图像内部提取的知识和外部知识来执行 MNER,即使是在训练过程中未遇到的实体也不例外。为实现这一目标,如图 3 所示,我们提出了一个两阶段架构,该架构以高效提取和搜索各种来源的知识而著称。在第一阶段,我们提取候选命名实体;在第二阶段,我们只对与这些候选实体相关的知识进行高效搜索和提取。获得的知识将用于实体识别。
Stage 1: Span Candidate Detection Module. In the first stage of SCANNER, the transformer encoder (Liu et al., 2019) is employed to detect entity candidates from the input text. During this phase, we utilizes BIO (Beginning, Inside, Outside) tagging to classify each token in the input text, determining whether it corresponds to the beginning, inside, or outside of an entity span. The classification process is guided by cross-entropy loss. 第一阶段:跨度候选词检测模块。在 SCANNER 的第一阶段,我们使用变换器编码器(Liu 等人,2019 年)从输入文本中检测实体候选。在这一阶段,我们利用 BIO(Beginning, Inside, Outside)标记对输入文本中的每个标记进行分类,确定其是否对应于实体跨度的开头、内部或外部。分类过程以交叉熵损失为指导。
Stage 2: Entity Recognition Module. In Stage 2, SCANNER performs named entity recognition and visual grounding for each entity candidate detected in Stage 1. It utilizes each entity candidate as a query to extract and leverage the necessary knowledge for the tasks. During this process, SCANNER efficiently searches and extracts knowledge by focusing on the initially detected entity candidates rather than the entire input text. SCANNER utilizes both internal (image-based) and external (e.g., Wikipedia) knowledge sources to perform 第 2 阶段:实体识别模块。在第 2 阶段,SCANNER 对第 1 阶段检测到的每个候选实体进行命名实体识别和视觉基础处理。它将每个候选实体作为一个查询,以提取和利用任务所需的知识。在此过程中,SCANNER 将重点放在最初检测到的候选实体上,而不是整个输入文本上,从而有效地搜索和提取知识。SCANNER 同时利用内部(基于图像)和外部(如维基百科)知识源来执行以下任务
Figure 4: An illustration of the entity recognition module (stage 2). Based on the entity candidates (extracted in stage 1), SCANNER utilizes various knowledge sources such as Wikipedia, image captioner, and object knowledge extractor. The knowledge collected from these sources are then processed by RoBERTa to give the final prediction. 图 4:实体识别模块(第 2 阶段)示意图。根据候选实体(在第 1 阶段提取),SCANNER 利用各种知识来源,如维基百科、图像标题器和对象知识提取器。然后,RoBERTa 对从这些来源收集到的知识进行处理,得出最终预测结果。
MNER on unseen entities, not encountered in training. Detailed information about these modules will be provided in Section 3.2. MNER 针对训练中未遇到的未知实体。有关这些模块的详细信息将在第 3.2 节中提供。
3.2 Entity Recognition Module 3.2 实体识别模块
For each entity candidate identified by the span candidate detection module, the entity recognition module processes a text prompt that includes both the entity candidate and associated knowledge. This knowledge, extracted from images and external knowledge sources, allows for performing MNER on unseen entities that were not encountered during training. Our methodology involves extracting this knowledge from a variety of sources, utilizing the identified entity candidates as the basis for the extraction process. Then, this module classifies the class of each entity candidate and performs grounding to determine which object in the image corresponds to the entity. A detailed illustration is shown in Fig. 4. 对于候选跨度检测模块识别出的每个候选实体,实体识别模块都会处理一个文本提示,其中包括候选实体和相关知识。这些知识是从图像和外部知识源中提取的,可以对训练过程中未遇到的未知实体执行 MNER。我们的方法包括从各种来源中提取这些知识,并将已识别的候选实体作为提取过程的基础。然后,该模块会对每个候选实体进行分类,并执行接地处理,以确定图像中哪个对象与实体相对应。详细图示如图 4 所示。
3.2.1 Prompt construction with knowledge 3.2.1 以知识促进建设
The entity recognition module extracts and utilizes useful knowledge from various sources when constructing the text prompt corresponding to the input. The knowledge applied for constructing text prompts in our method includes the following. 实体识别模块在构建与输入相对应的文本提示时,会从各种来源提取并利用有用的知识。在我们的方法中,用于构建文本提示的知识包括以下内容。
Wikipedia knowledge. Initially, information is searched using the entity candidate as a query in external knowledge source, which is Wikipedia. This information can be valuable for classifying the type of entity for each candidate and, moreover, enables the model to classify unseen entities that were not encountered during training. As illustrated in Fig. 4, for entity candidates like 'Steve Kerr', it enhances entity recognition performance by providing valuable information for classification as an American basketball player and coach. 维基百科知识。最初,我们使用候选实体作为外部知识源(即维基百科)的查询条件来搜索信息。这些信息对每个候选实体的类型分类很有价值,而且还能让模型对训练过程中未遇到的实体进行分类。如图 4 所示,对于像 "史蒂夫-科尔 "这样的候选实体,维基百科提供的宝贵信息有助于将其分类为美国篮球运动员和教练,从而提高了实体识别性能。
Image caption. To effectively utilize visual information, image captioning results are also used. We use the BLIP-2 (Li et al., 2023a) to extract synthetic captions for the whole image. 图像标题。为了有效利用视觉信息,我们还使用了图像标题结果。我们使用 BLIP-2(Li 等人,2023a)来提取整幅图像的合成标题。
Object knowledge. In addition to global information about the image, object-level information is also beneficial for entity recognition. To achieve this, results obtained from the object detector are employed as knowledge. Initially, object classes are converted into text format and used as knowledge. Then, synthetic captions for each object region are also utilized in conjunction with class names. This information is structured as details corresponding to each object, along with a special token denoted as [obj], as shown in Fig. 4. Additionally, during this process, the visual-language similarity between each object and entity candidate is calculated, and objects are arranged in order of high similarity, which is then included in the text prompt. One of the problems with existing meth- 物体知识。除了图像的全局信息外,物体级信息也有利于实体识别。为此,我们采用了对象检测器获得的结果作为知识。首先,将对象类别转换为文本格式并作为知识使用。然后,每个对象区域的合成标题也与类别名称结合使用。如图 4 所示,这些信息的结构是与每个物体相对应的详细信息,以及一个表示为 [obj] 的特殊标记。此外,在此过程中还会计算每个对象与候选实体之间的视觉语言相似度,并将对象按相似度高的顺序排列,然后将其包含在文本提示中。现有方法存在的问题之一是
ods for the MNER task is that the model sometimes references objects in the image that are irrelevant to the entity, leading to incorrect recognition. By arranging the object details in the text prompt according to the visual-language similarity order with the entity, our model can focus more on the object regions that are highly related to the entity. In this paper, CLIP (Radford et al., 2021) is employed for visual-language similarity, specifically calculating the similarity between the text representation of the entity candidate and the visual representation of each Region of Interest (RoI). 在 MNER 任务中,模型有时会引用图像中与实体无关的物体,从而导致识别错误。通过根据与实体的视觉语言相似性顺序排列文本提示中的物体细节,我们的模型可以更加关注与实体高度相关的物体区域。本文采用 CLIP(Radford 等人,2021 年)进行视觉语言相似性分析,具体计算候选实体的文本表示与每个感兴趣区域(RoI)的视觉表示之间的相似性。
All such knowledge mentioned above is converted into a textual format and integrated with the text prompt for entity recognition and visual grounding. 上述所有知识都被转换成文本格式,并与文本提示相结合,用于实体识别和视觉基础。
The text prompt, structured to include entity candidates, the entire input text sentence, and extracted knowledge, is presented as "The entity is [mask] for {entity} in this sentence. {original sentence} Wikipedia} {image caption} [obj] {object 1} [obj] {object 2..." 文本提示的结构包括候选实体、整个输入文本句子和提取的知识,显示为 "该句子中{实体}的实体是[掩码]。{原句}。 维基百科}。{图片标题} [对象] {对象 1}[对象 2 ......"
3.2.2 Encoder and Objective 3.2.2 编码器和目标
The prompts constructed for each entity candidate are input into a transformer encoder model (Liu et al., 2019). For entity recognition, the output token representation of the [mask] token in the text prompt for the -th entity candidate is fed into a linear layer to predict the probability distribution . Given the ground truth , the objective function is to minimize the cross-entropy loss between the predicted entity class distribution and the ground truth logit: 为每个实体候选者构建的提示输入到转换编码器模型中(Liu 等人,2019 年)。在实体识别中,第 个候选实体的文本提示 中 [掩码] 标记的输出标记表示被输入线性层,以预测概率分布 。给定地面实况 ,目标函数是最小化预测实体类别分布与地面实况 logit 之间的交叉熵损失:
where is the total number of the entity candidates. 其中, 是候选实体的总数。
Additionally, the visual grounding is performed by feeding the output token representation of the -th token from the text prompt into a linear layer. This is followed by a sigmoid function, which aids in predicting the overlap score between the ground truth image region grounding entity candidate and object . The objective function of visual grounding is calculated based on the binary cross-entropy loss between the overlap score and the ground truth Intersection over Union (IoU): 此外,视觉接地是通过将文本提示 中 第 个标记的输出标记表示送入线性层来实现的。然后是一个 sigmoid 函数,该函数有助于预测地面真实图像区域接地实体候选 和对象 之间的重叠分数 。视觉接地的目标函数是根据重叠分值与地面实况交集大于联合(IoU)之间的二元交叉熵损失来计算的:
where is the ground truth IoU between the ground truth image region of the entity and object region 其中, 是实体 的地面实况图像区域与对象区域 之间的地面实况 IoU
In training stage, we combine two losses as the final loss of our model: 在训练阶段,我们将两种损失合并为模型的最终损失:
where is the weighting coefficient, we set to 1 for the GMNER task and to 0 for the NER and MNER tasks in this paper. 其中 为加权系数,本文将 GMNER 任务中的 设为 1,将 NER 和 MNER 任务中的 设为 0。
3.3 Trust Your Teacher 3.3 相信你的老师
We introduce the novel self-distillation method, called as Trust Your Teacher (TYT). Our distillation method, which softly utilizes both the prediction of the teacher model and ground truth (GT) logit, addresses the challenges of noisy annotations. First, we train the teacher model using equation 1, and then train the final student model using both the predictions of the teacher model and the ground truth labels. The most significant feature of our proposed method is that it assesses the reliability of each sample by utilizing the prediction of the teacher model to determine if it is trustworthy or noisy. Based on this assessment, the method sets the weights between the model prediction and the gt label, which are then reflected in the loss calculation. The objective of the our proposed distillation method composes a cross-entropy loss with ground truth and Kullback-Leibler Divergence (KLD) loss with teacher predictions: 我们介绍了一种名为 "信任你的老师"(TYT)的新型自蒸馏方法。我们的蒸馏方法同时利用了教师模型的预测和地面实况(GT)对数,解决了有噪声注释的难题。首先,我们利用公式 1 训练教师模型,然后利用教师模型的预测和地面实况标签训练最终的学生模型。我们提出的方法的最大特点是,它通过利用教师模型的预测来评估每个样本的可靠性,从而确定它是可信的还是有噪声的。基于这一评估,该方法设定了模型预测和 gt 标签之间的权重,然后将其反映在损失计算中。我们提出的蒸馏方法的目标是将与地面实况的交叉熵损失和与教师预测的库尔贝克-莱布勒发散(KLD)损失结合起来:
where is the input sample, and are the model parameters of the student and teacher, and are the prediction distributions of the student and teacher and is a balancing factor proposed in this paper. In detail, determines whether to trust the teacher model prediction or the ground truth, and it represents the prediction score of the teacher model for the ground truth class index, which is . This implies that since the teacher model is well-trained, if the score for the ground 其中, 为输入样本, 和 为学生和教师的模型参数, 和 为学生和教师的预测分布, 为本文提出的平衡因子。具体来说, 决定了是相信教师模型的预测还是相信地面实况,它表示教师模型对地面实况类指数的预测得分,即 。这意味着,由于教师模型经过了良好的训练,如果地面实况指数的得分是 ,那么教师模型的预测得分就是 。
Figure 5: Experiments of text classification task in MNLI datasets. 'matched' is in-domain, and 'mismatched‘ is out-domain. 图 5:MNLI 数据集中的文本分类任务实验。匹配 "表示域内,"不匹配 "表示域外。
Methods 方法
Twitter-2015
Twitter-2017 推特-2017
Twitter-GMNER
Pre.
Rec. 回顾。
F1
Pre.
Rec. 回顾。
F1
Pre.
Rec. 回顾。
F1
Base 基地
83.28
87.68
85.43
93.23
92.08
86.60
87.59
87.09
Half 一半
83.36
87.69
85.47
90.31
92.92
91.60
87.96
87.47
Full 全部
87.72
85.63
90.53
92.95
91.72
86.90
87.80
87.35
TYT
83.59
90.94
86.82
Table 3: Ablation study on the MNER dataset in first stage. 'Half' is when is 0.5 and 'Full' is 0 . In 'TYT', is adjusted through the trust your teacher method. 表 3:第一阶段 MNER 数据集的消融研究。半 "是指 为 0.5,"全 "是指 为 0。在 "TYT "中, 通过 "相信你的老师 "方法进行调整。
truth class is high, then the sample is considered reliable and more weight is given to the cross-entropy with the ground truth label. Conversely, if the score is low, the sample is assumed to be an unreliable, noisy sample, and more weight is placed on the KLD loss with the prediction of the teacher model, rather than the ground truth label. 如果真值类得分高,则样本被认为是可靠的,与地面实况标签的交叉熵权重就会增加。反之,如果得分较低,则样本被认为是不可靠的、有噪声的样本,与教师模型预测的 KLD 损失,而不是地面实况标签的权重会更大。
To demonstrate the significant impact of our TYT approach, we have carried out some experiments. Fig. 5 illustrates our experiments on a text classification task in MNLI dataset. We extract about of the train set for experimental efficiency and intentionally added label noise at rates of and to this subset. We then compare the performance of the model trained with our TYT method on the train set with added label noise against the baseline that does not use distillation. Fig. 5 indicates that using TYT demonstrates relatively robust performance under moderate noise conditions. Additionally, we compare our method with the conventional soft distillation methods that do not dynamically vary the parameter in the entity detection task, stage 1 of MNER and GMNER. Table 3 shows that our method has better performance on MNER and GMNER benchmarks, and adaptively varying the is more effective than keeping it fixed. 为了证明 TYT 方法的显著效果,我们进行了一些实验。图 5 展示了我们在 MNLI 数据集中进行的文本分类任务实验。为了提高实验效率,我们提取了大约 的训练集,并有意在该子集中添加了比率为 和 的标签噪声。然后,我们将在添加了标签噪声的训练集上使用 TYT 方法训练的模型的性能与不使用蒸馏的基线进行比较。图 5 显示,在中等噪声条件下,使用 TYT 的性能相对稳健。此外,在实体检测任务中,即 MNER 和 GMNER 的第 1 阶段,我们将我们的方法与不动态变化 参数的传统软蒸馏方法进行了比较。表 3 显示,在 MNER 和 GMNER 基准测试中,我们的方法具有更好的性能,而且自适应地改变 比保持固定参数更有效。
We apply the TYT to both stages 1 and 2. But in 我们将 TYT 应用于第 1 和第 2 阶段。但在
NER, we only use it in stage 1 . The loss from the TYT is applied only to the classification loss and not to the loss for visual grounding. 我们只在第 1 阶段使用 NER。TYT 的损失只用于分类损失,而不用于视觉接地损失。
4 Experiment 4 实验
4.1 Dataset 4.1 数据集
Our methodology's efficacy was assessed using widely used datasets for each task. We utilize CoNLL2003 (Tjong Kim Sang and De Meulder, 2003) for NER, Twitter-2015 (Zhang et al., 2018) and Twitter-2017 (Lu et al., 2018) for MNER, and Twitter-GMNER (Yu et al., 2023) for GMNER. Details are in appendix B. 我们使用广泛使用的数据集对每项任务的有效性进行了评估。我们将 CoNLL2003(Tjong Kim Sang 和 De Meulder,2003 年)用于 NER,将 Twitter-2015 (Zhang 等人,2018 年)和 Twitter-2017 (Lu 等人,2018 年)用于 MNER,将 Twitter-GMNER (Yu 等人,2023 年)用于 GMNER。详情见附录 B。
4.2 Experimental Setups 4.2 实验装置
Evaluation metrics. To evaluate our method, we use Entity-wise F1, precision, and recall scores for NER and MNER tasks. For the GMNER task, there is an additional evaluation of the visual grounding. For instances, where the visual grounding is ungroundable, a prediction is correct if it is classified as 'None.' For others, correctness hinges on the IoU metric. A prediction is considered correct if the IoU score between the predicted visual region and the ground truth bounding boxes exceeds a threshold of 0.5 . We use F1, precision, and recall scores, which are calculated based on the aggregate correctness across entity, type, and visual region predictions. Our primary focus is on the F1 score in line with numerous preceding studies. 评估指标。为了评估我们的方法,我们在 NER 和 MNER 任务中使用了实体方面的 F1、精确度和召回分数。对于 GMNER 任务,还需要对视觉接地进行额外的评估。对于视觉接地不可接地的实例,如果被归类为 "无",则预测正确。对于其他情况,正确与否取决于 IoU 指标。如果预测的视觉区域与地面实况边界框之间的 IoU 分数超过 0.5 的阈值,则预测被认为是正确的。我们使用 F1、精确度和召回分数,它们是根据实体、类型和视觉区域预测的总体正确性计算得出的。我们主要关注的是 F1 分数,这与之前的许多研究结果是一致的。
Implementation details. Following most recent works, we implement our model utilizing RoBERTa-large in NER, XLM-RoBERTalarge (Conneau et al., 2020) for MNER, GMNER both in stage 1 and stage 2 . For the object detector, we use VinVL (Zhang et al., 2021b) following the settings with ITA (Wang et al., 2022b). To address the requirements of visual-language similarity and image caption, we use each of them CLIP and BLIP- models respectively. Detailed hyper-parameter settings are shown in appendix . All experiments were done on a single GeForce RTX 4090 GPU or NVIDIA H100 GPU, and we report the average score from 5 runs with different random seeds for each setting. 实施细节。根据最新的研究成果,我们在 NER 中使用了 RoBERTa-large,在 MNER 中使用了 XLM-RoBERTalarge (Conneau et al., 2020),在第 1 阶段和第 2 阶段都使用了 GMNER。对于对象检测器,我们使用 VinVL(Zhang 等人,2021b),其设置与 ITA(Wang 等人,2022b)相同。为了满足视觉语言相似性和图像标题的要求,我们分别使用了 CLIP 和 BLIP- 模型。详细的超参数设置见附录 。所有实验均在单个 GeForce RTX 4090 GPU 或 NVIDIA H100 GPU 上完成,我们报告了每种设置下使用不同随机种子运行 5 次的平均得分。
Also we applied several minor methods to enhance performance. In the second stage, we incorporated a 'non-entity' label to account for instances 此外,我们还采用了一些次要方法来提高性能。在第二阶段,我们加入了一个 "非实体 "标签,以考虑以下情况
Methods 方法
Twitter-2015
Twitter-2017 推特-2017
Single Type(F1) 单一类型(F1)
Overall 总体情况
Single Type(F1) 单一类型(F1)
Overall 总体情况
PER
LOC
ORG
OTH.
Pre.
Rec. 回顾。
F1
PER
LOC
ORG
OTH.
Pre.
Rec. 回顾。
F1
Text 文本
BERT-CRF
85.37
81.82
63.26
44.13
75.56
73.88
74.71
90.66
84.89
83.71
66.86
86.10
83.85
84.96
BERT-SPAN (Yamada et al., 2020) BERT-SPAN (Yamada 等人,2020 年)
85.35
81.88
62.06
43.23
75.52
73.83
74.76
90.84
85.55
81.99
69.77
85.68
84.60
85.14
RoBERTa-SPAN (Yamada et al., 2020) RoBERTa-SPAN (Yamada 等人,2020 年)
87.20
83.58
66.33
50.66
77.48
77.43
77.45
94.27
86.23
87.22
74.94
88.71
89.44
89.06
Vision-LLM (w/ zero-shot) 愿景-LLM(带零发子弹)
Gem 宝石
73.12
65.53
35.80
20.72
48.24
64.88
55.34
84.36
71.65
61.24
22.02
64.02
69.90
66.83
GPT4-V
80.00
75.26
40.53
25.26
51.46
70.42
59.46
85.63
78.62
73.68
36.63
67.63
74.90
71.08
Text+Image 文本+图片
UMT
85.24
81.58
63.03
39.45
71.67
75.23
73.41
91.56
84.73
82.24
70.10
85.28
85.34
85.31
UMGF (Zhang et al., 2021a) UMGF (Zhang 等人,2021a)
84.26
83.17
62.45
42.42
74.49
75.21
74.85
91.92
85.22
83.13
69.83
86.54
84.50
85.51
MNER-QG (Jia et al., 2023) MNER-QG (Jia 等人,2023 年)
85.68
81.42
63.62
41.53
77.76
72.31
74.94
93.17
86.02
84.64
71.83
88.57
85.96
87.25
R-GCN (Zhao et al., 2022) R-GCN (Zhao 等人,2022 年)
86.36
82.08
60.78
41.56
73.95
76.18
75.00
92.86
86.10
84.05
72.38
86.72
87.53
87.11
ITA (Wang et al., 2022b) ITA(Wang 等人,2022b)
-
-
-
-
-
-
78.03
-
-
-
-
-
-
89.75
PromptMNER (Wang et al., 2022d) PromptMNER (Wang 等人,2022d)
-
-
-
-
78.03
79.17
78.60
-
-
-
-
89.93
90.60
90.27
CAT-MNER (Wang et al., 2022e) CAT-MNER (Wang 等人,2022e)
88.04
84.70
68.04
52.33
78.75
78.69
78.72
94.61
88.40
88.14
80.50
90.27
90.67
90.47
MoRe (Wang et al., 2022a) MoRe (Wang 等人,2022a)
-
-
-
-
-
-
79.21
-
-
-
-
-
-
90.67
PGIM (Li et al., 2023b) PGIM (Li 等人,2023b)
88.34
84.22
70.15
52.34
79.21
79.45
79.33
96.46
89.89
89.03
79.62
90.86
92.01
91.43
SCANNER (Ours) 扫描仪(我们的)
88.24
85.16
69.86
52.23
79.72
79.03
79.38
95.18
88.52
88.45
79.71
90.40
90.67
90.54
Table 4: Experiment results on the Twitter-2015 and Twitter-2017. The results for methods marked with are from Wang et al. (2022e). The methods marked with denotes that they utilize LLMs (of ChatGPT scale) as knowledge sources. 表 4:Twitter-2015 和 Twitter-2017 的实验结果。标有 的方法的结果来自 Wang 等人(2022e)。标有 的方法表示它们使用 LLMs(ChatGPT 量表)作为知识源。
Methods 方法
CoNLL2003
Pre.
Rec. 回顾。
F1
(Li et al., 2022) (Li 等人,2022 年)
92.71
93.07
DiffusionNER (Shen et al., 2023a) DiffusionNER (Shen 等人,2023a)
92.99
92.56
92.78
PromptNER (Shen et al., 2023b) PromptNER (Shen 等人,2023b)
92.96
93.18
93.08
SCANNER (Ours) 扫描仪(我们的)
Table 5: Experiment results on the CoNLL2003. 表 5:CoNLL2003 的实验结果。
where the model erroneously predicts entity candidates not present in the dataset. That allowed for more accurate handling of such cases. We augmented it with non-entity data by dividing the training set into four folds in stage 1 and validating each fold. Secondly, we employed adversarial weight perturbation (AWP) (Wu et al., 2020) in stage 1 , which enhances the robustness and generalization capabilities of the model. We initiated AWP from an intermediate stage of our training process. 模型错误地预测了数据集中不存在的候选实体。这样就能更准确地处理此类情况。我们在第 1 阶段将训练集分为四折,并对每一折进行验证,从而用非实体数据对其进行扩充。其次,我们在第 1 阶段采用了对抗性权重扰动(AWP)(Wu 等人,2020 年),这增强了模型的鲁棒性和泛化能力。我们从训练过程的中间阶段开始启动 AWP。
4.3 Experimental results in various NER tasks 4.3 各种 NER 任务的实验结果
Experimental results in NER. To evaluate the effectiveness of our approach in NER, we primarily compared our model against the existing methods in Table 5. It shows that SCANNER exhibits a competitive performance compared to the existing NER methods. NER 的实验结果。为了评估我们的方法在 NER 中的有效性,我们主要在表 5 中将我们的模型与现有方法进行了比较。结果表明,与现有的 NER 方法相比,SCANNER 表现出了极具竞争力的性能。
Experimental results in MNER. In assessing the effectiveness of SCANNER in MNER, we conducted comparative analyses against various leading models in this task. The results, detailed in MNER 中的实验结果。为了评估 SCANNER 在 MNER 中的有效性,我们与该任务中的各种领先模型进行了比较分析。结果详见
Methods 方法
Twitter-GMNER
Pre.
Rec. 回顾。
F1
Text 文本
HBiLSTM-CRF-None (Lu et al., 2018) HBiLSTM-CRF-None (Lu 等人,2018 年)
43.56
40.69
42.07
BERT-None (Devlin et al., 2019) BERT-无(Devlin 等人,2019 年)
42.18
43.76
42.96
BERT-CRF-None
42.73
44.88
43.78
BARTNER-None (Yan et al., 2021a) BARTNER-无(Yan 等人,2021a)
44.61
45.04
44.82
Text+Image 文本+图片
GVATT-RCNN-EVG (Lu et al., 2018) GVATT-RCNN-EVG (Lu 等人,2018 年)
49.36
47.80
48.57
UMT-RCNN-EVG (Yu et al., 2020) UMT-RCNN-EVG (Yu 等人,2020 年)
49.16
51.48
50.29
UMT-VinVL-EVG (Yu et al., 2020) UMT-VinVL-EVG (Yu 等人,2020 年)
50.15
52.52
51.31
UMGF-VinVL-EVG (Zhang et al., 2021a) UMGF-VinVL-EVG (Zhang 等人,2021a)
51.62
51.72
51.67
ITA-VinVL-EVG (Wang et al., 2022b) ITA-VinVL-EVG(Wang 等人,2022b)
52.37
50.77
51.56
BARTMNER-VinVL-EVG (Yu et al., 2023) BARTMNER-VinVL-EVG (Yu 等人,2023 年)
52.47
52.43
52.45
H-Index (Yu et al., 2023) H 指数(Yu 等人,2023 年)
56.16
56.67
56.41
SCANNER (Ours) 扫描仪(我们的)
Table 6: Experiment results on the Twitter-GMNER. The reported figures for the baseline models are taken from Yu et al. (2023). 表 6:Twitter-GMNER 的实验结果。报告中的基准模型数据来自 Yu 等人(2023 年)。
Table 4, reveal that our model achieves superior performance in Twitter-2015 and exhibits markedly impressive results in Twitter-2017. Notably, while PGIM shows outstanding performance on Twitter2017, it utilizes large language models (LLM) like ChatGPT, which incurs API costs, a notable drawback. In contrast, our model does not rely on LLM knowledge, freeing it from such disadvantages and demonstrating better performance on Twitter-2015. Additionally, we conduct experience using the same LLM knowledge as PGIM, which is in appendix C. 表 4 显示,我们的模型在 Twitter-2015 中取得了卓越的性能,在 Twitter-2017 中也表现出了显著的效果。值得注意的是,虽然 PGIM 在 Twitter2017 中表现出色,但它使用了像 ChatGPT 这样的大型语言模型(LLM),这会产生 API 成本,这是一个显著的缺点。相比之下,我们的模型不依赖于 LLM 知识,从而摆脱了这些缺点,并在 Twitter-2015 上表现出更好的性能。此外,我们还使用与 PGIM 相同的 LLM 知识进行了体验,见附录 C。
Experimental results in GMNER. To show our effectiveness in GMNER, we make broad compar- GMNER 的实验结果为了证明我们在 GMNER 中的有效性,我们对我们的研究结果进行了广泛的比较。
Figure 6: Visualization results showing how various types of knowledge are brought in and utilized differently to perform the MNER task. Knowledge highlighted in blue positively influences correct predictions. 图 6:可视化结果显示了在执行 MNER 任务时如何引入和利用各种类型的知识。蓝色突出显示的知识对正确预测有积极影响。
Methods 方法
Twitter-2015
Twitter-2017 推特-2017
Pre.
Rec. 回顾。
F1
Pre.
Rec. 回顾。
F1
SCANNER
79.72
79.03
79.38
90.40
90.67
90.54
- TYT
-0.26
-0.17
-0.21
-0.24
-0.11
-0.18
- OBK
+0.11
-0.60
-0.26
-0.23
-0.22
-0.22
- WKK
-1.12
-0.14
-0.64
-0.51
-0.44
-0.48
- ICK
-0.08
-0.54
-0.31
-0.29
-0.31
-0.29
Table 7: Ablation studies on MNER datasets. '-TYT' is without trust your teacher method. '-OBK' is without object knowledge. '-WKK' is without Wikipedia knowledge. '-ICK' is without image caption knowledge. 表 7:对 MNER 数据集的消融研究。-TYT "表示不使用 "相信你的老师 "方法。-OBK "表示没有对象知识。-WKK'表示没有维基百科知识。-ICK "不包含图像标题知识。
isons with all existing methods. Text-only models made to predict the visual groundings all 'None'. The Table 6 shows that our model achieves significant performance improvements over prior research and establishes a new powerful baseline for future GMNER studies. 与所有现有方法的差异。预测视觉基础的纯文本模型全部为 "无"。表 6 显示,与之前的研究相比,我们的模型性能有了显著提高,为未来的 GMNER 研究建立了一个新的强大基线。
4.4 Ablation study 4.4 消融研究
Ablation study in MNER. We conduct ablation experiments on the MNER task to evaluate the effectiveness of the proposed method. These results are shown in Table 7. We observe that removing the Trust Your Teacher method led to a decrease in performance. Our proposed distillation method effectively alleviates the dataset noise issue, mak- ing our model more robust to learning from noisy dataset. Additionally, to verify the effectiveness of the various types of knowledge used in our study, we compare the results with experiments where each type of knowledge was removed. We confirm that the object knowledge, Wikipedia knowledge, and image caption knowledge used in our paper all contribute to the performance improvement of the MNER task. MNER 中的消融研究。我们在 MNER 任务中进行了消融实验,以评估所建议方法的有效性。结果如表 7 所示。我们发现,删除 "信任你的老师 "方法会导致性能下降。我们提出的蒸馏方法有效地缓解了数据集噪声问题,使我们的模型在从噪声数据集学习时更加稳健。此外,为了验证我们研究中使用的各类知识的有效性,我们将结果与去除各类知识的实验进行了比较。我们证实,本文中使用的对象知识、维基百科知识和图像标题知识都有助于提高 MNER 任务的性能。
Case study. As shown in Fig. 6, all three types of knowledge can be utilized as useful information for named entity recognition. In the case of the first image, knowledge from Wikipedia such as "American retail company" and object knowledge containing the logo information of "Kroger" both help in predicting the "Kroger" entity as an organization. For the image on the bottom left, image caption and object knowledge aided in named entity recognition. Moreover, in the image on the bottom right, vision information like image caption and object knowledge led to incorrect entity recognition results, but it was corrected through external knowledge from Wikipedia. Thus, the three types of knowledge proposed in this paper complement each other, enabling accurate MNER performance. 案例研究。如图 6 所示,这三类知识都可以作为命名实体识别的有用信息。在第一张图片中,来自维基百科的知识(如 "美国零售公司")和包含 "Kroger "徽标信息的对象知识都有助于将 "Kroger "实体预测为一个组织。对于左下方的图像,图像标题和对象知识有助于命名实体识别。此外,在右下方的图像中,图像标题和对象知识等视觉信息导致了错误的实体识别结果,但通过维基百科的外部知识得到了纠正。因此,本文提出的三种知识可以相互补充,从而实现准确的 MNER 性能。
Effectiveness in unseen entity. Table 8 shows the effectiveness of knowledge in unseen entities. As 未见实体的有效性。表 8 显示了知识在未见实体中的有效性。如
Datasets 数据集
w/o Knowledge 无知识
w/ Knowledge w/ 知识
Seen 看到的
Unseen 看不见的
Seen 看到的
Unseen 看不见的
CoNLL2003
96.29
89.68
96.35
89.70
Twitter-2015
87.18
73.84
87.50
75.45
Twitter-2017 推特-2017
95.68
82.96
95.90
83.71
Table 8: The result comparing the test F1 scores in unseen entities of knowledge extracted and baseline. 表 8:知识提取和基线在未见实体中的测试 F1 分数比较结果。
SCANNER utilizes various knowledge in MNER, it greatly increases performance in unseen entities. In NER, lack of various knowledge causes there to be no image, which slightly improves the performance. SCANNER 在 MNER 中利用了各种知识,大大提高了在未知实体中的性能。而在 NER 中,由于缺乏各种知识导致没有图像,因此性能略有提高。
5 Conclusions 5 结论
We introduce SCANNER, a novel approach for performing NER tasks by utilizing knowledge from various sources. To efficiently fetch diverse knowledge, SCANNER employs a two-stage structure, which detects entity candidates first, and performs named entity recognition and visual grounding on these candidates. Additionally, we propose the novel distillation method, which robustly trains the model against dataset noise, demonstrating superior performance in various NER benchmarks. We believe that our method can be easily extended to utilize knowledge from multiple sources that were not covered in this paper. 我们介绍了 SCANNER,这是一种通过利用各种来源的知识来执行 NER 任务的新方法。为了有效获取各种知识,SCANNER 采用了两阶段结构,首先检测候选实体,然后对这些候选实体进行命名实体识别和视觉接地。此外,我们还提出了新颖的蒸馏方法,该方法可针对数据集噪声对模型进行稳健训练,在各种 NER 基准测试中均表现出卓越的性能。我们相信,我们的方法可以很容易地扩展到利用本文未涉及的多种来源的知识。
Limitations 局限性
In this study, we extract knowledge from various sources and utilize it to perform MNER tasks. By leveraging several vision experts such as CLIP, and also fetching external knowledge, our method takes relatively longer inference time compared to approaches that do not use knowledge. However, the use of vision experts and knowledge is essential for a MNER model that functions well even with unseen entities, and we efficiently extract information through a two-stage structure. 在这项研究中,我们从各种来源提取知识,并利用这些知识来执行 MNER 任务。通过利用 CLIP 等多个视觉专家以及获取外部知识,与不使用知识的方法相比,我们的方法所需的推理时间相对较长。不过,视觉专家和知识的使用对于即使在未见实体的情况下也能良好运行的 MNER 模型来说至关重要,我们通过两阶段结构有效地提取了信息。
Ethics statement 道德规范声明
All experimental results we provide in this paper is based on publicly available datasets and opensource models. 本文提供的所有实验结果均基于公开数据集和开源模型。
Acknowledgments 致谢
This work was partly supported by Artificial Intelligence Industrial Convergence Cluster Development project funded by the Ministry of Science and 这项工作得到了科学和技术部资助的人工智能产业融合集群发展项目的部分支持。
ICT (MSIT, Korea) & Gwangju Metropolitan City, and the National Research Foundation of Korea (NRF) grant (RS-2023-00213710, Neural Network Optimization with Minimal Optimization Costs), and the Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.20190-01906, Artificial Intelligence Graduate School Program(POSTECH)) 韩国信息通信技术研究院(MSIT,韩国)与光州广域市,韩国国家研究基金会(NRF)资助(RS-2023-00213710,优化成本最小化的神经网络优化),韩国政府(MSIT)资助的信息通信技术规划与评估研究所(IIPT)资助(No.20190-01906,人工智能研究生院项目(POSTECH))。
References 参考资料
F. Chen, J. Liu, K. Ji, W. Ren, J. Wang, and J. Chen. 2023. Learning implicit entity-object relations by bidirectional generative alignment for multimodal ner. In . F.Chen, J. Liu, K. Ji, W. Ren, J. Wang, and J. Chen.2023.通过双向生成对齐学习隐式实体-对象关系以实现多模态ner。In .
X. Chen, N. Zhang, L. Li, S. Deng, C. Tan, C. Xu, F. Huang, L. Si, and H. Chen. 2022. Hybrid transformer with multi-level fusion for multimodal knowledge graph completion. In SIGIR. X.Chen, N. Zhang, L. Li, S. Deng, C. Tan, C. Xu, F. Huang, L. Si, and H. Chen.2022.用于多模态知识图谱补全的多层次融合混合变换器。In SIGIR.
A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov. 2020. Unsupervised crosslingual representation learning at scale. In . A.Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov.2020.大规模无监督跨语言表征学习。In .
J. Devlin, M. Chang, K. Lee, and K. Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In . J.Devlin、M. Chang、K. Lee 和 K. Toutanova。2019.BERT:用于语言理解的深度双向变换器预训练。在 .
M. Jia, L. Shen, X. Shen, L. Liao, M. Chen, X. He, Z. Chen, and J. Li. 2023. MNER-QG: An end-to-end mrc framework for multimodal named entity recognition with query grounding. In AAAI. M.Jia, L. Shen, X. Shen, L. Liao, M. Chen, X. He, Z. Chen, and J. Li.2023.MNER-QG:多模态命名实体识别与查询接地的端到端 mrc 框架。In AAAI.
J. Li, H. Fei, J. Liu, S. Wu, M. Zhang, C. Teng, D. Ji, and F. Li. 2022. Unified named entity recognition as word-word relation classification. In AAAI. J.Li, H. Fei, J. Liu, S. Wu, M. Zhang, C. Teng, D. Ji, and F. Li.2022.作为词-词关系分类的统一命名实体识别。In AAAI.
J. Li, D. Li, S. Savarese, and S. Hoi. 2023a. BLIP2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597. J.Li、D. Li、S. Savarese 和 S. Hoi.2023a.BLIP2:使用冻结图像编码器和大型语言模型进行语言图像预训练的引导。
J. Li, H. Li, Z. Pan, and G. Pan. 2023b. Prompt ChatGPT in MNER: Improved multimodal named entity recognition method based on auxiliary refining knowledge from ChatGPT. arXiv preprint arXiv:2305.12212. J.Li、H. Li、Z. Pan 和 G. Pan.2023b.MNER 中的 Prompt ChatGPT:基于 ChatGPT 的辅助提炼知识的改进型多模态命名实体识别方法。
J. Li, A. Sun, J. Han, and C. Li. 2020. A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34. J.Li、A. Sun、J. Han 和 C. Li.2020.命名实体识别的深度学习调查。IEEE Transactions on Knowledge and Data Engineering, 34.
Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. Y.Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov.2019.RoBERTa:ArXiv preprint arXiv:1907.11692.
I. Loshchilov. and . Hutter. 2019. Decoupled weight decay regularization. In ICLR. I.洛希洛夫和.胡特。2019.解耦权重衰减正则化。In ICLR.
D. Lu, L. Neves, V. Carvalho, N. Zhang, and H. Ji. 2018. Visual attention model for name tagging in multimodal social media. In . D.Lu、L. Neves、V. Carvalho、N. Zhang 和 H. Ji。2018.多模态社交媒体中姓名标记的视觉注意力模型。在 .
J. Lu, D. Zhang, J. Zhang, and P. Zhang. 2022. Flat multi-modal interaction transformer for named entity recognition. In COLING. J.Lu, D. Zhang, J. Zhang, and P. Zhang.2022.用于命名实体识别的扁平多模态交互转换器。在 COLING.
M. E. Peters, W. Ammar, C. Bhagavatula, and R. Power 2017. Semi-supervised sequence tagging with bidirectional language models. In . M.E. Peters、W. Ammar、C. Bhagavatula 和 R. Power 2017。使用双向语言模型的半监督序列标记。在 .
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. 2021. Learning transferable visual models from natural language supervision. In ICML Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al.从自然语言监督中学习可转移的视觉模型。在 ICML
R. R Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. 2017. Grad-CAM: Visual explanations from deep networks via gradient-based localization. In ICCV. R.R. Selvaraju、M. Cogswell、A. Das、R. Vedantam、D. Parikh 和 D. Batra。2017.Grad-CAM:基于梯度定位的深度网络视觉解释。In ICCV.
Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang. 2023a. DiffusionNER: Boundary diffusion for named entity recognition. In