Published online: 8 February 2024 在线出版:2024 年 2 月 8 日
( ) The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024 ( ) 作者,独家授权 Springer Science+Business Media, LLC(施普林格-自然出版社 2024 年版)。
Abstract 摘要
Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance. 多模态命名实体识别(MNER)旨在借助图像识别社交媒体帖子中的实体跨度并识别其类别。以往有关 MNER 的研究通常依赖注意力机制来模拟图像和文本表征之间的交互。然而,不同模态的特征表征的不一致性会给图像-文本交互建模带来困难。为了解决这个问题,我们提出了多粒度视觉上下文,将图像特征对齐到文本空间,以实现文本与图像的交互,从而更好地利用预训练文本嵌入中的注意力机制。多粒度的视觉信息有助于在图像像素和语言语义之间建立更准确、更全面的联系。具体来说,我们首先提取全局图像标题和密集图像标题,分别作为粗粒度视觉上下文和细粒度视觉上下文。然后,我们将图像视为具有稀疏语义密度的信号,用于图像-文本交互;将图像标题视为密集语义信号,用于文本-文本交互。为了减轻视觉噪声和对齐不准确造成的偏差,我们进一步设计了一个动态过滤网络来过滤视觉噪声,并为模态融合动态分配视觉信息。同时,我们提出了一种新颖的多粒度视觉提示引导融合网络,以建立更稳健的模态融合模型。 在三个 MNER 数据集上进行的广泛实验证明了我们方法的有效性,并取得了最先进的性能。
Named entity recognition (NER) is a fundamental task in the field of information extraction [1-7], which involves determining entity boundaries from free texts and classifying them 命名实体识别(NER)是信息提取领域的一项基本任务[1-7],涉及从自由文本中确定实体边界并对其进行分类
into pre-defined categories, such as person (PER), location (LOC), organization (ORG), and other types (MISC) [8]. As an important research direction of NER, multimodal named entity recognition (MNER) extends the conventional text-based NER by taking the (sentence, image) pair as input [9-13, 23], which has attracted more and more attention due to its research significance in multimodal deep learning and wide applications, such as social media posts. The main reason is that texts in social media are usually short and informal, lack contexts and full of ambiguous expressions. Since the visual contexts associated with text content have been confirmed to help resolve the recognition of ambiguous multi-sense words and out-of-vocabulary words, MNER plays an important role in extracting entities from user-generated content on social media platforms such as Twitter [14], especially when text semantics are ambiguous. 多模态命名实体识别(MNER)将命名实体分为预先定义的类别,如人物(PER)、地点(LOC)、组织(ORG)和其他类型(MISC)[8]。作为 NER 的一个重要研究方向,多模态命名实体识别(MNER)扩展了传统的基于文本的 NER,将(句子、图像)对作为输入[9-13, 23],由于其在多模态深度学习方面的研究意义以及在社交媒体帖子等方面的广泛应用,吸引了越来越多的关注。其主要原因是,社交媒体中的文本通常简短而不正式,缺乏语境,表达含糊不清。由于与文本内容相关的视觉语境已被证实有助于解决模棱两可的多义词和词汇表外词的识别问题,MNER 在从 Twitter 等社交媒体平台上用户生成的内容中提取实体方面发挥着重要作用[14],尤其是在文本语义模糊的情况下。
For example, in Fig. 1(a), given the sentence "I think Mickey is happy to be home after his 2-week vacation!", it is difficult for us to infer the type of the named entity "Mickey" which could be a person name or an animal name. With the help of its accompanying image, we can easily determine that its type is MISC. Current work in MNER mainly focuses on aligning words with image regions and fusing textual information and visual contexts. Early successful architectures for MNER rely on attention mechanisms combined with different fusion techniques . 例如,在图 1(a)中,给定句子 "I think Mickey is happy to be home after his 2-week vacation!",我们很难推断出命名实体 "Mickey "的类型。借助其附带的图像,我们可以很容易地确定其类型为 MISC。当前的 MNER 工作主要集中在将单词与图像区域对齐,以及融合文本信息和视觉上下文。早期成功的 MNER 架构依靠注意力机制与不同的融合技术相结合 。
However, there are three critical but often neglected aspects in the prior art, issue 1: The semantic richness of the information coming from images, issue 2 : The distribution differences between images and text, and issue 3: Noise interference from irrelevant image information. 然而,在现有技术中,有三个关键但往往被忽视的方面:问题 1:图像信息的语义丰富性;问题 2:图像和文本之间的分布差异;问题 3:无关图像信息的噪声干扰。
With regard to issue (1), an image related to a sentence can have different visual objects related to different entities in the sentence. For example, in Fig. 1(b), the sentence contains two entities with two different types: one PER entity and one MISC entity. The two visual objects with the label "person" are probably related to the PER entity "Grace". The object "gold medal" which is more relevant to awards corresponding to the MISC entity "Olympic". Recent work on multimodal NER extracting the features of the whole image only reflects the relations between the whole image (rather than objects) and only one entity. The corresponding relations of multiple visual objects and different entities are ignored. As a result, the visual features of the whole image with only one semantic label may mislead their models to identify different types of entities into the same type. It is necessary to leverage the object-level features of different visual objects to assist in extracting entities with different types. 关于问题(1),与句子相关的图像可以有与句子中不同实体相关的不同视觉对象。例如,在图 1(b)中,句子包含两个不同类型的实体:一个 PER 实体和一个 MISC 实体。标有 "person "的两个视觉对象可能与 PER 实体 "Grace "有关。而 "金牌 "这个对象则与 MISC 实体 "奥林匹克 "对应的奖项更为相关。最近的多模态 NER 研究提取整个图像的特征,但只能反映整个图像(而不是对象)与一个实体之间的关系。多个视觉对象与不同实体之间的相应关系被忽略了。因此,只有一个语义标签的整幅图像的视觉特征可能会误导其模型将不同类型的实体识别为同一类型。有必要利用不同视觉对象的对象级特征来辅助提取不同类型的实体。
(a) I think [Mickey MISC] is happy to be home after his week vacation! (a) 我认为 [Mickey MISC]很高兴能在 周假期后回家!
(b) [Grace PER] very happy to be holding an [Olympic MISC] gold medal and posing with its owner. (b) [Grace PER]非常高兴地手持[奥林匹克 MISC]金牌,并与金牌的主人合影留念。
(c) This is [Jeb PER] from my prayer group after confession. Be honest and experience the freedom. (c) 这是我在忏悔后的祷告小组中的 [Jeb PER]。诚实,体验自由。
Fig. 1 Three examples of multimodal named entity recognition in social media 图 1 社交媒体中多模态命名实体识别的三个实例
Regarding issue (2), previous work on MNER often relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are not aligned in the same semantic space. 关于问题(2),以往有关 MNER 的研究通常依靠注意力机制来模拟图像和文本表征之间的交互。然而,由于图像和文本表征并不在同一语义空间中对齐,因此很难为这种交互建模。
About issue (3), the relatedness between named entities and images usually be ignored which can lead to incorrect visual contexts clues being extracted when the images and text are not relevant. For example, in Fig. 1(c), it is expected to find some "people" ("PER" type) in the image to align with "Jeb" ("PER" type) in the text, but there is a "cat" ("MISC" type) in the image. Just like this example, when there is no such precise correspondence between text and image, it will bring difficulties for the graph-based method by establishing the relationship between entities and visual objects [17], and the method of taking visual objects as semantic representations of the image [18]. On the other hand, methods explicitly aligning visual objects and entities also suffer the bias introduced by visual objects when they are not identical to entities in quantity and entity type. For example, in Fig. 1(b), There are three detected objects in the image, which makes it difficult to explicitly align them with the "Grace" and "Olympic" (2 entities) in the text. 关于问题(3),命名实体与图像之间的关联性通常会被忽略,这可能会导致在图像和文本不相关的情况下提取出错误的视觉上下文线索。例如,在图 1(c)中,本应在图像中找到一些 "人"("PER "类型)与文本中的 "Jeb"("PER "类型)对齐,但图像中却有一只 "猫"("MISC "类型)。就像这个例子一样,当文本和图像之间没有这种精确的对应关系时,就会给基于图的方法建立实体和视觉对象之间的关系带来困难[17],也会给将视觉对象作为图像的语义表征的方法带来困难[18]。另一方面,当视觉对象与实体在数量和实体类型上不一致时,明确地将视觉对象与实体对齐的方法也会受到视觉对象带来的偏差的影响。例如,在图 1(b)中,图像中有三个被检测到的物体,这就很难将它们与文本中的 "Grace "和 "Olympic"(2 个实体)明确地对齐。
To handle the aforementioned issues, we propose several effective blocks in our model. 为了解决上述问题,我们在模型中提出了几个有效的模块。
(1) To fully exploit the rich semantic information of images, we collect multiple visual clues for multimodal entity extraction, which involves taking the regional images as the vital information and the global images as the supplement. Specially, we extract the pyramidal features of these images, which is capable of producing multi-scale feature representations with strong semantic information for feature maps of all levels. (1) 为了充分利用图像中丰富的语义信息,我们收集多种视觉线索进行多模态实体提取,其中包括以区域图像为重要信息,以全局图像为补充。特别是,我们提取了这些图像的金字塔特征,能够为各级特征图生成具有强大语义信息的多尺度特征表征。
(2) To handle the distribution issue between images and text, we construct multi-granularity visual contexts for text-text interactions and preserve the image-text interactions since images contain abstract information that cannot be summarized in words. To be specific, we first extract global image caption and dense image captions as coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interaction and image captions as dense semantic signals for text-text interaction. (2) 为了处理图像和文本之间的分布问题,我们为文本与文本之间的交互构建了多粒度视觉上下文,并保留了图像与文本之间的交互,因为图像包含无法用文字概括的抽象信息。具体来说,我们首先提取全局图像标题和密集图像标题,分别作为粗粒度视觉上下文和细粒度视觉上下文。然后,我们将图像视为具有稀疏语义密度的信号,用于图像-文本交互;将图像标题视为密集语义信号,用于文本-文本交互。
(3) To address the visual noise issue caused by irrelevant images, we propose a dynamic filter network to filter visual noise caused by irrelevant images. To alleviate the bias of quantity and entity type, instead of explicit alignment, the dynamic filter network dynamically allocate visual information for modality interaction which is more soft and flexible. (3) 针对无关图像造成的视觉噪音问题,我们提出了一种动态过滤网络来过滤无关图像造成的视觉噪音。为了减轻数量和实体类型的偏差,动态过滤网络不采用显式对齐,而是动态分配视觉信息进行模态交互,这更加柔和灵活。
Synthesizing the above blocks, we finally propose a novel multi-granularity visual promptguided fusion network for visual-enhanced entity extraction. In detail, we construct a multigranularity visual prompt for each self-attention layer in fusion network leveraging visual information allocated by the dynamic filter network. 综合上述内容,我们最终提出了一种新颖的多粒度视觉提示引导的融合网络,用于视觉增强实体提取。具体来说,我们利用动态滤波网络分配的视觉信息,为融合网络中的每个自我关注层构建多粒度视觉提示。
Overall, we summarize the major contributions of our paper as follows: 总之,我们将本文的主要贡献总结如下:
We propose a novel multi-granularity visual prompt-guided fusion network, which is a more flexible and robust attention module for visual-enhanced entity extraction. 我们提出了一种新颖的多粒度视觉提示引导融合网络,它是一种用于视觉增强实体提取的更灵活、更稳健的注意力模块。
We construct multi-granularity visual contexts as supplementation of visual semantic information and build a dynamic filter network to filter visual noise caused by irrelevant images and allocate visual information image-independently for each self-attention layer of fusion network. 我们构建了多粒度视觉语境作为视觉语义信息的补充,并构建了动态过滤网络来过滤无关图像造成的视觉噪声,为融合网络的每个自我注意层分配与图像无关的视觉信息。
We conduct experiments on two multimodal social media NER datasets and a multimodal NER dataset from Wikinews. The experimental results demonstrate that our model outperforms previous state-of-the-art (SOTA) methods. 我们在两个多模态社交媒体 NER 数据集和一个来自维基新闻的多模态 NER 数据集上进行了实验。实验结果表明,我们的模型优于之前的最先进(SOTA)方法。
2 Related work 2 相关工作
As multimodal data become increasingly popular on social media platforms, NER in the social media domain has raised broad concerns. Since visual information in images can help classify the entity types in text, especially when text semantics are ambiguous, effective use of visual information becomes essential for performance in MNER tasks. Depending on how visual information is used, previous approaches to MNER can be broadly classified into the following two types: 随着多模态数据在社交媒体平台上日益流行,社交媒体领域的 NER 引起了广泛关注。由于图像中的视觉信息可以帮助对文本中的实体类型进行分类,尤其是在文本语义模糊的情况下,因此有效利用视觉信息对于提高 MNER 任务的性能至关重要。根据视觉信息的使用方式,以往的 MNER 方法大致可分为以下两类:
Exploiting the coarse-grained information of images The multimodal NER task was first explored by Zhang et al. [12], Moon et al. [15], and Lu et al. [9] in the same period. They take the approach of encoding the entire image, which implicitly interacts the information from the two modalities. Specifically, Zhang et al. [12] extended an adaptive co-attention network to model the interaction between images and text. In the same year, Moon et al. [15] proposed a new model upon the SOTA Bi-LSTM word/character-based NER models with a deep image network and a generic modality-attention module to leverage provided visual contexts. Lu et al. [9] then proposed a visual-attention-based model to provide deeper visual understanding. These previous efforts to capture visual attention using only a single word were inadequate in their use of visual features. To address this problem, Arshad et al. [16] extended the self-attention mechanism to capture the relationship between two words and image regions and introduced a gated fusion module to dynamically select information from the text and visual features. However, the noise problem caused by uncorrelated images leads to the fact that images are not always beneficial when fusing text and image information in MNER. To alleviate the visual bias problem, Yu et al. [10] added an entity span detection module to guide the final prediction. Sun et al. [19] proposed a text-image relationship propagation model for predicting graphical relevance, which can help eliminate the effect of modal noise. Asgari- Chenaghlu et al. [20] extended the design of multimodal BERT to learn the relationship between images and text. 利用图像的粗粒度信息 Zhang 等人[12]、Moon 等人[15]和 Lu 等人[9]在同一时期首次探索了多模态 NER 任务。他们采用的方法是对整个图像进行编码,从而隐含地交互来自两种模态的信息。具体来说,Zhang 等人[12] 扩展了一个自适应协同注意网络,以模拟图像和文本之间的交互。同年,Moon 等人[15] 在基于 SOTA Bi-LSTM 的单词/字符 NER 模型的基础上提出了一个新模型,该模型具有一个深度图像网络和一个通用的模态注意模块,以充分利用所提供的视觉上下文。Lu 等人[9]随后提出了一种基于视觉注意力的模型,以提供更深入的视觉理解。之前这些仅使用单词捕捉视觉注意力的方法在使用视觉特征方面存在不足。为了解决这个问题,Arshad 等人[16] 扩展了自我注意机制,以捕捉两个单词和图像区域之间的关系,并引入了一个门控融合模块,动态地从文本和视觉特征中选择信息。然而,由于不相关图像造成的噪声问题,导致在 MNER 中融合文本和图像信息时,图像并不总是有利的。为了缓解视觉偏差问题,Yu 等人[10]增加了实体跨度检测模块,以指导最终预测。Sun 等人 [19] 提出了一种用于预测图形相关性的文本图像关系传播模型,该模型有助于消除模态噪声的影响。Asgari- Chenaghlu 等人[20] 扩展了多模态 BERT 的设计,以学习图像和文本之间的关系。
Recently, Tian et al. [21] proposed a bi-directional manner to adaptively assign fusion weights for output features, solving the situation that images are missing or mismatched with the text. After that, Xu et al. [22] designed a cross-modal matching (CM) module to calculate the similarity score between text and its associated image, and use the score to determine the proportion of visual information that should be retained. 最近,Tian 等人[21] 提出了一种自适应分配输出特征融合权重的双向方式,解决了图像与文本缺失或不匹配的情况。之后,Xu 等人[22] 设计了一个跨模态匹配(CM)模块,用于计算文本与其相关图像之间的相似度得分,并利用该得分确定应保留的视觉信息比例。
Exploiting the fine-grained information of images In addition to the above studies, there are also some works focusing on exploiting the fine-grained information of images. For example, Wu et al. [18] leveraged object labels as embeddings to bridge vision and language, and proposed a dense co-attention mechanism for fine-grained interactions. Chen et al. [23] considered leveraging the image attribute modality as well as the image conceptual knowledge modality. In addition, Zhang et al. [11] first proposed to explore a multimodal graph neural network (GNN) for MNER. Lu et al. [24] transformed the fine-grained semantic representations of the vision and text into a unified lattice structure and designed a novel relative position encoding. After that, Wang et al. [25] first aligned the image into regional object tags, image-level captions and optical characters as visual contexts. To obtain the fine-grained semantic correspondence between objects in images and words in the text, Liu et al. [17] performed the cross-modality semantic interaction between text and vision at the different vision granularities. 利用图像的细粒度信息 除上述研究外,还有一些研究侧重于利用图像的细粒度信息。例如,Wu 等人[18]利用物体标签作为嵌入来连接视觉和语言,并提出了一种用于细粒度交互的密集协同关注机制。Chen 等人[23]考虑利用图像属性模式和图像概念知识模式。此外,Zhang 等人[11] 首次提出探索用于 MNER 的多模态图神经网络(GNN)。Lu 等人[24] 将视觉和文本的细粒度语义表征转化为统一的网格结构,并设计了一种新颖的相对位置编码。之后,Wang 等人[25]首先将图像对齐为区域对象标记、图像级标题和光学字符作为视觉上下文。为了获得图像中的物体与文本中的文字之间的细粒度语义对应关系,Liu 等人[17]在不同的视觉粒度上进行了文本与视觉之间的跨模态语义交互。
Based on the granularity of image information, previous studies can be mainly categorized into two types, coarse-grained and fine-grained. Coarse-grained image information is more 根据图像信息的粒度,以往的研究主要分为粗粒度和细粒度两类。粗粒度图像信息更
focused while fine-grained image information is able to focus on more detailed features of the image. In addition, original image and image caption are different forms of image information, each of which has its own unique advantages and limitations. On the one hand, image description aligns image features into the textual space, which helps to solve the problem of inconsistent representation of different modal features. On the other hand, original image is more abstract than image caption, but may contain information that is difficult to describe in text. Therefore, this paper argues that coarse-grained and fine-grained can complement each other, and original image and image caption can also refine each other. However, most of the previous work focuses on a single granularity or representation, resulting in the utilization of image information always being less comprehensive and thorough. 而细粒度图像信息则能关注图像的更多细节特征。此外,原始图像和图像说明是不同形式的图像信息,各有其独特的优势和局限性。一方面,图像说明将图像特征统一到文本空间中,有助于解决不同模态特征表示不一致的问题。另一方面,原始图像比图像说明更抽象,但可能包含难以用文本描述的信息。因此,本文认为粗粒度和细粒度可以相互补充,原始图像和图像标题也可以相互细化。然而,以往的工作大多侧重于单一粒度或表述,导致对图像信息的利用总是不够全面和深入。
To compensate for this deficiency, this paper proposes a new approach: introducing multiple image captions with different granularities while preserving the original image to obtain more comprehensive and detailed information. The differences between our approach and previous works are shown in Table 1. 为了弥补这一不足,本文提出了一种新方法:在保留原始图像的基础上,引入多个不同粒度的图像标题,以获取更全面、更详细的信息。表 1 显示了我们的方法与之前工作的不同之处。
3 Task formulation 3 制定任务
NER aims to identify named entities (usually the name or symbol of a specific type of thing, typically a noun or phrase) in a text and classify the identified entities into pre-defined entity categories. Its input is the text to be recognized and its output is the recognized entities and their corresponding types. For example, in Fig. 2, words "Michael Jeffrey Jordan" are recognized as Person, and word "Brooklyn" and words "New York" are recognized as Location. NER 的目的是识别文本中的命名实体(通常是特定类型事物的名称或符号,一般是名词或短语),并将识别出的实体归入预先定义的实体类别。它的输入是要识别的文本,输出是识别出的实体及其相应的类型。例如,在图 2 中,单词 "迈克尔-杰弗里-乔丹 "被识别为 "人",单词 "布鲁克林 "和单词 "纽约 "被识别为 "地点"。
Table 1 The differences between our approach and previous works 表 1 我们的方法与之前工作的区别
Methods 方法
Coarse-grained 粗粒
Fine-grained 细粒度
Original image 原图
Image caption 图片说明
Zhang et al. [12] Zhang 等人[12]
Moon et al. [15] Moon 等人 [15]
Lu et al. [9] Lu 等人[9]
Arshad et al. [16] 阿尔沙德等人[16]
Yu et al. [10] Yu 等人[10]
Sun et al. [19] Sun 等人[19]
Asgari-Chenaghlu et al. [20] Asgari-Chenaghlu 等人 [20]
Tian et al. [21] Tian等人[21]
Xu et al. [22] 徐等人[22]
Wu et al. [18] Wu 等人[18]
Chen et al. [23] 陈等人[23]
Zhang et al. [11] Zhang 等人[11]
Lu et al. [24] Lu 等人[24]
Wang et al. [25] Wang 等人 [25]
Liu et al. [17] Liu 等人[17]
Fig. 2 An example of NER task 图 2 NER 任务示例
MNER aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. In this section, we describe the task formulation of MNER. MNER 旨在借助图像识别社交媒体帖子中的实体跨度并识别其类别。在本节中,我们将介绍 MNER 的任务制定。
Given a sentence and its associated image as input, the goal of MNER is to extract a set of entities from and classify each extracted entity into one of the pre-defined categories. As with most existing work in MNER, we formulate the task as a sequence labeling problem. Formally, let denotes a sequence of input words, where with denotes the -th word in the sentence and represents the length of the sentence, and be the corresponding entity labels for all words, where and is the pre-defined label set with standard BIO schema [26]. 给定一个句子 及其相关图像 作为输入,MNER 的目标是从 中提取一组实体,并将每个提取的实体归入预先定义的类别之一。与 MNER 领域的大多数现有工作一样,我们将任务表述为序列标注问题。形式上,让 表示输入词的序列,其中 和 表示句子中的第 -th 个词, 表示句子的长度, 是所有词对应的实体标签,其中 和 是预先定义的标签集,采用标准的 BIO 模式[26]。
In this paper, we use Conditional Random Fields (CRF) [27] for sequence labeling. It has been shown that CRF can produce higher tagging accuracy in sequence labeling tasks because CRF considers the correlations between labels in neighborhoods. For example, an adjective has a greater probability of being followed by a noun than a verb in POS tagging task, and I-PER cannot follow B-LOC in NER with the standard BIO2 annotation. Therefore, instead of decoding each label independently, we model them jointly using a CRF. We also use to denote a set of input visual objects of number . 在本文中,我们使用条件随机场(CRF)[27] 进行序列标注。研究表明,CRF 可以在序列标注任务中产生更高的标注准确率,因为 CRF 考虑了邻域中标签之间的相关性。例如,在 POS 标记任务中,形容词后面跟名词的概率大于跟动词的概率,而在使用标准 BIO2 注释的 NER 中,I-PER 无法跟 B-LOC。因此,我们不再对每个标签进行独立解码,而是使用 CRF 对它们进行联合建模。我们还使用 表示一组输入视觉对象的编号 。
To better represent the notations used in the manuscript, we provide a summary of the most essential mathematical notations and their descriptions in Table 2. 为了更好地体现手稿中使用的符号,我们在表 2 中总结了最基本的数学符号及其说明。
4 Method 4 方法
In this section, we present a novel multi-granularity visual prompt-guided fusion network (MVPN). The overall architecture is shown in Fig. 3. Specifically, the MVPN framework consists of four main components: (1) Visual information embedding, (2) Dynamical filter network, (3) Visual prompt-guided fusion, and (4) Classifier. We first introduce the process of visual information embedding, and then detail the other components of our model. 在本节中,我们将介绍一种新颖的多粒度视觉提示引导融合网络(MVPN)。整体架构如图 3 所示。具体来说,MVPN 框架由四个主要部分组成:(1) 视觉信息嵌入;(2) 动态滤波网络;(3) 视觉提示引导融合;(4) 分类器。我们首先介绍视觉信息嵌入的过程,然后详细介绍模型的其他组成部分。
4.1 Visual information embedding 4.1 视觉信息嵌入
To effectively exploit rich semantic information of images, we construct multi-granularity visual contexts and collect multiple visual clues for MNER. Specifically, we extract the pyramidal features of these images. Then we obtain the final visual features through feature aggregation. In the following subsections, we will detail each stage below. 为了有效利用图像中丰富的语义信息,我们构建了多粒度视觉语境,并为 MNER 收集了多种视觉线索。具体来说,我们提取这些图像的金字塔特征。然后,我们通过特征聚合获得最终的视觉特征。在下面的小节中,我们将详细介绍每个阶段。
Table 2 Key notations and descriptions used in this paper 表 2 本文使用的关键术语和说明
Notations 注释
Descriptions 说明
The sentence 句子
The image 图像
The corresponding entity labels for all words 所有单词的相应实体标签
Conventional caption describes the most prominent semantics of the image in a well-formed text while dense captions describe all the objects in the image with several short sentences, as shown in Fig. 4. Since captions help filter out the noise introduced by image features by summarizing image information into text through the image captioning model, we use the conventional image caption as the global visual context and dense image captions as the fine-grained visual contexts. 如图 4 所示,传统标题用格式良好的文本描述图像中最突出的语义,而密集标题则用几个短句描述图像中的所有对象。由于标题通过图像标题模型将图像信息概括为文本,有助于过滤图像特征带来的噪声,因此我们将传统图像标题作为全局视觉上下文,将密集图像标题作为细粒度视觉上下文。
Global image caption as coarse-grained context Image Captioning is the task of describing the contents of an image in words. Though the local alignment can localize the image into objects, the objects cannot fully describe the global information of the entire image. Therefore, we use the global image caption which focuses on the most salient part of the image. The global image caption is generated by global caption generator : 作为粗粒度上下文的全局图像标题 图像标题是用文字描述图像内容的任务。虽然局部对齐可以将图像局部化为对象,但对象并不能完全描述整个图像的全局信息。因此,我们使用全局图像标题,它侧重于图像中最突出的部分。全局图像标题 由全局标题生成器 生成:
Fig. 3 The overall architecture of our MVPN 图 3 MVPN 的整体架构
Dense image captions as fine-grained contexts A single caption only describes a prominent part of the image which may ignore other not obvious but vital information of the image. Previous works [25] use captions generated from beam search with beams. However, these captions are similar and redundant which cannot provide more effective information compared with a single caption. Therefore, we use dense image captions which contain captions focusing on different areas of the image. These dense image captions are generated by a dense caption generator DIC: 作为细粒度上下文的密集图像标题 单个标题只能描述图像的突出部分,可能会忽略图像中其他不明显但重要的信息。之前的研究 [25] 使用 标题,这些标题是通过 光束搜索生成的。然而,这些标题既相似又冗余,无法提供比单一标题更有效的信息。因此,我们使用密集图像标题,其中包含 个标题,重点关注图像的不同区域。这些密集图像标题由密集标题生成器 DIC 生成:
where is the -th dense caption, and the value of is the number of objects contained in the image. We concatenate the captions together with a special separate token " " to form the fine-grained visual contexts : 其中 是 -th dense caption,而 的值是图像中包含的对象数量。我们将 标题与一个特殊的独立标记 " " 连接起来,形成细粒度的视觉上下文 :
Caption 标题
(a). A black and white dog laying on a couch (a).一只躺在沙发上的黑白相间的狗
(b). A cat laying on a couch Two dogs on a couch White wall in the background (b).一只猫躺在沙发上 两只狗躺在沙发上 背景是白色的墙壁
Fig. 4 An example of conventional caption and dense caption of an image: (a) the conventional caption, (b) the dense caption 图 4 图像的传统标题和密集标题示例:(a)传统标题,(b)密集标题
The exact label of the special token "[X]" (e.g. "[SEP]" in BERT [28]) depends on the selection of embeddings. Due to the capability of giving different representations for the same word in different contexts, we employ the recent contextualized representations from BERT [28] as our sentence encoder. Following [28], each input sentence is preprocessed by inserting two special tokens, i.e., appending "[CLS]" to the beginning and "[SEP]" to the end, respectively. are then fed to BERT encoder to obtain the vectorized representations . 特殊标记"[X]"(例如 BERT [28] 中的"[SEP]")的确切标签取决于嵌入的选择。由于同一个词在不同的语境中会有不同的表示方法,我们采用了 BERT [28] 最新的语境化表示方法作为句子编码器。按照[28]的方法,每个输入句子都要经过预处理,插入两个特殊标记,即分别在开头和结尾添加"[CLS]"和"[SEP]"。然后将 送入 BERT 编码器,得到矢量化表示 。
4.1.2 Pyramidal visual feature 4.1.2 金字塔视觉特征
Global image features may express abstract concepts, which play a role as weak learning signals. The image associated with a sentence contains multiple visual objects related to the entities in the sentence, which further provides more semantic knowledge for information extraction. Thus, we take the regional images as the vital information and the global images as the supplement which are collected as multiple visual clues for multimodal entity extraction. 全局图像特征可以表达抽象概念,起到弱学习信号的作用。与句子相关的图像包含与句子中实体相关的多个视觉对象,这为信息提取提供了更多语义知识。因此,我们将区域图像作为重要信息,将全局图像作为补充,作为多视觉线索收集起来,用于多模态实体提取。
Given an image, we adopt the visual grounding toolkit [29] for extracting local visual objects with top salience. Then, we rescale the global image and object images to pixels as the global image and visual objects . 给定一幅图像后,我们采用视觉接地工具包[29]来提取局部突出度为 的视觉对象。然后,我们将全局图像和对象图像调整到 像素,作为全局图像 和视觉对象 。
In the area of computer vision (CV), leveraging features from different blocks of pre-trained models to fusion features is widely applied for improving model performance. Inspired by such practices, we focus on the application of pyramidal features in the area of multi-modality. We propose to extract multi-scale hierarchical features of an image and fuse the image features into each fusion layer. Typically, given an image and its object images, we separately encode them with a backbone model and generate a list of pyramidal 在计算机视觉(CV)领域,利用来自不同预训练模型块的特征 来融合特征被广泛应用于提高模型性能。受这种做法的启发,我们重点研究了金字塔特征在多模态领域的应用。我们建议提取图像的多尺度分层特征,并将图像特征融合到每个融合层中。通常情况下,给定一幅图像及其对象图像,我们分别用一个骨干模型对它们进行编码,然后生成一个金字塔特征列表。
feature maps: 特征地图:
which are with different scales. Then we obtain the scaled feature and by projecting them with as follows: 它们的比例不同。然后,我们用 对它们进行如下投影,就得到了比例特征 和 :
where denotes the -th block of the backbone model, denotes the number of blocks in the visual backbone model (here is 4 for ResNet [32]), represents the pooling operation, where the features are aggregated to the same spatial sizes. The convolutional layer is leveraged to map the visual feature pyramid to match the embedding size of the fusion layer. 其中 表示主干模型的第 - 个块, 表示视觉主干模型的块数(此处 ResNet [32] 为 4), 表示池化操作,即把特征聚合到相同的空间大小。4# 卷积层用于映射视觉特征金字塔,以匹配融合层的嵌入大小。
4.1.3 Visual feature aggregation 4.1.3 视觉特征聚合
We aggregate the global image caption features and the global image features as global visual features , then aggregate the dense caption features and the object visual features as fine-grained local features . Formally, we derive the final aggregated 我们将全局图像标题特征 和全局图像特征 聚合为全局视觉特征 ,然后将密集标题特征 和对象视觉特征 聚合为细粒度局部特征 。在形式上,我们得出最终的聚合
visual features by concatenating the global visual features and the local visual features 通过连接全局视觉特征 和局部视觉特征 来获得视觉特征
where is the element-wise summation, denotes the concatenate operation. 其中 表示元素求和, 表示连接操作。
4.2 Dynamical filter network 4.2 动态滤波网络
To decline the visual noise caused by irrelevant image information, we use a visual feature filter to filtrate irrelevant image information and a dynamic inference gate to allocate visual representations dynamically for modality fusion. In the following subsections, we introduce the details of each stage. 为了降低无关图像信息造成的视觉噪音,我们使用视觉特征过滤器过滤无关图像信息,并使用动态推理门为模态融合动态分配视觉表征。在下面的小节中,我们将介绍每个阶段的细节。
4.2.1 Visual feature filter 4.2.1 视觉特征过滤器
As we know, the MNER is a more text-oriented task, and the image features are auxiliary for better tagging . Although visual information can help reduce ambiguity. However, when the images are mismatched with the associated text, the visual features will mislead the model and decrease the performance. Therefore, if an image is missing or mismatched with the text, the text features should be the primary contribution. To address this challenge, we propose to construct a visual feature filter where irrelevant image features are filtrated depending on the relevance to the text. 我们知道,MNER 是一项更注重文本的任务,而图像特征则是更好地进行标记的辅助手段 。虽然视觉信息有助于减少歧义。但是,当图像与相关文本不匹配时,视觉特征会误导模型,降低性能。因此,如果图像缺失或与文本不匹配,文本特征应该是主要贡献。为了应对这一挑战,我们建议构建一个视觉特征过滤器,根据与文本的相关性过滤不相关的图像特征。
Considering the excellent ability of CLIP [34] to map images and text to the same feature space, we construct an image-text relation discriminator based on its text encoder and image encoder to extract the features of the input sentence and its associate image separately. The text-image relevance score is defined as the cosine similarity score of text features and image features. To build a soft filter mechanism, we use the relevance score to construct a visual mask matrix . The text-image relation is propagated to the filter gate to produce the filtered visual feature as follows: 考虑到 CLIP [34] 将图像和文本映射到同一特征空间的出色能力,我们在其文本编码器和图像编码器的基础上构建了一个图像-文本关系判别器,分别提取输入句子 及其关联图像 的特征。文本-图像相关性得分 定义为文本特征与图像特征的余弦相似度得分。为了建立软过滤机制,我们使用相关性得分 来构建视觉掩码矩阵 。文本与图像的关系传播到过滤门,产生过滤后的视觉特征 ,如下所示:
where is the element-wise multiplication. For example, if , then all visual features are discarded. 其中 是元素相乘。例如,如果 ,那么所有视觉特征都会被舍弃。
4.2.2 Dynamic inference gate 4.2.2 动态推理门
Objects of different sizes can have appropriate feature representations at the corresponding scales, it is vital to decide which block in the visual backbone is assigned as a visual prompt for each layer in the fusion network. To address this challenge, we construct the densely connected reasoning route, where filtered visual features are connected with each fusion layer. 不同大小的物体可以在相应的尺度上有适当的特征表示,因此决定将视觉主干中的哪个区块分配给融合网络中的每一层作为视觉提示至关重要。为了应对这一挑战,我们构建了密集连接的推理路径,将过滤后的视觉特征与每个融合层连接起来。
Dynamic inference gate module We conduct route reasoning processes through a dynamic inference gate module, which can be viewed as a procedure of path decision. The motivation of the dynamic inference gate is to predict a normalized vector, which represents how much to execute the visual feature of each block. 动态推理门模块 我们通过动态推理门模块进行路径推理过程,该模块可被视为路径决策程序。动态推理门的动机是预测一个归一化向量,它代表了执行每个区块视觉特征的程度。
In the dynamic gate, denotes the path probability from the -th block of the visual backbone to the -th layer of the fusion network. It is calculated as , where denotes the gating function according to the -th layer in fusion network, represents the numbers of the block in the backbone. We first produce the logits of the gate signals: 在动态门电路中, 表示从视觉主干网的第 - 个区块到融合网络的第 - 层的路径概率。它的计算公式为 ,其中 表示根据融合网络中第 - 层的门控函数, 表示主干网中区块的编号。我们首先生成门控信号的对数 :
where denotes the activate function LeakyReLU [35], represents the global average pooling layer. 其中 表示激活函数 LeakyReLU [35], 表示全局平均池化层。
We first squeeze the input feature with a shape of from the -th block by an average pooling operation. Then we add the features from multiple blocks to generate the average vectors. We further reduce the feature dimension by with the MLP layer and consider a soft gate via generating continuous values as path probabilities. Afterward, we generate the probability vector for the -th layer of the fusion network as follows: 我们首先通过平均池化操作,从第 - 个区块中挤出形状为 的输入特征 。然后,我们将多个区块的特征相加,生成平均向量。我们利用 MLP 层 进一步将特征维度减少 ,并通过生成连续值作为路径概率来考虑软门。之后,我们为融合网络的 -th 层生成概率向量 如下:
Aggregated hierarchical prompt We derive the aggregated hierarchical visual prompts based on the above dynamic inference gate to match the -th layer in fusion network as: 聚合分层提示 我们根据上述动态推理门 ,推导出聚合分层视觉提示