2024_08_05_5bd116e35380001de529g

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition
MVPN：用于多模态命名实体识别的多粒度视觉提示引导融合网络

Wei Liu Aiqun Ren Chao Wang ・ Yan Peng Shaorong Xie .Weimin 为民

Received: 21 September 2023 / Revised: 8 January 2024 / Accepted: 29 January 2024 /
收到：收到：2023 年 9 月 21 日 / 修订：2024 年 1 月 8 日 / 接受：2024 年 1 月 29 日 / 出版日期：2023 年 9 月 21 日。

Published online: 8 February 2024
在线出版：2024 年 2 月 8 日

( ) The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024
( ) 作者，独家授权 Springer Science+Business Media, LLC（施普林格-自然出版社 2024 年版）。

Abstract 摘要

Multimodal named entity recognition (MNER) aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. Previous work on MNER often relies on an attention mechanism to model the interactions between the images and text representations. However, the inconsistency of feature representations of different modalities will bring difficulties to the modeling of image-text interaction. To address this issue, we propose multi-granularity visual contexts to align image features into the textual space for text-text interactions so that the attention mechanism in pre-trained textual embeddings can be better utilized. The visual information of multi-granularity can help establish more accurate and thorough connections between image pixels and linguistic semantics. Specifically, we first extract the global image caption and dense image captions as the coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interactions and image captions as dense semantic signals for text-text interactions. To alleviate the bias caused by visual noise and inaccurate alignment, we further design a dynamic filter network to filter visual noise and dynamically allocate visual information for modality fusion. Meanwhile, we propose a novel multi-granularity visual prompt-guided fusion network to model more robust modality fusion. Extensive experiments on three MNER datasets demonstrate the effectiveness of our method and achieve state-of-the-art performance.
多模态命名实体识别（MNER）旨在借助图像识别社交媒体帖子中的实体跨度并识别其类别。以往有关 MNER 的研究通常依赖注意力机制来模拟图像和文本表征之间的交互。然而，不同模态的特征表征的不一致性会给图像-文本交互建模带来困难。为了解决这个问题，我们提出了多粒度视觉上下文，将图像特征对齐到文本空间，以实现文本与图像的交互，从而更好地利用预训练文本嵌入中的注意力机制。多粒度的视觉信息有助于在图像像素和语言语义之间建立更准确、更全面的联系。具体来说，我们首先提取全局图像标题和密集图像标题，分别作为粗粒度视觉上下文和细粒度视觉上下文。然后，我们将图像视为具有稀疏语义密度的信号，用于图像-文本交互；将图像标题视为密集语义信号，用于文本-文本交互。为了减轻视觉噪声和对齐不准确造成的偏差，我们进一步设计了一个动态过滤网络来过滤视觉噪声，并为模态融合动态分配视觉信息。同时，我们提出了一种新颖的多粒度视觉提示引导融合网络，以建立更稳健的模态融合模型。在三个 MNER 数据集上进行的广泛实验证明了我们方法的有效性，并取得了最先进的性能。

Keywords Multi-granularity

Multimodal named entity recognition

Prompt-guided

关键词多粒度

多模态命名实体识别

提示引导

Visual context 视觉背景

1 Introduction 1 引言

Named entity recognition (NER) is a fundamental task in the field of information extraction [1-7], which involves determining entity boundaries from free texts and classifying them
命名实体识别（NER）是信息提取领域的一项基本任务[1-7]，涉及从自由文本中确定实体边界并对其进行分类

into pre-defined categories, such as person (PER), location (LOC), organization (ORG), and other types (MISC) [8]. As an important research direction of NER, multimodal named entity recognition (MNER) extends the conventional text-based NER by taking the (sentence, image) pair as input [9-13, 23], which has attracted more and more attention due to its research significance in multimodal deep learning and wide applications, such as social media posts. The main reason is that texts in social media are usually short and informal, lack contexts and full of ambiguous expressions. Since the visual contexts associated with text content have been confirmed to help resolve the recognition of ambiguous multi-sense words and out-of-vocabulary words, MNER plays an important role in extracting entities from user-generated content on social media platforms such as Twitter [14], especially when text semantics are ambiguous.
多模态命名实体识别（MNER）将命名实体分为预先定义的类别，如人物（PER）、地点（LOC）、组织（ORG）和其他类型（MISC）[8]。作为 NER 的一个重要研究方向，多模态命名实体识别（MNER）扩展了传统的基于文本的 NER，将（句子、图像）对作为输入[9-13, 23]，由于其在多模态深度学习方面的研究意义以及在社交媒体帖子等方面的广泛应用，吸引了越来越多的关注。其主要原因是，社交媒体中的文本通常简短而不正式，缺乏语境，表达含糊不清。由于与文本内容相关的视觉语境已被证实有助于解决模棱两可的多义词和词汇表外词的识别问题，MNER 在从 Twitter 等社交媒体平台上用户生成的内容中提取实体方面发挥着重要作用[14]，尤其是在文本语义模糊的情况下。

For example, in Fig. 1(a), given the sentence "I think Mickey is happy to be home after his 2-week vacation!", it is difficult for us to infer the type of the named entity "Mickey" which could be a person name or an animal name. With the help of its accompanying image, we can easily determine that its type is MISC. Current work in MNER mainly focuses on aligning words with image regions and fusing textual information and visual contexts. Early successful architectures for MNER rely on attention mechanisms combined with different fusion techniques

.
例如，在图 1(a)中，给定句子 "I think Mickey is happy to be home after his 2-week vacation!"，我们很难推断出命名实体 "Mickey "的类型。借助其附带的图像，我们可以很容易地确定其类型为 MISC。当前的 MNER 工作主要集中在将单词与图像区域对齐，以及融合文本信息和视觉上下文。早期成功的 MNER 架构依靠注意力机制与不同的融合技术相结合

。

However, there are three critical but often neglected aspects in the prior art, issue 1: The semantic richness of the information coming from images, issue 2 : The distribution differences between images and text, and issue 3: Noise interference from irrelevant image information.
然而，在现有技术中，有三个关键但往往被忽视的方面：问题 1：图像信息的语义丰富性；问题 2：图像和文本之间的分布差异；问题 3：无关图像信息的噪声干扰。

With regard to issue (1), an image related to a sentence can have different visual objects related to different entities in the sentence. For example, in Fig. 1(b), the sentence contains two entities with two different types: one PER entity and one MISC entity. The two visual objects with the label "person" are probably related to the PER entity "Grace". The object "gold medal" which is more relevant to awards corresponding to the MISC entity "Olympic". Recent work on multimodal NER extracting the features of the whole image only reflects the relations between the whole image (rather than objects) and only one entity. The corresponding relations of multiple visual objects and different entities are ignored. As a result, the visual features of the whole image with only one semantic label may mislead their models to identify different types of entities into the same type. It is necessary to leverage the object-level features of different visual objects to assist in extracting entities with different types.
关于问题(1)，与句子相关的图像可以有与句子中不同实体相关的不同视觉对象。例如，在图 1(b)中，句子包含两个不同类型的实体：一个 PER 实体和一个 MISC 实体。标有 "person "的两个视觉对象可能与 PER 实体 "Grace "有关。而 "金牌 "这个对象则与 MISC 实体 "奥林匹克 "对应的奖项更为相关。最近的多模态 NER 研究提取整个图像的特征，但只能反映整个图像（而不是对象）与一个实体之间的关系。多个视觉对象与不同实体之间的相应关系被忽略了。因此，只有一个语义标签的整幅图像的视觉特征可能会误导其模型将不同类型的实体识别为同一类型。有必要利用不同视觉对象的对象级特征来辅助提取不同类型的实体。

(a) I think [Mickey MISC] is happy to be home after his

week vacation!
(a) 我认为 [Mickey MISC]很高兴能在

周假期后回家！

(b) [Grace PER] very happy to be holding an [Olympic MISC] gold medal and posing with its owner.
(b) [Grace PER]非常高兴地手持[奥林匹克 MISC]金牌，并与金牌的主人合影留念。

(c) This is [Jeb PER] from my prayer group after confession. Be honest and experience the freedom.
(c) 这是我在忏悔后的祷告小组中的 [Jeb PER]。诚实，体验自由。

Fig. 1 Three examples of multimodal named entity recognition in social media
图 1 社交媒体中多模态命名实体识别的三个实例

Regarding issue (2), previous work on MNER often relies on an attention mechanism to model the interactions between image and text representations. However, it is difficult to model such interactions as image and text representations are not aligned in the same semantic space.
关于问题(2)，以往有关 MNER 的研究通常依靠注意力机制来模拟图像和文本表征之间的交互。然而，由于图像和文本表征并不在同一语义空间中对齐，因此很难为这种交互建模。

About issue (3), the relatedness between named entities and images usually be ignored which can lead to incorrect visual contexts clues being extracted when the images and text are not relevant. For example, in Fig. 1(c), it is expected to find some "people" ("PER" type) in the image to align with "Jeb" ("PER" type) in the text, but there is a "cat" ("MISC" type) in the image. Just like this example, when there is no such precise correspondence between text and image, it will bring difficulties for the graph-based method by establishing the relationship between entities and visual objects [17], and the method of taking visual objects as semantic representations of the image [18]. On the other hand, methods explicitly aligning visual objects and entities also suffer the bias introduced by visual objects when they are not identical to entities in quantity and entity type. For example, in Fig. 1(b), There are three detected objects in the image, which makes it difficult to explicitly align them with the "Grace" and "Olympic" (2 entities) in the text.
关于问题（3），命名实体与图像之间的关联性通常会被忽略，这可能会导致在图像和文本不相关的情况下提取出错误的视觉上下文线索。例如，在图 1(c)中，本应在图像中找到一些 "人"（"PER "类型）与文本中的 "Jeb"（"PER "类型）对齐，但图像中却有一只 "猫"（"MISC "类型）。就像这个例子一样，当文本和图像之间没有这种精确的对应关系时，就会给基于图的方法建立实体和视觉对象之间的关系带来困难[17]，也会给将视觉对象作为图像的语义表征的方法带来困难[18]。另一方面，当视觉对象与实体在数量和实体类型上不一致时，明确地将视觉对象与实体对齐的方法也会受到视觉对象带来的偏差的影响。例如，在图 1(b)中，图像中有三个被检测到的物体，这就很难将它们与文本中的 "Grace "和 "Olympic"（2 个实体）明确地对齐。

To handle the aforementioned issues, we propose several effective blocks in our model.
为了解决上述问题，我们在模型中提出了几个有效的模块。

(1) To fully exploit the rich semantic information of images, we collect multiple visual clues for multimodal entity extraction, which involves taking the regional images as the vital information and the global images as the supplement. Specially, we extract the pyramidal features of these images, which is capable of producing multi-scale feature representations with strong semantic information for feature maps of all levels.
(1) 为了充分利用图像中丰富的语义信息，我们收集多种视觉线索进行多模态实体提取，其中包括以区域图像为重要信息，以全局图像为补充。特别是，我们提取了这些图像的金字塔特征，能够为各级特征图生成具有强大语义信息的多尺度特征表征。

(2) To handle the distribution issue between images and text, we construct multi-granularity visual contexts for text-text interactions and preserve the image-text interactions since images contain abstract information that cannot be summarized in words. To be specific, we first extract global image caption and dense image captions as coarse-grained visual context and fine-grained visual contexts separately. Then, we consider images as signals with sparse semantic density for image-text interaction and image captions as dense semantic signals for text-text interaction.
(2) 为了处理图像和文本之间的分布问题，我们为文本与文本之间的交互构建了多粒度视觉上下文，并保留了图像与文本之间的交互，因为图像包含无法用文字概括的抽象信息。具体来说，我们首先提取全局图像标题和密集图像标题，分别作为粗粒度视觉上下文和细粒度视觉上下文。然后，我们将图像视为具有稀疏语义密度的信号，用于图像-文本交互；将图像标题视为密集语义信号，用于文本-文本交互。

(3) To address the visual noise issue caused by irrelevant images, we propose a dynamic filter network to filter visual noise caused by irrelevant images. To alleviate the bias of quantity and entity type, instead of explicit alignment, the dynamic filter network dynamically allocate visual information for modality interaction which is more soft and flexible.
(3) 针对无关图像造成的视觉噪音问题，我们提出了一种动态过滤网络来过滤无关图像造成的视觉噪音。为了减轻数量和实体类型的偏差，动态过滤网络不采用显式对齐，而是动态分配视觉信息进行模态交互，这更加柔和灵活。

Synthesizing the above blocks, we finally propose a novel multi-granularity visual promptguided fusion network for visual-enhanced entity extraction. In detail, we construct a multigranularity visual prompt for each self-attention layer in fusion network leveraging visual information allocated by the dynamic filter network.
综合上述内容，我们最终提出了一种新颖的多粒度视觉提示引导的融合网络，用于视觉增强实体提取。具体来说，我们利用动态滤波网络分配的视觉信息，为融合网络中的每个自我关注层构建多粒度视觉提示。

Overall, we summarize the major contributions of our paper as follows:
总之，我们将本文的主要贡献总结如下：

We propose a novel multi-granularity visual prompt-guided fusion network, which is a more flexible and robust attention module for visual-enhanced entity extraction.
我们提出了一种新颖的多粒度视觉提示引导融合网络，它是一种用于视觉增强实体提取的更灵活、更稳健的注意力模块。
We construct multi-granularity visual contexts as supplementation of visual semantic information and build a dynamic filter network to filter visual noise caused by irrelevant images and allocate visual information image-independently for each self-attention layer of fusion network.
我们构建了多粒度视觉语境作为视觉语义信息的补充，并构建了动态过滤网络来过滤无关图像造成的视觉噪声，为融合网络的每个自我注意层分配与图像无关的视觉信息。
We conduct experiments on two multimodal social media NER datasets and a multimodal NER dataset from Wikinews. The experimental results demonstrate that our model outperforms previous state-of-the-art (SOTA) methods.
我们在两个多模态社交媒体 NER 数据集和一个来自维基新闻的多模态 NER 数据集上进行了实验。实验结果表明，我们的模型优于之前的最先进（SOTA）方法。

As multimodal data become increasingly popular on social media platforms, NER in the social media domain has raised broad concerns. Since visual information in images can help classify the entity types in text, especially when text semantics are ambiguous, effective use of visual information becomes essential for performance in MNER tasks. Depending on how visual information is used, previous approaches to MNER can be broadly classified into the following two types:
随着多模态数据在社交媒体平台上日益流行，社交媒体领域的 NER 引起了广泛关注。由于图像中的视觉信息可以帮助对文本中的实体类型进行分类，尤其是在文本语义模糊的情况下，因此有效利用视觉信息对于提高 MNER 任务的性能至关重要。根据视觉信息的使用方式，以往的 MNER 方法大致可分为以下两类：

Exploiting the coarse-grained information of images The multimodal NER task was first explored by Zhang et al. [12], Moon et al. [15], and Lu et al. [9] in the same period. They take the approach of encoding the entire image, which implicitly interacts the information from the two modalities. Specifically, Zhang et al. [12] extended an adaptive co-attention network to model the interaction between images and text. In the same year, Moon et al. [15] proposed a new model upon the SOTA Bi-LSTM word/character-based NER models with a deep image network and a generic modality-attention module to leverage provided visual contexts. Lu et al. [9] then proposed a visual-attention-based model to provide deeper visual understanding. These previous efforts to capture visual attention using only a single word were inadequate in their use of visual features. To address this problem, Arshad et al. [16] extended the self-attention mechanism to capture the relationship between two words and image regions and introduced a gated fusion module to dynamically select information from the text and visual features. However, the noise problem caused by uncorrelated images leads to the fact that images are not always beneficial when fusing text and image information in MNER. To alleviate the visual bias problem, Yu et al. [10] added an entity span detection module to guide the final prediction. Sun et al. [19] proposed a text-image relationship propagation model for predicting graphical relevance, which can help eliminate the effect of modal noise. Asgari- Chenaghlu et al. [20] extended the design of multimodal BERT to learn the relationship between images and text.
利用图像的粗粒度信息 Zhang 等人[12]、Moon 等人[15]和 Lu 等人[9]在同一时期首次探索了多模态 NER 任务。他们采用的方法是对整个图像进行编码，从而隐含地交互来自两种模态的信息。具体来说，Zhang 等人[12] 扩展了一个自适应协同注意网络，以模拟图像和文本之间的交互。同年，Moon 等人[15] 在基于 SOTA Bi-LSTM 的单词/字符 NER 模型的基础上提出了一个新模型，该模型具有一个深度图像网络和一个通用的模态注意模块，以充分利用所提供的视觉上下文。Lu 等人[9]随后提出了一种基于视觉注意力的模型，以提供更深入的视觉理解。之前这些仅使用单词捕捉视觉注意力的方法在使用视觉特征方面存在不足。为了解决这个问题，Arshad 等人[16] 扩展了自我注意机制，以捕捉两个单词和图像区域之间的关系，并引入了一个门控融合模块，动态地从文本和视觉特征中选择信息。然而，由于不相关图像造成的噪声问题，导致在 MNER 中融合文本和图像信息时，图像并不总是有利的。为了缓解视觉偏差问题，Yu 等人[10]增加了实体跨度检测模块，以指导最终预测。Sun 等人 [19] 提出了一种用于预测图形相关性的文本图像关系传播模型，该模型有助于消除模态噪声的影响。Asgari- Chenaghlu 等人[20] 扩展了多模态 BERT 的设计，以学习图像和文本之间的关系。

Recently, Tian et al. [21] proposed a bi-directional manner to adaptively assign fusion weights for output features, solving the situation that images are missing or mismatched with the text. After that, Xu et al. [22] designed a cross-modal matching (CM) module to calculate the similarity score between text and its associated image, and use the score to determine the proportion of visual information that should be retained.
最近，Tian 等人[21] 提出了一种自适应分配输出特征融合权重的双向方式，解决了图像与文本缺失或不匹配的情况。之后，Xu 等人[22] 设计了一个跨模态匹配（CM）模块，用于计算文本与其相关图像之间的相似度得分，并利用该得分确定应保留的视觉信息比例。

Exploiting the fine-grained information of images In addition to the above studies, there are also some works focusing on exploiting the fine-grained information of images. For example, Wu et al. [18] leveraged object labels as embeddings to bridge vision and language, and proposed a dense co-attention mechanism for fine-grained interactions. Chen et al. [23] considered leveraging the image attribute modality as well as the image conceptual knowledge modality. In addition, Zhang et al. [11] first proposed to explore a multimodal graph neural network (GNN) for MNER. Lu et al. [24] transformed the fine-grained semantic representations of the vision and text into a unified lattice structure and designed a novel relative position encoding. After that, Wang et al. [25] first aligned the image into regional object tags, image-level captions and optical characters as visual contexts. To obtain the fine-grained semantic correspondence between objects in images and words in the text, Liu et al. [17] performed the cross-modality semantic interaction between text and vision at the different vision granularities.
利用图像的细粒度信息除上述研究外，还有一些研究侧重于利用图像的细粒度信息。例如，Wu 等人[18]利用物体标签作为嵌入来连接视觉和语言，并提出了一种用于细粒度交互的密集协同关注机制。Chen 等人[23]考虑利用图像属性模式和图像概念知识模式。此外，Zhang 等人[11] 首次提出探索用于 MNER 的多模态图神经网络（GNN）。Lu 等人[24] 将视觉和文本的细粒度语义表征转化为统一的网格结构，并设计了一种新颖的相对位置编码。之后，Wang 等人[25]首先将图像对齐为区域对象标记、图像级标题和光学字符作为视觉上下文。为了获得图像中的物体与文本中的文字之间的细粒度语义对应关系，Liu 等人[17]在不同的视觉粒度上进行了文本与视觉之间的跨模态语义交互。

Based on the granularity of image information, previous studies can be mainly categorized into two types, coarse-grained and fine-grained. Coarse-grained image information is more
根据图像信息的粒度，以往的研究主要分为粗粒度和细粒度两类。粗粒度图像信息更
focused while fine-grained image information is able to focus on more detailed features of the image. In addition, original image and image caption are different forms of image information, each of which has its own unique advantages and limitations. On the one hand, image description aligns image features into the textual space, which helps to solve the problem of inconsistent representation of different modal features. On the other hand, original image is more abstract than image caption, but may contain information that is difficult to describe in text. Therefore, this paper argues that coarse-grained and fine-grained can complement each other, and original image and image caption can also refine each other. However, most of the previous work focuses on a single granularity or representation, resulting in the utilization of image information always being less comprehensive and thorough.
而细粒度图像信息则能关注图像的更多细节特征。此外，原始图像和图像说明是不同形式的图像信息，各有其独特的优势和局限性。一方面，图像说明将图像特征统一到文本空间中，有助于解决不同模态特征表示不一致的问题。另一方面，原始图像比图像说明更抽象，但可能包含难以用文本描述的信息。因此，本文认为粗粒度和细粒度可以相互补充，原始图像和图像标题也可以相互细化。然而，以往的工作大多侧重于单一粒度或表述，导致对图像信息的利用总是不够全面和深入。

To compensate for this deficiency, this paper proposes a new approach: introducing multiple image captions with different granularities while preserving the original image to obtain more comprehensive and detailed information. The differences between our approach and previous works are shown in Table 1.
为了弥补这一不足，本文提出了一种新方法：在保留原始图像的基础上，引入多个不同粒度的图像标题，以获取更全面、更详细的信息。表 1 显示了我们的方法与之前工作的不同之处。

3 Task formulation 3 制定任务

NER aims to identify named entities (usually the name or symbol of a specific type of thing, typically a noun or phrase) in a text and classify the identified entities into pre-defined entity categories. Its input is the text to be recognized and its output is the recognized entities and their corresponding types. For example, in Fig. 2, words "Michael Jeffrey Jordan" are recognized as Person, and word "Brooklyn" and words "New York" are recognized as Location.
NER 的目的是识别文本中的命名实体（通常是特定类型事物的名称或符号，一般是名词或短语），并将识别出的实体归入预先定义的实体类别。它的输入是要识别的文本，输出是识别出的实体及其相应的类型。例如，在图 2 中，单词 "迈克尔-杰弗里-乔丹 "被识别为 "人"，单词 "布鲁克林 "和单词 "纽约 "被识别为 "地点"。

Table 1 The differences between our approach and previous works
表 1 我们的方法与之前工作的区别

Methods 方法	Coarse-grained 粗粒	Fine-grained 细粒度	Original image 原图	Image caption 图片说明
Zhang et al. [12] Zhang 等人[12］
Moon et al. [15] Moon 等人 [15]
Lu et al. [9] Lu 等人[9］
Arshad et al. [16] 阿尔沙德等人[16］
Yu et al. [10] Yu 等人[10］
Sun et al. [19] Sun 等人[19］
Asgari-Chenaghlu et al. [20] Asgari-Chenaghlu 等人 [20]
Tian et al. [21] Tian等人[21］
Xu et al. [22] 徐等人[22］
Wu et al. [18] Wu 等人[18］
Chen et al. [23] 陈等人[23］
Zhang et al. [11] Zhang 等人[11］
Lu et al. [24] Lu 等人[24］
Wang et al. [25] Wang 等人 [25]
Liu et al. [17] Liu 等人[17］

Fig. 2 An example of NER task
图 2 NER 任务示例

MNER aims at identifying entity spans and recognizing their categories in social media posts with the aid of images. In this section, we describe the task formulation of MNER.
MNER 旨在借助图像识别社交媒体帖子中的实体跨度并识别其类别。在本节中，我们将介绍 MNER 的任务制定。

Given a sentence

and its associated image

as input, the goal of MNER is to extract a set of entities from

and classify each extracted entity into one of the pre-defined categories. As with most existing work in MNER, we formulate the task as a sequence labeling problem. Formally, let

denotes a sequence of input words, where

with

denotes the

-th word in the sentence and

represents the length of the sentence, and

be the corresponding entity labels for all words, where

and

is the pre-defined label set with standard BIO schema [26].
给定一个句子

及其相关图像

作为输入，MNER 的目标是从

中提取一组实体，并将每个提取的实体归入预先定义的类别之一。与 MNER 领域的大多数现有工作一样，我们将任务表述为序列标注问题。形式上，让

表示输入词的序列，其中

和

表示句子中的第

-th 个词，

表示句子的长度，

是所有词对应的实体标签，其中

和

是预先定义的标签集，采用标准的 BIO 模式[26]。

In this paper, we use Conditional Random Fields (CRF) [27] for sequence labeling. It has been shown that CRF can produce higher tagging accuracy in sequence labeling tasks because CRF considers the correlations between labels in neighborhoods. For example, an adjective has a greater probability of being followed by a noun than a verb in POS tagging task, and I-PER cannot follow B-LOC in NER with the standard BIO2 annotation. Therefore, instead of decoding each label independently, we model them jointly using a CRF. We also use

to denote a set of input visual objects of number

.
在本文中，我们使用条件随机场（CRF）[27] 进行序列标注。研究表明，CRF 可以在序列标注任务中产生更高的标注准确率，因为 CRF 考虑了邻域中标签之间的相关性。例如，在 POS 标记任务中，形容词后面跟名词的概率大于跟动词的概率，而在使用标准 BIO2 注释的 NER 中，I-PER 无法跟 B-LOC。因此，我们不再对每个标签进行独立解码，而是使用 CRF 对它们进行联合建模。我们还使用

表示一组输入视觉对象的编号

。

To better represent the notations used in the manuscript, we provide a summary of the most essential mathematical notations and their descriptions in Table 2.
为了更好地体现手稿中使用的符号，我们在表 2 中总结了最基本的数学符号及其说明。

4 Method 4 方法

In this section, we present a novel multi-granularity visual prompt-guided fusion network (MVPN). The overall architecture is shown in Fig. 3. Specifically, the MVPN framework consists of four main components: (1) Visual information embedding, (2) Dynamical filter network, (3) Visual prompt-guided fusion, and (4) Classifier. We first introduce the process of visual information embedding, and then detail the other components of our model.
在本节中，我们将介绍一种新颖的多粒度视觉提示引导融合网络（MVPN）。整体架构如图 3 所示。具体来说，MVPN 框架由四个主要部分组成：(1) 视觉信息嵌入；(2) 动态滤波网络；(3) 视觉提示引导融合；(4) 分类器。我们首先介绍视觉信息嵌入的过程，然后详细介绍模型的其他组成部分。

4.1 Visual information embedding
4.1 视觉信息嵌入

To effectively exploit rich semantic information of images, we construct multi-granularity visual contexts and collect multiple visual clues for MNER. Specifically, we extract the pyramidal features of these images. Then we obtain the final visual features through feature aggregation. In the following subsections, we will detail each stage below.
为了有效利用图像中丰富的语义信息，我们构建了多粒度视觉语境，并为 MNER 收集了多种视觉线索。具体来说，我们提取这些图像的金字塔特征。然后，我们通过特征聚合获得最终的视觉特征。在下面的小节中，我们将详细介绍每个阶段。

Table 2 Key notations and descriptions used in this paper
表 2 本文使用的关键术语和说明

Notations 注释

Descriptions 说明

The sentence 句子

The image 图像

The corresponding entity labels for all words
所有单词的相应实体标签

带有标准 BIO 模式的预定义标签集

The pre-defined label set with standard BIO

schema

A set of input visual objects
一组输入视觉对象

全局图像标题生成器和密集图像标题生成器

The global image caption generator and dense

image caption generator

全局图像标题和密集图像标题

The global image caption and dense image

captions

The vectorized representations
矢量化表示法

Pyramidal feature maps 金字塔地物图

The scaled feature 缩放特征

The pooling operation 汇集操作

The one-dimensional convolution operation
一维卷积运算

全局视觉特征、局部视觉特征和最终综合视觉特征

The global visual feature, local visual feature

and final aggregated visual feature

文本图像相关评分和视觉遮罩矩阵

The text-image relevant score and visual mask

matrix

The visual prompt 视觉提示

从视觉骨干网的

-th 块到融合网络的

-th 层的路径概率

The path probability from the

-th block of

visual backbone to the

-th layer of fusion net-

work

根据融合网络中

-th 层的门控函数

The gating function according to the

-th layer

in fusion network

基于视觉提示的注意力操作

The operation of visual prompt-based atten-

tion

The potential function 潜能函数

The final representation
最终表述

4.1.1 Multi-Granularity visual contexts
4.1.1 多粒度视觉背景

Conventional caption describes the most prominent semantics of the image in a well-formed text while dense captions describe all the objects in the image with several short sentences, as shown in Fig. 4. Since captions help filter out the noise introduced by image features by summarizing image information into text through the image captioning model, we use the conventional image caption as the global visual context and dense image captions as the fine-grained visual contexts.
如图 4 所示，传统标题用格式良好的文本描述图像中最突出的语义，而密集标题则用几个短句描述图像中的所有对象。由于标题通过图像标题模型将图像信息概括为文本，有助于过滤图像特征带来的噪声，因此我们将传统图像标题作为全局视觉上下文，将密集图像标题作为细粒度视觉上下文。

Global image caption as coarse-grained context Image Captioning is the task of describing the contents of an image in words. Though the local alignment can localize the image into objects, the objects cannot fully describe the global information of the entire image. Therefore, we use the global image caption which focuses on the most salient part of the image. The global image caption

is generated by global caption generator

:
作为粗粒度上下文的全局图像标题图像标题是用文字描述图像内容的任务。虽然局部对齐可以将图像局部化为对象，但对象并不能完全描述整个图像的全局信息。因此，我们使用全局图像标题，它侧重于图像中最突出的部分。全局图像标题

由全局标题生成器

生成：

Fig. 3 The overall architecture of our MVPN
图 3 MVPN 的整体架构

Dense image captions as fine-grained contexts A single caption only describes a prominent part of the image which may ignore other not obvious but vital information of the image. Previous works [25] use

captions generated from beam search with

beams. However, these captions are similar and redundant which cannot provide more effective information compared with a single caption. Therefore, we use dense image captions which contain

captions focusing on different areas of the image. These dense image captions are generated by a dense caption generator DIC:
作为细粒度上下文的密集图像标题单个标题只能描述图像的突出部分，可能会忽略图像中其他不明显但重要的信息。之前的研究 [25] 使用

标题，这些标题是通过

光束搜索生成的。然而，这些标题既相似又冗余，无法提供比单一标题更有效的信息。因此，我们使用密集图像标题，其中包含

个标题，重点关注图像的不同区域。这些密集图像标题由密集标题生成器 DIC 生成：

where

is the

-th dense caption, and the value of

is the number of objects contained in the image. We concatenate the

captions together with a special separate token "

" to form the fine-grained visual contexts

:
其中

是

-th dense caption，而

的值是图像中包含的对象数量。我们将

标题与一个特殊的独立标记 "

" 连接起来，形成细粒度的视觉上下文

：

Caption 标题

(a). A black and white dog laying on a couch
(a).一只躺在沙发上的黑白相间的狗

(b). A cat laying on a couch

Two dogs on a couch

White wall in the background
(b).一只猫躺在沙发上

两只狗躺在沙发上

背景是白色的墙壁

Fig. 4 An example of conventional caption and dense caption of an image: (a) the conventional caption, (b) the dense caption
图 4 图像的传统标题和密集标题示例：（a）传统标题，（b）密集标题

The exact label of the special token "[X]" (e.g. "[SEP]" in BERT [28]) depends on the selection of embeddings. Due to the capability of giving different representations for the same word in different contexts, we employ the recent contextualized representations from BERT [28] as our sentence encoder. Following [28], each input sentence is preprocessed by inserting two special tokens, i.e., appending "[CLS]" to the beginning and "[SEP]" to the end, respectively.

are then fed to BERT encoder to obtain the vectorized representations

.
特殊标记"[X]"（例如 BERT [28] 中的"[SEP]"）的确切标签取决于嵌入的选择。由于同一个词在不同的语境中会有不同的表示方法，我们采用了 BERT [28] 最新的语境化表示方法作为句子编码器。按照[28]的方法，每个输入句子都要经过预处理，插入两个特殊标记，即分别在开头和结尾添加"[CLS]"和"[SEP]"。然后将

送入 BERT 编码器，得到矢量化表示

。

4.1.2 Pyramidal visual feature
4.1.2 金字塔视觉特征

Global image features may express abstract concepts, which play a role as weak learning signals. The image associated with a sentence contains multiple visual objects related to the entities in the sentence, which further provides more semantic knowledge for information extraction. Thus, we take the regional images as the vital information and the global images as the supplement which are collected as multiple visual clues for multimodal entity extraction.
全局图像特征可以表达抽象概念，起到弱学习信号的作用。与句子相关的图像包含与句子中实体相关的多个视觉对象，这为信息提取提供了更多语义知识。因此，我们将区域图像作为重要信息，将全局图像作为补充，作为多视觉线索收集起来，用于多模态实体提取。

Given an image, we adopt the visual grounding toolkit [29] for extracting local visual objects with top

salience. Then, we rescale the global image and object images to

pixels as the global image

and visual objects

.
给定一幅图像后，我们采用视觉接地工具包[29]来提取局部突出度为

的视觉对象。然后，我们将全局图像和对象图像调整到

像素，作为全局图像

和视觉对象

。

In the area of computer vision (CV), leveraging features from different blocks of pre-trained models

to fusion features is widely applied for improving model performance. Inspired by such practices, we focus on the application of pyramidal features in the area of multi-modality. We propose to extract multi-scale hierarchical features of an image and fuse the image features into each fusion layer. Typically, given an image and its object images, we separately encode them with a backbone model and generate a list of pyramidal
在计算机视觉（CV）领域，利用来自不同预训练模型块的特征

来融合特征被广泛应用于提高模型性能。受这种做法的启发，我们重点研究了金字塔特征在多模态领域的应用。我们建议提取图像的多尺度分层特征，并将图像特征融合到每个融合层中。通常情况下，给定一幅图像及其对象图像，我们分别用一个骨干模型对它们进行编码，然后生成一个金字塔特征列表。

feature maps: 特征地图：

which are with different scales. Then we obtain the scaled feature

and

by projecting them with

as follows:
它们的比例不同。然后，我们用

对它们进行如下投影，就得到了比例特征

和

：

where

denotes the

-th block of the backbone model,

denotes the number of blocks in the visual backbone model (here is 4 for ResNet [32]),

represents the pooling operation, where the features are aggregated to the same spatial sizes. The

convolutional layer is leveraged to map the visual feature pyramid to match the embedding size of the fusion layer.
其中

表示主干模型的第

- 个块，

表示视觉主干模型的块数（此处 ResNet [32] 为 4），

表示池化操作，即把特征聚合到相同的空间大小。4# 卷积层用于映射视觉特征金字塔，以匹配融合层的嵌入大小。

4.1.3 Visual feature aggregation
4.1.3 视觉特征聚合

We aggregate the global image caption features

and the global image features

as global visual features

, then aggregate the dense caption features

and the object visual features

as fine-grained local features

. Formally, we derive the final aggregated
我们将全局图像标题特征

和全局图像特征

聚合为全局视觉特征

，然后将密集标题特征

和对象视觉特征

聚合为细粒度局部特征

。在形式上，我们得出最终的聚合
visual features

by concatenating the global visual features

and the local visual features

通过连接全局视觉特征

和局部视觉特征

来获得视觉特征

where

is the element-wise summation,

denotes the concatenate operation.
其中

表示元素求和，

表示连接操作。

4.2 Dynamical filter network
4.2 动态滤波网络

To decline the visual noise caused by irrelevant image information, we use a visual feature filter to filtrate irrelevant image information and a dynamic inference gate to allocate visual representations dynamically for modality fusion. In the following subsections, we introduce the details of each stage.
为了降低无关图像信息造成的视觉噪音，我们使用视觉特征过滤器过滤无关图像信息，并使用动态推理门为模态融合动态分配视觉表征。在下面的小节中，我们将介绍每个阶段的细节。

4.2.1 Visual feature filter
4.2.1 视觉特征过滤器

As we know, the MNER is a more text-oriented task, and the image features are auxiliary for better tagging

. Although visual information can help reduce ambiguity. However, when the images are mismatched with the associated text, the visual features will mislead the model and decrease the performance. Therefore, if an image is missing or mismatched with the text, the text features should be the primary contribution. To address this challenge, we propose to construct a visual feature filter where irrelevant image features are filtrated depending on the relevance to the text.
我们知道，MNER 是一项更注重文本的任务，而图像特征则是更好地进行标记的辅助手段

。虽然视觉信息有助于减少歧义。但是，当图像与相关文本不匹配时，视觉特征会误导模型，降低性能。因此，如果图像缺失或与文本不匹配，文本特征应该是主要贡献。为了应对这一挑战，我们建议构建一个视觉特征过滤器，根据与文本的相关性过滤不相关的图像特征。

Considering the excellent ability of CLIP [34] to map images and text to the same feature space, we construct an image-text relation discriminator based on its text encoder and image encoder to extract the features of the input sentence

and its associate image

separately. The text-image relevance score

is defined as the cosine similarity score of text features and image features. To build a soft filter mechanism, we use the relevance score

to construct a visual mask matrix

. The text-image relation is propagated to the filter gate to produce the filtered visual feature

as follows:
考虑到 CLIP [34] 将图像和文本映射到同一特征空间的出色能力，我们在其文本编码器和图像编码器的基础上构建了一个图像-文本关系判别器，分别提取输入句子

及其关联图像

的特征。文本-图像相关性得分

定义为文本特征与图像特征的余弦相似度得分。为了建立软过滤机制，我们使用相关性得分

来构建视觉掩码矩阵

。文本与图像的关系传播到过滤门，产生过滤后的视觉特征

，如下所示：

where

is the element-wise multiplication. For example, if

, then all visual features are discarded.
其中

是元素相乘。例如，如果

，那么所有视觉特征都会被舍弃。

4.2.2 Dynamic inference gate
4.2.2 动态推理门

Objects of different sizes can have appropriate feature representations at the corresponding scales, it is vital to decide which block in the visual backbone is assigned as a visual prompt for each layer in the fusion network. To address this challenge, we construct the densely connected reasoning route, where filtered visual features are connected with each fusion layer.
不同大小的物体可以在相应的尺度上有适当的特征表示，因此决定将视觉主干中的哪个区块分配给融合网络中的每一层作为视觉提示至关重要。为了应对这一挑战，我们构建了密集连接的推理路径，将过滤后的视觉特征与每个融合层连接起来。

Dynamic inference gate module We conduct route reasoning processes through a dynamic inference gate module, which can be viewed as a procedure of path decision. The motivation of the dynamic inference gate is to predict a normalized vector, which represents how much to execute the visual feature of each block.
动态推理门模块我们通过动态推理门模块进行路径推理过程，该模块可被视为路径决策程序。动态推理门的动机是预测一个归一化向量，它代表了执行每个区块视觉特征的程度。

In the dynamic gate,

denotes the path probability from the

-th block of the visual backbone to the

-th layer of the fusion network. It is calculated as

, where

denotes the gating function according to the

-th layer in fusion network,

represents the numbers of the block in the backbone. We first produce the logits

of the gate signals:
在动态门电路中，

表示从视觉主干网的第

- 个区块到融合网络的第

- 层的路径概率。它的计算公式为

，其中

表示根据融合网络中第

- 层的门控函数，

表示主干网中区块的编号。我们首先生成门控信号的对数

：

where

denotes the activate function LeakyReLU [35],

represents the global average pooling layer.
其中

表示激活函数 LeakyReLU [35]，

表示全局平均池化层。

We first squeeze the input feature

with a shape of

from the

-th block by an average pooling operation. Then we add the features from multiple blocks to generate the average vectors. We further reduce the feature dimension by

with the MLP layer

and consider a soft gate via generating continuous values as path probabilities. Afterward, we generate the probability vector

for the

-th layer of the fusion network as follows:
我们首先通过平均池化操作，从第

- 个区块中挤出形状为

的输入特征

。然后，我们将多个区块的特征相加，生成平均向量。我们利用 MLP 层

进一步将特征维度减少

，并通过生成连续值作为路径概率来考虑软门。之后，我们为融合网络的

-th 层生成概率向量

如下：

Aggregated hierarchical prompt We derive the aggregated hierarchical visual prompts

based on the above dynamic inference gate

to match the

-th layer in fusion network as:
聚合分层提示我们根据上述动态推理门

，推导出聚合分层视觉提示

，以匹配融合网络中的

-th 层：

Formally, we obtain the final visual prompts

corresponding to the

-th layer of fusion network by the following concatenation operation:
从形式上看，我们通过下面的串联操作，得到最终的视觉提示

对应于融合网络的

-th 层：

which will be adopted to enhance layer-level representations of textual modality through visual prompt-based attention.
将采用这种方法，通过基于视觉提示的注意力来增强文本模态的层级表征。

4.3 Visual prompt-guided fusion
4.3 视觉提示引导下的融合

To model the process of intra-modal and inter-modal fusions, we design a visual promptguided fusion network. Specifically, we regard the hierarchical image features as the visual prompts, and prepend the sequence of visual prompts to the text sequence at each self-attention layer of the fusion network, as shown in Fig. 5.
为了模拟模态内和模态间的融合过程，我们设计了一个视觉提示引导的融合网络。具体来说，我们将分层图像特征视为视觉提示，并在融合网络的每个自我注意层将视觉提示序列预置到文本序列中，如图 5 所示。

In particular, given an input sequence

, the contextual representations

are first projected into the query/key/value vector:
具体来说，给定一个输入序列

后，首先将上下文表示

投射到查询/键/值向量中：

As for aggregated hierarchical visual prompts

, we use a set of linear transformations

for

-th layer to project them into the same embedding space as of textual representation in self-attention module. Besides, we define the operation of visual prompts

as:
至于聚合分层视觉提示

，我们对

-th 层使用一组线性变换

将其投射到自我注意模块中与文本表示相同的嵌入空间。此外，我们将视觉提示的操作

定义为：

Visual Prompt-guided Fusion Layer
视觉提示引导的融合层

Fig. 5 The detailed structure of the fusion network layer, taking the multi-granularity visual contexts and images (the whole image and object images) as examples
图 5 以多粒度视觉语境和图像（整幅图像和对象图像）为例说明融合网络层的具体结构

where

represents the length of the visual sequences,

denotes the number of visual objects detected by the object detection model. Formally, the visual prompt-based attention are calculated as follows:
其中，

表示视觉序列的长度，

表示物体检测模型检测到的视觉物体的数量。形式上，基于视觉提示的注意力计算如下：

where

denotes the transpose operation. We regard hierarchical visual features as visual prompts at each fusion layer and sequentially conduct multimodal attention to update all textual states. In this way, the final textual states encode both the contexts and the crossmodal semantic information simultaneously, which is beneficial to reduce error sensitivity for irrelevant objects.
其中

表示转置运算。我们将分层视觉特征视为每个融合层的视觉提示，并依次进行多模态注意以更新所有文本状态。这样，最终的文本状态就同时编码了上下文和跨模态语义信息，有利于降低对无关对象的错误敏感性。

4.4 Classifier 4.4 分级器

Based on the above description, we get the final representation:
根据上述描述，我们得到了最终的表示方法：

where

denotes the operation of visual prompt-based attention. Finally, we conduct classifier layers for NER.
其中

表示基于视觉提示的注意力操作。最后，我们对 NER 进行分类器分层。

Following previous works [10, 15], we also adopt the conditional random fields (CRF) [27] decoder to perform the NER task. Formally, we feed the final hidden vector

to the CRF model. For a sequence of tags

, the probability of the label sequence

and the objective of NER are defined as follows:
根据之前的研究 [10, 15]，我们还采用了条件随机场（CRF）[27] 解码器来执行 NER 任务。形式上，我们将最终的隐藏向量

输入 CRF 模型。对于标签序列

，标签序列

的概率和 NER 目标定义如下：

where

represents the pre-defined label set with the BIO tagging schema [26], and

represents potential function.
其中，

表示使用 BIO 标签模式[26]预先定义的标签集，

表示潜在功能。

5 Experiment 5 实验

We conduct a series of experiments on Twitter-2015 [12] and Twitter-2017 [9] to validate the effectiveness of our proposed model. To further demonstrate the generalizability of our method, we compare it on the non-social media domain dataset: WikiDiverse [36] with several methods that perform well on this dataset. To assess the generalization ability of predictive models and to prevent overfitting, we choose hold-out cross-validation as our model evaluation methods [37-39], which is suitable for large-scale datasets and with fast calculation speed. For more details, the hold-out cross-validation method divides the original training set into three parts: training set, validation set and test set. The training set is used to train different models and the validation set is used for model selection. The test set is a portion of unused data used to evaluate the performance of the model on unknown data. This prevents the model from overfitting the training data and thus improves the prediction accuracy of the model.
我们在 Twitter-2015 [12] 和 Twitter-2017 [9] 上进行了一系列实验，以验证我们提出的模型的有效性。为了进一步证明我们的方法的通用性，我们在非社交媒体领域的数据集上进行了比较：WikiDiverse [36] 与在该数据集上表现良好的几种方法进行了比较。为了评估预测模型的泛化能力并防止过度拟合，我们选择了适用于大规模数据集且计算速度较快的 "保持交叉验证"（hold-out crossvalidation）作为模型评估方法[37-39]。具体来说，保持交叉验证法将原始训练集分为三个部分：训练集、验证集和测试集。训练集用于训练不同的模型，验证集用于选择模型。测试集是一部分未使用的数据，用于评估模型在未知数据上的性能。这可以防止模型过度拟合训练数据，从而提高模型的预测准确性。

In this section, we first introduce the datasets used in our experiments and the baseline models we compared with. We further describe the detailed parameter settings of our model.
在本节中，我们首先介绍实验中使用的数据集以及与之进行比较的基准模型。我们将进一步介绍模型的详细参数设置。

5.1 Datasets and metrics
5.1 数据集和衡量标准

Following previous work in MNER [17, 22, 24, 25, 40], we take Precision (P), Recall (R) and F1-score (F1) as our main evaluation metrics and select two benchmark Twitter datasets and a non-Twitter dataset for our experiments: Twitter-2015 [12], Twitter-2017 [9] and WikiDiverse [36].
根据 MNER 先前的工作[17, 22, 24, 25, 40]，我们将精度（P）、召回率（R）和 F1 分数（F1）作为主要评估指标，并选择两个基准 Twitter 数据集和一个非 Twitter 数据集进行实验：我们的实验选择了两个基准 Twitter 数据集和一个非 Twitter 数据集：Twitter-2015 [12]、Twitter-2017 [9] 和 WikiDiverse [36]。

The entity types in the Twitter datasets are Person, Location, Organization and Misc. Twitter-2015 contains 8,257 tweets posted by 2,116 users and the total number of entities is 12,800 . Twitter- 2017 contains 4,819 tweets and the number of entities is 8,724 . Two datasets are divided into training, development and testing parts following the same setting as [10], containing 4,000/1,000/3,257, 3,373/723/723 sentences in train/development/test split respectively. Table 3 shows the number of entities for each type and the counts of multimodal tweets in detail.
Twitter 数据集中的实体类型包括个人、地点、组织和其他。Twitter-2015 包含 2116 名用户发布的 8257 条推文，实体总数为 12800 个。Twitter- 2017 包含 4819 条推文，实体数量为 8724 个。按照与 [10] 相同的设置，两个数据集被分为训练、开发和测试部分，训练/开发/测试部分分别包含 4,000/1,000/3,257 句、3,373/723/723 句。表 3 详细列出了每种类型的实体数量和多模态推文的计数。

The WikiDiverse dataset is divided into training set, development set, and testing set with the ratio of 8:1:1, containing 6,312/755/757 sentences in train/development/test split, which is reorganized from Wikinews as a multimodal NER dataset. The statistics(pairs, entities, mentions per pair and words per pair) of WikiDiverse are shown in Table 4. More details of wikiDiverse are in [36].
WikiDiverse 数据集分为训练集、开发集和测试集，三者的比例为 8:1:1，包含训练/开发/测试三部分的 6,312/755/757 个句子，该数据集由 Wikinews 重组而成，是一个多模态 NER 数据集。表 4 列出了 WikiDiverse 的统计数据（句对、实体、每对提及次数和每对词数）。有关 WikiDiverse 的更多详情，请参阅 [36]。

Table 3 The statistics of each entity type of two Twitter datasets
表 3 两个 Twitter 数据集各实体类型的统计数据

Entity type 实体类型	Twitter-2015				Twitter-2017 推特-2017
	Train 火车	Dev 开发	Test 测试		Train 火车	Dev 开发	Test 测试
Person 个人	2,217	552	1,816		2,943	626	621
Location 地点	2,091	522	1,697	731	173	178
Organization 组织结构	928	247	839		1,674	375	395
Miscellaneous 杂项	940	225	726	701	150	157
Total 总计	6,176	1,546	5,078		6,049	1,324	1,351
Number of Tweets 推文数量	4,000	1,000	3,257		3,373	723	723

Table 4 The statistics of wikiDiverse dataset
表 4 维基多样性数据集的统计数据

	Train 火车	Dev.	Test 测试	Total 总计
Pairs 成对	6,377	796	796	7,969
Entities 实体	13,205	1,552	1,570	16,327
Ment. per pair 每对	2.04	2.03	1.87	2.02
Words per pair 每对字数	10.07	10.28	9.92	10.08

5.2 Baseline methods 5.2 基准方法

We first compare our model with several baseline models on two Twitter datasets for a comprehensive comparison to demonstrate the superiority of our model. Our comparison mainly focuses on three groups of models: (1) Text-based models, (2) Previous SOTA MNER models, and (3) Variants of our model.
我们首先在两个 Twitter 数据集上将我们的模型与几个基准模型进行了全面比较，以证明我们模型的优越性。我们的比较主要集中在三组模型上：(1) 基于文本的模型，(2) 以前的 SOTA MNER 模型，(3) 我们模型的变体。

Text-based models We first consider a group of representative text-based models:
基于文本的模型我们首先考虑一组有代表性的基于文本的模型：

CNN-BiLSTM-CRF [41]: It is a classical neural network for NER based on convolutional neural networks(CNN) and LSTM.
CNN-BiLSTM-CRF [41]：这是一种基于卷积神经网络（CNN）和 LSTM 的经典 NER 神经网络。
HBiLSTM-CRF [42]: It is an improvement of CNN-BiLSTMCRF, replacing the bottom CNN layer with the LSTM layer.
HBiLSTM-CRF [42]：它是 CNN-BiLSTMCRF 的改进版，用 LSTM 层取代了底部的 CNN 层。
BERT-CRF: It is a variant of BERT replacing the softmax layer with a CRF layer.
BERT-CRF：它是 BERT 的一种变体，用 CRF 层取代了 softmax 层。

Previous SOTA MNER models Besides, we further consider another group of previous SOTA approaches for MNER:
以前的 SOTA MNER 模型此外，我们还进一步考虑了以前用于 MNER 的另一组 SOTA 方法：

UMT [10]: It empowers Transformer [43] with a multimodal interaction module to capture the inter-modality dynamics and incorporates the auxiliary entity span detection module.
UMT [10]：它在 Transformer [43] 的基础上增加了多模态交互模块，以捕捉模态间的动态变化，并加入了辅助实体跨度检测模块。
UMGF [11]: It uses a unified multimodal graph to capture the semantic relationships between the words and visual objects and stacks multiple fusion layers to perform semantic interactions to learn node representations.
UMGF [11]：它使用统一的多模态图来捕捉词语和视觉对象之间的语义关系，并堆叠多个融合层来进行语义交互，从而学习节点表征。
MAF [22]: It designs a module to calculate the similarity score between text and image, and uses the score to determine the proportion of visual information that should be retained.
MAF [22]：它设计了一个模块来计算文本与图像之间的相似度得分，并利用该得分来确定应保留的视觉信息比例。
ITA [25]: It aligns image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized.
ITA [25]：它将图像特征对齐到文本空间，从而更好地利用基于转换器的预训练文本嵌入中的注意力机制。
MGCMT [44]: It introduces the multi-granularity cross-modality representation learning for the MNER task.
MGCMT [44]：它为 MNER 任务引入了多粒度跨模态表征学习。
HVPNet [40]: It uses a hierarchical visual prefix fusion network for visual-enhanced entity and relation extraction.
HVPNet [40]：它使用分层视觉前缀融合网络进行视觉增强实体和关系提取。
CAT-MNER [45]: It refines the cross-modal attention by identifying and highlighting some task-salient features.
CAT-MNER [45]：它通过识别和突出某些对任务有利的特征来完善跨模态注意力。
R-GCN [46]: It designs a relation-enhanced graph convolutional network for the MNER task.
R-GCN [46]：它为 MNER 任务设计了一个关系增强图卷积网络。
DebiasCL [47]: It is the newest SOTA for MNER, which studies modality alignment enhanced by cross-modal contrastive learning.
DebiasCL [47]：它是 MNER 的最新 SOTA，研究通过跨模态对比学习增强模态对齐。

Variants of our model We set up ablation experiments to explore the effectiveness of our design. For a fair comparison, we perform the same parameter settings for each variant of the MVPN model.
模型的变体我们进行了消融实验，以探索我们设计的有效性。为了进行公平比较，我们对 MVPN 模型的每个变体进行了相同的参数设置。

MVPN-Gcap: This is another variant of our model without the dense captions as inputs. Dense image captions are considered as fine-grained visual information. We conduct ablation by only adopting the global image caption in this model to validate the effect of the dense image captions.
MVPN-Gcap：这是我们模型的另一个变体，没有密集的标题作为输入。密集图像标题被视为细粒度视觉信息。我们在该模型中只采用全局图像标题进行消减，以验证密集图像标题的效果。
MVPN-Hard: This is another variant of our model with visual feature filter using hard filter mechanism. Relevant images associated with the text can clarify ambiguity, while images that are not relevant could introduce noise. Here we denote this variant as MVPN-Hard to validate the effect of the discriminator and the filter gate with soft filter mechanism.
MVPN-Hard：这是我们的模型的另一个变体，使用硬过滤机制进行视觉特征过滤。与文本相关的图像可以澄清歧义，而不相关的图像则会带来噪音。在此，我们将这一变体命名为 MVPN-Hard，以验证采用软过滤机制的判别器和过滤门的效果。

To further demonstrate the generalizability of our method, we then compare our model on a non-Twitter dataset: WikiDiverse [36], with several methods that perform well on this dataset:
为了进一步证明我们的方法具有普适性，我们在一个非 Twitter 数据集上比较了我们的模型：WikiDiverse [36]，与在该数据集上表现良好的几种方法进行比较：

ITA [25]: It aligns image features into the textual space, so that the attention mechanism in transformerbased pretrained textual embeddings can be better utilized.
ITA [25]：它将图像特征对齐到文本空间，从而更好地利用基于转换器的预训练文本嵌入中的注意力机制。
baseline [48]: It is a model training and testing with [TXT]+[IMG] input, baseline of [48].
基线 [48]：这是一个使用 [TXT]+[IMG] 输入进行训练和测试的模型，是 [48] 的基线。
baseline+Retri [48]: It is a model with knowledge variants and distillation variants. For the knowledge variants, it takes image-retrieved knowledge from MoRe [49] as Retri . And for the distillation variants, it refers to multi-view alignment from ITA [25] as MV.
baseline+Retri [48]：这是一个包含知识变体和提炼变体的模型。在知识变体中，它采用 MoRe [49] 中的图像检索知识作为 Retri 。而对于蒸馏变体，它将 ITA [25] 中的多视图配准作为 MV。

5.3 Parameter settings 5.3 参数设置

Our model is implemented by the PyTorch framework. We adopt ResNet50 [50] as the visual backbone and BERT-base [28] as the textual encoder. We train our model with an AdamW [51] optimizer. For BERT embeddings and prompt embeddings, we finetune the embeddings with a learning rate of

and a batch size of 32 . And for the CRF layer, we set the learning rate as 0.05 . The MNER model is trained for 30 epochs with a dropout rate of 0.5 and a learning rate decay of 0.01 .
我们的模型由 PyTorch 框架实现。我们采用 ResNet50 [50] 作为视觉骨干，BERT-base [28] 作为文本编码器。我们使用 AdamW [51] 优化器训练模型。对于 BERT 嵌入和提示嵌入，我们使用

的学习率和 32 的批量大小对嵌入进行微调。对于 CRF 层，我们将学习率设为 0.05。MNER 模型训练了 30 个 epoch，辍学率为 0.5，学习率衰减为 0.01。

6 Experimental result 6 实验结果

In this section, we provide a visualization of our model and compare the performance with other baseline models. Then, we perform some ablation studies to validate the effectiveness of each component in our method. We further perform some experiments in cross-task scenarios for versatility analysis.
在本节中，我们将提供一个可视化模型，并将其性能与其他基线模型进行比较。然后，我们进行了一些消融研究，以验证我们方法中每个组件的有效性。我们还进一步在跨任务场景中进行了一些实验，以进行通用性分析。

6.1 Overall experimental results
6.1 总体实验结果

6.1.1 Experimental results on Twitter datasets
6.1.1 Twitter 数据集的实验结果

We conduct our experiments on the Twitter datasets. We mainly report the metrics P, R and F1 for every single type and overall on two benchmark MNER datasets. Table 5 shows the performance comparison of different competitive uni-modal and multimodal approaches. From this table, we can observe that:
我们在 Twitter 数据集上进行了实验。我们主要报告了两种基准 MNER 数据集上每种单一类型和整体的 P、R 和 F1 指标。表 5 显示了不同单模态和多模态竞争方法的性能比较。从表中我们可以看出

For the uni-modal approaches, BERT-based approaches outperform CNN and LSTM significantly in terms of , and . This demonstrates the effectiveness of BERT as a text encoder for NER. In terms of the single type and overall results of both datasets, BERTCRF with CRF decoding performs better than BERT, except for the metric of R. This suggests the obvious advantages of CRF decoding as a NER decoder.
对于单模态方法，基于 BERT 的方法在和方面明显优于 CNN 和 LSTM。这证明了 BERT 作为 NER 文本编码器的有效性。就两个数据集的单一类型和整体结果而言，除 R 指标外，采用 CRF 解码的 BERTCRF 均优于 BERT。
Compared with uni-modal approaches, multimodal approaches generally achieve better performance, indicating that visual information is helpful for entity recognition. For example, in the overall F1 on both datasets, the most recent approach DebiasCL [47] outperforms BERT-CRF by and respectively.
与单模态方法相比，多模态方法通常能取得更好的性能，这表明视觉信息有助于实体识别。例如，在两个数据集的总体 F1 中，最新方法 DebiasCL [47] 的性能分别比 BERT-CRF 高和。
Our method is superior to the newest SOTA model DebiasCL, which improves and F1 scores for Twitter-2015 and Twitter-2017 datasets, respectively. It is worth noting that most of the previous multimodal methods ignore the error sensitivity of irrelevant object-level images, while our method exploits more comprehensive visual information by adding multi-granularity visual contexts as inputs, and filters visual noise by dynamic filter network. These results indicate that our method can effectively alleviate the error sensitivity of irrelevant object images, which is a more robust method for visualenhanced NER.
我们的方法优于最新的 SOTA 模型 DebiasCL，在 Twitter-2015 和 Twitter-2017 数据集上分别提高了和 F1 分数。值得注意的是，以往的多模态方法大多忽略了无关对象级图像的错误敏感性，而我们的方法通过添加多粒度视觉上下文作为输入，利用了更全面的视觉信息，并通过动态滤波网络过滤了视觉噪声。这些结果表明，我们的方法能有效缓解无关对象图像的错误敏感性，是一种更稳健的视觉增强 NER 方法。
Compared to DebiasCL, our MVPN and its variants demonstrate their advance in most matrics on two Twitter datasets but perform slightly worse for ORG on Twitter2015, which may be due to the fact that bias introduced by visual objects may mislead the recognition of entities and it is one direction of our future work.
与 DebiasCL 相比，我们的 MVPN 及其变体在两个 Twitter 数据集的大多数指标上都取得了进步，但在 Twitter2015 的 ORG 数据集上表现稍差，这可能是由于视觉对象引入的偏差可能会误导对实体的识别，这也是我们未来工作的一个方向。

6.1.2 Experimental results on non-Twitter dataset
6.1.2 非微博数据集的实验结果

We conduct our experiments on the non-Twitter dataset:wikiDiverse [36]. We mainly report the overall F1 score on wikiDiverse dataset. Table 6 shows the performance comparison of different competitive multimodal approaches. From this table, we can observe that:
我们在非 Twitter 数据集：wikiDiverse [36] 上进行了实验。我们主要报告了在 wikiDiverse 数据集上的总体 F1 分数。表 6 显示了不同多模态竞争方法的性能比较。从表中我们可以看出

Compared to the Twitter-2015 dataset, ITA is slightly outperforming on wikiDiverse. This may be due to the fact that the textual content in Twitter-2015 is shorter and the role of visual context can effectively supplement the missing textual information. whereas in wikiDiverse, the textual semantics are more explicit where the role of visual context is relatively small.
与Twitter-2015数据集相比，ITA在wikiDiverse上的表现略胜一筹。这可能是由于 Twitter-2015 中的文本内容较短，视觉上下文的作用可以有效补充缺失的文本信息，而在 wikiDiverse 中，文本语义更加明确，视觉上下文的作用相对较小。
Our method is superior to ITA on wikiDiverse dataset by improving F1 scores, which proves that images may contain information that is difficult to describe in text. When the quality of the text is relatively high, using the original image and the visual contexts is better than using only the visual contexts.
在 wikiDiverse 数据集上，我们的方法优于 ITA，提高了 F1 分数，这证明图像可能包含难以用文本描述的信息。当文本质量相对较高时，使用原始图像和视觉上下文比只使用视觉上下文更好。
The excellent performance of our method on both Twitter datasets and non-Twitter dataset demonstrates the generalizability of our method.
我们的方法在 Twitter 数据集和非 Twitter 数据集上的出色表现证明了我们方法的通用性。

Table 6 The overall performance of our model and other SOTA methods on wikiDiverse dataset
表 6 我们的模型和其他 SOTA 方法在 wikiDiverse 数据集上的总体表现

Methods 方法	wikiDiverse (F1) 维基多维（F1）
ITA	76.87
Baseline 基线	76.04
Baseline+Retri+MV 基线+Retri+MV	77.82
MVPN(ours) MVPN	77.86

6.2 Ablation study 6.2 消融研究

To investigate the influence of different factors of our proposed approach, we perform a comparison between the MVPN and its ablation approaches, concerning several critical components of the model such as the multi-granularity visual contexts and the visual feature filter. The results are reported in Table 7.
为了研究我们提出的方法中不同因素的影响，我们对 MVPN 及其消融方法进行了比较，涉及模型的几个关键组成部分，如多粒度视觉上下文和视觉特征过滤器。结果见表 7。

w/o filter. Firstly, we remove the image-text relation discriminator and filter gate for image-text similarity computing and useless visual information filtering. In this case, we only use the dynamic inference gate to conduct intra-modal and inter-modal fusions. We find that the overall F1 on both datasets decreases substantially by and respectively, which indicates a critical role for visual noise filtering.
无滤波器。首先，我们去掉了用于图像文本相似性计算和无用视觉信息过滤的图像文本关系判别器和过滤门。在这种情况下，我们只使用动态推理门进行模内和模间融合。我们发现，两个数据集的总体 F1 分别大幅下降了和，这表明视觉噪声过滤起着至关重要的作用。
w/o Mgvc. Secondly, we remove the multi-granularity visual contexts for visual information supplementation. In this case, we only use images and corresponding objects as visual input. We find that the overall F1 on both datasets decreases substantially by and respectively, which indicates a critical role for visual information supplementation provided by multi-granularity visual contexts.
w/o Mgvc。其次，我们去掉了用于补充视觉信息的多粒度视觉语境。在这种情况下，我们只使用图像和相应的物体作为视觉输入。我们发现，两个数据集的总体 F1 分别大幅下降了和，这表明多粒度视觉上下文对视觉信息的补充起着至关重要的作用。
MVPN (ours). Our approach demonstrates its advance in most matrics but performs slightly worse for ORG when discarding multi-granularity visual contexts. The potential reason may be that some generated captions are inaccurate or incomprehensible, driving the model to misjudge the entity type.
MVPN（我们的方法）。我们的方法在大多数矩阵中都取得了进步，但在剔除多粒度视觉上下文时，ORG 的表现略差。潜在原因可能是某些生成的标题不准确或难以理解，导致模型错误判断实体类型。

6.3 Cross-domain scenario
6.3 跨域情景

Due to the obvious differences in type distribution and data characteristics between the two Twitter datasets, we compare our MVPN approach and two existing SOTA multimodal approaches in cross-domain scenarios for generalization analysis. Twitter-2017

Twitter2015 indicates that the model trained on Twitter-2017 is used to test Twitter-2015, and Twitter-2015

Twitter-2017 has a similar meaning.
由于两个 Twitter 数据集在类型分布和数据特征上存在明显差异，我们将我们的 MVPN 方法与现有的两种 SOTA 多模态方法在跨领域场景下进行比较，以进行泛化分析。Twitter-2017

Twitter2015 表示使用在 Twitter-2017 上训练的模型来测试 Twitter-2015，Twitter-2015

Twitter-2017 意义类似。

As shown in Table 8, our approach outperforms UMT and UMGF by a large margin in most metrics. The potential reason for the excellent generalization may be that the multigranularity visual information structure and dynamic filter network enable MVPN to learn the essential features better.
如表 8 所示，我们的方法在大多数指标上都远远优于 UMT 和 UMGF。概括性出色的潜在原因可能是多粒度视觉信息结构和动态滤波网络使 MVPN 能够更好地学习基本特征。

6.4 Case study 6.4 案例研究

To better understand the effectiveness of our approach in incorporating visual information into the MNER task, we select a representative set of test samples to compare the prediction results of the MVPN and other approaches. The detail of these cases is shown in Fig. 6.
为了更好地了解我们的方法在将视觉信息纳入 MNER 任务方面的有效性，我们选择了一组有代表性的测试样本来比较 MVPN 和其他方法的预测结果。这些案例的详细情况如图 6 所示。

First, from Fig. 6(a), we can observe that the BERT-CRF fails to identify Harvest Time due to the lack of guidance from visual contexts, while the multimodal approaches UMGF and MVPN can accurately determine the entities by referring to specific visual regions.
首先，从图 6(a)中我们可以看出，BERT-CRF 由于缺乏视觉上下文的引导而无法识别 "丰收时间"，而多模态方法 UMGF 和 MVPN 则可以通过参考特定的视觉区域来准确地确定实体。

Second, we can see from Fig. 6(b), UMGF gives a wrong identification of the entity Istanbul and Turkey, which probably because the segmented visual feature is fragmented, bringing in interference to type classification. On the contrary, UMT and MVPN can accurately classify the entities into corresponding types with the guidance of targeted visual objects.
其次，从图 6（b）中可以看出，UMGF 错误地识别了伊斯坦布尔和土耳其这两个实体，这可能是因为分割后的视觉特征是碎片化的，给类型分类带来了干扰。相反，UMT 和 MVPN 可以在目标视觉对象的引导下准确地将实体划分为相应的类型。

Third, as shown in Fig. 6(b), UMT erroneously identifies Balconies as an entity of LOC, which indicates that over-reliance on visual information will lead to emphasis
第三，如图 6(b)所示，UMT 错误地将阳台识别为 LOC 的实体，这表明过度依赖视觉信息会导致强调......

Chao Wang 王超

cwang@shu.edu.cn

Extended author information available on the last page of the article
作者扩展信息可在文章最后一页查阅

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition MVPN：用于多模态命名实体识别的多粒度视觉提示引导融合网络

Abstract 摘要

1 Introduction 1 引言

2 Related work 2 相关工作

3 Task formulation 3 制定任务

4 Method 4 方法

4.1 Visual information embedding4.1 视觉信息嵌入

4.1.1 Multi-Granularity visual contexts4.1.1 多粒度视觉背景

Caption 标题

4.1.2 Pyramidal visual feature4.1.2 金字塔视觉特征

feature maps: 特征地图：

4.1.3 Visual feature aggregation4.1.3 视觉特征聚合

4.2 Dynamical filter network4.2 动态滤波网络

4.2.1 Visual feature filter4.2.1 视觉特征过滤器

4.2.2 Dynamic inference gate4.2.2 动态推理门

4.3 Visual prompt-guided fusion4.3 视觉提示引导下的融合

4.4 Classifier 4.4 分级器

5 Experiment 5 实验

5.1 Datasets and metrics5.1 数据集和衡量标准

5.2 Baseline methods 5.2 基准方法

5.3 Parameter settings 5.3 参数设置

6 Experimental result 6 实验结果

6.1 Overall experimental results6.1 总体实验结果

6.1.1 Experimental results on Twitter datasets6.1.1 Twitter 数据集的实验结果

6.1.2 Experimental results on non-Twitter dataset6.1.2 非微博数据集的实验结果

6.2 Ablation study 6.2 消融研究

6.3 Cross-domain scenario6.3 跨域情景

6.4 Case study 6.4 案例研究

MVPN: Multi-granularity visual prompt-guided fusion network for multimodal named entity recognition
MVPN：用于多模态命名实体识别的多粒度视觉提示引导融合网络

4.1 Visual information embedding
4.1 视觉信息嵌入

4.1.1 Multi-Granularity visual contexts
4.1.1 多粒度视觉背景

4.1.2 Pyramidal visual feature
4.1.2 金字塔视觉特征

4.1.3 Visual feature aggregation
4.1.3 视觉特征聚合

4.2 Dynamical filter network
4.2 动态滤波网络

4.2.1 Visual feature filter
4.2.1 视觉特征过滤器

4.2.2 Dynamic inference gate
4.2.2 动态推理门

4.3 Visual prompt-guided fusion
4.3 视觉提示引导下的融合

5.1 Datasets and metrics
5.1 数据集和衡量标准

6.1 Overall experimental results
6.1 总体实验结果

6.1.1 Experimental results on Twitter datasets
6.1.1 Twitter 数据集的实验结果

6.1.2 Experimental results on non-Twitter dataset
6.1.2 非微博数据集的实验结果

6.3 Cross-domain scenario
6.3 跨域情景