这是用户在 2024-8-5 13:13 为 https://app.immersivetranslate.com/pdf-pro/dedd2ec2-ede3-4f41-bce4-c7ebc4fdd07e 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_05_42c25e8705cd197f2f8dg

VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER
VEC-MNER:用于多模态 NER 的视觉增强型跨模态多级交互混合变换器

Pengfei WeiGuangdong University of Technology
广东工业大学
Guangzhou, China 中国广州wpf@gdut.edu.cn

Hongjun OuyangGuangdong University of Technology
广东工业大学
Guangzhou, China 中国广州3120005742@mail2.gdut.edu.cn

Qintai Hu*Guangdong University of Technology
广东工业大学
Guangzhou, China 中国广州huqt8@gdut.edu.cn

Bi Zeng 毕增Guangdong University of Technology
广东工业大学
Guangzhou, China 中国广州zb9215@gdut.edu.cn

Guang FengGuangdong University of Technology
广东工业大学
Guangzhou, China 中国广州von@gdut.edu.cn

Qingpeng WenGuangdong University of Technology
广东工业大学
Guangzhou, China 中国广州wqp@mail2.gdut.edu.cn

Abstract

Multimodal Named Entity Recognition (MNER) aims to leverage visual information to identify entity boundaries and categories in social media posts. Existing methods mainly adopt heterogeneous architecture, with ResNet (CNN-based) and BERT (Transformerbased) dedicated to modeling visual and textual features, respectively. However, current approaches still face the following issues (1) Weak cross-modal correlations and poor semantic consistency. (2) Suboptimal fusion results when visual objects and textual entities are inconsistent. To this end, we propose a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction (VEC-MNER) model for MNER. Specifically, compared to heterogeneous architectures, we propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. Moreover, we design the Correlation-Aware Alignment (CAAEncoder) layer and the Correlation-Aware Deep Fusion (CADFEncoder) layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. Experimental results demonstrate that our approach achieves SOTA performance, achieving and F1-score on Twitter-2015 and Twitter-2017, respectively.
多模态命名实体识别(MNER)旨在利用视觉信息来识别社交媒体帖子中的实体边界和类别。现有方法主要采用异构架构,分别使用基于 CNN 的 ResNet 和基于转换器的 BERT 对视觉和文本特征进行建模。然而,目前的方法仍面临以下问题 (1) 跨模态相关性弱,语义一致性差。(2) 当视觉对象和文本实体不一致时,融合结果不理想。为此,我们为 MNER 提出了视觉增强跨模态多级交互(VEC-MNER)混合变换器模型。具体来说,与异构架构相比,我们提出了一种新的同构混合变换器架构,自然地减少了异构性。此外,我们还设计了相关感知对齐(CAAEncoder)层和相关感知深度融合(CADFEncoder)层,结合对比学习,分别实现了模态间更有效的隐式对齐和深度语义融合。我们还构建了一个感知相关性(Correlation-Aware,CA)模块,可有效减少模态间的异质性,缓解视觉偏差。实验结果表明,我们的方法实现了 SOTA 性能,在 Twitter-2015 和 Twitter-2017 上分别获得了 的 F1 分数。

CCS CONCEPTS 综合传播战略概念

  • Information systems Multimedia information systems;
  • Computing methodologies Artificial intelligence.

KEYWORDS

Multimodal Named Entity Recognition, Dual-Stream Transformer, Cross-Modal Fusion, Contrastive learning, Conditional Random Field, Sequence labeling
多模态命名实体识别、双流变换器、跨模态融合、对比学习、条件随机场、序列标记

ACM Reference Format: ACM 参考格式:

Pengfei Wei, Hongjun Ouyang, Qintai Hu, Bi Zeng, Guang Feng, and Qingpeng Wen. 2024. VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR '24), June 10-14, 2024, Phuket, Thailand. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658097
魏鹏飞、欧阳红军、胡勤泰、曾碧、冯光、温庆鹏。2024.VEC-MNER:用于多模态 NER 的视觉增强跨模态多级交互混合变换器。2024 年多媒体检索国际会议(ICMR '24)论文集》,2024 年 6 月 10-14 日,泰国普吉岛。ACM, New York, NY, USA, 9 pages.https://doi.org/10.1145/3652583.3658097

1 INTRODUCTION 1 引言

Traditional NER tasks only extract keyword information from text. In today's information age with the surge of social media, Multimodal NER has brought unique challenges. The vast amount of content created by users on social media platforms covers multiple modalities such as text and images. However, the information on social media exhibits highly unstructured, short, and noisy characteristics [38], which makes traditional NER methods perform poorly in this field. The task of multimodal NER is to accurately locate and classify named entities by simultaneously utilizing text and image information. As shown in Figure 1a, while classifying the named entity "Kevin Durant" as "PER" instead of "MISC", the addition of visual information can alleviate the semantic deficiency and ambiguity caused by solely using textual information.
传统的 NER 任务只能从文本中提取关键词信息。在当今社交媒体激增的信息时代,多模态 NER 带来了独特的挑战。用户在社交媒体平台上创建的大量内容涵盖文本和图像等多种模式。然而,社交媒体上的信息呈现出高度非结构化、短小、噪声大等特点 [38],这使得传统的 NER 方法在这一领域表现不佳。多模态 NER 的任务是同时利用文本和图像信息准确定位命名实体并对其进行分类。如图 1a 所示,在将命名实体 "Kevin Durant "分类为 "PER "而非 "MISC "时,视觉信息的加入可以缓解单纯使用文本信息造成的语义缺陷和歧义。

(a) Kevin Durant PER
(a) 凯文-杜兰特 PER
(b) Basketball to Face [Texas (c)fxxk this shxt...... i'm off to enters [Oracle Arena LOC wearing off-White Ai Tech ORG] in An. vote for [Donald Trump PER] Jordan
(b)Basketball to Face [Texas (c)fxxk this shxt...... I'm off to enters [Oracle Arena LOC wearing off-White Ai Tech ORG] in An.
Figure 1: Three examples of Multimodal Named Entity Recognition in Social Media.
图 1:社交媒体中多模态命名实体识别的三个示例。
The core of existing MNER methods is to achieve alignment and fusion of textual and visual information through heterogeneous architectures of ResNet (CNN based) and BERT (Transformer based). These methods is mainly divided into the following four categories: (1) Arshad et al. [2], Asgari-Chenaghlu et al. [3], Lu et al. [16], Moon et al. [20, 21] adopts a pre-trained CNN model, such as ResNet [10] encodes the whole image into a global feature vector, and then enhances the representation of each word with the global image vector through the attention mechanism. (2) Sun et al. [25], Wang et al. [26], Yu et al. [36] divide the feature map obtained from the image into multiple blocks averagely, and then learn the most valuable visual-aware word representation by modeling the interaction between the text sequence and the visual region using Transformer or gating mechanism. (3) Some researchers employ object detection models such as Mask RCNN [9] to obtain visual objects from associated images, and then combine object-level visual information and text word information based on GNN or cross-modal attention . (4) There are also some works exploring the derivative knowledge of image content, including OCR, image descriptions, and other image attributes [12, 28, 29], which are used to guide words to obtain helpful visual semantic information.
现有 MNER 方法的核心是通过 ResNet(基于 CNN)和 BERT(基于变换器)的异构架构实现文本信息和视觉信息的对齐和融合。这些方法主要分为以下四类:(1)Arshad 等人[2]、Asgari-Chenaghlu 等人[3]、Lu 等人[16]、Moon 等人[20, 21]采用预先训练好的 CNN 模型,如 ResNet [10]将整个图像编码成一个全局特征向量,然后通过注意力机制用全局图像向量增强每个词的表示。(2) Sun 等人[25]、Wang 等人[26]、Yu 等人[36]将从图像中获得的特征图平均分成多个区块,然后利用 Transformer 或门控机制对文本序列和视觉区域之间的交互进行建模,从而学习出最有价值的视觉感知单词表示。(3) 一些研究者采用对象检测模型,如 Mask RCNN [9],从关联图像中获取视觉对象,然后基于 GNN 或跨模态注意 将对象级视觉信息和文本词信息结合起来。(4) 还有一些研究探索了图像内容的衍生知识,包括 OCR、图像描述和其他图像属性[12, 28, 29],利用这些知识引导词语获取有用的视觉语义信息。
Although existing methods have achieved surprising results, they still have some obvious limitations. It is widely believed that explicit alignment can mine fine-grained correspondences between text and image. As shown in Figure 1a, by observing an image containing two people (visual objects), the type of "Kevin Durant" (entity) in the text can be easily classified as "PER". However, this explicit alignment will inevitably cause problems when visual objects and entities are inconsistent in number or type. For example, in Figure 1b, there are many detected objects ( n visual objects) in the image, which makes it difficult to explicitly align them with "Texas Tech" in the text. In Figure 1c, it is expected to find some "people" ("PER" type) in the image aligned with "Donald Trump" ("PER" type) in the text, but there is a "panda" ("Panda") in the image "MISC" type). However, when there is no such precise correspondence between text and image, this will cause difficulties for graph-based methods by establishing explicit relationships between entities and visual objects [31]. As we can see, the correlation between image and text exists in various situations: fully relevant, partially relevant, irrelevant. Therefore, irrelevant visual information can lead to misleading modality alignment and fusion and further affect the performance of MNER.
尽管现有方法取得了令人惊喜的成果,但仍存在一些明显的局限性。人们普遍认为,显式配准可以挖掘文本和图像之间的细粒度对应关系。如图 1a 所示,通过观察包含两个人(视觉对象)的图像,文本中 "凯文-杜兰特"(实体)的类型很容易被归类为 "PER"。然而,当视觉对象和实体的数量或类型不一致时,这种明确的对齐方式难免会产生问题。例如,在图 1b 中,图像中有许多检测到的对象(n 个视觉对象),这就很难将它们与文本中的 "德克萨斯理工大学 "明确对齐。在图 1c 中,预计图像中会有一些 "人"("PER "类型)与文本中的 "唐纳德-特朗普"("PER "类型)对齐,但图像中却有一只 "熊猫"("Panda")("MISC "类型)。然而,当文本和图像之间没有这种精确的对应关系时,就会给基于图的方法建立实体和视觉对象之间的明确关系带来困难[31]。我们可以看到,图像和文本之间的相关性存在于各种情况下:完全相关、部分相关、不相关。因此,不相关的视觉信息会导致误导的模态配准和融合,并进一步影响 MNER 的性能。
Towards this end, we propose VEC-MNER, a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-Level interaction model, which is a Transformer-based homologous architecture. The goal of our work is to explore the potential of Transformerbased homogeneous architecture on MNER tasks while overcoming the following challenges: (1) Weak cross-modal correlation and poor semantic consistency. (2) Suboptimal fusion results when visual objects and text entities are inconsistent.
为此,我们提出了 VEC-MNER,一种具有视觉增强跨模态多层次交互模型的混合变形器,它是一种基于变形器的同源架构。我们的工作目标是探索基于变换器的同源架构在 MNER 任务中的潜力,同时克服以下挑战:(1)跨模态相关性弱,语义一致性差。(2) 当视觉对象和文本实体不一致时,融合结果不理想。
Our main contributions can be summarized as follows:
我们的主要贡献可概括如下:
(1) To our knowledge, our work is the first to propose a novel Vision-Enhanced Cross-Modal Multi-Level Interactive Hybrid Transformer model on the MNER task, intuitively leveraging a unified Transformer architecture to encode text and visual objects naturally reduces heterogeneity to better model the relationships between modalities.
(1) 据我们所知,我们的工作首次在 MNER 任务中提出了一种新颖的视觉增强型跨模态多层次交互混合变换器模型,直观地利用统一的变换器架构对文本和视觉对象进行编码,自然地减少了异质性,从而更好地模拟模态之间的关系。
(2) We design the CAA-Encoder layer and the CADF-Encoder layer to achieve more effective implicit alignment between image-text and deep semantic fusion between modalities, respectively. At the same time, we combine contrastive learning to further enhance the correlation between modalities, which is necessary for the MNER task.
(2) 我们设计了 CAA 编码器层和 CADF 编码器层,以分别实现更有效的图像-文本之间的隐式配准和模态之间的深度语义融合。同时,我们结合对比学习,进一步增强模态之间的相关性,这对于 MNER 任务是必不可少的。
(3) We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation.
(3) 我们还构建了一个相关感知(CA)模块,可有效减少模态之间的异质性,缓解视觉偏差。
(4) Our proposed method is evaluated on the industry standard multi-modal entity recognition datasets Twitter-2015 and Twitter-2017, and achieves the SOTA performance.
(4) 我们提出的方法在行业标准多模态实体识别数据集 Twitter-2015 和 Twitter-2017 上进行了评估,取得了 SOTA 性能。

2.1 Multimodal named entity recognition

In recent years, multiple modalities of data have been widely used to improve the effectiveness of named entity recognition. Due to grammar errors, excessive noise, and missing information in social media texts, some works use relevant illustrations to provide effective entity information to assist the model in prediction. Moon et al. [21] proposed using a bidirectional long short-term memory network (LSTM) to extract text features, a convolutional neural network (CNN) to extract image features, and combining them with a modal attention module to predict sequence labels. In order to extract the image regions most relevant to the text, Lu et al. [16] used an attention based model to fuse text and image features. To avoid the impact of mismatches between text and images, Xu et al. [32] proposed a cross modal alignment and matching module that consistently integrates text and image representations. Different from the above methods, we design the CAA-Encoder layer and the CADF-Encoder layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively.
近年来,多种数据模式被广泛用于提高命名实体识别的有效性。由于社交媒体文本中存在语法错误、噪声过大、信息缺失等问题,一些研究利用相关插图来提供有效的实体信息,辅助模型进行预测。Moon 等人[21]提出使用双向长短期记忆网络(LSTM)提取文本特征,使用卷积神经网络(CNN)提取图像特征,并将它们与模态注意力模块相结合来预测序列标签。为了提取与文本最相关的图像区域,Lu 等人[16]使用了基于注意力的模型来融合文本和图像特征。为了避免文本和图像之间不匹配的影响,Xu 等人[32]提出了一种跨模态对齐和匹配模块,该模块能始终如一地整合文本和图像表征。与上述方法不同,我们设计了 CAA 编码器层和 CADF 编码器层,结合对比学习,分别实现了更有效的隐式配准和模态间的深度语义融合。

2.2 Dual-stream Transformer

Early MNER approaches primarily focused on simple dual-stream structures concatenating text (BiLSTM, BERT) and image (ResNet) features, yet this approach significantly increased the heterogeneity of the modal representation layer, making modeling difficult. With the rapid development of Transformer models in computer vision tasks, researchers shifted their attention to more complex dual-stream Transformers. Dual-stream Transformers use the Transformer architecture to process data streams from two different modalities separately. Yu et al. [36] introduces a multi-modal interaction module in the dual-stream Transformer in the interaction layer to capture the dynamic relationships between modalities. However, this method still adopts the heterogeneous representation of ResNet, mainly focuses on explicit alignment, and does not fully consider the issue of visual bias. We propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. We also construct a CA module that can effectively alleviate visual deviation.
早期的 MNER 方法主要集中在将文本(BiLSTM、BERT)和图像(ResNet)特征串联起来的简单双流结构上,然而这种方法大大增加了模态表示层的异质性,给建模带来了困难。随着变换器模型在计算机视觉任务中的快速发展,研究人员将注意力转移到了更复杂的双流变换器上。双流变换器使用变换器架构分别处理来自两种不同模态的数据流。Yu 等人[36]在双流变换器的交互层中引入了多模态交互模块,以捕捉模态之间的动态关系。然而,这种方法仍然采用 ResNet 的异构表示,主要侧重于显式配准,没有充分考虑视觉偏差问题。我们提出了一种新的同构混合变换器架构,自然地减少了异构性。我们还构建了一个 CA 模块,可以有效缓解视觉偏差。
Figure 2: Overall Architecture of VEC-MNER. OT denotes the origin text, OI denotes the origin image, FRI denotes the set of images processed by Fastrenn, and VGI denotes the set of images processed by Visual Grounding.
图 2:VEC-MNER 的整体架构。OT 表示原始文本,OI 表示原始图像,FRI 表示由 Fastrenn 处理的图像集,VGI 表示由 Visual Grounding 处理的图像集。

2.3 Contrastive Learning 2.3 对比学习

Contrastive learning has been widely used in fields such as CV and NLP, such as image retrieval, text classification and recommendation systems. The core idea of contrastive learning is to bring positive samples closer and push away negative samples in the feature space by using contrastive loss (e.g., InfoNCE loss [22]), which brings significant improvements to downstream tasks. Several researchers proposed making different enhanced representations of images consistent with each other and showed positive results. At the same time, researchers in the field of NLP have also begun to work on finding suitable augmentations for text . However, a major limitation of the above methods is that they are only single-modal contrastive learning. In recent years, with the rise of multimodal pre-training models, many studies have incorporated multi-modal contrastive learning into their methods [10, 27, 39]. This paper introduces multi-modal contrastive learning to alleviate the problem of weak cross-modal correlation.
对比学习已被广泛应用于 CV 和 NLP 等领域,如图像检索、文本分类和推荐系统。对比学习的核心思想是通过使用对比损失(如 InfoNCE 损失 [22]),在特征空间中拉近正样本,推远负样本,从而显著改善下游任务。一些研究人员 提出了让不同的增强图像表征相互一致的建议,并取得了积极的成果。与此同时,NLP 领域的研究人员也开始致力于为文本 寻找合适的增强。然而,上述方法的一个主要局限是它们只是单模态对比学习。近年来,随着多模态预训练模型的兴起,许多研究将多模态对比学习纳入了他们的方法中[10, 27, 39]。本文引入多模态对比学习来缓解弱跨模态相关性的问题。

3 METHOD

This section introduces our proposed VEC-MNER model in this paper, as shown in Figure 2, which mainly consists of five stacked modules. The first is the embedding layer, which is used to vectorize text and images, then the Hybrid Transformer architecture, which consists of three stacked layers: the modality encoder layer (T-Encoder and V-Encoder), the modality alignment layer (CAAEncoder) and the modality fusion layer (CADF-Encoder), and finally the CRF framework is used to capture the label dependencies of the entity recognition task. In addition to this, we declare that the number of T-Encoder layers is , the number of V-Encoder layers is , the number of CAA-Encoder layers is , and the number of CADF-Encoder layers is . where , and .
本节将介绍我们在本文中提出的 VEC-MNER 模型,如图 2 所示,该模型主要由五个堆叠模块组成。首先是嵌入层,用于将文本和图像矢量化;然后是混合变换器架构,由三个堆叠层组成:模态编码器层(T-编码器和 V-编码器)、模态对齐层(CAAEncoder)和模态融合层(CADF-编码器);最后是 CRF 框架,用于捕捉实体识别任务的标签依赖关系。此外,我们还声明 T 编码器层数为 ,V 编码器层数为 ,CAA 编码器层数为 ,CADF 编码器层数为 ,其中
Task Definition: Given a sentence and its associated images as input, the goal of MNER is to extract entities from them and classify each extracted entity into one of the predefined types, including Person (PER), Organization (ORG), Location (LOC) and Others (MISC). Similar to most existing work in MNER, we formulate the task as a sequence labeling problem. Let denote the word sequence of input words, where denote the word in the sentence. is the corresponding sequence tagging, where is a set of predefined tags in the standard BIO mode. We also use to represent a set of input visual objects.
任务定义:给定一个句子及其相关图像作为输入,MNER 的目标是从句子和图像中提取实体,并将每个提取的实体归入预定义的类型之一,包括人员 (PER)、组织 (ORG)、位置 (LOC) 和其他 (MISC)。与 MNER 领域的大多数现有工作类似,我们将任务表述为序列标注问题。让 表示输入词的词序列,其中 表示句子中的 词。 是相应的序列标记,其中 是标准 BIO 模式下的一组预定义标记。我们还使用 表示一组输入的视觉对象。

3.1 Text and Visual Embedding
3.1 文本和视觉嵌入

Text Embedding. Similar to prior work, since BERT can provide different representations of the same word in different contexts, we also adopt BERT to obtain contextualized representations. BERT Embedding consists of three components: Token Embedding, Segment Embedding and Position Embedding. We first convert each text word into a contextual vector , and preprocess the input sentence by inserting special tokens and at the beginning and ending positions respectively. The word sequence is represented as , where is the length of the transformed text sequence.
文本嵌入。与之前的工作类似,由于 BERT 可以为同一单词在不同语境中提供不同的表示,因此我们也采用 BERT 来获得语境化的表示。BERT 嵌入由三个部分组成:标记嵌入(Token Embedding)、片段嵌入(Segment Embedding)和位置嵌入(Position Embedding)。我们首先将每个文本单词转换为上下文向量 ,然后在输入句子的开始和结束位置分别插入特殊标记 进行预处理。词序列表示为 ,其中 是转换后文本序列的长度。
Visual Embedding. Visual objects can complement missing entity information in a sentence. In contrast to prior work [12, 28, 29,
视觉嵌入。视觉对象可以补充句子中缺失的实体信息。与之前的工作[12, 28, 29、
37], in order to capture the visual objects in the context, in addition to using the original image, we also introduced visual local objects from Fastrcnn [7] and Visual Grounding as inputs. For Visual Grounding , similar to [35], we employ the selected parsing tool from the Stanford Grammar Parser to identify all noun phrases in the input sentence, and then apply the visual grounding toolkit [34] to detect the boundary objects for each noun phrase. Since it is difficult to fully detect all potential visuals in an image using only noun phrases, we further introduced four predefined entity categories (i.e., person, location, organization, miscellaneous) to discover more relevant visual objects.
37],为了捕捉上下文中的视觉对象,除了使用原始图像外,我们还引入了来自 Fastrcnn [7] 和 Visual Grounding 的视觉局部对象作为输入。对于视觉接地 ,与文献[35]类似,我们使用斯坦福语法分析器中的选定解析工具来识别输入句子中的所有名词短语,然后应用视觉接地工具包[34]来检测每个名词短语的边界对象。由于仅使用名词短语很难完全检测到图像中所有潜在的视觉对象,我们进一步引入了四个预定义的实体类别(即人物、地点、组织、杂项),以发现更多相关的视觉对象。
In contrast to previous work [24, 29], we choose CLIP as our visual feature extractor to obtain contextual representations. We use the same patchsize , which will split the above three input images into , and visual blocks. Since both types of extracted visual objects are multiple, we select three images as input respectively. In addition, we insert a special token at the beginning position, and like BERT Embedding, the segmented image block is processed through the Visual Embedding of CLIP to obtain the final image sequence , where M is the length of the transformed image sequence.
与之前的工作 [24, 29]不同,我们选择 CLIP 作为视觉特征提取器来获取上下文表征。我们使用相同的补丁大小 ,将上述三张输入图像分割成 视觉块。由于两类提取的视觉对象都是多个,因此我们分别选择三张图像作为输入。此外,我们在开始位置插入一个特殊标记 ,与 BERT 嵌入一样,分割后的图像块经过 CLIP 的视觉嵌入处理,得到最终的图像序列 ,其中 M 是转换后图像序列的长度。

3.2 Multimodal Hybrid Transformer

This section introduces our proposed Multimodal Hybrid Trans former architecture, which is a novel homogeneous architecture that has not been explored by previous MNER methods. This is a major highlight of this paper, which mainly includes the following three stacking modules.
本节将介绍我们提出的多模态混合前传架构,这是一种新颖的同构架构,以往的 MNER 方法尚未对其进行探索。这是本文的一大亮点,主要包括以下三个堆叠模块。
3.2.1 Information Encoder Layer. The information encoder layer mainly consists of a text encoder (T-Encoder) and a visual encoder (V-Encoder). Its core is the Transformer block. Transformer has been widely used in the two major research fields of CV and NLP, and is currently the main architecture of the academic community and consists of stacked blocks. Each block mainly contains two sublayers: multi-head self-attention layer (MHSA) and fully connected feed-forward network layer (FFN). Layer normalization (LN) and residual connections are also applied to each layer. Given an input vector , the traditional affine function converts x into query , and key-value pairs , resulting in an attention map:
3.2.1 信息编码器层。信息编码器层主要由文本编码器(T-编码器)和视觉编码器(V-编码器)组成。其核心是 Transformer 模块。Transformer 已广泛应用于 CV 和 NLP 两大研究领域,是目前学术界的主要架构,由堆叠的块组成。每个区块主要包含两个子层:多头自注意层(MHSA)和全连接前馈网络层(FFN)。每一层还应用了层归一化(LN)和残差连接。给定输入向量 ,传统的仿射函数会将 x 转换为查询 和键值对 ,从而得到注意力图:
where is the sequence length. is the dimension of the query vector. MHSA performs attention function calculations on heads in parallel, and each head maps input to queries, keys, and values by parameters . The role of MHSA is to calculate the weighted hidden state of each head and then concatenate them:
其中, 是序列长度。 是查询向量的维数。MHSA 在 头上并行执行注意力函数计算,每个头通过参数 将输入映射为查询、键和值。MHSA 的作用是计算每个头部的加权隐藏状态,然后将它们串联起来:
\footnotetext{
It details how Visual Grounding extracts visual objects.
它详细介绍了 Visual Grounding 如何提取视觉对象。
where are the dimensions of the hidden vector. is the dimension of each header of MHSA. FFN is another important component in Transformer. It usually consists of two sub-layers of linear transformation with a ReLU activation functions:
其中 是隐藏向量的维数。 是 MHSA 每个头的维数。FFN 是变换器中的另一个重要组件。它通常由两个带有 ReLU 激活函数的线性变换子层组成:
where .
T-Encoder. We use the first of BERT as the text encoder, aiming to obtain primary lexical and syntactic information, which contains MHSA and FFN blocks of the layer. Specifically, given the sentence sequence , the text representation is calculated as follows:
T 编码器。我们使用 BERT 的第一个 作为文本编码器,旨在获取主要的词法和句法信息,其中包含 层的 MHSA 和 FFN 块。具体来说,给定句子序列 ,文本表示的计算方法如下:
where is the hidden state of the layer of the text encoder.
其中 是文本编码器 层的隐藏状态。
V-Encoder. We adopt the first layer of the CLIP model pretrained on 400 million image-text pairs as the visual encoder, aiming to obtain primary edge, texture and spatial features. Similar to the text encoder, it also contains MHSA and FFN blocks of the layer. Specifically, given the embedding sequence after the image patch, the visual representation is calculated as follows:
视觉编码器。我们采用在 4 亿图像文本对上预训练过的 CLIP 模型的第一个 层作为视觉编码器,旨在获取主要的边缘、纹理和空间特征。与文本编码器类似,它也包含 层的 MHSA 和 FFN 块。具体来说,给定图像补丁后的嵌入序列 ,视觉表示的计算方法如下:
where is the hidden state of the layer of the visual encoder.
其中 是视觉编码器 层的隐藏状态。
3.2.2 Correlation-Aware Alignment Layer . This is the first innovation of this paper. Our proposed CAA-Encoder layer can model heterogeneity and irrelevant problems between different modalities. This module uses the sub-layers of BERT and CLIP. Similar to previous work [27], we employ co-attention to compute attention weights between all text positions and all visual regions to capture the fine-grained correlations between modalities. From the text perspective, the query is the hidden state of the text, and the key and value come from the visual hidden state. It aims to use the visual characteristics to adjust the attention weight of each word in the text, and obtain the Text-related visual information. From the visual perspective, the query is the hidden state of the image, and the key and value (V) come from the hidden state of the text. It aims to use the text content to adjust the attention weight of each position in the image and obtain text information related to the image. The detailed formula is as follows:
3.2.2 相关意识对齐层(Correlation-Aware Alignment Layer)。这是本文的第一个创新点。我们提出的 CAA 编码器层可以模拟不同模态之间的异质性和不相关问题。该模块使用 BERT 和 CLIP 的 子层。与之前的工作[27]类似,我们采用了协同注意力来计算所有文本位置和所有视觉区域之间的注意力权重,以捕捉模态之间的细粒度相关性。从文本角度来看,查询 是文本的隐藏状态,而键 和值 来自视觉隐藏状态。其目的是利用视觉特征调整文本中每个词的关注权重,获得与文本相关的视觉信息。从视觉角度看,查询 是图像的隐藏状态,键 和值(V)来自文本的隐藏状态。其目的是利用文本内容调整图像中每个位置的关注权重,获取与图像相关的文本信息。具体计算公式如下
where . The above equation (7) can be simplified as:
其中 。上式(7)可简化为
Equation (8) can be simplified as:
公式 (8) 可简化为
Correlation-Aware module. This is the second innovation of this paper. In the CAA-Encoder and CADF-Encoder layers, we design a CA module to mitigate the impact of noise caused by visual elements. This is a plug-and-play module that maps arbitrary input to feature map . In the first step, we per form feature compression along the MHSH text sequence dimension, converting each 2 D sequence feature into a real number using the equation (11b-11d). In the second step, based on the correlation between the sequence feature tokens, a Significance Score is generated for each feature token to represent its importance, using the equation (11e). Finally, the significance score obtained in the second step are weighted and summed with the original sequence features, achieving recalibration along the sequence dimension with the equation (11f). The specific formula is as follows:
关联感知模块。这是本文的第二个创新点。在 CAA-Encoder 和 CADF-Encoder 层中,我们设计了一个 CA 模块来减轻视觉元素造成的噪声影响。这是一个即插即用模块,可将任意输入 映射到特征图 。第一步,我们沿 MHSH 文本序列维度对特征进行压缩,使用公式(11b-11d)将每个 2 D 序列特征转换为实数。第二步,根据序列特征符之间的相关性,使用公式(11e)为每个特征符生成一个重要性分数,以表示其重要性。最后,将第二步得到的重要性得分与原始序列特征进行加权求和,通过公式(11f)实现序列维度的重新校准。具体公式如下
where soft_pool uses softmax for weighted pooling [23], and is the ReLU activation function. and are the fully connected layer. For convenience, we have simplify the above formula as follows: .
其中,soft_pool 使用 softmax 进行加权池化 [23], 是 ReLU 激活函数。 是全连接层。为方便起见,我们将上式简化如下:
Combined with the calculation of the CA module and co-attention, the final calculation formula of the CAA-Encoder layer is as follows:
结合 CA 模块和共同关注的计算,CAA-编码器层的最终计算公式如下:
3.2.3 Correlation-Aware deep fusion layer. This is the third innovation of this paper. Our proposed CADF-Encoder layer can achieve deeper semantic fusion between modalities, while introducing the CA module to further alleviate the impact of visual noise and ob tain more valuable information for subsequent decoding layers. This module makes use of the last sub-layers of BERT and CLIP, and the specific formula is as follows:
3.2.3 相关意识深度融合层。这是本文的第三个创新点。我们提出的 CADF 编码器层可以实现模态间更深层次的语义融合,同时引入 CA 模块,进一步减轻视觉噪声的影响,为后续解码层保留更多有价值的信息。该模块利用了 BERT 和 CLIP 的最后一个 子层,具体计算公式如下:
where ,

3.3 Vision-Text Contrastive Learning
3.3 视觉-文本对比学习

In order to further bridge the heterogeneity and gap between crossmodal and improve the quality of multimodal alignment expression, we adopt bidirectional contrastive loss: vision-text contrastive loss and text-vision contrastive loss. Given a batch of (text,vision) pairs, is considered positive for the -th pair, while samples from all other pairs in the batch are considered negative samples. Our goal is to learn good discriminant representations by mapping similar samples to adjacent positions in space, while mapping dissimilar samples to relatively distant positions. For example, in Figure 1a, our goal is to pull the image of the person and the text "Kevin Durant" in close proximity to the relevant representation, while moving away from the irrelevant representation.
为了进一步缩小跨模态之间的异质性和差距,提高多模态配准表达的质量,我们采用了双向对比损失:视觉-文本对比损失和文本-视觉对比损失。给定一批(文本、视觉)对, 被认为是 对的正样本,而该批中所有其他对的样本被认为是负样本。我们的目标是通过将相似样本映射到空间中的相邻位置,同时将不相似样本映射到相对较远的位置来学习良好的判别表征。例如,在图 1a 中,我们的目标是将人物形象和文字 "凯文-杜兰特 "拉近到相关表示位置,同时远离不相关表示位置。
where is the number of samples, is the similarity calculation function, is the temperature parameter, and and are the two encoders. Similarly, text-to-image contrast loss is defined as follows:
其中, 是样本数, 是相似度计算函数, 是温度参数, 是两个编码器。同样,文本到图像对比度损失的定义如下:

3.4 CRF Decoder

Due to the fact that visual information has been incorporated into the text semantic representation through the CAA-Encoder and the CADF-Decoder, we adopt a decoder to perform conditional sequence labeling on the textual semantic representation. Studies have shown that Conditional Random Field (CRF) takes into account the correlation between labels in the neighborhood and scores the entire label sequence, playing a beneficial role in many MNER tasks [12, 30, 36]. Therefore, we provide to the CRF layer to generate the probability of predicted label sequence :
由于视觉信息已经通过 CAA-Encoder 和 CADF-Decoder 纳入文本语义表示,我们采用解码器对文本语义表示进行条件序列标注。研究表明,条件随机场(Conditional Random Field,CRF)考虑了邻域中标签之间的相关性,并对整个标签序列进行评分,在许多 MNER 任务中发挥了有益的作用 [12, 30, 36]。因此,我们向 CRF 层提供 以生成预测标签序列 的概率:
where is the potential function, is the set of all possible label sequences, and is a set of parameters that define the potential function and the conversion score from label to . We use the negative log-likelihood function as the primary loss function to maximize the training set as follows:
其中, 是势函数, 是所有可能标签序列的集合, 是一组参数,定义了势函数和从标签 的转换分数。我们使用负对数似然函数作为主要损失函数,使训练集 最大化,具体如下:
The overall training loss is as follows:
总体培训损失如下
where and are the two trade-off parameters for contrastive losses.
其中, 是对比损失的两个权衡参数。

4 EXPERIMENT

4.1 Datasets

We conduct experiments on two widely used MNER datasets, namely Twitter-2015 and Twitter-2017, which include user posts on Twitter in 2014-2015 and 2016-2017, respectively. Table 1 shows the number of entities of each type and the count of tweets in the train, val and test dataset. Each tweet contains a text-image pair, where the text content may not be in the image and the text may have zero or more named entities. There are four types of entities: Person (PER), Location (LOC), Organization (ORG), and others (MISC).
我们在两个广泛使用的 MNER 数据集(即 Twitter-2015 和 Twitter-2017)上进行了实验,这两个数据集分别包含 2014-2015 年和 2016-2017 年用户在 Twitter 上发布的帖子。表 1 显示了 train、val 和 test 数据集中每种类型的实体数量和推文数量。每条推文都包含一对文本-图片,图片中可能没有文本内容,文本中可能有零个或多个命名实体。实体有四种类型:个人 (PER)、地点 (LOC)、组织 (ORG) 和其他 (MISC)。
Table 1: Statistics of the two datasets
表 1:两个数据集的统计数据
Entity Twitter-2015 Twitter-2017
Train Dev Test Train Dev Test
PER 2217 552 1816 2943 626 621
LOC 2091 522 1697 731 173 178
ORG 928 247 839 1674 375 395
MISC 940 225 726 701 150 157
Total 6176 1546 5078 6049 1324 1351
#Samples 4000 1000 3257 3373 723 723

4.2 Experimental settings

In this experiment, the code is written using Facebook's PyTorch deep learning framework, version 2.1.0, and the language is Python The experimental environment uses the GPU provided by Google Colab. This paper uses the validation dataset for model selection The setting and adjustment of hyper-parameter in the experiment are manually adjusted based on the accuracy and loss of the experiment. Through extensive hyper-parameter tuning, the experimental hyper-parameter are shown in Table 2.
本实验使用 Facebook 的 PyTorch 深度学习框架 2.1.0 版编写代码,语言为 Python。实验环境使用 Google Colab 提供的 GPU。本文使用验证数据集进行模型选择 实验中超参数的设置和调整都是根据实验的准确率和损失进行手动调整的。通过大量的超参数调整,实验超参数如表 2 所示。
Table 2: Experimental parameters.
表 2:实验参数。
Parameters Twitter-2015 Twitter-2017
Mini-batch Size 32 32
Epoch 40 40
Learning rate 学习率
CRF learning rate CRF 学习率
Weight decay 重量衰减
Max sequence length 最大序列长度 40 40
Gradient clipping 1.0 2.0

4.3 Evaluation metrics

In order to evaluate the effectiveness of our proposed VEC-MNER approach, we select each type of F 1 score ( F 1 ) and overall precision and recall (R) indicators for evaluation. The formula for the F1 score is as follows:
为了评估我们提出的 VEC-MNER 方法的有效性,我们选取了每种类型的 F 1 分数(F 1 )和总体精度 和召回率(R)指标进行评估。F1 分数的计算公式如下:
To make a fair comparison, we perform evaluation using the evaluation code provided by [27].
为了进行公平的比较,我们使用 [27] 提供的评估代码进行了评估。
\footnotetext{
PyTorch: https://pytorch.org/

4.4 Baseline model 4.4 基准模型

To conduct a better experimental analysis, we referred to the literature [27] and compared with the following text-only and text+visual benchmark models.
为了进行更好的实验分析,我们参考了文献[27],并与以下纯文本和文本+视觉基准模型进行了比较。
For unimodal methods, we consider: BiLSTM-CRF[11] and CNNBiLSTM-CRF[19], which is a classic NER model that combines BiLSTM and CRF. HBiLSTM-CRF[14] is a variant of CNN-BiLSTMCRF that obtains char-level word representation by replacing CNN layers with BiLSTM layers. BERT[13] and its variant BERT-CRF, which is a pre-trained language model stacked by multiple layers of bidirectional transformers.
对于单模态方法,我们考虑了BiLSTM-CRF[11]和 CNNBiLSTM-CRF[19]是结合了 BiLSTM 和 CRF 的经典 NER 模型。HBiLSTM-CRF[14]是 CNN-BiLSTMCRF 的变体,它通过用 BiLSTM 层代替 CNN 层来获得字符级单词表示。BERT[13] 及其变体 BERT-CRF,是由多层双向变换器堆叠而成的预训练语言模型。
For multimodal methods, we consider: GVATT-HBiLSTM-CRF[16], uses HBiLSTM-CRF as the encoder and uses an attention mechanism to combine image and text information. AdaCAN-CNNBiLSTM-CRF[38] is based on CNN-BiLSTM-CRF and uses an adaptive joint attention network to decide whether to focus on an image. GVATT-BERT-CRF[16] is a variant of GVATT-HBiLSTM-CRF by replacing the text encoder with BERT. AdaCAN-BERT-CRF[38] is a variation of AdaCAN-BiLSTM-CRF by replacing the text encoder with BERT. ATTR-MMKG-MNER[4] is a multimodal NER model that introduces image attributes and image knowledge to help improve NER tasks. UMGF[37], which obtain the visual representation of text based on graph model. MAF[32] is a matching and alignment framework for MNER. BFCL[27], which uses a bottleneck network to fuse fine-grained modality information.
对于多模态方法,我们认为GVATT-HBiLSTM-CRF[16]使用 HBiLSTM-CRF 作为编码器,并使用注意力机制来结合图像和文本信息。AdaCAN-CNNBiLSTM-CRF[38]基于 CNN-BiLSTM-CRF,使用自适应联合注意力网络来决定是否关注图像。GVATT-BERT-CRF[16]是 GVATT-HBiLSTM-CRF 的变体,用 BERT 取代了文本编码器。AdaCAN-BERT-CRF[38]是 AdaCAN-BiLSTM-CRF 的变体,用 BERT 代替了文本编码器。ATTR-MMKG-MNER[4] 是一种多模态 NER 模型,它引入了图像属性和图像知识,以帮助改进 NER 任务。UMGF[37],基于图模型获得文本的视觉表示。MAF[32]是用于 MNER 的匹配和对齐框架。BFCL[27],使用瓶颈网络融合细粒度模态信息。

4.5 Experimental results and analysis
4.5 实验结果与分析

To fairly verify the advantages of our model in multimodal NER, our proposed approach is compared with some benchmark models under the same experimental dataset distribution. The experimental results are shown in Table 3.
为了公平地验证我们的模型在多模态 NER 中的优势,我们在相同的实验数据集分布下将我们提出的方法与一些基准模型进行了比较。实验结果如表 3 所示。
Analysis of Text-based NER methods. We can see from the table that the BERT-based method is significantly better than the LSTM-based method. On the two datasets, the F1 score of BERT is and higher than CNN-BiLSTM-CRF, respectively. This shows that the pre-trained BERT model is more suitable for scenarios with insufficient text information and noisy data than the models of CNN and LSTM. On the other hand, the score of BERT-CRF F1 score is higher than BERT, which indicates that CRF can effectively model the dependency between tags in the sequence and predict the best sequence chain.
基于文本的 NER 方法分析。从表中可以看出,基于 BERT 的方法明显优于基于 LSTM 的方法。在两个数据集上,BERT 的 F1 分数分别 高于 CNN-BiLSTM-CRF。这表明,与 CNN 和 LSTM 模型相比,预训练的 BERT 模型更适用于文本信息不足和数据有噪声的场景。另一方面,BERT-CRF 的 F1 得分 高于 BERT,这表明 CRF 可以有效地模拟序列中标签之间的依赖关系,预测最佳序列链。
Analysis of Multimodal NER methods. We can see that the model with added visual modality information has better performance than the text-only model. On the Twitter-2017 dataset, GVATT's F1 score is and higher than BiLSTM-CRF and CNNBiLSTM-CRF respectively, and AdaCAN's F1 score is and higher than BiLSTM-CRF and CNN-BiLSTM-CRF respectively. This shows that visual information can make up for the ambiguity or fuzziness problems existing in the text, enrich the context of social media posts, and contribute to named entity recognition of text.
多模态 NER 方法分析。我们可以看到,添加了视觉模态信息的模型比纯文本模型性能更好。在 Twitter-2017 数据集上,GVATT 的 F1 分数分别比 BiLSTM-CRF 和 CNNBiLSTM-CRF 高 ,AdaCAN 的 F1 分数分别比 BiLSTM-CRF 和 CNNBiLSTM-CRF 高 。这说明视觉信息可以弥补文本中存在的模糊性或模糊性问题,丰富社交媒体帖子的语境,有助于文本的命名实体识别。
Analysis of compared with all other MNER methods. As can be seen from the Table 3, our model achieves SOTA performance on both public datasets, proving the effectiveness of our model. On the Twitter-2015 dataset, our model improves Precision,
与所有其他 MNER 方法的比较分析。从表 3 中可以看出,我们的模型在两个公共数据集上都达到了 SOTA 性能,证明了我们模型的有效性。在 Twitter-2015 数据集上,我们的模型提高了精确度、
Table 3: Performance comparison on two public datasets, Twitter-2015 and Twitter-2017, our model results are the average of three random runs.
表 3:Twitter-2015 和 Twitter-2017 这两个公开数据集的性能比较,我们的模型结果是三次随机运行的平均值。
Modality model Twitter-2015 Twitter-2017
Single Type (F1) Overall Single Type (F1) Overall
PER LOC ORG MISC R F1 PER LOC ORG MISC R F1
Text BiLSTM-CRF 76.77 72.56 41.33 26.80 68.14 61.49 64.42 85.12 72.68 72.50 52.56 79.42 73.43 46.31
CNN-BiLSTM-CRF 80.86 75.39 47.77 32.61 66.24 68.09 67.15 87.99 77.44 74.02 60.82 80.00 78.76 79.37
HBiLSTM-CRF 82.34 76.83 51.59 32.52 70.32 68.05 69.17 87.91 78.57 76.67 59.32 82.69 78.16 80.37
BERT 84.72 79.91 58.26 38.81 68.30 74.61 71.32 90.88 84.00 79.25 61.63 82.19 83.72 82.95
BERT-CRF 84.74 80.51 60.27 37.29 69.22 74.59 71.81 90.25 83.05 81.13 62.21 83.32 83.57 83.44
Text+Vision GVATT-HBiLSTM-CRF 82.66 77.21 55.06 35.25 73.96 67.90 70.80 89.34 78.53 79.12 62.21 83.41 80.38 81.87
AdaCAN-CNN-BiLSTM-CRF 81.98 78.95 53.07 34.02 72.75 68.74 70.69 89.63 77.46 79.24 62.77 84.16 80.24 82.15
GVATT-BERT-CRF 84.43 80.87 59.02 38.14 69.15 74.46 71.70 90.94 83.52 81.91 62.75 83.64 84.38 84.01
AdaCAN-BERT-CRF 85.28 80.64 59.39 38.88 69.87 74.59 72.15 90.20 82.97 82.67 64.83 85.13 83.20 84.10
ATTR-MMKG-MNER 84.28 79.43 58.97 41.47 74.78 71.82 73.27 - - - -
UMGF 84.26 83.17 62.45 42.42 74.49 75.21 74.85 91.92 85.22 83.13 69.83 86.54 84.50 85.51
MAF 84.67 81.18 63.35 41.82 71.86 75.10 73.42 91.51 85.80 85.10 68.79 86.13 86.38 86.25
BFCL 85.60 81.77 63.81 40.30 74.02 75.07 74.54 91.17 86.43 83.97 66.67 85.99 85.42 85.70
ViLBER 84.46 80.32 65.10 39.66 73.00 74.37 73.68 90.75 84.07 82.18 67.97 83.63 85.86 84.73
VEC-MNER 86.11 81.03 62.86 40.60 74.56 75.23 74.89 93.88 81.27 85.49 73.40 87.42 87.61 87.51
Recall, and F1 scores by , and respectively compared to MAF. On the Twitter-2017 dataset, our model improved Precision, Recall, and F1 scores by , and respectively compared to BFCL. This indicates that compared to the aforementioned heterogeneous methods, our proposed hybrid Transformer Homogeneous Architecture does not require additional complex cross-modal alignment and interaction structures, and can achieve better performance.
与 MAF 相比,我们的模型在 Recall 和 F1 分数上分别提高了 。在 Twitter-2017 数据集上,与 BFCL 相比,我们的模型的精确度、召回率和 F1 分数分别提高了 。这表明,与上述异构方法相比,我们提出的混合 Transformer 异构架构不需要额外复杂的跨模态配准和交互结构,可以实现更好的性能。
Analysis of dual-stream transformer. To further evaluate the effectiveness of our Hybrid Transformer Architecture. We introduce ViLBERT [17], a visual language pre-trained model that interacts through co-attention transformer layers in two streams. We observe that our model outperforms ViLBERT by and F1 scores on the Twitter-2015 and Twitter-2017 dataset, respectively. This observation shows that our model is more effective in utilizing both image and text information for cross-modal alignment and fusion tasks, thereby improving task performance.
分析双流变压器。为了进一步评估混合变换器架构的有效性。我们引入了 ViLBERT [17],这是一种通过双流中的共注意力变换器层进行交互的视觉语言预训练模型。我们发现,在 Twitter-2015 和 Twitter-2017 数据集上,我们的模型分别以 的 F1 分数优于 ViLBERT。这一观察结果表明,我们的模型在利用图像和文本信息进行跨模态配准和融合任务方面更为有效,从而提高了任务性能。

4.6 Ablation Study 4.6 消融研究

To further investigate whether the different components of our model are helpful to the detection of entity recognition, we carry out the ablation experiment, and the experimental results are shown in Table 4.
为了进一步研究我们模型的不同组成部分是否有助于实体识别检测,我们进行了消融实验,实验结果如表 4 所示。
Table 4: Ablation Study 表 4:消融研究
Models Twitter-2015 Twitter-2017
P R F1 P R F1
VEC-MNER
- FRI& VGI 73.62 74.50 74.06 86.09 86.93 86.51
- CL 73.76 74.87 74.31 87.61 86.48 87.04
- CAA-Encoder 73.47 75.56 74.50 88.05 86.68 87.36
- CADF-Encoder 72.38 74.08 73.22 85.32 87.34 86.32
- CA 74.76 73.49 74.12 86.14 87.76 86.94
Importance of visual objects. We remove the visual local objects and use the original image as the input of the visual modality. The F1 score of our model drops by and on two public datasets, Twitter-2015 and Twitter-2017, respectively. This observation shows that visual objects can make up for the contextual information missing from text, thereby helping improve model performance.
视觉对象的重要性。我们删除了视觉局部对象,并使用原始图像作为视觉模态的输入。在 Twitter-2015 和 Twitter-2017 这两个公开数据集上,我们模型的 F1 分数分别下降了 。这一观察结果表明,视觉对象可以弥补文本中缺失的上下文信息,从而有助于提高模型性能。
Importance of CAA-Encoder component. We remove the design of the CAA-Encoder layer and used the calculation of the CADF-Encoder layer. The F1 score of our model drops by and on Twitter-2015 and Twitter-2017 respectively. This observation shows that modal alignment is very important, which facilitates the subsequent learning of multi-modal fine-grained semantic fusion information.
CAA-Encoder 组件的重要性。我们删除了 CAA-Encoder 层的设计,使用 CADF-Encoder 层的计算。在 Twitter-2015 和 Twitter-2017 上,我们模型的 F1 分数分别下降了 。这一观察结果表明,模态对齐非常重要,有利于后续学习多模态细粒度语义融合信息。
Importance of CADF-Encoder component. We remove the design of the CADF-Encoder layer and adopted the calculation of the CAA-Encoder layer. The F1 score of our model drops by and on Twitter-2015 and Twitter-2017 respectively. This observation shows that modal fusion is indispensable, and it helps to fully exploit the effective semantic association between the two modalities.
CADF-Encoder 组件的重要性。我们取消了 CADF-Encoder 层的设计,采用 CAA-Encoder 层的计算方法。在 Twitter-2015 和 Twitter-2017 上,我们模型的 F1 分数分别下降了 。这一观察结果表明,模态融合是不可或缺的,它有助于充分利用两种模态之间的有效语义关联。
Importance of CA Module. We remove the CA module and directly give the visual modality information to the text alignment layer and depth fusion layer. The F1 score of our model drops by and on Twitter-2015 and Twitter-2017 respectively. This observation shows that the CA module can effectively reduce heterogeneity between modalities and alleviate visual deviation.
CA 模块的重要性。我们去掉了 CA 模块,直接将视觉模态信息交给文本对齐层和深度融合层。在 Twitter-2015 和 Twitter-2017 上,我们模型的 F1 分数分别下降了 。这一观察结果表明,CA 模块可以有效减少模态间的异质性,缓解视觉偏差。
Importance of contrastive learning. We remove the contrastive loss during model training. The F1 score of our model drops by and on the two datasets Twitter-2015 and Twitter2017, respectively. This observation suggests that contrastive learning can help modality alignment and bridge the gap between multimodal representations.
对比学习的重要性。我们在模型训练过程中去掉了对比损失。在 Twitter-2015 和 Twitter2017 这两个数据集上,我们模型的 F1 分数分别下降了 。这一观察结果表明,对比学习可以帮助模态对齐,弥合多模态表征之间的差距。

4.7 Sensitivity Analysis of CAA-Encoder layers
4.7 CAA 编码器层的灵敏度分析

In Section 3, we define as the number of layers of CAA-Encoder. As shown in Figure 3, values will affect the performance of our
在第 3 节中,我们将 定义为 CAA 编码器的层数。如图 3 所示, 的值将影响我们的

model, so it is worthwhile to pay attention to the number of layers of CAA-Encoder. We set and conduct experimental analysis.
模型,因此 CAA-Encoder 的层数值得关注。我们设置 并进行实验分析。
Figure 3: Parameter sensitivity analysis experiment for the number of CAA-Encoder layers.
图 3:CAA-编码器层数的参数敏感性分析实验。
Figure 3 shows the F1 scores on Twitter-2015 and Twitter-2017 datasets for different values. We can observe that as the number of layers increases, the F1 score shows an overall downward trend. When and , our model achieves optimal performance on both Twitter-2015 and Twitter-2017 datasets respectively. The above results show that: (1) Designing an alignment module is necessary, which proves the rationality of the multimodal structure paradigm. (2) The fusion module is more impor tant than the alignment module. It is closer to the decoding layer. The dominant fusion module is more conducive to the performance of the task. (3) Our model further explores the potential of BERT and CLIP pre-trained models on MNER tasks.
图 3 显示了 Twitter-2015 和 Twitter-2017 数据集上不同 值的 F1 分数。我们可以观察到,随着 层数的增加,F1 分数总体呈下降趋势。当 时,我们的模型分别在 Twitter-2015 和 Twitter-2017 数据集上实现了最佳性能。上述结果表明(1) 设计对齐模块是必要的,这证明了多模态结构范式的合理性。(2)融合模块比配准模块更重要。它更接近解码层。占主导地位的融合模块更有利于任务的完成。(3) 我们的模型进一步挖掘了 BERT 和 CLIP 预训练模型在 MNER 任务中的潜力。

4.8 Case study

To more intuitively demonstrate the capabilities of our model, we selected two typical sample cases for case study, as shown in Figure4.
为了更直观地展示模型的功能,我们选择了两个典型的样本案例进行案例研究,如图 4 所示。
Case 4a shows that visual information is very important and can help determine the entity type. The text "Kolo loves the sun and is so pretty, too" is highly ambiguous and lacks sufficient evidence to help identify the "Kolo" entity type. Humans easily misidentify "Kolo" as PER based on common sense. Although the BERT-CRF model that only uses text modeling has been pre-trained with a large amount of data, it still identifies "Kolo" as PER. However, both the multimodal models VEC-MNER and BFCL correctly identify the entity "Kolo" as MISC. Because they all learn the connection between vision and text.
案例 4a 表明视觉信息非常重要,可以帮助确定实体类型。科洛喜欢阳光,也很漂亮 "这段文字非常含糊,缺乏足够的证据来帮助识别 "科洛 "的实体类型。人类很容易根据常识将 "科洛 "误认为 PER。虽然只使用文本建模的 BERT-CRF 模型已经过大量数据的预训练,但它仍然将 "Kolo "识别为 PER。然而,多模态模型 VEC-MNER 和 BFCL 都正确地将 "Kolo "实体识别为 MISC。因为它们都学习了视觉与文本之间的联系。
Case 4 b shows that our model has stronger ability to model finegrained semantic information between modalities and filter visual noise. The text expression means "Thanks Andrew for the wonder ful Tesla trip speech", and there is a semantic correspondence with the person holding the microphone in the image. The BFCL model incorrectly identified "Tesla" as PER, while our model mined the deep semantic association between visual objects and entities and correctly identified "Tesla" as ORG. This indicates that compared
案例 4 b 表明,我们的模型具有更强的能力来模拟模态之间的细粒度语义信息并过滤视觉噪声。文字表达的意思是 "感谢安德鲁的神奇特斯拉之旅演讲",与图像中手持麦克风的人存在语义对应关系。BFCL 模型错误地将 "特斯拉 "识别为 PER,而我们的模型挖掘了视觉对象和实体之间的深层语义关联,正确地将 "特斯拉 "识别为 ORG。这表明
(a) [Kolo MISC] loves the sun and is so pretty, too.
(a) [Kolo MISC]喜欢阳光,而且长得很漂亮。
BERT-CRF: 1-PER
BFCL: 1-MISC
VEC-MNER: VEC-MNER: 1-MISC
(b) Thanks [Andrew PER] for the great [Tesla ORG] road trip presentation.
(b) 感谢 [Andrew PER] 提供精彩的 [Tesla ORG] 公路旅行演示。
1-PER , 2-None
1-PER , 2-PER
1-PER , 2-ORG
TEC-MNER. 1-MISC
Figure 4: Multimodal NER cases.
图 4:多模态 NER 案例。
to the BFCL heterogeneous method, our proposed hybrid Transformer isomorphic architecture does not require additional complex cross-modal alignment and interaction structures, and can achieve better performance. This also proves that our exploration idea is correct.
与 BFCL 异构方法相比,我们提出的混合 Transformer 同构架构不需要额外复杂的跨模态配准和交互结构,可以实现更好的性能。这也证明了我们的探索思路是正确的。

5 CONCLUSION AND FUTURE WORK
5 结论和未来工作

This paper propose a novel Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for MNER, which has a simpler architecture and further releases the potential of the dual-stream Transformer on MNER. This is a big highlight of this paper. Specifically, we design the CAA-Encoder layer and the CADF-Encoder layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. We also construct a CA module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. We believe that this research using a Hybrid Transformer Architecture to model MNER tasks can provide a new perspective for subsequent researchers.
本文为 MNER 提出了一种新颖的具有视觉增强型跨模态多级交互功能的混合变换器,该变换器具有更简单的架构,并进一步释放了双流变换器在 MNER 上的潜力。这是本文的一大亮点。具体来说,我们设计了 CAA 编码器层和 CADF 编码器层,结合对比学习,分别实现了更有效的隐式配准和模态间的深度语义融合。我们还构建了一个 CA 模块,可以有效减少模态间的异质性,缓解视觉偏差。我们相信,这项使用混合变换器架构对 MNER 任务进行建模的研究能为后续研究人员提供一个新的视角。
In future work, we plan to explore the ability of a single-stream Transformer architecture to model multimodal NER tasks, which is another way to mine implicit associative representations between vision and language.
在未来的工作中,我们计划探索单流 Transformer 架构为多模态 NER 任务建模的能力,这也是挖掘视觉与语言之间隐含关联表征的另一种方法。

ACKNOWLEDGMENTS 致谢

This work is supported by the Key Program of the National Natural Science Foundation of China (Grant No.62237001).
这项工作得到了国家自然科学基金重点项目(批准号:62237001)的支持。

REFERENCES 参考文献

[1] Aviad Aberdam, Ron Litman, Shahar Tsiper, Oron Anschel, Ron Slossberg, Shai Mazor, R Manmatha, and Pietro Perona. 2021. Sequence-to-sequence contrastive learning for text recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15302-15312.
[1] Aviad Aberdam、Ron Litman、Shahar Tsiper、Oron Anschel、Ron Slossberg、Shai Mazor、R Manmatha 和 Pietro Perona。2021.用于文本识别的序列到序列对比学习。IEEE/CVF 计算机视觉与模式识别会议论文集》。15302-15312.
[2] Omer Arshad, Ignazio Gallo, Shah Nawaz, and Alessandro Calefati. 2019. Aiding intra-text representations with visual context for multimodal named entity recognition. In 2019 International Conference on Document Analysis and Recognition (ICDAR). IEEE, 337-342
[2] Omer Arshad、Ignazio Gallo、Shah Nawaz 和 Alessandro Calefati。2019.多模态命名实体识别中的视觉上下文辅助文本内表征。2019年国际文档分析与识别大会(ICDAR)。IEEE,337-342
[3] Meysam Asgari-Chenaghlu, M Reza Feizi-Derakhshi, Leili Farzinvash, MA Balafar, and Cina Motamed. 2022. CWI: A multimodal deep learning approach for
[3] Meysam Asgari-Chenaghlu, M Reza Feizi-Derakhshi, Leili Farzinvash, MA Balafar, and Cina Motamed.2022.CWI:多模态深度学习方法

named entity recognition from social media using character, word and image features. Neural Computing and Applications (2022), 1-18.
《利用字符、单词和图像特征识别社交媒体中的命名实体。神经计算与应用》(2022 年),1-18 页。
[4] Dawei Chen, Zhixu Li, Binbin Gu, and Zhigang Chen. 2021. Multimodal named entity recognition with image attributes and image knowledge. In Database Systems for Advanced Applications: 26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part II 26. Springer, 186-201.
[4] Dawei Chen,Zhixu Li,Binbin Gu,and Zhigang Chen.2021.利用图像属性和图像知识的多模态命名实体识别。在高级应用数据库系统:26th International Conference, DASFAA 2021, Taipei, Taiwan, April 11-14, 2021, Proceedings, Part II 26.Springer, 186-201.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020 A simple framework for contrastive learning of visual representations. In Inter national conference on machine learning. PMLR, 1597-1607.
[5] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton.2020 视觉表征对比学习的简单框架。In Inter National Conference on Machine Learning.PMLR,1597-1607。
[6] Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan Fei Huang, Luo Si, and Huajun Chen. 2022. Good Visual Guidance Make A Better Extractor: Hierarchical Visual Prefix for Multimodal Entity and Relation Extraction. In Findings of the Association for Computational Linguistics: NAACL 2022. Association for Computational Linguistics, Seattle, United States, 16071618. https://doi.org/10.18653/v1/2022.findings-naacl. 121
[6] 陈翔、张宁宇、李磊、姚云志、邓淑敏、谭传奇、黄飞、司珞、陈华军。2022.良好的视觉引导造就更好的提取器:多模态实体和关系提取的分层视觉前缀。计算语言学协会论文集:NAACL 2022。计算语言学协会,美国西雅图,16071618。https://doi.org/10.18653/v1/2022.findings-naacl.121
[7] Ross Girshick. 2015. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision. 1440-1448.
[7] Ross Girshick.2015.Fast r-cnn。 In Proceedings of the IEEE international conference on computer vision.1440-1448.
[8] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momen tum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729-9738.
[8] He Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick.2020.用于无监督视觉表征学习的门瘤对比度。In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition.9729-9738.
[9] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask r-cnn In Proceedings of the IEEE international conference on computer vision. 2961-2969
[9] 何开明、Georgia Gkioxari、Piotr Dollár 和 Ross Girshick。2017.Mask r-cnn In Proceedings of the IEEE international conference on computer vision.2961-2969
10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770-778.
10]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.2016.图像识别的深度残差学习。In Proceedings of the IEEE conference on computer vision and pattern recognition.770-778.
[11] Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991 (2015)
[11] Zhiheng Huang,Wei Xu,and Kai Yu.2015.用于序列标记的双向 LSTM-CRF 模型。arXiv preprint arXiv:1508.01991 (2015)
12] Meihuizi Jia, Xin Shen, Lei Shen, Jinhui Pang, Lejian Liao, Yang Song, Meng Chen, and Xiaodong He. 2022. Query prior matters: a MRC framework for mul timodal named entity recognition. In Proceedings of the 30th ACM International Conference on Multimedia. 3549-3558.
12]Meihuizi Jia, Xin Shen, Lei Shen, Jinhui Pang, Lejian Liao, Yang Song, Meng Chen, and Xiaodong He.2022.查询先验事项:用于多模态命名实体识别的 MRC 框架。第 30 届 ACM 国际多媒体会议论文集》。3549-3558.
[13] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT Pre-training of Deep Bidirectional Transformers for Language Understanding In Proceedings of NAACL-HLT. 4171-4186
[13] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova.2019.用于语言理解的深度双向变换器的 BERT 预训练 在 NAACL-HLT 会议录中。4171-4186
[14] Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural Architectures for Named Entity Recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 260-270.
[14] Guillaume Lample、Miguel Ballesteros、Sandeep Subramanian、Kazuya Kawakami 和 Chris Dyer。2016.命名实体识别的神经架构》。计算语言学协会北美分会 2016 年会议论文集:人类语言技术。260-270.
[15] Yixin Liu and Pengfei Liu. 2021. SimCLS: A Simple Framework for Contrastive Learning of Abstractive Summarization. In Proceedings of the 59th Annual Meet ing of the Association for Computational Linguistics and the 11th International foint Conference on Natural Language Processing (Volume 2: Short Papers). 10651072.
[15] Yixin Liu and Pengfei Liu.2021.SimCLS:抽象总结对比学习的简单框架。In Proceedings of the 59th Annual Meet ing of the Association for Computational Linguistics and the 11th International foint Conference on Natural Language Processing (Volume 2: Short Papers).10651072.
[16] Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1990-1999
[16] Di Lu、Leonardo Neves、Vitor Carvalho、Ning Zhang 和 Heng Ji。2018.多模态社交媒体中姓名标记的视觉注意力模型。第 56 届计算语言学协会年会论文集(第 1 卷:长篇论文)。1990-1999
[17] Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Ad vances in neural information processing systems 32 (2019)
[17] Jiasen Lu、Dhruv Batra、Devi Parikh 和 Stefan Lee。2019.Vilbert:针对视觉-语言任务的任务识别视觉语言表征预训练。神经信息处理系统的发展 32 (2019)
[18] Junyu Lu, Dixiang Zhang, Jiaxing Zhang, and Pingjian Zhang. 2022. Flat Multi modal Interaction Transformer for Named Entity Recognition. In Proceedings of the 29th International Conference on Computational Linguistics. 2055-2064.
[18] Junyu Lu, Dixiang Zhang, Jiaxing Zhang, and Pingjian Zhang.2022.用于命名实体识别的扁平多模态交互变换器。第 29 届国际计算语言学大会论文集》。2055-2064.
[19] Xuezhe Ma and Eduard Hovy. 2016. End-to-end Sequence Labeling via Bidirectional LSTM-CNNs-CRF. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1064-1074.
[19] Xuezhe Ma 和 Eduard Hovy.2016.通过双向 LSTM-CNNs-CRF 进行端到端序列标注。In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).1064-1074.
20] Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimoda named entity disambiguation for noisy social media posts. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers). 2000-2008.
20]Seungwhan Moon, Leonardo Neves, and Vitor Carvalho.2018.嘈杂社交媒体帖子的多模命名实体消歧。In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers).2000-2008.
[21] Seungwhan Moon, Leonardo Neves, and Vitor Carvalho. 2018. Multimoda named entity recognition for short social media posts. arXiv preprint arXiv:1802.07862 (2018)
22] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
22]Aaron van den Oord、Yazhe Li 和 Oriol Vinyals.2018.使用对比预测编码的表征学习。arXiv preprint arXiv:1807.03748 (2018)。

23] Alexandros Stergiou, Ronald Poppe, and Grigorios Kalliatakis. 2021. Refining activation downsampling with SoftPool. In Proceedings of the IEEE/CVF international conference on computer vision. 10357-10366
23]Alexandros Stergiou, Ronald Poppe, and Grigorios Kalliatakis.2021.用 SoftPool 改进激活下采样。IEEE/CVF 计算机视觉国际会议论文集》。10357-10366
24] Dianbo Sui, Zhengkun Tian, Yubo Chen, Kang Liu, and Jun Zhao. 2021. A largescale chinese multimodal ner dataset with speech clues. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International foint Conference on Natural Language Processing (Volume 1: Long Papers).
24]Dianbo Sui, Zhengkun Tian, Yubo Chen, Kang Liu, and Jun Zhao.2021.带语音线索的大规模中文多模态语音数据集。第59届计算语言学协会年会暨第11届自然语言处理国际会议论文集(第1卷:长篇论文)。
[25] Lin Sun, Jiquan Wang, Yindu Su, Fangsheng Weng, Yuxuan Sun, Zengwei Zheng, and Yuanyi Chen. 2020. RIVA: a pre-trained tweet multimodal model based on text-image relation for multimodal NER. In Proceedings of the 28th International Conference on Computational Linguistics. 1852-1862.
[25] Lin Sun,Jiquan Wang,Yindu Su,Fangsheng Weng,Yuxuan Sun,Zengwei Zheng,and Yuanyi Chen.2020.RIVA:基于文本图像关系的预训练推文多模态模型,用于多模态 NER。第 28 届国际计算语言学大会论文集》。1852-1862.
26] Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, and Lequan Yu. 2022. Multi-granularity cross-modal alignment for generalized medical visual representation learning. Advances in Neural Information Processing Systems 35 (2022), 33536-33549.
26]Fuying Wang, Yuyin Zhou, Shujun Wang, Varut Vardhanabhuti, and Lequan Yu.2022.用于广义医学视觉表征学习的多粒度跨模态配准。神经信息处理系统进展 35 (2022),33536-33549。
[27] Peng Wang, Xiaohang Chen, Ziyu Shang, and Wenjun Ke. 2023. Multimodal Named Entity Recognition with Bottleneck Fusion and Contrastive Learning. IEICE TRANSACTIONS on Information and Systems 106, 4 (2023), 545-555.
[27] Peng Wang, Xiaohang Chen, Ziyu Shang, and Wenjun Ke.2023.采用瓶颈融合和对比学习的多模态命名实体识别。IEICE TRANSACTIONS on Information and Systems 106, 4 (2023), 545-555.
[28] Xinyu Wang, Min Gui, Yong Jiang, Zixia Jia, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2021. Ita: image-text alignments for multi-modal named entity recognition. arXiv preprint arXiv:2112.06482 (2021)
[29] Xuwu Wang, Junfeng Tian, Min Gui, Zhixu Li, Jiabo Ye, Ming Yan, and Yanghua Xiao. 2022. PromptMNER: prompt-based entity-related visual clue extraction and integration for multimodal named entity recognition. In International Conference on Database Systems for Advanced Applications. Springer, 297-305
[29] 王旭武、田俊峰、桂敏、李志旭、叶家波、严明、肖扬华。2022.PromptMNER:用于多模态命名实体识别的基于提示的实体相关视觉线索提取与整合。高级应用数据库系统国际会议。Springer, 297-305
[30] Shuang Wu, Xiaoning Song, and Zhenhua Feng. 2021. MECT: Multi-Metadata Embedding based Cross-Transformer for Chinese Named Entity Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 1529-1539
[30] Shuang Wu, Xiaoning Song, and Zhenhua Feng.2021.MECT:基于多元数据嵌入的中文命名实体识别交叉转换器。第 59 届计算语言学协会年会暨第 11 届自然语言处理国际联合会议论文集(第 1 卷:长篇论文)。1529-1539
[31] Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia. 1038-1046.
[31] Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li.2020.嵌入视觉引导对象的多模态表示法用于社交媒体帖子中的命名实体识别。第 28 届 ACM 国际多媒体会议论文集》。1038-1046.
[32] Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang. 2022. MAF: a general matching and alignment framework for multimodal named entity recognition. In Proceedings of the fifteenth ACM international conference on web search and data mining. 1215-1223
[32] Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang.2022.MAF:用于多模态命名实体识别的通用匹配与对齐框架。第十五届 ACM 网络搜索与数据挖掘国际会议论文集》。1215-1223
[33] Shusheng Xu, Xingxing Zhang, Yi Wu, and Furu Wei. 2022. Sequence level contrastive learning for text summarization. In Proceedings of the AAAI conference on artificial intelligence, Vol. 36. 11556-11565
[33] Shusheng Xu,Xingxing Zhang,Yi Wu,and Furu Wei.2022.用于文本摘要的序列级对比学习。In Proceedings of the AAAI conference on artificial intelligence, Vol.11556-11565
[34] Zhengyuan Yang, Boqing Gong, Liwei Wang, Wenbing Huang, Dong Yu, and Jiebo Luo. 2019. A fast and accurate one-stage approach to visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4683 4693.
[34] 杨正元、龚柏青、王立伟、黄文兵、于东、罗杰波。2019.快速准确的视觉接地单阶段方法。In Proceedings of the IEEE/CVF International Conference on Computer Vision.4683 4693.
[35] Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A Novel Graph-based Multi-modal Fusion Encoder for Neural Machine Translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 3025-3035.
[35] Yongjing Yin,Fandong Meng,Jinsong Su,Chulun Zhou,Zhengyuan Yang,Jie Zhou,and Jiebo Luo.2020.基于图的神经机器翻译多模态融合编码器。计算语言学协会第 58 届年会论文集》。3025-3035.
36] Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal trans former. Association for Computational Linguistics.
37] Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI conference on artificia intelligence, Vol. 35. 14347-14355.
37]Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou.2021.有针对性视觉引导的命名实体识别多模态图融合。美国人工智能学会会议论文集》,第 35 卷。14347-14355.
38] Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive coattention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32
38]Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang.2018.用于推文中命名实体识别的自适应协同网络。In Proceedings of the AAAI conference on artificial intelligence, Vol.
[39] Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu. 2023. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 958966 .
[39] Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu.2023.减少多模态命名实体识别中视觉对象的偏差》。第十六届 ACM 网络搜索与数据挖掘国际会议论文集》。958966 .
40] Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2020. Objectaware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia 23 (2020), 2520-2532
40]Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li.2020.用对抗学习识别社交媒体帖子中的物体感知多模态命名实体。IEEE Transactions on Multimedia 23 (2020),2520-2532

  1. Corresponding author
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full cita tion on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy other wise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org ICMR '24, June 10-14, 2024, Phuket, Thailand
    允许为个人或课堂用途免费复制本作品的全部或部分内容的电子版或印刷版,但不得以营利或商业利益为目的制作或传播,且复制件必须标明本声明和首页上的完整说明。必须尊重作者以外的其他人对本作品组成部分所拥有的版权。允许摘录并注明出处。以其他方式复制、再版、在服务器上发布或在列表中重新发布,需事先获得特别许可和/或付费。请向 permissions@acm.org ICMR '24, June 10-14, 2024, Phuket, Thailand 申请许可。
    (c) 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 979-8-4007-0619-6/24/06
    (c) 2024 版权归所有者/作者所有。出版权授权给 ACM。ACM ISBN 979-8-4007-0619-6/24/06