VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER VEC-MNER:用于多模态 NER 的视觉增强型跨模态多级交互混合变换器
Pengfei WeiGuangdong University of Technology 广东工业大学Guangzhou, China 中国广州wpf@gdut.edu.cn
Hongjun OuyangGuangdong University of Technology 广东工业大学Guangzhou, China 中国广州3120005742@mail2.gdut.edu.cn
Qintai Hu*Guangdong University of Technology 广东工业大学Guangzhou, China 中国广州huqt8@gdut.edu.cn
Bi Zeng 毕增Guangdong University of Technology 广东工业大学Guangzhou, China 中国广州zb9215@gdut.edu.cn
Guang FengGuangdong University of Technology 广东工业大学Guangzhou, China 中国广州von@gdut.edu.cn
Qingpeng WenGuangdong University of Technology 广东工业大学Guangzhou, China 中国广州wqp@mail2.gdut.edu.cn
Abstract
Multimodal Named Entity Recognition (MNER) aims to leverage visual information to identify entity boundaries and categories in social media posts. Existing methods mainly adopt heterogeneous architecture, with ResNet (CNN-based) and BERT (Transformerbased) dedicated to modeling visual and textual features, respectively. However, current approaches still face the following issues (1) Weak cross-modal correlations and poor semantic consistency. (2) Suboptimal fusion results when visual objects and textual entities are inconsistent. To this end, we propose a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction (VEC-MNER) model for MNER. Specifically, compared to heterogeneous architectures, we propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. Moreover, we design the Correlation-Aware Alignment (CAAEncoder) layer and the Correlation-Aware Deep Fusion (CADFEncoder) layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. Experimental results demonstrate that our approach achieves SOTA performance, achieving and F1-score on Twitter-2015 and Twitter-2017, respectively. 多模态命名实体识别(MNER)旨在利用视觉信息来识别社交媒体帖子中的实体边界和类别。现有方法主要采用异构架构,分别使用基于 CNN 的 ResNet 和基于转换器的 BERT 对视觉和文本特征进行建模。然而,目前的方法仍面临以下问题 (1) 跨模态相关性弱,语义一致性差。(2) 当视觉对象和文本实体不一致时,融合结果不理想。为此,我们为 MNER 提出了视觉增强跨模态多级交互(VEC-MNER)混合变换器模型。具体来说,与异构架构相比,我们提出了一种新的同构混合变换器架构,自然地减少了异构性。此外,我们还设计了相关感知对齐(CAAEncoder)层和相关感知深度融合(CADFEncoder)层,结合对比学习,分别实现了模态间更有效的隐式对齐和深度语义融合。我们还构建了一个感知相关性(Correlation-Aware,CA)模块,可有效减少模态间的异质性,缓解视觉偏差。实验结果表明,我们的方法实现了 SOTA 性能,在 Twitter-2015 和 Twitter-2017 上分别获得了 和 的 F1 分数。
CCS CONCEPTS 综合传播战略概念
Information systems Multimedia information systems;
Computing methodologies Artificial intelligence.
KEYWORDS
Multimodal Named Entity Recognition, Dual-Stream Transformer, Cross-Modal Fusion, Contrastive learning, Conditional Random Field, Sequence labeling 多模态命名实体识别、双流变换器、跨模态融合、对比学习、条件随机场、序列标记
ACM Reference Format: ACM 参考格式:
Pengfei Wei, Hongjun Ouyang, Qintai Hu, Bi Zeng, Guang Feng, and Qingpeng Wen. 2024. VEC-MNER: Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-level Interaction for Multimodal NER. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR '24), June 10-14, 2024, Phuket, Thailand. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658097 魏鹏飞、欧阳红军、胡勤泰、曾碧、冯光、温庆鹏。2024.VEC-MNER:用于多模态 NER 的视觉增强跨模态多级交互混合变换器。2024 年多媒体检索国际会议(ICMR '24)论文集》,2024 年 6 月 10-14 日,泰国普吉岛。ACM, New York, NY, USA, 9 pages.https://doi.org/10.1145/3652583.3658097
1 INTRODUCTION 1 引言
Traditional NER tasks only extract keyword information from text. In today's information age with the surge of social media, Multimodal NER has brought unique challenges. The vast amount of content created by users on social media platforms covers multiple modalities such as text and images. However, the information on social media exhibits highly unstructured, short, and noisy characteristics [38], which makes traditional NER methods perform poorly in this field. The task of multimodal NER is to accurately locate and classify named entities by simultaneously utilizing text and image information. As shown in Figure 1a, while classifying the named entity "Kevin Durant" as "PER" instead of "MISC", the addition of visual information can alleviate the semantic deficiency and ambiguity caused by solely using textual information. 传统的 NER 任务只能从文本中提取关键词信息。在当今社交媒体激增的信息时代,多模态 NER 带来了独特的挑战。用户在社交媒体平台上创建的大量内容涵盖文本和图像等多种模式。然而,社交媒体上的信息呈现出高度非结构化、短小、噪声大等特点 [38],这使得传统的 NER 方法在这一领域表现不佳。多模态 NER 的任务是同时利用文本和图像信息准确定位命名实体并对其进行分类。如图 1a 所示,在将命名实体 "Kevin Durant "分类为 "PER "而非 "MISC "时,视觉信息的加入可以缓解单纯使用文本信息造成的语义缺陷和歧义。
(a) Kevin Durant PER (a) 凯文-杜兰特 PER
(b) Basketball to Face [Texas (c)fxxk this shxt...... i'm off to enters [Oracle Arena LOC wearing off-White Ai Tech ORG] in An. vote for [Donald Trump PER] Jordan (b)Basketball to Face [Texas (c)fxxk this shxt...... I'm off to enters [Oracle Arena LOC wearing off-White Ai Tech ORG] in An.
Figure 1: Three examples of Multimodal Named Entity Recognition in Social Media. 图 1:社交媒体中多模态命名实体识别的三个示例。
The core of existing MNER methods is to achieve alignment and fusion of textual and visual information through heterogeneous architectures of ResNet (CNN based) and BERT (Transformer based). These methods is mainly divided into the following four categories: (1) Arshad et al. [2], Asgari-Chenaghlu et al. [3], Lu et al. [16], Moon et al. [20, 21] adopts a pre-trained CNN model, such as ResNet [10] encodes the whole image into a global feature vector, and then enhances the representation of each word with the global image vector through the attention mechanism. (2) Sun et al. [25], Wang et al. [26], Yu et al. [36] divide the feature map obtained from the image into multiple blocks averagely, and then learn the most valuable visual-aware word representation by modeling the interaction between the text sequence and the visual region using Transformer or gating mechanism. (3) Some researchers employ object detection models such as Mask RCNN [9] to obtain visual objects from associated images, and then combine object-level visual information and text word information based on GNN or cross-modal attention . (4) There are also some works exploring the derivative knowledge of image content, including OCR, image descriptions, and other image attributes [12, 28, 29], which are used to guide words to obtain helpful visual semantic information. 现有 MNER 方法的核心是通过 ResNet(基于 CNN)和 BERT(基于变换器)的异构架构实现文本信息和视觉信息的对齐和融合。这些方法主要分为以下四类:(1)Arshad 等人[2]、Asgari-Chenaghlu 等人[3]、Lu 等人[16]、Moon 等人[20, 21]采用预先训练好的 CNN 模型,如 ResNet [10]将整个图像编码成一个全局特征向量,然后通过注意力机制用全局图像向量增强每个词的表示。(2) Sun 等人[25]、Wang 等人[26]、Yu 等人[36]将从图像中获得的特征图平均分成多个区块,然后利用 Transformer 或门控机制对文本序列和视觉区域之间的交互进行建模,从而学习出最有价值的视觉感知单词表示。(3) 一些研究者采用对象检测模型,如 Mask RCNN [9],从关联图像中获取视觉对象,然后基于 GNN 或跨模态注意 将对象级视觉信息和文本词信息结合起来。(4) 还有一些研究探索了图像内容的衍生知识,包括 OCR、图像描述和其他图像属性[12, 28, 29],利用这些知识引导词语获取有用的视觉语义信息。
Although existing methods have achieved surprising results, they still have some obvious limitations. It is widely believed that explicit alignment can mine fine-grained correspondences between text and image. As shown in Figure 1a, by observing an image containing two people (visual objects), the type of "Kevin Durant" (entity) in the text can be easily classified as "PER". However, this explicit alignment will inevitably cause problems when visual objects and entities are inconsistent in number or type. For example, in Figure 1b, there are many detected objects ( n visual objects) in the image, which makes it difficult to explicitly align them with "Texas Tech" in the text. In Figure 1c, it is expected to find some "people" ("PER" type) in the image aligned with "Donald Trump" ("PER" type) in the text, but there is a "panda" ("Panda") in the image "MISC" type). However, when there is no such precise correspondence between text and image, this will cause difficulties for graph-based methods by establishing explicit relationships between entities and visual objects [31]. As we can see, the correlation between image and text exists in various situations: fully relevant, partially relevant, irrelevant. Therefore, irrelevant visual information can lead to misleading modality alignment and fusion and further affect the performance of MNER. 尽管现有方法取得了令人惊喜的成果,但仍存在一些明显的局限性。人们普遍认为,显式配准可以挖掘文本和图像之间的细粒度对应关系。如图 1a 所示,通过观察包含两个人(视觉对象)的图像,文本中 "凯文-杜兰特"(实体)的类型很容易被归类为 "PER"。然而,当视觉对象和实体的数量或类型不一致时,这种明确的对齐方式难免会产生问题。例如,在图 1b 中,图像中有许多检测到的对象(n 个视觉对象),这就很难将它们与文本中的 "德克萨斯理工大学 "明确对齐。在图 1c 中,预计图像中会有一些 "人"("PER "类型)与文本中的 "唐纳德-特朗普"("PER "类型)对齐,但图像中却有一只 "熊猫"("Panda")("MISC "类型)。然而,当文本和图像之间没有这种精确的对应关系时,就会给基于图的方法建立实体和视觉对象之间的明确关系带来困难[31]。我们可以看到,图像和文本之间的相关性存在于各种情况下:完全相关、部分相关、不相关。因此,不相关的视觉信息会导致误导的模态配准和融合,并进一步影响 MNER 的性能。
Towards this end, we propose VEC-MNER, a Hybrid Transformer with Visual-Enhanced Cross-Modal Multi-Level interaction model, which is a Transformer-based homologous architecture. The goal of our work is to explore the potential of Transformerbased homogeneous architecture on MNER tasks while overcoming the following challenges: (1) Weak cross-modal correlation and poor semantic consistency. (2) Suboptimal fusion results when visual objects and text entities are inconsistent. 为此,我们提出了 VEC-MNER,一种具有视觉增强跨模态多层次交互模型的混合变形器,它是一种基于变形器的同源架构。我们的工作目标是探索基于变换器的同源架构在 MNER 任务中的潜力,同时克服以下挑战:(1)跨模态相关性弱,语义一致性差。(2) 当视觉对象和文本实体不一致时,融合结果不理想。
Our main contributions can be summarized as follows: 我们的主要贡献可概括如下:
(1) To our knowledge, our work is the first to propose a novel Vision-Enhanced Cross-Modal Multi-Level Interactive Hybrid Transformer model on the MNER task, intuitively leveraging a unified Transformer architecture to encode text and visual objects naturally reduces heterogeneity to better model the relationships between modalities. (1) 据我们所知,我们的工作首次在 MNER 任务中提出了一种新颖的视觉增强型跨模态多层次交互混合变换器模型,直观地利用统一的变换器架构对文本和视觉对象进行编码,自然地减少了异质性,从而更好地模拟模态之间的关系。
(2) We design the CAA-Encoder layer and the CADF-Encoder layer to achieve more effective implicit alignment between image-text and deep semantic fusion between modalities, respectively. At the same time, we combine contrastive learning to further enhance the correlation between modalities, which is necessary for the MNER task. (2) 我们设计了 CAA 编码器层和 CADF 编码器层,以分别实现更有效的图像-文本之间的隐式配准和模态之间的深度语义融合。同时,我们结合对比学习,进一步增强模态之间的相关性,这对于 MNER 任务是必不可少的。
(3) We also construct a Correlation-Aware (CA) module that can effectively reduce heterogeneity between modalities and alleviate visual deviation. (3) 我们还构建了一个相关感知(CA)模块,可有效减少模态之间的异质性,缓解视觉偏差。
(4) Our proposed method is evaluated on the industry standard multi-modal entity recognition datasets Twitter-2015 and Twitter-2017, and achieves the SOTA performance. (4) 我们提出的方法在行业标准多模态实体识别数据集 Twitter-2015 和 Twitter-2017 上进行了评估,取得了 SOTA 性能。
2 RELATED WORK
2.1 Multimodal named entity recognition
In recent years, multiple modalities of data have been widely used to improve the effectiveness of named entity recognition. Due to grammar errors, excessive noise, and missing information in social media texts, some works use relevant illustrations to provide effective entity information to assist the model in prediction. Moon et al. [21] proposed using a bidirectional long short-term memory network (LSTM) to extract text features, a convolutional neural network (CNN) to extract image features, and combining them with a modal attention module to predict sequence labels. In order to extract the image regions most relevant to the text, Lu et al. [16] used an attention based model to fuse text and image features. To avoid the impact of mismatches between text and images, Xu et al. [32] proposed a cross modal alignment and matching module that consistently integrates text and image representations. Different from the above methods, we design the CAA-Encoder layer and the CADF-Encoder layer, combined with contrastive learning, to achieve more effective implicit alignment and deep semantic fusion between modalities, respectively. 近年来,多种数据模式被广泛用于提高命名实体识别的有效性。由于社交媒体文本中存在语法错误、噪声过大、信息缺失等问题,一些研究利用相关插图来提供有效的实体信息,辅助模型进行预测。Moon 等人[21]提出使用双向长短期记忆网络(LSTM)提取文本特征,使用卷积神经网络(CNN)提取图像特征,并将它们与模态注意力模块相结合来预测序列标签。为了提取与文本最相关的图像区域,Lu 等人[16]使用了基于注意力的模型来融合文本和图像特征。为了避免文本和图像之间不匹配的影响,Xu 等人[32]提出了一种跨模态对齐和匹配模块,该模块能始终如一地整合文本和图像表征。与上述方法不同,我们设计了 CAA 编码器层和 CADF 编码器层,结合对比学习,分别实现了更有效的隐式配准和模态间的深度语义融合。
2.2 Dual-stream Transformer
Early MNER approaches primarily focused on simple dual-stream structures concatenating text (BiLSTM, BERT) and image (ResNet) features, yet this approach significantly increased the heterogeneity of the modal representation layer, making modeling difficult. With the rapid development of Transformer models in computer vision tasks, researchers shifted their attention to more complex dual-stream Transformers. Dual-stream Transformers use the Transformer architecture to process data streams from two different modalities separately. Yu et al. [36] introduces a multi-modal interaction module in the dual-stream Transformer in the interaction layer to capture the dynamic relationships between modalities. However, this method still adopts the heterogeneous representation of ResNet, mainly focuses on explicit alignment, and does not fully consider the issue of visual bias. We propose a new homogeneous Hybrid Transformer Architecture, which naturally reduces the heterogeneity. We also construct a CA module that can effectively alleviate visual deviation. 早期的 MNER 方法主要集中在将文本(BiLSTM、BERT)和图像(ResNet)特征串联起来的简单双流结构上,然而这种方法大大增加了模态表示层的异质性,给建模带来了困难。随着变换器模型在计算机视觉任务中的快速发展,研究人员将注意力转移到了更复杂的双流变换器上。双流变换器使用变换器架构分别处理来自两种不同模态的数据流。Yu 等人[36]在双流变换器的交互层中引入了多模态交互模块,以捕捉模态之间的动态关系。然而,这种方法仍然采用 ResNet 的异构表示,主要侧重于显式配准,没有充分考虑视觉偏差问题。我们提出了一种新的同构混合变换器架构,自然地减少了异构性。我们还构建了一个 CA 模块,可以有效缓解视觉偏差。
Figure 2: Overall Architecture of VEC-MNER. OT denotes the origin text, OI denotes the origin image, FRI denotes the set of images processed by Fastrenn, and VGI denotes the set of images processed by Visual Grounding. 图 2:VEC-MNER 的整体架构。OT 表示原始文本,OI 表示原始图像,FRI 表示由 Fastrenn 处理的图像集,VGI 表示由 Visual Grounding 处理的图像集。
2.3 Contrastive Learning 2.3 对比学习
Contrastive learning has been widely used in fields such as CV and NLP, such as image retrieval, text classification and recommendation systems. The core idea of contrastive learning is to bring positive samples closer and push away negative samples in the feature space by using contrastive loss (e.g., InfoNCE loss [22]), which brings significant improvements to downstream tasks. Several researchers proposed making different enhanced representations of images consistent with each other and showed positive results. At the same time, researchers in the field of NLP have also begun to work on finding suitable augmentations for text . However, a major limitation of the above methods is that they are only single-modal contrastive learning. In recent years, with the rise of multimodal pre-training models, many studies have incorporated multi-modal contrastive learning into their methods [10, 27, 39]. This paper introduces multi-modal contrastive learning to alleviate the problem of weak cross-modal correlation. 对比学习已被广泛应用于 CV 和 NLP 等领域,如图像检索、文本分类和推荐系统。对比学习的核心思想是通过使用对比损失(如 InfoNCE 损失 [22]),在特征空间中拉近正样本,推远负样本,从而显著改善下游任务。一些研究人员 提出了让不同的增强图像表征相互一致的建议,并取得了积极的成果。与此同时,NLP 领域的研究人员也开始致力于为文本 寻找合适的增强。然而,上述方法的一个主要局限是它们只是单模态对比学习。近年来,随着多模态预训练模型的兴起,许多研究将多模态对比学习纳入了他们的方法中[10, 27, 39]。本文引入多模态对比学习来缓解弱跨模态相关性的问题。
3 METHOD
This section introduces our proposed VEC-MNER model in this paper, as shown in Figure 2, which mainly consists of five stacked modules. The first is the embedding layer, which is used to vectorize text and images, then the Hybrid Transformer architecture, which consists of three stacked layers: the modality encoder layer (T-Encoder and V-Encoder), the modality alignment layer (CAAEncoder) and the modality fusion layer (CADF-Encoder), and finally the CRF framework is used to capture the label dependencies of the entity recognition task. In addition to this, we declare that the number of T-Encoder layers is , the number of V-Encoder layers is , the number of CAA-Encoder layers is , and the number of CADF-Encoder layers is . where , and . 本节将介绍我们在本文中提出的 VEC-MNER 模型,如图 2 所示,该模型主要由五个堆叠模块组成。首先是嵌入层,用于将文本和图像矢量化;然后是混合变换器架构,由三个堆叠层组成:模态编码器层(T-编码器和 V-编码器)、模态对齐层(CAAEncoder)和模态融合层(CADF-编码器);最后是 CRF 框架,用于捕捉实体识别任务的标签依赖关系。此外,我们还声明 T 编码器层数为 ,V 编码器层数为 ,CAA 编码器层数为 ,CADF 编码器层数为 ,其中 和 。
Task Definition: Given a sentence and its associated images as input, the goal of MNER is to extract entities from them and classify each extracted entity into one of the predefined types, including Person (PER), Organization (ORG), Location (LOC) and Others (MISC). Similar to most existing work in MNER, we formulate the task as a sequence labeling problem. Let denote the word sequence of input words, where denote the word in the sentence. is the corresponding sequence tagging, where is a set of predefined tags in the standard BIO mode. We also use to represent a set of input visual objects. 任务定义:给定一个句子及其相关图像作为输入,MNER 的目标是从句子和图像中提取实体,并将每个提取的实体归入预定义的类型之一,包括人员 (PER)、组织 (ORG)、位置 (LOC) 和其他 (MISC)。与 MNER 领域的大多数现有工作类似,我们将任务表述为序列标注问题。让 表示输入词的词序列,其中 表示句子中的 词。 是相应的序列标记,其中 是标准 BIO 模式下的一组预定义标记。我们还使用 表示一组输入的视觉对象。
3.1 Text and Visual Embedding 3.1 文本和视觉嵌入
Text Embedding. Similar to prior work, since BERT can provide different representations of the same word in different contexts, we also adopt BERT to obtain contextualized representations. BERT Embedding consists of three components: Token Embedding, Segment Embedding and Position Embedding. We first convert each text word into a contextual vector , and preprocess the input sentence by inserting special tokens and at the beginning and ending positions respectively. The word sequence is represented as , where is the length of the transformed text sequence. 文本嵌入。与之前的工作类似,由于 BERT 可以为同一单词在不同语境中提供不同的表示,因此我们也采用 BERT 来获得语境化的表示。BERT 嵌入由三个部分组成:标记嵌入(Token Embedding)、片段嵌入(Segment Embedding)和位置嵌入(Position Embedding)。我们首先将每个文本单词转换为上下文向量 ,然后在输入句子的开始和结束位置分别插入特殊标记 和 进行预处理。词序列表示为 ,其中 是转换后文本序列的长度。
Visual Embedding. Visual objects can complement missing entity information in a sentence. In contrast to prior work [12, 28, 29, 视觉嵌入。视觉对象可以补充句子中缺失的实体信息。与之前的工作[12, 28, 29、
37], in order to capture the visual objects in the context, in addition to using the original image, we also introduced visual local objects from Fastrcnn [7] and Visual Grounding as inputs. For Visual Grounding , similar to [35], we employ the selected parsing tool from the Stanford Grammar Parser to identify all noun phrases in the input sentence, and then apply the visual grounding toolkit [34] to detect the boundary objects for each noun phrase. Since it is difficult to fully detect all potential visuals in an image using only noun phrases, we further introduced four predefined entity categories (i.e., person, location, organization, miscellaneous) to discover more relevant visual objects. 37],为了捕捉上下文中的视觉对象,除了使用原始图像外,我们还引入了来自 Fastrcnn [7] 和 Visual Grounding 的视觉局部对象作为输入。对于视觉接地 ,与文献[35]类似,我们使用斯坦福语法分析器中的选定解析工具来识别输入句子中的所有名词短语,然后应用视觉接地工具包[34]来检测每个名词短语的边界对象。由于仅使用名词短语很难完全检测到图像中所有潜在的视觉对象,我们进一步引入了四个预定义的实体类别(即人物、地点、组织、杂项),以发现更多相关的视觉对象。
In contrast to previous work [24, 29], we choose CLIP as our visual feature extractor to obtain contextual representations. We use the same patchsize , which will split the above three input images into , and visual blocks. Since both types of extracted visual objects are multiple, we select three images as input respectively. In addition, we insert a special token at the beginning position, and like BERT Embedding, the segmented image block is processed through the Visual Embedding of CLIP to obtain the final image sequence , where M is the length of the transformed image sequence. 与之前的工作 [24, 29]不同,我们选择 CLIP 作为视觉特征提取器来获取上下文表征。我们使用相同的补丁大小 ,将上述三张输入图像分割成 和 视觉块。由于两类提取的视觉对象都是多个,因此我们分别选择三张图像作为输入。此外,我们在开始位置插入一个特殊标记 ,与 BERT 嵌入一样,分割后的图像块经过 CLIP 的视觉嵌入处理,得到最终的图像序列 ,其中 M 是转换后图像序列的长度。
3.2 Multimodal Hybrid Transformer
This section introduces our proposed Multimodal Hybrid Trans former architecture, which is a novel homogeneous architecture that has not been explored by previous MNER methods. This is a major highlight of this paper, which mainly includes the following three stacking modules. 本节将介绍我们提出的多模态混合前传架构,这是一种新颖的同构架构,以往的 MNER 方法尚未对其进行探索。这是本文的一大亮点,主要包括以下三个堆叠模块。
3.2.1 Information Encoder Layer. The information encoder layer mainly consists of a text encoder (T-Encoder) and a visual encoder (V-Encoder). Its core is the Transformer block. Transformer has been widely used in the two major research fields of CV and NLP, and is currently the main architecture of the academic community and consists of stacked blocks. Each block mainly contains two sublayers: multi-head self-attention layer (MHSA) and fully connected feed-forward network layer (FFN). Layer normalization (LN) and residual connections are also applied to each layer. Given an input vector , the traditional affine function converts x into query , and key-value pairs , resulting in an attention map: 3.2.1 信息编码器层。信息编码器层主要由文本编码器(T-编码器)和视觉编码器(V-编码器)组成。其核心是 Transformer 模块。Transformer 已广泛应用于 CV 和 NLP 两大研究领域,是目前学术界的主要架构,由堆叠的块组成。每个区块主要包含两个子层:多头自注意层(MHSA)和全连接前馈网络层(FFN)。每一层还应用了层归一化(LN)和残差连接。给定输入向量 ,传统的仿射函数会将 x 转换为查询 和键值对 ,从而得到注意力图:
where is the sequence length. is the dimension of the query vector. MHSA performs attention function calculations on heads in parallel, and each head maps input to queries, keys, and values by parameters . The role of MHSA is to calculate the weighted hidden state of each head and then concatenate them: 其中, 是序列长度。 是查询向量的维数。MHSA 在 头上并行执行注意力函数计算,每个头通过参数 将输入映射为查询、键和值。MHSA 的作用是计算每个头部的加权隐藏状态,然后将它们串联起来:
\footnotetext{ It details how Visual Grounding extracts visual objects. 它详细介绍了 Visual Grounding 如何提取视觉对象。
where are the dimensions of the hidden vector. is the dimension of each header of MHSA. FFN is another important component in Transformer. It usually consists of two sub-layers of linear transformation with a ReLU activation functions: 其中 是隐藏向量的维数。 是 MHSA 每个头的维数。FFN 是变换器中的另一个重要组件。它通常由两个带有 ReLU 激活函数的线性变换子层组成:
where .
T-Encoder. We use the first of BERT as the text encoder, aiming to obtain primary lexical and syntactic information, which contains MHSA and FFN blocks of the layer. Specifically, given the sentence sequence , the text representation is calculated as follows: T 编码器。我们使用 BERT 的第一个 作为文本编码器,旨在获取主要的词法和句法信息,其中包含 层的 MHSA 和 FFN 块。具体来说,给定句子序列 ,文本表示的计算方法如下:
where is the hidden state of the layer of the text encoder. 其中 是文本编码器 层的隐藏状态。
V-Encoder. We adopt the first layer of the CLIP model pretrained on 400 million image-text pairs as the visual encoder, aiming to obtain primary edge, texture and spatial features. Similar to the text encoder, it also contains MHSA and FFN blocks of the layer. Specifically, given the embedding sequence after the image patch, the visual representation is calculated as follows: 视觉编码器。我们采用在 4 亿图像文本对上预训练过的 CLIP 模型的第一个 层作为视觉编码器,旨在获取主要的边缘、纹理和空间特征。与文本编码器类似,它也包含 层的 MHSA 和 FFN 块。具体来说,给定图像补丁后的嵌入序列 ,视觉表示的计算方法如下:
where is the hidden state of the layer of the visual encoder. 其中 是视觉编码器 层的隐藏状态。
3.2.2 Correlation-Aware Alignment Layer . This is the first innovation of this paper. Our proposed CAA-Encoder layer can model heterogeneity and irrelevant problems between different modalities. This module uses the sub-layers of BERT and CLIP. Similar to previous work [27], we employ co-attention to compute attention weights between all text positions and all visual regions to capture the fine-grained correlations between modalities. From the text perspective, the query is the hidden state of the text, and the key and value come from the visual hidden state. It aims to use the visual characteristics to adjust the attention weight of each word in the text, and obtain the Text-related visual information. From the visual perspective, the query is the hidden state of the image, and the key and value (V) come from the hidden state of the text. It aims to use the text content to adjust the attention weight of each position in the image and obtain text information related to the image. The detailed formula is as follows: 3.2.2 相关意识对齐层(Correlation-Aware Alignment Layer)。这是本文的第一个创新点。我们提出的 CAA 编码器层可以模拟不同模态之间的异质性和不相关问题。该模块使用 BERT 和 CLIP 的 子层。与之前的工作[27]类似,我们采用了协同注意力来计算所有文本位置和所有视觉区域之间的注意力权重,以捕捉模态之间的细粒度相关性。从文本角度来看,查询 是文本的隐藏状态,而键 和值 来自视觉隐藏状态。其目的是利用视觉特征调整文本中每个词的关注权重,获得与文本相关的视觉信息。从视觉角度看,查询 是图像的隐藏状态,键 和值(V)来自文本的隐藏状态。其目的是利用文本内容调整图像中每个位置的关注权重,获取与图像相关的文本信息。具体计算公式如下
where . The above equation (7) can be simplified as: 其中 。上式(7)可简化为
Equation (8) can be simplified as: 公式 (8) 可简化为
Correlation-Aware module. This is the second innovation of this paper. In the CAA-Encoder and CADF-Encoder layers, we design a CA module to mitigate the impact of noise caused by visual elements. This is a plug-and-play module that maps arbitrary input