这是用户在 2024-8-5 24:12 为 https://app.immersivetranslate.com/pdf-pro/e485ee2b-c6bc-4bb5-9dca-9c65c3be8172 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_05_b84a6238d9e53d299976g

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion
2M-NER:多语言和多模态 NER 的对比学习与语言和模态融合

Dongsheng Wang , Xiaoqin Feng , Zeming Liu , Chuan Wang
王东升 、冯晓琴 、刘泽明 、王川
The Department of Science and Technology Teaching, China University of Political
中国政治大学科技教学部
Science and Law, Beijing, 102249,China.
中国北京,科学与法律,102249。
Mobvoi AI Lab, Beijing, 100044,China.
Mobvoi人工智能实验室,中国北京,100044。
School of Computer Science and Engineering, Beihang University, Beijing, 100191,China.
北京航空航天大学计算机科学与工程学院,中国北京,100191。
Institute of Information Engineering, CAS, Beijing, 100085,China.
中国科学院信息工程研究所,北京,100085。

*Corresponding author(s). E-mail(s): zmliu@buaa.edu.cn;
*通讯作者。电子邮件:zmliu@buaa.edu.cn;
Contributing authors: wangdsh@cupl.edu.cn; xiaoqin.feng@mobvoi.com;
供稿作者:wangdsh@cupl.edu.cn; xiaoqin.feng@mobvoi.com;
wangchuan@iie.ac.cn; wangchuan@iie.ac.cn;

Abstract 摘要

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.
命名实体识别(NER)是自然语言处理中的一项基本任务,涉及将句子中的实体识别并分类为预定义的类型。它在实体链接、问题解答和在线产品推荐等多个研究领域发挥着至关重要的作用。最近的研究表明,结合多语言和多模态数据集可以提高 NER 的效率。这是由于语言迁移学习和不同模态之间存在共享的隐含特征。然而,由于缺乏将多语言和多模态结合起来的数据集,阻碍了对这两方面结合的研究,因为多模态可以同时帮助多种语言的 NER。在本文中,考虑到多语言和多模态命名实体识别(MMNER)的潜在价值和影响,我们旨在解决一项更具挑战性的任务:多语言和多模态命名实体识别。具体来说,我们构建了一个包含四种语言(英语、法语、德语和西班牙语)和两种模式(文本和图像)的大规模 MMNER 数据集。为了在该数据集上完成这项具有挑战性的 MMNER 任务,我们引入了一个名为 2M-NER 的新模型,该模型利用对比学习对齐文本和图像表征,并集成了一个多模态协作模块,以有效描述两种模态之间的交互。广泛的实验结果表明,与一些具有代表性的比较基线相比,我们的模型在多语言和多模态 NER 任务中取得了最高的 F1 分数。此外,在一项具有挑战性的分析中,我们发现句子级对齐对 NER 模型干扰很大,这表明我们的数据集难度更高。

Keywords: Multilingual NER, Multimodal NER, Contrastive learning, Multimodal interaction
关键词多语言 NER、多模态 NER、对比学习、多模态交互

1 Introduction 1 引言

Named entity recognition (NER) is one of the basic tasks in natural language processing which aims to locate and classify specific things into pre-defined types, such as diseases, products, monetary values, etc. It is an important contributor to different research fields, including question answering , automatic text-video retrieval
命名实体识别(NER)是自然语言处理的基本任务之一,旨在将特定事物定位并分类为预定义的类型,如疾病、产品、货币价值等。它对不同的研究领域都有重要贡献,包括问题解答 、自动文本-视频检索 等。
[3] and machine translation , as it enhances the understanding and interpretation of textual information in these tasks. Early studies recognize entities with long short-term memory (LSTM), attention-based technique, or convolution neural network (CNN) . owever, recent research has demonstrated that Transformer-based structures surpass these earlier methods in NER tasks . Despite these advancements, most of early studies solely utilized monolingual and unimodal text to recognize entities, which is insufficient [10-12]. For example, text-only methods ignore the helpfulness of corresponding images, leading to lower accuracy in entity recognition.
[3] 和机器翻译 ,因为在这些任务中,它能增强对文本信息的理解和解释。早期的研究使用长短期记忆(LSTM)、基于注意力的技术或卷积神经网络(CNN)来识别实体 。然而,最近的研究表明,基于 Transformer 的结构在 NER 任务中超越了这些早期的方法 。尽管取得了这些进步,但大多数早期研究仅仅利用单语和单模态文本来识别实体,这是不够的 [10-12]。例如,纯文本方法忽略了相应图像的帮助,导致实体识别的准确率较低。
Some other researchers leverage multilingual and multimodal information to solve the aforementioned issues [13-15]. On the one hand, some studies on multilingual NER have found that knowledge transfer from one language to another language is useful for zero-resource NER and cross-lingual NER. Hence, they have designed models to take advantage of multilingual features of multiple languages. On the other hand, some works on multimodal NER [18-20] have combined the textual modality with the visual modality so that both explicit and implicit visual information can be exploited to help improve the performance of their models. However, these studies have a limitation that they often focus on multilingual or multimodal NER, with many public datasets availiable for multilingual NER [21-23] and multimodal NER [18, 24, 25] respectively. However, in the real world, multilingual and multimodal coexist, making it valuable for models to utilize the multilingual features of the multilingual text and integrate the related visual information with the textual information. Thus, it is highly beneficial to develop a multilingual and multimodal NER dataset to facilitate future research in this area.
还有一些研究人员利用多语言和多模态信息来解决上述问题 [13-15]。一方面,一些关于多语言 NER 的研究发现,从一种语言到另一种语言的知识转移对于零资源 NER 和跨语言 NER 非常有用。因此,他们设计了一些模型来利用多种语言的多语言特点。另一方面,一些关于多模态 NER 的研究[18-20]将文本模态与视觉模态相结合,这样就可以利用显性和隐性视觉信息来帮助提高模型的性能。然而,这些研究有一个局限性,即它们通常侧重于多语言或多模态 NER,有许多公共数据集可分别用于多语言 NER [21-23] 和多模态 NER [18, 24, 25]。然而,在现实世界中,多语言和多模态并存,这使得利用多语言文本的多语言特点并将相关视觉信息与文本信息整合在一起的模型变得非常有价值。因此,开发多语种和多模态 NER 数据集对促进该领域的未来研究大有裨益。
To address the above issue and advance MMNER, we make the following attempts and efforts. We firstly build a large human-annotated Multilingual and Multimodal NER Dataset (MMNERD), which is constructed from a large multilingual dataset and an English multimodal dataset via transformation, translation and human annotation. Concretely, we mark all appearances of entities belonging to four categories (person, location, organization, and miscellaneous) in a widely used multilingual dataset called mBART50 and a famous multimodal dataset Twitter-2017 [24]. As a result, MMNERD has become the first public dataset which supports both multilingual and multimodal. Its total number of sentences across four languages (English, French, Spanish and German) is 42,908 and the in-depth statistical analysis of MMNERD is shown in Table 2. Besides, the detailed comparison between MMNERD and previous NER datasets in 3.3 highlights the multilingual and multimodal characteristics of our dataset.
为了解决上述问题,推进多语言和多模态 NER 的发展,我们做了以下尝试和努力。首先,我们建立了一个由人工标注的大型多语言和多模态 NER 数据集(MMNERD),该数据集由一个大型多语言数据集和一个英语多模态数据集通过转换、翻译和人工标注构建而成。具体来说,我们标注了被广泛使用的多语言数据集 mBART50 和著名的多模态数据集 Twitter-2017 [24] 中属于四个类别(人物、地点、组织和杂项)的实体的所有出现。因此,MMNERD 成为首个同时支持多语言和多模态的公开数据集。其四种语言(英语、法语、西班牙语和德语)的句子总数为 42 908 个,MMNERD 的深入统计分析如表 2 所示。此外,3.3 中对 MMNERD 和以前的 NER 数据集进行了详细比较,突出了我们数据集的多语言和多模态特点。
Based on MMNERD and its new MMNER task, we propose a novel Multilingual and Multimodal NER model named 2M-NER. Concretely, considering the different architectures of various image encoders, 2M-NER leverages ViT [26] and ResNet [27] to extract patch features and convolution features, respectively. Meanwhile, instead of directly blending textual and visual modalities in many previous works, contrastive learning is applied in for modal alignment, which is implemented by the contrastive loss in the multimodal alignment module. For instance, in Fig. 1, the entity Albert Pujols in the English sentence aligns with the area of a baseball player within the image. In the representation space, the entity Albert Pujols and its corresponding area should have similar embeddings, so that they will have bigger similarity score. Besides, 2M-NER utilizes two cross-attention Transformer [28] layers to build modal interactions between text and images.
基于 MMNERD 及其新的 MMNER 任务,我们提出了一种名为 2M-NER 的新型多语言和多模态 NER 模型。具体来说,考虑到各种图像编码器的不同架构,2M-NER 利用 ViT [26] 和 ResNet [27] 分别提取斑块特征和卷积特征。同时, 中应用了对比学习来进行模态对齐,而不是像之前的许多作品那样直接将文本模态和视觉模态融合在一起,而是通过多模态对齐模块中的对比损失来实现。例如,在图 1 中,英语句子中的实体 Albert Pujols 与图像中的棒球运动员区域对齐。在表示空间中,实体 Albert Pujols 和其对应的区域应该有相似的嵌入,这样它们的相似度得分就会更高。此外,2M-NER 利用两个交叉注意变换器层 [28] 来建立文本和图像之间的模态交互。
Moreover, we conduct extensive experiments on MMNERD. A set of classical and competitive NER methods are selected for comparison. Specifically, we first utilize some representative textbased NER models such as BiLSTM-CRF [6], HBiLSTM-CRF [29], BERT [30], etc. After that, to illustrate the help of visual representation to entity recognition, we pick out several representative multimodal NER models, such as UMT [8], UMGF [9], and MKGFormer [19], on our dataset. In terms of linguistic aspect, all the models we use are tested on all four languages and also each of them. These experiments prove the effectiveness of 2M-NER in the new MMNER task.
此外,我们还对 MMNERD 进行了广泛的实验。我们选择了一系列经典和有竞争力的 NER 方法进行比较。具体来说,我们首先使用了一些有代表性的基于文本的 NER 模型,如 BiLSTM-CRF [6]、HBiLSTM-CRF [29]、BERT [30] 等。之后,为了说明视觉表示对实体识别的帮助,我们在数据集上挑选了几个具有代表性的多模态 NER 模型,如 UMT [8]、UMGF [9] 和 MKGFormer [19]。在语言方面,我们使用的所有模型都在所有四种语言以及每种语言上进行了测试。这些实验证明了 2M-NER 在新的 MMNER 任务中的有效性。
Our work can be primarily characterized by the following major contributions:
我们的工作主要体现在以下几个方面:

English: Baseball world reacts to Albert Pujols [PER] ' 600 th career home run French: La réaction du monde du baseball à la 600e home run de la carrière d ' Albert Pujols [PER] Spanish: La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER] German: Baseball Welt reagiert auf Albert Pujols [PER]
英语棒球世界对 Albert Pujols [PER] 职业生涯第 600 次本垒打的反应 法文:La réaction du monde du baseball à la 600e home run de la carrière d ' Albert Pujols [PER] 西班牙语:西班牙文:La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER] 德文:Baseball Welt reagiert auf Albert Pujols [PER] (世界棒球联盟对阿尔伯特-普约尔斯的回应

Fig. 1 An instance of multilingual and multimodal NER. The types of all entities are enclosed in brackets.
图 1 多语言和多模态 NER 的一个实例。括号内为所有实体的类型。
(1) We construct MMNERD, a large-scale humanannotated MMNER dataset, which has four languages and two modalities. So far as we know, this is the first public dataset that supports both multilingual and multimodal NER.
(1) 我们构建了一个由人类注释的大规模 MMNER 数据集 MMNERD,该数据集包含四种语言和两种模态。据我们所知,这是第一个同时支持多语言和多模态 NER 的公开数据集。
(2) To facilitate research on this dataset, our paper introduces a novel model called 2M-NER for MMNER,which leverages contrastive learning to align the language and vision representations.
(2) 为了促进对该数据集的研究,我们的论文为 MMNER 引入了一个名为 2M-NER 的新模型,该模型利用对比学习来调整语言和视觉表征。
(3) Based on 2M-NER, comparative experiments illustrate the efficiency of our methodology, while more relevant experiments indicate that the assistance of multilingual and visual modal can enhance the performance of MMNER.
(3) 基于 2M-NER 的对比实验说明了我们的方法的效率,而更相关的实验表明,多语言和视觉模态的辅助可以提高 MMNER 的性能。
The following sections give the organization of this paper. In Section 2, we summarize the relevant literature on multilingual NER and multimodal NER. Section 3 describes the procedure of dataset construction in detail. A summary of our proposed model and the illustration of each component are presented in Section 4. The comparative experiments and ablation study conducted on MMNERD are presented in Section 5. To summarize our work, Section 6 provides a comprehensive overview and lists some future work.
下文将介绍本文的结构。在第 2 节中,我们总结了多语言 NER 和多模态 NER 的相关文献。第 3 节详细介绍了数据集的构建过程。第 4 节总结了我们提出的模型,并对每个组件进行了说明。第 5 节介绍了在 MMNERD 上进行的对比实验和消解研究。为了总结我们的工作,第 6 节提供了一个全面的概述,并列出了一些未来的工作。
In this section, we review two groups of relevant studies closely associated with MMNER: multilingual named entity recognition and multimodal named entity recognition.
在本节中,我们将回顾与 MMNER 密切相关的两组相关研究:多语言命名实体识别和多模态命名实体识别。

2.1 Multilingual Named Entity Recognition
2.1 多语言命名实体识别

Named entity recognition (NER) is a general task that provides foundational support for a range of natural language processing (NLP) tasks. As the research on multilingual models deepens, monolingual NER gradually migrates to multilingual NER, which provides a solid foundation for multilingual NLP tasks such as question answering [31, 32], information retrieval [33, 34], relation extraction , etc. In previous studies, NER relied on statistical modeling of annotated data, which is extremely arduous and expensive. For example, Joel et al. [37] proposed a data annotations method by utilizing the textual content and structural framework of Wikipedia. This enormous, accessible, and multilingual data helps the model to surpass domain-specific models in terms of performance. Recently, many researchers have presented various methods to obtain the most advanced outcomes in NER. Specifically, Shervin et al. [38] presented MultiCoNER, a vast multilingual corpus for NER that encompasses three domains and eleven different languages, including sub-corpora featuring multilingual and codemixed content. They applied two NER models to showcase the difficulty and validity of the dataset. Emanuela et al. [17], through their analysis of the MultiCoNER dataset, discovered that incorporating supplementary contextual information from the training data can enhance the performance of NER on shorter text samples. Additionally, Shervin et al. [39] fused external knowledge into transformer models to achieve the best performance. Moreover, CoNLL2002 [21], CoNLL2003 [22], and WikiAnn [23] are also commonly used
命名实体识别(NER)是一项通用任务,为一系列自然语言处理(NLP)任务提供基础支持。随着多语言模型研究的不断深入,单语言 NER 逐渐向多语言 NER 迁移,为问题解答 [31,32]、信息检索 [33,34]、关系提取 等多语言 NLP 任务奠定了坚实的基础。在以往的研究中,NER 依赖于对注释数据进行统计建模,而这是极其艰巨和昂贵的。例如,Joel 等人[37] 利用维基百科的文本内容和结构框架提出了一种数据注释方法。这种庞大、可访问和多语言的数据有助于该模型在性能上超越特定领域模型。最近,许多研究人员提出了各种方法,以获得最先进的 NER 成果。具体来说,Shervin 等人[38] 提出了 MultiCoNER,这是一个用于 NER 的庞大的多语言语料库,包含三个领域和 11 种不同的语言,其中包括以多语言和混合编码内容为特色的子语料库。他们应用了两种 NER 模型来展示该数据集的难度和有效性。Emanuela 等人[17]通过对 MultiCoNER 数据集的分析发现,从训练数据中加入补充上下文信息可以提高较短文本样本的 NER 性能。此外,Shervin 等人[39] 将外部知识融合到转换器模型中,以达到最佳性能。此外,CoNLL2002 [21]、CoNLL2003 [22] 和 WikiAnn [23] 也是常用的 NER 方法。

multilingual NER datasets. In comparison to them, the dataset we constructed incorporates multimodal characteristics.
多语言 NER 数据集。与它们相比,我们构建的数据集具有多模态特征。
Owing to the advancements in pre-trained language models (PLM), multilingual NER has started to utilize these models as embeddings to improve its performance [40, 41]. Genta et al. [42] and Wu et al. [43] proposed the different meta-embedding methods to represent multilingual text. In the meantime, several researchers have employed multilingual models that support over one hundred languages as a foundation for transfer learning to other language models, which enables them to achieve superior performance compared to many neural methods and multilingual pre-trained models [13, 41]. In summary, many multilingual methods for NER have been proposed and validated in rare languages and industrial areas, but further in-depth research is still required. Different from these methods, our approach incorporates additional visual information to assist in multilingual NER.
由于预训练语言模型(PLM)的进步,多语言 NER 已开始利用这些模型作为嵌入来提高其性能 [40,41]。Genta 等人[42] 和 Wu 等人[43] 提出了不同的元嵌入方法来表示多语言文本。与此同时,一些研究人员采用了支持一百多种语言的多语言模型,作为向其他语言模型迁移学习的基础,这使他们能够取得比许多神经方法和多语言预训练模型更优越的性能[13, 41]。总之,目前已经提出了许多多语种 NER 方法,并在稀有语言和工业领域得到了验证,但仍需进一步深入研究。与这些方法不同的是,我们的方法结合了额外的视觉信息来辅助多语种 NER。

2.2 Multimodal Named Entity Recognition
2.2 多模态命名实体识别

Multimodal NER, similar to multilingual NER, has recently drawn the attention of academics due to the abundance of user-generated graphic and textual data on social media platforms like Twitter. Base on Twitter data, researchers have built two widely used datasets, Twitter2015 [18] and Twitter-2017 [24], which are textimage corpora. However, our dataset stands out in terms of its larger scale and multilingual nature. Through employing these multimodal NER datasets, numerous studies have leveraged the corresponding images to help recognize entities within the text. During the initial phases of studies, Zhang et al. [18], Lu et al. [24], and Moon et al. [44] adopted long short-term memory (LSTM) networks to obtain text representations and convolution neural network (CNN) to obtain features from images. The methods used for text and image feature extraction and fusion in these works are coarse-grained and simple. Recently, pre-trained models like BERT have been employed in multimodal NER to generate better text representations. Specifically, Yu et al. [8] employed a BERT encoder to obtain contextualized representations for input sentences. They also proposed a multimodal interaction module that generates word representations with knowledge of images, and visual representations with knowledge of words. Similarly, Zhao et al. [45] used the pre-trained model BERT for text encoding, and they proposed a relation-based graph convolutional architecture to examine the extrinsic matching relationships among pairs of texts and images. To establish a text-image relation, Sun et al. [46] built a propagation-based BERT model for multimodal NER by integrating soft or hard gates.
与多语言 NER 类似,多模态 NER 最近也引起了学术界的关注,原因是 Twitter 等社交媒体平台上有大量用户生成的图形和文本数据。在 Twitter 数据的基础上,研究人员建立了两个广泛使用的数据集:Twitter2015 [18] 和 Twitter-2017 [24],它们都是文本图像语料库。不过,我们的数据集在规模更大和多语言性方面更胜一筹。通过使用这些多模态 NER 数据集,许多研究利用相应的图像来帮助识别文本中的实体。在最初的研究阶段,Zhang 等人[18]、Lu 等人[24]和 Moon 等人[44]采用长短期记忆(LSTM)网络获取文本表征,采用卷积神经网络(CNN)获取图像特征。在这些研究中,用于文本和图像特征提取与融合的方法都是粗粒度和简单的。最近,在多模态 NER 中采用了像 BERT 这样的预训练模型,以生成更好的文本表示。具体来说,Yu 等人[8] 使用 BERT 编码器来获取输入句子的上下文化表示。他们还提出了一种多模态交互模块,利用图像知识生成文字表征,利用文字知识生成视觉表征。同样,Zhao 等人[45] 使用预训练模型 BERT 进行文本编码,他们提出了一种基于关系的图卷积架构,用于检查文本和图像对之间的外在匹配关系。为了建立文本与图像的关系,Sun 等人[46] 通过整合软门或硬门,为多模态 NER 建立了基于传播的 BERT 模型。
In the meantime, several researchers have found that it was useful for multimodal NER to find fine-grained visual objects and filter out irrelevant visual objects with an object detection model. To this end, Zheng et al. [47] built an adversarial attention-based network with Mask RCNN [48] to extract visual object features, and constructed a common subspace for text and image modalities by adversarial learning. Similarly, Wu et al. [49] introduced a neural network that employs a pre-trained Mask RCNN network to recognize objects and utilizes a dense co-attention mechanism to learn the relationships between visual objects and textual entities. To exploit semantic relevance between different modalities, Zhang et al. [9] designed a multimodal graph that is constructed from words and visual objects. Besides, Wang et al. [20] applied optical character recognition (OCR) techniques to obtain object tags from images and mixed text and image modalities via transformer-based embeddings. The above methods require pre-trained object detectors. To reduce the dependence on them, Chen et al. [19] introduced a unified model for multimodal knowledge graph completion tasks, which adopts ViT [26] for image embedding. Similarly, our model also does not rely on pre-trained object detectors. Moreover, we align text and visual representations before integrating multimodal information, which is effective and distinguishes our model from previous approaches.
与此同时,一些研究人员发现,利用对象检测模型找到细粒度的视觉对象并过滤掉不相关的视觉对象对多模态 NER 非常有用。为此,Zheng 等人[47] 利用 Mask RCNN[48] 建立了一个基于对抗注意力的网络来提取视觉对象特征,并通过对抗学习为文本和图像模态构建了一个共同的子空间。同样,Wu 等人[49] 推出了一种神经网络,利用预先训练好的 Mask RCNN 网络来识别物体,并利用密集协同注意机制来学习视觉物体和文本实体之间的关系。为了利用不同模态之间的语义相关性,Zhang 等人[9]设计了一种由文字和视觉对象构建的多模态图。此外,Wang 等人[20]应用光学字符识别(OCR)技术从图像中获取对象标签,并通过基于变换器的嵌入来混合文本和图像模态。上述方法都需要预先训练对象检测器。为了减少对它们的依赖,Chen 等人[19] 为多模态知识图谱完成任务引入了一个统一模型,该模型采用 ViT [26] 进行图像嵌入。同样,我们的模型也不依赖预先训练好的对象检测器。此外,我们在整合多模态信息之前对齐了文本和视觉表征,这一点非常有效,使我们的模型有别于以往的方法。

3 Dataset Construction 3 数据集构建

Although building a multilingual and multimodal NER dataset can greatly facilitate the development of the MMNER task, no research has yet undertaken this endeavor. Hence, we plan to build one. In this part, we will first introduce our complete process of dataset annotation. Afterwards,
尽管建立一个多语言和多模态 NER 数据集能极大地促进 MMNER 任务的发展,但目前还没有任何研究开展这项工作。因此,我们计划建立一个数据集。在这一部分,我们将首先介绍数据集标注的完整过程。之后、

we will present the statistics of our MMNER dataset. Finally, we will compare the MMNER dataset with other NER datasets.
我们将介绍 MMNER 数据集的统计数据。最后,我们将把 MMNER 数据集与其他 NER 数据集进行比较。

3.1 Dataset Annotation 3.1 数据集注释

During the data annotation phase, we first select and process the data. Then, some data commissioners are trained to improve the quality of annotations. Finally, these commissioners spend three months completing all the annotation work. Obviously, it is easy to find a standalone multilingual or multimodal dataset. Therefore, we can transform a multilingual dataset to a MMNER dataset, or translate a single-language multimodal dataset into a MMNER dataset. The former is more reliable as it reduces the burden of translation. After conducting thorough investigations and careful consideration, we have opted for a combination approach. Firstly, we choose a large multilingual dataset called mBART50, which is released by Hugging Face and contains 2.5 million pairs of images and texts in four languages (English, French, German, and Spanish), translated via mBART-50 [50]. In this dataset, each image is associated with its respective URL. Through downloading the images using these URLs, we get a basic multilingual and multimodal dataset. Besides, considering the cost of NER annotation, we choose more English data. Concretely, the selected number of image-text pairs in English, French, German and Spanish is 30000 , 5000, 5000, and 5000 respectively. To ensure the quality and efficiency of dataset annotation, we firstly utilize to recognize entities and remove imagetext pairs whose entities number is less than two. All entities, except person, location, and organization, are marked as MISC. Secondly, note that person entities in mBART50 are replaced with the token PERSON . Therefore, Twitter-2017 [24] is also added into our MMNER dataset to complement person entity annotations. Moreover, since Twitter-2017 is an English multimodal dataset, we have translated it into French, German, and Spanish with the help of Baidu Translate . Specifically, 2982 twitters have been translated into these three languages.
在数据标注阶段,我们首先选择和处理数据。然后,对一些数据专员进行培训,以提高注释质量。最后,这些专员用三个月的时间完成所有注释工作。显然,要找到一个独立的多语言或多模态数据集很容易。因此,我们可以将多语言数据集转换成 MMNER 数据集,或者将单语言多模态数据集翻译成 MMNER 数据集。前者更可靠,因为它减轻了翻译的负担。经过深入调查和慎重考虑,我们选择了一种组合方法。首先,我们选择了一个名为 mBART50 的大型多语言数据集,该数据集由 Hugging Face 发布,包含 250 万对图像和文本,使用四种语言(英语、法语、德语和西班牙语),通过 mBART-50 [50] 进行翻译。在该数据集中,每张图片都与各自的 URL 相关联。通过使用这些 URL 下载图像,我们就得到了一个基本的多语言和多模态数据集。此外,考虑到 NER 注释的成本,我们选择了更多的英文数据。具体来说,我们选择的英语、法语、德语和西班牙语图像-文本对数量分别为 30000、5000、5000 和 5000。为了保证数据集注释的质量和效率,我们首先利用 来识别实体,并删除实体数量少于两个的图像文本对。除人物、地点和组织外,所有实体都标记为 MISC。其次,请注意 mBART50 中的人物实体是用标记 PERSON 替换的。因此,Twitter-2017 [24] 也被添加到我们的 MMNER 数据集中,以补充人物实体注释。 此外,由于 Twitter-2017 是一个英语多模态数据集,我们在百度翻译 的帮助下将其翻译成了法语、德语和西班牙语。具体来说,我们将 2982 条推特翻译成了这三种语言。
Before annotating the MMNER dataset, all data commissioners are initially trained on a subset corpus multiple times in advance, aiming to ensure consistency and reliability in intra- and inter-rater annotation. The accuracy of dataset annotation among all annotators improved from to after the training process.
在对 MMNER 数据集进行标注之前,所有数据专员都事先在一个子集语料库上进行了多次初步训练,旨在确保标注者内部和标注者之间标注的一致性和可靠性。经过培训后,所有标注者的数据集标注准确率从 提高到
During the data annotation process, 13 annotators help us to annotate the MMNER data manually with a detailed annotation instruction and the gold examples preprocessed by spaCy. Concretely, for each sentence, we ask for two annotators to perform annotation and one inspector for checking. Following the evaluation method in previous work [25], we evaluate the reliability between annotators via Cohen's kappa coefficient [51] and its value is 0.96 , which indicates the data quality is relatively high. In the end, the processed mBART50 and Twitter-2017 data are combined and partitioned into training, validation, and test sets. Some examples of our MMNER dataset are listed in Table 1. In those examples, the first image from Twitter-2017 depicts a distinct person entity which is the core part of the corresponding sentences. The other images come from mBART50, and each of them corresponds to some entities and many other words in those sentences, which makes the MMNER task on mBART50 is extremely challenging.
在数据标注过程中,13 名标注员通过详细的标注说明和 spaCy 预处理的金范例帮助我们对 MMNER 数据进行人工标注。具体来说,对于每个句子,我们要求两名注释员进行注释,一名检查员进行检查。按照前人的评估方法[25],我们通过科恩卡帕系数[51]来评估注释者之间的可靠性,其值为 0.96,表明数据质量相对较高。最后,我们将处理过的 mBART50 和 Twitter-2017 数据合并并划分为训练集、验证集和测试集。表 1 列出了 MMNER 数据集的一些示例。在这些示例中,Twitter-2017 中的第一张图片描绘了一个独特的人物实体,该实体是相应句子的核心部分。其他图片来自 mBART50,每张图片都与这些句子中的一些实体和许多其他单词相对应,这使得 mBART50 上的 MMNER 任务极具挑战性。

3.2 Dataset Statistics and Evaluation
3.2 数据集统计与评估

The details of the text in our MMNER dataset are summarized in Table 2. The proportion of training, development and test sets is . We have counted the number of 4 classes of entities (person, location, organization and miscellaneous) for each language. Additionally, we have also conducted statistics on the number of samples in the training, validation, and test sets, as well as a summary for each entity type. The total number of entities for the four classes is 89,019 .
表 2 总结了 MMNER 数据集中文本的详细信息。训练集、开发集和测试集的比例为 。我们统计了每种语言中 4 类实体(人物、地点、组织和杂项)的数量。此外,我们还对训练集、验证集和测试集中的样本数量进行了统计,并对每种实体类型进行了汇总。四个类别的实体总数为 89 019 个。
Each sentence has a corresponding image, and we store all 33,965 images in a dedicated folder. Besides, a default image is added to replace a few broken images in Twitter-2017. The MMNER
每个句子都有相应的图片,我们将所有 33965 张图片存储在一个专用文件夹中。此外,我们还添加了默认图片 ,以替换 Twitter-2017 中的一些破损图片。MMNER
Table 1 Examples of MMNERD. Note that the first text-image pair comes from Twitter-2017, and the other text-image pairs come from mBART50
表 1 MMNERD 示例。请注意,第一个文本-图像对来自 Twitter-2017,其他文本-图像对来自 mBART50
Text 文本
Image 图片
English: Baseball world reacts to Albert Pujols [PER] ' 600 th career home run
英语棒球世界对阿尔伯特-普霍尔斯[PER]职业生涯第 600 个本垒打的反应
French: La réaction du monde du baseball à la 600e home run de la carrière d' Albert Pujols PER
法文:棒球界对阿尔伯特-普霍尔斯职业生涯第 600 个本垒打的反应 PER
Spanish: La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER]
西班牙文:棒球界对阿尔伯特-佩里职业生涯第 600 个本垒打的反应 [PER]。
German: Baseball Welt reagiert auf Albert Pujols [PER]
德文:《棒球世界》关注阿尔伯特-普霍尔斯 [PER]
English: An elephant observes zebras and other animals in Namibia [LOC] 's Etosha National Park [LOC]
英语一头大象在纳米比亚[当地语言]埃托沙国家公园观察斑马和其他动物[当地语言]
French: Un éléphant observe des zèbres et d'autres animaux dans le parc national d'Etosha [LOC] en Namibie [LOC]
法语:一头大象在纳米比亚埃托沙国家公园[当地语言]观察斑马和其他动物[当地语言]。
Spanish: Un elefante observa cebras y otros animales en el Parque nacional Etosha [LOC] de Namibia [LOC]
英语:一头大象在纳米比亚埃托沙国家公园观察斑马和其他动物 [LOC] [LOC].
German: Ein Elefant beobachtet Zebras und andere Tiere im Etosha
德语:Ein Elefant beobachtet Zebras und andere Tiere im Etosha(一只大象在埃托沙观察斑马和其他动物
Nationalpark [LOC] Namibias [LOC]
国家公园 [LOC] 纳米比亚 [LOC]
English: Fireworks Over The Thames [LOC] For New Year [MISC]
英语泰晤士河上的焰火 [LOC] 新年焰火 [MISC]
French: Feux d'artifice sur la Tamise [LOC] pour le nouvel an [MISC]
法语:新一年[MISC][LOC]塔米塞上的假象
Spanish: Fuegos artificiales en el Támesis [LOC] de año nuevo [MISC]
英语:泰晤士河上的新年烟花 [LOC] [MISC].
German: Feuerwerk über der Themse [LOC] für Neujahr [MISC]
德国:泰晤士河上的新年焰火[本地][杂项]
dataset is released for further studies via this link: https://github.com/wangdsh/MMNERD.
数据集通过此链接发布,供进一步研究:https://github.com/wangdsh/MMNERD
After the data annotation, we randomly selected 300 sentences to test the quality of the dataset. By manual review of each entity annotation, we find that 626 out of 631 entities are correctly annotated, indicating that the dataset's quality is well guaranteed.
数据注释完成后,我们随机抽取了 300 个句子来测试数据集的质量。通过人工审核每个实体注释,我们发现 631 个实体中有 626 个注释正确,这表明数据集的质量得到了很好的保证。

3.3 Dataset Comparison 3.3 数据集比较

In Table 3, we conduct a comparison bwtween our MMNER dataset and other representative NER datasets. On the one hand, multilingual NER datasets, including CoNLL2002 [21], CoNLL2003 [22], and WikiAnn [23], are statisticed and compared with our dataset. CoNLL2002 and CoNLL2003 contain two languages each, while WikiAnn includes four languages. On the other hand, we compare our dataset with three widely used multimodal NER datasets: Twitter-2015 [18], Twitter-2017 [24], and CNERTA [25]. Twitter2015 and Twitter-2017 are text-image corpora, while CNERTA is a text-speech corpus.
在表 3 中,我们对 MMNER 数据集和其他具有代表性的 NER 数据集进行了比较。一方面,我们对包括 CoNLL2002 [21]、CoNLL2003 [22] 和 WikiAnn [23] 在内的多语言 NER 数据集进行了统计,并与我们的数据集进行了比较。CoNLL2002 和 CoNLL2003 各包含两种语言,而 WikiAnn 包含四种语言。另一方面,我们将我们的数据集与三个广泛使用的多模态 NER 数据集进行了比较:Twitter-2015 [18]、Twitter-2017 [24] 和 CNERTA [25]。Twitter-2015 和 Twitter-2017 是文本-图像语料库,而 CNERTA 是文本-语音语料库。
Compared with all other existing datasets, we discover that our MMNER dataset is a large-scale corpus with the characters of two modalities and four languages. According to our investigation, this is the first publicly available dataset specifically designed for the multilingual and multimodal NER task.
与所有其他现有数据集相比,我们发现 MMNER 数据集是一个包含两种模态和四种语言字符的大型语料库。根据我们的调查,这是第一个专门为多语言和多模态 NER 任务设计的公开可用数据集。

4 The Proposed Method 4 建议的方法

To address the NER challenge through the utilization of multimodal and multilingual information, we introduce a new MMNER task and design an innovative framework that validates the amalgamation of these different types of information. As for this part, we firstly provide a summary of
为了通过利用多模态和多语言信息来应对 NER 挑战,我们引入了一项新的 MMNER 任务,并设计了一个创新框架来验证这些不同类型信息的融合。关于这一部分,我们首先概述了
Table 2 Statistics of the MMNER dataset.
表 2 MMNER 数据集的统计数据。
English Data 英语数据 French Data 法国数据 Spanish Data 西班牙数据 German Data 德国数据
Entity Train 实体列车 Dev 开发 Test 测试 Train 火车 Dev 开发 Test 测试 Train 火车 Dev 开发 Test 测试 Train 火车 Dev 开发 Test 测试 Total 总计
Class 班级
PER. 3,340 419 431 3,237 396 362 3,186 393 362 3,213 395 380 16,114
LOC. 4,605 541 504 2,984 376 310 2,023 254 267 2,560 350 345 15,119
ORG. 7,342 927 968 1,511 204 183 2,115 277 247 1,556 181 205 15,716
MISC. 23,621 2,920 3,013 3,331 438 390 3,184 396 430 3,392 530 425 42,070
Total 总计 38,908 4,807 4,916 11,063 1,414 1,245 10,508 1,320 1,306 10,721 1,456 1,355 89,019
Sent 发送 19,966 2,463 2,533 4,812 609 561 4,798 591 593 4,750 628 604 42,908
Num 编号
Table 3 Dataset comparison bwtween our MMNER dataset and other NER datasets. AR, DE, DL, HI, EN, ES, ZH are the abbreviations of Arabic, German, Dutch, Hindi, English, Spanish and Chinese.
表 3 我们的 MMNER 数据集与其他 NER 数据集的比较。AR、DE、DL、HI、EN、ES、ZH 是阿拉伯语、德语、荷兰语、印地语、英语、西班牙语和汉语的缩写。
Dataset 数据集 Multilingual 多种语言 Multimodal 多式联运 Train 火车 Dev 开发 Test 测试 Total 总计
CoNLL2002 [21] (ES, DL)  (ES、DL) (Text)  (文本) 24,129 4,810 6,712 35,651
CoNLL2003 [22] (EN, DE) (Text)  (文本) 27,692 6,534 6,844 41,070
WikiAnn [23] 维基百科[23] (EN, AR, HI, ZH) (Text)  (文本) 65,000 31,000 31,000 127,000
Twitter-2015 [18] (EN) (Text, Image)  (文本、图像) 4,000 1,000 3,257 8,257
Twitter-2017 [24] 推特-2017 [24] (EN) (Text, Image)  (文本、图像) 3,373 723 723 4,819
CNERTA [25] (ZH)  (ZH) (Text, Speech)  (文本、语音) 34,102 4,440 4,445 42,987
MMNERD (Ours) MMNERD(我们的) (EN, FR, ES, DE) (Text, Image)  (文本、图像) 34,326 4,291 4,291 42,908
the proposed model and a task definition for our MMNER task. Then, we utilize a feature extraction module to obtain representations of both text and images, which are aligned using multimodal alignment module. Additionally, the mechanism of multimodal interaction is described in detail, followed by a conditional random field (CRF) decoder to predict the named entities.
在此基础上,我们介绍了所提出的模型以及 MMNER 任务的任务定义。然后,我们利用特征提取模块获得文本和图像的表征,并利用多模态对齐模块对其进行对齐。此外,我们还详细介绍了多模态交互的机制,然后使用条件随机场(CRF)解码器预测命名实体。
Task Definition. In our multilingual and multimodal NER (MMNER) task, we use a NER dataset that includes multilingual text in four languages: English, French, Spanish, and German. Each sentence in the corpus has a corresponding image , and each token in the sentence is associated with a predefined label . MMNER aims to assign a predefined label to every token in a sentence, and refers to the collection of all predefined labels. There are four categories of named entities (person, location, organization and miscellaneous) in this task and we adopt IOB2 tagging scheme for entity annotation. Note that if a sentence has multiple images, it can be converted into multiple sentence-image pairs.
任务定义。在多语言和多模态 NER(MMNER)任务中,我们使用的 NER 数据集包含四种语言的多语言文本:英语、法语、西班牙语和德语。语料库中的每个句子 都有相应的图像 ,句子 中的每个标记都与预定义标签 相关联。MMNER 的目的是为句子中的每个标记分配一个预定义标签 ,而 指的是所有预定义标签的集合。本任务中有四类命名实体(人物、地点、组织和杂项),我们采用 IOB2 标签方案进行实体标注。需要注意的是,如果一个句子有多个图像,则可以将其转换为多个句子-图像对。

4.1 Overview of the Proposed Model
4.1 拟议模型概述

The general framework of our proposed model is presented in Fig. 2. To begin with, we employ a multilingual BERT [30] encoder to acquire contextualized representation for the input sentence. The corresponding image of the sentence is embedded using a ViT [26] encoder and a ResNet [52] encoder to capture patch features and convolution features, respectively. To align the text and visual embeddings in the same subspace, we incorporate a multimodal alignment module that operates on top of these encoders. Meanwhile, the text representation and image representation interact with each other through two multimodal collaboration modules, which are based on the attention mechanism. Finally, at the top of the model, we include a CRF layer to predict entities with specific predefined labels.
图 2 展示了我们提出的模型的总体框架。首先,我们使用多语言 BERT [30] 编码器获取输入句子的上下文表示。使用 ViT [26] 编码器和 ResNet [52] 编码器嵌入句子的相应图像,分别捕捉补丁特征和卷积特征。为了在同一子空间中对齐文本和视觉嵌入,我们在这些编码器之上加入了一个多模态对齐模块。同时,文本表示和图像表示通过两个基于注意力机制的多模态协作模块进行交互。最后,在模型的顶层,我们加入了一个 CRF 层,用于预测带有特定预定义标签的实体。
Fig. 2 The overall architecture of 2M-NER.
图 2 2M-NER 的整体结构。

4.2 Feature Extraction Module
4.2 特征提取模块

To illustrate the process of feature acquisition with our input data, the details of the feature extraction module are specified in this subsection, including multilingual text representation and image representation.
为了说明利用输入数据获取特征的过程,本小节将详细介绍特征提取模块,包括多语言文本表示和图像表示。

4.2.1 Multilingual Text Representation
4.2.1 多语言文本表示法

As depicted in Fig. 2, a Transformer [28] encoder is employed to learn text representations. Specifically, we use the first layers of a BERT multilingual base model , which has been pretrained on 104 languages using a masked language model objective. Each sentence adds special tokens [CLS] and [SEP] to suit the model. is the processed sentence in which represents [CLS] and represents [SEP]. Besides, as calculated in Eq. (1) position embeddings are added to provide position information for all tokens.
如图 2 所示,我们使用 Transformer [28] 编码器来学习文本表示。具体来说,我们使用了 BERT 多语言基础模型 的第一层 ,该模型已使用屏蔽语言模型目标对 104 种语言进行了预训练。每个句子 都添加了特殊标记词 [CLS] 和 [SEP],以适应该模型。 是经过处理的句子,其中 表示 [CLS], 表示 [SEP]。此外,根据公式(1)的计算,还添加了位置嵌入 ,以提供所有标记的位置信息。
where represents the token embedding matrix. To obtain contextualized representations, layers of multi-headed self-attention (MHSA) and multilayer perceptron (MLP) followed by layernorm (LN) in Transformer are adopted to calculate token embedding iteratively. The whole process is calculated by Eq. (2) and Eq. (3).
其中 表示标记嵌入矩阵。为了获得上下文化表示,Transformer 中采用了多头自注意(MHSA)和多层感知器(MLP)的 层,然后采用层规范(LN)迭代计算标记嵌入 。整个过程由公式 (2) 和公式 (3) 计算得出。

4.2.2 Image Representation
4.2.2 图像表示法

Based on extensive research, ViT is capable of capturing both patch features and salient features without reducing image resolution or sacrificing local information. ResNet, on the other hand, learns convolutional features through its convolutional structure and has the ability to capture the general structure of image data, making it widely applicable. To enhance the extraction of image features across various dimensions, our model utilizes a ViT encoder to capture patch features and a ResNet encoder to extract convolution features. First, similar to the text representation process
在广泛研究的基础上,ViT 能够在不降低图像分辨率或牺牲局部信息的情况下捕捉斑块特征和突出特征。ResNet 则通过其卷积结构学习卷积特征,能够捕捉图像数据的一般结构,因此具有广泛的适用性。为了加强对不同维度图像特征的提取,我们的模型利用 ViT 编码器捕捉补丁特征,利用 ResNet 编码器提取卷积特征。首先,与文本表示过程类似

described above, a Transformer encoder is used to obtain image representations. Concretely, the first layers of ViT are used to encode the images, allowing the model to find out some salient features of the image at multiple levels of abstraction. For adapting the standard Transformer, a reshaped 2D image is split into flattened 2D patches, where refers to the quantity of channels, and stand for the height and width of the image, is the resolution of every patch. Then, based on Eq. (4), the flattened 2D patches are projected to dimensions. Note that absolute position embeddings are included to enable our image encoder to obtain position information of all patches. This enables the model to capture the spatial relationships between different patches in the image and ensure that the relative position of each patch is preserved during the encoding process.
如上所述,变换器编码器用于获取图像表征。具体来说,ViT 的前 层用于对图像进行编码,使模型能够在多个抽象层次上找出图像的一些突出特征。为了适应标准变换器,重塑的二维图像 被分割成 个扁平化的二维补丁,其中 指通道数量, 分别代表图像的高度和宽度, 是每个补丁的分辨率。然后,根据公式 (4),将扁平化的 2D 补丁投影到 维度。请注意,为了使我们的图像编码器能够获取所有斑块的位置信息,我们还加入了绝对位置嵌入 。这使模型能够捕捉图像中不同斑块之间的空间关系,并确保在编码过程中保留每个斑块的相对位置。
Finally, as calculated in Eq. (5) and Eq. (6), layers of multi-headed self-attention (MHSA) and multilayer perceptron (MLP) in the Transformer model are applied to the image embedding . Besides, layernorm (LN) and residual connections are also applied in every layer of the ViT model.
最后,根据公式 (5) 和公式 (6) 的计算,Transformer 模型中的 层多头自注意(MHSA)和多层感知器(MLP)被应用于图像嵌入 。此外,在 ViT 模型的每一层中还应用了 layernorm(LN)和残差连接。
As indicated in the lower right corner of Fig. 2, we utilize a ResNet encoder with deep layers to extract convolution features. To employ a pre-trained 152 layers ResNet, each image is transformed to a fixed size with pixels. The encoder splits each image into multiple small blocks and transforms each block into a 2048 dimensional vector. Formally, let represent the overall representation of an image. It is mapped to the text representation space by using the formula , where is the transition matrix and is the bias.
如图 2 右下角所示,我们利用具有深度层的 ResNet 编码器来提取卷积特征。为了使用预先训练好的 152 层 ResNet,我们将每幅图像转换为固定大小的 像素。编码器将每幅图像分割成多个 小块,并将每个块转换成 2048 维向量。形式上,让 代表图像的整体表示。使用公式 将其映射到文本表示空间,其中 是转换矩阵, 是偏置。

4.3 Multimodal Alignment Module
4.3 多模式对齐模块

Most previous studies learn text and image representations separately, which poses challenges for the interaction between the text and image encoders due to their distinct subspaces. To alleviate this problem, we align these two types of representations through contrastive learning (CL). Drawing inspiration from NT-Xent [53] and MoCo [54], our contrastive loss is transformed from visual representation learning to multimodal (text and image) representation learning. Specifically, the text and image representations in a batch are indicated as and respectively. To bring them into alignment, we initially resize them to the same dimensions using Eq. (7) and Eq. (8). Here, the hidden size we utilize is 768 , which is the same as the text hidden size.
以往的大多数研究都是将文本和图像表征分开学习,这给文本编码器和图像编码器之间的交互带来了挑战,因为它们的子空间截然不同。为了缓解这一问题,我们通过对比学习(CL)将这两类表征统一起来。受 NT-Xent [53] 和 MoCo [54] 的启发,我们的对比损失从视觉表征学习转变为多模态(文本和图像)表征学习。具体来说,一个批次中的文本和图像表征分别表示为 。为了使它们保持一致,我们首先使用公式 (7) 和公式 (8) 将它们调整到相同的尺寸。这里,我们使用的隐藏尺寸是 768,与文本的隐藏尺寸相同。
Different from contrastive learning that aligns two augmented views of the same image in visual models, the text representation and image representation are aligned by contrastive loss in our model. Specifically, assume that there are text-image pairs in one batch, and we can get data pairs. Each data pair contains a text embedding and an image embedding. Only the original text-image pair and the image-text pair are positive, while the other pairs are treated as negative examples. Hence, the contrastive loss can be formulated as follows:
与视觉模型中对同一图像的两个增强视图进行对齐的对比学习不同,在我们的模型中,文本表示 和图像表示 是通过对比损失进行对齐的。具体来说,假设一批中有 文本图像对,我们可以得到 数据对。每个数据对包含一个文本嵌入和一个图像嵌入。只有原始文本-图像对和图像-文本对是正例,而其他 对被视为负例。因此,对比度损失可表述如下:
In Eq. (9), represents the cosine similarity between the text vector representation and its image vector representation, is a temperature parameter, and represents an indicator function whose value is 1 iff , otherwise it is 0 . The total contrastive loss for all text-image pairs is defined in Eq. (10).
在公式 (9) 中, 表示文本矢量表示与其图像矢量表示之间的余弦相似度, 是一个温度参数, 表示一个指标函数,其值在 时为 1,否则为 0。所有文本-图像对的总对比度损失在公式 (10) 中定义。

4.4 Multimodal Collaboration Module
4.4 多模式协作模块

To create a multimodal interaction between text and image, we employ a crossmodal attention
为了在文字和图像之间创建多模态交互,我们采用了跨模态注意力

layer to fuse multimodal information. As shown in Fig. 2, two multimodal collaboration modules are connected to the ViT encoder and the ResNet encoder respectively. Each module consists of a multi-headed self-attention layer and a multilayer perceptron. Besides, a residual connection and a layer normalization are used subsequent to each sub-layer. To be specific, we use the token embedding calculated by the text encoder to represent the query. It is referred to as in Section 4.2.1. Meanwhile, we use the image representation obtained from the image encoder as the key-value pairs. It is referred to as or in Section 4.2.2. In Fig. 2, "Q", "K", and "V" represent the query, key, and value respectively. Formally,
层来融合多模态信息。如图 2 所示,两个多模态协作模块分别与 ViT 编码器和 ResNet 编码器相连。每个模块由多头自注意层和多层感知器组成。此外,在每个子层之后还使用了残差连接和层归一化。具体来说,我们使用文本编码器计算出的标记嵌入 来表示查询。它在第 4.2.1 节中被称为 。同时,我们使用从图像编码器获得的图像表示 作为键值对。在第 4.2.2 节中,它被称为 。在图 2 中,"Q"、"K "和 "V "分别代表查询、键和值。形式上
where denotes the -th attention head between text and image, refer to all of the learnable parameters for each query, key, and value, and in Eq. (12) stands for the weight matrix for multi-head attention. The multimodal representation