2M-NER：多语言和多模态 NER 的对比学习与语言和模态融合

2024_08_05_b84a6238d9e53d299976g

Dongsheng Wang , Xiaoqin Feng , Zeming Liu , Chuan Wang
王东升、冯晓琴、刘泽明、王川 The Department of Science and Technology Teaching, China University of Political
中国政治大学科技教学部Science and Law, Beijing, 102249,China.
中国北京，科学与法律，102249。 Mobvoi AI Lab, Beijing, 100044,China.
Mobvoi人工智能实验室，中国北京，100044。 School of Computer Science and Engineering, Beihang University, Beijing, 100191,China.
北京航空航天大学计算机科学与工程学院，中国北京，100191。 Institute of Information Engineering, CAS, Beijing, 100085,China.
中国科学院信息工程研究所，北京，100085。

*Corresponding author(s). E-mail(s): zmliu@buaa.edu.cn;
*通讯作者。电子邮件：zmliu@buaa.edu.cn；Contributing authors: wangdsh@cupl.edu.cn; xiaoqin.feng@mobvoi.com;
供稿作者：wangdsh@cupl.edu.cn; xiaoqin.feng@mobvoi.com；wangchuan@iie.ac.cn; wangchuan@iie.ac.cn；

Abstract 摘要

Named entity recognition (NER) is a fundamental task in natural language processing that involves identifying and classifying entities in sentences into pre-defined types. It plays a crucial role in various research fields, including entity linking, question answering, and online product recommendation. Recent studies have shown that incorporating multilingual and multimodal datasets can enhance the effectiveness of NER. This is due to language transfer learning and the presence of shared implicit features across different modalities. However, the lack of a dataset that combines multilingualism and multimodality has hindered research exploring the combination of these two aspects, as multimodality can help NER in multiple languages simultaneously. In this paper, we aim to address a more challenging task: multilingual and multimodal named entity recognition (MMNER), considering its potential value and influence. Specifically, we construct a large-scale MMNER dataset with four languages (English, French, German and Spanish) and two modalities (text and image). To tackle this challenging MMNER task on the dataset, we introduce a new model called 2M-NER, which aligns the text and image representations using contrastive learning and integrates a multimodal collaboration module to effectively depict the interactions between the two modalities. Extensive experimental results demonstrate that our model achieves the highest F1 score in multilingual and multimodal NER tasks compared to some comparative and representative baselines. Additionally, in a challenging analysis, we discovered that sentence-level alignment interferes a lot with NER models, indicating the higher level of difficulty in our dataset.
命名实体识别（NER）是自然语言处理中的一项基本任务，涉及将句子中的实体识别并分类为预定义的类型。它在实体链接、问题解答和在线产品推荐等多个研究领域发挥着至关重要的作用。最近的研究表明，结合多语言和多模态数据集可以提高 NER 的效率。这是由于语言迁移学习和不同模态之间存在共享的隐含特征。然而，由于缺乏将多语言和多模态结合起来的数据集，阻碍了对这两方面结合的研究，因为多模态可以同时帮助多种语言的 NER。在本文中，考虑到多语言和多模态命名实体识别（MMNER）的潜在价值和影响，我们旨在解决一项更具挑战性的任务：多语言和多模态命名实体识别。具体来说，我们构建了一个包含四种语言（英语、法语、德语和西班牙语）和两种模式（文本和图像）的大规模 MMNER 数据集。为了在该数据集上完成这项具有挑战性的 MMNER 任务，我们引入了一个名为 2M-NER 的新模型，该模型利用对比学习对齐文本和图像表征，并集成了一个多模态协作模块，以有效描述两种模态之间的交互。广泛的实验结果表明，与一些具有代表性的比较基线相比，我们的模型在多语言和多模态 NER 任务中取得了最高的 F1 分数。此外，在一项具有挑战性的分析中，我们发现句子级对齐对 NER 模型干扰很大，这表明我们的数据集难度更高。

Keywords: Multilingual NER, Multimodal NER, Contrastive learning, Multimodal interaction
关键词多语言 NER、多模态 NER、对比学习、多模态交互

1 Introduction 1 引言

Named entity recognition (NER) is one of the basic tasks in natural language processing which aims to locate and classify specific things into pre-defined types, such as diseases, products, monetary values, etc. It is an important contributor to different research fields, including question answering

, automatic text-video retrieval
命名实体识别（NER）是自然语言处理的基本任务之一，旨在将特定事物定位并分类为预定义的类型，如疾病、产品、货币价值等。它对不同的研究领域都有重要贡献，包括问题解答

、自动文本-视频检索

等。

[3] and machine translation

, as it enhances the understanding and interpretation of textual information in these tasks. Early studies recognize entities with long short-term memory (LSTM), attention-based technique, or convolution neural network (CNN)

. owever, recent research has demonstrated that Transformer-based structures surpass these earlier methods in NER tasks

. Despite these advancements, most of early studies solely utilized monolingual and unimodal text to recognize entities, which is insufficient [10-12]. For example, text-only methods ignore the helpfulness of corresponding images, leading to lower accuracy in entity recognition.
[3] 和机器翻译

，因为在这些任务中，它能增强对文本信息的理解和解释。早期的研究使用长短期记忆（LSTM）、基于注意力的技术或卷积神经网络（CNN）来识别实体

。然而，最近的研究表明，基于 Transformer 的结构在 NER 任务中超越了这些早期的方法

。尽管取得了这些进步，但大多数早期研究仅仅利用单语和单模态文本来识别实体，这是不够的 [10-12]。例如，纯文本方法忽略了相应图像的帮助，导致实体识别的准确率较低。

Some other researchers leverage multilingual and multimodal information to solve the aforementioned issues [13-15]. On the one hand, some studies on multilingual NER

have found that knowledge transfer from one language to another language is useful for zero-resource NER and cross-lingual NER. Hence, they have designed models to take advantage of multilingual features of multiple languages. On the other hand, some works on multimodal NER [18-20] have combined the textual modality with the visual modality so that both explicit and implicit visual information can be exploited to help improve the performance of their models. However, these studies have a limitation that they often focus on multilingual or multimodal NER, with many public datasets availiable for multilingual NER [21-23] and multimodal NER [18, 24, 25] respectively. However, in the real world, multilingual and multimodal coexist, making it valuable for models to utilize the multilingual features of the multilingual text and integrate the related visual information with the textual information. Thus, it is highly beneficial to develop a multilingual and multimodal NER dataset to facilitate future research in this area.
还有一些研究人员利用多语言和多模态信息来解决上述问题 [13-15]。一方面，一些关于多语言 NER

的研究发现，从一种语言到另一种语言的知识转移对于零资源 NER 和跨语言 NER 非常有用。因此，他们设计了一些模型来利用多种语言的多语言特点。另一方面，一些关于多模态 NER 的研究[18-20]将文本模态与视觉模态相结合，这样就可以利用显性和隐性视觉信息来帮助提高模型的性能。然而，这些研究有一个局限性，即它们通常侧重于多语言或多模态 NER，有许多公共数据集可分别用于多语言 NER [21-23] 和多模态 NER [18, 24, 25]。然而，在现实世界中，多语言和多模态并存，这使得利用多语言文本的多语言特点并将相关视觉信息与文本信息整合在一起的模型变得非常有价值。因此，开发多语种和多模态 NER 数据集对促进该领域的未来研究大有裨益。

To address the above issue and advance MMNER, we make the following attempts and efforts. We firstly build a large human-annotated Multilingual and Multimodal NER Dataset (MMNERD), which is constructed from a large multilingual dataset and an English multimodal dataset via transformation, translation and human annotation. Concretely, we mark all appearances of entities belonging to four categories (person, location, organization, and miscellaneous) in a widely used multilingual dataset called mBART50

and a famous multimodal dataset Twitter-2017 [24]. As a result, MMNERD has become the first public dataset which supports both multilingual and multimodal. Its total number of sentences across four languages (English, French, Spanish and German) is 42,908 and the in-depth statistical analysis of MMNERD is shown in Table 2. Besides, the detailed comparison between MMNERD and previous NER datasets in 3.3 highlights the multilingual and multimodal characteristics of our dataset.
为了解决上述问题，推进多语言和多模态 NER 的发展，我们做了以下尝试和努力。首先，我们建立了一个由人工标注的大型多语言和多模态 NER 数据集（MMNERD），该数据集由一个大型多语言数据集和一个英语多模态数据集通过转换、翻译和人工标注构建而成。具体来说，我们标注了被广泛使用的多语言数据集 mBART50

和著名的多模态数据集 Twitter-2017 [24] 中属于四个类别（人物、地点、组织和杂项）的实体的所有出现。因此，MMNERD 成为首个同时支持多语言和多模态的公开数据集。其四种语言（英语、法语、西班牙语和德语）的句子总数为 42 908 个，MMNERD 的深入统计分析如表 2 所示。此外，3.3 中对 MMNERD 和以前的 NER 数据集进行了详细比较，突出了我们数据集的多语言和多模态特点。

Based on MMNERD and its new MMNER task, we propose a novel Multilingual and Multimodal NER model named 2M-NER. Concretely, considering the different architectures of various image encoders, 2M-NER leverages ViT [26] and ResNet [27] to extract patch features and convolution features, respectively. Meanwhile, instead of directly blending textual and visual modalities in many previous works, contrastive learning is applied in

for modal alignment, which is implemented by the contrastive loss in the multimodal alignment module. For instance, in Fig. 1, the entity Albert Pujols in the English sentence aligns with the area of a baseball player within the image. In the representation space, the entity Albert Pujols and its corresponding area should have similar embeddings, so that they will have bigger similarity score. Besides, 2M-NER utilizes two cross-attention Transformer [28] layers to build modal interactions between text and images.
基于 MMNERD 及其新的 MMNER 任务，我们提出了一种名为 2M-NER 的新型多语言和多模态 NER 模型。具体来说，考虑到各种图像编码器的不同架构，2M-NER 利用 ViT [26] 和 ResNet [27] 分别提取斑块特征和卷积特征。同时，

中应用了对比学习来进行模态对齐，而不是像之前的许多作品那样直接将文本模态和视觉模态融合在一起，而是通过多模态对齐模块中的对比损失来实现。例如，在图 1 中，英语句子中的实体 Albert Pujols 与图像中的棒球运动员区域对齐。在表示空间中，实体 Albert Pujols 和其对应的区域应该有相似的嵌入，这样它们的相似度得分就会更高。此外，2M-NER 利用两个交叉注意变换器层 [28] 来建立文本和图像之间的模态交互。

Moreover, we conduct extensive experiments on MMNERD. A set of classical and competitive NER methods are selected for comparison. Specifically, we first utilize some representative textbased NER models such as BiLSTM-CRF [6], HBiLSTM-CRF [29], BERT [30], etc. After that, to illustrate the help of visual representation to entity recognition, we pick out several representative multimodal NER models, such as UMT [8], UMGF [9], and MKGFormer [19], on our dataset. In terms of linguistic aspect, all the models we use are tested on all four languages and also each of them. These experiments prove the effectiveness of 2M-NER in the new MMNER task.
此外，我们还对 MMNERD 进行了广泛的实验。我们选择了一系列经典和有竞争力的 NER 方法进行比较。具体来说，我们首先使用了一些有代表性的基于文本的 NER 模型，如 BiLSTM-CRF [6]、HBiLSTM-CRF [29]、BERT [30] 等。之后，为了说明视觉表示对实体识别的帮助，我们在数据集上挑选了几个具有代表性的多模态 NER 模型，如 UMT [8]、UMGF [9] 和 MKGFormer [19]。在语言方面，我们使用的所有模型都在所有四种语言以及每种语言上进行了测试。这些实验证明了 2M-NER 在新的 MMNER 任务中的有效性。

Our work can be primarily characterized by the following major contributions:
我们的工作主要体现在以下几个方面：

English: Baseball world reacts to Albert Pujols [PER] ' 600 th career home run French: La réaction du monde du baseball à la 600e home run de la carrière d ' Albert Pujols [PER] Spanish: La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER] German: Baseball Welt reagiert auf Albert Pujols [PER]
英语棒球世界对 Albert Pujols [PER] 职业生涯第 600 次本垒打的反应法文：La réaction du monde du baseball à la 600e home run de la carrière d ' Albert Pujols [PER] 西班牙语：西班牙文：La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER] 德文：Baseball Welt reagiert auf Albert Pujols [PER] （世界棒球联盟对阿尔伯特-普约尔斯的回应

Fig. 1 An instance of multilingual and multimodal NER. The types of all entities are enclosed in brackets.
图 1 多语言和多模态 NER 的一个实例。括号内为所有实体的类型。

(1) We construct MMNERD, a large-scale humanannotated MMNER dataset, which has four languages and two modalities. So far as we know, this is the first public dataset that supports both multilingual and multimodal NER.
(1) 我们构建了一个由人类注释的大规模 MMNER 数据集 MMNERD，该数据集包含四种语言和两种模态。据我们所知，这是第一个同时支持多语言和多模态 NER 的公开数据集。

(2) To facilitate research on this dataset, our paper introduces a novel model called 2M-NER for MMNER，which leverages contrastive learning to align the language and vision representations.
(2) 为了促进对该数据集的研究，我们的论文为 MMNER 引入了一个名为 2M-NER 的新模型，该模型利用对比学习来调整语言和视觉表征。

(3) Based on 2M-NER, comparative experiments illustrate the efficiency of our methodology, while more relevant experiments indicate that the assistance of multilingual and visual modal can enhance the performance of MMNER.
(3) 基于 2M-NER 的对比实验说明了我们的方法的效率，而更相关的实验表明，多语言和视觉模态的辅助可以提高 MMNER 的性能。

The following sections give the organization of this paper. In Section 2, we summarize the relevant literature on multilingual NER and multimodal NER. Section 3 describes the procedure of dataset construction in detail. A summary of our proposed model and the illustration of each component are presented in Section 4. The comparative experiments and ablation study conducted on MMNERD are presented in Section 5. To summarize our work, Section 6 provides a comprehensive overview and lists some future work.
下文将介绍本文的结构。在第 2 节中，我们总结了多语言 NER 和多模态 NER 的相关文献。第 3 节详细介绍了数据集的构建过程。第 4 节总结了我们提出的模型，并对每个组件进行了说明。第 5 节介绍了在 MMNERD 上进行的对比实验和消解研究。为了总结我们的工作，第 6 节提供了一个全面的概述，并列出了一些未来的工作。

In this section, we review two groups of relevant studies closely associated with MMNER: multilingual named entity recognition and multimodal named entity recognition.
在本节中，我们将回顾与 MMNER 密切相关的两组相关研究：多语言命名实体识别和多模态命名实体识别。

2.1 Multilingual Named Entity Recognition
2.1 多语言命名实体识别

Named entity recognition (NER) is a general task that provides foundational support for a range of natural language processing (NLP) tasks. As the research on multilingual models deepens, monolingual NER gradually migrates to multilingual NER, which provides a solid foundation for multilingual NLP tasks such as question answering [31, 32], information retrieval [33, 34], relation extraction

, etc. In previous studies, NER relied on statistical modeling of annotated data, which is extremely arduous and expensive. For example, Joel et al. [37] proposed a data annotations method by utilizing the textual content and structural framework of Wikipedia. This enormous, accessible, and multilingual data helps the model to surpass domain-specific models in terms of performance. Recently, many researchers have presented various methods to obtain the most advanced outcomes in NER. Specifically, Shervin et al. [38] presented MultiCoNER, a vast multilingual corpus for NER that encompasses three domains and eleven different languages, including sub-corpora featuring multilingual and codemixed content. They applied two NER models to showcase the difficulty and validity of the dataset. Emanuela et al. [17], through their analysis of the MultiCoNER dataset, discovered that incorporating supplementary contextual information from the training data can enhance the performance of NER on shorter text samples. Additionally, Shervin et al. [39] fused external knowledge into transformer models to achieve the best performance. Moreover, CoNLL2002 [21], CoNLL2003 [22], and WikiAnn [23] are also commonly used
命名实体识别（NER）是一项通用任务，为一系列自然语言处理（NLP）任务提供基础支持。随着多语言模型研究的不断深入，单语言 NER 逐渐向多语言 NER 迁移，为问题解答 [31,32]、信息检索 [33,34]、关系提取

等多语言 NLP 任务奠定了坚实的基础。在以往的研究中，NER 依赖于对注释数据进行统计建模，而这是极其艰巨和昂贵的。例如，Joel 等人[37] 利用维基百科的文本内容和结构框架提出了一种数据注释方法。这种庞大、可访问和多语言的数据有助于该模型在性能上超越特定领域模型。最近，许多研究人员提出了各种方法，以获得最先进的 NER 成果。具体来说，Shervin 等人[38] 提出了 MultiCoNER，这是一个用于 NER 的庞大的多语言语料库，包含三个领域和 11 种不同的语言，其中包括以多语言和混合编码内容为特色的子语料库。他们应用了两种 NER 模型来展示该数据集的难度和有效性。Emanuela 等人[17]通过对 MultiCoNER 数据集的分析发现，从训练数据中加入补充上下文信息可以提高较短文本样本的 NER 性能。此外，Shervin 等人[39] 将外部知识融合到转换器模型中，以达到最佳性能。此外，CoNLL2002 [21]、CoNLL2003 [22] 和 WikiAnn [23] 也是常用的 NER 方法。
multilingual NER datasets. In comparison to them, the dataset we constructed incorporates multimodal characteristics.
多语言 NER 数据集。与它们相比，我们构建的数据集具有多模态特征。

Owing to the advancements in pre-trained language models (PLM), multilingual NER has started to utilize these models as embeddings to improve its performance [40, 41]. Genta et al. [42] and Wu et al. [43] proposed the different meta-embedding methods to represent multilingual text. In the meantime, several researchers have employed multilingual models that support over one hundred languages as a foundation for transfer learning to other language models, which enables them to achieve superior performance compared to many neural methods and multilingual pre-trained models [13, 41]. In summary, many multilingual methods for NER have been proposed and validated in rare languages and industrial areas, but further in-depth research is still required. Different from these methods, our approach incorporates additional visual information to assist in multilingual NER.
由于预训练语言模型（PLM）的进步，多语言 NER 已开始利用这些模型作为嵌入来提高其性能 [40，41]。Genta 等人[42] 和 Wu 等人[43] 提出了不同的元嵌入方法来表示多语言文本。与此同时，一些研究人员采用了支持一百多种语言的多语言模型，作为向其他语言模型迁移学习的基础，这使他们能够取得比许多神经方法和多语言预训练模型更优越的性能[13, 41]。总之，目前已经提出了许多多语种 NER 方法，并在稀有语言和工业领域得到了验证，但仍需进一步深入研究。与这些方法不同的是，我们的方法结合了额外的视觉信息来辅助多语种 NER。

2.2 Multimodal Named Entity Recognition
2.2 多模态命名实体识别

Multimodal NER, similar to multilingual NER, has recently drawn the attention of academics due to the abundance of user-generated graphic and textual data on social media platforms like Twitter. Base on Twitter data, researchers have built two widely used datasets, Twitter2015 [18] and Twitter-2017 [24], which are textimage corpora. However, our dataset stands out in terms of its larger scale and multilingual nature. Through employing these multimodal NER datasets, numerous studies have leveraged the corresponding images to help recognize entities within the text. During the initial phases of studies, Zhang et al. [18], Lu et al. [24], and Moon et al. [44] adopted long short-term memory (LSTM) networks to obtain text representations and convolution neural network (CNN) to obtain features from images. The methods used for text and image feature extraction and fusion in these works are coarse-grained and simple. Recently, pre-trained models like BERT have been employed in multimodal NER to generate better text representations. Specifically, Yu et al. [8] employed a BERT encoder to obtain contextualized representations for input sentences. They also proposed a multimodal interaction module that generates word representations with knowledge of images, and visual representations with knowledge of words. Similarly, Zhao et al. [45] used the pre-trained model BERT for text encoding, and they proposed a relation-based graph convolutional architecture to examine the extrinsic matching relationships among pairs of texts and images. To establish a text-image relation, Sun et al. [46] built a propagation-based BERT model for multimodal NER by integrating soft or hard gates.
与多语言 NER 类似，多模态 NER 最近也引起了学术界的关注，原因是 Twitter 等社交媒体平台上有大量用户生成的图形和文本数据。在 Twitter 数据的基础上，研究人员建立了两个广泛使用的数据集：Twitter2015 [18] 和 Twitter-2017 [24]，它们都是文本图像语料库。不过，我们的数据集在规模更大和多语言性方面更胜一筹。通过使用这些多模态 NER 数据集，许多研究利用相应的图像来帮助识别文本中的实体。在最初的研究阶段，Zhang 等人[18]、Lu 等人[24]和 Moon 等人[44]采用长短期记忆（LSTM）网络获取文本表征，采用卷积神经网络（CNN）获取图像特征。在这些研究中，用于文本和图像特征提取与融合的方法都是粗粒度和简单的。最近，在多模态 NER 中采用了像 BERT 这样的预训练模型，以生成更好的文本表示。具体来说，Yu 等人[8] 使用 BERT 编码器来获取输入句子的上下文化表示。他们还提出了一种多模态交互模块，利用图像知识生成文字表征，利用文字知识生成视觉表征。同样，Zhao 等人[45] 使用预训练模型 BERT 进行文本编码，他们提出了一种基于关系的图卷积架构，用于检查文本和图像对之间的外在匹配关系。为了建立文本与图像的关系，Sun 等人[46] 通过整合软门或硬门，为多模态 NER 建立了基于传播的 BERT 模型。

In the meantime, several researchers have found that it was useful for multimodal NER to find fine-grained visual objects and filter out irrelevant visual objects with an object detection model. To this end, Zheng et al. [47] built an adversarial attention-based network with Mask RCNN [48] to extract visual object features, and constructed a common subspace for text and image modalities by adversarial learning. Similarly, Wu et al. [49] introduced a neural network that employs a pre-trained Mask RCNN network to recognize objects and utilizes a dense co-attention mechanism to learn the relationships between visual objects and textual entities. To exploit semantic relevance between different modalities, Zhang et al. [9] designed a multimodal graph that is constructed from words and visual objects. Besides, Wang et al. [20] applied optical character recognition (OCR) techniques to obtain object tags from images and mixed text and image modalities via transformer-based embeddings. The above methods require pre-trained object detectors. To reduce the dependence on them, Chen et al. [19] introduced a unified model for multimodal knowledge graph completion tasks, which adopts ViT [26] for image embedding. Similarly, our model also does not rely on pre-trained object detectors. Moreover, we align text and visual representations before integrating multimodal information, which is effective and distinguishes our model from previous approaches.
与此同时，一些研究人员发现，利用对象检测模型找到细粒度的视觉对象并过滤掉不相关的视觉对象对多模态 NER 非常有用。为此，Zheng 等人[47] 利用 Mask RCNN[48] 建立了一个基于对抗注意力的网络来提取视觉对象特征，并通过对抗学习为文本和图像模态构建了一个共同的子空间。同样，Wu 等人[49] 推出了一种神经网络，利用预先训练好的 Mask RCNN 网络来识别物体，并利用密集协同注意机制来学习视觉物体和文本实体之间的关系。为了利用不同模态之间的语义相关性，Zhang 等人[9]设计了一种由文字和视觉对象构建的多模态图。此外，Wang 等人[20]应用光学字符识别（OCR）技术从图像中获取对象标签，并通过基于变换器的嵌入来混合文本和图像模态。上述方法都需要预先训练对象检测器。为了减少对它们的依赖，Chen 等人[19] 为多模态知识图谱完成任务引入了一个统一模型，该模型采用 ViT [26] 进行图像嵌入。同样，我们的模型也不依赖预先训练好的对象检测器。此外，我们在整合多模态信息之前对齐了文本和视觉表征，这一点非常有效，使我们的模型有别于以往的方法。

3 Dataset Construction 3 数据集构建

Although building a multilingual and multimodal NER dataset can greatly facilitate the development of the MMNER task, no research has yet undertaken this endeavor. Hence, we plan to build one. In this part, we will first introduce our complete process of dataset annotation. Afterwards,
尽管建立一个多语言和多模态 NER 数据集能极大地促进 MMNER 任务的发展，但目前还没有任何研究开展这项工作。因此，我们计划建立一个数据集。在这一部分，我们将首先介绍数据集标注的完整过程。之后、
we will present the statistics of our MMNER dataset. Finally, we will compare the MMNER dataset with other NER datasets.
我们将介绍 MMNER 数据集的统计数据。最后，我们将把 MMNER 数据集与其他 NER 数据集进行比较。

3.1 Dataset Annotation 3.1 数据集注释

During the data annotation phase, we first select and process the data. Then, some data commissioners are trained to improve the quality of annotations. Finally, these commissioners spend three months completing all the annotation work. Obviously, it is easy to find a standalone multilingual or multimodal dataset. Therefore, we can transform a multilingual dataset to a MMNER dataset, or translate a single-language multimodal dataset into a MMNER dataset. The former is more reliable as it reduces the burden of translation. After conducting thorough investigations and careful consideration, we have opted for a combination approach. Firstly, we choose a large multilingual dataset called mBART50, which is released by Hugging Face

and contains 2.5 million pairs of images and texts in four languages (English, French, German, and Spanish), translated via mBART-50 [50]. In this dataset, each image is associated with its respective URL. Through downloading the images using these URLs, we get a basic multilingual and multimodal dataset. Besides, considering the cost of NER annotation, we choose more English data. Concretely, the selected number of image-text pairs in English, French, German and Spanish is 30000 , 5000, 5000, and 5000 respectively. To ensure the quality and efficiency of dataset annotation, we firstly utilize

to recognize entities and remove imagetext pairs whose entities number is less than two. All entities, except person, location, and organization, are marked as MISC. Secondly, note that person entities in mBART50 are replaced with the token

PERSON

. Therefore, Twitter-2017 [24] is also added into our MMNER dataset to complement person entity annotations. Moreover, since Twitter-2017 is an English multimodal dataset, we have translated it into French, German, and Spanish with the help of Baidu Translate

. Specifically, 2982 twitters have been translated into these three languages.
在数据标注阶段，我们首先选择和处理数据。然后，对一些数据专员进行培训，以提高注释质量。最后，这些专员用三个月的时间完成所有注释工作。显然，要找到一个独立的多语言或多模态数据集很容易。因此，我们可以将多语言数据集转换成 MMNER 数据集，或者将单语言多模态数据集翻译成 MMNER 数据集。前者更可靠，因为它减轻了翻译的负担。经过深入调查和慎重考虑，我们选择了一种组合方法。首先，我们选择了一个名为 mBART50 的大型多语言数据集，该数据集由 Hugging Face

发布，包含 250 万对图像和文本，使用四种语言（英语、法语、德语和西班牙语），通过 mBART-50 [50] 进行翻译。在该数据集中，每张图片都与各自的 URL 相关联。通过使用这些 URL 下载图像，我们就得到了一个基本的多语言和多模态数据集。此外，考虑到 NER 注释的成本，我们选择了更多的英文数据。具体来说，我们选择的英语、法语、德语和西班牙语图像-文本对数量分别为 30000、5000、5000 和 5000。为了保证数据集注释的质量和效率，我们首先利用

来识别实体，并删除实体数量少于两个的图像文本对。除人物、地点和组织外，所有实体都标记为 MISC。其次，请注意 mBART50 中的人物实体是用标记

PERSON

替换的。因此，Twitter-2017 [24] 也被添加到我们的 MMNER 数据集中，以补充人物实体注释。此外，由于 Twitter-2017 是一个英语多模态数据集，我们在百度翻译

的帮助下将其翻译成了法语、德语和西班牙语。具体来说，我们将 2982 条推特翻译成了这三种语言。

Before annotating the MMNER dataset, all data commissioners are initially trained on a subset corpus multiple times in advance, aiming to ensure consistency and reliability in intra- and inter-rater annotation. The accuracy of dataset annotation among all annotators improved from

after the training process.
在对 MMNER 数据集进行标注之前，所有数据专员都事先在一个子集语料库上进行了多次初步训练，旨在确保标注者内部和标注者之间标注的一致性和可靠性。经过培训后，所有标注者的数据集标注准确率从

提高到

。

During the data annotation process, 13 annotators help us to annotate the MMNER data manually with a detailed annotation instruction and the gold examples preprocessed by spaCy. Concretely, for each sentence, we ask for two annotators to perform annotation and one inspector for checking. Following the evaluation method in previous work [25], we evaluate the reliability between annotators via Cohen's kappa coefficient [51] and its value is 0.96 , which indicates the data quality is relatively high. In the end, the processed mBART50 and Twitter-2017 data are combined and partitioned into training, validation, and test sets. Some examples of our MMNER dataset are listed in Table 1. In those examples, the first image from Twitter-2017 depicts a distinct person entity which is the core part of the corresponding sentences. The other images come from mBART50, and each of them corresponds to some entities and many other words in those sentences, which makes the MMNER task on mBART50 is extremely challenging.
在数据标注过程中，13 名标注员通过详细的标注说明和 spaCy 预处理的金范例帮助我们对 MMNER 数据进行人工标注。具体来说，对于每个句子，我们要求两名注释员进行注释，一名检查员进行检查。按照前人的评估方法[25]，我们通过科恩卡帕系数[51]来评估注释者之间的可靠性，其值为 0.96，表明数据质量相对较高。最后，我们将处理过的 mBART50 和 Twitter-2017 数据合并并划分为训练集、验证集和测试集。表 1 列出了 MMNER 数据集的一些示例。在这些示例中，Twitter-2017 中的第一张图片描绘了一个独特的人物实体，该实体是相应句子的核心部分。其他图片来自 mBART50，每张图片都与这些句子中的一些实体和许多其他单词相对应，这使得 mBART50 上的 MMNER 任务极具挑战性。

3.2 Dataset Statistics and Evaluation
3.2 数据集统计与评估

The details of the text in our MMNER dataset are summarized in Table 2. The proportion of training, development and test sets is

. We have counted the number of 4 classes of entities (person, location, organization and miscellaneous) for each language. Additionally, we have also conducted statistics on the number of samples in the training, validation, and test sets, as well as a summary for each entity type. The total number of entities for the four classes is 89,019 .
表 2 总结了 MMNER 数据集中文本的详细信息。训练集、开发集和测试集的比例为

。我们统计了每种语言中 4 类实体（人物、地点、组织和杂项）的数量。此外，我们还对训练集、验证集和测试集中的样本数量进行了统计，并对每种实体类型进行了汇总。四个类别的实体总数为 89 019 个。

Each sentence has a corresponding image, and we store all 33,965 images in a dedicated folder. Besides, a default image

is added to replace a few broken images in Twitter-2017. The MMNER
每个句子都有相应的图片，我们将所有 33965 张图片存储在一个专用文件夹中。此外，我们还添加了默认图片

，以替换 Twitter-2017 中的一些破损图片。MMNER

Table 1 Examples of MMNERD. Note that the first text-image pair comes from Twitter-2017, and the other text-image pairs come from mBART50
表 1 MMNERD 示例。请注意，第一个文本-图像对来自 Twitter-2017，其他文本-图像对来自 mBART50

Text 文本

Image 图片

English: Baseball world reacts to Albert Pujols [PER] ' 600 th career home run
英语棒球世界对阿尔伯特-普霍尔斯[PER]职业生涯第 600 个本垒打的反应

French: La réaction du monde du baseball à la 600e home run de la carrière d' Albert Pujols

PER

法文：棒球界对阿尔伯特-普霍尔斯职业生涯第 600 个本垒打的反应

PER

Spanish: La reacción del mundo del béisbol al jonrón número 600 de la carrera de Albert pryors [PER]
西班牙文：棒球界对阿尔伯特-佩里职业生涯第 600 个本垒打的反应 [PER]。

German: Baseball Welt reagiert auf Albert Pujols [PER]
德文：《棒球世界》关注阿尔伯特-普霍尔斯 [PER]

English: An elephant observes zebras and other animals in Namibia [LOC] 's Etosha National Park [LOC]
英语一头大象在纳米比亚[当地语言]埃托沙国家公园观察斑马和其他动物[当地语言］

French: Un éléphant observe des zèbres et d'autres animaux dans le parc national d'Etosha [LOC] en Namibie [LOC]
法语：一头大象在纳米比亚埃托沙国家公园[当地语言]观察斑马和其他动物[当地语言]。

Spanish: Un elefante observa cebras y otros animales en el Parque nacional Etosha [LOC] de Namibia [LOC]
英语：一头大象在纳米比亚埃托沙国家公园观察斑马和其他动物 [LOC] [LOC].

German: Ein Elefant beobachtet Zebras und andere Tiere im Etosha
德语：Ein Elefant beobachtet Zebras und andere Tiere im Etosha（一只大象在埃托沙观察斑马和其他动物

Nationalpark [LOC] Namibias [LOC]
国家公园 [LOC] 纳米比亚 [LOC］

English: Fireworks Over The Thames [LOC] For New Year [MISC]
英语泰晤士河上的焰火 [LOC] 新年焰火 [MISC]

French: Feux d'artifice sur la Tamise [LOC] pour le nouvel an [MISC]
法语：新一年[MISC][LOC]塔米塞上的假象

Spanish: Fuegos artificiales en el Támesis [LOC] de año nuevo [MISC]
英语：泰晤士河上的新年烟花 [LOC] [MISC].

German: Feuerwerk über der Themse [LOC] für Neujahr [MISC]
德国：泰晤士河上的新年焰火[本地][杂项］

dataset is released for further studies via this link: https://github.com/wangdsh/MMNERD.
数据集通过此链接发布，供进一步研究：https://github.com/wangdsh/MMNERD。

After the data annotation, we randomly selected 300 sentences to test the quality of the dataset. By manual review of each entity annotation, we find that 626 out of 631 entities are correctly annotated, indicating that the dataset's quality is well guaranteed.
数据注释完成后，我们随机抽取了 300 个句子来测试数据集的质量。通过人工审核每个实体注释，我们发现 631 个实体中有 626 个注释正确，这表明数据集的质量得到了很好的保证。

3.3 Dataset Comparison 3.3 数据集比较

In Table 3, we conduct a comparison bwtween our MMNER dataset and other representative NER datasets. On the one hand, multilingual NER datasets, including CoNLL2002 [21], CoNLL2003 [22], and WikiAnn [23], are statisticed and compared with our dataset. CoNLL2002 and CoNLL2003 contain two languages each, while WikiAnn includes four languages. On the other hand, we compare our dataset with three widely used multimodal NER datasets: Twitter-2015 [18], Twitter-2017 [24], and CNERTA [25]. Twitter2015 and Twitter-2017 are text-image corpora, while CNERTA is a text-speech corpus.
在表 3 中，我们对 MMNER 数据集和其他具有代表性的 NER 数据集进行了比较。一方面，我们对包括 CoNLL2002 [21]、CoNLL2003 [22] 和 WikiAnn [23] 在内的多语言 NER 数据集进行了统计，并与我们的数据集进行了比较。CoNLL2002 和 CoNLL2003 各包含两种语言，而 WikiAnn 包含四种语言。另一方面，我们将我们的数据集与三个广泛使用的多模态 NER 数据集进行了比较：Twitter-2015 [18]、Twitter-2017 [24] 和 CNERTA [25]。Twitter-2015 和 Twitter-2017 是文本-图像语料库，而 CNERTA 是文本-语音语料库。

Compared with all other existing datasets, we discover that our MMNER dataset is a large-scale corpus with the characters of two modalities and four languages. According to our investigation, this is the first publicly available dataset specifically designed for the multilingual and multimodal NER task.
与所有其他现有数据集相比，我们发现 MMNER 数据集是一个包含两种模态和四种语言字符的大型语料库。根据我们的调查，这是第一个专门为多语言和多模态 NER 任务设计的公开可用数据集。

4 The Proposed Method 4 建议的方法

To address the NER challenge through the utilization of multimodal and multilingual information, we introduce a new MMNER task and design an innovative framework that validates the amalgamation of these different types of information. As for this part, we firstly provide a summary of
为了通过利用多模态和多语言信息来应对 NER 挑战，我们引入了一项新的 MMNER 任务，并设计了一个创新框架来验证这些不同类型信息的融合。关于这一部分，我们首先概述了

Table 2 Statistics of the MMNER dataset.
表 2 MMNER 数据集的统计数据。

English Data 英语数据				French Data 法国数据				Spanish Data 西班牙数据			German Data 德国数据
Entity Train 实体列车	Dev 开发	Test 测试	Train 火车	Dev 开发	Test 测试	Train 火车	Dev 开发	Test 测试	Train 火车	Dev 开发	Test 测试	Total 总计
Class 班级
PER.	3,340	419	431	3,237	396	362	3,186	393	362	3,213	395	380	16,114
LOC.	4,605	541	504	2,984	376	310	2,023	254	267	2,560	350	345	15,119
ORG.	7,342	927	968	1,511	204	183	2,115	277	247	1,556	181	205	15,716
MISC.	23,621	2,920	3,013	3,331	438	390	3,184	396	430	3,392	530	425	42,070
Total 总计	38,908	4,807	4,916	11,063	1,414	1,245	10,508	1,320	1,306	10,721	1,456	1,355	89,019
Sent 发送	19,966	2,463	2,533	4,812	609	561	4,798	591	593	4,750	628	604	42,908
Num 编号

Table 3 Dataset comparison bwtween our MMNER dataset and other NER datasets. AR, DE, DL, HI, EN, ES, ZH are the abbreviations of Arabic, German, Dutch, Hindi, English, Spanish and Chinese.
表 3 我们的 MMNER 数据集与其他 NER 数据集的比较。AR、DE、DL、HI、EN、ES、ZH 是阿拉伯语、德语、荷兰语、印地语、英语、西班牙语和汉语的缩写。

Dataset 数据集	Multilingual 多种语言	Multimodal 多式联运	Train 火车	Dev 开发	Test 测试	Total 总计
CoNLL2002 [21]	(ES, DL) （ES、DL）	(Text) （文本）	24,129	4,810	6,712	35,651
CoNLL2003 [22]	(EN, DE)	(Text) （文本）	27,692	6,534	6,844	41,070
WikiAnn [23] 维基百科[23］	(EN, AR, HI, ZH)	(Text) （文本）	65,000	31,000	31,000	127,000
Twitter-2015 [18]	(EN)	(Text, Image) （文本、图像）	4,000	1,000	3,257	8,257
Twitter-2017 [24] 推特-2017 [24]	(EN)	(Text, Image) （文本、图像）	3,373	723	723	4,819
CNERTA [25]	(ZH) （ZH）	(Text, Speech) （文本、语音）	34,102	4,440	4,445	42,987
MMNERD (Ours) MMNERD（我们的）	(EN, FR, ES, DE)	(Text, Image) （文本、图像）	34,326	4,291	4,291	42,908

the proposed model and a task definition for our MMNER task. Then, we utilize a feature extraction module to obtain representations of both text and images, which are aligned using multimodal alignment module. Additionally, the mechanism of multimodal interaction is described in detail, followed by a conditional random field (CRF) decoder to predict the named entities.
在此基础上，我们介绍了所提出的模型以及 MMNER 任务的任务定义。然后，我们利用特征提取模块获得文本和图像的表征，并利用多模态对齐模块对其进行对齐。此外，我们还详细介绍了多模态交互的机制，然后使用条件随机场（CRF）解码器预测命名实体。

Task Definition. In our multilingual and multimodal NER (MMNER) task, we use a NER dataset that includes multilingual text in four languages: English, French, Spanish, and German. Each sentence

in the corpus has a corresponding image

, and each token in the sentence

is associated with a predefined label

. MMNER aims to assign a predefined label

to every token in a sentence, and

refers to the collection of all predefined labels. There are four categories of named entities (person, location, organization and miscellaneous) in this task and we adopt IOB2 tagging scheme for entity annotation. Note that if a sentence has multiple images, it can be converted into multiple sentence-image pairs.
任务定义。在多语言和多模态 NER（MMNER）任务中，我们使用的 NER 数据集包含四种语言的多语言文本：英语、法语、西班牙语和德语。语料库中的每个句子

都有相应的图像

，句子

中的每个标记都与预定义标签

相关联。MMNER 的目的是为句子中的每个标记分配一个预定义标签

，而

指的是所有预定义标签的集合。本任务中有四类命名实体（人物、地点、组织和杂项），我们采用 IOB2 标签方案进行实体标注。需要注意的是，如果一个句子有多个图像，则可以将其转换为多个句子-图像对。

4.1 Overview of the Proposed Model
4.1 拟议模型概述

The general framework of our proposed model is presented in Fig. 2. To begin with, we employ a multilingual BERT [30] encoder to acquire contextualized representation for the input sentence. The corresponding image of the sentence is embedded using a ViT [26] encoder and a ResNet [52] encoder to capture patch features and convolution features, respectively. To align the text and visual embeddings in the same subspace, we incorporate a multimodal alignment module that operates on top of these encoders. Meanwhile, the text representation and image representation interact with each other through two multimodal collaboration modules, which are based on the attention mechanism. Finally, at the top of the model, we include a CRF layer to predict entities with specific predefined labels.
图 2 展示了我们提出的模型的总体框架。首先，我们使用多语言 BERT [30] 编码器获取输入句子的上下文表示。使用 ViT [26] 编码器和 ResNet [52] 编码器嵌入句子的相应图像，分别捕捉补丁特征和卷积特征。为了在同一子空间中对齐文本和视觉嵌入，我们在这些编码器之上加入了一个多模态对齐模块。同时，文本表示和图像表示通过两个基于注意力机制的多模态协作模块进行交互。最后，在模型的顶层，我们加入了一个 CRF 层，用于预测带有特定预定义标签的实体。

Fig. 2 The overall architecture of 2M-NER.
图 2 2M-NER 的整体结构。

4.2 Feature Extraction Module
4.2 特征提取模块

To illustrate the process of feature acquisition with our input data, the details of the feature extraction module are specified in this subsection, including multilingual text representation and image representation.
为了说明利用输入数据获取特征的过程，本小节将详细介绍特征提取模块，包括多语言文本表示和图像表示。

4.2.1 Multilingual Text Representation
4.2.1 多语言文本表示法

As depicted in Fig. 2, a Transformer [28] encoder is employed to learn text representations. Specifically, we use the first

layers of a BERT multilingual base model

, which has been pretrained on 104 languages using a masked language model objective. Each sentence

adds special tokens [CLS] and [SEP] to suit the model.

is the processed sentence in which

represents [CLS] and

represents [SEP]. Besides, as calculated in Eq. (1) position embeddings

are added to provide position information for all tokens.
如图 2 所示，我们使用 Transformer [28] 编码器来学习文本表示。具体来说，我们使用了 BERT 多语言基础模型

的第一层

，该模型已使用屏蔽语言模型目标对 104 种语言进行了预训练。每个句子

都添加了特殊标记词 [CLS] 和 [SEP]，以适应该模型。

是经过处理的句子，其中

表示 [CLS]，

表示 [SEP]。此外，根据公式（1）的计算，还添加了位置嵌入

，以提供所有标记的位置信息。

\footnotetext{ \脚注文本{

https://huggingface.co/bert-base-multilingual-cased

where

represents the token embedding matrix. To obtain contextualized representations,

layers of multi-headed self-attention (MHSA) and multilayer perceptron (MLP) followed by layernorm (LN) in Transformer are adopted to calculate token embedding

iteratively. The whole process is calculated by Eq. (2) and Eq. (3).
其中

表示标记嵌入矩阵。为了获得上下文化表示，Transformer 中采用了多头自注意（MHSA）和多层感知器（MLP）的

层，然后采用层规范（LN）迭代计算标记嵌入

。整个过程由公式 (2) 和公式 (3) 计算得出。

4.2.2 Image Representation
4.2.2 图像表示法

Based on extensive research, ViT is capable of capturing both patch features and salient features without reducing image resolution or sacrificing local information. ResNet, on the other hand, learns convolutional features through its convolutional structure and has the ability to capture the general structure of image data, making it widely applicable. To enhance the extraction of image features across various dimensions, our model utilizes a ViT encoder to capture patch features and a ResNet encoder to extract convolution features. First, similar to the text representation process
在广泛研究的基础上，ViT 能够在不降低图像分辨率或牺牲局部信息的情况下捕捉斑块特征和突出特征。ResNet 则通过其卷积结构学习卷积特征，能够捕捉图像数据的一般结构，因此具有广泛的适用性。为了加强对不同维度图像特征的提取，我们的模型利用 ViT 编码器捕捉补丁特征，利用 ResNet 编码器提取卷积特征。首先，与文本表示过程类似
described above, a Transformer encoder is used to obtain image representations. Concretely, the first

layers of ViT are used to encode the images, allowing the model to find out some salient features of the image at multiple levels of abstraction. For adapting the standard Transformer, a reshaped 2D image

is split into

flattened 2D patches, where

refers to the quantity of channels,

and

stand for the height and width of the image,

is the resolution of every patch. Then, based on Eq. (4), the flattened 2D patches are projected to

dimensions. Note that absolute position embeddings

are included to enable our image encoder to obtain position information of all patches. This enables the model to capture the spatial relationships between different patches in the image and ensure that the relative position of each patch is preserved during the encoding process.
如上所述，变换器编码器用于获取图像表征。具体来说，ViT 的前

层用于对图像进行编码，使模型能够在多个抽象层次上找出图像的一些突出特征。为了适应标准变换器，重塑的二维图像

被分割成

个扁平化的二维补丁，其中

指通道数量，

和

分别代表图像的高度和宽度，

是每个补丁的分辨率。然后，根据公式 (4)，将扁平化的 2D 补丁投影到

维度。请注意，为了使我们的图像编码器能够获取所有斑块的位置信息，我们还加入了绝对位置嵌入

。这使模型能够捕捉图像中不同斑块之间的空间关系，并确保在编码过程中保留每个斑块的相对位置。

Finally, as calculated in Eq. (5) and Eq. (6),

layers of multi-headed self-attention (MHSA) and multilayer perceptron (MLP) in the Transformer model are applied to the image embedding

. Besides, layernorm (LN) and residual connections are also applied in every layer of the ViT model.
最后，根据公式 (5) 和公式 (6) 的计算，Transformer 模型中的

层多头自注意（MHSA）和多层感知器（MLP）被应用于图像嵌入

。此外，在 ViT 模型的每一层中还应用了 layernorm（LN）和残差连接。

As indicated in the lower right corner of Fig. 2, we utilize a ResNet encoder with deep layers to extract convolution features. To employ a pre-trained 152 layers ResNet, each image is transformed to a fixed size with

pixels. The encoder splits each image into multiple

small blocks and transforms each block into a 2048 dimensional vector. Formally, let

represent the overall representation of an image. It is mapped to the text representation space by using the formula

, where

is the transition matrix and

is the bias.
如图 2 右下角所示，我们利用具有深度层的 ResNet 编码器来提取卷积特征。为了使用预先训练好的 152 层 ResNet，我们将每幅图像转换为固定大小的

像素。编码器将每幅图像分割成多个

小块，并将每个块转换成 2048 维向量。形式上，让

代表图像的整体表示。使用公式

将其映射到文本表示空间，其中

是转换矩阵，

是偏置。

4.3 Multimodal Alignment Module
4.3 多模式对齐模块

Most previous studies learn text and image representations separately, which poses challenges for the interaction between the text and image encoders due to their distinct subspaces. To alleviate this problem, we align these two types of representations through contrastive learning (CL). Drawing inspiration from NT-Xent [53] and MoCo [54], our contrastive loss is transformed from visual representation learning to multimodal (text and image) representation learning. Specifically, the text and image representations in a batch are indicated as

and

respectively. To bring them into alignment, we initially resize them to the same dimensions using Eq. (7) and Eq. (8). Here, the hidden size we utilize is 768 , which is the same as the text hidden size.
以往的大多数研究都是将文本和图像表征分开学习，这给文本编码器和图像编码器之间的交互带来了挑战，因为它们的子空间截然不同。为了缓解这一问题，我们通过对比学习（CL）将这两类表征统一起来。受 NT-Xent [53] 和 MoCo [54] 的启发，我们的对比损失从视觉表征学习转变为多模态（文本和图像）表征学习。具体来说，一个批次中的文本和图像表征分别表示为

和

。为了使它们保持一致，我们首先使用公式 (7) 和公式 (8) 将它们调整到相同的尺寸。这里，我们使用的隐藏尺寸是 768，与文本的隐藏尺寸相同。

Different from contrastive learning that aligns two augmented views of the same image in visual models, the text representation

and image representation

are aligned by contrastive loss in our model. Specifically, assume that there are

text-image pairs in one batch, and we can get

data pairs. Each data pair contains a text embedding and an image embedding. Only the original text-image pair and the image-text pair are positive, while the other

pairs are treated as negative examples. Hence, the contrastive loss can be formulated as follows:
与视觉模型中对同一图像的两个增强视图进行对齐的对比学习不同，在我们的模型中，文本表示

和图像表示

是通过对比损失进行对齐的。具体来说，假设一批中有

文本图像对，我们可以得到

数据对。每个数据对包含一个文本嵌入和一个图像嵌入。只有原始文本-图像对和图像-文本对是正例，而其他

对被视为负例。因此，对比度损失可表述如下：

In Eq. (9),

represents the cosine similarity between the text vector representation and its image vector representation,

is a temperature parameter, and

represents an indicator function whose value is 1 iff

, otherwise it is 0 . The total contrastive loss for all text-image pairs is defined in Eq. (10).
在公式 (9) 中，

表示文本矢量表示与其图像矢量表示之间的余弦相似度，

是一个温度参数，

表示一个指标函数，其值在

时为 1，否则为 0。所有文本-图像对的总对比度损失在公式 (10) 中定义。

4.4 Multimodal Collaboration Module
4.4 多模式协作模块

To create a multimodal interaction between text and image, we employ a crossmodal attention
为了在文字和图像之间创建多模态交互，我们采用了跨模态注意力
layer to fuse multimodal information. As shown in Fig. 2, two multimodal collaboration modules are connected to the ViT encoder and the ResNet encoder respectively. Each module consists of a multi-headed self-attention layer and a multilayer perceptron. Besides, a residual connection and a layer normalization are used subsequent to each sub-layer. To be specific, we use the token embedding

calculated by the text encoder to represent the query. It is referred to as

in Section 4.2.1. Meanwhile, we use the image representation

obtained from the image encoder as the key-value pairs. It is referred to as

in Section 4.2.2. In Fig. 2, "Q", "K", and "V" represent the query, key, and value respectively. Formally,
层来融合多模态信息。如图 2 所示，两个多模态协作模块分别与 ViT 编码器和 ResNet 编码器相连。每个模块由多头自注意层和多层感知器组成。此外，在每个子层之后还使用了残差连接和层归一化。具体来说，我们使用文本编码器计算出的标记嵌入

来表示查询。它在第 4.2.1 节中被称为

。同时，我们使用从图像编码器获得的图像表示

作为键值对。在第 4.2.2 节中，它被称为

或

。在图 2 中，"Q"、"K "和 "V "分别代表查询、键和值。形式上

where

denotes the

-th attention head between text and image,

refer to all of the learnable parameters for each query, key, and value, and

in Eq. (12) stands for the weight matrix for multi-head attention. The multimodal representation

is calculated by the multi-headed self-attention layer. Afterwards, we calculate the final multimodal representation

using the following formulas:
其中

表示文本和图像之间的

注意头，

指每个查询、键和值的所有可学习参数，公式 (12) 中的

代表多头注意的权重矩阵。多模态表示

由多头自我注意层计算得出。之后，我们利用以下公式计算出最终的多模态表征

：

4.5 CRF Decoder 4.5 CRF 解码器

Once we have fused the multimodal information, we obtain two multimodal representations from the ViT encoder and the ResNet encoder, respectively. These two final multimodal representations are concatenated together, and then fed into a standard CRF decoder to perform the MMNER task. The decoder predicts labels for each sentence

and its corresponding image

. The probability of the label result

is calculated using the following formulas:
融合多模态信息后，我们将分别从 ViT 编码器和 ResNet 编码器中获得两个多模态表征。这两个最终的多模态表征被连接在一起，然后输入标准 CRF 解码器以执行 MMNER 任务。解码器会预测每个句子

及其对应图像

的标签。标签结果

的概率使用以下公式计算：

where

represents the emission value of tag

for the

-th multimodal representation of

represents the transition value from tag

to tag

. The conditional probability of the tag result

is derived by a normalization with the sum of all emission and transition scores over all possible label sequences, as defined in Eq. (16). At the training stage, a negative log-likelihood loss is used to maximize the probability of the true label sequence, and the likelihood is formulated in Eq.

其中，

表示标签

对于

的

-th 多模态表示的发射值，

表示从标签

到标签

的过渡值。标签结果

的条件概率是通过对所有可能标签序列的所有发射和转换分数之和进行归一化得出的，如公式 (16) 所定义。在训练阶段，使用负对数似然损失来最大化真实标签序列的概率，似然值用公式

表示

where

is the true label sequence of a sentence, and the conditional probability is calculated by Eq. (16)
其中，

是句子的真实标签序列，条件概率由公式 (16) 计算得出

4.6 Model Training 4.6 模型训练

According to the overall architecture depicted in Fig. 2, our MMNER model has one multimodal alignment module and a CRF decoder which generate losses. Overall, we aim to train

via a weighted summation of two contrastive losses and one negative log-likelihood loss as follows:
根据图 2 所示的整体架构，我们的 MMNER 模型包含一个多模态配准模块和一个 CRF 解码器，它们会产生损失。总的来说，我们的目标是通过两个对比度损失和一个负对数似然损失的加权求和来训练

，具体如下：

where

and

are the contrastive losses calculated by Eq. (10), corresponding to ViT and ResNet respectively. The hyperparameter

is used to control the loss ratio.
其中，

和

是根据公式 (10) 计算出的对比损失，分别对应于 ViT 和 ResNet。超参数

用于控制损失率。

5 Experiments 5 项实验

In our experiments, we select some comparative and representative baselines and compare our model with them on MMNERD. Based on various experimental settings, we have done numerous multilingual and multimodal experiments. The following research questions (RQs) guide our experiments. We firstly introduce the implementation details and the compared baselines, then our experiments are conducted through these RQs.
在实验中，我们选择了一些具有代表性的比较基准，并将我们的模型与它们在 MMNERD 上进行比较。基于各种实验设置，我们进行了大量多语言和多模态实验。以下研究问题（RQs）为我们的实验提供了指导。我们首先介绍了实现细节和比较基线，然后通过这些研究问题进行了实验。

RQ1: How does 2M-NER model compared with existing NER models?
问题 1： 2M-NER 模型与现有的 NER 模型相比如何？
RQ2: What is the impact of multilingualism?
问题 2：使用多种语言有什么影响？
RQ3: What is the impact of multimodality?
问题 3：多模态有何影响？
RQ4: How challenging is the data from different sources?
问题 4：不同来源的数据有多大挑战性？
RQ5: How consistent is the annotation of MMNERD?
问题 5：MMNERD 注释的一致性如何？

5.1 Implementation Details
5.1 实施细节

To efficiently verify our model

, we conduct all the experiments on an Ubuntu server with PyTorch 1.12.1 and a NVIDIA GeForce RTX 3090 GPU. For the comparison methods, we adopt the original or reproduced public code, and the hyperparameters are consistent with the original papers. To adapt to multilingual scenarios, these comparison methods utilize

or multilingual BERT for encoding the multilingual text.
为了有效验证我们的模型

，我们在装有 PyTorch 1.12.1 和 NVIDIA GeForce RTX 3090 GPU 的 Ubuntu 服务器上进行了所有实验。对于比较方法，我们采用了原始或转载的公开代码，超参数与原始论文一致。为了适应多语言场景，这些比较方法使用

或多语言 BERT 对多语言文本进行编码。

In our 2M-NER framework, we use a BERT multilingual base model with 12 transformer blocks for multilingual text representation. Besides, a pre-trained 12-layer CLIP [55] and a pre-trained 152-layer

are used as image feature extractors. Both the text encoder and the image feature extractors are fine-tuned during the training phase. Specifically, some important hyperparameters of

are presented in Table 4 . The loss weight

in Eq. (18) and the learning rate are set to 0.8 and

respectively, which obatin the best results on our validation dataset through a grid search across various combinations of

and

. Like the dropout in the BERT model, the parameters of all dropout layers in

are set to 0.1 . To improve the model's performance, we adopt a linear decay scheme that decays the learning rate from a fixed value to zero by the end of the training epochs.
在我们的 2M-NER 框架中，我们使用带有 12 个转换器块的 BERT 多语种基础模型来表示多语种文本。此外，我们还使用预先训练好的 12 层 CLIP [55] 和 152 层

作为图像特征提取器。文本编码器和图像特征提取器都在训练阶段进行了微调。具体来说，表 4 列出了

的一些重要超参数。公式 (18) 中的损失权重

和学习率分别设置为 0.8 和

，通过对

、

和

的不同组合进行网格搜索，在验证数据集上获得了最佳结果。与 BERT 模型中的滤除一样，

中所有滤除层的参数均设置为 0.1。为了提高模型的性能，我们采用了线性衰减方案，即在训练历时结束时，学习率从固定值衰减为零。

5.2 Comparison Models 5.2 比较模式

To demonstrate the better performance of our approach, we first select some representative textbased NER methods for comparison (the first four models). Text-based NER models only consider text modal information, which lacks modal diversity. Therefore, several existing multimodal NER models incorporate visual information to overcome the above barrier. Here, we also pick out several classical and competitive multimodal models to illustrate the superiority of

(the last four models).
为了证明我们的方法具有更好的性能，我们首先选择了一些具有代表性的基于文本的 NER 方法（前四个模型）进行比较。基于文本的 NER 模型只考虑文本模态信息，缺乏模态多样性。因此，现有的一些多模态 NER 模型结合了视觉信息，以克服上述障碍。在此，我们还挑选了几个经典的、有竞争力的多模态模型来说明

的优越性（后四个模型）。

BiLSTM-CRF. Huang et al. [6] provided an architecture of bidirectional LSTM combined with a CRF module for the sequence tagging task. This model automatically extracts past and future features, which gets rid of heavy dependence on hand-crafted features.
BiLSTM-CRF。Huang 等人[6]为序列标记任务提供了一种双向 LSTM 与 CRF 模块相结合的架构。该模型能自动提取过去和未来的特征，从而摆脱了对手工特征的严重依赖。

HBiLSTM-CRF. Numerous studies show that character features can provide more morphological information. Lample et al. [29] improved the BiLSTM-CRF model by combining word embeddings and character-level word representations, which are obtained by an LSTM layer.
HBiLSTM-CRF。大量研究表明，字符特征可以提供更多形态学信息。Lample 等人[29]将单词嵌入和字符级单词表示（由 LSTM 层获得）结合起来，改进了 BiLSTM-CRF 模型。

BERT. By using a deep bidirectional Transformer [28], Devlin et al. [30] designed a novel language model called BERT, which learns contextual vector representations for each word. When it is used for the NER task, BERT adds a softmax layer to predict labels based on its contextual embeddings.
BERT。Devlin 等人[30]通过使用深度双向变换器[28]，设计了一种名为 BERT 的新型语言模型，该模型可学习每个单词的上下文向量表示。当用于 NER 任务时，BERT 会添加一个 softmax 层，根据上下文嵌入来预测标签。

BERT-CRF. The BERT-CRF model is a variant of BERT, which stacks a multi-layer bidirectional Transformer at the bottom and a CRF decoder at the top.
BERT-CRFBERT-CRF 模型是 BERT 的一种变体，它在底部堆叠了一个多层双向变压器，在顶部堆叠了一个 CRF 解码器。

AdaCAN. To leverage textual and visual information, Zhang et al. [18] presented an adaptive co-attention network to obtain token-aware image representations and image-aware token representations through token-based visual attention and image-based textual attention, respectively. This work is based on CNN-BiLSTM-CRF [7], which utilizes a CNN layer to learn character-level word representations.
AdaCAN。为了充分利用文本和视觉信息，Zhang 等人[18] 提出了一种自适应协同注意网络，通过基于标记的视觉注意和基于图像的文本注意，分别获得标记感知图像表征和图像感知标记表征。这项工作基于 CNN-BiLSTM-CRF[7]，利用 CNN 层学习字符级单词表征。

UMT To tackle the issue of word representation's sensitivity to visual context, Yu et al. [8] designed a multimodal interaction module to generate word-related visual representations and image-related word representations. They also utilized an entity span detection task to help predict entities with a unified multimodal transformer.
UMT 为了解决单词表征对视觉上下文的敏感性问题，Yu 等人[8]设计了一个多模态交互模块，用于生成与单词相关的视觉表征和与图像相关的单词表征。他们还利用实体跨度检测任务，通过统一的多模态转换器帮助预测实体。

UMGF Zhang et al. [9] introduced a multimodal graph to build different types of semantic associations between words and visual objects. Then, they designed a multimodal fusion module to learn node representations, and finally predicted entity annotation using the attention-based multimodal representations with a CRF layer.
UMGF Zhang 等人[9]引入了一种多模态图，用于在词语和视觉对象之间建立不同类型的语义关联。然后，他们设计了一个多模态融合模块来学习节点表征，最后利用基于注意力的多模态表征和 CRF 层来预测实体注释。

MKGFormer To build a unified model architecture for various entity and relation extraction tasks like multimodal entity extraction and link prediction, Chen et al. [19] utilized a unified input-output in a hybrid transformer architecture.
MKGFormer 为了为多模态实体提取和链接预测等各种实体和关系提取任务建立统一的模型架构，Chen 等人[19]在混合变换器架构中使用了统一的输入输出。

Table 4 Some important hyperparameters of our model
表 4 我们模型的一些重要超参数

Hyperparam. 超级参数	Value 价值	Hyperparam. 超级参数	Value 价值
	0.8	Learning rate 学习率
Batch size 批量大小	16	Dropout 辍学	0.1
Learning scheduler 学习调度程序	linear 线形	Warmup steps 热身步骤	0
Text hidden size 文本隐藏尺寸	768	Image hidden size 图像隐藏尺寸	768
ViT encoder ViT 编码器	clip-vit-base-patch32	Layers of multilingual BERT 多语种 BERT 的层级	12
Resolution of each image in ViT 每幅图像的 ViT 分辨率	224	Resolution of each image in ResNet ResNet 中每幅图像的分辨率	224
Optimizer 优化器	Adam 亚当

The architecture contains coarse-grained prefixguided interaction to mitigate modal heterogeneity and fine-grained correlation-aware integration to reduce error sensitivity.
该架构包含粗粒度的前导交互和细粒度的相关感知集成，前者可减轻模态异质性，后者可降低误差敏感性。

5.3 Metrics 5.3 衡量标准

The most commonly used evaluation metrics for NER tasks are precision, recall, and F1. Precision measures the accuracy of positive prediction made by a model. Recall represents the ratio of correctly predicted positive instances to the total number of actual positive instances. The F1 score is a combined metric that balances both precision and recall. Formulas 19, 20, and 21 provide the calculation methods for precision, recall, and F1 score respectively.

represents the number of entities correctly predicted as positive,

denotes the number of entities incorrectly predicted as positive, and

represents the number of entities incorrectly predicted as negative.
NER 任务最常用的评估指标是精确度、召回率和 F1。精确度衡量的是模型所做正向预测的准确性。召回率表示正确预测的正向实例与实际正向实例总数的比率。F1 分数是平衡精度和召回率的综合指标。公式 19、20 和 21 分别提供了精确度、召回率和 F1 分数的计算方法。

表示被正确预测为阳性的实体数量，

表示被错误预测为阳性的实体数量，

表示被错误预测为阴性的实体数量。

5.4 Results 5.4 结果

Based on the research questions RQ1: to RQ5:, we conducted numerous experiments, and the following are the experimental results.
根据研究问题 RQ1：至 RQ5：，我们进行了大量实验，以下是实验结果。

5.4.1 RQ1: Overall Comparison with Existing Models
5.4.1 问题 1：与现有模型的总体比较

We compare 2M-NER model with all the compared baselines in the overall dataset and each individual language of it. Table 5 represents the F1 values for each individual type, as well as the precision, recall and F1 values across five different datasets. The maximum value in each column of the result tables is indicated in bold. By analyzing these large numbers of experiments, we obtain the following findings:
我们将 2M-NER 模型与所有比较过的基线模型在整个数据集和其中的每种语言中进行了比较。表 5 列出了每种语言的 F1 值，以及五个不同数据集的精确度、召回率和 F1 值。结果表中每列的最大值均以粗体标出。通过分析这些大量实验，我们得出了以下结论：

First, comparing all the text-based NER methods, the table shows that BERT-based models have better performance than LSTM-based models across all datasets. This indicates that the fine-tuning mechanism of the pre-trained model has a huge advantage in the NER task. Moreover, the combination of BERT and CRF can improve the F1 score except the German data on which drops a little, indicating that CRF can establish good constraints for tag sequence.
首先，比较所有基于文本的 NER 方法，表格显示基于 BERT 的模型在所有数据集上的性能都优于基于 LSTM 的模型。这表明预训练模型的微调机制在 NER 任务中具有巨大优势。此外，BERT 和 CRF 的组合可以提高 F1 分数，只有德国数据的 F1 分数略有下降，这表明 CRF 可以为标签序列建立良好的约束。

Second, when comparing four classical and competitive multimodal approaches with the above text-based NER models, we can observe that AdaCAN has the lowest overall F1 score and UMT has the highest overall F1 score across all datasets. Therefore, multimodal approaches are not always better than unimodal methods, and how to properly embed images within the model is a significant challenge for researchers.
其次，在将四种经典的、具有竞争力的多模态方法与上述基于文本的 NER 模型进行比较时，我们可以发现，在所有数据集中，AdaCAN 的总体 F1 分数最低，而 UMT 的总体 F1 分数最高。因此，多模态方法并不总是优于单模态方法，如何在模型中正确嵌入图像是研究人员面临的一大挑战。

Third, our proposed architecture (

) obtains the best results in comparison with all the text-based NER models and all other multimodal NER approaches. We judge that the convolution features provided by our ResNet module and the patch features from the ViT encoder provides a rich vein of helpful information, which is aligned with corresponding text by contrastive loss. Besides, our multimodal collaboration module fuses this helpful information via multi-head attention to boost the performance.
第三，与所有基于文本的 NER 模型和所有其他多模态 NER 方法相比，我们提出的架构（

）获得了最佳结果。我们认为，我们的 ResNet 模块提供的卷积特征和 ViT 编码器提供的补丁特征提供了丰富的有用信息，这些信息通过对比损失与相应的文本对齐。此外，我们的多模态协作模块通过多头关注融合了这些有用信息，从而提高了性能。

Table 5 Experiments in all languages and four single languages on MMNERD
表 5 所有语言和四种单一语言在 MMNERD 上的实验

Language 语言	Methods 方法	Modality 模式	Single Type (F1)(%) 单一类型（F1）(%)				Overall(%) 总体(%)
Language 语言	Methods 方法	Modality 模式	PER.	LOC.	ORG.	MISC.	Precision 精确度	Recall 回顾	F1
All 全部	BiLSTM-CRF [	Text 文本	91.12	55.79	55.52	56.32	65.07	59.57	62.20
	HBiLSTM-CRF [29]	Text 文本	91.76	57.64	58.98	58.57	67.24	61.71	64.36
	BERT [30]	Text 文本	94.48	63.36	63.96	62.17	66.54	69.98	68.22
	BERT-CRF	Text 文本	94.87	64.25	66.08	63.24	69.04	69.48	69.26
	AdaCAN [18]	Text + 文本 +	89.24	52.13	53.86	53.62	63.50	56.30	59.69
	UMT [8]	Text + Image 文字 + 图片	95.50	63.82	65.97	65.27	69.27	71.06	70.15
	UMGF [9]	Text + Image 文字 + 图片	93.49	62.11	63.61	62.20	67.52	68.12	67.82
	MKGFormer	Text + Image 文字 + 图片	94.58	60.03	61.15	59.42	64.96	66.47	65.71
	2M-NER (Ours) 2M-NER （我们的）	Text + 文本 +	96.34	65.00	67.14	64.75	69.64	71.35	70.49
English 英语	BiLSTM-CRF [6]	Text 文本	86.23	50.15	56.63	64.18	64.73	60.76	62.68
	HBiLSTM-CRF [29]	Text 文本	88.86	47.25	54.12	65.94	64.78	3.87	64.32
	BERT [30]	Text 文本	92.73	54.70	63.94	68.78	66.15	70.46	68.24
	BERT-CRF	Text 文本	92.94	55.74	64.36	70.46	68.77	70.53	69.64
	AdaCAN [18]	Text + Image 文字 + 图片	85.21	46.04	50.44	62.12	61.87	58.76	60.28
	UMT [8]	Text + 文本 +	93.53	55.79	62.30	72.60	69.46	71.59	70.51
	UMGF [9]	Text + 文本 +	88.69	53.90	60.87	68.16	65.64	. 79	67.18
	MKGFormer [19]	Text + 文本 +	88.45	53.10	55.92	67.00	63.79	66.53	65.13
	2M-NER (Ours) 2M-NER （我们的）	Text + 文本 +	96.90	58.33	64.45	71.40	70.05	71.51	70.77
French 法语	S	T	88.39	65.13	54.67	25.57	71.55		60.98
	HBiLSTM-CRF [29]	Text 文本	89.32	65.64	54.48	34.10	68.92	57.59	62.74
	BERT [30]	Text 文本	90.96	67.74	62.79	35.08	59.59	66.44	62.83
	BERT-CRF	Text 文本	92.24	62.33	59.67	37.34	63.10	62.75	62.92
	AdaCAN [18]	Text + 文本 +	86.82	58.58	50.53	21.00	66.77	49.01	56.53
	UMT	Text 文本	94.07	67.56	61.71	37.34	62.08		64.13
	UMGF [9]	Text + 文本 +	91.38	65.58	55.53	29.40	58.80	62.40	60.54
	MKGFormer	Text + 文本 +	89.32	61.52	49.86	27.45	56.73	58.29	57.50
	2M-NER (Ours) 2M-NER （我们的）	Text + Image 文字 + 图片	95.34	67.04	63.39	36.10	62.24	67.43	64.73
Spanish 西班牙语	(	Te	83.73	62.28	69.60	46.15	72.29		65.30
	HBiLSTM-CRF [29]	Text 文本	84.60	66.54	68.10	50.20	71.64	63.05	67.07
	BERT [30]	Text 文本	89.93	70.61	68.64	52.67	68.83	71.35	70.07
	BERT-CRF	Text 文本	90.01	72.03	67.98	54.92	71.75	70.38	71.06
	AdaCAN [18	Text + 文本 +	85.44	56.97	66.67	47.50	70.51	9.69	64.65
	UMT [8]	Text + 文本 +	90.88	71.24	73.79	55.33	70.15	72.82	71.46
	UMGF [9]	Text + 文本 +	88.40	68.85	70.18	50.30	66.26	71.09	68.59
	MKGFormer [19]	Text + 文本 +	88.52	60.47	69.73	47.52	64.39	66.87	65.61
	2M-NER (Ours) 2M-NER （我们的）	Text + Image 文字 + 图片	93.08	73.40	70.70	55.12	71.33	73.13	72.22
German 德国	BiLSTM-CRF [6]	Text 文本	83.27	52.85	45.97	32.09	64.20	48.58	55.31
	HBiLSTM-CRF [29]	Text 文本	85.04	57.94	45.16	37.97	63.39	53.97	58.30
	BERT [30]	Text 文本	89.83	63.76	53.63	39.81	60.84	63.79	62.28
	BERT-CRF	Text 文本	88.69	61.89	55.85	39.32	63.55	60.31	61.88
	AdaCAN [18]	Text 文本	82.37	49.13	47.09	32.57	57.34	50.33	53.61
	UMT [8]	Text + Image 文字 + 图片	89.47	64.02	52.76	44.99	62.37	64.09	63.22
	UMGF [9]	Text + Image 文字 + 图片	87.42	58.09	51.95	33.49	55.78	60.49	58.04
	MKGFormer [19]	Text + Image 文字 + 图片	87.86	54.13	45.99	35.01	54.83	57.29	56.04
	2M-NER (Ours) 2M-NER （我们的）	Text + Image 文字 + 图片	93.47	63.85	56.87	41.61	62.27	65.40	63.80

5.4.2 RQ2: The Effect of Multilingualism
5.4.2 问题 2：使用多种语言的影响

To study the effect of multilingualism, we add another language on the basis of a single language. For instance, English, French, and Spanish are combined with German respectively, and we use these datasets to recognize entities in any of the assembled languages. Table 6 shows all the possible combinations and experimental results of our method on MMNERD. From the table, we can find that: (1) In the four languages we use, English and German belong to the Germanic language family, while French and Spanish belong to the Romance languages. Moreover, adding English data on top of German data to test German, the F1 score increases by

. The combination of French training data and Spanish training data can improve the F1 score by

on the French test set and

on the Spanish test set, respectively. Hence, although the F1 score calculated on our English test set does not improve much after adding German data to English data due to the relatively small scale of German data, we can still conclude that combining languages from the same language family can promote the MMNER task. (2) The combination of languages from different language families makes the MMNER task more challenging. For instance, adding the German training set to the Spanish training set does not improve the model's ability to predict Spanish test set. This is because the syntax and grammar of different language families will bring different degrees of confusion and interference to the model. (3) For German, French, and Spanish, increasing the training dataset of the same language can greatly improve the F1 scores of entity

and entity MISC. We consider that these two entity types are more sensitive to data size, so their corresponding F1 scores will increase rapidly when the amount of data increases.
为了研究多语言的效果，我们在单一语言的基础上增加了另一种语言。例如，英语、法语和西班牙语分别与德语相结合，我们使用这些数据集来识别任何一种组合语言的实体。表 6 显示了我们的方法在 MMNERD 上的所有可能组合和实验结果。从表中我们可以发现(1）在我们使用的四种语言中，英语和德语属于日耳曼语系，而法语和西班牙语属于罗曼语系。此外，在德语数据的基础上增加英语数据来测试德语，F1 分数会增加

。结合法语训练数据和西班牙语训练数据，在法语测试集上的 F1 分数可分别提高

，在西班牙语测试集上的 F1 分数可分别提高

。因此，虽然由于德语数据的规模相对较小，在英语测试集上添加德语数据后计算出的 F1 分数并没有提高多少，但我们仍然可以得出结论：将同一语系的语言组合在一起可以促进 MMNER 任务的完成。(2）不同语系语言的组合增加了 MMNER 任务的挑战性。例如，将德语训练集加入西班牙语训练集并不能提高模型预测西班牙语测试集的能力。这是因为不同语系的句法和语法会给模型带来不同程度的混淆和干扰。(3) 对于德语、法语和西班牙语，增加同一语言的训练数据集可以大大提高实体

和实体 MISC 的 F1 分数。我们认为，这两种实体类型对数据量更为敏感，因此当数据量增加时，其相应的 F1 分数也会迅速增加。

5.4.3 RQ3: The Effect of Multimodality
5.4.3 问题 3：多模态的影响

For all the multimodal models including our model, we remove the image encoders from each model architecture to verify whether adding images can improve model performance. From Table 7, we find that: (1) All the models except AdaCAN and UMGF have either major or minor improvement after adding images over single or multiple languages, which shows that the image features are indeed helpful to recognize entities, and our model achieves better F1 score across languages by adding images. (2) On the Spanish and German datasets, our model performs

and

lower in terms of the best F1 score compared to the baseline models, respectively. Across all language datasets, our model has a slight

lower best F1 score compared to the baseline models. However, after incorporating multimodal features, our model consistently outperforms the baseline models in terms of F1 score, indicating that our model excels in the extraction and fusion of multimodal features. (3) For our model

, incorporating multimodal features has improved the overall F1 score by

, and

for English, French, Spanish, and German, respectively. This indicates that the fusion of multimodal features is beneficial for entity recognition in each language, especially for French and German.
对于包括我们的模型在内的所有多模态模型，我们从每个模型架构中移除了图像编码器，以验证添加图像是否能提高模型性能。从表 7 中，我们发现(1) 除了 AdaCAN 和 UMGF 之外，所有模型在添加图像后在单语言或多语言上都有或大或小的改进，这说明图像特征确实有助于识别实体，而我们的模型通过添加图像在不同语言上取得了更好的 F1 分数。(2) 在西班牙语和德语数据集上，与基线模型相比，我们的模型在最佳 F1 分数方面的表现分别为

和

。在所有语言数据集上，与基线模型相比，我们的模型的最佳 F1 分数略低

。然而，在结合多模态特征后，我们的模型在 F1 分数方面始终优于基线模型，这表明我们的模型在提取和融合多模态特征方面表现出色。(3) 对于我们的模型

，在英语、法语、西班牙语和德语中，融合多模态特征后的整体 F1 分数分别提高了

、

和

。这表明多模态特征的融合有利于每种语言的实体识别，尤其是法语和德语。

5.4.4 RQ4: Challenging Analysis of Data from Different Sources
5.4.4 问题 4：对不同来源数据的分析具有挑战性

In Section 3, we construct our dataset MMNERD from mBART50 and Twitter-2017, which are two different datasets in some aspects. First, person entities are replaced with

token in mBART50 due to privacy, so all the person entities in MMNERD are collected from Twitter-2017. Second, the data scale of mBART50 is larger than that of Twitter-2017, and most of the entities (LOC, ORG and MISC) come from mBART50. To evaluate the difficulty of different data sources, we analyse the F1 scores of four different entities from Table 5, Table 6, and Table 7. The common feature of these tables is that the F1 score of

is much higher than other entities. For example, the F1 score of

is almost

higher compared to the F1 value of

in Table 5. The reason is that different datasets bring challenges of varying degrees of difficulty. On the one hand, as shown in Table 1, an image in Twitter-2017 usually corresponds to a distinct person entity, so it is easy for this image to align with entities in sentences. On the other hand, an image in mBART50 usually corresponds to a whole sentence which has entities and numerous other words. This sentencelevel alignment interferes a lot with the NER model, hence the MMNER task on mBART50 is extremely more challenging.
在第 3 节中，我们从 mBART50 和 Twitter-2017 这两个在某些方面不同的数据集中构建了我们的数据集 MMNERD。首先，出于保护隐私的考虑，mBART50 中的人物实体被替换为

标记，因此 MMNERD 中的人物实体全部来自 Twitter-2017。其次，mBART50 的数据规模大于 Twitter-2017，大部分实体（LOC、ORG 和 MISC）来自 mBART50。为了评估不同数据源的难度，我们分析了表 5、表 6 和表 7 中四个不同实体的 F1 分数。这些表格的共同特点是

的 F1 分数远高于其他实体。例如，与表 5 中

的 F1 值相比，

的 F1 值几乎

更高。原因在于不同的数据集带来了不同难度的挑战。一方面，如表 1 所示，Twitter-2017 中的图像通常对应一个独立的人物实体，因此该图像很容易与句子中的实体对齐。另一方面，mBART50 中的图像通常对应一整个句子，其中包含实体和许多其他词语。这种句子级对齐对 NER 模型干扰很大，因此 mBART50 上的 MMNER 任务极具挑战性。

Table 6 Experimental results of multilingual effects
表 6 多语言效果的实验结果

		Single Type (F1)(%) 单一类型（F1）(%)				Overall(%) 总体(%)
Language of training 培训语言	Language of testing 测试语言	PER.	LOC.	ORG.	MISC.	Precision 精确度	Recall 回顾	F1
English 英语	English 英语	96.90	58.33	64.45	71.40	70.05	71.51	70.77
English & German 英语和德语	English 英语	95.16	55.61	64.26	71.93	70.38	71.18
English & French 英语和法语	English 英语	94.34	57.38	65.07	70.69	69.41	70.99	70.19
English & Spanish 英语和西班牙语	English 英语	93.50	57.50	63.98	71.61	69.16	71.82	70.47
German 德国	German 德国	93.47	63.85	56.87	41.61	62.27	65.40	63.80
German & English 德语和英语	German 德国	92.87	61.95	62.76	46.09	63.91	66.42
German & French 德语和法语	German 德国	90.63	63.19	59.13	44.98	62.25	65.70	63.93
German & Spanish 德语和西班牙语	German 德国	91.81	64.99	53.77	43.01	62.03	64.97	63.46
French 法语	French 法语	95.34	67.04	63.39	36.10	62.24	67.43	64.73
French & English 法语和英语	French 法语	93.04	65.53	61.28	36.96	61.54	66.08	63.73
French & German 法语和德语	French 法语	94.13	69.96	61.87	36.23	62.37	67.67	64.91
French & Spanish 法语和西班牙语	French 法语	96.02	68.18	66.11	39.32	64.97	68.07
Spanish 西班牙语	Spanish 西班牙语	93.08	73.40	70.70	55.12	71.33	73.13	72.22
Spanish & English 西班牙语和英语	Spanish 西班牙语	92.69	70.92	73.09	53.79	70.64	72.37	71.49
Spanish & German 西班牙语和德语	Spanish 西班牙语	92.54	69.84	72.48	55.95	70.09	73.51	71.76
Spanish & French 西班牙语和法语	Spanish 西班牙语	93.78	73.06	76.27	57.14	72.53	75.19

5.4.5 RQ5: Model Validation for the Consistency of Dataset Annotation
5.4.5 问题 5：数据集注释一致性的模型验证

We use k-fold validation to measure the effectiveness of the

-NER model on the All Data described in Table 2. Specifically, we divide the training dataset into five subsets of roughly equal size without any intersection and keep the development and test set as same as the main model. Then, we train the

-NER model ten times with image, every two times using a different subset as the training set. Table 8 reports our average results, including the overall and individual type performance of 2M-NER across our different validation sets. From the table, those results indicate that the performance of each k -fold model converges and is reduced only a little compared to our ALL model, which indicates that the annotation of our dataset is consistent and efficient.
我们使用 k 倍验证来衡量

-NER 模型在表 2 所述的所有数据上的有效性。具体来说，我们将训练数据集分为五个大小大致相同、没有任何交集的子集，并保持开发集和测试集与主模型相同。然后，我们用图像训练

-NER 模型十次，每两次使用不同的子集作为训练集。表 8 报告了我们的平均结果，包括 2M-NER 在不同验证集上的整体性能和单个类型性能。从表中可以看出，每个 k 倍模型的性能都在收敛，与我们的 ALL 模型相比，性能仅略有下降，这表明我们的数据集标注是一致且高效的。

5.5 Ablation Study 5.5 消融研究

In order to analyze the contributions of image features in our model, we conducted some ablation experiments. In Fig. 2, the model utilizes ResNet to learn convolutional features and leverages ViT to capture patch features. Both types of features interact with textual features through multimodal fusion. To understand the contributions of ResNet and ViT, we designed two sets of ablation experiments. One set removes only ResNet from the original model, while the other set removes only ViT. The experimental results are shown in Table 9.
为了分析图像特征对我们模型的贡献，我们进行了一些消融实验。在图 2 中，模型利用 ResNet 学习卷积特征，并利用 ViT 捕捉斑块特征。这两类特征通过多模态融合与文本特征相互作用。为了了解 ResNet 和 ViT 的贡献，我们设计了两组消融实验。其中一组仅从原始模型中移除 ResNet，而另一组仅移除 ViT。实验结果如表 9 所示。

The ablation experiments were conducted on all language datasets, using the same data and parameters, with only the model architecture being different. From Table 9, it can be observed that removing either ResNet or ViT leads to a decrease in overall performance, while combining both yields better results. Additionally, removing ViT results in a smaller drop in F1 score compared to removing ResNet, indicating that ResNet contributes more significantly to the model's features.
消减实验在所有语言数据集上进行，使用相同的数据和参数，只有模型架构不同。从表 9 中可以看出，去除 ResNet 或 ViT 都会导致整体性能下降，而将两者结合则会产生更好的结果。此外，与移除 ResNet 相比，移除 ViT 导致的 F1 分数下降幅度较小，这表明 ResNet 对模型特征的贡献更大。

5.6 Error Analysis 5.6 误差分析

We observed two main types of errors in our model. One is related to the richness of image information, and the other is attributed to the diversity of entity semantics.
在我们的模型中，我们发现了两大类错误。一种与图像信息的丰富性有关，另一种则归因于实体语义的多样性。

Regarding the image information, we have discovered that some errors occur when the images do not contain relevant entity information. Without the assistance of image-related cues, the model can only predict entity positions and types based on
在图像信息方面，我们发现当图像不包含相关实体信息时会出现一些错误。如果没有图像相关线索的帮助，模型只能根据以下信息预测实体的位置和类型

Table 7 Experimental results of different models without images
表 7 不同模型在无图像情况下的实验结果

Language 语言	Methods 方法	Modality 模式	Single Type (F1)(%) 单一类型（F1）(%)				Overall(%) 总体(%)
Language 语言	Methods 方法	Modality 模式	PER.	LOC.	ORG.	MISC.	Precision 精确度	Recall 回顾	F1
All 全部	AdaCAN [18]	Only Text 仅文本	88.18	51.79	53.14	52.73	63.39	55.31	59.07
	UMT [8]	Only Text 仅文本	95.09	63.82	66.06	64.64	68.75	70.82	69.77
	UMGF [9]	Only Text 仅文本	93.64	62.04	62.44	60.89	66.03	67.85	66.93
	MKGFormer [19]	Only Text 仅文本	93.53	59.71	61.06	59.38	64.71	66.19	65.44
	2M-NER (Ours) 2M-NER （我们的）	Only Text 仅文本	94.53	63.73	65.33	64.86	68.89	70.62	69.75
English 英语	AdaCAN [18]	Only Text 仅文本	85.09	42.22	50.03	61.12	59.78	59.03	59.41
	UMT [8]	Only Text 仅文本	92.52	56.05	63.91	71.36	68.83	71.11	69.95
	UMGF [9]	Only Text 仅文本	88.29	55.31	61.60	68.26	65.24	69.55	67.32
	MKGFormer [19]	Only Text 仅文本	87.99	51.94	57.10	66.32	63.71	65.73	64.70
	2M-NER (Ours) 2M-NER （我们的）	Only Text 仅文本	96.74	53.61	63.47	71.27	68.85	71.16	69.99
French 法语	AdaCAN [18]	Only Text 仅文本	86.31	61.05	52.85	23.42	62.54	52.10	56.85
	UMT [8]	Only Text 仅文本	91.89	67.06	55.05	37.49	59.68	66.08	62.72
	UMGF [9]	Only Text 仅文本	91.46	66.18	52.11	30.30	57.79	63.54	60.53
	MKGFormer [19]	Only Text 仅文本	88.65	61.61	52.17	26.14	56.46	57.96	57.20
	2M-NER (Ours) 2M-NER （我们的）	Only Text 仅文本	92.73	67.88	58.60	35.19	61.54	64.81	63.13
Spanish 西班牙语	AdaCAN [18]	Only Text 仅文本	83.88	56.97	68.64	45.78	68.31	59.39	63.54
	UMT [8]	Only Text 仅文本	90.91	70.21	69.03	57.71	70.27	72.52	71.37
	UMGF [9]	Only Text 仅文本	88.73	69.15	68.35	50.85	65.40	71.78	68.44
	MKGFormer [19]	Only Text 仅文本	88.10	60.90	66.53	46.86	63.10	66.56	64.79
	2M-NER (Ours) 2M-NER （我们的）	Only Text 仅文本	92.32	71.22	70.89	54.91	68.85	73.74	71.21
German 德国	AdaCAN [18]	Only Text 仅文本	82.14	52.91	49.86	29.63	57.77	50.91	54.12
	UMT [8]	Only Text 仅文本	89.27	64.56	50.74	44.03	61.52	63.22	62.36
	UMGF [9]	Only Text 仅文本	89.04	60.11	50.12	31.18	56.74	59.19	57.94
	MKGFormer [	Only Text 仅文本	87.02	53.97	44.92	33.33	53.98	56.25	55.09
	2M-NER (Ours) 2M-NER （我们的）	Only Text 仅文本	92.61	61.85	55.90	38.64	62.34	61.98	62.16

Table 8 K-fold experimental results on the all language dataset
表 8 所有语言数据集的 K 倍实验结果

		Single Type (F1)(%) 单一类型（F1）(%)				Overall(%) 总体(%)
Data Type 数据类型	PER.	LOC.	ORG.	MISC.	Precision 精确度	Recall 回顾	F1
K-fold-1 K 倍-1	94.97	63.89	65.15	64.54	69.14	70.14	69.64
K-fold-2 K 折-2	95.82	63.82	63.98	64.15	68.90	69.91	69.40
K-fold-3 K 折-3	94.75	62.61	63.38	64.05	68.33	69.38	68.85
K-fold-4	94.91	62.50	63.67	63.49	68.26	69.01	68.63
K-fold-5	95.02	63.15	63.21	63.31	68.41	68.82	68.62
ALL	96.34	65.00	67.14	64.75	69.64	71.35

Table 9 Ablation experiments on the feature contributions of ResNet and ViT
表 9 ResNet 和 ViT 特征贡献的消融实验

		Single Type (F1)(%) 单一类型（F1）(%)				Overall(%) 总体(%)
Model 模型	PER.	LOC.	ORG.	MISC.	Precision 精确度	Recall 回顾	F1
2M-NER	96.34	65.00	67.14	64.75	69.64	71.35
-ResNet	95.02	62.32	64.86	63.96	67.80	70.35	69.05
-ViT	95.09	63.76	65.56	63.65	68.28	70.21	69.23

the textual context. For example, in the case of the sentence "00:30 An Evening with Glen Campbell - Musicians play arrangements of Glen Campbell's hits at the Royal Festival Hall in 1977" on Twitter, the corresponding image shows a scene of musicians and a band performing in a dark background. Due to the inability to clearly see the music hall, the model incorrectly classifies the entity type of "Royal Festival Hall" as MISC. Future work should focus on investigating how to enhance the scene knowledge and semantic information of images, in order to better determine the entity types mentioned in sentences.
文本语境。例如，在 Twitter 上的句子 "00:30 与格伦-坎贝尔共度的夜晚--1977 年在皇家节日音乐厅，音乐家们演奏格伦-坎贝尔的名曲改编曲 "中，相应的图片显示了在黑暗背景下音乐家和乐队表演的场景。由于无法清楚地看到音乐厅，模型错误地将 "皇家节日音乐厅 "的实体类型归类为 MISC。今后的工作应侧重于研究如何增强图像的场景知识和语义信息，以便更好地确定句子中提到的实体类型。

Regarding entity semantics, some entities themselves can represent multiple meanings. Therefore, if the model lacks domain knowledge, it may misclassify the entity type. For example, in the sentence "Jeep tour on the red rocks via Pink Adventure Tours," the word "Jeep" can refer to both an American automotive brand and a type of off-road vehicle. Due to this ambiguity, the model may incorrectly predict the MISC type as ORG or vice versa. One future research direction to address this issue is to incorporate multidimensional domain knowledge to better understand the semantic context of sentences.
关于实体语义，有些实体本身可以代表多种含义。因此，如果模型缺乏领域知识，就可能会对实体类型进行错误分类。例如，在句子 "Pink Adventure Tours 红色岩石上的吉普车之旅 "中，"吉普车 "一词既可以指美国的一个汽车品牌，也可以指一种越野车。由于这种模糊性，模型可能会错误地将 MISC 类型预测为 ORG，反之亦然。解决这一问题的一个未来研究方向是结合多维领域知识，更好地理解句子的语义上下文。

6 Conclusion and Future Work
6 结论和未来工作

To sum up, we have constructed a large-scale and high-value MMNER dataset named MMNERD, which is the first public dataset that supports both multilingual and multimodal. To facilitate research on MMNERD, we design a general MMNER framework, named 2M-NER, that achieves the highest F1 score on this valuable dataset. Sepcifically, 2M-NER leverages contrastive learning to align language and vision representations, and utilizes a Transformer based multimodal collaboration module to effectively fuse multilingual texts and images. The results from our experimental tables show that 2M-NER achieves better performance than numerous baseline models. Further analysis indicates the positive effects of both multilingualism and multimodality.
总之，我们构建了一个名为 MMNERD 的大规模、高价值 MMNER 数据集，这是首个同时支持多语言和多模态的公开数据集。为了促进对 MMNERD 的研究，我们设计了一个名为 2M-NER 的通用 MMNER 框架，它能在这个宝贵的数据集上获得最高的 F1 分数。具体来说，2M-NER 利用对比学习来调整语言和视觉表征，并利用基于 Transformer 的多模态协作模块来有效融合多语言文本和图像。实验结果表明，2M-NER 比众多基线模型取得了更好的性能。进一步的分析表明了多语言和多模态的积极作用。

In the future, we would like to conduct further studies in two directions. Firstly, as analyzed in our experiments, language combination of the same language family can promote the MMNER task, so we would like to test our model on more other languages. Secondly, according to our experimental results and previous works, more modalities can provide valuable information for the NER task. Therefore, we plan to introduce the acoustic modality, and propose an innovative framework that supports combined input of text, image, and sound.
未来，我们希望从两个方向开展进一步研究。首先，根据我们的实验分析，同一语系的语言组合可以促进 MMNER 任务，因此我们希望在更多其他语言上测试我们的模型。其次，根据我们的实验结果和之前的工作，更多的模态可以为 NER 任务提供有价值的信息。因此，我们计划引入声学模态，并提出一个支持文本、图像和声音组合输入的创新框架。

Acknowledgements. This work was supported by the National Key R&D Program of China (No. 2023YFF0725600) and Major Special Funds of the Changsha Scientific and Technological Project (Grant No. kh2202006).
致谢。本研究得到了国家重点研发计划（编号：2023YFF0725600）和长沙市科技计划重大专项资金（编号：kh2202006）的支持。

References 参考资料

[1] Cui H, Peng T, Xiao F, Han J, Han R, Liu L. Incorporating anticipation embedding into reinforcement learning framework for multihop knowledge graph question answering. Inf Sci. 2023;619:745-761.
[1] Cui H, Peng T, Xiao F, Han J, Han R, Liu L. 将预期嵌入纳入多跳知识图谱问题解答的强化学习框架.信息科学，2023；619：745-761.

[2] Du Y, Jin X, Yan R, Yan J. Sentiment enhanced answer generation and information fusing for product-related question answering. Inf Sci. 2023;627:205-219.
[2] Du Y，Jin X，Yan R，Yan J. 用于产品相关问题解答的情感增强型答案生成与信息融合。Inf Sci.

[3] Li D, Li J, Li H, Niebles JC, Hoi SCH. Align and Prompt: Video-and-Language Pretraining with Entity Prompts. In: CVPR. IEEE; 2022. p. 4943-4953.
[3] Li D，Li J，Li H，Niebles JC，Hoi SCH.对齐与提示：使用实体提示进行视频和语言预训练。In：CVPR.p. 4943-4953.

[4] Yang J, Yin Y, Ma S, Yang L, Guo H, Huang H, et al. HanoiT: Enhancing Contextaware Translation via Selective Context. In: DASFAA (3). vol. 13945 of Lecture Notes in Computer Science. Springer; 2023. p.

486.
[4] Yang J, Yin Y, Ma S, Yang L, Guo H, Huang H, et al. HanoiT: Enhancing Contextaware Translation via Selective Context.In：DASFAA (3)。《计算机科学讲座笔记》第 13945 卷。p.

486.

[5] Guerreiro NM, Voita E, Martins AFT. Looking for a Needle in a Haystack: A Comprehensive Study of Hallucinations in Neural Machine Translation. In: EACL. Association for Computational Linguistics; 2023. p.

.
[5] Guerreiro NM, Voita E, Martins AFT.大海捞针：神经机器翻译中的幻觉综合研究》。In：EACL.p.

[6] Huang Z, Xu W, Yu K. Bidirectional LSTMCRF Models for Sequence Tagging. CoRR. 2015;abs/1508.01991.
[6] Huang Z, Xu W, Yu K. Bidirectional LSTMCRF Models for Sequence Tagging.CoRR.2015;abs/1508.01991.

[7] Ma X, Hovy EH. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. In: ACL (1). The Association for Computer Linguistics; 2016. .
[7] Ma X, Hovy EH.通过双向 LSTM-CNNs-CRF 进行端到端序列标注。In：ACL (1).计算机语言学协会；2016..

[8] Yu J, Jiang J, Yang L, Xia R. Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer. In: ACL. Association for Computational Linguistics; 2020. p. 33423352 .
[8] Yu J, Jiang J, Yang L, Xia R. 通过统一多模态变换器的实体跨度检测改进多模态命名实体识别。In：ACL.p. 33423352 .

[9] Zhang D, Wei S, Li S, Wu H, Zhu Q, Zhou G. Multi-modal Graph Fusion for Named Entity Recognition with Targeted Visual Guidance. In: AAAI. AAAI Press; 2021. p. 14347-14355.
[9] Zhang D、Wei S、Li S、Wu H、Zhu Q、Zhou G. 利用目标视觉引导进行命名实体识别的多模态图融合。In：AAAI.AAAI Press; 2021. p. 14347-14355.

[10] Li J, Chiu B, Feng S, Wang H. FewShot Named Entity Recognition via MetaLearning. IEEE Trans Knowl Data Eng. 2022;34(9):4245-4256.
[10] Li J, Chiu B, Feng S, Wang H. FewShot Named Entity Recognition via MetaLearning.IEEE Trans Knowl Data Eng.

[11] Agarwal O. Towards Robust Named Entity Recognition via Temporal Domain Adaptation and Entity Context Understanding. In: AAAI. AAAI Press; 2022. p. 12866-12867.
[11] Agarwal O. Towards Robust Named Entity Recognition via Temporal Domain Adaptation and Entity Context Understanding.In：AAAI.AAAI Press; 2022. p. 12866-12867.

[12] Shen Y, Wang X, Tan Z, Xu G, Xie P, Huang F, et al. Parallel Instance Query Network for Named Entity Recognition. In: ACL (1). Association for Computational Linguistics; 2022. p.

.
[12] Shen Y、Wang X、Tan Z、Xu G、Xie P、Huang F 等：用于命名实体识别的并行实例查询网络。In：ACL (1).p.

[13] Schmidt FD, Vulic I, Glavas G. SLICER: Sliced Fine-Tuning for Low-Resource CrossLingual Transfer for Named Entity Recognition. In: EMNLP. Association for Computational Linguistics; 2022. p. 10775-10785.
[13] Schmidt FD, Vulic I, Glavas G. SLICER: Sliced Fine-Tuning for Low-Resource CrossLingual Transfer for Named Entity Recognition.In：EMNLP.p. 10775-10785.

[14] Zhang X, Yuan J, Li L, Liu J. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition. In: WSDM. ACM; 2023. p. 958-966.
[14] Zhang X, Yuan J, Li L, Liu J. Reducing the Bias of Visual Objects in Multimodal Named Entity Recognition.In. WSDM：WSDM.ACM; 2023. p. 958-966.

[15] Kulkarni M, Preotiuc-Pietro D, Radhakrishnan K, Winata G, Wu S, Xie L, et al. Towards a Unified Multi-Domain Multilingual Named Entity Recognition Model. In: EACL. Association for Computational Linguistics; 2023. p. 2202-2211.
[15] Kulkarni M, Preotiuc-Pietro D, Radhakrishnan K, Winata G, Wu S, Xie L, et al.In：EACL.p. 2202-2211.
[16] Zhang Y, Meng F, Chen Y, Xu J, Zhou J. Target-oriented Fine-tuning for ZeroResource Named Entity Recognition. In: ACL/IJCNLP (Findings). vol. ACL/IJCNLP 2021 of Findings of ACL. Association for Computational Linguistics; 2021. p. 16031615 .
[16] Zhang Y, Meng F, Chen Y, Xu J, Zhou J. Target-oriented Fine-tuning for ZeroResource Named Entity Recognition.In：ACL/IJCNLP (Findings). vol. ACL/IJCNLP 2021 of Findings of ACL.p. 16031615 .

[17] Boros E, González-Gallardo C, Moreno JG, Doucet A. L3i at SemEval-2022 Task 11: Straightforward Additional Context for Multilingual Named Entity Recognition. In: SemEval@NAACL. Association for Computational Linguistics; 2022. p. 1630-1638.
[17] Boros E, González-Gallardo C, Moreno JG, Doucet A. L3i at SemEval-2022 Task 11: Straforward Additional Context for Multilingual Named Entity Recognition.In：SemEval@NAACL.计算语言学协会；2022 年。第 1630-1638 页。

[18] Zhang Q, Fu J, Liu X, Huang X. Adaptive Co-attention Network for Named Entity Recognition in Tweets. In: AAAI. AAAI Press; 2018. p. 5674-5681.
[18] Zhang Q, Fu J, Liu X, Huang X. Adaptive Co-attention Network for Named Entity Recognition in Tweets.In：AAAI.AAAI Press; 2018. p. 5674-5681.

[19] Chen X, Zhang N, Li L, Deng S, Tan C, Xu C, et al. Hybrid Transformer with Multi-level Fusion for Multimodal Knowledge Graph Completion. In: SIGIR. ACM; 2022. p. 904-915.
[19] Chen X, Zhang N, Li L, Deng S, Tan C, Xu C, et al.In：SIGIR.ACM; 2022. p. 904-915.

[20] Wang X, Gui M, Jiang Y, Jia Z, Bach N, Wang T, et al. ITA: Image-Text Alignments for Multi-Modal Named Entity Recognition. In: NAACL-HLT. Association for Computational Linguistics; 2022. p. 3176-3189.
[20] Wang X、Gui M、Jiang Y、Jia Z、Bach N、Wang T 等：《ITA：用于多模态命名实体识别的图像-文本对齐》。In：NAACL-HLT.In: NAACL-HLT. Association for Computational Linguistics; 2022. p. 3176-3189.

[21] Sang EFTK. Introduction to the CoNLL2002 Shared Task: Language-Independent Named Entity Recognition. In: CoNLL. ACL; 2002. .
[21] Sang EFTK.CoNLL2002 共享任务简介：与语言无关的命名实体识别。In：CoNLL.ACL; 2002..

[22] Sang EFTK, Meulder FD. Introduction to the CoNLL-2003 Shared Task: LanguageIndependent Named Entity Recognition. In: CoNLL. ACL; 2003. p. 142-147.
[22] Sang EFTK, Meulder FD.CoNLL-2003 共享任务简介：LanguageIndependent Named Entity Recognition.In：CoNLL.p. 142-147.

[23] Pan X, Zhang B, May J, Nothman J, Knight K, Ji H. Cross-lingual Name Tagging and Linking for 282 Languages. In: ACL (1). Association for Computational Linguistics; 2017. p. 1946-1958.
[23] Pan X, Zhang B, May J, Nothman J, Knight K, Ji H.282 种语言的跨语言名称标记和链接。In：ACL (1).计算语言学协会；2017 年。第 1946-1958 页。

[24] Lu D, Neves L, Carvalho V, Zhang N, Ji H. Visual Attention Model for Name Tagging in Multimodal Social Media. In: ACL (1). Association for Computational Linguistics; 2018.
[24] Lu D, Neves L, Carvalho V, Zhang N, Ji H..多模态社交媒体中姓名标记的视觉注意力模型。In：ACL (1).计算语言学协会；2018 年。
p.

. p.

。

[25] Sui D, Tian Z, Chen Y, Liu K, Zhao J. A Large-Scale Chinese Multimodal NER Dataset with Speech Clues. In: ACL/IJCNLP (1). Association for Computational Linguistics; 2021. p. 2807-2818.
[25] Sui D, Tian Z, Chen Y, Liu K, Zhao J. A Large-Scale Chinese Multimodal NER Dataset with Speech Clues.In：acl/ijcnlp (1).计算语言学协会；2021 年。第 2807-2818 页。

[26] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16 x 16 Words: Transformers for Image Recognition at Scale. In: ICLR. OpenReview.net; 2021.
[26] Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, et al. An Image is Worth 16 x 16 Words. Transformers for Image Recognition at Scale：规模图像识别变换器。In：ICLR.OpenReview.net; 2021.

[27] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: CVPR. IEEE Computer Society; 2016. p. 770-778.
[27] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition.In：CVPR.IEEE Computer Society; 2016. p. 770-778.

[28] Sun E, Zhou D, Tian Y, Xu Z, Wang X. Transformer-based few-shot object detection in traffic scenarios. Appl Intell. 2024;54(1):947-958
[28] Sun E、Zhou D、Tian Y、Xu Z、Wang X. 基于变压器的交通场景中少量物体检测。Appl Intell.2024;54(1):947-958

[29] Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural Architectures for Named Entity Recognition. In: HLTNAACL. The Association for Computational Linguistics; 2016. p. 260-270.
[29] Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. 命名实体识别的神经架构。In：HLTNAACL.The Association for Computational Linguistics; 2016. p. 260-270.

[30] Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: NAACL-HLT (1). Association for Computational Linguistics; 2019. p. 4171-4186.
[30] Devlin J、Chang M、Lee K、Toutanova K. BERT：用于语言理解的深度双向变换器预训练。In：NAACL-HLT (1).Association for Computational Linguistics; 2019. p. 4171-4186.

[31] Sen P, Aji AF, Saffari A. Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering. In: COLING. International Committee on Computational Linguistics; 2022. p. 1604-1619.
[31] Sen P, Aji AF, Saffari A. Mintaka: A Complex, Natural, and Multilingual Dataset for End-to-End Question Answering.In. COLING：COLING.国际计算语言学委员会；2022 年。第 1604-1619 页。

[32] Perevalov A, Both A, Diefenbach D, Ngomo AN. Can Machine Translation be a Reasonable Alternative for Multilingual Question Answering Systems over Knowledge Graphs? In: WWW. ACM; 2022. p. 977-986.
[32] Perevalov A, Both A, Diefenbach D, Ngomo AN.机器翻译能否成为知识图谱多语言问题解答系统的合理替代方案？In. WWW：WWW.ACM; 2022. p. 977-986.

[33] Wang R, Zhang Z, Zhuang F, Gao D, Wei Y, He Q. Adversarial Domain Adaptation for Cross-lingual Information Retrieval with Multilingual BERT. In: CIKM. ACM; 2021. p.

[33] Wang R, Zhang Z, Zhuang F, Gao D, Wei Y, He Q. Adversarial Domain Adaptation for Cross-lingual Information Retrieval with Multilingual BERT.In：CIKM.ACM; 2021.

p.
[34] Sun S, Duh K. CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval. In: EMNLP (1). Association for Computational Linguistics; 2020. p. 4160 4170.
[34] Sun S, Duh K. CLIRMatrix：用于跨语言信息检索的大型双语和多语数据集。In：EMNLP (1).计算语言学协会；2020 年。第 4160 4170 页。

[35] Bhartiya A, Badola K, Mausam. DiS-ReX: A Multilingual Dataset for Distantly Supervised Relation Extraction. In: ACL (2). Association for Computational Linguistics; 2022. p. 849-863.
[35] Bhartiya A, Badola K, Mausam.DiS-ReX：用于远距离监督关系提取的多语言数据集。In. ACL (2)：ACL (2).计算语言学协会；2022 年。第 849-863 页。

[36] Rathore V, Badola K, Singla P, Mausam. PARE: A Simple and Strong Baseline for Monolingual and Multilingual Distantly Supervised Relation Extraction. In: ACL (2). Association for Computational Linguistics; 2022. p. 340-354.
[36] Rathore V, Badola K, Singla P, Mausam.PARE：用于单语和多语远程监督关系提取的简单而强大的基线。In. ACL (2)：ACL (2).第 340-354 页。

[37] Nothman J, Ringland N, Radford W, Murphy T, Curran JR. Learning multilingual named entity recognition from Wikipedia. Artif Intell. 2013;194:151-175.
[37] Nothman J, Ringland N, Radford W, Murphy T, Curran JR.从维基百科学习多语言命名实体识别。Artif Intell.2013;194:151-175.

[38] Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition. In: COLING. International Committee on Computational Linguistics; 2022. p. 3798-3809.
[38] Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. MultiCoNER: A Large-scale Multilingual Dataset for Complex Named Entity Recognition.In：COLING.p. 3798-3809.

[39] Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. SemEval-2022 Task 11: Multilingual Complex Named Entity Recognition (MultiCoNER). In: SemEval@NAACL. Association for Computational Linguistics; 2022. p.

.
[39] Malmasi S, Fang A, Fetahu B, Kar S, Rokhlenko O. SemEval-2022 Task 11: Multiilingual Complex Named Entity Recognition (MultiCoNER).In：SemEval@NAACL.p.

[40] Emelyanov AA, Artemova E. Multilingual Named Entity Recognition Using Pretrained Embeddings, Attention Mechanism and NCRF. In: BSNLP@ACL. Association for Computational Linguistics; 2019. p. 94 99 .
[40] Emelyanov AA, Artemova E. 使用预训练嵌入、注意机制和 NCRF 的多语言命名实体识别。In：BSNLP@ACL.计算语言学协会；2019 年。第 94 99 页。

[41] Arkhipov MY, Trofimova M, Kuratov Y, Sorokin A. Tuning Multilingual Transformers for Language-Specific Named Entity Recognition. In: BSNLP@ACL. Association for Computational Linguistics; 2019. p. 89-93.
[41] Arkhipov MY, Trofimova M, Kuratov Y, Sorokin A. Tuning Multilingual Transformers for Language-Specific Named Entity Recognition.In：BSNLP@ACL.Association for Computational Linguistics; 2019. p. 89-93.

[42] Winata GI, Lin Z, Fung P. Learning Multilingual Meta-Embeddings for CodeSwitching Named Entity Recognition. In: RepL4NLP@ACL. Association for Computational Linguistics; 2019. p. 181-186.
[42] Winata GI, Lin Z, Fung P. Learning Multilingual Meta-Embeddings for CodeSwitching Named Entity Recognition.In：RepL4NLP@ACL.Association for Computational Linguistics; 2019. p. 181-186.

[43] Wu Q, Lin Z, Wang G, Chen H, Karlsson BF, Huang B, et al. Enhanced Meta-Learning for Cross-Lingual Named Entity Recognition with Minimal Resources. In: AAAI. AAAI Press; 2020. p. 9274-9281.
[43] Wu Q, Lin Z, Wang G, Chen H, Karlsson BF, Huang B, et al.In：AAAI.AAAI Press; 2020. p. 9274-9281.

[44] Moon S, Neves L, Carvalho V. Multimodal Named Entity Recognition for Short Social Media Posts. In: NAACL-HLT. Association for Computational Linguistics; 2018. p. 852860 .
[44] Moon S, Neves L, Carvalho V. 社交媒体短文的多模态命名实体识别。In：NAACL-HLT.计算语言学协会；2018 年。第 852860 页。

[45] Zhao F, Li C, Wu Z, Xing S, Dai X. Learning from Different text-image Pairs: A Relationenhanced Graph Convolutional Network for Multimodal NER. In: ACM Multimedia. ACM; 2022. p. 3983-3992.
[45] Zhao F, Li C, Wu Z, Xing S, Dai X.从不同的文本图像对中学习：用于多模态 NER 的关系增强图卷积网络。In：ACM Multimedia.ACM; 2022. p. 3983-3992.

[46] Sun L, Wang J, Zhang K, Su Y, Weng F. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER. In: AAAI. AAAI Press; 2021. p.

[46] Sun L, Wang J, Zhang K, Su Y, Weng F. RpBERT: A Text-image Relation Propagation-based BERT Model for Multimodal NER.In：AAAI.P.

[47] Zheng C, Wu Z, Wang T, Cai Y, Li Q. Object-Aware Multimodal Named Entity Recognition in Social Media Posts With Adversarial Learning. IEEE Trans Multim. 2021;23:2520-2532.
[47] Zheng C, Wu Z, Wang T, Cai Y, Li Q.利用对抗学习实现社交媒体帖子中的物体感知多模态命名实体识别。IEEE Trans Multim.2021;23:2520-2532.

[48] Li X, Kong D. SRIF-RCNN: Sparsely represented inputs fusion of different sensors for 3D object detection. Appl Intell. 2023;53(5):5532-5553.
[48] Li X, Kong D. SRIF-RCNN：用于三维物体检测的不同传感器稀疏表示输入融合。Appl Intell.2023;53(5):5532-5553.

[49] Wu Z, Zheng C, Cai Y, Chen J, Leung H, Li Q. Multimodal Representation with Embedded Visual Guiding Objects for Named Entity Recognition in Social Media Posts. In: ACM Multimedia. ACM; 2020. p. 1038-1046.
[49] Wu Z, Zheng C, Cai Y, Chen J, Leung H, Li Q.嵌入视觉引导对象的多模态表示法用于社交媒体帖子中的命名实体识别。In：ACM Multimedia.ACM; 2020. p. 1038-1046.

[50] Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, et al. Multilingual Denoising Pre-training for Neural Machine Translation. Trans Assoc Comput Linguistics. 2020;8:726-742.
[50] Liu Y, Gu J, Goyal N, Li X, Edunov S, Ghazvininejad M, et al.Trans Assoc Comput Linguistics.2020;8:726-742.
[51] Cohen J. A Coefficient of Agreement for Nominal Scales. Educational and psychological measurement. 1960;20(1):37-46.
[51] Cohen J. A Coefficient of Agreement for Nominal Scales.教育与心理测量。1960;20(1):37-46.

[52] Oskouei AG, Balafar MA, Motamed C. RDEIC-LFW-DSS: ResNet-based deep embedded image clustering using local feature weighting and dynamic sample selection mechanism. Inf Sci. 2023;646:119374.
[52] Oskouei AG, Balafar MA, Motamed C. RDEIC-LFW-DSS: ResNet-based deep embedded image clustering using local feature weighting and dynamic sample selection mechanism.Inf Sci.

[53] Chen T, Kornblith S, Norouzi M, Hinton GE. A Simple Framework for Contrastive Learning of Visual Representations. In: ICML. vol. 119 of Proceedings of Machine Learning Research. PMLR; 2020. p. 1597-1607.
[53] Chen T, Kornblith S, Norouzi M, Hinton GE.视觉表征对比学习的简单框架。ICML：119 of Proceedings of Machine Learning Research. PMLR; 2020.PMLR; 2020. p. 1597-1607.

[54] He K, Fan H, Wu Y, Xie S, Girshick RB. Momentum Contrast for Unsupervised Visual Representation Learning. In: CVPR. Computer Vision Foundation / IEEE; 2020. p. 9726-9735.
[54] He K, Fan H, Wu Y, Xie S, Girshick RB.无监督视觉表征学习的动量对比。In：CVPR.p. 9726-9735.

[55] Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, et al. Learning Transferable Visual Models From Natural Language Supervision. In: ICML. vol. 139 of Proceedings of Machine Learning Research. PMLR; 2021. p. 8748-8763.
[55] Radford A、Kim JW、Hallacy C、Ramesh A、Goh G、Agarwal S 等：《从自然语言监督中学习可转移的视觉模型》。In：ICML。《机器学习研究论文集》第 139 卷。PMLR; 2021. p. 8748-8763.

https://huggingface.co/datasets/flaxcommunity/conceptual-12m-mbart-50-multilingual/tree/main
https://huggingface.co

https://spacy.io/

https://fanyi-api.baidu.com
An image named 17_06_4705.jpg with the words "image not found".
一张名为 17_06_4705.jpg 的图片，上面写着 "未找到图片"。
https://nlp.johnsnowlabs.com/2020/01/22/glove_6B_300.html

https://download.pytorch.org/models/resnet152b121ed2d.pth

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion 2M-NER：多语言和多模态 NER 的对比学习与语言和模态融合

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

2.1 Multilingual Named Entity Recognition2.1 多语言命名实体识别

2.2 Multimodal Named Entity Recognition2.2 多模态命名实体识别

3 Dataset Construction 3 数据集构建

3.1 Dataset Annotation 3.1 数据集注释

3.2 Dataset Statistics and Evaluation3.2 数据集统计与评估

3.3 Dataset Comparison 3.3 数据集比较

4 The Proposed Method 4 建议的方法

4.1 Overview of the Proposed Model4.1 拟议模型概述

4.2 Feature Extraction Module4.2 特征提取模块

4.2.1 Multilingual Text Representation4.2.1 多语言文本表示法

4.2.2 Image Representation4.2.2 图像表示法

4.3 Multimodal Alignment Module4.3 多模式对齐模块

4.4 Multimodal Collaboration Module4.4 多模式协作模块

4.5 CRF Decoder 4.5 CRF 解码器

4.6 Model Training 4.6 模型训练

5 Experiments 5 项实验

5.1 Implementation Details5.1 实施细节

5.2 Comparison Models 5.2 比较模式

5.3 Metrics 5.3 衡量标准

5.4 Results 5.4 结果

5.4.1 RQ1: Overall Comparison with Existing Models5.4.1 问题 1：与现有模型的总体比较

5.4.2 RQ2: The Effect of Multilingualism5.4.2 问题 2：使用多种语言的影响

5.4.3 RQ3: The Effect of Multimodality5.4.3 问题 3：多模态的影响

5.4.4 RQ4: Challenging Analysis of Data from Different Sources5.4.4 问题 4：对不同来源数据的分析具有挑战性

5.4.5 RQ5: Model Validation for the Consistency of Dataset Annotation5.4.5 问题 5：数据集注释一致性的模型验证

5.5 Ablation Study 5.5 消融研究

5.6 Error Analysis 5.6 误差分析

6 Conclusion and Future Work6 结论和未来工作

References 参考资料

2M-NER: Contrastive Learning for Multilingual and Multimodal NER with Language and Modal Fusion
2M-NER：多语言和多模态 NER 的对比学习与语言和模态融合

2.1 Multilingual Named Entity Recognition
2.1 多语言命名实体识别

2.2 Multimodal Named Entity Recognition
2.2 多模态命名实体识别

3.2 Dataset Statistics and Evaluation
3.2 数据集统计与评估

4.1 Overview of the Proposed Model
4.1 拟议模型概述

4.2 Feature Extraction Module
4.2 特征提取模块

4.2.1 Multilingual Text Representation
4.2.1 多语言文本表示法

4.2.2 Image Representation
4.2.2 图像表示法

4.3 Multimodal Alignment Module
4.3 多模式对齐模块

4.4 Multimodal Collaboration Module
4.4 多模式协作模块

5.1 Implementation Details
5.1 实施细节

5.4.1 RQ1: Overall Comparison with Existing Models
5.4.1 问题 1：与现有模型的总体比较

5.4.2 RQ2: The Effect of Multilingualism
5.4.2 问题 2：使用多种语言的影响

5.4.3 RQ3: The Effect of Multimodality
5.4.3 问题 3：多模态的影响

5.4.4 RQ4: Challenging Analysis of Data from Different Sources
5.4.4 问题 4：对不同来源数据的分析具有挑战性

5.4.5 RQ5: Model Validation for the Consistency of Dataset Annotation
5.4.5 问题 5：数据集注释一致性的模型验证

6 Conclusion and Future Work
6 结论和未来工作