基于跨度的有效多模态命名实体识别与一致的跨模态对齐

2024_08_05_f6c24d7a6654a7ae1d6fg

Yongxiu Xu , Hao Xu , Heyan Huang , Shiyao Cui
徐永秀、徐浩、黄海燕、崔诗瑶 Minghao Tang , Longzheng Wang , Hongbo Xu Institute of Information Engineering Chinese Academy of Sciences, Beijing, China
中国科学院信息工程研究所，中国北京 School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
中国科学院大学网络安全学院，中国北京 School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
北京理工大学计算机科学与技术学院，中国北京hhy63@bit.edu.cn, {xuyongxiu, xuhao, tangminghao, wanglongzheng,hbxu}@iie.ac.cn

Abstract 摘要

With the increasing availability of multimodal content on social media, consisting primarily of text and images, multimodal named entity recognition (MNER) has gained a wide-spread attention. A fundamental challenge of MNER lies in effectively aligning different modalities. However, the majority of current approaches rely on word-based sequence labeling framework and align the image and text at inconsistent semantic levels (whole image-words or regions-words). This misalignment may lead to inferior entity recognition performance. To address this issue, we propose an effective span-based method, named SMNER, which achieves a more consistent multimodal alignment from the perspectives of information-theoretic and cross-modal interaction, respectively. Specifically, we first introduce a cross-modal information bottleneck module for the global-level multimodal alignment (whole image-whole text). This module aims to encourage the semantic distribution of the image to be closer to the semantic distribution of the text, which can enable the filtering out of visual noise. Next, we introduce a cross-modal attention module for the local-level multimodal alignment (regions-spans), which captures the correlations between regions in the image and spans in the text, enabling a more precise alignment of the two modalities. Extensive experiments conducted on two benchmark datasets demonstrate that SMNER outperforms the state-of-the-art baselines.
随着社交媒体上主要由文本和图像组成的多模态内容越来越多，多模态命名实体识别（MNER）得到了广泛关注。多模态命名实体识别（MNER）的一个基本挑战在于如何有效对齐不同的模态。然而，目前大多数方法都依赖于基于单词的序列标注框架，并在不一致的语义层面（整个图像-单词或区域-单词）对齐图像和文本。这种不一致可能导致实体识别性能低下。针对这一问题，我们提出了一种有效的基于跨度的方法，命名为 SMNER，它分别从信息论和跨模态交互的角度实现了更加一致的多模态对齐。具体来说，我们首先为全局级多模态配准（全图-全文）引入了一个跨模态信息瓶颈模块。该模块旨在促使图像的语义分布更接近文本的语义分布，从而能够过滤视觉噪音。接下来，我们引入了一个用于局部多模态配准（区域-跨度）的跨模态关注模块，该模块捕捉图像中的区域与文本中的跨度之间的相关性，使两种模态的配准更加精确。在两个基准数据集上进行的广泛实验表明，SMNER 的性能优于最先进的基线。

Keywords: Multimodal named entity recogniton, Multimodal alignment, Multimodal fusion
关键词多模态命名实体识别多模态对齐多模态融合

1. Introduction 1.导言

Named Entity Recognition (NER) involves identifying named entities within a given sentence and categorizing them into the pre-defined types. (Li et al., 2020). NER is a critical natural language processing task and plays a key component in information retrieval (Dietz, 2019), question answering (Min et al., 2021), knowledge graph (Zhao et al., 2022), etc. However, in practical scenarios such as social media platforms, the text is often limited, informal, and accompanied by images, which presents a significant challenge for the traditional text-based NER. Multimodal named entity recognition (MNER) has become a new direction and attracts widespread attention attributed to its excellent performance in entity recognition for social media posts.
命名实体识别（NER）涉及识别给定句子中的命名实体，并将其归入预定义的类型。(Li等人，2020）。NER 是一项重要的自然语言处理任务，在信息检索（Dietz，2019 年）、问题解答（闵等人，2021 年）、知识图谱（赵等人，2022 年）等方面发挥着关键作用。然而，在社交媒体平台等实际应用场景中，文本往往是有限的、非正式的，并伴有图像，这给传统的基于文本的命名实体识别（NER）带来了巨大挑战。多模态命名实体识别（Multimodal named entity recognition，MNER）因其在社交媒体帖子实体识别中的出色表现而成为一个新方向，并引起了广泛关注。

MNER extends the traditional text-based NER by incorporating images as additional input (Zhang et al., 2018), which can offer complementary benefits to alleviate ambiguity in natural languages. However, MNER poses a fundamental challenge of effectively aligning information across two modalities: text and image. The existing MNER methods primarily utilize various attention networks (such as self-attention or cross-attention) to solve this challenge, which can be categorized into two strategies coarse-grained alignment and fine-grained alignment, as shown in Figure 1.
MNER 扩展了传统的基于文本的 NER，将图像作为额外输入（Zhang 等人，2018 年），这可以为减轻自然语言中的歧义提供互补优势。然而，MNER 面临着一个基本挑战，即如何有效地将文本和图像两种模态的信息进行对齐。现有的 MNER 方法主要利用各种注意网络（如自注意或交叉注意）来解决这一难题，可分为粗粒度对齐和细粒度对齐两种策略，如图 1 所示。

\footnotetext{ \脚注文本{

Corresponding author 通讯作者

(a) coarse-grained alignment
(a) 粗粒度排列

(b) fine-grained alignment
(b) 精细对齐
Figure 1: An example of coarse-grained alignment and fine-grained alignment. Both of these two strategies align text with image at inconsistent semantic levels, leading to misalignment noises.
图 1：粗粒度对齐和细粒度对齐的示例。这两种策略都是在不一致的语义层面上对文本和图像进行对齐，从而导致对齐错误。

In the early stages, some efforts (Zhang et al., 2018; Moon et al., 2018) directly consider the entire image as global-level visual cues, which guides the words in the text to learn a vision-aware representation of a whole image, as shown in Figure 1(a). However, this coarse-grained alignment inevitably introduces image noise (e.g., background) and simultaneously results in the loss of some representative information. Subsequently, increasing studies (Zheng et al., 2020; Yu et al., 2020; Zhang et al., 2021; Xu et al., 2022; Jia et al., 2023) started focusing on fine-grained semantic alignment between text and images. These methods typically involve capturing the interactions between words in the
在早期阶段，一些研究（Zhang 等人，2018；Moon 等人，2018）直接将整幅图像视为全局级视觉线索，从而引导文本中的单词学习整幅图像的视觉感知表征，如图 1（a）所示。然而，这种粗粒度的对齐方式不可避免地会引入图像噪声（如背景），同时导致一些代表性信息的丢失。随后，越来越多的研究（Zheng 等人，2020 年；Yu 等人，2020 年；Zhang 等人，2021 年；Xu 等人，2022 年；Jia 等人，2023 年）开始关注文本与图像之间的细粒度语义配准。这些方法通常涉及捕捉文本和图像中单词之间的交互。
text and regions in the image in a unified semantic space, as shown in Figure 1(b). Actually, the regions of the objects in the image should align with the corresponding entity spans in the text rather than the individual words, as individual words may not adequately capture the overall semantics of an entity span. As shown in Figure 1, for the semantic representations of the two modalities, the regions of the person object Adam Levine in the image should have a higher similarity to the span "Adam Levine " in the text than the word "Adam" or word "Levine". Given that neither of the aforementioned alignment strategies successfully achieves consistent semantic alignment between text and images, resulting in the introduction of noise and subsequently inferior performance, we collectively refer to these issues as "misalignment noise".
如图 1(b) 所示，在统一的语义空间中，文本和图像中的区域是一致的。实际上，图像中的对象区域应该与文本中相应的实体跨度相一致，而不是与单个单词相一致，因为单个单词可能无法充分体现实体跨度的整体语义。如图 1 所示，就两种模式的语义表征而言，图像中的人物对象 Adam Levine 的区域与文本中的跨度 "Adam Levine "的相似度应高于单词 "Adam "或单词 "Levine"。鉴于上述两种对齐策略都无法成功实现文本与图像之间一致的语义对齐，因此会引入噪音，导致性能下降，我们将这些问题统称为 "对齐错误噪音"。

Taking the considerations above, we propose an effective Span-based Multimodal Named Entity Recognition method, named SMNER, which regards MNER as a span-based classification task rather than a word-based sequence labeling task. SMNER is intensively designed for learning informative multimodal span representations by effectively aligning and fusing the information contained in text and image. SMNER consists of two key modules: a cross-modal information bottleneck (CMIB) module for global multimodal alignment and denoising, and a cross-modal attention (CMA) module for local multimodal alignment and interaction.
鉴于上述考虑，我们提出了一种有效的基于跨度的多模态命名实体识别方法，命名为 SMNER，它将 MNER 视为基于跨度的分类任务，而不是基于词序列的标注任务。SMNER 的主要设计目的是通过有效对齐和融合文本与图像中包含的信息来学习信息丰富的多模态跨度表示。SMNER 由两个关键模块组成：一个是用于全局多模态对齐和去噪的跨模态信息瓶颈（CMIB）模块，另一个是用于局部多模态对齐和交互的跨模态注意力（CMA）模块。

More specifically, motivated by the multi-view information bottleneck principle (Federici et al., 2020), we consider the text and image as two different views of the same posts. Firstly, we formulate the cross-modal global semantic alignment from an information-theoretic perspective by maximizing the mutual information and minimizing the distributional divergence between the two modalities. This module can bring the visual semantic distribution closer to the textual semantic distribution and filter out irrelevant information from visual representations. Secondly, for fine-grained multimodal alignment, we feed the contextual unimodal representations into a cross-modal attention module that captures the correlations between spans in the text and regions in the image. This module can enable a more precise alignment between two modalities and acquire informative cross-modal features. Finally, the obtained cross-modal features are aggregated effectively to enhance the representation of spans, thereby improving the performance of entity classification.
更具体地说，受多视角信息瓶颈原理（Federici 等人，2020 年）的启发，我们将文本和图像视为同一帖子的两个不同视角。首先，我们从信息论的角度出发，通过最大化互信息和最小化两种模态之间的分布分歧来制定跨模态的全局语义对齐。这一模块可以使视觉语义分布更接近文本语义分布，并过滤掉视觉表征中的无关信息。其次，为了实现精细的多模态配准，我们将上下文单模态表征输入跨模态注意力模块，该模块可捕捉文本中的跨度与图像中的区域之间的相关性。该模块可以在两种模态之间进行更精确的配准，并获取信息丰富的跨模态特征。最后，对获得的跨模态特征进行有效聚合，以增强跨度的表示，从而提高实体分类的性能。

In summary, the main contributions of this paper are as follows:
总之，本文的主要贡献如下：

fication method for MNER, aiming to reduce the impact of misalignment and achieve more consistent multimodal alignment at two levels (image-text and regions-words, respectively). To the best of our knowledge, we are the first to explore the span-based MNER model for the issue of misalignment.
我们的目标是在两个层面（分别是图像-文本和区域-单词）上减少错位的影响并实现更加一致的多模态对齐。据我们所知，我们是第一个针对错位问题探索基于跨度的 MNER 模型的人。

We introduce two modules (CMIB and CMA) from the perspective of information-theoretic principle and cross-modal interaction, respectively. These modules work in synergy to generate more expressive cross-modal representations, enhancing the final entity classification performance.
我们分别从信息论原理和跨模态交互的角度介绍了两个模块（CMIB 和 CMA）。这些模块协同工作，生成更具表现力的跨模态表征，从而提高最终的实体分类性能。
We conduct extensive experiments on two widely used MNER datasets to prove the effectiveness of our method. Experimental results show that SMNER outperforms the state-ofthe-art models on both datasets.
我们在两个广泛使用的 MNER 数据集上进行了大量实验，以证明我们方法的有效性。实验结果表明，SMNER 在这两个数据集上的表现都优于最先进的模型。

In this section, we review the related works of our method from: mulitmodal named entity recognition and information bottleneck.
在本节中，我们将从多模式命名实体识别和信息瓶颈两个方面回顾与我们的方法相关的工作。

2.1. Multimodal Named Entity Recognition
2.1.多模态命名实体识别

As multimodal data become increasingly popular on social media paltforms, starting with Moon et al. (2018); Lu et al. (2018); Zhang et al. (2018), MNER has attracted broad concerns in named entity recognition.
从 Moon 等人（2018）、Lu 等人（2018）、Zhang 等人（2018）开始，随着多模态数据在社交媒体上越来越流行，MNER 在命名实体识别方面引起了广泛关注。

From the perspective of multimodal alignment and fusion. some studies (Moon et al., 2018; Zhang et al., 2018) tried to encode the entire image, which implicitly interacts the information of two modalities using attention mechanism. For example, Moon et al. (2018) proposed to utilize LSTM-CNN architecture that combines text with image information via a general modality attention, and Zhang et al. (2018) proposed an adaptive co-attention network to dynamically control the fusion of two modalities. Different from above works of using the whole image, subsequent works (Lu et al., 2018; Yu et al., 2020; Wu et al., 2020; Zheng et al., 2020; Zhang et al., 2021) primarily focused on combining the finegrained regions visual information with the words information in text to boost the MNER performance. Lu et al. (2018) extracted the image regions that are most related to the text and utilized the attentionbased model to implicitly interact the information of two modalities. Yu et al. (2020) introduced a multimodal interaction module designed to capture both image-aware word representation and wordaware visual representation. Zhang et al. (2021)
从多模态配准与融合的角度来看，一些研究（Moon 等人，2018；Zhang 等人，2018）尝试对整个图像进行编码，利用注意机制隐含地交互两种模态的信息。例如，Moon 等人（2018）提出利用 LSTM-CNN 架构，通过一般模态注意力将文本与图像信息结合起来；Zhang 等人（2018）提出了一种自适应协同注意力网络，以动态控制两种模态的融合。与上述使用整幅图像的工作不同，后续工作（Lu 等，2018；Yu 等，2020；Wu 等，2020；Zheng 等，2020；Zhang 等，2021）主要侧重于将细粒度区域视觉信息与文本中的单词信息相结合，以提升 MNER 性能。Lu 等人（2018）提取了与文本关联度最高的图像区域，并利用基于注意力的模型隐式地交互了两种模态的信息。Yu 等人（2020）引入了一个多模态交互模块，旨在同时捕捉图像感知的文字表征和文字感知的视觉表征。Zhang 等人（2021 年）

Figure 2: Model architecture overview of SMNER. The cross-modal information bottleneck module for global multimodal alignment and denoising, and the cross-modal attention module for local multimodal alignment and interaction.
图 2：SMNER 的模型架构概览。跨模态信息瓶颈模块用于全局多模态配准和去噪，跨模态注意力模块用于局部多模态配准和交互。

exploited a unified multimodal graph to capture the interactions between words in the text and regions in the image.
利用统一的多模态图来捕捉文本中的单词与图像中的区域之间的交互。

Despite the studies above have achieved promising results, most of these methods ignore the problem of the visual noise caused by irrelevant images More recently, Xu et al. (2022), Chen et al. (2022) and Zhang et al. (2023) alleviate this problem by text-image matching, hierarchical visual prefix and contrastive learning, respectively.
尽管上述研究取得了可喜的成果，但这些方法大多忽视了无关图像所造成的视觉噪声问题。最近，Xu 等人（2022 年）、Chen 等人（2022 年）和 Zhang 等人（2023 年）分别通过文本-图像匹配、分层视觉前缀和对比学习等方法缓解了这一问题。

Different from the aforementioned methods, we focus on the noise caused by misalignment. Additionally, the above studies are under the wordbased sequence labeling framework, whereas we utilize the span-based classification framework, ensuring the alignment and interaction between text and images at consistent semantic levels. It is worth noting that while Zhou et al. (2022a) also employed a span-based framework for MNER, it takes more concerns on mulimodal representations, overlooking the mulitmodal alignment and interaction.
与上述方法不同的是，我们的研究重点是对齐错误引起的噪声。此外，上述研究都是在基于词的序列标注框架下进行的，而我们采用的是基于跨度的分类框架，确保文本与图像在一致的语义层面上进行对齐和交互。值得注意的是，虽然 Zhou 等人（2022a）也采用了基于跨度的 MNER 框架，但他们更关注多模态表征，忽略了多模态对齐和交互。

2.2. Information Bottleneck
2.2.信息瓶颈

Information Bottleneck (IB) (Tishby and Zaslavsky, 2015) principle provides a theoretical framework for analyzing deep neural networks, which formulates the goal of representation learning as an information trade-off between predictive power and representation compression. Later, variational information bottleneck (VIB) (Alemi et al., 2016) bridges the gap between IB and deep learning with variational inference. More recently, Federici et al. (2020) provides a variant of IB which extends the ability of IB to the multi-view unsupervised setting, enabling the identification of superfluous information that is not shared by both views. Nowadays, owing to its capacity for learning minimal informative representations, IB has been extensively applied in computer vision (Peng et al., 2018), sentiment analysis (Mai et al., 2022), and natural language processing (Zhou et al., 2022b). Motivated by this, instead of directly applying IB principle to MNER task, we adopt the multi-view IB principle for enhancing the distribution consistency between the two modalities and filtering out irrelevant information from the images.
信息瓶颈（IB）（Tishby 和 Zaslavsky，2015 年）原理为分析深度神经网络提供了一个理论框架，它将表征学习的目标表述为预测能力和表征压缩之间的信息权衡。后来，变异信息瓶颈（VIB）（Alemi 等人，2016 年）用变异推理弥补了 IB 和深度学习之间的差距。最近，Federici 等人（2020 年）提供了 IB 的一种变体，它将 IB 的能力扩展到了多视图无监督环境中，能够识别两个视图不共享的多余信息。如今，由于具有学习最小信息表征的能力，IB 已被广泛应用于计算机视觉（Peng 等人，2018）、情感分析（Mai 等人，2022）和自然语言处理（Zhou 等人，2022b）。受此启发，我们没有将 IB 原理直接应用于 MNER 任务，而是采用了多视角 IB 原理来增强两种模态之间的分布一致性，并过滤掉图像中的无关信息。

3. Method 3.方法

3.1. Overview 3.1.概述

Task Definition. Given the input pair containing a text sentence

and its associated image

, the goal of MNER is to detect entity spans from

, and classify them to corresponding entity types. Unlike the existing MNER models that regard MNER as a sequence labeling task, we regard MNER as a span classification task. Let

denote the input sentence with

words and the label for the text

is formulated as a set

, where

is the number of the named entities.

is the span of an entity that corresponds to the phrase

and

represents the corresponding entity type that belongs to a predefined entity type set.
任务定义。给定包含文本句子

及其相关图像

的输入对，MNER 的目标是检测

中的实体跨度，并将其分类为相应的实体类型。与将 MNER 视为序列标注任务的现有 MNER 模型不同，我们将 MNER 视为跨度分类任务。让

表示包含

个单词的输入句子，文本

的标签被表述为一个集合

，其中

是命名实体的个数。

是与短语

相对应的实体的跨度，

表示属于预定义实体类型集的相应实体类型。

Model Architecture. The overall architecture of the SMNER is illustrated in Figure 2. Given imagetext pairs, we first obtain the unimodal representations by the modal-specific encoder. Then, the representations of both text and image primarily
模型结构。SMNER 的整体架构如图 2 所示。给定图像文本对，我们首先通过特定模态编码器获得单模态表示。然后，文本和图像的表示主要是
flow into two modules: 1) the cross-modal information bottleneck module for global multimodal alignment and denoising, 2) the cross-modal attention module for local multimodal alignment and interaction. Finally, we fuse the representations of the two modalities to obtain the multimodal span representation, and feed it into a entity classification layer to get the final predictions. These modules are trained simultaneously using an end-to-end framework.
流程分为两个模块：1) 用于全局多模态配准和去噪的跨模态信息瓶颈模块；2) 用于局部多模态配准和交互的跨模态注意力模块。最后，我们融合两种模态的表征，得到多模态跨度表征，并将其输入实体分类层，得到最终预测结果。这些模块通过端到端框架同时进行训练。

Given a multimodal dataset with

samples, is formulated as

. Each example

contains the multimodal post

and the task defined label

, where

and

are text and image respectively. For each post

, we first utilize the pre-trained models to obtain its unimodal representations

and

, respectively.
给定一个包含

样本的多模态数据集，其表述为

。每个示例

包含多模态帖子

和任务定义的标签

，其中

和

分别为文本和图像。对于每个帖子

，我们首先利用预先训练好的模型分别获得其单模态表示

和

。

Text Encoder. To precisely capture both the global and contextual representations, we adopt a pretrained BERT-base-uncased model (Kenton and Toutanova, 2019) as our textual encoder. Given a text

with

words, it needs to add a [CLS] token at the beginning and a [SEP] token at the end. We denote the text input as

, where

is the [CLS] token and

is the [SEP] token. We feed the input

into BERT to obtain the factual output

, where

represents the global text representation,

is the contextual word representations for

, and

is the dimension of textual representations.
文本编码器。为了精确捕捉全局表征和上下文表征，我们采用了一个预训练的基于 BERT 的无基模型（Kenton 和 Toutanova，2019 年）作为文本编码器。给定一个包含

单词的文本

，它需要在开头添加一个[CLS]标记，在结尾添加一个[SEP]标记。我们将输入的文本表示为

，其中

是 [CLS] 标记，

是 [SEP] 标记。我们将输入

输入 BERT，得到事实输出

，其中

表示全局文本表示，

是

的上下文单词表示，

是文本表示的维度。

Image Encoder. To extract meaningful feature representations from images, we leverage a pretrained 152-layer ResNet (He et al., 2016) as the image encoder, which essentially splits each input image into

visual blocks. Specifically, we fist rescale the whole image into

piexls, and then feed them into ResNet to obtain visual representation

. To project the visual representations into the same dimension as the textual representations, we further convert

with a linear transformation:

, where

is the weight matrix. Finally, we obtain

, where

is the representation of the whole image, and

is the representations of the regional objects.
图像编码器。为了从图像中提取有意义的特征表征，我们利用预训练的 152 层 ResNet（He 等人，2016 年）作为图像编码器，该编码器基本上将每张输入图像分割成

个视觉块。具体来说，我们先将整幅图像重新缩放为

饼状图，然后将其输入 ResNet 以获得视觉表示

。为了将视觉表征投射到与文本表征相同的维度，我们进一步对

进行线性变换：

，其中

是权重矩阵。最后，我们得到

，其中

是整个图像的表示，

是区域对象的表示。

One challenge of multimodal alignment is how to establish a unified semantic representation space to bridge the semantic gap between two different modalities. Additionally, we should consider that text representations play a predominant role in MNER task, as all the entities to be recognized originate from text. To achieve these objectives, we present a Cross-Modal Information Bottleneck (CMIB) module from an information-theoretic perspective, aims to bring visual semantic distribution closer to the textual semantic distribution while filtering noise from the images.
多模态对齐的一个挑战是如何建立统一的语义表征空间，以弥合两种不同模态之间的语义差距。此外，我们还应考虑到文本表征在 MNER 任务中起着主导作用，因为所有需要识别的实体都来自文本。为了实现这些目标，我们从信息论的角度提出了跨模态信息瓶颈（CMIB）模块，旨在使视觉语义分布更接近文本语义分布，同时过滤图像中的噪声。

Given

and

that are derived from the same post, they share the same predictive task for a target

. Therefore, in this paper, we consider

and

to be two views for the same object and suppose the

is sufficient for

. Motivated by the multi-view IB (Federici et al., 2020), we can subdivide

into two components by using the chain of mutual information (MI):
鉴于

和

源自同一帖子，它们对目标

的预测任务相同。因此，在本文中，我们将

和

视为同一对象的两个视图，并假设

对于

来说是足够的。受多视图 IB（Federici 等人，2020 年）的启发，我们可以利用互信息链 (MI) 将

细分为两个部分：

where

and

are the representations of the entire text

and image

, respectively.

denotes the information that is consistent between two modalities, and

denotes the information in

which is unique to

but is not predictable by observing

, i.e., irrelevant information in the image.
其中

和

分别是整个文本

和图像

的表示。

表示两种模式之间一致的信息，

表示

中的信息，这些信息是

所独有的，但通过观察

无法预测，即图像中的无关信息。

We would like to define an objective function for the representation

that discards as much information as possible without losing any entity information. For this purpose, we should ensure that the representation

is sufficient for

(maximizing

), and that the irrelevant information is discarded (minimizing

). So the loss function of the cross-modal information bottleneck in our model is defined as:
我们希望为

的表示

定义一个目标函数，在不丢失任何实体信息的情况下，尽可能多地丢弃信息。为此，我们应确保

的表示对

足够充分（最大化

），并且丢弃无关信息（最小化

）。因此，在我们的模型中，跨模态信息瓶颈的损失函数定义为

where

represents the Lagrangian multiplier introduced by the constrained optimization. With the gradients from back-propagation, semantic regularization can automatically enforce semantic agreement among heterogeneous representations.
其中，

表示约束优化引入的拉格朗日乘数。利用反向传播的梯度，语义正则化可以自动强制异构表征之间达成语义一致。

It is challenging to compute the mutual information

and

directly. The same as Federici et al. (2020), we use variational inference to compute a variational upper bound for

as follow:
直接计算互信息

和

具有挑战性。与 Federici 等人（2020 年）一样，我们使用变异推理来计算

的变异上界，如下所示：

Therefore, we replace it in (2) with this upper bounder, which can be optimized via evaluating
因此，我们用这个上限值来替代 (2) 中的上限值。
the Kullback-Leibler (KL) divergence between the unimodal distributions approximated by two modalspecific Variational AutoEncoders (VAEs). Mathematically, the posterior distribution

of each unimodal representation is estimated as follows:
由两个特定模态变异自动编码器（VAE）近似的单模态分布之间的库尔巴克-莱伯勒（KL）分歧。在数学上，每个单模态表示的后验分布

是按如下方式估算的：

where the mean

and variance

of Gaussian distribution can be obtained from the modal-specific multilayer perceptron (MLP) layers. Then, we use reparameterization trick to sample

and

:
其中，高斯分布的均值

和方差

可从特定模态的多层感知器（MLP）层中获得。然后，我们使用重参数化技巧对

和

进行采样：

where

is a standard normal Gaussian distribution.
其中

是标准正态高斯分布。

Similarly, for

, we calculate its lower bounder as follows:
同样，对于

，我们计算其下限的方法如下：

We maximize the mutual information (MI)

for modality pairs by infoNCE (Oord et al., 2018) which is used as a lower bounder on MI. Subsequently, it can be optimized by :
我们通过 infoNCE（Oord 等人，2018 年）最大化模态对的互信息（MI）

，并将其作为 MI 的下限。随后，可以通过：

where

is a discriminator function that measures the degree of consistence between textimage representations,

refers to the positive sample which sampled text-image representations from the same input pair, and

denotes in-batch negative example.
其中，

是一个判别函数，用于测量文本图像表征之间的一致性程度；

是指从同一输入对中采样文本图像表征的正样本；

表示批内负样本。

Cross-Modal Attention module can further capture the fine-grained semantic interactions between two modalities after semantic regularization, enabling a more precise alignment of the two modalities.
跨模态注意模块可以在语义正则化之后进一步捕捉两种模态之间细粒度的语义交互，从而实现两种模态之间更精确的对齐。

Specifically, given the contextual unimodal representations

and

, we first obtain the representations of span

by a span encoder, denoted as

where

is the span length embedding and

is the morph embedding for span

. We could obtain the representations for all spans, and correspondingly acquire the contextual span representations

as follows:
具体来说，给定上下文单模态表示

和

，我们首先通过跨度编码器获得跨度

的表示，表示为

，其中

是跨度长度嵌入，

是跨度

的形态嵌入。我们可以获得所有跨度的表示，并相应地获得上下文跨度表示

如下：

where

is the number of spans and

is the dimension of span representation. Notice that, in this paper, the max span length is limited to 4 , which can cover almost all entities. Similarly, we further convert

into

via a linear transformation matrix

, so

has the same dimension as the visual representations.
其中，

是跨度的数量，

是跨度表示的维度。请注意，在本文中，最大跨度长度限制为 4，这几乎可以涵盖所有实体。同样，我们通过线性变换矩阵

将

进一步转换为

，因此

的维度与视觉表示的维度相同。

Additional, we fuse the global and local representation of images via concatenating, i.e.,

. Subsequently, we use the multi-head cross-modal attention to obtain the text-aware visual representations as follows:
此外，我们还通过连接（即

）来融合图像的全局和局部表示。随后，我们利用多头跨模态注意力获得文本感知视觉表征，具体如下：

where

is the number of heads,

are the weight matrices for each query, key and value,

is the weight matrix for

-head attention.

denotes the alignment score between each span in the text and visual block in the image. Finally, we obtain the final text-aware image representation

by stacking a fully-connected feed-forward network (FFN) and two residual layers with layer-normalization (LN) as follows:
其中，

为头部数量，

为每个查询、关键字和值的权重矩阵，

为

-头部关注的权重矩阵。

表示文本中每个跨度与图像中视觉块之间的对齐得分。最后，我们通过堆叠一个全连接前馈网络（FFN）和两个层归一化（LN）的残差层，得到最终的文本感知图像表示

，如下所示：

3.5. Entity Classification
3.5.实体分类

The input of the entity classifier is the multimodal span representation

obtained by concatenating the embeddings

and the embeddings

, as follows:
实体分类器的输入是多模态跨度表示

，由嵌入

和嵌入

连接而成，如下所示：

Then, we feed the final span representation

into a fully-connected network to predict the probabilities of entity types:
然后，我们将最终的跨度表示

输入全连接网络，以预测实体类型的概率：

Since we regard MNER as a span-based multiclassification task, we apply the span-level cross entropy loss at training phase, as follows:
由于我们将 MNER 视为一项基于跨度的多分类任务，因此我们在训练阶段采用跨度级交叉熵损失，具体如下：

where

is the prediction for the span

, and

is the ground truth.
其中，

是跨度

的预测值，

是地面实况。

By combining the loss functions from the main classification task and the cross-modal information bottleneck module, the overall loss function for SMNER is defined as:
结合主要分类任务和跨模态信息瓶颈模块的损失函数，SMNER 的总体损失函数定义为

Item 项目	Twitter-2015			Twitter-2017 推特-2017
Item 项目	Train 火车	Dev 开发	Test 测试	Train 火车	Dev 开发	Test 测试
#PER	2217	552	1816	2943	626	621
#LOC	2091	522	1697	731	173	178
#ORG	928	247	839	1674	375	395
#MISC	940	225	726	701	150	157
#Tweets	4000	1000	3257	3373	723	723

Table 1: The statistical information of two MNER datasets.
表 1：两个 MNER 数据集的统计信息。

where

is the hyper-parameter to balance the different loss.
其中

是平衡不同损耗的超参数。

4. Experiments 4.实验

4.1. Experimental Settings
4.1.实验设置

Datasets. We compare SMNER with the existing methods on the two widely used benchmark datasets collected from social medias, including: Twitter-2015 (Lu et al., 2018) and Twitter2017 (Zhang et al., 2018). There are four types of named entities including: Person (PER)m Location (LOC), Organization (ORG) and others (MISC) that are annotated in the text. Table 1 shows the detailed statistical information of the two datasets.
数据集。我们在从社交媒体收集的两个广泛使用的基准数据集上比较了 SMNER 和现有方法，这两个数据集包括：Twitter-2015（Lu 等人，2018 年）和 Twitter2017（Zhang 等人，2018 年）：Twitter-2015（Lu 等人，2018 年）和 Twitter2017（Zhang 等人，2018 年）。命名实体有四种类型，包括人（PER）m 地点（LOC）、组织（ORG）和其他（MISC）四种类型的命名实体在文本中进行了注释。表 1 显示了两个数据集的详细统计信息。

Implementation Configurations. We utilize Pytorch framework to conduct experiments with 1 Nvidia 3090 GPU. The dimension size

is set to 768. For span encoder, the dimensions of the span length embedding and morph embedding are set to 50 and 100, respectively. In the cross-modal information bottleneck module, we implement the modal-specific MLPs for obtaining

and

using 3 fully-connected layers with ReLU activation function in each layer. We manually tune the hyperparameter

and

, and achieve the best results with

and

. ReLU is used as the default activation function unless otherwise specified All optimizations are performed with the AdamW optimizer, where the decay is 0.01 , the learning rate is

and batch size is 16 .
实施配置。我们使用 Pytorch 框架和 1 个 Nvidia 3090 GPU 进行实验。尺寸

设置为 768。对于跨度编码器，跨度长度嵌入和形态嵌入的维度分别设置为 50 和 100。在跨模态信息瓶颈模块中，我们使用 3 个全连接层和每个层的 ReLU 激活函数实现了特定模态的 MLP，以获得

和

。我们手动调整了超参数

和

，并使用

和

达到最佳效果。除非另有说明，ReLU 被用作默认激活函数。所有优化均使用 AdamW 优化器进行，其中衰减为 0.01，学习率为

，批量大小为 16。

Baseline Methods. To demonstrate the effectiveness of our model, we mainly compare our model with three groups of state-of-the-art baselines. The first group contains several representative textbased NER approaches: BiLSTM-CRF (Huang et al., 2015), CNN-BiLSTM-CRF (Ma and Hovy, 2016), HBiLSTM-CRF (Lample et al., 2016), BERTCRF (Kenton and Toutanova, 2019), SpanNER (Fu et al., 2021). The second group contains sev eral competitive word-based MNER approaches with various alignment strategies: VG (Lu et al., 2018), AdaCoAtt (Zhang et al., 2018),UMT (Yu et al., 2020), UMGF (Zhang et al., 2021), MAF (Xu et al., 2022), HVPNeT (Chen et al., 2022), Debias (Zhang et al., 2023). And the third group contains two competitive span-based MNER approaches: SMVAE (Zhou et al., 2022a), MRCMNER (Jia et al., 2023).
基线方法。为了证明我们模型的有效性，我们主要将我们的模型与三组最先进的基准方法进行比较。第一组包含几种有代表性的基于文本的 NER 方法：BiLSTM-CRF（Huang 等人，2015 年）、CNN-BiLSTM-CRF（Ma 和 Hovy，2016 年）、HBiLSTM-CRF（Lample 等人，2016 年）、BERTCRF（Kenton 和 Toutanova，2019 年）、SpanNER（Fu 等人，2021 年）。第二组包含七种基于单词的 MNER 方法，这些方法具有不同的对齐策略：VG (Lu et al., 2018)、AdaCoAtt (Zhang et al., 2018)、UMT (Yu et al., 2020)、UMGF (Zhang et al., 2021)、MAF (Xu et al., 2022)、HVPNeT (Chen et al., 2022)、Debias (Zhang et al., 2023)。第三组包含两种基于跨度的 MNER 竞争方法：SMVAE（Zhou 等，2022a）、MRCMNER（Jia 等，2023）。

4.2. Main Results 4.2.主要成果

Effectiveness. Table 2 represents the performance comparision between SMNER and all baselines. As shown in the table, SMNER outperforms all the compared methods on two datasets in terms of

and achieves the second-best results in terms of Prec. and Rec., which verifies the effectiveness of our methods. We also draw the following observations:
效果表 2 是 SMNER 与所有基线方法的性能比较。如表所示，在两个数据集上，SMNER 在

方面优于所有比较方法，在 Prec.我们还得出以下结论：

(1) Among text-based methods, span-based method performs better than CRF-based methods, by comparing SpanNER with BERT-CRF. This can be explained by the fact that the informal social text usually faces challenges like short length and out-of-vocabulary (OVV), which can be better addressed by the span-based methods.
(1) 通过比较 SpanNER 和 BERT-CRF，在基于文本的方法中，基于 span 的方法比基于 CRF 的方法表现更好。这是因为非正式社交文本通常面临长度短和词汇量不足（OVV）等挑战，而基于跨度的方法可以更好地应对这些挑战。

(2) Visual features are generally helpful for textbased methods on both datasets by comparing the multimodal approaches with their corresponding text-based baselines (such as, VG vs. HBiLSTMCRF, UMT vs. BERT-CRF and SMNER vs. SpanNER). This indicates the necessary of incorporating visual information for MNER task.
(2) 通过比较多模态方法和相应的基于文本的基线方法（如 VG vs. HBiLSTMCRF、UMT vs. BERT-CRF 和 SMNER vs. SpanNER），在这两个数据集上，视觉特征普遍有助于基于文本的方法。这表明在 MNER 任务中纳入视觉信息是必要的。

(3) Multimodal methods are not always superior to unimodal methods (e.g., compare VG or AdaCoAtt with BERT-CRF or SpanNER). Since both VG and AdaCoAtt directly incorporate the whole image as global visual clues to enrich word representation, which can inevitably introduce some misalignment noise.
(3) 多模态方法并不总是优于单模态方法（例如，比较 VG 或 AdaCoAtt 与 BERT-CRF 或 SpanNER）。因为 VG 和 AdaCoAtt 都是直接将整个图像作为全局视觉线索来丰富单词表示，这不可避免地会引入一些错位噪声。

(4) Among multimodal methods, SMNER (ours) not only significantly outperforms the word-based baselines, but also outperforms the current stateof-the-art span-based methods, confirming the advantage of the consistent multimodal alignment. Although De-bias method consider the misalignment issue, it obtained correlation information still by associating words in the text with regions in the image, and fails to align two modalities consistently, causing inferior performance. Both SMVAE and MRC-MNER are also span-based methods, but SMVAE ignores interactions between two modalities, MRC-MNER requires additional tools or external annotation data, which affects model's accuracy and adaptability.
(4) 在多模态方法中，SMNER（我们的）不仅明显优于基于单词的基线方法，而且优于目前最先进的基于跨度的方法，证实了一致多模态对齐的优势。虽然 De-bias 方法考虑了错位问题，但它仍然是通过将文本中的单词与图像中的区域联系起来来获取相关信息，无法实现两种模态的一致配准，导致性能较差。SMVAE 和 MRC-MNER 也都是基于跨度的方法，但 SMVAE 忽略了两种模态之间的交互作用，MRC-MNER 则需要额外的工具或外部注释数据，这影响了模型的准确性和适应性。

Summarily, our method outperforms all these state-of-the-art methods. We attribute the performance gains of SMNER into its two advantages: 1) a well-aligned global semantic space between the two modalities achieved via the CMIB module, providing semantic regularization for the main task; 2) a more precise fine-grained cross-modal semantic
总之，我们的方法优于所有这些最先进的方法。我们将 SMNER 的性能提升归功于它的两个优势：1) 通过 CMIB 模块实现了两种模态之间的全局语义空间的良好对齐，为主要任务提供了语义正则化；2) 更精确的细粒度跨模态语义，为主要任务提供了语义正则化。

Modality 模式

Methods 方法

Twitter-2015

Twitter-2017 推特-2017

Pre.

Rec. 回顾。

Pre.

Rec. 回顾。

Text 文本

BiLSTM-CRF

68.14

61.09

64.42

79.42

73.43

76.31

CNN-BiLSTM-CRF

66.24

68.09

67.15

80.00

78.76

79.37

HBiLSTM-CRF

70.32

68.05

69.17

82.69

78.16

80.37

BERT-CRF

69.22

74.59

71.81

83.32

83.57

83.44

SpanNER*

70.09

74.27

72.54

83.91

84.46

84.18

{

Text+Image

(word-based)

}

73.96

67.90

70.80

83.41

80.33

81.87

AdaCoAtt

72.75

68.74

70.69

84.16

80.24

82.15

UMT

71.67

75.23

73.41

85.28

85.34

85.31

UMGF

74.49

75.21

74.85

86.54

84.50

85.51

MAF

71.86

75.10

73.42

86.13

86.38

86.25

De-bias 消除偏见

74.45

76.13

75.28

87.59

86.11

86.84

HVPNeT

73.87

76.82

75.32

85.84

87.93

86.87

{

Text+Image

(span-based)

}

SMVAE

74.40

75.76

75.07

85.77

86.97

86.37

MRC-MNER

77.43

72.15

74.70

88.26

85.65

86.94

SMNER(ours) SMNER（我们的）

75.34

76.81

76.06

88.15

87.47

87.81

Table 2: Performance comparison of different competitive baselines on two MNER datasets. The marker * represents the models reproduced by us for MNER. The bold numbers indicate the best results and the numbers with underline indicate the second-best results. And all improvement of our model are statistically significant with

under t-test.
表 2：不同竞争基线在两个 MNER 数据集上的性能比较。标记 * 代表我们为 MNER 重现的模型。粗体数字表示最佳结果，带下划线的数字表示次佳结果。在 t 检验下，我们的模型在

的所有改进都具有统计学意义。

Methods 方法	Twitter-2015		Twitter-2017 推特-2017		Size(M) 尺寸（中）
Methods 方法	Train(s) 列车	Test(s) 测试	Train(s) 列车	Test(s) 测试
HVPNet	150.27	54.65	122.87	13.13	177.97
SMNER

Table 3: Comparison of average training and testing time (seconds for one epoch) and the number of parameters (Millions).
表 3：平均训练和测试时间（一个历元的秒数）与参数数量（百万）的比较。

Methods 方法	Twitter-2015		Twitter-2017 推特-2017
Methods 方法	Pre. Rec. Pre.Rec.		Pre.	Rec. 回顾。	F1
SMN	75.3476 .81	76.06	88.15	87.47	87.81
w/	75.2174 .60	74.90	85.50	86.8	86.17
w/	74.5875 .52	75.05	87.19	86.3	86.76
w/o CMIB+CMA 无 CMIB+CMA	73.5274 .34	73.93	84.73	86.2	85.48

Table 4: Ablation study of MNER.
表 4：MNER 的消融研究。

alignment and interaction achieved by CMA, further mitigating the noise introduced by misalignment.
CMA 实现了对齐和互动，进一步减少了对齐失误带来的噪音。

Efficiency. We also compare SMNER with the state-of-the-art model (HVPNet) in terms of the runtime and model size. As shown in Table 3, the model size of SMNER is smaller than HVPNet, which indicates our model is simpler than HVPNet. On the other hand, both the training and testing speeds of our model are faster than HVPNet. Notice that, although our model enumerates the spans, social posts text are often short, so it does not affect the efficiency of our model.
效率我们还将 SMNER 与最先进的模型（HVPNet）在运行时间和模型大小方面进行了比较。如表 3 所示，SMNER 的模型大小小于 HVPNet，这表明我们的模型比 HVPNet 更简单。另一方面，我们模型的训练和测试速度都比 HVPNet 快。需要注意的是，虽然我们的模型枚举了跨度，但社交帖子的文本往往很短，所以这并不影响我们模型的效率。

4.3. Ablation Study 4.3.消融研究

We do the ablation studies to further investigate the effectiveness of the main components in our model, as shown in Table 4.
如表 4 所示，我们进行了消融研究，以进一步考察模型中主要组件的有效性。
From the information-theoretic view, removing the CMIB module will significantly drop the performance, which justifies that directly use image embeddings produced by image-specific encoder may bring noise and further shows the importance of CMIB for alleviating the visual noise. From the cross-modal interaction view, discarding the CMA module also leads the performance drop, which indicates the usefulness of capturing the fine-grained correlations between two modalities. Furthermore, removing both CMIB and CMA also leads to an obvious performance drop, which indicates that both CMIB module and CMA module make important contributions to the final results.
从信息论的角度来看，去掉 CMIB 模块会导致性能显著下降，这说明直接使用特定图像编码器生成的图像嵌入可能会带来噪声，也进一步说明了 CMIB 对于减轻视觉噪声的重要性。从跨模态交互的角度来看，去掉 CMA 模块也会导致性能下降，这说明捕捉两种模态之间的细粒度相关性是有用的。此外，同时删除 CMIB 和 CMA 模块也会导致性能明显下降，这表明 CMIB 模块和 CMA 模块对最终结果都有重要贡献。

4.4. Visualization for Modality-consistent
4.4.模态一致的可视化

To indicated that our model can produce the consistent text-image representations by applying CMIB and CMA, we perform a text-image representation visualization and a numerical distribution visualization of text-image similarity, respectively, as shown in Figure 3. Specifically, the samples used in this analysis are from the test set of Twitter-2017. We gathered the representations of text and image at three stages: encoding of raw data, after employing the CIMB, and after employing the CMA.
为了表明我们的模型能够通过应用 CMIB 和 CMA 生成一致的文本图像表示，我们分别进行了文本图像表示可视化和文本图像相似度数值分布可视化，如图 3 所示。具体来说，本分析中使用的样本来自 Twitter-2017 测试集。我们在三个阶段收集了文本和图像的表示：原始数据的编码、使用 CIMB 之后以及使用 CMA 之后。

By comparing Figure 3(a) and 3(b), we can observe that the semantic distribution of the text and image, after applying CMIB, exhibits a more consistent distribution shape, and the distribution distance significantly decreases. This phenomenon manifests the effectiveness of the cross-modal semantic alignment regularization by the CMIB component.
通过对比图 3(a) 和图 3(b)，我们可以发现，应用 CMIB 后，文本和图像的语义分布呈现出更加一致的分布形状，分布距离也明显减小。这一现象体现了 CMIB 组件对跨模态语义配准正则化的有效性。

By comparing Figure 3(c) and 3(d), we can find that, without applying the CMA, the most values of
通过比较图 3（c）和图 3（d），我们可以发现，在不应用 CMA 的情况下，"阈值 "和 "阈值 "的最大值分别为 0.001 和 0.002。

(a) T-SNE visualization of text and image representation for raw data encoder
(a) 原始数据编码器文本和图像表示的 T-SNE 可视化

Figure 3: The visualization for modality-consistent
图 3：与模式一致的可视化系统

text-image representation similarity are less than zero. While the majority of sample pairs have similarity values greater than 0 , after applying the CMA. This phenomenon further indicates the effectiveness of the cross-modal alignment by the CMA component. Therefore, we can confirm that adding both the CMIB and CMA components could achieve the semantic agreements between textual and visual representations.
文本-图像表征相似度均小于 0。而应用 CMA 后，大多数样本对的相似度值都大于 0。这一现象进一步说明了 CMA 组件跨模态配准的有效性。因此，我们可以确认，同时添加 CMIB 和 CMA 组件可以实现文本表征和视觉表征之间的语义一致。

4.5. Case Study via Attention Visualization
4.5.通过注意力可视化进行案例研究

To further validate the effectiveness of our proposed model, Figure 4 presents two cases with the cross-attention maps and the predicted results, produce by BERT-CRF (unimodal, text-based), HVPNet (multimodal, word-based) and SMNER (multimodal, span-based).
为了进一步验证我们所提模型的有效性，图 4 展示了两个案例的交叉注意图和预测结果，分别由 BERT-CRF（单模态，基于文本）、HVPNet（多模态，基于单词）和 SMNER（多模态，基于跨度）生成。

In case

, it is evident that both BERT-CRF (based on text-only) and SMNER successfully recognize the entity correctly. Additionally, the crossattention of SMNER demonstrates a positive effect on its prediction (e.g., in the second image, the cross-attention of SMNER attends mostly to the regions of object football in the image which is highly related to text "World Cup"). However, the entity predicted by HVPNet is incorrect and incomplete, since lacking effective ways to map the semantic of the objects in the image to the spans in the text.
在

中，BERT-CRF（基于纯文本）和 SMNER 显然都成功地正确识别了实体。此外，SMNER 的交叉注意也对其预测产生了积极影响（例如，在第二幅图像中，SMNER 的交叉注意主要集中在图像中与文本 "世界杯 "高度相关的足球物体区域）。然而，HVPNet 预测的实体是不正确和不完整的，因为缺乏将图像中对象的语义映射到文本中跨度的有效方法。

In case B, Only SMNER successfully predicts all entities, which further manifests the effectiveness of consistent cross-modal alignment for the MNER task. Since our model performs multimodal alignment at two consistent levels, it often captures complementary information from each level, and then combines them to predict the answer effectively. The BERT-CRF model relies solely on the textual information, may not be able to distinguish whether "Miss Bird Lake" refers to a location and
在案例 B 中，只有 SMNER 成功预测了所有实体，这进一步体现了一致的跨模态配准在 MNER 任务中的有效性。由于我们的模型在两个一致的层次上进行多模态配准，因此它往往能从每个层次上捕捉到互补信息，然后将它们结合起来有效地预测答案。而 BERT-CRF 模型仅依赖文本信息，可能无法区分 "鸟湖小姐 "是否指一个地点，也无法预测 "鸟湖小姐 "的答案。

(b) case B (b) 案例 B

Figure 4: Two cases of the predictions by BERTCRF, HVPNeT and SMNER. For visualization, both images and attention maps are scaled (from red:high to blue:low). In text, we use different colors mark entities: blue marks correct entities, red marks wrong ones.
图 4：BERTCRF、HVPNeT 和 SMNER 预测的两种情况。为便于可视化，图像和注意力图均按比例缩放（从红：高到蓝：低）。在文本中，我们使用不同颜色标记实体：蓝色标记正确的实体，红色标记错误的实体。

instead, it may tend to predict it as a

entity (based on the capitalization features of the words). From the cross-attention heatmaps, we can see that the HVPNet model allocates more interaction attention to the tree regions in the image, which results in the incorrect identification of "Miss Bird Lake" as a MISC entity.
相反，它可能倾向于将其预测为

实体（基于单词的大写特征）。从交叉注意力热图中我们可以看出，HVPNet 模型将更多的交互注意力分配给了图像中的树区域，从而导致将 "鸟湖小姐 "错误地识别为 MISC 实体。

All in all, these results further validate the assumption that our model is able to achieve more consistent text-image alignment for multimodal named entity recognition. However, we also found that when a sentence contains multiple types of entities, our model tends to make incorrect predictions. We speculate that the main reason is that the entity spans in the text do not have the corresponding regions in the image (a case of modality missing), and the textual information is insufficient, making it difficult to make accurate predictions. An effective solution would be to leverage external knowledge, which can be further explored as future work.
总之，这些结果进一步验证了我们的模型能够在多模态命名实体识别中实现更一致的文本-图像配准的假设。不过，我们也发现，当一个句子包含多种类型的实体时，我们的模型往往会做出错误的预测。我们推测，主要原因是文本中的实体跨度在图像中没有对应的区域（一种模态缺失的情况），文本信息不足，难以做出准确的预测。有效的解决方案是利用外部知识，这可以作为未来工作的进一步探索。

5. Conclusion 5.结论

In this paper, we propose SMNER, an effective span-based method for MNER that achieves consistent alignment of text and image modalities at two levels: the global level and local level. Specifically, SMNER comprises two key modules: the CMIB module for the global semantic alignment and denoising, and the CMA module for the local semantic alignment and interaction. Through
本文提出的 SMNER 是一种有效的基于跨度的 MNER 方法，可在两个层面（全局层面和局部层面）实现文本和图像模态的一致对齐。具体来说，SMNER 包括两个关键模块：用于全局语义配准和去噪的 CMIB 模块，以及用于局部语义配准和交互的 CMA 模块。通过
ablation studies, we further demonstrate the contributions of both CMIB and CMA to our final performance. Extensive experimental studies illustrate that SMNER outperforms all the baselines on two public benchmarks. In the future, we will explore the application of CMIB in other multimodal tasks for information compression and denoising.
我们进一步证明了 CMIB 和 CMA 对最终性能的贡献。广泛的实验研究表明，SMNER 在两个公共基准上的表现优于所有基准。未来，我们将探索 CMIB 在其他多模态信息压缩和去噪任务中的应用。

6. Acknowledgement 6.鸣谢

This work is supported by the National Key Research and Development Program of China (Grant No.2021YFB3100600).
这项工作得到了国家重点研发计划（批准号：2021YFB3100600）的支持。

7. Ethics Statement 7.伦理声明

This work provides a span-based method for multimodal named entity recognition in social media. We are committed to upholding the highest ethical standards throughout this research and ensuring the integrity and welfare of all involved parties. All authors are responsible for the content of this paper.
这项研究为社交媒体中的多模态命名实体识别提供了一种基于跨度的方法。我们承诺在整个研究过程中坚持最高的道德标准，并确保所有参与方的诚信和福利。所有作者均对本文内容负责。

8. Bibliographical References
8.参考书目

Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. 2016. Deep variational information bottleneck. In International Conference on Learning Representations.
Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy.2016.深度变异信息瓶颈。学习表征国际会议。

Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si and Huajun Chen. 2022. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 1607-1618.
Xiang Chen, Ningyu Zhang, Lei Li, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, Luo Si and Huajun Chen.2022.好的视觉引导造就更好的提取器多模态实体和关系提取的分层视觉前缀。In Findings of the Association for Computational Linguistics：NAACL 2022，第1607-1618页。

Laura Dietz. 2019. Ent rank: Retrieving entities for topical information needs through entityneighbor-text relations. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pages 215-224.
《劳拉-迪茨2019.Ent rank：通过实体邻接文本关系检索实体以满足主题信息需求。第 42 届 ACM SIGIR 信息检索研究与发展国际会议论文集》，第 215-224 页。

Marco Federici, Anjan Dutta, Patrick Forré, Nate Kushman, and Zeynep Akata. 2020. Learning robust representations via multi-view information bottleneck. In 8th International Conference on Learning Representations. OpenReview. net.
Marco Federici、Anjan Dutta、Patrick Forré、Nate Kushman 和 Zeynep Akata。2020.通过多视角信息瓶颈学习鲁棒表征。在第8届学习表征国际会议上。OpenReview.

Jinlan Fu, Xuan-Jing Huang, and Pengfei Liu. 2021 Spanner: Named entity re-/recognition as span prediction. In Proceedings of the 59th Annual
Jinlan Fu, Xuan-Jing Huang, and Pengfei Liu.2021 Spanner：作为跨度预测的命名实体再识别/识别。第 59 届年会论文集
Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7183-7195.
计算语言学协会会议暨第 11 届自然语言处理国际联合会议（第 1 卷：长篇论文），第 7183-7195 页。

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages

.
《何开明、张翔宇、任绍清、孙健。2016.图像识别的深度残差学习。IEEE计算机视觉与模式识别会议论文集》，第

页。

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional Istm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
Zhiheng Huang, Wei Xu, and Kai Yu.2015.用于序列标记的双向 Istm-crf 模型。arXiv preprint arXiv:1508.01991.

Meihuizi Jia, Lei Shen, Xin Shen, Lejian Liao, Meng Chen, Xiaodong He, Zhendong Chen, and Jiaqi Li. 2023. Mner-qg: An end-to-end mrc framework for multimodal named entity recognition with query grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 8032-8040.
Meihuizi Jia, Lei Shen, Xin Shen, Lejian Liao, Meng Chen, Xiaodong He, Zhendong Chen, and Jiaqi Li.2023.Mner-qg：多模态命名实体识别与查询接地的端到端 mrc 框架。美国人工智能学会会议论文集》，第 37 卷，第 8032-8040 页。

Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171-4186.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova.2019.Bert：用于语言理解的深度双向变换器的预训练。In Proceedings of NAACL-HLT, pages 4171-4186.

Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 260-270.
Guillaume Lample、Miguel Ballesteros、Sandeep Subramanian、Kazuya Kawakami 和 Chris Dyer。2016.命名实体识别的神经架构。计算语言学协会北美分会 2016 年会议论文集：人类语言技术》，第 260-270 页。

Changhee Lee and Mihaela van der Schaar. 2021. A variational information bottleneck approach to multi-omics data integration. In International Conference on Artificial Intelligence and Statistics, pages 1513-1521. PMLR.
Changhee Lee 和 Mihaela van der Schaar。2021.多组学数据整合的变异信息瓶颈方法。人工智能与统计学国际会议，第 1513-1521 页。PMLR.

Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2020. A unified mro framework for named entity recognition. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5849-5859
Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li.2020.命名实体识别的统一 MRO 框架。计算语言学协会第58届年会论文集》，第5849-5859页

Di Lu, Leonardo Neves, Vitor Carvalho, Ning Zhang, and Heng Ji. 2018. Visual attention model for name tagging in multimodal social media. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1990-1999.
Di Lu、Leonardo Neves、Vitor Carvalho、Ning Zhang 和 Heng Ji.2018.多模态社交媒体中姓名标记的视觉注意力模型。第 56 届计算语言学协会年会论文集》（第 1 卷：长篇论文），第 1990-1999 页。

Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional Istm-cnnscrf. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1064-1074.
Xuezhe Ma 和 Eduard Hovy.2016.通过双向 Istm-cnnscrf 进行端到端序列标注。第 54 届计算语言学协会年会论文集》（第 1 卷：长篇论文），第 1064-1074 页。

Sijie Mai, Ying Zeng, and Haifeng Hu. 2022. Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations. IEEE Transactions on Multimedia.
Sijie Mai, Ying Zeng, and Haifeng Hu.2022.多模态信息瓶颈：学习最小充分的单模态和多模态表征。电气和电子工程师学会多媒体学报》。

Sewon Min, Kenton Lee, Ming-Wei Chang, Kristina Toutanova, and Hannaneh Hajishirzi. 2021. Joint passage ranking for diverse multi-answer retrieval. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6997-7008.
Sewon Min、Kenton Lee、Ming-Wei Chang、Kristina Toutanova 和 Hannaneh Hajishirzi。2021.多样化多答案检索的联合段落排序。自然语言处理经验方法 2021 年会议论文集》，第 6997-7008 页。

Seungwhan Moon, Leonardo Neves, Vitor Carvalho, and Vitor Carvalho. 2018. Multimodal named entity recognition for short social media posts. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 852-860.
Seungwhan Moon, Leonardo Neves, Vitor Carvalho, and Vitor Carvalho.2018.短社交媒体帖子的多模态命名实体识别。In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：人类语言技术》第 1 卷（长篇论文），第 852-860 页。

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
Aaron van den Oord、Yazhe Li 和 Oriol Vinyals。2018.使用对比预测编码的表征学习》，arXiv preprint arXiv:1807.03748.

Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. 2018. Variational discriminator bottleneck: Improving imitation learning, inverse rl, and gans by constraining information flow. In International Conference on Learning Representations.
Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine.2018.变异判别瓶颈：通过限制信息流改进模仿学习、反向 RL 和 Gans。学习表征国际会议。

Naftali Tishby and Noga Zaslavsky. 2015. Deep learning and the information bottleneck principle In 2015 ieee information theory workshop (itw), pages

. IEEE.
Naftali Tishby 和 Noga Zaslavsky。2015.深度学习与信息瓶颈原理 In 2015 ieee information theory workshop (itw), pages

.IEEE.

Peng Wang, Xiaohang Chen, Ziyu Shang, and Wenjun Ke. 2023. Multimodal named entity recognition with bottleneck fusion and contrastive learning. IEICE TRANSACTIONS on Information and Systems, 106(4):545-555.
Peng Wang, Xiaohang Chen, Ziyu Shang, and Wenjun Ke.2023.采用瓶颈融合和对比学习的多模态命名实体识别.IEICE TRANSACTIONS on Information and Systems, 106(4):545-555.

Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Multimodal representation with embedded visual guiding objects for named entity recognition in social media posts. In Proceedings of the 28th ACM International Conference on Multimedia, pages

.
Zhiwei Wu, Changmeng Zheng, Yi Cai, Junying Chen, Ho-fung Leung, and Qing Li.2020.嵌入视觉引导对象的多模态表示法用于社交媒体帖子中的命名实体识别。第28届ACM国际多媒体会议论文集》，第

页。

Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang. 2022. Maf: a general matching and alignment framework for multimodal named entity recognition. In Proceedings of the fifteenth ACM international conference on web search and data mining, pages 1215-1223.
Bo Xu, Shizhou Huang, Chaofeng Sha, and Hongya Wang.2022.Maf：多模态命名实体识别的通用匹配与对齐框架。第十五届 ACM 网络搜索与数据挖掘国际会议论文集》，第 1215-1223 页。

Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia. 2020. Improving multimodal named entity recognition via entity span detection with unified multimoda transformer. Association for Computational Linguistics.
Jianfei Yu, Jing Jiang, Li Yang, and Rui Xia.2020.用统一多模态转换器通过实体跨度检测改进多模态命名实体识别》。计算语言学协会。

Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou. 2021. Multi-modal graph fusion for named entity recognition with targeted visual guidance. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 14347-14355.
Dong Zhang, Suzhong Wei, Shoushan Li, Hanqian Wu, Qiaoming Zhu, and Guodong Zhou.2021.多模态图融合命名实体识别与目标视觉引导。美国人工智能学会会议论文集》，第35卷，第14347-14355页。

Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang. 2018. Adaptive co-attention network for named entity recognition in tweets. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
Qi Zhang, Jinlan Fu, Xiaoyu Liu, and Xuanjing Huang.2018.用于推文中命名实体识别的自适应协同关注网络。In Proceedings of the AAAI conference on artificial intelligence, volume 32.

Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu. 2023. Reducing the bias of visual objects in multimodal named entity recognition. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, pages 958966.
Xin Zhang, Jingling Yuan, Lin Li, and Jianquan Liu.2023.减少多模态命名实体识别中视觉对象的偏差。第十六届 ACM 网络搜索与数据挖掘国际会议论文集》，第 958966 页。

Yu Zhao, Han Zhou, Anxiang Zhang, Ruobing Xie, Qing Li, and Fuzhen Zhuang. 2022. Connecting embeddings based on multiplex relational graph attention networks for knowledge graph entity typing. IEEE Transactions on Knowledge and Data Engineering, 35(5):4608-4620.
Yu Zhao, Han Zhou, Anxiang Zhang, Ruobing Xie, Qing Li, and Fuzhen Zhuang.2022.基于多重关系图注意网络的知识图实体类型化连接嵌入。IEEE 知识与数据工程论文集，35（5）：4608-4620.

Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li. 2020. Object-aware multimodal named entity recognition in social media posts with adversarial learning. IEEE Transactions on Multimedia, 23:2520-2532.
Changmeng Zheng, Zhiwei Wu, Tao Wang, Yi Cai, and Qing Li.2020.用对抗学习识别社交媒体帖子中的物体感知多模态命名实体。电气和电子工程师学会多媒体论文集》，23:2520-2532.

Baohang Zhou, Ying Zhang, Kehui Song, Wenya Guo, Guoqing Zhao, Hongbin Wang, and Xiaojie Yuan. 2022a. A span-based multimodal variational autoencoder for semi-supervised multimodal named entity recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 62936302.
Baohang Zhou, Ying Zhang, Kehui Song, Wenya Guo, Guoqing Zhao, Hongbin Wang, and Xiaojie Yuan.2022a.用于半监督多模态命名实体识别的基于跨度的多模态变异自动编码器。自然语言处理经验方法 2022 年会议论文集》，第 62936302 页。

Jie Zhou, Qi Zhang, Qin Chen, Liang He, and XuanJing Huang. 2022b. A multi-format transfer learning model for event argument extraction via variational information bottleneck. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1990-2000.
Jie Zhou, Qi Zhang, Qin Chen, Liang He, and XuanJing Huang.2022b.通过变异信息瓶颈提取事件论据的多格式迁移学习模型。第 29 届国际计算语言学大会论文集，第 1990-2000 页。

The code of our model will be released for future research.
我们将公布模型的代码，供今后研究使用。

An Effective Span-based Multimodal Named Entity Recognition with Consistent Cross-Modal Alignment 基于跨度的有效多模态命名实体识别与一致的跨模态对齐

Abstract 摘要

1. Introduction 1.导言

2. Related Work 2.相关工作

2.1. Multimodal Named Entity Recognition2.1.多模态命名实体识别

2.2. Information Bottleneck2.2.信息瓶颈

3. Method 3.方法

3.1. Overview 3.1.概述

3.2. Modal-specific Encoder3.2.特定模态编码器

3.3. Cross-Modal Information Bottleneck3.3.跨模式信息瓶颈

3.4. Cross-Modal Attention3.4.跨模态注意力

3.5. Entity Classification3.5.实体分类

4. Experiments 4.实验

4.1. Experimental Settings4.1.实验设置

4.2. Main Results 4.2.主要成果

4.3. Ablation Study 4.3.消融研究

4.4. Visualization for Modality-consistent4.4.模态一致的可视化

4.5. Case Study via Attention Visualization4.5.通过注意力可视化进行案例研究

5. Conclusion 5.结论

6. Acknowledgement 6.鸣谢

7. Ethics Statement 7.伦理声明

8. Bibliographical References8.参考书目

An Effective Span-based Multimodal Named Entity Recognition with Consistent Cross-Modal Alignment
基于跨度的有效多模态命名实体识别与一致的跨模态对齐

2.1. Multimodal Named Entity Recognition
2.1.多模态命名实体识别

2.2. Information Bottleneck
2.2.信息瓶颈

3.2. Modal-specific Encoder
3.2.特定模态编码器

3.3. Cross-Modal Information Bottleneck
3.3.跨模式信息瓶颈

3.4. Cross-Modal Attention
3.4.跨模态注意力

3.5. Entity Classification
3.5.实体分类

4.1. Experimental Settings
4.1.实验设置

4.4. Visualization for Modality-consistent
4.4.模态一致的可视化

4.5. Case Study via Attention Visualization
4.5.通过注意力可视化进行案例研究

8. Bibliographical References
8.参考书目