这是用户在 2024-8-5 24:34 为 https://app.immersivetranslate.com/pdf-pro/a9f2cfad-1352-4d50-9b29-f884f9d08782 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_05_f6c24d7a6654a7ae1d6fg

An Effective Span-based Multimodal Named Entity Recognition with Consistent Cross-Modal Alignment
基于跨度的有效多模态命名实体识别与一致的跨模态对齐

Yongxiu Xu , Hao Xu , Heyan Huang , Shiyao Cui
徐永秀 、徐浩 、黄海燕 、崔诗瑶
Minghao Tang , Longzheng Wang , Hongbo Xu Institute of Information Engineering Chinese Academy of Sciences, Beijing, China
中国科学院信息工程研究所,中国北京
School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China
中国科学院大学网络安全学院,中国北京
School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
北京理工大学计算机科学与技术学院,中国北京
hhy63@bit.edu.cn, {xuyongxiu, xuhao, tangminghao, wanglongzheng,hbxu}@iie.ac.cn

Abstract 摘要

With the increasing availability of multimodal content on social media, consisting primarily of text and images, multimodal named entity recognition (MNER) has gained a wide-spread attention. A fundamental challenge of MNER lies in effectively aligning different modalities. However, the majority of current approaches rely on word-based sequence labeling framework and align the image and text at inconsistent semantic levels (whole image-words or regions-words). This misalignment may lead to inferior entity recognition performance. To address this issue, we propose an effective span-based method, named SMNER, which achieves a more consistent multimodal alignment from the perspectives of information-theoretic and cross-modal interaction, respectively. Specifically, we first introduce a cross-modal information bottleneck module for the global-level multimodal alignment (whole image-whole text). This module aims to encourage the semantic distribution of the image to be closer to the semantic distribution of the text, which can enable the filtering out of visual noise. Next, we introduce a cross-modal attention module for the local-level multimodal alignment (regions-spans), which captures the correlations between regions in the image and spans in the text, enabling a more precise alignment of the two modalities. Extensive experiments conducted on two benchmark datasets demonstrate that SMNER outperforms the state-of-the-art baselines.
随着社交媒体上主要由文本和图像组成的多模态内容越来越多,多模态命名实体识别(MNER)得到了广泛关注。多模态命名实体识别(MNER)的一个基本挑战在于如何有效对齐不同的模态。然而,目前大多数方法都依赖于基于单词的序列标注框架,并在不一致的语义层面(整个图像-单词或区域-单词)对齐图像和文本。这种不一致可能导致实体识别性能低下。针对这一问题,我们提出了一种有效的基于跨度的方法,命名为 SMNER,它分别从信息论和跨模态交互的角度实现了更加一致的多模态对齐。具体来说,我们首先为全局级多模态配准(全图-全文)引入了一个跨模态信息瓶颈模块。该模块旨在促使图像的语义分布更接近文本的语义分布,从而能够过滤视觉噪音。接下来,我们引入了一个用于局部多模态配准(区域-跨度)的跨模态关注模块,该模块捕捉图像中的区域与文本中的跨度之间的相关性,使两种模态的配准更加精确。在两个基准数据集上进行的广泛实验表明,SMNER 的性能优于最先进的基线。

Keywords: Multimodal named entity recogniton, Multimodal alignment, Multimodal fusion
关键词多模态命名实体识别 多模态对齐 多模态融合

1. Introduction 1.导言

Named Entity Recognition (NER) involves identifying named entities within a given sentence and categorizing them into the pre-defined types. (Li et al., 2020). NER is a critical natural language processing task and plays a key component in information retrieval (Dietz, 2019), question answering (Min et al., 2021), knowledge graph (Zhao et al., 2022), etc. However, in practical scenarios such as social media platforms, the text is often limited, informal, and accompanied by images, which presents a significant challenge for the traditional text-based NER. Multimodal named entity recognition (MNER) has become a new direction and attracts widespread attention attributed to its excellent performance in entity recognition for social media posts.
命名实体识别(NER)涉及识别给定句子中的命名实体,并将其归入预定义的类型。(Li等人,2020)。NER 是一项重要的自然语言处理任务,在信息检索(Dietz,2019 年)、问题解答(闵等人,2021 年)、知识图谱(赵等人,2022 年)等方面发挥着关键作用。然而,在社交媒体平台等实际应用场景中,文本往往是有限的、非正式的,并伴有图像,这给传统的基于文本的命名实体识别(NER)带来了巨大挑战。多模态命名实体识别(Multimodal named entity recognition,MNER)因其在社交媒体帖子实体识别中的出色表现而成为一个新方向,并引起了广泛关注。
MNER extends the traditional text-based NER by incorporating images as additional input (Zhang et al., 2018), which can offer complementary benefits to alleviate ambiguity in natural languages. However, MNER poses a fundamental challenge of effectively aligning information across two modalities: text and image. The existing MNER methods primarily utilize various attention networks (such as self-attention or cross-attention) to solve this challenge, which can be categorized into two strategies coarse-grained alignment and fine-grained alignment, as shown in Figure 1.
MNER 扩展了传统的基于文本的 NER,将图像作为额外输入(Zhang 等人,2018 年),这可以为减轻自然语言中的歧义提供互补优势。然而,MNER 面临着一个基本挑战,即如何有效地将文本和图像两种模态的信息进行对齐。现有的 MNER 方法主要利用各种注意网络(如自注意或交叉注意)来解决这一难题,可分为粗粒度对齐和细粒度对齐两种策略,如图 1 所示。
\footnotetext{ \脚注文本{
  • Corresponding author 通讯作者
(a) coarse-grained alignment
(a) 粗粒度排列
(b) fine-grained alignment
(b) 精细对齐

Figure 1: An example of coarse-grained alignment and fine-grained alignment. Both of these two strategies align text with image at inconsistent semantic levels, leading to misalignment noises.
图 1:粗粒度对齐和细粒度对齐的示例。这两种策略都是在不一致的语义层面上对文本和图像进行对齐,从而导致对齐错误。
In the early stages, some efforts (Zhang et al., 2018; Moon et al., 2018) directly consider the entire image as global-level visual cues, which guides the words in the text to learn a vision-aware representation of a whole image, as shown in Figure 1(a). However, this coarse-grained alignment inevitably introduces image noise (e.g., background) and simultaneously results in the loss of some representative information. Subsequently, increasing studies (Zheng et al., 2020; Yu et al., 2020; Zhang et al., 2021; Xu et al., 2022; Jia et al., 2023) started focusing on fine-grained semantic alignment between text and images. These methods typically involve capturing the interactions between words in the
在早期阶段,一些研究(Zhang 等人,2018;Moon 等人,2018)直接将整幅图像视为全局级视觉线索,从而引导文本中的单词学习整幅图像的视觉感知表征,如图 1(a)所示。然而,这种粗粒度的对齐方式不可避免地会引入图像噪声(如背景),同时导致一些代表性信息的丢失。随后,越来越多的研究(Zheng 等人,2020 年;Yu 等人,2020 年;Zhang 等人,2021 年;Xu 等人,2022 年;Jia 等人,2023 年)开始关注文本与图像之间的细粒度语义配准。这些方法通常涉及捕捉文本和图像中单词之间的交互。

text and regions in the image in a unified semantic space, as shown in Figure 1(b). Actually, the regions of the objects in the image should align with the corresponding entity spans in the text rather than the individual words, as individual words may not adequately capture the overall semantics of an entity span. As shown in Figure 1, for the semantic representations of the two modalities, the regions of the person object Adam Levine in the image should have a higher similarity to the span "Adam Levine " in the text than the word "Adam" or word "Levine". Given that neither of the aforementioned alignment strategies successfully achieves consistent semantic alignment between text and images, resulting in the introduction of noise and subsequently inferior performance, we collectively refer to these issues as "misalignment noise".
如图 1(b) 所示,在统一的语义空间中,文本和图像中的区域是一致的。实际上,图像中的对象区域应该与文本中相应的实体跨度相一致,而不是与单个单词相一致,因为单个单词可能无法充分体现实体跨度的整体语义。如图 1 所示,就两种模式的语义表征而言,图像中的人物对象 Adam Levine 的区域与文本中的跨度 "Adam Levine "的相似度应高于单词 "Adam "或单词 "Levine"。鉴于上述两种对齐策略都无法成功实现文本与图像之间一致的语义对齐,因此会引入噪音,导致性能下降,我们将这些问题统称为 "对齐错误噪音"。
Taking the considerations above, we propose an effective Span-based Multimodal Named Entity Recognition method, named SMNER, which regards MNER as a span-based classification task rather than a word-based sequence labeling task. SMNER is intensively designed for learning informative multimodal span representations by effectively aligning and fusing the information contained in text and image. SMNER consists of two key modules: a cross-modal information bottleneck (CMIB) module for global multimodal alignment and denoising, and a cross-modal attention (CMA) module for local multimodal alignment and interaction.
鉴于上述考虑,我们提出了一种有效的基于跨度的多模态命名实体识别方法,命名为 SMNER,它将 MNER 视为基于跨度的分类任务,而不是基于词序列的标注任务。SMNER 的主要设计目的是通过有效对齐和融合文本与图像中包含的信息来学习信息丰富的多模态跨度表示。SMNER 由两个关键模块组成:一个是用于全局多模态对齐和去噪的跨模态信息瓶颈(CMIB)模块,另一个是用于局部多模态对齐和交互的跨模态注意力(CMA)模块。
More specifically, motivated by the multi-view information bottleneck principle (Federici et al., 2020), we consider the text and image as two different views of the same posts. Firstly, we formulate the cross-modal global semantic alignment from an information-theoretic perspective by maximizing the mutual information and minimizing the distributional divergence between the two modalities. This module can bring the visual semantic distribution closer to the textual semantic distribution and filter out irrelevant information from visual representations. Secondly, for fine-grained multimodal alignment, we feed the contextual unimodal representations into a cross-modal attention module that captures the correlations between spans in the text and regions in the image. This module can enable a more precise alignment between two modalities and acquire informative cross-modal features. Finally, the obtained cross-modal features are aggregated effectively to enhance the representation of spans, thereby improving the performance of entity classification.
更具体地说,受多视角信息瓶颈原理(Federici 等人,2020 年)的启发,我们将文本和图像视为同一帖子的两个不同视角。首先,我们从信息论的角度出发,通过最大化互信息和最小化两种模态之间的分布分歧来制定跨模态的全局语义对齐。这一模块可以使视觉语义分布更接近文本语义分布,并过滤掉视觉表征中的无关信息。其次,为了实现精细的多模态配准,我们将上下文单模态表征输入跨模态注意力模块,该模块可捕捉文本中的跨度与图像中的区域之间的相关性。该模块可以在两种模态之间进行更精确的配准,并获取信息丰富的跨模态特征。最后,对获得的跨模态特征进行有效聚合,以增强跨度的表示,从而提高实体分类的性能。
In summary, the main contributions of this paper are as follows:
总之,本文的主要贡献如下:
fication method for MNER, aiming to reduce the impact of misalignment and achieve more consistent multimodal alignment at two levels (image-text and regions-words, respectively). To the best of our knowledge, we are the first to explore the span-based MNER model for the issue of misalignment.
我们的目标是在两个层面(分别是图像-文本和区域-单词)上减少错位的影响并实现更加一致的多模态对齐。据我们所知,我们是第一个针对错位问题探索基于跨度的 MNER 模型的人。
  • We introduce two modules (CMIB and CMA) from the perspective of information-theoretic principle and cross-modal interaction, respectively. These modules work in synergy to generate more expressive cross-modal representations, enhancing the final entity classification performance.
    我们分别从信息论原理和跨模态交互的角度介绍了两个模块(CMIB 和 CMA)。这些模块协同工作,生成更具表现力的跨模态表征,从而提高最终的实体分类性能。
  • We conduct extensive experiments on two widely used MNER datasets to prove the effectiveness of our method. Experimental results show that SMNER outperforms the state-ofthe-art models on both datasets.
    我们在两个广泛使用的 MNER 数据集上进行了大量实验,以证明我们方法的有效性。实验结果表明,SMNER 在这两个数据集上的表现都优于最先进的模型。
In this section, we review the related works of our method from: mulitmodal named entity recognition and information bottleneck.
在本节中,我们将从多模式命名实体识别和信息瓶颈两个方面回顾与我们的方法相关的工作。

2.1. Multimodal Named Entity Recognition
2.1.多模态命名实体识别

As multimodal data become increasingly popular on social media paltforms, starting with Moon et al. (2018); Lu et al. (2018); Zhang et al. (2018), MNER has attracted broad concerns in named entity recognition.
从 Moon 等人(2018)、Lu 等人(2018)、Zhang 等人(2018)开始,随着多模态数据在社交媒体上越来越流行,MNER 在命名实体识别方面引起了广泛关注。
From the perspective of multimodal alignment and fusion. some studies (Moon et al., 2018; Zhang et al., 2018) tried to encode the entire image, which implicitly interacts the information of two modalities using attention mechanism. For example, Moon et al. (2018) proposed to utilize LSTM-CNN architecture that combines text with image information via a general modality attention, and Zhang et al. (2018) proposed an adaptive co-attention network to dynamically control the fusion of two modalities. Different from above works of using the whole image, subsequent works (Lu et al., 2018; Yu et al., 2020; Wu et al., 2020; Zheng et al., 2020; Zhang et al., 2021) primarily focused on combining the finegrained regions visual information with the words information in text to boost the MNER performance. Lu et al. (2018) extracted the image regions that are most related to the text and utilized the attentionbased model to implicitly interact the information of two modalities. Yu et al. (2020) introduced a multimodal interaction module designed to capture both image-aware word representation and wordaware visual representation. Zhang et al. (2021)
从多模态配准与融合的角度来看,一些研究(Moon 等人,2018;Zhang 等人,2018)尝试对整个图像进行编码,利用注意机制隐含地交互两种模态的信息。例如,Moon 等人(2018)提出利用 LSTM-CNN 架构,通过一般模态注意力将文本与图像信息结合起来;Zhang 等人(2018)提出了一种自适应协同注意力网络,以动态控制两种模态的融合。与上述使用整幅图像的工作不同,后续工作(Lu 等,2018;Yu 等,2020;Wu 等,2020;Zheng 等,2020;Zhang 等,2021)主要侧重于将细粒度区域视觉信息与文本中的单词信息相结合,以提升 MNER 性能。Lu 等人(2018)提取了与文本关联度最高的图像区域,并利用基于注意力的模型隐式地交互了两种模态的信息。Yu 等人(2020)引入了一个多模态交互模块,旨在同时捕捉图像感知的文字表征和文字感知的视觉表征。Zhang 等人(2021 年)
Figure 2: Model architecture overview of SMNER. The cross-modal information bottleneck module for global multimodal alignment and denoising, and the cross-modal attention module for local multimodal alignment and interaction.
图 2:SMNER 的模型架构概览。跨模态信息瓶颈模块用于全局多模态配准和去噪,跨模态注意力模块用于局部多模态配准和交互。
exploited a unified multimodal graph to capture the interactions between words in the text and regions in the image.
利用统一的多模态图来捕捉文本中的单词与图像中的区域之间的交互。
Despite the studies above have achieved promising results, most of these methods ignore the problem of the visual noise caused by irrelevant images More recently, Xu et al. (2022), Chen et al. (2022) and Zhang et al. (2023) alleviate this problem by text-image matching, hierarchical visual prefix and contrastive learning, respectively.
尽管上述研究取得了可喜的成果,但这些方法大多忽视了无关图像所造成的视觉噪声问题。最近,Xu 等人(2022 年)、Chen 等人(2022 年)和 Zhang 等人(2023 年)分别通过文本-图像匹配、分层视觉前缀和对比学习等方法缓解了这一问题。
Different from the aforementioned methods, we focus on the noise caused by misalignment. Additionally, the above studies are under the wordbased sequence labeling framework, whereas we utilize the span-based classification framework, ensuring the alignment and interaction between text and images at consistent semantic levels. It is worth noting that while Zhou et al. (2022a) also employed a span-based framework for MNER, it takes more concerns on mulimodal representations, overlooking the mulitmodal alignment and interaction.
与上述方法不同的是,我们的研究重点是对齐错误引起的噪声。此外,上述研究都是在基于词的序列标注框架下进行的,而我们采用的是基于跨度的分类框架,确保文本与图像在一致的语义层面上进行对齐和交互。值得注意的是,虽然 Zhou 等人(2022a)也采用了基于跨度的 MNER 框架,但他们更关注多模态表征,忽略了多模态对齐和交互。

2.2. Information Bottleneck
2.2.信息瓶颈

Information Bottleneck (IB) (Tishby and Zaslavsky, 2015) principle provides a theoretical framework for analyzing deep neural networks, which formulates the goal of representation learning as an information trade-off between predictive power and representation compression. Later, variational information bottleneck (VIB) (Alemi et al., 2016) bridges the gap between IB and deep learning with variational inference. More recently, Federici et al. (2020) provides a variant of IB which extends the ability of IB to the multi-view unsupervised setting, enabling the identification of superfluous information that is not shared by both views. Nowadays, owing to its capacity for learning minimal informative representations, IB has been extensively applied in computer vision (Peng et al., 2018), sentiment analysis (Mai et al., 2022), and natural language processing (Zhou et al., 2022b). Motivated by this, instead of directly applying IB principle to MNER task, we adopt the multi-view IB principle for enhancing the distribution consistency between the two modalities and filtering out irrelevant information from the images.
信息瓶颈(IB)(Tishby 和 Zaslavsky,2015 年)原理为分析深度神经网络提供了一个理论框架,它将表征学习的目标表述为预测能力和表征压缩之间的信息权衡。后来,变异信息瓶颈(VIB)(Alemi 等人,2016 年)用变异推理弥补了 IB 和深度学习之间的差距。最近,Federici 等人(2020 年)提供了 IB 的一种变体,它将 IB 的能力扩展到了多视图无监督环境中,能够识别两个视图不共享的多余信息。如今,由于具有学习最小信息表征的能力,IB 已被广泛应用于计算机视觉(Peng 等人,2018)、情感分析(Mai 等人,2022)和自然语言处理(Zhou 等人,2022b)。受此启发,我们没有将 IB 原理直接应用于 MNER 任务,而是采用了多视角 IB 原理来增强两种模态之间的分布一致性,并过滤掉图像中的无关信息。

3. Method 3.方法

3.1. Overview 3.1.概述

Task Definition. Given the input pair containing a text sentence and its associated image , the goal of MNER is to detect entity spans from , and classify them to corresponding entity types. Unlike the existing MNER models that regard MNER as a sequence labeling task, we regard MNER as a span classification task. Let denote the input sentence with words and the label for the text is formulated as a set , where is the number of the named entities. is the span of an entity that corresponds to the phrase and represents the corresponding entity type that belongs to a predefined entity type set.
任务定义。给定包含文本句子 及其相关图像 的输入对,MNER 的目标是检测 中的实体跨度,并将其分类为相应的实体类型。与将 MNER 视为序列标注任务的现有 MNER 模型不同,我们将 MNER 视为跨度分类任务。让 表示包含 个单词的输入句子,文本 的标签被表述为一个集合 ,其中 是命名实体的个数。 是与短语 相对应的实体的跨度, 表示属于预定义实体类型集的相应实体类型。
Model Architecture. The overall architecture of the SMNER is illustrated in Figure 2. Given imagetext pairs, we first obtain the unimodal representations by the modal-specific encoder. Then, the representations of both text and image primarily
模型结构。SMNER 的整体架构如图 2 所示。给定图像文本对,我们首先通过特定模态编码器获得单模态表示。然后,文本和图像的表示主要是

flow into two modules: 1) the cross-modal information bottleneck module for global multimodal alignment and denoising, 2) the cross-modal attention module for local multimodal alignment and interaction. Finally, we fuse the representations of the two modalities to obtain the multimodal span representation, and feed it into a entity classification layer to get the final predictions. These modules are trained simultaneously using an end-to-end framework.
流程分为两个模块:1) 用于全局多模态配准和去噪的跨模态信息瓶颈模块;2) 用于局部多模态配准和交互的跨模态注意力模块。最后,我们融合两种模态的表征,得到多模态跨度表征,并将其输入实体分类层,得到最终预测结果。这些模块通过端到端框架同时进行训练。

3.2. Modal-specific Encoder
3.2.特定模态编码器

Given a multimodal dataset with samples, is formulated as . Each example contains the multimodal post and the task defined label , where and are text and image respectively. For each post , we first utilize the pre-trained models to obtain its unimodal representations and , respectively.
给定一个包含 样本的多模态数据集,其表述为 。每个示例 包含多模态帖子 和任务定义的标签 ,其中 分别为文本和图像。对于每个帖子 ,我们首先利用预先训练好的模型分别获得其单模态表示
Text Encoder. To precisely capture both the global and contextual representations, we adopt a pretrained BERT-base-uncased model (Kenton and Toutanova, 2019) as our textual encoder. Given a text with words, it needs to add a [CLS] token at the beginning and a [SEP] token at the end. We denote the text input as , where is the [CLS] token and is the [SEP] token. We feed the input into BERT to obtain the factual output , where represents the global text representation, is the contextual word representations for , and is the dimension of textual representations.
文本编码器。为了精确捕捉全局表征和上下文表征,我们采用了一个预训练的基于 BERT 的无基模型(Kenton 和 Toutanova,2019 年)作为文本编码器。给定一个包含 单词的文本 ,它需要在开头添加一个[CLS]标记,在结尾添加一个[SEP]标记。我们将输入的文本表示为 ,其中 是 [CLS] 标记, 是 [SEP] 标记。我们将输入 输入 BERT,得到事实输出 ,其中 表示全局文本表示, 的上下文单词表示, 是文本表示的维度。
Image Encoder. To extract meaningful feature representations from images, we leverage a pretrained 152-layer ResNet (He et al., 2016) as the image encoder, which essentially splits each input image into visual blocks. Specifically, we fist rescale the whole image into piexls, and then feed them into ResNet to obtain visual representation . To project the visual representations into the same dimension as the textual representations, we further convert with a linear transformation: , where is the weight matrix. Finally, we obtain , where is the representation of the whole image, and is the representations of the regional objects.
图像编码器。为了从图像中提取有意义的特征表征,我们利用预训练的 152 层 ResNet(He 等人,2016 年)作为图像编码器,该编码器基本上将每张输入图像分割成 个视觉块。具体来说,我们先将整幅图像重新缩放为 饼状图,然后将其输入 ResNet 以获得视觉表示 。为了将视觉表征投射到与文本表征相同的维度,我们进一步对 进行线性变换: ,其中 是权重矩阵。最后,我们得到 ,其中 是整个图像的表示, 是区域对象的表示。

3.3. Cross-Modal Information Bottleneck
3.3.跨模式信息瓶颈

One challenge of multimodal alignment is how to establish a unified semantic representation space to bridge the semantic gap between two different modalities. Additionally, we should consider that text representations play a predominant role in MNER task, as all the entities to be recognized originate from text. To achieve these objectives, we present a Cross-Modal Information Bottleneck (CMIB) module from an information-theoretic perspective, aims to bring visual semantic distribution closer to the textual semantic distribution while filtering noise from the images.
多模态对齐的一个挑战是如何建立统一的语义表征空间,以弥合两种不同模态之间的语义差距。此外,我们还应考虑到文本表征在 MNER 任务中起着主导作用,因为所有需要识别的实体都来自文本。为了实现这些目标,我们从信息论的角度提出了跨模态信息瓶颈(CMIB)模块,旨在使视觉语义分布更接近文本语义分布,同时过滤图像中的噪声。
Given and that are derived from the same post, they share the same predictive task for a target . Therefore, in this paper, we consider and to be two views for the same object and suppose the is sufficient for . Motivated by the multi-view IB (Federici et al., 2020), we can subdivide into two components by using the chain of mutual information (MI):
鉴于 源自同一帖子,它们对目标 的预测任务相同。因此,在本文中,我们将 视为同一对象的两个视图,并假设 对于 来说是足够的。受多视图 IB(Federici 等人,2020 年)的启发,我们可以利用互信息链 (MI) 将 细分为两个部分:
where and are the representations of the entire text and image , respectively. denotes the information that is consistent between two modalities, and denotes the information in which is unique to but is not predictable by observing , i.e., irrelevant information in the image.
其中 分别是整个文本 和图像 的表示。 表示两种模式之间一致的信息, 表示 中的信息,这些信息是 所独有的,但通过观察 无法预测,即图像中的无关信息。
We would like to define an objective function for the representation of that discards as much information as possible without losing any entity information. For this purpose, we should ensure that the representation is sufficient for (maximizing ), and that the irrelevant information is discarded (minimizing ). So the loss function of the cross-modal information bottleneck in our model is defined as:
我们希望为 的表示 定义一个目标函数,在不丢失任何实体信息的情况下,尽可能多地丢弃信息。为此,我们应确保 的表示对 足够充分(最大化 ),并且丢弃无关信息(最小化 )。因此,在我们的模型中,跨模态信息瓶颈的损失函数定义为
where represents the Lagrangian multiplier introduced by the constrained optimization. With the gradients from back-propagation, semantic regularization can automatically enforce semantic agreement among heterogeneous representations.
其中, 表示约束优化引入的拉格朗日乘数。利用反向传播的梯度,语义正则化可以自动强制异构表征之间达成语义一致。
It is challenging to compute the mutual information and directly. The same as Federici et al. (2020), we use variational inference to compute a variational upper bound for as follow:
直接计算互信息 具有挑战性。与 Federici 等人(2020 年)一样,我们使用变异推理来计算 的变异上界,如下所示:
Therefore, we replace it in (2) with this upper bounder, which can be optimized via evaluating
因此,我们用这个上限值来替代 (2) 中的上限值。

the Kullback-Leibler (KL) divergence between the unimodal distributions approximated by two modalspecific Variational AutoEncoders (VAEs). Mathematically, the posterior distribution of each unimodal representation is estimated as follows:
由两个特定模态变异自动编码器(VAE)近似的单模态分布之间的库尔巴克-莱伯勒(KL)分歧。在数学上,每个单模态表示的后验分布 是按如下方式估算的:
where the mean and variance of Gaussian distribution can be obtained from the modal-specific multilayer perceptron (MLP) layers. Then, we use reparameterization trick to sample and :
其中,高斯分布的均值