这是用户在 2024-7-24 14:19 为 https://app.immersivetranslate.com/pdf-pro/e35bd69e-f562-4aa0-8c42-cfc28c9316e6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_07_24_1565dbdb820472818fd3g

Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection
针对多模态讽刺检测的聚合网络多任务联合训练建模

Lisong Ou 欧力松Key Lab Education Blockchain and Intelligent Technology,
区块链与智能技术重点实验室、
Ministry of Education, Guangxi Normal University,
教育部、广西师范大学、
Guilin, Guangxi, China 中国广西桂林Guangxi Key Lab of Multi-source Information Mining and
广西多源信息挖掘与分析重点实验室
Security, Guangxi Normal University,
安全,广西师范大学、
Guilin, Guangxi, China 中国广西桂林School of Mathematics and Statistics, Guilin University of
桂林电子科技大学数学与统计学院
Technology, 技术、Guilin, Guangxi, China 中国广西桂林ouls@stu.gxnu.edu.cn

Zhixin Key Lab Education Blockchain and Intelligent Technology,
区块链与智能技术重点实验室、
Ministry of Education, Guangxi Normal University,
教育部、广西师范大学、
Guilin, Guangxi, China 中国广西桂林Guangxi Key Lab of Multi-source Information Mining and
广西多源信息挖掘与分析重点实验室
Security, Guangxi Normal University,
安全,广西师范大学、
Guilin, Guangxi, China 中国广西桂林lizx@gxnu.edu.cn

Abstract 摘要

With the continuous emergence of various types of social media, which people often use to express their emotions in daily life, the multi-modal sarcasm detection (MSD) task has attracted more and more attention. However, due to the unique nature of sarcasm itself, there are still two main challenges on the way to achieving robust MSD: 1) existing mainstream methods often fail to take into account the problem of multi-modal weak correlation, thus ignoring the important sarcasm information of the uni-modal itself; 2) inefficiency in modeling cross-modal interactions in unaligned multi-modal data. Therefore, this paper proposes a multi-task jointly trained aggregation network (MTAN), which mainly adopts networks adapted to different modalities according to different modality processing tasks. Specifically, we design a multi-task CLIP framework that includes an uni-modal text task, an uni-modal image task, and a cross-modal interaction task, which can utilize sentiment cues from multiple tasks for multi-modal sarcasm detection. In addition, we design a global-local cross-modal interaction learning method that utilizes discourse-level representations from each modality as the global multi-modal context to interact with local uni-modal features which not only avoids the secondary scaling cost of previous locallocal cross-modal interaction methods but also allows the global multi-modal context and local uni-modal features to be mutually reinforcing and progressively improved through multi-layer superposition. After extensive experimental results and in-depth analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.
随着日常生活中人们常用来表达情绪的各类社交媒体的不断涌现,多模态讽刺语言检测(MSD)任务受到越来越多的关注。然而,由于讽刺本身的特殊性,在实现鲁棒性 MSD 的道路上还存在两大挑战:1)现有的主流方法往往没有考虑到多模态弱相关性的问题,从而忽略了单模态本身重要的讽刺信息;2)在未对齐的多模态数据中,跨模态交互建模效率低下。因此,本文提出了一种多任务联合训练聚合网络(MTAN),主要是根据不同的模态处理任务,采用适应不同模态的网络。具体来说,我们设计了一个多任务 CLIP 框架,其中包括一个单模态文本任务、一个单模态图像任务和一个跨模态交互任务,可以利用多个任务的情感线索进行多模态讽刺检测。此外,我们还设计了一种全局-局部跨模态交互学习方法,利用各模态的话语级表征作为全局多模态上下文,与局部单模态特征进行交互,不仅避免了以往局部跨模态交互方法的二次缩放成本,而且通过多层叠加,使全局多模态上下文与局部单模态特征相互促进,逐步提高。经过广泛的实验结果和深入分析,我们的模型在多模态讽刺语言检测方面达到了最先进的性能。

CCS CONCEPTS 综合传播战略概念

  • Computing methodologies Natural language generation; Neural networks; Theory of computation Fixed parameter tractability.
    计算方法 自然语言生成;神经网络; 计算理论 固定参数可操作性。

KEYWORDS 关键词

multi-modal sarcasm detection, aggregation network, multi-task CLIP framework, cross-modal interaction
多模态讽刺检测、聚合网络、多任务 CLIP 框架、跨模态交互

ACM Reference Format: ACM 参考格式:

Lisong Ou and Zhixin Li. 2024. Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR '24), Fune 10-14, 2024, Phuket, Thailand. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658015
Lisong Ou 和 Zhixin Li.2024.用于多模态讽刺检测的聚合网络多任务联合训练建模。2024 年国际多媒体检索会议(ICMR '24)论文集》,2024 年 6 月 10-14 日,泰国普吉岛。ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658015

1 INTRODUCTION 1 引言

Sarcasm is a form of communication that conveys implied emotions and expresses a person's emotional attitude in a specific situation through the use of antithesis or superficially contradictory expressions . In increasingly developed social software communication, sarcasm is frequently used, usually by combining multi-modal information such as text and pictures to express the meaning of sarcasm [20, 39]. As in the satirical example shown in Fig. 1(a), a text-only approach may incorrectly recognize a burger described as "huge and stuffy" as a positive expression of emotion. However, the image expresses sarcasm with a negative emotion because it shows an image of a "small and flat" burger, and the inconsistency expressed by the fusion of text and image happens to be a key cue for capturing sarcasm. Given this, how to fully mine and utilize the data in social networks, especially the social comment data with the emotional tendency for sarcasm detection, has become a current research hotspot [41].
讽刺是一种通过使用对立面或表面上矛盾的表达方式 来传递隐含情感和表达一个人在特定情况下的情感态度的交流形式。在日益发达的社交软件交流中,讽刺被频繁使用,通常是通过结合文字和图片等多模态信息来表达讽刺的含义 [20, 39]。如图 1(a)所示的讽刺例子,纯文本方法可能会错误地将一个被描述为 "巨大而闷热 "的汉堡识别为正面的情绪表达。然而,图片表达的是带有负面情绪的讽刺,因为它显示的是一个 "小而扁 "的汉堡的图片,而文本和图片融合所表达的不一致性恰好是捕捉讽刺的关键线索。有鉴于此,如何充分挖掘和利用社交网络中的数据,尤其是带有情感倾向的社交评论数据进行讽刺检测,已成为当前的研究热点[41]。
With the rapid development of computer technology, multimodal sarcasm detection methods based on deep learning have made significant progress [17, 28, 35, 40]. Schifanella et al.[26] first defined the multi-modal sarcasm detection task in 2016, which attracted extensive attention from researchers and drove the emergence of various deep learning models. Cai et al.[2] created a Twitter-based multi-modal sarcasm detection dataset and proposed a multi-modal hierarchical fusion model to solve this task. Some of
随着计算机技术的飞速发展,基于深度学习的多模态讽刺检测方法取得了长足的进步[17, 28, 35, 40]。Schifanella 等人[26] 于 2016 年首次定义了多模态讽刺检测任务,引起了研究者的广泛关注,并推动了各种深度学习模型的出现。Cai 等人[2] 创建了一个基于 Twitter 的多模态讽刺检测数据集,并提出了一种多模态分层融合模型来解决这一任务。其中一些
(a) Prior Work (a) 以前的工作
(b) Our Work (b) 我们的工作
Figure 1: Two multi-modal sarcastic examples. Comparison between prior work and our proposed MTAN.
图 1:两个多模态讽刺例子。之前的工作与我们提出的 MTAN 之间的比较。
these research efforts have focused on methods based on attentional mechanisms to capture the interactions between various modalities for implicit learning of inconsistencies between images and text [20, 43-45]. Another research is based on graphical approaches to identify essential cues in sarcasm detection, which can better capture relationships across different modalities and thus dominate the performance in the literature [13-15].
这些研究工作的重点是基于注意机制的方法,以捕捉各种模态之间的相互作用,从而隐式学习图像与文本之间的不一致之处[20, 43-45]。另一项研究基于图形方法来识别讽刺检测中的基本线索,这种方法能更好地捕捉不同模态之间的关系,因此在文献中的表现占主导地位[13-15]。
Although current multi-modal sarcasm detection methods have made significant progress, effectively fusing heterogeneous sequence features and utilizing each unimodally important sarcasm cue in practical applications is still an essential challenge in this field [24, 38]. The first is the inefficiency in modeling cross-modal interactions in unaligned multi-modal data and the problem of excessive space and time complexity. In addition, we found that when using only the textual modality model RoBERTa for multi-modal sarcasm detection, its performance significantly outperforms the state-of-the-art multi-modal model HKE [15] by . This suggests that the performance of current models may rely heavily on satirical cues in uni-modal data. For example, in the satirical example in Fig. 1(b), there is a weak correlation between the text and the image, which may negatively impact the multi-modal fusion process.In summary, it is clear from Fig. 1 that prior work has often relied on text-image fusion for multi-modal sarcasm detection. In our study we focus not only on the interactions between modalities but also on the satirical cues of the unimodality themselves.
尽管目前的多模态讽刺检测方法已经取得了重大进展,但在实际应用中有效融合异构序列特征并利用每一个单模态重要讽刺线索仍然是这一领域的基本挑战[24, 38]。首先是在未对齐的多模态数据中建立跨模态交互模型的效率低下,以及空间和时间复杂度过高的问题。此外,我们还发现,当仅使用文本模态模型 RoBERTa 进行多模态讽刺检测时,其性能明显优于最先进的多模态模型 HKE [15] 。这表明,当前模型的性能可能在很大程度上依赖于单模态数据中的讽刺线索。例如,在图 1(b) 中的讽刺示例中,文本和图像之间的相关性很弱,这可能会对多模态融合过程产生负面影响。在我们的研究中,我们不仅关注模态之间的交互,还关注单模态本身的讽刺线索。
To this end, in this work, we propose a multi-task co-training aggregation network approach (MTAN) for multi-modal detection , which improves the performance of sarcasm by setting up three tasks, namely text, image, and text-image interaction, to mine the sentiment features embedded in the multi-modal information of different modalities themselves. Specifically, we use the CLIP framework for the text and image processing tasks, which can utilize multi-granularity cues for multi-modal sarcasm detection. We utilize discourse-level representations from each modality as a global multi-modal context for text-image interaction tasks to interact with local single-peak features. On the one hand, the global multimodal context can effectively augment the local uni-modal features through cross-modal attention. In turn, the global multi-modal context can update itself by extracting useful information from the local single-peak features through symmetric cross-modal attention. Through multi-layer superposition, global multi-modal context and local single-peak features can enhance each other and gradually improve. In order to verify the effectiveness of the proposed method, we conduct comprehensive experiments on the widely used Twitter benchmark dataset. The results show that our proposed method exhibits excellent performance. To summarize, the main contributions of this paper are as follows:
为此,我们在这项工作中提出了一种用于多模态检测的多任务协同训练聚合网络方法(MTAN),通过设置文本、图像和文本图像交互三个任务来挖掘不同模态本身的多模态信息中蕴含的情感特征,从而提高讽刺信息的检测性能。具体来说,我们在文本和图像处理任务中使用了 CLIP 框架,它可以利用多粒度线索进行多模态讽刺检测。我们利用每种模态的话语级表征作为文本-图像交互任务的全局多模态上下文,与局部单峰特征进行交互。一方面,全局多模态语境可以通过跨模态注意有效增强局部单模态特征。反过来,全局多模态上下文也可以通过对称的跨模态注意从局部单峰特征中提取有用信息,从而更新自身。通过多层叠加,全局多模态上下文和局部单峰值特征可以相互促进,逐步完善。为了验证所提方法的有效性,我们在广泛使用的 Twitter 基准数据集上进行了综合实验。结果表明,我们提出的方法表现出了卓越的性能。综上所述,本文的主要贡献如下:
  • We design a novel multi-task CLIP framework to capture multi-granularity clues of each modality and inter-modal data from three tasks: text, image, and graphic interaction.
    我们设计了一个新颖的多任务 CLIP 框架,以捕捉每种模态的多粒度线索以及来自文本、图像和图形交互这三种任务的跨模态数据。
  • We propose a global-local cross-modal interaction learning approach for effective and efficient fusion of non-aligned multi-modal data. It avoids the quadratic scaling cost of previous dense cross-modal interaction methods and obtains better performance than them.
    我们提出了一种全局-局部跨模态交互学习方法,用于高效融合非对齐的多模态数据。它避免了以往密集跨模态交互方法的二次缩放成本,并获得了比它们更好的性能。
  • The results of a series of experiments on a publicly available multi-modal sarcasm detection benchmark dataset show that our proposed method achieves state-of-the-art performance.
    在一个公开的多模态讽刺检测基准数据集上进行的一系列实验结果表明,我们提出的方法达到了最先进的性能。

2.1 Multi-modal Sarcasm Detection
2.1 多模态讽刺检测

Sarcasm detection belongs to one of the research directions of sentiment analysis and was first seen as a text classification task [37]. Traditional machine learning algorithms, including simple Bayesian, decision tree, and SVM models, were initially employed for singlemodal text sarcasm detection tasks, which delivered strong experimental results, establishing the foundation for future research in the field [3, 23]. Then, some research efforts have focused on using deep learning such as GRU [5], RNN [12], and LSTM [36] to capture tweet text's syntactic and semantic information for sarcasm detection [7, 42]. Although these efforts have made promising progress, with the rapid growth of modern social media, multi-modal sarcasm tasks have become a new research hotspot in response to the times. A new concept of multi-modal sarcasm detection was first defined by Schifanella et al. [26]. Subsequently, a series of works focused on capturing satirical emotions using attentional mechanisms. Pan [20] and Zhang et al. [43] designed inter-channel attention and co-attention mechanisms to capture inter-modal inconsistencies and simulate internal contradictions within the text, respectively. Similarly, Zhao et al. [45] proposed coupled attention networks, which can effectively integrate text and image information into a unified framework. Zhang et al. [44] proposed a cross-modal target attention mechanism to automatically learn the alignment between text and image/speech. In recent years, good progress has been made in graph structure-based approaches for sarcasm detection. Liang et al. [13] deployed a heterogeneous graph structure to learn sarcasm features from intra- and inter-modal perspectives. Yue et al. [38] improved graph structure to incorporate sample-level and word-level cross-modal semantic similarity detection for determining image-text relevance. Xu et al. [34] used a hierarchical graph attention network to model each subgraph of a fine-grained text-image graph and employed a global feature alignment module to fuse the entire image and text. However, there is the problem of inefficiency in modeling cross-modal interactions in unaligned multi-modal data. To address this issue, we propose the Global-Local Interaction Learning method, which avoids the quadratic scaling cost of previous local-local cross-modal interaction methods and leads to better performance.
讽刺检测属于情感分析的研究方向之一,最早被视为一种文本分类任务[37]。传统的机器学习算法,包括简单的贝叶斯模型、决策树模型和 SVM 模型,最初被用于单模态文本讽刺检测任务,取得了很好的实验结果,为该领域未来的研究奠定了基础[3, 23]。随后,一些研究工作侧重于使用深度学习,如 GRU [5]、RNN [12] 和 LSTM [36],来捕捉鸣叫文本的句法和语义信息,以进行讽刺检测 [7,42]。尽管这些努力取得了可喜的进展,但随着现代社交媒体的快速发展,多模态讽刺任务已成为顺应时代的新研究热点。Schifanella 等人[26] 首次提出了多模态讽刺检测的新概念。随后,一系列研究集中于利用注意机制捕捉讽刺情绪。潘[20]和张等人[43]设计了通道间注意和共同注意机制,分别用于捕捉模式间不一致和模拟文本内部矛盾。同样,Zhao 等人[45] 提出了耦合注意力网络,可以有效地将文本和图像信息整合到一个统一的框架中。Zhang 等人[44]提出了一种跨模态目标注意机制,可自动学习文本与图像/语音之间的对齐。近年来,基于图结构的讽刺语言检测方法取得了良好的进展。Liang 等人[13] 利用异构图结构从模态内和模态间的角度学习讽刺特征。Yue等人[14]从模式内和模式间的角度学习挖苦特征。 [38]改进了图结构,纳入了样本级和词级跨模态语义相似性检测,以确定图像与文本的相关性。Xu 等人[34]使用分层图注意网络对细粒度文本-图像图的每个子图进行建模,并使用全局特征对齐模块对整个图像和文本进行融合。然而,在未对齐的多模态数据中,存在跨模态交互建模效率低下的问题。为解决这一问题,我们提出了全局-局部交互学习方法,该方法避免了以往局部-局部跨模态交互方法的二次缩放成本,从而获得了更好的性能。

2.2 Multi-Task Sarcasm Detection
2.2 多任务讽刺检测

Multi-task sarcasm detection aims to comprehensively understand the sarcastic meanings in language and contexts [4, 19, 27]. Although there is relatively less research on multi-task sarcasm detection at present, some work has begun to explore this area. Schifanella et al. [26] combined manually designed features of text and images with deep learning-based features for prediction in two tasks. Wu et al. [31]constructed a multi-task model of densely connected LSTM using embeddings, sentiment features, and syntactic features. Additionally, Majumder et al. [18] used deep neural networks to simulate the correlation between sentiment classification and sarcasm detection to improve performance in a multi-task learning environment.Savini et al. [25] treated emotion classification as an auxiliary task to assist the main task of sarcasm detection. Recently, Zhang et al. [44] proposed a multi-modal interaction learning task, including a dual-gate network and three independent fully connected layers, to capture the commonalities and uniqueness of sarcastic information. However, these tasks overlook the significant sarcasm clues in the single modality itself. To address this issue, we design emotional clues for multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection.
多任务讽刺检测旨在全面理解语言和语境中的讽刺含义 [4, 19, 27]。虽然目前关于多任务讽刺检测的研究相对较少,但已有一些工作开始探索这一领域。Schifanella 等人[26]将人工设计的文本和图像特征与基于深度学习的特征相结合,在两个任务中进行预测。Wu 等人[31]利用嵌入、情感特征和句法特征构建了一个由密集连接的 LSTM 组成的多任务模型。此外,Majumder 等人[18]使用深度神经网络模拟情感分类和讽刺检测之间的相关性,以提高多任务学习环境下的性能。最近,Zhang 等人[44] 提出了一种多模态交互学习任务,包括一个双门网络和三个独立的全连接层,以捕捉讽刺信息的共性和独特性。然而,这些任务忽略了单一模态本身的重要讽刺线索。为了解决这个问题,我们为多任务(即文本、图像和文本-图像交互任务)设计了情感线索,用于多模态讽刺检测。

3 METHODOLOGY 3 方法

Our propose multi-task jointly trained aggregation network (MTAN) consists of three processing tasks to utilize multi-granularity cues for better multi-modal sarcasm detection. As shown in Fig. 2, our framework contains three processing tasks: text processing task, image processing task, and text-image fusion task, which explicitly utilizes different cues from different tasks better to capture rich satirical cues for multi-modal satire detection. For text- and imageprocessing tasks, we use the text and image editors of the CLIP framework, which naturally inherit multi-modal knowledge from pre-trained CLIP models. For the text-image interaction task, we utilize discourse-level representations from each modality as a global multi-modal context to interact with local single-peak features to achieve an efficient interactive fusion of unaligned multi-modal sequences.
我们提出的多任务联合训练聚合网络(MTAN)由三个处理任务组成,以利用多粒度线索更好地进行多模态讽刺检测。如图 2 所示,我们的框架包含三个处理任务:文本处理任务、图像处理任务和文本图像融合任务。在文本和图像处理任务中,我们使用了 CLIP 框架的文本和图像编辑器,它们自然地继承了预先训练好的 CLIP 模型的多模态知识。在文本-图像交互任务中,我们利用每种模态的话语级表征作为全局多模态上下文,与局部单峰特征进行交互,从而实现未对齐多模态序列的高效交互式融合。

3.1 CLIP Model 3.1 CLIP 模型

CLIP [22] is a visual and linguistic interrelated pre-training model based on a Contrastive Learning and Pre-Training approach that enables the model to be trained on large-scale unlabeled data and learn feature representations with good generalization ability. It uses Contrastive Learning and Pre-Training to train the model on large-scale unlabeled data and learn feature representations with good generalization ability, achieving excellent performance in several visual and linguistic tasks [9, 16, 32]. Briefly, CLIP internally contains two models, a Text Encoder and an Image Encoder, as shown in Fig. 3. For each image-text pair, the image and text encoders map the text-image pairs to the same feature space. For a given batch of N image-text pairs, the training objective of CLIP is to maximize the cosine similarity of the paired image and text feature encodings while minimizing the cosine similarity of the unpaired image and text feature encodings. In this article, we design a multi-task joint training aggregation network for multi-modal satire detection based on CLIP architecture.
CLIP[22]是一个视觉和语言相互关联的预训练模型,它基于对比学习和预训练方法,能够在大规模无标记数据上训练模型,并学习具有良好泛化能力的特征表征。CLIP 是一种基于对比学习和预训练方法的关联训练模型,可在大规模无标记数据上训练模型,并学习具有良好泛化能力的特征表征,在多项视觉和语言任务中取得了优异的表现 [9、16、32]。简而言之,CLIP 内部包含两个模型,即文本编码器和图像编码器,如图 3 所示。对于每个图像-文本对,图像编码器和文本编码器将文本-图像对映射到相同的特征空间。对于给定的 N 组图像-文本对,CLIP 的训练目标是最大化配对图像和文本特征编码的余弦相似度,同时最小化未配对图像和文本特征编码的余弦相似度。本文设计了一种基于 CLIP 架构的多任务联合训练聚合网络,用于多模态讽刺画检测。
Figure 3: CLIP model, which is a visual and language-related pre-training model based on contrastive learning and pretraining methods.
图 3:CLIP 模型,这是一个基于对比学习和预训练方法的视觉和语言相关预训练模型。

3.2 Text Processing Task
3.2 文本处理任务

In practical sarcasm detection applications, it cannot be ignored that there are weak correlations between some graphs and texts. At the same time, some research results have fully demonstrated the feasibility of using textual information for sarcasm detection . Therefore, we introduce an unimodal text task to determine sarcasm from the perspective of text using the architecture of a mighty CLIP. In this section, the Text Encoder is used to extract the features of the text with its internal Text transformer model, which is commonly used in NLP. The transformer captures the semantic information in the text sequences utilizing a multi-head self-attention mechanism and a feed-forward neural network layer. The text representation contains rich semantic information through multiple levels of abstraction hierarchy. Finally, the representation of text is defined as:
在实际的讽刺信息检测应用中,不可忽视的是某些图形和文本之间存在着微弱的相关性。同时,一些研究成果已经充分证明了利用文本信息进行讽刺检测 的可行性。因此,我们引入了一个单模态文本任务,利用强大的 CLIP 架构从文本的角度来判断讽刺。在本节中,文本编码器利用其内部的文本转换器模型提取文本特征,该模型在 NLP 中非常常用。转换器利用多头自注意机制和前馈神经网络层捕捉文本序列中的语义信息。通过多层次的抽象层次结构,文本表示包含了丰富的语义信息。最后,文本 的表示法定义为:
Where is the words or phrases that make up the original text, is the sequence length of the original text, and ext is encoded using the text encoder of the CLIP model. is the output as a contextual representation of the text, and is the contextual representation of the -th word in the text. Finally, the text view decoder directly uses for sarcasm detection.
其中, 是组成原始文本的单词或短语, 是原始文本的序列长度, ext 使用 CLIP 模型的文本编码器进行编码。 是作为文本上下文表示的输出, 是文本中 -th 字 的上下文表示。最后,文本视图解码器直接使用 进行讽刺检测。
Where is the image feature output distribution, and and are trainable parameters.
其中, 为图像特征输出分布, 为可训练参数。

3.3 Image Processing Task
3.3 图像处理任务

Image information is an important feature that can give the text more emotional expression and visual elements. Also, the weak correlation between graphics and text motivated us to design an image-processing task module to determine sarcasm from an image perspective using the architecture of a mighty CLIP. In this section,
图像信息是一个重要的特征,可以赋予文本更多的情感表达和视觉元素。同时,图形与文本之间的弱相关性促使我们设计了一个图像处理任务模块,利用强大的 CLIP 架构从图像角度判断讽刺。在本节中
Figure 2: The architecture of the proposed MTAN, which involves three processing tasks: a text-processing task, an imageprocessing task, and a text-image fusion task. represents matrix concatenation. The architecture leverages multi-granularity clues from multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection.
图 2:拟议 MTAN 的架构,包括三个处理任务:文本处理任务、图像处理任务和文本图像融合任务。 表示矩阵连接。该架构利用来自多个任务(即文本、图像和文本图像交互任务)的多粒度线索进行多模态讽刺检测。
the Image Encoder is used to extract the features of the image, which internally focuses on the vision Transformer. First, the image is cut into a series of Patch blocks and then encoded through the Transformer model. Finally, the local and global features of the image are captured by stacking layers of convolution or Transformer blocks. Specifically, for image , the output of the CLIP model is a semantic representation of the image:
图像编码器用于提取图像的特征,其内部侧重于视觉变换器。首先,图像被切割成一系列 Patch 块,然后通过变换器模型进行编码。最后,通过堆叠卷积层或变换器块来捕捉图像的局部和全局特征。具体来说,对于图像 ,CLIP 模型的输出是图像的语义表示:
Where is the incoming image information, mage denotes the encoding using the text encoder of the CLIP model, and is the information feature obtained after encoding. Finally, the image view decoder directly uses for sarcasm detection.
其中, 是输入的图像信息, mage 表示使用 CLIP 模型的文本编码器进行的编码, 是编码后得到的信息特征。最后,图像视图解码器直接使用 进行讽刺信息检测。
Where is the image feature output distribution, and and are trainable parameters.
其中, 为图像特征输出分布, 为可训练参数。

3.4 Text-image Fusion Task
3.4 文本图像融合任务

In recent years, some attention-based methods have mainly used cross-attention for multi-modal fusion, strengthening the target modality by learning directed pairwise attention between the target and source modalities [15, 20]. A good fusion scheme extracts and integrates adequate information from multi-modal sequences and maintains the mutual independence of modalities. Based on this, we design an efficient cross-modal attention promotion mechanism (CAP), which mainly exploits symmetric cross-modal attention to explore the intrinsic correlation between two input feature sequences and realizes the exchange of beneficial information between the two sequences so that they can promote each other. Self-attention is used to model the temporal dependencies in each feature sequence to allow further information integration.
近年来,一些基于注意力的方法主要利用交叉注意力进行多模态融合,通过学习目标模态和源模态之间的定向成对注意力来强化目标模态[15, 20]。好的融合方案能从多模态序列中提取和整合足够的信息,并保持模态之间的相互独立性。在此基础上,我们设计了一种高效的跨模态注意力促进机制(CAP),它主要利用对称跨模态注意力来探索两个输入特征序列之间的内在相关性,并实现两个序列之间有益信息的交换,使它们能够相互促进。自注意用于模拟每个特征序列的时间依赖性,以便进一步整合信息。
3.4.1 Attention Architecture. Attention mechanism has been studied as early as the 1990s, and later, accompanied by the proposal of Vaswani et al. [30] in 2017 in the Transformer structure, attention mechanisms have been widely used in network design for NLP and CV-related problems. In this paper, we will mainly combine the self-attention mechanism and the cross-attention mechanism for use in sarcasm detection tasks. The self-attention mechanism is mainly used to compute the attentional weights between elements in a sequence to capture the dependencies between components. The multi-head self-attention mechanism utilizes multiple heads using the same computation method, given different parameters to express features from multiple subspaces, and can capture richer
3.4.1 注意力架构。早在 20 世纪 90 年代,人们就已经开始研究注意力机制,后来伴随着 Vaswani 等人[30] 在 2017 年提出的 Transformer 结构,注意力机制被广泛应用于 NLP 和 CV 相关问题的网络设计中。在本文中,我们将主要结合自注意机制和交叉注意机制,用于讽刺检测任务中。自注意机制主要用于计算序列中元素之间的注意权重,以捕捉组件之间的依赖关系。多头自注意机制利用多个头使用相同的计算方法,给定不同的参数来表达来自多个子空间的特征,可以捕捉到更丰富的特征。

feature information. As shown in Fig.4(a), we take the text modality as an example to define the input of the multi-head self-attention mechanism as , where denotes the encoding dimension of the text modality and the whole process can be expressed as:
特征信息。如图 4(a)所示,我们以文本模态为例,将多头自注意机制的输入定义为 ,其中 表示文本模态的编码维度,整个过程可表示为:
MSA concat head , head , head ,
MSA 连接 head , head , head
Where , and are the results of a linear transformation of the input vectors by the -th head, , and are the weight parameters of Query, Key, and Value mapping, respectively, and concat denotes the splicing operation, is the weight matrix of the final linear transformation, and is the result of the computation of the multi-head self-attention mechanism.
其中, 是第 个头对输入向量进行线性变换的结果, 分别是查询、键和值映射的权重参数,concat 表示拼接操作, 是最终线性变换的权重矩阵, 是多头自关注机制的计算结果。
The self-attention mechanism acts on the same sequence, while the cross-attention mechanism deals with dependencies between two sequences. As shown in Fig. 4(b), the cross-attention mechanism provides interaction from modality to modality in this way by learning directed pairwise attention between the source and target modalities, i.e., the query comes from the target modality , while the key and the value come from the source modality , and the result of the linear transformation becomes .
自我注意机制作用于同一序列,而交叉注意机制则处理两个序列之间的依赖关系。如图 4(b) 所示,交叉注意机制通过学习源模态和目标模态之间的定向配对注意,以这种方式提供从模态 到模态 的交互,即查询来自目标模态 ,而键和值来自源模态 ,线性变换的结果变成
concat head , head , head ,
连接 head , head , head
Where and denote the different modalities, and is the result of calculating the multiple cross-attention mechanism.
其中, 表示不同的模式, 是多重交叉注意机制的计算结果。
(a) Self-attention Mechanism
(a) 自我关注机制
(b) Cross-attention Mechanism
(b) 交叉关注机制

Figure 4: Two mechanisms of attention.
图 4:两种注意力机制。
3.4.2 Cross-modal Attention Promotion Mechanism . To achieve better inter- and intra-modal interactions, we designed an efficient cross-modal attention promotion mechanism (CAP) . It mainly utilizes symmetric cross-modal attention to explore the intrinsic correlation between two input feature sequences, enabling the exchange of beneficial information between the two sequences so that they can promote each other. Self-attention is used to model the temporal dependencies in each feature sequence to allow further information integration. As shown in Fig. 5, CAP takes sequences and as inputs and outputs their mutually reinforcing information and . Specifically, is computed as follows:
3.4.2 跨模态注意力促进机制 .为了更好地实现模态间和模态内的交互,我们设计了一种高效的跨模态注意力促进机制(CAP)。它主要利用对称跨模态注意力来探索两个输入特征序列之间的内在关联性,使两个序列之间交换有益信息,从而相互促进。自我注意用于对每个特征序列的时间依赖性进行建模,以便进一步整合信息。如图 5 所示,CAP 将序列 作为输入,并输出其相互促进的信息 。具体来说, 的计算过程如下:
Where denotes layer normalization and is the feedforward neural network in Transformer. Similarly, we can obtain . Considering that the computational time complexity of MCA is and the complexity of MSA is , the total time complexity of a CAP is , where denotes the length of the textual modality, and denotes the size of the sequence of the imaging modality, respectively.
其中, 表示层归一化, 是变换器中的前馈神经网络。类似地,我们可以得到 。考虑到 MCA 的计算时间复杂度 ,MSA 的复杂度 ,CAP 的总时间复杂度为 ,其中 分别表示文本模态的长度, 表示成像模态序列的大小。
Figure 5: The architecture of cross-modal attention promotion mechanism (CAP).
图 5:跨模态注意力促进机制(CAP)的结构。
3.4.3 Global-Local Interactive Learning Model. Cross-attentionbased interactions are updated twice during a modal interaction to achieve modal enhancement, i.e., text-image interactions, and image-text interactions need to be executed twice, which is inefficient and introduces redundant features to the sequence. The historical experience from large-scale pre-modeling in which a certain Token can represent the whole sequence can further enhance the efficiency of modal interaction. Based on this, we design a globallocal Interactive model (GLIM) based on linear computational cost to improve the parameter efficiency further. The discourse-level representation of each modality can replace the standard information and interact with the local uni-modal features in a global
3.4.3 全局-局部交互学习模型。在模态交互过程中,为了实现模态增强,基于交叉注意力的交互需要更新两次,即文本-图像交互、图像-文本交互需要执行两次,这不仅效率低下,而且会给序列带来冗余特征。在大规模预建模中,某个 Token 可以代表整个序列,这一历史经验可以进一步提高模态交互的效率。在此基础上,我们设计了一种基于线性计算成本的全局交互模型(GLIM),以进一步提高参数效率。每种模态的话语级表征可以取代标准信息,并在全局中与本地的单模态特征交互。

multi-modal context. We set the global multi-modal context information concat , where denotes the number of layers of global-local interaction. Interacting global information with local modal information, modal consistency, and specificity are learned. The whole interaction process is as follows
多模态上下文。我们设置全局多模态上下文信息 连接 ,其中 表示全局与本地交互的层数。学习全局信息与本地模态信息的交互、模态一致性和特异性。整个交互过程如下
In this way, this strategy can capture one-to-many global-local cross-modal interactions in both. By stacking multiple layers, global multi-modal contexts and local uni-modal features can reinforce each other and progressively refine themselves. This operation requires Min each layer CAPS. Due to the small length of the global multi-modal context, the overall time complexity reduces to (practically, wehave , which degenerates to in the modality-aligned case. Thus, the default global-local fusion strategy has linear spatial complexity and enjoys linear computation over the modalities involved. Meanwhile, each global-local fusion is followed by a pooling layer to aggregate the boosting information of different modalities to facilitate subsequent fusions. We utilize a fully-connected layer based on Tanh nonlinear activation to implement this operation. We define 2 lifting global multi-modal contexts, i.e., and The new global multi-modal contexts can be obtained as follows:
这样,这种策略就能捕捉到这两种模式中一对多的全局-局部跨模式互动。通过多层堆叠,全局多模态语境和局部单模态特征可以相互促进,逐步完善。这一操作需要对每一层进行 CAPS。由于全局多模态上下文的长度较小,因此整体时间复杂度会降低到 (实际上,我们有 ,在模态对齐的情况下会退化为 。因此,默认的全局-局部融合策略具有线性空间复杂性,并可对所涉及的模态进行线性计算。同时,每次全局-局部融合后都会有一个汇集层来汇集不同模态的增强信息,以促进后续融合。我们利用基于 Tanh 非线性激活的全连接层来实现这一操作。我们定义了 2 个提升的全局多模态上下文,即 新的全局多模态上下文可按如下方式获得: