针对多模态讽刺检测的聚合网络多任务联合训练建模

2024_07_24_1565dbdb820472818fd3g

Lisong Ou 欧力松Key Lab Education Blockchain and Intelligent Technology,
区块链与智能技术重点实验室、Ministry of Education, Guangxi Normal University,
教育部、广西师范大学、Guilin, Guangxi, China 中国广西桂林Guangxi Key Lab of Multi-source Information Mining and
广西多源信息挖掘与分析重点实验室Security, Guangxi Normal University,
安全，广西师范大学、Guilin, Guangxi, China 中国广西桂林School of Mathematics and Statistics, Guilin University of
桂林电子科技大学数学与统计学院Technology, 技术、Guilin, Guangxi, China 中国广西桂林ouls@stu.gxnu.edu.cn

Zhixin Key Lab Education Blockchain and Intelligent Technology,
区块链与智能技术重点实验室、Ministry of Education, Guangxi Normal University,
教育部、广西师范大学、Guilin, Guangxi, China 中国广西桂林Guangxi Key Lab of Multi-source Information Mining and
广西多源信息挖掘与分析重点实验室Security, Guangxi Normal University,
安全，广西师范大学、Guilin, Guangxi, China 中国广西桂林lizx@gxnu.edu.cn

Abstract 摘要

With the continuous emergence of various types of social media, which people often use to express their emotions in daily life, the multi-modal sarcasm detection (MSD) task has attracted more and more attention. However, due to the unique nature of sarcasm itself, there are still two main challenges on the way to achieving robust MSD: 1) existing mainstream methods often fail to take into account the problem of multi-modal weak correlation, thus ignoring the important sarcasm information of the uni-modal itself; 2) inefficiency in modeling cross-modal interactions in unaligned multi-modal data. Therefore, this paper proposes a multi-task jointly trained aggregation network (MTAN), which mainly adopts networks adapted to different modalities according to different modality processing tasks. Specifically, we design a multi-task CLIP framework that includes an uni-modal text task, an uni-modal image task, and a cross-modal interaction task, which can utilize sentiment cues from multiple tasks for multi-modal sarcasm detection. In addition, we design a global-local cross-modal interaction learning method that utilizes discourse-level representations from each modality as the global multi-modal context to interact with local uni-modal features which not only avoids the secondary scaling cost of previous locallocal cross-modal interaction methods but also allows the global multi-modal context and local uni-modal features to be mutually reinforcing and progressively improved through multi-layer superposition. After extensive experimental results and in-depth analysis, our model achieves state-of-the-art performance in multi-modal sarcasm detection.
随着日常生活中人们常用来表达情绪的各类社交媒体的不断涌现，多模态讽刺语言检测（MSD）任务受到越来越多的关注。然而，由于讽刺本身的特殊性，在实现鲁棒性 MSD 的道路上还存在两大挑战：1）现有的主流方法往往没有考虑到多模态弱相关性的问题，从而忽略了单模态本身重要的讽刺信息；2）在未对齐的多模态数据中，跨模态交互建模效率低下。因此，本文提出了一种多任务联合训练聚合网络（MTAN），主要是根据不同的模态处理任务，采用适应不同模态的网络。具体来说，我们设计了一个多任务 CLIP 框架，其中包括一个单模态文本任务、一个单模态图像任务和一个跨模态交互任务，可以利用多个任务的情感线索进行多模态讽刺检测。此外，我们还设计了一种全局-局部跨模态交互学习方法，利用各模态的话语级表征作为全局多模态上下文，与局部单模态特征进行交互，不仅避免了以往局部跨模态交互方法的二次缩放成本，而且通过多层叠加，使全局多模态上下文与局部单模态特征相互促进，逐步提高。经过广泛的实验结果和深入分析，我们的模型在多模态讽刺语言检测方面达到了最先进的性能。

CCS CONCEPTS 综合传播战略概念

Computing methodologies Natural language generation; Neural networks; Theory of computation Fixed parameter tractability.
计算方法自然语言生成；神经网络；计算理论固定参数可操作性。

KEYWORDS 关键词

multi-modal sarcasm detection, aggregation network, multi-task CLIP framework, cross-modal interaction
多模态讽刺检测、聚合网络、多任务 CLIP 框架、跨模态交互

ACM Reference Format: ACM 参考格式：

Lisong Ou and Zhixin Li. 2024. Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection. In Proceedings of the 2024 International Conference on Multimedia Retrieval (ICMR '24), Fune 10-14, 2024, Phuket, Thailand. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658015
Lisong Ou 和 Zhixin Li.2024.用于多模态讽刺检测的聚合网络多任务联合训练建模。2024 年国际多媒体检索会议（ICMR '24）论文集》，2024 年 6 月 10-14 日，泰国普吉岛。ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/3652583.3658015

1 INTRODUCTION 1 引言

Sarcasm is a form of communication that conveys implied emotions and expresses a person's emotional attitude in a specific situation through the use of antithesis or superficially contradictory expressions

. In increasingly developed social software communication, sarcasm is frequently used, usually by combining multi-modal information such as text and pictures to express the meaning of sarcasm [20, 39]. As in the satirical example shown in Fig. 1(a), a text-only approach may incorrectly recognize a burger described as "huge and stuffy" as a positive expression of emotion. However, the image expresses sarcasm with a negative emotion because it shows an image of a "small and flat" burger, and the inconsistency expressed by the fusion of text and image happens to be a key cue for capturing sarcasm. Given this, how to fully mine and utilize the data in social networks, especially the social comment data with the emotional tendency for sarcasm detection, has become a current research hotspot [41].
讽刺是一种通过使用对立面或表面上矛盾的表达方式

来传递隐含情感和表达一个人在特定情况下的情感态度的交流形式。在日益发达的社交软件交流中，讽刺被频繁使用，通常是通过结合文字和图片等多模态信息来表达讽刺的含义 [20, 39]。如图 1(a)所示的讽刺例子，纯文本方法可能会错误地将一个被描述为 "巨大而闷热 "的汉堡识别为正面的情绪表达。然而，图片表达的是带有负面情绪的讽刺，因为它显示的是一个 "小而扁 "的汉堡的图片，而文本和图片融合所表达的不一致性恰好是捕捉讽刺的关键线索。有鉴于此，如何充分挖掘和利用社交网络中的数据，尤其是带有情感倾向的社交评论数据进行讽刺检测，已成为当前的研究热点[41]。

With the rapid development of computer technology, multimodal sarcasm detection methods based on deep learning have made significant progress [17, 28, 35, 40]. Schifanella et al.[26] first defined the multi-modal sarcasm detection task in 2016, which attracted extensive attention from researchers and drove the emergence of various deep learning models. Cai et al.[2] created a Twitter-based multi-modal sarcasm detection dataset and proposed a multi-modal hierarchical fusion model to solve this task. Some of
随着计算机技术的飞速发展，基于深度学习的多模态讽刺检测方法取得了长足的进步[17, 28, 35, 40]。Schifanella 等人[26] 于 2016 年首次定义了多模态讽刺检测任务，引起了研究者的广泛关注，并推动了各种深度学习模型的出现。Cai 等人[2] 创建了一个基于 Twitter 的多模态讽刺检测数据集，并提出了一种多模态分层融合模型来解决这一任务。其中一些

(a) Prior Work (a) 以前的工作

(b) Our Work (b) 我们的工作
Figure 1: Two multi-modal sarcastic examples. Comparison between prior work and our proposed MTAN.
图 1：两个多模态讽刺例子。之前的工作与我们提出的 MTAN 之间的比较。

these research efforts have focused on methods based on attentional mechanisms to capture the interactions between various modalities for implicit learning of inconsistencies between images and text [20, 43-45]. Another research is based on graphical approaches to identify essential cues in sarcasm detection, which can better capture relationships across different modalities and thus dominate the performance in the literature [13-15].
这些研究工作的重点是基于注意机制的方法，以捕捉各种模态之间的相互作用，从而隐式学习图像与文本之间的不一致之处[20, 43-45]。另一项研究基于图形方法来识别讽刺检测中的基本线索，这种方法能更好地捕捉不同模态之间的关系，因此在文献中的表现占主导地位[13-15]。

Although current multi-modal sarcasm detection methods have made significant progress, effectively fusing heterogeneous sequence features and utilizing each unimodally important sarcasm cue in practical applications is still an essential challenge in this field [24, 38]. The first is the inefficiency in modeling cross-modal interactions in unaligned multi-modal data and the problem of excessive space and time complexity. In addition, we found that when using only the textual modality model RoBERTa for multi-modal sarcasm detection, its performance significantly outperforms the state-of-the-art multi-modal model HKE [15] by

. This suggests that the performance of current models may rely heavily on satirical cues in uni-modal data. For example, in the satirical example in Fig. 1(b), there is a weak correlation between the text and the image, which may negatively impact the multi-modal fusion process.In summary, it is clear from Fig. 1 that prior work has often relied on text-image fusion for multi-modal sarcasm detection. In our study we focus not only on the interactions between modalities but also on the satirical cues of the unimodality themselves.
尽管目前的多模态讽刺检测方法已经取得了重大进展，但在实际应用中有效融合异构序列特征并利用每一个单模态重要讽刺线索仍然是这一领域的基本挑战[24, 38]。首先是在未对齐的多模态数据中建立跨模态交互模型的效率低下，以及空间和时间复杂度过高的问题。此外，我们还发现，当仅使用文本模态模型 RoBERTa 进行多模态讽刺检测时，其性能明显优于最先进的多模态模型 HKE [15]

。这表明，当前模型的性能可能在很大程度上依赖于单模态数据中的讽刺线索。例如，在图 1(b) 中的讽刺示例中，文本和图像之间的相关性很弱，这可能会对多模态融合过程产生负面影响。在我们的研究中，我们不仅关注模态之间的交互，还关注单模态本身的讽刺线索。

To this end, in this work, we propose a multi-task co-training aggregation network approach (MTAN) for multi-modal detection , which improves the performance of sarcasm by setting up three tasks, namely text, image, and text-image interaction, to mine the sentiment features embedded in the multi-modal information of different modalities themselves. Specifically, we use the CLIP framework for the text and image processing tasks, which can utilize multi-granularity cues for multi-modal sarcasm detection. We utilize discourse-level representations from each modality as a global multi-modal context for text-image interaction tasks to interact with local single-peak features. On the one hand, the global multimodal context can effectively augment the local uni-modal features through cross-modal attention. In turn, the global multi-modal context can update itself by extracting useful information from the local single-peak features through symmetric cross-modal attention. Through multi-layer superposition, global multi-modal context and local single-peak features can enhance each other and gradually improve. In order to verify the effectiveness of the proposed method, we conduct comprehensive experiments on the widely used Twitter benchmark dataset. The results show that our proposed method exhibits excellent performance. To summarize, the main contributions of this paper are as follows:
为此，我们在这项工作中提出了一种用于多模态检测的多任务协同训练聚合网络方法（MTAN），通过设置文本、图像和文本图像交互三个任务来挖掘不同模态本身的多模态信息中蕴含的情感特征，从而提高讽刺信息的检测性能。具体来说，我们在文本和图像处理任务中使用了 CLIP 框架，它可以利用多粒度线索进行多模态讽刺检测。我们利用每种模态的话语级表征作为文本-图像交互任务的全局多模态上下文，与局部单峰特征进行交互。一方面，全局多模态语境可以通过跨模态注意有效增强局部单模态特征。反过来，全局多模态上下文也可以通过对称的跨模态注意从局部单峰特征中提取有用信息，从而更新自身。通过多层叠加，全局多模态上下文和局部单峰值特征可以相互促进，逐步完善。为了验证所提方法的有效性，我们在广泛使用的 Twitter 基准数据集上进行了综合实验。结果表明，我们提出的方法表现出了卓越的性能。综上所述，本文的主要贡献如下：

We design a novel multi-task CLIP framework to capture multi-granularity clues of each modality and inter-modal data from three tasks: text, image, and graphic interaction.
我们设计了一个新颖的多任务 CLIP 框架，以捕捉每种模态的多粒度线索以及来自文本、图像和图形交互这三种任务的跨模态数据。
We propose a global-local cross-modal interaction learning approach for effective and efficient fusion of non-aligned multi-modal data. It avoids the quadratic scaling cost of previous dense cross-modal interaction methods and obtains better performance than them.
我们提出了一种全局-局部跨模态交互学习方法，用于高效融合非对齐的多模态数据。它避免了以往密集跨模态交互方法的二次缩放成本，并获得了比它们更好的性能。
The results of a series of experiments on a publicly available multi-modal sarcasm detection benchmark dataset show that our proposed method achieves state-of-the-art performance.
在一个公开的多模态讽刺检测基准数据集上进行的一系列实验结果表明，我们提出的方法达到了最先进的性能。

Sarcasm detection belongs to one of the research directions of sentiment analysis and was first seen as a text classification task [37]. Traditional machine learning algorithms, including simple Bayesian, decision tree, and SVM models, were initially employed for singlemodal text sarcasm detection tasks, which delivered strong experimental results, establishing the foundation for future research in the field [3, 23]. Then, some research efforts have focused on using deep learning such as GRU [5], RNN [12], and LSTM [36] to capture tweet text's syntactic and semantic information for sarcasm detection [7, 42]. Although these efforts have made promising progress, with the rapid growth of modern social media, multi-modal sarcasm tasks have become a new research hotspot in response to the times. A new concept of multi-modal sarcasm detection was first defined by Schifanella et al. [26]. Subsequently, a series of works focused on capturing satirical emotions using attentional mechanisms. Pan [20] and Zhang et al. [43] designed inter-channel attention and co-attention mechanisms to capture inter-modal inconsistencies and simulate internal contradictions within the text, respectively. Similarly, Zhao et al. [45] proposed coupled attention networks, which can effectively integrate text and image information into a unified framework. Zhang et al. [44] proposed a cross-modal target attention mechanism to automatically learn the alignment between text and image/speech. In recent years, good progress has been made in graph structure-based approaches for sarcasm detection. Liang et al. [13] deployed a heterogeneous graph structure to learn sarcasm features from intra- and inter-modal perspectives. Yue et al. [38] improved graph structure to incorporate sample-level and word-level cross-modal semantic similarity detection for determining image-text relevance. Xu et al. [34] used a hierarchical graph attention network to model each subgraph of a fine-grained text-image graph and employed a global feature alignment module to fuse the entire image and text. However, there is the problem of inefficiency in modeling cross-modal interactions in unaligned multi-modal data. To address this issue, we propose the Global-Local Interaction Learning method, which avoids the quadratic scaling cost of previous local-local cross-modal interaction methods and leads to better performance.
讽刺检测属于情感分析的研究方向之一，最早被视为一种文本分类任务[37]。传统的机器学习算法，包括简单的贝叶斯模型、决策树模型和 SVM 模型，最初被用于单模态文本讽刺检测任务，取得了很好的实验结果，为该领域未来的研究奠定了基础[3, 23]。随后，一些研究工作侧重于使用深度学习，如 GRU [5]、RNN [12] 和 LSTM [36]，来捕捉鸣叫文本的句法和语义信息，以进行讽刺检测 [7，42]。尽管这些努力取得了可喜的进展，但随着现代社交媒体的快速发展，多模态讽刺任务已成为顺应时代的新研究热点。Schifanella 等人[26] 首次提出了多模态讽刺检测的新概念。随后，一系列研究集中于利用注意机制捕捉讽刺情绪。潘[20]和张等人[43]设计了通道间注意和共同注意机制，分别用于捕捉模式间不一致和模拟文本内部矛盾。同样，Zhao 等人[45] 提出了耦合注意力网络，可以有效地将文本和图像信息整合到一个统一的框架中。Zhang 等人[44]提出了一种跨模态目标注意机制，可自动学习文本与图像/语音之间的对齐。近年来，基于图结构的讽刺语言检测方法取得了良好的进展。Liang 等人[13] 利用异构图结构从模态内和模态间的角度学习讽刺特征。Yue等人[14]从模式内和模式间的角度学习挖苦特征。 [38]改进了图结构，纳入了样本级和词级跨模态语义相似性检测，以确定图像与文本的相关性。Xu 等人[34]使用分层图注意网络对细粒度文本-图像图的每个子图进行建模，并使用全局特征对齐模块对整个图像和文本进行融合。然而，在未对齐的多模态数据中，存在跨模态交互建模效率低下的问题。为解决这一问题，我们提出了全局-局部交互学习方法，该方法避免了以往局部-局部跨模态交互方法的二次缩放成本，从而获得了更好的性能。

2.2 Multi-Task Sarcasm Detection
2.2 多任务讽刺检测

Multi-task sarcasm detection aims to comprehensively understand the sarcastic meanings in language and contexts [4, 19, 27]. Although there is relatively less research on multi-task sarcasm detection at present, some work has begun to explore this area. Schifanella et al. [26] combined manually designed features of text and images with deep learning-based features for prediction in two tasks. Wu et al. [31]constructed a multi-task model of densely connected LSTM using embeddings, sentiment features, and syntactic features. Additionally, Majumder et al. [18] used deep neural networks to simulate the correlation between sentiment classification and sarcasm detection to improve performance in a multi-task learning environment.Savini et al. [25] treated emotion classification as an auxiliary task to assist the main task of sarcasm detection. Recently, Zhang et al. [44] proposed a multi-modal interaction learning task, including a dual-gate network and three independent fully connected layers, to capture the commonalities and uniqueness of sarcastic information. However, these tasks overlook the significant sarcasm clues in the single modality itself. To address this issue, we design emotional clues for multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection.
多任务讽刺检测旨在全面理解语言和语境中的讽刺含义 [4, 19, 27]。虽然目前关于多任务讽刺检测的研究相对较少，但已有一些工作开始探索这一领域。Schifanella 等人[26]将人工设计的文本和图像特征与基于深度学习的特征相结合，在两个任务中进行预测。Wu 等人[31]利用嵌入、情感特征和句法特征构建了一个由密集连接的 LSTM 组成的多任务模型。此外，Majumder 等人[18]使用深度神经网络模拟情感分类和讽刺检测之间的相关性，以提高多任务学习环境下的性能。最近，Zhang 等人[44] 提出了一种多模态交互学习任务，包括一个双门网络和三个独立的全连接层，以捕捉讽刺信息的共性和独特性。然而，这些任务忽略了单一模态本身的重要讽刺线索。为了解决这个问题，我们为多任务（即文本、图像和文本-图像交互任务）设计了情感线索，用于多模态讽刺检测。

3 METHODOLOGY 3 方法

Our propose multi-task jointly trained aggregation network (MTAN) consists of three processing tasks to utilize multi-granularity cues for better multi-modal sarcasm detection. As shown in Fig. 2, our framework contains three processing tasks: text processing task, image processing task, and text-image fusion task, which explicitly utilizes different cues from different tasks better to capture rich satirical cues for multi-modal satire detection. For text- and imageprocessing tasks, we use the text and image editors of the CLIP framework, which naturally inherit multi-modal knowledge from pre-trained CLIP models. For the text-image interaction task, we utilize discourse-level representations from each modality as a global multi-modal context to interact with local single-peak features to achieve an efficient interactive fusion of unaligned multi-modal sequences.
我们提出的多任务联合训练聚合网络（MTAN）由三个处理任务组成，以利用多粒度线索更好地进行多模态讽刺检测。如图 2 所示，我们的框架包含三个处理任务：文本处理任务、图像处理任务和文本图像融合任务。在文本和图像处理任务中，我们使用了 CLIP 框架的文本和图像编辑器，它们自然地继承了预先训练好的 CLIP 模型的多模态知识。在文本-图像交互任务中，我们利用每种模态的话语级表征作为全局多模态上下文，与局部单峰特征进行交互，从而实现未对齐多模态序列的高效交互式融合。

3.1 CLIP Model 3.1 CLIP 模型

CLIP [22] is a visual and linguistic interrelated pre-training model based on a Contrastive Learning and Pre-Training approach that enables the model to be trained on large-scale unlabeled data and learn feature representations with good generalization ability. It uses Contrastive Learning and Pre-Training to train the model on large-scale unlabeled data and learn feature representations with good generalization ability, achieving excellent performance in several visual and linguistic tasks [9, 16, 32]. Briefly, CLIP internally contains two models, a Text Encoder and an Image Encoder, as shown in Fig. 3. For each image-text pair, the image and text encoders map the text-image pairs to the same feature space. For a given batch of N image-text pairs, the training objective of CLIP is to maximize the cosine similarity of the paired image and text feature encodings while minimizing the cosine similarity of the unpaired image and text feature encodings. In this article, we design a multi-task joint training aggregation network for multi-modal satire detection based on CLIP architecture.
CLIP[22]是一个视觉和语言相互关联的预训练模型，它基于对比学习和预训练方法，能够在大规模无标记数据上训练模型，并学习具有良好泛化能力的特征表征。CLIP 是一种基于对比学习和预训练方法的关联训练模型，可在大规模无标记数据上训练模型，并学习具有良好泛化能力的特征表征，在多项视觉和语言任务中取得了优异的表现 [9、16、32]。简而言之，CLIP 内部包含两个模型，即文本编码器和图像编码器，如图 3 所示。对于每个图像-文本对，图像编码器和文本编码器将文本-图像对映射到相同的特征空间。对于给定的 N 组图像-文本对，CLIP 的训练目标是最大化配对图像和文本特征编码的余弦相似度，同时最小化未配对图像和文本特征编码的余弦相似度。本文设计了一种基于 CLIP 架构的多任务联合训练聚合网络，用于多模态讽刺画检测。

Figure 3: CLIP model, which is a visual and language-related pre-training model based on contrastive learning and pretraining methods.
图 3：CLIP 模型，这是一个基于对比学习和预训练方法的视觉和语言相关预训练模型。

3.2 Text Processing Task
3.2 文本处理任务

In practical sarcasm detection applications, it cannot be ignored that there are weak correlations between some graphs and texts. At the same time, some research results have fully demonstrated the feasibility of using textual information for sarcasm detection

. Therefore, we introduce an unimodal text task to determine sarcasm from the perspective of text using the architecture of a mighty CLIP. In this section, the Text Encoder is used to extract the features of the text with its internal Text transformer model, which is commonly used in NLP. The transformer captures the semantic information in the text sequences utilizing a multi-head self-attention mechanism and a feed-forward neural network layer. The text representation contains rich semantic information through multiple levels of abstraction hierarchy. Finally, the representation of text

is defined as:
在实际的讽刺信息检测应用中，不可忽视的是某些图形和文本之间存在着微弱的相关性。同时，一些研究成果已经充分证明了利用文本信息进行讽刺检测

的可行性。因此，我们引入了一个单模态文本任务，利用强大的 CLIP 架构从文本的角度来判断讽刺。在本节中，文本编码器利用其内部的文本转换器模型提取文本特征，该模型在 NLP 中非常常用。转换器利用多头自注意机制和前馈神经网络层捕捉文本序列中的语义信息。通过多层次的抽象层次结构，文本表示包含了丰富的语义信息。最后，文本

的表示法定义为：

Where

is the words or phrases that make up the original text,

is the sequence length of the original text, and

ext is encoded using the text encoder of the CLIP model.

is the output as a contextual representation of the text, and

is the contextual representation of the

-th word

in the text. Finally, the text view decoder directly uses

for sarcasm detection.
其中，

是组成原始文本的单词或短语，

是原始文本的序列长度，

ext 使用 CLIP 模型的文本编码器进行编码。

是作为文本上下文表示的输出，

是文本中

-th 字

的上下文表示。最后，文本视图解码器直接使用

进行讽刺检测。

Where

is the image feature output distribution, and

and

are trainable parameters.
其中，

为图像特征输出分布，

和

为可训练参数。

3.3 Image Processing Task
3.3 图像处理任务

Image information is an important feature that can give the text more emotional expression and visual elements. Also, the weak correlation between graphics and text motivated us to design an image-processing task module to determine sarcasm from an image perspective using the architecture of a mighty CLIP. In this section,
图像信息是一个重要的特征，可以赋予文本更多的情感表达和视觉元素。同时，图形与文本之间的弱相关性促使我们设计了一个图像处理任务模块，利用强大的 CLIP 架构从图像角度判断讽刺。在本节中

Figure 2: The architecture of the proposed MTAN, which involves three processing tasks: a text-processing task, an imageprocessing task, and a text-image fusion task.

represents matrix concatenation. The architecture leverages multi-granularity clues from multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection.
图 2：拟议 MTAN 的架构，包括三个处理任务：文本处理任务、图像处理任务和文本图像融合任务。

表示矩阵连接。该架构利用来自多个任务（即文本、图像和文本图像交互任务）的多粒度线索进行多模态讽刺检测。

the Image Encoder is used to extract the features of the image, which internally focuses on the vision Transformer. First, the image is cut into a series of Patch blocks and then encoded through the Transformer model. Finally, the local and global features of the image are captured by stacking layers of convolution or Transformer blocks. Specifically, for image

, the output of the CLIP model is a semantic representation of the image:
图像编码器用于提取图像的特征，其内部侧重于视觉变换器。首先，图像被切割成一系列 Patch 块，然后通过变换器模型进行编码。最后，通过堆叠卷积层或变换器块来捕捉图像的局部和全局特征。具体来说，对于图像

，CLIP 模型的输出是图像的语义表示：

Where

is the incoming image information,

mage denotes the encoding using the text encoder of the CLIP model, and

is the information feature obtained after encoding. Finally, the image view decoder directly uses

for sarcasm detection.
其中，

是输入的图像信息，

mage 表示使用 CLIP 模型的文本编码器进行的编码，

是编码后得到的信息特征。最后，图像视图解码器直接使用

进行讽刺信息检测。

Where

is the image feature output distribution, and

and

are trainable parameters.
其中，

为图像特征输出分布，

和

为可训练参数。

3.4 Text-image Fusion Task
3.4 文本图像融合任务

In recent years, some attention-based methods have mainly used cross-attention for multi-modal fusion, strengthening the target modality by learning directed pairwise attention between the target and source modalities [15, 20]. A good fusion scheme extracts and integrates adequate information from multi-modal sequences and maintains the mutual independence of modalities. Based on this, we design an efficient cross-modal attention promotion mechanism (CAP), which mainly exploits symmetric cross-modal attention to explore the intrinsic correlation between two input feature sequences and realizes the exchange of beneficial information between the two sequences so that they can promote each other. Self-attention is used to model the temporal dependencies in each feature sequence to allow further information integration.
近年来，一些基于注意力的方法主要利用交叉注意力进行多模态融合，通过学习目标模态和源模态之间的定向成对注意力来强化目标模态[15, 20]。好的融合方案能从多模态序列中提取和整合足够的信息，并保持模态之间的相互独立性。在此基础上，我们设计了一种高效的跨模态注意力促进机制（CAP），它主要利用对称跨模态注意力来探索两个输入特征序列之间的内在相关性，并实现两个序列之间有益信息的交换，使它们能够相互促进。自注意用于模拟每个特征序列的时间依赖性，以便进一步整合信息。

3.4.1 Attention Architecture. Attention mechanism has been studied as early as the 1990s, and later, accompanied by the proposal of Vaswani et al. [30] in 2017 in the Transformer structure, attention mechanisms have been widely used in network design for NLP and CV-related problems. In this paper, we will mainly combine the self-attention mechanism and the cross-attention mechanism for use in sarcasm detection tasks. The self-attention mechanism is mainly used to compute the attentional weights between elements in a sequence to capture the dependencies between components. The multi-head self-attention mechanism utilizes multiple heads using the same computation method, given different parameters to express features from multiple subspaces, and can capture richer
3.4.1 注意力架构。早在 20 世纪 90 年代，人们就已经开始研究注意力机制，后来伴随着 Vaswani 等人[30] 在 2017 年提出的 Transformer 结构，注意力机制被广泛应用于 NLP 和 CV 相关问题的网络设计中。在本文中，我们将主要结合自注意机制和交叉注意机制，用于讽刺检测任务中。自注意机制主要用于计算序列中元素之间的注意权重，以捕捉组件之间的依赖关系。多头自注意机制利用多个头使用相同的计算方法，给定不同的参数来表达来自多个子空间的特征，可以捕捉到更丰富的特征。
feature information. As shown in Fig.4(a), we take the text modality as an example to define the input of the multi-head self-attention mechanism as

, where

denotes the encoding dimension of the text modality and the whole process can be expressed as:
特征信息。如图 4（a）所示，我们以文本模态为例，将多头自注意机制的输入定义为

，其中

表示文本模态的编码维度，整个过程可表示为：

MSA

concat

head

, head

,
MSA

连接

head

, head

、

Where

, and

are the results of a linear transformation of the input vectors by the

-th head,

, and

are the weight parameters of Query, Key, and Value mapping, respectively, and concat denotes the splicing operation,

is the weight matrix of the final linear transformation, and

is the result of the computation of the multi-head self-attention mechanism.
其中，

和

是第

个头对输入向量进行线性变换的结果，

和

分别是查询、键和值映射的权重参数，concat 表示拼接操作，

是最终线性变换的权重矩阵，

是多头自关注机制的计算结果。

The self-attention mechanism acts on the same sequence, while the cross-attention mechanism deals with dependencies between two sequences. As shown in Fig. 4(b), the cross-attention mechanism provides interaction from modality

to modality

in this way by learning directed pairwise attention between the source and target modalities, i.e., the query comes from the target modality

, while the key and the value come from the source modality

, and the result of the linear transformation becomes

.
自我注意机制作用于同一序列，而交叉注意机制则处理两个序列之间的依赖关系。如图 4(b) 所示，交叉注意机制通过学习源模态和目标模态之间的定向配对注意，以这种方式提供从模态

到模态

的交互，即查询来自目标模态

，而键和值来自源模态

，线性变换的结果变成

。

concat

head

, head

连接

head

, head

、

Where

and

denote the different modalities, and

is the result of calculating the multiple cross-attention mechanism.
其中，

和

表示不同的模式，

是多重交叉注意机制的计算结果。

(a) Self-attention Mechanism
(a) 自我关注机制

(b) Cross-attention Mechanism
(b) 交叉关注机制
Figure 4: Two mechanisms of attention.
图 4：两种注意力机制。

3.4.2 Cross-modal Attention Promotion Mechanism . To achieve better inter- and intra-modal interactions, we designed an efficient cross-modal attention promotion mechanism (CAP) . It mainly utilizes symmetric cross-modal attention to explore the intrinsic correlation between two input feature sequences, enabling the exchange of beneficial information between the two sequences so that they can promote each other. Self-attention is used to model the temporal dependencies in each feature sequence to allow further information integration. As shown in Fig. 5, CAP takes sequences

and

as inputs and outputs their mutually reinforcing information

and

. Specifically,

is computed as follows:
3.4.2 跨模态注意力促进机制 .为了更好地实现模态间和模态内的交互，我们设计了一种高效的跨模态注意力促进机制（CAP）。它主要利用对称跨模态注意力来探索两个输入特征序列之间的内在关联性，使两个序列之间交换有益信息，从而相互促进。自我注意用于对每个特征序列的时间依赖性进行建模，以便进一步整合信息。如图 5 所示，CAP 将序列

和

作为输入，并输出其相互促进的信息

和

。具体来说，

的计算过程如下：

Where

denotes layer normalization and

is the feedforward neural network in Transformer. Similarly, we can obtain

. Considering that the computational time complexity of MCA

and the complexity of MSA

, the total time complexity of a CAP is

, where

denotes the length of the textual modality, and

denotes the size of the sequence of the imaging modality, respectively.
其中，

表示层归一化，

是变换器中的前馈神经网络。类似地，我们可以得到

。考虑到 MCA 的计算时间复杂度

为

，MSA 的复杂度

为

，CAP 的总时间复杂度为

，其中

分别表示文本模态的长度，

表示成像模态序列的大小。

Figure 5: The architecture of cross-modal attention promotion mechanism (CAP).
图 5：跨模态注意力促进机制（CAP）的结构。

3.4.3 Global-Local Interactive Learning Model. Cross-attentionbased interactions are updated twice during a modal interaction to achieve modal enhancement, i.e., text-image interactions, and image-text interactions need to be executed twice, which is inefficient and introduces redundant features to the sequence. The historical experience from large-scale pre-modeling in which a certain Token can represent the whole sequence can further enhance the efficiency of modal interaction. Based on this, we design a globallocal Interactive model (GLIM) based on linear computational cost to improve the parameter efficiency further. The discourse-level representation of each modality can replace the standard information and interact with the local uni-modal features in a global
3.4.3 全局-局部交互学习模型。在模态交互过程中，为了实现模态增强，基于交叉注意力的交互需要更新两次，即文本-图像交互、图像-文本交互需要执行两次，这不仅效率低下，而且会给序列带来冗余特征。在大规模预建模中，某个 Token 可以代表整个序列，这一历史经验可以进一步提高模态交互的效率。在此基础上，我们设计了一种基于线性计算成本的全局交互模型（GLIM），以进一步提高参数效率。每种模态的话语级表征可以取代标准信息，并在全局中与本地的单模态特征交互。
multi-modal context. We set the global multi-modal context information

concat

, where

denotes the number of layers of global-local interaction. Interacting global information with local modal information, modal consistency, and specificity are learned. The whole interaction process is as follows
多模态上下文。我们设置全局多模态上下文信息

连接

，其中

表示全局与本地交互的层数。学习全局信息与本地模态信息的交互、模态一致性和特异性。整个交互过程如下

In this way, this strategy can capture one-to-many global-local cross-modal interactions in both. By stacking multiple layers, global multi-modal contexts and local uni-modal features can reinforce each other and progressively refine themselves. This operation requires Min each layer CAPS. Due to the small length of the global multi-modal context, the overall time complexity reduces to

(practically, wehave

, which degenerates to

in the modality-aligned case. Thus, the default global-local fusion strategy has linear spatial complexity and enjoys linear computation over the modalities involved. Meanwhile, each global-local fusion is followed by a pooling layer to aggregate the boosting information of different modalities to facilitate subsequent fusions. We utilize a fully-connected layer based on Tanh nonlinear activation to implement this operation. We define 2 lifting global multi-modal contexts, i.e.,

and

The new global multi-modal contexts can be obtained as follows:
这样，这种策略就能捕捉到这两种模式中一对多的全局-局部跨模式互动。通过多层堆叠，全局多模态语境和局部单模态特征可以相互促进，逐步完善。这一操作需要对每一层进行 CAPS。由于全局多模态上下文的长度较小，因此整体时间复杂度会降低到

（实际上，我们有

，在模态对齐的情况下会退化为

。因此，默认的全局-局部融合策略具有线性空间复杂性，并可对所涉及的模态进行线性计算。同时，每次全局-局部融合后都会有一个汇集层来汇集不同模态的增强信息，以促进后续融合。我们利用基于 Tanh 非线性激活的全连接层来实现这一操作。我们定义了 2 个提升的全局多模态上下文，即

和

新的全局多模态上下文可按如下方式获得：

Where

. We treat the whole learning process in layers, with features at different stages of the model's main at each layer. We consider that the beginning of the model learns shallow interaction features, and the later stages of the model learn higher-order semantic features. This hierarchical learning approach successfully fuses information from different modalities through the clever design of the aggregation block, providing the model with a more comprehensive and rich representation of multimodal features. Through the interaction of the model, the information from different modalities can be combined more deeply and effectively, and then more advanced and abstract feature representations can be obtained in the subsequent hierarchical learning. After that, we aggregate the features from uni-modal and multi-modal modalities and pass them through a layer of Transformer to obtain the corresponding sarcasm detection category

.
其中

。我们将整个学习过程分层处理，每一层都有模型主要不同阶段的特征。我们认为，模型初期学习的是浅层交互特征，而模型后期学习的是高阶语义特征。这种分层学习方法通过聚合块的巧妙设计，成功地融合了来自不同模态的信息，为模型提供了更全面、更丰富的多模态特征表征。通过模型的交互作用，不同模态的信息可以得到更深入、更有效的结合，进而在后续的分层学习中获得更高级、更抽象的特征表征。之后，我们将来自单模态和多模态的特征进行聚合，并通过一层转换器（Transformer）得到相应的讽刺检测类别

。

Where

is the feature output distribution after multi-modal fusion,

and

are trainable parameters.
其中，

是多模态融合后的特征输出分布，

和

是可训练参数。

3.5 Multi-task Aggregation
3.5 多任务聚合

Given the obtained

, and

, we use late fusion to produce the final prediction

.
鉴于得到的

和

，我们使用后期融合生成最终预测

。

Where

can be considered as a rich set of features utilized for differ ent tasks from text, image, and image-text, in order to eliminate the uncertainty of ironic information, our proposed model is better able to be trained on the dataset with distributions as close as possible between the predicted data and the actual data. Therefore, in this paper, the cross-entropy loss function is used as the classification loss.
其中，

可视为用于文本、图像和图像-文本等不同任务的丰富特征集，为了消除讽刺信息的不确定性，我们提出的模型最好能够在预测数据和实际数据分布尽可能接近的数据集上进行训练。因此，本文采用交叉熵损失函数作为分类损失。

Where

is the classification loss function,

is the true label of the sample ( 0 for non-sarcastic and 1 for sarcastic), and

is the model's prediction of the sample.
其中，

是分类损失函数，

是样本的真实标签（0 表示非讽刺，1 表示讽刺），

是模型对样本的预测。

4 EXPERIMENTS 4 实验

This section focuses on a series of experiments to evaluate the performance of the proposed MTAN method. A comparison with competing methods is made to validate the effectiveness of MTAN and its constituent elements.Subsequently, the effectiveness and efficiency of multi-modal fusion method is analyzed to evaluate their impact on model performance, and an ablation study is performed. We further conducte several visualization analyses and case studies to prove MTAN's ability to capture sarcasm . The main research questions to be validated in this section are highlighted here:
本节主要通过一系列实验来评估所提出的 MTAN 方法的性能。随后，我们分析了多模态融合方法的有效性和效率，以评估它们对模型性能的影响，并进行了消融研究。我们还进行了多项可视化分析和案例研究，以证明 MTAN 能够捕捉讽刺语言。本节将重点讨论有待验证的主要研究问题：

Q1: How does the performance of our proposed MTAN model compare to state-of-the-art models?
问题 1：与最先进的模型相比，我们提出的 MTAN 模型性能如何？
Q2:How efficient is our proposed MTAN model compared to other multi-modal fusion methods?
问题 2：与其他多模态融合方法相比，我们提出的 MTAN 模型效率如何？
Q3: How do the various subtasks designed contribute to the performance of MTAN?
问题 3：设计的各种子任务如何促进 MTAN 的性能？
Q4: Why can MTAN use multi-granularity clues to detect potential sarcasm?
问题 4：为什么 MTAN 可以使用多粒度线索来检测潜在的讽刺？
Q5: How does MTAN perform in different test cases compared to other baseline methods?
问题 5：与其他基线方法相比，MTAN 在不同测试案例中的表现如何？

These questions aim to evaluate our proposed MTAN model's efficacy, reliability, and ability to identify multi-modal sarcasm compared to other models.
这些问题旨在评估我们提出的 MTAN 模型的有效性、可靠性以及与其他模型相比识别多模态讽刺的能力。

4.1 Datasets and Experimental Settings
4.1 数据集和实验设置

This experiment utilizes a multi-modal sarcasm detection dataset sourced from the Twitter social media platform [2]. The statistical details of the experimental data are presented in Table 1. Our framework is based on the Huggingface library, and we adopt clipvit-base-patch 32 as the backbone. The word embeddings for text are set to 768 dimensions, with a length limit of 100. Any length exceeding this limit is truncated and padded with zeros. We set the cross-attention layers to 8 during the interaction between text and images. The optimizer chosen is Adam, with a learning rate of

for CLIP and

for the remaining components. The batch size for training is set to 32 , and a dropout rate of 0.3 is implemented to prevent overfitting.
本实验使用的多模态讽刺检测数据集来自 Twitter 社交媒体平台[2]。实验数据的统计详情见表 1。我们的框架基于 Huggingface 库，以 clipvit-base-patch 32 为骨干。文本的词嵌入设置为 768 维，长度限制为 100。任何超过此限制的长度都会被截断并填充零。在文本和图像的交互过程中，我们将交叉注意层设置为 8。选择的优化器是 Adam，CLIP 的学习率为

，其余组件的学习率为

。训练的批量大小设置为 32，并采用 0.3 的辍学率以防止过度拟合。

Table 1: Statistics of the dataset
表 1：数据集的统计数据

	Train 火车	Val 瓦尔	Test 测试
Positive 积极的	8642	959	959
Negative 阴性	11174	1451	1450
All 全部	19816	2410	2409

4.2 Baselines 4.2 基线

To verify that the proposed model in this paper can effectively improve the accuracy of sarcastic sentiment detection, we choose the following two categories of benchmark models as comparison models: uni-modal methods and multi-modal methods. we com pared it with single-modal models including TextCNN [11], SMSD [33], BERT [10], Image [2], and ViT [6]. Additionally, we evaluated it against multi-modal models such as HFM [2], D&R Net [35], Att-BERT [20], CMGCN [14], HKE [15], and SEF [46], considering their performance in capturing sarcastic cues from both textual and visual information.
为了验证本文提出的模型能否有效提高讽刺情感检测的准确性，我们选择了以下两类基准模型作为对比模型：单模态方法和多模态方法。我们将其与单模态模型进行了对比，包括 TextCNN [11]、SMSD [33]、BERT [10]、Image [2] 和 ViT [6]。此外，我们还将其与 HFM [2]、D&R Net [35]、Att-BERT [20]、CMGCN [14]、HKE [15] 和 SEF [46] 等多模态模型进行了评估，考虑了它们从文本和视觉信息中捕捉讽刺线索的性能。

4.3 Experimental Results(Q1)
4.3 实验结果（Q1）

In our experimental evaluation, we use four widely recognized evaluation metrics: Acc. (accuracy), Pre (precision), Rec (recall), and F1 (F1 score). The results in Table 2 demonstrate that our model performs best in the evaluation metrics. Additionally, more insights can be gained from these results.
在实验评估中，我们使用了四个广受认可的评估指标：Acc（准确度）、Pre（精确度）、Rec（召回率）和 F1（F1 分数）。表 2 中的结果表明，我们的模型在这些评价指标中表现最佳。此外，我们还可以从这些结果中获得更多启示。

(1) Whether it is the text modality or the image modality, the accuracy of sarcasm detection on the same dataset exceeds

, and in some cases, the accuracy of text modality methods even exceeds that of multi-modal methods, indicating the feasibility of sarcasm detection models relying on text features or image features, supporting our research motivation. At the same time, methods based on the text modality are superior to those based on the image modality, indicating that the expression of ironic/non-ironic information mainly exists in the text modality.
(1）无论是文本模态还是图像模态，在同一数据集上讽刺检测的准确率都超过了

，在某些情况下，文本模态方法的准确率甚至超过了多模态方法，这说明依赖文本特征或图像特征的讽刺检测模型是可行的，支持了我们的研究动机。同时，基于文本模态的方法优于基于图像模态的方法，这说明讽刺/非讽刺信息的表达主要存在于文本模态中。

(2) Multi-modal methods are superior to uni-modal baselines, indicating that aggregating information from image and text modalities is more effective for multi-modal sarcasm detection. Compared to models that only aggregate content from different modalities (i.e., HFM, DRNet, and Att-BERT), models that consider complex interactions between modalities (i.e., HKE and CMGCN) demon strate better performance, indicating that capturing multi-level interaction information is advantageous for detecting sarcasm in multi-modal contexts.
(2) 多模态方法优于单模态基线，这表明聚合图像和文本模态的信息对多模态讽刺检测更为有效。与仅聚合不同模态内容的模型（即 HFM、DRNet 和 Att-BERT）相比，考虑模态间复杂交互的模型（即 HKE 和 CMGCN）性能更佳，这表明捕捉多层次交互信息有利于检测多模态语境中的讽刺语言。

(3) Our proposed model outperforms existing baselines on all metrics. Compared to the latest SEF method, our proposed method has improved accuracy and F1 score, validating the effectiveness of our proposed model in multi-modal sarcasm detection.
(3) 我们提出的模型在所有指标上都优于现有基线。与最新的 SEF 方法相比，我们提出的方法提高了准确率和 F1 分数，验证了我们提出的模型在多模态讽刺检测中的有效性。

Table 2: Comparison results for sarcasm detection.
表 2：讽刺语言检测的比较结果。

	Model 模型	Acc(%)
Text 文本	TextCNN [11]	80.03	74.29	76.39	75.32
	SMSD[33]	80.90	76.46	75.18	75.82
	BERT[10]	83.85	78.72	82.27	80.22
Image 图片	Image[2] 图片[2］	64.76	54.41	70.80	61.53
Image 图片	ViT[6]	67.83	57.93	70.07	63.43
Multi-Modal 多种模式
	HFM [2]	83.44	76.57	84.15	80.18
	D&R Net [35]	84.02	77.97	83.42	80.60
	Att-BERT[20]	86.05	80.87	85.08	82.92
	CMGCN [14]	86.54	-	-	82.73
	HKE[15] HKE[15］	87.36	81.84	86.48	84.09
	SEF[46]	88.45	82.53	86.58	85.96
Ours 我们的	Ours 我们的

To further validate the effectiveness and efficiency of our proposed multi-modal fusion strategy, we compared the default global-local interaction method (GLIM) used by CAP with the local-local interaction method (LLIM)[29]. Unlike our approach, the core of LLIM is to explore the one-to-one local-local cross-modal interaction across all modalities by pairing each modality. We show the performance of the two fusion strategies in Fig.6 The performance of GLIM is superior to LLIM, validating the feasibility and effectiveness of using discourse-level representations of each modality as a global multi-modal context to interact with local uni-modal features.
为了进一步验证我们提出的多模态融合策略的有效性和效率，我们比较了 CAP 默认使用的全局-局部交互方法（GLIM）和局部-局部交互方法（LLIM）[29]。与我们的方法不同，LLIM 的核心是通过将每种模态配对，探索所有模态之间一对一的局部-局部跨模态交互。我们在图 6 中展示了两种融合策略的性能。GLIM 的性能优于 LLIM，这验证了使用每种模态的话语级表征作为全局多模态语境与局部单模态特征交互的可行性和有效性。

Figure 6: Comparison of two fusion strategies.
图 6：两种融合策略的比较。

To further illustrate the efficiency of GLIM relative to LLIM, we provide a practical complexity comparison. Since there are two input modalities, and both strategies have the same spatial complexity (each requiring 2 CAP in each fusion layer), this paper mainly evaluates the time complexity. In Table 3, we compare the computation measured in Multiply-Accumulate operations (MAC), the number of parameters, training time per epoch, and GPU memory usage during model training for both fusion strategies. We can observe that GLIM outperforms LLIM in all aspects. This is expected, as GLIM enjoys linear computational complexity on the involved modalities in practice, while LLIM incurs quadratic scaling costs.
为了进一步说明 GLIM 相对于 LLIM 的效率，我们进行了实际复杂度比较。由于有两种输入模式，而两种策略的空间复杂度相同（每个融合层都需要 2 个 CAP），因此本文主要评估时间复杂度。在表 3 中，我们比较了两种融合策略的乘累加运算（MAC）计算量、参数数量、每个历元的训练时间以及模型训练过程中 GPU 内存的使用情况。我们可以看到，GLIM 在所有方面都优于 LLIM。这在意料之中，因为在实践中，GLIM 在相关模态上具有线性计算复杂性，而 LLIM 则会产生二次扩展成本。

Table 3: Computational complexity analysis.
表 3：计算复杂性分析

Fusion Strategy 融合战略	Params(M) 参数(M)	Trainging Times(s) 培训时间	GPU (GB) 图形处理器（GB）
GLIM(Ours) GLIM（我们的）
LLIM	102	15.4	12.3

4.5 Ablation Study(Q3) 4.5 消融研究（第三季度）

An ablation experiment was conducted to demonstrate the contribution of each component in our model. Three variants of our model were used for comparison: (1) removing the text-task

; (2) removing the image-task

; (3) removing the text-image interaction task

. We use the same parameter settings for each model to ensure fair comparison. The results of the ablative study are shown in Table 3. When

, and

are removed, the accuracy decreases by

, and

, respectively. Then, our framework performs best when combining all these tasks, which indicates that all tasks in our framework contribute to the final
我们进行了一次消融实验，以证明我们模型中每个组成部分的贡献。我们使用了模型的三个变体进行比较：(1) 删除文本任务

；(2) 删除图像任务

；(3) 删除文本图像交互任务

。我们对每个模型使用相同的参数设置，以确保比较的公平性。表 3 显示了消减研究的结果。当删除

和

时，准确率分别下降了

和

。然后，我们的框架在结合所有这些任务时表现最佳，这表明我们框架中的所有任务都对最终结果做出了贡献。
performance. It is worth noting that removing

results in a significant performance drop, demonstrating the importance of modeling multi-modal interaction features for multi-modal sarcasm detection.
性能。值得注意的是，移除

会导致性能显著下降，这说明了多模态交互特征建模对于多模态讽刺检测的重要性。

Table 4: Experimental results of ablation study.
表 4：烧蚀研究的实验结果。

Model 模型	Acc(%)	F1(%)
Ours 我们的
	84.27	81.93
	86.41	82.92
	83.26	81.09

4.6 Visual Interpretation (Q4)
4.6 视觉解读（Q4）

To comprehensively demonstrate the effectiveness of our framework, we visualize the attention distribution of the visual encoder to illustrate why our framework is effective for multi-modal sarcasm detection. As illustrated in Fig.7, our model demonstrates effective capability in directing attention to pertinent regions within an image that contain cues indicative of sarcasm. For instance, in the first scenario, our model focuses on the image snippet depicting "a small hamburger," revealing the contradiction with the accompanying text's "huge and stuffy" sentiment. Similarly, in the subsequent scenario, the model considers the importance of the "bumpy road" as a critical cue, contradicting the concept of "spacious and smooth" conveyed in the text. These observations confirm the model's proficiency in discerning and highlighting the appropriate elements of the image, which form the basis of sarcasm instances in the accompanying text.
为了全面展示我们框架的有效性，我们将视觉编码器的注意力分布可视化，以说明我们的框架为何能有效地进行多模态讽刺检测。如图 7 所示，我们的模型能够有效地将注意力引导到图像中包含讽刺线索的相关区域。例如，在第一个场景中，我们的模型将注意力集中在描述 "一个小汉堡包 "的图像片段上，从而揭示了与所附文字 "巨大而闷热 "的情绪之间的矛盾。同样，在随后的场景中，模型将 "崎岖不平的道路 "视为重要线索，这与文本中传达的 "宽敞平坦 "的概念相矛盾。这些观察结果证实了模型在辨别和突出图像中适当元素方面的熟练程度，而这些元素构成了随附文本中讽刺事例的基础。

(a) Look at the size of this corn & spinach (b) How spacious and smooth the road is! burger, it 's so huge and stuffy. A long journey back home.
(a) 看看这玉米和菠菜的大小 (b) 道路多么宽敞平坦啊！汉堡包，又大又闷。漫漫回家路

Figure 7: Visualization of two sarcasm examples.Our model demonstrates effective capability in directing attention to pertinent regions within an image that contain cues indicative of sarcasm.
图 7：两个讽刺示例的可视化效果。我们的模型证明，它能有效地将注意力引导到图像中包含讽刺线索的相关区域。

4.7 Case study (Q5)
4.7 案例研究（问题 5）

To further validate the effectiveness of our approach, we selected two typical cases to verify the efficacy of multi-task training in the model. The sarcasm detection results of HKE [15], SEF [46], MTAN (ours), and two variants are shown in Fig. 8. The first case emphasizes integrating image and text features to capture inconsistency for sarcasm. The sarcasm implied by the inconsistency between the worn-out package in the image and the word "take care" in the text can be accurately predicted by multi-modal models (such as HFM and SEF) and our proposed method. On the other hand, without the interaction between image and text,

loses accuracy, demonstrating the importance of modeling multi-modal interaction features for multi-modal sarcasm detection. Additionally, In addition, the second scenario emphasizes the weak correlation in multi-modal data, where the "pleasant holiday" conveyed in the text aligns perfectly with the image description of "blue sky and white clouds," enforcing multi-modal fusion would lead to incorrect judgments. Our model exhibits excellent performance in addressing such scenarios, thanks to its ability to utilize multi-granularity cues from multiple tasks (i.e., text, image, and text-image interaction tasks) for multi-modal sarcasm detection.
为了进一步验证我们方法的有效性，我们选择了两个典型案例来验证模型中多任务训练的功效。图 8 显示了 HKE [15]、SEF [46]、MTAN（我们的方法）和两个变体的讽刺检测结果。第一种情况强调整合图像和文本特征来捕捉讽刺的不一致性。多模态模型（如 HFM 和 SEF）和我们提出的方法可以准确预测图像中的破旧包装与文本中的 "小心 "一词不一致所隐含的讽刺。另一方面，如果没有图像和文本之间的交互，

就会失去准确性，这说明了多模态交互特征建模对于多模态讽刺检测的重要性。此外，第二种情况强调了多模态数据中的弱相关性，即文本中传达的 "愉快的假期 "与图像中描述的 "蓝天白云 "完全一致，强制进行多模态融合会导致错误的判断。我们的模型能够利用来自多个任务（即文本、图像和文本-图像交互任务）的多粒度线索来进行多模态讽刺检测，因此在处理此类情况时表现出色。

Text and Image 文本和图像

Sarcasm 讽刺

感谢你们对我包裹的精心照料。

I would like to thank for

taking such good care of

my package.

今天我的脚扭了，这真是个愉快的假期！

It twisted my foot today,

it's such a enjoyable

holiday!

HFM

Sarcasm 讽刺

Non-Sarcasm 非讽刺

SEF

Sarcasm 讽刺

Non-Sarcasm 非讽刺

ours(MTAN) 我们的（MTAN）

Sarcasm 讽刺

MTAN w/o

不含

的 MTAN

Non-Sarcasm 非讽刺

MTAN w/o Tti MTAN 无 Tti

Sarcasm 讽刺

Figure 8: Representative sarcasm examples with their predictions in the case study.
图 8：案例研究中具有代表性的讽刺例子及其预测结果。

5 CONCLUSION 5 结束语

In this paper, we propose a multi-task joint training aggregation network for multi-modal sarcasm detection framework. Specifically, we design a multi-task CLIP framework that leverages multigranularity clues from multiple tasks (i.e., text, image, and textimage interaction tasks) for multi-modal sarcasm detection. Additionally, we designed a global-local visual-text interaction learning module, which effectively explores cross-modal interactions between global and local aspects, thereby improving performance. Finally, extensive experiments on the dataset show that this method achieves state-of-the-art performance in multi-modal sarcasm detection. bOur work offers valuable insights and explores fresh paradigms for cross- modal fusion in visual linguistic tasks. Future research should focus on further exploration of constructing multi-modal datasets and continuous refinement and optimization of network models to enhance performance while reducing complexity.
本文提出了一种用于多模态讽刺检测框架的多任务联合训练聚合网络。具体来说，我们设计了一个多任务 CLIP 框架，利用来自多个任务（即文本、图像和文本图像交互任务）的多粒度线索进行多模态讽刺检测。此外，我们还设计了一个全局-局部视觉-文本交互学习模块，该模块可有效探索全局和局部之间的跨模态交互，从而提高性能。最后，在数据集上进行的大量实验表明，这种方法在多模态挖苦检测方面达到了最先进的性能。 b 我们的工作为视觉语言任务中的跨模态融合提供了宝贵的见解并探索了新的范式。未来的研究重点应是进一步探索构建多模态数据集，并不断完善和优化网络模型，以在提高性能的同时降低复杂性。

ACKNOWLEDGMENTS 致谢

This work is supported by National Natural Science Foundation of China (Nos. 62276073, 61966004), Guangxi Natural Science Foundation, China (No. 2019GXNSFDA245018), Guangxi "Bagui Scholar" Teams for Innovation and Research Project, and Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing.
本研究得到国家自然科学基金（编号：62276073、61966004）、广西自然科学基金（编号：2019GXNSFDA245018）、广西 "八桂学者 "团队创新研究项目、广西多源信息集成与智能处理协同创新中心的资助。

REFERENCES 参考文献

[1] Nastaran Babanejad, Heidar Davoudi, Aijun An, and Manos Papagelis. 2020 Affective and contextual embedding for sarcasm detection. In Proceedings of the 28th international conference on computational linguistics. 225-243.
[1] Nastaran Babanejad, Heidar Davoudi, Aijun An, and Manos Papagelis.2020 用于讽刺检测的情感和语境嵌入。第 28 届计算语言学国际会议论文集》。225-243.

[2] Yitao Cai, Huiyu Cai, and Xiaojun Wan. 2019. Multi-modal sarcasm detection in twitter with hierarchical fusion model. In Proceedings of the 57th annual meeting of the association for computational linguistics. 2506-2515
[2] 蔡一涛、蔡慧玉、万晓军。2019.利用分层融合模型检测微博中的多模态讽刺。第 57 届计算语言学协会年会论文集》。2506-2515

[3] Bahzad Charbuty and Adnan Abdulazeez. 2021. Classification based on decision tree algorithm for machine learning. Journal of Applied Science and Technology Trends 2 (2021), 20-28
[3] Bahzad Charbuty and Adnan Abdulazeez.2021.基于决策树算法的机器学习分类。应用科学与技术趋势期刊2（2021），20-28

[4] Dushyant Singh Chauhan, SR Dhanush, Asif Ekbal, and Pushpak Bhattacharyya 2020. Sentiment and emotion help sarcasm? A multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4351-4360.
[4] Dushyant Singh Chauhan, SR Dhanush, Asif Ekbal, and Pushpak Bhattacharyya 2020.情感和情绪有助于讽刺？用于多模态讽刺、情感和情绪分析的多任务学习框架。第 58 届计算语言学协会年会论文集》。4351-4360.

[5] Rahul Dey and Fathi M Salem. 2017. Gate-variants of gated recurrent unit (GRU) neural networks. In Proceedings of the 2017 IEEE 60th international midwest symposium on circuits and systems. 1597-1600.
[5] Rahul Dey and Fathi M Salem.2017.门控递归单元（GRU）神经网络的门控变体。In Proceedings of the 2017 IEEE 60th international midwest symposium on circuits and systems.1597-1600.

[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[6] Alexey Dosovitskiy、Lucas Beyer、Alexander Kolesnikov、Dirk Weissenborn、Xi aohua Zhai、Thomas Unterthiner、Mostafa Dehghani、Matthias Minderer、Georg Heigold、Sylvain Gelly 等，2020 年。一幅图像胜过 16x16 个单词：ArXiv preprint arXiv:2010.11929 (2020).

[7] Aniruddha Ghosh and Tony Veale. 2017. Magnets for sarcasm: Making sarcasm detection timely, contextual and very personal. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 482-491.
[7] Aniruddha Ghosh and Tony Veale.2017.讽刺磁铁：让讽刺检测变得及时、语境化和个性化。自然语言处理经验方法 2017 年会议论文集》。482-491.

[8] Raymond W Gibbs. 1986. On the psycholinguistics of sarcasm. Fournal of experimental psychology: general 115 (1986), 3.
[8] Raymond W Gibbs.1986.论讽刺的心理语言学》。Fournal of experimental psychology: general 115 (1986), 3.

[9] Xueting Hu, Ce Zhang, Yi Zhang, Bowen Hai, Ke Yu, and Zhihai He. 2024. Learn ing to Adapt CLIP for Few-Shot Monocular Depth Estimation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5594-5603.
[9] Xueting Hu，Ce Zhang，Yi Zhang，Bowen Hai，Ke Yu，and Zhihai He.2024.Learn ing to Adapt CLIP for Few-Shot Monocular Depth Estimation.IEEE/CVF 计算机视觉应用冬季会议论文集》。5594-5603.

[10] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 naacL-HLT, Vol. 1. 2.
[10] Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova.2019.Bert Pre-training of deep bidirectional transformers for language understanding.In Proceedings of the 2019 naacL-HLT, Vol.2.

[11] Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. 1746-1751.
[11] Yoon Kim.2014.用于句子分类的卷积神经网络。2014年自然语言处理实证方法会议论文集》。1746-1751.

[12] Jan Koutnik, Klaus Greff, Faustino Gomez, and Juergen Schmidhuber. 2014. A clockwork rnn. In Proceedings of the 2014 International conference on machine learning. 1863-1871
[12] Jan Koutnik、Klaus Greff、Faustino Gomez 和 Juergen Schmidhuber。2014.钟表 RNN。2014年机器学习国际会议论文集》。1863-1871

[13] Bin Liang, Chenwei Lou, Xiang Li, Lin Gui, Min Yang, and Ruifeng Xu. 2021 Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM international conference on multimedia. 4707-4715.
[13] Bin Liang, Chenwei Lou, Xiang Li, Lin Gui, Min Yang, and Ruifeng Xu.2021 使用交互式模态和跨模态图的多模态讽刺检测。第 29 届 ACM 国际多媒体会议论文集》。4707-4715.

[14] Bin Liang, Chenwei Lou, Xiang Li, Min Yang, Lin Gui, Yulan He, Wenjie Pei, and Ruifeng Xu. 2022. Multi-modal sarcasm detection via cross-modal graph convo lutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. 1767-1777
[14] 梁斌、娄晨薇、李翔、杨敏、桂林、何玉兰、裴文杰、徐瑞峰。2022.通过跨模态图对话网络进行多模态讽刺检测。计算语言学协会第 60 届年会论文集》。1767-1777

[15] Hui Liu, Wenya Wang, and Haoliang Li. 2022. Towards Multi-Modal Sarcasm Detection via Hierarchical Congruity Modeling with Knowledge Enhancement. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 4995-5006.
[15] Hui Liu, Wenya Wang, and Haoliang Li.2022.通过分层一致性建模与知识增强实现多模态挖苦检测。自然语言处理经验方法 2022 年会议论文集》。4995-5006.

[16] Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning. Neurocomputing 508 (2022), 293-304
[16] 罗怀绍、季磊、钟明、陈阳、雷雯、段楠、李天瑞。2022.Clip4clip：用于端到端视频片段检索和字幕的片段实证研究。神经计算 508 (2022)，293-304

[17] Krishanu Maity, Prince Jha, Sriparna Saha, and Pushpak Bhattacharyya. 2022 A multitask framework for sentiment, emotion and sarcasm aware cyberbully ing detection from multi-modal code-mixed memes. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval.

.
[17] Krishanu Maity、Prince Jha、Sriparna Saha 和 Pushpak Bhattacharyya。2022 从多模态混合代码备忘录中检测情感、情绪和讽刺意识网络欺凌的多任务框架。第 45 届 ACM SIGIR 信息检索研究与发展国际会议论文集》。

。

[18] Navonil Majumder, Soujanya Poria, Haiyun Peng, Niyati Chhaya, Erik Cam bria, and Alexander Gelbukh. 2019. Sentiment and sarcasm classification with multitask learning. IEEE Intelligent Systems 34 (2019), 38-43.
[18] Navonil Majumder、Soujanya Poria、Haiyun Peng、Niyati Chhaya、Erik Cam bria 和 Alexander Gelbukh。2019.多任务学习的情感和讽刺分类。IEEE Intelligent Systems 34 (2019)，38-43。

[19] Abhijit Mishra, Kuntal Dey, and Pushpak Bhattacharyya. 2017. Learning cog nitive features from gaze data for sentiment and sarcasm classification using convolutional neural network. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 377-387.
[19] Abhijit Mishra、Kuntal Dey 和 Pushpak Bhattacharyya。2017.使用卷积神经网络从凝视数据中学习情感和讽刺分类的语义特征。第 55 届计算语言学协会年会论文集》。377-387.

[20] Hongliang Pan, Zheng Lin, Peng Fu, Yatao Qi, and Weiping Wang. 2020. Model ing intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings of the Association for Computational Linguistics: EMNLP 2020. 1383-1392.
[20] Hongliang Pan, Zheng Lin, Peng Fu, Yatao Qi, and Weiping Wang.2020.多模态讽刺检测中的模态内和模态间不一致性。计算语言学协会论文集：EMNLP 2020。1383-1392.

[21] Libo Qin, Shijue Huang, Qiguang Chen, Chenran Cai, Yudi Zhang, Bin Liang, Wanxiang Che, and Ruifeng Xu. 2023. MMSD2. 0: Towards a Reliable Multimodal Sarcasm Detection System. In Findings of the Association for Computational Linguistics: ACL 2023. 10834-10845.
[21] Libo Qin，Shijue Huang，Qiguang Chen，Chenran Cai，Yudi Zhang，Bin Liang，Wanxiang Che，and Ruifeng Xu.2023.MMSD2.0：迈向可靠的多模态讽刺检测系统》。计算语言学协会论文集：ACL 2023.10834-10845.

[22] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 2021 International conference on machine learning. 8748-8763.
[22] Alec Radford、Jong Wook Kim、Chris Hallacy、Aditya Ramesh、Gabriel Goh、Sandhini Agarwal、Girish Sastry、Amanda Askell、Pamela Mishkin、Jack Clark 等，2021 年。从自然语言监督中学习可转移的视觉模型。2021 年国际机器学习大会论文集》。8748-8763.

[23] Ellen Riloff, Ashequl Qadir, Prafulla Surve, Lalindra De Silva, Nathan Gilbert, and Ruihong Huang. 2013. Sarcasm as contrast between a positive sentiment and negative situation. In Proceedings of the 2013 conference on empirical methods in natural language processing. 704-714.
[23] Ellen Riloff、Ashequl Qadir、Prafulla Surve、Lalindra De Silva、Nathan Gilbert 和 Ruihong Huang。2013.讽刺是积极情绪与消极情绪的对比。2013年自然语言处理经验方法会议论文集》。704-714.

[24] Suyash Sangwan, Md Shad Akhtar, Pranati Behera, and Asif Ekbal. 2020. I didn't mean what I wrote! Exploring Multimodality for Sarcasm Detection. In Proceedings of the 2020 International Foint Conference on Neural Networks. 1-8.
[24] Suyash Sangwan、Md Shad Akhtar、Pranati Behera 和 Asif Ekbal。2020.我不是故意的！探索多模态嘲讽检测。2020 年国际神经网络会议论文集》。1-8.

[25] Edoardo Savini and Cornelia Caragea. [n. d.]. A multi-task learning approach to sarcasm detection (student abstract). In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34.
[25] Edoardo Savini and Cornelia Caragea.[n. d.].讽刺检测的多任务学习方法（学生摘要）。In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34.

[26] Rossano Schifanella, Paloma De Juan, Joel Tetreault, and Liangliang Cao. 2016. Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM international conference on Multimedia. 1136-1145.
[26] Rossano Schifanella、Paloma De Juan、Joel Tetreault 和曹亮亮。2016.多模态社交平台中的讽刺检测》。第 24 届 ACM 国际多媒体会议论文集》。1136-1145.

[27] Yik Yang Tan, Chee-Onn Chow, Jeevan Kanesan, Joon Huang Chuah, and YongLiang Lim. 2023. Sentiment Analysis and Sarcasm Detection using Deep Multi-Task Learning. Wireless personal communications 129 (2023), 2213-2237.
[27] Yik Yang Tan, Chee-Onn Chow, Jeevan Kanesan, Joon Huang Chuah, and YongLiang Lim.2023.使用深度多任务学习进行情感分析和讽刺检测。无线个人通信 129（2023），2213-2237。

[28] Yi Tay, Anh Tuan Luu, Siu Cheung Hui, and Jian Su. 2018. Reasoning with Sarcasm by Reading In-Between. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 1010-1020.
[28] Yi Tay、Anh Tuan Luu、Siu Cheung Hui 和 Jian Su.2018.通过夹缝阅读推理讽刺语言》。第 56 届计算语言学协会年会论文集》。1010-1020.

[29] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. 6558.
[29] Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov.2019.未对齐多模态语言序列的多模态转换器。In Proceedings of the conference.计算语言学协会。会议，2019 卷。6558.

[30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).
[30] Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。2017.注意力就是你所需要的一切。神经信息处理系统进展 30 (2017).

[31] Chuhan Wu, Fangzhao Wu, Sixing Wu, Junxin Liu, Zhigang Yuan, and Yongfeng Huang. 2018. Thu_ngn at semeval-2018 task 3: Tweet irony detection with densely connected Istm and multi-task learning. In Proceedings of The 12th International Workshop on Semantic Evaluation. 51-56.
[31] Chuhan Wu、Fangzhao Wu、Sixing Wu、Junxin Liu、Zhigang Yuan 和 Yongfeng Huang.2018.Thu_ngn at semeval-2018 task 3: Tweet irony detection with densely connected Istm and multi-task learning.第 12 届语义评估国际研讨会论文集》。51-56.

[32] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello. 2022. Wav2clip: Learning robust audio representations from clip. In Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing.

[32] Ho-Hsiang Wu, Prem Seetharaman, Kundan Kumar, and Juan Pablo Bello.2022.Wav2clip：从片段中学习稳健的音频表示。2022 年电气和电子工程师学会声学、语音和信号处理国际会议论文集》。

[33] Tao Xiong, Peiran Zhang, Hongbo Zhu, and Yihui Yang. 2019. Sarcasm detection with self-matching networks and low-rank bilinear pooling. In Proceedings of the 2019 world wide web conference. 2115-2124.
[33] Tao Xiong, Peiran Zhang, Hongbo Zhu, and Yihui Yang.2019.利用自匹配网络和低阶双线性集合进行讽刺检测。2019年世界互联网大会论文集。2115-2124.

[34] Fan Xu, Lei Zeng, Qi Huang, Keyu Yan, Mingwen Wang, and Victor S Sheng. 2024. Hierarchical graph attention networks for multi-modal rumor detection on social media. Neurocomputing 569 (2024), 127112.
[34] Fan Xu, Lei Zeng, Qi Huang, Keyu Yan, Mingwen Wang, and Victor S Sheng.2024.用于社交媒体多模态谣言检测的层次图注意力网络。Neurocomputing 569 (2024), 127112.

[35] Nan Xu, Zhixiong Zeng, and Wenji Mao. 2020. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th annual meeting of the association for computational linguistics. 3777-3786.
[35] Nan Xu, Zhixiong Zeng, and Wenji Mao.2020.通过跨模态对比和语义关联建模推理多模态讽刺推文。第 58 届计算语言学协会年会论文集》。3777-3786.

[36] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang. 2019. A review of recurrent neural networks: LSTM cells and network architectures. Neural computation 31 (2019), 1235-1270.
[36] Yong Yu, Xiaosheng Si, Changhua Hu, and Jianxun Zhang.2019.递归神经网络综述：LSTM 单元与网络架构.神经计算 31 (2019)，1235-1270.

[37] Lin Yue, Weitong Chen, Xue Li, Wanli Zuo, and Minghao Yin. 2019. A survey of sentiment analysis in social media. Knowledge and Information Systems 60 (2019),

.
[37] Lin Yue、Weitong Chen、Xue Li、Wanli Zuo 和 Minghao Yin。2019.社交媒体中的情感分析调查.知识与信息系统 60 (2019),

[38] Tan Yue, Rui Mao, Heng Wang, Zonghai Hu, and Erik Cambria. 2023. KnowleNet: Knowledge fusion network for multimodal sarcasm detection. Information Fusion 100 (2023), 101921
[38] Tan Yue, Rui Mao, Heng Wang, Zonghai Hu, and Erik Cambria.2023.KnowleNet：用于多模态讽刺检测的知识融合网络。信息融合 100 (2023)，101921

[39] Yufei Zeng, Zhixin Li, Zhenbin Chen, and Huifang Ma. 2023. Aspect-level sentiment analysis based on semantic heterogeneous graph convolutional network. Frontiers of Computer Science 17 (2023), 176340.
[39] Yufei Zeng，Zhixin Li，Zhenbin Chen，and Huifang Ma.2023.基于语义异构图卷积网络的方面级情感分析。计算机科学前沿 17 (2023)，176340。

[40] Yufei Zeng, Zhixin Li, Zhenbin Chen, and Huifang Ma. 2024. A feature-based restoration dynamic interaction network for multimodal sentiment analysis. Engineering Applications of Artificial Intelligence 127 (2024), 107335.
[40] Yufei Zeng，Zhixin Li，Zhenbin Chen，and Huifang Ma.2024.基于特征的多模态情感分析恢复动态交互网络.人工智能的工程应用 127 (2024)，107335.

[41] Yufei Zeng, Zhixin Li, Zhenjun Tang, Zhenbin Chen, and Huifang Ma. 2023. Heterogeneous graph convolution based on in-domain self-supervision for multimodal sentiment analysis. Expert Systems with Applications 213 (2023), 119240
[41] Yufei Zeng，Zhixin Li，Zhenjun Tang，Zhenbin Chen，and Huifang Ma.2023.基于域内自监督的异构图卷积用于多模态情感分析。专家系统与应用 213 (2023)，119240

[42] Meishan Zhang, Yue Zhang, and Guohong Fu. 2016. Tweet sarcasm detection using deep neural network. In Proceedings of the 26th International Conference on Computational Linguistics. 2449-2460.
[42] Meishan Zhang，Yue Zhang，and Guohong Fu.2016.使用深度神经网络检测鸣叫讽刺。第 26 届国际计算语言学大会论文集》。2449-2460.

[43] Xiaoqiang Zhang, Ying Chen, and Guangyuan Li. 2021. Multi-modal sarcasm detection based on contrastive attention mechanism. In Proceedings of the 10th Natural Language Processing and Chinese Computing. 822-833.
[43] Xiaoqiang Zhang, Ying Chen, and Guangyuan Li.2021.基于对比注意机制的多模态挖苦检测。第十届自然语言处理与中文计算大会论文集》。822-833.

[44] Yazhou Zhang, Yang Yu, Dongming Zhao, Zuhe Li, Bo Wang, Yuexian Hou, Prayag Tiwari, and Jing Qin. 2023. Learning multi-task commonness and uniqueness for multi-modal sarcasm detection and sentiment analysis in conversation. IEEE Transactions on Artificial Intelligence 5 (2023), 1349-1361.
[44] Yazhou Zhang, Yang Yu, Dongming Zhao, Zuhe Li, Bo Wang, Yuexian Hou, Prayag Tiwari, and Jing Qin.2023.对话中多模态讽刺检测和情感分析的多任务共性和独特性学习.IEEE Transactions on Artificial Intelligence 5 (2023)，1349-1361。

[45] Xuan Zhao, Jimmy Huang, and Haitian Yang. 2021. CANs: coupled-attention networks for sarcasm detection on social media. In Proceedings of the 2021 International foint Conference on Neural Networks. 1-8.
[45] Xuan Zhao, Jimmy Huang, and Haitian Yang.2021.CANs：用于社交媒体讽刺检测的耦合注意力网络。2021 年国际神经网络会议论文集》。1-8.

[46] Weiyu Zhong, Zhengxuan Zhang, Qiaofeng Wu, Yun Xue, and Qianhua Cai. 2024. A Semantic Enhancement Framework for Multimodal Sarcasm Detection. Mathematics 12 (2024), 317.
[46] 钟伟玉、张正轩、吴巧凤、薛云、蔡倩华。2024.多模态讽刺检测的语义增强框架。数学 12 (2024)，317.

Zhixin Li is the corresponding author
李志新为通讯作者

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org
允许为个人或课堂用途免费复制本作品的全部或部分内容的电子版或印刷版，但不得以营利或商业利益为目的制作或传播，且复制件的首页须注明本声明和完整的引文。必须尊重作者以外的其他人对本作品组成部分所拥有的版权。允许摘录并注明出处。如需复制、再版、在服务器上发布或在列表中重新发布，需事先获得特别许可和/或付费。请向 permissions@acm.org 申请许可。

ICMR '24, June 10-14, 2024, Phuket, Thailand
ICMR '24，2024 年 6 月 10-14 日，泰国普吉岛

(c) 2024 Copyright held by the owner/author(s). Publication rights licensed to ACM ACM ISBN 979-8-4007-0619-6/24/06
(c) 2024 版权归所有者/作者所有。出版权授权给 ACM ACM ISBN 979-8-4007-0619-6/24/06

https://doi.org/10.1145/3652583.3658015

Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection 针对多模态讽刺检测的聚合网络多任务联合训练建模

Abstract 摘要

CCS CONCEPTS 综合传播战略概念

KEYWORDS 关键词

ACM Reference Format: ACM 参考格式：

1 INTRODUCTION 1 引言

2 RELATED WORK 2 相关工作

2.1 Multi-modal Sarcasm Detection2.1 多模态讽刺检测

2.2 Multi-Task Sarcasm Detection2.2 多任务讽刺检测

3 METHODOLOGY 3 方法

3.1 CLIP Model 3.1 CLIP 模型

3.2 Text Processing Task3.2 文本处理任务

3.3 Image Processing Task3.3 图像处理任务

3.4 Text-image Fusion Task3.4 文本图像融合任务

3.5 Multi-task Aggregation3.5 多任务聚合

4 EXPERIMENTS 4 实验

4.1 Datasets and Experimental Settings4.1 数据集和实验设置

4.2 Baselines 4.2 基线

4.3 Experimental Results(Q1)4.3 实验结果（Q1）

4.4 Multi-modal Fusion Strategy(Q2)4.4 多种模式融合战略（第二季度）

4.5 Ablation Study(Q3) 4.5 消融研究（第三季度）

4.6 Visual Interpretation (Q4)4.6 视觉解读（Q4）

4.7 Case study (Q5)4.7 案例研究（问题 5）

5 CONCLUSION 5 结束语

ACKNOWLEDGMENTS 致谢

REFERENCES 参考文献

Modeling Multi-Task Joint Training of Aggregate Networks for Multi-Modal Sarcasm Detection
针对多模态讽刺检测的聚合网络多任务联合训练建模

2.1 Multi-modal Sarcasm Detection
2.1 多模态讽刺检测

2.2 Multi-Task Sarcasm Detection
2.2 多任务讽刺检测

3.2 Text Processing Task
3.2 文本处理任务

3.3 Image Processing Task
3.3 图像处理任务

3.4 Text-image Fusion Task
3.4 文本图像融合任务

3.5 Multi-task Aggregation
3.5 多任务聚合

4.1 Datasets and Experimental Settings
4.1 数据集和实验设置

4.3 Experimental Results(Q1)
4.3 实验结果（Q1）

4.4 Multi-modal Fusion Strategy(Q2)
4.4 多种模式融合战略（第二季度）

4.6 Visual Interpretation (Q4)
4.6 视觉解读（Q4）

4.7 Case study (Q5)
4.7 案例研究（问题 5）