A Multi-task Framework based on Decomposition for Multimodal Named Entity Recognition 基于分解的多任务多模态命名实体识别框架
Chenran Cai , Qianlong Wang , Bing Qin and Ruifeng Xu 蔡晨然 、王乾龙 、秦冰 和徐瑞峰 School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), China 中国哈尔滨工业大学(深圳)计算机科学与技术学院 Peng Cheng Laboratory, China 中国鹏程实验室 Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, China 广东省新型安全智能技术重点实验室,中国
ARTICLE INFO 文章信息
Keywords: 关键词:
Multimodal Named Entity Recognition 多模态命名实体识别
Multi-task Framework 多任务框架
Entity Boundary Detection 实体边界检测
Entity Category Classification 实体类别分类
Abstract 摘要
Given a text-image pair, Multimodal Named Entity Recognition (MNER) is the task of identifying and categorizing entities in the text. Most existing work performs named entity labeling directly using final token representations derived by fusing image and text representations. Although they achieve promising results, these work may fail to effectively exploit text and image modalities. This is because they neglect the difference in the role of the two modalities: text modality can detect the boundary of an entity, while image modality is introduced to disambiguate the category of the entity. Based on these findings, in this paper, we construct two auxiliary tasks based on the decomposition strategy and propose a multi-task framework for MNER. Specifically, we first decompose MNER into two auxiliary tasks: entity boundary detection task and entity category classification task. Here, the former treats only the text modality as input and outputs the boundary labels, since it can achieve satisfactory boundary results by itself. The latter uses two modalities to yield category labels where image modality is dedicated to disambiguating categories. These two auxiliary tasks allow the effective exploitation of text and image modalities and put them back into their respective roles. Then, we vectorize their results to improve entity recognition using label clues from auxiliary tasks. Finally, we fuse features from text and image modalities and label embeddings from auxiliary tasks to fulfill MNER. Experimental results on two widely used MNER datasets show that our framework can yield new SOTA performance. 多模态命名实体识别(MNER)是给定文本-图像对,识别文本中的实体并对其进行分类的任务。现有的大多数工作都是直接使用融合图像和文本表征后得到的最终标记表征来执行命名实体标注。虽然这些工作取得了可喜的成果,但可能无法有效利用文本和图像模式。这是因为它们忽视了两种模态作用的不同:文本模态可以检测实体的边界,而图像模态则是为了消除实体类别的歧义。基于这些发现,本文在分解策略的基础上构建了两个辅助任务,并提出了 MNER 的多任务框架。具体来说,我们首先将 MNER 分解为两个辅助任务:实体边界检测任务和实体类别分类任务。在这里,前者只将文本模态作为输入,并输出边界标签,因为它本身就能获得令人满意的边界结果。后者使用两种模态生成类别标签,其中图像模态专门用于消除类别歧义。通过这两项辅助任务,我们可以有效利用文本和图像模态,使它们重新发挥各自的作用。然后,我们将它们的结果进行矢量化,利用来自辅助任务的标签线索提高实体识别率。最后,我们将来自文本和图像模态的特征与来自辅助任务的标签嵌入进行融合,从而实现 MNER。在两个广泛使用的 MNER 数据集上的实验结果表明,我们的框架可以产生新的 SOTA 性能。
1. Introduction 1.导言
Multimodal Named Entity Recognition (MNER) aims to recognize entities of various categories (such as person and location, denoted as PER and LOC, respectively) in unstructured text with the help of additional image information (Zhang et al., 2018; Lu et al., 2018; Tian et al., 2021). Take Figure 1a as an example, with the help of the corresponding image, we require identifying two entities in the text, "Lee Brice" (PER) and "Hard Rock Tulsa" (LOC), to finish MNER. 多模态命名实体识别(MNER)旨在借助额外的图像信息识别非结构化文本中的各类实体(如人和地点,分别表示为PER和LOC)(Zhang等人,2018;Lu等人,2018;Tian等人,2021)。以图 1a 为例,借助相应的图像,我们需要识别文本中的两个实体 "Lee Brice"(PER)和 "Hard Rock Tulsa"(LOC),从而完成 MNER。
Unlike the conventional NER (Konkol et al., 2015; Long et al., 2016; Augenstein et al., 2017; Goyal et al., 2018; Li et al., 2020; Suman et al., 2021; Hosseini et al., 2022; Liu et al., 2023; Li et al., 2024a; Mao et al., 2024), which solely relies on text modality to carry out entity recognition, MNER tends first to leverage text and image modalities to derive the final token representations and then identify entities. Existing MNER studies can be divided into two main groups according to the method of obtaining token representations. The first group is cross-modal interaction-based methods (Zhang et al., 2021a; Chen et al., 2022; Zhang et al., 2023; Wang et al., 2023; Ren et al., 2023; Liu et al., 2024). They often first employ the attention mechanism for crossmodal interaction to overcome the semantic discrepancy between different modalities Then, concatenating interaction results and textual representation outputs the final token representations for entity labeling. The second one is image conversion-based methods (Chen et al., 2021; Wang et al., 与传统的 NER(Konkol 等人,2015;Long 等人,2016;Augenstein 等人,2017;Goyal 等人,2018;Li 等人,2020;Suman 等人,2021;Hosseini 等人,2022;Liu 等人,2023;Li et al、而 MNER 则倾向于首先利用文本和图像模式得出最终的标记表征,然后再识别实体。根据获取标记表征的方法,现有的 MNER 研究可分为两大类。第一类是基于跨模态交互的方法(Zhang 等人,2021a;Chen 等人,2022;Zhang 等人,2023;Wang 等人,2023;Ren 等人,2023;Liu 等人,2024)。它们通常首先利用跨模态交互的注意力机制来克服不同模态之间的语义差异,然后将交互结果和文本表征串联起来,输出用于实体标注的最终标记表征。第二种是基于图像转换的方法(Chen et al、
2022). Such methods enrich the token representations by incorporating textualized information derived from images, which is treated as an additional input to the text. 2022).这些方法通过纳入从图像中提取的文本化信息来丰富标记表征,这些信息被视为文本的额外输入。
Despite their promising results, these studies neglect the difference in the role of the text and image modalities on MNER, which may cause inefficient utilization of both modalities. For MNER, text modality itself can detect the boundary of an entity and could even determine its category with weak confidence, while image modality is usually introduced to disambiguate the entity category. As shown in Figure 1a, the text itself can give satisfactory results even without the image. In contrast, in Figure 1b, the text can only determine the entity boundary "Alibaba", but not whether its category is ORG or PER. In this case, image modality can be used to help filter irrelevant categories. Therefore, we believe that the role difference of the two modalities should be concerned, rather than directly fusing the two by crossmodal interaction or image conversion to obtain the final token representations. 尽管这些研究取得了可喜的成果,但它们忽略了文本和图像模态在 MNER 中的作用差异,这可能导致两种模态的利用效率低下。对于 MNER 而言,文本模态本身可以检测实体的边界,甚至可以以弱置信度确定实体的类别,而图像模态通常是为了消除实体类别的歧义而引入的。 如图 1a 所示,即使没有图像,文本本身也能给出令人满意的结果。相反,在图 1b 中,文本只能确定实体边界 "阿里巴巴",但不能确定其类别是 ORG 还是 PER。在这种情况下,图像模式可以用来帮助过滤不相关的类别。因此,我们认为应该关注两种模态的作用差异,而不是直接通过跨模态交互或图像转换来融合两者,从而获得最终的标记表征。
In this paper, we construct two auxiliary tasks based on the decomposition strategy and propose a multi-task framework for MNER Specifically, we first decompose MNER into two auxiliary tasks, entity boundary detection task, and entity category classification task, based on the characteristics of entity sequence labels. For entity boundary detection, we use only text modality as input and output boundary labels since it can achieve satisfactory boundary results by itself. For entity category classification, we exploit the interactive features of two modalities to derive the category 具体来说,我们首先根据实体序列标签的特点,将 MNER 分解为两个辅助任务,即实体边界检测任务和实体类别分类任务。在实体边界检测中,我们只使用文本模态作为输入和输出边界标签,因为它本身就能获得令人满意的边界结果。对于实体类别分类,我们利用两种模态的交互特征来推导出类别
(a) in concert Monday at ! (a) 周一在 举行音乐会!
(b) I love . (b) 我爱 。
Figure 1: Illustration of Multimodal Named Entity Recognition (MNER) on two examples. Entities are labeled with brackets, and the entity categories are indicated by the subscripts of the brackets. 图 1:多模态命名实体识别(MNER)在两个示例中的应用说明。实体用括号标注,实体类别用括号的下标表示。
labels where image modality helps recognize ambiguous entities that are insufficiently described by the text modality. The benefit of constructing these two auxiliary tasks allows the effective exploitation of text and image modalities and puts them back into their respective roles. In addition, their labels can be treated as clues to improve the performance of MNER. Based on this, we then vectorize the output results of two auxiliary tasks via label embedding to apply their label clues in entity recognition. Finally, to better leverage label clues for entity recognition, we adopt the multi-layer transformer that integrates features from text and image modalities and label embeddings from two auxiliary tasks. By this fusion, we can interpose the label clues into the interactive features to enhance the final token representations, thereby achieving MNER. 图像模态有助于识别文本模态无法充分描述的模糊实体。构建这两种辅助任务的好处是可以有效利用文本和图像模态,使它们重新发挥各自的作用。此外,它们的标签可以作为提高 MNER 性能的线索。在此基础上,我们通过标签嵌入对两个辅助任务的输出结果进行矢量化,将它们的标签线索应用到实体识别中。最后,为了更好地利用标签线索进行实体识别,我们采用了多层变换器,将文本和图像模式的特征与两个辅助任务的标签嵌入进行融合。通过这种融合,我们可以将标签线索插入到交互式特征中,以增强最终的标记表示,从而实现 MNER。
This paper makes the following contributions: 本文的贡献如下
We elaborate on the role of text and image modalities on MNER, which offers insights for the following studies to utilize both modalities effectively. 我们阐述了文本和图像模式对 MNER 的作用,这为后续研究有效利用这两种模式提供了启示。
Based on the respective roles of two modalities, we design two auxiliary tasks by decomposing labels to effectively leverage two modalities and further introduce a multi-task framework to solve MNER. 根据两种模态的各自作用,我们通过分解标签设计了两种辅助任务,以有效利用两种模态,并进一步引入多任务框架来解决 MNER 问题。
We present a comprehensive evaluation of our framework on two popular MNER datasets. The experimental results show that our framework can achieve SOTA performance, verifying its effectiveness and superiority. 我们在两个流行的 MNER 数据集上对我们的框架进行了全面评估。实验结果表明,我们的框架可以实现 SOTA 性能,验证了其有效性和优越性。
2. Related Work 2.相关工作
2.1. Multimodal Named Entity Recognition 2.1.多模态命名实体识别
With the growing prevalence of multimodal social media data, MNER has attracted more research attention. Like other multimodal tasks, such as multimodal sentiment analysis and visual question answering, MNER also requires designing various approaches to fuse latent information from different modalities (Zhang et al., 2020, 2021b; Abdu et al., 2021; Zhu et al., 2023; Kim and Park, 2023; Nguyen et al., 2023; Li et al., 2024b). Most of the existing research can be broadly categorized into two groups: cross-modal interaction-based approaches and image conversion-based approaches. The first group focuses on cross-modal interaction and fusion of text and image representations, using the attention mechanism or transformer. For instance, some work (Zhang et al., 2018; Moon et al., 2018; Lu et al., 2018) utilize CNN and LSTM to encode the image and text modalities, respectively, and then use attention to fuse both modalities and generate multimodal representations for entity labeling. Yu et al. (2020) interacts with two modalities via a transformer to directly handle MNER. To enhance MNER performance, some studies (Zhang et al., 2021a; Chen et al., 2022; Wang et al., 2023) not only model the global image and text relationship but also exploit the local semantic alignment between visual objects and textual tokens to obtain fine-grained token representations. To reduce the issue of visual object bias, Zhang et al. (2023) employs a de-bias method that implicitly aligns multimodal information and mitigates bias induced by visual objects in the image. The second group (Chen et al., 2021; Wang et al., 2022) first converts images into textualized information, such as captions, that bridge the gap between image and text modalities. Then, they concatenate the input text with the textualized information from the image and feed them into a pre-trained language model to finish MNER. 随着多模态社交媒体数据的日益普及,MNER 吸引了更多的研究关注。与多模态情感分析和视觉问题解答等其他多模态任务一样,MNER 也需要设计各种方法来融合来自不同模态的潜在信息(Zhang 等人,2020,2021b;Abdu 等人,2021;Zhu 等人,2023;Kim 和 Park,2023;Nguyen 等人,2023;Li 等人,2024b)。大多数现有研究可大致分为两类:基于跨模态交互的方法和基于图像转换的方法。第一类侧重于跨模态交互和文本与图像表征的融合,使用注意力机制或转换器。例如,一些工作(Zhang 等人,2018;Moon 等人,2018;Lu 等人,2018)利用 CNN 和 LSTM 分别对图像和文本模态进行编码,然后利用注意力对两种模态进行融合,生成用于实体标注的多模态表征。Yu 等人(2020)通过转换器与两种模态交互,直接处理 MNER。为了提高 MNER 性能,一些研究(Zhang 等人,2021a;Chen 等人,2022;Wang 等人,2023)不仅对全局图像和文本关系进行建模,还利用视觉对象和文本标记之间的局部语义对齐来获得细粒度标记表征。为了减少视觉对象偏差问题,Zhang 等人(2023 年)采用了一种去偏差方法,隐式对齐多模态信息,减轻图像中视觉对象引起的偏差。第二组(Chen 等人,2021 年;Wang 等人,2022 年)首先将图像转换为文本化信息,如标题,从而弥合图像和文本模态之间的差距。 然后,他们将输入文本与图像中的文本化信息进行串联,并将其输入到预先训练好的语言模型中,从而完成 MNER。
For MNER, we find that the text modality itself can effectively identify the entity boundaries, while the image modality is typically introduced to disambiguate the entity category. Based on this finding, we construct two auxiliary tasks (i.e., entity boundary detection and entity category classification), which leverage the features of different modalities. We then utilize the results of two auxiliary tasks to enhance the performance of the final MNER entity labeling. 对于 MNER,我们发现文本模态本身可以有效识别实体边界,而图像模态通常是为了消除实体类别的歧义而引入的。基于这一发现,我们构建了两个辅助任务(即实体边界检测和实体类别分类),充分利用了不同模态的特征。然后,我们利用两个辅助任务的结果来提高最终 MNER 实体标注的性能。
2.2. Pre-trained Models 2.2.预训练模型
Pre-trained models are trained on large unlabeled datasets using self-supervised objectives, and they can be categorized into two groups according to the inputs: pre-trained language models (PLMs) and pre-trained vision models (PVMs). PLMs, such as BERT (Devlin et al., 2019) and BERTweet (Nguyen et al., 2020), excel at various NLP tasks, e.g., named entity recognition. Inspired by PLMs, researchers propose a series of vision transformers (e.g., ViT (Dosovitskiy et al., 2021) and Swin (Liu et al., 2021)). These PVMs achieve significant improvements on a range of computer vision tasks, such as object detection (Zhao et al., 预训练模型是利用自监督目标在大型无标记数据集上进行训练的,根据输入可分为两类:预训练语言模型(PLM)和预训练视觉模型(PVM)。BERT(Devlin 等人,2019 年)和 BERTweet(Nguyen 等人,2020 年)等 PLM 擅长各种 NLP 任务,例如命名实体识别。受 PLM 的启发,研究人员提出了一系列视觉转换器(如 ViT(Dosovitskiy 等人,2021 年)和 Swin(Liu 等人,2021 年))。这些 PVM 在一系列计算机视觉任务中取得了显著的改进,如物体检测(Zhao et al、
Figure 2: Overview of our proposed framework MFD. The text and image representation modules first extract features from both modalities. Then, two auxiliary tasks respectively predict the boundary and category of each entity. To utilize the results of two auxiliary tasks, we vectorize predicted boundary and category labels into label clues via label embedding. Finally, the MNER classification module fuses information from both modalities and label clues to perform MNER. 图 2:我们提出的 MFD 框架概览。文本和图像表示模块首先从两种模式中提取特征。然后,两个辅助任务分别预测每个实体的边界和类别。为了利用两个辅助任务的结果,我们通过标签嵌入将预测的边界标签和类别标签矢量化为标签线索。最后,MNER 分类模块融合了两种模态和标签线索的信息,执行 MNER。
2019). Due to their powerful representation capabilities, we here use BERT and ViT models to encode the text and image modalities, respectively. 2019).由于 BERT 和 ViT 模型具有强大的表示能力,我们在此分别使用这两种模型对文本和图像模式进行编码。
3. Methodology 3.方法论
3.1. Task Definition 3.1 任务定义
MNER aims to identify and categorize entities in a given sentence with the help of additional an image . The entities are detected from and assigned to one of the pre-defined categories. Given an input token sequence , we assign a label sequence to , where is the length of , and , B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, B-MISC, I-MISC is the pre-defined label set using BIO tagging schema (Sang and Veenstra, 1999). MNER 的目的是借助附加图像 识别给定句子 中的实体并将其分类。从 中检测出实体,并将其归入预先定义的类别之一。给定一个输入标记序列 ,我们分配一个标签序列 到 ,其中 是 和 的长度、B-PER、I-PER、B-LOC、I-LOC、B-ORG、I-ORG、B-MISC、I-MISC 是使用 BIO 标记模式预先定义的标签集(Sang 和 Veenstra,1999 年)。
3.2. Overall Architecture 3.2.总体结构
As shown in Figure 2, we propose a Multi-task Framework based on the Decomposition strategy for MNER, which we denote as MFD. The framework contains four components: (1) Text representation module, which extracts each token feature from the text; (2) Image representation module, which consists object detection part and image representation part to obtain object features from the image; (3) auxiliary module, which include entity boundary detection task and entity category classification task, and utilize label embedding to vectorize the results of two tasks as label clues; (4) MNER classification module, which fuses information from both textual and visual modalities and two label embeddings to perform MNER. 如图 2 所示,我们为 MNER 提出了一个基于分解策略的多任务框架,并将其命名为 MFD。该框架包含四个部分:(1) 文本表示模块,从文本中提取每个标记特征;(2) 图像表示模块,由对象检测部分和图像表示部分组成,从图像中获取对象特征;(3) 辅助模块,包括实体边界检测任务和实体类别分类任务,并利用标签嵌入将两个任务的结果矢量化作为标签线索;(4) MNER 分类模块,融合文本和视觉模态的信息以及两个标签嵌入来执行 MNER。
3.3. Text Representation Module 3.3.文本表示模块
Given a sequence of tokens , we first utilize BERT to encode each token as a feature embedding: 给定标记序列 ,我们首先利用 BERT 将每个标记编码为特征嵌入:
where is the embedding matrix of and indicates the dimension of each token feature representation. To align the dimensions of representations across text and image modalities, we apply a trainable linear layer to the text representation : 其中 是 的嵌入矩阵, 表示每个标记特征表示的维度。为了调整文本和图像模式的表示维度,我们对文本表示 应用了可训练线性层:
where is the -th token vector and indicates the dimension of hidden state representation unified by text and image modalities. 其中, 是 -th 标记向量, 表示文本和图像模式统一的隐藏状态表示维度。
3.4. Image Representation Module 3.4.图像表示模块
Given an image , we first apply a trained bottom-upattention model (Anderson et al., 2018) to obtain object-level bounding boxes. For each object region , we first resize it to a fixed size of , where ( , ) denotes the original resolution of the object region, represents the channel dimension, and is set to 224 . As in Dosovitskiy et al. (2021), we convert each object region into a series of patches and prepend a learnable embedding 给定图像 ,我们首先应用训练有素的自下而上注意力模型(Anderson 等人,2018 年)来获取对象级边界框。对于每个对象区域 ,我们首先将其大小调整为固定大小 ,其中 ( , ) 表示对象区域的原始分辨率, 表示通道维度, 设置为 224。与 Dosovitskiy 等人(2021 年)的研究一样,我们将每个对象区域 转换为一系列补丁,并预置一个可学习的嵌入值
token to the sequence of embedded patches. Then, we apply ViT to get the token representation that represents the object region : 标记 到嵌入的补丁序列。然后,我们应用 ViT 得到标记表示 ,该标记表示对象区域 :
where is the feature representation of represents the number of patches, and indicates the imagemodality hidden state representation dimension. We define the representation of the image as the aggregation of all object region features: 其中, 是特征表示, 表示斑块数量, 表示图像模态隐藏状态表示维度。我们将图像的表示 定义为所有对象区域特征的集合:
where is the number of object regions, the order of is sorted according to the ascending L2 distance from the center position of the object box to the image's upper left corner, and is the position embedding matrix. We then apply a trainable linear projection to transform into a -dimension space: 其中, 是对象区域的数量, 的顺序按照对象框中心位置到图像左上角的 L2 距离由大到小排序, 是位置嵌入矩阵。然后,我们应用可训练的线性投影将 变换到 维空间:
where and is the -th object region vector. 其中, 和 是 对象区域向量。
3.5. Auxiliary Module 3.5.辅助模块
Most previous studies perform named entity labeling using final token representations, which are derived by fusing image and text representations. Here, they generally adopt the joint tagging schema of entity boundary and entity category (e.g., B-PER, I-PER). However, we observe that text modality itself can provide satisfactory boundary results, and image modality is typically introduced to disambiguate the entity category. Based on this finding, we decompose the joint tagging schema into two auxiliary tasks: entity boundary detection and entity category classification. 以往的研究大多使用最终标记表示法进行命名实体标注,而最终标记表示法是通过融合图像和文本表示法得出的。在这里,它们一般采用实体边界和实体类别的联合标记模式(如 B-PER、I-PER)。然而,我们发现文本模式本身就能提供令人满意的边界结果,而图像模式通常是为了消除实体类别的歧义而引入的。基于这一发现,我们将联合标记模式分解为两个辅助任务:实体边界检测和实体类别分类。
Entity Boundary Detection. This task aims to detect the boundary of entities in the input text. We first formulate this task as a sequence labeling problem and adopt the BIOES tagging schema that emphasizes the concept of start and end. Then, we use to denote the boundary sequence labels, where , the entity of length greater than one starts with and ends with E , I represents the inside token of the entity, is the label of non-entity, and is used to label the entity consisting of a single token. Specifically, as shown in the Entity Boundary Detection section of Figure 2, we apply a linear layer to the text features to predict boundary labels and calculate boundary loss using the cross-entropy loss function: 实体边界检测。这项任务旨在检测输入文本中实体的边界。我们首先将这一任务表述为序列标注问题,并采用强调开始和结束概念的 BIOES 标记模式。然后,我们用 表示边界序列标签,其中 ,长度大于 1 的实体以 开始,以 E 结束,I 表示实体的内部标记, 是非实体的标签, 用于标记由单个标记组成的实体。具体来说,如图 2 实体边界检测部分所示,我们对文本特征 应用线性层来预测边界标签,并使用交叉熵损失函数计算边界损失:
where is the boundary label and denotes the size of the boundary label set. 其中, 是边界标签, 表示边界标签集的大小。
Entity Category Classification. The entity category classification task aims to classify the category of each token in the input text and use to denote the category sequence labels, where PER, LOC, ORG, MISC, 0. Since text modality alone can cause category ambiguity, which needs to be eliminated using image modality, we exploit both text and image modalities for the entity category classification task 实体类别分类。实体类别分类任务旨在对输入文本中的每个标记进行类别分类,并使用 表示类别序列标签,其中 PER、LOC、ORG、MISC、0 。由于仅使用文本模式会造成类别模糊,而这需要使用图像模式来消除,因此我们在实体类别分类任务中同时使用了文本和图像模式
However, some object regions in images are irrelevant to the text. We should ignore these object regions to reduce noise. To solve this challenge, we design a Multimodal Interaction part, which adopts a multi-head attention mechanism (Vaswani et al., 2017) to enhance the image representations with the guidance of the associated text. As shown in the Entity Category Classification section of Figure 2, given text representation and image representation , we apply a multi-head cross-modal attention mechanism to derive textaware image representations : 然而,图像中的某些对象区域与文本无关。我们应该忽略这些对象区域,以减少噪音。为了解决这一难题,我们设计了多模态交互部分,采用多头关注机制(Vaswani 等人,2017 年),在相关文本的引导下增强图像表征。如图 2 实体类别分类部分所示,在给定文本表示 和图像表示 的情况下,我们应用多头跨模态关注机制得出文本感知图像表示 :
where is scaling factor. Then, we feed and into a linear layer to predict category labels and calculate category loss using the cross-entropy loss function: 其中 为缩放因子。然后,我们将 和 输入线性层,以预测类别标签,并使用交叉熵损失函数计算类别损失 :
where is the category label and denotes the size of the category label set. 其中, 是类别标签, 表示类别标签集的大小。
After obtaining the results of two auxiliary tasks, we can obtain joint entity labels by combining two sequence labels to solve MNER. However, this combination fails to consider the correlation of labels, such as producing the invalid label 0-PER, which hurts the performance of the framework. 在获得两个辅助任务的结果后,我们可以通过组合两个序列标签来获得联合实体标签,从而解决 MNER 问题。但是,这种组合方法没有考虑标签的相关性,例如会产生无效标签 0-PER,从而损害了框架的性能。
Label Embedding. To utilize the two auxiliary task results to enhance the final entity recognition performance, we vectorize boundary labels and category labels into embeddings via label embedding. For boundary label embedding, we first assign trainable embeddings , where . Then, we initialize with word embeddings from the BERT vocabulary, which have the same semantics as boundary labels. For instance, we utilize word embeddings of begin and end to initialize B and E labels, respectively. Finally, we select corresponding boundary label embeddings for each token from to obtain boundary label embedding sequence : 标签嵌入。为了利用这两项辅助任务结果来提高最终的实体识别性能,我们通过标签嵌入将边界标签和类别标签矢量化为嵌入式标签。对于边界标签嵌入,我们首先分配 可训练嵌入 ,其中 。然后,我们用 BERT 词库中的词嵌入来初始化 ,这些词嵌入与边界标签具有相同的语义。例如,我们使用 begin 和 end 的词嵌入分别初始化 B 和 E 标签。最后,我们从 中为每个标记选择相应的边界标签嵌入词,得到边界标签嵌入序列 :
where , and is the number of tokens. For category label embedding, similar to boundary label embedding, we set the trainable embeddings and 其中, 和 是标记数。对于类别标签嵌入,与边界标签嵌入类似,我们设置了 可训练嵌入 和
Table 1 表 1
Statistics of two MNER datasets. 两个 MNER 数据集的统计数据。
Entity Category 实体类别
Twitter-2015
Twitter-2017 推特-2017
Train 火车
Dev 开发
Test 测试
Train 火车
Dev 开发
Test 测试
Organization 组织结构
928
247
839
1,674
375
395
Person 个人
2,217
552
1,816
2,943
626
621
Location 地点
2,091
522
1,697
731
173
178
Miscellaneous 杂项
940
225
726
701
150
157
Total 总计
6,176
1,546
5,078
6,049
1,324
1,351
Number of Samples 样本数量
4,000
1,000
3,257
3,373
723
723
initialize PER, LOC, ORG, MISC, and o labels with the word embeddings of person, location, organization, other, and none, respectively. Then, we obtain the category label embedding sequence by selecting the corresponding category label embeddings from for each token. 将 PER、LOC、ORG、MISC 和 o 标签分别初始化为 person、location、organization、other 和 none 的词嵌入。然后,我们从 中为每个标记选择相应的类别标签嵌入,从而得到类别标签嵌入序列 。
where and . It is worth noting that we construct and using the ground truth labels during the training phase and apply the predicted labels of and during the inference phase. 其中, 和 。值得注意的是,我们在训练阶段使用地面实况标签构建 和 ,并在推理阶段应用 和 的预测标签。
3.6. MNER Classification Module 3.6.MNER 分类模块
To interpose the label clues from both auxiliary tasks into the interactive features to improve the final token representations, we utilize the multi-layer transformer to fuse the multimodal interaction representation , text representation , boundary label embeddings and category label embeddings by concatenation. We then pass the fused feature representation to a linear layer that predicts MNER labels. Finally, we calculate MNER loss by applying the cross-entropy loss function: 为了将这两项辅助任务中的标签线索插入到交互式特征中以改进最终的标记表示,我们利用多层变换器将多模态交互表示 、文本表示 、边界标签嵌入 和类别标签嵌入 通过连接进行融合。然后,我们将融合后的特征表示传递给预测 MNER 标签的线性层。最后,我们应用交叉熵损失函数计算 MNER 损失 :
where is the MNER label and denotes the size of the MNER joint label set. We train the entire framework in an end-to-end manner by combining the loss functions of three tasks: MNER classification, entity boundary detection, and entity category classification. 其中 是 MNER 标签, 表示 MNER 联合标签集的大小。我们通过结合三个任务的损失函数,以端到端的方式训练整个框架:MNER 分类、实体边界检测和实体类别分类。
where and control the contributions of and towards MNER, respectively. 其中 和 分别控制 和 对 MNER 的贡献。
4. Experiments 4.实验
4.1. Dataset 4.1.数据集
We evaluate our framework on two public MNER benchmark datasets: Twitter-2015 and Twitter-2017. These datasets are constructed by Zhang et al. (2018) and Lu et al. (2018), respectively. Each sample in these datasets comprises a text and a corresponding image. Table 1 shows the statistics of these two MNER datasets. 我们在两个公开的 MNER 基准数据集上评估了我们的框架:Twitter-2015 和 Twitter-2017。这些数据集分别由 Zhang 等人(2018)和 Lu 等人(2018)构建。这些数据集中的每个样本都包含一个文本和一张相应的图片。表 1 显示了这两个 MNER 数据集的统计数据。
4.2. Experimental Settings 4.2.实验设置
The maximum number of object regions is 8 and objects with a confidence greater than 0.5 are identified during the object detection process. We apply the BERT-base-uncased and models to embed each token and each objectregion as a 768-dimensional embedding, respectively. We set the hyper-parameters as follows: and are both 1 , and the number of the transformer layers is 5 . We apply different learning rates for different datasets and framework components. For the Twitter-2015 dataset, BERT uses a learning rate of while the rest of the framework uses . For the Twitter-2017 dataset, BERT uses a learning rate of while the rest of the framework uses . The model is trained for 15 epochs on both datasets and uses AdamW as the optimizer. Following the previous work (Zhang et al., 2018, 2023), we also calculate three metrics Precision (P), Recall (R), and F1, with F1 being the primary measure of the model performance. 对象区域的最大数量为 8 个,对象检测过程中会识别出置信度大于 0.5 的对象。我们应用基于 BERT 的 和 模型,将每个标记和每个对象区域分别嵌入为 768 维的嵌入。我们将超参数设置如下: 和 均为 1,变换层数为 5。我们对不同的数据集和框架组件采用不同的学习率。对于 Twitter-2015 数据集,BERT 使用 的学习率,而框架的其他部分则使用 。对于 Twitter-2017 数据集,BERT 使用 的学习率,而框架的其他部分则使用 。该模型在两个数据集上都进行了 15 次历时训练,并使用 AdamW 作为优化器。按照之前的工作(Zhang 等人,2018 年,2023 年),我们还计算了 Precision (P)、Recall (R) 和 F1 三个指标,其中 F1 是衡量模型性能的主要指标。
4.3. Comparison Models 4.3 比较模型
We conduct an extensive comparison of our framework with several competitive methods that use either textmodality or multi-modality. The methods are as follows: 我们将我们的框架与几种使用文本模式或多模式的竞争方法进行了广泛比较。这些方法如下:
Text-modality Methods: 文本模式方法
BiLSTM-CRF (Huang et al., 2015), which predicts token labels based on the combination of a BiLSTM and a CRF layer. BiLSTM-CRF(Huang 等人,2015 年),根据 BiLSTM 和 CRF 层的组合预测标记标签。
HBiLSTM-CRF (Lample et al., 2016), which adopts a hierarchical and bidirectional LSTM and a CRF to obtain token representations. HBiLSTM-CRF(Lample 等人,2016 年),它采用分层双向 LSTM 和 CRF 来获取标记表示。
CNN-BiLSTM-CRF (Ma and Hovy, 2016), which enhances BiLSTM-CRF by introducing the character representation of each token based on CNN. CNN-BiLSTM-CRF(Ma 和 Hovy,2016 年),它在 CNN 的基础上引入了每个标记的字符表示,从而增强了 BiLSTM-CRF。
BERT (Devlin et al., 2019), which employs the BERT model for token representation. BERT(Devlin 等人,2019 年),它采用 BERT 模型进行标记表示。
Table 2 表 2
Main experimental results. The baselines marked with are obtained from Yu et al. (2020), and the rest are from the respective papers. We use boldface to indicate the best score and underline to indicate the second-best score. The model with indicates significance tests of our framework MFD outperform the baseline approaches in terms of the F1 at . 主要实验结果。标有 的基线来自 Yu 等人(2020 年)的论文,其余来自相关论文。我们用黑体表示最佳得分,用下划线表示次佳得分。标有 的模型表示我们的框架 MFD 在 时的 F1 显著性检验优于基线方法。
Model 模型
Twitter-2015
Twitter-2017 推特-2017
R
F1
R
F1
Text-modality 文本模式
BiLSTM-CRF (Huang et al., 2015) BiLSTM-CRF (Huang 等人,2015 年)
68.14
61.09
64.42
79.42
73.43
76.31
HBiLSTM-CRF (Lample et al., 2016) HBiLSTM-CRF (Lample 等人,2016 年)
70.32
68.05
69.17
82.69
78.16
80.37
CNN-BiLSTM-CRF (Ma and Hovy, 2016) CNN-BiLSTM-CRF (马和霍维,2016 年)
66.24
68.09
67.15
80.00
78.76
79.37
(Devlin et al., 2019) (Devlin 等人,2019 年)
68.30
74.61
71.32
82.19
83.72
82.95
BERT-CRF
69.22
74.59
71.81
83.32
83.57
83.44
Mutli-modality 多种模式
VG-ATT (Lu et al., 2018) VG-ATT (Lu 等人,2018 年)
73.96
67.90
70.80
83.41
80.38
81.87
Ada-Co-ATT (Zhang et al., 2018) Ada-Co-ATT (Zhang 等人,2018 年)
69.87
74.59
72.15
85.13
83.20
84.10
UMT (Yu et al., 2020) UMT (Yu 等人,2020 年)
71.67
75.23
73.41
85.28
85.34
85.31
UMGF (Zhang et al., 2021a) UMGF (Zhang 等人,2021a)
74.49
75.21
74.85
86.54
84.50
85.51
ITA (Wang et al., 2022) 国际热带木材组织(Wang 等人,2022 年)
-
-
-
-
86.45
HVPNeT (Chen et al., 2022) HVPNeT (Chen 等人,2022 年)
73.87
76.82
75.32
85.84
87.93
M3S (Wang et al., 2023) M3S(Wang 等人,2023 年)
75.14
75.03
86.93
85.21
86.06
DebiasCL (Zhang et al., 2023) DebiasCL(Zhang 等人,2023 年)
74.45
76.13
75.28
87.59
86.11
86.84
MFD (Ours) 多功能显示器(我们的)
77.59
76.52
77.05
87.28
87.34
87.31
BERT-CRF, which is a modification of BERT. BERT-CRF 是对 BERT 的改进。
2) Multi-modality Methods: 2) 多模式方法:
VG-ATT (Lu et al., 2018), which applies the visual attention to obtain word-aware visual representations and utilizes the HBiLSTM-CRF for MNER. VG-ATT(Lu 等人,2018 年),它应用视觉注意力获得单词感知视觉表征,并利用 HBiLSTM-CRF 实现 MNER。
Ada-Co-ATT (Zhang et al., 2018), which combines text and image information to identify entities using an adaptive co-attention network. Ada-Co-ATT(Zhang 等人,2018 年),它结合了文本和图像信息,利用自适应协同关注网络来识别实体。
UMT (Yu et al., 2020), which employs a multimodal interaction transformer to obtain the relationship between text and image modalities. UMT(Yu 等人,2020 年),它采用多模态交互转换器来获取文本和图像模态之间的关系。
UMGF (Zhang et al., 2021a), which applies a multimodal graph framework to obtain the relationships between tokens and visual objects to complete MNER. UMGF(Zhang 等人,2021a),它采用多模态图框架来获取词块和视觉对象之间的关系,从而完成 MNER。
ITA (Wang et al., 2022), which extracts image captions, optical characters, and object names from images as auxiliary visual context to enhance the entity labeling performance. ITA(Wang 等人,2022 年),从图像中提取图像标题、光学字符和物体名称作为辅助视觉上下文,以提高实体标注性能。
HVPNeT (Chen et al., 2022), which uses the visual prefix prompt method for the visual-enhanced entity and error-insensitive forecasting decision. HVPNeT(Chen 等人,2022 年),采用视觉前缀提示法进行视觉增强实体和误差不敏感的预测决策。
M3S (Wang et al., 2023), which applies a scene graphdriven multi-task and multi-granularity method to effectively capture text and image information on MNER. M3S(Wang 等人,2023 年)采用场景图驱动的多任务和多粒度方法,有效捕捉 MNER 上的文本和图像信息。
DebiasCL (Zhang et al., 2023), which employs a de-bias method to implicitly align multimodal information and mitigate bias induced by visual objects in the image. DebiasCL(Zhang 等人,2023 年),它采用去偏差法对多模态信息进行隐式对齐,减轻图像中视觉对象引起的偏差。
4.4. Main Results 4.4.主要成果
Table 2 summarizes the main experimental results and reveals the following observations: 表 2 总结了主要的实验结果,并揭示了以下几点:
(1) In terms of F1 on two MNER datasets, MFD obtains the best performance with F1 of and , which are higher than ITA by and , HVPNeT by and , M3S by and , and DebiasCL by and , respectively. The reason is that MFD puts text and image modalities into their respective roles, and explicitly models the boundary advantage of texts and the disambiguation effect of images via label embedding, thus effectively enhancing the MNER performance. (1) 从两个 MNER 数据集的 F1 来看,MFD 的性能最好,F1 分别为 和 ,分别比 ITA 高 和 ,比 HVPNeT 高 和 ,比 M3S 高 和 ,比 DebiasCL 高 和 。这是因为 MFD 将文本和图像模式放在各自的角色中,并通过标签嵌入明确地模拟了文本的边界优势和图像的消歧效果,从而有效地提高了 MNER 性能。
(2) Among all text-modality methods, BERT demonstrates superior performance over the other approaches, achieving a significant improvement on two datasets. For instance, BERT-CRF surpasses BiLSTM-CRF by and in terms of F1 on both datasets, respectively. Similarly, recent multi-modality approaches (i.e., ITA, HVPNeT, M3S, and DebiasCL) all employ BERT as the text encoder. In comparison to the best-performing Ada-Co-ATT methods based on LSTM, the HVPNeT model based on BERT achieves a significant improvement of and in F1 on both datasets, respectively. These indicate that token representations produced by a stronger pre-trained encoder are very beneficial for both NER and MNER tasks, as they can effectively capture the context information. (2) 在所有文本模式方法中,BERT 的性能优于其他方法,在两个数据集上取得了显著的改进。例如,BERT-CRF 在两个数据集上的 F1 分别超过 BiLSTM-CRF 和 。同样,最近的多模态方法(即 ITA、HVPNeT、M3S 和 DebiasCL)都采用 BERT 作为文本编码器。与基于 LSTM 的最佳 Ada-Co-ATT 方法相比,基于 BERT 的 HVPNeT 模型在这两个数据集上的 F1 分别显著提高了 和 。这表明,由更强的预训练编码器生成的标记表示对 NER 和 MNER 任务都非常有益,因为它们能有效捕捉上下文信息。
Table 3 表 3
Ablation study. and are the training objects of entity boundary detection task, entity category classification task and both tasks, respectively. uses word embeddings to initialize label embeddings. is the training object of MNER classification task. is object detection step. 消融研究。 和 分别是实体边界检测任务、实体类别分类任务和两个任务的训练对象。 使用词嵌入来初始化标签嵌入。 是 MNER 分类任务的训练对象。 是对象检测步骤。
Model 模型
Twitter-2015
Twitter-2017 推特-2017
P
R
F 1
P
R
F 1
MFD
87.34
76.58
76.17
76.37
85.98
86.68
86.33
76.23
75.78
76.00
85.76
86.45
86.10
73.92
74.57
74.24
87.07
86.23
86.65
74.87
76.04
75.45
86.15
86.08
86.12
76.14
75.29
75.71
86.45
86.90
86.67
77.39
75.61
76.49
86.92
87.24
Table 4 表 4
Boundary and category analysis experiments. , and denote the boundary score, category score and MNER F1 score, respectively. Results come from re-running open-source code. 边界和类别分析实验。 和 分别表示边界 得分、类别 得分和 MNER F1得分。结果来自重新运行的开源代码。
Model 模型
Twitter-2015
Twitter-2017 推特-2017
F 1
F 1
Text-modality 文本模式
BERT
82.42
91.49
75.40
90.40
93.91
84.89
Multi-modality 多种模式
UMT
80.83
89.45
72.30
89.63
94.73
84.91
UMGF
81.16
89.43
72.58
88.45
94.46
83.55
HVPNeT
81.64
91.66
74.84
91.32
94.51
86.31
(3) The image can effectively enhance the model performance on MNER. For instance, vG-ATT, which has the same text encoder as HBiLSTM-CRF, outperforms the latter by and in terms of F1 on both datasets, respectively. Similarly, the text encoder of Ada-Co-ATT is the same as that of CNN-BiLSTM-CRF, but the former achieves and higher F 1 on both datasets, respectively. We can observe that the performance gain of the image varies across different models. These results indicate that image information is beneficial for MNER, and incorporating the image information in the right way to enhance the model's performance is a challenge. (3) 图像能有效提高模型在 MNER 上的性能。例如,与 HBiLSTM-CRF 具有相同文本编码器的 vG-ATT 在两个数据集上的 F1 分别比后者高出 和 。同样,Ada-Co-ATT 的文本编码器与 CNN-BiLSTM-CRF 的文本编码器相同,但前者在两个数据集上的 F1 分别比后者高 和 。我们可以观察到,图像在不同模型中的性能增益各不相同。这些结果表明,图像信息对 MNER 是有益的,而以正确的方式纳入图像信息以提高模型的性能则是一个挑战。
(4) The results of the significance tests show that our MFD over the baseline methods with a statistically significant margin on both datasets (with ). (4) 显著性检验结果表明,在两个数据集上,我们的 MFD 在统计学上都显著优于基线方法( )。
4.5. Ablation Study 4.5.消融研究
To examine the influence of auxiliary tasks on MNER performance, we individually exclude the training object of entity boundary detection task , entity category classification task , and both tasks . Table 3 shows that MFD obtains the best performance when combining both tasks. This indicates that both tasks in our framework contribute to the final performance. In comparison to the removal of , the elimination of leads to even more performance degradation. This observation highlights the usefulness of the entity category classification on MNER. 为了考察辅助任务对 MNER 性能的影响,我们分别排除了实体边界检测任务 、实体类别分类任务 和两个任务 的训练对象。表 3 显示,将这两项任务结合起来时,MFD 的性能最佳。这表明,我们框架中的两个任务都有助于提高最终性能。与删除 相比,删除 会导致更严重的性能下降。这一观察结果凸显了实体类别分类在 MNER 中的作用。
Furthermore, we perform ablation studies on the methods of initializing label embedding and obtaining MNER joint labels. In comparison to random initialization ( ), employing the original word embedding initialization makes a more substantial contribution to performance. This is because prior knowledge of word embeddings can facilitate the transfer of label clues from two auxiliary tasks to the final task. W/o solves MNER by combining two sequence labels from both auxiliary tasks, which results in performance degradation due to the lack of consideration for the correlation between joint labels. To analyze the impact of the object detection, we remove the object detection step ( ) and directly feed the raw image into the vision model. This approach yields F1 scores of and on both datasets, which are lower than the performance of our current framework. This observation indicates that utilizing objects in the image instead of the entire image is advantageous for our framework. The rationale behind this is that using objects in the image allows for the elimination of other redundant image information. 此外,我们还对标签嵌入初始化和获取 MNER 联合标签的方法进行了消减研究。与随机初始化( )相比,采用原始词嵌入初始化对性能的贡献更大。这是因为单词嵌入的先验知识可以促进标签线索从两个辅助任务转移到最终任务。W/o 通过合并两个辅助任务中的两个序列标签来解决 MNER 问题,由于没有考虑联合标签之间的相关性,导致性能下降。为了分析物体检测的影响,我们取消了物体检测步骤( ),直接将原始图像输入视觉模型。这种方法在两个数据集上得到的 F1 分数分别为 和 ,低于我们当前框架的性能。这一观察结果表明,利用图像中的对象而不是整个图像对我们的框架是有利的。这背后的原因是,利用图像中的对象可以消除其他冗余图像信息。
4.6. Analysis 4.6.分析
4.6.1. Boundary and Category Analysis 4.6.1.边界和类别分析
We conduct boundary and category analysis on both MNER datasets by evaluating the performance of textmodality and multi-modality methods. For boundary performance, we disregard the category prediction results of models and calculate the F1 score, denoted as . For 我们通过评估文本模态和多模态方法的性能,对两个 MNER 数据集进行了边界和类别分析。对于边界性能,我们不考虑模型的类别预测结果,而是计算 F1 分数,记为 。对于
Figure 3: F 1 scores on the different and transformer layer number. 图 3:不同 和变压器层数下的 F 1 分数。
Table 5 表 5
Impact of different PVMs. The MFD is a variant of our framework, which replaces the vision encoder (i.e., ViT) with ResNet-50. 不同 PVM 的影响。MFD 是我们框架的一个变体,它用 ResNet-50 取代了视觉编码器(即 ViT)。
Model 模型
Twitter-2015
Twitter-2017 推特-2017
P
R
F 1
P
R
F 1
MFD
75.01
76.04
86.44
86.82
86.63
MFD
76.52
Table 6 表 6
framework, which replaces the language encoder (i.e., BERT) with BERTweet 框架,用 BERTweet 取代语言编码器(即 BERT
Model 模型
Twitter-2015
Twitter-2017 推特-2017
P
R
F 1
P
R
F 1
MFD
77.59
76.52
77.05
87.28
87.34
87.31
MFD
category performance, we only consider predicted entities with completely correct boundaries to calculate the F1 score, namely . Table 4 shows that the text-only model (i.e., BERT) achieves satisfactory results for boundary detection and even outperforms some multimodal models in terms of . The advantage of multimodal models lies in their ability to disambiguate entity categories, which is reflected in . For instance, in comparison to UMT, BERT exhibits a higher , but a lower on Twitter2017 dataset. These results demonstrate that the image information primarily helps to disambiguate the entity category, while the text itself can provide satisfactory boundary detection results. 在计算 F1 分数时,我们只考虑边界完全正确的预测实体,即 。表 4 显示,纯文本模型(即 BERT)在边界检测方面取得了令人满意的结果,甚至在 方面优于某些多模态模型。多模态模型的优势在于其消歧实体类别的能力,这体现在 中。例如,在 Twitter2017 数据集上,与 UMT 相比,BERT 的 较高,但 较低。这些结果表明,图像信息主要有助于区分实体类别,而文本本身可以提供令人满意的边界检测结果。
4.6.2. Impact of Different Pre-trained Models 4.6.2.不同预训练模型的影响
We compare the performance of different PVMs and PLMs to examine the impact of pre-trained models on MFD. Experimental results with different PVMs and PLMs are presented in Tables 5 and 6, respectively. For pre-trained vision models, we introduce , a variant of our framework that replaces the vision encoder (i.e., ViT) with ResNet-50 while preserving other modules unchanged. We obverse that achieves higher F 1 than by and on two MNER datasets, respectively. This finding implies that the ViT model demonstrates superior image feature representation capabilities for MNER. 我们比较了不同 PVM 和 PLM 的性能,以考察预训练模型对 MFD 的影响。表 5 和表 6 分别列出了不同 PVM 和 PLM 的实验结果。对于预训练的视觉模型,我们引入了 ,它是我们框架的一个变体,用 ResNet-50 代替了视觉编码器(即 ViT),而其他模块保持不变。我们发现,在两个 MNER 数据集上, 分别通过 和 实现了比 更高的 F 1。这一结果表明,ViT 模型在 MNER 中表现出了卓越的图像特征表示能力。
For pre-trained language models, we propose , which replaces the language encoder with BERTweet. We obverse that surpasses by and on two MNER datasets, respectively. This result suggests that a pre-trained language encoder with prior knowledge can further enhance the performance of our framework. 对于预训练的语言模型,我们提出了 ,它用 BERTweet 代替了语言编码器。我们发现,在两个 MNER 数据集上, 比 分别高出 和 。这一结果表明,具有先验知识的预训练语言编码器可以进一步提高我们框架的性能。
4.6.3. Impact of and 4.6.3. 和 的影响
We perform experiments with different values of hyperparameters and (Eq. 17) to explore their impact on MFD. The results are presented in Figure 3a and 3b. We adopt the control variates method to analyze these two hyperparameters. Namely, when one of the hyper-parameters or is set as a variable, the other hyper-parameter is set to 1.0. In Figure 3a, we can observe that the F1 generally augments as the value of increases. MFD obtains the highest performance on both MNER datasets when the value of is equal to 1.0. In Figure 3b, the hyper-parameter displays similar trends as . This fluctuation phenomenon indicates that the two auxiliary tasks have a positive boost effect on the final MNER task. We also attempt to increase the value of or beyond 1.0, which results in a decline in the F1. 我们使用不同的超参数 和 值(公式 17)进行实验,以探索它们对 MFD 的影响。结果如图 3a 和 3b 所示。我们采用控制变量法分析这两个超参数。即当其中一个超参数 或 设置为变量时,另一个超参数设置为 1.0。在图 3a 中,我们可以观察到,随着 值的增加,F1 通常也会增加。当 的值等于 1.0 时,MFD 在两个 MNER 数据集上都获得了最高性能。在图 3b 中,超参数 显示出与 相似的趋势。这种波动现象表明,两个辅助任务对最终的 MNER 任务有积极的促进作用。我们还尝试将 或 的值提高到 1.0 以上,结果导致 F1 下降。
4.6.4. Impact of Transformer Layer Number 4.6.4.变压器层数的影响
To examine the effect of varying the transformer layer number on MFD, we conduct experiments with different values of this hyper-parameter, ranging from 1 to 10. Figure 3c shows that the performance of MFD exhibits an initial increase followed by a decrease as the transformer layer number increases. When the transformer layer number is equal to 5, MFD achieves the optimal performance. Notably, deeper layers do not guarantee better performance. Rather, they introduce more parameters, which increase the complexity and difficulty of model training. 为了研究变压器层数变化对 MFD 的影响,我们对该超参数进行了不同数值的实验,范围从 1 到 10 不等。图 3c 显示,随着变压器层数的增加,MFD 的性能呈现先上升后下降的趋势。当变压器层数等于 5 时,MFD 达到最佳性能。值得注意的是,层数越深并不能保证性能越好。相反,它们引入了更多参数,增加了模型训练的复杂性和难度。
4.6.5. Boundary and Category Errors Analysis 4.6.5.边界和类别误差分析
To evaluate the effectiveness of MFD on boundary and category dimensions, we conduct a fine-grained analysis of erroneous predictions produced by MFD and baseline methods (i.e., UMT, UMGF, HVPNeT, and ITA). We classify the erroneous predictions into two types: boundary errors and category errors. We classify the predicted entity as the boundary error when the predicted entity is boundary-misspecified. 为了评估 MFD 在边界和类别维度上的有效性,我们对 MFD 和基准方法(即 UMT、UMGF、HVPNeT 和 ITA)产生的错误预测进行了细粒度分析。我们将错误预测分为两种类型:边界错误和类别错误。当预测的实体边界不明确时,我们将其归类为边界错误。
Table 7 表 7
Analysis of the wrong predictions on different MNER approaches. ., and F1 denote prediction numbers, boundary error percentage, correct boundary numbers, category error percentage, and F1 score on MNER, respectively. We re-run the baseline models according to their public code. 不同 MNER 方法的错误预测分析。 .、和 F1 分别表示预测数、边界错误率、正确边界数、类别错误率和 MNER 的 F1 分数。我们根据基线模型的公开代码重新运行了这些模型。
Model 模型
Twitter-2015
Twitter-2017 推特-2017
B.
C.
F1
B.(%)
F1
UMT
5,579
21.87
4,359
10.55
72.30
1,360
10.66
1,215
5.27
84.91
UMGF
5,346
19.64
4,296
10.57
72.58
1,384
12.57
1,210
5.54
83.55
HVPNeT
5,314
18.95
4,309
8.34
74.84
1,402
10.34
1,257
5.49
86.31
ITA
5,253
17.36
4,341
10.07
74.65
1,354
1,239
6.38
85.77
MFD
5,046
4,255
1,352
8.73
1,234
Figure 4: Visualization of an example. 图 4:示例可视化。
When the entity boundary is right, but the category is misidentified, we treat it as a category error. Then, we compute the statistics of these two error types and present them in Table 7. 当实体边界正确,但类别识别错误时,我们将其视为类别错误。然后,我们计算了这两种错误类型的统计数据,并将其列在表 7 中。
Based on the approach of obtaining token representations, we can classify the previous multi-modality methods into two groups, cross-modal interaction-based methods (i.e., UMT, UMGF, and HVPNeT) and image conversion-based method (i.e., ITA). As shown in Table 7, MFD significantly reduces both boundary errors and category errors compared to cross-modal interaction-based methods. In particular, MFD achieves a remarkable improvement in boundary error reduction, outperforming the best model in this group (i.e., HVPNeT) by and less proportion of boundary errors on two MNER datasets, respectively. Compared with the image conversion-based method ITA, MFD significantly reduces the proportion of category errors, which is decreased by and on both MNER datasets, respectively. These results indicate that MFD effectively reduces the proportion of both boundary errors and category errors on MNER. Moreover, we observe an interesting phenomenon. Comparing the cross-modal interaction-based method with the image conversion-based method, the former has lower category errors and the latter has lower boundary errors. This is consistent with our motivation, which correctly views the difference in the role of the text and image modalities on MNER. 根据获取标记表征的方法,我们可以将以往的多模态方法分为两类:基于跨模态交互的方法(即 UMT、UMGF 和 HVPNeT)和基于图像转换的方法(即 ITA)。如表 7 所示,与基于跨模态交互的方法相比,MFD 能显著减少边界误差和类别误差。特别是,MFD 在减少边界错误方面取得了显著的进步,在两个 MNER 数据集上分别以 和 更少的边界错误比例超过了该组中的最佳模型(即 HVPNeT)。与基于图像转换的方法 ITA 相比,MFD 显著减少了类别错误的比例,在两个 MNER 数据集上分别减少了 和 。这些结果表明,在 MNER 数据集上,MFD 有效地降低了边界错误和类别错误的比例。此外,我们还观察到一个有趣的现象。基于跨模态交互的方法与基于图像转换的方法相比,前者的类别错误更少,而后者的边界错误更少。这与我们正确看待文本和图像模式对 MNER 的作用差异的动机是一致的。
4.6.6. Visualization 4.6.6.可视化
To gain more insight into how the proposed framework MFD operates on MNER, we visualize the attention values in the Multimodal Interaction part (Eq. 9). Figure 4 illustrates that entities of different categories can attend to objects that correspond to their semantic meaning. For instance, the entity "Germany" (LOC) pays more attention to signs and sky, while the entity "Rock am Ring Festival" (MISC) concentrates more on signs and billboards with "Rock Ring". This indicates that our framework can effectively learn the appropriate cues from images and enhance the accuracy of the entity category. 为了更深入地了解拟议框架 MFD 如何在 MNER 上运行,我们将多模态交互部分的注意力值可视化(公式 9)。图 4 显示,不同类别的实体可以关注与其语义相对应的对象。例如,实体 "德国"(LOC)更关注标志和天空,而实体 "Rock am Ring Festival"(MISC)则更关注带有 "Rock Ring "字样的标志和广告牌。这表明,我们的框架可以有效地从图像中学习适当的线索,提高实体分类的准确性。
4.6.7. Case Study 4.6.7.案例研究
To illustrate the superiority of MFD intuitively, we choose three test samples and compare the prediction results of different methods. Figure 5 illustrates the results. Regarding the entity boundary, we observe that BERT and ITA perform satisfactorily, which can be attributed to the fact that the text modality itself can precisely identify the entity boundary. The predicted results of HVPNeT tend to incorporate more tokens into the entity. Regarding the entity 为了直观地说明 MFD 的优越性,我们选择了三个测试样本,比较了不同方法的预测结果。图 5 展示了结果。在实体边界方面,我们发现 BERT 和 ITA 的表现令人满意,这可以归因于文本模态本身可以精确识别实体边界这一事实。而 HVPNeT 的预测结果则倾向于将更多的标记纳入实体。关于实体
Text: stickers I got at the [Bayou Art Festival] MISC last Friday, made by the . more at 正文:我上周五在[河口艺术节]MISC 上得到的贴纸,由 制作。更多信息,请访问: 。
Text: Don 't miss everyone 's favorite radio pair ' s # NHLdraft coverage on @ [560 WQAM] tonight 文本:不要错过大家最喜爱的一对电台组合今晚在 @ [560 WQAM] 上的 # NHLdraft 报道 。
Text: Coach with the infield # pepsibaseball # baberut 文本:教练 与内野 # pepsibaseball # baberut
Figure 5: The above row shows several samples and entity boundary and category are denoted in the text by brackets and subscripts, respectively. The bottom row presents the predicted results. 图 5:上一行显示了几个样本,实体边界和类别在文本中分别用括号和下标表示。下一行是预测结果。
category, HVPNeT can disambiguate the entity category with the assistance of the image, such as classifying "560 WQAM" as ORG rather than LOC. BERT and ITA may be confused by the preposition "on" preceding the entity, misclassifying it as LOC. In contrast, MFD predicts the correct results in all three samples. These observations suggest that MFD fully leverages the advantage of different modalities and achieves a satisfactory performance on MNER. 而 HVPNeT 则可以在图像的帮助下消歧实体类别,例如将 "560 WQAM "归类为 ORG 而不是 LOC。BERT 和 ITA 可能会被实体前的介词 "on "所混淆,从而将其错误分类为 LOC。相比之下,MFD 在所有三个样本中都预测出了正确的结果。这些观察结果表明,MFD 充分发挥了不同模态的优势,在 MNER 上取得了令人满意的性能。
5. Conclusion 5.结论
This paper proposes a Multi-task Framework based on the Decomposition strategy for Multimodal Named Entity Recognition, which constructs two auxiliary tasks based on the decomposition strategy and correctly views the difference in the role of the text and image modalities. Furthermore, our framework employs label embedding to vectorize the outputs of two auxiliary tasks and utilize their label clues to improve the final MNER performance. We perform experiments on two public MNER benchmark datasets and demonstrate that our framework outperforms existing approaches. The effectiveness of both the auxiliary tasks and the proposed modules is confirmed by the ablation study. The analysis indicates that our framework effectively reduces both the boundary errors and the category errors, which is the key to the superior performance of our framework. 本文提出了基于分解策略的多任务多模态命名实体识别框架,该框架基于分解策略构建了两个辅助任务,并正确看待了文本模态和图像模态的角色差异。此外,我们的框架还采用了标签嵌入技术,将两个辅助任务的输出进行向量化,并利用其标签线索来提高最终的 MNER 性能。我们在两个公开的 MNER 基准数据集上进行了实验,结果表明我们的框架优于现有的方法。消减研究证实了辅助任务和拟议模块的有效性。分析表明,我们的框架有效地减少了边界误差和类别误差,这是我们的框架取得优异性能的关键所在。
Acknowledgments 致谢
This work was partially supported by the National Natural Science Foundation of China 62176076, Natural Science Foundation of Guang Dong 2023A1515012922, Shenzhen Foundational Research Funding JCYJ20220818102415032, Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies 2022B1212010005, the Major Key Project of PCL No. PCL2023A09. 本研究得到了国家自然科学基金 62176076、广东省自然科学基金 2023A1515012922、深圳市基础研究基金 JCYJ20220818102415032、广东省新型安全智能技术重点实验室 2022B1212010005、PCL 重大攻关项目 PCL2023A09 的部分资助。
References 参考资料
Abdu, S.A., Yousef, A.H., Salem, A., 2021. Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion Abdu, S.A., Yousef, A.H., Salem, A., 2021.使用深度学习方法进行多模态视频情感分析调查。信息融合
76, 204-226.
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018. Bottom-up and top-down attention for image captioning and visual question answering, in: Proceedings of CVPR, pp. 6077-6086. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L., 2018.图像字幕和视觉问题解答的自下而上和自上而下注意,见《CVPR 论文集》:Proceedings of CVPR, pp.
Augenstein, I., Derczynski, L., Bontcheva, K., 2017. Generalisation in named entity recognition: A quantitative analysis. Computer Speech & Language 44, 61-83. Augenstein, I., Derczynski, L., Bontcheva, K., 2017.命名实体识别中的泛化:定量分析。Computer Speech & Language 44, 61-83.
Chen, S., Aguilar, G., Neves, L., Solorio, T., 2021. Can images help recognize entities? a study of the role of images for multimodal ner, in: Proceedings of W-NUT, pp. 87-96. Chen, S., Aguilar, G., Neves, L., Solorio, T., 2021.图像有助于识别实体吗?图像在多模态 Ner 中的作用研究》,W-NUT 会议录,第 87-96 页:Proceedings of W-NUT, pp.
Chen, X., Zhang, N., Li, L., Yao, Y., Deng, S., Tan, C., Huang, F., Si, L., Chen, H., 2022. Good visual guidance make a better extractor: Hierarchical visual prefix for multimodal entity and relation extraction, in: Findings of NAACL, pp. 1607-1618. Chen, X., Zhang, N., Li, L., Yao, Y., Deng, S., Tan, C., Huang, F., Si, L., Chen, H., 2022.好的视觉引导能产生更好的提取器:多模态实体和关系提取的分层视觉前缀》(Hierarchical visual prefix for multimodal entity and relation extraction):NAACL 研究成果,第 1607-1618 页。
Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019. BERT: Pretraining of deep bidirectional transformers for language understanding, in: Proceedings of NAACL, pp. 4171-4186. Devlin, J., Chang, M.W., Lee, K., Toutanova, K., 2019.BERT: Pretraining of deep bidirectional transformers for language understanding, in:Proceedings of NAACL, pp.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: Proceedings of ICLR. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021.一幅图像胜过 16x16 个单词:规模图像识别的变换器》,In:Proceedings of ICLR.
Goyal, A., Gupta, V., Kumar, M., 2018. Recent named entity recognition and classification techniques: a systematic review. Computer Science Review 29, 21-43. Goyal, A., Gupta, V., Kumar, M., 2018.最新命名实体识别与分类技术:系统综述》。Computer Science Review 29, 21-43.
Hosseini, H., Mansouri, M., Bagheri, E., 2022. A systemic functional linguistics approach to implicit entity recognition in tweets. Information Processing & Management 59, 102957 Hosseini, H., Mansouri, M., Bagheri, E., 2022.推文中隐含实体识别的系统功能语言学方法。信息处理与管理 59,102957
Huang, Z., Xu, W., Yu, K., 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991 Huang, Z., Xu, W., Yu, K., 2015.用于序列标记的双向 LSTM-CRF 模型。arXiv 预印本 arXiv:1508.01991
Kim, K., Park, S., 2023. Aobert: All-modalities-in-one bert for multimodal sentiment analysis. Information Fusion 92, 37-45. Kim, K., Park, S., 2023.Aobert: All-modalities-in-one bert for multimodal sentiment analysis.Information Fusion 92, 37-45.
Konkol, M., Brychcín, T., Konopík, M., 2015. Latent semantics in named entity recognition. Expert Systems with Applications 42, 3470-3479. Konkol, M., Brychcín, T., Konopík, M., 2015.命名实体识别中的潜在语义。Expert Systems with Applications 42, 3470-3479.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C., 2016. Neural architectures for named entity recognition, in: Proceedings of NAACL, pp. 260-270. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C., 2016.用于命名实体识别的神经架构》,NAACL 论文集,第 260 页:Proceedings of NAACL, pp.
Li, P., Zhou, G., Guo, Y., Zhang, S., Jiang, Y., Tang, Y., 2024a. Epic: An epidemiological investigation of covid-19 dataset for chinese named entity recognition. Information Processing & Management 61, 103541. Li, P., Zhou, G., Guo, Y., Zhang, S., Jiang, Y., Tang, Y., 2024a.史诗:用于中文命名实体识别的covid-19数据集流行病学调查。信息处理与管理》,61,103541。
Liu, Y., Wei, S., Huang, H., Lai, Q., Li, M., Guan, L., 2023. Naming entity recognition of citrus pests and diseases based on the bert-bilstmcrf model. Expert Systems with Applications 234, 121103. Liu, Y., Wei, S., Huang, H., Lai, Q., Li, M., Guan, L., 2023.基于 Bert-bilstmcrf 模型的柑橘病虫害命名实体识别。专家系统与应用234,121103。
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of ICCV, pp. 10012-10022. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021.Swin transformer: Hierarchical vision transformer using shifted windows, in:Proceedings of ICCV, pp.
Long, Y., Xiong, D., Lu, Q., Li, M., Huang, C.R., 2016. Named entity recognition for chinese novels in the ming-qing dynasties, in: Proceedings of CLSW, Springer. pp. 362-375. Long, Y., Xiong, D., Lu, Q., Li, M., Huang, C.R., 2016.中国明清小说的命名实体识别》,Proceedings of CLSW, Springer:Proceedings of CLSW, Springer. pp.362-375.
Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H., 2018. Visual attention model for name tagging in multimodal social media, in: Proceedings of ACL, pp. 1990-1999. Lu, D., Neves, L., Carvalho, V., Zhang, N., Ji, H., 2018.多模态社交媒体中姓名标记的视觉注意力模型》(Visual attention model for name tagging in multimodal social media),《ACL 论文集》,第 1990-1999 页:Proceedings of ACL, pp.
Ma, X., Hovy, E., 2016. End-to-end sequence labeling via bi-directional 1stm-cnns-crf, in: Proceedings of ACL, pp. 1064-1074. Ma, X., Hovy, E., 2016.通过双向 1stm-cnns-crf 进行端到端序列标注,《ACL 论文集》,第 1064-1074 页:Proceedings of ACL, pp.
Mao, R., He, K., Zhang, X., Chen, G., Ni, J., Yang, Z., Cambria, E., 2024. A survey on semantic processing techniques. Information Fusion 101, 101988. Mao, R., He, K., Zhang, X., Chen, G., Ni, J., Yang, Z., Cambria, E., 2024.语义处理技术调查。信息融合 101,101988。
Moon, S., Neves, L., Carvalho, V., 2018. Multimodal named entity recognition for short social media posts, in: Proceedings of NAACL, pp. 852-860. Moon, S., Neves, L., Carvalho, V., 2018.短社交媒体帖子的多模态命名实体识别》(Multimodal named entity recognition for short social media posts):NAACL 论文集》,第 852-860 页。
Nguyen, D.Q., Vu, T., Tuan Nguyen, A., 2020. BERTweet: A pre-trained language model for English tweets, in: Proceedings of EMNLP: System Demonstrations, pp. 9-14. Nguyen, D.Q., Vu, T., Tuan Nguyen, A., 2020.BERTweet:BERTweet: A pre-trained language model for English tweets, in:Proceedings of EMNLP: System Demonstrations, pp.
Nguyen, N.H., Vo, D.T., Van Nguyen, K., Nguyen, N.L.T., 2023. Openvivqa: Task, dataset, and multimodal fusion models for visual question answering in vietnamese. Information Fusion 100, 101868. Nguyen, N.H., Vo, D.T., Van Nguyen, K., Nguyen, N.L.T., 2023.Openvivqa:越南语视觉问题解答的任务、数据集和多模态融合模型。Information Fusion 100, 101868.
Ren, Y., Li, H., Liu, P., Liu, J., Zhu, H., Sun, L., 2023. Owner name entity recognition in websites based on multiscale features and multimodal coattention. Expert Systems with Applications 224, 120014. Ren, Y., Li, H., Liu, P., Liu, J., Zhu, H., Sun, L., 2023.基于多尺度特征和多模态协同的网站所有者姓名实体识别。专家系统与应用 224,120014。
Sang, E.T.K., Veenstra, J., 1999. Representing text chunks, in: Proceedings of EACL, pp. 173-179. Sang, E.T.K., Veenstra, J., 1999.Representing text chunks, in:Proceedings of EACL, pp.
Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P., 2021. Why pay more? a simple and efficient named entity recognition system for tweets. Expert Systems with Applications 167, 114101. Suman, C., Reddy, S.M., Saha, S., Bhattacharyya, P., 2021.一个简单高效的推文命名实体识别系统。专家系统与应用》,167, 114101。
Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K., 2021. Hierarchical self-adaptation network for multimodal named entity recognition in social media. Neurocomputing 439, 12-21. Tian, Y., Sun, X., Yu, H., Li, Y., Fu, K., 2021.社交媒体中多模态命名实体识别的分层自适应网络。Neurocomputing 439, 12-21.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017.注意力就是你所需要的一切。神经信息处理系统进展 30
Wang, J., Yang, Y., Liu, K., Zhu, Z., Liu, X., 2023. M3s: Scene graph driven multi-granularity multi-task learning for multi-modal ner. IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 111-120. Wang, J., Yang, Y., Liu, K., Zhu, Z., Liu, X., 2023.M3s:场景图驱动的多粒度多任务学习,用于多模态 ner。IEEE/ACM Transactions on Audio, Speech, and Language Processing 31, 111-120.
Wang, X., Gui, M., Jiang, Y., Jia, Z., Bach, N., Wang, T., Huang, Z., Tu, K., 2022. Ita: Image-text alignments for multi-modal named entity recognition, in: Proceedings of NAACL, pp. 3176-3189. Wang, X., Gui, M., Jiang, Y., Jia, Z., Bach, N., Wang, T., Huang, Z., Tu, K., 2022.Ita:用于多模态命名实体识别的图像-文本对齐,In:Proceedings of NAACL, pp.
Yu, J., Jiang, J., Yang, L., Xia, R., 2020. Improving multimodal named entity recognition via entity span detection with unified multimodal transformer, in: Proceedings of ACL, pp. 3342-3352. Yu, J., Jiang, J., Yang, L., Xia, R., 2020.使用统一多模态转换器通过实体跨度检测改进多模态命名实体识别》,《ACL 论文集》,第 3342-3352 页:Proceedings of ACL, pp.
Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G., 2021a. Multi-modal graph fusion for named entity recognition with targeted visual guidance, in: Proceedings of AAAI, pp. 14347-14355. Zhang, D., Wei, S., Li, S., Wu, H., Zhu, Q., Zhou, G., 2021a.多模态图融合命名实体识别与目标视觉引导》(Multi-modal graph fusion for named entity recognition with targeted visual guidance):Proceedings of AAAI, pp.
Zhang, J., Yin, Z., Chen, P., Nichele, S., 2020. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion 59, 103-126 Zhang, J., Yin, Z., Chen, P., Nichele, S., 2020.使用多模态数据和机器学习技术进行情感识别:教程与综述》。信息融合 59,103-126
Zhang, Q., Fu, J., Liu, X., Huang, X., 2018. Adaptive co-attention network for named entity recognition in tweets, in: Proceedings of AAAI, pp. Zhang, Q., Fu, J., Liu, X., Huang, X., 2018.用于推文中命名实体识别的自适应协同关注网络》(Adaptive co-attention network for named entity recognition in tweets):Proceedings of AAAI, pp.
Zhang, W., Yu, J., Zhao, W., Ran, C., 2021b. Dmrfnet: deep multimodal reasoning and fusion for visual question answering and explanation generation. Information Fusion 72, 70-79. Zhang,W.、Yu,J.、Zhao,W.、Ran,C.,2021b.Dmrfnet:用于视觉问题解答和解释生成的深度多模态推理与融合。信息融合 72,70-79。
Zhang, X., Yuan, J., Li, L., Liu, J., 2023. Reducing the bias of visual objects in multimodal named entity recognition, in: Proceedings of WSDM, p. . Zhang, X., Yuan, J., Li, L., Liu, J., 2023.减少多模态命名实体识别中视觉对象的偏差》,《WSDM论文集》,第 页:Proceedings of WSDM, p. .
Zhao, Z.Q., Zheng, P., Xu, S.t., Wu, X., 2019. Object detection with deep learning: A review. IEEE Transactions on Neural Networks and Learning Systems 30, 3212-3232. Zhao, Z.Q., Zheng, P., Xu, S.t., Wu, X., 2019.深度学习的物体检测:综述。IEEE Transactions on Neural Networks and Learning Systems 30, 3212-3232.
Zhu, L., Zhu, Z., Zhang, C., Xu, Y., Kong, X., 2023. Multimodal sentiment analysis based on fusion methods: A survey. Information Fusion 95, Zhu, L., Zhu, Z., Zhang, C., Xu, Y., Kong, X., 2023.基于融合方法的多模态情感分析:一项调查。信息融合 95, .