这是用户在 2024-8-5 24:19 为 https://app.immersivetranslate.com/pdf-pro/9d0c0f15-9078-4bca-a754-645f7e4211f1 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_08_05_c813047988e0018bcd2fg

A Multi-task Framework based on Decomposition for Multimodal Named Entity Recognition
基于分解的多任务多模态命名实体识别框架

Chenran Cai , Qianlong Wang , Bing Qin and Ruifeng Xu
蔡晨然 、王乾龙 、秦冰 和徐瑞峰
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), China
中国哈尔滨工业大学(深圳)计算机科学与技术学院
Peng Cheng Laboratory, China
中国鹏程实验室
Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, China
广东省新型安全智能技术重点实验室,中国

ARTICLE INFO 文章信息

Keywords: 关键词:

Multimodal Named Entity Recognition
多模态命名实体识别
Multi-task Framework 多任务框架
Entity Boundary Detection
实体边界检测
Entity Category Classification
实体类别分类

Abstract 摘要

Given a text-image pair, Multimodal Named Entity Recognition (MNER) is the task of identifying and categorizing entities in the text. Most existing work performs named entity labeling directly using final token representations derived by fusing image and text representations. Although they achieve promising results, these work may fail to effectively exploit text and image modalities. This is because they neglect the difference in the role of the two modalities: text modality can detect the boundary of an entity, while image modality is introduced to disambiguate the category of the entity. Based on these findings, in this paper, we construct two auxiliary tasks based on the decomposition strategy and propose a multi-task framework for MNER. Specifically, we first decompose MNER into two auxiliary tasks: entity boundary detection task and entity category classification task. Here, the former treats only the text modality as input and outputs the boundary labels, since it can achieve satisfactory boundary results by itself. The latter uses two modalities to yield category labels where image modality is dedicated to disambiguating categories. These two auxiliary tasks allow the effective exploitation of text and image modalities and put them back into their respective roles. Then, we vectorize their results to improve entity recognition using label clues from auxiliary tasks. Finally, we fuse features from text and image modalities and label embeddings from auxiliary tasks to fulfill MNER. Experimental results on two widely used MNER datasets show that our framework can yield new SOTA performance.
多模态命名实体识别(MNER)是给定文本-图像对,识别文本中的实体并对其进行分类的任务。现有的大多数工作都是直接使用融合图像和文本表征后得到的最终标记表征来执行命名实体标注。虽然这些工作取得了可喜的成果,但可能无法有效利用文本和图像模式。这是因为它们忽视了两种模态作用的不同:文本模态可以检测实体的边界,而图像模态则是为了消除实体类别的歧义。基于这些发现,本文在分解策略的基础上构建了两个辅助任务,并提出了 MNER 的多任务框架。具体来说,我们首先将 MNER 分解为两个辅助任务:实体边界检测任务和实体类别分类任务。在这里,前者只将文本模态作为输入,并输出边界标签,因为它本身就能获得令人满意的边界结果。后者使用两种模态生成类别标签,其中图像模态专门用于消除类别歧义。通过这两项辅助任务,我们可以有效利用文本和图像模态,使它们重新发挥各自的作用。然后,我们将它们的结果进行矢量化,利用来自辅助任务的标签线索提高实体识别率。最后,我们将来自文本和图像模态的特征与来自辅助任务的标签嵌入进行融合,从而实现 MNER。在两个广泛使用的 MNER 数据集上的实验结果表明,我们的框架可以产生新的 SOTA 性能。

1. Introduction 1.导言

Multimodal Named Entity Recognition (MNER) aims to recognize entities of various categories (such as person and location, denoted as PER and LOC, respectively) in unstructured text with the help of additional image information (Zhang et al., 2018; Lu et al., 2018; Tian et al., 2021). Take Figure 1a as an example, with the help of the corresponding image, we require identifying two entities in the text, "Lee Brice" (PER) and "Hard Rock Tulsa" (LOC), to finish MNER.
多模态命名实体识别(MNER)旨在借助额外的图像信息识别非结构化文本中的各类实体(如人和地点,分别表示为PER和LOC)(Zhang等人,2018;Lu等人,2018;Tian等人,2021)。以图 1a 为例,借助相应的图像,我们需要识别文本中的两个实体 "Lee Brice"(PER)和 "Hard Rock Tulsa"(LOC),从而完成 MNER。
Unlike the conventional NER (Konkol et al., 2015; Long et al., 2016; Augenstein et al., 2017; Goyal et al., 2018; Li et al., 2020; Suman et al., 2021; Hosseini et al., 2022; Liu et al., 2023; Li et al., 2024a; Mao et al., 2024), which solely relies on text modality to carry out entity recognition, MNER tends first to leverage text and image modalities to derive the final token representations and then identify entities. Existing MNER studies can be divided into two main groups according to the method of obtaining token representations. The first group is cross-modal interaction-based methods (Zhang et al., 2021a; Chen et al., 2022; Zhang et al., 2023; Wang et al., 2023; Ren et al., 2023; Liu et al., 2024). They often first employ the attention mechanism for crossmodal interaction to overcome the semantic discrepancy between different modalities Then, concatenating interaction results and textual representation outputs the final token representations for entity labeling. The second one is image conversion-based methods (Chen et al., 2021; Wang et al.,
与传统的 NER(Konkol 等人,2015;Long 等人,2016;Augenstein 等人,2017;Goyal 等人,2018;Li 等人,2020;Suman 等人,2021;Hosseini 等人,2022;Liu 等人,2023;Li et al、而 MNER 则倾向于首先利用文本和图像模式得出最终的标记表征,然后再识别实体。根据获取标记表征的方法,现有的 MNER 研究可分为两大类。第一类是基于跨模态交互的方法(Zhang 等人,2021a;Chen 等人,2022;Zhang 等人,2023;Wang 等人,2023;Ren 等人,2023;Liu 等人,2024)。它们通常首先利用跨模态交互的注意力机制来克服不同模态之间的语义差异,然后将交互结果和文本表征串联起来,输出用于实体标注的最终标记表征。第二种是基于图像转换的方法(Chen et al、
2022). Such methods enrich the token representations by incorporating textualized information derived from images, which is treated as an additional input to the text.
2022).这些方法通过纳入从图像中提取的文本化信息来丰富标记表征,这些信息被视为文本的额外输入。
Despite their promising results, these studies neglect the difference in the role of the text and image modalities on MNER, which may cause inefficient utilization of both modalities. For MNER, text modality itself can detect the boundary of an entity and could even determine its category with weak confidence, while image modality is usually introduced to disambiguate the entity category. As shown in Figure 1a, the text itself can give satisfactory results even without the image. In contrast, in Figure 1b, the text can only determine the entity boundary "Alibaba", but not whether its category is ORG or PER. In this case, image modality can be used to help filter irrelevant categories. Therefore, we believe that the role difference of the two modalities should be concerned, rather than directly fusing the two by crossmodal interaction or image conversion to obtain the final token representations.
尽管这些研究取得了可喜的成果,但它们忽略了文本和图像模态在 MNER 中的作用差异,这可能导致两种模态的利用效率低下。对于 MNER 而言,文本模态本身可以检测实体的边界,甚至可以以弱置信度确定实体的类别,而图像模态通常是为了消除实体类别的歧义而引入的。 如图 1a 所示,即使没有图像,文本本身也能给出令人满意的结果。相反,在图 1b 中,文本只能确定实体边界 "阿里巴巴",但不能确定其类别是 ORG 还是 PER。在这种情况下,图像模式可以用来帮助过滤不相关的类别。因此,我们认为应该关注两种模态的作用差异,而不是直接通过跨模态交互或图像转换来融合两者,从而获得最终的标记表征。
In this paper, we construct two auxiliary tasks based on the decomposition strategy and propose a multi-task framework for MNER Specifically, we first decompose MNER into two auxiliary tasks, entity boundary detection task, and entity category classification task, based on the characteristics of entity sequence labels. For entity boundary detection, we use only text modality as input and output boundary labels since it can achieve satisfactory boundary results by itself. For entity category classification, we exploit the interactive features of two modalities to derive the category
具体来说,我们首先根据实体序列标签的特点,将 MNER 分解为两个辅助任务,即实体边界检测任务和实体类别分类任务。在实体边界检测中,我们只使用文本模态作为输入和输出边界标签,因为它本身就能获得令人满意的边界结果。对于实体类别分类,我们利用两种模态的交互特征来推导出类别
(a) in concert Monday at !
(a) 周一在 举行音乐会!
(b) I love . (b) 我爱
Figure 1: Illustration of Multimodal Named Entity Recognition (MNER) on two examples. Entities are labeled with brackets, and the entity categories are indicated by the subscripts of the brackets.
图 1:多模态命名实体识别(MNER)在两个示例中的应用说明。实体用括号标注,实体类别用括号的下标表示。
labels where image modality helps recognize ambiguous entities that are insufficiently described by the text modality. The benefit of constructing these two auxiliary tasks allows the effective exploitation of text and image modalities and puts them back into their respective roles. In addition, their labels can be treated as clues to improve the performance of MNER. Based on this, we then vectorize the output results of two auxiliary tasks via label embedding to apply their label clues in entity recognition. Finally, to better leverage label clues for entity recognition, we adopt the multi-layer transformer that integrates features from text and image modalities and label embeddings from two auxiliary tasks. By this fusion, we can interpose the label clues into the interactive features to enhance the final token representations, thereby achieving MNER.
图像模态有助于识别文本模态无法充分描述的模糊实体。构建这两种辅助任务的好处是可以有效利用文本和图像模态,使它们重新发挥各自的作用。此外,它们的标签可以作为提高 MNER 性能的线索。在此基础上,我们通过标签嵌入对两个辅助任务的输出结果进行矢量化,将它们的标签线索应用到实体识别中。最后,为了更好地利用标签线索进行实体识别,我们采用了多层变换器,将文本和图像模式的特征与两个辅助任务的标签嵌入进行融合。通过这种融合,我们可以将标签线索插入到交互式特征中,以增强最终的标记表示,从而实现 MNER。
This paper makes the following contributions:
本文的贡献如下
  • We elaborate on the role of text and image modalities on MNER, which offers insights for the following studies to utilize both modalities effectively.
    我们阐述了文本和图像模式对 MNER 的作用,这为后续研究有效利用这两种模式提供了启示。
  • Based on the respective roles of two modalities, we design two auxiliary tasks by decomposing labels to effectively leverage two modalities and further introduce a multi-task framework to solve MNER.
    根据两种模态的各自作用,我们通过分解标签设计了两种辅助任务,以有效利用两种模态,并进一步引入多任务框架来解决 MNER 问题。
  • We present a comprehensive evaluation of our framework on two popular MNER datasets. The experimental results show that our framework can achieve SOTA performance, verifying its effectiveness and superiority.
    我们在两个流行的 MNER 数据集上对我们的框架进行了全面评估。实验结果表明,我们的框架可以实现 SOTA 性能,验证了其有效性和优越性。

2.1. Multimodal Named Entity Recognition
2.1.多模态命名实体识别

With the growing prevalence of multimodal social media data, MNER has attracted more research attention. Like other multimodal tasks, such as multimodal sentiment analysis and visual question answering, MNER also requires designing various approaches to fuse latent information from different modalities (Zhang et al., 2020, 2021b; Abdu et al., 2021; Zhu et al., 2023; Kim and Park, 2023; Nguyen et al., 2023; Li et al., 2024b). Most of the existing research can be broadly categorized into two groups: cross-modal interaction-based approaches and image conversion-based approaches. The first group focuses on cross-modal interaction and fusion of text and image representations, using the attention mechanism or transformer. For instance, some work (Zhang et al., 2018; Moon et al., 2018; Lu et al., 2018) utilize CNN and LSTM to encode the image and text modalities, respectively, and then use attention to fuse both modalities and generate multimodal representations for entity labeling. Yu et al. (2020) interacts with two modalities via a transformer to directly handle MNER. To enhance MNER performance, some studies (Zhang et al., 2021a; Chen et al., 2022; Wang et al., 2023) not only model the global image and text relationship but also exploit the local semantic alignment between visual objects and textual tokens to obtain fine-grained token representations. To reduce the issue of visual object bias, Zhang et al. (2023) employs a de-bias method that implicitly aligns multimodal information and mitigates bias induced by visual objects in the image. The second group (Chen et al., 2021; Wang et al., 2022) first converts images into textualized information, such as captions, that bridge the gap between image and text modalities. Then, they concatenate the input text with the textualized information from the image and feed them into a pre-trained language model to finish MNER.
随着多模态社交媒体数据的日益普及,MNER 吸引了更多的研究关注。与多模态情感分析和视觉问题解答等其他多模态任务一样,MNER 也需要设计各种方法来融合来自不同模态的潜在信息(Zhang 等人,2020,2021b;Abdu 等人,2021;Zhu 等人,2023;Kim 和 Park,2023;Nguyen 等人,2023;Li 等人,2024b)。大多数现有研究可大致分为两类:基于跨模态交互的方法和基于图像转换的方法。第一类侧重于跨模态交互和文本与图像表征的融合,使用注意力机制或转换器。例如,一些工作(Zhang 等人,2018;Moon 等人,2018;Lu 等人,2018)利用 CNN 和 LSTM 分别对图像和文本模态进行编码,然后利用注意力对两种模态进行融合,生成用于实体标注的多模态表征。Yu 等人(2020)通过转换器与两种模态交互,直接处理 MNER。为了提高 MNER 性能,一些研究(Zhang 等人,2021a;Chen 等人,2022;Wang 等人,2023)不仅对全局图像和文本关系进行建模,还利用视觉对象和文本标记之间的局部语义对齐来获得细粒度标记表征。为了减少视觉对象偏差问题,Zhang 等人(2023 年)采用了一种去偏差方法,隐式对齐多模态信息,减轻图像中视觉对象引起的偏差。第二组(Chen 等人,2021 年;Wang 等人,2022 年)首先将图像转换为文本化信息,如标题,从而弥合图像和文本模态之间的差距。 然后,他们将输入文本与图像中的文本化信息进行串联,并将其输入到预先训练好的语言模型中,从而完成 MNER。
For MNER, we find that the text modality itself can effectively identify the entity boundaries, while the image modality is typically introduced to disambiguate the entity category. Based on this finding, we construct two auxiliary tasks (i.e., entity boundary detection and entity category classification), which leverage the features of different modalities. We then utilize the results of two auxiliary tasks to enhance the performance of the final MNER entity labeling.
对于 MNER,我们发现文本模态本身可以有效识别实体边界,而图像模态通常是为了消除实体类别的歧义而引入的。基于这一发现,我们构建了两个辅助任务(即实体边界检测和实体类别分类),充分利用了不同模态的特征。然后,我们利用两个辅助任务的结果来提高最终 MNER 实体标注的性能。

2.2. Pre-trained Models 2.2.预训练模型

Pre-trained models are trained on large unlabeled datasets using self-supervised objectives, and they can be categorized into two groups according to the inputs: pre-trained language models (PLMs) and pre-trained vision models (PVMs). PLMs, such as BERT (Devlin et al., 2019) and BERTweet (Nguyen et al., 2020), excel at various NLP tasks, e.g., named entity recognition. Inspired by PLMs, researchers propose a series of vision transformers (e.g., ViT (Dosovitskiy et al., 2021) and Swin (Liu et al., 2021)). These PVMs achieve significant improvements on a range of computer vision tasks, such as object detection (Zhao et al.,
预训练模型是利用自监督目标在大型无标记数据集上进行训练的,根据输入可分为两类:预训练语言模型(PLM)和预训练视觉模型(PVM)。BERT(Devlin 等人,2019 年)和 BERTweet(Nguyen 等人,2020 年)等 PLM 擅长各种 NLP 任务,例如命名实体识别。受 PLM 的启发,研究人员提出了一系列视觉转换器(如 ViT(Dosovitskiy 等人,2021 年)和 Swin(Liu 等人,2021 年))。这些 PVM 在一系列计算机视觉任务中取得了显著的改进,如物体检测(Zhao et al、
Figure 2: Overview of our proposed framework MFD. The text and image representation modules first extract features from both modalities. Then, two auxiliary tasks respectively predict the boundary and category of each entity. To utilize the results of two auxiliary tasks, we vectorize predicted boundary and category labels into label clues via label embedding. Finally, the MNER classification module fuses information from both modalities and label clues to perform MNER.
图 2:我们提出的 MFD 框架概览。文本和图像表示模块首先从两种模式中提取特征。然后,两个辅助任务分别预测每个实体的边界和类别。为了利用两个辅助任务的结果,我们通过标签嵌入将预测的边界标签和类别标签矢量化为标签线索。最后,MNER 分类模块融合了两种模态和标签线索的信息,执行 MNER。
2019). Due to their powerful representation capabilities, we here use BERT and ViT models to encode the text and image modalities, respectively.
2019).由于 BERT 和 ViT 模型具有强大的表示能力,我们在此分别使用这两种模型对文本和图像模式进行编码。

3. Methodology 3.方法论

3.1. Task Definition 3.1 任务定义

MNER aims to identify and categorize entities in a given sentence with the help of additional an image . The entities are detected from and assigned to one of the pre-defined categories. Given an input token sequence , we assign a label sequence to , where is the length of , and , B-PER, I-PER, B-LOC, I-LOC, B-ORG, I-ORG, B-MISC, I-MISC is the pre-defined label set using BIO tagging schema (Sang and Veenstra, 1999).
MNER 的目的是借助附加图像 识别给定句子 中的实体并将其分类。从 中检测出实体,并将其归入预先定义的类别之一。给定一个输入标记序列 ,我们分配一个标签序列 ,其中 的长度、B-PER、I-PER、B-LOC、I-LOC、B-ORG、I-ORG、B-MISC、I-MISC 是使用 BIO 标记模式预先定义的标签集(Sang 和 Veenstra,1999 年)。

3.2. Overall Architecture
3.2.总体结构

As shown in Figure 2, we propose a Multi-task Framework based on the Decomposition strategy for MNER, which we denote as MFD. The framework contains four components: (1) Text representation module, which extracts each token feature from the text; (2) Image representation module, which consists object detection part and image representation part to obtain object features from the image; (3) auxiliary module, which include entity boundary detection task and entity category classification task, and utilize label embedding to vectorize the results of two tasks as label clues; (4) MNER classification module, which fuses information from both textual and visual modalities and two label embeddings to perform MNER.
如图 2 所示,我们为 MNER 提出了一个基于分解策略的多任务框架,并将其命名为 MFD。该框架包含四个部分:(1) 文本表示模块,从文本中提取每个标记特征;(2) 图像表示模块,由对象检测部分和图像表示部分组成,从图像中获取对象特征;(3) 辅助模块,包括实体边界检测任务和实体类别分类任务,并利用标签嵌入将两个任务的结果矢量化作为标签线索;(4) MNER 分类模块,融合文本和视觉模态的信息以及两个标签嵌入来执行 MNER。

3.3. Text Representation Module
3.3.文本表示模块

Given a sequence of tokens , we first utilize BERT to encode each token as a feature embedding:
给定标记序列 ,我们首先利用 BERT 将每个标记编码为特征嵌入:
where is the embedding matrix of and indicates the dimension of each token feature representation. To align the dimensions of representations across text and image modalities, we apply a trainable linear layer to the text representation :
其中 的嵌入矩阵, 表示每个标记特征表示的维度。为了调整文本和图像模式的表示维度,我们对文本表示 应用了可训练线性层:
where is the -th token vector and indicates the dimension of hidden state representation unified by text and image modalities.
其中, -th 标记向量, 表示文本和图像模式统一的隐藏状态表示维度。

3.4. Image Representation Module
3.4.图像表示模块

Given an image , we first apply a trained bottom-upattention model (Anderson et al., 2018) to obtain object-level bounding boxes. For each object region , we first resize it to a fixed size of , where ( , ) denotes the original resolution of the object region, represents the channel dimension, and is set to 224 . As in Dosovitskiy et al. (2021), we convert each object region into a series of patches and prepend a learnable embedding
给定图像 ,我们首先应用训练有素的自下而上注意力模型(Anderson 等人,2018 年)来获取对象级边界框。对于每个对象区域 ,我们首先将其大小调整为固定大小 ,其中 ( , ) 表示对象区域的原始分辨率, 表示通道维度, 设置为 224。与 Dosovitskiy 等人(2021 年)的研究一样,我们将每个对象区域 转换为一系列补丁,并预置一个可学习的嵌入值

token to the sequence of embedded patches. Then, we apply ViT to get the token representation that represents the object region :
标记 到嵌入的补丁序列。然后,我们应用 ViT 得到标记表示 ,该标记表示对象区域
where is the feature representation of represents the number of patches, and indicates the imagemodality hidden state representation dimension. We define the representation of the image as the aggregation of all object region features:
其中, 是特征表示, 表示斑块数量, 表示图像模态隐藏状态表示维度。我们将图像的表示 定义为所有对象区域特征的集合:
where is the number of object regions, the order of is sorted according to the ascending L2 distance from the center position of the object box to the image's upper left corner, and is the position embedding matrix. We then apply a trainable linear projection to transform into a -dimension space:
其中, 是对象区域的数量, 的顺序按照对象框中心位置到图像左上角的 L2 距离由大到小排序, 是位置嵌入矩阵。然后,我们应用可训练的线性投影将 变换到 维空间:
where and is the -th object region vector.
其中, 对象区域向量。

3.5. Auxiliary Module 3.5.辅助模块

Most previous studies perform named entity labeling using final token representations, which are derived by fusing image and text representations. Here, they generally adopt the joint tagging schema of entity boundary and entity category (e.g., B-PER, I-PER). However, we observe that text modality itself can provide satisfactory boundary results, and image modality is typically introduced to disambiguate the entity category. Based on this finding, we decompose the joint tagging schema into two auxiliary tasks: entity boundary detection and entity category classification.
以往的研究大多使用最终标记表示法进行命名实体标注,而最终标记表示法是通过融合图像和文本表示法得出的。在这里,它们一般采用实体边界和实体类别的联合标记模式(如 B-PER、I-PER)。然而,我们发现文本模式本身就能提供令人满意的边界结果,而图像模式通常是为了消除实体类别的歧义而引入的。基于这一发现,我们将联合标记模式分解为两个辅助任务:实体边界检测和实体类别分类。
Entity Boundary Detection. This task aims to detect the boundary of entities in the input text. We first formulate this task as a sequence labeling problem and adopt the BIOES tagging schema that emphasizes the concept of start and end. Then, we use to denote the boundary sequence labels, where , the entity of length greater than one starts with and ends with E , I represents the inside token of the entity, is the label of non-entity, and is used to label the entity consisting of a single token. Specifically, as shown in the Entity Boundary Detection section of Figure 2, we apply a linear layer to the text features to predict boundary labels and calculate boundary loss using the cross-entropy loss function:
实体边界检测。这项任务旨在检测输入文本中实体的边界。我们首先将这一任务表述为序列标注问题,并采用强调开始和结束概念的 BIOES 标记模式。然后,我们用 表示边界序列标签,其中 ,长度大于 1 的实体以 开始,以 E 结束,I 表示实体的内部标记, 是非实体的标签, 用于标记由单个标记组成的实体。具体来说,如图 2 实体边界检测部分所示,我们对文本特征 应用线性层来预测边界标签,并使用交叉熵损失函数计算边界损失:
where is the boundary label and denotes the size of the boundary label set.
其中, 是边界标签, 表示边界标签集的大小。

Entity Category Classification. The entity category classification task aims to classify the category of each token in the input text and use to denote the category sequence labels, where PER, LOC, ORG, MISC, 0 . Since text modality alone can cause category ambiguity, which needs to be eliminated using image modality, we exploit both text and image modalities for the entity category classification task
实体类别分类。实体类别分类任务旨在对输入文本中的每个标记进行类别分类,并使用 表示类别序列标签,其中 PER、LOC、ORG、MISC、0 。由于仅使用文本模式会造成类别模糊,而这需要使用图像模式来消除,因此我们在实体类别分类任务中同时使用了文本和图像模式
However, some object regions in images are irrelevant to the text. We should ignore these object regions to reduce noise. To solve this challenge, we design a Multimodal Interaction part, which adopts a multi-head attention mechanism (Vaswani et al., 2017) to enhance the image representations with the guidance of the associated text. As shown in the Entity Category Classification section of Figure 2, given text representation and image representation , we apply a multi-head cross-modal attention mechanism to derive textaware image representations :
然而,图像中的某些对象区域与文本无关。我们应该忽略这些对象区域,以减少噪音。为了解决这一难题,我们设计了多模态交互部分,采用多头关注机制(Vaswani 等人,2017 年),在相关文本的引导下增强图像表征。如图 2 实体类别分类部分所示,在给定文本表示 和图像表示 的情况下,我们应用多头跨模态关注机制得出文本感知图像表示
where is scaling factor. Then, we feed and into a linear layer to predict category labels and calculate category loss using the cross-entropy loss function:
其中 为缩放因子。然后,我们将 输入线性层,以预测类别标签,并使用交叉熵损失函数计算类别损失
where is the category label and denotes the size of the category label set.
其中, 是类别标签, 表示类别标签集的大小。
After obtaining the results of two auxiliary tasks, we can obtain joint entity labels by combining two sequence labels to solve MNER. However, this combination fails to consider the correlation of labels, such as producing the invalid label 0-PER, which hurts the performance of the framework.
在获得两个辅助任务的结果后,我们可以通过组合两个序列标签来获得联合实体标签,从而解决 MNER 问题。但是,这种组合方法没有考虑标签的相关性,例如会产生无效标签 0-PER,从而损害了框架的性能。
Label Embedding. To utilize the two auxiliary task results to enhance the final entity recognition performance, we vectorize boundary labels and category labels into embeddings via label embedding. For boundary label embedding, we first assign trainable embeddings , where . Then, we initialize with word embeddings from the BERT vocabulary, which have the same semantics as boundary labels. For instance, we utilize word embeddings of begin and end to initialize B and E labels, respectively. Finally, we select corresponding boundary label embeddings for each token from to obtain boundary label embedding sequence :
标签嵌入。为了利用这两项辅助任务结果来提高最终的实体识别性能,我们通过标签嵌入将边界标签和类别标签矢量化为嵌入式标签。对于边界标签嵌入,我们首先分配 可训练嵌入 ,其中 。然后,我们用 BERT 词库中的词嵌入来初始化 ,这些词嵌入与边界标签具有相同的语义。例如,我们使用 begin 和 end 的词嵌入分别初始化 B 和 E 标签。最后,我们从 中为每个标记选择相应的边界标签嵌入词,得到边界标签嵌入序列
where , and is the number of tokens. For category label embedding, similar to boundary label embedding, we set the trainable embeddings and
其中, 是标记数。对于类别标签嵌入,与边界标签嵌入类似,我们设置了 可训练嵌入
Table 1 表 1
Statistics of two MNER datasets.
两个 MNER 数据集的统计数据。
Entity Category 实体类别 Twitter-2015 Twitter-2017 推特-2017
Train 火车 Dev 开发 Test 测试 Train 火车 Dev 开发 Test 测试
Organization 组织结构 928 247 839 1,674 375 395
Person 个人 2,217 552 1,816 2,943 626 621
Location 地点 2,091 522 1,697 731 173 178
Miscellaneous 杂项 940 225 726 701 150 157
Total 总计 6,176 1,546 5,078 6,049 1,324 1,351
Number of Samples 样本数量 4,000 1,000 3,257 3,373 723 723
initialize PER, LOC, ORG, MISC, and o labels with the word embeddings of person, location, organization, other, and none, respectively. Then, we obtain the category label embedding sequence by selecting the corresponding category label embeddings from for each token.
将 PER、LOC、ORG、MISC 和 o 标签分别初始化为 person、location、organization、other 和 none 的词嵌入。然后,我们从 中为每个标记选择相应的类别标签嵌入,从而得到类别标签嵌入序列
where and . It is worth noting that we construct and using the ground truth labels during the training phase and apply the predicted labels of and during the inference phase.
其中, 。值得注意的是,我们在训练阶段使用地面实况标签构建 ,并在推理阶段应用 的预测标签。

3.6. MNER Classification Module
3.6.MNER 分类模块

To interpose the label clues from both auxiliary tasks into the interactive features to improve the final token representations, we utilize the multi-layer transformer to fuse the multimodal interaction representation , text representation , boundary label embeddings and category label embeddings by concatenation. We then pass the fused feature representation to a linear layer that predicts MNER labels. Finally, we calculate MNER loss by applying the cross-entropy loss function:
为了将这两项辅助任务中的标签线索插入到交互式特征中以改进最终的标记表示,我们利用多层变换器将多模态交互表示 、文本表示 、边界标签嵌入 和类别标签嵌入 通过连接进行融合。然后,我们将融合后的特征表示传递给预测 MNER 标签的线性层。最后,我们应用交叉熵损失函数计算 MNER 损失
where