Simplify the Usage of Lexicon in Chinese NER 简化中文命名实体识别中的词汇使用
Ruotian Ma ^(1**){ }^{1 *}, Minlong Peng ^(1**){ }^{1 *}, Qi Zhang ^(1,3){ }^{1,3}, Zhongyu Wei ^(2,3){ }^{2,3}, Xuanjing Huang ^(1){ }^{1}^(1){ }^{1} Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University ^(1){ }^{1} 复旦大学计算机科学学院智能信息处理上海市重点实验室^(2){ }^{2} School of Data Science, Fudan University 复旦大学数据科学学院^(3){ }^{3} Research Institute of Intelligent and Complex Systems, Fudan University 复旦大学智能与复杂系统研究院{rtma19,mlpeng16,qz,zywei,xjhuang} @fudan.edu.cn
Abstract 摘要
Recently, many works have tried to augment the performance of Chinese named entity recognition (NER) using word lexicons. As a representative, Lattice-LSTM (Zhang and Yang, 2018) has achieved new benchmark results on several public Chinese NER datasets. However, Lattice-LSTM has a complex model architecture. This limits its application in many industrial areas where real-time NER responses are needed. In this work, we propose a simple but effective method for incorporating the word lexicon into the character representations. This method avoids designing a complicated sequence modeling architecture, and for any neural NER model, it requires only subtle adjustment of the character representation layer to introduce the lexicon information. Experimental studies on four benchmark Chinese NER datasets show that our method achieves an inference speed up to 6.15 times faster than those of state-of-the-art methods, along with a better performance. The experimental results also show that the proposed method can be easily incorporated with pre-trained models like BERT. ^(1){ }^{1} 最近,许多研究试图通过使用词汇表来增强中文命名实体识别(NER)的性能。作为一个代表,Lattice-LSTM(Zhang 和 Yang,2018)在几个公共中文 NER 数据集上取得了新的基准结果。然而,Lattice-LSTM 具有复杂的模型架构。这限制了它在许多需要实时 NER 响应的工业领域的应用。在这项工作中,我们提出了一种简单但有效的方法,将词汇表纳入字符表示中。这种方法避免了设计复杂的序列建模架构,对于任何神经 NER 模型,只需对字符表示层进行微调即可引入词汇信息。在四个基准中文 NER 数据集上的实验研究表明,我们的方法在推理速度上比最先进的方法快了多达 6.15 倍,同时性能更好。实验结果还表明,所提出的方法可以很容易地与像 BERT 这样的预训练模型结合。
1 Introduction 1 引言
Named Entity Recognition (NER) is concerned with the identification of named entities, such as persons, locations, and organizations, in unstructured text. NER plays an important role in many downstream tasks, including knowledge base construction (Riedel et al., 2013), information retrieval (Chen et al., 2015), and question answering (Diefenbach et al., 2018). In languages where words are naturally separated (e.g., English), NER has been conventionally formulated as a sequence 命名实体识别(NER)关注于在非结构化文本中识别命名实体,如人、地点和组织。NER 在许多下游任务中发挥着重要作用,包括知识库构建(Riedel 等,2013)、信息检索(Chen 等,2015)和问答(Diefenbach 等,2018)。在单词自然分隔的语言中(例如,英语),NER 通常被表述为一个序列
labeling problem, and the state-of-the-art results have been achieved using neural-network-based models (Huang et al., 2015; Chiu and Nichols, 2016; Liu et al., 2018). 标注问题,最先进的结果是使用基于神经网络的模型取得的(Huang et al., 2015; Chiu and Nichols, 2016; Liu et al., 2018)。
Compared with NER in English, Chinese NER is more difficult since sentences in Chinese are not naturally segmented. Thus, a common practice for Chinese NER is to first perform word segmentation using an existing CWS system and then apply a word-level sequence labeling model to the segmented sentence (Yang et al., 2016; He and Sun, 2017b). However, it is inevitable that the CWS system will incorrectly segment query sentences. This will result in errors in the detection of entity boundary and the prediction of entity category in NER. Therefore, some approaches resort to performing Chinese NER directly at the character level, which has been empirically proven to be effective (He and Wang, 2008; Liu et al., 2010; Li et al., 2014; Liu et al., 2019; Sui et al., 2019; Gui et al., 2019b; Ding et al., 2019). 与英语中的命名实体识别(NER)相比,中文的 NER 更为困难,因为中文句子并不是自然分段的。因此,中文 NER 的一个常见做法是首先使用现有的中文分词系统进行分词,然后对分词后的句子应用词级序列标注模型(杨等,2016;何和孙,2017b)。然而,CWS 系统在分割查询句子时不可避免地会出现错误。这将导致在 NER 中实体边界的检测和实体类别的预测出现错误。因此,一些方法直接在字符级别进行中文 NER,这已被实证证明是有效的(何和王,2008;刘等,2010;李等,2014;刘等,2019;隋等,2019;桂等,2019b;丁等,2019)。
A drawback of the purely character-based NER method is that the word information is not fully exploited. With this consideration, Zhang and Yang, (2018) proposed Lattice-LSTM for incorporating word lexicons into the character-based NER model. Moreover, rather than heuristically choosing a word for the character when it matches multiple words in the lexicon, the authors proposed to preserve all words that match the character, leaving the subsequent NER model to determine which word to apply. To realize this idea, they introduced an elaborate modification to the sequence modeling layer of the LSTM-CRF model (Huang et al., 2015). Experimental studies on four Chinese NER datasets have verified the effectiveness of Lattice-LSTM. 纯字符基础的命名实体识别(NER)方法的一个缺点是没有充分利用词汇信息。考虑到这一点,Zhang 和 Yang(2018)提出了 Lattice-LSTM,以将词汇表纳入基于字符的 NER 模型。此外,作者提出在字符与词汇表中的多个词匹配时,不是启发式地选择一个词,而是保留所有与字符匹配的词,让后续的 NER 模型来决定应用哪个词。为了实现这一想法,他们对 LSTM-CRF 模型的序列建模层进行了精细的修改(Huang 等,2015)。在四个中文 NER 数据集上的实验研究验证了 Lattice-LSTM 的有效性。
However, the model architecture of LatticeLSTM is quite complicated. In order to introduce lexicon information, Lattice-LSTM adds several additional edges between nonadjacent characters 然而,LatticeLSTM 的模型架构相当复杂。为了引入词汇信息,Lattice-LSTM 在非相邻字符之间添加了几个额外的边。
in the input sequence, which significantly slows its training and inference speeds. In addition, it is difficult to transfer the structure of LatticeLSTM to other neural-network architectures (e.g., convolutional neural networks and transformers) that may be more suitable for some specific tasks. 在输入序列中,这显著降低了其训练和推理速度。此外,将 LatticeLSTM 的结构转移到其他神经网络架构(例如卷积神经网络和变换器)上也很困难,这些架构可能更适合某些特定任务。
In this work, we propose a simpler method to realize the idea of Lattice-LSTM, i.e., incorporating all the matched words for each character to a character-based NER model. The first principle of our model design is to achieve a fast inference speed. To this end, we propose to encode lexicon information in the character representations, and we design the encoding scheme to preserve as much of the lexicon matching results as possible. Compared with Lattice-LSTM, our method avoids the need for a complicated model architecture, is easier to implement, and can be quickly adapted to any appropriate neural NER model by adjusting the character representation layer. In addition, ablation studies show the superiority of our method in incorporating more complete and distinct lexicon information, as well as introducing a more effective word-weighting strategy. The contributions of this work can be summarized as follows: 在这项工作中,我们提出了一种更简单的方法来实现 Lattice-LSTM 的思想,即将每个字符的所有匹配词纳入基于字符的命名实体识别模型。我们模型设计的第一个原则是实现快速推理速度。为此,我们提议在字符表示中编码词汇信息,并设计编码方案以尽可能保留更多的词汇匹配结果。与 Lattice-LSTM 相比,我们的方法避免了复杂的模型架构,更易于实现,并且可以通过调整字符表示层快速适应任何合适的神经命名实体识别模型。此外,消融研究表明我们的方法在整合更完整和独特的词汇信息方面具有优势,并引入了更有效的词权重策略。这项工作的贡献可以总结如下:
We propose a simple but effective method for incorporating word lexicons into the character representations for Chinese NER. 我们提出了一种简单但有效的方法,将词汇表纳入中文命名实体识别的字符表示中。
The proposed method is transferable to different sequence-labeling architectures and can be easily incorporated with pre-trained models like BERT (Devlin et al., 2018). 所提议的方法可以转移到不同的序列标注架构,并且可以轻松与像 BERT(Devlin 等,2018)这样的预训练模型结合。
We performed experiments on four public Chinese NER datasets. The experimental results show that when implementing the sequence modeling layer with a single-layer Bi-LSTM, our method achieves considerable improvements over the state-of-theart methods in both inference speed and sequence labeling performance. 我们在四个公共中文命名实体识别数据集上进行了实验。实验结果表明,当使用单层双向长短期记忆网络(Bi-LSTM)实现序列建模层时,我们的方法在推理速度和序列标注性能上都显著优于最先进的方法。
2 Background 2 背景
In this section, we introduce several previous works that influenced our work, including the Softword technique and Lattice-LSTM. 在本节中,我们介绍了几项影响我们工作的先前研究,包括 Softword 技术和 Lattice-LSTM。
2.1 Softword Feature 2.1 Softword 特性
The Softword technique was originally used for incorporating word segmentation information into downstream tasks (Zhao and Kit, 2008; Peng and Softword 技术最初用于将词语分割信息纳入下游任务(Zhao 和 Kit,2008;Peng 和
Dredze, 2016). It augments the character representation with the embedding of its corresponding segmentation label: Dredze, 2016)。它通过对应分割标签的嵌入增强字符表示:
Here, seg(c_(j))inY_("seg ")\operatorname{seg}\left(c_{j}\right) \in \mathcal{Y}_{\text {seg }} denotes the segmentation label of the character c_(j)c_{j} predicted by the word segmentor, e^("seg ")e^{\text {seg }} denotes the segmentation label embedding lookup table, and typically Y_("seg ")=\mathcal{Y}_{\text {seg }}= {B, M, E, S }\}. 这里, seg(c_(j))inY_("seg ")\operatorname{seg}\left(c_{j}\right) \in \mathcal{Y}_{\text {seg }} 表示由分词器预测的字符 c_(j)c_{j} 的分割标签, e^("seg ")e^{\text {seg }} 表示分割标签嵌入查找表,通常 Y_("seg ")=\mathcal{Y}_{\text {seg }}= {B, M, E, S }\} 。
However, gold segmentation is not provided in most datasets, and segmentation results obtained by a segmenter can be incorrect. Therefore, segmentation errors will inevitably be introduced through this approach. 然而,大多数数据集中没有提供黄金分割,而通过分割器获得的分割结果可能是错误的。因此,这种方法不可避免地会引入分割错误。
2.2 Lattice-LSTM 2.2 格子长短期记忆网络
Lattice-LSTM designs to incorporate lexicon information into the character-based neural NER model. To achieve this purpose, lexicon matching is first performed on the input sentence. If the subsequence {c_(i),cdots,c_(j)}\left\{c_{i}, \cdots, c_{j}\right\} of the sentence matches a word in the lexicon for i < ji<j, a directed edge is added from c_(i)c_{i} to c_(j)c_{j}. All lexicon matching results related to a character are preserved by allowing the character to be connected with multiple other characters. Intrinsically, this practice converts the input form of a sentence from a chain into a graph. Lattice-LSTM 设计用于将词典信息纳入基于字符的神经命名实体识别模型。为实现这一目的,首先对输入句子进行词典匹配。如果句子的子序列 {c_(i),cdots,c_(j)}\left\{c_{i}, \cdots, c_{j}\right\} 与词典中的一个词匹配 i < ji<j ,则从 c_(i)c_{i} 到 c_(j)c_{j} 添加一个有向边。通过允许字符与多个其他字符连接,保留与字符相关的所有词典匹配结果。从本质上讲,这种做法将句子的输入形式从链转换为图。
In a normal LSTM layer, the hidden state h_(i)h_{i} and the memory cell c_(i)c_{i} of each time step is updated by: 在一个正常的 LSTM 层中,每个时间步的隐藏状态 h_(i)h_{i} 和记忆单元 c_(i)c_{i} 通过以下方式更新:
However, in order to model the graph-based input, Lattice-LSTM introduces an elaborate modification to the normal LSTM. Specifically, let s_( < **,j > )s_{<*, j>} denote the list of sub-sequences of sentence ss that match the lexicon and end with c_(j),h_( < **,j:))c_{j}, h_{<*, j\rangle} denote the corresponding hidden state list {h_(i),AAs_((:i,j > )in:}\left\{h_{i}, \forall s_{\langle i, j>} \in\right.{:s_( < **,j > )}\left.s_{<*, j>}\right\}, and c_( < **,j > )c_{<*, j>} denote the corresponding memory cell list {c_(i),AAs_( < i,j > )ins_( < **,j > )}\left\{c_{i}, \forall s_{<i, j>} \in s_{<*, j>}\right\}. In Lattice-LSTM, the hidden state h_(j)h_{j} and memory cell c_(j)c_{j} of c_(j)c_{j} are now updated as follows: 然而,为了对基于图的输入进行建模,Lattice-LSTM 对普通 LSTM 进行了精细的修改。具体来说,设 s_( < **,j > )s_{<*, j>} 表示与词汇表匹配并以 c_(j),h_( < **,j:))c_{j}, h_{<*, j\rangle} 结尾的句子 ss 的子序列列表, {h_(i),AAs_((:i,j > )in:}\left\{h_{i}, \forall s_{\langle i, j>} \in\right.{:s_( < **,j > )}\left.s_{<*, j>}\right\} 表示相应的隐藏状态列表, c_( < **,j > )c_{<*, j>} 表示相应的记忆单元列表 {c_(i),AAs_( < i,j > )ins_( < **,j > )}\left\{c_{i}, \forall s_{<i, j>} \in s_{<*, j>}\right\} 。在 Lattice-LSTM 中, c_(j)c_{j} 的隐藏状态 h_(j)h_{j} 和记忆单元 c_(j)c_{j} 现在更新如下:
where ff is a simplified representation of the function used by Lattice-LSTM to perform memory update. 其中 ff 是 Lattice-LSTM 用于执行记忆更新的函数的简化表示。
From our perspective, there are two main advantages to Lattice-LSTM. First, it preserves all the possible lexicon matching results that are related to 从我们的角度来看,Lattice-LSTM 有两个主要优点。首先,它保留了所有与之相关的可能的词汇匹配结果。
a character,which helps avoid the error propagation problem introduced by heuristically choosing a single matching result for each character.Second, it introduces pre-trained word embeddings to the system,which greatly enhances its performance. 一个字符,它有助于避免由于启发式选择每个字符的单一匹配结果而引入的错误传播问题。其次,它将预训练的词嵌入引入系统,这大大增强了其性能
However,efficiency problems exist in Lattice- LSTM.Compared with normal LSTM,Lattice- LSTM needs to additionally model s_( < **,j > ),h_( < **,j > )s_{<*, j>}, h_{<*, j>} , and c_( < **,j > )c_{<*, j>} for memory update,which slows the training and inference speeds.Additionally,due to the complicated implementation of ff ,it is difficult for Lattice-LSTM to process multiple sentences in parallel(in the published implementation of Lattice-LSTM,the batch size was set to 1 ).These problems limit its application in some industrial areas where real-time NER responses are needed. 然而,Lattice-LSTM 存在效率问题。与普通 LSTM 相比,Lattice-LSTM 需要额外建模 s_( < **,j > ),h_( < **,j > )s_{<*, j>}, h_{<*, j>} 和 c_( < **,j > )c_{<*, j>} 以进行记忆更新,这降低了训练和推理速度。此外,由于 ff 的复杂实现,Lattice-LSTM 很难并行处理多个句子(在 Lattice-LSTM 的发布实现中,批量大小被设置为 1)。这些问题限制了它在一些需要实时 NER 响应的工业领域的应用
3 Approach 3 方法
In this work,we sought to retain the merits of Lattice-LSTM while overcoming its drawbacks.To this end,we propose a novel method in which lexi- con information is introduced by simply adjusting the character representation layer of an NER model. We refer to this method as SoftLexicon.As shown in Figure 1,the overall architecture of the proposed method is as follows.First,each character of the input sequence is mapped into a dense vector.Next, the SoftLexicon feature is constructed and added to the representation of each character.Then,these augmented character representations are put into the sequence modeling layer and the CRF layer to obtain the final predictions. 在这项工作中,我们旨在保留 Lattice-LSTM 的优点,同时克服其缺点。为此,我们提出了一种新方法,通过简单调整 NER 模型的字符表示层引入词典信息。我们将这种方法称为 SoftLexicon。如图 1 所示,所提方法的整体架构如下。首先,输入序列的每个字符被映射为一个密集向量。接下来,构建 SoftLexicon 特征并将其添加到每个字符的表示中。然后,这些增强的字符表示被放入序列建模层和 CRF 层,以获得最终预测
3.1 Character Representation Layer 3.1 字符表示层
For a character-based Chinese NER model,the input sentence is seen as a character sequence s=s={c_(1),c_(2),cdots,c_(n)}inV_(c)\left\{c_{1}, c_{2}, \cdots, c_{n}\right\} \in \mathcal{V}_{c} ,where V_(c)\mathcal{V}_{c} is the character vocabulary.Each character c_(i)c_{i} is represented using a dense vector(embedding): 对于基于字符的中文命名实体识别模型,输入句子被视为字符序列 s=s={c_(1),c_(2),cdots,c_(n)}inV_(c)\left\{c_{1}, c_{2}, \cdots, c_{n}\right\} \in \mathcal{V}_{c} ,其中 V_(c)\mathcal{V}_{c} 是字符词汇表。每个字符 c_(i)c_{i} 使用稠密向量(嵌入)表示:
where e^(c)\boldsymbol{e}^{c} denotes the character embedding lookup table. 其中 e^(c)\boldsymbol{e}^{c} 表示字符嵌入查找表
Char+bichar.In addition,Zhang and Yang, (2018)has proved that character bigrams are useful for representing characters,especially for those methods not using word information.Therefore, it is common to augment the character representa- tions with bigram embeddings: Char+bichar.此外,张和杨(2018)证明了字符二元组对于表示字符是有用的,特别是对于那些不使用词信息的方法。因此,通常会用二元组嵌入来增强字符表示:
Figure 1:The overall architecture of the proposed method. 图 1:所提方法的整体架构
where e^(b)\boldsymbol{e}^{b} denotes the bigram embedding lookup table. 其中 e^(b)\boldsymbol{e}^{b} 表示二元组嵌入查找表
3.2 Incorporating Lexicon Information 3.2 纳入词汇信息
The problem with the purely character-based NER model is that it fails to exploit word information. To address this issue,we proposed two methods,as described below,to introduce the word information into the character representations.In the following, for any input sequence s={c_(1),c_(2),cdots,c_(n)},w_(i,j)s=\left\{c_{1}, c_{2}, \cdots, c_{n}\right\}, w_{i, j} denotes its sub-sequence {c_(i),c_(i+1),cdots,c_(j)}\left\{c_{i}, c_{i+1}, \cdots, c_{j}\right\} . 纯字符基础的命名实体识别模型的问题在于它未能利用词信息。为了解决这个问题,我们提出了两种方法,如下所述,以将词信息引入字符表示中。在下面,对于任何输入序列 s={c_(1),c_(2),cdots,c_(n)},w_(i,j)s=\left\{c_{1}, c_{2}, \cdots, c_{n}\right\}, w_{i, j} 表示其子序列 {c_(i),c_(i+1),cdots,c_(j)}\left\{c_{i}, c_{i+1}, \cdots, c_{j}\right\}
ExSoftword Feature ExSoftword 功能
The first conducted method is an intuitive exten- sion of the Softword method,called ExSoftword. Instead of choosing one segmentation result for each character,it proposes to retain all possible segmentation results obtained using the lexicon: 第一种进行的方法是 Softword 方法的直观扩展,称为 ExSoftword。它提议保留使用词典获得的所有可能的分词结果,而不是为每个字符选择一个分词结果:
where segs(c_(j))\operatorname{segs}\left(c_{j}\right) denotes all segmentation labels related to c_(j)c_{j} ,and e^(seg)(seg s(c_(j)))\boldsymbol{e}^{s e g}\left(\operatorname{seg} s\left(c_{j}\right)\right) is a 5 -dimensional multi-hot vector with each dimension correspond- ing to an item of {B,M,E,S,O}\{\mathrm{B}, \mathrm{M}, \mathrm{E}, \mathrm{S}, \mathrm{O}\} . 其中 segs(c_(j))\operatorname{segs}\left(c_{j}\right) 表示与 c_(j)c_{j} 相关的所有分割标签,而 e^(seg)(seg s(c_(j)))\boldsymbol{e}^{s e g}\left(\operatorname{seg} s\left(c_{j}\right)\right) 是一个 5 维多热向量,每个维度对应于 {B,M,E,S,O}\{\mathrm{B}, \mathrm{M}, \mathrm{E}, \mathrm{S}, \mathrm{O}\} 的一个项目
As an example presented in Figure 2,the character c_(7)c_{7}("西")occurs in two words,w_(5,8)w_{5,8} ("中山西路")and w_(6,7)w_{6,7}("山西"),that match the lexicon,and it occurs in the middle of"中山西路"and the end of"山西".Therefore,its corresponding segmentation result is {M,E}\{\mathrm{M}, \mathrm{E}\} ,and its character representation is enriched as follows: 作为图 2 中呈现的一个例子,字符 c_(7)c_{7} ("西")出现在两个词中, w_(5,8)w_{5,8} ("中山西路")和 w_(6,7)w_{6,7} ("山西"),它们与词汇表匹配,并且它出现在"中山西路"的中间和"山西"的末尾。因此,它的对应分割结果是 {M,E}\{\mathrm{M}, \mathrm{E}\} ,它的字符表示如下:
x_(7)^(c)larr[x_(7)^(c);e^(seg)({M,E})].\boldsymbol{x}_{7}^{c} \leftarrow\left[\boldsymbol{x}_{7}^{c} ; \boldsymbol{e}^{s e g}(\{M, E\})\right] .
Figure 2:The ExSoftword method. 图 2:ExSoftword 方法
Here,the second and third dimensions of e^(seg(*))\boldsymbol{e}^{\operatorname{seg}(\cdot)} are set to 1 ,and the rest dimensions are set to 0 . 在这里, e^(seg(*))\boldsymbol{e}^{\operatorname{seg}(\cdot)} 的第二和第三维被设置为 1,其余维度被设置为 0
The problem of this approach is that it cannot fully inherit the two merits of Lattice-LSTM.First, it fails to introduce pre-trained word embeddings. Second,it still losses information of the matching results.As shown in Figure 2,the constructed ExSoftword feature for characters {c_(5),c_(6),c_(7),c_(8)}\left\{c_{5}, c_{6}, c_{7}, c_{8}\right\} is {{B},{B,M,E},{M,E},{E}}\{\{B\},\{B, M, E\},\{M, E\},\{E\}\}