Simplify the Usage of Lexicon in Chinese NER
简化中文命名实体识别中的词汇使用

Ruotian Ma $^{1 *}$ , Minlong Peng $^{1 *}$ , Qi Zhang $^{1, 3}$ , Zhongyu Wei $^{2, 3}$ , Xuanjing Huang $^{1}$ $^{1}$ Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University
$^{1}$ 复旦大学计算机科学学院智能信息处理上海市重点实验室 $^{2}$ School of Data Science, Fudan University
复旦大学数据科学学院 $^{3}$ Research Institute of Intelligent and Complex Systems, Fudan University
复旦大学智能与复杂系统研究院{rtma19,mlpeng16,qz,zywei,xjhuang} @fudan.edu.cn

Abstract 摘要

Recently, many works have tried to augment the performance of Chinese named entity recognition (NER) using word lexicons. As a representative, Lattice-LSTM (Zhang and Yang, 2018) has achieved new benchmark results on several public Chinese NER datasets. However, Lattice-LSTM has a complex model architecture. This limits its application in many industrial areas where real-time NER responses are needed. In this work, we propose a simple but effective method for incorporating the word lexicon into the character representations. This method avoids designing a complicated sequence modeling architecture, and for any neural NER model, it requires only subtle adjustment of the character representation layer to introduce the lexicon information. Experimental studies on four benchmark Chinese NER datasets show that our method achieves an inference speed up to 6.15 times faster than those of state-of-the-art methods, along with a better performance. The experimental results also show that the proposed method can be easily incorporated with pre-trained models like BERT. $^{1}$
最近，许多研究试图通过使用词汇表来增强中文命名实体识别（NER）的性能。作为一个代表，Lattice-LSTM（Zhang 和 Yang，2018）在几个公共中文 NER 数据集上取得了新的基准结果。然而，Lattice-LSTM 具有复杂的模型架构。这限制了它在许多需要实时 NER 响应的工业领域的应用。在这项工作中，我们提出了一种简单但有效的方法，将词汇表纳入字符表示中。这种方法避免了设计复杂的序列建模架构，对于任何神经 NER 模型，只需对字符表示层进行微调即可引入词汇信息。在四个基准中文 NER 数据集上的实验研究表明，我们的方法在推理速度上比最先进的方法快了多达 6.15 倍，同时性能更好。实验结果还表明，所提出的方法可以很容易地与像 BERT 这样的预训练模型结合。

1 Introduction 1 引言

Named Entity Recognition (NER) is concerned with the identification of named entities, such as persons, locations, and organizations, in unstructured text. NER plays an important role in many downstream tasks, including knowledge base construction (Riedel et al., 2013), information retrieval (Chen et al., 2015), and question answering (Diefenbach et al., 2018). In languages where words are naturally separated (e.g., English), NER has been conventionally formulated as a sequence
命名实体识别（NER）关注于在非结构化文本中识别命名实体，如人、地点和组织。NER 在许多下游任务中发挥着重要作用，包括知识库构建（Riedel 等，2013）、信息检索（Chen 等，2015）和问答（Diefenbach 等，2018）。在单词自然分隔的语言中（例如，英语），NER 通常被表述为一个序列

labeling problem, and the state-of-the-art results have been achieved using neural-network-based models (Huang et al., 2015; Chiu and Nichols, 2016; Liu et al., 2018).
标注问题，最先进的结果是使用基于神经网络的模型取得的（Huang et al., 2015; Chiu and Nichols, 2016; Liu et al., 2018）。

Compared with NER in English, Chinese NER is more difficult since sentences in Chinese are not naturally segmented. Thus, a common practice for Chinese NER is to first perform word segmentation using an existing CWS system and then apply a word-level sequence labeling model to the segmented sentence (Yang et al., 2016; He and Sun, 2017b). However, it is inevitable that the CWS system will incorrectly segment query sentences. This will result in errors in the detection of entity boundary and the prediction of entity category in NER. Therefore, some approaches resort to performing Chinese NER directly at the character level, which has been empirically proven to be effective (He and Wang, 2008; Liu et al., 2010; Li et al., 2014; Liu et al., 2019; Sui et al., 2019; Gui et al., 2019b; Ding et al., 2019).
与英语中的命名实体识别（NER）相比，中文的 NER 更为困难，因为中文句子并不是自然分段的。因此，中文 NER 的一个常见做法是首先使用现有的中文分词系统进行分词，然后对分词后的句子应用词级序列标注模型（杨等，2016；何和孙，2017b）。然而，CWS 系统在分割查询句子时不可避免地会出现错误。这将导致在 NER 中实体边界的检测和实体类别的预测出现错误。因此，一些方法直接在字符级别进行中文 NER，这已被实证证明是有效的（何和王，2008；刘等，2010；李等，2014；刘等，2019；隋等，2019；桂等，2019b；丁等，2019）。

A drawback of the purely character-based NER method is that the word information is not fully exploited. With this consideration, Zhang and Yang, (2018) proposed Lattice-LSTM for incorporating word lexicons into the character-based NER model. Moreover, rather than heuristically choosing a word for the character when it matches multiple words in the lexicon, the authors proposed to preserve all words that match the character, leaving the subsequent NER model to determine which word to apply. To realize this idea, they introduced an elaborate modification to the sequence modeling layer of the LSTM-CRF model (Huang et al., 2015). Experimental studies on four Chinese NER datasets have verified the effectiveness of Lattice-LSTM.
纯字符基础的命名实体识别（NER）方法的一个缺点是没有充分利用词汇信息。考虑到这一点，Zhang 和 Yang（2018）提出了 Lattice-LSTM，以将词汇表纳入基于字符的 NER 模型。此外，作者提出在字符与词汇表中的多个词匹配时，不是启发式地选择一个词，而是保留所有与字符匹配的词，让后续的 NER 模型来决定应用哪个词。为了实现这一想法，他们对 LSTM-CRF 模型的序列建模层进行了精细的修改（Huang 等，2015）。在四个中文 NER 数据集上的实验研究验证了 Lattice-LSTM 的有效性。

However, the model architecture of LatticeLSTM is quite complicated. In order to introduce lexicon information, Lattice-LSTM adds several additional edges between nonadjacent characters
然而，LatticeLSTM 的模型架构相当复杂。为了引入词汇信息，Lattice-LSTM 在非相邻字符之间添加了几个额外的边。
in the input sequence, which significantly slows its training and inference speeds. In addition, it is difficult to transfer the structure of LatticeLSTM to other neural-network architectures (e.g., convolutional neural networks and transformers) that may be more suitable for some specific tasks.
在输入序列中，这显著降低了其训练和推理速度。此外，将 LatticeLSTM 的结构转移到其他神经网络架构（例如卷积神经网络和变换器）上也很困难，这些架构可能更适合某些特定任务。

In this work, we propose a simpler method to realize the idea of Lattice-LSTM, i.e., incorporating all the matched words for each character to a character-based NER model. The first principle of our model design is to achieve a fast inference speed. To this end, we propose to encode lexicon information in the character representations, and we design the encoding scheme to preserve as much of the lexicon matching results as possible. Compared with Lattice-LSTM, our method avoids the need for a complicated model architecture, is easier to implement, and can be quickly adapted to any appropriate neural NER model by adjusting the character representation layer. In addition, ablation studies show the superiority of our method in incorporating more complete and distinct lexicon information, as well as introducing a more effective word-weighting strategy. The contributions of this work can be summarized as follows:
在这项工作中，我们提出了一种更简单的方法来实现 Lattice-LSTM 的思想，即将每个字符的所有匹配词纳入基于字符的命名实体识别模型。我们模型设计的第一个原则是实现快速推理速度。为此，我们提议在字符表示中编码词汇信息，并设计编码方案以尽可能保留更多的词汇匹配结果。与 Lattice-LSTM 相比，我们的方法避免了复杂的模型架构，更易于实现，并且可以通过调整字符表示层快速适应任何合适的神经命名实体识别模型。此外，消融研究表明我们的方法在整合更完整和独特的词汇信息方面具有优势，并引入了更有效的词权重策略。这项工作的贡献可以总结如下：

We propose a simple but effective method for incorporating word lexicons into the character representations for Chinese NER.
我们提出了一种简单但有效的方法，将词汇表纳入中文命名实体识别的字符表示中。
The proposed method is transferable to different sequence-labeling architectures and can be easily incorporated with pre-trained models like BERT (Devlin et al., 2018).
所提议的方法可以转移到不同的序列标注架构，并且可以轻松与像 BERT（Devlin 等，2018）这样的预训练模型结合。

We performed experiments on four public Chinese NER datasets. The experimental results show that when implementing the sequence modeling layer with a single-layer Bi-LSTM, our method achieves considerable improvements over the state-of-theart methods in both inference speed and sequence labeling performance.
我们在四个公共中文命名实体识别数据集上进行了实验。实验结果表明，当使用单层双向长短期记忆网络（Bi-LSTM）实现序列建模层时，我们的方法在推理速度和序列标注性能上都显著优于最先进的方法。

2 Background 2 背景

In this section, we introduce several previous works that influenced our work, including the Softword technique and Lattice-LSTM.
在本节中，我们介绍了几项影响我们工作的先前研究，包括 Softword 技术和 Lattice-LSTM。

2.1 Softword Feature 2.1 Softword 特性

The Softword technique was originally used for incorporating word segmentation information into downstream tasks (Zhao and Kit, 2008; Peng and
Softword 技术最初用于将词语分割信息纳入下游任务（Zhao 和 Kit，2008；Peng 和

Dredze, 2016). It augments the character representation with the embedding of its corresponding segmentation label:
Dredze, 2016)。它通过对应分割标签的嵌入增强字符表示：

x_{j}^{c} \leftarrow [x_{j}^{c}; e^{seg} (seg (c_{j}))] .

Here,

seg (c_{j}) \in Y_{seg}

denotes the segmentation label of the character

c_{j}

predicted by the word segmentor,

e^{seg}

denotes the segmentation label embedding lookup table, and typically

Y_{seg} =

{B, M, E, S

}

.
这里，

seg (c_{j}) \in Y_{seg}

表示由分词器预测的字符

c_{j}

的分割标签，

e^{seg}

表示分割标签嵌入查找表，通常

Y_{seg} =

{B, M, E, S

}

。

However, gold segmentation is not provided in most datasets, and segmentation results obtained by a segmenter can be incorrect. Therefore, segmentation errors will inevitably be introduced through this approach.
然而，大多数数据集中没有提供黄金分割，而通过分割器获得的分割结果可能是错误的。因此，这种方法不可避免地会引入分割错误。

2.2 Lattice-LSTM 2.2 格子长短期记忆网络

Lattice-LSTM designs to incorporate lexicon information into the character-based neural NER model. To achieve this purpose, lexicon matching is first performed on the input sentence. If the subsequence

{c_{i}, \dots, c_{j}}

of the sentence matches a word in the lexicon for

i < j

, a directed edge is added from

c_{i}

c_{j}

. All lexicon matching results related to a character are preserved by allowing the character to be connected with multiple other characters. Intrinsically, this practice converts the input form of a sentence from a chain into a graph.
Lattice-LSTM 设计用于将词典信息纳入基于字符的神经命名实体识别模型。为实现这一目的，首先对输入句子进行词典匹配。如果句子的子序列

{c_{i}, \dots, c_{j}}

与词典中的一个词匹配

i < j

，则从

c_{i}

到

c_{j}

添加一个有向边。通过允许字符与多个其他字符连接，保留与字符相关的所有词典匹配结果。从本质上讲，这种做法将句子的输入形式从链转换为图。

In a normal LSTM layer, the hidden state

h_{i}

and the memory cell

c_{i}

of each time step is updated by:
在一个正常的 LSTM 层中，每个时间步的隐藏状态

h_{i}

和记忆单元

c_{i}

通过以下方式更新：

h_{i}, c_{i} = f (h_{j - 1}, c_{j - 1}, x_{j}^{c})

However, in order to model the graph-based input, Lattice-LSTM introduces an elaborate modification to the normal LSTM. Specifically, let

s_{< *, j >}

denote the list of sub-sequences of sentence

s

that match the lexicon and end with

c_{j}, h_{< *, j ⟩}

denote the corresponding hidden state list

{h_{i}, \forall s_{⟨ i, j >} \in

s_{< *, j >}}

, and

c_{< *, j >}

denote the corresponding memory cell list

{c_{i}, \forall s_{< i, j >} \in s_{< *, j >}}

. In Lattice-LSTM, the hidden state

h_{j}

and memory cell

c_{j}

c_{j}

are now updated as follows:
然而，为了对基于图的输入进行建模，Lattice-LSTM 对普通 LSTM 进行了精细的修改。具体来说，设

s_{< *, j >}

表示与词汇表匹配并以

c_{j}, h_{< *, j ⟩}

结尾的句子

s

的子序列列表，

{h_{i}, \forall s_{⟨ i, j >} \in

s_{< *, j >}}

表示相应的隐藏状态列表，

c_{< *, j >}

表示相应的记忆单元列表

{c_{i}, \forall s_{< i, j >} \in s_{< *, j >}}

。在 Lattice-LSTM 中，

c_{j}

的隐藏状态

h_{j}

和记忆单元

c_{j}

现在更新如下：

h_{j}, c_{j} = f (h_{j - 1}, c_{j - 1}, x_{j}^{c}, s_{< *, j >}, h_{< *, j >}, c_{< *, j >}),

where

f

is a simplified representation of the function used by Lattice-LSTM to perform memory update.
其中

f

是 Lattice-LSTM 用于执行记忆更新的函数的简化表示。

From our perspective, there are two main advantages to Lattice-LSTM. First, it preserves all the possible lexicon matching results that are related to
从我们的角度来看，Lattice-LSTM 有两个主要优点。首先，它保留了所有与之相关的可能的词汇匹配结果。
a character，which helps avoid the error propagation problem introduced by heuristically choosing a single matching result for each character．Second， it introduces pre－trained word embeddings to the system，which greatly enhances its performance．
一个字符，它有助于避免由于启发式选择每个字符的单一匹配结果而引入的错误传播问题。其次，它将预训练的词嵌入引入系统，这大大增强了其性能

However，efficiency problems exist in Lattice－ LSTM．Compared with normal LSTM，Lattice－ LSTM needs to additionally model

s_{< *, j >}, h_{< *, j >}

， and

c_{< *, j >}

for memory update，which slows the training and inference speeds．Additionally，due to the complicated implementation of

f

，it is difficult for Lattice－LSTM to process multiple sentences in parallel（in the published implementation of Lattice－LSTM，the batch size was set to 1 ）．These problems limit its application in some industrial areas where real－time NER responses are needed．
然而，Lattice-LSTM 存在效率问题。与普通 LSTM 相比，Lattice-LSTM 需要额外建模

s_{< *, j >}, h_{< *, j >}

和

c_{< *, j >}

以进行记忆更新，这降低了训练和推理速度。此外，由于

f

的复杂实现，Lattice-LSTM 很难并行处理多个句子（在 Lattice-LSTM 的发布实现中，批量大小被设置为 1）。这些问题限制了它在一些需要实时 NER 响应的工业领域的应用

3 Approach 3 方法

In this work，we sought to retain the merits of Lattice－LSTM while overcoming its drawbacks．To this end，we propose a novel method in which lexi－ con information is introduced by simply adjusting the character representation layer of an NER model． We refer to this method as SoftLexicon．As shown in Figure 1，the overall architecture of the proposed method is as follows．First，each character of the input sequence is mapped into a dense vector．Next， the SoftLexicon feature is constructed and added to the representation of each character．Then，these augmented character representations are put into the sequence modeling layer and the CRF layer to obtain the final predictions．
在这项工作中，我们旨在保留 Lattice-LSTM 的优点，同时克服其缺点。为此，我们提出了一种新方法，通过简单调整 NER 模型的字符表示层引入词典信息。我们将这种方法称为 SoftLexicon。如图 1 所示，所提方法的整体架构如下。首先，输入序列的每个字符被映射为一个密集向量。接下来，构建 SoftLexicon 特征并将其添加到每个字符的表示中。然后，这些增强的字符表示被放入序列建模层和 CRF 层，以获得最终预测

3．1 Character Representation Layer
3．1 字符表示层

For a character－based Chinese NER model，the input sentence is seen as a character sequence

s =

{c_{1}, c_{2}, \dots, c_{n}} \in V_{c}

，where

V_{c}

is the character vocabulary．Each character

c_{i}

is represented using a dense vector（embedding）：
对于基于字符的中文命名实体识别模型，输入句子被视为字符序列

s =

{c_{1}, c_{2}, \dots, c_{n}} \in V_{c}

，其中

V_{c}

是字符词汇表。每个字符

c_{i}

使用稠密向量（嵌入）表示：

x_{i}^{c} = e^{c} (c_{i}),

where

e^{c}

denotes the character embedding lookup table．
其中

e^{c}

表示字符嵌入查找表

Char＋bichar．In addition，Zhang and Yang，（2018）has proved that character bigrams are useful for representing characters，especially for those methods not using word information．Therefore， it is common to augment the character representa－ tions with bigram embeddings：
Char＋bichar．此外，张和杨（2018）证明了字符二元组对于表示字符是有用的，特别是对于那些不使用词信息的方法。因此，通常会用二元组嵌入来增强字符表示：

x_{i}^{c} = [e^{c} (c_{i}); e^{b} (c_{i}, c_{i + 1})]

Figure 1：The overall architecture of the proposed method．
图 1：所提方法的整体架构
where

e^{b}

denotes the bigram embedding lookup table．
其中

e^{b}

表示二元组嵌入查找表

3．2 Incorporating Lexicon Information
3．2 纳入词汇信息

The problem with the purely character－based NER model is that it fails to exploit word information． To address this issue，we proposed two methods，as described below，to introduce the word information into the character representations．In the following， for any input sequence

s = {c_{1}, c_{2}, \dots, c_{n}}, w_{i, j}

denotes its sub－sequence

{c_{i}, c_{i + 1}, \dots, c_{j}}

．
纯字符基础的命名实体识别模型的问题在于它未能利用词信息。为了解决这个问题，我们提出了两种方法，如下所述，以将词信息引入字符表示中。在下面，对于任何输入序列

s = {c_{1}, c_{2}, \dots, c_{n}}, w_{i, j}

表示其子序列

{c_{i}, c_{i + 1}, \dots, c_{j}}

ExSoftword Feature ExSoftword 功能

The first conducted method is an intuitive exten－ sion of the Softword method，called ExSoftword． Instead of choosing one segmentation result for each character，it proposes to retain all possible segmentation results obtained using the lexicon：
第一种进行的方法是 Softword 方法的直观扩展，称为 ExSoftword。它提议保留使用词典获得的所有可能的分词结果，而不是为每个字符选择一个分词结果：

x_{j}^{c} \leftarrow [x_{j}^{c}; e^{seg} (segs (c_{j})]

where

segs (c_{j})

denotes all segmentation labels related to

c_{j}

，and

e^{s e g} (seg s (c_{j}))

is a 5 －dimensional multi－hot vector with each dimension correspond－ ing to an item of

{B, M, E, S, O}

．
其中

segs (c_{j})

表示与

c_{j}

SoftLexicon

Based on the analysis on Exsoftword，we further developed the SoftLexicon method to incorporate the lexicon information．The SoftLexicon features are constructed in three steps．
基于对 Exsoftword 的分析，我们进一步开发了 SoftLexicon 方法，以纳入词汇信息。SoftLexicon 特征的构建分为三个步骤

Categorizing the matched words．First，to re－ tain the segmentation information，all matched words of each character

c_{i}

is categorized into four word sets＂BMES＂，which is marked by the four segmentation labels．For each character

c_{i}

in the input sequence

= {c_{1}, c_{2}, \dots, c_{n}}

，the four set is constructed by：
对匹配词进行分类。首先，为了保留分割信息，每个字符

c_{i}

的所有匹配词被分类为四个词集“BMES”，由四个分割标签标记。对于输入序列

= {c_{1}, c_{2}, \dots, c_{n}}

中的每个字符

c_{i}

，四个集合的构建方式为：

\begin{aligned} B (c_{i}) = {w_{i, k}, \forall w_{i, k} \in L, i < k \leq n} \\ M (c_{i}) = {w_{j, k}, \forall w_{j, k} \in L, 1 \leq j < i < k \leq n} \\ E (c_{i}) = {w_{j, i}, \forall w_{j, i} \in L, 1 \leq j < i} \\ S (c_{i}) = {c_{i}, \exists c_{i} \in L} \end{aligned}

Here，L denotes the lexicon we use in this work． Additionally，if a word set is empty，a special word＂NONE＂is added to the empty word set．An example of this categorization approach is shown in Figure 3．Noted that in this way，not only we
这里，L 表示我们在这项工作中使用的词汇表。此外，如果一个词集为空，则会向空词集中添加一个特殊词“NONE”。这种分类方法的示例如图 3 所示。注意，通过这种方式，不仅我们

Figure 3：The SoftLexicon method．
图 3：SoftLexicon 方法
can introduce the word embedding，but also no information loss exists since the matching results can be exactly restored from the four word sets of the characters．
可以引入词嵌入，但由于可以从字符的四个词集精确恢复匹配结果，因此不存在信息损失

Condensing the word sets．After obtaining the ＂BMES＂word sets for each character，each word set is then condensed into a fixed－dimensional vector．In this work，we explored two approaches for implementing this condensation．
压缩词集。在获得每个字符的“BMES”词集后，每个词集被压缩成一个固定维度的向量。在这项工作中，我们探索了两种实现这种压缩的方法

The first implementation is the intuitive mean－ pooling method：
第一种实现是直观的均值池化方法：

v^{s} (S) = \frac{1}{| S |} \sum_{w \in S} e^{w} (w)

Here，

S

denotes a word set and

e^{w}

denotes the word embedding lookup table．
这里，

S

表示一个词集，

e^{w}

表示词嵌入查找表

However，as shown in Table 8，the results of empirical studies revealed that this algorithm does not perform well．Therefore，a weighting algorithm is introduced to further leverage the word information．To maintain computational efficiency，we did not opt for a dynamic weighting algorithm like attention．Instead，we propose using the frequency of each word as an indication of its weight．Since the frequency of a word is a static value that can be obtained offline，this can greatly accelerate the calculation of the weight of each word．
然而，如表 8 所示，实证研究的结果表明该算法表现不佳。因此，引入了一种加权算法以进一步利用词语信息。为了保持计算效率，我们没有选择像注意力机制那样的动态加权算法。相反，我们建议使用每个词的频率作为其权重的指示。由于词的频率是一个可以离线获得的静态值，这可以大大加快每个词权重的计算

Specifically，let

z (w)

denote the frequency that a lexicon word

w

occurs in the statistical data， the weighted representation of the word set

S

is obtained as follows：
具体地，令

z (w)

表示词汇词

w

在统计数据中出现的频率，词集

S

的加权表示如下：

v^{s} (S) = \frac{4}{Z} \sum_{w \in S} z (w) e^{w} (w)

where 哪里

Z = \sum_{w \in B \cup M \cup E \cup S} z (w)

Here, weight normalization is performed on all words in the four word sets to make an overall comparison.
在这里，对四个词组中的所有单词进行权重归一化，以便进行整体比较。
In this work, the statistical data set is constructed from a combination of training and developing data of the task. Of course, if there is unlabelled data in the task, the unlabeled data set can serve as the statistical data set. In addition, note that the frequency of

w

does not increase if

w

is covered by another sub-sequence that matches the lexicon. This prevents the problem in which the frequency of a shorter word is always less than the frequency of the longer word that covers it.
在这项工作中，统计数据集是由任务的训练数据和开发数据组合而成的。当然，如果任务中有未标记的数据，未标记的数据集可以作为统计数据集。此外，请注意，如果

w

被另一个与词汇表匹配的子序列覆盖，则

w

的频率不会增加。这防止了较短单词的频率总是低于覆盖它的较长单词的频率的问题。
Combining with character representation. The final step is to combine the representations of four word sets into one fix-dimensional feature, and add it to the representation of each character. In order to retain as much information as possible, we choose to concatenate the representations of the four word sets, and the final representation of each character is obtained by:
结合字符表示。最后一步是将四个词集的表示合并为一个固定维度的特征，并将其添加到每个字符的表示中。为了尽可能保留更多信息，我们选择将四个词集的表示进行连接，每个字符的最终表示通过以下方式获得：

\begin{aligned} e^{s} (B, M, E, S) & = [v^{s} (B); v^{s} (M); v^{s} (E); v^{s} (S)], \\ x^{c} & \leftarrow [x^{c}; e^{s} (B, M, E, S)] . \end{aligned}

Here,

v^{s}

denotes the weighting function above.
这里，

v^{s}

表示上面的加权函数。

3.3 Sequence Modeling Layer
3.3 序列建模层

With the lexicon information incorporated, the character representations are then put into the sequence modeling layer, which models the dependency between characters. Generic architectures for this layer including the bidirectional longshort term memory network(BiLSTM), the Convolutional Neural Network(CNN) and the transformer(Vaswani et al., 2017). In this work, we implemented this layer with a single-layer Bi LSTM.
将词汇信息纳入后，字符表示被放入序列建模层，该层建模字符之间的依赖关系。该层的通用架构包括双向长短期记忆网络（BiLSTM）、卷积神经网络（CNN）和变换器（Vaswani 等，2017）。在这项工作中，我们使用单层 Bi LSTM 实现了这一层。
Here, we precisely show the definition of the forward LSTM:
在这里，我们准确地展示了前向 LSTM 的定义：

\begin{aligned} [\begin{array}{c} i_{t} \\ f_{t} \\ o_{t} \\ c_{t} \end{array}] & = [\begin{array}{c} σ \\ σ \\ σ \\ \tanh \end{array}] (W [\begin{array}{c} x_{t}^{c} \\ h_{t - 1} \end{array}] + b), \\ c_{t} & = {\tilde{c}}_{t} ⊙ i_{t} + c_{t - 1} ⊙ f_{t}, \\ h_{t} & = o_{t} ⊙ \tanh (c_{t}) . \end{aligned}

where

σ

is the element-wise sigmoid function and

⊙

represents element-wise product. W and b are trainable parameters. The backward LSTM shares the same definition as the forward LSTM
其中

σ

是逐元素的 sigmoid 函数，

⊙

表示逐元素乘积。W 和 b 是可训练的参数。反向 LSTM 的定义与正向 LSTM 相同。

Datasets 数据集	Type 类型	Train 火车	Dev	Test 测试
OntoNotes	Sentence 句子	15.7 k 15.7 千	4.3 k 4.3 千	4.3 k 4.3 千
OntoNotes	Char 字符	491.9 k	200.5 k 200.5 千	208.1 k
MSRA	Sentence 句子	46.4 k 46.4 千	-	4.4 k 4.4 千
MSRA	Char 字符	2169.9 k	-	172.6 k
Weibo 微博	Sentence 句子	1.4 k 1.4 千	0.27 k	0.27 k
Weibo 微博	Char 字符	73.8 k 73.8 千	14.5	14.8 k 14.8 千
Resume 简历	Sentence 句子	3.8 k 3.8 千	0.46	0.48 k
Resume 简历	Char 字符	124.1 k	13.9 k 13.9 千	15.1 k 15.1 千

Table 1: Statistics of datasets.
表 1：数据集统计。
yet model the sequence in a reverse order. The concatenated hidden states at the

i^{th}

step of the forward and backward LSTMs

h_{i} = [{\vec{h}}_{i}; {\overset{\leftarrow}{h}}_{i}]

forms the context-dependent representation of

c_{i}

.
然而以相反的顺序对序列进行建模。在前向和后向 LSTM 的

h_{i} = [{\vec{h}}_{i}; {\overset{\leftarrow}{h}}_{i}]

步的连接隐藏状态

i^{th}

形成了

c_{i}

的上下文相关表示。

3.4 Label Inference Layer
3.4 标签推断层

On top of the sequence modeling layer, it is typical to apply a sequential conditional random field (CRF) (Lafferty et al., 2001) layer to perform label inference for the whole character sequence at once:
在序列建模层之上，通常会应用一个序列条件随机场（CRF）层（Lafferty et al., 2001）来对整个字符序列进行标签推断：

p (y ∣ s; θ) = \frac{\prod_{t = 1}^{n} ϕ_{t} (y_{t - 1}, y_{t} ∣ s)}{\sum_{y^{'} \in Y_{s}} \prod_{t = 1}^{n} ϕ_{t} (y_{t - 1}^{'}, y_{t}^{'} ∣ s)} .

Here,

Y_{s}

denotes all possible label sequences of

s

, and

ϕ_{t} (y^{'}, y ∣ s) = \exp (w_{y^{'}, y}^{T} h_{t} + b_{y^{'}, y})

, where

w_{y^{'}, y}

and

b_{y^{'}, y}

are trainable parameters corresponding to the label pair

(y^{'}, y)

, and

θ

denotes model parameters. For label inference, it searches for the label sequence

y^{*}

with the highest conditional probability given the input sequence

s

:
在这里，

Y_{s}

表示

s

和

ϕ_{t} (y^{'}, y ∣ s) = \exp (w_{y^{'}, y}^{T} h_{t} + b_{y^{'}, y})

的所有可能标签序列，其中

w_{y^{'}, y}

和

b_{y^{'}, y}

是对应于标签对

(y^{'}, y)

的可训练参数，而

θ

表示模型参数。对于标签推断，它搜索给定输入序列

s

的条件概率最高的标签序列

y^{*}

：

y^{*} =_{y} p (y ∣ s; θ),

which can be efficiently solved using the Viterbi algorithm (Forney, 1973).
可以使用维特比算法（Forney，1973）高效解决。

4 Experiments 4 个实验

4.1 Experiment Setup 4.1 实验设置

Most experimental settings in this work followed the protocols of Lattice-LSTM (Zhang and Yang, 2018), including tested datasets, compared baselines, evaluation metrics (

P, R, F 1

), and so on. To make this work self-completed, we concisely illustrate some primary settings of this work.
本研究中的大多数实验设置遵循了 Lattice-LSTM（Zhang 和 Yang，2018）的协议，包括测试数据集、比较基线、评估指标（

P, R, F 1

）等。为了使本工作自我完整，我们简要说明了本工作的某些主要设置。

Datasets 数据集

The methods were evaluated on four Chinese NER datasets, including OntoNotes (Weischedel et al., 2011), MSRA (Levow, 2006), Weibo NER (Peng
这些方法在四个中文命名实体识别数据集上进行了评估，包括 OntoNotes（Weischedel 等，2011），MSRA（Levow，2006），微博命名实体识别（Peng）

Models 模型	OntoNotes	MSRA	Weibo 微博	Resume 简历
Lattice-LSTM	$1 \times$	$1 \times$	$1 \times$	$1 \times$
LR-CNN (Gui et al., 2019)	$2.23 \times$	$1.57 \times$	$2.41 \times$	$1.44 \times$
BERT-tagger	$2.56 \times$	$2.55 \times$	$4.45 \times$	$3.12 \times$
BERT + LSTM + CRF	$2.77 \times$	$2.32 \times$	$2.84 \times$	$2.38 \times$
SoftLexicon (LSTM)	$6.15 \times$	$5.78 \times$	$6.10 \times$	$6.13 \times$
SoftLexicon (LSTM) + bichar	$6.08 \times$	$5.95 \times$	$5.91 \times$	$6.45 \times$
SoftLexicon (LSTM) + BERT	$2.74 \times$	$2.33 \times$	$2.85 \times$	$2.32 \times$

Table 2: Inference speed (average sentences per second, the larger the better) of our method with LSTM layer compared with Lattice-LSTM, LR-CNN and BERT.
表 2：我们的方法与 Lattice-LSTM、LR-CNN 和 BERT 的 LSTM 层的推理速度（每秒平均句子数，越大越好）。
and Dredze, 2015; He and Sun, 2017a), and Resume NER (Zhang and Yang, 2018). OntoNotes and MSRA are from the newswire domain, where gold-standard segmentation is available for training data. For OntoNotes, gold segmentation is also available for development and testing data. Weibo NER and Resume NER are from social media and resume, respectively. There is no gold standard segmentation in these two datasets. Table 1 shows statistic information of these datasets. As for the lexicon, we used the same one as Lattice-LSTM, which contains 5.7 k single-character words, 291.5 k two-character words, 278.1 k three-character words, and 129.1 k other words. In addition, the pretrained character embeddings we used are also the same with Lattice-LSTM, which are pre-trained on Chinese Giga-Word using word2vec.
和 Dredze，2015；He 和 Sun，2017a），以及简历命名实体识别（Zhang 和 Yang，2018）。OntoNotes 和 MSRA 来自新闻领域，其中有可用于训练数据的金标准分割。对于 OntoNotes，开发和测试数据也有金标准分割。微博命名实体识别和简历命名实体识别分别来自社交媒体和简历。这两个数据集中没有金标准分割。表 1 显示了这些数据集的统计信息。至于词典，我们使用的与 Lattice-LSTM 相同，包含 5.7 千个单字符词，291.5 千个双字符词，278.1 千个三字符词和 129.1 千个其他词。此外，我们使用的预训练字符嵌入也与 Lattice-LSTM 相同，都是在中文 Giga-Word 上使用 word2vec 进行预训练的。

Implementation Detail 实施细节

In this work, we implement the sequence-labeling layer with Bi-LSTM. Most implementation details followed those of Lattice-LSTM, including character and word embedding sizes, dropout, embedding initialization, and LSTM layer number. Additionally, the hidden size was set to 200 for small datasets Weibo and Resume, and 300 for larger datasets OntoNotes and MSRA. The initial learning rate was set to 0.005 for Weibo and 0.0015 for the rest three datasets with Adamax (Kingma and

Ba, 2014

) step rule

^{2}

.
在这项工作中，我们使用双向长短期记忆网络（Bi-LSTM）实现了序列标注层。大多数实现细节遵循了 Lattice-LSTM 的做法，包括字符和词嵌入大小、丢弃率、嵌入初始化和 LSTM 层数。此外，对于小型数据集微博和简历，隐藏层大小设置为 200，对于较大数据集 OntoNotes 和 MSRA，设置为 300。初始学习率对于微博设置为 0.005，对于其余三个数据集设置为 0.0015，使用 Adamax（Kingma 和

Ba, 2014

）步长规则

^{2}

。

4.2 Computational Efficiency Study
4.2 计算效率研究

Table 2 shows the inference speed of the SoftLexicon method when implementing the sequence modeling layer with a bi-LSTM layer. The speed was evaluated based on the average number of sentences processed by the model per second using a GPU (NVIDIA TITAN X). From the
表 2 显示了在使用双向 LSTM 层实现序列建模层时 SoftLexicon 方法的推理速度。该速度是基于模型每秒处理的句子平均数量进行评估的，使用了 GPU（NVIDIA TITAN X）。从

Figure 4: Inference speed against sentence length. We use a same batch size of 1 for a fair speed comparison.
图 4：推理速度与句子长度的关系。我们使用相同的批量大小为 1，以便进行公平的速度比较。
table, we can observe that when decoding with the same batch size

(= 1)

, the proposed method is considerably more efficient than Lattice-LSTM and LR-CNN, performing up to 6.15 times faster than Lattice-LSTM. The inference speeds of SoftLexicon(LSTM) with bichar are close to those without bichar, since we only concatenate an additional feature to the character representation. The inference speeds of the BERT-Tagger and SoftLexicon (LSTM) + BERT models are limited due to the deep layers of the BERT structure. However, the speeds of the SoftLexicon (LSTM) + BERT model are still faster than those of LatticeLSTM and LR-CNN on all datasets.
表格中，我们可以观察到，当使用相同的批量大小

(= 1)

进行解码时，所提出的方法比 Lattice-LSTM 和 LR-CNN 显著更高效，速度比 Lattice-LSTM 快多达 6.15 倍。使用双字符的 SoftLexicon（LSTM）的推理速度与不使用双字符的推理速度相近，因为我们只是将一个额外的特征连接到字符表示上。由于 BERT 结构的深层，BERT-Tagger 和 SoftLexicon（LSTM）+ BERT 模型的推理速度受到限制。然而，SoftLexicon（LSTM）+ BERT 模型的速度在所有数据集上仍然快于 Lattice-LSTM 和 LR-CNN。

To further illustrate the efficiency of the SoftLexicon method, we also conducted an experiment to evaluate its inference speed against sentences of different lengths, as shown in Table 4. For a fair comparison, we set the batch size to 1 in all of the compared methods. The results show that the proposed method achieves significant improvement in speed over Lattice-LSTM and LR-CNN when processing short sentences. With the increase of sentence length, the proposed method is consistently faster than Lattice-LSTM and LR-CNN despite the speed degradation due to the recurrent architecture of LSTM. Overall, the proposed SoftLexicon method shows a great advantage over other methods in computational efficiency.
为了进一步说明 SoftLexicon 方法的效率，我们还进行了一个实验，以评估其在不同长度句子上的推理速度，如表 4 所示。为了公平比较，我们在所有比较方法中将批量大小设置为 1。结果表明，所提出的方法在处理短句时，相较于 Lattice-LSTM 和 LR-CNN 在速度上有显著提升。随着句子长度的增加，尽管由于 LSTM 的递归结构导致速度下降，所提出的方法在速度上仍然始终快于 Lattice-LSTM 和 LR-CNN。总体而言，所提出的 SoftLexicon 方法在计算效率上相较于其他方法显示出很大的优势。

4.3 Effectiveness Study 4.3 效果研究

Tables

3 - 6^{3}

show the performances of our method against the compared baselines. In this study, the sequence modeling layer of our method was
表

3 - 6^{3}

显示了我们的方法与比较基线的性能。在本研究中，我们方法的序列建模层是

Input 输入	Models 模型	P	R	F1
Gold seg 金色分隔	Yang et al., 2016	65.59	71.84	68.57
	Yang et al., 2016*	Che et al., 2013*	72.98	$80.15$
	76.40
	Wang et al., 2013*	77.71	72.51	75.02
	Word-based (LSTM) 基于词的（LSTM）	76.43	72.32	74.32
	+ char + bichar	76.66	63.60	69.52
	Word-based (LSTM) 基于词的（LSTM）	72.84	59.72	65.63
	+ char + bichar	73.36	70.12	71.70
	Char-based (LSTM) 字符级（LSTM）	68.79	60.35	64.30
	+ bichar + softword	74.36	69.43	71.89
	+ ExSoftword	69.90	66.46	68.13
	+ bichar + ExSoftword	73.80	71.05	72.40
	Lattice-LSTM	76.35	71.56	73.88
	LR-CNN (Gui et al., 2019)	76.40	72.60	74.45
	SoftLexicon (LSTM)	77.28	74.07	75.64
	SoftLexicon (LSTM) + bichar	77.13	$75.22$	$76.16$
	BERT-Tagger	76.01	79.96	77.93
	BERT + LSTM + CRF	81.99	81.65	81.82
	SoftLexicon (LSTM) + BERT	$83.41$	$82.21$	$82.81$

Table 3: Performance on OntoNotes. A model followed by (LSTM) (e.g., Proposed (LSTM)) indicates that its sequence modeling layer is LSTM-based.
表 3：在 OntoNotes 上的表现。后面带有（LSTM）的模型（例如，提议（LSTM））表示其序列建模层基于 LSTM。
implemented with a single layer bidirectional LSTM.
使用单层双向 LSTM 实现。

OntoNotes. Table 3 shows results

^{4}

on the OntoNotes dataset, where gold word segmentation is provided for both training and testing data. The methods of the “Gold seg” and the “Auto seg” groups are all word-based, with the former input building on gold word segmentation results and the latter building on automatic word segmentation results by a segmenter trained on OntoNotes training data. The methods used in the “No seg” group are character-based. From the table, we can make several observations. First, when gold word segmentation was replaced by automatically generated word segmentation, the F1 score decreases from

75.77 %

71.70 %

. This reveals the problem of treating the predicted word segmentation result as the true result in the word-based Chinese NER. Second, the F1 score of the Char-based (LSTM)+ExSoftword model is greatly improved from that of the Char-based (LSTM) model. This indicates the feasibility of the naive ExSoftword method. However, it still greatly underperforms relative to Lattice-LSTM, which reveals its deficiency in utilizing word information. Lastly, the proposed SoftLexicon method outperforms Lattice-LSTM by

1.76 %

with respect to the F1 score, and obtains a greater improvement of

2.28 %

combining the bichar
OntoNotes。表 3 显示了在 OntoNotes 数据集上的结果

^{4}

，其中为训练和测试数据提供了金标准词汇分割。“Gold seg”和“Auto seg”组的方法都是基于词的，前者的输入基于金标准词汇分割结果，后者则基于在 OntoNotes 训练数据上训练的分词器生成的自动词汇分割结果。“No seg”组使用的方法是基于字符的。从表中，我们可以得出几个观察结果。首先，当金标准词汇分割被自动生成的词汇分割替代时，F1 分数从

75.77 %

下降到

71.70 %

。这揭示了在基于词的中文命名实体识别中将预测的词汇分割结果视为真实结果的问题。其次，基于字符的（LSTM）+ExSoftword 模型的 F1 分数相比于基于字符的（LSTM）模型有了很大提高。这表明了简单的 ExSoftword 方法的可行性。然而，相比于 Lattice-LSTM，它的表现仍然大大不足，这揭示了它在利用词信息方面的缺陷。最后，提出的 SoftLexicon 方法在 F1 分数方面比 Lattice-LSTM 提高了

1.76 %

，并通过结合双字符获得了更大的改进

2.28 %

Models 模型	P	R	F1
Chen et al., 2006	91.22	81.71	86.20
Zhang et al. 2006*	92.20	90.18	91.18
Zhou et al. 2013	91.86	88.75	90.28
Lu et al. 2016	-	-	87.94
Dong et al. 2016	91.28	90.62	90.95
Char-based (LSTM) 字符级（LSTM）	90.74	86.96	88.81
+ bichar+softword	92.97	90.80	91.87
+ ExSoftword	90.77	87.23	88.97
+ bichar+ExSoftword	93.21	91.57	92.38
Lattice-LSTM	93.57	92.79	93.18
LR-CNN (Gui et al., 2019)	94.50	92.93	93.71
SoftLexicon (LSTM)	94.63	92.70	93.66
SoftLexicon (LSTM) + bichar	$94.73$	$93.40$	$94.06$
BERT-Tagger	93.40	94.12	93.76
BERT + LSTM + CRF	95.06	94.61	94.83
SoftLexicon (LSTM) + BERT	$95.75$	$95.10$	$95.42$

Table 4: Performance on MSRA.
表 4：在 MSRA 上的表现。

Models 模型	NE	NM	Overall 总体
Peng and Dredze, 2015 Peng 和 Dredze，2015	51.96	61.05	56.05
Peng and Dredze, 2016* Peng 和 Dredze，2016*	$55.28$	$62.97$	$58.99$
He and Sun, 2017a 他和孙，2017a	50.60	59.32	54.82
He and Sun, 2017b* 他和孙，2017b*	54.50	62.17	58.23
Char-based (LSTM) 字符级（LSTM）	46.11	55.29	52.77
+ bichar+softword	50.55	60.11	56.75
+ ExSoftword	44.65	55.19	52.42
+ bichar+ExSoftword	58.93	53.38	56.02
Lattice-LSTM	53.04	62.25	58.79
LR-CNN (Gui et al., 2019)	57.14	$66.67$	59.92
SoftLexicon (LSTM)	$59.08$	62.22	$61.42$
SoftLexicon (LSTM) + bichar	58.12	64.20	59.81
BERT-Tagger	65.77	62.05	63.80
BERT + LSTM + CRF	69.65	64.62	67.33
SoftLexicon (LSTM) + BERT	$70.94$	$67.02$	$70.50$

Table 5: Performance on Weibo. NE, NM and Overall denote F1 scores for named entities, nominal entities (excluding named entities) and both, respectively.
表 5：微博上的表现。NE、NM 和总体分别表示命名实体、名义实体（不包括命名实体）和两者的 F1 分数。
feature. It even performs comparably with the word-based methods of the “Gold seg” group, verifying its effectiveness on OntoNotes.
特性。它甚至在“Gold seg”组的基于词的方法中表现相当，验证了其在 OntoNotes 上的有效性。

MSRA/Weibo/Resume. Tables 4, 5 and 6 show results on the MSRA, Weibo and Resume datasets, respectively. Compared methods include the best statistical models on these data set, which leveraged rich handcrafted features (Chen et al., 2006; Zhang et al., 2006; Zhou et al., 2013), character embedding features (Lu et al., 2016; Peng and Dredze, 2016), radical features (Dong et al., 2016), cross-domain data, and semi-supervised data (He and Sun, 2017b). From the tables, we can see that the performance of the proposed Softlexion method is significant better than that of Lattice-LSTM and other baseline methods on all three datasets.
MSRA/微博/简历。表 4、5 和 6 分别显示了在 MSRA、微博和简历数据集上的结果。比较的方法包括在这些数据集上最佳的统计模型，这些模型利用了丰富的手工特征（Chen et al., 2006; Zhang et al., 2006; Zhou et al., 2013）、字符嵌入特征（Lu et al., 2016; Peng and Dredze, 2016）、部首特征（Dong et al., 2016）、跨领域数据和半监督数据（He and Sun, 2017b）。从表中可以看出，所提出的 Softlexion 方法在所有三个数据集上的表现显著优于 Lattice-LSTM 和其他基线方法。

Models 模型	P	R	F1
Word-based (LSTM) 基于词的（LSTM）	93.72	93.44	93.58
+char+bichar	94.07	94.42	94.24
Char-based (LSTM) 字符级（LSTM）	93.66	93.31	93.48
+ bichar+softword	94.53	94.29	94.41
+ ExSoftword	95.29	94.42	94.85
+ bichar+ExSoftword	$96.14$	94.72	95.43
Lattice-LSTM	94.81	94.11	94.46
LR-CNN (Gui et al., 2019)	95.37	94.84	95.11
SoftLexicon (LSTM)	95.30	95.77	95.53
SoftLexicon (LSTM) + bichar	95.71	95.77	$95.74$
BERT-Tagger	94.87	$96.50$	95.68
BERT + LSTM + CRF	95.75	95.28	95.51
SoftLexicon (LSTM) + BERT	$96.08$	96.13	$96.11$

Table 6: Performance on Resume.
表 6：简历上的表现。

Models 模型	OntoNotes	MSRA	Weibo 微博	Resume 简历
SoftLexicon (LSTM)	75.64	93.66	61.42	95.53
ExSoftword (CNN)	68.11	90.02	53.93	94.49
SoftLexicon (CNN)	$74.08$	$92.19$	$59.65$	$95.02$
ExSoftword (Transformer)	64.29	86.29	52.86	93.78
SoftLexicon (Transformer)	$71.21$	$90.48$	$61.04$	$94.59$

Table 7: F1 score with different implementations of the sequence modeling layer. ExSoftword is the shorthand of Char-based+bichar+ExSoftword.
表 7：不同序列建模层实现的 F1 分数。ExSoftword 是基于字符+双字符+ExSoftword 的简称。

4.4 Transferability Study
4.4 可转让性研究

Table 7 shows the performance of the SoftLexicon method when implementing the sequence modeling layer with different neural architecture. From the table, we can first see that the LSTM-based architecture performed better than the CNN- and transformer- based architectures. In addition, our method with different sequence modeling layers consistently outperformed their corresponding ExSoftword baselines. This confirms the superiority of our method in modeling lexicon information in different neural NER models.
表 7 显示了在不同神经架构下实施序列建模层时 SoftLexicon 方法的性能。从表中，我们可以首先看到基于 LSTM 的架构表现优于基于 CNN 和变换器的架构。此外，我们的方法在不同的序列建模层下始终优于其对应的 ExSoftword 基线。这证实了我们的方法在不同神经命名实体识别模型中建模词汇信息的优越性。

4.5 Combining Pre-trained Model
4.5 结合预训练模型

We also conducted experiments on the four datasets to further verify the effectiveness of SoftLexicon in combination with pre-trained model, the results of which are shown in Tables 3-6. In these experiments, we first use a BERT encoder to obtain the contextual representations of each sequenc, and then concatenated them into the character representations. From the table, we can see that the SoftLexicon method with BERT outperforms the BERT tagger on all four datasets. These results show that the SoftLexicon method can be effectively combined with pre-trained model. Moreover, the results also verify the effectiveness of our method in utilizing lexicon information,
我们还在四个数据集上进行了实验，以进一步验证 SoftLexicon 与预训练模型结合的有效性，结果如表 3-6 所示。在这些实验中，我们首先使用 BERT 编码器获取每个序列的上下文表示，然后将它们连接成字符表示。从表中可以看出，使用 BERT 的 SoftLexicon 方法在所有四个数据集上都优于 BERT 标注器。这些结果表明，SoftLexicon 方法可以有效地与预训练模型结合。此外，结果还验证了我们的方法在利用词典信息方面的有效性。

Models 模型	OntoNotes	MSRA	Weibo 微博	Resume 简历
SoftLexicon (LSTM)	75.64	93.66	61.42	95.53
- "M" group “M”组	75.06	93.09	58.13	94.72
- Distinction - 区别	70.29	92.08	54.85	94.30
- Weighted pooling 加权池化	72.57	92.76	57.72	95.33
- Overall weighting - 整体权重	74.28	93.16	59.55	94.92

Table 8: An ablation study of the proposed model.
表 8：所提模型的消融研究。
which means it can complement the information obtained from the pre-trained model.
这意味着它可以补充从预训练模型获得的信息。

4.6 Ablation Study 4.6 消融研究

To investigate the contribution of each component of our method, we conducted ablation experiments on all four datasets, as shown in table 8.
为了研究我们方法中每个组件的贡献，我们在所有四个数据集上进行了消融实验，如表 8 所示。
(1) In Lattice-LSTM, each character receives word information only from the words that begin or end with it. Thus, the information of the words that contain the character inside is ignored. However, the SoftLexicon prevents the loss of this information by incorporating the “Middle” group of words. In the " - ‘M’ group" experiment, we removed the “Middle” group in SoftLexicon, as in Lattice-LSTM. The degradation in performance on all four datasets indicates the importance of the “M” group of words, and confirms the advantage of our method.
在 Lattice-LSTM 中，每个字符仅接收来自以其开头或结尾的单词的信息。因此，包含该字符的单词内部的信息被忽略。然而，SoftLexicon 通过引入“中间”组单词来防止这些信息的丢失。在“ - ‘M’组”实验中，我们像在 Lattice-LSTM 中一样移除了 SoftLexicon 中的“中间”组。所有四个数据集上的性能下降表明“M”组单词的重要性，并确认了我们方法的优势。
(2) Our method proposed to draw a clear distinction between the four “BMES” categories of matched words. To study the relative contribution of this design, we conducted experiments to remove this distinction, i.e., we simply added up all the weighted words regardless of their categories. The decline in performance verifies the significance of a clear distinction for different matched words.
(2) 我们提出的方法旨在清晰区分四个“BMES”类别的匹配词。为了研究这种设计的相对贡献，我们进行了实验以去除这种区分，即我们简单地将所有加权词汇相加，而不考虑它们的类别。性能的下降验证了对不同匹配词进行清晰区分的重要性。
(3) We proposed two strategies for pooling the four word sets in Section 3.2. In the “- Weighted pooling” experiment, the weighted pooling strategy was replaced with mean-pooling, which degrades the performance. Compared with mean-pooling, the weighting strategy not only succeeds in weighing different words by their significance, but also introduces the frequency information of each word in the statistical data, which is verified to be helpful.
（3）我们在 3.2 节中提出了两种策略来汇聚四个词集。在“加权汇聚”实验中，加权汇聚策略被替换为均值汇聚，这降低了性能。与均值汇聚相比，加权策略不仅成功地根据不同词的重要性进行加权，还引入了每个词在统计数据中的频率信息，这被验证是有帮助的。
(4) Although existing lexicon-based methods like Lattice-LSTM also use word weighting, unlike the proposed Soft-lexion method, they fail to perform weight normalization among all the matched words. For example, Lattice-LSTM only normalizes the weights inside the “B” group or the “E” group. In the “- Overall weighting” experiment, we performed weight normalization inside each
（4）尽管现有的基于词典的方法如 Lattice-LSTM 也使用词权重，但与提出的 Soft-lexion 方法不同，它们未能在所有匹配的词之间进行权重归一化。例如，Lattice-LSTM 仅在“B”组或“E”组内部进行权重归一化。在“ - 整体加权”实验中，我们在每个内部进行了权重归一化。
“BMES” group as Lattice-LSTM does, and found the resulting performance to be degraded. This result shows that the ability to perform overall weight normalization among all matched words is also an advantage of our method.
“BMES”组与 Lattice-LSTM 一样，发现结果性能下降。这个结果表明，在所有匹配词之间进行整体权重归一化的能力也是我们方法的一个优势。

5 Conclusion 5 结论

In this work, we addressed the computational efficiency of utilizing word lexicons in Chinese NER. To obtain a high-performing Chinese NER system with a fast inference speed, we proposed a novel method to incorporate the lexicon information into the character representations. Experimental studies on four benchmark Chinese NER datasets reveal that our method can achieve a much faster inference speed and better performance than the compared state-of-the-art methods.
在这项工作中，我们解决了在中文命名实体识别中利用词汇表的计算效率。为了获得一个具有快速推理速度的高性能中文命名实体识别系统，我们提出了一种新方法，将词汇信息融入字符表示中。在四个基准中文命名实体识别数据集上的实验研究表明，我们的方法可以实现比比较的最先进方法更快的推理速度和更好的性能。

Acknowledgements 致谢

The authors wish to thank the anonymous reviewers for their helpful comments. This work was partially funded by China National Key RD Program (No. 2018YFB1005104, 2018YFC0831105, 2017YFB1002104), National Natural Science Foundation of China (No. 61976056, 61532011, 61751201), Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01), Science and Technology Commission of Shanghai Municipality Grant (No.18DZ1201000, 16JC1420401, 17JC1420200).
作者感谢匿名评审的有益意见。本研究部分由中国国家重点研发计划（编号：2018YFB1005104，2018YFC0831105，2017YFB1002104）、国家自然科学基金（编号：61976056，61532011，61751201）、上海市科技重大项目（编号：2018SHZDZX01）、上海市科技委员会资助（编号：18DZ1201000，16JC1420401，17JC1420200）资助。

References 参考文献

Wanxiang Che, Mengqiu Wang, Christopher D Manning, and Ting Liu. 2013. 使用双语约束的命名实体识别. 在 NAACL, 页码 52-62.

Aitao Chen, Fuchun Peng, Roy Shan, and Gordon Sun. 2006. Chinese named entity recognition with conditional probabilistic models. In SIGHAN Workshop on Chinese Language Processing.
Aitao Chen, Fuchun Peng, Roy Shan, and Gordon Sun. 2006. 使用条件概率模型的中文命名实体识别。在 SIGHAN 中文处理研讨会。

Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, and Jun Zhao. 2015. Event extraction via dynamic multi-pooling convolutional neural networks. In ACL-IJCNLP, volume 1, pages 167-176.
Yubo Chen, Liheng Xu, Kang Liu, Daojian Zeng, 和 Jun Zhao. 2015. 通过动态多池卷积神经网络进行事件提取. 在 ACL-IJCNLP, 第 1 卷, 第 167-176 页.

Jason Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the Association of Computational Linguistics, 4(1):357-370.
Jason Chiu 和 Eric Nichols. 2016. 使用双向 LSTM-CNN 进行命名实体识别. 计算语言学协会会刊, 4(1):357-370.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Dennis Diefenbach, Vanessa Lopez, Kamal Singh, and Pierre Maret. 2018. Core techniques of question answering systems over knowledge bases: a survey. KAIS, 55(3):529-569.
Dennis Diefenbach, Vanessa Lopez, Kamal Singh, and Pierre Maret. 2018. 知识库问答系统的核心技术：一项调查。KAIS, 55(3):529-569.

Ruixue Ding, Pengjun Xie, Xiaoyan Zhang, Wei Lu, Linlin Li, and Luo Si. 2019. A neural multidigraph model for chinese ner with gazetteers. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1462-1467.
Ruixue Ding, Pengjun Xie, Xiaoyan Zhang, Wei Lu, Linlin Li, 和 Luo Si. 2019. 一种用于中文命名实体识别的神经多重图模型，结合地名词典。在第 57 届计算语言学协会年会上，页面 1462-1467。

Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, and Hui Di. 2016. Characterbased lstm-crf with radical-level features for chinese named entity recognition. In Natural Language Understanding and Intelligent Applications, pages 239-250. Springer.
Chuanhai Dong, Jiajun Zhang, Chengqing Zong, Masanori Hattori, 和 Hui Di. 2016. 基于字符的 LSTM-CRF 和部首级特征的中文命名实体识别. 在《自然语言理解与智能应用》，第 239-250 页. Springer.
G David Forney. 1973. The viterbi algorithm. Proceedings of the IEEE, 61(3):268-278.

Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-Gang Jiang, and Xuanjing Huang. 2019a. Cnn-based chinese ner with lexicon rethinking. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 4982-4988. AAAI Press.
Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-Gang Jiang, 和 Xuanjing Huang. 2019a. 基于卷积神经网络的中文命名实体识别与词典重思. 在第 28 届国际人工智能联合会议论文集中, 第 4982-4988 页. AAAI 出版社.

Tao Gui, Yicheng Zou, Qi Zhang, Minlong Peng, Jinlan Fu, Zhongyu Wei, and Xuan-Jing Huang. 2019b. A lexicon-based graph neural network for chinese ner. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1039-1049.
Tao Gui, Yicheng Zou, Qi Zhang, Minlong Peng, Jinlan Fu, Zhongyu Wei, 和 Xuan-Jing Huang. 2019b. 一种基于词典的图神经网络用于中文命名实体识别. 在 2019 年自然语言处理实证方法会议和第九届国际自然语言处理联合会议（EMNLP-IJCNLP）论文集, 页码 1039-1049.
Hangfeng He and Xu Sun. 2017a. F-score driven max margin neural network for named entity recognition in chinese social media. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 713-718.
Hangfeng He 和 Xu Sun. 2017a. 基于 F-score 的最大边际神经网络用于中文社交媒体中的命名实体识别. 在第十五届欧洲计算语言学协会分会会议论文集：第二卷，短篇论文，页 713-718.
Hangfeng He and Xu Sun. 2017b. A unified model for cross-domain and semi-supervised named entity recognition in chinese social media. In

A A A I

.
Hangfeng He 和 Xu Sun. 2017b. 一种用于中文社交媒体的跨领域和半监督命名实体识别的统一模型。在

A A A I

Jingzhou He and Houfeng Wang. 2008. Chinese named entity recognition and word segmentation based on character. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.
荆州和王后风。2008 年。基于字符的中文命名实体识别和分词。在第六届 SIGHAN 中文处理研讨会论文集中。

Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. 双向 LSTM-CRF 模型用于序列标注。arXiv 预印本 arXiv:1508.01991。
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Diederik P Kingma 和 Jimmy Ba. 2014. Adam: 一种随机优化方法. arXiv 预印本 arXiv:1412.6980.
John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data.
约翰·拉弗提，安德鲁·麦卡勒姆，和费尔南多·CN·佩雷拉。2001 年。条件随机场：用于分割和标记序列数据的概率模型。

Gina-Anne Levow. 2006. The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In SIGHAN Workshop on Chinese Language Processing, pages 108-117.
吉娜-安妮·莱沃。2006 年。第三届国际中文处理比赛：词语分割和命名实体识别。在 SIGHAN 中文处理研讨会，页码 108-117。

Haibo Li, Masato Hagiwara, Qi Li, and Heng Ji. 2014. Comparison of the impact of word segmentation on name tagging for chinese and japanese. In LREC, pages 2532-2536.
Haibo Li, Masato Hagiwara, Qi Li, 和 Heng Ji. 2014. 中文和日文中词语分割对命名标注影响的比较. 在 LREC, 页码 2532-2536.

Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, Jian Peng, and Jiawei Han. 2018. Empower sequence labeling with task-aware neural language model. AAAI Conference on Artificial Intelligence.
Liyuan Liu, Jingbo Shang, Xiang Ren, Frank Xu, Huan Gui, Jian Peng, 和 Jiawei Han. 2018. 用任务感知神经语言模型增强序列标注. AAAI 人工智能会议.

Wei Liu, Tongge Xu, Qinghua Xu, Jiayu Song, and Yueran Zu. 2019. An encoding strategy based wordcharacter 1stm for chinese ner. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2379-2389.
魏刘，童戈徐，清华徐，佳宇宋，和月然祖。2019 年。基于编码策略的中文命名实体识别的字字符 1stm。在 2019 年北美计算语言学协会会议：人类语言技术会议论文集，第 1 卷（长篇和短篇论文），第 2379-2389 页。

Zhangxun Liu, Conghui Zhu, and Tiejun Zhao. 2010. Chinese named entity recognition with a sequence labeling approach: based on characters, or based on words? In Advanced intelligent computing theories and applications. With aspects of artificial intelligence, pages 634-640. Springer.
Zhangxun Liu, Conghui Zhu, 和 Tiejun Zhao. 2010. 基于字符还是基于词的序列标注方法的中文命名实体识别？在《先进智能计算理论与应用》中。涉及人工智能的方面，页 634-640。施普林格。

Yanan Lu, Yue Zhang, and Dong-Hong Ji. 2016. Multiprototype chinese character embedding. In LREC.
Yanan Lu, Yue Zhang, and Dong-Hong Ji. 2016. 多原型中文字符嵌入. 在 LREC.

Nanyun Peng and Mark Dredze. 2015. Named entity recognition for chinese social media with jointly trained embeddings. In EMNLP.
Nanyun Peng 和 Mark Dredze. 2015. 针对中文社交媒体的命名实体识别与联合训练的嵌入. 在 EMNLP.

Nanyun Peng and Mark Dredze. 2016. Improving named entity recognition for chinese social media with word segmentation representation learning. In

A C L

, page 149 .
南云彭和马克·德雷兹。2016 年。通过词语分割表示学习提高中文社交媒体的命名实体识别。在

A C L

，第 149 页。

Sebastian Riedel, Limin Yao, Andrew McCallum, and Benjamin M Marlin. 2013. Relation extraction with matrix factorization and universal schemas. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 74-84.
Sebastian Riedel, Limin Yao, Andrew McCallum, 和 Benjamin M Marlin. 2013. 使用矩阵分解和通用模式进行关系提取. 在 2013 年北美计算语言学协会会议：人类语言技术的论文集中, 第 74-84 页.

Dianbo Sui, Yubo Chen, Kang Liu, Jun Zhao, and Shengping Liu. 2019. Leverage lexical knowledge for chinese named entity recognition via collaborative graph network. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3821-3831.
Dianbo Sui, Yubo Chen, Kang Liu, Jun Zhao, 和 Shengping Liu. 2019. 通过协作图网络利用词汇知识进行中文命名实体识别. 在 2019 年自然语言处理实证方法会议和第九届国际联合自然语言处理会议（EMNLP-IJCNLP）论文集, 页码 3821-3831.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998-6008.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 2017. 注意力是你所需要的一切. 在神经信息处理系统进展中, 第 5998-6008 页.

Mengqiu Wang, Wanxiang Che, and Christopher D Manning. 2013. Effective bilingual constraints for semi-supervised learning of named entity recognizers. In AAAI.
Mengqiu Wang, Wanxiang Che, and Christopher D Manning. 2013. 有效的双语约束用于命名实体识别器的半监督学习。发表于 AAAI。

Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, et al. 2011. Ontonotes release 4.0. LDC2011T03, Philadelphia, Penn.: Linguistic Data Consortium.
Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin 等。2011 年。Ontonotes 发布 4.0。LDC2011T03，宾夕法尼亚州费城：语言数据联盟。

Jie Yang, Zhiyang Teng, Meishan Zhang, and Yue Zhang. 2016. Combining discrete and neural features for sequence labeling. In CICLing. Springer.
Jie Yang, Zhiyang Teng, Meishan Zhang, 和 Yue Zhang. 2016. 结合离散特征和神经特征进行序列标注. 在 CICLing. Springer.

Suxiang Zhang, Ying Qin, Juan Wen, and Xiaojie Wang. 2006. Word segmentation and named entity recognition for sighan bakeoff3. In SIGHAN Workshop on Chinese Language Processing, pages 158-161.

Yue Zhang and Jie Yang. 2018. Chinese ner using lattice 1stm. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL), 1554-1564.
Yue Zhang 和 Jie Yang. 2018. 使用格子进行中文命名实体识别 1stm. 计算语言学协会第 56 届年会论文集 (ACL), 1554-1564.

Hai Zhao and Chunyu Kit. 2008. Unsupervised segmentation helps supervised learning of character tagging for word segmentation and named entity recognition. In Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing.
海钊和春雨吉特。2008 年。无监督分割有助于监督学习的字符标记，用于词语分割和命名实体识别。在第六届 SIGHAN 中文处理研讨会论文集中。

Junsheng Zhou, Weiguang Qu, and Fen Zhang. 2013. Chinese named entity recognition via joint identification and categorization. Chinese journal of electronics, 22(2):225-230.
Junsheng Zhou, Weiguang Qu, and Fen Zhang. 2013. 中文命名实体识别通过联合识别和分类. 电子学报, 22(2):225-230.

*Equal contribution. *平等贡献。
$^{1}$ The source code of this paper is publicly available at https://github.com/v-mipeng/ LexiconAugmentedNER.
$^{1}$ 本论文的源代码可以在 https://github.com/v-mipeng/ LexiconAugmentedNER 上公开获取。
$^{2}$ Please refer to the attached source code for more implementation detail of this work and access https:// github. com/jiesutd/LatticeLSTM for pre-trained word and character embeddings.
$^{2}$ 请参阅附加的源代码以获取此工作的更多实现细节，并访问 https:// github. com/jiesutd/LatticeLSTM 以获取预训练的词和字符嵌入。
$^{3}$ In Table 3-5, * indicates that the model uses external labeled data for semi-supervised learning. $†$ means that the model also uses discrete features.
在表 3-5 中，*表示该模型使用外部标记数据进行半监督学习。 $†$ 表示该模型还使用离散特征。
$^{4} A$ result in boldface indicates that it is statistically significantly better ( $p < 0.01$ in pairwise $t$ -test) than the others in the same box.
$^{4} A$ 的粗体结果表明它在统计上显著优于同一框中的其他结果（ $p < 0.01$ 在成对 $t$ -检验中）。

Simplify the Usage of Lexicon in Chinese NER 简化中文命名实体识别中的词汇使用

Abstract 摘要

1 Introduction 1 引言

2 Background 2 背景

2.1 Softword Feature 2.1 Softword 特性

2.2 Lattice-LSTM 2.2 格子长短期记忆网络

3 Approach 3 方法

3．1 Character Representation Layer3．1 字符表示层

3．2 Incorporating Lexicon Information3．2 纳入词汇信息

ExSoftword Feature ExSoftword 功能

SoftLexicon

3.3 Sequence Modeling Layer3.3 序列建模层

3.4 Label Inference Layer3.4 标签推断层

4 Experiments 4 个实验

4.1 Experiment Setup 4.1 实验设置

Datasets 数据集

Implementation Detail 实施细节

4.2 Computational Efficiency Study4.2 计算效率研究

4.3 Effectiveness Study 4.3 效果研究

4.4 Transferability Study4.4 可转让性研究

4.5 Combining Pre-trained Model4.5 结合预训练模型

4.6 Ablation Study 4.6 消融研究

5 Conclusion 5 结论

Acknowledgements 致谢

References 参考文献

Simplify the Usage of Lexicon in Chinese NER
简化中文命名实体识别中的词汇使用

3．1 Character Representation Layer
3．1 字符表示层

3．2 Incorporating Lexicon Information
3．2 纳入词汇信息

3.3 Sequence Modeling Layer
3.3 序列建模层

3.4 Label Inference Layer
3.4 标签推断层

4.2 Computational Efficiency Study
4.2 计算效率研究

4.4 Transferability Study
4.4 可转让性研究

4.5 Combining Pre-trained Model
4.5 结合预训练模型