Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts 基于语义的简单数据扩展,用于生物医学文本中的命名实体识别
Uyen T.P. Phan , Nhung T.H. Nguyen Faculty of Information Technology, University of Science, Ho Chi Minh city, Vietnam 越南胡志明市理科大学信息技术系 Vietnam National University, Ho Chi Minh city, Vietnam 越南胡志明市越南国立大学 Department of Computer Science, University of Manchester, UK 英国曼彻斯特大学计算机科学系ptpuyen@fit.hcmus.edu.vn, nhung.nguyen @ manchester.ac.uk
Abstract 摘要
Data augmentation is important in addressing data sparsity and low resources in NLP. Unlike data augmentation for other tasks such as sentence-level and sentence-pair ones, data augmentation for named entity recognition (NER) requires preserving the semantic of entities. To that end, in this paper we propose a simple semantic-based data augmentation method for biomedical NER. Our method leverages semantic information from pre-trained language models for both entity-level and sentence-level. Experimental results on two datasets: i2b2-2010 (English) and VietBioNER (Vietnamese) showed that the proposed method could improve NER performance. 数据扩增对于解决 NLP 中的数据稀缺和资源匮乏问题非常重要。与句子级和句子对等其他任务的数据扩增不同,命名实体识别(NER)的数据扩增需要保留实体的语义。为此,我们在本文中为生物医学命名实体识别(NER)提出了一种基于语义的简单数据扩增方法。我们的方法利用了来自预训练语言模型的实体级和句子级语义信息。在 i2b2-2010(英语)和 VietBioNER(越南语)两个数据集上的实验结果表明,所提出的方法可以提高 NER 性能。
1 Introduction 1 引言
In machine learning and especially deep learning approaches, performance of the trained models is often proportional to the size of the training data. Consequently, for a model to achieve acceptable performance, we need a certain amount of labelled data. This would be an issue for low-resource domain and low-resource languages since annotating labelled data is time-consuming and expensive. To address the issue, data augmentation has been proposed to increase the variety of training data without directly collecting or annotating additional data (Feng et al., 2021). 在机器学习尤其是深度学习方法中,训练模型的性能通常与训练数据的大小成正比。因此,要使模型达到可接受的性能,我们需要一定量的标注数据。这对于低资源领域和低资源语言来说是一个问题,因为标注数据既耗时又昂贵。为了解决这个问题,有人提出了数据扩增法,以增加训练数据的种类,而无需直接收集或标注额外的数据(Feng 等人,2021 年)。
Intuitively, data augmentation for named entity recognition (NER) task is more difficult to perform than for other sentence-level and sentencepair tasks. Simple operations used to augment a sentence such as token swap, token deletion, and token insertion (Wei and Zou, 2019) may not work well in the case of NER, especially in the biomedical domain. One of the reasons is that a named entity can be composed by multiple tokens and we have to preserve the semantic of entities after applying those operations. For example, consider the following sentence from the i2b2-2010 corpus (Uzuner et al., 2011) with its entities: 直观地说,与其他句子级任务和句对任务相比,命名实体识别(NER)任务的数据扩增更加困难。用于增强句子的简单操作,如标记交换、标记删除和标记插入(Wei 和 Zou,2019 年),在命名实体识别(NER)中可能效果不佳,尤其是在生物医学领域。原因之一是一个命名实体可以由多个标记组成,我们必须在应用这些操作后保留实体的语义。例如,请看 i2b2-2010 语料库(Uzuner et al:
She can be given prn [lasix for [weight gain or [shortness of breath] If we randomly swap the 'lasix' token with 'weight', the sentence is not semantically correct. Similarly, when the 'weight' token is deleted, the remaining 'gain' token is no longer suitable for an entity of Problem. For the insertion operation, if we randomly insert a token into the sentence, the semantic of the sentence will be changed and we will not be able to assign a suitable entity label for it. As a result, it is necessary to have different augmentation methods specified for NER. 如果我们随意地将 "lasix "标记与 "weight "标记互换,那么这个句子在语义上就不正确了。同样,删除 "weight "标记后,剩下的 "gain "标记也不再适合作为 "问题 "的实体。在插入操作中,如果我们随意在句子中插入一个标记,句子的语义就会被改变,我们就无法为其分配合适的实体标签。因此,有必要为 NER 指定不同的增强方法。
There are several model-based data augmentation methods for NER. Chen et al. (2020) proposed Local Additivity-based Data Augmentation (LADA) that can create virtual samples using interpolation technique. Their exeperimental results showed that LADA could help to produce state-ofthe-art (SOTA) on two NER benchmarks including CoNLL 2013 (Tjong Kim Sang and De Meulder, 2003) and GermEval 2014 (Benikova et al., 2014). Meanwhile, Nie et al. (2020) took advantages of the rich semantic information in pre-trained word embeddings to create a semantic augmentation module for NER models. They also reported SOTA performance on some social media corpora. 有几种基于模型的 NER 数据扩增方法。Chen 等人(2020 年)提出了基于局部累加性的数据扩增(LADA),该方法可以利用插值技术创建虚拟样本。他们的实验结果表明,LADA 有助于在两个 NER 基准(包括 CoNLL 2013(Tjong Kim Sang 和 De Meulder,2003 年)和 GermEval 2014(Benikova 等人,2014 年))上生成最新状态(SOTA)。同时,Nie 等人(2020 年)利用预训练词嵌入中丰富的语义信息,为 NER 模型创建了一个语义增强模块。他们还报告了 SOTA 在一些社交媒体语料库中的表现。
Obviously, model-based methods can help to improve NER performance, but they are often complicated and difficult to implement. In contrast, rulebased methods are simpler and more intepretable than model-based ones, but still effective. Dai and Adel (2020) adjusted simple operations such as replacement and shuffle to preserve the semantic of both entities and sentences. Specifically, they proposed Synonym Replacement (SR) and Mention Replacement (MR). SR replaces a word in a sentence with a word of the same semantics taken from WordNet. MR replaces the whole entity with another random entity in the same entity type based on the training data; the replacement action for each entity is decided based on the binomial distribution. 显然,基于模型的方法有助于提高 NER 性能,但这些方法往往比较复杂,难以实施。相比之下,基于规则的方法比基于模型的方法更简单、更可解释,但仍然有效。Dai 和 Adel(2020 年)调整了替换和洗牌等简单操作,以保留实体和句子的语义。具体来说,他们提出了同义词替换(SR)和提及替换(MR)。SR 将句子中的一个词替换为取自 WordNet 的相同语义的词。MR 则根据训练数据,用相同实体类型的另一个随机实体替换整个实体;每个实体的替换操作根据二项分布决定。
As a result, they could improve the NER performance on both MaSciP and i2b2-2010 corpora. 因此,他们可以提高 MaSciP 和 i2b2-2010 语料库的 NER 性能。
We find two limitations in Dai and Adel (2020)'s approach. Firstly, although the SR operation takes into account the semantic aspect of tokens, it does not consider the semantic at the entity level. Secondly, the MR operation is performed on the entity level randomly, which may cause semantically incorrect sentences. We hypothesise that if we somehow control the semantics in entity and sentence levels in augmentation operations, we could create a meaningful augmented data, hence improving the NER performance. To that end, we propose Semantic Neighbour Replacement (SNR), a simple data augmentation method for biomedical NER that considers the semantic aspects of both entity and sentence levels. 我们发现 Dai 和 Adel(2020)的方法有两个局限性。首先,虽然 SR 运算考虑到了标记的语义方面,但它没有考虑实体层面的语义。其次,MR 操作是在实体层面随机执行的,这可能会导致句子语义不正确。我们假设,如果能在扩增操作中以某种方式控制实体和句子层面的语义,就能创建有意义的扩增数据,从而提高 NER 性能。为此,我们提出了语义邻接替换法(Semantic Neighbour Replacement,SNR),这是一种用于生物医学 NER 的简单数据扩增方法,它考虑了实体和句子两个层面的语义。
Specifically, at the entity level, unlike MR (Dai and Adel, 2020), we only replace a source entity with a target one if the target entity is in the same entity type and semantically related to the source one. At the sentence level, we only retain sentences that are semantically related to the original sentence. The semantically related entities and sentences are calculated by using pre-trained language models. 具体来说,在实体层面,与 MR(Dai 和 Adel,2020 年)不同的是,我们只在目标实体与源实体属于同一实体类型且在语义上相关时才用目标实体替换源实体。在句子层面,我们只保留与原始句子语义相关的句子。语义相关的实体和句子是通过预先训练的语言模型计算得出的。
We conducted experiments on two biomedical datasets: i2b2-2010 (Uzuner et al., 2011)— an English corpus of clinical records and VietBioNER (Phan et al., 2022)—a Vietnamese corpus of biomedical texts. Experimental results indicate that using SNR, we can improve NER performance on low-resource settings as well as on full training data. In particular, the F1-scores were increased by for i2b2-2010 and for VietBioNER. 我们在两个生物医学数据集上进行了实验:i2b2-2010(Uzuner 等人,2011 年)-- 包含临床记录的英文语料库;VietBioNER(Phan 等人,2022 年)-- 包含生物医学文本的越南语语料库。实验结果表明,利用信噪比,我们可以提高低资源设置和完整训练数据的 NER 性能。特别是,i2b2-2010 的 F1 分数提高了 ,VietBioNER 的 F1 分数提高了 。
2 Methodology 2 方法
The core idea of SNR is to replace entities and to control augmented sentences based on semantic similarity. The method can be divided into three consecutive phases: semantic neighbour extraction, entity replacement, and sentence evaluation. SNR 的核心思想是根据语义相似性替换实体和控制增强句子。该方法可分为三个连续阶段:语义邻接提取、实体替换和句子评估。
Semantic Neighbour Extraction: Initially, we perform feature extraction for entities using pretrained language models. An entity embedding is calculated by taking an average of word embeddings in it. Next, we generate sets of semantic neighbors based on cosine similarity. An entity is a semantic neighbor to another entity if both of them belong to the same entity type and have a cosine similarity greater than or equal to a threshold . 语义邻域提取:最初,我们使用预训练的语言模型对实体进行特征提取。实体嵌入的计算方法是取其中单词嵌入的平均值。接下来,我们根据余弦相似度生成语义邻居集。如果一个实体与另一个实体属于相同的实体类型,并且它们的余弦相似度大于或等于阈值 ,那么这两个实体就是另一个实体的语义邻居。
Entity Replacement: During this phase, we generate new sentences by replacing an entity with another random entity in its semantic neighbor set. For each entity type, we just randomly replace one entity of that type in a sentence. As a result, we obtain a set of augmented sentences from original ones. 实体替换:在这一阶段,我们用语义邻接集中的另一个随机实体替换一个实体,从而生成新句子。对于每种实体类型,我们只需在句子中随机替换该类型的一个实体。这样,我们就从原始句子中获得了一组增强句子。
Sentence Evaluation: Augmented sentences generated in the previous phase are probably semantically incorrect, which may affect the training process. To alleviate the issue, we perform an automatic evaluation to remove augmented sentences that are semantically different from their original sentences. To that end, we firstly represent both original and augmented sentences as vectors by using a pre-trained sentence-level language model. We then use cosine similarity to estimate the semantic similarity between two sentences. If the cosine similarity of an augmented sentence and its original sentence is less than a threshold , the augmented sentence will be discarded. 句子评估:上一阶段生成的增强句子很可能语义不正确,这可能会影响训练过程。为了缓解这一问题,我们会进行自动评估,删除语义与原始句子不同的增强句子。为此,我们首先使用预先训练好的句子级语言模型,将原始句子和增强句子表示为向量。然后,我们使用余弦相似度来估计两个句子之间的语义相似度。如果增强句与原始句的余弦相似度小于阈值 ,则增强句将被丢弃。
In this paper, the two parameters and will be in ranges of . The larger the , the greater the semantic similarity between entities, but the smaller the number of neighbours. The parameter represents the degree of rigour in the automatic evaluation phase. When approximates to 1 , only sentences that are very close to the meaning of the original sentence are retained. We therefore can keep only a few of the augmented sentences. In contrast, we can keep more sentences as approximates to 0 . When is set to 0 , the sentence evaluation phase will be disabled. At this point, we do not discard any augmented sentences from the second phase. We can fine-tune both and to generate suitable augmented data. 在本文中, 和 这两个参数的范围为 。 越大,实体之间的语义相似性越高,但相邻实体的数量越少。 参数表示自动评估阶段的严格程度。当 接近于 1 时,只有与原句含义非常接近的句子才会被保留。因此,我们只能保留少数增强句子。相反,当 接近 0 时,我们可以保留更多的句子。当 设置为 0 时,句子评估阶段将被禁用。此时,我们不会丢弃第二阶段的任何增强句子。我们可以对 和 进行微调,以生成合适的增强数据。
3 Experiments 3 项实验
3.1 Datasets 3.1 数据集
We conduct experiments on the two datasets including i2b2-2010 (English) (Uzuner et al., 2011) and VietBioNER (Phan et al., 2022) (Vietnamese). The i2b2 corpus includes patient records annotated with three named entity categories of Medical Problem, Test, and Treatment. Meanwhile, VietBioNER is constituted by biomedical grey literature specified for tuberculosis. The corpus was annotated with five named entity categories of Organisation, Location, Date and Time, Symptom and Disease, and Diagnostic Procedure. Some statistics of both corpora are reported in Table 1. 我们在两个数据集上进行了实验,包括 i2b2-2010(英语)(Uzuner 等人,2011 年)和 VietBioNER(Phan 等人,2022 年)(越南语)。i2b2 语料库包括注有医疗问题、测试和治疗三个命名实体类别的患者记录。同时,VietBioNER 由针对肺结核的生物医学灰色文献构成。该语料库标注了组织、地点、日期和时间、症状和疾病以及诊断程序五个命名实体类别。表 1 报告了这两个语料库的一些统计数据。
i2b2-2010
VietBioNER 越南生物网
#Sentence
32894
1706
#Sentence in #句中
Training set 训练集
9558
706
Development set 开发集
2389
300
Test set 测试装置
20947
700
Avg. len. of sent. 平均发送长度
13
31
#Entity type #实体类型
3
5
Vocab size 词汇量
24321
3548
Table 1: The summary statistic of the two datasets. 表 1:两个数据集的统计摘要。
Following Dai and Adel (2020), to simulate a low-resource setting, we create small, medium and large sets with different numbers of sentences: 50 , 150 and 500, respectively. These sentences are randomly selected from the training part of each dataset. It is noted that our small, medium and large splits of the i 2 b 2 dataset are different from those by Dai and Adel (2020). Augmentation methods are only applied on the training set, we use the same development and test sets for all experiments. 根据 Dai 和 Adel(2020 年)的研究,为了模拟低资源环境,我们创建了具有不同句子数量的小型、中型和大型句子集:分别为 50、150 和 500。这些句子是从每个数据集中的训练部分随机选取的。值得注意的是,我们对 i 2 b 2 数据集进行的小、中、大拆分与 Dai 和 Adel(2020 年)的拆分不同。增强方法只应用于训练集,我们在所有实验中使用相同的开发集和测试集。
3.2 Language Models 3.2 语言模型
For semantic neighbour extraction, we use ClinicalBERT (Alsentzer et al., 2019)—a pre-trained language model on clinical text for the i2b22010 dataset and PhoBERT (Nguyen and Nguyen, 2020)—a pre-trained language model on Vietnamese Wikipedia and news for VietBioNER. 在语义邻接提取方面,我们使用了ClinicalBERT(Alsentzer等人,2019年)和PhoBERT(Nguyen和Nguyen,2020年),前者是针对i2b22010数据集的临床文本预先训练的语言模型,后者是针对VietBioNER的越南语维基百科和新闻预先训练的语言模型。
In sentence evaluation, we employ SentenceBERT (SBERT) (Reimers and Gurevych, 2019), a sentence-level language model for sentence embeddings, to represent both original and augmented sentences. 在句子评估中,我们采用句子嵌入的句子级语言模型 SentenceBERT (SBERT) (Reimers and Gurevych, 2019) 来表示原始句子和增强句子。
We use all the mentioned models with the initialised weights provided by Hugging Face . 我们使用上述所有模型,并使用 "拥抱面孔" 提供的初始化权重。
Regarding the NER task training, we also finetune the aforementioned language models on the two corpora. 关于 NER 任务训练,我们还在两个语料库上对上述语言模型进行了微调。
3.3 Experiment Settings 3.3 实验设置
To show the effectiveness of the proposed method, we conducted the following experiments: 为了证明所提方法的有效性,我们进行了以下实验:
- Baseline: We only trained NER models on the original training data. - 基线:我们只在原始训练数据上训练 NER 模型。
- Baseline combined with augmented data: We trained NER models on the original training - 基线与增强数据相结合:我们在原始训练
MR
ER
SNR
i2b2-2010
S
17
19
12
M
67
90
61
L
242
347
239
F
4462
7308
4626
VietBioNER 越南生物网
S
21
9
7
M
76
13
13
L
256
86
84
F
347
550
459
Table 2: Number of augmented sentences in each training set. Small, Medium, Large, and Full sets contain 50 sentences, 150 sentences, 500 sentences, and the complete training set, respectively. 表 2:每个训练集中的增强句子数量。小型集、中型集、大型集和完整集分别包含 50 个句子、150 个句子、500 个句子和完整训练集。
set and its augmented data created by the following three methods: 数据集及其增强数据是通过以下三种方法创建的:
Mention Replacement (MR): We followed the MR method proposed by Dai and Adel (2020). 提及替换(MR):我们采用了 Dai 和 Adel(2020 年)提出的 MR 方法。
Entity Replacement (ER): We only performed the first two phases of our proposed method. The last phase, Sentence Evaluation, was disabled by setting the parameter to 0 . 实体替换 (ER):我们只执行了拟议方法的前两个阶段。通过将参数 设置为 0,最后一个阶段 "句子评估 "被禁用。
Semantic Neighbour Replacement (SNR): We performed all three phases of our proposed method. 语义 Neighbour Replacement (SNR):我们执行了拟议方法的所有三个阶段。
It is noted that since in this paper we focus on biomedical entities, we only created an augmented data for Symptom_and_Disease and DiagnosticProcedure entities in the case of VietBioNER. We however report the NER performance on all five NE categories. 值得注意的是,由于本文的重点是生物医学实体,因此在 VietBioNER 中,我们只创建了 Symptom_and_Disease 和 DiagnosticProcedure 实体的增强数据。不过,我们报告了所有五个近义词类别的 NER 性能。
3.4 Experimental Results 3.4 实验结果
Based on the fine-tuning results on the development sets, we selected for all sets of i2b22010; for VietBioNER, for the full set, and for the other sets; and for all cases across the corpora. The number of augmented sentences generated in each setting are reported in Table 2. Since SNR discards augmented sentences that are not semantically related to the original ones, it is reasonable that the numbers of augmented sentences by SNR is less than or equal to those by MR and ER. 根据开发集的微调结果,我们为 i2b22010 的所有集选择了 ;为 VietBioNER,全集选择了 ,其他集选择了 ;为所有语料库的所有情况选择了 。表 2 报告了每种情况下生成的增强句子的数量。由于 SNR 会放弃与原始句子语义无关的增强句子,因此 SNR 的增强句子数量少于或等于 MR 和 ER 的增强句子数量是合理的。
We trained NER models on a combination of augmented and original sentences, and applied them to the corresponding testing sets. The NER performance in terms of F1-scores on those sets 我们在增强句子和原始句子的组合上训练了 NER 模型,并将其应用于相应的测试集。在这些测试集上的 NER 性能(以 F1 分数表示
Method 方法
i2b2-2010
VietBioNER 越南生物网
S
M
L
F
S
M
L
F
Baseline 基线
37.13
67.58
75.53
87.21
59.21
70.78
79.48
79.60
+ MR
67.21
76.35
87.54
71.19
79.31
79.00
+ ER (our method) + ER(我们的方法)
39.42
68.36
76.33
87.37
59.31
71.94
80.09
+ SNR (our method) + SNR(我们的方法)
38.75
59.83
79.34
Table 3: NER performance by different augmentation methods in terms of F1-score. Bold numbers indicate the best performance in a specific setting. 表 3:按 F1 分数计算的不同增强方法的 NER 性能。粗体数字表示在特定环境中的最佳性能。
Sentence 句子
Her speech was fluent with no . 她说话流利,没有 。
N
ปै
MR
Her speech was fluent with no . 她说话流利,没有 。
SNR
Her speech was fluent with no . 她说话流利,没有 。
Ori 东方
然而,细胞学和细菌学检查在诊断 方面的灵敏度较低。
Tuy nhiên, các xét nghiệm tế bào và vi trùng trong chẩn đoán có độ nhạy còn thấp.
(However, cytology and bacteria tests in the diagnosis of have low sensitivity.)
MR
然而,细胞学和细菌测试在诊断[干咳] 中的敏感性较低)。
Tuy nhiên, các xét nghiê̂m tế bào và vi trùng trong chẩn đoán [ho khan] có độ nhạy còn thấp.
(However, cytology and bacteria tests in the diagnosis of [dry cough] have low sensitivity.)
SNR
Table 4: Original sentences and their augmented sentences with different methods. Blue texts indicates entity replacement. 表 4:采用不同方法处理的原始句子及其增强句子。蓝色文本表示实体替换。
are reported in Table . Generally, we can see that the NER performance was improved when using data augmentation methods on both English and Vietnamese corpora. Detailed results of precision and recall can be found in Appendix A. 表 中报告了这些结果。总体而言,我们可以看到,在英语和越南语语料库中使用数据增强方法后,NER 性能都有所提高。精确度和召回率的详细结果见附录 A。
Among the four sizes of the data, MR (Dai and Adel, 2020) could obtain the best performance in the small size setting, across the two corpora. This can be explained by the fact that given only 50 sentences in the training, adding more sentences will help the model overcome overfitting. With the medium size sets, MR could improve the performance on VietBioNER but not on i2b2-2010. In contrast, MR could boost F1-scores on the large and full sets on i2b2-2010, but not on VietBioNER. 在四种规模的数据中,MR(Dai 和 Adel,2020 年)在两种语料库中的小规模设置中表现最好。这是因为训练中只有 50 个句子,增加句子有助于模型克服过度拟合。在中型语料集中,MR 可以提高 VietBioNER 的性能,但不能提高 i2b2-2010 的性能。相反,在 i2b2-2010 上,MR 可以提高大型集和完整集的 F1 分数,但在 VietBioNER 上却不能。
Regarding SNR, we could have better F1-scores in most settings of medium, large and full sets, on both English and Vietnamese corpora. With the i2b2 English corpus, the proposed methods has an average improvement of of F1-scores (SNR) and (ER). Meanwhile, that number by MR is . For VietBioNER, the average improvement is , and of F1scores for SNR, ER, and MR, respectively. It is worth noting that even with a full training set, using SNR to augment the data training could also boost NER performance. In particular, F1-scores were increased by for i2b2-2010 and for 在信噪比方面,我们可以在英语和越南语语料库的中集、大集和全集等大多数设置中获得更好的 F1 分数。在 i2b2 英语语料库中,所提出的方法在 F1 分数(信噪比)和 (ER)方面的平均改进幅度为 。同时,MR 的这一数字为 。就 VietBioNER 而言,SNR、ER 和 MR 的 F1 分数平均值分别为 和 。值得注意的是,即使有完整的训练集,使用 SNR 增强数据训练也能提高 NER 性能。特别是,i2b2-2010 的 F1 分数提高了 ,i2b2-2010 的 F1 分数提高了 。
VietBioNER.
Interestingly, while the number of augmented sentences by SNR is lower than those by ER (as shown in Table 2), the NER performance by SNR is better than those by ER in most of the cases across the corpora. This indicates that having augmented sentences semantically related to the original ones in the training data really improves the NER performance, despite the fact that the total number of sentence is not big. For instance, in the case of i2b2-2010, SNR generated about less sentences than ER, but the NER performance by SNR was still better than those by ER. 有趣的是,虽然 SNR 的扩增句子数量低于 ER 的扩增句子数量(如表 2 所示),但是 SNR 的 NER 性能在大多数语料库中都优于 ER 的 NER 性能。这表明,尽管句子总数不多,但在训练数据中加入与原始句子语义相关的增强句子确实能提高 NER 性能。例如,在 i2b2-2010 中,SNR 生成的句子比 ER 生成的句子少 ,但 SNR 的 NER 性能仍然优于 ER。
3.5 Analysis 3.5 分析
Although using MR could help improve the NER performance (as illustrated in Table 3), it is inevitable that MR could produce meaningless sentences. We collected such examples and showed them in Table 4. It can be seen that although MR replaced entities in the same type with the original ones, the resulting sentence is meaningless. Meanwhile, SNR controls the semantic at both entity level and sentence level, hence producing a more meaningful sentence close to the original meaning than the one by MR. 虽然使用 MR 有助于提高 NER 性能(如表 3 所示),但 MR 也不可避免地会产生一些无意义的句子。我们收集了这样的例子,并将其显示在表 4 中。可以看出,虽然 MR 将同类型的实体替换为原始实体,但生成的句子是无意义的。与此同时,SNR 在实体和句子两个层面都对语义进行了控制,因此产生的句子比 MR 产生的句子更接近原意。
Moreover, we observed that most of sentences discarded by the sentence evaluation were semantically incorrect. We report some of discarded sentences in Table 5. It is obvious that the entity re- 此外,我们还注意到,句子评估所丢弃的句子大多语义不正确。我们在表 5 中报告了一些被丢弃的句子。很明显,实体重
Sentence 句子
i2b2-2010
Original 原创
He did not sleep at night before and was [extremely fatigued . 他前一天晚上没有睡觉,[非常疲劳 。
Augmented 增强型
He did not sleep at night before and was [some shortness of breath . 他前一天晚上没睡好,[有些气短 。
VietBioNER 越南生物网
Original 原创
Hình ảnh [X-quang phổi] chủ yếu là thâm nhiễm ([胸部 X 光]诊断程序图像主要为浸润性 ...)
Hình ảnh [X-quang phổi] chủ yếu là thâm nhiễm
(The [chest X-ray]
DiagnosticProcedure
image is mainly infiltrative ...)
Augmented 增强型
Hình ảnh [chọnh dò màng phối chủ yếu là thâm nhiễm ... (The [thoracentesis] image is mainly infiltrative ...)
Hình ảnh [chọc dò màng phối chủ yếu là thâm nhiễm ...
(The [thoracentesis] image is mainly infiltrative ...)
Table 5: Examples of augmented sentences discarded by the Sentence Evaluation phase in SNR. Blue texts indicates entity replacement. 表 5:SNR 中句子评估阶段丢弃的增强句子示例。蓝色文本表示实体替换。
placement altered the meaning of those sentences and made them meaningless. As aforementioned, by discarding those sentences, SNR could produce better NER performance, indicating that it is useful to filter augmented sentences based on their semantic relatedness. 这些句子的位置改变了这些句子的含义,使它们变得毫无意义。如前所述,通过舍弃这些句子,SNR 可以产生更好的 NER 性能,这表明根据语义相关性过滤增强句子是有用的。
4 Conclusion 4 结论
In this paper, we proposed a semantic-based data augmentation method for the named entity recognition task in the biomedical domain. Our method, namely Semantic Neighbour Replacement (SNR), simply generates more training sentences based on semantics of entity and sentence. Experiments on simulated low-resource settings show that using the proposed method, we can improve F1 score in both English (i2b2-2010) and Vietnamese (VietBioNER) corpora, even on the full training setting. Such results again confirm the importance of semantics in data augmentation. We believe that SNR can be applied to other domains and other languages as long as we have corresponding pretrained language models. 在本文中,我们针对生物医学领域的命名实体识别任务提出了一种基于语义的数据增强方法。我们的方法,即语义邻接替换(Semantic Neighbour Replacement,SNR),只需根据实体和句子的语义生成更多的训练句子。在模拟低资源环境下进行的实验表明,使用所提出的方法,我们可以在英语(i2b2-2010)和越南语(VietBioNER)语料库中提高 F1 分数,即使在完整的训练环境下也是如此。这些结果再次证实了语义在数据扩增中的重要性。我们相信,只要有相应的预训练语言模型,SNR 也可以应用于其他领域和其他语言。
Similar to previous work, our proposed method only augments in-domain data. Therefore, a followup work would be to study cross-domain augmentation method (Chen et al., 2021), in which we can leverage rich-resource data to enrich lowresource ones. 与之前的工作类似,我们提出的方法只增强域内数据。因此,后续工作将是研究跨域增强方法(Chen 等人,2021 年),我们可以利用富资源数据来丰富低资源数据。
Acknowledgements 致谢
We would like to thank the anonymous reviewers for their useful comments. This research was partially funded by the University of Science, VNUHCM, Vietnam under grant number CNTT202004 . 感谢匿名审稿人提出的宝贵意见。本研究得到了越南 VNUHCM 科学大学的部分资助,资助编号为 CNTT202004。
References 参考资料
Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jindi, Tristan Naumann, and Emily Alsentzer、John Murphy、William Boag、WeiHung Weng、Di Jindi、Tristan Naumann 和
Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72-78, Minneapolis, Minnesota, USA. Association for Computational Linguistics. 马修-麦克德莫特2019.公开可用的临床 BERT 嵌入。In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72-78, Minneapolis, Minnesota, USA.计算语言学协会。
Darina Benikova, Chris Biemann, Max Kisselew, and Sebastian Padó. 2014. GermEval 2014 Named Entity Recognition Shared Task: Companion Paper. In Proceedings of the KONVENS GermEval workshop, pages 104-112, Hildesheim, Germany. Darina Benikova、Chris Biemann、Max Kisselew 和 Sebastian Padó。2014.GermEval 2014 命名实体识别共享任务:配套论文。KONVENS GermEval 研讨会论文集》,第 104-112 页,德国希尔德斯海姆。
Jiaao Chen, Zhenghui Wang, Ran Tian, Zichao Yang, and Diyi Yang. 2020. Local additivity based data augmentation for semi-supervised NER. Proceedings of The 2020 Conference on Empirical Methods in Natural Language Processing. Jiaao Chen, Zhenghui Wang, Ran Tian, Zichao Yang, and Diyi Yang.2020.基于局部可加性的半监督 NER 数据扩增。2020 年自然语言处理经验方法大会论文集》。
Shuguang Chen, Gustavo Aguilar, Leonardo Neves, and Thamar Solorio. 2021. Data augmentation for crossdomain named entity recognition. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5346-5356, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. 陈曙光、古斯塔沃-阿吉拉尔、莱昂纳多-内韦斯和塔马尔-索洛里奥。2021.跨域命名实体识别的数据增强。2021年自然语言处理经验方法大会论文集》,第5346-5356页,多米尼加共和国蓬塔卡纳在线。计算语言学协会。
Xiang Dai and Heike Adel. 2020. An analysis of simple data augmentation for named entity recognition. CoRR, abs/2010.11683. Xiang Dai 和 Heike Adel.2020.命名实体识别的简单数据增强分析。CoRR,abs/2010.11683.
Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy. 2021. A survey of data augmentation approaches for NLP. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 968-988, Online. Association for Computational Linguistics. Steven Y. Feng, Varun Gangal, Jason Wei, Sarath Chandar, Soroush Vosoughi, Teruko Mitamura, and Eduard Hovy.2021.用于 NLP 的数据增强方法调查。In Findings of the Association for Computational Linguistics:ACL-IJCNLP 2021,第 968-988 页,在线。计算语言学协会。
Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1037-1042. Dat Quoc Nguyen 和 Anh Tuan Nguyen。2020.PhoBERT:越南语预训练语言模型。In Findings of the Association for Computational Linguistics:EMNLP 2020,第 1037-1042 页。
Yuyang Nie, Yuanhe Tian, Xiang Wan, Yan Song, and Bo Dai. 2020. Named entity recognition for social media texts with semantic augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Yuyang Nie, Yuanhe Tian, Xiang Wan, Yan Song, and Bo Dai.2020.社交媒体文本的命名实体识别与语义增强。2020年自然语言处理经验方法大会论文集》。
Uyen Phan, Phuong Nguyen, and Nhung Nguyen. 2022. A Named Entity Recognition Corpus for Vietnamese Uyen Phan、Phuong Nguyen 和 Nhung Nguyen。2022.越南语命名实体识别语料库
Biomedical Texts to Support Tuberculosis Treatment. In Proceedings of the 13th Language Resources and Evaluation Conference. European Language Resources Association. 支持结核病治疗的生物医学文本。第 13 届语言资源与评估会议论文集》。欧洲语言资源协会。
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics. Nils Reimers 和 Iryna Gurevych.2019.Sentence-bert: Sentence embeddings using siamese bert-networks.自然语言处理实证方法 2019 年会议论文集》。计算语言学协会。
Erik F. Tjong Kim Sang and Fien De Meulder. 2003. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142-147. Erik F. Tjong Kim Sang and Fien De Meulder.2003.CoNLL-2003 共享任务简介:独立于语言的命名实体识别。In Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pages 142-147.
Ozlem Uzuner, Brett South, Shuying Shen, and Scott DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association : JAMIA, 18:552-6. Ozlem Uzuner、Brett South、Shuying Shen 和 Scott DuVall。2011.2010 i2b2/va 关于临床文本中的概念、断言和关系的挑战。美国医学信息学协会期刊》:Jamia,18:552-6.
Jason Wei and Kai Zou. 2019. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6382-6388, Hong Kong, China. Association for Computational Linguistics. Jason Wei 和 Kai Zou.2019.EDA:提升文本分类任务性能的简易数据增强技术。2019年自然语言处理经验方法大会暨第九届自然语言处理国际联合会议(EMNLPIJCNLP)论文集》,第6382-6388页,中国香港。计算语言学协会。
A Detailed Results 详细结果
We report the detailed results of precision, recall and F1-scores on i2b2-2010 in Table 6 and VietBioNER in Table 7. 表 6 和表 7 详细报告了 i2b2-2010 和 VietBioNER 的精确度、召回率和 F1 分数结果。
It is expected that NER performances in terms of recall were mostly improved when using the data augmentation methods. Meanwhile, in terms of precision, the increase or decrease of NER performance was dependent on the data augmentation methods as well as the sizes of the training data. Nevertheless, in the case of full training data, using the SNR method, we could improve the NER performance in both recall and precision across corpora. 预计在使用数据增强方法时,召回率方面的 NER 性能大多会得到改善。同时,在精确度方面,NER 性能的增减取决于数据增强方法和训练数据的大小。不过,在使用 SNR 方法获得完整训练数据的情况下,我们可以提高不同语料库的召回率和精确率。
Method 方法
Small 小型
Medium 中型
Large 大型
Full 全部
P
R
F 1
P
R
F 1
P
R
F 1
P
R
F 1
Baseline 基线
43.39
32.45
37.13
66.54
68.65
67.58
74.00
77.13
75.53
86.24
88.20
87.21
+ MR
35.59
63.36
71.55
67.21
73.56
79.35
76.35
86.67
88.42
87.54
+ ER
42.44
39.42
69.52
68.36
72.97
76.33
86.47
88.29
87.37
+ SNR
42.49
35.62
38.75
67.11
79.51
Table 6: NER performance on i2b2-2010 by different augmentation methods in terms of Precision, Recall and F1-score. Bold numbers indicate the best performance in a specific setting. 表 6:不同增强方法在 i2b2-2010 上的精确度、召回率和 F1 分数的 NER 性能。粗体数字表示在特定环境中的最佳性能。
Method 方法
Small 小型
Medium 中型
Large 大型
Full 全部
P
R
F 1
P
R
F 1
P
R
F 1
P
R
F 1
Baseline 基线
56.92
61.69
59.21
67.88
73.93
70.78
81.99
79.48
77.49
81.83
79.60
+ MR
67.79
74.96
71.19
76.60
82.23
79.31
76.85
81.28
79.00
+ ER
57.39
61.37
59.31
74.33
71.94
76.50
77.57
80.09
+ SNR
58.87
60.82
59.83
68.92
76.93
81.91
79.34
Table 7: NER performance on VietBioNER by different augmentation methods in terms of Precision, Recall and F1-score. Bold numbers indicate the best performance in a specific setting. 表 7:不同扩增方法在 VietBioNER 上的 NER 性能(精确度、召回率和 F1 分数)。粗体数字表示在特定环境中的最佳性能。