GPT-NER: Named Entity Recognition via Large Language Models
GPT-NER：通过大型语言模型进行命名实体识别

Shuhe Wang^♠, Xiaofei Sun^◆, Xiaoya Li^♣, Rongbin Ouyang^♠
王树河^♠ 、孙晓飞^◆ 、李小雅^♣ 、欧阳荣斌^♠
Fei Wu^◆, Tianwei Zhang

{}^{\text{\char 170}}

, Jiwei Li^◆, Guoyin Wang^★
吴飞^◆ 、张天伟

{}^{\text{\char 170}}

、李继伟^◆ 、王国银^★

Abstract 抽象的

Despite the fact that large-scale Language Models (LLM) have achieved SOTA performances on a variety of NLP tasks, its performance on NER is still significantly below supervised baselines. This is due to the gap between the two tasks the NER and LLMs: the former is a sequence labeling task in nature while the latter is a text-generation model.
尽管大规模语言模型（ LLM ）已经在各种 NLP 任务上实现了 SOTA 性能，但其在 NER 上的性能仍然明显低于监督基线。这是由于 NER 和LLMs两个任务之间的差距造成的：前者本质上是序列标记任务，而后者是文本生成模型。

In this paper, we propose GPT-NER to resolve this issue. GPT-NER bridges the gap by transforming the sequence labeling task to a generation task that can be easily adapted by LLMs e.g., the task of finding location entities in the input text Columbus is a city is transformed to generate the text sequence @@Columbus## is a city, where special tokens @@## marks the entity to extract. To efficiently address the hallucination issue of LLMs, where LLMs have a strong inclination to over-confidently label NULL inputs as entities, we propose a self-verification strategy by prompting LLMs to ask itself whether the extracted entities belong to a labeled entity tag.
在本文中，我们提出 GPT-NER 来解决这个问题。 GPT-NER 通过将序列标记任务转换为LLMs可以轻松适应的生成任务来弥补这一差距，例如，在输入文本中查找位置实体的任务Columbus is a city被转换为生成文本序列@@Columbus# # 是一个城市，其中特殊标记 @@## 标记要提取的实体。为了有效解决LLMs的幻觉问题（ LLMs非常倾向于过度自信地将 NULL 输入标记为实体），我们提出了一种自我验证策略，通过提示LLMs询问自己提取的实体是否属于标记的实体标签。

We conduct experiments on five widely adopted NER datasets, and GPT-NER achieves comparable performances to fully supervised baselines, which is the first time as far as we are concerned. More importantly, we find that GPT-NER exhibits a greater ability in the low-resource and few-shot setups, when the amount of training data is extremely scarce, GPT-NER performs significantly better than supervised models. This demonstrates the capabilities of GPT-NER in real-world NER applications where the number of labeled examples is limited.¹¹1^♠Peking University, ^♣ Shannon.AI, ^◆Zhejiang University, ${}^{\text{\char 170}}$ Nanyang Technological University, ^★Amazon
wangshuhe@stu.pku.edu.cn, xiaoya_li@shannonai.com,
{xiaofei_sun, wufei, jiwei_li}@zju.edu.cn,
ouyang@pku.edu.cn, tianwei.zhang@ntu.edu.sg,
guoyiwan@amazon.com ²²2Codes are available at https://github.com/ShuheWang1998/GPT-NER.
我们在五个广泛采用的 NER 数据集上进行了实验，GPT-NER 实现了与完全监督基线相当的性能，这对我们来说是第一次。更重要的是，我们发现 GPT-NER 在低资源和少样本设置中表现出更强的能力，当训练数据量极其稀缺时，GPT-NER 的表现明显优于监督模型。这证明了 GPT-NER 在实际 NER 应用中的功能，其中标记示例的数量有限。 12

1 Introduction
1简介

Large-scale language models (LLMs) Brown et al. (2020); Smith et al. (2022); Du et al. (2022); Rae et al. (2021); Thoppilan et al. (2022); Hoffmann et al. (2022); Chowdhery et al. (2022); Touvron et al. (2023) have shown an impressive ability for in-context learning: with only a few task-specific examples as demonstrations, LLMs are able to generate results for a new test input. Under the framework of in-context learning, LLMs have achieved promising results in a variety of NLP tasks, include machine translation (MT) Vilar et al. (2022); Vidal et al. (2022); Moslem et al. (2023), question answering (QA) Robinson et al. (2022); Li et al. (2022); Lazaridou et al. (2022) and named entity extraction (NEE) Chowdhery et al. (2022); Brown et al. (2020).
大规模语言模型 ( LLMs ) Brown 等人。（ 2020 ）；史密斯等人。（ 2022 ）；杜等人。（ 2022 ）；雷伊等人。（ 2021 ）；托皮兰等人。（ 2022 ）；霍夫曼等人。（ 2022 ）；乔杜里等人。（ 2022 ）；图夫龙等人。（ 2023 ）在上下文学习方面表现出了令人印象深刻的能力：只需几个特定于任务的示例作为演示， LLMs就能够为新的测试输入生成结果。在情境学习的框架下， LLMs在各种 NLP 任务中取得了可喜的成果，包括机器翻译（MT） Vilar 等人。（ 2022 ）；维达尔等人。（ 2022 ）；穆斯林等人。 ( 2023 ) ，问答 (QA) Robinson 等人。（ 2022 ）；李等人。（ 2022 ）；拉扎里杜等人。 ( 2022 )和命名实体提取 (NEE) Chowdhery 等人。（ 2022 ）；布朗等人。（ 2020 ）。

Despite the progress, LLMs’ performances on the task of NER are still well below supervised baselines. This is because of the intrinsic gap between the two tasks of NER and LLMs: NER is a sequence labeling task in nature, where the model needs to assign an entity-type label to each token within a sentence, while LLMs are formalized under a text generation task. The gap between the semantic labeling task and the text generation model leads to inferior performance when applying LLMs to resolve the NER task.
尽管取得了进展， LLMs在 NER 任务上的表现仍然远低于监督基线。这是因为 NER 和LLMs两个任务之间存在内在差距：NER 本质上是序列标记任务，其中模型需要为句子中的每个标记分配实体类型标签，而LLMs则在文本下形式化生成任务。语义标记任务和文本生成模型之间的差距导致应用LLMs解决 NER 任务时性能较差。

In this paper, we propose GPT-NER to resolve this issue. GPT-NER transforms the NER task to a text-generation task that can be easily adapted by LLMs. Specifically, the task of finding location entities in the input text Columbus is a city is transformed to generate the text sequence @@Columbus## is a city, where special tokens @@## marks the entity. We find that, compared with other formalizations, the proposed strategy, can significantly decrease the difficulty in generating text that fully encodes label information of the input sequence, as the model only needs to mark the position for entities and make copies for all the rest tokens. Experiments show that the proposed strategy significantly improves the performance.
在本文中，我们提出 GPT-NER 来解决这个问题。 GPT-NER 将 NER 任务转换为可由LLMs轻松改编的文本生成任务。具体来说，在输入文本Columbus is a city中查找位置实体的任务将被转换为生成文本序列@@Columbus## is a city ，其中特殊标记 @@## 标记实体。我们发现，与其他形式化相比，所提出的策略可以显着降低生成完全编码输入序列标签信息的文本的难度，因为模型只需要标记实体的位置并为所有其余标记复制。实验表明，所提出的策略显着提高了性能。

Another big problem with LLMs for NER is the hallucination issue, where LLMs have a strong inclination to over-confidently label NULL inputs as entities. To address this issue, we propose a self-verification strategy, which is placed right after the entity extraction stage, prompting LLMs to ask itself whether an extracted entity belongs to a labeled entity tag. The self-verification strategy acts as a regulating function to counteract the excessive confidence of LLMs, which we find effective in addressing the hallucination issue, leading to a significant performance boost.
NER 的LLMs的另一个大问题是幻觉问题， LLMs有强烈的倾向过度自信地将 NULL 输入标记为实体。为了解决这个问题，我们提出了一种自我验证策略，该策略放置在实体提取阶段之后，促使LLMs询问自己提取的实体是否属于带标签的实体标签。自我验证策略充当调节功能，以抵消LLMs的过度自信，我们发现这可以有效解决幻觉问题，从而显着提高绩效。

We conduct experiments on five widely-adopted NER datasets, both flat NER and nested NER. GPT-NER achieves comparable performances to fully supervised baselines, which is the first time as far as we are concerned. Additionally, we find that the performance hasn’t plateaued when we reach the GPT-3 token limit with respect to the number of demonstrations. This means that there is still room for improvement when the 4,096 token limits of GPT-3 are released, e.g., using GPT-4 whose token limit is more than 20K. What is particularly noteworthy is that GPT-NER exhibits impressive proficiency in low-resource and few-shot NER setups: when the amount of training data is extremely scarce, GPT-NER performs significantly better than supervised models. This illustrates the potential of GPT-NER to be employed in real-world NER applications even when the quantity of labeled samples is scant.
我们在五个广泛采用的 NER 数据集（平面 NER 和嵌套 NER）上进行了实验。 GPT-NER 实现了与完全监督基线相当的性能，这对我们来说是第一次。此外，我们发现当我们达到 GPT-3 代币对演示数量的限制时，性能并未达到稳定水平。这意味着当GPT-3的4,096个代币限制被释放时，仍有改进的空间，例如使用代币限制超过20K的GPT-4。特别值得注意的是，GPT-NER 在低资源和少样本 NER 设置方面表现出了令人印象深刻的熟练程度：当训练数据量极其稀缺时，GPT-NER 的表现明显优于监督模型。这说明了 GPT-NER 在现实世界的 NER 应用中的潜力，即使标记样本的数量很少。

2 Related Work
2相关工作

2.1 Named Entity Recognition
2.1命名实体识别

Named Entity Recognition (NER) is a task to identify key information in the text and classify it into a set of predefined categories. A common approach to resolve NER is to formulate it as a sequence labeling task. Hammerton (2003) used unidirectional LSTMs to obtain token-level representations and feed them to the softmax classifier obtaining the results. Collobert et al. (2011) used CNN to embed each input word and leverage CRF to decode each embedding into a certain entity. Chiu and Nichols (2016) used a character CNN and Devlin et al. (2018) used BERT to obtain token-level representations for classifications. Lample et al. (2016) combined the bidirectional LSTMs with CRFs to augment the prediction. Sarzynska-Wawer et al. (2021) improved the quality of each word via a large-scale pre-training model. Li et al. (2019a, b) formulated the NER task as an MRC task and further leveraged dice loss to improve the performance of the MRC model, and Wang et al. (2022) proposed the GNN-SL model to allow a general NER model to refer to training examples at test time.
命名实体识别（NER）是一项识别文本中的关键信息并将其分类为一组预定义类别的任务。解决 NER 的常见方法是将其制定为序列标记任务。 Hammerton ( 2003 )使用单向 LSTM 来获取 token 级表示，并将其输入到 softmax 分类器中以获得结果。科洛伯特等人。 ( 2011 )使用 CNN 嵌入每个输入单词，并利用 CRF 将每个嵌入解码为某个实体。 Chiu 和 Nichols ( 2016 )使用了 CNN 和Devlin 等人的角色。 ( 2018 )使用 BERT 获得分类的 token 级表示。兰普尔等人。 ( 2016 )将双向 LSTM 与 CRF 结合起来以增强预测。萨金斯卡-瓦维尔等人。 ( 2021 )通过大规模预训练模型提高了每个单词的质量。李等人。 Wang等人( 2019a , b )将NER任务制定为MRC任务，并进一步利用dice loss来提高MRC模型的性能。 ( 2022 )提出了 GNN-SL 模型，允许通用 NER 模型在测试时引用训练样本。

2.2 Large Language Models and In-context Learning
2.2大型语言模型和上下文学习

Large language models (LLMs) Brown et al. (2020); Rae et al. (2021); Smith et al. (2022); Hoffmann et al. (2022); Chowdhery et al. (2022) have obtained significant performance boosts on a variety of natural language processing tasks Hegselmann et al. (2022); Vilar et al. (2022); Perez et al. (2021); Pietrzak et al. (2021); Wei et al. (2021). Strategies to use LLMs for downstream tasks can be divided into two categories: fine-tuning and in-context learning. The fine-tuning strategy takes a pre-trained model as initialization and runs additional epochs on the downstream supervised data Raffel et al. (2020); Gururangan et al. (2018); Roberts et al. (2020); Guu et al. (2020).
大型语言模型 ( LLMs ) Brown 等人。（ 2020 ）；雷伊等人。（ 2021 ）；史密斯等人。（ 2022 ）；霍夫曼等人。（ 2022 ）；乔杜里等人。 Hegselmann 等人( 2022 )在各种自然语言处理任务上获得了显着的性能提升。（ 2022 ）；维拉尔等人。（ 2022 ）；佩雷斯等人。（ 2021 ）；彼得扎克等人。（ 2021 ）；魏等人。（ 2021 ）。使用LLMs进行下游任务的策略可以分为两类：微调和上下文学习。微调策略采用预训练模型作为初始化，并在下游监督数据上运行额外的 epochs Raffel 等人。（ 2020 ）；古鲁兰甘等人。（ 2018 ）；罗伯茨等人。（ 2020 ）；顾等人。（ 2020 ）。

Different from the fine-tuning strategy, in-context learning (ICL) prompts LLMs to generate texts under few-shot demonstrations. Radford et al. (2019) first reform downstream tasks using prompts containing demonstrations. Brown et al. (2020) performs a systematic analysis for in-context learning and conducts multiple experiments on various tasks by its GPT-3 model. Chowdhery et al. (2022) perform analysis for the NMT task on PaLM. Perez et al. (2021); Lu et al. (2021); Rubin et al. (2021) show that better prompts and demonstrations lead to a performance boost for in-context learning.
与微调策略不同，上下文学习（ICL）促使LLMs在几次演示下生成文本。雷德福等人。（ 2019 ）首先使用包含演示的提示来改革下游任务。布朗等人。 ( 2020 )对情境学习进行了系统分析，并通过其GPT-3模型对各种任务进行了多次实验。乔杜里等人。 ( 2022 )在 PaLM 上对 NMT 任务进行分析。佩雷斯等人。（ 2021 ）；卢等人。（ 2021 ）；鲁宾等人。（ 2021 ）表明，更好的提示和演示可以提高情境学习的表现。

3 Background
3背景

Named entity recognition (NER) is a typical sequence labeling task that assigns an entity type $y\in Y$ to each word $x$ in a given sentence $X=\{x_{1},...,x_{n}\}$ , where $Y$ denotes the set of entity labels and $n$ denotes the length of the given sentence.
命名实体识别（NER）是分配实体类型的典型序列标记任务 $y\in Y$ 对每个词 $x$ 在给定的句子中 $X=\{x_{1},...,x_{n}\}$ ，在哪里 $Y$ 表示实体标签集， $n$ 表示给定句子的长度。

3.1 NER as Sequence Labeling
3.1 NER 作为序列标记

A common approach to resolve NER is to formulate it as a sequence labeling task, which can be decomposed into the following two steps: (1) representation extraction and (2) classification.
解决NER的常见方法是将其制定为序列标记任务，该任务可以分解为以下两个步骤：（1）表示提取和（2）分类。

Representation Extraction
表示提取

aims to obtain the high-dimensional representation for each token within the input sequence. To embed each input word $x$ , firstly the input sentence $X$ is fed into an encoder model, e.g., BERT Devlin et al. (2018). Then the output of the last layer of the word embedding model is used as the high-dimensional representation $h_{i}\in\mathbb{R}^{m\times 1}$ , where $n$ denotes the length of the input sentence and $m$ denotes a variable parameter of the dimension of the vector.
旨在获得输入序列中每个标记的高维表示。嵌入每个输入单词 $x$ ，首先输入句子 $X$ 被输入编码器模型，例如 BERT Devlin 等人。（ 2018 ）。然后将词嵌入模型最后一层的输出作为高维表示 $h_{i}\in\mathbb{R}^{m\times 1}$ ，在哪里 $n$ 表示输入句子的长度 $m$ 表示向量维度的可变参数。

Classification. 分类。

For classification, each embedded high-dimensional vector $h$ is sent to a multi-layer perceptron and then generates the distribution over the named entity vocabulary using the softmax function:
对于分类，每个嵌入的高维向量 $h$ 被发送到多层感知器，然后使用 softmax 函数生成命名实体词汇表的分布：

p_{\text{NER}}=\text{softmax}{\ \text{MLP}(h\in\mathbb{R}^{m\times 1})}

(1)

4 GPT-NER
4 GPT NER

In this work, we propose GPT-NER, which uses large language models to resolve the NER task. GPT-NER follows the general paradigm of in-context learning and can be decomposed into three steps: (1) Prompt Construction: for a given input sentence $X$ , we construct a prompt (denoted by $\text{Prompt}(X)$ ) for $X$ ; (2) feeding the constructed prompt to the large language model (LLM) to obtain the generated text sequence $W=\{w_{1},...,w_{n}\}$ ; (3) transforming the text sequence $W$ to a sequence of entity labels to obtain the final results.
在这项工作中，我们提出了 GPT-NER，它使用大型语言模型来解决 NER 任务。 GPT-NER遵循上下文学习的一般范式，可以分解为三个步骤：（1）快速施工：对于给定的输入句子 $X$ ，我们构造一个提示（表示为 $\text{Prompt}(X)$ ）为了 $X$ ; (2)将构建的提示输入大语言模型( LLM )以获得生成的文本序列 $W=\{w_{1},...,w_{n}\}$ ; (3) 变换文本序列 $W$ 到一系列实体标签以获得最终结果。

Straightforward as the in-context learning paradigm is, the task of NER is not readily fit for it as it is a sequence labeling task in nature rather than a generation task. Below we describe different strategies we propose to adapt LLMs to the NER task in detail.
尽管上下文学习范式很简单，但 NER 的任务并不容易适合它，因为它本质上是序列标记任务而不是生成任务。下面我们详细描述了我们建议的使LLMs适应 NER 任务的不同策略。

Refer to caption — Figure 1: The example of the prompt of GPT-NER. Suppose that we need to recognize location entities for the given sentence: China says Taiwan spoils atmosphere for talks. The prompt consists of three parts: (1) Task Description: It’s surrounded by a red rectangle, and instructs the GPT-3 model that the current task is to recognize Location entities using linguistic knowledge. (2) Few-shot Demonstrations: It’s surrounded by a yellow rectangle giving the GPT-3 model few-shot examples for reference. (3) Input Sentence: It’s surrounded by a blue rectangle indicating the input sentence, and the output of the GPT-3 model is colored green.
图1： GPT-NER的提示示例。假设我们需要识别给定句子的地点实体：中国说台湾破坏了谈判的气氛。提示由三部分组成： (1) 任务描述：由红色矩形包围，指示 GPT-3 模型当前任务是使用语言知识识别位置实体。 (2) Few-shot Demonstrations ：黄色矩形包围，给出GPT-3模型的few-shot示例供参考。 (3) 输入句子：它被一个蓝色矩形包围，表示输入句子，GPT-3 模型的输出为绿色。

4.1 Prompt Construction
4.1快速施工

Figure 1 is an example of the prompt used in GPT-NER, which consists of three parts:
图1是GPT-NER中使用的提示符示例，它由三部分组成：

4.1.1 Task Description
4.1.1任务说明

Task Description gives an overview of the task, which can be further decomposed into three components:
任务描述给出了任务的概述，它可以进一步分解为三个组成部分：

(1) the first sentence of the task description,
(1) 任务描述的第一句，

“I am an excellent linguist”
“我是一位优秀的语言学家”

is a constant telling LLMs to produce the output using linguistic knowledge;
是一个常数，告诉LLMs使用语言知识产生输出；

(2) The second sentence (2)第二句

“The task is to label [Entity Type] entities in the given sentence”
“任务是在给定句子中标记[实体类型]实体”

is a variable sentence indicating the category of entities to be extracted, [Entity Type] represents the type of entity to extract, e.g., Location in the example of Figure 1. It is worth noting that, in this way, for each input sentence, we need to iterate over all entity labels, which is equivalent to transforming an N-class classification task to N binary classification tasks. The reason behind this is as follows: for most current LLMs, e.g., GPT-3, there is a hard limit on the length of the prompt (e.g, 4096 tokens for GPT-3³³3Each English word corresponds to 1.3 tokens on average.) due to the hardware restrictions. Given this limited number of tokens, it is impossible to include descriptions and demonstrations for all entity types in a single prompt. Therefore, for each input sentence, we construct the prompt $N$ times, each of which corresponds to each entity type;
是一个变量句子，指示要提取的实体的类别， [Entity Type]表示要提取的实体的类型，例如图1的示例中的Location 。值得注意的是，这样，对于每个输入句子，我们需要迭代所有实体标签，相当于将一个N类分类任务转化为N个二元分类任务。其背后的原因如下：对于大多数当前的LLMs ，例如GPT-3，对提示的长度有硬性限制（例如，GPT-3 4096 个标记） ³ ³ 3每个英文单词对应于1.3 个标记平均）由于硬件限制。鉴于令牌数量有限，不可能在单个提示中包含所有实体类型的描述和演示。因此，对于每个输入句子，我们构造提示 $N$ 次，每次对应每种实体类型；

(3) the third sentence (3)第三句

“Below are some examples”
“下面是一些例子”

marks the end of the description and points out the position of few-shot demonstrations.
标志着描述的结束，并指出了少镜头演示的位置。

4.1.2 Few-shot Demonstration
4.1.2少样本演示

The few-shot demonstration is appended to the prompt. It serves as the following two purposes: (1) it regulates the format of the LLM outputs for each test input, as LLMs will (very likely) generate outputs that mimic the format of demonstrations. This is vital for the NER task as we need the output format to be consistent so that we can parse the output in the form of natural language to NER results; (2) it provides the LLM with direct evidence about the task and references to make predictions.
几次演示会附加到提示中。它有以下两个目的：（1）它调节每个测试输入的LLM输出的格式，因为LLMs将（很可能）生成模仿演示格式的输出。这对于 NER 任务至关重要，因为我们需要输出格式一致，以便我们可以将自然语言形式的输出解析为 NER 结果； (2)它为LLM提供了有关任务的直接证据和做出预测的参考。

The demonstration sequentially packs a list of examples, where each example consists of both the input sequence $X$ and the output sequence $W$ :
该演示按顺序打包示例列表，其中每个示例都包含输入序列 $X$ 和输出序列 $W$ :

		$\displaystyle\textit{Input: }\text{[Example Sentence]}_{1}$
		$\displaystyle\textit{Output: }\text{[Labeled Sentence]}_{1}$
		$\displaystyle\cdots$
		$\displaystyle\textit{Input: }\text{[Example Sentence]}_{k}$
		$\displaystyle\textit{Output: }\text{[Labeled Sentence]}_{k}$

where $k$ denotes the number of demonstrations.
在哪里 $k$ 表示示威次数。

The Format of LLM Output.
LLM输出的格式。

The format of each labeled sentence $W$ , which is a text sequence, is of vital importance and should satisfy the following conditions: (1) it needs to contain the information for each word label, and can be easily transformed into the entity type sequence; (2) it needs to be smoothly and easily generated by LLMs to boost the models’ final accuracy.
每个标记句子的格式 $W$ ，是一个文本序列，至关重要，应该满足以下条件：（1）它需要包含每个单词标签的信息，并且可以很容易地转换为实体类型序列；（2）它需要由LLMs顺利、轻松地生成，以提高模型的最终精度。

For illustration purposes, here we first give a few bad examples for the form of $W$ : for a given input sequence "Columbus is a city", "LOC O O O" is an intuitive format for $W$ which satisfies condition (1); But for condition (2), to generate "LOC O O O", the LLM first needs to learn the alignment between each position in the input sequence "Columbus is a city and each position in $W$ : Columbus to LOC, is to O, a to O, city to O, which naturally adds up to the difficulty of the generation task. However, we find that it is difficult for GPT-3 to generate the output with the same length as the input sentence, especially when the input sentence is long.
为了便于说明，这里我们首先举几个不好的例子 $W$ ：对于给定的输入序列“哥伦布是一座城市”，“ LOC OOO ”是一种直观的格式 $W$ 满足条件（1）；但对于条件（2），要生成“ LOC OOO ”， LLM首先需要学习输入序列“哥伦布是一个城市”中每个位置之间的对齐情况 $W$ ：哥伦布到LOC ，就是到O ， a到O ，城市到O ，这自然就增加了生成任务的难度。然而，我们发现GPT-3很难生成与输入句子长度相同的输出，特别是当输入句子很长时。

To resolve this issue, we propose the LLM output takes the following format: if the input sequence does not contain any entity, $W$ just copies the input $X$ ; for an entity/entities in the input sequence, we use special tokens “@@” and “##” to surround it/them. The following is an example to extract LOC entities:
为了解决这个问题，我们建议LLM输出采用以下格式：如果输入序列不包含任何实体， $W$ 只是复制输入 $X$ ;对于输入序列中的一个或多个实体，我们使用特殊标记“@@”和“##”来包围它/它们。以下是提取 LOC 实体的示例：

		$\displaystyle\textit{Input: }\textit{Columbus is a sailor}_{1}$
		$\displaystyle\textit{Output: }\textit{Columbus is a sailor}_{1}$
		$\displaystyle\textit{Input: }\textit{Columbus is a city}_{2}$
		$\displaystyle\textit{Output: }\textit{@@Columbus\#\# is a city}_{2}$

The proposed strategy significantly bridges the gap between the format of the sequence labeling task and the generation model: it significantly decreases the difficulty in the generated text that fully encodes label information, as the LLM only needs to mark the position for entities and make copies for all the rest. As we will show in the ablation study section 6.1, the proposed strategy yields significant performance boosts over other formats.
所提出的策略显着弥合了序列标记任务的格式和生成模型之间的差距：它显着降低了生成完全编码标签信息的文本的难度，因为LLM只需要标记实体的位置并为实体制作副本其余的一切。正如我们将在消融研究第6.1节中展示的那样，所提出的策略比其他格式产生了显着的性能提升。

4.1.3 Input Sentence
4.1.3输入语句

This part feeds the current input sentence into the LLM and expects the LLM to generate the output sequence according to the defined format in Sec 4.1.2, which is:
这部分将当前的输入句子输入LLM ，并期望LLM根据第4.1.2节中定义的格式生成输出序列，即：

		Input: [The Input Sentence] 输入： [输入句子]
		Output: 输出：

where “Ouput:” denotes the flag that the LLM begins to generate the labeled sequence.
其中“ Ouput: ”表示LLM开始生成标记序列的标志。

Shown in the bottom of Figure 1, given the input sentence “China says Taiwan spoils atmosphere for talks”, the LLM ⁴⁴4GPT-3 is used in this example. generates the labeled sentence “@@China## says @@Taiwan## spoils atmosphere for talks” with the same format as the former demonstrations, where the two words “China” and “Taiwan” is the recognized Location entity.
如图1底部所示，给定输入句子“中国说台湾破坏了会谈气氛”，本^例中^使用LLM 4 4 4 GPT-3。生成标签句“ @@China##说@@Taiwan##破坏了会谈气氛”，其格式与之前的示威活动相同，其中“中国”和“台湾”这两个词是公认的位置实体。

Here comes the end of the prompt construction.
到这里，快速构建就结束了。

4.2 Few-shot Demonstrations Retrieval
4.2少样本演示检索

Here we describe strategies to retrieve demonstration examples.
在这里，我们描述了检索演示示例的策略。

4.2.1 Random Retrieval
4.2.1随机检索

The most straightforward strategy is randomly select $k$ examples from the training set. The shortcoming is obvious: there is no guarantee that retrieved examples are semantically close to the input.
最直接的策略是随机选择 $k$ 训练集中的示例。缺点是显而易见的：无法保证检索到的示例在语义上与输入接近。

4.2.2 $k$ NN-based Retrieval
4.2.2 $k$ 基于神经网络的检索

To resolve the relatedness issue in Sec 4.2.1, we can retrieve $k$ nearest neighbor ( $k$ NN) of the input sequence from the training set Vilar et al. (2022); Liu et al. (2021): we first compute representations for all training examples, based on which we obtain the $k$ nearest neighbors for an input test sequence.
为了解决第4.2.1节中的相关性问题，我们可以检索 $k$ 最近邻（ $k$ Vilar 等人训练集的输入序列的 NN）。（ 2022 ）；刘等人。 ( 2021 ) ：我们首先计算所有训练示例的表示，基于此我们获得 $k$ 输入测试序列的最近邻。

$k$ NN based on Sentence-level Representations.
$k$ 基于句子级表示的神经网络。

To find $k$ NN examples in the training set, one straightforward method is to use text similarity models such as SimCSE Gao et al. (2021): we first obtain sentence-level representations for training examples and the input sequence, and use cosine similarity to find $k$ NN.
寻找 $k$ 训练集中的 NN 个示例，一种直接的方法是使用文本相似度模型，例如 SimCSE Gau 等人。（ 2021 ）：我们首先获得训练示例和输入序列的句子级表示，并使用余弦相似度来查找 $k$ NN。

The shortcoming of $k$ NN based on sentence-level representations is obvious: NER is a token-level task that focuses more on local evidence rather than a sentence-level task, which is concerned with sentence-level semantics. a retrieved sentence (e.g., he is a soldier) that is semantically similar to the input (e.g., John is a soldier) might shed no light on the NER the input contains: in the example above, the retrieved sentence contains no NER and thus provides no evidence for tagging the input.
的缺点是 $k$ 基于句子级表示的神经网络是显而易见的：NER 是一个更关注局部证据的 token 级任务，而不是关心句子级语义的句子级任务。检索到的句子（例如，他是一名士兵）在语义上与输入（例如，约翰是一名士兵）相似，可能无法揭示输入包含的 NER：在上面的示例中，检索到的句子不包含 NER，因此没有提供标记输入的证据。

Entity-level Embedding. 实体级嵌入。

To resolve the issue above, we need to retrieve $k$ NN examples based on token-level representations rather than sentence-level representations. We first extract entity-level representations for all tokens of all training examples as the datastore using a fine-tuned NER tagging model. For a given input sequence with length $N$ , we first iterate over all tokens within the sequence to find $k$ NNs for each token, obtaining $K\times N$ retrieved tokens. Next, we select the top $k$ tokens from the $K\times N$ retrieved tokens, and use their associated sentences as demonstrations. We select several examples to better illustrate demonstrations of three retrieval strategies in Appendix C.
为了解决上述问题，我们需要检索 $k$ 基于标记级表示而不是句子级表示的神经网络示例。我们首先使用微调的 NER 标记模型提取所有训练示例的所有标记的实体级表示作为数据存储。对于给定长度的输入序列 $N$ ，我们首先迭代序列中的所有标记以查找 $k$ 每个 token 的神经网络，获得 $K\times N$ 检索到的令牌。接下来我们选择顶部 $k$ 代币来自 $K\times N$ 检索到的标记，并使用它们相关的句子作为演示。我们选择了几个例子来更好地说明附录C中三种检索策略的演示。

4.3 Self-verification
4.3自我验证

LLMs significantly suffer from the hallucination or overprediction issue Braverman et al. (2020); Jiang et al. (2021); Zhao et al. (2021). Specifically for NER, LLMs have a strong inclination to over-confidently label NULL inputs as entities, even with demonstrations. The following is an example of overprediction:
LLMs严重遭受幻觉或过度预测问题Braverman 等人。（ 2020 ）；江等人。（ 2021 ）；赵等人。（ 2021 ）。特别是对于 NER， LLMs非常倾向于过度自信地将 NULL 输入标记为实体，即使有演示也是如此。以下是过度预测的示例：

		Prompt: 迅速的：
		I am an excellent linguist. The task is to label 我是一位优秀的语言学家。任务是贴标签
		location entities in the given sentence. 给定句子中的位置实体。
		Below are some examples. 以下是一些示例。
		Input:Columbus is a city 输入：哥伦布是一座城市
		Output:@@Columbus## is a city 输出： @@Columbus## 是一座城市
		Input:Rare Hendrix song sells for $17 输入：亨德里克斯的稀有歌曲售价 17 美元
		Output: 输出：
		GPT-3 Output: GPT-3 输出：
		Rare @@Hendrix## song sells for $17 罕见的 @@Hendrix## 歌曲售价 17 美元

where “Hendrix” is recognized as a location entity by the GPT-3, which is obviously incorrect. To address this issue, we propose the self-verification strategy. Given an extracted entity by LLMs, we ask the LLM to further verify whether the extracted entity is correct, answered by yes or no.
其中“ Hendrix ”被GPT-3识别为位置实体，这显然是不正确的。为了解决这个问题，我们提出了自我验证策略。给定LLMs提取的实体，我们要求LLM进一步验证提取的实体是否正确，回答是或否。

We construct the prompt for self-verification shown in Figure 3. Take the extraction of location entities as an example. The prompt starts with the task description:
我们构建如图3所示的自我验证提示。以位置实体的提取为例。提示以任务描述开始：

“The task is to verify whether the word is a location entity extracted from the given sentence”.
“任务是验证该单词是否是从给定句子中提取的位置实体”。

Again, we need few-shot demonstrations to boost the accuracy of the self-verifier. Shown in the yellow rectangle in Figure 3, each demonstration consists of three lines:
同样，我们需要几次演示来提高自我验证器的准确性。如图3中的黄色矩形所示，每个演示由三行组成：

(1) “The input sentence: Only France and Britain backed Fischler’s proposal”,
(1)“输入句子：只有法国和英国支持费施勒的提议”，
(2) “Is the word "France" in the input sentence a location entity? Please answer with yes or no”.
(2)输入句子中的“France”这个词是一个地点实体吗？请回答是或否”。
(3) Yes.
(3)是的。

We pack multiple demonstrations in the prompt in the few-shot setup. Demonstrations are followed by the test example, and fed to the LLM to obtain the output.
我们在几次设置的提示中打包了多个演示。演示之后是测试示例，并馈送到LLM以获得输出。

Demonstration Selection. 示范选择。

We need to select demonstrations for the few-shot self-verification. Since the center of the self-verification task is asking about whether an extracted entity is a specific entity type, we need to select training examples that are semantic to the extracted entity rather than overall sentence-level semantics.
我们需要选择演示进行小样本自我验证。由于自验证任务的中心是询问提取的实体是否是特定的实体类型，因此我们需要选择对提取的实体语义的训练示例，而不是整体句子级语义。

Therefore, we use the entity-level embedding described in Sec 4.2.2 for $k$ NN demonstration search rather than sentence-level representations: (1) firstly, we construct the datastore by extracting entity-level representations for all training examples using a fine-tuned NER model; (2) then, we use the same fine-tuned NER model to extract representation for the queried word; (3) finally, we use the representation of the queried word to select $k$ examples from the datastore as few-shot demonstrations, whose answer is “Yes” if the retrieved entity belongs to the queried entity type, and “no” otherwise.
因此，我们使用中描述的实体级嵌入第4.2.2节： $k$ NN 演示搜索而不是句子级表示：（1）首先，我们通过使用微调的 NER 模型提取所有训练示例的实体级表示来构建数据存储； (2) 然后，我们使用相同的微调 NER 模型来提取查询词的表示； (3)最后，我们使用查询词的表示来选择 $k$ 来自数据存储的示例作为少样本演示，如果检索到的实体属于查询的实体类型，则答案为“是”，否则答案为“否”。

5 Experiments
5实验

We use GPT-3 Brown et al. (2020) (davinci-003) as the LLM backbone for all experiments. For davinci-003 parameters, we set the maximum output length to 512 tokens. Temperature is set to 0, top_p to 1, frequency_penalty to 0, presence_penalty to 0, and best_of to 1.
我们使用 GPT-3 Brown 等人。 ( 2020 ) (davinci-003) 作为所有实验的LLM骨干。对于davinci-003参数，我们将最大输出长度设置为512个标记。温度设置为 0，top_p 设置为 1，Frequency_penalty 设置为 0，presence_penalty 设置为 0，best_of 设置为 1。

5.1 Results on the Full Training Set
5.1完整训练集的结果

5.1.1 Results on Flat NER
5.1.1 Flat NER 的结果

For flat NER, entities can’t overlap with each other. We conduct experiments on the two widely-used flat-NER datasets, English CoNLL2003 and OntoNotes5.0, using span-level precision, recall, and F1 score as evaluation metrics.
对于平面 NER，实体不能相互重叠。我们在两个广泛使用的 Flat-NER 数据集 English CoNLL2003 和 OntoNotes5.0 上进行了实验，使用跨度级别的精度、召回率和 F1 分数作为评估指标。

CoNLL2003. CoNLL2003。

CoNLL2003 (Sang and De Meulder, 2003) is an English NER dataset containing four types of named entities: Location, Organization, Person, and Miscellaneous, and we followed protocols in Li et al. (2019a); Ma and Hovy (2016) to process the data.
CoNLL2003 （Sang 和 De Meulder， 2003 ）是一个英文 NER 数据集，包含四种类型的命名实体：位置、组织、人员和杂项，我们遵循Li 等人的协议。（ 2019a ）； Ma 和 Hovy ( 2016 )处理数据。

OntoNotes5.0. OntoNotes5.0。

OntoNotes5.0 (Pradhan et al., 2013) is an English NER dataset containing 18 types of named entities: 11 types (e.g., Person, Organization) and 7 values (e.g., Date, Percent). More details (including entity types, sentence numbers, and examples) of the two flat NER datasets are shown in Appendix A.1.
OntoNotes5.0 （Pradhan et al., 2013 ）是一个英文NER数据集，包含18种类型的命名实体：11种类型（例如，Person、Organization）和7种值（例如，Date、Percent）。附录A.1显示了两个平面 NER 数据集的更多详细信息（包括实体类型、句子编号和示例）。

Due to the fact that accessing davinci-003 can be expensive, in addition to the full test set, we randomly selected 100 test instances to make it easier for the community to replicate our results. We report performances on both the full and the partial test sets.
由于访问 davinci-003 的成本可能很高，除了完整的测试集之外，我们还随机选择了 100 个测试实例，以便社区更容易复制我们的结果。我们报告完整和部分测试集的性能。

English CoNLL2003 (Sampled 100) 英语 CoNLL2003（样本 100）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
ACE+document-context (Wang et al., 2020) ACE+文档上下文（Wang et al., 2020 ）	97.8	98.28	98.04 (SOTA) 98.04（部分）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	88.18	78.54	83.08
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	90.47	95	92.68
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	94.06	96.54	95.3
Self-verification (zero-shot) 自我验证（零样本）
+ GPT-3 + random retrieval + GPT-3 +随机检索	88.95	79.73	84.34
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	91.77	96.36	94.01
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	94.15	96.77	95.46
Self-verification (few-shot) 自我验证（少量）
+ GPT-3 + random retrieval + GPT-3 +随机检索	90.04	80.14	85.09
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	92.92	95.45	94.17
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	94.73	96.97	95.85
English OntoNotes5.0 (Sampled 100) 英语OntoNotes5.0（样本100）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
BERT-MRC+DSC Li et al. (2019b) BERT-MRC+DSC李等人。 ( 2019b )	93.81	93.95	93.88 (SOTA) 93.88（部分）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	64.21	65.51	64.86
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	76.08	83.06	79.57
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	78.38	83.9	81.14
Self-verification (zero-shot) 自我验证（零样本）
+ GPT-3 + random retrieval + GPT-3 +随机检索	64.94	65.90	65.42
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	77.33	83.29	80.31
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	79.05	83.71	81.38
Self-verification (few-shot) 自我验证（少量）
+ GPT-3 + random retrieval + GPT-3 +随机检索	65.21	66.25	65.73
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	77.64	83.22	80.43
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	79.25	83.73	81.49

Table 1: Results of sampled 100 pieces of data for two Flat NER datasets: CoNLL2003 and OntoNotes5.0.
表1：两个Flat NER数据集CoNLL2003和OntoNotes5.0的100条数据采样结果。

English CoNLL2003 (FULL) 英语 CoNLL2003（完整版）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
BERT-Tagger (Devlin et al., 2018) BERT-Tagger （Devlin 等人， 2018 ）	-	-	92.8
BERT-MRC (Li et al., 2019a) BERT-MRC （Li 等人， 2019a ）	92.33	94.61	93.04
GNN-SL (Wang et al., 2022) GNN-SL （Wang 等人， 2022 ）	93.02	93.40	93.2
ACE+document-context (Wang et al., 2020) ACE+文档上下文（Wang et al., 2020 ）	-	-	94.6 (SOTA) 94.6（一般）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	77.04	68.69	72.62
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	81.04	88.00	84.36
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	88.54	91.4	89.97
Self-verification (zero-shot) 自我验证（零样本）
+ GPT-3 + random retrieval + GPT-3 +随机检索	77.13	69.23	73.18
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	83.31	88.11	85.71
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	89.47	91.77	90.62
Self-verification (few-shot) 自我验证（少量）
+ GPT-3 + random retrieval + GPT-3 +随机检索	77.50	69.38	73.44
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	83.73	88.07	85.9
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	89.76	92.06	90.91
English OntoNotes5.0 (FULL) 英语OntoNotes5.0（完整版）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
BERT-Tagger (Devlin et al., 2018) BERT-Tagger （Devlin 等人， 2018 ）	90.01	88.35	89.16
BERT-MRC (Li et al., 2019a) BERT-MRC （Li 等人， 2019a ）	92.98	89.95	91.11
GNN-SL (Wang et al., 2022) GNN-SL （Wang 等人， 2022 ）	91.48	91.29	91.39
BERT-MRC+DSC Li et al. (2019b) BERT-MRC+DSC李等人。 ( 2019b )	91.59	92.56	92.07 (SOTA) 92.07（部分）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	58.8	64.36	61.58
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	71.87	78.77	75.32
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	79.17	84.29	81.73
Self-verification (zero-shot) 自我验证（零样本）
+ GPT-3 + random retrieval + GPT-3 +随机检索	59.14	64.44	61.79
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	72.29	78.81	75.55
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	79.64	84.52	82.08
Self-verification (few-shot) 自我验证（少量）
+ GPT-3 + random retrieval + GPT-3 +随机检索	59.23	64.65	61.94
+ GPT-3 + sentence-level embedding + GPT-3 +句子级嵌入	72.35	78.79	75.57
+ GPT-3 + entity-level embedding + GPT-3 +实体级嵌入	79.89	84.51	82.20

Table 2: Results of full data for two Flat NER datasets: CoNLL2003 and OntoNotes5.0.
表 2：两个Flat NER 数据集的完整数据结果：CoNLL2003 和 OntoNotes5.0。

Baselines. 基线。

We adopt currently widely-used NER systems as baselines including:
我们采用当前广泛使用的 NER 系统作为基线，包括：

•

BERT-Tagger Devlin et al. (2018) fine-tunes BERT on the full training dataset.
BERT-Tagger Devlin 等人( 2018 )在完整的训练数据集上微调 BERT。
•

MRC-NER Li et al. (2019a) formulates the NER task as a machine reading comprehension (MRC) task and trains the MRC-NER model on the full training dataset.
MRC-NER李等人。 ( 2019a )将 NER 任务制定为机器阅读理解 (MRC) 任务，并在完整的训练数据集上训练 MRC-NER 模型。
•

MRC-NER+DSC Li et al. (2019b) is the current SOTA model on the OntoNotes5.0 dataset, leveraging dice loss in replacement of the standard cross-entropy loss during training.
MRC-NER+DSC李等人。 ( 2019b )是 OntoNotes5.0 数据集上当前的 SOTA 模型，在训练期间利用骰子损失代替标准交叉熵损失。
•

GNN-SL Wang et al. (2022) fine-tunes RoBERTa Liu et al. (2019) on the full training dataset and using GNN to refer to the whole training examples at test time.
GNN-SL Wang 等人。 ( 2022 )微调 RoBERTa Liu 等人。 ( 2019 )在完整的训练数据集上，并使用 GNN 在测试时引用整个训练示例。
•

ACE+document-context Wang et al. (2020) is the current SOTA model on the CoNLL2003 dataset, optimizing the controller to find better concatenations of embeddings on the full training dataset.
ACE+文档上下文Wang 等人。 ( 2020 )是 CoNLL2003 数据集上当前的 SOTA 模型，优化控制器以在完整训练数据集上找到更好的嵌入串联。

Main Results. 主要结果。

Table 1 and Table 2 respectively show results on the partial and the full test set for flat NER. Observations are as follows:
表1和表2分别显示了平坦 NER 的部分和完整测试集的结果。观察结果如下：

(1) $k$ NN retrieval is of vital importance for the NER task. For the random retrieval strategy where demonstrations are randomly selected rather than through $k$ NN search, performances are only 72.62 and 61.58 on the full CoNLL2003 and OntoNotes5.0 sets. Results skyrocket to 84.36 and 75.32 on the full CoNLL2003 and OntoNotes5.0 when sentence-level embeddings are used for the $k$ NN demonstration retrieval.
(1) $k$ NN 检索对于 NER 任务至关重要。对于随机检索策略，其中演示是随机选择的，而不是通过 $k$ NN 搜索，在完整的 CoNLL2003 和 OntoNotes5.0 集上的性能仅为 72.62 和 61.58。当使用句子级嵌入时，在完整的 CoNLL2003 和 OntoNotes5.0 上，结果飙升至 84.36 和 75.32 $k$ NN演示检索。

(2) We observe a significant improvement by changing the sentence-level embedding to token-level embedding for the $k$ NN demonstration search: 84.36 v.s. 89.97 on CoNLL2003 dataset and 75.32 v.s. 81.73 on OntoNotes5.0. This phenomenon is because NER is a token-level task that focuses more on local evidence rather than a sentence-level task: the two sentences “he is a soldier” and “John is a soldier” are semantically similar but don’t share any identical entities. Using token-level representation for the $k$ NN search help retrieve more similar demonstrations with respect to the specific entity type, leading to better performances.
(2) 通过将句子级嵌入更改为标记级嵌入，我们观察到显着的改进 $k$ NN 演示搜索：CoNLL2003 数据集上为 84.36 vs 89.97，OntoNotes5.0 上为 75.32 vs 81.73。这种现象是因为 NER 是一个 token 级任务，更关注局部证据而不是句子级任务：“he is a brothers”和“John is a brothers”这两个句子在语义上相似，但不共享任何内容。相同的实体。使用令牌级表示 $k$ 神经网络搜索有助于检索与特定实体类型相关的更多相似演示，从而获得更好的性能。

(3) We observe further improvements by adding self-verification: on the full CoNLL2003 dataset with entity-level embedding, 89.97 v.s. 90.62 respectively for without and with self-verification for zero-shot learning and 84.97 v.s. 85.91 for few-shot learning. The results prove the effectiveness of self-verification in alleviating overprediction of the GPT-3.
(3) 通过添加自我验证，我们观察到进一步的改进：在具有实体级嵌入的完整 CoNLL2003 数据集上，零样本学习的无自我验证和有自我验证的分别为 89.97 和 90.62，而少样本学习的分别为 84.97 和 85.91。结果证明了自我验证在缓解 GPT-3 过度预测方面的有效性。

(4) LLM-based systems obtain comparable results to supervised baselines using BERT, i.e., 90.91 v.s. 92.8 on the full CoNLL2003 dataset and 82.20 v.s. 89.16 on the full OntoNotes5.0 dataset. We observe that there still remains a gap between the supervised SOTA model: 94.6 v.s. 90.91 on the full CoNLL2003 dataset and 92.07 v.s. 82.20 on the full OntoNotes5.0 dataset. As will be shown in the ablation study section, we find that the performance hasn’t plateaued when we reach the GPT-3 token limit with respect to the number of KNN demonstrations. This means that the token limit is released, e.g., using GPT-4 whose token limit is more than 20K tokens, there is still room for improvement. We will update performances when GPT-4 API is accessible.
(4) 基于LLM的系统使用 BERT 获得了与监督基线相当的结果，即在完整的 CoNLL2003 数据集上为 90.91 vs 92.8，在完整的 OntoNotes5.0 数据集上为 82.20 vs 89.16。我们观察到监督 SOTA 模型之间仍然存在差距：在完整的 CoNLL2003 数据集上为 94.6 vs 90.91，在完整的 OntoNotes5.0 数据集上为 92.07 vs 82.20。正如消融研究部分所示，我们发现当我们达到 GPT-3 代币对 KNN 演示数量的限制时，性能并未达到稳定水平。这意味着令牌限制被释放，例如使用GPT-4，其令牌限制超过20K令牌，仍有改进的空间。当 GPT-4 API 可用时，我们将更新性能。

5.1.2 Results on Nested NER
5.1.2嵌套 NER 的结果

For nested NER, entities in each sentence may overlap with each other, as in the following example:
对于嵌套的 NER，每个句子中的实体可能会相互重叠，如下例所示：

		Sentence: The Chinese embassy in France 句子：中国驻法国大使馆
		Geographical Political Entities: Chinese, France 地理政治实体：中国、法国
		Facility Entities: The Chinese embassy in France 机构单位：中国驻法国大使馆

where the two geographical political entities “Chinese” and “France” overlap with the facility entity “The Chinese embassy in France”.
其中两个地理政治实体“中国”和“法国”与设施实体“中国驻法国大使馆”重叠。

We conduct experiments on the three widely-used nested NER datasets: ACE2004, ACE2005 and GENIA, and use span-level precision, recall, and F1 score for evaluation.
我们在三个广泛使用的嵌套NER数据集：ACE2004、ACE2005和GENIA上进行实验，并使用跨度级别的精度、召回率和F1分数进行评估。

ACE2004 and ACE2005. ACE2004 和 ACE2005。

ACE2004 Doddington et al. (2004) and ACE2005 Christopher et al. (2006) contain seven types of entities (e.g., organization entities and person entities). We follow the commonly adopted protocols in Katiyar and Cardie (2018) to process the two datasets by dividing them into train, dev, and test sets in an 8:1:1 ratio.
ACE2004多丁顿等人。 ( 2004 )和 ACE2005 Christopher 等人。 ( 2006 )包含七种类型的实体（例如组织实体和个人实体）。我们遵循Katiyar 和 Cardie ( 2018 )中常用的协议来处理这两个数据集，将它们按照 8:1:1 的比例分为训练集、开发集和测试集。

GENIA. 吉尼亚。

GENIA Ohta et al. (2002) is an English nested NER dataset in the molecular biology domain containing five entity types (e.g., DNA and RNA). More details (including entity types, sentence number, and examples) of three nested NER datasets ACE2004, ACE2005, and GENIA can be found in Appendix A.2.
吉尼亚·太田等人。 ( 2002 )是分子生物学领域的英文嵌套 NER 数据集，包含五种实体类型（例如 DNA 和 RNA）。三个嵌套 NER 数据集 ACE2004、ACE2005 和 GENIA 的更多详细信息（包括实体类型、句子数量和示例）可以在附录A.2中找到。

Baselines. 基线。

For baselines, four widely used supervised models are included:
对于基线，包括四种广泛使用的监督模型：

•

BERT-MRC (Li et al., 2019a): the current SOTA model on the GENIA dataset, formulating the NER task as a machine reading comprehension (MRC) task and training the MRC-NER model on the full training dataset.
BERT-MRC （Li et al., 2019a ）：GENIA 数据集上当前的 SOTA 模型，将 NER 任务制定为机器阅读理解（MRC）任务，并在完整的训练数据集上训练 MRC-NER 模型。
•

Triaffine+BERT (Yuan et al., 2021): fine-tuning BERT Devlin et al. (2018) on the full training set and fusing heterogeneous factors for span representations and classification.
Triaffine+BERT (Yuan et al., 2021 ) ：微调 BERT Devlin et al. （ 2018 ）关于完整的训练集和融合异质因素以进行跨度表示和分类。
•

Triaffine+ALBERT (Yuan et al., 2021): fine-tuning ALBERT Lan et al. (2019) on the full training set and fusing heterogeneous factors for span representations and classification.
Triaffine+ALBERT (Yuan et al., 2021 ) ：微调 ALBERT Lan et al. ( 2019 )关于完整的训练集和融合异构因素以进行跨度表示和分类。
•

BINDER (Zhang et al., 2022): the current SOTA model on the ACE2004 dataset and ACE2005 dataset, leveraging a bi-encoder framework to apply contrastive learning to map candidate text spans and entity types into the same vector representation space for representation and classification.
BINDER (Zhang et al., 2022 ) ：ACE2004 数据集和 ACE2005 数据集上的当前 SOTA 模型，利用双编码器框架应用对比学习将候选文本范围和实体类型映射到相同的向量表示空间中进行表示和分类。

ACE2004 (FULL) ACE2004（完整）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
BERT-MRC (Li et al., 2019a) BERT-MRC （Li 等人， 2019a ）	85.05	86.32	85.98
Triaffine+BERT (Yuan et al., 2021) Triaffine+BERT （Yuan 等人， 2021 ）	87.13	87.68	87.40
Triaffine+ALBERT (Yuan et al., 2021) Triaffine+ALBERT （袁等人， 2021 ）	88.88	88.24	88.56
BINDER (Zhang et al., 2022) BINDER （Zhang 等人， 2022 ）	88.3	89.1	88.7 (SOTA) 88.7（一般）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	55.04	41.76	48.4
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	65.31	53.67	60.68
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	72.23	75.01	73.62
Self-verification (zero-shot) 自我验证（零样本）
GPT-3 + random retrieval GPT-3+随机检索	55.44	42.22	48.83
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	69.64	54.98	62.31
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	73.58	74.74	74.16
Self-verification (few-shot) 自我验证（少量）
GPT-3 + random retrieval GPT-3+随机检索	55.63	42.49	49.06
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	70.17	54.87	62.52
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	73.29	75.11	74.2
ACE2005 (FULL) ACE2005（完整）
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
Triaffine+BERT (Yuan et al., 2021) Triaffine+BERT （Yuan 等人， 2021 ）	86.70	86.94	86.82
BERT-MRC (Li et al., 2019a) BERT-MRC （Li 等人， 2019a ）	87.16	86.59	86.88
Triaffine+ALBERT (Yuan et al., 2021) Triaffine+ALBERT （袁等人， 2021 ）	87.39	90.31	88.83
BINDER (Zhang et al., 2022) BINDER （Zhang 等人， 2022 ）	89.1	89.8	89.5 (SOTA) 89.5（一般）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	45.5	46.24	45.37
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	58.04	58.97	58.50
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	71.72	74.2	72.96
Self-verification (zero-shot) 自我验证（零样本）
GPT-3 + random retrieval GPT-3+随机检索	45.06	46.62	45.84
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	59.49	60.17	59.83
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	72.63	75.39	73.46
Self-verification (few-shot) 自我验证（少量）
GPT-3 + random retrieval GPT-3+随机检索	45.49	46.73	46.11
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	59.69	60.35	60.02
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	72.77	75.51	73.59
GENIA (FULL) 吉尼亚 (全)
Model	Precision 精确	Recall 记起	F1
Baselines (Supervised Model) 基线（监督模型）
Triaffine+BERT (Yuan et al., 2021) Triaffine+BERT （Yuan 等人， 2021 ）	80.42	82.06	81.23
BERT-MRC (Li et al., 2019a) BERT-MRC （Li 等人， 2019a ）	85.18	81.12	83.75 (SOTA) 83.75（部分）
GPT-NER
GPT-3 + random retrieval GPT-3+随机检索	44.1	38.64	41.37
GPT-3 + sentence-level embedding GPT-3 +句子级嵌入	63.43	44.17	51.68
GPT-3 + entity-level embedding GPT-3 +实体级嵌入	61.38	66.74	64.06
Self-verification (zero-shot) 自我验证（零样本）
GPT-3 + random retrieval GPT-3+随机检索	44.31	38.79	41.55
GPT-3 + sentence-level embedding	59.54	44.26	51.9
GPT-3 + entity-level embedding	61.77	66.81	64.29
Self-verification (few-shot)
GPT-3 + random retrieval	44.68	38.98	41.83
GPT-3 + sentence-level embedding	59.87	44.39	52.13
GPT-3 + entity-level embedding	61.89	66.95	64.42

Table 3: Results of full data for three Nested NER datasets: ACE2004, ACE2005 and GENIA.

Main Results.

Results are shown in Table 3, and phenomenon is similar to flat NER is observed:

(1) Again, $k$ NN retrieval is of vital importance: on the full ACE2004 dataset, 48.4 for random retrieval v.s. 73.62 for entity-level embedding retrieval using KNN search.

(2) For the $k$ NN demonstration search, a significant improvement is observed by changing the sentence-level embedding to entity-level embedding: 60.68 v.s. 73.62 on ACE2004 and 56.68 v.s. 69.06 on GENIA.

(3) Further performance boost is obtained by adding self-verification, i.e., on the full ACE2004 dataset with sentence-level embedding, 60.68 v.s. 62.31 for zero-shot learning and 60.68 v.s. 62.52 for few-shot learning.

We also observe that the gap between GPT-NER and SOTA models is greater than flat NER. This is because:
我们还观察到 GPT-NER 和 SOTA 模型之间的差距比平面 NER 更大。这是因为：

(1) Nested NER datasets contain more similar entity types, e.g., the location entities (LOC) and the geographical entities (GPE). Since only a limited number of demonstrations is allowed, it is harder for GPT-3 to distinguish between them,
(1)嵌套NER数据集包含更多相似的实体类型，例如位置实体(LOC)和地理实体(GPE)。由于只允许进行有限数量的演示，因此 GPT-3 更难区分它们，

(2) The annotation guidelines for the three nested NER datasets are more complex and less straightforward. For example, the substring of “The bodies of six people” within the sentence “The bodies of six people were found in the region” is annotated as a person entity. It is easier for a supervised model fine-tuned on the full training set to learn these complex rules, while much harder for an LLM model with a limited number of demonstrations.

5.2 Results on Low-resource Scenario
5.2低资源场景下的结果

We conduct experiments to estimate the performance of GPT-NER in low resource setups on the English CoNLL2003 dataset. In order to mimic the low-resource scenario, we randomly select a subset of the full training data as the training set: (a) 8 training sentences ( $0.063\%$ ); (b) 100 training sentences ( $0.788\%$ ); (c) 1K sentences ( $7.880\%$ ); and (d) 10K sentences ( $78.808\%$ ). For the setup with 8 training sentences, the dataset is constructed to ensure that each entity type contains one positive and one negative example. Evaluations are performed on the full test set.
我们进行实验来估计性能英语 CoNLL2003 数据集上低资源设置中的 GPT-NER。为了模拟低资源场景，我们随机选择完整训练数据的子集作为训练集： (a) 8 个训练句子 ( $0.063\%$ ）； (b) 100 个训练句子 ( $0.788\%$ ）； (c) 1K 个句子 ( $7.880\%$ ）； (d) 10K 句子 ( $78.808\%$ ）。对于具有 8 个训练句子的设置，构建数据集以确保每种实体类型包含一个正例和一个负例。评估是在完整的测试集上进行的。

Setups. 设置。

We use the same GPT parameters as in Sec 5. For baselines, we train the ACE model Wang et al. (2020) (which is the current SOTA model) on different training subsets. For GPT-NER, we use random demonstration retrieval and sentence-level embedding-based demonstration retrieval for demonstration selection in the few-shot learning stage. For the self-verification stage, we only use zero-shot learning where no demonstration is needed.
我们使用与第5节中相同的 GPT 参数。对于基线，我们训练 ACE 模型Wang 等人。 ( 2020 ) （这是当前的 SOTA 模型）在不同的训练子集上。对于 GPT-NER，我们使用随机示范检索和基于句子级嵌入的示范检索来进行少样本学习阶段的示范选择。对于自我验证阶段，我们仅在不需要演示的情况下使用零样本学习。

5.2.1 Results
5.2.1结果

Results are shown in Figure 4. Observations are as follows:
结果如图4所示。观察结果如下：

(1) When the size of the training set is extremely small (i.e., 8 or 100 sentences), and the performance of the supervised model is far below GPT-3. Specifically, with only 8 training examples, the F1 score of GPT-NER is already about 60 while the performance of supervised models is around 0. This demonstrates the significantly better generalization ability of GPT-NER over supervised baselines in the low-resource setup.
（1）当训练集的规模极小时（即8个或100个句子），并且监督模型的性能远低于GPT-3。具体来说，仅使用 8 个训练样本，GPT-NER 的 F1 分数就已经约为 60，而监督模型的性能约为 0。这表明 GPT-NER 在低资源设置中比监督基线具有明显更好的泛化能力。

(2) With the increase of the training data, the performance of KNN search grows faster than random retrieval, which is in accord with our expectations: for random retrieval, where all demonstrations are randomly selected, the impact of increasing the size of training data is minimal: the outcomes of selecting K demonstration from 100 and 1000 sets are similar since they are all randomly selected. But for $k$ NN demonstration search, increasing the size of training data means selected demonstrations are more likely to be related to the input, leading to better performances.
(2) 随着训练数据的增加， KNN搜索的性能增长速度比随机检索快，这符合我们的预期：对于随机检索，如果所有演示都是随机选择的，则增加训练数据大小的影响很小：的结果从 100 组和 1000 组中选择 K 个演示是相似的，因为它们都是随机选择的。但对于 $k$ 神经网络演示搜索，增加训练数据的大小意味着所选的演示更有可能与输入相关，从而获得更好的性能。

(3) When the amount of data reaches $10\%$ , as the size of training data increases, the performance of the supervised model will significantly improve, while the result of GPT-3 will increase marginally. This phenomenon indicates that for in-context learning, instead of focusing on increasing the amount of training data, it is more effective to focus on improving the quality of retrieved demonstrations (e.g., random retrieval to $k$ NN based retrieval) and prompt structure (e.g., adding self-verification).
(3)当数据量达到 $10\%$ ，随着训练数据规模的增加，监督模型的性能将显着提高，而 GPT-3 的结果将略有增加。这一现象表明，对于上下文学习，与其专注于增加训练数据量，不如专注于提高检索到的演示的质量（例如，随机检索到 $k$ 基于神经网络的检索）和提示结构（例如，添加自我验证）。

6 Ablation Study
6消融研究

6.1 Varying the Format of LLM Output
6.1改变LLM输出的格式

In Sec 4.1.2, we propose to use special tokens “@@” and “##” to regulate the format of the GPT-3 output, e.g., “@@Columbus## is a city” indicates the word “Columbus” is the recognized entity. We compare the proposed output format with the following two formats:
在第4.1.2节中，我们建议使用特殊标记“@@”和“##”来规范GPT-3输出的格式，例如“ @@Columbus## is a city ”表示单词“ Columbus ”是公认的实体。我们将建议的输出格式与以下两种格式进行比较：

BMES

directly outputs the beginning, middle, end, and singleton indicator for each token within the input:

		Input:White House is in Washington
		Output:B-ORG E-ORG O O O

Entity+Position

asks LLMs to output the entity within the sentence along with its position:

		Input:White House is in Washington
		Output:White House (0)

where “White House (0)” means that “White House” is an entity and its starting position is 0 at the input sentence.

To enable apple-to-apple comparisons, we use the same setup for the three output formats and conduct experiments on the 100-sample CoNLL 2003 dataset with 32 few-shots.

The F1-score for the proposed ##@@ strategy, BMES and Entity+Position are respective 92.68, 29.75 and 38.73, where BMES and Entity+Position significantly underperform the proposed ##@@ strategy. Explanations are as follows: for the BMES strategy, the LLM needs to learn the alignment between each input word and each BMES label: White to B-ORG, House to E-ORG, is to O, in to O, Washington to O. By analyzing the error samples, we find that it is usually even hard for the LLM to output a BMES string with the correct length, especially when the input sentence is long, leading to poor final evaluation performances.

For the Entity+Position strategy, we find that the LLM usually confuses the meaning of the of position index (e.g., whether it is character index or word index), leading to incorrect entity position. This problem can be partially alleviated by demonstrations but still exists considering the 4096 token limit for GPT-3. Incorrect position indexes make it hard to map the output text to the sequence labeling evaluation format, leading to poor final evaluation performances.

6.2 The Number of Few-shot Demonstrations

We conduct experiments to estimate the effect of the number of demonstrations. Experiments are conducted on the 100-sample CoNLL 2003 dataset. Results are shown in Figure 2. We can observe as $k$ increases, all three LLM-based results keep rising. As we approach the 4096 token limit for demonstrations, the result still hasn’t plateaued. This means performance will still rise if more demonstrations are allowed.

An interesting phenomenon is observed that when the number of demonstrations is small, i.e., $k=2,4$ , $k$ NN-based strategies underperform the random retrieval strategy. The explanation is as follows: $k$ NN-based retrieval tends to select demonstrations that are very similar to the input sentence. Therefore, if the input sentence does not contain any entity, the retrieved demonstrations are most to contain no entity either. In this case, demonstrations do not contain the output format information we wish to enforce, leading LLMs to output arbitrary format. Here we give an example:

When the number of demonstrations is small and GPT is required to recognize a certain kind of entity (e.g., Location) but few-shot are all sentences without NER, GPT will be confused and output in its own format, illustrated in the following example:

		Prompt:
		I am an excellent linguist. The task is to label
		organization entities in the given sentence. Below
		are some examples.
		Input:Korean pro-soccer games
		Output:Korean pro-soccer games
		Input:Australia defend the Ashes
		Output:Australia defend the Ashes
		Input:Japan get lucky win
		Output:
		GPT-3 Output:
		Japan [Organization Entity] get lucky win

7 Conclusion

In this paper, we propose GPT-NER to adapt LLMs to the NER task. To bridge the gap between the sequence labeling task and the text generation task, we instruct the LLM to generate a labeled sequence by surrounding entities with special tokens. Additionally, we propose a self-verification strategy to alleviate the hallucination issue of the LLM model. We conduct experiments on both flat and nested NER datasets, and achieve comparable performances to fully supervised baselines. Besides that, we find that GPT-NER has a remarkable ability in the low-resource scenario, that when the amount of training data is extremely scarce, the results of GPT-NER are significantly better than that of the supervised model.

References

Braverman et al. (2020) Mark Braverman, Xinyi Chen, Sham Kakade, Karthik Narasimhan, Cyril Zhang, and Yi Zhang. 2020. Calibration, entropy rates, and memory in language models. pages 1089–1099.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Chiu and Nichols (2016) Jason PC Chiu and Eric Nichols. 2016. Named entity recognition with bidirectional lstm-cnns. Transactions of the association for computational linguistics, 4:357–370.
Chowdhery et al. (2022) Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. 2022. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
Christopher et al. (2006) Walker Christopher, Strassel Stephanie, Medero Julie, and Maeda Kazuaki. 2006. Ace 2005 multilingual training corpus.
Collobert et al. (2011) Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011. Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE):2493–2537.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Doddington et al. (2004) George R Doddington, Alexis Mitchell, Mark A Przybocki, Lance A Ramshaw, Stephanie M Strassel, and Ralph M Weischedel. 2004. The automatic content extraction (ace) program-tasks, data, and evaluation. 2(1).
Du et al. (2022) Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. pages 5547–5569.
Gao et al. (2021) Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
Gururangan et al. (2018) Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. Annotation artifacts in natural language inference data. arXiv preprint arXiv:1803.02324.
Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. pages 3929–3938.
Hammerton (2003) James Hammerton. 2003. Named entity recognition with long short-term memory. pages 172–175.
Hegselmann et al. (2022) Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. 2022. Tabllm: Few-shot classification of tabular data with large language models. arXiv preprint arXiv:2210.10723.
Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Jiang et al. (2021) Zhengbao Jiang, Jun Araki, Haibo Ding, and Graham Neubig. 2021. How can we know when language models know? on the calibration of language models for question answering. 9:962–977.
Katiyar and Cardie (2018) Arzoo Katiyar and Claire Cardie. 2018. Nested named entity recognition revisited. 1.
Lample et al. (2016) Guillaume Lample, Miguel Ballesteros, Sandeep Subramanian, Kazuya Kawakami, and Chris Dyer. 2016. Neural architectures for named entity recognition. arXiv preprint arXiv:1603.01360.
Lan et al. (2019) Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
Lazaridou et al. (2022) Angeliki Lazaridou, Elena Gribovskaya, Wojciech Stokowiec, and Nikolai Grigorev. 2022. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115.
Li et al. (2022) Junlong Li, Zhuosheng Zhang, and Hai Zhao. 2022. Self-prompting large language models for open-domain qa. arXiv preprint arXiv:2212.08635.
Li et al. (2019a) Xiaoya Li, Jingrong Feng, Yuxian Meng, Qinghong Han, Fei Wu, and Jiwei Li. 2019a. A unified mrc framework for named entity recognition. arXiv preprint arXiv:1910.11476.
Li et al. (2019b) Xiaoya Li, Xiaofei Sun, Yuxian Meng, Junjun Liang, Fei Wu, and Jiwei Li. 2019b. Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855.
Liu et al. (2021) Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What makes good in-context examples for gpt- $3$ ? arXiv preprint arXiv:2101.06804.
Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
Lu et al. (2021) Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2021. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
Ma and Hovy (2016) Xuezhe Ma and Eduard Hovy. 2016. End-to-end sequence labeling via bi-directional lstm-cnns-crf. arXiv preprint arXiv:1603.01354.
Moslem et al. (2023) Yasmin Moslem, Rejwanul Haque, and Andy Way. 2023. Adaptive machine translation with large language models. arXiv preprint arXiv:2301.13294.
Ohta et al. (2002) Tomoko Ohta, Yuka Tateisi, Jin-Dong Kim, Hideki Mima, and Junichi Tsujii. 2002. The genia corpus: An annotated research abstract corpus in molecular biology domain.
Perez et al. (2021) Ethan Perez, Douwe Kiela, and Kyunghyun Cho. 2021. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
Pietrzak et al. (2021) Ben Pietrzak, Ben Swanson, Kory Mathewson, Monica Dinculescu, and Sherol Chen. 2021. Story centaur: Large language model few shot learning as a creative writing tool.
Pradhan et al. (2013) Sameer Pradhan, Alessandro Moschitti, Nianwen Xue, Hwee Tou Ng, Anders Björkelund, Olga Uryupina, Yuchen Zhang, and Zhi Zhong. 2013. Towards robust linguistic analysis using ontonotes.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
Rae et al. (2021) Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. 2021. Scaling language models: Methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? arXiv preprint arXiv:2002.08910.
Robinson et al. (2022) Joshua Robinson, Christopher Michael Rytting, and David Wingate. 2022. Leveraging large language models for multiple choice question answering. arXiv preprint arXiv:2210.12353.
Rubin et al. (2021) Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2021. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.
Sang and De Meulder (2003) Erik F Sang and Fien De Meulder. 2003. Introduction to the conll-2003 shared task: Language-independent named entity recognition. arXiv preprint cs/0306050.
Sarzynska-Wawer et al. (2021) Justyna Sarzynska-Wawer, Aleksander Wawer, Aleksandra Pawlak, Julia Szymanowska, Izabela Stefaniak, Michal Jarkiewicz, and Lukasz Okruszek. 2021. Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135.
Smith et al. (2022) Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, et al. 2022. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.
Thoppilan et al. (2022) Romal Thoppilan, Daniel De Freitas, Jamie Hall, Noam Shazeer, Apoorv Kulshreshtha, Heng-Tze Cheng, Alicia Jin, Taylor Bos, Leslie Baker, Yu Du, et al. 2022. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
Vidal et al. (2022) Blanca Vidal, Albert Llorens, and Juan Alonso. 2022. Automatic post-editing of mt output using large language models. pages 84–106.
Vilar et al. (2022) David Vilar, Markus Freitag, Colin Cherry, Jiaming Luo, Viresh Ratnakar, and George Foster. 2022. Prompting palm for translation: Assessing strategies and performance. arXiv preprint arXiv:2211.09102.
Wang et al. (2022) Shuhe Wang, Yuxian Meng, Rongbin Ouyang, Jiwei Li, Tianwei Zhang, Lingjuan Lyu, and Guoyin Wang. 2022. Gnn-sl: Sequence labeling based on nearest examples via gnn. arXiv preprint arXiv:2212.02017.
Wang et al. (2020) Xinyu Wang, Yong Jiang, Nguyen Bach, Tao Wang, Zhongqiang Huang, Fei Huang, and Kewei Tu. 2020. Automated concatenation of embeddings for structured prediction. arXiv preprint arXiv:2010.05006.
Wei et al. (2021) Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
Yuan et al. (2021) Zheng Yuan, Chuanqi Tan, Songfang Huang, and Fei Huang. 2021. Fusing heterogeneous factors with triaffine mechanism for nested named entity recognition. arXiv preprint arXiv:2110.07480.
Zhang et al. (2022) Sheng Zhang, Hao Cheng, Jianfeng Gao, and Hoifung Poon. 2022. Optimizing bi-encoder for named entity recognition via contrastive learning. arXiv preprint arXiv:2208.14565.
Zhao et al. (2021) Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. pages 12697–12706.

Entities Annotations of English CoNLL2003
Entity Type	Annotation
ORG	organization entities are limited to named corporate, governmental, or other organizational entities 组织实体仅限于指定的公司、政府或其他组织实体
PER	person entities are named persons or family 个人实体被命名为个人或家庭
LOC	location entities are the name of politically or geographically defined locations such as cities, provinces, countries, international regions, bodies of water, mountains, etc 位置实体是政治或地理上定义的位置的名称，例如城市、省份、国家、国际区域、水体、山脉等
MISC	miscellaneous entities include events, nationalities, products and works of art 各种实体包括活动、国籍、产品和艺术品

Table 4: Entity annotations of the flat NER dataset CoNLL2003.
表 4：平面 NER 数据集 CoNLL2003 的实体注释。

Entities Annotations of English OntoNotes5.0
Entity Type	Annotation
PERSON	People, including fictional
NORP	Nationalities or religious or political groups
FAC	Buildings, airports, highways, bridges, etc
ORG	Companies, agencies, institutions, etc
GPE	Countries, cities, states
LOC	Non-GPE locations, mountain ranges, bodies of water
PRODUCT	Vehicles, weapons, foods, etc
EVENT	Named hurricanes, battles, wars, sports events, etc
WORK_OF_ART	Titles of books, songs, etc
LAW	Named documents made into laws
LANGUAGE	Any named language
DATE	Absolute or relative dates or periods
TIME	Times smaller than a day
PERCENT	Percentage (including "%")
MONEY	Monetary values, including unit
QUANTITY	Measurements, as of weight or distance
ORDINAL	"first", "second", etc
CARDINAL	Numerals that do not fall under another type

Table 5: Entity annotations of the flat NER dataset OntoNotes5.0.

Statistics on English CoNLL2003
Dataset	Sentences	Tokens	Entities
Training set	14,987	203,621	23,499
Development set	3,466	51,362	5,942
Test set	3,684	46,435	5,648
Statistics on OntoNotes5.0
Dataset	Sentences	Tokens	Entities
Training set	59,924	1,088,503	81,828
Development set	8,528	147,724	11,066
Test set	8,262	152,728	11,257

Table 6: Number of sentences, tokens and entities of the flat NER dataset English CoNLL2003 and OntoNotes5.0.

Statistics on ACE2004
Dataset	Sentences	Entities	Nested Entities	Nested Percentage
Training set	6,200	22,204	10,149	45.71%
Development set	745	2,514	1,092	46.69%
Test set	812	3,035	1,417	45.61%
Statistics on ACE2005
Dataset	Sentences	Entities	Nested Entities	Nested Percentage
Training set	7,194	24,441	9,389	38.41%
Development set	969	3,200	1,112	34.75%
Test set	1,047	2,993	1,118	37.35%
Statistics on GENIA
Dataset	Sentences	Entities	Nested Entities	Nested Percentage
Training set 训练集	16,692	50,509	9,064	17.95%
Development set 开发套装	-	-	-	-
Test set 测试集	1,854	5,506	1,199	21.78%

Table 7: Number of sentences, entities, nested entities, and nested percentage of the nested NER dataset ACE2004, ACE2005 and GENIA.
表 7：嵌套 NER 数据集 ACE2004、ACE2005 和 GENIA 的句子、实体、嵌套实体的数量和嵌套百分比。

Entities Annotations of English ACE2004 and ACE2005 英文ACE2004和ACE2005的实体注释
Entity Type 实体类型	Annotation 注解
GPE	geographical political entities are geographical regions defined by political and or social groups such as countries, nations, regions, cities, states, government and its people 地理政治实体是由政治和/或社会群体定义的地理区域，例如国家、民族、地区、城市、州、政府及其人民
ORG	organization entities are limited to companies, corporations, agencies, institutions and other groups of people 组织实体仅限于公司、企业、机关、机构和其他人群
PER	a person entity is limited to human including a single individual or a group 个人实体仅限于人类，包括单个个体或群体
FAC	facility entities are limited to buildings and other permanent man-made structures such as buildings, airports, highways, bridges 设施实体仅限于建筑物和其他永久性人造结构，例如建筑物、机场、高速公路、桥梁
VEH	vehicle entities are physical devices primarily designed to move, carry, pull or push the transported object such as helicopters, trains, ship and motorcycles 车辆实体是主要用于移动、携带、拉动或推动运输物体的物理设备，例如直升机、火车、轮船和摩托车
LOC	location entities are limited to geographical entities such as geographical areas and landmasses, mountains, bodies of water, and geological formations 位置实体仅限于地理实体，例如地理区域和陆地、山脉、水体和地质构造
WEA	weapon entities are limited to physical devices such as instruments for physically harming such as guns, arms and gunpowder 武器实体仅限于物理装置，例如枪支、武器和火药等用于物理伤害的工具

Table 8: Entity annotations of the dataset ACE2004 and ACE2005.
表 8：数据集 ACE2004 和 ACE2005 的实体注释。

Example 1 Length Error
示例1长度错误

Task Description 任务描述

I am an excellent linguist. The task is to label organization entities. Below are some examples.
我是一位优秀的语言学家。任务是标记组织实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O O O O O O
输出：OOOOOOOO

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O O O O O O
输出：OOOOOOOO

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O O O O O O
输出：OOOOOOOO

Input: Soccer - Italian cup second round results .
输入：足球-意大利杯第二轮结果。

Output: O O O O O O O O
输出：噢噢噢噢

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: Soccer - Japan get lucky win , China in surprise defeat .
输入：足球-日本队幸运获胜，中国队惨败。

Output: O O O O O O O O O O O
输出：OOOOOOOOOO

Expected Output: O O O O O O O O O O O O
预期输出：OOOOOOOOOOOO

Example 2 Length Error and Entity Error
示例2长度错误和实体错误

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O B-MISC E-MISC O O O
输出：OOOO B-MISC E-MISC OOO

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O B-MISC E-MISC O O O
输出：OOOO B-MISC E-MISC OOO

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: O O O O B-MISC E-MISC O O O
输出：OOOO B-MISC E-MISC OOO

Input: Soccer - Italian cup second round results .
输入：足球-意大利杯第二轮结果。

Output: O O B-MISC E-MISC O O O O
输出：OO B-MISC E-MISC OOOO

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: Soccer - Japan get lucky win , China in surprise defeat .
输入：足球-日本队幸运获胜，中国队惨败。

Output: O O B-MISC E-MISC O O O O B-MISC E-MISC O O
输出：OO B-MISC E-MISC OOOO B-MISC E-MISC OO

Expected Output: O O O O O O O O O O O O
预期输出：OOOOOOOOOOOO

Example 3 Length Error
示例3长度错误

Task Description 任务描述

I am an excellent linguist. The task is to label person entities. Below are some examples.
我是一位优秀的语言学家。任务是标记人物实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Dubai 1996-08-26 输入：迪拜 1996-08-26

Output: O O 输出：OO

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: O O 输出：OO

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: O O 输出：OO

Input: Dubai 1996-08-22 输入：迪拜 1996-08-22

Output: O O 输出：OO

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: AL-AIN , United Arab Emirates 1996-12-06
输入： AL-AIN ，阿拉伯联合酋长国 1996-12-06

Output: O O 输出：OO

Expected Output: O O O O O O
预期输出：OOOOOO

Example 4 Length Error and Entity Error
示例4长度错误和实体错误

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Azerbaijan beat Switzerland 1-0 ( halftime 1-0 ) in their World Cup soccer European group three qualifying match on Saturday .
输入：阿塞拜疆队在周六举行的世界杯足球赛欧洲小组第三轮预选赛中以1-0（半场1-0）击败瑞士队。

Output: S-LOC O S-LOC O O O O O O O O O O O O O O O O O O
输出：S-LOC O S-LOC OOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO

Input: Nijmeh of Lebanon beat Nasr of Saudi Arabia 1-0 ( halftime 1-0 ) in their Asian club championship second round first leg tie on Saturday .
输入：周六，黎巴嫩队奈杰梅在亚洲俱乐部锦标赛第二轮首回合比赛中以 1-0（半场 1-0）击败沙特阿拉伯队纳斯尔。

Output: O O S-LOC O O O B-LOC E-LOC O O O O O O O O O O O O O O O O O O
输出：OO S-LOC OOO B-LOC E-LOC OOOOOOOOOOOOOOOOOO

Input: Slovakia beat the Faroe Islands 2-1 ( halftime 1-0 ) in their World Cup soccer European group six qualifying match on Saturday .
输入：斯洛伐克在周六举行的世界杯足球赛欧洲组第六组预选赛中以2-1（半场1-0）击败法罗群岛。

Output: S-LOC O O B-LOC E-LOC O O O O O O O O O O O O O O O O O O
输出： S-LOC OO B-LOC E-LOC OOOOOOOOOOOOOO

Input: Canada beat Panama 3-1 ( halftime 2-0 ) in their CONCACAF semifinal phase qualifying match for the 1998 World Cup on Friday .
输入：加拿大队在周五举行的 1998 年世界杯半决赛预选赛中以 3-1（半场 2-0）击败巴拿马队。

Output: S-LOC O S-LOC O O O O O O O O O O O O O O O O O O O O
输出：S-LOC O S-LOC OOOOOOOOOOOOOOO

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: Japan began the defence of their Asian Cup title with a lucky 2-1 win against Syria in a Group C championship match on Friday .
输入：在周五的C组冠军赛中，日本队以2-1战胜叙利亚队，开启了亚洲杯卫冕之路。

Output: S-LOC O O O O O O O O O O O O B-LOC E-LOC O O O O O O O O O O O O O O
输出： S-LOC OOOOOOOOOOOO B-LOC E-LOC OOOOOOOOOOOOOO

Expected Output: S-LOC O O O O O O O O O O O O O O S-LOC O O O O O O O O O
预期输出：S-LOC OOOOOOOOOOOOOO S-LOC OOOOOOOOOO

Table 9: Examples for the BMES output format on sample-100 CoNLL2003 dataset, where the error information is colored red and the expected correct output is colored blue.
表 9： sample-100 CoNLL2003 数据集上的 BMES 输出格式示例，其中错误信息为红色，预期正确输出为蓝色。

Example 1 Position Error and Entity Error
示例1位置错误和实体错误

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: South Korean(4) 输出：韩国语(4)

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: South Korean(4) 输出：韩国语(4)

Input: Soccer - results of South Korean PRO-SOCCER games .
输入：足球 - 韩国职业足球比赛的结果。

Output: South Korean(4) 输出：韩国语(4)

Input: Soccer - Italian cup second round results .
输入：足球-意大利杯第二轮结果。

Output: Italian cup(2) 输出：意大利杯(2)

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: Soccer - Japan get lucky win , China in surprise defeat .
输入：足球-日本队幸运获胜，中国队惨败。

Output: Japan(4), China(4)
输出：日本(4)、中国(4)

Expected Output: None
预期输出：无

Example 2 Position Error and Entity Error
示例2位置错误和实体错误

Task Description 任务描述

I am an excellent linguist. The task is to label organization entities. Below are some examples.
我是一位优秀的语言学家。任务是标记组织实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Dubai 1996-08-26 输入：迪拜 1996-08-26

Output: None 输出：无

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: None 输出：无

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: None 输出：无

Input: Dubai 1996-08-22 输入：迪拜 1996-08-22

Output: None 输出：无

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: AL-AIN , United Arab Emirates 1996-12-06
输入： AL-AIN ，阿拉伯联合酋长国 1996-12-06

Output: AL-AIN, United Arab Emirates
输出： AL-AIN，阿拉伯联合酋长国

Expected Output: None
预期输出：无

Example 3 Position Error
示例3位置误差

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Third seed Arantxa Sanchez Vicario , the 1994 champion , and eighth-seeded Olympic gold medalist Lindsay Davenport dropped three game each en route to the second round .
输入：1994年冠军三号种子阿兰查·桑切斯·维卡里奥和八号种子奥运金牌得主林赛·达文波特在进入第二轮的过程中各输了三局。

Output: None 输出：无

Input: Dutch champions Ajax Amsterdam faltered in their second league match of the season on Saturday losing 2-0 away at Heerenveen .
输入：荷兰冠军阿贾克斯在周六的本赛季第二场联赛中客场0-2输给了海伦芬。

Output: None 输出：无

Input: Soccer - disappointing Ajax slump 2-0 at Heerenveen .
输入：足球 - 阿贾克斯在海伦芬以 2-0 惨败，令人失望。

Output: Heerenveen(7) 输出：海伦芬(7)

Input: Australian Open runner-up Anke Huber of Germany , the sixth seed , was undone by an unlucky draw that put her against 17th ranked South African Amanda Coetzer in her opening match .
输入：澳大利亚网球公开赛亚军德国选手安克·胡贝尔（Anke Huber），六号种子，在首场比赛中因不幸抽签而遭遇排名第17位的南非选手阿曼达·科泽（Amanda Coetzer）。

Output: Germany(6) 输出：德国(6)

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: But China saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers Uzbekistan .
导语：但中国队在小组第二场比赛中却遭遇厄运，0-2出人意料地输给了新晋的乌兹别克斯坦队。

Output: China(2), Uzbekistan(9)
输出：中国(2)、乌兹别克斯坦(9)

Expected Output: China(1), Uzbekistan(23)
预期产出：中国(1)、乌兹别克斯坦(23)

Example 4 Position Error and Entity Error
示例4位置误差和实体误差

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Output: World Cup(10), European(13)
产出：世界杯(10)、欧洲(13)

Output: Asian(15) 输出：亚洲人(15)

Output: World Cup(12), European(15)
产出：世界杯(12)、欧洲(15)

Output: World Cup(18) 输出：世界杯(18)

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Output: Asian(14) 输出：亚洲人(14)

Expected Output: Asian Cup(6)
预期产出：亚洲杯(6)

Table 10: Examples for the entity-position output format on sample-100 CoNLL2003 dataset, where the error information is colored red and the expected correct output is colored blue.
表 10： sample-100 CoNLL2003 数据集上的实体位置输出格式示例，其中错误信息为红色，预期正确输出为蓝色。

Example 1 实施例1

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Seattle at Boston 输入：西雅图波士顿

Output: Seattle at Boston
输出：西雅图波士顿

Input: 3. Carla Sacramento ( Portugal ) 4:08.96
输入：3.卡拉萨克拉门托（葡萄牙）4:08.96

Output: 3. Carla Sacramento ( Portugal ) 4:08.96
输出：3.卡拉萨克拉门托（葡萄牙）4:08.96

Input: Director Budge Weidman , who has shepherded the project from the beginning , predicts it will take up to a decade to complete .
输入：从一开始就领导该项目的总监 Budge Weidman 预测该项目将需要十年才能完成。

Output: Director Budge Weidman , who has shepherded the project from the beginning , predicts it will take up to a decade to complete .
输出：从一开始就领导该项目的总监巴奇·魏德曼 (Budge Weidman) 预测，该项目将需要长达十年的时间才能完成。

Input: Hull 0 Barnet 0
输入：船体 0 巴尼特 0

Output: Hull 0 Barnet 0
输出：船体 0 巴尼特 0

Input: Scott Draper ( Australia ) vs. Galo Blanco ( Spain )
输入：Scott Draper（澳大利亚）vs. Galo Blanco（西班牙）

Output: Scott Draper ( Australia ) vs. Galo Blanco ( Spain )
输出：Scott Draper（澳大利亚）vs. Galo Blanco（西班牙）

Input: Standings in the French first
输入：法国队积分榜第一

Output: Standings in the @@French## first
输出：@@French## 排名第一

Input: He said only the removal of the government and an early election could save Pakistan from disaster . "
输入：他说，只有推翻政府和提前选举才能使巴基斯坦免于灾难。 ”

Output: He said only the removal of the government and an early election could save Pakistan from disaster . "
结果：他说，只有推翻政府和提前选举才能使巴基斯坦免于灾难。 ”

Input: Stock markets 输入：股票市场

Output: Stock markets 输出：股票市场

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: Soccer - Japan get lucky win , China in surprise defeat .
输入：足球-日本队幸运获胜，中国队惨败。

Output: Soccer - Japan get lucky win , China in surprise defeat .
输出：足球-日本队幸运获胜，中国队惨败。

Expected Output: Soccer - Japan get lucky win , China in surprise defeat .
预期输出：足球-日本队幸运获胜，中国队意外落败。

Example 2 实施例2

Task Description 任务描述

I am an excellent linguist. The task is to label organization entities. Below are some examples.
我是一位优秀的语言学家。任务是标记组织实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Jakob Hlasek ( Switzerland ) beat Alberto Berasategui ( Spain ) 7-6 ( 7-5 ) 7-6 ( 9-7 ) 6-0
输入：雅各布·赫拉塞克（瑞士）击败阿尔贝托·贝拉萨特吉（西班牙）7-6（7-5）7-6（9-7）6-0

Output: Jakob Hlasek ( Switzerland ) beat Alberto Berasategui ( Spain ) 7-6 ( 7-5 ) 7-6 ( 9-7 ) 6-0
输出：雅各布·赫拉塞克（瑞士）击败阿尔贝托·贝拉萨特吉（西班牙）7-6（7-5）7-6（9-7）6-0

Input: After bogeying the 10th hole to move to four-over for the round , he rallied for birdies on 15 and 18 .
输入：在第 10 洞吞下柏忌，该轮比赛升至四杆以上后，他在 15 号和 18 号洞抓到小鸟。

Output: After bogeying the 10th hole to move to four-over for the round , he rallied for birdies on 15 and 18 .
输出：在第10洞吞下柏忌，将本轮比赛提升至四杆以上后，他在第15洞和第18洞抓到小鸟。

Input: Abidjan 1996-08-29
输入：阿比让 1996-08-29

Output: Abidjan 1996-08-29
输出：阿比让 1996-08-29

Input: Falkirk 1 Partick 0
输入：福尔柯克 1 颗粒 0

Output: @@Falkirk## 1 @@Partick## 0
输出：@@Falkirk## 1 @@Partick## 0

Input: Williams seized two wickets in two deliveries and left-armer Ilott also captured two as Gloucestershire , 252 behind on first innings , slumped to 27 for four at the close on the third day of the four-day game at Colchester .
输入：威廉姆斯在两次投球中抓住了两个三柱门，左臂伊洛特也抓住了两个三柱门，格洛斯特郡在第一局落后252分，在科尔切斯特为期四天的比赛的第三天结束时，四分之二的成绩跌至27分。

Output: Williams seized two wickets in two deliveries and left-armer Ilott also captured two as @@Gloucestershire## , 252 behind on first innings , slumped to 27 for four at the close on the third day of the four-day game at Colchester .
输出：威廉姆斯在两次投球中抓住了两个三柱门，左臂投手伊洛特也以@@格洛斯特郡##的身份抓住了两个三柱门，第一局落后252分，在科尔切斯特为期四天的比赛的第三天结束时，四分之二的成绩跌至27分。

Input: South Queensland 21 4 0 17 210 460 8
输入：南昆士兰 21 4 0 17 210 460 8

Output: @@South Queensland## 21 4 0 17 210 460 8
输出：@@南昆士兰## 21 4 0 17 210 460 8

Input: In Skopje : Sloga Jugomagnat ( Macedonia ) 0 Kispest Honved
输入：斯科普里：Sloga Jugomagnat（马其顿）0 Kispest Honved

Output: In Skopje : @@Sloga Jugomagnat## ( Macedonia ) 0 Kispest Honved
输出：在斯科普里：@@Sloga Jugomagnat##（马其顿）0 Kispest Honved

Input: Call C 98.00 pct 0.47 Dem 3.30 pct 202.90 X
输入：看涨期权 C 98.00 pct 0.47 Dem 3.30 pct 202.90 X

Output: Call C 98.00 pct 0.47 Dem 3.30 pct 202.90 X
输出：看涨期权 C 98.00 pct 0.47 Dem 3.30 pct 202.90 X

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: AL-AIN , United Arab Emirates 1996-12-06
输入： AL-AIN ，阿拉伯联合酋长国 1996-12-06

Output: @@AL-AIN## , United Arab Emirates 1996-12-06
输出：@@AL-AIN##，阿拉伯联合酋长国 1996-12-06

Expected Output: AL-AIN , United Arab Emirates 1996-12-06
预期产量：阿拉伯联合酋长国 AL-AIN 1996-12-06

Example 3 实施例3

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Serbian policeman shot dead in Kosovo province .
输入：塞尔维亚警察在科索沃省被枪杀。

Output: @@Serbian## policeman shot dead in Kosovo province .
输出：@@Serbian## 警察在科索沃省被枪杀。

Input: British Labour Party leader Tony Blair won a narrow victory on Saturday when the party ’s Scottish executive voted 21-18 in favour of his plans for a referendum on a separate parliament for Scotland .
周六，英国工党领袖托尼·布莱尔以微弱优势获胜，该党的苏格兰行政长官以 21 比 18 的投票结果支持了他就苏格兰独立议会进行公投的计划。

Output: British Labour Party leader Tony Blair won a narrow victory on Saturday when the party ’s @@Scottish## executive voted 21-18 in favour of his plans for a referendum on a separate parliament for Scotland .
输出：英国工党领袖托尼·布莱尔周六以微弱优势获胜，该党的@@Scottish##执行官以21比18的投票结果支持他就苏格兰独立议会进行全民公投的计划。

Input: Newcastle 24 Western Reds 20
输入：纽卡斯尔 24 西部红军 20

Output: Newcastle 24 Western Reds 20
输出：纽卡斯尔 24 西部红军 20

Input: WSRL is part of the Welspun group which has a presence in the cotton yarn , terry towels and polyester yarn industry , the statement said .
声明称，WSRL 是 Welspun 集团的一部分，该集团涉足棉纱、毛巾布和涤纶纱行业。

Output: WSRL is part of the Welspun group which has a presence in the cotton yarn , terry towels and polyester yarn industry , the statement said .
输出：声明称，WSRL 是 Welspun 集团的一部分，该集团涉足棉纱、毛巾布和涤纶纱行业。

Input: In Chicago , Erik Hanson outdueled Alex Fernandez , and Jacob Brumfield drove in Otis Nixon with the game ’s only run in the sixth inning as the Toronto Blue Jays blanked the White Sox 1-0 in a game shortened to six innings due to rain .
输入：在芝加哥，埃里克·汉森（Erik Hanson）击败了亚历克斯·费尔南德斯（Alex Fernandez），雅各布·布鲁姆菲尔德（Jacob Brumfield）在第六局中击中了奥蒂斯·尼克松（Otis Nixon），打出全场唯一一分，多伦多蓝鸟队以 1-0 击败了白袜队，比赛因比赛缩短至六局。雨。

Output: In Chicago , Erik Hanson outdueled Alex Fernandez , and Jacob Brumfield drove in Otis Nixon with the game ’s only run in the sixth inning as the Toronto Blue Jays blanked the White Sox 1-0 in a game shortened to six innings due to rain .
输出：在芝加哥，埃里克·汉森（Erik Hanson）击败了亚历克斯·费尔南德斯（Alex Fernandez），雅各布·布鲁姆菲尔德（Jacob Brumfield）在第六局中击中了奥蒂斯·尼克松（Otis Nixon），打出全场唯一一分，多伦多蓝鸟队以 1-0 击败了白袜队，比赛缩短至六局。雨。

Input: ( Corrects that Habsudova is sixth seed ) .
输入：（更正哈布苏多娃是六号种子）。

Output: ( Corrects that Habsudova is sixth seed ) .
输出：（更正哈布苏多娃是六号种子）。

Input: San Francisco at New York
输入：旧金山和纽约

Output: San Francisco at New York
输出：旧金山到纽约

Input: – The short-term price objective is $ 5 a share and the long-term objective is $ 9 .
输入： – 短期价格目标为每股 5 美元，长期目标为每股 9 美元。

Output: – The short-term price objective is $ 5 a share and the long-term objective is $ 9 .
输出： – 短期价格目标是每股 5 美元，长期目标是每股 9 美元。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Output: Japan began the defence of their @@Asian## Cup title with a lucky 2-1 win against @@Syria## in a Group C championship match on Friday .
输出：周五，在 C 组冠军赛中，日本队以 2-1 战胜@@叙利亚队，开始卫冕@@亚洲##杯冠军。

Expected Output: Japan began the defence of their @@Asian Cup## title with a lucky 2-1 win against Syria in a Group C championship match on Friday .
预期成果：在周五的 C 组冠军赛中，日本队以 2-1 战胜叙利亚队，开启了他们@@亚洲杯##冠军的卫冕之旅。

Table 11: Examples on the CoNLL2003 datasets with the random retrieval.
表 11： CoNLL2003 数据集上随机检索的示例。

Example 1 实施例1

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: Dubai 1996-08-26 输入：迪拜 1996-08-26

Output: @@Dubai## 1996-08-26
输出：@@迪拜## 1996-08-26

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: @@Dubai## 1996-08-29
输出：@@迪拜## 1996-08-29

Input: Dubai 1996-08-29 输入：迪拜 1996-08-29

Output: @@Dubai## 1996-08-29
输出：@@迪拜## 1996-08-29

Input: Dubai 1996-08-22 输入：迪拜 1996-08-22

Output: @@Dubai## 1996-08-22
输出：@@迪拜## 1996-08-22

Input: Dubai 1996-08-25 输入：迪拜 1996-08-25

Output: @@Dubai## 1996-08-25
输出：@@迪拜## 1996-08-25

Input: Baghdad 1996-08-24
输入：巴格达 1996-08-24

Output: @@Baghdad## 1996-08-24
输出：@@巴格达## 1996-08-24

Input: Baghdad 1996-08-27
输入：巴格达 1996-08-27

Output: @@Baghdad## 1996-08-27
输出：@@巴格达## 1996-08-27

Input: Baghdad 1996-08-28
输入：巴格达 1996-08-28

Output: @@Baghdad## 1996-08-28
输出：@@巴格达## 1996-08-28

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: AL-AIN , United Arab Emirates 1996-12-06
输入： AL-AIN ，阿拉伯联合酋长国 1996-12-06

Output: @@AL-AIN## , @@United Arab Emirates## 1996-12-06
输出：@@AL-AIN##，@@阿拉伯联合酋长国## 1996-12-06

Expected Output: @@AL-AIN## , @@United Arab Emirates## 1996-12-06
预期产出：@@AL-AIN##，@@阿拉伯联合酋长国## 1996-12-06

Example 2 实施例2

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Output: @@Azerbaijan## beat @@Switzerland## 1-0 ( halftime 1-0 ) in their World Cup soccer European group three qualifying match on Saturday .
输出：@@阿塞拜疆##在周六的世界杯足球欧洲小组第三轮预选赛中以1-0（半场1-0）击败@@瑞士##。

Output: Nijmeh of @@Lebanon## beat Nasr of @@Saudi Arabia## 1-0 ( halftime 1-0 ) in their Asian club championship second round first leg tie on Saturday .
输出：周六，@@黎巴嫩##的尼杰梅在亚洲俱乐部锦标赛第二轮首回合比赛中以1-0（半场1-0）击败@@沙特阿拉伯##的纳斯尔。

Output: @@Slovakia## beat the @@Faroe Islands## 2-1 ( halftime 1-0 ) in their World Cup soccer European group six qualifying match on Saturday .
输出：@@斯洛伐克##在周六的世界杯足球赛欧洲六组预选赛中以2-1（半场1-0）击败@@法罗群岛##。

Output: @@Canada## beat @@Panama## 3-1 ( halftime 2-0 ) in their CONCACAF semifinal phase qualifying match for the 1998 World Cup on Friday .
输出：@@加拿大##周五在1998年世界杯的CONCACAF半决赛阶段资格赛中以3-1（半场2-0）击败@@巴拿马##。

Input: Soccer - Azerbaijan beat Switzerland in world cup Qualifier .
输入：足球 - 阿塞拜疆在世界杯预选赛中击败瑞士。

Output: Soccer - @@Azerbaijan## beat @@Switzerland## in world cup Qualifier .
输出：足球 - @@阿塞拜疆##在世界杯预选赛中击败@@瑞士##。

Input: Soccer - Wales beat San Marino in world cup Qualifier .
输入：足球 - 威尔士在世界杯预选赛中击败圣马力诺。

Output: Soccer - @@Wales## beat @@San Marino## in world cup Qualifier .
输出：足球 - @@威尔士## 在世界杯预选赛中击败@@圣马力诺##。

Input: The United States edged Austria in Salzburg 3-2 in the opening round in April , and then blanked Japan 5-0 in Nagoya last month in the semifinals .
输入：美国队在4月的萨尔茨堡首轮比赛中以3-2击败奥地利队，然后在上个月的名古屋半决赛中以5-0横扫日本队。

Output: The @@United States## edged @@Austria## in @@Salzburg## 3-2 in the opening round in April , and then blanked @@Japan## 5-0 in @@Nagoya## last month in the semifinals .
输出：@@美国队##在4月的@@萨尔茨堡##首轮比赛中以3-2击败@@奥地利##，然后上个月在@@名古屋##以5-0击败@@日本##在半决赛中。

Input: Soccer - Slovakia beat Faroes in world cup Qualifier .
输入：足球 - 斯洛伐克在世界杯预选赛中击败法罗群岛。

Output: Soccer - @@Slovakia## beat @@Faroes## in world cup Qualifier .
输出：足球 - @@斯洛伐克##在世界杯预选赛中击败@@Faroes##。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Output: @@Japan## began the defence of their Asian Cup title with a lucky 2-1 win against @@Syria## in a Group C championship match on Friday .
输出： @@Japan## 在周五的 C 组冠军赛中以 2-1 的比分战胜@@Syria##，开启了亚洲杯冠军卫冕之路。

Expected Output: @@Japan## began the defence of their Asian Cup title with a lucky 2-1 win against @@Syria## in a Group C championship match on Friday .
预期产出：@@Japan##在周五的C组冠军赛中以2-1战胜@@Syria##，开启了亚洲杯卫冕之路。

Example 3 实施例3

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Output: Azerbaijan beat Switzerland 1-0 ( halftime 1-0 ) in their @@World Cup## soccer @@European## group three qualifying match on Saturday .
输出：阿塞拜疆在周六的@@世界杯##足球@@欧洲##第三组预选赛中以1-0击败瑞士（半场1-0）。

Output: Nijmeh of Lebanon beat Nasr of Saudi Arabia 1-0 ( halftime 1-0 ) in their @@Asian## club championship second round first leg tie on Saturday .
输出：周六，黎巴嫩的奈杰梅在 @@Asian## 俱乐部锦标赛第二轮首回合比赛中以 1-0（半场 1-0）击败沙特阿拉伯的纳斯尔。

Output: Slovakia beat the Faroe Islands 2-1 ( halftime 1-0 ) in their @@World Cup## soccer @@European## group six qualifying match on Saturday .
输出：斯洛伐克在周六的@@世界杯##足球@@欧洲##第六组预选赛中以2-1击败法罗群岛（半场1-0）。

Output: Canada beat Panama 3-1 ( halftime 2-0 ) in their CONCACAF semifinal phase qualifying match for the 1998 @@World Cup## on Friday .
输出：周五，加拿大在 1998 年@@世界杯## 中北美及加勒比地区足联半决赛资格赛中以 3-1（半场 2-0）击败巴拿马。

Input: Soccer - Azerbaijan beat Switzerland in world cup Qualifier .
输入：足球 - 阿塞拜疆在世界杯预选赛中击败瑞士。

Output: Soccer - Azerbaijan beat Switzerland in @@world cup## Qualifier .
输出：足球 - 阿塞拜疆在@@世界杯##预选赛中击败瑞士。

Input: Soccer - Wales beat San Marino in world cup Qualifier .
输入：足球 - 威尔士在世界杯预选赛中击败圣马力诺。

Output: Soccer - Wales beat San Marino in @@world cup## Qualifier .
输出：足球 - 威尔士在 @@world cup## 预选赛中击败圣马力诺。

Output: The United States edged Austria in Salzburg 3-2 in the opening round in April , and then blanked Japan 5-0 in Nagoya last month in the semifinals .
输出：4月份在萨尔茨堡举行的首轮比赛中，美国队以3-2击败奥地利队，然后在上个月的名古屋半决赛中以5-0横扫日本队。

Input: Soccer - Slovakia beat Faroes in world cup Qualifier .
输入：足球 - 斯洛伐克在世界杯预选赛中击败法罗群岛。

Output: Soccer - Slovakia beat Faroes in @@world cup## Qualifier .
输出：足球 - 斯洛伐克在@@世界杯##预选赛中击败法罗群岛。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Output: Japan began the defence of their @@Asian Cup## title with a lucky 2-1 win against Syria in a Group C championship match on Friday .
输出：周五，日本队在 C 组冠军赛中以 2-1 战胜叙利亚队，开始卫冕@@亚洲杯##冠军。

Table 12: Examples on the CoNLL2003 datasets with the sentence-level embedding.
表 12：具有句子级嵌入的 CoNLL2003 数据集的示例。

Example 1 实施例1

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: AL-RAM , West Bank 1996-08-30
输入：AL-RAM，约旦河西岸 1996-08-30

Output: @@AL-RAM## , @@West Bank## 1996-08-30
输出：@@AL-RAM##，@@West Bank## 1996-08-30

Input: AL-MUNTAR , West Bank 1996-08-26
输入： AL-MUNTAR ，约旦河西岸 1996-08-26

Output: @@AL-MUNTAR## , @@West Bank## 1996-08-26
输出：@@AL-MUNTAR##，@@West Bank## 1996-08-26

Input: Teravainen ( U.S. ) , Jean Van de Velde ( France ) , Oyvind Rojahn
输入：Teravainen（美国）、Jean Van de Velde（法国）、Oyvind Rojahn

Output: Teravainen ( @@U.S.## ) , Jean Van de Velde ( @@France## ) , Oyvind Rojahn
输出：Teravainen (@@US##)、Jean Van de Velde (@@France##)、Oyvind Rojahn

Input: The greatest declines in the volume of help-wanted advertising were in the New England , Mountain and West South Central regions .
输入：招聘广告数量下降幅度最大的是新英格兰、山区和中西部地区。

Output: The greatest declines in the volume of help-wanted advertising were in the @@New England## , @@Mountain## and @@West South Central## regions .
输出：招聘广告数量下降幅度最大的是@@新英格兰##、@@山区## 和@@中西南部## 地区。

Input: Doug Flach ( U.S. ) beat Gianluca Pozzi ( Italy ) 7-5 7-6 ( 7-5 ) 2-6 7-6 ( 8-6 )
输入：Doug Flach（美国）击败 Gianluca Pozzi（意大利） 7-5 7-6 ( 7-5 ) 2-6 7-6 ( 8-6 )

Output: Doug Flach ( @@U.S.## ) beat Gianluca Pozzi ( @@Italy## ) 7-5 7-6 ( 7-5 ) 2-6 7-6 ( 8-6 )
输出：Doug Flach (@@US##) 击败 Gianluca Pozzi (@@Italy##) 7-5 7-6 ( 7-5 ) 2-6 7-6 ( 8-6 )

Input: Jeff Tarango ( U.S. ) beat Alex Radulescu ( Romania ) 6-7 ( 5-7 ) 6-4 6-1 retired , heat exhaustion
输入：杰夫·塔兰戈（美国）击败亚历克斯·拉杜莱斯库（罗马尼亚） 6-7 ( 5-7 ) 6-4 6-1 退役，热衰竭

Output: Jeff Tarango ( @@U.S.## ) beat Alex Radulescu ( @@Romania## ) 6-7 ( 5-7 ) 6-4 6-1 retired , heat exhaustion
输出：Jeff Tarango (@@US##) 击败 Alex Radulescu (@@Romania##) 6-7 ( 5-7 ) 6-4 6-1 退役，热衰竭

Input: Chelsea , 16 , was at President Bill Clinton ’s side as he rode the rails through parts of West Virginia , Kentucky and Ohio , and was introduced at every stop .
输入：比尔·克林顿总统乘坐铁轨穿越西弗吉尼亚州、肯塔基州和俄亥俄州的部分地区时，16 岁的切尔西陪伴在他身边，并且在每一站都有人介绍。

Output: Chelsea , 16 , was at President Bill Clinton ’s side as he rode the rails through parts of @@West Virginia## , @@Kentucky## and @@Ohio## , and was introduced at every stop .
输出：当比尔·克林顿总统乘坐铁轨穿越@@西弗吉尼亚##、@@肯塔基##和@@俄亥俄##的部分地区时，16岁的切尔西就在他身边，并且在每一站都有介绍。

Input: Clinton said on Saturday he had ordered U.S. forces in the Gulf to go on high alert and was reinforcing them in response to Iraqi attacks on Kurdish dissidents in northern Iraq .
投入：克林顿周六表示，他已命令驻海湾美军保持高度戒备，并正在加强其力量，以应对伊拉克对伊拉克北部库尔德持不同政见者的袭击。

Output: Clinton said on Saturday he had ordered @@U.S.## forces in the @@Gulf## to go on high alert and was reinforcing them in response to Iraqi attacks on Kurdish dissidents in northern @@Iraq## .
输出：克林顿周六表示，他已命令@@海湾##的@@US##部队保持高度戒备，并正在增援他们，以应对伊拉克对@@Iraq##北部库尔德持不同政见者的袭击。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Input: AL-AIN , United Arab Emirates 1996-12-06
输入： AL-AIN ，阿拉伯联合酋长国 1996-12-06

Output: @@AL-AIN## , @@United Arab Emirates## 1996-12-06
输出：@@AL-AIN##，@@阿拉伯联合酋长国## 1996-12-06

Expected Output: @@AL-AIN## , @@United Arab Emirates## 1996-12-06
预期产出：@@AL-AIN##，@@阿拉伯联合酋长国## 1996-12-06

Example 2 实施例2

Task Description 任务描述

I am an excellent linguist. The task is to label miscellaneous entities. Below are some examples.
我是一位优秀的语言学家。任务是标记各种实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: The armed hijackers of the Airbus 310 Flight 150 , which is expected to arrive about 4 a.m. ( 0300 GMT ) , have said they intend to surrender and seek political asylum in Britain .
输入：预计将于凌晨 4 点（格林威治标准时间 0300）左右抵达的空客 310 航班 150 的武装劫机者表示，他们打算投降并在英国寻求政治庇护。

Output: The armed hijackers of the @@Airbus 310## @@Flight 150## , which is expected to arrive about 4 a.m. ( 0300 @@GMT## ) , have said they intend to surrender and seek political asylum in Britain .
输出： @@Airbus 310## @@Flight 150## 的武装劫机者预计将于凌晨 4 点（0300 @@GMT##）左右抵达，他们表示打算投降并在英国寻求政治庇护。

Input: Toronto-based Barrick , the world ’s third largest gold producer , sweetened its July 11 bid to C$ 30 a share from C$ 27 on August 16 after a fresh batch of drill results from the Pierina deposit .
输入：全球第三大黄金生产商、总部位于多伦多的巴里克公司 (Barrick) 在皮耶里纳 (Pierina) 矿床获得新一批钻探结果后，将 7 月 11 日的每股出价从 8 月 16 日的 27 加元提高到每股 30 加元。

Output: @@Toronto-based## Barrick , the world ’s third largest gold producer , sweetened its July 11 bid to @@C$## 30 a share from @@C$## 27 on August 16 after a fresh batch of drill results from the Pierina deposit .
输出：@@多伦多总部##巴里克，世界第三大黄金生产商，在新一批黄金生产后，将其7月11日的出价从8月16日@@C$##每股30美元提高至@@C$## Pierina 矿床的钻探结果。

Input: The club , who put Manchester United out of last year ’s UEFA Cup , were fined $ 1,000 .
输入：去年让曼联退出欧洲联盟杯的俱乐部被罚款 1,000 美元。

Output: The club , who put Manchester United out of last year ’s @@UEFA Cup## , were fined $ 1,000 .
输出：导致曼联队退出去年@@欧洲联盟杯##的俱乐部被罚款 1,000 美元。

Input: Shr C$ 0.12 C$ 0.15
输入：先加元 0.12 加元 0.15

Output: Shr @@C$## 0.12 @@C$## 0.15
输出：Shr @@C$## 0.12 @@C$## 0.15

Input: Shr C$ 0.04 C$ 0.08
输入：先加元 0.04 加元 0.08

Output: Shr @@C$## 0.04 @@C$## 0.08
输出：Shr @@C$## 0.04 @@C$## 0.08

Input: An Iraqi Kurdish group on Wednesday said it had agreed a new U.S.-brokered ceasefire with a rival faction after a previous accord was shattered by sporadic fighting between the groups in recent days .
资料：伊拉克库尔德组织周三表示，在美国斡旋下，该组织已与敌对派系达成新的停火协议，此前达成的协议因最近几天各组织之间的零星战斗而破裂。

Output: An @@Iraqi Kurdish## group on Wednesday said it had agreed a new @@U.S.-brokered## ceasefire with a rival faction after a previous accord was shattered by sporadic fighting between the groups in recent days .
输出：@@伊拉克库尔德##组织周三表示，在最近几天各组织之间的零星战斗破坏了之前的协议后，它已与敌对派别达成了新的@@美国斡旋##停火协议。

Input: On Friday , Metro Holdings topped gainers , soaring by S$ 1.55 to close at S$ 6.05 on market rumours of a takeover bid by First Capital Corp .
输入：周五，由于市场传言第一资本公司 (First Capital Corp) 提出收购要约，美罗控股 (Metro Holdings) 涨幅居前，飙升 1.55 新元，收于 6.05 新元。

Output: On Friday , Metro Holdings topped gainers , soaring by @@S$## 1.55 to close at @@S$## 6.05 on market rumours of a takeover bid by First Capital Corp .
输出：周五，由于市场传言第一资本公司 (First Capital Corp) 提出收购要约，麦德龙控股 (Metro Holdings) 涨幅居前，飙升@@S$## 1.55，收于@@S$## 6.05。

Input: We ’re looking for it to stabilise now , " said one Euromark options trader at a U.S. bank .
意见：我们希望它现在能够稳定下来，”一家美国银行的欧洲马克期权交易员表示。

Output: We ’re looking for it to stabilise now , " said one @@Euromark## options trader at a U.S. bank .
输出：我们希望它现在能够稳定下来，”一家美国银行的一位@@Euromark##期权交易员说道。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Example 3 实施例3

Task Description 任务描述

I am an excellent linguist. The task is to label location entities. Below are some examples.
我是一位优秀的语言学家。任务是标记位置实体。以下是一些示例。

Sentence-level Demonstrations
句子级演示

Input: In April , China quashed a draft resolution by the U.N. Human Rights Commission expressing concern over continuing reports of Beijing ’s violations of fundamental freedoms .
投入：四月，中国撤销了联合国人权委员会的一项决议草案，该草案表达了对北京侵犯基本自由的持续报道的担忧。

Output: In April , @@China## quashed a draft resolution by the U.N. Human Rights Commission expressing concern over continuing reports of @@Beijing## ’s violations of fundamental freedoms .
输出：四月，@@中国##撤销了联合国人权委员会的一项决议草案，该决议草案表达了对@@北京##侵犯基本自由的持续报道的关注。

Input: China thanks Gabon for support on human rights .
投入：中国感谢加蓬在人权方面的支持。

Output: @@China## thanks @@Gabon## for support on human rights .
输出：@@中国##感谢@@加蓬##对人权的支持。

Input: China says Taiwan spoils atmosphere for talks .
输入：中国称台湾破坏了会谈气氛。

Output: @@China## says @@Taiwan## spoils atmosphere for talks .
输出：@@中国##说@@台湾##破坏了会谈气氛。

Input: Asked what India would do if the pact were forwarded to the United Nations General Assembly , Gujral said : " That bridge I will cross when I come to it . "
输入：当被问及如果将该协议提交联合国大会印度会做什么时，古吉拉尔说：“当我到达那座桥时，我会跨过它。”

Output: Asked what @@India## would do if the pact were forwarded to the United Nations General Assembly , Gujral said : " That bridge I will cross when I come to it . "
输出：当被问及如果将该协议提交联合国大会 @@India## 会做什么时，Gujral 说：“当我到达那座桥时，我会跨过它。”

Input: China says militant Japan must face war past .
输入：中国称好战的日本必须面对过去的战争。

Output: @@China## says militant @@Japan## must face war past .
输出：@@中国##说好战的@@日本##必须面对过去的战争。

Input: The victory against Japan marked the Fed Cup debut of Monica Seles , who became a naturalised U.S. citizen in 1994 .
输入：对日本队的胜利标志着莫妮卡·塞莱斯 (Monica Seles) 首次参加联合会杯，她于 1994 年入籍美国。

Output: The victory against @@Japan## marked the Fed Cup debut of Monica Seles , who became a naturalised @@U.S.## citizen in 1994 .
输出：对阵@@Japan##的胜利标志着莫妮卡·塞莱斯 (Monica Seles) 首次参加联合会杯，她于 1994 年入籍@@US## 公民。

Input: The constitutional monarch , who last visited China in 1993 , was scheduled to meet Chinese President Jiang Zemin and Premier Li Peng during his visit , they said .
他们说，这位立宪君主上次访问中国是在1993年，他计划在访问期间会见中国国家主席江泽民和国务院总理李鹏。

Output: The constitutional monarch , who last visited @@China## in 1993 , was scheduled to meet Chinese President Jiang Zemin and Premier Li Peng during his visit , they said .
输出：他们说，这位立宪君主上次访问@@中国##是在1993年，他计划在访问期间会见中国国家主席江泽民和国务院总理李鹏。

Input: Atheist China officially bans missionary activities but often turns a blind eye to religious activities of people nominally employed as foreign language teachers , particularly in remote areas that are unable to attract other candidates .
输入：无神论者中国官方禁止传教活动，但往往对名义上担任外语教师的人的宗教活动视而不见，特别是在无法吸引其他候选人的偏远地区。

Output: Atheist @@China## officially bans missionary activities but often turns a blind eye to religious activities of people nominally employed as foreign language teachers , particularly in remote areas that are unable to attract other candidates .
输出：无神论者@@中国##正式禁止传教活动，但往往对名义上受聘为外语教师的人的宗教活动视而不见，特别是在无法吸引其他候选人的偏远地区。

Input Sentence and GPT-3 Output
输入句子和 GPT-3 输出

Output: But @@China## saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers @@Uzbekistan##
输出：但是@@中国##在小组第二场比赛中运气不佳，以0-2出人意料地输给了新人@@乌兹别克斯坦##

Expected Output: But @@China## saw their luck desert them in the second match of the group , crashing to a surprise 2-0 defeat to newcomers @@Uzbekistan##
预期输出：但是@@中国##在小组第二场比赛中运气不佳，以0-2出人意料地输给了新人@@乌兹别克斯坦##

Table 13: Examples on the CoNLL2003 datasets with the entity-level embedding.
表 13：具有实体级嵌入的 CoNLL2003 数据集示例。

Appendix A Datasets
附录 A数据集

A.1 Flat NER
A.1平坦 NER

CoNLL2003. CoNLL2003。

CoNLL2003 Sang and De Meulder (2003) contains four types of named entities: Location, Organization, Person and Miscellaneous. Table 4 shows annotations for each entity type and Table 6 shows the number of sentences, tokens and entities in CoNLL2003.
CoNLL2003 Sang 和 De Meulder ( 2003 )包含四种类型的命名实体：位置、组织、人员和杂项。表4显示了每种实体类型的注释，表6显示了 CoNLL2003 中句子、标记和实体的数量。

OntoNotes5.0. OntoNotes5.0。

OntoNotes Pradhan et al. (2013) contains 18 types of named entities, and Table 5 lists each entity and its annotation. The number of sentences, tokens and entities of OntoNotes5.0 is shown in Table 6.
OntoNotes Pradhan 等人。 ( 2013 )包含18种命名实体，表5列出了每种实体及其注释。 OntoNotes5.0的句子、标记和实体的数量如表6所示。

A.2 Nested NER
A.2嵌套 NER

ACE2004 and ACE2005. ACE2004 和 ACE2005。

ACE2004 Doddington et al. (2004) and ACE2005 Christopher et al. (2006) are two English nested NER datasets containing seven entity types: geographical political entities (GPE), organization entities (ORG), person entities (PER), facility entities (FAC), vehicle entities (VEH), location entities (LOC) and weapon entities (WEA). Annotations for each entity type are shown in Table 8, and the number of sentences, entities, nested entities and nested percentages are shown in Table 7.
ACE2004多丁顿等人。 ( 2004 )和 ACE2005 Christopher 等人。 ( 2006 )是两个英语嵌套 NER 数据集，包含七种实体类型：地理政治实体 (GPE)、组织实体 (ORG)、人员实体 (PER)、设施实体 (FAC)、车辆实体 (VEH)、位置实体 (LOC)和武器实体（WEA）。每种实体类型的注释如表8所示，句子、实体、嵌套实体和嵌套百分比的数量如表7所示。

GENIA. 吉尼亚。

GENIA is an English nested NER dataset within the molecular biology domain and contains five entity types: cell line, cell type, DNA, RNA and protein. Literally, each entity is named according to the biological meaning, and the number of sentences, entities, nested entities and nested percentages are shown in Table 7.
GENIA 是分子生物学领域内的英文嵌套 NER 数据集，包含五种实体类型：细胞系、细胞类型、DNA、RNA 和蛋白质。从字面上看，每个实体根据生物学意义进行命名，句子数、实体数、嵌套实体数和嵌套百分比如表7所示。

Appendix B Error Cases of Format BMES and Entity-position
附录B BMES格式和Entity-position错误情况

We select several examples on sample-100 CoNLL2003 dataset to better illustrate the ineffectiveness of these two formats BMES and entity-position. For BMES format is shown in Table 9, and for entity-position format is shown in 10.
我们在sample-100 CoNLL2003数据集上选择了几个例子，以更好地说明BMES和实体位置这两种格式的无效性。对于BMES格式如表9所示，对于实体位置格式如10所示。

From these examples, we can obviously observe that (1) for BMES format, it is difficult for GPT-3 to generate the output with the same length as the input sentence, especially when the input sentence is long; (2) for entity-position format, it is confused for GPT-3 to generate the correct position information.
从这些例子中，我们可以明显观察到：（1）对于BMES格式，GPT-3很难生成与输入句子长度相同的输出，特别是当输入句子很长时； (2)对于实体位置格式，GPT-3生成正确的位置信息是混乱的。

Appendix C Examples
附录 C示例

To better illustrate demonstrations of our GPT-NER, we select several examples for random retrieval in Table 11, for sentence-level embedding 12 and for entity-level embedding 13. From these results, we can observe that:
为了更好地说明我们的 GPT-NER 的演示，我们在表11中选择了几个用于随机检索的示例，用于句子级嵌入12和实体级嵌入13 。从这些结果中，我们可以观察到：

(1) For random retrieval in Table 11, we can observe that all sentences have the same opportunity to appear as an example in the few-shot demonstration, and the input sentence and each retrieved example usually do not contain similar examples.
（1）对于表11中的随机检索，我们可以观察到所有句子都有相同的机会在少镜头演示中作为示例出现，并且输入句子和每个检索到的示例通常不包含相似的示例。

(2) For sentence-level embedding in Table 12, we can observe that the retrieved examples are semantically similar to the input sentence, but may not focus on the same local entities as the input sentence.
（2）对于表12中的句子级嵌入，我们可以观察到检索到的示例在语义上与输入句子相似，但可能不关注与输入句子相同的局部实体。

(3) For entity-level embedding in Table 13, we can observe that the retrieved examples do focus on the same local entities as the input sentence to lead the prediction progress of GPT-3 more easily. This phenomenon emphasizes the effectiveness of the quality of demonstrations in-context learning.
（3）对于表13中的实体级嵌入，我们可以观察到检索到的示例确实关注与输入句子相同的本地实体，从而更容易引导GPT-3的预测进度。这种现象强调了情境学习中演示质量的有效性。

GPT-NER: Named Entity Recognition via Large Language ModelsGPT-NER：通过大型语言模型进行命名实体识别

Abstract 抽象的

1 Introduction1简介

2 Related Work2相关工作

2.1 Named Entity Recognition2.1命名实体识别

2.2 Large Language Models and In-context Learning2.2大型语言模型和上下文学习

3 Background3背景

3.1 NER as Sequence Labeling3.1 NER 作为序列标记

Representation Extraction表示提取

Classification. 分类。

4 GPT-NER4 GPT NER

4.1 Prompt Construction4.1快速施工

4.1.1 Task Description4.1.1任务说明

4.1.2 Few-shot Demonstration4.1.2少样本演示

The Format of LLM Output.LLM输出的格式。

4.1.3 Input Sentence4.1.3输入语句

4.2 Few-shot Demonstrations Retrieval4.2少样本演示检索

4.2.1 Random Retrieval4.2.1随机检索

4.2.2 k𝑘kNN-based Retrieval4.2.2 kk 基于神经网络的检索

k𝑘kNN based on Sentence-level Representations. kk 基于句子级表示的神经网络。

Entity-level Embedding. 实体级嵌入。

4.3 Self-verification4.3自我验证

Demonstration Selection. 示范选择。

5 Experiments5实验

5.1 Results on the Full Training Set5.1完整训练集的结果

5.1.1 Results on Flat NER5.1.1 Flat NER 的结果

CoNLL2003. CoNLL2003。

OntoNotes5.0. OntoNotes5.0。

Baselines. 基线。

Main Results. 主要结果。

5.1.2 Results on Nested NER5.1.2嵌套 NER 的结果

ACE2004 and ACE2005. ACE2004 和 ACE2005。

GENIA. 吉尼亚。

Baselines. 基线。

Main Results.

5.2 Results on Low-resource Scenario5.2低资源场景下的结果

Setups. 设置。

5.2.1 Results5.2.1结果

6 Ablation Study6消融研究

6.1 Varying the Format of LLM Output6.1改变LLM输出的格式

BMES

Entity+Position

6.2 The Number of Few-shot Demonstrations

7 Conclusion

References

Appendix A Datasets附录 A数据集

A.1 Flat NERA.1平坦 NER

CoNLL2003. CoNLL2003。

OntoNotes5.0. OntoNotes5.0。

A.2 Nested NERA.2嵌套 NER

ACE2004 and ACE2005. ACE2004 和 ACE2005。

GENIA. 吉尼亚。

Appendix B Error Cases of Format BMES and Entity-position附录B BMES格式和Entity-position错误情况

Appendix C Examples附录 C示例

GPT-NER: Named Entity Recognition via Large Language Models
GPT-NER：通过大型语言模型进行命名实体识别

1 Introduction
1简介

2 Related Work
2相关工作

2.1 Named Entity Recognition
2.1命名实体识别

2.2 Large Language Models and In-context Learning
2.2大型语言模型和上下文学习

3 Background
3背景

3.1 NER as Sequence Labeling
3.1 NER 作为序列标记

Representation Extraction
表示提取

4 GPT-NER
4 GPT NER

4.1 Prompt Construction
4.1快速施工

4.1.1 Task Description
4.1.1任务说明

4.1.2 Few-shot Demonstration
4.1.2少样本演示

The Format of LLM Output.
LLM输出的格式。

4.1.3 Input Sentence
4.1.3输入语句

4.2 Few-shot Demonstrations Retrieval
4.2少样本演示检索

4.2.1 Random Retrieval
4.2.1随机检索

4.2.2 $k$ NN-based Retrieval
4.2.2 $k$ 基于神经网络的检索

$k$ NN based on Sentence-level Representations.
$k$ 基于句子级表示的神经网络。

4.3 Self-verification
4.3自我验证

5 Experiments
5实验

5.1 Results on the Full Training Set
5.1完整训练集的结果

5.1.1 Results on Flat NER
5.1.1 Flat NER 的结果

5.1.2 Results on Nested NER
5.1.2嵌套 NER 的结果

5.2 Results on Low-resource Scenario
5.2低资源场景下的结果

5.2.1 Results
5.2.1结果

6 Ablation Study
6消融研究

6.1 Varying the Format of LLM Output
6.1改变LLM输出的格式

Appendix A Datasets
附录 A数据集

A.1 Flat NER
A.1平坦 NER

A.2 Nested NER
A.2嵌套 NER

Appendix B Error Cases of Format BMES and Entity-position
附录B BMES格式和Entity-position错误情况

Appendix C Examples
附录 C示例