Beyond the Known: Investigating LLMs Performance on Out-of-Domain Intent Detection 超越已知:调查 LLMs 域外意图检测的性能
Pei Wang ^(1**){ }^{1 *}, Keqing He ^(2**){ }^{2 *}, Yejie Wang ^(1**){ }^{1 *}, Xiaoshuai Song ^(1){ }^{1}, Yutao Mou ^(1){ }^{1}, Jingang Wang ^(2){ }^{2}, Yunsen Xian ^(2){ }^{2}, Xunliang Cai ^(2){ }^{2}, Weiran Xu ^(1**){ }^{1 *}{wangpei, wangyejie, songxiaoshuai,myt,xuweiran}@bupt.edu.cn{hekeqing,wangjingang,xianyunsen,caixunliang}@meituan.com
Abstract 摘要
Out-of-domain (OOD) intent detection aims to examine whether the user’s query falls outside the predefined domain of the system, which is crucial for the proper functioning of task-oriented dialogue (TOD) systems. Previous methods address it by fine-tuning discriminative models. Recently, some studies have been exploring the application of large language models (LLMs) represented by ChatGPT to various downstream tasks, but it is still unclear for their ability on OOD detection task. This paper conducts a comprehensive evaluation of LLMs under various experimental settings, and then outline the strengths and weaknesses of LLMs. We find that LLMs exhibit strong zero-shot and few-shot capabilities, but is still at a disadvantage compared to models fine-tuned with full resource. More deeply, through a series of additional analysis experiments, we discuss and summarize the challenges faced by LLMs and provide guidance for future work including injecting domain knowledge, strengthening knowledge transfer from IND(In-domain) to OOD, and understanding long instructions. 域外(OOD)意图检测旨在检查用户的询问是否超出了系统的预定义领域,这对于面向任务的对话(TOD)系统的正常运行至关重要。以往的方法是通过微调判别模型来解决这个问题。最近,一些研究探索了以 ChatGPT 为代表的大型语言模型(LLMs)在各种下游任务中的应用,但其在 OOD 检测任务中的能力仍不明确。本文在各种实验设置下对 LLMs 进行了全面评估,然后概述了 LLMs 的优缺点。我们发现,LLMs表现出很强的零次射击和少量射击能力,但与使用全部资源进行微调的模型相比仍处于劣势。更深入地说,通过一系列额外的分析实验,我们讨论并总结了 LLMs 所面临的挑战,并为今后的工作提供了指导,包括注入领域知识、加强从 IND(In-domain)到 OOD 的知识转移以及理解长指令。
Keywords: OOD, ChatGPT, LLM 关键词OOD, ChatGPT, LLM
1. Introduction 1.导言
Traditional TOD systems are based on the closedset hypothesis (Chen et al., 2019; Yang et al., 2021; Zeng et al., 2022) and can only handle queries within a limited scope of in-domain(IND) intents. However, users may input queries with out-of-domain(OOD) intents in the real open world, which poses new challenges for TOD systems. As shown in Figure 1, OOD intent detection task aims to determine whether the intent of user queries exceeds the predefined intents, making it an essential component of TOD systems. (Tulshan and Dhage, 2018; Lin and Xu, 2019; Zeng et al., 2021; Wu et al., 2022a,b; Mou et al., 2022, 2023; Song et al., 2023b). 传统的 TOD 系统基于闭集假设(Chen 等人,2019;Yang 等人,2021;Zeng 等人,2022),只能处理有限的域内(IND)意图范围内的查询。然而,在真实的开放世界中,用户可能会输入具有域外(OOD)意图的查询,这给 TOD 系统带来了新的挑战。如图 1 所示,OOD 意图检测任务旨在确定用户查询的意图是否超出了预定义的意图,因此是 TOD 系统的重要组成部分。(Tulshan 和 Dhage,2018;Lin 和 Xu,2019;Zeng 等人,2021;Wu 等人,2022a,b;Mou 等人,2022,2023;Song 等人,2023b)。
Previous work on OOD detection rely on finetuning pre-training language model (PLM), extracting the output representation of PLMs’ final layer as the intent feature, and employing scoring functions based on density, distance, or energy to detect OOD samples, as shown in Figure 2 (Zeng et al., 2021; Zhou et al., 2022; Mou et al., 2022; Cho et al., 2023; Wang et al., 2023b). Recently, the emergence of large language models(LLMs) like ChatGPT ^(1){ }^{1} has injected new vitality into natural language process(NLP) tasks. Their superior zeroshot learning capability enables a new paradigm of NLP research and applications by prompting LLMs without finetuning (Ouyang et al., 2022; Touvron et al., 2023; Jiao et al., 2023; Wei et al., 2023; Yang et al., 2023). Given the LLMs’ training on broad 如图 2 所示,以往的 OOD 检测工作依赖于微调预训练语言模型(PLM),提取 PLM 最后一层的输出表示作为意图特征,并采用基于密度、距离或能量的评分函数来检测 OOD 样本(Zeng 等,2021;Zhou 等,2022;Mou 等,2022;Cho 等,2023;Wang 等,2023b)。最近,ChatGPT ^(1){ }^{1} 等大型语言模型(LLMs)的出现为自然语言处理(NLP)任务注入了新的活力。它们卓越的零点学习能力通过提示LLMs而无需微调,为 NLP 研究和应用带来了新的范式(欧阳等人,2022;Touvron 等人,2023;焦等人,2023;魏等人,2023;杨等人,2023)。鉴于LLMs'对广义的
text corpora and their impressive generalization skills, it’s worth considering the benefits and potential challenges they may face in open-scenario intent identification. Specifically, we have raised the following questions: 由于文本语料库及其令人印象深刻的泛化技能,我们值得考虑它们在开放场景意图识别中的优势和可能面临的挑战。具体来说,我们提出了以下问题:
Figure 1: Explanation of the role of OOD intent detection in the TOD system. When the system encounters an intent that is beyond its supported intents, it can detect and friendly prompt the user. 图 1:OOD 意图检测在 TOD 系统中的作用说明。当系统遇到超出其支持意图的意图时,它可以检测到并友好地提示用户。
What are the potential positive and negative effects of large language models on the Out-ofDomain (OOD) detection task? 大型语言模型对域外(OOD)检测任务有哪些潜在的积极和消极影响?
What are the strengths and weaknesses of large language models, compared with traditional finetuned models? 与传统的微调模型相比,大型语言模型有哪些优缺点?
Why do large language models exhibit certain strengths and weaknesses? 为什么大型语言模型会表现出某些优缺点?
How can we potentially address and improve these weaknesses? 我们如何才能解决和改进这些弱点?
In this work, we introduce two LLM-based OOD framework, ZSD-LLM and FSD-LLM which based on different IND prior to instruct LLM to conduct 在这项工作中,我们介绍了两个基于 LLM 的 OOD 框架,即 ZSD-LLM 和 FSD-LLM,它们基于不同的 IND,在指示 LLM 之前进行了以下操作
intent detection (Section 3). Then we conduct comparative experiments between ChatGPT and discriminative methods (Section 4). In order to further explore the underlying reasons behind the experiments, we conduct a series of analytical experiments including IND intent number effect, different data split, comparison of different LLMs and different prompts effect (Section5). Finally, we summarize the strengths and weeknesses of ChatGPT in OOD detection tasks and future improvement directions (Section 6). To the best of our knowledge, we are the first to comprehensively evaluate the performance of LLMs on OOD intent detection. 意图检测(第 3 节)。然后,我们进行了 ChatGPT 和判别方法的对比实验(第 4 节)。为了进一步探索实验背后的深层原因,我们进行了一系列分析实验,包括 IND 意图数效应、不同数据拆分、不同 LLMs 的比较和不同提示效应(第 5 节)。最后,我们总结了 ChatGPT 在 OOD 检测任务中的优势和不足以及未来的改进方向(第 6 节)。据我们所知,我们是第一个全面评估 LLMs 在 OOD 意图检测方面性能的人。
The key findings of this paper can be summarized as follows: 本文的主要结论可归纳如下:
What ChatGPT does well: ChatGPT 的优势
ChatGPT can achieve good zero-shot performance without providing any IND intent priors, demonstrating his powerful NLU capabilities. ChatGPT 在不提供任何 IND 意图前验的情况下,也能实现良好的归零性能,展示了其强大的 NLU 功能。
When the number of IND intents is small, ChatGPT can achieve better accuracy in few-shot settings than discriminative models. 当 IND 意图数量较少时,与判别模型相比,ChatGPT 可以在少镜头设置中获得更好的准确性。
ChatGPT can not only perform OOD detection but also output the intent of the OOD samples, which is something that current methods based on discriminative models cannot achieve. ChatGPT 不仅能进行 OOD 检测,还能输出 OOD 样本的意图,这是目前基于判别模型的方法无法实现的。
What ChatGPT does not do well: ChatGPT 做得不好的地方:
ChatGPT performs significantly worse than baselines with a large number of IND intents. It’s manifested by an increase in misclassifications among IND intents and a substantial number of OOD samples being detected as IND when there is a higher number of IND intents. 在有大量 IND 意图的情况下,ChatGPT 的表现明显不如基线。具体表现为:当 IND 意图数量较多时,IND 意图之间的误分类增加,大量 OOD 样本被检测为 IND。
In rare instances, ChatGPT does not output according to our designed instructions. Particularly when the increase in intents leads to longer instructions, ChatGPT may overlook key information in the prompts, resulting in task failure. 在极少数情况下,ChatGPT 不会按照我们设计的指令输出。特别是当意图的增加导致指令变长时,ChatGPT 可能会忽略提示中的关键信息,从而导致任务失败。
Compared to discriminative models, the performance of ChatGPT is affected to some extent by the number of intents. This is primarily manifested when the number of IND intents increases, resulting in a significant decline in its performance. 与判别模型相比,ChatGPT 的性能在一定程度上受到意图数量的影响。这主要表现在 IND 意图数量增加时,性能会显著下降。
ChatGPT struggles with fine-grained semantic distinctions which indicating the comprehension of ChatGPT in fine-grained intent labels is insufficient, exhibiting misalignment with human-level understanding. ChatGPT 在细粒度语义区分方面存在困难,这表明 ChatGPT 在细粒度意图标签方面的理解能力不足,表现出与人类理解水平的不一致。
It’s challenging for ChatGPT to acquire knowledge from IND demonstrations that could assist with OOD tasks. It might even perceive IND demonstrations as noisy, which could potentially harm the performance of OOD tasks. 对于 ChatGPT 来说,从 IND 演示中获取有助于执行 OOD 任务的知识具有挑战性。它甚至可能认为 IND 演示是嘈杂的,这可能会影响 OOD 任务的执行。
We further summarize future LLM improvement directions which includes the following aspects: 我们进一步总结了未来LLM的改进方向,包括以下几个方面:
Figure 2: Comparison of the OOD detection method between previous method (Upper part) and LLM-based method (Lower part). Previous method trains a feature extractor using IND samples in the first stage, and estimates the confidence score of the sample using the designed scoring function and features; Our end-to-end OOD detection based on LLM adds task descriptions to prompts, and LLM directly outputs detection results. 图 2:先前方法(上部)与基于 LLM 的 OOD 检测方法(下部)的比较。前一种方法在第一阶段使用 IND 样本训练特征提取器,并使用设计的评分函数和特征估算样本的置信度得分;我们基于 LLM 的端到端 OOD 检测方法在提示中添加了任务描述,LLM 直接输出检测结果。
injecting domain knowledge 2) strengthening knowledge transfer from IND to OOD, and 3) understanding long instructions. ^(2){ }^{2} 2) 加强从 IND 到 OOD 的知识转移,以及 3) 理解长指令。 ^(2){ }^{2}
2. Related Work 2.相关工作
2.1. LLM
LLM has become a popular paradigm for research and applications in natural language processing tasks. ChatGPT is a generative foundational model belonging to the GPT-3.5 series in the OpenAI GPT family, which includes its predecessors, GPT, GPT-2, and GPT-3. Recently, there has been an increasing interest in utilizing LLMs for various natural language processing (NLP) tasks. Several studies have been conducted to systematically investigate the performance of ChatGPT on different downstream tasks, including machine translation (Jiao et al., 2023), information extraction (Wei et al., 2023), summarization(Yang et al., 2023) and clustering (Song et al., 2023a). However, it is unclear about the performance of ChatGPT in OOD detection. LLM已成为自然语言处理任务研究和应用的流行范式。ChatGPT 是一种生成式基础模型,属于 OpenAI GPT 系列中的 GPT-3.5 系列,包括其前身 GPT、GPT-2 和 GPT-3。最近,人们对利用 LLMs 执行各种自然语言处理 (NLP) 任务的兴趣与日俱增。已有多项研究系统地调查了 ChatGPT 在不同下游任务中的性能,包括机器翻译(Jiao 等人,2023 年)、信息提取(Wei 等人,2023 年)、摘要(Yang 等人,2023 年)和聚类(Song 等人,2023 年a)。然而,目前还不清楚 ChatGPT 在 OOD 检测方面的性能。
<Task description> <任务描述
You are an out-of-domain intent detector, and your task is to detect whether the intents of users’ queries belong to the intents supported by the system. If they do, return the corresponding intent label, otherwise return unknown. The supported intents include: [Intent 1], [Intent 2] … [Intent N] 您是域外意图检测器,您的任务是检测用户查询的意图是否属于系统支持的意图。如果属于,则返回相应的意图标签,否则返回未知。支持的意图包括[意图 1]、[意图 2] ... [意图 N] << Response format >> << 响应格式 >>
Please respond to me with the format of “Intent: XX” or "Intent: unknown’ 请以 "意图:XX "或 "意图:未知 "的格式回复我:XX "或 "意图:未知 << Utterance for test> << 测试用语
Please tell me the intent of this text: [Here is the utterance for text.] 请告诉我这段文字的意图[这里是文本的语句]
ZSD-LLM
<Task description> <任务描述
You are an out-of-domain intent detector, and your task is to detect whether the intents of users’ queries belong to the intents supported by the system. If they do, return the corresponding intent label, otherwise return unknown. The supported intents include: [Intent 1] ([Example 1] [Example 2]…), [Intent 2] ([Example 1] [Example 2]…), … The text in parentheses is the example of the corresponding intent. 您是域外意图检测器,您的任务是检测用户查询的意图是否属于系统支持的意图。如果属于,则返回相应的意图标签,否则返回未知。支持的意图包括[意图 1]([示例 1] [示例 2]...)、[意图 2]([示例 1] [示例 2]...)、......括号中的文字是相应意图的示例。
<Response format> <响应格式
Please respond to me with the format of “Intent: XX” or “Intent: unknown” 请以 "意图:XX "或 "意图:未知 "的格式回复我:XX "或 "意图:未知" << Utterance for test> << 测试用语
Please tell me the intent of this text: [Here is the utterance for text.] 请告诉我这段文字的意图[这里是文本的语句]
FSD-LLM
Figure 3: The demonstration of the two prompts we use to assist ChatGPT in performing OOD intent detection. FSD-OOD incorporates examples of intentions in the prompt as prior knowledge. 图 3:我们用来协助 ChatGPT 进行 OOD 意图检测的两种提示的演示。FSD-OOD 将提示中的意图示例作为先验知识。
2.2. OOD Detection 2.2.OOD 检测
Previous OOD detection methods can be divided into two categories: supervised OOD detection (Fei and Liu, 2016; Kim and Kim, 2018; Larson et al., 2019; Zheng et al., 2020) and unsupervised OOD detection(Shu et al., 2017; Lee et al., 2018; Ren et al., 2019; Lin and Xu, 2019; Xu et al., 2020; Zeng et al., 2021; Mou et al., 2022). The former indicates that there are some extensive labeled OOD samples in the training data. Classic supervised OOD algorithms consider the OOD detection problem as an N+1\mathrm{N}+1 classification problem (Fei and Liu, 2016; Larson et al., 2019). Unsupervised methods generally perform in two stages: learning intent representation and estimating confidence scores (Mou et al., 2022). Most previous research has focused on fine-tuning small-scale pre-trained language models (PLMs), such as BERT, to learn intent features from the training data (Wang et al., 2023b). However, with the recent advancements in LLM, there is increasing interest in exploring their potential for OOD intent detection. Compared to small-scale PLMs, LLMs have greater capacity for learning and generalization from data, making them promising candidates for OOD intent detection. 以往的 OOD 检测方法可分为两类:有监督 OOD 检测(Fei 和 Liu,2016 年;Kim 和 Kim,2018 年;Larson 等人,2019 年;Zheng 等人,2020 年)和无监督 OOD 检测(Shu 等人,2017 年;Lee 等人,2018 年;Ren 等人,2019 年;Lin 和 Xu,2019 年;Xu 等人,2020 年;Zeng 等人,2021 年;Mou 等人,2022 年)。前者表明训练数据中存在一些广泛标记的 OOD 样本。经典的有监督 OOD 算法将 OOD 检测问题视为 N+1\mathrm{N}+1 分类问题(Fei 和 Liu,2016;Larson 等人,2019)。无监督方法一般分为两个阶段:学习意图表示和估计置信度分数(Mou 等人,2022 年)。以往的研究大多侧重于微调小规模的预训练语言模型(PLM),如 BERT,以从训练数据中学习意图特征(Wang 等人,2023b)。然而,随着 LLM 的最新进展,人们对探索其在 OOD 意图检测方面的潜力越来越感兴趣。与小型 PLM 相比,LLMs 从数据中学习和归纳的能力更强,因此有望成为 OOD 意图检测的候选对象。
3. Methodology 3.方法论
3.1. Problem Formulation 3.1.问题表述
Given predefined set of intents, denoted as S=\mathcal{S}={l_(1),l_(2),dots,l_(N)}\left\{l_{1}, l_{2}, \ldots, l_{N}\right\},it contains N intents supported by the system. The input is the user’s natural language query q={t_(1),t_(2),dots,t_(n)}q=\left\{t_{1}, t_{2}, \ldots, t_{n}\right\}, where t_(i)t_{i} represents ii th token in the query. The output is an intent 给定预定义的意图集,表示为 S=\mathcal{S}={l_(1),l_(2),dots,l_(N)}\left\{l_{1}, l_{2}, \ldots, l_{N}\right\} ,它包含系统支持的 N 个意图。输入是用户的自然语言查询 q={t_(1),t_(2),dots,t_(n)}q=\left\{t_{1}, t_{2}, \ldots, t_{n}\right\} ,其中 t_(i)t_{i} 表示查询中的第 ii 个标记。输出是一个意图
label l_("pre ")l_{\text {pre }} that belongs to the set Suu{OOD}\mathcal{S} \cup\{O O D\}. 属于集合 Suu{OOD}\mathcal{S} \cup\{O O D\} 的标签 l_("pre ")l_{\text {pre }} 。
3.2. Prompt Engineer 3.2.快速工程师
We evaluate the OOD intent detection capability of ChatGPT in an end-to-end manner. We heuristically propose two prompts based on different IND prior: 我们以端到端方式评估了 ChatGPT 的 OOD 意图检测能力。我们根据不同的 IND 先验,启发式地提出了两种提示:
Zero-shot Detection (ZSD-LLM): This method only provides the IND intent set in the prompt as prior knowledge without supplying any IND samples. It can be utilized in scenarios where user privacy protection is required. The prompt template is: <Task description><Prior: S><\mathcal{S}>< Response format><Utterance for test>. 零次检测 (ZSD-LLM):这种方法只提供提示中的 IND 意图集作为先验知识,而不提供任何 IND 样本。它可用于需要保护用户隐私的场景。提示模板为<任务描述><先验知识: S><\mathcal{S}>< 响应格式><测试结果>。
Few-shot Detection (FSD-LLM): This method provides several samples for each intent in the prompt, allowing ChatGPT to extract useful knowledge from these samples and apply it to distinguish between IND and OOD intents. The prompt template is: <Task description><Prior: D={(q_(1),l_(1)),(q_(2),l_(2))dots,(q_(n),l_(n))}><D=\left\{\left(q_{1}, l_{1}\right),\left(q_{2}, l_{2}\right) \ldots,\left(q_{n}, l_{n}\right)\right\}>< Response format><Utterance for test>. 少量检测 (FSD-LLM):该方法为提示中的每个意图提供多个样本,允许 ChatGPT 从这些样本中提取有用的知识,并将其用于区分 IND 和 OOD 意图。提示模板是<任务描述><优先级: D={(q_(1),l_(1)),(q_(2),l_(2))dots,(q_(n),l_(n))}><D=\left\{\left(q_{1}, l_{1}\right),\left(q_{2}, l_{2}\right) \ldots,\left(q_{n}, l_{n}\right)\right\}>< 响应格式><测试结果>。
We show this two methods in Figure 3. About the exploration of different prompts, we discuss it in Section 5.4. 图 3 展示了这两种方法。关于不同提示的探索,我们将在第 5.4 节中讨论。
4. Experiment 4.实验
4.1. Setup 4.1.设置
4.1.1. Dataset & Metric 4.1.1.数据集和指标
Dataset We conduct experiments on two widely used benchmark, CLINC (Larson et al., 2019) and Banking (Casanueva et al., 2020). CLINC consists of 150 intents distributed across 10 domains, 数据集 我们在 CLINC(Larson 等人,2019 年)和银行业务(Casanueva 等人,2020 年)这两个广泛使用的基准上进行了实验。CLINC 由分布在 10 个领域的 150 个意图组成、
Table 1: The performance comparison between ChatGPT and baselines of Banking. We select 25%, 50%50 \% and 75%75 \% of all intents as IND intents. Three average values are taken for each experiment. 表 1:ChatGPT 与银行业务基线的性能比较。我们选择所有意图中的 25%、 50%50 \% 和 75%75 \% 作为 IND 意图。每个实验取三个平均值。
Table 2: The performance comparison between ChatGPT and baselines of CLINC. We select 25%,50%25 \%, 50 \%, and 75%75 \% of all intents as IND intents. Three average values are taken for each experiment. 表 2:ChatGPT 与 CLINC 基线的性能比较。我们选择所有意图中的 25%,50%25 \%, 50 \% 和 75%75 \% 作为 IND 意图。每个实验取三个平均值。
Statistic 统计
Banking 银行业
CLINC
Avg utterance length 平均语篇长度
9
12
Intent 意图
150
77
Training set size 训练集大小
15000
9003
Training sample per class 每班培训样本
100
-
Development set size 开发集尺寸
3000
1000
Development sample per class 每班开发样本
20
-
Testing set size 测试集大小
5500
3080
Testing sample per class 每班测试样本
30
-
Statistic Banking CLINC
Avg utterance length 9 12
Intent 150 77
Training set size 15000 9003
Training sample per class 100 -
Development set size 3000 1000
Development sample per class 20 -
Testing set size 5500 3080
Testing sample per class 30 -| Statistic | Banking | CLINC |
| :--- | :---: | :---: |
| Avg utterance length | 9 | 12 |
| Intent | 150 | 77 |
| Training set size | 15000 | 9003 |
| Training sample per class | 100 | - |
| Development set size | 3000 | 1000 |
| Development sample per class | 20 | - |
| Testing set size | 5500 | 3080 |
| Testing sample per class | 30 | - |
Table 3: Statistics of datasets. 表 3:数据集统计。
with each domain containing 15 intents. Banking contains intents from a single domain, totaling 77 intents. Consistent with previous research, we conduct OOD detection under three settings: 25%25 \%, 50%50 \%, and 75%75 \%. Here, 25%25 \% refers to selecting 25%25 \% of the intents as IND, with the remaining intents considered as OOD. We show the detailed statistics of the datasets in Table 3. 每个域包含 15 个意图。银行业务包含来自单个域的意图,共计 77 个意图。与之前的研究一致,我们在三种设置下进行 OOD 检测:<这里, 25%25 \% 指的是选择 25%25 \% 个意图作为 IND,其余意图视为 OOD。表 3 显示了数据集的详细统计信息。
Metric We employ six commonly used OOD detection metrics to evaluate the performance, including IND metrics: accuracy and macro-F1, OOD metrics: recall and macro-F1, as well as overall accuracy and macro-F1. 我们采用了六种常用的 OOD 检测指标来评估性能,包括 IND 指标:准确率和宏观-F1;OOD 指标:召回率和宏观-F1;以及总体准确率和宏观-F1。
4.1.2. Baselines 4.1.2.基线
We compare ChatGPT with the following three state-of-the-art discriminative two-stage methods: 我们将 ChatGPT 与以下三种最先进的两阶段判别方法进行了比较:
SCL (Zeng et al., 2021) It proposes a supervised contrastive learning objective to minimize intra-class variance by pulling together in-domain intents belonging to the same class and maximize inter-class variance by pushing apart samples from different classes. SCL(Zeng 等人,2021 年)它提出了一种有监督的对比学习目标,通过将属于同一类别的域内意图拉到一起来最小化类内差异,通过将不同类别的样本推开来最大化类间差异。
KNN-CL (Zhou et al., 2022) It proposes a KNNbased contrastive loss for IND pre-training. KNNCL selects k-nearest neighbors from samples of the same class as positives and uses samples of the different classes as negatives. KNN-CL(Zhou 等人,2022 年) 它为 IND 预训练提出了一种基于 KNN 的对比损失。KNNCL 从同一类别的样本中选择 k 近邻作为正样本,并将不同类别的样本作为负样本。
UniNL (Mou et al., 2022) It proposes a unified Neighborhood Learning to align representation learning with the scoring function to improve OOD detection performance. KNCL objective is employed for IND pre-training and a KNN-based score function is used for OOD detection. UniNL (Mou 等人,2022 年) 它提出了一种统一的邻域学习(Neighborhood Learning)方法,使表示学习与评分函数相一致,以提高 OOD 检测性能。KNCL 目标用于 IND 预训练,基于 KNN 的评分函数用于 OOD 检测。
4.2. ZSD-LLM Results 4.2.ZSD-LLM 结果
Our results are shown in Table 1 and 2. The results show that ZSD-LLM performs worse than the best baselines on all metrics. we analyze the results from three aspects: 我们的结果如表 1 和表 2 所示。结果显示,ZSD-LLM 在所有指标上的表现都不如最佳基线:
(1) The performance of IND intent recognition. There is a certain gap between ChatGPT and strong baselines (UniNL, KNN-CL). Taking Banking-50% as an example, in terms of IND indicators, ChatGPT’s performance is 13.36% (INDACC) and 27.23%27.23 \% (IND-F1) lower than UniNL. However, the gap between ChatGPT and SCL is slightly smaller, and it even surpasses SCL in some settings, such as Banking 25% (71.57 -> 73.16). This demonstrates ChatGPT’s strong zero-shot capability. (1) IND 意图识别的性能。ChatGPT 与强基线(UniNL、KNN-CL)存在一定差距。以银行业-50%为例,在 IND 指标上,ChatGPT 的性能比 UniNL 低 13.36%(INDACC)和 27.23%27.23 \% (IND-F1)。不过,ChatGPT 与 SCL 之间的差距略小,甚至在某些情况下超过了 SCL,例如银行业务 25% (71.57 -> 73.16)。这证明了 ChatGPT 强大的归零能力。
(2) The performance of OOD sample detection. Compared to IND classification, the performance gap between ChatGPT and baselines is larger on OOD metrics. Specifically, ChatGPT’s OOD-Recall is reduced by 56%56 \%, and OOD-F1 is reduced by 47.41%47.41 \% compared with UniNL for banking-50%. It’s generally observed across three IND intent splits in the two datasets. We speculate that the reason for the lower OOD metrics is that a large number of OOD samples are misclassified as IND intents by ChatGPT. Such results indicate that this kind of zero-shot prompting is not enough to provide ChatGPT with sufficient prior knowledge to complete the OOD detection task. (2) OOD 样本检测性能。与 IND 分类相比,ChatGPT 与基线在 OOD 指标上的性能差距更大。具体来说,ChatGPT 的 OOD-Recall 比 UniNL 减少了 56%56 \% ,OOD-F1 减少了 47.41%47.41 \% 。这在两个数据集中的三个 IND 意向分拆中都能普遍观察到。我们推测,OOD 指标降低的原因是大量 OOD 样本被 ChatGPT 误判为 IND 意图。这些结果表明,这种零点提示不足以为 ChatGPT 提供足够的先验知识来完成 OOD 检测任务。
(3) Comparison between datasets. The performance comparison of ChatGPT between the two datasets shows the same trend as the baselines. (3) 数据集之间的比较。ChatGPT 在两个数据集之间的性能比较显示出与基线相同的趋势。
The detection ability of the multi-domain dataset (CLINC) is better than that of the single-domain dataset (Banking). On the simpler dataset CLINC, the gap between ChatGTP and UniNL is smaller than that on Banking. (ALL-ACC: 41.91 -> 23.84 for 25%, 34.96 -> 31.11 for 50% ,32.20 -> 27.53 for 75%)75 \%). This indicates that the high granularity of intent division is the reason for the poor performance of ChatGPT. 多域数据集(CLINC)的检测能力优于单域数据集(银行业务)。在较简单的数据集 CLINC 上,ChatGTP 与 UniNL 之间的差距小于 Banking。(ALL-ACC:在 25% 的数据集上,ALL-ACC 为 41.91 -> 23.84;在 50% 的数据集上,ALL-ACC 为 34.96 -> 31.11;在 75%)75 \%) 的数据集上,ALL-ACC 为 32.20 -> 27.53。这表明,意图划分的粒度较高是 ChatGPT 性能较差的原因。
(3) Task fail. In addition to the above results, we use the original OOD data from CLINC for OOD detection. This results in a total of 150 intents, which can be used to test ChatGPT’s ability to perform large-scale system OOD detection. Experimental results show that ChatGPT predicts new intents in approximately 8.49%8.49 \% of the test samples, neither returning an IND intent nor ‘unknown’, leading to task failure. This reflects the instability of LLM in performing OOD detection. (3) 任务失败。除上述结果外,我们还使用 CLINC 的原始 OOD 数据进行 OOD 检测。这总共产生了 150 个意图,可用于测试 ChatGPT 执行大规模系统 OOD 检测的能力。实验结果表明,ChatGPT 在大约 8.49%8.49 \% 的测试样本中预测出了新意图,既没有返回 IND 意图,也没有返回 "未知",导致任务失败。这反映了 LLM 在执行 OOD 检测时的不稳定性。
4.3. FSD-LLM Results 4.3.FSD-LLM 结果
Due to the length limitation of ChatGPT’s conversations, we reduce the number of IND intents and randomly select N=5,10,20,30,40N=5,10,20,30,40 intents as IND intents, with the number of OOD intents fixed at 20. Under each setting, we test four groups of experiments with K=0,1,3,5K=0,1,3,5 ( KK is the number of samples provided for each intent). We show the detailed FSD-LLM results in Table 4 and the changing trend in Figure 4. We discover that: 由于 ChatGPT 会话的长度限制,我们减少了 IND 意图的数量,并随机选择 N=5,10,20,30,40N=5,10,20,30,40 个意图作为 IND 意图,而 OOD 意图的数量固定为 20 个。在每种设置下,我们用 K=0,1,3,5K=0,1,3,5 测试了四组实验( KK 是为每个意图提供的样本数)。我们在表 4 中显示了详细的 FSD-LLM 结果,并在图 4 中显示了变化趋势。我们发现
(1) FSD-LLM demonstrates strong competitiveness compared to the baseline in situations with a limited number of INDs (1) 在 IND 数量有限的情况下,FSD-LLM与基线相比表现出很强的竞争力
When N=5,K=5\mathrm{N}=5, \mathrm{~K}=5, ChatGPT outperforms UniNL by 0.76 and 3.16 on ALL-ACC and ALL-IND respectively. When N=10\mathrm{N}=10 and N=20\mathrm{N}=20, ChatGPT is superior to UniNL in IND classification, but inferior in UniNL in OOD detection. When N=30\mathrm{N}=30 and N=\mathrm{N}= 40, UniNL widens the gap with ChatGPT. 当 N=5,K=5\mathrm{N}=5, \mathrm{~K}=5 时,ChatGPT 在 ALL-ACC 和 ALL-IND 上分别比 UniNL 高出 0.76 和 3.16。当 N=10\mathrm{N}=10 和 N=20\mathrm{N}=20 时,ChatGPT 在 IND 分类上优于 UniNL,但在 OOD 检测上劣于 UniNL。当 N=30\mathrm{N}=30 和 N=\mathrm{N}= 40 时,UniNL 扩大了与 ChatGPT 的差距。
(2) The more the number of intents, the more demonstrations are needed for IND intent recognition. (2) 意向数量越多,IND 意向识别所需的演示就越多。
From Figure 4, We find that when N=5,10\mathrm{N}=5,10, FSDOOD achieves better F1-OOD and F1-IND performance at K=1\mathrm{K}=1. Even ZS-LLM achieves best ACC-IND. However, as N increases to 30 or 40, both ACC-IND and F1-IND show an upward trend with the increase of KK. This suggests that the more the number of intents, the more prior knowledge about intents is needed to help distinguish between different intents. 从图 4 中,我们发现当 N=5,10\mathrm{N}=5,10 时,FSDOOD 在 K=1\mathrm{K}=1 时取得了更好的 F1-OOD 和 F1-IND 性能。即使 ZS-LLM 也能达到最佳 ACC-IND。然而,当 N 增加到 30 或 40 时,随着 KK 的增加,ACC-IND 和 F1-IND 都呈上升趋势。这表明,意图的数量越多,就越需要有关意图的先验知识来帮助区分不同的意图。
(3) Too many demonstrations may introduce noise into OOD detection. (3) 过多的演示可能会给 OOD 检测带来噪音。
OOD-Recall shows an overall trend of initially increasing and then decreasing. This demonstrates the model’s negative transfer from IND to OOD data which means. We speculate that this could be due to significant feature distribution differences OOD-Recall 显示出先上升后下降的总体趋势。这表明该模型从 IND 数据到 OOD 数据之间存在负迁移。我们推测,这可能是由于特征分布的显著差异造成的
Table 4: Performance of ChatGPT under different few-shot settings with varying five sets of IND numbers. 表 4:ChatGPT 在五组 IND 编号不同的少发设置下的性能。
between the IND and OOD data, making it challenging for the model to learn useful features from the IND samples for OOD data. 这使得模型很难从 IND 样本中学习到对 OOD 数据有用的特征。
5. Qualitative Analysis 5.定性分析
5.1. Effect of IND intent number 5.1.IND 意向书编号的影响
In Section 4.3, we observe the varying performance of ChatGPT under different numbers of IND intents. In this section, we provide a detailed analysis of the changes in the effectiveness of the ZSDLLM methods as N increases. Figure 5 shows the trend of the changes. The results reveal that as the number of IND intents increases: 在第 4.3 节中,我们观察了 ChatGPT 在不同 IND 意图数量下的不同性能。在本节中,我们将详细分析随着 N 的增加,ZSDLLM 方法效果的变化。图 5 显示了变化趋势。结果显示,随着 IND 意图数量的增加:
(1) ChatGPT is sensitive to the number of intents compared with UniNL. As shown in Figure 5a5 a and 5b, ChatGPT has the best OOD detection performance when N=5\mathrm{N}=5, but as the number of intents increases, both metrics consistently decrease. When it reaches 30 and 40, ACC and F1 decrease by 21.33 and 8.8, respectively. However, UniNL consistently demonstrates robust results across all numbers. UniNL still achieves an ACC of 83.69 and F 1 of 87.77 when N=40\mathrm{N}=40. (1) 与 UniNL 相比,ChatGPT 对意图数量很敏感。如图 5a5 a 和 5b 所示,当 N=5\mathrm{N}=5 时,ChatGPT 的 OOD 检测性能最好,但随着意图数的增加,两个指标都持续下降。当意图数达到 30 和 40 时,ACC 和 F1 分别减少了 21.33 和 8.8。不过,UniNL 在所有数量上都始终保持了稳健的结果。当 N=40\mathrm{N}=40 时,UniNL 的 ACC 和 F1 分别为 83.69 和 87.77。
(2) The increase in intents leads to more severe confusion between labels. Figure 5c shows a continuous decrease in IND-ACC, indicating that more IND samples are misclassified. We find that compared to IND samples being misclassified as OOD, the proportion of being misclassified as incorrect IND intents is increasing. This may be due to ChatGPT’s understanding of intent labels not aligning with human-defined labels, leading to con- (2) 意向的增加导致标签之间的混淆更加严重。图 5c 显示 IND-ACC 持续下降,表明更多 IND 样本被错误分类。我们发现,与 IND 样本被误判为 OOD 相比,被误判为不正确 IND 意图的比例正在增加。这可能是由于 ChatGPT 对意图标签的理解与人类定义的标签不一致,从而导致误分类。
Figure 4: The effect of few-shot on different IND number. We show the changes of four metrics under different demonstration quantities. Due to the limitation of ChatGPT’s input length, we conduct three sets of experiments with 1 -shot, 3 -shot, and 5 -shot settings. 图 4:少拍对不同 IND 数量的影响。我们展示了不同演示量下四个指标的变化。由于 ChatGPT 输入长度的限制,我们进行了三组实验,分别设置为 1 次、3 次和 5 次。
Figure 5: Changes in the ALL-ACC, ALL-F1, INDACC and OOD-Recall of ChatGPT and UniNL as the number of IND intent increases for banking50%. 图 5:随着 IND 意图数量的增加,ChatGPT 和 UniNL 的 ALL-ACC、ALL-F1、INDACC 和 OOD-Recall 的变化(银行业务 50%)。
fusion between different label meanings. As the number of intents increases, this confusion intensifies. 不同标签含义之间的融合。随着意图数量的增加,这种混淆也会加剧。
(3) The increase in intents causes a sharp drop in the OOD-recall rate. Figure 5d shows that with the increase in the number of intents, OOD samples are more likely to be misclassified as IND. This is because the increase in IND intents number introduces more interference to ChatGPT’s OOD detection. (3) 意向的增加导致 OOD 召回率急剧下降。图 5d 显示,随着意图数量的增加,OOD 样本更容易被误判为 IND。这是因为 IND 意图数量的增加给 ChatGPT 的 OOD 检测带来了更多干扰。
We believe that the advantage of ChatGPT over discriminative models lies in its OOD detection with fewer intents, where it can accurately make judgments based on its internal knowledge. However, as the number of labels increases, the confusion in label meanings becomes more prominent, resulting in a decline in both IND intent classification and OOD sample detection. 我们认为,与判别模型相比,ChatGPT 的优势在于它能在较少意图的情况下进行 OOD 检测,并能根据内部知识做出准确判断。然而,随着标签数量的增加,标签含义的混淆变得更加突出,从而导致 IND 意图分类和 OOD 样本检测能力下降。
Figure 6: The variances of ChatGPT and UniNL on various metrics across five different data splits. 图 6:ChatGPT 和 UniNL 在五种不同数据拆分中各种指标的差异。
5.2. The robustness of ChatGPT OOD detection 5.2.ChatGPT OOD 检测的鲁棒性
The robustness of OOD detection can be reflected in its ability to maintain stable across different intent partitions. To verify it, we randomly select five different IND intent set split (using five different seeds when selecting IND intents) for the experiment. Figure 6 shows the variance of the results from the five sets of experiments. We observe that UniNL and ChatGPT perform similarly. However, in terms of the IND metrics IND-ACC and IND-F1, UniNL shows greater fluctuations. In terms of OOD metrics, ChatGPT is inferior to UniNL. This highlights the differences between the two models in performing OOD intent detection. ChatGPT excels in IND intent recognition, but its stability in OOD detection is relatively poor. OOD 检测的鲁棒性体现在它能够在不同的意图分区中保持稳定。为了验证这一点,我们随机选择了五个不同的 IND 意图集分区(在选择 IND 意图时使用了五个不同的种子)进行实验。图 6 显示了五组实验结果的差异。我们发现,UniNL 和 ChatGPT 的表现类似。但是,就 IND 指标 IND-ACC 和 IND-F1 而言,UniNL 的波动更大。在 OOD 指标方面,ChatGPT 不如 UniNL。这凸显了两个模型在进行 OOD 意图检测方面的差异。ChatGPT 在 IND 意图识别方面表现出色,但在 OOD 检测方面的稳定性相对较差。
5.3. Comparison of Different LLMs 5.3.不同 LLMs 的比较
We do ZSD-LLM on other mainstream LLMs and compare them with ChatGPT. 我们对其他主流 LLMs 进行 ZSD-LLM 并与 ChatGPT 进行比较。
Text-davinci-002, text-davinci-003 ^(3){ }^{3} belong to InstructGPT and text-davinci-003 is an improved 文本-davinci-002、文本-davinci-003 ^(3){ }^{3} 属于 InstructGPT,文本-davinci-003 是改进版的 InstructGPT。
Table 5: OOD detection performance of six different LLMs. 表 5:六种不同 LLMs 的 OOD 检测性能。
version of text-davinci-002. Compared with GPT3 , the biggest difference of InstructGPT is that it is fine-tuned for human instructions. 版本。与 GPT3 相比,InstructGPT 的最大区别在于它针对人类指令进行了微调。
Claude is an artificial intelligence chatbot developed by Anthropic ^(4){ }^{4}. Claude 是 Anthropic ^(4){ }^{4} 开发的人工智能聊天机器人。
Llama2-70B-Chat ^(5){ }^{5} is developed and opensourced by Meta AI. Llama-2-70b-Chat is the native open-source version with high-precision results. Llama2-70B-Chat ^(5){ }^{5} 由 Meta AI 开发并开源。Llama-2-70b-Chat 是具有高精度结果的本地开源版本。
GPT4 ^(6){ }^{6} is the latest and most advanced multimodal large model from OpenAI. GPT-4 can generate more factual and accurate statements than GPT-3.5 and other language models. GPT4 ^(6){ }^{6} 是 OpenAI 最新推出的最先进的多模态大型模型。与 GPT-3.5 和其他语言模型相比,GPT-4 可以生成更真实、更准确的语句。
Results are shown in Table 5. GPT4 leads in all six metrics compared to other models. Surprisingly, GPT3 performs better in IND intent recognition, with text-davinci-003 even surpassing ChatGPT and text-davinci-002 shows similar performance to ChatGPT in IND metrics. However, ChatGPT exhibits significantly better results than the GPT-3 series in OOD metrics. The differences in performance may be attributed to ChatGPT’s inclusion of SFT (Supervised Fine-Tuning) during the optimization phase, which gives it an advantage in understanding human instructions. In contrast, the GPT-3 series is slightly inferior in understanding tasks, making it more inclined towards intent detection tasks. Llama2-70b-Chat and Claund also exhibit a similar phenomenon to ChatGPT, but overall, ChatGPT outperforms Llama2-70B-Chat and Claund. 结果如表 5 所示。与其他模型相比,GPT4 在所有六个指标上都处于领先地位。令人惊讶的是,GPT3 在 IND 意图识别方面表现更好,text-davinci-003 甚至超过了 ChatGPT,text-davinci-002 在 IND 指标方面的表现与 ChatGPT 相似。然而,ChatGPT 在 OOD 指标上的表现明显优于 GPT-3 系列。性能上的差异可能是由于 ChatGPT 在优化阶段加入了 SFT(监督微调),这使其在理解人类指令方面更具优势。相比之下,GPT-3 系列在理解任务方面略逊一筹,因此更倾向于意图检测任务。Llama2-70b-Chat 和 Claund 也表现出与 ChatGPT 类似的现象,但总体而言,ChatGPT 优于 Llama2-70B-Chat 和 Claund。
5.4. Effect of different prompts 5.4.不同提示的效果
Prompt engineering is a crucial strategy for LLM. To verify the impact of different prompts, we devise four additional variations. They are: 提示工程是LLM的一项重要策略。为了验证不同提示的影响,我们设计了四种额外的变体。它们是
prompt.detector: Modify the role positioning of the LLM from intent detector to OOD detector. prompt.detector:将 LLM 的角色定位从意图检测器修改为 OOD 检测器。
prompt.discovery: Adopt a new task description for intent discovery. Directly make the LLM return the labels of OOD intent. prompt.discovery:为意图发现采用新的任务描述。直接使 LLM 返回 OOD 意图的标签。
Table 6: The performance of ChatGPT on various prompts. 表 6:ChatGPT 在各种提示下的性能。
prompt.order: Change the order of each part, using <Utterance for test> < Task description> <Utterance for test>. prompt.order:使用 < 测试结果> < 任务描述> < 测试结果> 更改各部分的顺序。
prompt.reason: Output the reason before outputting the results. prompt.reason:在输出结果前输出原因。
As the version of ChatGPT changes over time, we choose to use the open-source Llama2-70BChat for the prompt experiment. Results are shown in the Table 6. We find that in both detection and discovery modes, LLM tends to perform IND detection rather than OOD detection. The prompt.order led to a decline in performance, which may be due to the overly long instructions causing LLM to forget the information from the beginning sentence. The Reason mode exacerbate the LLM’s use of incorrect domain knowledge. About the exploration of a better prompt, we leave it for future research. 由于 ChatGPT 的版本会随时间变化,我们选择使用开源的 Llama2-70BChat 进行提示实验。结果如表 6 所示。我们发现,在检测和发现模式中,LLM 都倾向于执行 IND 检测,而不是 OOD 检测。prompt.order 导致性能下降,这可能是由于过长的指令导致 LLM 遗忘了开头句子中的信息。原因模式加剧了 LLM 对错误领域知识的使用。关于探索更好的提示方式,我们留待今后研究。
6. Challenge & Further Disscussion 6.质疑和进一步讨论
Based on the above experiments and analysis, we identify the challenging scenarios that LLMs encounter and offer guidance for future reference. 基于上述实验和分析,我们确定了LLMs遇到的具有挑战性的场景,并提供了指导意见,供今后参考。
6.1. Conflict between Domain-Specific Knowledge and General Knowledge 6.1.特定领域知识与常识之间的冲突
The majority of errors are caused by the model’s incorrect utilization of certain knowledge, which may be due to discrepancies between LLM and humans in both intent and task understanding. We refer to it as conflicts between generic knowledge within the model and domain-specific knowledge required for the task. Specifically, they result in three types of errors: false association, focus deviation, and lack of domain knowledge.We display the relevant cases in Figure 7(a). 大多数错误是由于模型对某些知识的使用不当造成的,这可能是由于 LLM 与人类在意图和任务理解方面存在差异。我们将其称为模型中的通用知识与任务所需的特定领域知识之间的冲突。具体来说,它们会导致三种类型的错误:错误关联、重点偏离和缺乏领域知识。
Our FSD-LLM can inject domain-specific knowledge in Section 4.3. However, the effectiveness of demonstrations seems to be influenced by various factors such as the numbers of IND intents, the number of demonstrations. Besides, the knowledge conflict may be more pronounced in larger models than in smaller ones (Section 5.3), so how to inject domain-specific knowledge into LLM and eliminate noise interference from useless general knowledge will be a future research direction. 我们的 FSD-LLM 可以在第 4.3 节中注入特定领域的知识。然而,演示的有效性似乎受到多种因素的影响,例如 IND 意图的数量和演示的数量。此外,在较大的模型中,知识冲突可能比在较小的模型中更明显(第 5.3 节),因此如何将特定领域的知识注入 LLM 并消除无用的一般知识的噪声干扰将是未来的研究方向。
(a)General knowledge vs domain-specific knowledge (a) 通用知识与特定领域知识
False Association 假协会
Query: Can you tell me if my top- 询问:您能告诉我,我的顶部
up has been cancelled? 已被取消?
Ture: OOD Ture:OOD
Predict: 预测:
top_up_by_bank_transfer_charge. 通过银行转账充值。
Error: 错误:
the cancellation if accidental, which may 如果是意外取消,可能
be related to the charge. 与指控有关。
Query: Can you tell me if my top-
up has been cancelled?
Ture: OOD
Predict:
top_up_by_bank_transfer_charge.
Error:
the cancellation if accidental, which may
be related to the charge.| Query: Can you tell me if my top- |
| :---: |
| up has been cancelled? |
| Ture: OOD |
| Predict: |
| top_up_by_bank_transfer_charge. |
| Error: |
| the cancellation if accidental, which may |
| be related to the charge. |