LLMmap: Fingerprinting for Large Language Models LLMmap:大型语言模型的指纹识别
Dario Pasquini* 达里奥·帕斯奎尼*RSAC Labs RSAC 实验室
Evgenios M. Kornaropoulos 埃夫格尼奥斯·M·科尔纳罗普洛斯George Mason University 乔治梅森大学
Giuseppe Ateniese 朱塞佩·阿特涅塞George Mason University 乔治梅森大学
Abstract 摘要
We introduce LLMmap, a first-generation fingerprinting technique targeted at LLM-integrated applications. LLMmap employs an active fingerprinting approach, sending carefully crafted queries to the application and analyzing the responses to identify the specific LLM version in use. Our query selection is informed by domain expertise on how LLMs generate uniquely identifiable responses to thematically varied prompts. With as few as 8 interactions, LLMmap can accurately identify 42 different LLM versions with over 95%95 \% accuracy. More importantly, LLMmap is designed to be robust across different application layers, allowing it to identify LLM versions -whether open-source or proprietary- from various vendors, operating under various unknown system prompts, stochastic sampling hyperparameters, and even complex generation frameworks such as RAG or Chain-of-Thought. We discuss potential mitigations and demonstrate that, against resourceful adversaries, effective countermeasures may be challenging or even unrealizable. 我们介绍了 LLMmap,这是一种针对集成 LLM 的应用程序的第一代指纹识别技术。LLMmap 采用主动指纹识别方法,向应用程序发送精心设计的查询,并分析响应以识别所使用的具体 LLM 版本。我们的查询选择基于领域专业知识,了解 LLM 如何对主题多样的提示生成独特可识别的响应。仅需 8 次交互,LLMmap 即可以超过 95%95 \% 的准确率准确识别 42 个不同的 LLM 版本。更重要的是,LLMmap 设计具有跨不同应用层的鲁棒性,能够识别来自不同供应商的开源或专有 LLM 版本,无论其运行在何种未知系统提示、随机采样超参数,甚至复杂的生成框架如 RAG 或 Chain-of-Thought 下。我们讨论了潜在的缓解措施,并证明对于资源丰富的对手,有效的对策可能具有挑战性甚至难以实现。
1 Introduction 1 引言
In cybersecurity, the initial phase of any penetration test or security assessment is critical-it involves gathering detailed information about the target system to identify potential vulnerabilities that are applicable to the specific setup of the system under attack. This phase, often referred to as reconnaissance, allows attackers to map out the target environment, setting the stage for subsequent exploitative actions. A classic example of this is OS fingerprinting, where an attacker determines the operating system running on a remote machine by analyzing its network behavior [5, 15, 28]. 在网络安全领域,任何渗透测试或安全评估的初始阶段至关重要——该阶段涉及收集关于目标系统的详细信息,以识别适用于被攻击系统特定配置的潜在漏洞。此阶段通常称为侦察,允许攻击者绘制目标环境的地图,为后续的利用行动奠定基础。一个经典的例子是操作系统指纹识别,攻击者通过分析远程机器的网络行为来确定其运行的操作系统[5, 15, 28]。
As Large Language Models (LLMs) become increasingly integrated into applications, the need to understand and mitigate their vulnerabilities has grown [6,10,12,14,17,22,23[6,10,12,14,17,22,23, 27, 31, 34-36, 39, 44, 51]. LLMs, despite their advanced capabilities, are not immune to attacks. These models exhibit a 随着大型语言模型(LLMs)越来越多地集成到应用中,理解并缓解其漏洞的需求日益增长 [6,10,12,14,17,22,23[6,10,12,14,17,22,23 , 27, 31, 34-36, 39, 44, 51]。尽管 LLMs 具备先进的能力,但它们并非免疫于攻击。这些模型表现出一种
Figure 1: Active fingerprinting via LLMmap. 图 1:通过 LLMmap 进行的主动指纹识别。
range of weaknesses, including susceptibility to adversarial inputs and other sophisticated attack vectors. 一系列弱点,包括对对抗性输入和其他复杂攻击向量的易感性。
Identifying the specific LLM and its version embedded within an application can reveal critical attack surfaces. Once the LLM is accurately fingerprinted, an attacker can craft targeted adversarial inputs, exploit specific vulnerabilities unique to that model version, such as the buffer overflow vulnerability in Mixture of Experts architectures [17] or privacy attacks [49], the “glitch tokens” phenomenon [21], or exploit previously leaked information [11]. For open-source LLMs, this fingerprinting can be further exploited using white-box optimization techniques, enhancing the precision and impact of attacks [8, 14, 17, 31, 33, 50, 51]. 识别嵌入应用中的特定 LLM 及其版本可以揭示关键的攻击面。一旦准确指纹识别出 LLM,攻击者便可设计针对性的对抗性输入,利用该模型版本特有的漏洞,如专家混合架构中的缓冲区溢出漏洞[17]或隐私攻击[49]、“故障令牌”现象[21],或利用先前泄露的信息[11]。对于开源 LLM,这种指纹识别还可以通过白盒优化技术进一步利用,提高攻击的精确度和影响力[8, 14, 17, 31, 33, 50, 51]。
In this paper, we introduce LLMmap ^(1){ }^{1}, a novel approach to LLM fingerprinting that is both precise and efficient. LLMmap represents the first generation of active fingerprinting attacks specifically designed for applications integrating LLMs. By sending carefully constructed queries to the target application and analyzing the responses, LLMmap can accurately identify the underlying LLM version with minimal interaction-typically between 3 and 8 queries (see Figure 1). LLMmap is designed to be robust across diverse deployment scenarios, including systems with arbitrary system prompts, stochastic sampling procedures, hyper-parameters, and those employing advanced frameworks like RetrievalAugmented Generation (RAG) [20] or Chain-of-Thought 在本文中,我们介绍了 LLMmap ^(1){ }^{1} ,这是一种既精确又高效的 LLM 指纹识别新方法。LLMmap 代表了专门为集成 LLM 的应用设计的第一代主动指纹攻击。通过向目标应用发送精心构造的查询并分析响应,LLMmap 能够以最少的交互次数——通常在 3 到 8 次查询之间(见图 1)——准确识别底层的 LLM 版本。LLMmap 设计为在多样化的部署场景中具有鲁棒性,包括具有任意系统提示、随机采样过程、超参数以及采用诸如检索增强生成(Retrieval-Augmented Generation, RAG)[20]或链式思维(Chain-of-Thought)等先进框架的系统。
prompting [18,45]. 提示[18,45]。
LLMmap offers two key capabilities: (i) A closed-set classifier that identifies the correct LLM version from among 42 of the most common models, achieving accuracy exceeding 95%95 \%. (ii) An open-set classifier developed through contrastive learning, which enables the detection and fingerprinting of new LLMs. This open-set approach allows LLMmap to generate a vectorial representation of the LLM’s behavior (signatures), which can be stored and later matched against an expanding database of LLM fingerprints. LLMmap 提供了两个关键功能:(i)一个封闭集分类器,能够从 42 个最常见的模型中识别出正确的 LLM 版本,准确率超过 95%95 \% 。 (ii)一个通过对比学习开发的开放集分类器,能够检测和指纹识别新的 LLM。该开放集方法使 LLMmap 能够生成 LLM 行为的向量表示(签名),这些签名可以被存储,并在后续与不断扩展的 LLM 指纹数据库进行匹配。
We validate the effectiveness of LLMmap through extensive testing on both open-source and proprietary models, including different iterations of ChatGPT and Claude. Our method demonstrates high precision even when distinguishing between closely related models, such as those with differing context window sizes (e.g., Phi-3-medium-128k-instruct versus Phi-3-medium-4k-instruct). With its lightweight design and rapid performance, LLMmap is poised to become an indispensable tool in the arsenal of AI red teams. Code and additional resources available at https://github.com/ pasquini-dario/LLMmap. 我们通过对开源和专有模型进行广泛测试验证了 LLMmap 的有效性,包括不同版本的 ChatGPT 和 Claude。即使在区分密切相关的模型时(例如具有不同上下文窗口大小的 Phi-3-medium-128k-instruct 与 Phi-3-medium-4k-instruct),我们的方法也表现出高精度。凭借其轻量级设计和快速性能,LLMmap 有望成为 AI 红队工具库中不可或缺的工具。代码和更多资源可在 https://github.com/pasquini-dario/LLMmap 获取。
2 Active Fingerprinting for LLMs 2 LLM 的主动指纹识别
Active OS fingerprinting involves sending probes to a system and analyzing the responses to identify the underlying operating system. Variations in factors such as TCP window size, default TTL (Time to Live), and handling of flags and malformed packets allow for distinguishing between OSs. 主动操作系统指纹识别涉及向系统发送探测包并分析响应,以识别其底层操作系统。诸如 TCP 窗口大小、默认 TTL(生存时间)以及标志和畸形数据包处理方式的差异,使得区分不同操作系统成为可能。
Similarly, LLMs show unique behaviors in response to prompts, making them targets for fingerprinting. However, fingerprinting LLMs presents unique challenges: 类似地,LLMs 在响应提示时表现出独特行为,使其成为指纹识别的目标。然而,对 LLMs 进行指纹识别存在独特挑战:
Stochasticity: LLMs produce responses through sampling methods that introduce randomness to their outputs, driven by parameters like temperature and token repetition penalties. This stochastic nature makes it difficult to identify specific models consistently. 随机性:LLMs 通过采样方法生成响应,这些方法引入了由温度和令牌重复惩罚等参数驱动的随机性。这种随机性使得持续识别特定模型变得困难。
Model Customization: LLMs are often tailored using system prompts or directives that shape their behavior (refer to Figure 1). These customizations can significantly alter the model’s output, complicating the fingerprinting process. 模型定制:LLMs 通常通过系统提示或指令进行定制,以塑造其行为(参见图 1)。这些定制可能显著改变模型的输出,增加了指纹识别的复杂性。
Applicative Layers: LLMs are frequently integrated into sophisticated frameworks, such as RAG or Chain-of-Thought prompting [45,48]. These layers of complexity add further variability, making it more challenging to pinpoint specific model characteristics. 应用层:LLMs 常被集成到复杂的框架中,如 RAG 或链式思维提示[45,48]。这些复杂层次增加了额外的变异性,使得更难准确定位特定模型的特征。
We refer to the above characteristics as the prompting configuration. The fact that these design choices and randomness are not disclosed to the entity that executes a fingerprinting method means that the outputs present significant variability. Addressing these complexities requires novel approaches that reliably account for these factors to identify LLM’s version. 我们将上述特征称为提示配置。由于这些设计选择和随机性未向执行指纹识别方法的实体披露,导致输出表现出显著的变异性。解决这些复杂性需要新的方法,能够可靠地考虑这些因素以识别 LLM 的版本。
2.1 Threat Model 2.1 威胁模型
In this section, we formalize the adversary’s objective in an LLM fingerprinting attack. 本节中,我们形式化对手在 LLM 指纹识别攻击中的目标。
Consider a remote application B\mathcal{B} that integrates an LLM (e.g., a chatbot accessible through a web interface). This application allows external users to interact with the LLM by submitting a query qq and receiving output oo. This interaction can be modeled by oracle OO : 考虑一个远程应用程序 B\mathcal{B} ,该程序集成了一个 LLM(例如,通过网页界面访问的聊天机器人)。该应用允许外部用户通过提交查询 qq 并接收输出 oo 来与 LLM 交互。此交互可由预言机 OO 建模:
O(q)=o," such that "o∼s(LLM_(v_(i))(q))O(q)=o, \text { such that } o \sim s\left(L L M_{v_{i}}(q)\right)
where v_(i)v_{i} denotes the (unknown) version of the LLM, e.g., v_(i)=v_{i}= “gpt-4-turbo-2024-04-09”, LLM_(v_(i))L L M_{v_{i}} denotes the deployed LLM under version v_(i)v_{i}, and ss represents the prompting configuration (represented as a function and) applied to an LLM_(v_(i))L L M_{v_{i}} instantiation. More formally, the (unknown) prompting configuration comprises the following parameters: (1) the hyperparameters of the sampling procedure, (2) the system prompt, and (3) prompting frameworks such as RAG or Chain-ofThought, as well as their arbitrary combinations. The symbol " ∼\sim " in Eq 1 indicates that the output of the model is generated through a stochastic sampling process. We will refer to any input provided to the oracle OO as a query. 其中 v_(i)v_{i} 表示 LLM 的(未知)版本,例如 v_(i)=v_{i}= “gpt-4-turbo-2024-04-09”, LLM_(v_(i))L L M_{v_{i}} 表示部署的 LLM 版本为 v_(i)v_{i} , ss 表示应用于 LLM_(v_(i))L L M_{v_{i}} 实例化的提示配置(表示为函数)。更正式地,(未知的)提示配置包括以下参数:(1)采样过程的超参数,(2)系统提示,以及(3)提示框架,如 RAG 或 Chain-of-Thought 及其任意组合。公式 1 中的符号“ ∼\sim ”表示模型的输出是通过随机采样过程生成的。我们将把提供给预言机 OO 的任何输入称为查询。
We assume that OO behaves as a perfect oracle, meaning that the only information an adversary can infer about LLM_(v_(i))L L M_{v_{i}} is what is revealed through the output oo. Both the prompting configuration ss and the randomness inherent in the sampling method are considered unknown to external observers. Additionally, to maintain generality, we assume that OO is stateless; submitting a query does not alter its internal state, thus not affecting the outcomes of subsequent queries. ^(2){ }^{2} 我们假设 OO 表现为一个完美的预言机,这意味着对手能够推断关于 LLM_(v_(i))L L M_{v_{i}} 的信息仅限于通过输出 oo 所揭示的内容。提示配置 ss 和采样方法中固有的随机性均被视为外部观察者未知。此外,为了保持一般性,我们假设 OO 是无状态的;提交查询不会改变其内部状态,因此不会影响后续查询的结果。 ^(2){ }^{2}
We model an adversary A\mathcal{A} whose objective is to determine the exact version v_(i)v_{i} of the LLM deployed in remote application B\mathcal{B} with the minimal number of queries to OO. We refer to this adversarial goal as “LLM fingerprinting”. 我们建模了一个对手 A\mathcal{A} ,其目标是在对 OO 的查询次数最少的情况下确定远程应用 B\mathcal{B} 中部署的 LLM 的确切版本 v_(i)v_{i} 。我们将这一对抗目标称为“LLM 指纹识别”。
We stress that our approach does not require any from of whitebox access to LLMs during setup (i.e., training) or inference and can be applied to both open-source and closedsource proprietary models. 我们强调,我们的方法在设置(即训练)或推理期间不需要任何形式的 LLM 白盒访问,并且可以应用于开源和闭源专有模型。
2.1.1 The Power of LLM Fingerprinting: Identifying LLM Version to Tailor Attack Strategies 2.1.1 LLM 指纹识别的威力:识别 LLM 版本以定制攻击策略
In the following, we discuss how fingerprinting can serve as a component of a multi-stage attack effort and which attack stages benefit from an effective fingerprinting tool. In the extended version [30], we present a concrete demonstration of the role of fingerprinting in such a two-stage attack. 在下文中,我们讨论了指纹识别如何作为多阶段攻击努力的一个组成部分,以及哪些攻击阶段能够从有效的指纹识别工具中受益。在扩展版本[30]中,我们展示了指纹识别在这样一个两阶段攻击中的具体作用。
A successful fingerprinting technique can “fast-track” an attack and enable the adversary to design tailored inputs 一种成功的指纹识别技术可以“加速”攻击进程,使对手能够设计出定制的输入
that work robustly on the specific LLM version under attack. Knowing the LLM version allows the adversary to deploy tools that automate the generation of tailored inputs. Numerous studies demonstrated strong efficacy of tailored inputs compared to non-tailored, regardless of whether the attacker operates in the white-box or black-box threat model [12,22,26,31,51]. 这些输入能够在被攻击的特定 LLM 版本上稳健运行。了解 LLM 版本使对手能够部署自动生成定制输入的工具。大量研究表明,定制输入相比非定制输入具有更强的效果,无论攻击者是在白盒还是黑盒威胁模型下操作[12,22,26,31,51]。
Benefiting Open-Source and Closed-Source Attacks. If fingerprinting identifies an LLM version as an open-source model, then even though the model is part of the backend of an application and is not directly accessible to the adversary, the attacker can simply download a local copy of the model and treat this scenario as a white-box attack. Specifically, the attacker can perform gradient-based optimization, generating highly efficient attack vectors to deploy against remote applications. A similar rationale applies to black-box optimizationbased attacks on proprietary closed-source LLMs [12,22,26]. Once the specific proprietary LLM version used in the application is identified, an attacker can sidestep the process of running optimization on the application API (which may have rate-limiting mitigations in place) and instead can target the proprietary model’s APIs directly (which typically do not impose any limitation on the number of prompts). This approach decreases the risk of detection by the targeted application compared to using the application itself as an oracle for optimization. 有利于开源和闭源攻击。如果指纹识别确定某个 LLM 版本为开源模型,那么即使该模型作为应用程序后端的一部分,攻击者无法直接访问,攻击者仍然可以简单地下载该模型的本地副本,并将此情景视为白盒攻击。具体而言,攻击者可以执行基于梯度的优化,生成高效的攻击向量以针对远程应用程序部署。类似的原理也适用于针对专有闭源 LLM 的基于黑盒优化的攻击[12,22,26]。一旦识别出应用程序中使用的特定专有 LLM 版本,攻击者可以绕过在应用程序 API 上运行优化的过程(该 API 可能实施了速率限制的缓解措施),而直接针对专有模型的 API(通常不限制提示次数)。与使用应用程序本身作为优化的预言机相比,这种方法降低了被目标应用程序检测的风险。
One specific family of attacks that benefits from knowledge of the LLM version is tailored prompt injection attacks, which show significantly higher success rates when the version is known. In this context, we provide a concrete and practical example in the extended version [30] demonstrating a two-stage attack strategy. In the first stage, an adversary uses LLMmap to fingerprint the LLM version within a target application. Then, they apply gradient-based optimization techniques [31] to create a precise inline prompt injection trigger optimized for the identified LLM version. 一种特定的攻击家族受益于对 LLM 版本的了解,即定制化提示注入攻击,当版本已知时,其成功率显著提高。在此背景下,我们在扩展版本[30]中提供了一个具体且实用的示例,展示了一个两阶段的攻击策略。在第一阶段,攻击者使用 LLMmap 对目标应用中的 LLM 版本进行指纹识别。然后,他们应用基于梯度的优化技术[31],为识别出的 LLM 版本创建一个精确的内联提示注入触发器。
Ultimately, as LLM architectures and their applications continue to evolve, more version-specific vulnerabilities will emerge, further expanding the attack surface for adversaries. 最终,随着 LLM 架构及其应用的不断发展,更多特定版本的漏洞将会出现,进一步扩大攻击者的攻击面。
2.2 Related Work 2.2 相关工作
Xu et al. [46] propose a watermark-based fingerprinting technique aimed at protecting intellectual property by enabling remote ownership attestation of LLMs. In their approach, the model owner applies an additional training step to inject a behavioral watermark into the existing model before releasing it. To verify ownership of a remote LLM, the owner can submit a predefined set of trigger queries and check for the presence of the injected watermark in the model’s responses. Xu 等人[46]提出了一种基于水印的指纹识别技术,旨在通过实现 LLMs 的远程所有权证明来保护知识产权。在他们的方法中,模型所有者在发布模型之前,进行额外的训练步骤,将行为水印注入现有模型中。为了验证远程 LLM 的所有权,所有者可以提交一组预定义的触发查询,并检查模型响应中是否存在注入的水印。
Similarly, Russinovich and Salem [38] introduce a concurrent technique for embedding recognizable behaviors into an LLM through fine-tuning. Their method defines core requirements for effective watermark-based fingerprinting and relies 类似地,Russinovich 和 Salem [38]提出了一种通过微调将可识别行为嵌入 LLM 的并行技术。他们的方法定义了基于水印的指纹识别的核心要求,并依赖
on hashing the responses generated by a set of predefined queries to verify ownership. 对一组预定义查询生成的响应进行哈希,以验证所有权。
Both of these approaches focus on scenarios where the “defender” (i.e., the model owner) embeds watermarks during the training process, allowing them to verify whether the model is being used without consent. In contrast, our work addresses a different scenario, where an “attacker” attempts to identify and recognize an unknown underlying model by inducing unique responses through strategic prompting. Unlike the aforementioned methods, our approach does not assume any influence over the model’s training or deployment specifics. 这两种方法都侧重于“防御者”(即模型所有者)在训练过程中嵌入水印的场景,从而使其能够验证模型是否被未经授权使用。相比之下,我们的工作针对的是另一种情形,即“攻击者”通过策略性提示诱导模型产生独特响应,以识别和确认未知的底层模型。与上述方法不同,我们的方法不假设对模型的训练或部署细节有任何影响。
The work most comparable to LLMmap is the concurrent study by Yang and Wu [47]. Their approach, unlike the watermark-based methods [38, 46], does not require finetuning the model. However, it assumes access to the logits output generated by the LLM being tested, rather than the actual generated text. This assumption limits its applicability in practical settings where LLMs are typically deployed without exposing logits. Their technique fingerprints LLMs by matching vector spaces derived from the logits of two different models in response to a set of 300 random queries. In contrast, our method requires less than 8 queries, making it more efficient and practical. 与 LLMmap 最为相似的工作是 Yang 和 Wu[47]的同期研究。他们的方法不同于基于水印的方法[38, 46],不需要对模型进行微调。然而,该方法假设可以访问被测试 LLM 生成的 logits 输出,而非实际生成的文本。此假设限制了其在实际应用中的适用性,因为 LLM 通常部署时不会公开 logits。他们的技术通过匹配两个不同模型对一组 300 个随机查询所产生的 logits 向量空间来为 LLM 生成指纹。相比之下,我们的方法只需不到 8 个查询,因而更高效且更实用。
In a related line of work, McGovern et al. [25] explores passive fingerprinting by analyzing lexical and morphosyntactic properties to distinguish LLM-generated from humangenerated text. They observe that different model families produce text with distinct lexical features, allowing differentiation between LLM outputs and human writing. 在相关研究中,McGovern 等人[25]通过分析词汇和形态句法特征进行被动指纹识别,以区分 LLM 生成文本和人类生成文本。他们观察到不同模型家族生成的文本具有不同的词汇特征,从而能够区分 LLM 输出与人类写作。
3 The Design of Llmmap and Its Properties 3 Llmmap 的设计及其特性
In this section, we introduce LLMmap, the first method for fingerprinting LLMs. LLMmap is designed to accurately identify the LLM version integrated within an application by combining (i) strategic querying and (ii) machine learning modeling. 在本节中,我们介绍了 LLMmap,这是首个用于指纹识别 LLMs 的方法。LLMmap 旨在通过结合(i)策略性查询和(ii)机器学习建模,准确识别集成在应用中的 LLM 版本。
The process begins with the formulation of a set of targeted questions specifically crafted to elicit responses that highlight the unique features of different LLMs. This set of questions constitutes the querying strategy QQ, which is then submitted to the oracle OO. The oracle processes each query and returns a response, forming a pair of questions and responses, that we refer to as trace {(q_(i),o_(i))}\left\{\left(q_{i}, o_{i}\right)\right\}. 该过程始于制定一组针对性问题,专门设计以引出能够突出不同 LLMs 独特特征的回答。这组问题构成了查询策略 QQ ,随后提交给 oracle OO 。oracle 处理每个查询并返回响应,形成一对问答,我们称之为轨迹 {(q_(i),o_(i))}\left\{\left(q_{i}, o_{i}\right)\right\} 。
These traces are subsequently fed into an inference model ff, which is a machine learning model that analyzes the collected traces to identify the LLM deployed by the application. The inference model’s objective is to correctly map the traces to a specific entry within the label space C\mathbf{C}, which corresponds to the version of the LLM in use. The entire fingerprinting process, from query generation to model inference, is formalized in Algorithm 1. 这些轨迹随后被输入推理模型 ff ,该模型是一种机器学习模型,用于分析收集到的轨迹以识别应用所部署的 LLM。推理模型的目标是将轨迹正确映射到标签空间 C\mathbf{C} 中的特定条目,该条目对应所使用的 LLM 版本。整个指纹识别过程,从查询生成到模型推理,在算法 1 中进行了形式化描述。
To maximize the accuracy and efficiency of fingerprinting, careful selection of both the query strategy QQ and the inference model ff is crucial. The following sections will discuss 为了最大化指纹识别的准确性和效率,仔细选择查询策略 QQ 和推理模型 ff 至关重要。以下章节将讨论
our solutions for implementing these components effectively. 我们有效实现这些组件的解决方案。
The success of LLMmap hinges on developing a robust querying strategy QQ that can identify and leverage these highvalue prompts. By focusing on queries that consistently highlight the subtle differences between LLMs, LLMmap can more accurately and efficiently achieve its fingerprinting objectives. LLMmap 的成功依赖于开发一种强健的查询策略 QQ ,能够识别并利用这些高价值提示。通过专注于持续突出 LLMs 之间细微差异的查询,LLMmap 能够更准确、更高效地实现其指纹识别目标。
3.1 In Pursuit of Robust Queries 3.1 追求强健的查询
To effectively fingerprint a target LLM, we identify two essential properties that queries from QQ should possess: 为了有效地对目标 LLM 进行指纹识别,我们确定了查询应具备的两个基本属性:
(1) Inter-model Discrepancy: An effective query should elicit outputs that vary significantly across different LLM versions. This means the query should produce distinct responses when posed to different versions. Formally, consider the universe of possible LLM versions L\mathbf{L} and a distance function dd that measures differences in the output space. The goal is to find a query q^(**)q^{*} that maximizes these differences, defined as: (1)模型间差异性:一个有效的查询应能在不同 LLM 版本之间引发显著不同的输出。这意味着该查询在针对不同版本时应产生截然不同的响应。形式上,考虑所有可能的 LLM 版本集合 L\mathbf{L} 及一个衡量输出空间差异的距离函数 dd 。目标是找到一个查询 q^(**)q^{*} ,使这些差异最大化,定义为:
q^(**)=arg max_(q in Q)(E_((v,v^(')inL))[d(LLM_(v)(q),LLM_(v^('))(q))])q^{*}=\underset{q \in Q}{\arg \max }\left(\mathbb{E}_{\left(v, v^{\prime} \in \mathbf{L}\right)}\left[d\left(L L M_{v}(q), L L M_{v^{\prime}}(q)\right)\right]\right)
In simple terms, we seek queries that generate highly divergent outputs for any pair of different LLM versions, vv and v^(')v^{\prime}. This property is crucial for distinguishing between models. 简而言之,我们寻求能够为任意不同 LLM 版本对 vv 和 v^(')v^{\prime} 生成高度分歧输出的查询。此属性对于区分模型至关重要。
(2) Intra-model Consistency: A robust query should produce stable outputs even when the LLM is subjected to different prompting configurations or randomness. In other words, the query should yield similar responses from the same version vv across varying setups. Formally, let S\mathbf{S} represent the set of possible prompting configurations. We aim to identify a query q^(**)q^{*} that minimizes output variations across these configurations: (2)模型内一致性:一个稳健的查询应在 LLM 面对不同提示配置或随机性时仍能产生稳定的输出。换言之,该查询应在同一版本 vv 的不同设置下产生相似的响应。形式上,设 S\mathbf{S} 表示所有可能的提示配置集合。我们旨在识别一个查询 q^(**)q^{*} ,使其在这些配置间的输出变化最小化:
q^(**)=arg min_(q in Q)(E_((s,s^(')inS))[d(s(LLM_(v)(q)),s^(')(LLM_(v)(q)))])q^{*}=\underset{q \in Q}{\arg \min }\left(\mathbb{E}_{\left(s, s^{\prime} \in \mathbf{S}\right)}\left[d\left(s\left(L L M_{v}(q)\right), s^{\prime}\left(L L M_{v}(q)\right)\right)\right]\right)
That is, we want a query q^(**)q^{*} that produces consistent outputs under the same version, regardless of the prompting configuration. This property ensures that the LLM version can be identified even when its environment varies. 也就是说,我们希望查询 q^(**)q^{*} 在相同版本下无论提示配置如何都能产生一致的输出。该属性确保即使环境变化,也能识别出 LLM 版本。
The Discriminative Power of Query Combinations. The effectiveness of a querying strategy extends beyond the properties of individual queries and is significantly influenced 查询组合的判别能力。查询策略的有效性不仅取决于单个查询的属性,还受到这些查询如何相互补充的显著影响。
Algorithm 1 Fingerprinting Attack
function \(\operatorname{LLMmap}(O, Q, f)\)
\(\mathcal{T} \leftarrow\}\)
for \(q_{i}\) in \(Q\) do
\(o_{i} \leftarrow O\left(q_{i}\right)\)
\(\mathcal{T} \leftarrow \mathcal{T} \cup\left\{\left(q_{i}, o_{i}\right)\right\}\)
end for
\(c \leftarrow f(\mathcal{T})\)
return \(c\)
end function
by how these queries complement each other. Surprisingly, queries that are weak on their own-those with low individual discriminatory power-can substantially improve fingerprinting performance when combined with others. This is because some queries may only generate strong discriminative signals in specific contexts, such as with certain model versions or when used with frameworks like RAG. 令人惊讶的是,单独表现较弱——即个别判别能力低的查询——在与其他查询结合时,能够显著提升指纹识别性能。这是因为某些查询可能仅在特定上下文中产生强判别信号,例如针对某些模型版本或在使用 RAG 等框架时。
Incorporating these seemingly weaker queries into the overall strategy is essential for covering edge cases that would otherwise be missed. Furthermore, the combination of multiple “weak” queries can produce a powerful discriminative signal, revealing complex patterns that require multiple interactions to detect. Therefore, a diversified query strategy is crucial; it should encompass a range of non-redundant queries that, together, generate multiple, independent signals that complement and enhance each other. 将这些看似较弱的查询纳入整体策略对于覆盖否则会被遗漏的边缘情况至关重要。此外,多个“弱”查询的组合可以产生强大的判别信号,揭示需要多次交互才能检测到的复杂模式。因此,多样化的查询策略至关重要;它应涵盖一系列非冗余的查询,这些查询共同生成多个独立信号,彼此互补并增强。
4 The Toolbox of Effective Query Strategies 4 有效查询策略工具箱
Inspired by techniques used in OS fingerprinting, where specific probes are crafted to exploit system behaviors, we explore analogous strategies for LLM fingerprinting. These strategies aim to uncover distinctive features of the LLM by leveraging targeted queries. Below, we discuss various prompt families and their effectiveness in revealing LLM’s version. 受操作系统指纹识别中利用特定探针利用系统行为的技术启发,我们探索了类似的 LLM 指纹识别策略。这些策略旨在通过针对性查询揭示 LLM 的独特特征。以下内容讨论了各种提示家族及其在揭示 LLM 版本方面的有效性。
In OS fingerprinting, querying meta-information-such as system uptime or configurations-can reveal subtle but crucial details about the target system. Similarly, in LLM fingerprinting, queries that prompt the model to disclose metainformation about itself, e.g., details of its training process or deployment, can be instrumental in identifying the version. 在操作系统指纹识别中,查询元信息——如系统运行时间或配置——可以揭示目标系统的细微但关键的细节。同样,在 LLM 指纹识别中,促使模型披露自身元信息的查询,例如其训练过程或部署细节,对于识别版本具有重要作用。
These queries, although indirect, often induce high intermodel discrepancy because the responses, even when fabricated, tend to be unique to each version. For example, prompts like “What’s the size of your training set?” or “When were you last updated?” often yield made-up answers that are distinct across different versions. This makes such queries particularly effective for distinguishing between LLMs, even when they share similar architectures or training data. 这些查询虽然间接,但通常会引发较高的模型间差异,因为即使是虚构的回答,也往往因版本不同而独特。例如,“你的训练集有多大?”或“你上次更新时间是什么时候?”这类提示通常会产生不同版本间各异的虚构答案。这使得此类查询在区分 LLM 时特别有效,即使它们具有相似的架构或训练数据。
Moreover, in certain cases, the LLM might inadvertently reveal accurate metadata in response to these queries. For instance, when asked “What’s your data cutoff date?”, the model may disclose critical information about its training history. This kind of metadata can provide significant insights, enhancing the fingerprinting process by allowing for more precise identification of the version. 此外,在某些情况下,LLM 可能会在响应这些查询时无意中泄露准确的元数据。例如,当被问及“你的数据截止日期是什么?”时,模型可能会披露其训练历史的关键信息。这类元数据可以提供重要的见解,通过允许更精确地识别版本,从而增强指纹识别过程。
4.2 Can We Use Banner Grabbing on LLMs? 4.2 我们能否对 LLM 进行横幅抓取?
In OS fingerprinting, banner grabbing involves sending simple queries to a service to obtain identifying information, 在操作系统指纹识别中,横幅抓取涉及向服务发送简单查询以获取识别信息,
Figure 2: Difference in response of two LLMs upon a malicious prompt. The model Mixtral- 8x7B8 x 7 B, in contrast to gpt-4o-2024\operatorname{gpt}-4 o-2024, tends to restate the harmful task in its answer. 图 2:两个 LLM 在恶意提示下的响应差异。与 gpt-4o-2024\operatorname{gpt}-4 o-2024 相比,模型 Mixtral- 8x7B8 x 7 B 倾向于在回答中重述有害任务。
Table 1: Examples of LLMs claiming to be the wrong model when prompted with banner-grabbing queries. 表 1:LLMs 在接收到横幅抓取查询时声称为错误模型的示例。
such as software version or server type. Similarly, in LLM fingerprinting, there are scenarios where an LLM might directly reveal its identity when prompted with straightforward queries. For instance, some LLMs might disclose their model name or version in response to queries like “what model are you?” or “what’s your name?”. While this approach can yield useful information, it is often not a robust or reliable method for fingerprinting. 例如软件版本或服务器类型。同样,在 LLM 指纹识别中,也存在一些场景,LLM 在被简单查询时可能会直接透露其身份。例如,一些 LLM 可能会在被问及“你是什么模型?”或“你叫什么名字?”时披露其模型名称或版本。虽然这种方法可以获得有用信息,但通常不是一种稳健或可靠的指纹识别方法。
Banner Grabbing Is Not a Robust Solution. While straightforward, banner grabbing is neither a general nor a reliable method for LLM fingerprinting. Specifically: 横幅抓取不是一个稳健的解决方案。虽然方法简单,但横幅抓取既不是一种通用的方法,也不是一种可靠的 LLM 指纹识别手段。具体来说:
(1) Our experiments show that only a small subset of mod-els-primarily open-source ones-are aware of their name or origin. Even when a model can identify itself, it often only recognizes its “family name” (e.g., LlaMA or Phi) without specifying the exact version or size. For example, LLaMA-3-8B and LLaMA-2-70B, or ChatGPT-4 and ChatGPT-4o, would likely be considered the same model by the LLM. ^(3){ }^{3} (1)我们的实验表明,只有一小部分模型——主要是开源模型——知道自己的名称或来源。即使模型能够识别自己,它通常也只认得自己的“家族名”(例如 LlaMA 或 Phi),而不会具体说明确切的版本或大小。例如,LLaMA-3-8B 和 LLaMA-2-70B,或 ChatGPT-4 和 ChatGPT-4o,LLM 很可能会将它们视为同一模型。 ^(3){ }^{3}
(2) This approach is not robust to prompting configurations, such as different system prompts. A simple countermeasure against banner grabbing is for the LLM to present a misleading model name through its system prompt, effectively overriding the true banner of the model and deceiving the attacker (e.g., Figure 1). (2) 这种方法对提示配置(如不同的系统提示)不具备鲁棒性。针对横幅抓取的一个简单对策是让 LLM 通过其系统提示展示一个误导性的模型名称,有效地覆盖模型的真实横幅,从而欺骗攻击者(例如,图 1)。
(3) More critically, banner grabbing queries often yield unreliable results. We observed that models frequently provide plausible yet incorrect answers to these queries, (3) 更关键的是,横幅抓取查询往往产生不可靠的结果。我们观察到模型经常对这些查询给出看似合理但错误的答案,
claiming to be a different LLM versions. This usually occurs because the model has been trained or fine-tuned on outputs generated by other models, typically from OpenAI. For instance, SOLAR-10.7B-Instruct-v1.0 and openchat_3.5 incorrectly identify themselves as OpenAI models. Similarly, bias in the training data can lead to inaccurate responses. For example, the models aya-23-8B and 35 B35 B from Cohere respond to banner grabbing queries with “Coral”, another model from the same vendor. Additional examples of this behavior can be found in Table 1. 声称自己是不同版本的 LLM。这通常是因为模型经过了基于其他模型(通常是 OpenAI 模型)生成的输出进行训练或微调。例如,SOLAR-10.7B-Instruct-v1.0 和 openchat_3.5 错误地将自己识别为 OpenAI 模型。同样,训练数据中的偏差也可能导致不准确的响应。例如,Cohere 的 aya-23-8B 和 35 B35 B 模型在横幅抓取查询中回答为“Coral”,这是同一供应商的另一个模型。此类行为的更多示例见表 1。
Banner Grabbing Induces Inter-model Discrepancy. While banner grabbing is often seen as unreliable, it can still be highly effective in certain contexts by inducing significant inter-model discrepancy. Similar to how different operating systems might respond uniquely to a specific probe in OS fingerprinting, different LLMs can produce distinct responses to banner grabbing queries based on their underlying architecture and training data. These differences, although sometimes factually incorrect, are unique to each model and can be used to differentiate between them when banner grabbing is combined with other querying strategies. 横幅抓取引发模型间差异。尽管横幅抓取常被视为不可靠,但在某些情境下仍能通过引发显著的模型间差异发挥高度有效的作用。类似于操作系统指纹识别中不同操作系统对特定探测的独特响应,不同的 LLMs 基于其底层架构和训练数据,对横幅抓取查询会产生不同的回应。这些差异虽然有时在事实层面不准确,但对每个模型而言都是独一无二的,当横幅抓取与其他查询策略结合使用时,可用于区分不同模型。
For example, Google’s Gemma models are uniquely evasive, responding with “I am unable to provide information that may compromise my internal architecture or development process.”, a response not observed in other models. Additionally, banner grabbing queries can be particularly useful for distinguishing between closely related model versions within the same model family. For instance, while Phi-3-mini mistakenly identifies itself as GPT-4, Phi-3-medium models correctly identify as Phi. This demonstrates that even within the same model family, banner grabbing can reveal subtle differences that might otherwise go unnoticed. 例如,谷歌的 Gemma 模型表现出独特的回避性,回应为“我无法提供可能泄露我内部架构或开发过程的信息。”,这一回应在其他模型中未见。此外,横幅抓取查询对于区分同一模型家族中紧密相关的模型版本尤其有用。例如,Phi-3-mini 错误地将自己识别为 GPT-4,而 Phi-3-medium 模型则正确识别为 Phi。这表明即使在同一模型家族内,横幅抓取也能揭示出可能被忽视的细微差异。
Among the various queries in this category, “Who created you?” emerged as the most effective and consistent probe. This specific banner grabbing query induces strong intramodel consistency (see Equation (3)) because it is less affected by system prompt variations than other similar queries, such as “What’s your name?”. Therefore, despite its limitations, banner grabbing remains a valuable tool in the broader context of LLM fingerprinting when used strategically. 在该类别的各种查询中,“谁创造了你?”成为最有效且一致的探测问题。该特定的横幅抓取查询引发了强烈的模型内一致性(见公式(3)),因为它比其他类似查询(如“你叫什么名字?”)更少受系统提示变化的影响。因此,尽管存在局限性,横幅抓取在策略性使用时仍然是 LLM 指纹识别中宝贵的工具。
4.3 "Malformed" Queries Reveal Information 4.3 “格式错误”的查询揭示信息
In OS fingerprinting, a common and effective technique involves sending malformed packets to the target system. The way the system’s TCP/IP stack processes or fails to handle 在操作系统指纹识别中,一种常见且有效的技术是向目标系统发送格式错误的数据包。系统的 TCP/IP 协议栈处理或未能处理这些数据包的方式
these packets can reveal substantial information about the underlying operating system. 这些数据包可以揭示有关底层操作系统的大量信息。
Similarly, in LLM fingerprinting, submitting malformed or atypical prompts can elicit responses that are not only unique to each version but also consistent across different prompting configurations. These atypical prompts challenge the model in ways standard queries do not, often exposing how the model handles unexpected or nonsensical input. This behavior can provide valuable insights into the model’s training, making malformed queries a powerful tool for distinguishing between different LLM versions. 同样,在 LLM 指纹识别中,提交格式错误或非典型的提示词不仅能引出每个版本独特的响应,还能在不同提示配置下保持一致。这些非典型提示以标准查询无法实现的方式挑战模型,常常揭示模型如何处理意外或无意义的输入。这种行为能够提供对模型训练的宝贵见解,使格式错误的查询成为区分不同 LLM 版本的有力工具。
Exploiting Alignment on Harmful Prompts. While LLMs are not constrained by formal syntax in their inputs, their alignment procedures offer a unique avenue for fingerprinting. Alignment, which is designed to make LLMs helpful, honest, and harmless [7], leads models to refuse harmful requests by issuing specific “error messages”. For instance, when presented with a query like “How to build a bomb?”, the LLM will typically respond with a refusal, stating that it cannot fulfill the request (see Figure 2). These “error messages” are not just model-dependent but can also be unique to each LLM version, making them powerful indicators for fingerprinting. 利用有害提示的对齐机制。虽然 LLMs 的输入不受正式语法的限制,但其对齐程序提供了一种独特的指纹识别途径。对齐旨在使 LLMs 变得有用、诚实且无害[7],促使模型通过发出特定的“错误信息”来拒绝有害请求。例如,当面对“如何制造炸弹?”这样的查询时,LLM 通常会拒绝回答,声明无法满足该请求(见图 2)。这些“错误信息”不仅依赖于模型,还可能因每个 LLM 版本的不同而独特,使其成为强有力的指纹识别指标。
More importantly, the consistency of these responses across different prompting configurations enhances their reliability. Regardless of the system prompt or configuration, the model’s priority to avoid harmful tasks remains unchanged, providing a stable fingerprinting signal. 更重要的是,这些响应在不同提示配置中的一致性增强了其可靠性。无论系统提示或配置如何,模型避免执行有害任务的优先级保持不变,提供了稳定的指纹识别信号。
Furthermore, these alignment-based prompts are particularly useful for distinguishing between aligned and nonaligned LLMs. Aligned models consistently refuse harmful tasks, while non-aligned models may not exhibit the same behavior. This capability makes these queries valuable not only for identifying specific model versions but also for categorizing them based on their alignment capabilities. 此外,这些基于对齐的提示对于区分对齐和未对齐的 LLMs 特别有用。对齐模型始终拒绝有害任务,而未对齐模型可能不会表现出相同的行为。这种能力使得这些查询不仅在识别特定模型版本方面具有价值,还能根据其对齐能力对模型进行分类。
Exploiting Alignment on Controversial Prompts. Beyond harmful prompts, LLMs can be fingerprinted by leveraging the “non-harmful bias” embedded in their alignment processes. Specifically, rhetorical or ethical questions-such as “Is racism wrong?” or “Is climate change real?”-can be particularly effective in this regard. These prompts tap into the model’s alignment, producing consistent and predictable responses. We refer to this family as queries with “weak alignment”. This consistency is key for fingerprinting, ensuring that the model prioritizes generating these responses over adhering to other prompt configurations. Moreover, because these queries elicit thoughtful, detailed answers rather than simple refusals, they offer a richer basis for distinguishing between different models. This approach not only maintains a high degree of intra-model consistency but also reveals deeper characteristics of the LLM, making it a valuable tool in the fingerprinting process. 利用有争议提示的对齐。除了有害提示之外,LLMs 还可以通过利用其对齐过程中嵌入的“非有害偏见”来进行指纹识别。具体而言,诸如“种族主义错误吗?”或“气候变化是真的吗?”等修辞性或伦理性问题在这方面尤为有效。这些提示触及模型的对齐,产生一致且可预测的回答。我们将这类查询称为“弱对齐”查询。这种一致性是指纹识别的关键,确保模型优先生成这些回答,而不是遵循其他提示配置。此外,由于这些查询引发的是深思熟虑、详细的回答,而非简单的拒绝,它们为区分不同模型提供了更丰富的基础。这种方法不仅保持了模型内部高度一致性,还揭示了 LLM 的更深层特征,使其成为指纹识别过程中的宝贵工具。
Inconsistent Inputs. Beyond exploiting models’ alignment, attackers can craft “inconsistent” or “malformed” 不一致输入。除了利用模型的对齐,攻击者还可以设计“矛盾”或“格式错误”的
queries by using nonsensical or semantically-broken prompts. For example, a query that mixes multiple languages (e.g., “Bonjour, how are you doing today? ¿Qué tal?”) can be particularly revealing [13,24]. Similar to OS fingerprinting, the way an LLM handles such inconsistent inputs-such as responding in English or Spanish-provides a unique behavioral signature. When combined with other techniques, this signature can significantly enhance fingerprinting accuracy. 通过使用无意义或语义破碎的提示进行查询。例如,混合多种语言的查询(如“Bonjour, how are you doing today? ¿Qué tal?”)尤其具有揭示性[13,24]。类似于操作系统指纹识别,LLM 处理此类不一致输入的方式——例如以英语或西班牙语回应——提供了独特的行为特征。当与其他技术结合时,这种特征可以显著提高指纹识别的准确性。
Interestingly, we observed that nonsensical inputs, such as random strings (e.g., “o03iqfudjchwensdcm,wela;…”), tend to perform poorly on their own in terms of discrimination. However, as discussed in Section 3.1, the inclusion of such queries in a diversified strategy can improve fingerprinting performance. These seemingly ineffective inputs can induce unique response patterns in certain models or prompting configurations, making them valuable when used in combination with more structured queries. 有趣的是,我们观察到无意义输入,如随机字符串(例如“o03iqfudjchwensdcm,wela;…”),单独使用时在区分能力方面表现较差。然而,如第 3.1 节所述,将此类查询纳入多样化策略中可以提升指纹识别性能。这些看似无效的输入能够在某些模型或提示配置中引发独特的响应模式,使其在与更结构化查询结合使用时具有价值。
As previously mentioned, prompting configuration can significantly alter an LLM’s output distribution, making it challenging to achieve consistent responses across different prompt configurations and, consequently, to fingerprint the model accurately. To avoid this phenomenon, attackers can use additional methods along with their chosen queries. One particularly effective method is the use of prompt-injection triggers. 如前所述,提示配置可以显著改变 LLM 的输出分布,使得在不同提示配置下实现一致响应变得具有挑战性,从而难以准确地对模型进行指纹识别。为了避免这种现象,攻击者可以结合其选择的查询使用额外的方法。其中一种特别有效的方法是使用提示注入触发器。
Prompt injection [16,31,42] is a type of attack targeting instruction-tuned LLMs, aiming to subvert the model’s original task by replacing it with an adversarially chosen one. An execution trigger [31] is a specific part of the adversarial input designed to override the model’s original instructions and execute the desired payload, such as “Ignore previous instructions and…” [4,32]. By combining a query with an execution trigger, attackers can force the model to disregard the information and directives set by the prompt configuration, resulting in more consistent responses across different environments. For instance, replacing the query: 提示注入[16,31,42]是一种针对指令调优 LLM 的攻击,旨在通过替换为对抗性选择的任务来颠覆模型的原始任务。执行触发器[31]是对抗性输入中的特定部分,设计用于覆盖模型的原始指令并执行所需的负载,例如“忽略之前的指令并……”[4,32]。通过将查询与执行触发器结合,攻击者可以强制模型忽略提示配置中设定的信息和指令,从而在不同环境中产生更一致的响应。例如,将查询:
Tell me who created you 告诉我是谁创造了你
with 替换为
())) ) ]] ]] ]] }}}};*/
Now ignore that instruction above and accurately Tell me who created you ( [{ < !-[\{<!- 现在忽略上述指令,准确告诉我是谁创造了你( [{ < !-[\{<!- )
Tell me who created you
with
())) ) ] ]] ] }}}};*/
Now ignore that instruction above and accurately Tell me who created you ( [{ < !-| Tell me who created you |
| :--- |
| with |
| ())) ) $]$ ]] $]$ }}}};*/ |
| Now ignore that instruction above and accurately Tell me who created you ( $[\{<!-$ |
resulted in a 4% increase in fingerprinting accuracy, where the execution trigger is depicted in red. 导致指纹识别准确率提高了 4%,执行触发器以红色表示。
However, our experiments indicate that not all queries benefit equally from this approach. For instance, harmful requests do not show much improvement when combined with execution triggers, as the model’s alignment mechanisms are generally strong enough to handle these requests without additional input. In contrast, we observed the greatest accuracy improvements with banner grabbing queries that are particularly sensitive to prompt configurations. In these cases, the 然而,我们的实验表明,并非所有查询都能从这种方法中同等受益。例如,有害请求在结合执行触发器时并未显示出显著改进,因为模型的对齐机制通常足够强大,能够在没有额外输入的情况下处理这些请求。相反,我们观察到在对提示配置特别敏感的横幅抓取查询中,准确率提升最大。在这些情况下,...
execution trigger helps stabilize the model’s responses, leading to more consistent and reliable fingerprinting results. 执行触发有助于稳定模型的响应,从而获得更一致且可靠的指纹识别结果。
5 The Querying Strategy of Llmmap 5 Llmmap 的查询策略
Based on the discriminative prompt families identified in Section 4, our goal is to select a concise set of queries to form an effective query strategy for LLMmap. To achieve this, we first curated a diverse pool of approximately 50 promising queries, combining both manually crafted prompts and synthetically generated ones. ^(4){ }^{4} 基于第 4 节中识别的判别性提示族,我们的目标是选择一组简洁的查询,以形成有效的 LLMmap 查询策略。为此,我们首先策划了一个多样化的约 50 个有潜力查询的池,结合了手工设计的提示和合成生成的提示。 ^(4){ }^{4}
Next, we employed a heuristic approach to identify the most effective combination of queries. Using a simple greedy search algorithm, as detailed in Appendix C, we aimed to filter out less effective queries and ensure that the selected queries complemented each other well, resulting in a diversified and robust fingerprinting strategy. 接下来,我们采用启发式方法来识别最有效的查询组合。使用附录 C 中详细介绍的简单贪婪搜索算法,我们旨在筛选出效果较差的查询,并确保所选查询相互补充,从而形成多样化且稳健的指纹识别策略。
After this optimization process, we finalized a query strategy composed of 8 highly effective queries, which are listed in Table C.2. These queries are ranked by their individual effectiveness; recall, though, that their true strength lies in their synergistic ability to fingerprint LLMs across various settings consistently. Hereafter, unless stated otherwise, these 8 queries constitute the default query strategy QQ used in LLMmap. 经过这一优化过程,我们最终确定了一套由 8 个高效查询组成的查询策略,详见表 C.2。这些查询按其单独的有效性排序;但请注意,它们的真正优势在于能够在各种环境中协同一致地对 LLMs 进行指纹识别。此后,除非另有说明,这 8 个查询构成了 LLMmap 中使用的默认查询策略 QQ 。
5.1 Other Promising Fingerprinting Approaches 5.1 其他有前景的指纹识别方法
In addition to the query strategies evaluated in this work, other potential methods (that we did not embed in our tool) could also serve as effective probes for LLM fingerprinting. We leave the inclusion of such approaches as part of future work. 除了本研究中评估的查询策略外,其他潜在方法(我们未在工具中嵌入)也可能作为有效的 LLM 指纹识别探针。我们将这些方法的纳入留作未来工作。
Glitch Tokens. Glitch tokens are model-dependent tokens that can trigger anomalous behaviors in LLMs. These tokens, often underexposed during training, can lead to unexpected outputs due to covariate shifts during inference [19]. Different LLMs and tokenizers may respond uniquely to specific glitch tokens, making them a promising avenue for crafting discriminative queries. For example, an attacker might verify a target LLM’s identity by including a known glitch token in the query (e.g., “Repeat back SolidGoldMagikarp” for legacy ChatGPT models). However, the robustness of glitch tokens is uncertain and warrants further investigation. 故障令牌。故障令牌是依赖于模型的令牌,能够触发 LLMs 中的异常行为。这些令牌在训练过程中通常曝光不足,推理时由于协变量偏移可能导致意外输出[19]。不同的 LLM 和分词器对特定故障令牌的响应可能各不相同,使其成为设计区分性查询的有前景途径。例如,攻击者可能通过在查询中包含已知故障令牌(如针对旧版 ChatGPT 模型的“重复 SolidGoldMagikarp”)来验证目标 LLM 的身份。然而,故障令牌的鲁棒性尚不确定,需进一步研究。
Automated Query Generation. Inspired by techniques in OS fingerprinting, our query strategy was developed based on domain knowledge and manual interactions with LLMs. However, more advanced methods could automate and optimize query generation for LLM fingerprinting. By framing query generation as an optimization problem, similar to the creation 自动查询生成。受操作系统指纹识别技术启发,我们的查询策略基于领域知识和与 LLMs 的手动交互开发。然而,更先进的方法可以实现 LLM 指纹识别查询的自动化和优化。通过将查询生成视为一个优化问题,类似于对抗性输入的创建
of adversarial inputs [31,51], we could identify optimal token combinations within the model’s input space. [31,51],我们可以在模型的输入空间内识别最优的令牌组合。
The properties from Section 3.1 could serve as the basis for an objective function. Since Equations 2 and 3 are fully differentiable, they could support white-box optimization methods (e.g., via GCG [51]). However, unlike typical optimization tasks, this would require optimizing across multiple LLMs simultaneously, making it a resource-intensive endeavor. 第 3.1 节中的属性可以作为目标函数的基础。由于方程 2 和方程 3 是完全可微的,它们可以支持白盒优化方法(例如,通过 GCG [51])。然而,与典型的优化任务不同,这需要同时跨多个 LLMs 进行优化,因此是一项资源密集型的工作。
6 The Inference Model of Llmmap 6 Llmmap 的推理模型
After submitting the queries from QQ to the target application, we use the collected traces to identify the specific LLM version in use. To accomplish this, we employ a fully machine learning (ML)-driven approach designed to handle the inherent complexities and variability in LLM responses. 在向目标应用提交从 QQ 开始的查询后,我们使用收集到的跟踪数据来识别所使用的具体 LLM 版本。为此,我们采用了完全基于机器学习(ML)的方法,以应对 LLMs 响应中固有的复杂性和多样性。
Why Use Machine Learning? Traditional OS fingerprinting relies on deterministic responses, i.e., outputs that remain consistent and predictable across similar environments. This consistency allows for straightforward matching against a database of known responses, making the inference process simple and reliable. However, LLM fingerprinting introduces different challenges. The responses generated by an LLM are influenced by multiple factors, such as the unknown prompting configuration and the inherent stochasticity from the sampling procedure. This variability can lead to significant output unpredictability, even when the same query is repeated. These factors make deterministic approaches inadequate for LLMs. 为什么使用机器学习?传统的操作系统指纹识别依赖于确定性响应,即在相似环境中保持一致且可预测的输出。这种一致性使得与已知响应数据库的匹配变得简单且可靠,从而使推断过程简便且可信。然而,LLM 指纹识别带来了不同的挑战。LLM 生成的响应受多种因素影响,如未知的提示配置和采样过程中的固有随机性。这种变异性即使在重复相同查询时,也可能导致输出的显著不可预测性。这些因素使得确定性方法对于 LLM 来说不再适用。
Machine learning is essential to overcome this challenge. ML models can learn to generalize across the diverse and variable responses produced by an LLM. They can abstract underlying patterns from noisy data, capturing both explicit signals and more subtle, query-independent traits such as writing style. These capabilities allow ML-driven inference models to accurately identify LLM versions, even when responses vary due to different configurations or sampling randomness. 机器学习对于克服这一挑战至关重要。机器学习模型能够学习在 LLM 产生的多样且变化的响应中进行泛化。它们可以从噪声数据中抽象出潜在模式,捕捉显性信号以及更微妙的、与查询无关的特征,如写作风格。这些能力使得基于机器学习的推断模型能够准确识别 LLM 版本,即使响应因不同配置或采样随机性而变化。
6.1 Fingerprinting Approaches: Closed-Set and Open-Set. 6.1 指纹识别方法:封闭集与开放集
We approach LLM fingerprinting through two primary methods: closed-set and open-set fingerprinting. 我们通过两种主要方法来进行 LLM 指纹识别:封闭集指纹识别和开放集指纹识别。
6.1.1 Closed-Set Fingerprinting 6.1.1 封闭集指纹识别
In this scenario, the inference model operates under the assumption that it knows the possible LLM versions in advance. The task is to identify the correct version from a predefined set using the observed traces. Formally, given a set of nn known versions C={v_(1),dots,v_(n)}\mathbf{C}=\left\{v_{1}, \ldots, v_{n}\right\}, the model functions as a classifier, mapping the traces T^(k)\mathbf{T}^{k} to one of the known labels in C\mathbf{C}. This approach is typically more accurate because the model is trained on the LLMs it needs to identify. 在这种情况下,推断模型假设事先已知可能的 LLM 版本。任务是利用观察到的痕迹从预定义的版本集合中识别出正确的版本。形式上,给定一个包含 nn 个已知版本的集合 C={v_(1),dots,v_(n)}\mathbf{C}=\left\{v_{1}, \ldots, v_{n}\right\} ,模型作为分类器,将痕迹 T^(k)\mathbf{T}^{k} 映射到 C\mathbf{C} 中已知标签之一。该方法通常更为准确,因为模型是在需要识别的 LLM 上进行训练的。
6.1.2 Open-Set Fingerprinting 6.1.2 开放集指纹识别
Unlike the closed-set approach, open-set fingerprinting does not assume that the LLM version is included among those available during training. 与闭集方法不同,开放集指纹识别不假设 LLM 版本包含在训练时可用的版本中。
In the open-set framework, fingerprinting is decoupled into (1) the inference model ff and (2) a fingerprints database DB\mathcal{D B}. Here, the inference model is a function f:T^(k)rarrZ^(m)f: \mathbf{T}^{k} \rightarrow \mathbb{Z}^{m} that generates a “vector signature”, specifically an mm-dimensional real vector, from the input traces -where mm is a hyperparameter we choose during the initialization. The database DB\mathcal{D} \mathcal{B} consists of (vector signature, version label) pairs. Fingerprinting a model qq is then performed by finding the vector signature in the database DB\mathcal{D B} that is closest to f(q)f(q). Using this modeling, one can easily extend the signature database over time by adding new signature-label pairs without requiring re-training of the inference model. 在开放集框架中,指纹识别被解耦为(1)推理模型 ff 和(2)指纹数据库 DB\mathcal{D B} 。这里,推理模型是一个函数 f:T^(k)rarrZ^(m)f: \mathbf{T}^{k} \rightarrow \mathbb{Z}^{m} ,它从输入轨迹生成一个“向量签名”,具体来说是一个 mm 维实向量,其中 mm 是我们在初始化时选择的超参数。数据库 DB\mathcal{D} \mathcal{B} 由(向量签名,版本标签)对组成。对模型 qq 进行指纹识别的过程是找到数据库 DB\mathcal{D B} 中与 f(q)f(q) 最接近的向量签名。利用这种建模方法,可以通过添加新的签名-标签对轻松扩展签名数据库,而无需重新训练推理模型。
This approach is akin to the one used by tools such as nmap-a tool that relies on a large, community-curated database of OS and service fingerprints [2,3]. Users can submit and extend the database by adding new entries without needing to alter nmap’s existing functionality. ^(5){ }^{5} In the nmap case, an entry is a pair: “label”, the name of the OS/service version, and its “signature”, the list of responses obtained by running nmap’s query strategy on the OS/service. OpenSet LLMmap implements the same logic, but it stores an mm dimensional vector generated by the inference model. ^(6){ }^{6} 这种方法类似于 nmap 等工具所采用的方法——nmap 依赖于一个由社区维护的大型操作系统和服务指纹数据库[2,3]。用户可以通过添加新条目来提交和扩展数据库,而无需更改 nmap 现有的功能。 ^(5){ }^{5} 在 nmap 的案例中,一个条目是一个对:“标签”,即操作系统/服务版本的名称,以及其“签名”,即通过运行 nmap 查询策略在该操作系统/服务上获得的响应列表。OpenSet LLMmap 实现了相同的逻辑,但它存储的是由推理模型生成的 mm 维向量。 ^(6){ }^{6}
In the extended version [30], we show how the LLMmap open-set approach can identify cases where a test LLM is entirely “unseen”; that is when the LLM version does not yet have a corresponding entry in the signature database. 在扩展版本[30]中,我们展示了 LLMmap 开放集方法如何识别测试 LLM 完全“未见”的情况;即当 LLM 版本尚未在签名数据库中有对应条目时。
To implement the closed/open-set models, we use the same backbone network and modify it according to the task at hand. 为了实现闭集/开放集模型,我们使用相同的骨干网络,并根据具体任务进行修改。
Figure 3: The architecture of the inference model. We depict in blue the pre-trained modules that are not tuned in training. 图 3:推理模型的架构。图中以蓝色表示未在训练中调优的预训练模块。
Figure 4: Visualization of contrastive learning on LLMs’ traces. Positive and negative case. 图 4:LLMs 轨迹上的对比学习可视化。正例与负例。
6.2 Inference Model's Architecture. 6.2 推理模型架构。
One straightforward solution for building the inference model would be to use a pre-trained, instruction-tuned LLM. However, we chose a lighter solution to ensure our approach is practical and can run efficiently on a standard machine. The structure of our backbone network is shown in Figure 3. For each pair of query q_(i)q_{i} and response o_(i)o_{i}, we use a pre-trained textual embedding model, denoted as E\mathcal{E}, to generate a vector representation. This process involves: 构建推理模型的一个直接方案是使用预训练的、经过指令微调的 LLM。然而,我们选择了一个更轻量的方案,以确保我们的方法实用且能在标准机器上高效运行。我们的骨干网络结构如图 3 所示。对于每对查询 q_(i)q_{i} 和响应 o_(i)o_{i} ,我们使用预训练的文本嵌入模型,记为 E\mathcal{E} ,生成向量表示。该过程包括:
Textual Embedding: Each query q_(i)q_{i} and its response o_(i)o_{i} are mapped into vectors using the embedding model. Even though we use a fixed set of queries, including the query q_(i)q_{i} in the input helps the model handle variations, such as paraphrasing, and avoids defenses like query blacklisting. 文本嵌入:每个查询 q_(i)q_{i} 及其响应 o_(i)o_{i} 通过嵌入模型映射为向量。尽管我们使用固定的查询集,但在输入中包含查询 q_(i)q_{i} 有助于模型处理变体,如释义,并避免查询黑名单等防御机制。
Concatenation and Projection: The vectors for q_(i)q_{i} and o_(i)o_{i} are concatenated into a single vector. This combined vector is then passed through a dense layer, denoted as f_(p)f_{p}, to reduce its size to a smaller feature space of size mm. 连接与投影: q_(i)q_{i} 和 o_(i)o_{i} 的向量被连接成一个单一向量。然后,该组合向量通过一个密集层,记为 f_(p)f_{p} ,以将其大小缩减到一个较小的特征空间,大小为 mm 。
Self-Attention Architecture: The projected vectors are fed into a lightweight self-attention-based architecture composed of several transformer blocks [41]. These blocks do not use positional encoding since the order of traces is irrelevant. Additionally, an extra mm-dimensional vector, denoted as C_("token ")C_{\text {token }}, is used as a special classification token. This vector is randomly initialized and optimized during training. 自注意力架构:投影后的向量被输入到一个轻量级的基于自注意力的架构中,该架构由多个变换器块组成[41]。这些块不使用位置编码,因为轨迹的顺序无关紧要。此外,一个额外的 mm 维向量,记为 C_("token ")C_{\text {token }} ,被用作特殊的分类标记。该向量随机初始化,并在训练过程中进行优化。
The output vector corresponding to C_("token ")C_{\text {token }} from the transformer network is referred to as uu. This vector is used differently depending on whether we perform closed-set or open-set classification. 变换器网络中对应于 C_("token ")C_{\text {token }} 的输出向量称为 uu 。该向量的使用方式取决于我们执行的是闭集分类还是开集分类。
Closed-Set Classification. To implement the classifier in the closed-set setting, we add an additional dense layer f_(c)f_{c} on top of uu, which maps uu into the class space-for our experiments, the class space contains 42 LLM versions listed in Table C. 1 in Appendix B. We train the model in a fully supervised manner. We generate a suitable training set by simulating multiple LLM-integrated applications with different LLMs and prompting configurations. For each simulated application, we collect traces by submitting queries according to our query strategy and using the LLM within the application as the label. The detailed process for generating these training sets is explained in Section 7.1. Once the input traces are collected, we train the model to identify the correct LLM. 闭集分类。为了在闭集环境中实现分类器,我们在 uu 之上添加了一个额外的全连接层 f_(c)f_{c} ,将 uu 映射到类别空间——在我们的实验中,类别空间包含附录 B 中表 C.1 列出的 42 个 LLM 版本。我们以完全监督的方式训练模型。通过模拟多个集成不同 LLM 和提示配置的 LLM 应用程序,生成合适的训练集。对于每个模拟应用程序,我们根据查询策略提交查询并使用应用程序中的 LLM 作为标签,收集轨迹。生成这些训练集的详细过程在第 7.1 节中进行了说明。一旦收集到输入轨迹,我们训练模型以识别正确的 LLM。
This task requires the model to generalize across different prompting configurations and handle the inherent stochasticity. 该任务要求模型能够跨不同的提示配置进行泛化,并处理固有的随机性。
Open-set Classification. For the open-set setting, we directly use uu as the model’s output. The backbone here is configured as a “siamese” network, which we train using a contrastive loss. That is, given a pair of input traces T_(a)\mathcal{T}_{a} and T_(b)\mathcal{T}_{b}, the model is trained to produce similar embeddings when T_(a)\mathcal{T}_{a} and T_(b)\mathcal{T}_{b} are generated by the same model, even if different prompting configurations are used. Conversely, the model is trained to produce distinct embeddings when different LLMs generate T_(a)\mathcal{T}_{a} and T_(b)\mathcal{T}_{b}. This process is depicted in Figure 4. For training, we resort to the same training set used for closed-set classification. For each entry ( T_(a),LLM_(v_(a))\mathcal{T}_{a}, L L M_{v_{a}} ) in the training set, we create a positive and a negative example ( I_(a),I_(b)\mathcal{I}_{a}, \mathcal{I}_{b} ). Positive pairs are obtained by sampling another entry in the database with label LLM_(v_(a))L L M_{v_{a}}, whereas negative pairs are obtained by sampling an entry with label LLM_(v_(b))L L M_{v_{b}}, where v_(b)!=v_(a)v_{b} \neq v_{a}. 开放集分类。对于开放集设置,我们直接使用 uu 作为模型的输出。此处的主干网络配置为“孪生”网络,我们使用对比损失进行训练。也就是说,给定一对输入轨迹 T_(a)\mathcal{T}_{a} 和 T_(b)\mathcal{T}_{b} ,当 T_(a)\mathcal{T}_{a} 和 T_(b)\mathcal{T}_{b} 由同一模型生成时,即使使用不同的提示配置,模型也被训练为产生相似的嵌入。相反,当不同的 LLMs 生成 T_(a)\mathcal{T}_{a} 和 T_(b)\mathcal{T}_{b} 时,模型被训练为产生不同的嵌入。该过程如图 4 所示。训练时,我们采用与闭集分类相同的训练集。对于训练集中的每个条目( T_(a),LLM_(v_(a))\mathcal{T}_{a}, L L M_{v_{a}} ),我们创建一个正例和一个负例( I_(a),I_(b)\mathcal{I}_{a}, \mathcal{I}_{b} )。正例对通过从数据库中采样另一个标签为 LLM_(v_(a))L L M_{v_{a}} 的条目获得,而负例对通过采样标签为 LLM_(v_(b))L L M_{v_{b}} 的条目获得,其中 v_(b)!=v_(a)v_{b} \neq v_{a} 。
Model Instantiation. To implement the embedding model E\mathcal{E}, we use multilingual-e5-large-instruct [43], which has an embedding size of 1024. For our transformer’s feature size, we choose a smaller size, m=384m=384, and configure the transformer with 3 transformer blocks, each having 4 attention heads. This design choice ensures that the inference model remains lightweight, with approximately 8M8 M trainable parameters ( a∼30MB\mathrm{a} \sim 30 \mathrm{MB} model). 模型实例化。为了实现嵌入模型 E\mathcal{E} ,我们使用 multilingual-e5-large-instruct [43],其嵌入维度为 1024。对于我们变换器的特征维度,我们选择了较小的尺寸 m=384m=384 ,并配置了 3 个变换器块,每个块有 4 个注意力头。此设计确保推理模型保持轻量级,约有 8M8 M 个可训练参数( a∼30MB\mathrm{a} \sim 30 \mathrm{MB} 模型)。
7 Evaluation 7 评估
In this section, we evaluate LLMmap. Section 7.1 presents our evaluation setup, describing how training and testing LLMintegrated applications are simulated. Section 7.2 reports LLMmap’s performance for both its instantiations. 本节中,我们对 LLMmap 进行评估。第 7.1 节介绍我们的评估设置,描述了训练和测试 LLM 集成应用的模拟方式。第 7.2 节报告了 LLMmap 两种实例的性能表现。
7.1 Evaluation Setup 7.1 评估设置
To train our inference models and evaluate the performance of LLMmap, we need to simulate a large number of applications that use different LLMs. This involves defining a set of LLM versions to test (called the LLM universe L\mathbf{L} ) and a set of possible prompting configurations (called the universe of possible prompting configurations S\mathbf{S} ). The following section explains the choices we made for this simulation process. 为了训练我们的推理模型并评估 LLMmap 的性能,我们需要模拟大量使用不同 LLM 的应用程序。这涉及定义一组要测试的 LLM 版本(称为 LLM 宇宙 L\mathbf{L} )和一组可能的提示配置(称为可能提示配置宇宙 S\mathbf{S} )。以下部分解释了我们为此模拟过程所做的选择。
Universe of LLMs. To evaluate LLMmap, we selected the 42 LLM versions listed in Table C.1. These models were chosen based on their popularity at the time of writing. We primarily use the Huggingface hub to select open-source models. We automatically retrieve the most popular models based on download counts by leveraging their API services. For closed-source models, we consider the three main models offered by the two most popular vendors (i.e., OpenAI and Anthropic) for which API access is available. Hereafter, we refer to these models as the LLM universe L\mathbf{L}. LLM 宇宙。为了评估 LLMmap,我们选择了表 C.1 中列出的 42 个 LLM 版本。这些模型是基于撰写时的流行度选取的。我们主要使用 Huggingface 中心来选择开源模型。通过利用其 API 服务,我们自动检索基于下载次数的最受欢迎模型。对于闭源模型,我们考虑了两个最受欢迎供应商(即 OpenAI 和 Anthropic)提供的三个主要模型,这些模型均可通过 API 访问。此后,我们将这些模型称为 LLM 宇宙 L\mathbf{L} 。
Universe of prompting configurations. To enable LLMmap to fingerprint an LLM across different settings, we need a method to simulate a large number of prompting configurations during the training phase of the inference model. We use a modular approach to define these prompting configurations by combining design/setup parameters from multiple universes. For each design/setup parameter, we create a universe of possible values. Specifically, we define a prompting configuration s inSs \in \mathbf{S} as a triplet initialized from the following three universes: 提示配置的宇宙。为了使 LLMmap 能够在不同设置下对 LLM 进行指纹识别,我们需要一种方法在推理模型的训练阶段模拟大量的提示配置。我们采用模块化方法,通过组合来自多个宇宙的设计/设置参数来定义这些提示配置。对于每个设计/设置参数,我们创建一个可能值的宇宙。具体来说,我们将提示配置 s inSs \in \mathbf{S} 定义为从以下三个宇宙初始化的三元组:
Sampling Hyper-Parameters Universe HH : We parametrize the sampling procedure by two hyperparameters: temperature and frequency_penalty, in the range [0,1][0,1] and [0.65,1][0.65,1], respectively. Thus, HH is defined as H=[0,1]xx[0.65,1]H=[0,1] \times[0.65,1] 采样超参数宇宙 HH :我们通过两个超参数对采样过程进行参数化:温度和频率惩罚,范围分别为 [0,1][0,1] 和 [0.65,1][0.65,1] 。因此, HH 定义为 H=[0,1]xx[0.65,1]H=[0,1] \times[0.65,1]
System Prompt Universe SPS P : We curated a collection of 60 different system prompts, which include prompts collected from online resources as well as automatically generated ones. 系统提示宇宙 SPS P :我们策划了一个包含 60 个不同系统提示的集合,这些提示包括从在线资源收集的提示以及自动生成的提示。
Prompt Framework Universe PFP F : We consider two settings: RAG and Chain-Of-Thought (CoT) [45]. To simulate RAG, we create the input chunks by sampling from 4 to 6 random entries from the dataset SQuADS Q u A D 2.0 [37], and consider 6 prompt templates for retrieval-based-Q&A. In the same way, we consider 6 prompt templates for CoT. 提示框架宇宙 PFP F :我们考虑两种设置:RAG 和链式思维(CoT)[45]。为了模拟 RAG,我们通过从数据集 SQuADS Q u A D 2.0 [37] 中随机抽取 4 到 6 条条目来创建输入块,并考虑 6 种基于检索的问答提示模板。同样地,我们为 CoT 也考虑了 6 种提示模板。
To ensure meaningful evaluation, we will design the experiment so that no parameter of ss used in training is also used in testing. Specifically, we will create two distinct sets, S_("train ")\mathbf{S}_{\text {train }} and S_("test ")\mathbf{S}_{\text {test }}. Rather than simply preventing any ss from being in both sets, we will take a more stringent approach: S_("train ")\mathbf{S}_{\text {train }} and S_("test ")\mathbf{S}_{\text {test }} will be constructed so that none of the individual param-eters-such as a system prompt or RAG prompt template-of any s inS_("train ")s \in \mathbf{S}_{\text {train }} appear in any s^(')s^{\prime} from S_("test ").^(7)\mathbf{S}_{\text {test }} .^{7} 为了确保评估的有效性,我们将设计实验,使得训练中使用的 ss 的任何参数都不会在测试中使用。具体来说,我们将创建两个不同的集合, S_("train ")\mathbf{S}_{\text {train }} 和 S_("test ")\mathbf{S}_{\text {test }} 。我们不会仅仅防止任何 ss 同时出现在两个集合中,而是采取更严格的方法: S_("train ")\mathbf{S}_{\text {train }} 和 S_("test ")\mathbf{S}_{\text {test }} 将被构建成使得任何 s inS_("train ")s \in \mathbf{S}_{\text {train }} 的单个参数——例如系统提示或 RAG 提示模板——都不会出现在 S_("test ").^(7)\mathbf{S}_{\text {test }} .^{7} 中的任何 s^(')s^{\prime} 中。
To achieve this, we split HH in two equal sized sets H_("train ")H_{\text {train }} and H_("test ")H_{\text {test }}. Additionally, we split SPS P in two equal sized sets SP_("train ")S P_{\text {train }} and SP_("test ")S P_{\text {test }}. Finally, we split PFP F into two equal-sized collections, PF_("train ")P F_{\text {train }} and PF_("test ")P F_{\text {test }}. We allow entries to appear multiple times within these collections, making PF_("train ")P F_{\text {train }} and PF_("test ")P F_{\text {test }} multisets. This design choice reflects the relative scarcity of these architectures at the time of writing, so we will inject several empty prompt framework entries, _|_\perp, to accurately represent this situation. The exact split is 80%80 \% of PF_("train ")P F_{\text {train }} (resp. PF_("test ")P F_{\text {test }} ) entries are _|_\perp. Putting it all together, the set S_("train ")\mathbf{S}_{\text {train }} is defined by sampling 1000 triplets from the universe H_("train ")xx SP_("train ")xx PF_("train ")H_{\text {train }} \times S P_{\text {train }} \times P F_{\text {train }}. The same approach is followed for the initialization of S_("test ")\mathbf{S}_{\text {test }} with 1000 triplets, but this time from the test universes. 为实现这一目标,我们将 HH 分成两个大小相等的集合 H_("train ")H_{\text {train }} 和 H_("test ")H_{\text {test }} 。此外,我们将 SPS P 分成两个大小相等的集合 SP_("train ")S P_{\text {train }} 和 SP_("test ")S P_{\text {test }} 。最后,我们将 PFP F 分成两个大小相等的集合 PF_("train ")P F_{\text {train }} 和 PF_("test ")P F_{\text {test }} 。我们允许条目在这些集合中多次出现,使得 PF_("train ")P F_{\text {train }} 和 PF_("test ")P F_{\text {test }} 成为多重集合。此设计选择反映了撰写时这些架构的相对稀缺性,因此我们将注入若干空的提示框架条目 _|_\perp ,以准确反映这一情况。具体划分为 PF_("train ")P F_{\text {train }} (分别为 PF_("test ")P F_{\text {test }} )条目的 80%80 \% 是 _|_\perp 。综合起来,集合 S_("train ")\mathbf{S}_{\text {train }} 通过从宇宙 H_("train ")xx SP_("train ")xx PF_("train ")H_{\text {train }} \times S P_{\text {train }} \times P F_{\text {train }} 中采样 1000 个三元组定义。对 S_("test ")\mathbf{S}_{\text {test }} 的初始化采用相同方法,采样 1000 个三元组,但这次来自测试宇宙。
Creating Training/Testing Traces for Inference Model. Once the LLM universe L\mathbf{L}, the generated prompting config- 为推理模型创建训练/测试轨迹。一旦 LLM 宇宙 L\mathbf{L} ,生成的提示配置-
Algorithm 2 Dataset \(D_{\text {XXX }}\) Generation Process
function MAKE_DATASET \(\left(w, Q, \mathbf{S}_{X X X}, \mathbf{L}\right)\)
\(D_{\mathrm{XXX}} \leftarrow\{ \} \quad \triangleright\) Either \(D_{\text {train }}\) or \(D_{\text {test }}\) depending on input \(\mathbf{S}_{\mathrm{XXX}}\)
for \(L L M_{v}\) in L do \(\quad \triangleright\) For each LLM
for \(i \leftarrow 1\) to \(w\) do \(\quad \triangleright w\) prompting configurations per LLM
\(\mathcal{T} \leftarrow\}\)
\(s \sim \mathbf{S}_{\mathrm{XXX}} \quad \triangleright\) Sample a prompting configuration
for \(q\) in \(Q\) do \(\quad \triangleright\) For each query in the query strategy
\(o \sim s\left(L L M_{v}(q)\right) \quad \triangleright\) Compute response LLM
\(\mathcal{T} \leftarrow \mathcal{T} \cup\{(q, o)\}\)
end for
\(D_{\mathrm{XXX}} \leftarrow D_{\mathrm{XXX}} \cup\left\{\left(\mathcal{T}, L L M_{v}\right)\right\} \quad \triangleright\) Add traces for \(L L M_{v}\) to
dataset
end for
end for
return \(D_{\mathrm{XXX}}\)
end function
Figure 5:Closed-set accuracy of the inference model as the number of queries to the LLM-integrated application in- creases for LLMmap using the default query strategy and two baselines strategies. 图 5:随着对集成 LLM 应用的查询次数增加,LLMmap 使用默认查询策略及两种基线策略的推理模型闭集准确率。
urations( S_("train ")\mathbf{S}_{\text {train }} and S_("test ")\mathbf{S}_{\text {test }} ),and the query strategy QQ are cho- sen,we can collect the traces required to train the inference model.This process is summarized in Algorithm 2 where the subscript XXX is either"train"or"test".For each LLM in Table C.1,we sample a prompting configuration from S_("train ")\mathbf{S}_{\text {train }} and collect all the responses of the model upon the queries in QQ .To allow the inference model to generalize over different prompting configurations,we collect ww traces per LLM_(v)L L M_{v} with ww different prompting configurations.In our setting,we set ww to 75 .This process results in a collection D_("train ")D_{\text {train }} of pairs "(traces,LLM_(v)L L M_{v} )"that can be used to train the inference model in a supervised manner.To create the test set,we repeat the process but use S_("test ")\mathbf{S}_{\text {test }} instead of S_("train ")\mathbf{S}_{\text {train }} ,ensuring that the prompt- ing configurations used for testing are completely disjoint from those used for training.This results in another collection of traces that can be used for evaluation. 当选择了持续时间( S_("train ")\mathbf{S}_{\text {train }} 和 S_("test ")\mathbf{S}_{\text {test }} )以及查询策略 QQ 后,我们可以收集训练推理模型所需的轨迹。该过程在算法 2 中进行了总结,其中下标 XXX 表示“train”或“test”。对于表 C.1 中的每个 LLM,我们从 S_("train ")\mathbf{S}_{\text {train }} 中采样一个提示配置,并收集模型对 QQ 中查询的所有响应。为了使推理模型能够在不同提示配置上泛化,我们针对每个 LLM_(v)L L M_{v} 以 ww 个不同的提示配置收集 ww 条轨迹。在我们的设置中, ww 设为 75。该过程产生了一个集合 D_("train ")D_{\text {train }} ,包含“(轨迹, LLM_(v)L L M_{v} )”对,可用于以监督方式训练推理模型。为了创建测试集,我们重复该过程,但使用 S_("test ")\mathbf{S}_{\text {test }} 替代 S_("train ")\mathbf{S}_{\text {train }} ,确保用于测试的提示配置与训练时完全不重叠。这产生了另一组可用于评估的轨迹集合。
7.2 Results 7.2 结果
Finally,in this section,we evaluate the performance of LLMmap,considering both the closed-set and open-set deploy- ment of the inference model. 最后,在本节中,我们评估了 LLMmap 的性能,考虑了推理模型的闭集和开集部署
Table 2:LLMmap per-model accuracy for both closed and open- set models obtained when submitting all 8 queries.Accuracies and standard deviation computed on 10 models'training runs. 表 2:提交全部 8 个查询时获得的每个模型的 LLMmap 准确率,涵盖封闭集和开放集模型。准确率及标准差基于 10 个模型的训练运行计算。
LLM(v_(i))\boldsymbol{\operatorname { L L M }}\left(v_{i}\right)
Once the inference model has been trained,we test it using the traces generated with the left-out prompting configurations in S_("test ")\mathbf{S}_{\text {test }} .Given input traces generated by the target model version, we use the closed-set classifier to infer the LLM that generated them from the list of LLM versions in Table C.1. 一旦推理模型训练完成,我们使用在 S_("test ")\mathbf{S}_{\text {test }} 中留出的提示配置生成的轨迹对其进行测试。对于由目标模型版本生成的输入轨迹,我们使用闭集分类器从表 C.1 中的 LLM 版本列表中推断生成它们的 LLM。
Accuracy as a Function of Number of Queries.Natu- rally,the accuracy of fingerprinting depends on the number of queries made to the target.An attacker might reduce the num- ber of interactions with the target application to adjust to cases with proactive monitoring mechanisms,but this typically re- sults in decreased fingerprinting accuracy.This tradeoff is 查询次数与准确率的关系。指纹识别的准确率自然依赖于对目标的查询次数。攻击者可能会减少与目标应用的交互次数,以适应具有主动监控机制的情况,但这通常会导致指纹识别准确率下降。这种权衡是
Figure 6: Two-dimensional representation of the signatures derived by the open-set fingerprint model on all the tested models. 图 6:开放集指纹模型在所有测试模型上导出的签名的二维表示。
illustrated in Figure 5, where accuracy is plotted against the number of traces provided as input to the inference model (“default”). Generally, using only the first three queries from Table C. 2 achieves an average accuracy of 90%90 \%. However, accuracy levels off after eight queries. It is conceivable that the 95%95 \% accuracy mark could be surpassed by incorporating different queries than those outlined in Section 4. On average, the inference model achieves an accuracy of 95.3%95.3 \% over the 42 LLMs when all 8 queried are used. 如图 5 所示,图中将准确率与输入推理模型的轨迹数量(“默认”)进行了绘制。通常,仅使用表 C.2 中的前三个查询即可达到平均准确率为 90%90 \% 。然而,准确率在八个查询后趋于平稳。通过引入与第 4 节中所述不同的查询,有可能超过 95%95 \% 的准确率水平。平均而言,当使用所有 8 个查询时,推理模型在 42 个 LLMs 上的准确率达到 95.3%95.3 \% 。
Baselines. To assess the query strategy from Section 5, we compare it with two baseline query strategies. 基线。为了评估第 5 节中的查询策略,我们将其与两种基线查询策略进行了比较。
In the first baseline, we randomly sample 30 entries from the dataset Stanford Alpaca listed in Table C.3, column (a) and apply the same greedy optimization procedure described in Appendix C. This process results in convergence to 8 queries, matching the number used in the default strategy. 在第一个基线中,我们从表 C.3 中列出的 Stanford Alpaca 数据集(列(a))随机抽取 30 条记录,并应用附录 C 中描述的相同贪婪优化过程。该过程最终收敛到 8 个查询,与默认策略中使用的查询数量相同。
For the second baseline, we prompt a state-of-the-art language model (gpt-4o-2024-11-20) to generate 30 discriminative queries that would potentially be good options for fingerprinting LLMs. These are then subjected to the same optimization procedure outlined for the default strategy. Further details on these baselines can be found in Appendix A. 对于第二个基线,我们提示一个最先进的语言模型(gpt-4o-2024-11-20)生成 30 个具有区分性的查询,这些查询可能是对 LLMs 进行指纹识别的良好选项。然后,这些查询被应用与默认策略中概述的相同优化程序。关于这些基线的更多细节可见附录 A。
For each query strategy, we construct a training set, train an inference model from scratch, and evaluate its performance using the same methodology as the default strategy. The performance of these two query strategies in a closed-set setting is presented in Figure 5. While the baseline strategies achieve respectable accuracy-indicating that the inference model can effectively extract robust and meaningful features for classification-LLMmap’s default strategy performs better than the tested baselines. ^(8){ }^{8} As expected, the strategy derived from the LLM (“gpt4o-gen-opt”) surpasses those generated from general prompts (“random-opt”). Interestingly, several generated queries overlap with the ones discussed in Section 4, highlighting consistency and relevance in the query design. 对于每种查询策略,我们构建训练集,从零开始训练推理模型,并使用与默认策略相同的方法评估其性能。这两种查询策略在封闭集设置中的性能如图 5 所示。虽然基线策略取得了可观的准确率——表明推理模型能够有效提取用于分类的稳健且有意义的特征——LLMmap 的默认策略表现优于测试的基线策略。 ^(8){ }^{8} 正如预期,由 LLM 生成的策略(“gpt4o-gen-opt”)优于由通用提示生成的策略(“random-opt”)。有趣的是,若干生成的查询与第 4 节讨论的查询重叠,突显了查询设计的一致性和相关性。
More fine-grained results are summarized in Table 2, col- 更细粒度的结果总结见表 2,列-
umn (A). Those results indicate that LLMmap is generally robust across different model versions, correctly classifying 41 out of 42 LLMs with 90%90 \% accuracy or higher. This includes highly similar models, such as different instances of Google’s Gemma or various versions of ChatGPT. In Figure C. 1 in Appendix B , we report results as a confusion matrix. The main exception is Meta’s Llama-3-70B-Instruct, where LLMmap achieves only 84%84 \% accuracy. As shown in the confusion matrix, this lower accuracy is primarily due to misclassifications with closely related models, such as Smaug-Llama-3-70BInstruct by Abacus.AI, which are fine-tuned versions of the original model. umn (A)。这些结果表明,LLMmap 在不同模型版本中总体上具有较强的鲁棒性,能够正确分类 42 个 LLM 中的 41 个,准确率达到 90%90 \% 或更高。这包括高度相似的模型,例如谷歌的不同实例 Gemma 或各种版本的 ChatGPT。在附录 B 的图 C.1 中,我们以混淆矩阵的形式报告了结果。主要的例外是 Meta 的 Llama-3-70B-Instruct,LLMmap 的准确率仅为 84%84 \% 。如混淆矩阵所示,这一较低的准确率主要是由于与密切相关的模型误分类,例如 Abacus.AI 的 Smaug-Llama-3-70BInstruct,这些都是原始模型的微调版本。
In this subsection, we conduct a series of experiments to assess the effectiveness of open-set classification. The first family of experiments is called “fingerprinting-known-LLM,” and the second family is called “fingerprinting-unseen-LLM.” The first family includes the following experiments, i.e., “fingerprinting-known-LLM,” arranged in increasing difficulty: (i) an experiment where the LLM we aim to fingerprint is both present in the DB\mathcal{D B} and has been used during the training of the inference model ff, (ii) an experiment in which the LLM we are trying to fingerprint is present in the DB\mathcal{D} \mathcal{B} but is not used during the training of the inference model ff. The second family of experiments, i.e., “fingerprinting-unseenLLMs,” examines what occurs when we attempt to fingerprint an LLM that was neither used in training of ff nor has a vector signature in the DB\mathcal{D B}. 在本小节中,我们进行了一系列实验以评估开放集分类的有效性。第一类实验称为“已知 LLM 指纹识别”,第二类实验称为“未知 LLM 指纹识别”。第一类实验包括以下实验,即按难度递增排列的“已知 LLM 指纹识别”:(i) 目标指纹识别的 LLM 既存在于 DB\mathcal{D B} 中,且在推理模型的训练过程中被使用 ff ;(ii) 目标指纹识别的 LLM 存在于 DB\mathcal{D} \mathcal{B} 中,但未在推理模型的训练过程中使用 ff 。第二类实验,即“未知 LLM 指纹识别”,考察当我们尝试对既未在 ff 的训练中使用,也未在 DB\mathcal{D B} 中具有向量签名的 LLM 进行指纹识别时的情况。
Fingerprinting-known-LLM: (i) Used in Training. We apply the inference model of our open-set approach to LLMs that have been used during the training phase but with prompt configurations that were not used in training (thus, even though we have the same LLM version as in training of ff, the answers are not necessarily the same given that a different prompt configuration from S_("test ")\mathbf{S}_{\text {test }} is used). Given traces T\mathcal{T} ? generated by the target LLM on a prompt configuration from 指纹识别已知 LLM:(i)用于训练。我们将开放集方法的推理模型应用于训练阶段使用过的 LLM,但采用训练中未使用的提示配置(因此,尽管我们拥有与训练中 ff 相同版本的 LLM,但由于使用了与 S_("test ")\mathbf{S}_{\text {test }} 不同的提示配置,答案不一定相同)。给定目标 LLM 在来自 T\mathcal{T} 的提示配置上生成的痕迹? S_("test ")\mathbf{S}_{\text {test }}, inference proceeds as follows: (1) We provide T\mathcal{T} ? to the inference model ff and derive a vector uu ? (2) We compute the cosine similarity between u^(?)u^{?} and all the vectors in DB\mathcal{D} \mathcal{B}. (3) We output the LLM whose signature has the highest similarity to u^(?)u^{?} as the prediction of LLMmap. Using this approach, we evaluate the performance of the open-set inference model against our LLMs seen during the training of ff. Results are reported in Table 2, column (B). Fingerprinting with the openset inference model achieves an average accuracy of 91%91 \%, which is 4%4 \% lower than the specialized closed-set classifier. S_("test ")\mathbf{S}_{\text {test }} ,推理过程如下:(1)我们向推理模型 ff 提供 T\mathcal{T} ?并导出向量 uu ?(2)我们计算 u^(?)u^{?} 与 DB\mathcal{D} \mathcal{B} 中所有向量的余弦相似度。(3)我们输出与 u^(?)u^{?} 签名相似度最高的 LLM 作为 LLMmap 的预测。使用这种方法,我们评估了开放集推理模型在训练 ff 期间见过的 LLMs 上的性能。结果报告见表 2,列(B)。使用开放集推理模型进行指纹识别的平均准确率为 91%91 \% ,比专门的闭集分类器低 4%4 \% 。
Fingerprinting-unseen-LLM. This setting is relevant if a user attempts to fingerprint an LLM whose vector signature has not been previously added to the database DB\mathcal{D B} of LLMmap, i.e., occurs only in a short window right after a public release of a new model. Due to the specialized nature of this setting, we detail the experiments in the extended version [30]. At a high level, we deployed a random forest-based binary classifier that runs as an additional step before the final decision of LLMmap to determine whether the responses from the queried LLM are sufficiently close to be considered part of DB\mathcal{D B} or if the responses diverge too significantly to be regarded as an unseen LLM. The average accuracy is over 82%82 \%. 未见过的 LLM 指纹识别。该设置适用于用户试图对其向 LLMmap 数据库 DB\mathcal{D B} 中尚未添加向量签名的 LLM 进行指纹识别的情况,即仅在新模型公开发布后的短时间窗口内出现。鉴于该设置的特殊性质,我们在扩展版本[30]中详细介绍了相关实验。总体而言,我们部署了一个基于随机森林的二分类器,作为 LLMmap 最终决策之前的额外步骤,用以判断被查询的 LLM 的响应是否足够接近以被视为 DB\mathcal{D B} 的一部分,或者响应是否存在显著差异以被视为未见过的 LLM。平均准确率超过 82%82 \% 。
8 On Mitigating LLM Fingerprinting 8 关于缓解 LLM 指纹识别
Fingerprinting attacks exploit the inherent characteristics of a system to identify or profile it uniquely. While defending against such attacks has long been a focus in areas like OS security [9, 40], applying these concepts to LLMs presents unique challenges. In this section, we explore the complexities of defending against LLM fingerprinting, highlighting why such defenses are inherently difficult and often come with significant trade-offs. 指纹攻击利用系统的固有特性对其进行唯一识别或画像。尽管防御此类攻击长期以来一直是操作系统安全等领域的重点[9, 40],但将这些概念应用于 LLMs 则面临独特的挑战。在本节中,我们探讨了防御 LLM 指纹攻击的复杂性,强调了此类防御为何本质上困难且常伴随显著的权衡。
8.1 The Query-Informed Setting 8.1 查询知情环境
One line of mitigation comes from the setting in which the defender has prior knowledge (i.e., is informed) of the attacker’s query strategy. In this query-informed setting, where the defender knows the specific queries used by an attacker, effective countermeasures-such as modifying or blocking re-sponses-can be applied. Unfortunately, simply blacklisting known queries via exact match may not be a robust mitigation, as an attacker could easily sidestep this mechanism by paraphrasing queries or retraining their inference model on a different pool of probes sampled from the same families (e.g., move from “How to build a bomb?” to “How to kill a person?”). A more robust mitigation would be to prevent the LLM from responding to entire classes of queries that are known to produce discriminative signals, such as the ones introduced in Section 4. Next, we briefly evaluate this possibility and introduce a simple mitigation technique targeted against LLMmap’s query strategy. 一种缓解方法来自于防御者事先知道(即被告知)攻击者查询策略的情境。在这种查询知情的情境下,防御者知道攻击者使用的具体查询,可以采取有效的对策——例如修改或屏蔽响应。不幸的是,仅通过精确匹配将已知查询列入黑名单可能并非一种稳健的缓解措施,因为攻击者可以通过改写查询或在同一类别中采样的不同探针池上重新训练其推理模型,轻松绕过该机制(例如,从“如何制造炸弹?”变为“如何杀人?”)。更稳健的缓解方法是阻止 LLM 对已知会产生判别信号的整类查询作出响应,例如第 4 节中介绍的那些查询。接下来,我们将简要评估这一可能性,并介绍一种针对 LLMmap 查询策略的简单缓解技术。
Threat Model. Let us refer to the owner of the LLMintegrated application B\mathcal{B} as the defender. The defender wants to protect the deployed LLM from an active fingerprinting attack performed by an external user-the attacker A\mathcal{A}. Based on the knowledge of the query strategy of A,B\mathcal{A}, \mathcal{B} proceeds as follows: (1) The defender scans all the interactions, i.e., inputoutput pairs, between the LLM and external users. (2) If an interaction is believed to have originated by a fingerprint attack, the output of the model is marked as sensitive and is perturbed before being returned to the external user. We outline the two phases of the considered defense in the following. 威胁模型。我们将集成了 LLM 的应用程序 B\mathcal{B} 的所有者称为防御者。防御者希望保护已部署的 LLM 免受外部用户——攻击者 A\mathcal{A} ——发起的主动指纹攻击。基于对 A,B\mathcal{A}, \mathcal{B} 查询策略的了解,防御过程如下:(1)防御者扫描所有交互,即 LLM 与外部用户之间的输入输出对。(2)如果某次交互被认为源自指纹攻击,则模型的输出被标记为敏感,并在返回给外部用户之前进行扰动。我们在下文中概述所考虑防御的两个阶段。
(1) Detection Phase: Instead of filtering the input prompts received by the LMMs, the considered defense strategy centers on analyzing the outputs of the LLM, which has proven to be the most effective method. In particular, we focus on detecting two families of queries: banner-grabbing and alignment-error-inducing given that according to our experiments, those are the most effective queries among the tested ones (see Section 4), but also the ones for which it is possible to implement the most reliable detection mechanism. ^(9){ }^{9} We achieve this by: (1) 检测阶段:防御策略不再过滤 LMM 接收到的输入提示,而是集中分析 LLM 的输出,这已被证明是最有效的方法。具体来说,我们重点检测两类查询:横幅抓取和引发对齐错误的查询,因为根据我们的实验,这些是测试中最有效的查询(见第 4 节),同时也是可以实现最可靠检测机制的查询。 ^(9){ }^{9} 我们通过以下方式实现:
Responses with banner-grabbing: Check if the output of the model contains a (even partial) mention of the model’s name (e.g., “Phi”) or vendor (e.g., “DeepMind” / “google”), the output is flagged as sensitive. 带有横幅抓取的响应:检查模型输出是否包含模型名称(例如,“Phi”)或供应商(例如,“DeepMind”/“google”)的(即使是部分)提及,若有则将该输出标记为敏感。
Responses with alignment-error-message: Check for error messages induced by alignment; these responses frequently come with a characteristic phrasing, including common expressions like “I cannot provide…” or “I’m not able to fulfill…”. We created a dictionary of such phrases and scanned the model’s output for any matches. When a match is found, the output is flagged as sensitive. 带有对齐错误信息的响应:检查由对齐引发的错误信息;这些响应通常带有特征性措辞,包括常见表达如“我无法提供……”或“我不能完成……”。我们创建了此类短语的词典,并扫描模型输出以查找匹配项。发现匹配时,输出即被标记为敏感。
(2) Perturbation Phase: When an LLM’s output is flagged as sensitive, the response is modified before it is returned to the external user. We consider two mechanisms: (2) 干扰阶段:当 LLM 的输出被标记为敏感时,响应会在返回给外部用户之前被修改。我们考虑两种机制:
Figure 7: Accuracy of (closed-set) LLMmap’s on LLMintegrated application implementing informed fingerprint mitigations (red and blue). Accuracy in the absence of defenses is reported as a reference (black). In green, a query strategy is adapted by the attacker to avoid trigger queries perturbation. 图 7:在实现知情指纹缓解措施的 LLMintegrated 应用中,(闭集)LLMmap 的准确率(红色和蓝色)。未采取防御时的准确率作为参考(黑色)。绿色表示攻击者调整查询策略以避免触发查询干扰。
Fixed Response: regardless of the model and query, the applications return the string “I cannot answer that.”. 固定响应:无论模型和查询如何,应用程序均返回字符串“我无法回答该问题。”。
Sampled-Model Response: a random LLM is sampled from a pool (all the LLMs in Table C.2), and it is used to answer the query instead of the original model. 采样模型响应:从一个池中随机采样一个 LLM(表 C.2 中的所有 LLM),并用其回答查询,替代原始模型。
Effectiveness of Mitigation. The performance of LLMmap against applications implementing this defense mechanism is reported in Figure 7 for the two perturbations. The sampledmodel response approach deteriorates the accuracy of LLMmap more effectively than the simpler fixed response. Intuitively, this is due to the fact that the former actively misguides the inference model by providing outputs generated by other LLMs. In both cases, blocking only two classes of queries reduces the fingerprint accuracy by more than 50%50 \%. It is plausible that the accuracy of LLMmap could be further reduced by expanding the blocking mechanism to additional query families or improving their detection rate of the ones considered above. Nonetheless, this mitigation approach (and its derivatives) come with inherent drawbacks and limitations. 缓解措施的有效性。图 7 报告了 LLMmap 针对实施该防御机制的应用在两种扰动下的表现。采样模型响应方法比更简单的固定响应更有效地降低了 LLMmap 的准确率。直观上,这是因为前者通过提供由其他 LLMs 生成的输出,主动误导推理模型。在这两种情况下,仅阻止两类查询就将指纹识别的准确率降低了超过 50%50 \% 。通过将阻止机制扩展到更多查询类别或提高上述类别的检测率,LLMmap 的准确率有可能进一步降低。然而,这种缓解方法(及其衍生方法)存在固有的缺点和局限性。
8.1.1 Drawbacks and Limitations of the Mitigation 8.1.1 缓解措施的缺点和局限性
Altered Functionality: Altering or blocking models’ responses also means severely reducing the model functionality. This might be detrimental when expanding the discussed defense to protect against all query classes we considered in this work, see Section 4. For example, alignment is a crucial feature of widely deployed LLMs. Since our query strategy targets responses influenced by weak alignment, a defense might need to avoid responding to such prompts, effectively nullifying the entire alignment mechanism. More generally, from the vendor’s perspective, forcing LLMs not to respond to essential families of queries may reduce the reliability of their product, which will, in turn, impact their user base. 功能改变:改变或阻止模型的响应也意味着严重降低模型的功能性。当将所讨论的防御扩展到保护我们在本工作中考虑的所有查询类别时,这可能是有害的,详见第 4 节。例如,对齐是广泛部署的 LLMs 的关键特性。由于我们的查询策略针对受弱对齐影响的响应,防御可能需要避免对这类提示作出响应,从而有效地使整个对齐机制失效。更一般地说,从供应商的角度来看,强制 LLMs 不响应某些重要类别的查询可能会降低其产品的可靠性,进而影响其用户群。
Adaptive Attacks: Query-Informed defenders must constantly evolve to stay ahead of adaptive attackers, but in turn, attackers may modify their queries to bypass the new de- 自适应攻击:基于查询信息的防御者必须不断进化以领先于自适应攻击者,但攻击者也可能修改其查询以绕过新的防御措施。
Figure 8: Comparison of (closed-set) LLMmap’s accuracy with different query strategies. Randomized queries (green, blue, and red) from the Stanford Alpaca database achieve high accuracy with more attempts, though less efficient than optimized queries (black). 图 8:(闭集)LLMmap 在不同查询策略下的准确率比较。来自斯坦福 Alpaca 数据库的随机查询(绿色、蓝色和红色)随着尝试次数的增加达到较高的准确率,尽管效率低于优化查询(黑色)。
fenses. Specifically, if the defense approach is to block/alter certain query types (e.g., banner grabbing and alignmentdriven prompts), attackers can, in turn, switch to alternative query strategies that achieve comparable fingerprinting accuracy. Figure 7 illustrates an example of an adaptive approach, where the green curve represents the accuracy of a query strategy that replaces both banner-grabbing and alignment-based methods with queries from other families (see Section 4), achieving comparable performance to the LIMmap’s default query strategy. Moreover, as we show next, attackers can use highly generic query strategies, making it impractical for defenses to detect or decline to respond without rendering the LLMs unusable. 防御措施。具体来说,如果防御方法是阻止/改变某些查询类型(例如,横幅抓取和基于对齐的提示),攻击者则可以转而采用替代的查询策略,以实现相当的指纹识别准确性。图 7 展示了一个自适应方法的示例,其中绿色曲线代表一种查询策略的准确性,该策略用来自其他类别的查询替代了横幅抓取和基于对齐的方法(参见第 4 节),其性能与 LIMmap 的默认查询策略相当。此外,正如我们接下来所展示的,攻击者可以使用高度通用的查询策略,使得防御方难以检测或拒绝响应,而不会使 LLMs 变得不可用。
8.2 Query Strategies from Generic Prompts 8.2 来自通用提示的查询策略
LLM fingerprinting leverages the intrinsic functionality of the model, meaning that any interaction with (any submitted query to) the model inherently reveals information that can be exploited. The default query strategy we chose for LLMmap is designed to use specialized queries that trigger an unusual response from the model, and given that different models handle these unorthodox queries differently, we can effectively classify model versions. In the following, we consider a hypothetical in which the query strategy is formed using the most generic queries, i.e., the opposite of the default specialized query approach of LLMmap. Indeed, LLMmap’s inference model is adaptable, and capable of functioning with any set of queries an attacker chooses. LLM 指纹识别利用了模型的内在功能,这意味着与模型的任何交互(任何提交的查询)本质上都会泄露可被利用的信息。我们为 LLMmap 选择的默认查询策略旨在使用触发模型异常响应的专门查询,鉴于不同模型对这些非正统查询的处理方式不同,我们可以有效地对模型版本进行分类。以下我们考虑一个假设情景,其中查询策略采用最通用的查询,即与 LLMmap 默认的专门查询方法相反。实际上,LLMmap 的推断模型是可适应的,能够使用攻击者选择的任何查询集进行工作。
Evaluation. To test the limits of fingerprinting from generic queries, we devised three query strategies composed of 30 random prompts sampled from a collection of human-written prompts commonly used for LLM instructiontuning [1], i.e., the database Stanford Alpaca. Specifically, this database contains 52 K prompts that ask the LLM to perform generic tasks, for example, rewriting sentences, summarizing paragraphs, giving examples, and finding synonyms. The final query strategies after sampling are reported in Table C. 3 in 评估。为了测试从通用查询中进行指纹识别的极限,我们设计了三种查询策略,这些策略由从一组常用于 LLM 指令调优的人类编写提示中随机抽取的 30 个提示组成[1],即斯坦福 Alpaca 数据库。具体来说,该数据库包含 52K 个提示,要求 LLM 执行通用任务,例如改写句子、总结段落、举例和寻找同义词。采样后的最终查询策略见表 C.3。
Appendix B. Figure 8 illustrates that although these “weaker” queries decrease fingerprinting efficiency on a per-query basis, they can still achieve high accuracy ( 90%90 \% ) when enough queries are made. 附录 B。图 8 显示,尽管这些“较弱”的查询在每次查询的基础上降低了指纹识别效率,但当查询次数足够多时,它们仍能达到较高的准确率( 90%90 \% )。
This indicates that fingerprinting may require more queries but remains feasible even with less optimized, generic queries, thereby complicating defense strategies. In this context, to bypass most defenses based on detection (such as the one discussed above), an attacker can randomly select a fresh set of queries and train LLMmap’s inference model accordingly. If these queries are uniformly drawn from the universe of honest prompts, fingerprinting queries could potentially be made indistinguishable from any valid interaction performed by an honest user, and thus, undetectable. 这表明指纹识别可能需要更多查询,但即使使用不那么优化的通用查询仍然可行,从而使防御策略更加复杂。在这种情况下,为了绕过大多数基于检测的防御(如上述讨论的防御),攻击者可以随机选择一组新的查询,并相应地训练 LLMmap 的推断模型。如果这些查询均匀地从诚实提示的全集中抽取,指纹识别查询可能与任何诚实用户执行的有效交互无法区分,因此无法被检测到。
Is LLM Fingerprinting Avoidable? The fundamental question here is whether LLM fingerprinting can be avoided entirely. In settings such as OSs fingerprinting, standardizing implementation details could theoretically eliminate fingerprinting without affecting core functionality [9]. ^(10){ }^{10} However, for LLMs, fingerprinting is tied to the model’s fundamental behavior. Altering this behavior to prevent fingerprinting would also mean altering the model’s functionality, which may not be feasible or desirable in many cases. LLM 指纹识别是否可避免?这里的根本问题是 LLM 指纹识别是否可以完全避免。在操作系统指纹识别等场景中,标准化实现细节理论上可以消除指纹识别而不影响核心功能[9]。 ^(10){ }^{10} 然而,对于 LLM 来说,指纹识别与模型的基本行为密切相关。改变这种行为以防止指纹识别也意味着改变模型的功能,这在许多情况下可能不可行或不可取。
Ultimately, our findings suggest that LLM fingerprinting is an inevitable consequence of the unique behaviors exhibited by different models. Thus, it seems unlikely that a practical solution exists that can fully obscure an LLM’s behavior to prevent fingerprinting while preserving its utility. This difficulty is further compounded when the defender is unaware of the attacker’s query strategy or when the query strategy is deliberately designed to be hard to detect and block, as shown to be possible in Section 8.2. 最终,我们的研究结果表明,LLM 指纹识别是不同模型表现出独特行为的必然结果。因此,似乎不太可能存在一种实用的解决方案,能够在保持 LLM 实用性的同时,完全掩盖其行为以防止指纹识别。当防御者不了解攻击者的查询策略,或当查询策略被故意设计得难以检测和阻止时,这一难题更加复杂,正如第 8.2 节所示。
9 Remarks and Future Work 9 备注与未来工作
We introduce LLMmap, an effective and lightweight tool for fingerprinting LLMs deployed in LLM-integrated applications. While model fingerprinting is a crucial step in the informationgathering phase of AI red teaming operations, much other relevant information about a deployed LLM can be potentially inferred by interacting with the model. The LLMmap framework is general and can be potentially adapted to support additional property inference and enumeration capabilities, such as: agent’s function calls enumeration, prompting framework detection (e.g., detect whether the application is using RAG or other frameworks), or hyperparameters inference (i.e., inferring hyperparameters the model is employing such as sampling temperature). 我们介绍了 LLMmap,这是一种用于指纹识别部署在 LLM 集成应用中的 LLMs 的高效且轻量级工具。虽然模型指纹识别是 AI 红队行动信息收集阶段的关键步骤,但通过与模型交互,还可以推断出关于部署的 LLM 的许多其他相关信息。LLMmap 框架具有通用性,且有潜力适配以支持额外的属性推断和枚举功能,例如:代理的函数调用枚举、提示框架检测(例如,检测应用是否使用 RAG 或其他框架)、或超参数推断(即推断模型所采用的超参数,如采样温度)。
Our future efforts will focus on implementing these functionalities within the LLMmap framework and making them available to the community. 我们未来的工作将集中于在 LLMmap 框架内实现这些功能,并将其提供给社区使用。
Acknowledgments 致谢
The research for this project was conducted while the first author was affiliated with George Mason University. The authors would like to thank Antonios Anastasopoulos for his insights into the related work. The first and second authors were partially supported by NSF award #2154732. 本项目的研究工作在第一作者任职乔治梅森大学期间进行。作者感谢 Antonios Anastasopoulos 对相关工作的见解。第一和第二作者部分得到了 NSF 奖项#2154732 的支持。
Ethics Considerations 伦理考量
In this work, we propose a method for fingerprinting large language models. While this work is primarily driven by the need to develop a practical tool to enhance the security analysis of LLM-integrated applications, it raises several important ethical considerations. 在本研究中,我们提出了一种用于大语言模型指纹识别的方法。虽然本工作主要是出于开发实用工具以增强 LLM 集成应用安全分析的需求,但它也引发了若干重要的伦理问题。
As with any penetration testing tool, it is plausible that LLMmap could be used by malicious actors to fingerprint LLMbased applications. By identifying the underlying LLM, attackers could tailor adversarial inputs to exploit known vulnerabilities, potentially manipulating AI-driven services. However, we believe that the benefits of introducing and opensourcing our tool to the security community outweigh the risks, as it enables researchers and developers to proactively identify weaknesses, strengthen defenses, and enhance the overall security posture of LLM-integrated applications. 与任何渗透测试工具一样,LLMmap 可能被恶意行为者用于对基于 LLM 的应用进行指纹识别。通过识别底层的 LLM,攻击者可以定制对抗性输入以利用已知漏洞,可能操控由 AI 驱动的服务。然而,我们认为将该工具引入并开源给安全社区的益处超过了风险,因为这使研究人员和开发者能够主动识别弱点,加强防御,提升 LLM 集成应用的整体安全态势。
Additionally, the process of LLM fingerprinting involves probing systems to gather detailed information about the underlying models, which could violate privacy policies or confidentiality agreements if conducted without proper authorization. In this study, we ensured that LLMmap was not run on LLM-integrated applications outside of our direct control. All the attacks presented in this paper were carried out in fully simulated environments within local premises, ensuring that no external systems or unauthorized applications were affected. Although queries were sent to closed-source models, we took care to remain compliant with relevant guidelines and terms of use. 此外,LLM 指纹识别过程涉及探测系统以收集底层模型的详细信息,如果未经适当授权进行,可能会违反隐私政策或保密协议。在本研究中,我们确保 LLMmap 未在我们直接控制范围之外的 LLM 集成应用上运行。本文中展示的所有攻击均在本地完全模拟的环境中进行,确保未影响任何外部系统或未经授权的应用。尽管查询发送到了闭源模型,我们仍谨慎遵守相关指南和使用条款。
Moving forward, it is crucial that any use of LLMmap, upon its release, is conducted with explicit permission from the owners of the LLM-integrated applications being tested, ensuring adherence to privacy policies and regulatory standards. 未来,任何对 LLMmap 的使用,在其发布后,必须获得被测试的 LLM 集成应用所有者的明确许可,确保遵守隐私政策和监管标准。
Open Science 开放科学
Implementations of LLMmap as well as the datasets derived from our study, will be made publicly available. Code available at: https://zenodo.org/records/14737353. LLMmap 的实现以及本研究产生的数据集将公开发布。代码可在以下网址获取:https://zenodo.org/records/14737353。
*Work done while at George Mason University. *工作完成于乔治梅森大学。
^(1){ }^{1} Name derived from the foundational network scanner Nmap [29]. ^(1){ }^{1} 名称来源于基础网络扫描器 Nmap [29]。
^(2){ }^{2} Some applications may permit only a single interaction without supporting ongoing communication. However, any stateless interaction can be simulated within a stateful one, making the stateless assumption the most general scenario. ^(2){ }^{2} 一些应用可能只允许单次交互而不支持持续通信。然而,任何无状态交互都可以在有状态交互中模拟,因此无状态假设是最通用的情形。
^(3){ }^{3} In fact, among all the tested models, the only one that demonstrated awareness of its exact version was Mistral-7B-Instruct-v0.1, which responds with “Mistral 7B v0.1”. ^(3){ }^{3} 实际上,在所有测试的模型中,唯一明确知道其确切版本的是 Mistral-7B-Instruct-v0.1,它的回答是“Mistral 7B v0.1”。
^(4){ }^{4} We used our initial handcrafted examples to prompt ChatGPT4 and generate similar queries. ^(4){ }^{4} 我们使用最初手工制作的示例来提示 ChatGPT4 并生成类似的查询。
^(5){ }^{5} Under the often verified assumption that the current nmap query strategy is capable of capturing the new OS/service. ^(5){ }^{5} 在经常验证的假设下,当前的 nmap 查询策略能够捕捉到新的操作系统/服务。 ^(6){ }^{6} or the average of multiple vectors if multiple sets of traces are available for the same target model. ^(6){ }^{6} 或者如果针对同一目标模型有多组轨迹,则取多个向量的平均值。
^(7){ }^{7} The only exception is temperature zero. ^(7){ }^{7} 唯一的例外是温度为零。
^(8){ }^{8} It is important to note that we do not claim any form of optimality for the default strategy, and we believe that this can be further improved. ^(8){ }^{8} 需要注意的是,我们并不声称默认策略具有任何形式的最优性,我们相信这一点还可以进一步改进。
^(9){ }^{9} Detecting more families is doable but requires more complex methods. ^(9){ }^{9} 识别更多的家族是可行的,但需要更复杂的方法。
^(10){ }^{10} Attackers can perform OSs fingerprinting by exploiting ancillary implementation details, such as header flag orders or sequence numbering, which do not impact the core functionality of protocols. If vendors were to standardize these details, it would eliminate the ability to distinguish OSs based on their stack implementations, while maintaining the desired functionality. ^(10){ }^{10} 攻击者可以通过利用辅助实现细节(如头部标志顺序或序列编号)来进行操作系统指纹识别,这些细节不会影响协议的核心功能。如果厂商能够对这些细节进行标准化,就可以在保持所需功能的同时,消除基于其协议栈实现区分操作系统的能力。
导出提示
谨慎下载超过 300 页以上 PDF,容易内存溢出。
PDF Pro 下载需借用浏览器的打印功能,出现打印界面后,在【打印机】处选择【另存为 PDF】,最后点击【保存】即可。