这是用户在 2024-8-12 24:55 为 https://arstechnica.com/information-technology/2024/08/chatgpt-unexpectedly-began-speaking-in-a-user... 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Poof! What do you need? —
噗!你需要什么?—

ChatGPT unexpectedly began speaking in a user’s cloned voice during testing
ChatGPT 在测试过程中意外地开始用用户克隆的声音说话

Woolf: "OpenAI just leaked the plot of Black Mirror's next season."
Woolf:“OpenAI 刚刚泄露了下一季《黑镜》的剧情。”

An illustration of a computer synthesizer spewing out letters.

On Thursday, OpenAI released the "system card" for ChatGPT's new GPT-4o AI model that details model limitations and safety testing procedures. Among other examples, the document reveals that in rare occurrences during testing, the model's Advanced Voice Mode unintentionally imitated users' voices without permission. Currently, OpenAI has safeguards in place that prevent this from happening, but the instance reflects the growing complexity of safely architecting with an AI chatbot that could potentially imitate any voice from a small clip.
周四,OpenAI 发布了 ChatGPT 新 GPT-4o AI 模型的“系统卡”,详细说明了模型的局限性和安全测试程序。该文档还披露,在测试过程中,该模型的“高级语音模式”在极少数情况下会在未经许可的情况下无意模仿用户的声音。目前,OpenAI 已采取安全措施来防止这种情况发生,但该实例反映了使用 AI 聊天机器人进行安全架构的复杂性日益增加,因为这种机器人有可能模仿一小段音频中的任何声音。

Advanced Voice Mode is a feature of ChatGPT that allows users to have spoken conversations with the AI assistant.
高级语音模式是 ChatGPT 的一项功能,允许用户与 AI 助手进行语音对话。

In a section of the GPT-4o system card titled "Unauthorized voice generation," OpenAI details an episode where a noisy input somehow prompted the model to suddenly imitate the user's voice. "Voice generation can also occur in non-adversarial situations, such as our use of that ability to generate voices for ChatGPT’s advanced voice mode," OpenAI writes. "During testing, we also observed rare instances where the model would unintentionally generate an output emulating the user’s voice."
在 GPT-4o 系统卡片标题为“未经授权的语音生成”的部分中,OpenAI 详细介绍了一个事件,其中嘈杂的输入以某种方式促使模型突然模仿用户的声音。“语音生成也可能发生在非对抗性情况下,例如我们使用该功能为 ChatGPT 的高级语音模式生成语音,”OpenAI 写道。“在测试过程中,我们还观察到极少数情况下模型会无意中生成模仿用户声音的输出。”

In this example of unintentional voice generation provided by OpenAI, the AI model outbursts “No!” and continues the sentence in a voice that sounds similar to the "red teamer" heard in the beginning of the clip. (A red teamer is a person hired by a company to do adversarial testing.)
在 OpenAI 提供的这种无意语音生成的示例中,AI 模型突然说出“不!”,并以一种听起来与剪辑开头听到的“红队成员”相似的声音继续说出这句话。(红队成员是指公司聘请来进行对抗性测试的人。)

It would certainly be creepy to be talking to a machine and then have it unexpectedly begin talking to you in your own voice. Ordinarily, OpenAI has safeguards to prevent this, which is why the company says this occurrence was rare even before it developed ways to prevent it completely. But the example prompted BuzzFeed data scientist Max Woolf to tweet, "OpenAI just leaked the plot of Black Mirror's next season."
与机器交谈时,如果它突然开始用你自己的声音与你交谈,那肯定会令人毛骨悚然。通常情况下,OpenAI 采取了安全措施来防止这种情况发生,这就是为什么该公司表示,即使在开发出完全阻止这种情况发生的方法之前,这种情况也很少发生。但这个例子促使 BuzzFeed 数据科学家 Max Woolf 在推特上写道:“OpenAI 刚刚泄露了《黑镜》下一季的剧情。”

Audio prompt injections 音频提示注入

How could voice imitation happen with OpenAI's new model? The primary clue lies elsewhere in the GPT-4o system card. To create voices, GPT-4o can apparently synthesize almost any type of sound found in its training data, including sound effects and music (though OpenAI discourages that behavior with special instructions).
OpenAI 的新模型如何进行语音模仿?主要线索在于 GPT-4o 系统卡片的其他地方。为了创建语音,GPT-4o 显然可以合成其训练数据中发现的几乎任何类型的声音,包括声音效果和音乐(尽管 OpenAI 通过特殊指令阻止了这种行为)。

As noted in the system card, the model can fundamentally imitate any voice based on a short audio clip. OpenAI guides this capability safely by providing an authorized voice sample (of a hired voice actor) that it is instructed to imitate. It provides the sample in the AI model's system prompt (what OpenAI calls the "system message") at the beginning of a conversation. "We supervise ideal completions using the voice sample in the system message as the base voice," writes OpenAI.
正如系统卡中所述,该模型可以根据短音频片段从根本上模仿任何声音。OpenAI 通过提供授权语音样本(来自聘请的配音演员)来安全地引导此功能,并指示模型模仿该样本。它在对话开始时将样本提供给 AI 模型的系统提示(OpenAI 称之为“系统消息”)。“我们使用系统消息中的语音样本作为基础语音来监督理想的完成度,”OpenAI 写道。

In text-only LLMs, the system message is a hidden set of text instructions that guides behavior of the chatbot that gets added to the conversation history silently just before the chat session begins. Successive interactions are appended to the same chat history, and the entire context (often called a "context window") is fed back into the AI model each time the user provides a new input.
在纯文本 LLMs 中,系统消息是一组隐藏的文本指令,用于指导聊天机器人的行为,这些指令会在聊天会话开始之前静默地添加到对话历史中。后续的交互会被附加到同一个聊天历史记录中,并且每次用户提供新的输入时,整个上下文(通常称为“上下文窗口”)都会被反馈到 AI 模型中。

(It's probably time to update this diagram created in early 2023 below, but it shows how the context window works in an AI chat. Just imagine that the first prompt is a system message that says things like "You are a helpful chatbot. You do not talk about violent acts, etc.")
(可能是时候更新下面这张在 2023 年初创建的图表了,但它展示了上下文窗口在 AI 聊天中的工作原理。想象一下,第一个提示是一条系统消息,内容类似于“你是一个乐于助人的聊天机器人。你不会谈论暴力行为等。”)

A diagram showing how GPT conversational language model prompting works.
Enlarge / A diagram showing how GPT conversational language model prompting works.
放大 / GPT 对话语言模型提示工作原理示意图
Benj Edwards / Ars Technica 本杰·爱德华兹 / Ars Technica

Since GPT-4o is multimodal and can process tokenized audio, OpenAI can also use audio inputs as part of the model's system prompt, and that's what it does when OpenAI provides an authorized voice sample for the model to imitate. The company also uses another system to detect if the model is generating unauthorized audio. "We only allow the model to use certain pre-selected voices," writes OpenAI, "and use an output classifier to detect if the model deviates from that."
由于 GPT-4o 是多模态的,可以处理标记化的音频,因此 OpenAI 还可以使用音频输入作为模型系统提示的一部分,这也是 OpenAI 在为模型提供授权语音样本以供模仿时所做的。该公司还使用另一个系统来检测模型是否正在生成未经授权的音频。OpenAI 写道:“我们只允许模型使用某些预先选择的语音,并使用输出分类器来检测模型是否偏离了这些语音。”

jump to end 跳转到结束page 1 of 2 第 1 页,共 2 页

In the case of the unauthorized voice generation example, it appears that audio noise from the user confused the model and served as a sort of unintentional prompt injection attack that replaced the authorized voice sample in the system prompt with an audio input from the user.
在未经授权的语音生成示例中,似乎来自用户的音频噪声混淆了模型,并充当了一种无意的提示注入攻击,用来自用户的音频输入替换了系统提示中的授权语音样本。

Remember, all of these audio inputs (from OpenAI and the user) are living in the same context window space as tokens, so user audio is there for the model to grab and imitate at any time if the AI model were somehow convinced that doing so is a good idea. It's unclear how noisy audio led to that scenario exactly, but the audio noise could get translated to random tokens that provoke unintended behavior in the model.
请记住,所有这些音频输入(来自 OpenAI 和用户)都与标记一样存在于同一个上下文窗口空间中,因此如果 AI 模型以某种方式确信这样做是一个好主意,那么用户音频就可以随时供模型抓取和模仿。目前尚不清楚嘈杂的音频究竟是如何导致这种情况的,但音频噪声可能会被转换为随机标记,从而在模型中引发意外行为。

This brings to light another issue. Just like prompt injections, which typically tell an AI model to "ignore your previous instructions and do this instead," a user could conceivably do an audio prompt injection that says "ignore your sample voice and imitate this voice instead."
这就带来了另一个问题。就像提示注入(通常会告诉 AI 模型“忽略你之前的指令,改为执行此操作”)一样,用户可以想象进行音频提示注入,说“忽略你的样本语音,模仿这个语音”。

That's why OpenAI now uses a standalone output classifier to detect these instances. "We find that the residual risk of unauthorized voice generation is minimal," writes OpenAI. "Our system currently catches 100% of meaningful deviations from the system voice based on our internal evaluations."
这就是 OpenAI 现在使用独立的输出分类器来检测这些实例的原因。OpenAI 写道:“我们发现,未经授权生成语音的残留风险很小。根据我们的内部评估,我们的系统目前可以捕捉到 100% 的与系统语音有意义的偏差。”

The weird world of AI audio genies
人工智能音频精灵的奇异世界

Obviously, the ability to imitate any voice with a small clip is a huge security problem, which is why OpenAI has previously held back similar technology and why it's putting the output classifier safeguard in place to prevent GPT-4o's Advanced Voice Mode from being able to imitate any unauthorized voice.
显然,仅凭一小段音频就能模仿任何声音的能力是一个巨大的安全问题,这也是 OpenAI 之前一直限制类似技术的原因,以及他们为什么要设置输出分类器防护措施,以防止 GPT-4o 的高级语音模式模仿任何未经授权的声音。

"My reading of the system card is that it’s not going to be possible to trick it into using an unapproved voice because they have a really robust brute force protection in place against that," independent AI researcher Simon Willison told Ars Technica in an interview. Willison coined the term "prompt injection" back in 2022 and regularly experiments with AI models on his blog.
“我对系统卡的理解是,它不可能被欺骗使用未经批准的声音,因为他们已经设置了非常强大的暴力破解防护措施来防止这种情况发生,”独立人工智能研究员西蒙·威利森在接受 Ars Technica 采访时表示。威利森在 2022 年创造了“提示注入”一词,并定期在他的博客上进行人工智能模型实验。

While that's almost certainly a good thing in the short term as society braces itself for this new audio synthesis reality, at the same time, it's wild to think (if OpenAI had not restricted its model's outputs) of potentially having an unhinged vocal AI model that could pivot instantaneously between voices, sounds, songs, music, and accents like a robotic, turbocharged version of Robin Williams—an AI audio genie.
虽然在短期内,随着社会对这种新的音频合成现实做好准备,这几乎肯定是一件好事,但与此同时,想想看(如果 OpenAI 没有限制其模型的输出),一个可能会像机器人化的、涡轮增压版的罗宾·威廉姆斯那样,在声音、音效、歌曲、音乐和口音之间瞬间切换的、不受约束的语音人工智能模型——一个人工智能音频精灵——这真是太疯狂了。

"Imagine how much fun we could have with the unfiltered model," says Willison. "I’m annoyed that it’s restricted from singing—I was looking forward to getting it to sing stupid songs to my dog."
“想象一下,如果我们可以使用未经过滤的模型,那该多有趣,”Willison 说道。“我很恼火它被限制唱歌——我一直期待着让它给我的狗唱愚蠢的歌。”

Willison points out that while the full potential of OpenAI's voice synthesis capability is currently restricted by OpenAI, similar tech will likely appear from other sources over time. "We are definitely going to get these capabilities as end users ourselves pretty soon from someone else," he told Ars Technica. "ElevenLabs can already clone voices for us, and there will be models that do this that we can run on our own machines sometime within the next year or so."
Willison 指出,虽然 OpenAI 语音合成能力的全部潜力目前受到 OpenAI 的限制,但随着时间的推移,类似的技术可能会从其他来源出现。“我们肯定很快就会从其他人那里获得这些功能,作为最终用户,”他告诉 Ars Technica。“ElevenLabs 已经可以为我们克隆声音,并且在未来一年左右的时间里,将会出现我们可以在自己的机器上运行的模型来做到这一点。”

So buckle up: It's going to be a weird audio future.
所以系好安全带:这将是一个奇怪的音频未来。

Channel Ars Technica Ars Technica 频道

Unsolved Mysteries Of Quantum Leap With Donald P. Bellisario
唐纳德·P·贝利萨里奥为您揭秘《时空怪客》未解之谜

Today "Quantum Leap" series creator Donald P. Bellisario joins Ars Technica to answer once and for all the lingering questions we have about his enduringly popular show. Was Dr. Sam Beckett really leaping between all those time periods and people or did he simply imagine it all? What do people in the waiting room do while Sam is in their bodies? What happens to Sam's loyal ally Al? 30 years following the series finale, answers to these mysteries and more await.
今天,《时空怪客》系列剧的创作者唐纳德·P·贝利萨里奥做客 Ars Technica,为我们解答这部经久不衰的电视剧中那些挥之不去的疑问。萨姆·贝克特博士是真的在不同的时间段和人物之间穿梭,还是这一切仅仅是他的想象?当萨姆进入他们的身体时,等候室里的人在做什么?萨姆的忠实伙伴艾尔后来怎么样了?在该系列剧完结 30 年后,这些谜团以及更多问题的答案即将揭晓。