Introducing Voice Control
介绍语音控制
By Lydia Schooler, Lorenss Martinsons on Dec 2, 2024
Lydia Schooler, Lorenss Martinsonson 2024 年 12 月 2 日
![Group 124](https://directus.hume.ai/assets/f9755631-dcdd-4c8d-bc8e-1f6f22c450b8/Group 124.png?width=1920&height=1920&quality=75&format=webp&fit=inside)
Introducing Voice Control – novel interpretability-based method for AI voice customization
介绍语音控制——基于新颖可解释性的 AI 语音定制方法
-
We’re introducing Voice Control, a novel interpretability-based method that brings precise control to AI voice customization without the risks of voice cloning.
我们引入了语音控制,这是一种基于可解释性的创新方法,它为 AI 语音定制带来了精确的控制,同时避免了语音克隆的风险。 -
Our tool gives developers control over 10 voice dimensions, labeled “gender,” “assertiveness,” “buoyancy,” “confidence,” “enthusiasm,” “nasality,” “relaxedness,” “smoothness,” “tepidity,” and “tightness.”
我们的工具让开发者能够控制 10 个声音维度,分别标记为“性别”、“自信”、“活力”、“自信”、“热情”、“鼻音”、“放松”、“流畅”、“温和”和“紧绷”。 -
Unlike prompt-based approaches, Voice Control enables continuous adjustments along these dimensions, allowing for precise control and making voice modifications reproducible across sessions.
与基于提示的方法不同,语音控制允许在这些维度上进行连续调整,从而实现精确控制,并使语音修改在会话之间可重复。 -
We’re releasing Voice Control in beta so that developers can create one-of-a-kind voices for any application, but we’re still working on making voice quality 100% reliable for extreme parameter combinations.
我们正在推出语音控制测试版,以便开发者可以为任何应用程序创建独一无二的语音,但我们仍在努力使语音质量在极端参数组合下达到 100%可靠。 -
Through an intuitive no-code interface, you can easily tinker with this frontier technology to craft the perfect voice for your brand or application.
通过直观的无代码界面,您可以轻松地玩转这项前沿技术,为您的品牌或应用打造完美的声音。
Faced with an increasingly recognizable set of preset voices from AI providers, creators still struggle to find voices that match their product, brand, or application without compromising on quality.
面对来自 AI 提供商越来越可识别的预设声音,创作者仍然难以找到与他们的产品、品牌或应用相匹配的声音,同时又不牺牲质量。
Today, we're introducing Voice Control, our experimental feature for the Empathic Voice Interface (EVI) that transforms how custom AI voices are created through interpretable, continuous controls.
今天,我们推出了语音控制功能,这是我们为情感语音界面(EVI)提供的实验性功能,它通过可解释的连续控制改变定制 AI 语音的创建方式。
Why voice control matters
为什么语音控制很重要
Until today, finding the perfect AI voice for your product has been a compromise—either settling for stock voices that aren’t uniquely suited to your brand's identity or wrestling with voice-cloning approaches that are riskier, take more time, and often compromise on quality. We’re introducing Voice Control so that developers can design their own unique voice in seconds. On our playground, you can now tinker with voice characteristics in real-time until you find one that matches your vision—you'll know it when you hear it.
直到今天,为您的产品找到完美的 AI 语音一直是妥协——要么满足于不符合您品牌身份的库存语音,要么与风险更高、耗时更长且往往在质量上妥协的语音克隆方法作斗争。我们推出了语音控制,让开发者可以在几秒钟内设计自己的独特声音。在我们的游乐场,您现在可以实时调整语音特征,直到找到与您的愿景相匹配的声音——您会在听到它时知道。
What started as a research project has evolved into an artistic tool—each voice a unique creation that captures a specific mood, personality, or character.
这项最初作为研究项目开始,已经发展成为一种艺术工具——每个声音都是独特的创作,捕捉特定的情绪、个性或角色。
Interpretable control for voice AI
可解释的语音 AI 控制
As scientists working at the intersection of emotion science and AI, our research goal was to develop interpretability tools for speech-language models. What makes this particularly challenging is that people’s perceptions of voices are far more granular than they can articulate in words. Consider how parents can instantly distinguish their child's voice in a playground full of young, squeaky, enthusiastic voices, or how you'd struggle to describe your best friend's voice to a stranger—despite immediately recognizing it yourself. Nuanced, ineffable voice characteristics are not just highly recognizable to humans, but extremely psychologically salient.
作为在情感科学和人工智能交叉领域工作的科学家,我们的研究目标是开发语音语言模型的可解释性工具。这尤其具有挑战性,因为人们对声音的感知比他们能用言语表达的要细致得多。想想父母如何在充满年轻、尖锐、热情的声音的游乐场中立刻辨认出自己孩子的声音,或者你如何努力向陌生人描述你最好的朋友的声音——尽管你自己立刻就能认出它。细微、难以言表的声音特征不仅对人类来说高度可识别,而且在心理上也非常显著。
Given these constraints, we decided to develop a slider-based approach to voice interpretability and control that reflects the nuances of human voice perception without forcing them through the bottleneck of language.
鉴于这些限制,我们决定开发一种基于滑块的语音可解释性和控制方法,这种方法反映了人类语音感知的细微差别,而不必通过语言的瓶颈来强制实现。
Modifiable voice attributes
可修改的语音属性
The following attributes can be modified to personalize any of the base voices:
以下属性可以修改以个性化任何基本声音:
Masculine/Feminine: The vocalization of gender, ranging between more masculine and more feminine.
男/女:性别的发音,介于更男性化和更女性化之间。
Assertiveness: The firmness of the voice, ranging between timid and bold.
自信:声音的坚定程度,介于胆怯和大胆之间。
Buoyancy: The density of the voice, ranging between deflated and buoyant.
浮力:声音的密度,介于沉闷和轻浮之间。
Confidence: The assuredness of the voice, ranging between shy and confident.
自信:声音的确定性,介于害羞和自信之间。
Enthusiasm: The excitement within the voice, ranging between calm and enthusiastic.
热情:声音中的激动,介于平静和热情之间。
Nasality: The openness of the voice, ranging between clear and nasal.
鼻音:声音的开阔度,介于清晰和鼻音之间。
Relaxedness: The stress within the voice, ranging between tense and relaxed.
轻松度:声音中的压力,介于紧张和放松之间。
Smoothness: The texture of the voice, ranging between smooth and staccato.
平滑度:声音的质感,介于平滑和断续之间。
Tepidity: The liveliness behind the voice, ranging between tepid and vigorous.
迟钝:声音背后的活力,介于温吞和活泼之间。
Tightness: The containment of the voice, ranging between tight and breathy.
紧密度:声音的包含程度,介于紧绷和呼吸之间。
Each voice attribute can be adjusted relative to the base voice's characteristics. Values range from -100 to 100, with 0 as the default. Setting all attributes to their default values will keep the base voice unchanged.
每个声音属性都可以相对于基本声音的特性进行调整。值范围从-100 到 100,默认值为 0。将所有属性设置为默认值将保持基本声音不变。
These sliders represent perceptual qualities that listeners tend to associate with specific voice characteristics – for instance, what people commonly interpret as a voice that sounds 'confident' or 'feminine' – rather than making claims about someone’s underlying gender or confidence level (after all, these are synthetic voices that don’t correspond to any real person).
这些滑块代表听众倾向于与特定声音特征相关联的感知质量——例如,人们通常理解为听起来“自信”或“女性化”的声音——而不是对某人的潜在性别或自信水平做出断言(毕竟,这些是合成声音,并不对应任何真实人物)。
Disentangling voice characteristics
区分语音特征
One of our core technical achievements is ensuring that, in general, modifications to one voice characteristic don't influence others. This is particularly challenging as many voice attributes are highly correlated across real speakers, so we decided to develop a new, unsupervised approach that preserves most characteristics of each base voice when specific parameters are varied.
我们的一项核心技术成就是在一般情况下,确保对一个声音特征的修改不会影响其他特征。这尤其具有挑战性,因为许多声音属性在真实说话者之间高度相关,因此我们决定开发一种新的无监督方法,在特定参数变化时,尽可能保留每个基础声音的大部分特征。
Implementation and integration
实施与集成
Voice Control is immediately available through our platform. The creation process is straightforward:
语音控制通过我们的平台立即可用。创建过程简单直接:
- Select a base voice as your starting point
选择一个基础语音作为您的起点 - Adjust the voice attributes using intuitive sliders
调整声音属性,使用直观的滑块 - Preview your changes in real-time
预览您的实时更改 - Deploy your custom voice through the EVI configuration
部署您的自定义语音通过 EVI 配置
The system ensures that voice customizations are:
系统确保语音定制是:
- Reproducible across sessions
跨会话可复现 - Stable across different utterances
稳定跨不同话语 - Computationally efficient for real-time applications
计算效率高,适用于实时应用
What's next 下一步是什么
This release marks just the beginning of our vision for voice customization. We're actively working on:
这一版本仅标志着我们对语音定制的愿景的开始。我们正在积极进行以下工作:
- Expanding our range of base voices
拓展我们的基础语音范围 - Introducing additional interpretable dimensions
介绍额外的可解释维度 - Enhancing preservation of voice characteristics under extreme modifications
提升在极端修改下保持语音特征的能力 - Developing advanced tools for analyzing and visualizing voice characteristics
开发分析并可视化语音特征的高级工具
Learn More: Transform AI interactions with EVI. Create customizable, emotionally intelligent voice AI for any industry to build AI applications that better understand and respond to human emotional behavior. Start building more engaging AI apps today.
了解更多:通过 EVI 转换 AI 交互。为任何行业创建可定制的、情感智能的语音 AI,以构建更好地理解和响应人类情感行为的 AI 应用。今天开始构建更具吸引力的 AI 应用。
Subscribe 订阅
Sign up now to get notified of any updates or new articles.
立即注册,以便接收任何更新或新文章的通知。
Share article 分享文章
Recent articles
![Image](https://directus.hume.ai/assets/e330c95e-e7f1-4851-80bb-6fb078d859ee/image.png?width=1000&height=1000&quality=75&format=webp&fit=inside)
Hume AI creates emotionally intelligent voice interactions with Claude
Hume AI trained its speech-language foundation model to verbalize Claude responses, powering natural, empathic voice conversations that help developers build trust with users in healthcare, customer service, and consumer applications.
![Everfriendsfinalsquare](https://directus.hume.ai/assets/14aed106-8e82-4836-b54a-97e85c2d3a91/everfriendsfinalsquare.png?width=1000&height=1000&quality=75&format=webp&fit=inside)
How EverFriends.ai uses empathic AI for eldercare
To truly connect with users and provide a natural, empathic experience, EverFriends.ai needed an AI solution capable of understanding and responding to emotional cues. They found their answer in Hume's Empathic Voice Interface (EVI). EVI merges generative language and voice into a single model trained specifically for emotional intelligence, enabling it to emphasize the right words, laugh or sigh at appropriate times, and much more, guided by language prompting to suit any particular use case.
![Mentalhealth1](https://directus.hume.ai/assets/386f2eba-ac91-4626-8f1b-4482aa26645e/mentalhealth1.png?width=1000&height=1000&quality=75&format=webp&fit=inside)
How can emotionally intelligent voice AI support our mental health?
Recent advances in voice-to-voice AI, like EVI 2, offer emotionally intelligent interactions, picking up on vocal cues related to mental and physical health, which could enhance both clinical care and daily well-being.