这是用户在 2024-11-11 22:57 为 https://www.bvp.com/atlas/roadmap-voice-ai?utm_source=substack&utm_medium=email 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

Roadmap: Voice AI 路线图:语音人工智能

Voice AI isn’t just an upgrade to software’s UI — it's transforming how businesses and customers connect.
语音人工智能不仅仅是软件用户界面的升级,它正在改变企业与客户的联系方式。

Imagine this: Your flight has just been canceled, and you are standing at the airport gate, trying to contact your airline’s customer service line, but they’ve informed you that “due to increased call volume, call waiting times are longer than usual.” You’re stuck in an endless maze of automated menus, repeating “speak to a representative” after every option fails to address what you need. When you finally connect with a person, they transfer you, forcing you to explain your situation all over again to someone else. Meanwhile, the minutes are ticking by. You’re no closer to rebooking your flight or resolving the issue, and the prospect of staying at an overnight airport hotel feels inevitable. This is a stressful, expensive, and all-too-familiar travel nightmare millions experience.
想象一下:您的航班刚刚被取消,您站在机场登机口,试图联系航空公司的客户服务热线,但他们告诉您 "由于呼叫量增加,呼叫等待时间比平时长"。你被困在无尽的自动菜单迷宫中,在每个选项都无法满足你的需求后,你只能重复 "与代表通话"。当你终于接通一个人时,他们又把你转接过去,迫使你重新向其他人解释你的情况。与此同时,时间在一分一秒地流逝。您根本无法重新预订航班或解决问题,在机场酒店过夜的前景也不可避免。这是一个压力巨大、费用高昂、千百万人都曾经历过的旅行噩梦。

Now picture this: You call the airline, and instead of a long hold, a robotic voice, or a string of options, you’re greeted by an AI that instantly understands your situation. It recognizes that you’ve missed your flight, suggests the best alternatives based on your preferences, and handles rebooking — all with the ease and fluidity of talking to a “human.” This is just one example of the promise of voice AI technology applied to a problem we all know — and like most applications of transformative technologies, we’ve yet to discover the most compelling use cases because they weren’t possible before AI.
现在想象一下:您致电航空公司,迎接您的不是漫长的等待、机器人的声音或一连串的选项,而是一个能立即理解您情况的人工智能。它能识别出你错过了航班,根据你的偏好为你推荐最佳的替代方案,并处理重新订票事宜--所有这一切都像与 "人 "交谈一样轻松流畅。这只是语音人工智能技术应用于我们都知道的问题的一个例子--就像变革性技术的大多数应用一样,我们还没有发现最有吸引力的用例,因为在人工智能出现之前,这些用例是不可能实现的。

Voice AI solutions can finally engage in human-like conversations.
语音人工智能解决方案终于可以进行类似人类的对话。

With advancements across all layers of the voice technology stack, voice AI solutions can finally engage in human-like conversations, personalize the customer experience, and scale infinitely to meet spikes in demand any time of day. Those frustrating robotic interactions are a relic of the past. There may even be a future state where consumers prefer to speak with an AI agent over a human because it’s the most expedient way to solve our problems.
随着语音技术栈各层次的进步,语音人工智能解决方案终于可以进行类似人类的对话,个性化客户体验,并可无限扩展,以满足一天中任何时间的需求高峰。那些令人沮丧的机器人交互已成为过去式。甚至在未来,消费者可能更愿意与人工智能代理交谈,而不是与人交谈,因为这是解决问题的最便捷方式。

Voice AI isn’t just an upgrade to software’s UI; it’s transforming how businesses and customers connect. The convergence of speech-native AI models and multimodal capabilities has positioned voice AI to transform industries where human communication is critical. We believe investing in voice AI will unlock a new era of business communications, allowing companies to meet rising customer expectations while scaling their operations more efficiently.
语音人工智能不仅仅是软件用户界面的升级,它还将改变企业和客户的联系方式。语音原生人工智能模型和多模态功能的融合使语音人工智能能够改变人类沟通至关重要的行业。我们相信,投资语音人工智能将开启一个全新的业务通信时代,使企业能够满足客户不断提高的期望,同时更高效地扩展业务。

If you prefer to listen to the top insights from this roadmap, here is an AI-generated podcast via NotebookLM. 
如果您想收听本路线图中的精彩观点,这里有一个通过 NotebookLM 生成的人工智能播客。

The massive market for voice
巨大的语音市场

Humans like to talk. We speak extensively, making tens of billions of phone calls each day. And despite the prevalence of other communication forms like texting, emailing, and social media, phone calls remain the dominant mode of communication for most businesses. Across industries such as healthcare, legal services, home services, insurance, logistics, and others, businesses rely on phone-based communication to convey complex information more effectively, provide personalized services or advice, handle high-value transactions, and address urgent, time-sensitive needs.
人类喜欢说话。我们广泛交谈,每天拨打数百亿通电话。尽管短信、电子邮件和社交媒体等其他通信方式已经普及,但电话仍然是大多数企业的主要通信方式。在医疗保健、法律服务、家庭服务、保险、物流等各行各业,企业都依赖电话通信来更有效地传达复杂信息、提供个性化服务或建议、处理高价值交易,以及满足紧急和时间敏感的需求。

Yet a vast majority of calls go unanswered. For instance, SMBs miss 62% of their calls on average, missing out on addressing customer needs and winning more business. There are multiple inefficiencies with the status quo; calls are sent to voicemail after working hours, humans can only handle one call at a time, and the quality of support is inconsistent — leading to long wait times, after-hours delays, and poor customer experience. Despite pouring money into larger call centers or legacy automation systems, companies struggle to overcome these fundamental constraints.
然而,绝大多数电话都无人接听。例如,中小型企业平均漏接62%的电话,从而错失了满足客户需求和赢得更多业务的机会。现状存在多种低效问题:下班后电话被转到语音信箱、人工一次只能处理一个电话、支持质量不稳定--导致等待时间过长、下班后延误和客户体验不佳。尽管企业在大型呼叫中心或传统自动化系统上投入了大量资金,但仍难以克服这些基本限制。

Previous attempts to integrate technology that could augment phone-based work have seen lackluster success. Returning to our example of calling the airlines, customers often need help navigating through an outdated Interactive Voice Response (IVR) system, a technology that dates back to the 1970s. IVR is when automated systems say things such as, “Press 1 for rebooking” or “In a few words, tell me what you are calling about.” This outdated technology was first designed to automate call handling. Still, it’s built on a rigid system that can only process pre-set commands and cannot truly understand the intent or urgency behind a call. There is no shortage of demand for better voice automation technologies. However, businesses are limited by the technical capabilities to deliver voice products in a way that solves customers’ problems efficiently and pleasantly.
以前,人们曾尝试整合可增强电话工作的技术,但收效甚微。回到我们给航空公司打电话的例子,客户经常需要帮助来浏览过时的交互式语音应答(IVR)系统,这种技术可以追溯到 20 世纪 70 年代。IVR 是指自动系统发出诸如 "按 1 重新预订 "或 "请用几句话告诉我您打电话的内容 "之类的提示。这种过时的技术最初是为了实现呼叫处理自动化而设计的。然而,它建立在一个僵化的系统上,只能处理预先设定的命令,无法真正理解呼叫背后的意图或紧迫性。对更好的语音自动化技术的需求并不缺乏。然而,企业受限于技术能力,无法以高效、愉悦的方式提供语音产品,解决客户的问题。

Why now is the time to build in voice
为什么现在是建立语音功能的时候了?

To better impart why now is such an important inflection point for voice-as-an-interface, we’ll reflect on the evolution of voice technology. First, there were IVR systems, as described above, which enterprises and consumers almost universally dislike despite IVR still representing over a $5 billion market today.
为了更好地说明为什么现在是语音作为界面的重要拐点,我们将回顾一下语音技术的发展历程。首先是 IVR 系统,如上所述,尽管 IVR 如今仍是一个价值超过 50 亿美元的市场,但企业和消费者几乎都不喜欢它。

Improvements led to the second wave of innovation in voice, as Automatic Speech Recognition (ASR) software, also known as Speech-to-Text (STT) models, focused on transcription, enabling machines to convert spoken language into text in real-time. As ASR approached human-level performance over the past decade, we saw several new companies emerge building on top of ASR, including Gong and our portfolio company Rev. Advancements in ASR/STT have continued with the release of OpenAI’s open-source Whisper model in late 2022, and many others, that have helped to power more natural conversational systems capable of processing natural speech rather than rigid menu selections. Despite these improvements, ASR can still struggle with accents, background noise, and nuanced understanding of tone, humor, emotion, etc.
随着自动语音识别(ASR)软件(也称为 "语音到文本"(STT)模型)的改进,语音领域掀起了第二波创新浪潮,其重点是转录,使机器能够将口语实时转换成文本。在过去的十年中,随着 ASR 的性能接近人类水平,我们看到在 ASR 的基础上出现了几家新公司,其中包括Gong和我们的投资组合公司Rev。随着 OpenAI 的开源 Whisper 模型于 2022 年底发布,ASR/STT 继续取得进步,并帮助支持能够处理自然语音而非死板的菜单选择的更自然的对话系统。尽管有了这些改进,ASR 在处理口音、背景噪音以及对语气、幽默、情感等的细微理解方面仍有困难。

In the past year, the voice AI landscape has seen a surge of transformative advancements across research, infrastructure, and application layers. Rapid progress has stemmed from generative voice, with companies like Eleven Labs and others redefining Text-To-Speech (TTS) technology, creating models that produce voices with unprecedented emotional nuance, making AI sound more human than ever before. Google’s launch of Gemini 1.5 brought multimodal search into the fold, combining voice, text, and visual inputs to create a richer user experience. Shortly after that, OpenAI’s Voice Engine further pushed the boundaries of voice recognition, generating speech that closely mimicked natural conversation. The most significant breakthrough, however, came with the unveiling of GPT-4o, a model capable of real-time reasoning natively across audio, vision, and text. This represents a monumental leap forward, showcasing how AI can understand and process human speech and respond with depth and intelligence across multiple modalities.
在过去的一年中,语音人工智能领域在研究、基础设施和应用层都取得了突飞猛进的变革性进展。语音生成技术取得了飞速发展,Eleven Labs 等公司重新定义了文本到语音 (TTS) 技术,创建了能够发出具有前所未有的情感细微差别的声音的模型,使人工智能听起来比以往任何时候都更加人性化。谷歌推出的双子座 1.5 将多模态搜索带入人们的视野,将语音、文本和视觉输入结合起来,创造出更丰富的用户体验。此后不久,OpenAI 的语音引擎进一步推动了语音识别技术的发展,其生成的语音非常接近自然对话。然而,最重要的突破来自于GPT-4o的发布,这是一个能够跨音频、视觉和文本进行实时推理的模型。这代表了一个巨大的飞跃,展示了人工智能如何理解和处理人类语音,并在多种模式下做出有深度和智能的响应。

These innovations are leading to two main developments:
这些创新带来了两大发展:

First, a growing array of high-quality models have emerged to support the conversational voice stack, which has led to an influx of developers experimenting with voice applications. Traditionally, voice AI apps have all leveraged a “cascading” architecture: a user’s speech is first transcribed into text using an STT model, then the text is processed by large language models (LLMs) to generate a response, which is finally converted back into speech by a TTS model. However, the cascading nature of this architecture presents two significant drawbacks: latency and the loss of non-textual context. Latency is one of the biggest drivers of a negative user experience, particularly when it exceeds 1000 ms, as typical human speech has a 200 to 500 milliseconds latency. Within the past year, models like GPT-4 Turbo were released, substantially reducing latency. However, it still took developers a lot of engineering to optimize their apps to get closer to human-level latency. In this context, emotional and contextual cues are often lost when converting from audio to text, and these systems struggle with interruptions or overlapping speech due to their rigid, turn-based interaction structure. These technologies — STT, LLMs, and TTS — are rapidly advancing and converging to similar performance levels, which is great news for developers. Certain models perform better on different dimensions, such as latency, expressiveness, and function calling, so developers can pick and choose which models they want to use based on their specific use cases.
首先,支持对话语音堆栈的高质量模型不断涌现,这导致大量开发人员开始尝试语音应用。 传统上,语音 AI 应用都采用 "级联 "架构:首先使用 STT 模型将用户的语音转录为文本,然后由大型语言模型(LLMs)处理文本以生成响应,最后由 TTS 模型将其转换为语音。然而,这种架构的级联性质带来了两个重大缺陷:延迟和非文本上下文的丢失。延迟是造成负面用户体验的最大原因之一,尤其是当延迟超过 1000 毫秒时,因为典型的人类语音延迟为 200 至 500 毫秒。去年,GPT-4 Turbo 等型号的发布大大减少了延迟。然而,开发人员仍需要大量的工程技术来优化他们的应用程序,使其更接近人类水平的延迟。在这种情况下,从音频转换为文本时,情感和上下文线索往往会丢失,而且这些系统由于其僵化的、基于回合的交互结构,很难处理中断或重叠的语音。这些技术--STT、LLMs和 TTS--正在迅速发展并趋于相似的性能水平,这对开发人员来说是个好消息。某些模型在延迟、表现力和函数调用等不同方面表现更佳,因此开发人员可以根据自己的特定用例来选择要使用的模型。

Second, we’re seeing groundbreaking progress with the rise of Speech-To-Speech (STS) models, specifically designed to handle speech-based tasks without transcribing audio into text. These models address key limitations in traditional cascading architectures, particularly latency and conversational dynamics. Unlike their predecessors, speech-native models process raw audio inputs and outputs directly, leading to significant improvements:
其次,随着语音到语音 (STS) 模型的兴起,我们看到了突破性的进展,这些模型专门用于处理基于语音的任务,而无需将音频转录为文本。这些模型解决了传统级联架构的主要局限性,尤其是延迟和对话动态。与前代产品不同,语音原生模型直接处理原始音频输入和输出,从而实现了显著的改进:

  • Ultra-low latency with response times of ~300 milliseconds, closely mirroring natural human conversational latency.
    超低延迟,响应时间约为 300 毫秒,与人类自然对话的延迟时间非常接近。
  • Contextual understanding allows these models to retain information from earlier in the conversation, interpret the purpose behind spoken words (even when phrased in varied or complex ways), and identify multiple speakers without losing track of the dialogue.
    语境理解使这些模型能够保留对话早期的信息,解释口语背后的目的(即使措辞多变或复杂),并在不跟丢对话内容的情况下识别多个说话者。
  • Enhanced emotional and tonal awareness, capturing the speaker’s emotions, tone, and sentiment and reflecting those nuances in the model’s responses. This results in more fluid and natural interactions.
    增强的情感和语调意识,捕捉说话者的情感、语调和情绪,并将这些细微差别反映到模型的响应中。这使得交互更加流畅自然。
  • Real-time voice activity detection allows these models to listen to users even while speaking, meaning a user can interrupt them at any time. This is a substantial step forward from cascading applications, which typically rely on rigid turn-taking dynamics where the user has to wait for the agent to finish speaking before it will listen to them. This provides a much more natural and efficient user experience.
    实时语音活动检测允许这些模型在用户说话时也能监听用户,这意味着用户可以随时打断它们。这比级联应用向前迈进了一大步,级联应用通常依赖于僵化的轮流动态,在这种动态中,用户必须等待代理说完话,代理才会聆听他们的发言。这为用户提供了更自然、更高效的体验。

cascading architecture

Speech-native models are the future of conversational voice. Alongside OpenAI’s recently released Realtime API, which supports Speech-to-Speech (STS) interactions via GPT-4o, several companies, open-source projects, and research initiatives are advancing the development of this new STS paradigm. Notable examples include Kyutai’s open-source Moshi model, Alibaba’s two open-source foundational speech models, SenseVoice and CosyVoice, and Hume’s voice-to-voice Empathetic Voice Interface, among many others.
语音原生模型是会话语音的未来。除了 OpenAI 最近发布的实时 API(通过 GPT-4o 支持语音到语音 (STS) 交互)之外,一些公司、开源项目和研究计划也在推进这种新的 STS 范式的发展。著名的例子包括九泰的开源 Moshi 模型、阿里巴巴的两个开源基础语音模型 SenseVoice 和 CosyVoice,以及Hume的语音对语音 Empathetic Voice Interface 等。

Key challenges to industry adoption
行业采用的主要挑战

Quality, trust, and reliability are the biggest challenges driving enterprise adoption of voice agents. In part, customers are jaded by poor experiences with legacy IVR products, and many modern AI voice agents still need to be more reliable for many use cases or more comprehensive rollouts. Most enterprises start by employing voice agents in low-stakes situations, and as they move to higher-value use cases, the bar becomes very high for agents to perform reliably. For example, a small roofing company might happily employ an agent to field inbound customer calls after hours when they have no alternative. But in a business like this, where each customer call could represent a $30K project, they may be slow to move to a voice agent as the primary answering service as customers may have very little tolerance for an AI agent that fumbles a call and costs them a valuable lead.
质量、信任和可靠性是推动企业采用语音代理的最大挑战。部分原因是客户对传统 IVR 产品的糟糕体验感到厌倦,而且许多现代人工智能语音代理在许多使用案例或更全面的推广中仍需要更高的可靠性。大多数企业一开始都是在风险较低的情况下使用语音代理,而当他们转向价值更高的使用案例时,对语音代理执行可靠性的要求就变得非常高。例如,一家小型屋顶公司可能会很乐意在下班后,在别无选择的情况下聘用一名座席人员接听客户来电。但在这样的企业中,每个客户来电都可能代表着一个价值 3 万美元的项目,因此他们可能迟迟不会将语音代理作为主要的应答服务,因为客户可能很难容忍人工智能代理在接听电话时出现失误,使他们失去宝贵的客户资源。

Generally, complaints over Voice AI agents can be characterized as performance reliability issues. This includes everything from the call dropping entirely to the agent hallucinating, the latency being too high, and the customer getting frustrated and hanging up. The good news is that voice agents are improving on these dimensions. Developer platforms providing more reliable infrastructure for voice agents are on the rise, focusing on optimizing latency and failing gracefully without interrupting the conversation. Conversational orchestration platforms help give the agent a deterministic flow for them to follow in the conversation, which minimizes hallucinations and provides some guardrails around what the agent is allowed to discuss with customers.
一般来说,对语音人工智能座席的投诉可以归结为性能可靠性问题。这包括从呼叫完全中断到座席出现幻觉、延迟过高以及客户感到沮丧并挂断电话等所有问题。好消息是,语音代理正在这些方面不断改进。为语音座席提供更可靠基础架构的开发人员平台正在兴起,其重点是优化延迟和在不中断对话的情况下从容应对故障。对话协调平台有助于为座席人员提供一个确定的对话流程,从而最大限度地减少幻觉,并为座席人员与客户讨论的内容提供一些指导。

Our Voice AI market map
我们的语音人工智能市场地图

We’re witnessing innovation at every layer of the stack — from foundational models and core voice infrastructure to developer platforms and verticalized applications. We’re looking to back founders building solutions at every level of the Voice AI stack, and there are several key areas we find particularly exciting:
从基础模型和核心语音基础设施到开发者平台和垂直化应用,我们见证了堆栈每一层的创新。我们期待支持创始人在语音人工智能堆栈的各个层面构建解决方案,我们认为有几个关键领域尤其令人兴奋:

voice ai maket map

Models: Under the hood, the foundational model providers build technologies that power various speech-driven use cases. Existing players primarily focus on specific skills — such as SST, LLS, and TTS — designed for cascading architectures. However, it’s clear that the future of voice AI will depend on multimodal or speech-native models that can process audio natively without needing back-and-forth transcription between text and audio. Next-gen voice AI players leverage new architectures and multimodal capabilities to introduce novel approaches. For example, companies like Cartesia are pioneering an entirely new architecture using State Space Models (SSMs). Across the board, we expect significant improvements in foundation models, and we are particularly excited about the development of smaller models that can handle more straightforward conversational turns without relying on the most powerful models. This ability to offload less complex tasks to smaller models will help reduce latency and cost.
模型:在引擎盖下,基础模型提供商构建了支持各种语音驱动用例的技术。现有厂商主要专注于为级联架构设计的特定技能,如 SST、LLS 和 TTS。然而,很明显,语音人工智能的未来将取决于多模态或语音原生模型,它们可以原生处理音频,而无需在文本和音频之间来回转录。下一代语音人工智能公司利用新架构和多模态功能推出了新方法。例如,Cartesia等公司正在利用状态空间模型(SSM)开创一种全新的架构。我们预计基础模型将全面得到显著改进,我们尤其期待开发出更小的模型,从而无需依赖最强大的模型即可处理更简单的会话转折。这种将不太复杂的任务卸载到较小模型的能力将有助于减少延迟和成本。

Developer Platforms: While the underlying models have significantly improved latency, cost, and context windows, building voice agents and managing real-time voice infrastructure is still a substantial challenge for developers. Thankfully, a category of voice-focused developer platforms has rapidly emerged to help developers abstract away much of the complexity. A few of the core challenges that these developer tools help solve are:
开发人员平台:虽然底层模型已显著改善了延迟、成本和上下文窗口,但对于开发人员而言,构建语音代理和管理实时语音基础架构仍然是一项巨大的挑战。值得庆幸的是,一类以语音为重点的开发人员平台已迅速崛起,可帮助开发人员抽象出大部分复杂性。这些开发人员工具可帮助解决的几个核心挑战包括

  • Optimizing latency and reliability: Maintaining the infrastructure to provide scalable and performant real-time voice agents is a significant burden that would often require an entire engineering team to manage at scale.
    优化延迟和可靠性:维护基础架构以提供可扩展且性能良好的实时语音代理是一项沉重的负担,通常需要整个工程团队进行大规模管理。
  • Managing conversational cues, background noise, and non-textual context: Many STT models struggle to determine when a user is done speaking, so developers often need to build their own “end-pointing” detection themselves to address this issue. Additionally, developers often need to enhance the background noise filtering and the sentiment and emotion detection provided by existing models. These seemingly small features can be critical to improving the call quality and bridging the gap between a demo and the higher expectations customers have in production environments.
    管理会话线索、背景噪声和非文本上下文:许多 STT 模型都难以确定用户何时结束发言,因此开发人员通常需要自行构建 "终点 "检测来解决这一问题。此外,开发人员通常需要增强现有模型提供的背景噪声过滤以及情感和情绪检测功能。这些看似微小的功能对于提高通话质量、缩小演示与客户在生产环境中的更高期望值之间的差距至关重要。
  • Efficient error handling and retries: It is still common for voice model APIs to fail occasionally, bringing a conversation to a screeching halt. The key to building reliable applications on top of this unreliable infrastructure is to quickly identify failed API calls, buy time by inserting filler words in the conversation, and retry the API call to another model, which all needs to be done incredibly quickly.
    高效的错误处理和重试:语音模型 API 偶尔出现故障,导致对话戛然而止的情况仍然很常见。要在这种不可靠的基础架构之上构建可靠的应用,关键在于快速识别失败的 API 调用,通过在对话中插入填充词来争取时间,并向另一个模型重试 API 调用,而这一切都需要以惊人的速度完成。
  • Integrations into third-party systems and support for retrieval-augmented generation (RAG): Most business use cases require access to knowledge bases and integrations into third-party systems to provide more intelligent responses and take action on behalf of a user. Doing this in a low-latency fashion that fits naturally into a conversational system is non-trivial.
    集成到第三方系统并支持检索增强生成 (RAG):大多数业务用例都需要访问知识库并集成到第三方系统,以提供更智能的响应并代表用户采取行动。要以低延迟的方式将这些功能自然地融入到会话式系统中并非易事。
  • Conversational flow control: Flow control allows a developer to specify a deterministic flow of the conversation, giving them far more control than you would get by just providing the model a prompt to guide the conversation. These flow control systems are particularly important in sensitive or regulated conversations like a healthcare call, where a voice agent must confirm the right patient identity before moving on to the next step in a conversation.
    对话流程控制:流程控制允许开发人员指定一个确定的对话流程,使他们获得比仅向模型提供提示来引导对话更多的控制权。这些流程控制系统在敏感或受监管的对话中尤为重要,例如在医疗保健呼叫中,语音代理必须先确认正确的患者身份,然后才能进入对话的下一步。
  • Observability, analytics, and testing: Observability and testing of voice agents are still in their infancy in many ways, and developers are looking for better ways to evaluate their performance both in development and production and, ideally, A/B test multiple agents. In addition, tracking these agents' conversational quality and performance at scale in production remains a significant challenge.
    可观察性、分析和测试:语音代理的可观察性和测试在许多方面仍处于起步阶段,开发人员正在寻找更好的方法来评估其在开发和生产中的性能,并在理想情况下对多个代理进行 A/B 测试。此外,在生产中大规模跟踪这些代理的对话质量和性能仍然是一项重大挑战。

Most developers building a voice agent prefer to focus on creating the business logic and customer experience unique to their product rather than managing the infrastructure and models required to address the mentioned challenges. As a result, many companies have emerged, offering orchestration suites and platforms that simplify the process for developers and/or business users to build, test, deploy, and monitor automated voice agents.
大多数构建语音代理的开发人员更愿意专注于创建其产品独有的业务逻辑和客户体验,而不是管理应对上述挑战所需的基础设施和模型。因此,许多公司纷纷推出了协调套件和平台,简化了开发人员和/或业务用户构建、测试、部署和监控自动语音代理的流程。

One example is Vapi, which abstracts away the complexity of voice infrastructure and provides the tools to quickly build high-quality, reliable voice agents for enterprises and self-serve customers.
其中一个例子是Vapi,它抽象了语音基础架构的复杂性,并提供了为企业和自助服务客户快速构建高质量、可靠语音代理的工具。

Applications: Finally, companies at the application layer are developing voice-based automation products for a wide range of use cases. We are particularly excited about applications that a) fully “do the work” for customers, handling a complete function end-to-end and delivering valuable outcomes, b) leverage AI’s ability to scale on demand — such as handling thousands of calls simultaneously during peak moments — and c) build highly specialized, vertically-focused solutions with deep integrations into relevant third-party systems. These capabilities allow voice applications to command high ACVs, especially when used in revenue-generating scenarios or significantly reduce costs. Additionally, we’re seeing instances where Voice AI products are creating net new technology budgets in customer segments that typically didn’t spend much on technology, substantially expanding the overall TAM in markets previously considered too small for venture-backed companies.
应用:最后,应用层的公司正在为各种用例开发基于语音的自动化产品。我们尤其对以下应用感到兴奋:a)完全为客户 "完成工作",端到端处理完整的功能并提供有价值的结果;b)利用人工智能按需扩展的能力,例如在高峰时段同时处理数千个呼叫;c)构建高度专业化、以垂直领域为重点的解决方案,并与相关的第三方系统深度集成。这些功能使语音应用能够获得较高的 ACV,尤其是在创收场景中使用或显著降低成本时。此外,我们还看到,语音人工智能产品正在为那些通常不会在技术方面花费太多的客户群创造新的净技术预算,从而大大扩展了以前被认为对风险投资公司来说规模太小的市场的总体TAM。

However, it is also worth noting that quality is a top priority for voice applications. While it's easy to show a compelling demo that gets customers to buy, customers are also quick to churn if a voice agent doesn’t consistently deliver high-quality, reliable service, which is easier said than done. Building a high-quality product requires combining the right models, integrations, conversational flows, and error handling to create an agent that efficiently solves users’ issues without going off the rails. Going the extra mile to build this level of quality is not only key to satisfying customers, but it also helps to enhance product defensibility.
不过,值得注意的是,质量也是语音应用的重中之重。虽然展示一个引人注目的演示很容易吸引客户购买,但如果语音代理不能始终如一地提供高质量、可靠的服务,客户也会很快流失,这说起来容易做起来难。要打造高质量的产品,就必须将正确的模型、集成、对话流和错误处理结合起来,以创建一个能有效解决用户问题而又不会出错的代理。不遗余力地打造这种质量水平不仅是让客户满意的关键,还有助于提高产品的可防御性。

We have identified several functional opportunities for Voice AI at the application layer. These include transcription (e.g., taking notes, suggesting follow-ups based on conversation), inbound calling (e.g., booking appointments, closing warm leads, managing customer success), outbound calling and screening (e.g., sourcing and screening candidates for recruiting, appointment confirmation), training (e.g., single-player mode for sales or interview training), and negotiation (e.g., procurement negotiations, bill disputes, insurance policy negotiations).
我们已经确定了语音人工智能在应用层的几个功能机会。这些功能包括转录(如做笔记、根据对话建议后续行动)、呼入呼叫(如预约、成交热线索、管理客户成功率)、呼出呼叫和筛选(如寻找和筛选招聘候选人、预约确认)、培训(如销售或面试培训的单人模式)和谈判(如采购谈判、账单纠纷、保险单谈判)。

We’ve been excited to back some of the first voice AI wave leaders, primarily focused on transcription use cases. This is evident from our investments in Abridge, which documents clinical conversations in healthcare; Rilla, which analyzes and coaches field sales reps in the home services industry; and Rev, which provides best-in-class AI and human-in-the-loop transcription across industries.
我们很高兴能支持一些首批语音人工智能浪潮的领导者,他们主要专注于转录用例。这一点从我们对AbridgeRillaRev的投资中可见一斑。

Companies expand into fully conversational voice applications across various use cases and industries in this second voice AI wave. One example of an inbound calling solution tailored for a specific industry is Sameday AI, which provides AI sales agents for the home services industry. For instance, when a homeowner calls an HVAC contractor in urgent need of repairs, the AI agent can field the call, provide a quote based on the issue, handle the negotiation, schedule a technician in the customer’s system of record, take payment, and ultimately close what might have been a lost lead.
在第二次语音人工智能浪潮中,各公司将业务扩展到各种用例和行业的完全对话式语音应用。Sameday AI 就是一个为特定行业量身定制呼入呼叫解决方案的例子,它为家庭服务行业提供人工智能销售代理。例如,当房主致电急需维修的暖通空调承包商时,人工智能代理可以接听电话,根据问题提供报价,处理谈判,在客户的记录系统中安排技术人员,接受付款,并最终完成可能会丢失的线索。

In the outbound calling space, companies like Wayfaster are automating parts of the interview process for recruiters by integrating with applicant tracking systems to conduct initial screening calls automatically. This allows recruiters to screen hundreds of candidates in a fraction of the time it would take a human team to do so and focus more of their energy on closing the top candidates.
在外呼领域,Wayfaster等公司通过与求职者跟踪系统集成,自动拨打初步筛选电话,从而使招聘人员的部分面试流程自动化。这样,招聘人员只需花费人力团队所需的一小部分时间,就能筛选出数百名候选人,并将更多的精力集中在与最优秀的候选人达成合作上。

Voice agents are also becoming capable of handling complex tasks across multiple modalities. For example, some companies are helping medical offices use voice agents to negotiate insurance coverage with carriers, leveraging LLMs to sift through thousands of insurance documents and patient records and utilizing those findings for real-time negotiations with insurance agents.
语音代理也越来越有能力处理跨越多种模式的复杂任务。例如,一些公司正在帮助医疗机构使用语音代理与承保人进行保险谈判,利用 LLMs 筛选成千上万份保险文件和患者记录,并利用这些结果与保险代理进行实时谈判。

Our principles for investing in Voice AI technologies
我们投资语音人工智能技术的原则

As underlying models advance rapidly, the most entrepreneurial opportunities currently sit at the developer platform and application layers. The accelerated pace of model improvements has also enabled entrepreneurs to quickly create effective MVPs, allowing for quick testing and iteration on a product’s value proposition without requiring significant upfront investment. These conditions make it an exciting time to be building in the voice AI ecosystem.
随着底层模型的快速发展,目前最有创业机会的是开发者平台和应用层。模型改进速度的加快也使创业者能够快速创建有效的 MVP,从而在无需大量前期投资的情况下快速测试和迭代产品的价值主张。这些条件使得语音人工智能生态系统的建设正处于一个激动人心的时刻。

While much of our voice AI thesis aligns with the framework we’ve developed for investing in vertical AI businesses, we wanted to highlight a few key nuances specific to voice solutions. In particular, we emphasize the importance of voice agent quality. It’s easy to develop a compelling demo, but moving from a demo to a production-grade product requires a deep understanding of industry- and customer-specific pain points and the ability to solve a wide range of engineering challenges. Ultimately, we believe that agent quality and execution speed will be the defining factors for success in this category.
虽然我们的语音人工智能论述与我们为投资垂直人工智能企业而开发的框架基本一致,但我们仍希望强调一些语音解决方案特有的关键细微差别。我们尤其强调语音代理质量的重要性。开发一个引人注目的演示并不难,但要从演示转变为生产级产品,就需要深入了解行业和客户的具体痛点,并具备解决各种工程挑战的能力。最终,我们相信代理质量执行速度将成为该类别产品成功的决定性因素。

Below are our voice AI-specific principles for building in the space:
以下是我们在该领域构建语音人工智能的具体原则:

1. Solutions deeply embedded in industry-specific workflows and across modalities. The most impactful voice AI applications are those deeply embedded within industry-specific workflows. This high level of focus allows companies to tailor their voice agents to the language and types of conversations relevant to the sector, enabling deep integrations with third-party systems, which are essential for agents to take action on a user’s behalf. For example, a voice agent for auto dealerships could integrate with CRMs, leveraging past customer interaction data to improve service and accelerate deployment. Furthermore, applications that combine voice with other modalities add further defensibility by addressing complex, multi-step processes typically reserved for humans.
1.最有影响力的语音人工智能应用是那些深度嵌入特定行业工作流的解决方案。这种高度关注使企业能够根据与行业相关的语言和对话类型定制语音代理,实现与第三方系统的深度集成,这对于代理代表用户采取行动至关重要。例如,汽车经销商的语音代理可以与客户关系管理(CRM)集成,利用过去的客户互动数据改善服务并加快部署。此外,将语音与其他模式相结合的应用还能解决通常由人工完成的复杂、多步骤流程,从而进一步提高可防御性。

2. Deliver superior product quality through robust engineering. While building an exciting voice agent demo for a hackathon may be relatively straightforward, the real challenge lies in creating applications that are highly reliable, scalable, and capable of handling a wide range of edge cases. Enterprises require consistent performance, low latency, and seamless integration with existing systems. Founders should focus on designing systems that can handle the unpredictable nature of real-world voice inputs, ensure security, and maintain high uptime. It’s not just about functionality — it’s about building a foundation that guarantees resilience, reliability, and adaptability, distinguishing top-tier voice AI applications from simple prototypes.
2.通过强大的工程设计提供卓越的产品质量。虽然为黑客马拉松构建令人兴奋的语音代理演示可能相对简单,但真正的挑战在于创建高度可靠、可扩展并能够处理各种边缘情况的应用程序。企业需要稳定的性能、低延迟以及与现有系统的无缝集成。创始人应专注于设计能够处理真实世界语音输入的不可预测性、确保安全性并保持高正常运行时间的系统。这不仅仅是功能的问题,而是要建立一个基础,确保弹性、可靠性和适应性,将顶级语音人工智能应用与简单的原型区分开来。

3. Balancing growth with retention and product quality KPIs. Voice agents unlock capabilities across revenue-driving functions like sales, and many voice application companies are experiencing rapid and efficient growth as customers look to turbocharge their go-to-market (GTM) functions.
3.平衡增长与保留率和产品质量关键绩效指标。语音代理可释放销售等收入驱动型功能的能力,随着客户寻求提升其进入市场 (GTM) 功能,许多语音应用公司正在经历快速高效的增长。

Metrics to measure  衡量标准

Call quality and reliability become increasingly critical as faulty voice agents lead to dissatisfied users who may turn to competitors. Founders should prioritize tracking critical data that reflects product quality, including the following:
通话质量和可靠性变得越来越重要,因为语音代理故障会导致用户不满,从而转向竞争对手。创始人应优先跟踪反映产品质量的关键数据,包括以下数据:

  • Churn: Churn will be a clear, although lagging, indicator of quality, and we’ve observed that many voice applications have struggled with high churn, particularly in their early days. This is most common in cases where customers shift valuable workflows from a human to an agent that ultimately fails to deliver reliable and consistent user experiences, resulting in customer dissatisfaction.
    客户流失率:客户流失率是衡量质量的一个明显指标,尽管这个指标是滞后的,但我们注意到许多语音应用都在与高客户流失率作斗争,尤其是在早期。这种情况最常见于客户将有价值的工作流从人工转移到最终无法提供可靠、一致的用户体验的代理,从而导致客户不满。
  • Self-Serve Resolution: The higher the self-serve resolution rate, the more effective the voice agent is in fully solving the end user’s problem without human intervention.
    自助解决:自助解决率越高,语音代理在无需人工干预的情况下全面解决最终用户问题的效率就越高。
  • Customer Satisfaction Score: This reflects the overall satisfaction of customers who interact with a voice agent, providing insight into the quality of the experience.
    客户满意度得分:这反映了与语音座席进行交互的客户的总体满意度,有助于深入了解体验的质量。
  • Call Termination Rates: High call termination rates indicate unsatisfactory user experiences and unresolved issues, signaling that the voice agent may not perform as expected.
    呼叫终止率:高呼叫终止率表明用户体验不尽如人意和问题未得到解决,表明语音代理的表现可能不如预期。
  • Cohort Call Volume Expansion: This measures whether customers increase their use of a voice agent over time, a key indicator of product value and end-user engagement.
    群组呼叫量扩展:该指标衡量客户是否随着时间的推移增加使用语音座席,这是衡量产品价值和最终用户参与度的关键指标。

What’s ahead for the future of voice?
语音技术的未来在哪里?

Explosive step-function improvements in voice AI models have unlocked exciting startup opportunities across the voice stack in just the past couple of years. As the underlying model and infrastructure technologies in voice keep improving, we expect to see even more products emerge, solving increasingly more complex problems with conversational voice. We’re eager to partner with the most ambitious founders building in this space across all stages.
在过去的几年里,语音人工智能模型在阶跃功能上的爆炸性改进为整个语音堆栈带来了令人兴奋的创业机会。随着语音底层模型和基础架构技术的不断改进,我们期待看到更多的产品出现,用语音对话解决越来越复杂的问题。我们渴望与最有雄心壮志的创始人合作,共同打造这一领域的各个阶段。

If you are working on a business in voice AI or want to learn more about how we think about building and investing in these businesses, please don’t hesitate to contact Mike Droesch (mdroesch@bvp.com), Aia Sarycheva (asarycheva@bvp.com), and Libbie Frost (lfrost@bvp.com).
如果您正在从事语音人工智能方面的业务,或想进一步了解我们是如何考虑建立和投资这些业务的,请随时联系 Mike Droesch (mdroesch@bvp.com)、Aia Sarycheva (asarycheva@bvp.com) 和 Libbie Frost (lfrost@bvp.com)。