FunAudioLLM: Voice Understanding and Generation Foundation Models for Natural Interaction Between Humans and LLMs FunAudioLLM:用于人类与LLMs之间自然互动的语音理解和生成基础模型
Tongyi SpeechTeam 同义语音团队Alibaba Group 阿里巴巴集团FunAudioLLM@list.alibaba-inc.com
Abstract
This report introduces FunAudioLLM, a model family designed to enhance natural voice interactions between humans and large language models (LLMs). At its core are two innovative models: SenseVoice, which handles multilingual speech recognition, emotion recognition, and audio event detection; and CosyVoice, which facilitates natural speech generation with control over multiple languages, timbre, speaking style, and speaker identity. SenseVoice-Small delivers exceptionally low-latency ASR for 5 languages, and SenseVoice-Large supports highprecision ASR for over 50 languages, while CosyVoice excels in multi-lingual voice generation, zero-shot in-context learning, cross-lingual voice cloning, and instruction-following capabilities. The models related to SenseVoice and CosyVoice have been open-sourced on Modelscope and Huggingface, along with the corresponding training, inference, and fine-tuning codes released on GitHub. By integrating these models with LLMs, FunAudioLLM enables applications such as speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration, thereby pushing the boundaries of voice interaction technology. Demos are available at https://fun-audio-llm.github.io, and the code can be accessed at https://github.com/FunAudioLLM. 本报告介绍了 FunAudioLLM,一个旨在增强人类与大型语言模型之间自然语音交互的模型家族(LLMs)。其核心是两个创新模型:SenseVoice,负责多语言语音识别、情感识别和音频事件检测;以及 CosyVoice,促进自然语音生成,能够控制多种语言、音色、说话风格和说话者身份。SenseVoice-Small 为 5 种语言提供极低延迟的 ASR,而 SenseVoice-Large 支持超过 50 种语言的高精度 ASR,同时 CosyVoice 在多语言语音生成、零-shot 上下文学习、跨语言语音克隆和遵循指令的能力方面表现出色。与 SenseVoice 和 CosyVoice 相关的模型已在 Modelscope 和 Huggingface 上开源,并在 GitHub 上发布了相应的训练、推理和微调代码。通过将这些模型与LLMs集成,FunAudioLLM 实现了语音到语音翻译、情感语音聊天、互动播客和富有表现力的有声书叙述等应用,从而推动了语音交互技术的边界。演示可在https://fun-audio-llm.github.io获取,代码可在https://github.com/FunAudioLLM访问。
1 Introduction 1 引言
In recent years, the advancement in artificial intelligence (AI) has dramatically transformed how humans interact with machines, such as GPT-4o (OpenAI, 2023) and Gemini-1.5 (Reid et al., 2024) and so on (Bai et al. 2023b; Chu et al. 2023). This transformation is particularly evident in the realm of voice processing, where capabilities such as high-precision speech recognition (Radford et al. 2023), emotion recognition (Ma et al., 2024b), and voice generation (Wang et al., 2023a; Du et al. 2024a) are paving the way for more intuitive and human-like interactions. In this report, we introduce FunAudioLLM, an innovative framework designed to facilitate natural voice interactions between humans and large language models (LLMs) (Team, 2023, Bai et al, 2023a, Touvron et al., 2023). At the core of FunAudioLLM are our two groundbreaking models: SenseVoice, for voice understanding, and CosyVoice, for voice generation. 最近几年,人工智能(AI)的进步显著改变了人类与机器的互动方式,例如 GPT-4o(OpenAI,2023)和 Gemini-1.5(Reid 等,2024)等。这种转变在语音处理领域尤为明显,其中高精度语音识别(Radford 等,2023)、情感识别(Ma 等,2024b)和语音生成(Wang 等,2023a;Du 等,2024a)等能力正在为更直观和类人化的互动铺平道路。在本报告中,我们介绍了 FunAudioLLM,这是一个创新框架,旨在促进人类与大型语言模型之间的自然语音互动(LLMs)(Team,2023,Bai 等,2023a,Touvron 等,2023)。FunAudioLLM 的核心是我们的两个突破性模型:SenseVoice,用于语音理解,以及 CosyVoice,用于语音生成。
SenseVoice is our state-of-the-art voice understanding model, which excels in multiple domains of voice processing. We offer both SenseVoice-Small and SenseVoice-Large variants. We have open-sourced SenseVoice-Small, which supports multilingual recognition in Chinese, English, Cantonese, Japanese, and Korean, delivering extremely low inference latency by employing a nonautoregressive end-to-end architecture. This design choice results in a performance that is more than 5 times faster than Whisper-small and more than 15 times faster than Whisper-large (Radford et al. 2023). On the other hand, SenseVoice-Large supports speech recognition in over 50 lan- SenseVoice 是我们最先进的语音理解模型,擅长多个语音处理领域。我们提供 SenseVoice-Small 和 SenseVoice-Large 两个变体。我们已开源 SenseVoice-Small,支持中文、英语、粤语、日语和韩语的多语言识别,通过采用非自回归的端到端架构,实现极低的推理延迟。这一设计选择使其性能比 Whisper-small 快 5 倍以上,比 Whisper-large 快 15 倍以上(Radford 等,2023)。另一方面,SenseVoice-Large 支持超过 50 种语言的语音识别。
Figure 1: An overview of our FunAudioLLM models for voice understanding and generation. 图 1:我们 FunAudioLLM 模型在语音理解和生成方面的概述。
guages, with significant advantages in recognizing Chinese and Cantonese. In addition to speech recognition, SenseVoice offers state-of-the-art capabilities in emotion recognition and audio event detection (Mesaros et al. 2021), making it an ideal choice for creating low-latency, human-like voice interaction systems. 语言,在识别中文和粤语方面具有显著优势。除了语音识别,SenseVoice 还提供了情感识别和音频事件检测的最先进能力(Mesaros 等,2021),使其成为创建低延迟、人性化语音交互系统的理想选择。
Our suite of applications is further enriched by CosyVoice (Du et al., 2024a), a family of fundamental speech generation models designed to produce natural-sounding voices for a variety of contexts. CosyVoice excels in generating multi-lingual voices tailored to specific speakers, zero-shot adaptation to new speakers (Wang et al., 2023a), cross-lingual voice cloning (Zhang et al., 2023), creating emotionally resonant voices (Shin et al., 2022), and offering nuanced control over speech output through instructional text (Ji et al., 2024). CosyVoice supports five languages: Chinese, English, Japanese, Cantonese, and Korean. CosyVoice comes in three open-source models: CosyVoice-base300M, which specializes in accurately representing speaker identity, zero-shot learning, and crosslingual voice cloning; Cosy Voice-instruct-300M, which focuses on generating emotionally expressive voices and allows for meticulous adjustments via instructional text, extending its capabilities to controllability over various aspects such as speaker identity (Shimizu et al., 2023), speaking style (Ji) et al. 2024), and fine-grained paralinguistic features (Kanda et al., 2024); and CosyVoice-sft-300M, which has been fine-tuned on seven multilingual speakers and is ready for immediate deployment. 我们的应用套件进一步丰富了 CosyVoice(Du 等,2024a),这是一系列基本的语音生成模型,旨在为各种场景生成自然听起来的声音。CosyVoice 在生成针对特定说话者的多语言声音、对新说话者的零样本适应(Wang 等,2023a)、跨语言语音克隆(Zhang 等,2023)、创建情感共鸣的声音(Shin 等,2022)以及通过指令文本提供对语音输出的细致控制(Ji 等,2024)方面表现出色。CosyVoice 支持五种语言:中文、英语、日语、粤语和韩语。CosyVoice 有三种开源模型:CosyVoice-base300M,专注于准确表示说话者身份、零样本学习和跨语言语音克隆;CosyVoice-instruct-300M,专注于生成情感丰富的声音,并允许通过指令文本进行细致调整,扩展其在说话者身份(Shimizu 等,2023)、说话风格(Ji 等,2024)和细粒度副语言特征(Kanda 等,2024)等各个方面的可控能力;以及 CosyVoice-sft-300M,经过七位多语言说话者的微调,已准备好立即部署。
By integrating SenseVoice, CosyVoice, and LLMs like Qwen (Team, 2023), FunAudioLLM offers a range of rich application demos. These include Speech-to-Speech Translation (Berard et al., 2018), which allows users to speak in foreign languages using their own voice; Emotional Voice Chat (Xue et al. (2024), which enables the model to understand and respond to emotions for more human-like interactions; Interactive Podcast (Laban et al. 2022), wherein users can engage in live discussions with multiple large models; and AudioBook (Chalamandaris et al., 2014), allowing the model to perform expressive, multi-character narration for audiobooks. 通过整合 SenseVoice、CosyVoice 和LLMs(Qwen,2023),FunAudioLLM 提供了一系列丰富的应用演示。这些包括语音到语音翻译(Berard 等,2018),允许用户使用自己的声音说外语;情感语音聊天(Xue 等,2024),使模型能够理解并回应情感,以实现更人性化的互动;互动播客(Laban 等,2022),用户可以与多个大型模型进行实时讨论;以及有声书(Chalamandaris 等,2014),允许模型为有声书进行富有表现力的多角色叙述。
Overall, FunAudioLLM leverages the strengths of SenseVoice and CosyVoice to push the boundaries of voice interaction technology, enabling more natural and seamless communication between humans and large language models. 总体而言,FunAudioLLM 利用 SenseVoice 和 CosyVoice 的优势,推动语音交互技术的边界,使人类与大型语言模型之间的沟通更加自然和无缝。
2 FunAudioLLM Models 2 FunAudioLLM 模型
2.1 Overview of FunAudioLLM 2.1 FunAudioLLM 概述
FunAudioLLM consists of two foundation models for voice understanding and generation, named SenseVoice and CosyVoice, respectively. SenseVoice supports multi-lingual speech recognition, which is trained on over 300k hours. Specifically, SenseVoice-Small is efficient in inference, in which the recognition latency is less than 80 ms and is more than 5 and 15 times faster than WhisperSmall and Whisper-large, respectively, and SenseVoice-Large supports high-precision ASR for over 50 languages. Furthermore, SenseVoice supports rich transcription, including state-of-the-art emotion recognition, audio event detection, inverse text normalization (Pusateri et al. 2017) and punctuation (Chen et al. 2020). FunAudioLLM 由两个基础模型组成,用于语音理解和生成,分别名为 SenseVoice 和 CosyVoice。SenseVoice 支持多语言语音识别,训练时间超过 30 万小时。具体来说,SenseVoice-Small 在推理中效率高,识别延迟低于 80 毫秒,速度比 WhisperSmall 和 Whisper-large 快 5 倍和 15 倍以上,而 SenseVoice-Large 支持超过 50 种语言的高精度 ASR。此外,SenseVoice 还支持丰富的转录功能,包括最先进的情感识别、音频事件检测、逆文本规范化(Pusateri 等,2017)和标点符号(Chen 等,2020)。
Our voice generation model, CosyVoice, can generate multi-lingual speeches, which is trained on over 170k hours and five languages, including Chinese (ZH), English (EN), Japanese (JP), Cantonese (Yue) and Korean (KO). CosyVoice generated samples can achieve a WER of less and speaker 我们的语音生成模型 CosyVoice 可以生成多语言演讲,训练时间超过 17 万小时,涵盖五种语言,包括中文(ZH)、英语(EN)、日语(JP)、粤语(Yue)和韩语(KO)。CosyVoice 生成的样本可以实现低于 的 WER 和说话者。
Figure 2: SenseVoice is a comprehensive speech foundation model designed to perform various speech understanding tasks, including Automatic Speech Recognition (ASR), Language Identification (LID), Speech Emotion Recognition (SER), and Audio Event Detection (AED). SenseVoiceSmall [Top]: An encoder-only model optimized for rapid speech understanding. It offers high-speed processing while supporting 5 languages. SenseVoice-Large [Bottom]: An encoder-decoder model aimed at achieving more precise speech understanding across a broader range of languages. It excels in accuracy and supports an extensive set of language capabilities. 图 2:SenseVoice 是一个综合性的语音基础模型,旨在执行各种语音理解任务,包括自动语音识别(ASR)、语言识别(LID)、语音情感识别(SER)和音频事件检测(AED)。SenseVoiceSmall [顶部]:一个仅编码器模型,优化用于快速语音理解。它提供高速处理,同时支持 5 种语言。SenseVoice-Large [底部]:一个编码-解码模型,旨在实现更精确的语音理解,覆盖更广泛的语言范围。它在准确性方面表现出色,并支持广泛的语言能力。
similarity of over , which achieves the quality level of human parity. CosyVoice supports zeroshot in-context learning, which enables voice cloning with a prompt speech of even 3 seconds. The timbre, emotion, prosody and style can be reproduced within or cross languages. We also released an instruction model, which can control speaker identity, speaking style (e.g., emotion) and other finegrained paralinguistic features with natural textural instructions. An overview of FunAudioLLM models is shown in Figure 1. 相似度超过 ,达到了人类平等的质量水平。CosyVoice 支持零样本上下文学习,这使得即使是 3 秒的提示语音也能进行语音克隆。音色、情感、韵律和风格可以在语言内或跨语言重现。我们还发布了一个指令模型,可以通过自然文本指令控制说话者身份、说话风格(例如,情感)和其他细粒度的副语言特征。图 1 展示了 FunAudioLLM 模型的概述。
SenseVoice is a speech foundation model with multiple voice understanding capabilities, including automatic speech recognition (ASR), spoken language identification (LID), speech emotion recognition (SER), and audio event classification (AEC) or audio event detection (AED). Two models with different sizes and architectures are proposed to suit different requirements: SenseVoice-Small, an encoder-only speech foundation model for rapid speech understanding, and SenseVoice-Large, an encoder-decoder (Vaswani et al., 2017) speech foundation model for more accurate speech understanding with more languages supported, as illustrated in Figure 2 SenseVoice 是一个具有多种语音理解能力的语音基础模型,包括自动语音识别 (ASR)、口语语言识别 (LID)、语音情感识别 (SER) 和音频事件分类 (AEC) 或音频事件检测 (AED)。提出了两种不同大小和架构的模型,以满足不同的需求:SenseVoice-Small 是一个仅编码器的语音基础模型,用于快速语音理解,而 SenseVoice-Large 是一个编码器-解码器 (Vaswani 等, 2017) 的语音基础模型,用于更准确的语音理解,并支持更多语言,如图 2 所示。
SenseVoice-Small is a non-autoregressive encoder-only model for multi-lingual multi-style ASR and multiple speech understanding tasks. Given the input waveform, we first compute the 80dimensional log-mel filter-bank, and then stack consecutive frames and down-sample them by a factor of 6 . The extracted feature is then mapped to the dimension of the encoder, denoted as , where is the length of the down-sampled feature. The encoder is implemented as a memory-equipped self-attention network (SAN-M) (Gao et al., 2020). To specify the task, we prepend four embeddings to the speech feature as the input to the encoder: SenseVoice-Small 是一个非自回归的仅编码器模型,用于多语言多风格的 ASR 和多种语音理解任务。给定输入波形,我们首先计算 80 维的对数梅尔滤波器组,然后将连续帧堆叠并按 6 的因子进行下采样。提取的特征随后被映射到编码器的维度 ,表示为 ,其中 是下采样特征的长度。编码器被实现为一个配备内存的自注意力网络 (SAN-M) (Gao 等, 2020)。为了指定任务,我们在语音特征前添加四个嵌入作为编码器的输入:
and is the vocabulary including tokens for ASR and other tasks. are embeddings of four special tokens: 和 是包括 ASR 和其他任务的词汇表。 是四个特殊标记的嵌入:
LID indicates the LID task. If is prepended, the model is trained to predict the language token, at the corresponding position of the output. In the training stage, we randomly replace LID LID 表示 LID 任务。如果 被添加,模型将被训练以预测输出中相应位置的语言标记。在训练阶段,我们随机替换 LID
with the ground truth language token according to probability 0.8 so that the model can either predict the language token, or be configured with a specified language token in the inference stage. 根据概率 0.8 用真实语言标记替换,以便模型可以预测语言标记,或者在推理阶段配置指定的语言标记。
indicates the SER task. If is prepended, the model is trained to predict the speech emotion label, at the corresponding position of the output. 表示 SER 任务。如果 被添加,模型将被训练以预测输出中相应位置的语音情感标签。
indicates the AEC task. If is prepended, the model is trained to predict the audio event label, at the corresponding position of the output. 表示 AEC 任务。如果 被添加,模型将被训练以预测音频事件标签,在输出的相应位置。
or NoITN specify the transcription style. If is provided, the model is trained to transcript with inverse text normalization (ITN) and punctuation. If is provided, the model is trained to transcript without ITN and punctuation. 或 NoITN 指定转录风格。如果提供 ,模型将被训练以进行逆文本规范化(ITN)和标点符号的转录。如果提供 ,模型将被训练以不进行 ITN 和标点符号的转录。
In the training stage, the LID, SER, and AEC tasks are optimized using the cross-entropy loss. The ASR task is optimized using the CTC loss (Graves et al., 2006). 在训练阶段,LID、SER 和 AEC 任务使用交叉熵损失进行优化。ASR 任务使用 CTC 损失进行优化(Graves 等,2006)。
SenseVoice-Large is an autoregressive encoder-decoder model for multi-lingual ASR and multiple speech understanding tasks. Similar to Whisper (Radford et al., 2023), SenseVoice-Large specifies tasks by a sequence of input tokens to the decoder. Specifically, we specify whether to predict language, speech emotion, and audio events with timestamps by including LID tokens respectively. Compared to SenseVoice-Small, the advantage of SenseVoice-Large is the transcription accuracy and supporting for a vast number of languages (50+). SenseVoice-Large 是一种自回归编码-解码模型,用于多语言 ASR 和多种语音理解任务。与 Whisper(Radford 等,2023)类似,SenseVoice-Large 通过一系列输入令牌来指定解码器的任务。具体而言,我们通过分别包含 LID 令牌来指定是否预测语言、语音情感和带时间戳的音频事件。与 SenseVoice-Small 相比,SenseVoice-Large 的优势在于转录准确性和支持大量语言(50+)。
Table 1 gives examples of transcriptions of Whisper, SenseVoice-S, SenseVoice-L, and the ground truth of the ASR task. 表 1 给出了 Whisper、SenseVoice-S、SenseVoice-L 和 ASR 任务的真实情况的转录示例。
Whisper
Absolute shock, but in a great way. Wow. That was awesome. That was awesome. What way to open a song. That was awesome. Awesome. . . 绝对震惊,但以一种很好的方式。哇。太棒了。太棒了。开场的方式真不错。太棒了。太棒了。。。
SenseVoice-S 感声-S
music Absolute shocked but in a great way my. happy That was awesome, that was awesome what way to open a song that was awesome, awesome, 音乐 绝对震惊,但以一种很好的方式。我的 开心 那太棒了,那太棒了,开场的方式太棒了,太棒了,
SenseVoice-L 感觉声音-L
music Absolutely shocked but in a great way. That was awesome, music that was awesome happy what way to open a song, that was awesome, awesome, 音乐 绝对震惊,但以一种很好的方式。太棒了, 音乐 太棒了 开心 开场的方式真不错,太棒了,太棒了,
Ground Truth 真实情况
Absolutely shocked, but in a great way. Who am I? Wow. That was awesome. That was awesome. What way to open a song. That was awesome. Awesome. . . 绝对震惊,但以一种很好的方式。我是谁?哇。太棒了。太棒了。开场的方式真不错。太棒了。太棒了。。。
Table 1: Examples of transcriptions of Whisper, SenseVoice-S, SenseVoice-L, and the ground truth. 表 1:Whisper、SenseVoice-S、SenseVoice-L 和真实值的转录示例。
2.3 Semantic Speech Tokenizer 2.3 语义语音标记器
A speech tokenizer transforms vocal signals into discrete tokens, enabling their modeling and prediction by autoregressive transformers for speech generation. Our preliminary experiments indicated that the choice of speech tokenizer is pivotal for overall system performance as well as the requirements of both data quality and volume. We evaluated three classes of speech tokenizers: 1) those based on residual quantization like SoundStream (Zeghidour et al., 2022), Encodec (Défossez et al., 2022) and FunCodec (Du et al. 2024b); 2) those utilizing multi-grouped quantization, such as HifiCodec (Yang et al., 2023); and 3) "semantic" speech tokens, specifically HuBERT(Hsu et al., 2021). All the above tokenizers are trained in the unsupervised or self-supervised manners. Thus, their association to semantic content is often tenuous, contributing to an unstable synthesis process and a substantial demand for clean training data. Moreover, unsupervised tokenizers are susceptible to data noise, necessitating meticulously curated clean data sets. 语音标记器将声音信号转换为离散标记,使其能够通过自回归变换器进行建模和预测,以实现语音生成。我们的初步实验表明,语音标记器的选择对整体系统性能以及数据质量和数量的要求至关重要。我们评估了三类语音标记器:1)基于残差量化的标记器,如 SoundStream(Zeghidour 等,2022 年)、Encodec(Défossez 等,2022 年)和 FunCodec(Du 等,2024b);2)利用多分组量化的标记器,如 HifiCodec(Yang 等,2023 年);以及 3)“语义”语音标记,特别是 HuBERT(Hsu 等,2021 年)。上述所有标记器均以无监督或自监督的方式进行训练。因此,它们与语义内容的关联往往较弱,导致合成过程不稳定,并对干净的训练数据有较大需求。此外,无监督标记器对数据噪声敏感,需要精心策划的干净数据集。
Building on the success of SenseVoice models, we introduce a supervised semantic speech tokenizer, denoted as (Du et al. 2024a). Using the pre-trained SenseVoice-Large model as a foundation, we incorporate a vector quantizer subsequent to the encoder's initial six layers, delineated in Figure 在 SenseVoice 模型成功的基础上,我们引入了一种监督语义语音分词器,称为 (Du et al. 2024a)。以预训练的 SenseVoice-Large 模型为基础,我们在编码器的前六层之后加入了一个向量量化器,如图所示。
Importantly, the integration of an additional positional embedding post-quantization enhances temporal information. The combination of Encoder and vector quantizer is considered as the 重要的是,在量化后集成额外的位置信息嵌入增强了时间信息。编码器 和向量量化器的组合被视为
Projects
Languages
Zero-shot
风格&说话者
控制
Style&Speaker
Control
精细-
颗粒
Fine-
grained
SFT
Server
Bark
13
ChatTTS
en, zh
WebUI
parler-tts
en
EmotiVoice
en, zh
WebUI
GPT-SoVITS
en, zh, jp 英语, 中文, 日语
WebUI
OpenVoice
en,sp,fr, zh,jp,kr 英语, 西班牙语, 法语, 中文, 日语, 韩语
CosyVoice
en, zh, jp, yue, kr 英语, 中文, 日语, 粤语, 韩语
WebUI, gRPC WebUI,gRPC
Table 2: Comparison on released features between CosyVoice and other open-sourced projects. 表 2:CosyVoice 与其他开源项目发布功能的比较。
speech tokenizer, employing the index of the closest code vector as speech tokens. The vector quantizer utilizes a solitary codebook with an expansive dictionary containing 4,096 entries. The derived token sequence exhibits a frequency of 50 Hz , thereby reducing the computational load on text-to-token generation within language models. 语音分词器,利用最近的代码向量索引作为语音标记。向量量化器使用一个包含 4,096 个条目的单一代码本。生成的标记序列的频率为 50 Hz,从而减少了语言模型中文本到标记生成的计算负担。
Since the speech tokenizer is trained to minimize the recognition errors of rich text in an end-to-end manner, the extracted tokens have a strong semantic relationship to textual and paralinguistic information. Furthermore, our tokenizer benefits from supervised training, enhancing its robustness to data noise and reducing the reliance on pristine data collection. Consequently, a broader spectrum of data can be utilized for training the model. 由于语音分词器经过训练以最小化丰富文本的识别错误,因此提取的标记与文本和副语言信息之间具有强烈的语义关系。此外,我们的 分词器受益于监督训练,提高了其对数据噪声的鲁棒性,并减少了对完美数据收集的依赖。因此,可以利用更广泛的数据来训练模型。
Figure 3: An illustration of our supervised semantic speech tokenizer. 图 3:我们监督语义语音分词器的示意图。
CosyVoice, a family of fundamental speech generation models (Du et al., 2024a), utilizes tokens to synthesize natural-sounding voices suitable for various applications. As a versatile model, CosyVoice excels in tasks such as generating multi-lingual voices tailored to specific speakers, adapting to new speakers without training (zero-shot in-context learning), replicating voices across different languages (cross-lingual voice cloning), creating emotionally resonant voices, and offering nuanced influence over speech output through instructional text. CosyVoice supports five languages, including Chinese (ZH), English (EN), Japanese (JP), Cantonese (Yue) and Korean (KO). We released three open-source models. The first, CosyVoice-base-300M, excels in accurately representing speaker identity, adapting to contexts without any finetuning, and cloning voices across languages. The second, CosyVoice-instruct-300M, is adept in generating emotionally expressive voices and allows for meticulous adjustments via instructional text. Lastly, CosyVoice-sft-300M has been fine-tuned on seven multi-lingual speakers and is ready for immediate deployment. All of them share the common model architecture and learning framework. Compared with other open-sourced projects, CosyVoice released a widest spectrum of supporting features as shown in Table 2 CosyVoice,一个基本语音生成模型的家族(Du et al., 2024a),利用 个令牌合成自然听起来的声音,适用于各种应用。作为一个多功能模型,CosyVoice 在生成针对特定说话者的多语言声音、在不进行训练的情况下适应新说话者(零-shot 上下文学习)、跨语言复制声音(跨语言声音克隆)、创建情感共鸣的声音以及通过指令文本对语音输出进行细致调整等任务中表现出色。CosyVoice 支持五种语言,包括中文(ZH)、英语(EN)、日语(JP)、粤语(Yue)和韩语(KO)。我们发布了三个开源模型。第一个,CosyVoice-base-300M,擅长准确表示说话者身份,能够在不进行微调的情况下适应上下文,并在不同语言之间克隆声音。第二个,CosyVoice-instruct-300M,擅长生成情感丰富的声音,并允许通过指令文本进行细致调整。最后,CosyVoice-sft-300M 已在七位多语言说话者上进行了微调,随时可以部署。它们共享相同的模型架构和学习框架。与其他开源项目相比,CosyVoice 发布了最广泛的支持功能,如表 2 所示。
Figure 4: A semantic diagram of CosyVoice models. 图 4:CosyVoice 模型的语义图。
(a) Zero-shot In-context Learning (a)零-shot 上下文学习
(b) Cross-lingual Voice Cloning (b) 跨语言语音克隆
Figure 5: Sequence construction for (a) zero-shot in-context learning and (b) cross-lingual voice cloning. LID represents language identifier. 图 5:用于(a)零-shot 上下文学习和(b)跨语言语音克隆的序列构建。LID 代表语言标识符。
2.4.1 System Overview 2.4.1 系统概述
CosyVoice incorporates an autoregressive Transformer-based language model (LM) to generate speech tokens for the input text. An ordinary differential equation based (ODE-based) diffusion model, flow matching (Lipman et al. 2023), reconstructs Mel spectrum from the generated tokens. Subsequently, a HiFTNet-based vocoder (Li et al. 2023) is followed to synthesize waveforms from the reconstructed Mel spectrum. Dashed models are optional for certain applications, such as crosslingual cloning and speaker fine-tuned inference. CosyVoice 结合了一种自回归的基于 Transformer 的语言模型(LM),用于为输入文本生成语音标记。基于常微分方程(ODE)的扩散模型(Lipman 等,2023)从生成的标记中重建 Mel 谱。随后,基于 HiFTNet 的声码器(Li 等,2023)用于从重建的 Mel 谱合成波形。虚线模型是某些应用的可选项,例如跨语言克隆和说话者微调推理。
2.4.2 Model Training 2.4.2 模型训练
At the training stage, the autoregressive language model (LM) is trained using a teacher-forcing paradigm. In this process, tokenized text and a left-shifted version of the speech tokens are provided as input to predict the subsequent speech tokens. 在训练阶段,自回归语言模型(LM)使用教师强制范式进行训练。在此过程中,标记化的文本和左移版本的语音标记作为输入,以预测后续的语音标记。
The flow matching model is developed to estimate the conditional probabilities , where and denote the speech tokens and speaker embeddings (Wang et al. 2023b), respectively. and represent the Mel spectrum of target and reference speech, respectively. A convolutional Transformer U-Net (Mehta et al. 2023) is employed to ascertain the vector field between the prior distribution and the desired one, which is derived from the optimal transport ODE. The straightforward nature of resolving the OT-ODE allows for a significantly reduced number of iterations during the inference stage, typically only five to ten iterations are required to produce a satisfactory Mel spectrogram. We also employ the classifier-free guidance (CFG) (Ho & Salimans, 2022) technique and mask out the proceeding feature conditions to boost the in-context learning ability. 流匹配模型的开发旨在估计条件概率 ,其中 和 分别表示语音标记和说话者嵌入(Wang et al. 2023b)。 和 分别表示目标和参考语音的梅尔谱。采用卷积 Transformer U-Net(Mehta et al. 2023)来确定先验分布与所需分布之间的向量场,该分布源自最优传输 ODE。解决 OT-ODE 的简单性使得在推理阶段所需的迭代次数显著减少,通常只需五到十次迭代即可生成令人满意的梅尔谱图。我们还采用了无分类器引导(CFG)(Ho & Salimans, 2022)技术,并掩蔽了 后续特征条件,以增强上下文学习能力。
For the synthesis of waveforms from the predicted Mel spectrograms, we utilize a vocoder based on HiFTNet (Li et al., 2023). Modifications have been made on HiFTNet to support streaming generation, including the replacement and redesign of certain components. Complete details regarding these adjustments are available in our released code. 为了从预测的梅尔谱中合成波形,我们利用基于 HiFTNet(Li et al., 2023)的声码器。对 HiFTNet 进行了修改,以支持流式生成,包括某些组件的替换和重新设计。有关这些调整的完整细节,请参见我们发布的代码。
CosyVoice models exhibit zero-shot in-context learning capabilities, allowing for the replication of an arbitrary voice with only a brief reference speech sample. This process entails the careful construction of input sequences for the token language model (LM), depicted in Figure 5. For prompt speech and input text in the same language, we merge them to form a unified input, treating the prompt speech tokens as pre-generated. With this input sequence, the autoregressive LM iteratively predicts subsequent tokens until it encounters the "end of sequence" token (E). However, when the prompt speech and input text differ linguistically, we omit the text and tokens associated with the prompt to prevent prosodic characteristics of the original language from influencing the target language. It is important to note that the prompt text, which corresponds to the prompt speech's content, can be transcribed either through human annotation or ASR models, such as SenseVoice. Similar to the prompt text, the prompt tokens are extracted from the prompt speech with tokenizer. CosyVoice 模型展现了零样本上下文学习能力,只需简短的参考语音样本即可复制任意声音。这个过程涉及为令牌语言模型(LM)精心构建输入序列,如图 5 所示。对于同一语言的提示语音和输入文本,我们将它们合并形成统一的输入,将提示语音令牌视为预生成的。使用这个输入序列,自回归 LM 迭代预测后续令牌,直到遇到“序列结束”令牌(E)。然而,当提示语音和输入文本在语言上不同的时候,我们会省略与提示相关的文本和令牌,以防止原语言的韵律特征影响目标语言。需要注意的是,提示文本与提示语音内容相对应,可以通过人工标注或自动语音识别(ASR)模型(如 SenseVoice)进行转录。与提示文本类似,提示令牌是通过 令牌器从提示语音中提取的。
After generating the speech tokens, they are appended after the prompt tokens, forming a composite condition for the flow-matching model. Additionally, the speaker embedding and the Mel spectrogram of the prompt speech are incorporated to further enhance timbre and environmental consistency. 在生成语音令牌后,它们被附加在提示令牌之后,形成流匹配模型的复合条件。此外,提示语音的说话者嵌入和梅尔谱图也被纳入,以进一步增强音色和环境一致性。
2.4.4 Instruction Fine-tuning 2.4.4 指令微调
To enable further controllability on CosyVoice, we experiment with integrating additional instruction fine-tuning (Ji et al. 2023). CosyVoice-instruct extends CosyVoice-base with enhanced instructionfollowing capabilities. Specifically, it supports controllability over various aspects such as speaker identity (i.e., speaker's characteristics), speaking style (including emotion, gender, speaking rate, and pitch), and fine-grained paralinguistic features. These features include the ability to insert laughter, breaths, speaking while laughing, and emphasizing certain words. Table 3 shows some examples of speaker identity, speaking style, and fine-grained paralinguistic features. 为了进一步增强 CosyVoice 的可控性,我们尝试整合额外的指令微调(Ji et al. 2023)。CosyVoice-instruct 在 CosyVoice-base 的基础上扩展了增强的指令跟随能力。具体而言,它支持对多个方面的可控性,如说话者身份(即说话者的特征)、说话风格(包括情感、性别、说话速度和音调)以及细粒度的副语言特征。这些特征包括插入笑声、呼吸、在笑声中说话以及强调某些词的能力。表 3 展示了一些说话者身份、说话风格和细粒度副语言特征的示例。
Speaker Identity
1. Selene 'Moonshade', is a mysterious, elegant dancer with a connection to the night. Her movements
are both mesmerizing and deadly.<endofprompt \(>\) Hope is a good thing.
2. Theo 'Crimson', is a fiery, passionate rebel leader. Fights with fervor for justice, but struggles with
impulsiveness. \(<\) endofprompt \(>\) You don't know about real loss.
Speaking Style
1. A happy girl with high tone and quick speech. \(<\) endofprompt \(>\) The sun is shining brightly today.
2. A sad woman with normal tone and slow speaking speed.<endofprompt>I failed my important exam.
Fine-grained Paralinguistics
1. Well that's kind of scary [laughter].
2. I don't think I over eat yeah [breath] and um I do exercise regularly.
3. Well that pretty much covers \(<\) laughter \(>\) the subject \(<\) /laughter \(>\) well thanks for calling me.
4. The team's \(<\) strong \(>\) unity \(</\) strong \(>\) and \(<\) strong \(>\) resilience \(</\) strong \(>\) helped them win the cham-
pionship.
Table 3: Examples of speaker identity, speaking style, and fine-grained paralinguistics. 表 3:说话者身份、说话风格和细粒度副语言学的示例。
3 Dataset
3.1 Training Set for SenseVoice 3.1 SenseVoice 的训练集
Figure 6 provides an overview of the dataset utilized for training the SenseVoice models. The SenseVoice-Small model was trained on an extensive audio data corpus of approximately 300,000 hours, covering 5 languages including Chinese, Cantonese, English, Japanese, and Korean. To further enhance the multilingual ability of SenseVoice-Large, an additional 100,000 hours of diverse multilingual data were integrated into the training corpus. To obtain rich transcription labels from speech data, we leveraged open-source models for audio event detection (AED) and speech emo- 图 6 提供了用于训练 SenseVoice 模型的数据集概述。SenseVoice-Small 模型是在大约 300,000 小时的广泛音频数据语料库上训练的,涵盖了包括中文、粤语、英语、日语和韩语在内的 5 种语言。为了进一步增强 SenseVoice-Large 的多语言能力,额外的 100,000 小时多样化的多语言数据被整合到训练语料库中。为了从语音数据中获得丰富的转录标签,我们利用了开源模型进行音频事件检测(AED) 和语音情感-
Figure 6: Hours of SenseVoice training data across languages (in log scale). 图 6:各语言的 SenseVoice 训练数据小时数(对数尺度)。
tion recognition (SER) to generate pseudo labels, yielding an extensive rich transcribe dataset. Specifically, the AED data amounted to 150 million entries, while the SER data comprised 30 million entries. 语音识别(SER) 生成伪标签,从而产生一个丰富的转录数据集。具体而言,AED 数据总计达到 1.5 亿条,而 SER 数据则包含 3000 万条。
Language
Duration (hr) 持续时间(小时)
ZH
130,000
EN
30,000
Yue
5,000
JP
4,600
KO
2,200
Table 4: Hours of CosyVoice training data across languages. 表 4:各语言的 CosyVoice 训练数据小时数。
3.2 Training Set for CosyVoice 3.2 CosyVoice 的训练集
To train the CosyVoice models, we have amassed a considerable dataset comprising multiple languages. Throughout the collection process, we utilize specialized in-house tools for speech detection, signal-to-noise ratio (SNR) estimation, speaker diarization, and separation. Subsequently, pseudo text labels are generated using SenseVoice-Large and Paraformer. These labels undergo a refinement process with the aid of force-alignment (FA) models, which helps eliminate low-quality data and enhances the accuracy of punctuation. A comprehensive breakdown of the training data's duration across various languages is presented in Table 4 为了训练 CosyVoice 模型,我们收集了一个包含多种语言的相当可观的数据集。在收集过程中,我们使用专门的内部工具进行语音检测、信噪比 (SNR) 估计、说话人分离和分离。随后,使用 SenseVoice-Large 和 Paraformer 生成伪文本标签。这些标签经过强对齐 (FA) 模型的帮助进行精炼,帮助消除低质量数据并提高标点的准确性。训练数据在各种语言中的持续时间的全面细分见于表 4。
For the CosyVoice-instruct model, we fine-tuned CosyVoice-base using instruction training data without incorporating speaker embedding in the autoregressive language model. Table 5 presents the duration of the training data for different types of instructions. 对于 CosyVoice-instruct 模型,我们使用指令训练数据对 CosyVoice-base 进行了微调,而没有在自回归语言模型中加入说话者嵌入。表 5 展示了不同类型指令的训练数据持续时间。
Type
Duration (hr) 持续时间(小时)
Speaker Identity 说话者身份
101
Speaking Style 说话风格
407
Fine-grained Paralinguistics 细粒度的副语言学
48
Table 5: Duration statistics of instruction training data by type. 表 5:按类型划分的教学训练数据的持续时间统计。
Figure 7: Comparison of SenseVoice and Whisper on Common Voice, with or without LID 图 7:SenseVoice 与 Whisper 在 Common Voice 上的比较,带或不带 LID
4 Experimental Results 4 实验结果
4.1 Multilingual Speech Recognition 4.1 多语言语音识别
Metrics. We use Character Error Rate (CER) to evaluate the models in five languages: Chinese, Cantonese, Japanese, Korean, and Thai, and use the Word Error Rate (WER) for all other languages. Both the ground truth transcriptions and the recognition outputs are standardized using text normalization before the error rate calculation, in alignment with the methodology used by Whisper. All Chinese characters were converted into the simplified Chinese version, together with an additional text normalization pipeline 指标。我们使用字符错误率(CER)来评估五种语言的模型:中文、粤语、日语、韩语和泰语,并对所有其他语言使用词错误率(WER)。在计算错误率之前,所有的真实转录和识别输出都经过文本规范化,以与 Whisper 使用的方法保持一致。所有中文字符都被转换为简体中文版本,并附加了一个额外的文本规范化流程
Results in Table 6 show the comparison of Whisper, SenseVoice and Paraformer (Gao et al. 2022, 2023, Shi et al., 2024) on popular open speech recognition benchmark datasets, including AISHELL-1 (Bu et al. 2017), AISHELL-2 (Du et al., 2018), WenetSpeech (Zhang et al., 2022), Librispeech (Panayotov et al., 2015), and Common Voice (Ardila et al., 2019). It can be seen that SenseVoice-S and SenseVoice-L outperform their Whisper counterparts by a significant margin in most test sets except Librispeech. 表 6 中的结果显示了 Whisper、SenseVoice 和 Paraformer(Gao 等,2022,2023,Shi 等,2024)在流行的开放语音识别基准数据集上的比较,包括 AISHELL-1(Bu 等,2017)、AISHELL-2(Du 等,2018)、WenetSpeech(Zhang 等,2022)、Librispeech(Panayotov 等,2015)和 Common Voice(Ardila 等,2019)。可以看出,SenseVoice-S 和 SenseVoice-L 在大多数测试集上显著超越了它们的 Whisper 对应版本,除了 Librispeech。
Figure 7 illustrates the comparative performance of SenseVoice-Large and Whisper-Large-V3 on a broader range of languages, with or without ground truth LID as input. While SenseVoice-Large performs comparably with Whisper-Large-V3 in general, SenseVoice-Large obtains significantly better performance in languages like Cantonese (Yue), Catalan (CA), and Marathi (MR). 图 7 展示了 SenseVoice-Large 和 Whisper-Large-V3 在更广泛语言范围内的比较性能,输入时是否包含真实的语言识别(LID)。虽然 SenseVoice-Large 的表现与 Whisper-Large-V3 大致相当,但在粤语(Yue)、加泰罗尼亚语(CA)和马拉地语(MR)等语言中,SenseVoice-Large 的表现显著更好。
The evaluation of inference efficiency is shown in Table 7 . The Real-time factor (RTF, the ratio of the transcribing time to the audio length) and 10s Audio Latency (the average time cost when transcribing a 10s audio.) are benchmarked on an A800 machine, with a decoding batch size of 1. For the encoder-decoder-based model (Whipser-S, Whipser-L-V3, and SenseVoice-L), we perform beam search in decoding with a beam size of 5 . Owing to its non-autoregressive architecture, SenseVoiceS obtains extremely low inference latency-more than 5 times faster compared to Whisper-small and more than 15 times faster compared to Whisper-L-V3. SenseVoice-L shows close performance with Whipser-L-V3. 推理效率的评估如表 7 所示。实时因子(RTF,即转录时间与音频长度的比率)和 10 秒音频延迟(转录 10 秒音频时的平均时间成本)在 A800 机器上进行基准测试,解码批量大小为 1。对于基于编码器-解码器的模型(Whipser-S、Whipser-L-V3 和 SenseVoice-L),我们在解码时执行大小为 5 的束搜索。由于其非自回归架构,SenseVoiceS 获得了极低的推理延迟——比 Whisper-small 快超过 5 倍,比 Whisper-L-V3 快超过 15 倍。SenseVoice-L 的表现与 Whipser-L-V3 接近。
4.2 Speech Emotion Recognition 4.2 语音情感识别
We evaluate the SER ability of the SenseVoice on 7 popular emotion recognition datasets, including CREMA-D(Cao et al., 2014), MELD(Poria et al., 2019), IEMOCAP(Busso et al., 2008), MSP-
Table 6: Performance comparisons among different models on Chinese and English Open Corpus. 表 6:不同模型在中文和英文开放语料库上的性能比较。
Model
Framework
Parameters 参数
Support Language 支持语言
RTF
10s Audio Latency(ms) 10 秒音频延迟(毫秒)
Whisper-S
Autoregressive 自回归
224 M
0.042
518
Whisper-L-V3 低语-L-V3
Autoregressive 自回归
1550 M
0.111
1281
Paraformer-zh
Non-autoregressive 非自回归
220 M
zh
0.009
100
SenseVoice-S 感声-S
Non-autoregressive 非自回归
234 M
zh,yue,en,ja,ko 中文, 粤语, 英语, 日语, 韩语
0.007
70
SenseVoice-L 感觉声音-L
Autoregressive 自回归
1587 M
0.110
1623
Table 7: Comparison of model architecture, parameter scale, supported languages, and inference efficiency of SenseVoice, Paraformer, and Whisper. 表 7:SenseVoice、Paraformer 和 Whisper 的模型架构、参数规模、支持的语言和推理效率的比较。
Podcast(Martinez-Lucas et al., 2020), CASIA(Zhang & Jia, 2008), MER2023(Lian et al. 2023) and Zhou et al.||2021). These corpora cover both Chinese and English, and scenarios like acts, TV dramas, and daily conversation. We report unweighted average accuracy (UA), weighted average accuracy (WA), macro F1 Score (F1), and weighted average F1 (WF1), and compare them with some recently published SER benchmarks (EmoBox (Ma et al., 2024a), Emo-Superb(Wu et al., 2024) and MerBench (Lian et al., 2024) from literature in Table 8 . We show that SenseVoice achieves a good performance on all test sets and all metrics even without fine-tuning on the target domain. Podcast(Martinez-Lucas 等,2020),CASIA(Zhang & Jia,2008),MER2023(Lian 等,2023)和 Zhou 等||2021)。这些语料库涵盖中文和英文,以及戏剧、电视节目和日常对话等场景。我们报告无权重平均准确率(UA)、加权平均准确率(WA)、宏 F1 分数(F1)和加权平均 F1(WF1),并将其与一些最近发布的 SER 基准(EmoBox(Ma 等,2024a)、Emo-Superb(Wu 等,2024)和 MerBench(Lian 等,2024))进行比较,如表 8 所示。我们显示 SenseVoice 在所有测试集和所有指标上都取得了良好的表现,即使没有在目标领域进行微调。
Figure 8: Weighted Average Accuracy (WA(%)) comparison with other open source SER models. 图 8:与其他开源 SER 模型的加权平均准确率(WA(%))比较。
We further compare SenseVoice with some open-sourced SER models. Results are shown in Figure 8 XLSR-SER is the most popular SER model on HuggingFace . and Qwen-Audio(Chu et al. 2023) and SALMONN(Tang et al. 2024) are two Audio-LLM models which can recognize speech emotion with natural language prompt. Results from EmoBox are also involved in the figure as references. SenseVoice-Large achieves the best results on almost all datasets while the SenseVoice-Small also outperforms other baseline models on the majority datasets. 我们进一步将 SenseVoice 与一些开源的 SER 模型进行比较。结果如图 8 所示,XLSR-SER 是 HuggingFace 上最受欢迎的 SER 模型 。Qwen-Audio(Chu 等,2023)和 SALMONN(Tang 等,2024)是两个可以通过自然语言提示识别语音情感的音频模型LLM。EmoBox 的结果也作为参考包含在图中。SenseVoice-Large 在几乎所有数据集上都取得了最佳结果,而 SenseVoice-Small 在大多数数据集上也优于其他基线模型。
Table 8: SER performance comparisons on different evaluation benchmarks. 表 8:在不同评估基准上的 SER 性能比较。
4.3 Audio Event Detection 4.3 音频事件检测
Both SenseVoice-Small and SenseVoice-Large models can classify the audio event in the speech, including music, applause, and laughter. The SenseVoice-L can further predict the start and end position of the audio event, while the SenseVoice-Small can only predict what happened in the audio, with at most one event per utterance. SenseVoice-Small can detect more kinds of events, such as coughing, sneezing, breathing, and crying which could occur during human-machine interaction. SenseVoice-Small 和 SenseVoice-Large 模型可以对语音中的音频事件进行分类,包括音乐、掌声和笑声。SenseVoice-L 可以进一步预测音频事件的开始和结束位置,而 SenseVoice-Small 只能预测音频中发生了什么,每次话语最多只能预测一个事件。SenseVoice-Small 可以检测更多种类的事件,例如咳嗽、打喷嚏、呼吸和哭泣,这些事件可能在人与机器的交互中发生。
Figure 9: F1(%) Score comparison of the SenseVoice with the audio event detection models BEATS and PANNs on different audio event detection tasks. 图 9:SenseVoice 与音频事件检测模型 BEATS 和 PANNs 在不同音频事件检测任务上的 F1(%)分数比较。
We compare SenseVoice with the SOTA audio event detection models BEATs(Chen et al., 2023a) and PANNs (Kong et al. 2020) on different tasks, including environment sound classification (ESC50) (Piczak, 2015), baby cry/laugh detection , coughing detection (Coswara) (Sharma et al. 2020, and in-home talkshow event detection. As SenseVoice only predicts the event of our interest, which may not include event categories in other models, we use the F1 score on each event for evaluation. Qwen-audio is also evaluated for comparison. 我们将 SenseVoice 与 SOTA 音频事件检测模型 BEATs(Chen 等,2023a)和 PANNs(Kong 等,2020)在不同任务上进行比较,包括环境声音分类(ESC50)(Piczak,2015)、婴儿哭声/笑声检测 、咳嗽检测(Coswara)(Sharma 等,2020, )以及家庭脱口秀事件检测。由于 SenseVoice 仅预测我们感兴趣的事件,这可能不包括其他模型中的事件类别,因此我们使用每个事件的 F1 分数进行评估。Qwen-audio 也进行了评估以作比较。
We find that SenseVoice serves as a good audio event classification or detection model, though BEATs and PANNs may have better F1 scores, which may be attributed to two reasons. Firstly, BETAS and PANNs can modify the detection threshold to trade-off the accuracy and recall rate to obtain a higher F1 score, but threshold modification is much more difficult for SenseVoice and Qwen-Audio (An interesting discovery is that SenseVoice and Qwen-Audio always have a much higher accuracy than the recall rate, which could be more friendly for the human-machine interaction). Secondly, SenseVoice is trained with ASR data with AED pseudo labeling rather than AED-specific data. 我们发现 SenseVoice 作为一个良好的音频事件分类或检测模型,尽管 BEATs 和 PANNs 可能具有更好的 F1 分数,这可能归因于两个原因。首先,BEATs 和 PANNs 可以修改检测阈值,以权衡准确率和召回率,从而获得更高的 F1 分数,但对于 SenseVoice 和 Qwen-Audio 来说,阈值修改要困难得多(一个有趣的发现是,SenseVoice 和 Qwen-Audio 的准确率总是远高于召回率,这可能更有利于人机交互)。其次,SenseVoice 是使用带有 AED 伪标记的 ASR 数据进行训练的,而不是使用特定于 AED 的数据。
4.4 Preserving Semantic Information by Tokenizer 4.4 通过 分词器保留语义信息
To assess the tokenizer's ability to preserve semantic information, we compared the recognition performance of the quantizer-augmented SenseVoice-L against its original version and the WhisperLarge V3 model. The models underwent evaluation using the Common Voice and en benchmarks, with the findings detailed in Table 9 为了评估 分词器保留语义信息的能力,我们比较了量化增强的 SenseVoice-L 与其原始版本以及 WhisperLarge V3 模型的识别性能。这些模型使用 Common Voice 和 en 基准进行了评估,结果详见表 9
From the table, we can see that our tokens demonstrate robust recognition performance in both the Chinese and English test sets. Notably, on the common_voice_zh-CN set, tokens surpass the performance of the Whisper-Large V3 model, achieving a relative reduction in error rate. This suggests a substantial correlation between tokens and semantic content. It is worth noting that there is only a single codebook in the tokenizer with a dictionary size of 4,096 entries. 从表中可以看出,我们的 令牌在中文和英文测试集上表现出强大的识别性能。值得注意的是,在 common_voice_zh-CN 集上, 令牌的性能超过了 Whisper-Large V3 模型,达到了 的相对错误率降低。这表明 令牌与语义内容之间存在显著的相关性。值得注意的是, 分词器中只有一个代码本,字典大小为 4,096 条目。
Whisper-L-V3 低语-L-V3
SenseVoice-L 感觉声音-L
tokens 个令牌
Test set
w/o lid
w/ lid
w/o lid
w/ lid
w/o lid
w/ lid
common_voice_zh-CN
12.82
12.55
8.76
8.68
12.24
12.06
common_voice_en
13.55
9.39
9.79
9.77
15.43
15.38
Table 9: The evaluation on tokens' capability to preserve semantic information. We employ character and word error rates for and en languages on the Common Voice benchmarks. Please note that the SenseVoice-L model in this table is an intermediate version, and is not identical to the one presented in Table 6 表 9:对 个标记在保持语义信息方面的能力的评估。我们在 Common Voice 基准上使用字符和单词错误率来评估 和英语语言。请注意,此表中的 SenseVoice-L 模型是一个中间版本,与表 6 中呈现的模型并不相同。
4.5 Evaluation on Generation Quality of CosyVoice CosyVoice 生成质量的 4.5 评估
We evaluate the quality of CosyVoice's speech synthesis by examining content consistency and speaker similarity. The "test-clean" subset of LibriTTS (Zen et al., 2019) and the test set of AISHELL-3 (Shi et al. 2021) are employed to construct evaluation set for English and Chinese, respectively. For each text in these sets, we randomly select a prompt speech. Content consistency was evaluated using Whisper-Large V3 (Radford et al., 2023) for English and Paraformer (Gao et al., 2022) for Chinese recognition. Speaker similarity was quantified by calculating the cosine similarity between speaker embeddings of the generated and prompt speeches, extracted using ERes 2 Net (Chen et al. 2023b). 我们通过检查内容一致性和说话者相似性来评估 CosyVoice 的语音合成质量。我们使用 LibriTTS 的“test-clean”子集(Zen 等,2019)和 AISHELL-3 的测试集(Shi 等,2021)来构建英语和中文的评估集。对于这些集合中的每个文本,我们随机选择一个提示语音。内容一致性使用 Whisper-Large V3(Radford 等,2023)进行英语评估,使用 Paraformer(Gao 等,2022)进行中文识别。通过计算生成语音和提示语音的说话者嵌入之间的余弦相似度来量化说话者相似性,这些嵌入是使用 ERes 2 Net(Chen 等,2023b)提取的。
Similar to other autoregressive language models, we employ a random sampling decoding strategy for our token LM and assessed the synthesis process using five different random seed values: , 123, and 1,337 . The resultant evaluation metrics were averaged to determine the mean and standard deviation. Additionally, we conducted an ASR re-ranking to demonstrate potential performance improvements in offline mode. 与其他自回归语言模型类似,我们为我们的令牌语言模型采用随机采样解码策略,并使用五个不同的随机种子值进行合成过程评估: 、123 和 1,337。结果评估指标被平均以确定均值和标准差。此外,我们进行了 ASR 重排序,以展示离线模式下潜在的性能提升。
Tables 10 and 11 present the results for English and Chinese, respectively. On the English dataset, Cosy Voice attained human-level performance with similar content recognition and higher speaker similarity. ASR re-ranking notably enhanced content consistency, yielding a reduced word error rate (WER) of . CosyVoice outperformed ChatTTS in WER and the number of insertion and deletion errors, indicating superior content consistency. We did not assess speaker similarity for ChatTTS as it doesn't release voice cloning capabilities. 表 10 和表 11 分别展示了英语和中文的结果。在英语数据集上,Cosy Voice 达到了与人类相当的表现,内容识别相似,且说话者相似度更高。ASR 重新排序显著增强了内容一致性,降低了词错误率(WER)至 。CosyVoice 在 WER 以及插入和删除错误的数量上优于 ChatTTS,表明其内容一致性更强。我们没有评估 ChatTTS 的说话者相似度,因为它不发布语音克隆能力。
Model
WER (%)
#Ins.&Del.
SS
Original
2.66
92
69.67
ChatTTS
8.32
441
-
CosyVoice
re-ranking 重新排序
1.51
47
74.30
Table 10: The comparison of original and CosyVoice generated speeches on the LibriTTS test-clean set in terms of word error rate (WER) and speaker similarity (SS). " " joins the mean and standard deviation for each evaluation metric. 表 10:在 LibriTTS 测试清晰集上,原始和 CosyVoice 生成的演讲在词错误率(WER)和说话者相似度(SS)方面的比较。“ ”连接了每个评估指标的均值和标准差。
As for the results on Chinese, the generated utterances of CosyVoice achieves a comparable CER as well as the errors of insertion and deletion compared with the original utterances. It seems 关于中文的结果,CosyVoice 生成的语句在字符错误率(CER)方面达到了与原始语句相当的水平,同时插入和删除错误也相似。似乎
Model
CER (%)
#Ins.&Del.
SS
Original
2.52
25
74.15
ChatTTS
3.87
111
-
CosyVoice
re-ranking 重新排序
1.84
11
81.58
Table 11: The comparison of original and CosyVoice generated speeches on the AISHELL-3 test set in terms of character error rate (CER) and speaker similarity (SS). " " joins the mean and standard deviation for each evaluation metric. 表 11:在 AISHELL-3 测试集上,原始演讲与 CosyVoice 生成演讲在字符错误率(CER)和说话者相似性(SS)方面的比较。“ ”连接了每个评估指标的均值和标准差。
that ChatTTS has a better generation ability on Chinese than English in terms of CER. Although ChatTTS and CosyVoice achieves a similar CER, ChatTTS produces more insertion and deletion errors, This is due to the problem of speaker leaking, where modal particles of another speaker is generated unexpectedly. On the contrary, CosyVoice doesn't suffer this problem with much less insertion and deletion errors. With ASR re-ranking, CosyVoice reached a remarkably low CER of . As seen with English, CosyVoice also exhibited greater speaker similarity than the original utterances, showcasing its effective voice-cloning proficiency. ChatTTS 在中文的生成能力比在英语方面的 CER 更好。尽管 ChatTTS 和 CosyVoice 的 CER 相似,但 ChatTTS 产生了更多的插入和删除错误。这是由于说话者泄漏的问题,导致意外生成了另一个说话者的模态粒子。相反,CosyVoice 在插入和删除错误方面则表现得更少。通过 ASR 重新排序,CosyVoice 达到了显著低的 CER 。与英语一样,CosyVoice 也表现出比原始话语更大的说话者相似性,展示了其有效的声音克隆能力。
4.6 Evaluation on Emotion Controllability of CosyVoice 4.6 CosyVoice 情感可控性的评估
To verify the emotion controllability, we use the public speech emotion recognition model emo2ve (Ma et al. 2024b). We generate and evaluate 100 English utterances for each of the six emotions: happy, angry, sad, surprised, fearful, and disgusted. The content of the synthesized text is designed to match the target emotion. We then measure the accuracy of the predicted emotions from the synthesized speech for each emotion. 为了验证情感可控性,我们使用公共演讲情感识别模型 emo2ve (Ma 等,2024b)。我们为六种情感(快乐、愤怒、悲伤、惊讶、恐惧和厌恶)生成并评估 100 个英语话语。合成文本的内容旨在与目标情感相匹配。然后,我们测量从合成语音中预测的每种情感的准确性。
Table 12 shows the comparison of emotion control accuracy between CosyVoice-base and CosyVoice-instruct. For CosyVoice-instruct, the input consists of content text accompanied by a speaking style instruction (e.g., "Happy.Content Text"). In contrast, CosyVoicebase only receives the content text as input. The results indicate that CosyVoice-instruct with emotional instructions demonstrates a significant improvement over both CosyVoice-base and CosyVoice-instruct without emotional instructions. 表 12 显示了 CosyVoice-base 和 CosyVoice-instruct 之间情感控制准确性的比较。对于 CosyVoice-instruct,输入由内容文本和说话风格指令(例如,“快乐.内容文本”)组成。相比之下,CosyVoice-base 仅接收内容文本作为输入。结果表明,带有情感指令的 CosyVoice-instruct 在准确性上显著优于 CosyVoice-base 和没有情感指令的 CosyVoice-instruct。
Model
Happy
Sad
Angry
Surprised
Fearful
Disgusted
CosyVoice-base
CosyVoice-instruct
w/o instruction 无需说明
Table 12: Comparison of emotion control accuracy between CosyVoice-base and CosyVoiceinstruct. " " joins the mean and standard deviation for each evaluation metric. 表 12:CosyVoice-base 与 CosyVoiceinstruct 之间情感控制准确性的比较。“ ”连接每个评估指标的均值和标准差。
4.7 CosyVoice as a Data Generator 4.7 CosyVoice 作为数据生成器
A straightforward application of CosyVoice is as a data generator to augment the training data of other tasks, such as ASR, speech-to-speech translation (S2ST). Taking the ASR task an example, we conduct an experiment on the Librispeech corpus to evaluate CosyVoice's capability in generating high-quality data. The experimental results are shown in Table 13, where "Librispeech" denotes the original 960-hour data. "Syn on LS text" and "Syn on LS text" denote the generated data with the text from Librispeech and MLS training sets, respectively. From the table, we can see that only training on the synthesized data, the ASR model can achieve a comparable result than the original Librispeech training set. Upon integration of them, a notable enhancement in recognition accuracy is observed. An interesting finding is that involving the synthesized data on the MLS text significantly improve the recognition performance. This may indicates that the text diversity is more critical for ASR task than the duration of speech itself. This improvement can be attributed to the varied CosyVoice 的一个直接应用是作为数据生成器,以增强其他任务的训练数据,例如自动语音识别(ASR)和语音到语音翻译(S2ST)。以 ASR 任务为例,我们在 Librispeech 语料库上进行了一项实验,以评估 CosyVoice 在生成高质量数据方面的能力。实验结果如表 13 所示,其中“Librispeech”表示原始的 960 小时数据。“Syn on LS text”和“Syn on LS text”表示分别使用 Librispeech 和 MLS 训练集的文本生成的数据。从表中可以看出,仅在合成数据上进行训练,ASR 模型就能达到与原始 Librispeech 训练集相当的结果。将它们整合后,识别准确率显著提高。一个有趣的发现是,涉及 MLS 文本的合成数据显著提高了识别性能。这可能表明文本多样性对 ASR 任务比语音持续时间更为关键。这一改善可以归因于多样化的
\footnotetext{ https://modelscope.cn/models/iic/emotion2vec_base_finetuned
linguistic content introduced by CosyVoice synthesized samples. The findings from our evaluation underscore the high quality of the samples generated by CosyVoice. 由 CosyVoice 合成样本引入的语言内容。我们评估的结果强调了 CosyVoice 生成样本的高质量。
Training Data 训练数据
dev_clean
dev_other
test_clean 测试清单
test_other 测试其他
Librispeech
2.77
5.84
2.79
5.97
Syn on LS text LS 文本上的 Syn
2.79
6.37
3.00
6.59
Librispeech + Syn on LS text Librispeech + LS 文本上的 Syn
2.44
5.52
2.56
5.68
Librispeech + Syn on LS text Librispeech + LS 文本上的 Syn
2.51
5.23
2.68
5.26
Librispeech + Syn on LS, MLS text Librispeech + LS, MLS 文本上的 Syn
Table 13: Evaluation on CosyVoice generation quality by treating it as a data generator. Word error rates (%) on the human-uttered test sets are employed as the evaluation metrics. 表 13:通过将 CosyVoice 生成质量视为数据生成器的评估。使用人类发音测试集上的词错误率(%)作为评估指标。
5 Applications 5 个应用
The FunAudioLLM is an innovative framework designed to facilitate natural voice interactions between humans and large language models (LLMs). By integrating SenseVoice, CosyVoice, and LLMs, FunAudioLLM offers a variety of rich application demos, including speech-to-speech translation, emotional voice chat, interactive podcasts, and expressive audiobook narration. The demos are available at https://fun-audio-llm.github.io. FunAudioLLM 是一个创新框架,旨在促进人类与大型语言模型(LLMs)之间的自然语音互动。通过整合 SenseVoice、CosyVoice 和LLMs,FunAudioLLM 提供多种丰富的应用演示,包括语音到语音翻译、情感语音聊天、互动播客和生动的有声读物叙述。演示可在https://fun-audio-llm.github.io上获取。
By combining SenseVoice, LLMs, and CosyVoice, we can effortlessly perform speech-to-speech translation (S2ST), as illustrated in Figure 10 SenseVoice is used to recognize the input speech in its original language, the LLM translates the source language to the target language, and CosyVoice synthesizes the target speech with cross-lingual voice cloning. This allows users to speak in foreign languages using their own voice. 通过结合 SenseVoice、LLMs和 CosyVoice,我们可以轻松地进行语音到语音的翻译(S2ST),如图 10 所示。SenseVoice 用于识别输入语音的原始语言,LLM将源语言翻译为目标语言,而 CosyVoice 则使用跨语言语音克隆合成目标语音。这使得用户可以使用自己的声音说外语。
Figure 10: A diagram of Speech-to-Speech Translation. 图 10:语音到语音翻译的示意图。
By integrating SenseVoice, LLMs, and CosyVoice, we can develop an Emotional Voice Chat application, as depicted in Figure 11. SenseVoice recognizes the input speech and its emotion and audio event, the LLM generates the response content with a speaking style description, and CosyVoice produces emotional speech following the given speaking style description. 通过整合 SenseVoice、LLMs和 CosyVoice,我们可以开发一个情感语音聊天应用程序,如图 11 所示。SenseVoice 识别输入语音及其情感和音频事件,LLM生成带有说话风格描述的响应内容,而 CosyVoice 则根据给定的说话风格描述生成情感语音。
Figure 11: A diagram of Emotional Voice Chat. 图 11:情感语音聊天的示意图。
By leveraging SenseVoice, an LLM-based multi-agent system with real-time world knowledge, and CosyVoice, we can create an interactive podcast, as shown in Figure 12. We can use an LLM plugin to fetch real-time daily knowledge, which a content-generation agent then transforms into a podcast script. The Multi-Agent system matches podcast roles, and CosyVoice synthesizes the voices. Users can also insert themselves into the podcast for interactive dialogues with the Multi-Agent system. 通过利用 SenseVoice,一个基于LLM的多智能体系统,具备实时世界知识,以及 CosyVoice,我们可以创建一个互动播客,如图 12 所示。我们可以使用LLM插件来获取实时的每日知识,然后由内容生成代理将其转化为播客脚本。多智能体系统匹配播客角色,而 CosyVoice 合成声音。用户还可以将自己插入播客中,与多智能体系统进行互动对话。
Figure 12: A diagram of Interactive Podcast. 图 12:互动播客的示意图。
Through the analytical capabilities of LLMs to structure and identify emotions within books, and synthesizing this with CosyVoice, we achieve audiobooks with enhanced expressiveness, as illustrated in Figure 13. The LLM is used for narrative and dialogue analysis, character analysis, and fine-grained sentiment analysis, while CosyVoice synthesizes the speech with enhanced expressiveness. 通过LLMs的分析能力来构建和识别书中的情感,并将其与 CosyVoice 结合,我们实现了具有增强表现力的有声读物,如图 13 所示。LLM用于叙事和对话分析、角色分析以及细粒度情感分析,而 CosyVoice 则以增强的表现力合成语音。
Figure 13: A diagram of Expressive Audiobook. 图 13:表现力有声读物的示意图。
6 Limitations 6 限制
SenseVoice has certain limitations that need to be addressed. Firstly, the ASR performance generally remains much lower for under-resourced languages. Secondly, SenseVoice is not designed for streaming transcription. Therefore, future work may focus on developing streamable voice understanding models based on SenseVoice. SenseVoice 有一些需要解决的局限性。首先,对于资源不足的语言,ASR 性能通常较低。其次,SenseVoice 并不适合流式转录。因此,未来的工作可能会集中在基于 SenseVoice 开发可流式传输的语音理解模型上。
CosyVoice also has several limitations. Firstly, it supports a limited number of languages. While it can express emotions and speaking styles based on explicit instructions, it cannot infer the appropriate emotion or style based on the semantic content of the text. Additionally, CosyVoice does not perform well when tasked with singing. There's still room for improvement in achieving expressive emotional changes while maintaining the original timbre of the voice. CosyVoice 也有几个局限性。首先,它支持的语言数量有限。虽然它可以根据明确的指示表达情感和说话风格,但无法根据文本的语义内容推断出适当的情感或风格。此外,CosyVoice 在唱歌时表现不佳。在实现表达性情感变化的同时保持原始音色方面仍有改进的空间。
Another limitation is that the two innovative models within FunAudioLLM are not trained end-toend with LLMs. This pipeline approach may introduce error propagation, which could affect overall performance. 另一个局限性是 FunAudioLLM 中的两个创新模型没有与 LLMs 进行端到端训练。这种管道方法可能会引入错误传播,从而影响整体性能。
7 Authors (alphabetical order of family name) 7 位作者(按姓氏字母顺序排列)