ChatGPT 与 WebRTC 的结合：生成式人工智能对实时通信的意义

May 8, 2023 2023 年 5 月 8 日

ChatGPT is changing computing and as an extension how we interact with machines. Here’s how it is going to affect WebRTC.
ChatGPT 正在改变计算，进而改变我们与机器的交互方式。以下是它对 WebRTC 的影响。

ChatGPT became the service with the highest growth rate of any internet application, reaching 100 million active users within the first two months of its existence. A few are using it daily. Others are experimenting with it. Many have heard about it. All of us will be affected by it in one way or another.
ChatGPT 成为所有互联网应用程序中增长率最高的服务，在其推出的头两个月内，活跃用户就达到了 1 亿。一些人每天都在使用它。其他人正在试用。许多人听说过它。我们所有人都将或多或少地受到它的影响。

I’ve been trying to figure out what exactly does a “ChatGPT WebRTC” duo means – or in other words – what does ChatGPT means for those of us working with and on WebRTC.
我一直想弄明白 "ChatGPT WebRTC "二重奏到底是什么意思，或者换句话说，ChatGPT 对我们这些使用 WebRTC 的人来说意味着什么。

Here are my thoughts so far.
以下是我目前的想法。

Table of contents 目录

Crash course on ChatGPT
ChatGPT 速成班
- BI, AI and Generative AI
  商业智能、人工智能和生成式人工智能
- The stellar rise of ChatGPT
  ChatGPT 的辉煌崛起
Why ChatGPT and WebRTC are like oil and water
为什么说 ChatGPT 和 WebRTC 就像油和水？
What have people done with ChatGPT and WebRTC so far?
到目前为止，人们用 ChatGPT 和 WebRTC 做了些什么？
Broadening the scope: Generative AI
扩大范围：生成式人工智能
Fitting Generative AI to the world of RTC
将生成式人工智能融入 RTC 世界
Is the future of WebRTC generative (AI)?
WebRTC 的未来是生成式（人工智能）吗？

Crash course on ChatGPT
ChatGPT 速成班

Let’s start with a quick look at what ChatGPT really is (in layman terms, with a lot of hand waving, and probably more than a few mistakes along the way).
首先，让我们快速了解一下 ChatGPT 的真正含义（通俗地说，其中有很多手势，可能还有一些错误）。

BI, AI and Generative AI
商业智能、人工智能和生成式人工智能

I’ll start with a few slides I cobbled up for a presentation I did for a group of friends who wanted to understand this.
我先给大家看几张幻灯片，这是我为一群想了解这个问题的朋友做演讲时拼凑的。

ChatGPT is a product/service that makes use of machine learning. Machine learning is something that has been marketed a lot as AI – Artificial Intelligence. If you look at how this field has evolved, it would be something like the below:
ChatGPT 是一种利用机器学习的产品/服务。机器学习在市场上经常被称为 AI（人工智能）。如果你看看这个领域是如何发展起来的，就会发现它的发展过程如下：

We started with simple statistics – take a few numbers, sum them up, divide by their count and you get an average. You complicate that a bit with weighted average. Add a bit more statistics on top of it, collect more data points and cobble up a nice BI (Business Intelligence) system.
我们从简单的统计开始--取几个数字，求和，除以它们的个数，就得到了平均数。如果使用加权平均值，就会变得更加复杂。在此基础上添加更多的统计数据，收集更多的数据点，拼凑出一个漂亮的 BI（商业智能）系统。

At some point, we started looking at deep learning:
后来，我们开始研究深度学习：

Here, we train a model by using a lot of data points, to a point that the model can infer things about new data given to it. Things like “do you see a dog in this picture?” or “what is the text being said in this audio recording?”.
在这里，我们通过使用大量数据点来训练模型，使模型能够推断出新数据的内容。比如 "你在这张图片中看到一只狗吗？"或 "这段录音中的文字是什么？"。

Here, a lot of 3 letter acronyms are used like HMM, ANN, CNN, RNN, GNN…
这里使用了大量 3 个字母的缩写，如 HMM、ANN、CNN、RNN、GNN...

What deep learning did in the past decade or two was enable machines to describe things – be able to identify objects in images and videos, convert speech to text, etc.
深度学习在过去一二十年里所做的工作，就是让机器能够描述事物--能够识别图像和视频中的物体，将语音转换成文本等。

It made it the ultimate classifier, improving the way we search and catalog things.
它使其成为终极分类器，改善了我们搜索和编目事物的方式。

And then came a new field of solutions in the form of Generative AI. Here, machine learning is used to generate new data, as opposed to classifying existing data:
随后，一个新的解决方案领域--生成式人工智能（Generative AI）应运而生。在这里，机器学习被用来生成新数据，而不是对现有数据进行分类：

Here what we’re doing is creating a random input vector, pushing it into a generator model. The generator model creates a sample for us – something that *should* result in the type of thing we want created (say a picture of a dog). That sample that was generated is then passed to the “traditional” inference model that checks if this is indeed what we wanted to generate. If it isn’t, we iteratively try to fine tune it until we get to a result that is “real”.
在这里，我们要做的就是创建一个随机输入向量，并将其推送到生成器模型中。生成器模型会为我们创建一个样本--一个*应该*生成我们想要的类型的东西（比如一张狗的图片）。生成的样本会被传递给 "传统 "推理模型，后者会检查这是否真的是我们想要生成的东西。如果不是，我们就会反复调整，直到得到一个 "真实 "的结果。

This is time consuming and resource intensive – but it works rather well for many use cases (like some of the images on this site’s articles that are now generated with the help of Midjourney).
这样做既费时又耗费资源，但在很多情况下效果都很好（比如本网站文章中的一些图片就是在 Midjourney 的帮助下生成的）。

So…

We started with averages and statistics
我们从平均数和统计数据入手
Moved to “deep learning”, which is just hard for us to explain how the algorithms got to the results they did (it isn’t based on simple rules any longer)
转为 "深度学习"，只是我们很难解释算法是如何得出结果的（它不再基于简单的规则）
And we then got to a point where AI generates new data
然后我们就到了人工智能产生新数据的阶段

The stellar rise of ChatGPT
ChatGPT 的辉煌崛起

The thing is that all this thing I just explained wouldn’t be interesting without ChatGPT – a service that came to our lives only recently, becoming the hottest thing out there:
问题是，如果没有 ChatGPT，我刚才解释的这一切就不会有趣--这项服务最近才出现在我们的生活中，并成为最热门的服务：

ChatGPT is based on LLMs – Large Language Models – and it is fast becoming the hottest thing around. No other service grew as fast as ChatGPT, which is why every business in the world now is trying to figure out if and how ChatGPT will fit into their world and services.
ChatGPT 基于LLMs （大型语言模型），正迅速成为最热门的服务。没有任何其他服务能像 ChatGPT 一样快速发展，这就是为什么现在世界上的每家企业都在试图弄清楚 ChatGPT 是否以及如何融入他们的世界和服务。

Why ChatGPT and WebRTC are like oil and water
为什么说 ChatGPT 和 WebRTC 就像油和水？

So it begged the question: what can you do with ChatGPT and WebRTC?
这就引出了一个问题：使用 ChatGPT 和 WebRTC 能做什么？

Problem is, ChatGPT and WebRTC are like oil and water – they don’t mix that well.
问题是，ChatGPT 和 WebRTC 就像油和水一样，不能很好地融合在一起。

ChatGPT generates data whereas WebRTC enables people to communicate with each other. The “generation” part in WebRTC is taken care of by the humans that interact mostly with each other on it.
ChatGPT 生成数据，而 WebRTC 使人们能够相互通信。在 WebRTC 中，"生成 "部分主要由通过它进行交互的人类来完成。

On one hand, this makes ChatGPT kinda useless for WebRTC – or at least not that obvious to use for it.
一方面，这使得 ChatGPT 对 WebRTC 有点无用--或者至少使用起来不是那么明显。

But on the other hand, if someone succeeds to crack this one up properly – that someone will have an innovative and unique thing.
但另一方面，如果有人成功地破解了这一难题，那么他就会拥有一个新颖独特的东西。

What have people done with ChatGPT and WebRTC so far?
到目前为止，人们用 ChatGPT 和 WebRTC 做了些什么？

It is interesting to see what people and companies have done with ChatGPT and WebRTC in the last couple of months. Here are a few things that I’ve noticed:
在过去几个月里，人们和公司在 ChatGPT 和 WebRTC 上做了些什么，这很有意思。以下是我注意到的几件事：

Arin Sime decided to ask ChatGPT about the future of WebRTC. Nice, but not really something that gets WebRTC and ChatGPT more integrated with one another
Arin Sime 决定就 WebRTC 的未来向 ChatGPT 提问。很好，但并不能让 WebRTC 和 ChatGPT 更好地相互集成
LiveKit shows how to connect ChatGPT to a live WebRTC video call. The result is mindbogglingly good – practically giving voice to ChatGPT
LiveKit 演示了如何将 ChatGPT 连接到实时 WebRTC 视频通话。结果好得令人匪夷所思--实际上是让 ChatGPT 发出声音
Twilio showcases a similar thing – connecting ChatGPT to their Programmable Voice service. Slightly less compelling but just as practical
Twilio 也展示了类似的功能--将 ChatGPT 连接到其可编程语音服务。稍逊一筹，但同样实用
Then there’s the whole transcription space, where you see ChatGPT and its ilk used for the generation of summaries and action items from the meeting transcription
在整个转录领域，你会看到 ChatGPT 及其同类产品用于生成会议转录摘要和行动项目。

In LiveKit’s and Twilio’s examples, the concept is to use the audio source from humans as part of prompts for ChatGPT after converting them using Speech to Text and then converting the ChatGPT response using Text to Speech and pass it back to the humans in the conversation.
在 LiveKit 和 Twilio 的示例中，其概念是在使用语音到文本（Speech to Text）转换后，将来自人类的音频源作为 ChatGPT 提示的一部分，然后使用文本到语音（Text to Speech）转换 ChatGPT 响应，并将其传回对话中的人类。

Broadening the scope: Generative AI
扩大范围：生成式人工智能

ChatGPT is one of many generative AI services. Its focus is on text. Other generative AI solutions deal with images or sound or video or practically any other data that needs to be generated.
ChatGPT 是众多生成式人工智能服务之一。它的重点是文本。其他生成式人工智能解决方案则处理图像、声音、视频或几乎任何其他需要生成的数据。

I have been using MidJourney for the past several months to help me with the creation of many images in this blog.
在过去的几个月里，我一直在使用 MidJourney 来帮助我制作博客中的许多图片。

Today it seems that in any field where new data or information needs to be created, a generative AI algorithm can be a good place to investigate. And in marketing-speak – AI is overused and a new overhyped term was needed to explain what innovation and cutting edge is – so the word “generative” was added to AI for that purpose.
如今，在任何需要创建新数据或信息的领域，生成式人工智能算法似乎都是一个很好的研究对象。而在市场营销术语中，人工智能已经被过度使用，需要一个被过度炒作的新词来解释什么是创新和前沿，因此，"生成 "一词被添加到人工智能中，以达到这一目的。

Fitting Generative AI to the world of RTC
将生成式人工智能融入 RTC 世界

How does one go about connecting generative AI technologies with communications then? The answer to this question isn’t an obvious or simple one. From what I’ve seen, there are 3 main areas where you can make use of generative AI with WebRTC (or just RTC):
那么，如何将人工智能生成技术与通信联系起来呢？这个问题的答案并不明显或简单。据我所知，生成式人工智能与 WebRTC（或 RTC）的结合主要有三个方面：

Conversations and bots 对话和机器人
Media compression 媒体压缩
Media processing 媒体处理

Here’s what it means 👇
这意味着 👇

Conversations and bots 对话和机器人

In this area, we either have a conversation with a bot or have a bot “eavesdrop” on a conversation.
在这方面，我们要么与机器人对话，要么让机器人 "偷听 "对话。

The LiveKit and Twilio examples earlier are about striking a conversation with a bot – much like how you’d use ChatGPT’s prompts.
前面的 LiveKit 和 Twilio 示例都是关于与机器人进行对话--就像你使用 ChatGPT 的提示一样。

A bot eavesdropping to a conversation can offer assistance throughout a meeting or after the meeting –
窃听对话的机器人可以在整个会议期间或会后提供帮助。

It can try to capture to essence of a session, turning it into a summary
它可以尝试捕捉会议的精髓，将其转化为摘要
Help with note taking and writing down action items
帮助记笔记和撰写行动项目
Figure out additional resources to share during the conversation – such as knowledge base items that reflect what a customer is complaining about to a call center agent
找出在对话过程中可共享的其他资源，例如反映客户向呼叫中心座席人员投诉内容的知识库项目

As I stated above, this has little to do with WebRTC itself – it takes place elsewhere in the pipeline; and to me, this is mostly an application capability.
如上所述，这与 WebRTC 本身关系不大--它发生在管道的其他地方；对我来说，这主要是一种应用能力。

Media compression 媒体压缩

An interesting domain where AI is starting to be investigated and used is media compression. I’ve written about Lyra, Google’s AI enabled speech codec in the past. Lyra makes assumptions on how human speech sounds and behaves in order to send less data over the network (effectively compressing it) and letting the receiving end figure out and fill out the gaps using machine learning. Can this approach be seen as a case of generative AI? Maybe
媒体压缩是人工智能开始研究和应用的一个有趣领域。我曾写过一篇关于谷歌人工智能语音编解码器 Lyra 的文章。Lyra 对人类语音的发音和行为做出假设，以便在网络上发送更少的数据（有效地压缩数据），并让接收端利用机器学习找出并填补空白。这种方法可以看作是生成式人工智能的一种吗？也许可以

Would investigating such approaches where the speakers are known to better compress their audio and even video makes sense?
在已知扬声器能更好地压缩音频甚至视频的情况下，研究这种方法是否有意义？

How about the whole super resolution angle? Where you send video at resolutions of WVGA or 720p and then having the decoder scale them up to 1080p or 4K, losing little in the process. We’re generating data out of thin air, though probably not in the “classic” sense of generative AI.
整个超分辨率的角度如何？以 WVGA 或 720p 分辨率发送视频，然后让解码器将其放大到 1080p 或 4K，在此过程中几乎不会损失什么。我们正在凭空生成数据，不过可能不是 "经典 "意义上的生成式人工智能。

I’d also argue that if you know the initial raw content was generated using generative AI, there might be a better way in which the data can be compressed and sent at lower bitrates. Is that something worth pursuing or investigating? I don’t know.
我还想说的是，如果你知道最初的原始内容是通过生成式人工智能生成的，也许有更好的方法可以压缩数据并以更低的比特率发送。这值得研究吗？我不知道。

Media processing 媒体处理

Similar to how we can have AI based codecs such as Lyra, we can also use AI algorithms to improve quality – better packet loss concealment that learns the speech patterns in real time and then mimics them when there’s packet loss. This is what Google is doing with their WaveNetEQ, something I mentioned in my WebRTC unbundling article from 2020.
与我们可以使用 Lyra 等基于人工智能的编解码器类似，我们也可以使用人工智能算法来提高质量--更好的丢包隐藏功能可以实时学习语音模式，然后在丢包时模仿它们。这就是谷歌在其 WaveNetEQ 中正在做的事情，我在 2020 年发表的 WebRTC 解绑文章中也提到了这一点。

Here again, the main question is how much of this is generative AI versus simply AI – and does that even matter?
这里的主要问题还是，生成式人工智能与单纯的人工智能之间到底有多大的区别？

Is the future of WebRTC generative (AI)?
WebRTC 的未来是生成式（人工智能）吗？

ChatGPT and other generative AI services are growing and evolving rapidly. While WebRTC isn’t directly linked to this trend, it certainly is affected by it:
ChatGPT 和其他生成式人工智能服务正在快速增长和发展。虽然 WebRTC 与这一趋势没有直接联系，但肯定会受到其影响：

Applications will need to figure out how (and why) to incorporate generative AI with WebRTC as part of what they offer
应用程序需要弄清楚如何（以及为什么）将生成式人工智能与 WebRTC 结合起来，作为其服务的一部分
Algorithms and codecs in WebRTC are evolving with the assistance of AI (generative or otherwise)
WebRTC 的算法和编解码器在人工智能（生成或其他）的帮助下不断发展

Like any other person and business out there, you too should see if and how does generative AI affects your own plans.
与其他个人和企业一样，您也应该看看生成式人工智能是否会影响您的计划，以及会如何影响您的计划。

Previous 上一页

Next 下一页