Improved Jitter Buffer Management for WebRTC 改进的 WebRTC 抖动缓冲管理
YUSUF CINAR, National University of Ireland, Galway YUSUF CINAR,爱尔兰国立大学戈尔韦分校PETER POCTA, University of Zilina PETER POCTA,日利纳大学DESMOND CHAMBERS and HUGH MELVIN, National University of Ireland, Galway DESMOND CHAMBERS 和 HUGH MELVIN,爱尔兰国立大学戈尔韦分校
Abstract 摘要
This work studies the jitter buffer management algorithm for Voice over IP in WebRTC. In particular, it details the core concepts of WebRTC’s jitter buffer management. Furthermore, it investigates how jitter buffer management algorithm behaves under network conditions with packet bursts. It also proposes an approach, different from the default WebRTC algorithm, to avoid distortions that occur under such network conditions. Under packet bursts, when the packet buffer becomes full, the WebRTC jitter buffer algorithm may discard all the packets in the buffer to make room for incoming packets. The proposed approach offers a novel strategy to minimize the number of packets discarded in the presence of packet bursts. Therefore, voice quality as perceived by the user is improved. ITU-T Rec. P.863, which also confirms the improvement, is employed to objectively evaluate the listening quality. 本研究探讨了 WebRTC 中用于 IP 语音的抖动缓冲管理算法。特别地,详细介绍了 WebRTC 抖动缓冲管理的核心概念。此外,研究了抖动缓冲管理算法在存在数据包突发的网络条件下的表现。还提出了一种不同于 WebRTC 默认算法的方法,以避免在此类网络条件下出现的失真现象。在数据包突发时,当数据包缓冲区满时,WebRTC 抖动缓冲算法可能会丢弃缓冲区中的所有数据包以为新到的数据包腾出空间。所提出的方法提供了一种新策略,以最小化在数据包突发情况下被丢弃的数据包数量。因此,用户感知的语音质量得以提升。采用 ITU-T Rec. P.863 标准对听感质量进行了客观评估,结果也证实了该改进。
CCS Concepts: *\cdot Applied computing rarr\rightarrow Internet telephony; *\cdot Information systems rarr\rightarrow Multimedia streaming; CCS 概念: *\cdot 应用计算 rarr\rightarrow 网络电话; *\cdot 信息系统 rarr\rightarrow 多媒体流;
Additional Key Words and Phrases: WebRTC, jitter buffer management, voice quality 附加关键词和短语:WebRTC,抖动缓冲管理,语音质量
ACM Reference format: ACM 引用格式:
Yusuf Cinar, Peter Pocta, Desmond Chambers, and Hugh Melvin. 2021. Improved Jitter Buffer Management for WebRTC. ACM Trans. Multimedia Comput. Commun. Appl. 17, 1, Article 30 (April 2021), 20 pages. https://doi.org/10.1145/3410449 Yusuf Cinar、Peter Pocta、Desmond Chambers 和 Hugh Melvin。2021 年。《改进的 WebRTC 抖动缓冲管理》。ACM 多媒体计算通信应用汇刊,17 卷,第 1 期,文章 30(2021 年 4 月),20 页。https://doi.org/10.1145/3410449
1 INTRODUCTION 1 引言
Web Real-Time Communications (WebRTC), is a standardisation effort for real-time communications by the World Wide Web Consortium and the Internet Engineering Task Force (IETF) [4, 5]. There is also an open source project with the same name, i.e., WebRTC, to implement the standards [22]. The software developed by WebRTC project today runs on billions of devices [2] and countless number of Voice over IP (VoIP) calls are taking place every day over the Internet. Web 实时通信(WebRTC)是由万维网联盟(W3C)和互联网工程任务组(IETF)推动的实时通信标准化工作[4, 5]。同时,也有一个同名的开源项目,即 WebRTC,用于实现这些标准[22]。目前,WebRTC 项目开发的软件已运行在数十亿设备上[2],每天通过互联网进行无数次的语音通信(VoIP)通话。
VoIP applications produce voice packets at regular intervals, for instance every 20 ms . Nevertheless, Internet is a non-deterministic medium in which packets are delivered based on a best-effort service level. As a result of this nature of Internet, packets are exposed to network impairments such as delay, loss, jitter, and packet bursts. VoIP solutions, and inherently WebRTC, employ jitter VoIP 应用以固定间隔产生语音数据包,例如每 20 毫秒一次。然而,互联网是一种非确定性媒介,数据包的传输基于尽力而为的服务水平。因此,互联网的这种特性使得数据包容易受到网络损伤,如延迟、丢包、抖动和数据包突发。VoIP 解决方案,尤其是 WebRTC,采用抖动缓冲机制来应对这些问题。
buffer management algorithms to combat some of the effects of non-deterministic nature of the Internet. In a typical VoIP application, additional techniques, such as packet loss concealment, also exist to reduce the effects of network impairments. However, jitter buffer management algorithm, which is the main focus of this article, attempts to absorb the effects of network jitter and packet bursts. 缓冲区管理算法用于应对互联网非确定性特性带来的一些影响。在典型的 VoIP 应用中,还存在诸如丢包隐藏等附加技术,以减少网络损伤的影响。然而,本文主要关注的抖动缓冲区管理算法,旨在吸收网络抖动和数据包突发的影响。
In VoIP applications, voice packets are typically placed into a buffer, known as jitter buffer, as they arrive. Jitter buffer holds a number of packets waiting to be picked up by the playout logic. Jitter buffer management is an important part of the VoIP application design as it can greatly influence VoIP quality. Jitter buffer management algorithms manage how the packets are retrieved from the jitter buffer and how they are played out in a controlled manner with typically voice and conversation quality in mind [10]. 在 VoIP 应用中,语音数据包通常在到达时被放入一个称为抖动缓冲区的缓冲区。抖动缓冲区保存一定数量的数据包,等待播放逻辑取出。抖动缓冲区管理是 VoIP 应用设计中的重要部分,因为它能极大地影响 VoIP 的质量。抖动缓冲区管理算法负责控制如何从抖动缓冲区中取出数据包,以及如何以受控方式播放这些数据包,通常以语音和通话质量为重点[10]。
This work studies jitter buffer management algorithm for Voice over IP in WebRTC. In particular, it investigates how it behaves under network conditions with high jitter causing packet bursts. It also proposes an improvement over default WebRTC algorithm to avoid distortions that occur under such network conditions. Network conditions are composed of the real world network traces, captured over Wi-Fi and LTE networks, and simulated artificial network traces. 本研究针对 WebRTC 中的 IP 语音抖动缓冲区管理算法进行了研究。特别地,研究了该算法在高抖动导致数据包突发的网络条件下的表现。并提出了一种改进方案,以避免在此类网络条件下出现的失真。网络条件包括通过 Wi-Fi 和 LTE 网络捕获的真实网络轨迹,以及模拟的人工网络轨迹。
The resulting recordings are evaluated using ITU-T Rec. P. 863 (03/2018) [13], state-of-the-art objective listening quality assessment model. ITU-T Rec. P. 863 is also known as POLQA, abbreviation of Perceptual Objective Listening Quality Assessment. The ITU-T Rec. P. 863 shows quality improvement with the proposed approach compared to the default WebRTC behaviour. 所得录音使用 ITU-T Rec. P. 863 (03/2018) [13]进行评估,该模型是最先进的客观听觉质量评估模型。ITU-T Rec. P. 863 也被称为 POLQA,即感知客观听觉质量评估的缩写。与默认的 WebRTC 行为相比,ITU-T Rec. P. 863 显示出所提方法的质量提升。
We propose, in this article, a burst-aware jitter buffer management algorithm that improves the voice quality. Section 2 presents related literature review of the research in the area. Section 3 describes the proposed approach. Section 4 shows the methodology for the experiments conducted in this study. Section 5 presents and explains the results. Section 6 concludes the article. 本文提出了一种突发感知的抖动缓冲管理算法,以提升语音质量。第 2 节介绍了该领域相关的文献综述。第 3 节描述了所提方法。第 4 节展示了本研究中进行实验的方法。第 5 节呈现并解释了结果。第 6 节对文章进行了总结。
2 JITTER BUFFER MANAGEMENT AND RELATED WORK 2 抖动缓冲管理及相关工作
Research and development efforts on VoIP technologies has accelerated since the early 2010s with the increase in VoIP deployments. IP telephony, conferencing technologies, Over-The-Top services, and now emergence of WebRTC made VoIP technologies much more ubiquitous than ever before. Jitter buffer management for VoIP solutions is one of the important dimensions in these efforts. 自 2010 年代初以来,随着 VoIP 部署的增加,VoIP 技术的研发工作加速推进。IP 电话、会议技术、OTT 服务以及如今 WebRTC 的出现,使得 VoIP 技术比以往任何时候都更加普及。VoIP 解决方案中的抖动缓冲管理是这些努力中的一个重要方面。
2.1 Jitter and Jitter Buffer Management 2.1 抖动及抖动缓冲管理
VoIP applications typically produce voice packets at a constant rate, for instance one packet every 20 ms , except the cases when mechanisms such as silence suppression or discontinuous transmission (DTX) are incorporated, in these cases fewer packets are transmitted during inactive periods [20]. It is worth noting here that the DTX is used in mobile VoIP systems, i.e., VoLTE and VoWiFi (operator services). The regularity of VoIP packets is impaired by the stochastically varying conditions of packet-switched networks, typically due to effects of routing, queuing, scheduling and serialization on the path through the packet-switched networks [19]. Therefore, the RTP packets arrive with random jitters in their arrival time to the destination, and they may also arrive out of order. Because the decoders typically expects a speech packet at the packetisation frequency to output speech samples in periodic blocks, a jitter buffer is employed to absorb the jitter in the packet arrival time [1]. It is worth noting here that a packet is assumed to transport a single codec frame. Packets arriving later than their playout time are typically discarded. Additionally, if too many packets, or more precisely, more packets than jitter buffer can hold, arrive, then packets will again be discarded. This situation is called buffer overflow. Consequently, the ability of jitter buffer management logic to absorb network jitter gets superior as the jitter buffer size gets larger [1,19]. However, VoIP services are sensitive to delay, therefore it is critical to keep the end to end delay as VoIP 应用通常以恒定速率生成语音包,例如每 20 毫秒一个包,除非采用了静音抑制或不连续传输(DTX)等机制,在这些情况下,非活动期间传输的包会更少[20]。值得注意的是,DTX 用于移动 VoIP 系统,即 VoLTE 和 VoWiFi(运营商服务)。VoIP 包的规律性受到分组交换网络随机变化条件的影响,通常是由于路由、排队、调度和序列化等因素在分组交换网络路径上的作用[19]。因此,RTP 包到达目的地的时间存在随机抖动,且可能出现乱序。由于解码器通常期望以分组化频率接收语音包,以周期性块输出语音样本,因此采用抖动缓冲区来吸收包到达时间的抖动[1]。这里值得注意的是,假设一个包传输一个编码帧。晚于播放时间到达的包通常会被丢弃。 此外,如果到达的数据包过多,或者更准确地说,超过抖动缓冲区容量的数据包到达,那么数据包将再次被丢弃。这种情况称为缓冲区溢出。因此,随着抖动缓冲区大小的增大,抖动缓冲区管理逻辑吸收网络抖动的能力也会增强[1,19]。然而,VoIP 服务对延迟非常敏感,因此保持端到端延迟尽可能低以维持双向对话是至关重要的。
Fig. 1. System described in Reference [16]. 图 1. 参考文献[16]中描述的系统。
low as possible to maintain bi-directional conversation. Conversational quality degrades if the end to end delay starts going above a perceivable levels and caveats, like parties starting to talk over each other, occur [10, 17, 20]. Long delays can also result in audible talker acoustic echo, which consequently reduces the interactivity [8]. While larger jitter buffer sizes prevents overflows to some extent, it may result in extended end-to-end delay that degrades the conversational VoIP quality. Jitter buffer management solutions deals with this tradeoff by adequately adjusting the jitter buffer size. 端到端延迟一旦超过可感知的水平,对话质量就会下降,并出现诸如双方开始互相打断等问题[10,17,20]。长时间的延迟还可能导致可听见的说话者声学回声,从而降低交互性[8]。虽然较大的抖动缓冲区在一定程度上防止了溢出,但可能导致延长的端到端延迟,进而降低 VoIP 对话质量。抖动缓冲区管理方案通过适当调整抖动缓冲区大小来处理这一权衡。
2.2 Jitter Buffer Management in WebRTC 2.2 WebRTC 中的抖动缓冲管理
Although there have been many studies on jitter buffer management, there are not many with a focus on WebRTC. A small number of previous research have been found concerning WebRTC’s jitter buffer management and they evaluate only its performance and share their findings [3, 6 , 7,14,17,18]7,14,17,18]. References [3, 7, 17, 18] show that WebRTC’s voice quality suffers under network conditions with high jitter and it is seen that there is room for improvement. Moreover, it is worth noting here that that the previous studies do not identify the root causes of the degradation introduced in WebRTC under impaired network conditions. 尽管关于抖动缓冲区管理的研究很多,但专注于 WebRTC 的研究并不多。已有少量关于 WebRTC 抖动缓冲区管理的研究,它们仅评估其性能并分享了相关发现[3, 6, 7,14,17,18]7,14,17,18] ]。文献[3, 7, 17, 18]表明,在高抖动的网络条件下,WebRTC 的语音质量会受到影响,且存在改进的空间。此外,值得注意的是,之前的研究并未识别出在网络受损条件下导致 WebRTC 性能下降的根本原因。
IETF standardises some aspects of media handling in WebRTC, however, it does not define a Jitter Buffer Management solution. Design and architectural concepts of JBM are left to the practitioners. WebRTC open source project encompasses a module called NetEQ in its audio coding modules. NetEQ module implements a jitter buffer management solution. Detailed analysis of Reference [16] and WebRTC NetEQ code base investigation show that some of the core concepts come from Method and receiver for determining a jitter buffer level [16]. References [17, 18] also confirm that the jitter buffer management algorithm in NetEQ is based on Reference [16]. Figure 1 illustrates the receiver side of a VoIP system described in Reference [16]. IETF 标准化了 WebRTC 中媒体处理的某些方面,但并未定义抖动缓冲管理(Jitter Buffer Management,JBM)解决方案。JBM 的设计和架构概念由实践者自行决定。WebRTC 开源项目在其音频编码模块中包含一个名为 NetEQ 的模块。NetEQ 模块实现了抖动缓冲管理解决方案。对参考文献[16]的详细分析以及对 WebRTC NetEQ 代码库的调查表明,其核心概念部分来源于“用于确定抖动缓冲级别的方法和接收器”[16]。参考文献[17, 18]也证实 NetEQ 中的抖动缓冲管理算法基于参考文献[16]。图 1 展示了参考文献[16]中描述的 VoIP 系统的接收端。
The essential idea presented in Reference [16] is founded on trading off the internal delay in jitter buffer and packet loss by adapting to the current network conditions. The receiver side of the VoIP session collects statistical measures describing network conditions to monitor the nature and magnitude of the jitter. Inter-arrival times between consequent packets are used to continuously update a probability mass function, which is then used to define an expected duration of an empty jitter buffer, i.e., probability of underflowing the jitter buffer. A map of expected duration of an empty buffer for a certain buffer level is produced. In Reference [16], the duration of an empty jitter buffer is also named as outage time, during which time, synthetic data or packet loss concealment data is generated. The combination of the two quantities, internal delay and robustness against network jitter, forms the basis of the optimisation problem. Reference [16] minimises a cost function, as shown in Equation (1), including a jitter buffer delay and expected duration of 文献[16]中提出的基本思想是通过适应当前网络状况,在抖动缓冲区的内部延迟和数据包丢失之间进行权衡。VoIP 会话的接收端收集描述网络状况的统计数据,以监测抖动的性质和大小。连续数据包之间的到达时间间隔被用来持续更新概率质量函数,该函数随后用于定义抖动缓冲区空置的预期持续时间,即抖动缓冲区下溢的概率。生成了某一缓冲区级别下空缓冲区预期持续时间的映射。在文献[16]中,抖动缓冲区空置的持续时间也被称为中断时间,在此期间会生成合成数据或丢包隐藏数据。内部延迟和对网络抖动的鲁棒性这两个量的结合构成了优化问题的基础。文献[16]最小化了一个代价函数,如公式(1)所示,该函数包括抖动缓冲区延迟和预期持续时间。
Fig. 2. Simplified flow chart for the system described in Reference [16]. 图 2. 文献[16]中描述系统的简化流程图。
empty buffer for a given buffer level, BB. Thus, it proposes determining a target buffer level. The cost function refers to the total cost of jitter buffer delay and expected duration of an empty jitter buffer for a given buffer level, BB, 给定缓冲区级别 BB 的空缓冲区。因此,提出确定目标缓冲区级别。代价函数指的是给定缓冲区级别 BB 的抖动缓冲区延迟总成本和预期空抖动缓冲区持续时间,
eta(B)=C*B+E[" length "(" underflow ")∣B]=C*B+int_(B)^(oo)(t-B)f_(tau)(t)dt\eta(B)=C \cdot B+E[\text { length }(\text { underflow }) \mid B]=C \cdot B+\int_{B}^{\infty}(t-B) f_{\tau}(t) d t
In Equation (1), C is the weighing factor, which sets the relative impact of the two quantities where a large C indicates a higher impact of internal buffer delay. Similarly, a small C indicates a higher impact of buffer underflow. The objective is to determine the target buffer level B, which minimises the cost function. 在公式(1)中,C 是权重因子,用于设置两个量的相对影响,其中较大的 C 表示内部缓冲区延迟的影响较大。同样,较小的 C 表示缓冲区下溢的影响较大。目标是确定目标缓冲区级别 B,使代价函数最小化。
This target buffer level is then compared with the actual current buffer level. In case of a difference, the decoded signal samples are modified accordingly, i.e., lengthened or shortened. This process indirectly controls the current jitter buffer level and shifts it toward target buffer level. As a result of this process, the packet demand, from the jitter buffer by the playout logic, will increase or decrease the rate with which packets are extracted from the jitter buffer, hence converging the current buffer level toward target buffer level. The flow chart for this algorithm is illustrated in Figure 2. 然后将该目标缓冲区级别与实际当前缓冲区级别进行比较。如果存在差异,则相应地修改解码后的信号样本,即延长或缩短。该过程间接控制当前抖动缓冲区级别,并将其向目标缓冲区级别移动。通过该过程,播放逻辑对抖动缓冲区的包需求将增加或减少从抖动缓冲区提取包的速率,从而使当前缓冲区级别趋近于目标缓冲区级别。该算法的流程图如图 2 所示。
Reference [16] does not define how the decoded signal samples are modified; however, investigation of the source code shows that WebRTC implements a timescale modification algorithm for lengthening or shortening of the decoded signals. 参考文献[16]未定义解码信号样本的修改方式;然而,通过对源代码的调查显示,WebRTC 实现了一种时间尺度修改算法,用于延长或缩短解码信号。
2.3 Analysis of NetEQ Source Code 2.3 NetEQ 源代码分析
NetEQ was originally developed as a codec agnostic joint adaptive jitter buffer management and error concealment solution by GIPS [9], which was later acquired by Google [23]. The idea of a joint jitter management and packet loss concealment is studied in the literature as an enhanced alternative method instead of being distinct components [15, 21]. Note that an internal packet loss concealment procedure of the Opus codec is not used in NetEQ. ^(1){ }^{1} NetEQ 最初由 GIPS [9] 开发,作为一种与编解码器无关的联合自适应抖动缓冲管理和误码隐藏解决方案,后来被 Google [23] 收购。文献中研究了联合抖动管理和丢包隐藏的思想,作为一种改进的替代方法,而非将其作为独立组件[15, 21]。注意,NetEQ 中未使用 Opus 编解码器的内部丢包隐藏过程。 ^(1){ }^{1}
NetEQ source code ^(2){ }^{2} is openly available in WebRTC repository. We investigated ^(3){ }^{3} the source code of NetEQ component in WebRTC code base. The investigation reveals interesting details regarding the internal architectural building blocks and call flows. We provide an overall summarised description for the architecture and the flow of actions for an RTP packet that arrives into NetEQ and extraction of audio frames from NetEQ for playout. NetEQ 源代码 ^(2){ }^{2} 在 WebRTC 仓库中公开可用。我们对 WebRTC 代码库中的 NetEQ 组件源代码进行了 ^(3){ }^{3} 调查。调查揭示了关于内部架构构建模块和调用流程的有趣细节。我们提供了一个整体的架构总结描述,以及 RTP 包到达 NetEQ 后的处理流程和从 NetEQ 中提取音频帧进行播放的流程。
NetEqImpl class, defined in neteq_impl.h and implemented in neteq_impl.cc, is the main entry and overall governing control logic implementation for NetEQ. It exposes methods to be used by the other components in a WebRTC application. Two focal methods exposed, among others, are InsertPacket and GetAudio, typically consumed by two different threads, network/socket thread to place the incoming packets into the packet buffer and a renderer thread to fetch the audio frames to deliver to the sound card respectively. NetEqImpl 类定义于 neteq_impl.h,实现在 neteq_impl.cc 中,是 NetEQ 的主要入口和整体控制逻辑实现。它对外提供方法供 WebRTC 应用中的其他组件使用。其中两个主要方法是 InsertPacket 和 GetAudio,通常由两个不同的线程调用:网络/套接字线程将接收到的数据包放入数据包缓冲区,渲染线程则获取音频帧以传递给声卡。
NetEqImpl class has a packet buffer object, an instance of PacketBuffer class. PacketBuffer is the actual jitter buffer storage in WebRTC, defined in packet_buffer.h and implemented in packet_buffer.cc. As the RTP packets arrive from the network components, they are placed into the PacketBuffer, which provides a First-in, First-out data structure for the packets. It is worth mentioning that the audio payload is in the encoded form while it is in the PacketBuffer. NetEQ allows the user to define a maximum buffer capacity in PacketBuffer. In this study, we exploited this setting to adjust maximum capacity for different experimentations. In the absence of a user defined capacity, the default capacity used in WebRTC was 50 packets since the inception; however, it was increased to 200 packets ^(4){ }^{4} due to the reported issues with frequent buffer flush due to buffer overflows and packet loss concealment operations occurring under network conditions with high delay. ^(5){ }^{5} These conditions are similar to the conditions described in later sections of the article for our experiments. It is acknowledged that increased capacity will result in higher jitter buffer delay while overall positive impact is still greater. Two hundred packets, or even 50 packets, can be deemed to be too many considering the nature of VoIP and delay recommendations in Reference [11], therefore it is advantageous to exploit the method to configure the adequate capacity depending on the requirements for a given application. In addition to PacketBuffer, there are two additional buffers in NetEQ that are worth mentioning to describe NetEQ principles, SyncBuffer and AlgorithmBuffer, which will be explained in the following paragraphs. NetEqImpl 类拥有一个数据包缓冲对象,该对象是 PacketBuffer 类的一个实例。PacketBuffer 是 WebRTC 中实际的抖动缓冲存储,定义在 packet_buffer.h 中,实现在 packet_buffer.cc 中。当 RTP 数据包从网络组件到达时,它们被放入 PacketBuffer,PacketBuffer 为数据包提供了先进先出(FIFO)的数据结构。值得一提的是,音频负载在存储于 PacketBuffer 时是编码形式。NetEQ 允许用户在 PacketBuffer 中定义最大缓冲容量。在本研究中,我们利用此设置调整了不同实验的最大容量。如果用户未定义容量,WebRTC 自始以来使用的默认容量为 50 个数据包;然而,由于报告了频繁的缓冲区刷新问题,这些问题是由缓冲区溢出和在高延迟网络条件下发生的数据包丢失隐藏操作引起的,容量被增加到了 200 个数据包 ^(4){ }^{4} 。 ^(5){ }^{5} 这些条件与本文后续章节中描述的实验条件相似。 众所周知,容量的增加会导致抖动缓冲区延迟的增加,但整体的积极影响仍然更大。考虑到 VoIP 的特性以及参考文献[11]中的延迟建议,200 个数据包甚至 50 个数据包都可能被认为过多,因此根据特定应用的需求利用该方法配置适当的容量是有利的。除了 PacketBuffer 之外,NetEQ 中还有两个值得一提的缓冲区,用以描述 NetEQ 的原理,分别是 SyncBuffer 和 AlgorithmBuffer,接下来的段落将对此进行解释。
NetEqImpl class has a delay manager object, an instance of DelayManager class, defined in delay_manager.h and implemented in delay_manager.cc. After the last received packet is inserted into the PacketBuffer, DelayManager is updated, which in turn updates the inter-arrival time histogram for the incoming packets and other statistics it holds. Based on these calculations and current packet buffer level, target packet buffer level is adjusted. Examination of the delay manager implementation shows that it generally aligns with the concepts described in Reference [16]. NetEqImpl 类拥有一个延迟管理器对象,该对象是 DelayManager 类的一个实例,定义在 delay_manager.h 中,实现在 delay_manager.cc 中。在最后一个接收的数据包被插入到 PacketBuffer 后,DelayManager 会被更新,进而更新传入数据包的到达间隔时间直方图及其持有的其他统计数据。基于这些计算和当前的数据包缓冲区水平,目标数据包缓冲区水平会被调整。对延迟管理器实现的检查表明,其总体上与参考文献[16]中描述的概念相符。
In summary, an incoming packet is received through the InsertPacket method in NetEqImpl, which places it into the packet buffer and updates the delay manager. DelayManager allows setting of some user-defined parameters such as minimum and maximum base delays, which are taken into consideration when calculating the target buffer level. 总之,传入的数据包通过 NetEqImpl 中的 InsertPacket 方法接收,该方法将数据包放入缓冲区并更新延迟管理器。DelayManager 允许设置一些用户定义的参数,如最小和最大基础延迟,这些参数在计算目标缓冲区水平时会被考虑。
GetAudio is invoked to provide 10 ms of decoded audio data, which will typically be delivered to a sound card for playout by the application. GetAudio starts a decision-making process to determine how best to reconstruct 10 ms of audio from the already received packets. The process for decision making is executed through the GetDecision method of DecisionLogic instance in NetEqImpl. DecisionLogic is defined in decision_logic.h and implemented in decision_logic.cc. The output of GetDecision in DecisionLogic class is an Operations enum, which describes the timescale modification operation that should be done to prepare 10 ms of the requested audio based on the current circumstances. Operations enum is defined in defines.h file and has values such as kNormal, kExpand, kAccelerate, which corresponds to normal, slower, and faster playout, respectively. While determining the operation that should be executed, DecisionLogic utilises statistics, histograms, data such as recent packet loss, delay, and jitter profiles in the network, current, and target packet buffer levels provided by other classes such as DelayManager, PacketBuffer, SychBuffer. NetEQ allows user to enable or disable fast accelerate mode, which, if enabled, can remove multiple pitch periods by relaxing the requirements on finding strong correlations for more aggressive scaling. This setting is not enabled during the experiments carried out in this study. 调用 GetAudio 以提供 10 毫秒的解码音频数据,通常由应用程序传递给声卡进行播放。GetAudio 启动一个决策过程,以确定如何从已接收的数据包中最佳地重建这 10 毫秒的音频。决策过程通过 NetEqImpl 中 DecisionLogic 实例的 GetDecision 方法执行。DecisionLogic 在 decision_logic.h 中定义,并在 decision_logic.cc 中实现。DecisionLogic 类中 GetDecision 的输出是一个 Operations 枚举,描述了应基于当前情况执行的时间尺度修改操作,以准备请求的 10 毫秒音频。Operations 枚举定义在 defines.h 文件中,包含如 kNormal、kExpand、kAccelerate 等值,分别对应正常、较慢和较快的播放速度。在确定应执行的操作时,DecisionLogic 利用统计数据、直方图、最近的丢包、网络中的延迟和抖动特征,以及由 DelayManager、PacketBuffer、SychBuffer 等其他类提供的当前和目标数据包缓冲区水平。 NetEQ 允许用户启用或禁用快速加速模式,如果启用,该模式可以通过放宽寻找强相关性的要求来移除多个音高周期,从而实现更激进的缩放。本研究中进行的实验未启用此设置。
After DecisionLogic is executed, a packet is extracted from PacketBuffer and encoded payload is decoded. The decoded audio is processed with the resulting timescale modification operation decision that was made in the decision logic stage and the processed audio is placed into AlgorithmBuffer, which is an ephemeral buffer cleared at each GetAudio call. Processed audio data, for instance accelerated or expanded, in the AlgorithmBuffer is placed into SyncBuffer, which is the long-lived interim playout buffer that holds the audio frames after they are decoded and processed with time scaling modification based on the decision logic operation. 10 ms of audio data is extracted from SyncBuffer and written into the memory location provided in the pointer passed as an argument into GetAudio method call. 在执行 DecisionLogic 之后,从 PacketBuffer 中提取一个数据包并对编码的有效载荷进行解码。解码后的音频根据决策逻辑阶段所做出的时间缩放修改操作决策进行处理,处理后的音频被放入 AlgorithmBuffer,该缓冲区是一个临时缓冲区,在每次调用 GetAudio 时都会被清空。AlgorithmBuffer 中处理过的音频数据,例如加速或扩展后的音频,被放入 SyncBuffer,SyncBuffer 是一个长期存在的中间播放缓冲区,用于保存解码后并根据决策逻辑操作进行时间缩放处理的音频帧。从 SyncBuffer 中提取 10 毫秒的音频数据,并写入传入 GetAudio 方法调用参数中指针所指向的内存位置。
Interested reader is referred to the code base for further details on inner workings of NetEQ. We are also planning a separate study and a article with dedicated scope on analysis and documentation of NetEQ, considering its broad dimensions and the potential benefit of such a study for the research community in the area. 有兴趣的读者可参考代码库以了解 NetEQ 内部工作机制的更多细节。我们还计划进行一项独立的研究和一篇专门针对 NetEQ 分析与文档编制的文章,考虑到其广泛的维度以及此类研究对该领域研究社区的潜在益处。
This article proposes a novel approach to better manage the jitter buffer for WebRTC with an aim to improve voice quality. The approach is realised with an implementation in NetEQ, the related component of WebRTC. By tracking the incoming packets, it estimates the number of packets due to arrive in a burst. Instead of discarding all the packets in the buffer, like WebRTC does, it discards only necessary number of packets in order for all the packets in the burst to fit into the buffer. 本文提出了一种新颖的方法,以更好地管理 WebRTC 的抖动缓冲区,旨在提升语音质量。该方法通过在 WebRTC 相关组件 NetEQ 中的实现得以实现。通过跟踪传入的数据包,它估计即将到达的突发数据包数量。与 WebRTC 丢弃缓冲区中所有数据包不同,该方法仅丢弃必要数量的数据包,以使突发中的所有数据包都能适应缓冲区。
Packets leave the RTP sender side regularly at every packetisation interval, say, 20ms,30ms20 \mathrm{~ms}, 30 \mathrm{~ms}, and so on. The packetization interval, called xx in this article, is negotiated during RTP session establishment for codecs, which supports it, e.g., Opus, otherwise it is implicitly defined by the codec frame length, which is often fixed. However, it is well known that packets do not arrive to the receiver side very regularly. When the regularity of the packets is impaired, it may indicate the presence of jitter in the network. If the time elapsed since the arrival of last packet is several multiples of packetisation interval, then it means that a packet burst may be due to occur. To estimate the number of packets in the burst, the timestamp of when the last packet arrived is 数据包在 RTP 发送端以每个分包间隔定期发送,例如, 20ms,30ms20 \mathrm{~ms}, 30 \mathrm{~ms} ,依此类推。本文中称为 xx 的分包间隔是在 RTP 会话建立期间为支持该功能的编解码器(如 Opus)协商确定的,否则它由编解码器帧长度隐式定义,通常是固定的。然而,众所周知,数据包到达接收端的时间并不十分规律。当数据包的规律性受到影响时,可能表明网络中存在抖动。如果自上一个数据包到达以来经过的时间是分包间隔的若干倍,则意味着可能即将发生数据包突发。为了估计突发中的数据包数量,记录了上一个数据包到达的时间戳。
tracked. Time elapsed since the arrival of last packet is, in this article, called Delta t\Delta t. By dividing Delta t\Delta t by the the packetisation interval xx, the number of packets in burst, kk, is deduced as in Equation (2). 本文中称自上一个数据包到达以来经过的时间为 Delta t\Delta t 。通过将 Delta t\Delta t 除以分包间隔 xx ,突发中的数据包数量 kk 如公式(2)所示推导得出。
Packet bursts typically present issues when the number of packets in the burst are more than the number of packets jitter buffer can hold. To give a more materialised and simple example, a case, where the maximum number of packets that jitter buffer can hold is BB and a number of packets in the burst is kk and kk is greater than BB, can be considered. The kk is a number of packets arriving nearly at the same time in a packet burst scenario, let us assume an empty buffer. When (k-B+1)(k-B+1) th packet arrives, WebRTC, as per its current design, flushes all the BB packets in the entire buffer. If the jitter buffer is not empty and there are already packets in it, then jitter buffer will be filled even quicker. A smaller kk will be sufficient to fill the jitter buffer. It should be reiterated here that WebRTC jitter buffer implementation is unaware of number of packets in the burst. Equation (2) can be accommodated to calculate the minimum number of packets to flush to reduce the number of packets to be discarded. It may not be actually necessary to flush the entire buffer for the remaining k-Bk-B packets to fit into the buffer. By making WebRTC burst-aware as per the Equation (2), it is possible to calculate the minimum number of packets to remove, which will result in playing maximum number of packets out to the user, 数据包突发通常在突发中的数据包数量超过抖动缓冲区能容纳的数据包数量时出现问题。为了给出一个更具体且简单的例子,可以考虑这样一种情况:抖动缓冲区最大能容纳的数据包数量为 BB ,而突发中的数据包数量为 kk ,且 kk 大于 BB 。 kk 表示在数据包突发场景中几乎同时到达的数据包数量,假设缓冲区为空。当第 (k-B+1)(k-B+1) 个数据包到达时,WebRTC 根据其当前设计,会清空整个缓冲区中的所有 BB 个数据包。如果抖动缓冲区非空且已有数据包存在,那么抖动缓冲区将更快被填满。较小的 kk 就足以填满抖动缓冲区。这里应再次强调,WebRTC 的抖动缓冲区实现并不知道突发中数据包的数量。公式(2)可以用来计算需要清空的最小数据包数量,以减少被丢弃的数据包数量。实际上,可能并不需要清空整个缓冲区,剩余的 k-Bk-B 个数据包就能放入缓冲区。 通过使 WebRTC 具备突发感知能力(如公式(2)所示),可以计算出需要移除的最小数据包数量,从而实现向用户播放最多数量的数据包,
n=k-B=(Delta t)/(x)-B,n=k-B=\frac{\Delta t}{x}-B,
Delta t\Delta t : Time elapsed since last packet arrived Delta t\Delta t :自上一个数据包到达以来的时间 xx : Packetisation interval xx :分包间隔 kk : Estimated number of packets in burst kk :突发中估计的数据包数量 BB : Jitter buffer capacity BB :抖动缓冲区容量 nn : Number of packets to remove. nn :要移除的数据包数量。
An example scenario: 一个示例场景:
Let the maximum jitter buffer capacity, BB, be 10 packets. 设最大抖动缓冲区容量 BB 为 10 个数据包。
Let the packetisation interval, xx, be 20 ms . 设分组间隔, xx ,为 20 毫秒。
Let the time elapsed since last packet arrived, Delta t\Delta t, at a given burst, be 220 ms . 设自上一个分组到达以来经过的时间, Delta t\Delta t ,在某次突发中为 220 毫秒。
Then, according to the Equation (2), number of packets in the burst, kk, will be 11. 那么,根据公式(2),突发中的分组数, kk ,将是 11。
Number of packets to remove, nn, will therefore be just 1 according to Equation (2). 因此,根据公式(2),需要移除的分组数, nn ,将仅为 1。
WebRTC would normally remove all of the 10 packets. However, with the burst-aware approach, 9 packets are saved and played out to the user. Impact of saving 9 packets on listening quality improvement is detailed on the results section, Section 5. WebRTC 通常会移除所有 10 个数据包。然而,采用突发感知方法后,有 9 个数据包被保留并播放给用户。保存这 9 个数据包对听觉质量提升的影响详见结果部分,第 5 节。
One might consider the end-to-end delay impact of the proposed approach. The proposed approach aims to minimise the number of packets that are dropped due to buffer overflows while not directly imposing a playout operation mode (normal, accelerate, expand) for the packets. However, NetEQ logic typically chooses the accelerate mode to speed up the playout when the buffer is above the target buffer level in the packet burst cases. This topic is further discussed in Section 5.4. 有人可能会考虑所提方法对端到端延迟的影响。该方法旨在最大限度减少因缓冲区溢出而丢弃的数据包数量,同时不直接对数据包施加播放操作模式(正常、加速、扩展)。然而,NetEQ 逻辑通常会在数据包突发情况下,当缓冲区超过目标缓冲区水平时选择加速模式以加快播放速度。该话题将在第 5.4 节中进一步讨论。
4 METHODOLOGY 4 方法论
This section explains the methodology used for the study. Section 4.1 describes implementation methodology for the proposed approach. Section 4.3 explains the samples used. The experiments are carried out for various maximum jitter buffer sizes under various impaired network conditions as explained in Sections 4.2 and 4.4, respectively. 本节介绍本研究所采用的方法论。第 4.1 节描述了所提方法的实现方法。第 4.3 节说明了所用样本。实验在不同最大抖动缓冲区大小和不同受损网络条件下进行,分别详见第 4.2 节和第 4.4 节。
Methodology for the assessment of listening quality experiments typically come in two forms, objective and subjective. In this study we heavily used the state of art model, ITU-T Rec. P. 863 (03/2018). The burst-aware approach reduces the rate of packet drops compared to the default WebRTC approach, therefore we also analysed the reduction in the packet drop rate. 听力质量评估实验的方法通常有两种形式,客观和主观。在本研究中,我们大量使用了最先进的模型,ITU-T Rec. P. 863(2018 年 3 月)。相比默认的 WebRTC 方法,突发感知方法减少了丢包率,因此我们也分析了丢包率的降低情况。
We modify the source code of NetEQ component in WebRTC audio coding module to implement the proposed burst-aware approach. The NetEQ code base comes with a utility called neteq_rtpplay that takes an RTP file, containing RTP packets to be placed into the jitter buffer. neteq_rtpplay is also modified to perform experiments and validate the approach in a regularly reproducible manner. Each test is first run with the default WebRTC approach of NetEQ and later with the modified version of NetEQ for the burst-aware approach. neteq_rtpplay uses and configures NetEQ by consuming the methods it exposes. The implemented logic for burst awareness in NetEQ is made configurable and it can be enabled or disabled. The modified neteq_rtpplay exploits this configuration during the test runs and the automated test can be executed without compiling the binary again during the experiments to shorten the experiment feedback loop. 我们修改了 WebRTC 音频编码模块中 NetEQ 组件的源代码,以实现所提出的突发感知方法。NetEQ 代码库附带一个名为 neteq_rtpplay 的工具,该工具接受一个 RTP 文件,文件中包含要放入抖动缓冲区的 RTP 包。neteq_rtpplay 也被修改以便进行实验,并以一种可定期复现的方式验证该方法。每个测试首先使用 NetEQ 的默认 WebRTC 方法运行,随后使用为突发感知方法修改后的 NetEQ 版本运行。neteq_rtpplay 通过调用 NetEQ 暴露的方法来使用和配置 NetEQ。在 NetEQ 中实现的突发感知逻辑是可配置的,可以启用或禁用。修改后的 neteq_rtpplay 在测试运行期间利用此配置,且自动化测试可以在实验过程中无需重新编译二进制文件即可执行,从而缩短实验反馈周期。
4.2 Jitter Buffer Size 4.2 抖动缓冲区大小
Experiments are run for different maximum jitter buffer sizes, in NetEQ component. Moreover, we wanted to evaluate the effects of network conditions on different jitter buffer sizes. Therefore, the NetEQ is configured with the following maximum jitter buffer packet sizes: 10,15,20,25,30,35,4010,15,20,25,30,35,40, 45, and 50 packets, as 50 packets was the previous default jitter buffer capacity size for WebRTC. To cover the current default WebRTC size of 200, additional experiments are run under a jitter buffer capacity of 200 packets, too. 在 NetEQ 组件中,针对不同的最大抖动缓冲区大小进行了实验。此外,我们还希望评估网络条件对不同抖动缓冲区大小的影响。因此,NetEQ 配置了以下最大抖动缓冲区包大小: 10,15,20,25,30,35,4010,15,20,25,30,35,40 、45 和 50 包,其中 50 包是 WebRTC 之前的默认抖动缓冲区容量大小。为了覆盖当前 WebRTC 默认的 200 包大小,还在抖动缓冲区容量为 200 包的情况下进行了额外的实验。
4.3 Speech Samples 4.3 语音样本
Samples are carefully chosen according to the best practices and ITU recommendations. ITU suggests that a sample should contain at least 250-ms250-\mathrm{ms}-long start and end period. Moreover, a silence period between the sentences/active periods should be at least 500 ms long. For further details, the corresponding section of the ITU-T Rec. P.863.1 [13] can be visited. 样本是根据最佳实践和 ITU 建议精心选择的。ITU 建议样本应包含至少 250-ms250-\mathrm{ms} 长度的起始和结束时间段。此外,句子/活动时间段之间的静默期应至少为 500 毫秒。有关更多细节,可参阅 ITU-T Rec. P.863.1 [13]的相关章节。
Four reference speech signals, i.e., the English samples coming from ITU-T Rec. P. 501 [12] representing two male and two female speakers, were used in the experiments. Each sample consists of two utterances with small pause between the utterances. The samples are 8 s long, stored in 16-16- bit, 48-kHz48-\mathrm{kHz} linear PCM. 实验中使用了四个参考语音信号,即来自 ITU-T Rec. P. 501 [12]的英语样本,代表两名男性和两名女性说话者。每个样本包含两段话语,中间有短暂的停顿。样本长度为 8 秒,存储格式为 16-16- 位, 48-kHz48-\mathrm{kHz} 线性 PCM。
4.4 Network Conditions 4.4 网络条件
In this study, we have run our experiments under three main types of network conditions: Wi-Fi, LTE, and simulated conditions. Both Wi-Fi and LTE network conditions are packet traces coming from real world device measurements and simulated traces are artificially created. The traces contain one value per frame indicating a delay, and consequently jitter, value. The Wi-Fi network conditions are based on the packet traces from a university campus captured at various times and locations to cover different network behaviours [3]. It is worth noting here that the sender and receiver endpoint machines were time synchronised in the setup of the corresponding measurements. Wi-Fi packet traces are available in the Online Appendix. The LTE network conditions are based on the packet traces from an Internet Service Provider measurement lab where performance of WebRTC VoIP quality over LTE technologies under different LTE radio settings was analysed [18]. Some specifics of the LTE packet traces are covered in the Online Appendix, however, the interested reader is referred to Reference [18] for further details on the LTE network traces. 在本研究中,我们在三种主要的网络条件下进行了实验:Wi-Fi、LTE 和模拟条件。Wi-Fi 和 LTE 网络条件均来自真实设备测量的包追踪,而模拟追踪则是人工创建的。追踪数据中每一帧包含一个延迟值,从而反映抖动值。Wi-Fi 网络条件基于在大学校园内不同时间和地点捕获的包追踪,以涵盖不同的网络行为[3]。值得注意的是,在相应测量的设置中,发送端和接收端机器进行了时间同步。Wi-Fi 包追踪数据可在在线附录中获得。LTE 网络条件基于来自互联网服务提供商测量实验室的包追踪,该实验室分析了不同 LTE 无线设置下 WebRTC VoIP 质量的性能[18]。LTE 包追踪的一些具体细节在在线附录中有所介绍,感兴趣的读者可参考文献[18]以获取关于 LTE 网络追踪的更多信息。
The simulated conditions are artificially created conditions to represent packet bursts, sizes of which go from the number of packets as the maximum size of jitter buffer up to a predefined number of max_number_of_packets_in_burst packets as formulated in Equation (3), 模拟条件是人为创建的条件,用于表示数据包突发,其大小从抖动缓冲区的最大数据包数开始,到预定义的最大突发数据包数 max_number_of_packets_in_burst,如公式(3)所示,
jitter_buffer_size < burst_size < max_number_of_packets_in_burst. 抖动缓冲区大小 < 突发大小 < 最大突发数据包数。
We have chosen max_number_of_packets_in_burst as 75 in the experiments carried out under simulated network conditions. The reason is that we have seen from the real world measurements that maximum number of packets in a burst was 64 packets, which was in the LTE case. Therefore, we increased the max_number_of_packets_in_burst to 75 to cover even further extreme cases. Additionally, packet bursts of 205 and 225 packets are also experimented with jitter buffer capacity of 200 packets to cover latest WebRTC default jitter buffer capacity. 在模拟网络条件下进行的实验中,我们选择了 max_number_of_packets_in_burst 为 75。原因是我们从实际测量中发现,突发中的最大数据包数为 64 个数据包,这发生在 LTE 情况下。因此,我们将 max_number_of_packets_in_burst 增加到 75,以涵盖更极端的情况。此外,还对 205 和 225 个数据包的突发进行了实验,抖动缓冲区容量为 200 个数据包,以覆盖最新的 WebRTC 默认抖动缓冲区容量。
It is worth noting here that with the help of timescale modification techniques, WebRTC handles packet bursts well as long as they are less than jitter buffer size. Therefore, the experiments are only run for packet bursts where there are more packets in the burst than the capacity of the buffer. We also applied the packet bursts in a way that we could observe the impact of bursts in different locations of speech signal. We studied the impact when the packet burst corresponds to the beginning, middle, and the end of a talk-spurt. When it comes to the middle of talk-spurt case, we have studied the impact both during utterance of a word and also during the transition from a word to another. 值得注意的是,借助时间尺度修改技术,WebRTC 能很好地处理数据包突发,只要突发的数据包数量小于抖动缓冲区的大小。因此,实验仅针对数据包突发数量超过缓冲区容量的情况进行。我们还以一种能够观察突发在语音信号不同位置影响的方式施加数据包突发。我们研究了当数据包突发发生在讲话段的开始、中间和结束时的影响。对于讲话段中间的情况,我们既研究了在单词发音过程中突发的影响,也研究了从一个单词过渡到另一个单词时突发的影响。
Therefore, packet bursts are introduced, for each sentence, in the following sections of each signal: 因此,对于每个句子,数据包突发被引入到每个信号的以下部分:
Beginning of a talk spurt 讲话爆发的开始
Middle of a talk spurt 讲话段的中间
Transition between talk spurts/words 话语段/词语之间的过渡
End of talk spurt 话语段结束
A waveform for a typical deployed in our experiments sample is shown in Figure 3. It is a 8-slong male speech signal with two sentences. The burst locations can be seen in Figure 3. Arrows show the location of the bursts introduced. 图 3 显示了我们实验中典型样本的波形。该样本是一个 8 秒长的男性语音信号,包含两句话。爆发位置可以在图 3 中看到。箭头标示了引入的爆发位置。
4.5 RTP Session Settings 4.5 RTP 会话设置
Packetisation interval is 20 ms . Since speech signals are 8 s long, it results in 400 RTP packets. Codec used is Opus operating at 32 kbps (NetEQ default) with a sampling rate of 48 kHz . Variable bit rate and VoIP mode is used with Opus. Base delay is set to zero and fast accelerate mode, mute state, and rtx handling parameters are not enabled in NetEQ configuration. Voice activity detection is not enabled. It is worth noting here that proposed approach is codec agnostic. Opus is selected in the experiments as it is the primary voice codec for WebRTC [4]. 分组间隔为 20 毫秒。由于语音信号长度为 8 秒,因此产生 400 个 RTP 包。所用编解码器为 Opus,工作在 32 kbps(NetEQ 默认)码率,采样率为 48 kHz。Opus 采用可变码率和 VoIP 模式。基准延迟设置为零,NetEQ 配置中未启用快速加速模式、静音状态和 RTX 处理参数。未启用语音活动检测。值得注意的是,所提方法与编解码器无关。实验中选择 Opus 是因为它是 WebRTC 的主要语音编解码器[4]。
4.6 Assessment Methods 4.6 评估方法
ITU-T Rec. P. 863 (03/2018) is used to perform objective listening quality assessments. Each time an experiment is run a degraded wav file is produced, which is provided to ITU-T Rec. P. 863 with the reference (original) wav audio file. It should be noted here that the fullband mode of the ITU-T Rec. P. 863 model is deployed in the experiments. 采用 ITU-T Rec. P. 863 (03/2018)进行客观听觉质量评估。每次实验运行时都会生成一个降质的 wav 文件,该文件与参考(原始)wav 音频文件一起提供给 ITU-T Rec. P. 863。这里需注意,实验中采用的是 ITU-T Rec. P. 863 模型的全频带模式。
4.7 Packet Drops 4.7 丢包情况
Each experiment produces logs to highlight the events of importance during the tests. Example of such an event is a packet drop. When a packet drop occurs, i.e., due to buffer overflow in our case, a log statement is written. At the end of each test, the number of packets dropped is moved into the results section for further analysis. 每次实验都会生成日志,以突出测试过程中重要的事件。此类事件的一个例子是数据包丢失。当发生数据包丢失时,即在我们的案例中由于缓冲区溢出,会写入一条日志记录。在每次测试结束时,丢失的数据包数量会被转移到结果部分以供进一步分析。
Fig. 3. Original waveform with burst locations. 图 3. 带有突发位置的原始波形。
5 RESULTS 5 结果
The experiments are carried with various parameters as explained in Section 4. These parameters are as follows: 实验按照第 4 节中解释的各种参数进行。这些参数如下:
Jitter buffer strategy as explained in Section 4.1, i.e., default WebRTC vs. burst-aware approach. 抖动缓冲区策略,如第 4.1 节所述,即默认 WebRTC 与突发感知方法。
Jitter buffer sizes as explained in Section 4.2. 抖动缓冲区大小,如第 4.2 节所述。
Speech samples as explained in Section 4.3. 语音样本,如第 4.3 节所述。
Network conditions as explained in Section. 4.4. 网络条件,如第 4.4 节所述。
The following subsections present the results, with respect to the listening quality and packet drops as explained in Sections 4.6 and 4.7, respectively, for both the default WebRTC approach and the proposed burst-aware approach under these parameters. The listening quality results come in the form of MOS-LQOf values and they correspond to Quality of Experience from a user perspective. Packet drops, however, can be thought as Quality of Service metric. 以下小节展示了在这些参数下,针对默认的 WebRTC 方法和所提出的突发感知方法,在听觉质量和丢包率方面的结果,分别对应第 4.6 节和第 4.7 节的说明。听觉质量结果以 MOS-LQOf 值的形式呈现,反映了用户视角的体验质量。而丢包率则可视为服务质量的指标。
5.1 Wi-Fi Results 5.1 Wi-Fi 结果
Results obtained for the Wi-Fi network conditions are presented in Figure 4. A total of 96 Wi-Fi packet traces were used in the experiments. For each buffer capacity, all of the 96 packet traces were tested against 4 speech samples, described in Section 4.3. In total, 384 test conditions were Wi-Fi 网络条件下获得的结果如图 4 所示。实验中共使用了 96 个 Wi-Fi 数据包轨迹。对于每个缓冲区容量,所有 96 个数据包轨迹均针对第 4.3 节描述的 4 个语音样本进行了测试。总共进行了 384 个测试条件。
Fig. 4. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals and packet drop results for the Wi-Fi packet traces per buffer size tested. 图 4. 按缓冲区大小测试的 Wi-Fi 数据包轨迹的平均 MOS-LQOf(由 ITU-T Rec. P. 863 模型预测)及其 95%95 \% 置信区间和丢包结果。
produced, which resulted in 384 degraded samples per buffer capacity, i.e., buffer size. Overall, 1,536 test runs were executed. 产生了 384 个每个缓冲区容量(即缓冲区大小)退化的样本。总体上,执行了 1,536 次测试运行。
It can be clearly seen from the Figure 4 that number of the dropped packets has significantly decreased with the burst-aware approach, especially for the smaller buffer capacities. Significant reduction in the dropped packets is observed, i.e., for a buffer size of 10 in Figure 4, the number of dropped packets went down approximately from 5%5 \% to 2%2 \% packets on average. The average MOSLQOf results, depicted in the upper part of the picture, also confirm the quality improvement. For a buffer size of 10 over 384 test conditions, the quality perceived by the end user is improved by 0.42 MOS -LQOf on average. 从图 4 中可以清楚地看到,采用突发感知方法后,丢包数量显著减少,尤其是在较小的缓冲区容量下。丢包数量显著减少,例如在图 4 中缓冲区大小为 10 时,丢包数量平均从约 5%5 \% 下降到 2%2 \% 个。图中上部显示的平均 MOS-LQOf 结果也证实了质量的提升。在 384 个测试条件下,缓冲区大小为 10 时,终端用户感知的质量平均提高了 0.42 MOS-LQOf。
Of 384 test conditions, there are several test conditions with packet burst occurrences for which the improvement is even more significant. For instance, for a test condition with a jitter buffer capacity of 25 packets, we have found a case where the MOS-LQOf was 1.10 and 4.28 for the default WebRTC approach and the burst-aware approach, respectively. The burst-aware approach achieved an improvement of 3.18 in MOS. On the QoS front, the number of lost packets in jitter buffer was 25 and 3 for the default WebRTC approach and the burst-aware approach respectively. The burst-aware approach reduced the number of packets dropped by 22 packets. 在 384 个测试条件中,有若干测试条件出现了数据包突发的情况,其改进效果更加显著。例如,在抖动缓冲区容量为 25 个数据包的测试条件下,我们发现默认 WebRTC 方法的 MOS-LQOf 为 1.10,而突发感知方法为 4.28。突发感知方法在 MOS 上实现了 3.18 的提升。在 QoS 方面,抖动缓冲区中丢失的数据包数量分别为默认 WebRTC 方法的 25 个和突发感知方法的 3 个。突发感知方法减少了 22 个丢包数量。
It is worth noting here that the real world Wi-Fi packet traces [3] used in the experiments did not include any delay that would result in a packet burst of more than 25 packets. Therefore, no packet was dropped in these cases, hence the results for a buffer capacity of 30 or more are same for both the proposed burst-aware and default WebRTC approaches. Therefore, they are not included in Figure 4. Demo samples are provided in the Online Appendix. 值得注意的是,实验中使用的真实世界 Wi-Fi 数据包跟踪[3]未包含任何导致超过 25 个数据包突发的延迟。因此,在这些情况下没有数据包被丢弃,因此缓冲区容量为 30 或以上时,所提出的突发感知方法和默认 WebRTC 方法的结果相同。因此,这些结果未包含在图 4 中。演示样本已在在线附录中提供。
Fig. 5. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals and packet drop results for the LTE packet traces per buffer size tested. 图 5. 按照 ITU-T Rec. P. 863 模型预测的平均 MOS-LQOf 值及 95%95 \% 置信区间和 LTE 数据包丢失结果,针对每个测试的缓冲区大小的 LTE 数据包轨迹。
5.2 LTE Results 5.2 LTE 结果
Results obtained for the LTE network conditions are presented in Figure 5. A total of 33 LTE packet traces were used in the experiments. For each buffer capacity, i.e., buffer size, all of the 33 packet traces are tested against 4 speech samples, described in Section 4.3. In total, 132 test conditions were produced, which resulted in 132 degraded samples per buffer capacity. Overall, 1,188 test runs were realized. LTE 网络条件下获得的结果如图 5 所示。实验中共使用了 33 个 LTE 数据包轨迹。对于每个缓冲容量,即缓冲区大小,所有 33 个数据包轨迹均针对第 4.3 节中描述的 4 个语音样本进行了测试。总共产生了 132 个测试条件,导致每个缓冲容量下有 132 个降级样本。总体上,完成了 1,188 次测试运行。
It is worth noting here that the LTE network conditions, compared to the Wi-Fi conditions, include more extreme delay and jitter values. This is due to the nature of the LTE technology as explained in Reference [17]. Therefore, we have buffer overflows for jitter buffer sizes of up to 50 packets as it can be seen in Figure 5. 值得注意的是,与 Wi-Fi 条件相比,LTE 网络条件包含了更极端的延迟和抖动值。这是由于 LTE 技术的特性,如参考文献[17]所解释。因此,如图 5 所示,对于抖动缓冲区大小高达 50 个数据包时,会出现缓冲区溢出。
As reported above for the Wi-Fi results, the burst-aware approach again out-performed the default WebRTC approach in terms of QoS, i.e., the number of packets dropped is lower for the burst-aware approach, under the LTE network conditions. As it can be seen from Figure 5, the number of packets dropped again reduced significantly. Regarding the Quality of Experience in the LTE setting, the ITU-T Rec. P. 863 results show that burst-aware approach mostly offers an improved QoE, i.e., an increase of quality perceived by the end user up to 0.51 MOS-LQOf on average for a jitter buffer size of 10 packets. 如上文所述的 Wi-Fi 结果,突发感知方法在 LTE 网络条件下再次优于默认的 WebRTC 方法,表现为 QoS 更好,即突发感知方法丢包数量更少。如图 5 所示,丢包数量再次显著减少。关于 LTE 环境下的用户体验质量,ITU-T Rec. P. 863 的结果显示,突发感知方法大多提供了更好的 QoE,即在抖动缓冲区大小为 10 个数据包时,终端用户感知的质量平均提升了最高 0.51 的 MOS-LQOf。
Similar to the previous case, i.e., the Wi-Fi results presented in Section 5.1, only several test conditions, of 128 test conditions, include packet bursts with 30 packets or more. It should be mentioned here that the improvement for those cases, i.e., the test conditions involving the packet 与前述情况类似,即第 5.1 节中展示的 Wi-Fi 结果,在 128 个测试条件中,只有少数测试条件包含 30 个或更多数据包的突发。这里需要提到的是,对于这些情况,即涉及数据包的测试条件,其改进效果
Fig. 6. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals and packet drop results for the simulated packet traces per buffer size tested. 图 6. ITU-T Rec. P. 863 模型预测的平均 MOS-LQOf 及其 95%95 \% 置信区间,以及针对每个测试缓冲区大小的模拟数据包轨迹的丢包结果。
bursts more than the buffer capacity, is much more significant than the averages illustrated in Figure 5. For instance, for a test condition with a jitter buffer capacity of 45 packets, we have found a case where the MOS-LQOf was 1.33 and 4.15 for the default WebRTC approach and the burst-aware approach, respectively. The burst-aware approach achieved an improvement of 2.82 in MOS. On the QoS front, the number of lost packets in jitter buffer was 47 and 18 for the default WebRTC approach and the burst-aware approach respectively. The burst-aware approach reduced the number of packets dropped by 29 packets. Demo samples are provided in the Online Appendix. 突发超过缓冲区容量的情况,比图 5 中展示的平均值更为显著。例如,在抖动缓冲区容量为 45 个数据包的测试条件下,我们发现默认 WebRTC 方法的 MOS-LQOf 为 1.33,而突发感知方法为 4.15。突发感知方法在 MOS 上实现了 2.82 的提升。在 QoS 方面,抖动缓冲区丢失的数据包数量分别为默认 WebRTC 方法的 47 个和突发感知方法的 18 个。突发感知方法减少了 29 个丢包。演示样本见在线附录。
Results for the simulated network conditions are presented in Figure 6. In a simulated setting, we produced many more network conditions to represent a wider range of possibilities for packet bursts starting from 11 packets up to 75 packets, i.e., 65 different packet burst configurations exist. Also, each packet burst configuration is placed carefully to correspond to eight different locations of each of the four speech signals as described in Section 4.4. Therefore, a total of 2,080 test conditions were produced, which resulted in 2,080 degraded samples per buffer capacity, offering a higher statistical significance/power of the obtained results than the previous cases focusing on the real world traces. Overall, 18,720 test runs were evaluated. 模拟网络条件下的结果如图 6 所示。在模拟环境中,我们生成了更多的网络条件,以涵盖从 11 个数据包到 75 个数据包的突发范围,即存在 65 种不同的数据包突发配置。此外,每种数据包突发配置都被精心安排,以对应第 4.4 节中描述的四个语音信号的八个不同位置。因此,总共生成了 2080 个测试条件,导致每个缓冲容量下有 2080 个降级样本,这比之前关注真实世界轨迹的情况提供了更高的统计显著性/效能。总体上,评估了 18720 次测试运行。
Results for the simulated network conditions depicted in Figure 6 clearly show the impact of the burst-aware approach in the presence of packet bursts. In terms of Quality of Experience, there is about 0.16 to 0.38 MOS -LQOf difference in favor of the burst-aware approach. Regarding the Quality of Service metric, a superior performance of the burst-aware approach over the default 图 6 中模拟网络条件的结果清楚地显示了在数据包突发情况下,突发感知方法的影响。在体验质量方面,突发感知方法相较于其他方法,MOS-LQOf 指标提升约 0.16 到 0.38。在服务质量指标方面,突发感知方法表现优于默认方案。
Fig. 7. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals under simulated packet traces per burst location. 图 7. 根据 ITU-T Rec. P. 863 模型预测的平均 MOS-LQOf 及 95%95 \% 置信区间,在模拟的每个突发位置的分组跟踪下的表现。
WebRTC approach is also apparent in Figure 6. For instance, packet loss rate is 4.81%4.81 \% for the default WebRTC approach and it is only 1.25%1.25 \% for the burst-aware approach. 图 6 中也明显展示了 WebRTC 方法。例如,默认 WebRTC 方法的丢包率为 4.81%4.81 \% ,而突发感知方法的丢包率仅为 1.25%1.25 \% 。
The simulated network conditions are adjusted to see the effects of the network impairments on different locations of the speech signals. Figure 7 presents the impact of packet bursts on the different burst locations, i.e., start of a talk spurt, during a talk spurt, between talk spurts, and finally end of a talk spurt, when it comes to the both investigated approaches, i.e., the burst-aware approach as well as the default approach. As it can be seen from Figure 7, the burst-aware approach significantly out-performs the default WebRTC approach for all of the burst locations. 模拟的网络条件被调整以观察网络损伤对语音信号不同位置的影响。图 7 展示了分组突发对不同突发位置的影响,即讲话突发开始、讲话突发期间、讲话突发之间以及讲话突发结束时,针对两种研究方法——突发感知方法和默认方法的表现。如图 7 所示,突发感知方法在所有突发位置上均显著优于默认 WebRTC 方法。
It is also observed that packet drop behaviour of the burst-aware approach converges to the default WebRTC approach when an amount of the packets in the burst is close to the multiples of maximum jitter buffer capacity. This behaviour can be clearly seen in the lower part of the Figure 8. 还观察到,当突发中的分组数量接近最大抖动缓冲容量的倍数时,突发感知方法的丢包行为趋近于默认 WebRTC 方法。这一行为在图 8 的下部清晰可见。
Similarly to the Wi-Fi and LTE results presented in Sections 5.1 and 5.2, the test conditions for all the buffer sizes are same, allowing a fair comparison between the different buffer sizes, which means that the number of occurrences of packet bursts with a high number of packets in the burst is less than the number of the network conditions with low number of packets in the burst. It is worth noting here that this reflects a reality as the packet bursts involving the high number of packets are not so frequent in the current networks as those involving the low number of packets. For instance, packet bursts involving from 51 packets to 75 packets are the bursts that only cause the buffer overflows for a buffer capacity of 50 packets. However, packet bursts containing from 11 to 50 packets, do not cause buffer overflows when it comes to the buffer size of 50 . Therefore, the improvement for those cases, e.g., packet bursts containing from 51 to 75 与第 5.1 节和 5.2 节中展示的 Wi-Fi 和 LTE 结果类似,所有缓冲区大小的测试条件相同,这使得不同缓冲区大小之间的比较公平。这意味着包含大量数据包的突发包出现次数少于包含少量数据包的网络条件。值得注意的是,这反映了现实情况,因为当前网络中包含大量数据包的突发包并不像包含少量数据包的突发包那样频繁。例如,包含 51 到 75 个数据包的突发包仅会导致缓冲区容量为 50 个数据包时的缓冲区溢出。然而,包含 11 到 50 个数据包的突发包在缓冲区大小为 50 时不会导致缓冲区溢出。因此,对于这些情况的改进,例如包含 51 到 75 个数据包的突发包
Fig. 8. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals and packet drop results under simulated packet traces for a maximum jitter buffer size of 10 packets. 图 8. 由 ITU-T Rec. P. 863 模型预测的平均 MOS-LQOf 值及其 95%95 \% 置信区间,以及在最大抖动缓冲区大小为 10 个数据包的模拟数据包跟踪下的丢包结果。
packets for a buffer capacity of 50 packets, is much more significant than the averages illustrated in Figure 6. For instance, for a test condition with a jitter buffer capacity of 10 packets, we have found a case where the MOS-LQOf was 1.78 and 3.98 for the default WebRTC approach and the burst-aware approach, respectively. The burst-aware approach achieved an improvement of 2.2 in MOS-LQOf. On the QoS front, the number of lost packets in jitter buffer was 10 and 1 for the default WebRTC approach and the burst-aware approach respectively. The burst-aware approach reduced the number of packets dropped by 9 packets. Demo samples are provided in the Online Appendix. 对于缓冲区容量为 50 个数据包的情况,数据包的表现远比图 6 中展示的平均值更为显著。例如,在抖动缓冲区容量为 10 个数据包的测试条件下,我们发现默认 WebRTC 方法的 MOS-LQOf 为 1.78,而突发感知方法的 MOS-LQOf 为 3.98。突发感知方法在 MOS-LQOf 上提升了 2.2。在 QoS 方面,抖动缓冲区丢失的数据包数量分别为默认 WebRTC 方法的 10 个和突发感知方法的 1 个。突发感知方法减少了 9 个丢包。演示样本已在在线附录中提供。
As mentioned in Section 2.3, the default WebRTC jitter buffer capacity was changed from 50 to 200. Therefore, we experimented with default jitter buffer capacity for completeness. Since buffer overflows can, in this case, only happen for packet bursts containing more than 200 packets, we have introduced two bursts of 205 and 225 packets and MOS-LQOf and packet drop results are shown in Figure 9. 如第 2.3 节所述,默认 WebRTC 抖动缓冲区容量已从 50 更改为 200。因此,为了完整性,我们对默认抖动缓冲区容量进行了实验。由于在这种情况下,缓冲区溢出只能发生在包含超过 200 个数据包的突发中,我们引入了两个分别包含 205 和 225 个数据包的突发,MOS-LQOf 和丢包结果如图 9 所示。
The default WebRTC approach causes significant speech loss in these experiments. Figure 10 clearly shows that the default WebRTC approach lost the first sentence almost entirely for a given speech signal while the burst-aware approach saved the entire sentence. Demo samples are provided in the Online Appendix. 默认的 WebRTC 方法在这些实验中导致了显著的语音丢失。图 10 清楚地显示,对于给定的语音信号,默认的 WebRTC 方法几乎完全丢失了第一句,而基于突发感知的方法则保留了整句。演示样本见在线附录。
5.4 End-to-end Delay Impact of the Proposed Approach 5.4 所提方法的端到端延迟影响
The proposed approach reduces the number of packets dropped due to buffer overflows, which has the benefit of playing out more packets and preventing speech loss compared to the default 所提方法减少了因缓冲区溢出而丢弃的数据包数量,这带来了播放更多数据包和防止语音丢失的好处,相较于默认方法。
Fig. 9. Average MOS-LQOf as predicted by ITU-T Rec. P. 863 model along with 95%95 \% confidence intervals and packet drop results under simulated packet traces for a maximum jitter buffer size of 200 packets. 图 9. 由 ITU-T Rec. P. 863 模型预测的平均 MOS-LQOf 及其 95%95 \% 置信区间,以及在最大抖动缓冲区大小为 200 个数据包的模拟数据包轨迹下的数据包丢失结果。
WebRTC approach. NetEQ activates the acceleration mode to speed up the playout when the buffer is above the target buffer level, e.g., it is nearly full in the packet burst cases. Since the burst aware approach prevents dropping packets due to buffer overflows and there are more packets to be played out in comparison to the default WebRTC approach, it can take longer to settle it at the target buffer level. WebRTC 方法。NetEQ 在缓冲区高于目标缓冲区水平时激活加速模式以加快播放速度,例如在数据包突发情况下缓冲区几乎已满。由于突发感知方法防止因缓冲区溢出而丢包,并且相比默认的 WebRTC 方法有更多数据包需要播放,因此将缓冲区调整到目标水平可能需要更长时间。
Figure 11 shows the delay impact of the proposed approach by displaying the relative delay for packet arrival and playout for the default WebRTC approach as well as the burst aware approach. As it is shown in the figure, a delay is introduced at the 700th ms . Two different burst sizes are shown where a total of 55 and 65 packets are delayed for their respective duration, i.e., 1,100ms1,100 \mathrm{~ms} and 1,300ms1,300 \mathrm{~ms}, and they all arrive at the receiver at the same time. As it is clear from Figure 11, the default WebRTC approach settles it quicker at the expense of significant speech loss due to the packets dropped with full buffer flush. However, the proposed approach prevents the speech loss by removing a fraction of packets, rather than all the packets, which has a side effect of an extended end-to-end delay for a brief period. This brief period to catch-up would stochastically vary depending on the network conditions. It takes about 2 s for playout relative delay to go below 400 ms in the examples shown in the Figure 11. The advantage of the proposed approach, however, is significant. To be more precise, four additional words are played out to the listener, which all are lost otherwise, as it can be seen from the waveforms in Figure 11. We also provide the speech recordings in the Online Appendix for the interested reader. 图 11 展示了所提方法的延迟影响,通过显示默认 WebRTC 方法和突发感知方法的包到达和播放的相对延迟。如图所示,在第 700 毫秒引入了延迟。图中显示了两种不同的突发大小,分别有 55 个和 65 个包在各自的持续时间内被延迟,即 1,100ms1,100 \mathrm{~ms} 和 1,300ms1,300 \mathrm{~ms} ,它们全部同时到达接收端。正如图 11 所示,默认的 WebRTC 方法以较快的速度解决了问题,但代价是由于全缓冲区清空导致的显著语音丢失。然而,所提方法通过移除部分包而非全部包,避免了语音丢失,这带来了端到端延迟在短时间内延长的副作用。这个短暂的追赶期会根据网络状况随机变化。图 11 中的示例显示,播放相对延迟下降到 400 毫秒以下大约需要 2 秒时间。然而,所提方法的优势是显著的。 更准确地说,会向听众播放额外的四个单词,否则这些单词都会丢失,这一点可以从图 11 的波形中看出。我们还在在线附录中提供了语音录音,供感兴趣的读者参考。
Fig. 10. Speech signals produced under network conditions with a packet burst of 205 packets on the left and with a packet burst of 225 packets on the right; (a) with the default WebRTC approach, (b) with the burst-aware approach. 图 10. 在网络条件下产生的语音信号,左侧为 205 个数据包突发,右侧为 225 个数据包突发;(a)使用默认的 WebRTC 方法,(b)使用突发感知方法。
5.5 Discussion 5.5 讨论
Results show that the burst-aware approach is reducing the number of packets dropped under the network conditions where packet burst occur for the Wi-Fi, LTE as well as the simulated network conditions. It is also seen that burst-aware approach offers higher MOS-LQOf based on the ITU-T Rec. P. 683 measurement averages. 结果显示,突发感知方法在 Wi-Fi、LTE 以及模拟网络条件下,能够减少数据包突发时丢包的数量。还可以看到,基于 ITU-T Rec. P. 683 测量平均值,突发感知方法提供了更高的 MOS-LQOf 评分。
The LTE packet traces used in the experiments had more severe packet bursts than the Wi-Fi ones. For instance, a maximum number of packets involved in the burst was less than 30 for the Wi-Fi packet traces. However, in the LTE case, the maximum number of packets involved in the burst was 64. It is also worth noting here that those severe packet bursts happen rather rarely. For instance, there was only one case in the LTE experiment where there were more than 50 packets in the burst, which was represented by the burst containing 64 packets. It should be mentioned here that the amount of the available network traces was rather limited in the LTE as well as Wi-Fi experiment, i.e., 33 in the LTE case and 96 in the Wi-Fi case. Therefore, we created the simulated conditions covering the packet bursts ranging between 11 and 75 packets and different burst locations to provide the interested reader with the results of higher statistical significance/power when it comes to the benchmark of the proposed and default approach. 实验中使用的 LTE 数据包跟踪比 Wi-Fi 的出现了更严重的数据包突发。例如,Wi-Fi 数据包跟踪中突发涉及的最大数据包数量少于 30 个。然而,在 LTE 情况下,突发中涉及的最大数据包数量为 64 个。这里还值得注意的是,这些严重的数据包突发发生的频率相当低。例如,在 LTE 实验中,只有一次突发中数据包数量超过 50 个,即包含 64 个数据包的突发。需要提及的是,LTE 和 Wi-Fi 实验中可用的网络跟踪数据量都相当有限,LTE 为 33 个,Wi-Fi 为 96 个。因此,我们创建了模拟条件,涵盖了 11 到 75 个数据包的突发范围以及不同的突发位置,以便为感兴趣的读者提供在基准测试所提出方法和默认方法时具有更高统计显著性/效力的结果。
Demo samples to demonstrate the quality improvements offered by the burst-aware approach over default WebRTC approach for each network type are provided in the Online Appendix. 在线附录中提供了演示样例,以展示突发感知方法相较于默认 WebRTC 方法在各种网络类型下的质量提升。
In this article, we have studied the core design concepts of WebRTC jitter buffer management in its NetEQ audio coding module. Furthermore, we have investigated how WebRTC performs in terms of voice quality under different network conditions with packet bursts. The network conditions included the real Wi-Fi and LTE packet traces as well as simulated network conditions. We have identified that the jitter buffer management component of the open source WebRTC project flushes 本文研究了 WebRTC 在其 NetEQ 音频编码模块中抖动缓冲管理的核心设计理念。此外,我们还考察了 WebRTC 在不同网络条件下(包括数据包突发情况下)的语音质量表现。网络条件包括真实的 Wi-Fi 和 LTE 数据包跟踪以及模拟网络条件。我们发现,开源 WebRTC 项目的抖动缓冲管理组件会刷新
all the packets stored in the jitter buffer once a packet arrives to a full jitter buffer. This causes a huge packet drop rate resulting in a voice quality degradation. On the basis of that, this article has proposed the burst-aware jitter buffer management algorithm that can be deployed/implemented to prevent the excessive packet drop in such cases. An extensive set of experiments was run against both the default and modified jitter buffer management algorithms to benchmark the proposed and default approach. The results have shown that the burst-aware approach can prevent the excessive packet drops that otherwise would occur. As an objective listening quality assessment tool, the ITU-T Rec. P. 863 model was used and the MOS-LQOf results have confirmed the quality improvement provided by the burst-aware approach. 一旦数据包到达已满的抖动缓冲区,所有存储在抖动缓冲区中的数据包都会被丢弃。这导致了极高的数据包丢失率,从而引起语音质量下降。基于此,本文提出了一种突发感知抖动缓冲区管理算法,可部署/实现以防止此类情况下的过度数据包丢失。针对默认和修改后的抖动缓冲区管理算法进行了大量实验,以对比所提方法与默认方法。结果表明,突发感知方法能够防止本应发生的过度数据包丢失。作为客观的听觉质量评估工具,采用了 ITU-T Rec. P. 863 模型,MOS-LQOf 结果也证实了突发感知方法带来的质量提升。
Future work will focus on an improvement of the proposed approach by detecting packets containing voice and silence stored in the buffer. The corresponding information can help us to remove/discard only the packets containing the silence, as those have a lower impact on a quality perceived by the end user than the packets containing voice. By doing so, the quality perceived by the user should be improved. Moreover, an extension of this work toward a video modality is planned to see the effect of the proposed approach in this context. 未来的工作将集中于改进所提出的方法,通过检测缓冲区中包含语音和静音的数据包。相应的信息可以帮助我们仅移除/丢弃包含静音的数据包,因为与包含语音的数据包相比,静音数据包对最终用户感知质量的影响较小。通过这样做,用户感知的质量应当得到提升。此外,计划将此工作扩展到视频模式,以观察所提出方法在该环境中的效果。
REFERENCES 参考文献
[1] 3GPP TS 26.448 Version 16.0.0. 2019. Codec for Enhanced Voice Services (EVS); Fitter Buffer Management. Technical Specification (TS) 26.448. 3rd Generation Partnership Project (3GPP). Version 16.0.0. Retrieved from https://portal. 3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1470. [1] 3GPP TS 26.448 版本 16.0.0。2019。增强语音服务(EVS)编解码器;更合适的缓冲区管理。技术规范(TS)26.448。第三代合作伙伴计划(3GPP)。版本 16.0.0。检索自 https://portal.3gpp.org/desktopmodules/Specifications/SpecificationDetails.aspx?specificationId=1470。
[2] ABIResearch. 2013. 4.7 billion mobile WebRTC devices by 2018 despite lack of open support from Apple and Microsoft. Retrieved from https://www.abiresearch.com/press/47-billion-mobile-webrtc-devices-by-2018-despite-1/. [2] ABIResearch。2013。尽管苹果和微软缺乏开放支持,2018 年移动 WebRTC 设备将达到 47 亿台。检索自 https://www.abiresearch.com/press/47-billion-mobile-webrtc-devices-by-2018-despite-1/。
[3] Mohannad Al-Ahmadi, Yusuf Cinar, Hugh Melvin, and Peter Pocta. 2016. Investigating the extent and impact of time-scaling in WebRTC voice traffic under light, moderate and heavily congested Wi-Fi APs. In Proceedings of the 5th ISCA/DEGA Workshop on Perceptual Quality of System (PQS’16). [3] Mohannad Al-Ahmadi, Yusuf Cinar, Hugh Melvin 和 Peter Pocta. 2016. 研究在轻度、中度和重度拥堵的 Wi-Fi 接入点下,WebRTC 语音流量中时间缩放的程度及其影响。载于第五届 ISCA/DEGA 系统感知质量研讨会(PQS’16)论文集。
[4] H Alvestrand. 2021. Overview: Real-Time Protocols for Browser-Based Applications. RFC 8825. https://tools.ietf.org/ html/rfc8825. [4] H Alvestrand. 2021. 概述:基于浏览器应用的实时协议。RFC 8825。https://tools.ietf.org/html/rfc8825。
[5] Adam Bergkvist, Daniel C. Burnett, Cullen Jennings, and Anant Narayanan. 2021. WebRTC 1.0: Real-Time Communication Between Browsers. W3C Recommendation. https://www.w3.org/TR/2021/REC-webrtc-20210126/. [5] Adam Bergkvist, Daniel C. Burnett, Cullen Jennings 和 Anant Narayanan. 2021. WebRTC 1.0:浏览器间的实时通信。W3C 推荐标准。https://www.w3.org/TR/2021/REC-webrtc-20210126/。
[6] Yusuf Cinar and Hugh Melvin. 2014. Webrtc quality assessment: Dangers of black-box testing. In Proceedings of the The 10th International Conference on Digital Technologies 2014. IEEE, 32-35. [6] Yusuf Cinar 和 Hugh Melvin. 2014. WebRTC 质量评估:黑盒测试的风险。载于第十届数字技术国际会议论文集(2014)。IEEE,32-35。
[7] Yusuf Cinar, Hugh Melvin, and Peter Pocta. 2016. A black-box analysis of the extent of time-scale modification introduced by webrtc adaptive jitter buffer and its impact on listening speech quality. Commun. Sci. Lett. Univ. Zilina 18, 1 (2016), 17-22. [7] Yusuf Cinar、Hugh Melvin 和 Peter Pocta。2016。对 WebRTC 自适应抖动缓冲引入的时间尺度修改程度的黑箱分析及其对听觉语音质量的影响。《Commun. Sci. Lett. Univ. Zilina》18 卷,第 1 期(2016),17-22 页。
[8] Nicolas Côté. 2011. Integral and Diagnostic Intrusive Prediction of Speech Quality. Springer Science & Business Media. [8] Nicolas Côté。2011。语音质量的积分和诊断性侵入式预测。Springer Science & Business Media 出版。
[9] Global IP Solutions. 2007. GIPS NetEQ. Retrieved May 26, 2020 from http://www.gipscorp.alcatrazconsulting.com/ files/english/datasheets/NetEQ.pdf. [9] Global IP Solutions。2007。GIPS NetEQ。2020 年 5 月 26 日检索自 http://www.gipscorp.alcatrazconsulting.com/files/english/datasheets/NetEQ.pdf。
[10] B. E. Hugh Melvin. 2004. The Use of Synchronised Time in Voice over Internet Protocol (VoIP) Applications. Ph.D. Dissertation. University College Dublin. [10] B. E. Hugh Melvin。2004。互联网协议语音(VoIP)应用中同步时间的使用。博士论文。都柏林大学学院。
[11] ITU-T. 2003. Recommendation ITU-T G. 114 One-way transmission time. [11] ITU-T. 2003. 建议 ITU-T G.114 单向传输时间。
[12] ITU-T. 2017. Recommendation ITU-T P.501, Test signals for use in telephonometry. [12] ITU-T. 2017. 建议 ITU-T P.501,电话测量中使用的测试信号。
[13] ITU-T. 2018. Recommendation ITU-T P.863, Perceptual objective listening quality assessment. [13] ITU-T. 2018. 建议 ITU-T P.863,感知客观听觉质量评估。
[14] Oliwia Komperda, Hugh Melvin, and Peter Pocta. 2016. A black box analysis of WebRTC mouth-to-ear delays. Commun. Sci. Lett. Univ. Zilina 18, 1 (2016), 11-16. [14] Oliwia Komperda, Hugh Melvin, 和 Peter Pocta. 2016. WebRTC 口到耳延迟的黑盒分析。通讯科学通讯,Zilina 大学 18 卷,1 期 (2016),11-16。
[15] Yi J. Liang, Nikolaus Farber, and Bernd Girod. 2003. Adaptive playout scheduling and loss concealment for voice communication over IP networks. IEEE Trans. Multimedia 5, 4 (2003), 532-543. [15] Yi J. Liang, Nikolaus Farber 和 Bernd Girod. 2003. 用于 IP 网络语音通信的自适应播放调度和丢包隐藏。IEEE 多媒体学报 5 卷,4 期 (2003),532-543。
[16] Henrik Fahlberg Lundin. 2010. Method and receiver for determining a jitter buffer level. US Patent 7,733,893. [16] Henrik Fahlberg Lundin. 2010. 用于确定抖动缓冲区级别的方法和接收器。美国专利 7,733,893。
[17] Najmeddine Majed. 2018. Measuring and Improving the Quality of Experience of Mobile Voice Over IP. Ph.D. Dissertation. Ecole nationale supérieure Mines-Télécom Atlantique. [17] Najmeddine Majed. 2018. 移动语音 IP 质量体验的测量与提升。博士论文。法国国家高等矿业电信学院。
[18] Najmeddine Majed, Stephane Ragot, Xavier Lagrange, Alberto Blanc, Jerome Dufour, and Guillaume Grao. 2017. Experimental evaluation of WebRTC voice quality in LTE coverage tests. In Proceedings of the 2017 9th International Conference on Quality of Multimedia Experience (QoMEX’17). IEEE, 1-6. [18] Najmeddine Majed, Stephane Ragot, Xavier Lagrange, Alberto Blanc, Jerome Dufour 和 Guillaume Grao. 2017. LTE 覆盖测试中 WebRTC 语音质量的实验评估。载于 2017 年第九届多媒体体验质量国际会议(QoMEX’17)论文集。IEEE,1-6。
[19] B. Oklander and M. Sidi. 2008. Jitter buffer analysis. In Proceedings of the 2008 Proceedings of 17th International Conference on Computer Communications and Networks. 1-6. DOI : https://doi.org/10.1109/ICCCN.2008.ECP. 33 [19] B. Oklander 和 M. Sidi. 2008. 抖动缓冲区分析。载于 2008 年第 17 届国际计算机通信与网络会议论文集。1-6 页。DOI:https://doi.org/10.1109/ICCCN.2008.ECP.33
[20] Alexander Raake. 2006. Speech Quality of VoIP. Wiley Online Library. [20] Alexander Raake. 2006. VoIP 的语音质量。Wiley 在线图书馆。
[21] Jan Skoglund, Ermin Kozica, Jan Linden, Roar Hagen, and W Bastiaan Kleijn. 2008. Voice over IP: Speech transmission over packet networks. In Springer Handbook of Speech Processing. Springer, 307-330. [21] Jan Skoglund, Ermin Kozica, Jan Linden, Roar Hagen 和 W Bastiaan Kleijn. 2008. 互联网语音协议:基于分组网络的语音传输。载于《Springer 语音处理手册》。Springer 出版社,307-330 页。
[22] WebRTC Team. 2017. WebRTC. Retrieved January 1, 2017 from WebRTC.org. [22] WebRTC 团队. 2017. WebRTC。2017 年 1 月 1 日从 WebRTC.org 检索。
[23] Wikipedia contributors. 2019. Global IP Solutions. Wikipedia, The Free Encyclopedia. Retrieved May 26, 2020 from https://en.wikipedia.org/w/index.php?title=Global_IP_Solutions&oldid=933140402. [23] 维基百科贡献者。2019 年。Global IP Solutions。维基百科,自由的百科全书。2020 年 5 月 26 日检索自 https://en.wikipedia.org/w/index.php?title=Global_IP_Solutions&oldid=933140402。
Received January 2020; revised June 2020; accepted July 2020 2020 年 1 月收到;2020 年 6 月修订;2020 年 7 月接受