这是用户在 2024-3-24 15:07 为 https://app.immersivetranslate.com/html 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

1. A Composite T60 Regression and Classification Approach for Speech Dereverberation
1. 一种用于语音去偏视的复合T60回归与分类方法

REVERBERATION occurs in everyday environments, due to the reflection of sounds off the many surfaces in a room, such as the furniture, walls and floors. This causes listeners to hear a combination of the direct speech signal and the reflections. The effects of reverberation often smear speech across time and frequency, which negatively impacts individuals with impaired hearing [1], [2], since reverberation degrades perceptual quality and intelligibility. This also creates challenges to various voice-based applications, including automatic speech recognition (ASR) [3], [4], speaker identification [5], [6] and speaker localization [7], [8], to name a few.
在日常环境中,由于声音会从房间的许多表面(例如家具、墙壁和地板)反射,因此会发生 REVERBERATION。这会导致听众听到直接语音信号和反射的组合。混响的影响通常会在时间和频率上涂抹语音,这会对听力受损的人产生负面影响[1],[2],因为混响会降低感知质量和可理解性。这也给各种基于语音的应用带来了挑战,包括自动语音识别 (ASR) [3]、[4]、说话人识别 [5]、[6] 和说话人定位 [7]、[8] 等。

Current monaural approaches often use deep neural networks (DNN) to remove reverberation. A spectral mapping method [9], proposed by Han et al., maps the noisy and reverberant signal to an anechoic signal in the time-frequency (T-F) domain using a fully-connected DNN. It has a postprocessing stage that performs iterative phase reconstruction to re-synthesize the estimated time-domain signal. Xiong et al. [10] use a multi-layer perception (MLP) to estimate T60 using features from a Gabor filterbank. A DNN estimates the complex ideal ratio mask (cIRM) [11], which processes the magnitude and phase responses in the imaginary and real domains. The approach simultaneously handles noisy and reverberant conditions. The usage of DNNs, however, is a limitation since they do not capture long-term contextual information. To overcome this limitation, Santos et al. proposed a dereverberation method that uses a recurrent neural network (RNN) [12] to capture long-term contextual information, along with employing a 2-D convolutional encoder to extract local contextual features. Another RNN model with a long short-term memory (LSTM) network [13] proposed by Zhao et al. predicts late reflections and subtracts them from the reverberant signal to estimate the direct and early components of reverberation. Zhao et al. later propose a dereverberation model that uses temporal convolutional networks (TCN) with a self-attention module [14]. This method uses self attention to extract dynamic features from the input, and uses the TCN to learn the non-linear mapping from reverberant to anechoic speech. All the above approaches operate in the T-F domain.
目前的单声道方法通常使用深度神经网络 (DNN) 来消除混响。Han等人提出的频谱映射方法[9]使用全连接的DNN将噪声和混响信号映射到时频(T-F)域中的消声信号。它有一个后处理阶段,执行迭代相位重建以重新合成估计的时域信号。Xiong等[10]使用多层感知(MLP)来估计T60,使用Gabor滤波器组的特征。DNN估计复理想比掩模(cIRM)[11],该掩模处理虚域和实域中的幅度和相位响应。该方法可同时处理嘈杂和混响条件。然而,DNN 的使用是一个限制,因为它们不能捕获长期的上下文信息。为了克服这一局限性,Santos等人提出了一种去杠杆化方法,该方法使用递归神经网络(RNN)[12]来捕获长期上下文信息,同时使用二维卷积编码器来提取局部上下文特征。Zhao等人提出的另一种具有长短期记忆(LSTM)网络的RNN模型[13]预测了后期反射,并从混响信号中减去它们,以估计混响的直接和早期分量。Zhao等人后来提出了一种去杠杆化模型,该模型使用带有自注意力模块的时间卷积网络(TCN)[14]。该方法使用自注意力从输入中提取动态特征,并使用TCN学习从混响到消声的非线性映射。上述所有方法都在 T-F 域中运行。

Traditional signal processing methods have also been developed. These approaches can be divided into two categories: spectral subtraction and inverse filtering. Lebart et al. proposed a spectral subtraction approach [15] that removed reverberation by canceling the smearing effects in phonemic energy using prior knowledge of the reverberation time and phonemes. More specifically, the approach estimated the power spectral density (PSD) of the reverberation. The square root of the estimated PSD is subtracted from the reverberant signal, resulting in the estimation of the dereverberated signal’s spectrum. Yoshioka et al. provided a generalized subband-domain multi-channel linear prediction approach (also known as weighted prediction error-WPE) without prior knowledge of acoustic conditions [16]. Tomohiro et al. [17] used a delayed linear prediction (DLP) model to cancel the late reverberation without prior knowledge of the room impulse responses (RIR). WPE estimates a filter that predicts the reverberation tail and subtracts it from the reverberant signal to obtain the maximum likelihood estimate. It has been used in many applications [18], [19].
传统的信号处理方法也得到了发展。这些方法可以分为两类:光谱减法和逆滤波。Lebart等人提出了一种频谱减法[15],该方法利用混响时间和音素的先验知识,通过消除音素能量中的拖尾效应来消除混响。更具体地说,该方法估计了混响的功率谱密度(PSD)。从混响信号中减去估计的PSD的平方根,从而估计出去变信号的频谱。Yoshioka等人提供了一种广义的子带域多通道线性预测方法(也称为加权预测误差-WPE),而无需事先了解声学条件[16]。Tomohiro等[17]使用延迟线性预测(DLP)模型在事先不了解房间脉冲响应(RIR)的情况下消除了后期混响。WPE 估计一个滤波器,该滤波器预测混响尾部,并从混响信号中减去它以获得最大似然估计值。它已被用于许多应用[18],[19]。

All the above approaches operate directly on the reverberant signal and are either (1) agnostic of the acoustical and contextual information about the room and signal or (2) they assume this information is known and do not estimate it. In particular, the approaches do not leverage or estimate information about the reverberation time, T60 , which is a strong indicator of the smearing effects of reverberation. It is possible to estimate this information, for instance Bryan et al. [20] use a convolutional neural network (CNN) to downsample the input to a single T60 value. Considering how important the room environment is to dereverberation, Wu et al. [21] investigate how different context information affect the suppression of reverberation. The approach uses a reverberation-time-aware DNN that estimates the reverberation time based on the proper selection of the frame shift and context window sizes during feature extraction. It then supplies the log-power spectrogram to the DNNs for dereverberation. Instead of manually generating different contextual information based on the selected frame shift and window size, a self-attention module learns the different representations automatically. This method, however, requires manually choosing the contextual information for different T 60 s and a reliable T 60 estimator. This reverberation-timeaware DNN has been used as a front-end process for ASR [22]. Another temporal-contextually aware approach is proposed by Wang et al. [23]. The main dereverbation model uses timeaware context frames to predict the dereverbation spectrogram, while jointly optimizing the reverberation time. This environment-aware network jointly performs reverberationtime estimation and dereverberation, however, the approach does not always generalize and needs improved performance in real-world settings. These approaches indicate that optimizing with reverberation time offers benefits to dereverberation.
上述所有方法都直接对混响信号起作用,并且 (1) 与有关房间和信号的声学和上下文信息无关,或者 (2) 它们假设这些信息是已知的并且不估计它。特别是,这些方法没有利用或估计有关混响时间的信息,T60是混响拖尾效应的有力指标。例如,Bryan等[20]使用卷积神经网络(CNN)将输入下采样到单个T60值是可能的。考虑到房间环境对混响的重要性,Wu等[21]研究了不同的环境信息如何影响混响的抑制。该方法使用混响时间感知 DNN,该 DNN 根据特征提取期间正确选择的帧移位和上下文窗口大小来估计混响时间。然后,它将对数功率频谱图提供给 DNN 进行去授权。自我注意力模块不会根据选定的帧移位和窗口大小手动生成不同的上下文信息,而是自动学习不同的表示。然而,这种方法需要手动选择不同 T 60 s 的上下文信息和可靠的 T 60 估计器。这种混响时间感知DNN已被用作ASR的前端进程[22]。Wang等[23]提出了另一种时间语境感知方法。主要的去混响模型使用时间感知上下文帧来预测去混响频谱图,同时联合优化混响时间。 这种环境感知网络共同执行混响时间估计和去恒定,但是,该方法并不总是通用的,需要在实际环境中提高性能。这些方法表明,使用混响时间进行优化可以提高混响效果。

In this paper, we propose a joint-learning approach for speech dereverberation that accurately estimates T 60 and late reflections. Early reflections are beneficial to speech intelligibility, so we decided to remove the late reflections [24],[25]. We separately train a T 60 estimator that matches the model from our preliminary work [26], and a dereverberation network [13]. Additional features from the T 60 estimator are also provided to the dereverberation module. The new feature connects the two networks, and fine-tuning is subsequently performed to generate the final dereverberated result. We do run experiments to determine the loss function from [26] that is best for the joint approach. There are four major differences between the adopted dereverberation model and the model in[13]: 1) the window and FFT sizes, 2) we stacked three LSTM layers instead of two, 3) we remove the weight-dropping approach during training, but instead use dropout and train with a larger and more balanced dataset. Finally, 4) we also incorporate different input features.
在本文中,我们提出了一种用于语音去恒定的联合学习方法,该方法可以准确估计T 60和后期反射。早期反射有利于语音清晰度,因此我们决定删除后期反射[24],[25]。我们分别训练了一个与初步工作[26]中的模型相匹配的T 60估计器,以及一个去杠杆化网络[13]。T 60 估算器的附加功能也提供给去杠杆模块。新功能将两个网络连接起来,随后进行微调以生成最终的 dereverbered 结果。我们确实进行了实验,以确定 [26] 中最适合联合方法的损失函数。所采用的去力化模型与模型之间有四个主要区别[13]:1)窗口和FFT大小,2)我们堆叠了三个LSTM层而不是两个,3)我们在训练期间删除了权重下降方法,而是使用dropout并使用更大,更平衡的数据集进行训练。最后,4)我们还加入了不同的输入功能。

2. A GRU-Based Late Reverberation Suppression Method for Single-Channel Speech Dereverberation

At this stage, the importance of speech signal processing technologies is paid great notice in different fields. Considering the closed space and distant microphones, reverberation becomes an source of degradation to the intelligibility and recognizability of the recorded speech [1]. Broadly seen as the multipath propagation of sound in an enclosure, it has been showed that reverberation can be divided into early reverberation and late reverberation [2]. The early reverberation, within 50∼80 ms, is considered to be the result of early reflections. After that comes a series of numerous indistinguishable reflections which lead to late reverberation. Exploring efficient dereverberation methods has been an important topic to improve the intelligibility of speech and the recognition performance of automatic speech recognition systems [3] [4].
在这个阶段,语音信号处理技术的重要性在不同领域都得到了极大的关注。考虑到封闭的空间和遥远的麦克风,混响成为录制语音的清晰度和可识别性下降的根源[1]。从广义上讲,混响是声音在外壳中的多径传播,已经表明混响可以分为早期混响和晚期混响[2]。50∼80 ms 内的早期混响被认为是早期反射的结果。之后是一系列无法区分的反射,导致后期混响。探索高效的语音识别方法一直是提高语音清晰度和自动语音识别系统识别性能的重要课题[3] [4]。

In recent years, using the deep neural network, researchers have provided plenty of new methods to remove reverberation [5] [6] [7] [8]. Wu in [6] proposed a reverberation-timeaware speech dereverberation framework based on deep neural network (DNN) to handle a wide range of reverberation times. It is worth mentioning that the principle of this algorithm is based on spectrum enhancement for restoring the amplitude spectrum of clean speech from that of the reverberant speech. Convolutional neural network (CNN) has also been tried out for reverberation removing [7] which has successfully improved the quality of speech.
近年来,利用深度神经网络,研究人员提供了许多消除混响的新方法[5] [6] [7] [8]。Wu在[6]中提出了一种基于深度神经网络(DNN)的混响时间感知语音去电化框架,用于处理各种混响时间。值得一提的是,该算法的原理是基于频谱增强,用于将干净语音的振幅频谱从混响语音的振幅谱中恢复。卷积神经网络(CNN)也被用于混响消除[7],这成功地提高了语音质量。

As recurrent neural network (RNN) comes into view, the application of it in speech enhancement is fruitful equally. RNN shows more advantages than DNN in the field of speech processing for its characteristic of capturing the correlation of time series. Saleem et al. in [9] employed both DNN and RNN to accomplish single-channel speech enhancement task by learning the spectral mapping between the degraded and clean speech signals. The conclusion is that RNN has a better effect on speech denoising than DNN. Subsequently, long short-term memory (LSTM), an improvement on the structure of RNN, which solves the problem that RNN cannot rely on for a long time and avoids the problems of RNN gradient explosion and gradient disappearance, has been utilized for time series prediction [10]. As an improved version of LSTM, Gate Recurrent Unit (GRU) can achieve similar effects as LSTM with a simpler structure [11]. GRU network has also been attempted in the area of speech enhancement [12]. However, the use of neural networks with the recurrent structure in speech dereverberation has not been fully exploited. A speech dereverberation algorithm using LSTM to suppress late reverberation proposed in [13] is the only attempt coming into our notice.
随着递归神经网络(RNN)的出现,它在语音增强中的应用同样富有成效。RNN 在语音处理领域表现出比 DNN 更多的优势,因为它具有捕获时间序列相关性的特性。Saleem等[9]使用DNN和RNN通过学习降级语音信号和干净语音信号之间的频谱映射来完成单通道语音增强任务。结论是RNN对语音去噪的影响比DNN更好。随后,对RNN结构的改进,解决了RNN长期无法依赖的问题,避免了RNN梯度爆炸和梯度消失的问题,将长短期记忆(LSTM)用于时间序列预测[10]。作为LSTM的改进版本,栅极循环单元(GRU)可以达到与LSTM相似的效果,但结构更简单[11]。GRU网络也被尝试用于语音增强领域[12]。然而,在语音去杠杆化中使用具有递归结构的神经网络尚未得到充分利用。[13]中提出的使用LSTM抑制后期混响的语音去激化算法是我们注意到的唯一尝试。

Traditionally, there exists a dereverberation approach trying not to fully suppress the reverberation, but only to remove the late reverberation. This approach is based on the observation that the early reverberation has no harm or even be beneficial to the intelligibility of the speech [14]. Following this approach, a lot of effective methods have been produced [14] [15] [16]. In order to retain the advantage that recurrent neural networks can learn the temporal correlation of speech sequences and complete the function of late reverberation suppression with a more simplified network structure, a GRU model is selected to make an attempt. In this paper, we propose a system using GRU network to suppress the late reverberation of speech signals. The mapping from the magnitude spectrum of reverberant speech to that of the speech with only early reverberation has been learned by the GRU network through training.
传统上,存在一种去去混响方法,试图不完全抑制混响,而只是去除后期混响。这种方法基于这样的观察,即早期的混响对语音的可理解性没有伤害甚至有益[14]。按照这种方法,已经产生了许多有效的方法[14] [15] [16]。为了保留递归神经网络能够学习语音序列的时间相关性,并以更简化的网络结构完成后期混响抑制功能的优势,选择GRU模型进行尝试。在本文中,我们提出了一种使用GRU网络来抑制语音信号后期混响的系统。GRU网络通过训练学会了从混响语音的幅度谱到只有早期混响的语音的幅度谱的映射。

3. CAT-DUnet: Enhancing Speech Dereverberation via Feature Fusion and Structural Similarity Loss
3. CAT-DUnet:通过特征融合和结构相似性损失增强语音去线性

REVERBERATION originates from sound waves reflecting off walls and surfaces. These reflected waves, when overlaid with the direct sound and captured by a distant microphone, can distort the primary characteristics of the initial speech waveform, resulting in a reduction in both the clarity and intelligibility of the speech signal [1, 2].
REVERBERATION起源于从墙壁和表面反射的声波。当这些反射波与直接声音叠加并被远处的麦克风捕获时,会扭曲初始语音波形的主要特征,导致语音信号的清晰度和可理解性降低[1,2]。

With the progress of deep learning, speech dereverberation schemes based on Deep Neural Networks (DNN) have emerged and achieved considerable results. Early approaches leverage fully convolutional networks, such as the Unet architecture [3, 4, 5], which employs convolutional encoders for hierarchical feature extraction and transpose convolutional decoders for up-sampling. Subsequently, Recurrent Neural Networks (RNN) are commonly used to capture temporal dependencies [6, 7, 8]. With the advent of the attention mechanism [9], which assigns attention weights to elements within a sequence for improved feature representations, several studies have recently proposed time-attention and frequency-attention modules to respectively model the temporal and frequency sequence features [10, 11, 12]. As a recent advancement, the diffusion model demonstrates effectiveness in complex scenarios by employing generative techniques, albeit with higher computational demands [13].
随着深度学习的进步,基于深度神经网络(DNN)的语音去杠杆化方案应运而生,并取得了可观的成果。早期的方法利用完全卷积网络,例如Unet架构[3,4,5],它使用卷积编码器进行分层特征提取,并使用转置卷积解码器进行上采样。随后,循环神经网络(RNN)通常用于捕获时间依赖性[6,7,8]。随着注意力机制[9]的出现,该机制将注意力权重分配给序列中的元素以改进特征表示,最近一些研究提出了时间-注意力和频率-注意力模块,分别对时间和频率序列特征进行建模[10,11,12]。作为最近的进展,扩散模型通过采用生成技术证明了在复杂场景中的有效性,尽管计算要求更高[13]。

Although existing approaches have demonstrated efficacy in handling complex data, mainstream solutions still predominantly employ simple residual connections to link the encoder and decoder within their network architectures. However, deep convolutional networks compute a feature hierarchy layer by layer, starting with basic feature extraction such as edges and textures in initial layers, and progressing to more complex pattern recognition in deeper layers. This progression results in varying semantic levels of features and a consequent semantic gap between these layers [14]. The exclusive dependence on simple residual connections may not fully ensure the effective feature fusion at different semantic levels and could result in a decline in network performance [15]. Therefore, a more advanced form of connectivity is required [16, 17].
尽管现有方法在处理复杂数据方面已经证明是有效的,但主流解决方案仍然主要采用简单的残差连接来连接其网络架构中的编码器和解码器。然而,深度卷积网络逐层计算特征层次结构,从初始层的边缘和纹理等基本特征提取开始,到更深层的更复杂的模式识别。这种进展导致特征的语义水平不同,并因此在这些层之间产生语义差距[14]。对简单残差连接的排他性依赖可能无法完全确保不同语义层次的有效特征融合,并可能导致网络性能下降[15]。因此,需要一种更高级的连接形式[16,17]。

On the other hand, speech features such as rhythm and pitch exhibit distinctive structural characteristics in spectrograms, closely linked to speech clarity and intelligibility. Consequently, training objectives frequently concentrate on the spectrogram domain [7, 18]. However, existing methods often prioritize the reduction of numerical errors using ℓ 1 or ℓ 2 norms, potentially overlooking the structural nuances within spectrograms crucial for speech signals. Just as in computer vision, where distance-based norms have limited capacity to capture structural nuances and align with human perception [19], similar concerns may extend to speech processing.
另一方面,节奏和音高等语音特征在频谱图中表现出独特的结构特征,与语音清晰度和可理解性密切相关。因此,训练目标通常集中在频谱图领域[7,18]。然而,现有方法通常优先考虑使用 l 1 或 l 2 范数来减少数值误差,这可能会忽略对语音信号至关重要的频谱图中的结构细微差别。就像在计算机视觉中一样,基于距离的规范捕捉结构细微差别并与人类感知保持一致的能力有限[19],类似的问题可能会延伸到语音处理。

To address the aforementioned issues, we first propose CAT-DUnet: Channel Attention, Time-frequency attention, and Dilated-convolutional Unet to enhance feature fusion. We incorporate the channel attention blocks to model the channel importance of low semantic-level features. Additionally, the time-frequency attention blocks and dilated convolutional blocks are employed to enhance global and local time-frequency feature modeling, respectively. To improve loss perception characteristics, we employ Structural Similarity (SSIM) [20] as a training objective, which incorporates luminance, contrast, and structure, leading to improved congruence with human perceptual judgments. On another front, acknowledging the non-linear nature of human auditory frequency perception [21], we introduce an innovative time-weighted Mel-Spectrogram to better reflect the characteristics of the spectrogram structure in correlation with human perception.
针对上述问题,我们首先提出了CAT-DUnet:信道注意力、时频注意力和膨胀卷积Unet来增强特征融合。我们结合了通道注意力块来模拟低语义级特征的通道重要性。此外,还利用时频注意力模块和膨胀卷积模块分别增强了全局和局部时频特征建模。为了改善损失感知特征,我们采用结构相似性(SSIM)[20]作为训练目标,它结合了亮度、对比度和结构,从而提高了与人类感知判断的一致性。另一方面,在承认人类听觉频率感知的非线性性质[21]的同时,我们引入了一种创新的时间加权Mel-Spectrogram,以更好地反映与人类感知相关的频谱结构的特征。

Our main contributions are three-fold: 1) We design a novel fusion block that facilitates feature fusion within the Unet architecture, validated through ablation studies for its efficacy. 2) To our knowledge, we are the first to utilize SSIM loss in speech dereverberation and demonstrate its effectiveness through experiments. 3) We propose the adoption of a timeweighted Mel-spectrogram to better capture the correspondence between spectrograms and human auditory perception substantiated by experimental verification.
我们的主要贡献有三个方面:1)我们设计了一种新颖的融合块,促进了Unet架构中的特征融合,并通过烧蚀研究验证了其功效。2)据我们所知,我们是第一个将SSIM损失用于语音传输并通过实验证明其有效性的公司。3)我们建议采用时间加权梅尔谱图,以更好地捕捉频谱图与人类听觉感知之间的对应关系,并通过实验验证得到证实。

4. Complex-Valued Time-Frequency Self-Attention for Speech Dereverberation
4. 用于语音去偏视的复值时频自注意力

Speech dereverberation is a method for eliminating the ambiguous effects introduced by surround reflections when the speech is captured by a distant microphone. Statistical-based speech enhancement techniques played a crucial role as a frontend processor in several speech processing pipelines, such as [1, 2, 3, 4]. The requirement for such robust speech systems to assist in the recognition of naturalistic speech recorded by distant microphones has become more important as humanmachine interaction technologies gain traction [5, 6, 7, 8]. Additionally, advances in technologies such as hearing aids require the speech systems to enhance perceptual quality of speech captured in adverse environmental conditions, thus improving human hearing abilities. Several deep learning (DL)-based speech enhancement systems have been successfully developed to address concurrent improvements in perceptual quality and performance of back-end speech and language applications using fully convolutional neural networks (FCN), and recurrent networks (RNN) [9, 10, 11, 12]. The majority of these approaches work with the complex short-term fourier transform (STFT) of distorted speech, either to enhance the log-power spectrum (LPS) and reuse the unaltered distorted phase signal [13, 14, 15, 16, 17], or to estimate the complex ratio mask (cRM) [18, 19, 20] and directly enhance the complex spectrogram to restore a cleaner time-domain signal. Enhancing the magnitude response or LPS enables back-end speech applications to operate more efficiently. This is because the most of of back-end speech applications are trained using LPS-derived speech features. Alternatively, speech applications aimed at enhancing perceived quality and intelligibility of speech make extensive use of complex spectrograms to recover magnitude & phase of distorted signal using DL approaches.
语音去偏激是一种消除语音被远处麦克风捕获时环绕反射引入的模棱两可效果的方法。基于统计的语音增强技术在多个语音处理管道中作为前端处理器发挥了至关重要的作用,例如 [1, 2, 3, 4]。随着人机交互技术的普及,对这种强大的语音系统的需求变得更加重要,以帮助识别由远处麦克风录制的自然语音[5,6,7,8]。此外,助听器等技术的进步要求语音系统提高在恶劣环境条件下捕获的语音的感知质量,从而提高人类的听力能力。已经成功开发了几种基于深度学习(DL)的语音增强系统,以使用全卷积神经网络(FCN)和递归网络(RNN)解决后端语音和语言应用程序的感知质量和性能的同步改进[9,10,11,12]。这些方法大多与失真语音的复数短时傅里叶变换(STFT)配合使用,以增强对数功率谱(LPS)并重用未改变的失真相位信号[13,14,15,16,17],或估计复数比掩模(cRM)[18,19,20]并直接增强复频谱图以恢复更清晰的时域信号。增强幅度响应或 LPS 使后端语音应用程序能够更高效地运行。这是因为大多数后端语音应用程序都是使用 LPS 派生的语音功能进行训练的。 或者,旨在提高语音感知质量和清晰度的语音应用广泛使用复杂的频谱图来恢复使用深度学习方法失真信号的幅度和相位。

As deep neural networks (DNN) advance to be compatible with complex representations, researchers have investigated many speech enhancement strategies to estimate cRM using deep complex neural networks (DCNN). To address reverberation which distorts the signal in both time and frequency, many sequence-to-sequence learning strategies such as recurrent neural networks (RNNs) and long short-term memory (LSTM) [21, 11] have also been explored. In addition to the FCNs, these methods capture and leverage the temporal correlations for speech dereverberation. In recent years, self-attention (SA) has become a widely utilized mechanism for sequence-tosequence learning tasks [22, 23, 24, 25]. SA is a mechanism for selective context aggregation that generates an output sequence by computing a weighted average of the input sequence. The learned weights represent the level of attention the network pays to subsets of the input sequence while generating an output sequence. For speech dereverberation task, SA will allow the network to attend time-frequency (TF) locations to reduce the smearing effects of reverberation. However, conventional SA approaches used in DCNN networks do not account for the inter-dependencies between real and imaginary component of complex-valued features.
随着深度神经网络 (DNN) 的发展与复杂表示兼容,研究人员研究了许多语音增强策略,以使用深度复杂神经网络 (DCNN) 估计 cRM。为了解决在时间和频率上扭曲信号的混响问题,还探索了许多序列间学习策略,如递归神经网络(RNN)和长短期记忆(LSTM)[21,11]。除了 FCN 之外,这些方法还捕获并利用时间相关性进行语音传输。近年来,自我注意力(SA)已成为序列间学习任务中广泛使用的机制[22,23,24,25]。SA 是一种选择性上下文聚合机制,它通过计算输入序列的加权平均值来生成输出序列。学习的权重表示网络在生成输出序列时对输入序列子集的关注程度。对于语音去控制任务,SA 将允许网络访问时频 (TF) 位置,以减少混响的拖尾效应。然而,DCNN网络中使用的传统SA方法没有考虑复值特征的实数和虚数分量之间的相互依赖关系。

The purpose of this study is to develop a complex-valued time-frequency (T-F) self-attention mechanism that computes attention using both real and imaginary components to accurately model temporal dependencies using deep neural networks. To demonstrate the effectiveness of our proposed complex-valued SA mechanism, we integrate two SA approaches with DCCRN: (i) the conventional self-attention mechanism [22], and (ii) the sample independent dual attention block (SDAB) [26] using channel-wise concatenated real and imaginary components. The REVERB challenge corpus is used to examine the improvements in overall speech quality and back-end speech application performance achieved by integrating our proposed self-attention with a fully convolutional and recurrent network, DCCRN.
本研究的目的是开发一种复值时频 (T-F) 自注意力机制,该机制使用实数和虚数分量计算注意力,以使用深度神经网络准确建模时间依赖关系。为了证明我们提出的复值SA机制的有效性,我们将两种SA方法与DCCRN集成在一起:(i)传统的自注意力机制[22],以及(ii)使用通道级联的实数和虚数分量的样本独立双注意力块(SDAB)[26]。REVERB 挑战语料库用于检查通过将我们提出的自我注意力与完全卷积和循环网络 DCCRN 集成而实现的整体语音质量和后端语音应用程序性能的改进。

5. D²Net: A Denoising and Dereverberation Network Based on Two-branch Encoder and Dual-path Transformer
5. D²Net:一种基于双分支编码器和双路径变压器的去噪去噪网络

Speech is easily degraded by background noise and room reverberation, resulting in a significant decline of speech intelligibility and speech recognition performance. Reverberation is the accumulation of multiple reflections of a signal as it travels around the room from source to microphone. Room reverberation can be characterized by the room impulse response (RIR) related to the position of the sound source and microphone [1] [2]. Speech denoising and dereverberation processing is an indispensable front-end task and has been widely studied in speech recognition [3] [4]. In this paper, we focus on the task of single-channel denoising and dereverberation, which is more challengeable.
语音容易受到背景噪声和室内混响的影响,导致语音清晰度和语音识别性能显著下降。混响是信号在房间内从信号源传播到麦克风时多次反射的累积。房间混响可以通过与声源和麦克风位置相关的房间脉冲响应(RIR)来表征[1] [2]。语音去噪和去噪处理是语音识别中不可或缺的前端任务,在语音识别中已被广泛研究[3] [4]。在本文中,我们重点研究了更具挑战性的单通道去噪和去噪任务。

Generally, the spatial information is extracted and fed into the network of multi-channel dereverberation, such as inter-channel phase differences (IPDs) [5] and inter-channel convolution differences (ICDs) [6]. However, for the task of single-channel dereverberation, the spatial information is not available. There are some researches that exploit cascaded modules to complete denoising and dereverberation [7] [8].
通常,空间信息被提取并输入到多通道去杠杆网络中,例如通道间相位差(IPD)[5]和通道间卷积差(ICD)[6]。但是,对于单通道去电阻任务,空间信息不可用。有一些研究利用级联模块来完成去噪和去噪[7] [8]。

According to the input feature, speech enhancement methods can be divided into two categories: time-domain methods and time-frequency (T-F) domain methods. The time-domain methods estimate the clean waveform directly from the mixture speech in the time domain [9] [10]. The traditional T-F domain methods usually adopt the magnitude spectrogram obtained by the short time Fourier transform (STFT) operation [11]. In this case, the performance is limited because the phase information can not be utilized effectively. Many recent studies have begun to adopt the complex-valued spectrogram, which can be decomposed into the amplitude and phase in polar coordinates or the real and imaginary parts in Cartesian coordinates. The complex-valued spectrogram served as input is verified to improve the performance of the speech denoising network [12].
根据输入特征,语音增强方法可以分为两类:时域方法和时频(T-F)域方法。时域方法直接从时域[9] [10]中的混合语音中估计干净的波形。传统的T-F域方法通常采用短时傅里叶变换(STFT)运算得到的幅谱图[11]。在这种情况下,性能受到限制,因为相位信息无法有效利用。近年来,许多研究开始采用复值频谱图,该谱图可以分解为极坐标中的振幅和相位或笛卡尔坐标中的实部和虚部。验证了作为输入的复值频谱图,以提高语音去噪网络的性能[12]。

In recent years, dual-path networks have shown good performance in speech separation [13] [14] and speech enhancement [10]. Wang K et al. introduced transformer [15] into dualpath network structure and proposed a time-domain speech enhancement model: Two-stage transformer based neural network (TSTNN), which greatly improves the performance of the speech enhancement [10]. Several studies reported that dot-product self-attention may not be indispensable to the transformer models. Xu M et al. proposed the local dense synthesizer attention (LDSA) which dispenses with dot products and pairwise interactions, and restricts the attention scope to a local range around the current central frame to reduce the computational complexity and improve the performance [16].
近年来,双路径网络在语音分离[13][14]和语音增强[10]方面表现出良好的性能。Wang K等将Transformer[15]引入双路径网络结构,并提出了一种时域语音增强模型:基于两阶段Transformer的神经网络(TSTNN),大大提高了语音增强的性能[10]。几项研究报告称,点积自注意力对于变压器模型来说可能不是必不可少的。Xu M等人提出了局部密集合成器注意力(LDSA),该方法省去了点积和成对相互作用,并将注意力范围限制在当前中心帧周围的局部范围内,以降低计算复杂度并提高性能[16]。

Inspired by the above-mentioned works, we propose a single-channel network for simultaneous denoising and dereverberation named as D²Net, in which a two-branch encoder (TBE) is designed to extract and selectively fuse the different granularity features from two branches. Meanwhile, we design a global-local dual-path transformer (GLDPT) which introduces the LDSA in the dual-path transformer to improve the perception of local information. We evaluated our proposed D²Net and conducted ablation studies on the VoiceBank+DEMAND and WHAMR! datasets.
受上述工作的启发,我们提出了一种用于同时去噪和去噪的单通道网络,称为D²Net,其中设计了一个双分支编码器(TBE)来提取和选择性地融合来自两个分支的不同粒度特征。同时,我们设计了一种全局-局部双路径变压器(GLDPT),在双路径变压器中引入了LDSA,以提高对局部信息的感知。我们评估了我们提出的 D²Net,并对 VoiceBank+DEMAND 和 WHAMR!数据。

6. DNN-Based Linear Prediction Residual Enhancement for Speech Dereverberation
6. 基于DNN的语音去杠杆化线性预测残差增强

In daily-life scenarios, sound that reaches the ears usually includes the original source sound and its reflections on various surfaces, especially in some confined spaces, when using microphones or other receiving devices to capture the sound, the original sound is combined with these attenuated, delayed reflections to form a reverberant signal. Reverberation will inevitably cause a decrease in speech recognizability and speech quality [1]. In a reverberant environment, the speech intelligibility for hearing-impaired listeners is significantly reduced [2], and severe reverberation will also have a certain impact on normal-hearing listeners [3]. Reverberation in speech affects human perception and poses a challenge to the robustness of identity authentication systems and speech recognition systems as well [4] [5]. The method to solve the reverberation problem will benefit many speech technology applications.
在日常生活场景中,到达耳朵的声音通常包括原始声源及其在各种表面上的反射,特别是在一些密闭空间中,当使用麦克风或其他接收设备捕捉声音时,原始声音与这些衰减的延迟反射相结合,形成混响信号。混响将不可避免地导致语音识别度和语音质量下降[1]。在混响环境中,听障听众的言语清晰度显著降低[2],严重的混响也会对听力正常的听众产生一定的影响[3]。语音中的混响会影响人类的感知,并对身份认证系统和语音识别系统的鲁棒性构成挑战[4] [5]。解决混响问题的方法将有利于许多语音技术应用。

In order to meet the requirements of human ears for speech intelligibility, people use a variety of methods to study speech dereverberation. One method is to enhance or modify the linear prediction (LP) residuals of the short-term speech segments [6] [7]. In this method, the linear prediction coefficients (LPC) of a short-term reverberant speech segment is assumed to be equal to the ones of the corresponding clean speech segment, or can be estimated to a high accuracy through spatial averaging on reverberant speech segments collected at different positions in the room [8]. Then, the LP residual of the reverberant speech segment is processed through varied algorithms to get it approach that of the clean speech [7] [9] [10], and then the enhanced speech segment is obtained by stimulating the all-pole filter represented by the LP coefficients using the processed LP residual.
为了满足人耳对语音清晰度的要求,人们使用各种方法来研究语音的清晰度。一种方法是增强或修改短期语音片段的线性预测(LP)残差[6] [7]。在这种方法中,假设短期混响语音段的线性预测系数(LPC)等于相应干净语音段的线性预测系数(LPC),或者可以通过对房间内不同位置收集的混响语音段进行空间平均来高精度估计[8]。然后,通过各种算法对混响语音段的LP残差进行处理,使其接近干净语音的LP残差[7] [9] [10],然后使用处理后的LP残差刺激由LP系数表示的全极滤波器,得到增强的语音段。

In recent years, with the broad application of Deep Neural Network (DNN) in varied signal processing areas, DNN has also been used for speech enhancement [11]. Using DNN to eliminate speech reverberation was firstly proposed in [12] [13]. The DNN was trained to learn the spectral mapping relationship between reverberant speech and clean speech. Subsequently, Wu et al. [14] used the reverberation time (RT60) [15] as known to select the appropriate frame shift time and frame extension length to further improve the DNNbased dereverberation method. A step of denoising was added to the DNN to suppress the noise as well as the reverberation in [16].
近年来,随着深度神经网络(DNN)在各种信号处理领域的广泛应用,DNN也被用于语音增强[11]。使用DNN消除语音混响是在[12] [13]中首次提出的。DNN 经过训练来学习混响语音和干净语音之间的频谱映射关系。随后,Wu等[14]使用已知的混响时间(RT60)[15]来选择合适的移帧时间和帧扩展长度,以进一步改进基于DNN的去维贝变方法。在 DNN 中添加了一个去噪步骤,以抑制 [16] 中的噪声和混响。

In this paper, we propose to use DNN to process the residual of the reverberant speech. Instead of using DNN to model the mapping from the short-term spectrum of the reverberant speech to that of the clean speech as above in [13] [14], the DNN in this method is trained to learn the relationship between the spectrum of the LP residual of the reverberant speech and that of the clean speech, and hereafter this method will be referred to as DNN-based linear prediction residual mapping (DNN-LPRM) method. Since the DNN mapping is nonlinear and it is well-known that nonlinear processing of speech signals can introduce in artifacts [17], in DNNLPRM we have tried not to apply the direct speech-to-speech mapping, and only apply the mapping to the LP residual to avoid this problem to some extent.
在本文中,我们建议使用 DNN 来处理混响语音的残余。该方法中的DNN不是像上文[13][14]那样使用DNN来模拟从混响语音的短期频谱到干净语音的映射,而是训练该方法中的DNN来学习混响语音的LP残差频谱与干净语音的LP残差频谱之间的关系,以下该方法将称为基于DNN的线性预测残差映射(DNN-LPRM)方法。由于 DNN 映射是非线性的,并且众所周知,语音信号的非线性处理会引入伪影 [17],因此在 DNNLPRM 中,我们尝试不应用直接的语音到语音映射,而只将映射应用于 LP 残差,以在一定程度上避免这个问题。

7. DRC-NET: DENSELY CONNECTED RECURRENT CONVOLUTIONAL NEURAL NETWORK FOR SPEECH DEREVERBERATION
7. DRC-NET:用于语音去电阻的密集连接循环卷积神经网络

Reverberations are widely experienced in daily life, which is caused by the multi-path transmission of sonic wave in an enclosed space. The reflections and energy loss during the acoustic transmission make it a mixture of infinite numbers of attenuated speech copies. Thus, a strong correlation exists between speech and its reverberation which is difficult to eliminate. Even the modest speech reverberation approach could lead to a detrimental impact on speech intelligibility. For example, in a heavily reverberant room, the speech captured by a far-field microphone sounds blurred and confusing in teleconferencing, people with hearing impairment could not track the discussions.
混响在日常生活中很普遍,这是由声波在封闭空间中的多路径传输引起的。声传输过程中的反射和能量损失使其成为无限数量的衰减语音副本的混合物。因此,语音与其混响之间存在着难以消除的强烈相关性。即使是适度的语音混响方法也可能对语音清晰度产生不利影响。例如,在一个混响严重的房间里,远场麦克风捕获的语音在电话会议中听起来模糊不清,听力障碍者无法跟踪讨论。

The dereverberation is a process of removing reverberations or late reverberation components from the observed mixture, usually by signal processing or data-driven deep learning. In practice, the reverberation can be mathematically modeled as the convolution process between the source signal and the acoustic impulse response (AIR). Harmonicity-based dereverberation (HERB) [1] first estimates a harmonic direct signal by adapting a time-varying harmonic comb filter, and then forms a time-invariant dereverberation filter by the observed signal and the estimated harmonic direct arrive signal. It performs well but with high latency to calculate an accurate inverse filter. The well known weighted prediction error (WPE) [2][3] is a widely used dereverberation algorithm, which can work for both single-channel and multi-channel conditions. The WPE models the dereverberation as an auto-regressive process. As a parameter to calculate the dereverberation filter, the short-time power spectral density (PSD) of the speech is estimated iteratively. Due to the linear prediction, WPE has no non-linear distortion to signal phase. For multi-channel case, spatial directivity of beamformer or channel correlations of the microphone array [4] can be utilized to effectively suppress reverberations by a spatial beam pattern [5], which can also be combined with WPE [6][7].
去混响是从观察到的混音中去除混响或后期混响成分的过程,通常通过信号处理或数据驱动的深度学习。在实践中,混响可以在数学上建模为源信号和声脉冲响应 (AIR) 之间的卷积过程。基于谐波的去恒频(HERB)[1]首先通过调整时变谐波梳滤波器来估计谐波直接信号,然后通过观测信号和估计的谐波直接到达信号形成时不变的去谐波滤波器。它性能良好,但延迟很高,无法计算准确的逆滤波器。众所周知的加权预测误差(WPE)[2][3]是一种广泛使用的去杠杆算法,它既适用于单通道条件,也适用于多通道条件。WPE 将去杠杆化建模为自动回归过程。作为计算去电波滤波器的参数,对语音的短时功率谱密度(PSD)进行迭代估计。由于线性预测,WPE对信号相位没有非线性失真。对于多声道情况,可以利用波束形成器的空间方向性或麦克风阵列的信道相关性[4],通过空间波束模式[5]有效地抑制混响,也可以与WPE[6][7]结合使用。

All those conventional algorithms are formulated under concrete and complete theoretical fundamentals. But the reverberations in real world are too complex to be fit mathematically by handcraft model. Instead, the data-driven deep neural network (DNN) model with powerful non-linear fitting ability shows its promising performance recently [8]. Some of researches, such as in [9], combine deep learning methods with conventional methods, mostly by using DNN to robustly estimate the critical parameters in those conventional methods. These combined methods usually perform well with lower computational complexity but are limited by the upper bound of the conventional part. Currently, the end-to-end neural network for speech signal processing becomes popular with impressive performance. There are mainly two ways to promote the neural network performance, one is to properly design the learning target and the output constrain [10][11][12]. The other one is to design topology of the model or neural operator which could effectively fit the problems [13][14][15]. Convolutional Recurrent Networks (CRN) are representative models [16][17][18] for speech enhancement, which could learn speech pattern effectively by combining CNN and RNN together.
所有这些传统算法都是在具体而完整的理论基础下制定的。但是现实世界中的混响太复杂了,无法用手工模型在数学上拟合。相反,具有强大非线性拟合能力的数据驱动深度神经网络(DNN)模型最近显示出其良好的性能[8]。一些研究,例如[9],将深度学习方法与传统方法相结合,主要是通过使用DNN来稳健地估计这些传统方法中的关键参数。这些组合方法通常在较低的计算复杂度下表现良好,但受到传统部分上限的限制。目前,用于语音信号处理的端到端神经网络以令人印象深刻的性能开始流行。提升神经网络性能的方法主要有两种,一种是正确设计学习目标和输出约束[10][11][12]。另一种是设计模型或神经算子的拓扑结构,以有效地拟合问题[13][14][15]。卷积递归网络(CRN)是语音增强的代表性模型[16][17][18],通过将CNN和RNN结合在一起,可以有效地学习语音模式。

The dense convolutional neural network (DenseNET) [19] has been utilized recently for speech dereverberation [20][21]. A substantial improvement was reported which makes it the SOTA algorithm. The DenseNET is a very deep model which has a total of 56 layers. Its dominant part is the 54-layer CNN, and only 2-layer LSTM are embedded in the middle of the U-Net structure. However, within such a structure, the LSTM might not be fully utilized. In fact, the mechanism of the CNN and RNN are quite different which can be viewed as finite and infinite impulse response filter, respectively. It motivates us that a sufficient and complete combination of these two mechanisms is reasonable and promising. In our previous work for dual-channel speech enhancement [15], By using the channel-wise LSTM, we explore the independence of frequency binwise processing, which improves the performance and significantly reduces the computational complexity. In the same way, RNN can be equipped in DRC-NET on a large scale.
密集卷积神经网络(DenseNET)[19]最近被用于语音去控制[20][21]。据报道,它得到了实质性的改进,使其成为SOTA算法。DenseNET 是一个非常深的模型,共有 56 层。其主要部分是 54 层 CNN,只有 2 层 LSTM 嵌入 U-Net 结构的中间。但是,在这样的结构中,LSTM 可能无法充分利用。事实上,CNN和RNN的机制是完全不同的,可以分别看作是有限和无限的脉冲响应滤波器。它激励我们,这两种机制的充分和完全结合是合理和有希望的。在我们之前的双通道语音增强工作[15]中,通过使用逐通道LSTM,我们探索了频率双向处理的独立性,从而提高了性能并显着降低了计算复杂度。同样,RNN可以大规模地装备在DRC-NET中。

Our study makes three main contributions in this work. First, we introduce channel-wise LSTM and unify its usage in time and frequency dimension. Second, we propose a recurrent convolutional (RC) unit which combines channel-wise LSTM and CNN. Third, we utilize the RC unit to construct a densely connected recurrent convolutional model for speech dereverberation, which dramatically improves the performance compared to the SOTA algorithm.
我们的研究在这项工作中做出了三个主要贡献。首先,我们介绍了通道式LSTM,并在时间和频率维度上统一了其用法。其次,我们提出了一个结合了通道LSTM和CNN的循环卷积(RC)单元。第三,我们利用RC单元构建了一个密集连接的循环卷积模型,用于语音去电化,与SOTA算法相比,该模型的性能得到了显著提高。

8. Dual-stream Speech Dereverberation Network Using Long-term and Short-term Cues
8. 使用长期和短期线索的双流语音去杠杆化网络

With the development of speech processing, hands-free speech techniques are becoming increasingly popular for various applications, such as chat robots and smart speakers. However, To overcome such a challenging task, speech dereverberation (SD) [1] algorithms often function as a preprocessing module to help suppress the reverberation in observed speech signals before they are fed into the following stage of speech applications.
随着语音处理的发展,免提语音技术在各种应用中越来越受欢迎,例如聊天机器人和智能扬声器。然而,为了克服这一具有挑战性的任务,语音去声化(SD)[1]算法通常用作预处理模块,以帮助抑制观察到的语音信号中的混响,然后再将其输入语音应用的下一阶段。

Recently, SD methods based on deep learning (DL) have significantly improved speech intelligence performance due to the strong regression capability. The general solution is to view the SD problem as a multivariate regression problem, where the nonlinear regression function is parametrized by various deep neural networks, such as deep neural networks (DNNs) [2]–[5] , convolutional neural networks (CNNs) [6]–[8], recurrent neural networks (RNNs) [9]–[12] and generative adversarial networks (GANs) [13], [14]. Based on how the enhanced target is achieved, the techniques can be categorized into mapping-based methods [3] and masking-based methods[4]. The former aims to learn a nonlinear mapping function from the observed reverberant speech into the desired clean speech, while the latter focuses on learning a time-frequency mask from the observed reverberant speech to desired clean speech. Despite the success of these existing methods, they often fail to meet the complex scenarios for the following disadvantages. Because the traditional DL-based SD can only get cues from the current speech frame in a short-term or context frame [2], [11], [13], the ability to model long-term information to remove late reverberation is limited.
近年来,基于深度学习(DL)的SD方法具有较强的回归能力,显著提升了语音智能性能。一般的解决方案是将SD问题视为一个多元回归问题,其中非线性回归函数由各种深度神经网络参数化,例如深度神经网络(DNN)[2]–[5]、卷积神经网络(CNN)[6]–[8]、递归神经网络(RNN)[9]–[12]和生成对抗网络(GAN)[13],[14]。根据如何实现增强目标,这些技术可以分为基于映射的方法[3]和基于掩蔽的方法[4]。前者旨在学习从观察到的混响语音到所需干净语音的非线性映射函数,而后者侧重于学习从观察到的混响语音到所需干净语音的时频掩码。尽管这些现有方法取得了成功,但它们往往无法满足以下缺点的复杂场景。由于传统的基于深度学习的SD只能从短期或上下文帧[2]、[11]、[13]中的当前语音帧中获取线索,因此对长期信息进行建模以消除后期混响的能力是有限的。

Generally, low-dimensional representation extracted from the input feature is a prerequisite for target learning [15]. Therefore, designing a neural network with the input feature that contains both long-term and short-term cues is important for DL-based SD. Motivated by the fact that reverberation is affected by multiple level previous speech frames, we propose a dual-stream speech dereverberation network (DualSDNet). First, to maximize the utilization of long-term speech to remove late reverberation, the finite impulse response (FIR) is used to design a multi-scale long-term storage filter. FIR is inspired by the generation progress of reverberation, and it can store long-term information of T R seconds at most. Second, due to the different effects of long-term and short-term speech on the current frame, we further design a dual-stream network. On the one hand, the fully convolutional network [6] in the long and short stream is used to learn the relationship between previous information and the current direct sound frame, respectively. On the other hand, a gating unit based on the multi-subband attention mechanism gives higher weight to dual-stream learned speech representation.
通常,从输入特征中提取的低维表示是目标学习的先决条件[15]。因此,设计一个具有包含长期和短期线索的输入特征的神经网络对于基于 DL 的 SD 非常重要。 基于混响受多级先前语音帧影响的事实,我们提出了一种双流语音去控网络 (DualSDNet)。首先,为了最大限度地利用长期语音来消除后期混响,采用有限脉冲响应(FIR)设计了一种多尺度长期存储滤波器。FIR受混响产生进程的启发,最多可以存储T R秒的长期信息。其次,由于长期和短期语音对当前帧的影响不同,我们进一步设计了双流网络。一方面,利用长流和短流中的完全卷积网络[6]分别学习前一个信息与当前直接声帧之间的关系。另一方面,基于多子带注意力机制的门控单元赋予了双流学习语音表示更高的权重。

9. Extending DNN-based Multiplicative Masking to Deep Subband Filtering for Improved Dereverberation
9. 将基于 DNN 的乘法屏蔽扩展到深度子带滤波以改善去恒频化

In modern communication devices, recorded speech is corrupted when clean speech sources are affected by interfering speakers, background noise and room acoustics. Speech restoration aims to recover clean speech from the corrupted signal, whereby two distinct tasks, denoising and dereverberation, are considered here [1, 2].
在现代通信设备中,当干净的语音源受到干扰扬声器、背景噪音和室内声学的影响时,录制的语音就会损坏。语音恢复旨在从损坏的信号中恢复干净的语音,这里考虑了两个不同的任务,去噪和去噪[1,2]。

Traditional speech restoration algorithms are based on statistical methods, exploiting properties of the target and interfering signals to discriminate between them [3]. These include linear prediction [4], spectral enhancement [5], inverse filtering[6], and cepstral processing [7]. Modern approaches rely mostly on machine learning. In this field, predictive methods, learning a one-to-one mapping between corrupted and clean speech through a deep neural network (DNN), are most popular [8, 9]. A large portion of DNNs used in speech restoration are trained for mask estimation, i.e. they learn a mask value to be applied to each single bin of the signal, either in a learnt domain [10] or in the time-frequency (TF) domain [11, 12]. On the opposite, some approaches employ deep filtering [13], which means that their final stage involves a convolution between the input signal and a learnt multi-frame TF filter [14, 15, 16, 17, 18, 19, 20]. In[17], this filter is parameterized as a multi-frame MVDR [21] for denoising. A DNN-parameterized weighted prediction error subband filter is proposed in [19, 18, 20]. A deep filter can also be directly learnt, e.g. in [15] as a frequency-independent time filter or in [13, 14] as a joint time-frequency filter.
传统的语音恢复算法基于统计方法,利用目标的特性和干扰信号来区分它们[3]。这些方法包括线性预测[4]、频谱增强[5]、逆滤波[6]和倒谱处理[7]。现代方法主要依赖于机器学习。在这个领域,预测方法最受欢迎,即通过深度神经网络(DNN)学习损坏和干净语音之间的一对一映射[8,9]。用于语音恢复的很大一部分 DNN 都经过了掩码估计训练,即它们学习要应用于信号的每个 bin 的掩码值,无论是在学习域 [10] 中还是在时频 (TF) 域 [11, 12] 中。相反,一些方法采用深度滤波[13],这意味着它们的最后阶段涉及输入信号和学习的多帧TF滤波器之间的卷积[14,15,16,17,18,19,20]。在[17]中,该滤波器被参数化为用于降噪的多帧MVDR [21]。在[19, 18, 20]中提出了一种DNN参数化的加权预测误差子带滤波器。深度滤波器也可以直接学习,例如在[15]中作为与频率无关的时间滤波器,或在[13,14]中作为联合时频滤波器。

In this paper, we propose a deep subband filtering extension (DSFE) scheme to transform masking-based speech restoration DNNs into deep subband filters. The proposed extension is implemented by using a learnable temporal convolution at the output of the original masking DNN backbone and training the resulting architecture in an end-to-end fashion in the TF domain. Most of the time, the original masking DNN already handles multi-frame filtering internally through e.g. temporal convolutions. However, we show that enforcing explicit multi-frame subband filtering as the final stage of processing results in a significant performance increase for dereverberation while leaving the denoising performance virtually unaltered. We justify our approach by relating time-frequency multiplicative masking and deep subband filtering to the noising and reverberation corruption models respectively. The proposed approach has a negligible computational overhead and constitutes a generic module that can be plugged in any masking-based system.
本文提出了一种深度子带滤波扩展(DSFE)方案,将基于掩码的语音恢复DNN转换为深度子带滤波器。建议的扩展是通过在原始掩码 DNN 主干的输出处使用可学习的时间卷积来实现的,并在 TF 域中以端到端的方式训练生成的架构。大多数情况下,原始屏蔽 DNN 已经通过时间卷积等方式在内部处理多帧过滤。然而,我们发现,在处理的最后阶段强制执行显式多帧子带滤波会导致去噪性能显著提高,同时使去噪性能几乎保持不变。我们通过将时频乘法掩蔽和深度子带滤波分别与噪声和混响损坏模型联系起来来证明我们的方法的合理性。所提出的方法具有可以忽略不计的计算开销,并构成了一个通用模块,可以插入任何基于掩蔽的系统中。

10. Monaural Speech Dereverberation Using Deformable Convolutional Networks
10. 使用可变形卷积网络的单声道语音去音化

SPEECH captured by distant microphones in reverberant environments, often suffers from reduced intelligibility and quality. To address this, dereverberation systems are designed to suppress energies from reflections and retain speech from the direct path. These systems are often employed as preprocessors for various speech applications like automatic speech recognition (ASR) [1], speaker identification (SID) [2], [3], [4], hearing aids, teleconferencing, hands-free communication, and voice control for human-assistance devices. Despite extensive research on both signal processing (SP) and deep learning (DL)-based techniques for single-channel [5], [6], [7], [8] and multi-channel [9], [10], [11] speech capture over the past decade, this problem still persists in many speech applications that rely on distant microphones for speech capture in naturalistic environments [12], [13].
在混响环境中,由远处麦克风捕获的语音通常会受到清晰度和质量下降的影响。为了解决这个问题,dereverberation系统被设计成抑制来自反射的能量,并保留来自直接路径的语音。这些系统通常用作各种语音应用的预处理器,如自动语音识别 (ASR) [1]、说话人识别 (SID) [2]、[3]、[4]、助听器、电话会议、免提通信和人工辅助设备的语音控制。尽管在过去十年中,对单通道[5]、[6]、[7]、[8]和多通道[9]、[10]、[11]语音捕获的信号处理(SP)和深度学习(DL)技术进行了广泛的研究,但这个问题仍然存在于许多依赖远程麦克风在自然环境中进行语音捕获的语音应用[12],[13]。

A reverberant speech signal can be mathematically modeled as a convoluted mixture of an anechoic clean speech with a room impulse response (RIR), which comprises the energies captured from: (i) direct path, (ii) early reflections (ER): within the first 50 milliseconds of the direct path, and (iii) late reflections (LR): after 50 milliseconds of the direct path. The energies within these reflections are significantly influenced by the size of the room and the damping properties of the surfaces of the environment, as well as the distance from the source to the capturing microphone. The early and late reflections are the leading cause of degraded speech quality, as they create smearing effects in speech-specific frequency bands over time.
混响语音信号可以在数学上建模为消声干净语音与房间脉冲响应 (RIR) 的复杂混合,其中包括从以下位置捕获的能量:(i) 直接路径,(ii) 早期反射 (ER):在直接路径的前 50 毫秒内,以及 (iii) 后期反射 (LR):在直接路径的 50 毫秒后。这些反射中的能量受到房间大小和环境表面阻尼特性以及从声源到捕获麦克风的距离的显着影响。早期和晚期反射是语音质量下降的主要原因,因为它们会随着时间的推移在语音特定频段产生拖尾效应。

An effective speech dereverberation systems aims to suppress the convolutive effects of early and late reflections, preserving only the speech from the direct path. In recent years, many speech enhancement (SE) systems were proposed using fully convolutional networks (FCN) [14], [15], recurrent neural networks (RNN) [16], [17], etc. that estimate a time-frequency (TF) masks to to reduce background noise [18], [19], [20], reverberation [21], [22], [23], [24], or both [25], [26], [27]. Early DL-based research focused on enhancing only the magnitude response of the complex-valued spectrogram and reusing the phase of reverberant speech samples [28], [29]. Although this improved the performance of back-end speech applications, it couldn’t significantly enhance overall speech quality, supporting the notion that phase response is crucial to perceptual quality and intelligibility. Later, complex ratio masks (cRM) [9], [30], [31], [32] and end-to-end systems [33], [34], [35] were proposed for simultaneously enhancing magnitude and phase responses of distorted speech signals to improve the overall perceptual quality of speech.
一个有效的语音去理性系统旨在抑制早期和晚期反射的卷积效应,只保留直接路径的语音。近年来,许多语音增强(SE)系统被提出使用全卷积网络(FCN)[14]、[15]、递归神经网络(RNN)[16]、[17]等,这些系统估计时频(TF)掩码以减少背景噪声[18]、[19]、[20]、混响[21]、[22]、[23]、[24]或两者[25]、[26]、[27]。早期基于深度学习的研究仅侧重于增强复值频谱图的幅度响应,并重用混响语音样本的相位[28],[29]。尽管这提高了后端语音应用程序的性能,但它无法显着提高整体语音质量,这支持了相位响应对感知质量和清晰度至关重要的观点。后来,人们提出了复比掩码(cRM)[9]、[30]、[31]、[32]和端到端系统[33]、[34]、[35],用于同时增强失真语音信号的幅度和相位响应,以提高语音的整体感知质量。

Although these systems have demonstrated a satisfactory performance, they commonly overlook two issues: (i) DLsystems’ predetermined weights are insensitive to the degree of environmental distortion in speech during processing, and(ii) DL-systems fail to explicitly address the loss of formant structure resulting from destructive interference between desired speech and undesired reflections in certain time-frequency (TF) bins caused by the smearing effects of reverberation.
尽管这些系统表现出令人满意的性能,但它们通常忽略了两个问题:(i)DL系统的预定权重对处理过程中语音中的环境失真程度不敏感,以及(ii)DL系统未能明确解决由混响的拖尾效应引起的某些时频(TF)箱中所需语音和不需要的反射之间的破坏性干扰导致的共振峰结构损失。

To address the later issue, different approaches such as generative adversarial network (GAN)-based methods [36], [37],[38], [39], [40], [41], multi-branch decoupling networks [42],[43], [44], and deep filtering (DF) [45] have been proposed to regenerate the lost formant structure. To address the former issue, DL-systems are trained on a vast amount of real-world and simulated speech data to increase their resilience to various levels and types of environmental distortions that may be introduced into speech. Moreover, several DL-systems have been proposed with the use of multi-head self-attention [46], [47],[48], input-dependent gating mechanisms [49], [50], [51], and environment-aware multi-tasking abilities [52], [53], [54], [55],[56], to enable the system to adapt their processing during inference by learning dynamic representations of distorted speech signals. Several multi-frame processing algorithms and deep learningcounterparts,includingmulti-framefilteringtechniques for single-channel and multi-channel processing [57], [58], [59],[60], have been developed to estimate complex-valued causal filters from covariance matrices of input reverberant speech signalsduringinference.ThesemethodsalsoenableDL-systems to adapt their processing during inference and improve their ability to handle unforeseen reverberation levels.
为了解决后一个问题,人们提出了不同的方法,如基于生成对抗网络(GAN)的方法[36]、[37]、[38]、[39]、[40]、[41]、多分支解耦网络[42]、[43]、[44]和深度滤波(DF)[45]来再生丢失的共振峰结构。为了解决前一个问题,深度学习系统在大量真实世界和模拟语音数据上进行了训练,以提高它们对可能引入语音的各种级别和类型的环境扭曲的适应能力。此外,还提出了几种深度学习系统,包括多头自注意力[46]、[47]、[48]、输入相关门控机制[49]、[50]、[51]和环境感知多任务处理能力[52]、[53]、[54]、[55]、[56],使系统能够通过学习失真语音信号的动态表示来调整其推理过程中的处理。已经开发了几种多帧处理算法和深度学习对应物,包括用于单通道和多通道处理的多帧滤波技术[57],[58],[59],[60],用于在推理过程中从输入混响语音信号的协方差矩阵中估计复值因果滤波器。这些方法还使深度学习系统能够在推理过程中调整其处理能力,并提高其处理不可预见的混响水平的能力。

In this paper, we propose an end-to-end DL-based system for single-channel speech deverberation which combines the potentials of deformable convolutional networks (DCN) [61],[62] and single-channel multi-frame filtering [57] to enhance the perceptual quality of reverberant speech signal. The following are our primary contributions:
本文提出了一种基于端到端DL的单通道语音传输系统,该系统结合了可变形卷积网络(DCN)[61],[62]和单通道多帧滤波[57]的潜力,以增强混响语音信号的感知质量。以下是我们的主要贡献:

We propose using deformable convolutional networks(DCNs) with kernel offset prediction modules to identify the optimal TF sampling locations on the input reverberant speech spectrogram before employing convolution operations. These offset prediction modules are optimized through the network’s training loss, without additional supervision.Asaresult,oursystemcanadaptitsprocessingto varying levels of distortion in the input reverberant speech signals. • We replace the conventional complex-valued time-frequency (TF) masking approach with deep filtering [45], which estimates a complex-valued time-frequency (TF) filter that can be applied to each TF bin and its neighboring regions on the spectrogram, providing us the potential to recover any lost formant structure. • We estimate a DL-based complex-valued multi-frame filtersimilar to [58], [63] from speech embedding(s) extracted using deformable convolutional layers for each TF bin to implicitly enhance the phase response which is crucial for improved perceptual quality. • We propose using a frame-level speech activity detection(SAD) to further suppress energies in non-speech regions and enhance the efficiency of the proposed speech dereverberation system.
我们建议在采用卷积运算之前,使用可变形卷积网络(DCN)和核偏移预测模块来识别输入混响语音频谱图上的最佳TF采样位置。这些偏移预测模块通过网络的训练损失进行优化,无需额外的监督。Asaresult,我们的系统可以适应输入混响语音信号中不同程度的失真。• 我们用深度滤波[45]取代了传统的复值时频(TF)掩蔽方法,该方法估计了一个复值时频(TF)滤波器,该滤波器可以应用于频谱图上的每个TF箱及其邻近区域,为我们提供了恢复任何丢失共振峰结构的潜力。• 我们估计了一个基于DL的复值多帧滤波器,类似于[58],[63],来自语音嵌入,使用每个TF箱的可变形卷积层提取,以隐式增强相位响应,这对于提高感知质量至关重要。• 我们提出使用帧级语音活动检测(SAD)来进一步抑制非语音区域的能量,并提高所提出的语音去控制系统的效率。

11. Monaural Speech Dereverberation Using Temporal Convolutional Networks With Self Attention
11. 使用具有自我注意力的时间卷积网络的单声道语音去线性

IN REAL-WORLD environments, when people converse in a room or communicate with a device, the speech signal is inevitably distorted by its delayed and damped reflections from various surfaces (walls, ceilings, tables, and so forth) during sound propagation. This type of distortion, namely, room reverberation, degrades the speech quality and intelligibility for human listeners [19], especially when the reverberation time (T 60 ) is long. Moreover, reverberation poses a serious problem for many speech-related applications including automatic speech recognition (ASR), which is widely utilized in smart speakers and in-car systems for voice control. It has been shown that the performance of ASR systems is severelly degraded under far-field conditions [49]. Therefore, reducing the effect of reverberation is beneficial for both human listeners and machine perception systems. In this study, we address room reverberation in monaural scenarios, which is easier to apply but more challenging lacking spatial information provided by a microphone array.
在现实世界中,当人们在房间内交谈或与设备通信时,语音信号在声音传播过程中不可避免地会因来自各种表面(墙壁、天花板、桌子等)的延迟和阻尼反射而失真。这种类型的失真,即房间混响,会降低人类听众的语音质量和清晰度[19],特别是当混响时间(T 60)很长时。此外,混响给许多与语音相关的应用带来了严重的问题,包括自动语音识别 (ASR),ASR 广泛用于智能扬声器和车载系统的语音控制。研究表明,在远场条件下,ASR系统的性能会严重下降[49]。因此,减少混响的影响对人类听众和机器感知系统都是有益的。在这项研究中,我们解决了单声道场景中的房间混响,这更容易应用,但缺乏麦克风阵列提供的空间信息更具挑战性。

Due to the importance of speech dereverberation, it has been extensively studied and many algorithms have been developed in the past decades. For example, by assuming an exponentially decaying model of room impulse response (RIR), Lebart et al. [25] proposed a power spectral subtraction algorithm to remove late reverberation. Wu and Wang [47] proposed a two-stage dereverberation algorithm, which employed an inverse filter to reduce early reflections in the first stage and spectral subtraction to remove long-term reverberation in the second stage. In [28], López et al. considered the magnitude spectrum of late reverberation as a sparse linear combination of the magnitude spectrum of past frames and proposed to predict the late reverberation using a Lasso based approach. In order to obtain an inverse filter in short-time Fourier transform (STFT) domain, Nakatani et al. [32] employed a long-term linear prediction method, and achieved good dereverberation performance [6], [24]. Specifically, based onarelativelylargenumberofpastframes,frequency-dependent linear prediction filters were first estimated by minimizing the weighted prediction error (WPE). Then the enhanced signal was obtained by subtracting the filtered signal (the estimated late reverberation) from the reverberant signal. Since it operates in the complex domain, both the magnitude and phase information could be recovered. Although few distortions were introduced in WPE-processed signals, a certain amount of reverberation remains.
由于语音去电化的重要性,它已被广泛研究,并且在过去几十年中开发了许多算法。例如,Lebart等[25]假设房间脉冲响应(RIR)的指数衰减模型,提出了一种功率谱减法算法来消除后期混响。Wu和Wang[47]提出了一种两阶段去激算法,该算法采用逆滤波器减少第一阶段的早期反射,采用光谱减法消除第二阶段的长期混响。在[28]中,López等人将晚期混响的幅度谱视为过去帧的幅度谱的稀疏线性组合,并提出使用基于Lasso的方法预测后期混响。为了获得短时傅里叶变换(STFT)域的反滤波器,Nakatani等[32]采用了长期线性预测方法,并取得了良好的去杠杆化性能[6],[24]。具体而言,基于相对较大的过去帧数,首先通过最小化加权预测误差(WPE)来估计频率相关的线性预测滤波器。然后,通过从混响信号中减去滤波信号(估计的后期混响)来获得增强信号。由于它在复杂域中工作,因此可以恢复幅度和相位信息。尽管在WPE处理的信号中引入了很少的失真,但仍然存在一定程度的混响。

In recent years, deep neural networks (DNNs) have been widely used in speech enhancement or separation, and substantially outperformed conventional enhancement methods [44]. For speech dereverberation, Han et al. [13], [14] first proposed to learn a spectral mapping from the log magnitude spectrum of reverberant speech to that of anechoic speech by using a DNN. Noting the importance of reverberation dependent parameters in supervised training, Wu et al. [46] developed a reverberation-time-aware approach to suppress reverberation in a wide range of reverberation times. Better performance over Han et al.’s system was reported. However, the utilization of feed-forwardDNNsintheseapproachescanonlycapturelimited contextual information. To overcome such a limitation, Santos and Falk [39] proposed a recurrent neural network (RNN) to capture long-term contexts while employing a 2-D convolutional layer to extract short-term contextual information. Their study reported some benefits from residual connections. By employing RNN with long short-term memory (LSTM), we previously estimated the magnitude spectrum of late reverberation first, and then subtracted it from the magnitude spectrum of reverberant speech [52]. Our training used the magnitude spectrum of direct sound plus early reflections as the training target. It is worth noting that this design can be viewed as implicitly adding skip connections from LSTM’s input layer to the output layer.
近年来,深度神经网络(DNN)在语音增强或分离方面得到了广泛的应用,其性能大大优于传统的增强方法[44]。对于语音去电化,Han等[13],[14]首先提出使用DNN学习从混响语音的对数幅度谱到消声语音的频谱映射。Wu等[46]注意到混响相关参数在监督训练中的重要性,开发了一种混响时间感知方法,以抑制各种混响时间的混响。据报道,与Han等人的系统相比,其性能更好。但是,前馈DNNsinthese方法的利用只能捕获有限的上下文信息。为了克服这种局限性,Santos和Falk[39]提出了一种递归神经网络(RNN)来捕获长期上下文,同时使用二维卷积层来提取短期上下文信息。他们的研究报告了残余连接的一些好处。通过使用具有长短期记忆(LSTM)的RNN,我们首先估计了晚期混响的幅度谱,然后从混响语音的幅度谱中减去它[52]。我们的训练使用直接声波的幅度频谱加上早期反射作为训练目标。值得注意的是,这种设计可以看作是隐式地将 LSTM 的输入层添加到输出层的跳过连接。

Despite the advantage of incorporating long range contexts, one drawback of RNN-based approaches is that the output of the current time step depends on the computation of the previous time steps, which prevents parallelization. On the other hand, stacking multiple layers in convolutional neural networks (CNNs) also captures contextual information. Another benefit is that the hierarchical structure of CNN enables a short path to incorporate distant information than the chain structure of RNN [9]. Systematic evaluations [2] show better performance of convolutional architecture for sequence modelling. This motivates us to develop a CNN-based system for speech dereverberation.
尽管合并长距离上下文具有优势,但基于 RNN 的方法的一个缺点是当前时间步长的输出取决于先前时间步长的计算,这阻碍了并行化。另一方面,在卷积神经网络 (CNN) 中堆叠多个层也可以捕获上下文信息。另一个好处是,与RNN的链式结构相比,CNN的分层结构使得合并遥远信息的路径更短[9]。系统评估[2]表明卷积架构在序列建模中的性能更好。这促使我们开发一个基于CNN的语音净化系统。

As indicated in a previous study [46], to deal with various reverberant environments, some reverberation time dependent parameters should be included. Instead of using a reverberation time estimator, we believe that such information can be encoded in the relationship among input features (e.g., the magnitude spectrum) extracted from reverberant speech. This inspires us to apply an attention mechanism [43] to input features. By exploring the relevance among features at different time steps, the attention mechanism is expected to produce a dynamic representation according to different reverberant environments, e.g., different T 60 s, and direct-to-reverberation ratios (DRRs). We take an attention layer as a feature enhancement module for the rest of the dereverberation system.
正如之前的一项研究[46]所指出的,为了处理各种混响环境,应包括一些与混响时间相关的参数。我们认为,这些信息可以编码为从混响语音中提取的输入特征(例如,幅度频谱)之间的关系,而不是使用混响时间估计器。这启发了我们将注意力机制[43]应用于输入特征。通过探索不同时间步长特征之间的相关性,预计注意力机制将根据不同的混响环境(例如不同的 T 60 s)和直接混响比 (DRR) 产生动态表示。我们将注意力层作为 dereverberation 系统其余部分的功能增强模块。

Based on the above analyses, we propose a monuaral speech dereverberation algorithm using temporal convolutional networks (TCNs) with self attention. More specifically, a selfattention module is first applied to raw input features to generate dynamic representations, and then a temporal convolutional network is employed to learn the nonlinear mapping from the enhanced features to the magnitude spectrum of anechoic speech. Finally, a 1-D convolutional layer is added to smooth the estimated magnitude spectrum among adjacent frames. A recent study [36] employing wide residual networks is related to our algorithm, but we use a different network architecture and introduce the self-attention mechanism. It is worth noting that TCN has been successfully used for speaker separation [29] and speech enhancement [33], both in the time domain.
基于上述分析,我们提出了一种使用具有自注意力的时间卷积网络(TCNs)的单声道语音去杠杆算法。更具体地说,首先将自注意力模块应用于原始输入特征以生成动态表示,然后利用时间卷积网络学习从增强特征到消声语音幅度谱的非线性映射。最后,添加了一个一维卷积层,以平滑相邻帧之间的估计幅度谱。最近的一项研究[36]采用了宽残差网络,这与我们的算法有关,但我们使用了不同的网络架构并引入了自注意力机制。值得注意的是,TCN在时域中已成功用于说话人分离[29]和语音增强[33]。

12. REAL-TIME DENOISING AND DEREVERBERATION WTIH TINY RECURRENT U-NET
12. 实时去噪和去噪 WTIH TINY RECURRENT U-NET

In this paper, we focus on developing a deep learning-based speech enhancement model for real-world applications that meets the following criteria: 1. a small and fast model that can reduce singleframe real-time-factor (RTF) as much as possible while keeping competitive performance against the state-of-the-art deep learning networks, 2. a model that can perform both the denoising and derverberation simultaneously.
在本文中,我们专注于为实际应用开发一种基于深度学习的语音增强模型,该模型满足以下标准:1.一个小而快速的模型,可以尽可能减少单帧实时因子(RTF),同时保持与最先进的深度学习网络的竞争性能,2.一个可以同时执行去噪和去混响的模型。

To address the first issue, we aim to improve a popular neural architecture, U-Net [1], which has proven its superior performance on speech enhancement tasks [2, 3, 4]. The previous approaches that use U-Net on source separation applications apply convolution kernel not only on the frequency-axis but also on the time-axis. This non-causal nature of U-Net increases computational complexity because additional computations are required on past and future frames to infer the current frame. Therefore, it is not suitable for online inference scenarios where the current frame needs to be processed in real-time. In addition, the time-axis kernel makes the network computationally inefficient because there exists redundant computation between adjacent frames in both the encoding and decoding path of U-Net. To tackle this problem, we propose a new neural architecture, Tiny Recurrent U-Net (TRU-Net), which is suitable for online speech enhancement. The architecture is designed to enable efficient decoupling of the frequency-axis and time-axis computations, which makes the network fast enough to process a single frame in real-time. The number of parameters of the proposed network is only 0.38 million (M), which is small enough to deploy the model not only on a laptop but also on a mobile device and even on an embedded device combined with a quantization technique [5]. The details of TRU-Net is described more in section 2.
为了解决第一个问题,我们的目标是改进一种流行的神经架构,即U-Net [1],该架构已经证明了其在语音增强任务方面的卓越性能[2,3,4]。以前在源分离应用程序上使用 U-Net 的方法不仅在频率轴上应用卷积核,而且在时间轴上应用卷积核。U-Net 的这种非因果性质增加了计算复杂性,因为需要对过去和未来的帧进行额外的计算来推断当前帧。因此,它不适合需要实时处理当前帧的在线推理场景。此外,由于U-Net的编解码路径中相邻帧之间存在冗余计算,因此时间轴内核使网络计算效率低下。为了解决这个问题,我们提出了一种新的神经架构,即微小递归U-Net(TRU-Net),它适用于在线语音增强。该架构旨在实现频率轴和时间轴计算的高效解耦,这使得网络足够快,可以实时处理单个帧。所提出的网络的参数数量仅为38万(M),这足以将模型部署在笔记本电脑上,还可以部署在移动设备上,甚至结合量化技术部署在嵌入式设备上[5]。第 2 节详细介绍了 TRU-Net。

Next, to suppress the noise and reverberation simultaneously, we propose a phase-aware β -sigmoid mask (PHM). The proposed PHM is inspired by [6], in which the authors propose to estimate phase by reusing an estimated magnitude mask value from a trigonometric perspective. The major difference between PHM and the approach in [6] is that PHM is designed to respect the triangular relationship between the mixture, the target source, and the remaining part, hence the sum of the estimated target source and the remaining part is always equal to the mixture. We extend this property into a quadrilateral by producing two different PHMs simultaneously, which allows us to effectively deal with both denoising and dereverberation. We will discuss PHM in further details in section 3.13. Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation
接下来,为了同时抑制噪声和混响,我们提出了一种相位感知β-sigmoid掩模(PHM)。所提出的PHM的灵感来自[6],其中作者建议通过从三角函数的角度重用估计的幅度掩模值来估计相位。PHM 与 [6] 中的方法之间的主要区别在于,PHM 旨在尊重混合物、目标源和剩余部分之间的三角关系,因此估计的目标源和剩余部分的总和始终等于混合物。我们通过同时生成两个不同的 PHM 将这种特性扩展到四边形,这使我们能够有效地处理去噪和去噪。我们将在第 3.13 节中进一步详细讨论 PHM。单耳语音去音时域卷积网络感受场分析

13. Receptive Field Analysis of Temporal Convolutional Networks for Monaural Speech Dereverberation
13. 单耳语音去音节的时间卷积网络感受场分析

In far-field recording environments, reverberation affects the quality and intelligibility of the recorded audio signal [1]. This remains a problem for many domains in speech technology[2], [3]. Dereverberation of speech signals has been studied thoroughly over the past decades [4], [5] based on machine learning models and signal processing (SP) techniques [2], [6].Most SP approaches model reverberant speech as a mixture of the anechoic speech signal summed with delayed, exponentially decaying weighted sums of itself. The sequence of weights used in this summation is commonly referred to as the RIR, which is typically modelled in three parts: the direct path, the early reflections (ERs) and the late reflections (LRs) [6]. ERs in speech are typically assumed to occur within the first 50 ms after the direct path. SP methodologies for suppressing reverberant content in speech signals range from a number of techniques with the most prominent approaches in recent work using spectral suppression or linear predictive modelling [7],[8].DL models have mostly surpassed pure SP approaches for enhancing reverberant speech signals on objective measures such as word error rate (WER) or perceptual evaluation of speech quality (PESQ) [9]–[11]. Time-domain audio separation networks (TasNets) [12] were proposed for speech separation which were later applied also to dereverberation [13]. Convolutional TasNets (Conv-TasNets) [14] replace the BLSTM network of TasNets with a fully convolutional model using a TCN [15]. TCNs have also shown to be effective at more general speech enhancement tasks including dereverberation [16]. A dereverberation network using a TCN with self attention was proposed in [11] which demonstrated that TCN models give competitive results with other state-of-the-art techniques such as deep neural network (DNN) weighted prediction error (WPE) models.
在远场录音环境中,混响会影响录制的音频信号的质量和清晰度[1]。对于语音技术的许多领域来说,这仍然是一个问题[2],[3]。在过去的几十年里,基于机器学习模型和信号处理(SP)技术[2],[6],对语音信号的去恒定进行了深入研究[4],[5]。大多数 SP 方法将混响语音建模为消声语音信号与延迟、指数衰减的加权和的混合体。此求和中使用的权重序列通常称为 RIR,通常分为三个部分进行建模:直接路径、早期反射 (ER) 和晚期反射 (LR) [6]。通常假定语音中的 ER 发生在直接路径之后的前 50 毫秒内。抑制语音信号中混响内容的SP方法包括多种技术,其中最突出的方法在最近的工作中使用频谱抑制或线性预测建模[7],[8]。在单词错误率(WER)或语音质量感知评估(PESQ)等客观测量上,DL模型在增强混响语音信号方面大多超过了纯SP方法[9]–[11]。时域音频分离网络(TasNets)[12]被用于语音分离,后来也被应用于去音分离[13]。卷积TasNets(Conv-TasNets)[14]使用TCN将TasNets的BLSTM网络替换为完全卷积模型[15]。TCNs也被证明在更一般的言语增强任务中是有效的,包括去音增强[16]。 在[11]中提出了一种使用具有自注意力的TCN的去杠杆网络,该网络表明TCN模型与其他最先进的技术(如深度神经网络(DNN)加权预测误差(WPE)模型)相比具有竞争力。

In this work, Conv-TasNets are analysed for application to monaural dereverbation of speech. The main focus is to analyse the interplay between RF, model size and RIR length on the capability of Conv-TasNets to dereverb speech.
在这项工作中,分析了Conv-TasNets在语音单声道去混响中的应用。主要重点是分析RF、模型大小和RIR长度对Conv-TasNets语音传输能力的相互作用。

14. SkipConvGAN: Monaural Speech Dereverberation using Generative Adversarial Networks via Complex Time-Frequency Masking
14. SkipConvGAN:通过复杂时频掩码使用生成对抗网络的单声道语音去杠杆化

AS THE SOUND propagates in naturalistic environments such as conference rooms, lobbies or cafeterias, speech captured by a device is perceived as a distorted mixture of clean speech, intrusive background noise, and its own reflections from surrounding objects and surfaces. As a result, speech captured by distant microphones in such confined spaces faces two major challenges: (a) background noise: speech from multiple overlapping speakers, music, or other acoustical sounds picked up from the environment; (b) reverberation: reflection-induced self-distortion. Speech intelligibility is greatly reduced in either or both circumstances. These challenges are well-known in the research community and have been addressed through the use of a variety of signal processing and deep neural network (DNN) techniques. The terms “speech denoising” and “speech dereverberation” refer to systems developed to address them. These systems fall within the broader category of “speech enhancement”, which is the task of recovering a desired speaker’s clean speech in a noisy and reverberant environment. Speech enhancing techniques for eliminating background noise or reducing the impact of reverberation from captured speech signals have been shown to be beneficial in a wide range of back-end speech applications, including automatic speech recognition (ASR) systems, speaker identification (SID), hearing aids for cochlear implants, teleconferencing systems, hands-free communication, and voice control for human-assisting devices.
当声音在会议室、大厅或自助餐厅等自然环境中传播时,设备捕获的语音被视为干净语音、侵入性背景噪音以及周围物体和表面自身反射的失真混合。因此,在这种密闭空间内,由远处麦克风捕获的语音面临两大挑战:(a)背景噪音:来自多个重叠扬声器的语音、音乐或从环境中拾取的其他声学声音;(b) 混响:反射引起的自失真。在这两种情况下,言语清晰度都会大大降低。这些挑战在研究界是众所周知的,并且已经通过使用各种信号处理和深度神经网络 (DNN) 技术来解决。术语“语音去噪”和“语音去噪”是指为解决这些问题而开发的系统。这些系统属于更广泛的“语音增强”类别,这是在嘈杂和混响的环境中恢复所需说话者的干净语音的任务。用于消除背景噪音或减少捕获的语音信号混响影响的语音增强技术已被证明在广泛的后端语音应用中是有益的,包括自动语音识别 (ASR) 系统、说话人识别 (SID)、人工耳蜗助听器、电话会议系统、免提通信和人类辅助设备的语音控制。

The speech denoising aspect of enhancement is a widely used technique aimed at separating the desired audio signal from the intrusive noise [1]. Background noises of various types and levels, i.e., signal-to-noise ratio (SNR), have a direct impact on the performance of these systems. Background noises that are very similar to the desired speech that are produced in cafeteria or lunchroom with lower SNRs, make it extremely difficult task for the denoising systems as they share the frequency bands with the desired speech [2].
增强的语音去噪方面是一种广泛使用的技术,旨在将所需的音频信号与侵入性噪声分离[1]。各种类型和级别的背景噪声,即信噪比(SNR),对这些系统的性能有直接影响。在食堂或餐厅中产生的与所需语音非常相似的背景噪声具有较低的信噪比,这使得去噪系统的任务变得极其困难,因为它们与所需的语音共享频段[2]。

On the other hand, speech dereverberation aspect of enhancement refers to the task of suppressing convolutive effects of reflections induced into speech by the environment. The major factors that contribute to the number of reflections induced from the environment are the room size, damping properties of surfaces in the environment, and the distance of the capturing microphone from the origin of the desired speech. Lower signal-to-reverberation modulation energy ratio (SRMR) [3], [4] conditions, which correlate to stronger reflections, significantly reduce the performance of the dereverberation systems because the reflections cause smearing effects across time in the frequency bands with the desired speech. Researchers have introduced multiple techniques using single microphones over time, as well as techniques incorporating multiple microphones to leverage spatial information of desired speech, background noises, and reflections in addition to acoustical information. In this study, we focus exclusively on the speech dereverberation aspect of enhancement to improve the performance of systems using a single microphone.
另一方面,语音去强化方面是指抑制环境诱导到语音中的反射的卷积效应的任务。影响环境反射数量的主要因素是房间大小、环境中表面的阻尼特性以及捕获麦克风与所需语音原点的距离。较低的信混响调制能量比 (SRMR) [3], [4] 条件与更强的反射相关,会显着降低去混响系统的性能,因为反射会在具有所需语音的频段内随时间推移产生拖尾效应。随着时间的推移,研究人员引入了使用单个麦克风的多种技术,以及结合多个麦克风的技术,以利用所需语音、背景噪声和反射的空间信息以及声学信息。在这项研究中,我们只关注增强的语音去杠杆化方面,以提高使用单个麦克风的系统性能。

One early attempt to reduce the effects of reverberation was to identify distinct peaks in the cepstrum of the speech signal and design a comb filter to cancel them out. However, it was discovered that this approach is inefficient for severe reverberant conditions [5]. Spectral subtraction, a widely used technique for speech denoising, was applied to dereverberation via statistical modeling of room impulse responses (RIR) using gaussian noise modulated with exponentially decaying functions [6]. The decay rate of the exponential function was used to govern the rate at which the reflections fade in a given environment viz., reverberation time. This technique has shown promising performance when extended for multimicrophone speech recordings. Likewise, the statistical estimation of room impulse response was used to compute an inverse filter capable of removing the effects of reverberation. This approach, however, could not produce satisfactory results for speech in realistic environments because it assumes the RIR function to be a minimum-phase function, which is frequently not satisfied in naturalistic environments. Later, researchers proposed echo cancellation strategies for estimating RIRs from reverberant speech signal second order statistics and using them to invert reverb effects. Many signal processing-based algorithms have been proposed with the goal of reducing processing latency versus the level of attenuation required. Later, many array processing solutions were addressed, such as spatio-temporal filtering, eigen decomposition, multi-channel system identification, which used multiple microphones to capture spatial information used to pick the direct path and reduce the energies in reflections from recorded speech signals.
减少混响影响的早期尝试是识别语音信号倒谱中的不同峰值,并设计一个梳状滤波器来抵消它们。然而,人们发现这种方法在严重的混响条件下效率低下[5]。频谱减法是一种广泛使用的语音去噪技术,通过使用指数衰减函数调制的高斯噪声对房间脉冲响应(RIR)进行统计建模,将其应用于去噪声[6]。指数函数的衰减率用于控制反射在给定环境中的衰减速率,即混响时间。该技术在扩展到多麦克风语音录音时显示出有希望的性能。同样,房间脉冲响应的统计估计用于计算能够消除混响影响的反向滤波器。然而,这种方法无法在现实环境中产生令人满意的语音结果,因为它假设 RIR 函数是最小相位函数,这在自然环境中通常不满足。后来,研究人员提出了回声消除策略,用于从混响语音信号二阶统计中估计RIR,并使用它们来反转混响效果。已经提出了许多基于信号处理的算法,其目标是降低处理延迟与所需的衰减水平。后来,许多阵列处理解决方案被解决,如时空滤波、特征分解、多通道系统识别,它使用多个麦克风来捕获空间信息,用于选择直接路径并减少记录的语音信号的反射能量。

To address speech denoising tasks, time-frequency masking approaches were first introduced. Later, this technique was adapted to address speech dereverberation too. The ideal binary mask (IBM) has been shown in studies to improve speech intelligibility in reverberant speech. To compute the speech presence probability (SPP) in each time-frequency cell of a reverberant spectrogram, IBM treats the direct path and early reflections as the target speech and the rest as the masker. This mask is then applied to the reverberant spectrogram in order to resynthesize the dereveberated speech signal. IBMbased approaches were considered to have a hard decision boundary, which was the cause of musical noise artifacts in synthesized speech. To address these concerns, the hard decision boundaries were converted to soft boundaries to create an ideal ratio mask (IRM). Initially, IBM and IRM were applied to the magnitude responses of the distorted spectrograms, and the unaltered phase of the reverberant spectrogram was used to reconstruct the enhanced speech signal. This method has assisted many speech back-end systems in improving their performance. However, because the phase distortions compensated for the magnitude response improvements, this approach was unsuitable for improving the perceptual quality of the speech signal. To address this issue, researchers proposed the complex ideal ratio mask (CRM) [7]–[9], which estimates a complex time-frequency (TF) mask with the goal of enhance both the magnitude and phase of the reverberant speech complex spectorgram.
为了解决语音去噪任务,首先引入了时频掩蔽方法。后来,这种技术也被用于解决语音失调问题。理想的二进制掩码 (IBM) 已被证明可以提高混响语音中的语音清晰度。为了计算混响频谱图中每个时频单元中的语音存在概率 (SPP),IBM 将直接路径和早期反射视为目标语音,其余部分视为掩码器。然后将该掩模应用于混响频谱图,以重新合成经过剥离的语音信号。基于 IBM 的方法被认为具有硬决策边界,这是合成语音中音乐噪声伪影的原因。为了解决这些问题,将硬决策边界转换为软边界,以创建理想比率掩码 (IRM)。最初,将IBM和IRM应用于畸变频谱图的幅度响应,并利用混响频谱图的未改变相位重建增强的语音信号。这种方法帮助许多语音后端系统提高了其性能。然而,由于相位失真补偿了幅度响应的改善,这种方法不适合改善语音信号的感知质量。为了解决这个问题,研究人员提出了复理想比掩码(CRM)[7]–[9],它估计了一个复时频(TF)掩码,目的是增强混响语音复频谱图的幅度和相位。

With advancements in deep neural networks (DNN) [10][13] most speech researchers have begun developing speech denoising and dereverberation strategies to estimate TF-masks. Many researchers have also looked into denoising autoencoders, recurrent networks, gated recurrent units (GRU) [14], convolution networks and their variants such as temporal convolution networks (TCN) [15], [16], fully convolutional networks (FCN) [17], [18], gated convolution networks, and so on. Unlike most signal processing methods, deep neural networks can learn patterns for either denoising or dereverberation and generalize them to larger unseen scenarios with the help of nonlinear optimization. In recent years, many approaches in the field of image processing have been adapted to address speech enhancement for enhancing timefrequency spectorgrams. Self-attetnion networks (SAN) [19], [20] and generative adversarial networks (GANs) [21]–[23] are two such deep learning techniques that have attracted attention. SANs are known for their performance in sequenceto-sequence tasks, such as machine translation. Self attention (SA) [24], [25] is a selective context aggregation, in which the predictions are computed based only on a subset of the input sequence it learns to attend. This ability to learn the regions of a sequence to attend for better estimations have been successfully used for many speech applications, including speech enhancement. On the other hand, GANs have been known for their ability to generate convincing unseen images when trained on natural images. Lately, researchers have proposed many improvements to the GAN architecture, which help improve the quality of the generated images by enhancing the fine details in the image. Many researchers are still exploring the potential of this approach for speech enhancement in the presence of additive noise. Some of the GAN approaches that showed promising improvements in the perceptual quality of speech are speech enhancement GAN (SEGAN) [26], frequency-based speech enhancement GAN (FSEGAN) [22], MetricGAN [27], etc. Current studies focus on various GAN approaches for speech denoising, but few have looked into the GANs’ abilities for speech dereverberation [28], [29]. In this study, we keep our focus on building a speech dereverberation system using self-attention and GAN. We use the REVERB challenge corpus [30], [31], which is widely used to benchmark improvements in the field of speech dereverberation to evaluate our proposed system.
随着深度神经网络(DNN)[10][13]的进步,大多数语音研究人员已经开始开发语音去噪和去噪策略来估计TF掩码。许多研究人员还研究了去噪自编码器、递归网络、门控递归单元(GRU)[14]、卷积网络及其变体,如时间卷积网络(TCN)[15]、[16]、全卷积网络(FCN)[17]、[18]、门控卷积网络等。与大多数信号处理方法不同,深度神经网络可以学习去噪或去噪的模式,并在非线性优化的帮助下将它们推广到更大的看不见的场景。近年来,图像处理领域的许多方法已被用于解决语音增强问题,以增强时频频谱图。自证明网络(SAN)[19]、[20]和生成对抗网络(GAN)[21]–[23]是两种引起人们关注的深度学习技术。SAN 以其在序列到序列任务(如机器翻译)中的性能而闻名。自我注意力(SA)[24],[25]是一种选择性的上下文聚合,其中预测仅基于它学习参加的输入序列的子集来计算。这种学习序列区域以进行更好估计的能力已成功用于许多语音应用,包括语音增强。另一方面,GANs以其在自然图像上训练时生成令人信服的看不见的图像的能力而闻名。最近,研究人员对GAN架构提出了许多改进建议,这些改进通过增强图像中的精细细节来帮助提高生成图像的质量。 许多研究人员仍在探索这种方法在存在加性噪声的情况下增强语音的潜力。一些在语音感知质量方面显示出有希望改善的GAN方法包括语音增强GAN(SEGAN)[26],基于频率的语音增强GAN(FSEGAN)[22],MetricGAN [27]等。目前的研究集中在各种GAN的语音去噪方法上,但很少有人研究GAN的语音去致力能力[28],[29]。在这项研究中,我们将重点放在使用自我注意力和 GAN 构建语音去杠杆化系统上。我们使用 REVERB 挑战语料库 [30], [31],该语料库被广泛用于对语音变频领域的改进进行基准测试,以评估我们提出的系统。

Our main contributions towards the proposed network are summarized as follows:
我们对拟议网络的主要贡献总结如下:

• We propose a novel fully complex-valued generative adversarial network, SkipConvGAN, that uses timefrequency masking to enhance the complex spectrogram. Unlike other GAN-based systems that handle magnitude response or time-domain speech signals using real-valued networks, our method incorporates complex-valued networks into the design of both the generator and discriminator of the proposed network.
• 我们提出了一种新型的全复值生成对抗网络SkipConvGAN,该网络使用时频掩蔽来增强复频谱图。与其他基于GAN的系统使用实值网络处理幅度响应或时域语音信号不同,我们的方法将复值网络纳入了所提网络的生成器和鉴别器的设计中。

• We replace the skip connections in the generator network with the proposed complex-valued convolutional modules, i.e., Skipconv blocks (SB) to bridge the semantic gap between feature maps exchanged between the encoder and decoder of the generator network and strengthen feature representation.
• 用所提出的复值卷积模块(即Skipconv块(SB))代替了生成器网络中的跳跃连接,以弥合生成器网络编码器和解码器之间交换的特征图之间的语义差距,并加强特征表示。

• We propose a complex-valued time-frequency selfattention (TF-SA) module that attends to features in both time and frequency dimensions while also preserving the interdependence of the real and imaginary components of the intermediate complex-valued feature maps.
• 我们提出了一个复值时频自注意力(TF-SA)模块,该模块在时间和频率维度上关注特征,同时还保留了中间复值特征图的实部和虚部的相互依赖性。

• We train the proposed network with feature loss in addition to adversarial loss utilizing the complex-valued patch-discriminator network as a feature extractor.
• 我们利用复值斑块鉴别器网络作为特征提取器,在特征损失之外还使用特征损失来训练所提出的网络。

15. SkipConvNet: Skip Convolutional Neural Network for Speech Dereverberation using Optimally Smoothed Spectral Mapping
15. SkipConvNet:使用最佳平滑频谱映射的 Skip 卷积神经网络进行语音去推理

Recent years have seen an exponential rise in the need for effective distant speech systems to improve the experience of humanmachine interactions. These systems find their applications in many consumer devices today as personal assistance. Speech captured by these devices in confined spaces such as conference rooms, lobby, cafeteria, etc. faces two major challenges: (a) reverberation: self-distortion due to reflections which greatly reduce the intelligibility of the speech, and (b) background-noise: speech from multiple overlapping speakers, music or other acoustical sounds picked up from the environment. These two challenges are well-known and have been dealt with various signal processing and deep neural network (DNN) based speech enhancement strategies.
近年来,对有效的远程语音系统的需求呈指数级增长,以改善人机交互的体验。如今,这些系统在许多消费类设备中作为个人辅助设备得到应用。这些设备在会议室、大厅、自助餐厅等密闭空间内捕获的语音面临两大挑战:(a)混响:由于反射而产生的自失真,大大降低了语音的清晰度;(b)背景噪声:来自多个重叠扬声器的语音、音乐或其他从环境中拾取的声音。这两个挑战是众所周知的,并且已经处理了各种基于信号处理和深度神经网络 (DNN) 的语音增强策略。

In earlier days, statistical signal enhancement methods played a crucial part in all speech processing pipelines [1, 2, 3]. However, over the past few years, many DNN based approaches for speech enhancement showed promising results in enhancing a reverberant speech. While DNN strategies for time-domain and frequency-domain processing [4, 5] were developed, most approaches prefer to operate on short-time fourier transform (STFT) of reverberant speech, to enhance the log-power spectrum (LPS) and reuse the unaltered noisy phase signal to restore a clean time-domain signal. As reverberation has its effects spread over time and frequency, sequence-to-sequence learn ing strategies like recurrent neural networks (RNNs) and long short-term memory (LSTM) [6] have been explored to capture and leverage the temporal correlations for speech dereverberation. Besides the extreme capabilities of these networks to capture temporal correlations in speech, they fail to capture the spectral structure of formants encoded in the short-time fourier transform (STFT). Therefore, researchers have moved to convolutional neural networks (CNNs) which learn the dependencies from a group of neighboring time-frequency pixels. In a conventional CNN, the spectral structure learned using 2-D convolutions is compromised due to the presence of fully connected layers. For this reason, researchers used fully convolutional networks (FCNs) which substitute the fully connected layers with 1x1 convolutions in order to prevent the loss of spectral structure information. In past couple of years, many FCN architectures like U-Net, ResNet, DenseNet etc. were adopted from computer vision for various speech applications [7, 8, 9, 10, 11]. Since these networks have shown significant success, exploration of various network architectures that further improve the system performance in speech related tasks has been a part of the research. In this study, we propose modifications to one such FCN architecture, U-Net, specifically designed for speech dereveberation task. We also show that using pre-processed LPS for training such networks improves the efficiency significantly.
在早期,统计信号增强方法在所有语音处理管道中都起着至关重要的作用[1,2,3]。然而,在过去几年中,许多基于 DNN 的语音增强方法在增强混响语音方面显示出有希望的结果。虽然开发了用于时域和频域处理的DNN策略[4,5],但大多数方法更喜欢对混响语音的短时傅里叶变换(STFT)进行操作,以增强对数功率谱(LPS)并重用未改变的噪声相位信号来恢复干净的时域信号。由于混响的影响会随着时间和频率而扩散,因此已经探索了递归神经网络(RNN)和长短期记忆(LSTM)等序列间学习策略[6],以捕获和利用语音去恒定的时间相关性。除了这些网络捕获语音中时间相关性的极端能力外,它们还无法捕获在短时傅里叶变换(STFT)中编码的共振峰的频谱结构。因此,研究人员已经转向卷积神经网络(CNN),该网络从一组相邻的时频像素中学习依赖关系。在传统的CNN中,由于存在全连接层,使用二维卷积学习的光谱结构会受到影响。出于这个原因,研究人员使用了全卷积网络(FCN),该网络用1x1卷积代替了全连接层,以防止光谱结构信息的丢失。在过去的几年里,许多FCN架构,如U-Net、ResNet、DenseNet等,被从计算机视觉中采用到各种语音应用中[7,8,9,10,11]。 由于这些网络已经取得了重大成功,因此探索各种网络架构,以进一步提高语音相关任务中的系统性能,一直是研究的一部分。在这项研究中,我们提出了对一个这样的FCN架构U-Net的修改,该架构是专门为语音开发任务而设计的。我们还表明,使用预处理的LPS来训练这样的网络可以显着提高效率。

16. SPEECH DEREVERBERATION WITH A REVERBERATION TIME SHORTENING TARGET
16. 具有混响时间缩短目标的语音去电阻

Severe late reverberation brings significant damage to the quality and intelligibility of speech [1] and will also cause performance degradation for back-end tasks such as automatic speech recognition (ASR)[2]. Normally, early reflections does not cause so much negative effect [3]. Speech dereverberation, especially the single-channel case, is still a challenging task.
严重的后期混响会严重损害语音的质量和清晰度[1],还会导致自动语音识别(ASR)等后端任务的性能下降[2]。通常,早期反射不会造成太大的负面影响[3]。语音去音译,尤其是单通道情况,仍然是一项具有挑战性的任务。

Before deep neural network (DNN) has been widely used, the traditional dereverberation methods were based on statistical models and signal processing algorithms. The essential problem of dereverberation is the deconvolution between speech signal and room impulse response (RIR). Deconvolution can be accomplished by applying an inverse filter of RIR to the reverberant speech, which is referred to as inverse filtering methods, e.g. [4], [5], [6]. As for the inverse filtering methods, accurate RIR must be first blindly identified, which is very challenging espcially for the single-channel case[7]. Even if the RIR is known, due to its non-minimum phase characteristics in typical cases, directly computing its inverse filter will cause system instability or non-causality [8], [9]. Moreover, inverse filtering is very sensitive to noise. Alternatively, instead of resolving the inverse filter of RIR, weighted prediction error (WPE) [10, 11] uses linear prediction to directly estimate the inverse filter from reverberant signal, and applies the inverse filter to remove late reverberation. WPE has achieved remarkable performance, and is one of the most popular dereverberation methods. Another technical line of dereverberation is spectral subtraction, following the perspective of speech enhancement. Late reverberation can be considered as additive noise, which is assumed to be independent of direct path signal and early reflections [12, 13]. In [14], methods for estimating the power spectrum density of late reverberation have been summarized.
在深度神经网络(DNN)被广泛应用之前,传统的去杠杆化方法是基于统计模型和信号处理算法。去电平化的本质问题是语音信号和房间脉冲响应(RIR)之间的反卷积。反卷积可以通过对混响语音应用RIR的反滤波来实现,这被称为反滤波方法,例如[4]、[5]、[6]。至于逆滤波方法,首先必须盲目识别准确的RIR,这对于单通道情况非常具有挑战性[7]。即使RIR是已知的,由于其在典型情况下的非最小相位特性,直接计算其反滤波器也会导致系统不稳定或非因果关系[8],[9]。此外,逆滤波对噪声非常敏感。或者,加权预测误差(WPE)[10,11]使用线性预测直接估计混响信号的反向滤波器,并应用逆滤波器去除后期混响,而不是解析RIR的反向滤波器。WPE取得了显著的性能,是最受欢迎的去杠杆化方法之一。另一个技术路线是频谱减法,遵循语音增强的观点。后期混响可以被认为是加性噪声,它被认为与直接路径信号和早期反射无关[12,13]。在[14]中,总结了估计后期混响功率谱密度的方法。

The application of DNN has made a great progress in solving speech dereverberation. The basic idea is to construct a nonlinear mapping function, based on supervised learning of DNN, from the spectral feature of reverberant speech to the one of target speech. The input feature could be directly the time-domain signal, or the STFT (short-time Fourier transform) coefficients or magnitude spectrum of reverberant speech. Correspondingly, the output features could be the time-domain signal, STFT coefficients, magnitude spectrum or magnitude mask of target speech. The network architecture used for single-channel speech dereverberation has been evolved a lot, and made a great progress, from the initial fully connected networks [15, 16] to recurrent neural networks (RNN) with long short-term memory (LSTM) for time series modeling [17, 18], and to convolutional neural networks (CNN), such as U-NET [19, 20] and temporal convolutional networks (TCN) [21], then to (self-)attentionbased methods [22, 23].
DNN的应用在解决语音去暴力化方面取得了长足的进步。其基本思想是基于DNN的监督学习,构建一个从混响语音的频谱特征到目标语音的非线性映射函数。输入特征可以直接是时域信号,也可以是混响语音的STFT(短时傅里叶变换)系数或幅度谱。相应地,输出特征可以是目标语音的时域信号、STFT系数、幅度谱或幅度掩码。用于单通道语音传输的网络架构已经发展了很多,并取得了长足的进步,从最初的全连接网络[15,16]到用于时间序列建模的具有长短期记忆(LSTM)的递归神经网络(RNN)[17,18],以及卷积神经网络(CNN),如U-NET [19,20]和时间卷积网络(TCN)[21], 然后是(自我)注意力基础方法[22,23]。

In this work, we experimentally study to adapt our two previously proposed speech denoising networks, i.e. subband LSTM network [24] (refered to as SubNet) and FullSubNet [25], for speech dereverberation. Based on the cross-band filters model [26], the time-domain convolution between source speech and RIR can be decomposed into subband convolutions, and thence speech dereverberation can be perfectly performed in subband based on deconvolution or inverse filtering. SubNet inputs the noisy spectra of one frequency and its neighbouring frequencies, and outputs/predicts the clean speech spectra of this frequency, which seems exactly suitable for speech dereverberation by mimicking the inverse filtering process. FullSubNet combines SubNet with a fullband network to also exploit the fullband spectral pattern, as the enhanced speech should have a correct spectral pattern across all frequencies. FullSubNet used for speech dereverberation can be seen as a combination of speech spectral regression and subband inverse fitering. Experiments show that SubNet and FullSubNet are indeed able to achieve outstanding dereverberation performance.
在这项工作中,我们通过实验研究了我们之前提出的两个语音去噪网络,即子带LSTM网络[24](称为SubNet)和FullSubNet [25],用于语音去噪。基于跨带滤波模型[26],可以将源语音和RIR之间的时域卷积分解为子带卷积,从而在基于反卷积或逆滤波的子带中完美地进行语音去当化。SubNet 输入一个频率及其相邻频率的噪声频谱,并输出/预测该频率的干净语音频谱,这似乎完全适合通过模拟逆滤波过程进行语音去调频。FullSubNet 将 SubNet 与全频带网络相结合,以利用全频段频谱模式,因为增强的语音应该在所有频率上都具有正确的频谱模式。用于语音去杠杆化的 FullSubNet 可以看作是语音频谱回归和子带逆拟合的组合。实验表明,SubNet和FullSubNet确实能够实现出色的去杠杆性能。

More importantly, this work proposes a new learning target based on reverberation time shortening (RTS). DNN-based methods in the literature normally takes the direct-path speech as learning target, which actually is a very strict target as removing all reverberation. As a result, they normally have a large prediction error, which may cause speech distortion. Since early reflections do not cause speech quality degradation, they are often preserved and only late reverberation are removed, such as in WPE [10, 11] and the spectral subtraction methods [14]. Preserving early reflections in the learning target would reduce the prediction error of the network. However, preserving only early reflections will also reduce the sound naturalness, as sounds appear in real life never have such type of reverberation form. In addition, no matter which training target is used, the direct path or early reflections, the network need to learn a sudden truncation of reverberation, which is not fully suitable for network training and will cause signal distortion. The proposed learning target is a shortened version of the original RIR, and has a small target T 60 , e.g. 0.15 s. Instead of suddenly truncating RIR, this target still maintains the property of exponential decay, which will maintain the sound naturalness and also ease the network training. Experiments show that using the proposed learning target can more effectively suppress reverberation and signal distortion.
更重要的是,本工作提出了一个基于混响时间缩短(RTS)的新学习目标。文献中基于DNN的方法通常将直接路径语音作为学习目标,这实际上是一个非常严格的目标,因为消除了所有混响。因此,它们通常具有较大的预测误差,这可能会导致语音失真。由于早期反射不会导致语音质量下降,因此它们通常被保留下来,只有后期混响被去除,例如在WPE [10,11]和频谱减法[14]中。在学习目标中保留早期反射将减少网络的预测误差。然而,只保留早期的反射也会降低声音的自然度,因为现实生活中出现的声音从来没有这种混响形式。此外,无论使用哪种训练目标,直接路径还是早期反射,网络都需要学习突然截断的混响,这并不完全适合网络训练,并且会造成信号失真。建议的学习目标是原始 RIR 的缩短版本,并且目标 T 60 很小,例如 0.15 s。该目标不会突然截断 RIR,而是仍然保持指数衰减的特性,这将保持声音自然性并简化网络训练。实验表明,使用所提出的学习目标可以更有效地抑制混响和信号失真。

17. TeCANet: Temporal-Contextual Attention Network for Environment-Aware Speech Dereverberation
17. TeCANet:用于环境感知语音去杠杆化的时间上下文注意力网络

When speech signals are obtained in an enclosed space by one or more microphones positioned at a distance from the talker, the observed signal consists of a superposition of many delayed and attenuated copies of the speech signal due to multiple reflections from the surrounding walls, ceilings, and floors [1]. As a result, intelligibility and quality of speech is degraded especially when the reverberation effects are severe [2]. The performances for automatic speech recognition (ASR) [3], hearing aids and source localization are also severely affected. Therefore, reducing the effect of reverberation is beneficial for the speech applications.
当通过一个或多个麦克风在封闭空间中通过与说话者保持一定距离的麦克风获得语音信号时,由于周围墙壁、天花板和地板的多次反射,观察到的信号由许多延迟和衰减的语音信号副本叠加组成[1]。因此,语音的清晰度和质量会下降,尤其是在混响影响严重时[2]。自动语音识别(ASR)[3]、助听器和声源定位的性能也受到严重影响。因此,减少混响的影响对语音应用是有益的。

To deal with the reverberation, many dereverberation techniques have been developed in the past decades, such as estimating an inverse filter of the room impulse response (RIR) [4, 5], and separating speech and reverberation via homomorphic transformation [6,7]. In recent years, deep neural networks (DNNs) [8] have been utilized in speech enhancement [9] and source separation [10], and substantially outperformed conventional methods. For speech dereverberation, Han et al. [11] firstly proposed to learn a spectral mapping from reverberant speech to anechoic speech with DNN. Considering the importance of reverberation dependent parameters in supervised training, Wu et al. [12] developed a reverberation-time-aware DNN approach to suppress reverberation, which investigated the effect of different contexts in a wide range of reverberation times (RT60s). However, the RT60-dependent context is manually chosen and a RT60 estimator is needed during the inference, which are quite unpractical in speech applications. In order to capture long-term contextual information, Santos and Falk [13] proposed a context-aware recurrent neural network (RNN) method, and Zhao et al. [14] proposed a dereverberation algorithm using temporal convolutional networks. Compared with speech denoising [15, 16] which often benefits from incorporating long range contexts, the correlations between the context frames are more crucial for speech dereverberation. In a strong reverberant environment, the correlations between adjacent frames tend to be strong and a long context is needed to capture adequate contextual information. On the other hand, the long context may bring out redundant information in a weak reverberant environment where the correlations between neighboring frames are weak. In addition, as shown in Figure 1, different frequency bands attenuate diversely in a real RIR, where the attenuation of high frequency bands is faster than that of low frequency bands [1]. Although several approaches had applied self attention networks [14,17,18] to explore the relevance among features at different time steps and frequency bands, they ignored the physical relationships between the contextual information and reverberant environments and the different attenuation between the frequency bands, which could be exploited to realize the better potential of neural networks.
为了处理混响,在过去的几十年里,已经开发了许多去混响技术,例如估计房间脉冲响应(RIR)的反滤波器[4,5],以及通过同态变换分离语音和混响[6,7]。近年来,深度神经网络(DNN)[8]被用于语音增强[9]和源分离[10],并且大大优于传统方法。对于语音去电化,Han等[11]首先提出使用DNN学习从混响语音到消声语音的频谱映射。考虑到混响相关参数在监督训练中的重要性,Wu等[12]开发了一种混响时间感知DNN方法来抑制混响,该方法研究了不同情境在各种混响时间(RT60s)中的影响。然而,依赖于 RT60 的上下文是手动选择的,并且在推理过程中需要 RT60 估计器,这在语音应用中非常不切实际。为了捕获长期的上下文信息,Santos和Falk[13]提出了一种上下文感知递归神经网络(RNN)方法,Zhao等[14]提出了一种使用时间卷积网络的去杠杆化算法。与语音去噪[15,16]相比,语音去噪通常受益于纳入远距离上下文,上下文框架之间的相关性对于语音去噪更为关键。在强混响环境中,相邻帧之间的相关性往往很强,需要长时间的上下文来捕获足够的上下文信息。另一方面,在相邻帧之间的相关性较弱的弱混响环境中,长时间的上下文可能会带来冗余信息。 此外,如图1所示,在实际RIR中,不同频段的衰减各不相同,其中高频段的衰减速度快于低频段[1]。尽管有几种方法应用了自注意力网络[14,17,18]来探索不同时间步长和频带下特征之间的相关性,但它们忽略了上下文信息与混响环境之间的物理关系以及频带之间的不同衰减,这些都可用于实现神经网络的更好潜力。

In this paper, we propose a novel attention-based approach for speech dereverberation, which can adaptively attend to the contextual information by perceiving the reverberant environments. More specifically, a temporal attention module is first applied to the input features (i.e. the log-power spectrum of reverberant speech) to generate dynamic representations according to the correlations between frames, and a dereveration network is then employed to learn the nonlinear mapping from the representations to the log-power spectrum of anechoic speech. We use a small-footprint DNN model as the dereveration network due to its adaptation to practical applications. Inspired by the distinct attenuation between the frequency bands in the RIR, two types of approaches are proposed to obtain the attention weights, including a FullBand based Temporal Attention approach (FTA) and a SubBand based Temporal Attention approach (STA). The whole system is trained with the goal of both the dereverberation and the RT60 estimation, so that it could be more aware of the reverberant conditions. Our experiments are conducted on a public AISHELL-1 corpus [19], and the results show that our proposed method can significantly outperform our previous reverberation-time-aware methods [12] and generalize well to the real RIRs.
在本文中,我们提出了一种新的基于注意力的语音去形象化方法,该方法可以通过感知混响环境来自适应地关注上下文信息。更具体地说,首先将时间注意力模块应用于输入特征(即混响语音的对数功率谱),根据帧之间的相关性生成动态表示,然后采用去力化网络来学习从表征到消声语音对数功率谱的非线性映射。我们使用小尺寸的 DNN 模型作为去重力网络,因为它适用于实际应用。受RIR中频带之间明显衰减的启发,提出了两种类型的方法来获得注意力权重,包括基于全频段的时间注意力方法(FTA)和基于子波段的时间注意力方法(STA)。整个系统都经过了去电化和 RT60 估计的训练,因此它可以更好地了解混响条件。我们的实验是在公共AISHELL-1语料库[19]上进行的,结果表明,我们提出的方法可以明显优于我们以前的混响时间感知方法[12],并很好地推广到真正的RIR。

18. Time-Frequency Masking in the Complex Domain for Speech Dereverberation and Denoising
18. 复杂域中的时频掩蔽,用于语音去噪和去噪

ROOM acoustics affect the speech signal transmitted inside a room. When someone is having a conversation, they hear not only the sound that directly reaches their ears, but also reflections off the walls, ceiling and furniture. These reflections, termed reverberation, are altered versions of the original speech. In fact, reverberant speech consists of three components: the direct sound, early and late reflections. The direct sound is the anechoic part corresponding to the first wavefront, early reflections typically arrive up to 50 ms after the direct sound, and late reflections come anytime thereafter.
房间声学会影响房间内传输的语音信号。当有人交谈时,他们不仅会听到直接到达耳朵的声音,还会听到墙壁、天花板和家具的反射。这些反射被称为混响,是原始语音的修改版本。事实上,混响语音由三个部分组成:直接声音、早期和晚期反射。直接声音是对应于第一个波前的消声部分,早期反射通常在直接声音后 50 毫秒到达,此后任何时候都会出现后期反射。

Reverberation is problematic because the reflections cause smearing across time and frequency, which interferes with the direct sound. This is particularly challenging for hearing-impaired listeners, since the smearing affects their ability to recognize speech [2], [28]. Additionally, the performance of speech processing applications is degraded in reverberant environments, where reverberation causes automatic speech recognition (ASR) [22] and speaker identification systems [45] to become less accurate. The problem is worsened when background noise is present. Roman and Woodruff [34] show that reverberation combined with additive noise can be detrimental to the speech intelligibility of normal hearing listeners. A solution for removing reverberation and noise would be beneficial for a variety of speech processing tasks.
混响是有问题的,因为反射会导致时间和频率上的拖尾,从而干扰直接声音。这对听障听众来说尤其具有挑战性,因为涂片会影响他们识别语音的能力[2],[28]。此外,在混响环境中,语音处理应用的性能会下降,其中混响会导致自动语音识别(ASR)[22]和说话人识别系统[45]变得不那么准确。当存在背景噪音时,问题会恶化。Roman和Woodruff[34]表明,混响与加性噪声相结合会损害正常听力听众的语音清晰度。消除混响和噪声的解决方案将有利于各种语音处理任务。

Many approaches have been developed to remove reverberation. Delcroix et al. use a weighted prediction error (WPE) algorithm and beamforming to remove room reverberation [6]. Reverberant speech corresponds to convolving a room impulse response (RIR) with anechoic speech (i.e. direct sound). WPE is an unsupervised approach that operates in the complex timefrequency (T-F) domain and uses linear prediction to shorten the RIR, which in effect removes late reverberation [44]. Although WPE helps with dereverberation, it does not address noise that is typically present in real situations. Inverse filtering is another technique for dereverberation. Inverse filters attempt to undo the effects of the RIR, since the convolution of the inverse filter with the reverberant signal results in anechoic speech. Inverse filters, however, cannot be fully realized, since the RIR is unstable due to its nonminimum phase nature [29]. Miyoshi and Kaneda [26] address the invertibility of the inverse filter by utilizing multiple finite impulse response (FIR) filters. In [21] and [35], the T-F magnitude response of the RIR is estimated. Another approach uses the RIR magnitude response and nonnegative matrix factorization (NMF) to remove reverberation [27]. A two-stage algorithm for enhancing reverberant speech is described by Wu and Wang [43], where the first stage estimates an inverse filter and the second stage uses spectral subtraction to minimize long-term reverberation. A monaural pitch-based method that estimates an inverse filter [33] has also been investigated. It should also be noted that inverse filtering is fundamentally sensitive to RIRs, which strongly limits the robustness of this approach [20], [32].
已经开发了许多方法来消除混响。Delcroix等人使用加权预测误差(WPE)算法和波束成形来消除房间混响[6]。混响语音对应于将房间脉冲响应 (RIR) 与消声语音(即直接声音)卷积在一起。WPE是一种无监督方法,在复时频(T-F)域中工作,并使用线性预测来缩短RIR,这实际上消除了后期混响[44]。尽管 WPE 有助于去正常化,但它并不能解决实际情况下通常存在的噪声。逆向滤波是另一种去偏激技术。逆滤波器试图消除 RIR 的影响,因为逆滤波器与混响信号的卷积会导致消声。然而,反向滤波器无法完全实现,因为RIR由于其非最小相位特性而不稳定[29]。Miyoshi和Kaneda[26]利用多个有限脉冲响应(FIR)滤波器解决了逆滤波器的可逆性问题。在[21]和[35]中,估计了RIR的T-F幅度响应。另一种方法使用RIR幅度响应和非负矩阵分解(NMF)来消除混响[27]。Wu和Wang[43]描述了一种增强混响语音的两阶段算法,其中第一阶段估计逆滤波器,第二阶段使用频谱减法来最小化长期混响。还研究了一种基于单声道音高的估计逆滤波器的方法[33]。还应该注意的是,逆滤波从根本上对RIRs敏感,这极大地限制了这种方法的鲁棒性[20],[32]。

More recent studies perform dereverberation in a supervised manner. In [20], Jin and Wang use a multi-layer perceptron (MLP) to learn a mapping from pitch-based features to grouping cues that encode the posterior probability of a T-F unit being speech dominant given the reverberant observation. The mapping results in a binary mask that is used to retain the speech dominant units. Evaluations show that this system generalizes well in various reverberant environments. Jiang et al. [19] use deep neural networks (DNNs) to estimate the ideal binary mask (IBM), where binaural and monaural features are used to train a DNN. Weninger et al. [40] use deep bidirectional Long Short-Term Memory (LSTM) recurrent neural networks (RNNs) to dereverberate features that are inputed to an ASR system. Very recently, Han et al. [13] learn a spectral mapping from the logmagnitude spectra of noisy and reverberant speech to the logmagnitude spectra of clean speech using a DNN. Although each of these approaches produces improvements in various conditions, their performance is limited since they only enhance the magnitude response, and use reverberant and noisy phase during signal reconstruction. As a result, the quality of separated speech is not good when interference is strong, and there is a strong need to produce speech estimates with high quality in reverberant and noisy environments.
最近的研究以监督的方式进行去节化。在[20]中,Jin和Wang使用多层感知器(MLP)来学习从基于音高的特征到分组线索的映射,这些线索编码了给定混响观察的T-F单元的后验概率。映射生成一个二进制掩码,用于保留语音主导单元。评估表明,该系统在各种混响环境中都能很好地泛化。江等[19]使用深度神经网络(DNN)来估计理想的二进制掩码(IBM),其中双耳和单耳特征用于训练DNN。Weninger等[40]使用深度双向长短期记忆(LSTM)递归神经网络(RNN)来推导输入到ASR系统的特征。最近,Han等[13]使用DNN学习了从嘈杂和混响语音的对数级谱到干净语音的对数级谱的频谱映射。尽管这些方法中的每一种都能在各种条件下产生改进,但它们的性能是有限的,因为它们只能增强幅度响应,并且在信号重建过程中使用混响和噪声相位。因此,当干扰较强时,分离语音的质量不好,并且非常需要在混响和嘈杂环境中产生高质量的语音估计。

When dealing with background noise, we recently found that performing T-F masking in the complex domain is very beneficial [42]. This approach jointly enhances the magnitude and phase response of noisy speech by estimating the complex ideal ratio mask (cIRM) in the real and imaginary domains. The performance of complex domain processing is not bounded since a full (magnitude and phase) reconstruction of speech is possible in the ideal case. Results show that the estimated cIRM substantially outperforms directly estimating speech in the time domain, traditional ideal ratio mask (IRM) estimation in the magnitude domain, and other related methods. Furthermore, cIRM estimation is shown to outperform methods that separately enhance the magnitude and phase of noisy speech. More details about different phase enhancement techniques can be found in [11].
在处理背景噪声时,我们最近发现在复杂域中执行T-F掩蔽是非常有益的[42]。该方法通过估计实域和虚域中的复理想比掩码 (cIRM) 来共同增强噪声语音的幅度和相位响应。复杂域处理的性能不受限制,因为在理想情况下可以对语音进行完整(幅度和相位)重建。结果表明,估计的cIRM明显优于时域直接估计语音、幅度域中传统理想比率掩码(IRM)估计和其他相关方法。此外,cIRM估计被证明优于单独增强嘈杂语音幅度和相位的方法。有关不同相位增强技术的更多详细信息,请参阅[11]。

Complex ratio masking, however, has not been investigated in adverse conditions with both room reverberation and background noise. In this paper, we propose to use DNNs to learn a mapping from reverberant (and noisy) speech to the cIRM. We extend the definition of the cIRM to deal with reverberant (and noisy) spectra, where the desired output is the spectra of the direct sound source. Unlike previous approaches, applying the cIRM enables the complete reconstruction of the clean and anechoic speech, since it jointly enhances the magnitude and phase. To our knowledge, this is the first supervised separation study that addresses dereverberation and denoising in the complex domain. A preliminary version of this work is published in [41].
然而,在具有房间混响和背景噪声的不利条件下,尚未研究复比掩蔽。在本文中,我们建议使用 DNN 来学习从混响(和嘈杂)语音到 cIRM 的映射。我们扩展了cIRM的定义,以处理混响(和噪声)频谱,其中所需的输出是直接声源的频谱。与以前的方法不同,应用cIRM可以完全重建干净和消声的语音,因为它共同增强了幅度和相位。据我们所知,这是第一个解决复杂领域去噪和去噪的监督分离研究。这项工作的初步版本发表在[41]中。

19. UTTERANCE WEIGHTED MULTI-DILATION TEMPORAL CONVOLUTIONAL NETWORKS FOR MONAURAL SPEECH DEREVERBERATION
19. 用于单声道语音去音的言语加权多膨胀时间卷积网络

Speech dereverberation remains an important task for robust speech processing [1–3]. Far-field speech signals such as for automatic meeting transcription and digital assistants normally require preprocessing to remove the detrimental effects of interference in the signal [4–6]. A number of methods have been proposed for speech dereverberation for both single channel and multichannel models [7]. Recent advances in speech dereverberation performance in a number of domains have been driven by deep neural network (DNN) models [8–12].
语音去音化仍然是鲁棒语音处理的一项重要任务[1–3]。远场语音信号(如自动会议转录和数字助理)通常需要预处理,以消除信号干扰的不利影响[4\u20126]。对于单通道和多通道模型,已经提出了许多语音去电化方法[7]。深度神经网络(DNN)模型推动了语音去理性性能的最新进展[8\u201212]。

Convolutional neural network models are commonly used for sequence modelling in speech dereverberation tasks [13–15]. One such fully convolutional model known as the TCN has been proposed for a number of speech enhancement tasks [16–18]. TCNs are capable of monaural speech dereverberation as well as more complex tasks such as joint speech dereverberation and speech separation [17]. The best performing TCN models for speech dereverberation tasks typically have a larger receptive field for data with higher reverberation times T60 and a smaller receptive field for data with small T60s [19] which forms the motivation for this paper.
卷积神经网络模型通常用于语音传输任务中的序列建模[13\u201215]。一种称为TCN的全卷积模型已被提出用于许多语音增强任务[16\u201218]。TCN能够进行单声道语音分离以及更复杂的任务,如联合语音分离和语音分离[17]。对于语音去声化任务,性能最好的TCN模型通常对具有较高混响时间T60的数据具有较大的感受野,而对于具有较小T60的数据具有较小的感受野[19],这构成了本文的动机。

In this work, a novel TCN architecture is proposed which is able to focus on specific temporal context within its receptive field. This is achieved by using an additional depthwise convolution kernel in the depthwise-separable convolution with a small dilation factor. Inspired by work in dynamic convolutional networks, an attention network is used to selected how to weight each of the depthwise kernels [20,21].
在这项工作中,提出了一种新颖的TCN架构,该架构能够关注其感受场内的特定时间背景。这是通过在具有小膨胀因子的深度可分离卷积中使用额外的深度卷积核来实现的。受动态卷积网络工作的启发,使用注意力网络来选择如何对每个深度核进行加权[20,21]。