這是用戶在 2024-5-1 16:27 為 https://ieeexplore.ieee.org/abstract/document/10448335?casa_token=OLuk32ijxZQAAAAA:9MSyvYZMF-DllTe3x... 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
透過互動式語言偏好增強代碼切換語音識別 | IEEE 會議出版物 | IEEE Xplore --- Enhancing Code-Switching Speech Recognition With Interactive Language Biases | IEEE Conference Publication | IEEE Xplore

Enhancing Code-Switching Speech Recognition With Interactive Language Biases
透過互動式語言偏好增強代碼轉換語音識別

Publisher: IEEE 出版商: IEEE

Abstract:Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic s...View more
Abstract: 摘要:
Languages usually switch within a multilingual speech signal, especially in a bilingual society. This phenomenon is referred to as code-switching (CS), making automatic speech recognition (ASR) challenging under a multilingual scenario. We propose to improve CS-ASR by biasing the hybrid CTC/attention ASR model with multi-level language information comprising frame-and token-level language posteriors. The interaction between various resolutions of language biases is subsequently explored in this work. We conducted experiments on datasets from the ASRU 2019 code-switching challenge. Compared to the baseline, the proposed interactive language biases (ILB) method achieves higher performance and ablation studies highlight the effects of different language biases and their interactions. In addition, the results presented indicate that language bias implicitly enhances internal language modeling, leading to performance degradation after employing an external language model.
在雙語社會中,語言通常會在多語言語音信號中切換。這種現象被稱為代碼轉換(CS),使得在多語言情境下自動語音識別(ASR)變得具有挑戰性。我們提出通過將混合 CTC/注意力 ASR 模型與包含幀級和標記級語言後驗概率的多級語言信息進行偏置來改善 CS-ASR。隨後在本研究中探討了各種分辨率的語言偏置之間的交互作用。我們在 ASRU 2019 代碼轉換挑戰賽的數據集上進行了實驗。與基準相比,提出的交互式語言偏置(ILB)方法實現了更高的性能,消融研究突出了不同語言偏置及其交互作用的影響。此外,所呈現的結果表明,語言偏置隱式增強了內部語言建模,在使用外部語言模型後導致性能下降。
Published in: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
發表於: ICASSP 2024 - 2024 年 IEEE 國際聲學、語音和信號處理大會(ICASSP)
Date of Conference: 14-19 April 2024
會議日期: 2024 年 4 月 14 日至 19 日
Date Added to IEEE Xplore: 18 March 2024
加入 IEEE Xplore 日期: 2024 年 3 月 18 日
ISBN Information: ISBN 資訊:
ISSN Information: ISSN 資訊:
Publisher: IEEE 出版商: IEEE
Conference Location: Seoul, Korea, Republic of
會議地點: 韓國首爾

SECTION 1. 第一節

INTRODUCTION 簡介

Code-switching (CS) refers to the switching of languages within a spontaneous multilingual recording. Automatic speech recognition (ASR) faces challenges in a code-switching scenario due to the inter- and intra-sentence language varieties compared to its monolingual counterparts [1], [2], [3], [4], [5]. Although conventional ASR approaches can operate on code-switching speech similar to monolingual data, early works identify languages before speech recognition or performs these processes jointly [6], [7], [8]. In contrast, recent CS-ASR techniques tackle language confusion by incorporating language information in modules within the ASR model.
轉換代碼(CS)指的是在即興多語言錄音中切換語言。自動語音識別(ASR)在轉換代碼情境中面臨挑戰,這是由於與其單語對應物相比,存在句內和句間語言變化 [1][2][3][4][5] 。儘管傳統的 ASR 方法可以處理與單語數據類似的轉換代碼語音,早期的研究在語音識別之前識別語言,或者聯合執行這些過程 [6][7][8] 。相比之下,最近的 CS-ASR 技術通過在 ASR 模型內部的模塊中融入語言信息來應對語言混淆。

One such approach involves the use of a bi-encoder model that is built on the transformer architecture [9], [10], where modeling of English and Mandarin languages is decoupled by two encoders pre-trained independently on each language. Since the dual-encoder approach has shown to be language discriminative, CS-ASR approaches that adopted similar architectures were subsequently proposed [11], [12], [13]. Apart from dual encoders, a language-specific attention mechanism has also been proposed to reduce confusion caused by code-switching contexts [14], [15]. This attention mechanism is employed within the transformer decoders and processes monolingual token embeddings which are separated from code-switching token sequences. In addition, a conditional factorization method factorizes CS-ASR into two monolingual recognitions before composing recognized monolingual segments into a single bilingual sequence which may or may not be code-switched [16].
其中一種方法涉及使用基於變壓器架構 [9] , [10] 的雙編碼器模型,其中英語和普通話語言的建模是由兩個編碼器獨立預先在每種語言上進行訓練的。由於雙編碼器方法已被證明具有語言區分性,因此後來提出了採用類似架構的 CS-ASR 方法 [11] , [12] , [13] 。除了雙編碼器外,還提出了一種特定於語言的注意機制,以減少代碼切換上下文引起的混淆 [14] , [15] 。該注意機制應用於變壓器解碼器中,處理單語令牌嵌入,這些嵌入與代碼切換令牌序列分開。此外,一種條件分解方法將 CS-ASR 分解為兩個單語識別,然後將識別的單語片段組合成一個可能或可能不是代碼切換的雙語序列 [16]

Although existing approaches mitigate the language confusion for CS-ASR, they are generally stuck in only one module within a CS-ASR model. Since language-aware modules have shown to be effective, it is natural to consider incorporating language information in all modules to further enhance the performance of existing approaches. In addition, these approaches utilize language information either at frame-level (dual-encoder methods) or token-level (diarization or transformer-decoder-based approaches) [15], [17], [18]. Since the ASR process aims to align acoustic frames to texts (e.g., characters, words), it is desirable to associate frame- and token-level language information and utilize them jointly for CS-ASR.
儘管現有方法減輕了 CS-ASR 的語言混淆問題,但通常僅困於 CS-ASR 模型中的一個模組。由於語言感知模組已被證明有效,自然而然地考慮在所有模組中加入語言信息,以進一步提升現有方法的性能。此外,這些方法在幀級(雙編碼器方法)或標記級(語者分割或基於變壓器解碼器的方法) [15][17][18] 上利用語言信息。由於語音識別過程旨在將聲學幀對齊到文本(例如字符、單詞),因此希望將幀級和標記級語言信息關聯起來,並共同用於 CS-ASR。

Inspired by the success of incorporating language information [17], we propose to enhance language-aware CS-ASR using interactive language biases (ILB). In particular, the proposed method comprises two contributions. Firstly, we bias the connectionist temporal classification (CTC), encoder, and decoder modules jointly within a hybrid CTC/attention CSASR model with language posteriors. It is useful to note that the language information transits from frames to tokens (i.e., from the encoder to CTC and decoder) intrinsically. As opposed to existing models, our method utilizes the interaction between frame- and token-level language information resulting in an integrated and language-discriminative model. In addition, the proposed architecture allows the research community to gain insight into how language biases influence a CS-ASR model beyond improving performance. Experiment results suggest that the CS-ASR model is capable of developing a robust internal language model after learning from language information.
受到將語言資訊納入成功的啟發,我們提議使用互動式語言偏見(ILB)來增強語言感知的 CS-ASR。具體而言,所提出的方法包括兩個貢獻。首先,我們在具有語言後驗概率的混合 CTC/注意力 CS-ASR 模型中聯合偏置連接主導時間分類(CTC)、編碼器和解碼器模塊。值得注意的是,語言資訊從幀到標記(即從編碼器到 CTC 和解碼器)內在地過渡。與現有模型相反,我們的方法利用了幀級和標記級語言資訊之間的交互作用,形成了一個整合且具有語言區分能力的模型。此外,所提出的架構使研究社區能夠深入了解語言偏見如何影響 CS-ASR 模型,而不僅僅是提高性能。實驗結果表明,CS-ASR 模型能夠在從語言資訊中學習後發展出強大的內部語言模型。

SECTION 2. 第二部分。

METHODOLOGY 方法論

2.1. Language posterior bias
2.1. 語言後驗偏差

The language posterior bias approach [17] has been developed on the hybrid CTC/attention ASR model, which comprises an encoder module, a decoder module, and a CTC module [19], [20]. These encoder and decoder modules consist of conformer encoder layers and transformer decoder layers [2], [9], respectively.
語言後驗偏差方法 [17] 是在混合 CTC/注意力 ASR 模型上開發的,該模型包括編碼器模塊、解碼器模塊和 CTC 模塊 [19] , [20] 。這些編碼器和解碼器模塊分別由 conformer 編碼器層和 transformer 解碼器層 [2] , [9] 組成。

Consider a speech signal with its acoustic features X = (xt ∈ ℝF|t = 1,…,T) and token sequence W = (wnV|n = 1,…,N), where V is a vocabulary of size V, T and N are the lengths of acoustic features and token sequence, respectively. The encoder generates output H = (ht ∈ ℝD|t = 1,…,T1) from X, which are subsequently fed into the decoder and CTC modules. Tokens are first embedded into W = (wn ∈ ℝD|n = 1,…,N) before being fed into the decoder module along with H. The ASR model is optimized with a language diarization (LD) decoder jointly, where the LD decoder computes a V ld-dimensional token-level language posterior bias p(ln−1|w1:n−1,X). Here, V ld is the language vocabulary size and ln−1 is the language index for the n-th token. The token embedding wn−1is then biased by its language posterior. The ASR decoder output is subsequently computed via

H= Encoder (X),wn1=Concat(wn1,p(ln1w1:n1,X)),p(wnw1:n1,X)=Decoder(w1:n1,H),(1)(2)(3)
View SourceRight-click on figure for MathML and additional features. where Concat(•) denotes the concatenation operation. The matrix W=(wnRD+Vldn=1,,N) consists of input token embeddings of the ASR decoder which are subsequently projected back to D dimensions by a linear layer. The model is optimized via
Ljoint =αLctc+(1α)Latt+βLld(4)
View SourceRight-click on figure for MathML and additional features.
and that the decoding process is similar to (3) but with input token embeddings W of the ASR decoder being replaced by W. The integrated model is optimized via a multi-task objective function. Here, β is a multi-task learning parameter and ℒld is a label-smoothed cross-entropy loss between the predicted and ground-truth language labels for the LD decoder.
考慮一個具有其聲學特徵 X = (x t ∈ ℝ F |t = 1,…,T) 和標記序列 W = (w nV |n = 1,…,N) 的語音信號,其中 V 是大小為 V 的詞彙表,T 和 N 分別是聲學特徵和標記序列的長度。編碼器從 X 生成輸出 H = (h t ∈ ℝ D |t = 1,…,T 1 ),然後將其餵入解碼器和 CTC 模塊。標記首先被嵌入到 W = (w n ∈ ℝ D |n = 1,…,N) 中,然後與 H 一起餵入解碼器模塊。語音識別模型與語言分割(LD)解碼器聯合優化,其中 LD 解碼器計算 V ld 維度的標記級語言後偏差 p(l n−1 |w 1:n−1 ,X)。這裡,V ld 是語言詞彙表大小,l n−1 是第 n 個標記的語言索引。然後,標記嵌入 w n−1 受其語言後偏差影響。語音識別解碼器的輸出隨後通過
H= Encoder (X),wn1=Concat(wn1,p(ln1w1:n1,X)),p(wnw1:n1,X)=Decoder(w1:n1,H),(1)(2)(3)
View SourceRight-click on figure for MathML and additional features.
計算,其中 Concat(•) 表示連接操作。矩陣 W=(wnRD+Vldn=1,,N) 包含語音識別解碼器的輸入標記嵌入,隨後通過線性層投影回 D 維度。 該模型經過
Ljoint =αLctc+(1α)Latt+βLld(4)
View SourceRight-click on figure for MathML and additional features.
進行優化,解碼過程類似於 (3) ,但 ASR 解碼器的輸入標記嵌入 W 被 W 取代。整合模型通過多任務目標函數進行優化。在這裡,β 是多任務學習參數,ℒ ld 是 LD 解碼器預測和地面真實語言標籤之間的標籤平滑交叉熵損失。

2.2. Interactive language biases
2.2. 互動式語言偏見

We propose to extend the language posterior bias method to frame-level language information. We note that frame-level language identification (LID) is undesirable since the LID performance generally degrades with shorter speech [21], [22], [23]. However, since acoustic frames are tightly associated with tokens in ASR, frame-level language identification over H may benefit from token-level language diarization in (2). The frame-level language posteriors, therefore, enhance the hidden output H to achieve high language discrimination before being fed into the ASR and LD decoders. Consequently, frame- and token-level language posteriors interact and jointly improve the model performance in CS-ASR.
我們提議將語言後驗偏見方法擴展到框級語言信息。我們注意到框級語言識別(LID)是不可取的,因為 LID 性能通常隨著較短的語音 [21][22][23] 而下降。然而,由於聲學框緊密地與 ASR 中的標記相關,因此在 H 上的框級語言識別可能受益於 (2) 中的標記級語言話語。因此,框級語言後驗增強了隱藏輸出 H,以實現高語言區分度,然後再餵入 ASR 和 LD 解碼器。因此,框級和標記級語言後驗相互作用,共同提高了 CS-ASR 模型的性能。

Fig. 1. - 
The hybrid CTC/attention model with interactive language biases.
Fig. 1.  圖 1.

The hybrid CTC/attention model with interactive language biases.
具有互動語言偏差的混合 CTC/注意力模型。

With reference to Fig. 1, the frame-level language bias is achieved through a LID layer before being concatenated with the hidden output. In particular, the biased hidden output H is computed via

ht=Concat(ht,p(ltht))(5)
View SourceRight-click on figure for MathML and additional features. which subsequently replaces H in (3) to facilitate the interaction between language information among frames and tokens. In addition, H is also employed to develop a language-aware CTC module. The ASR decoder output is next achieved via
p(wnw1:n1,X)=Decoder(w1:n1,H).(6)
View SourceRight-click on figure for MathML and additional features.

參考 Fig. 1 ,在與隱藏輸出串聯之前,通過 LID 層實現幀級語言偏差。特別地,通過
ht=Concat(ht,p(ltht))(5)
View SourceRight-click on figure for MathML and additional features.
計算偏置隱藏輸出 H ,隨後替換 (3) 中的 H 以促進幀和標記之間的語言信息交互作用。此外,H 也用於開發語言感知 CTC 模塊。ASR 解碼器輸出隨後通過
p(wnw1:n1,X)=Decoder(w1:n1,H).(6)
View SourceRight-click on figure for MathML and additional features.
實現。

During training, the frame-level LID is optimized in an unsupervised manner similar to [10] (i.e., frame-level language annotations are not provided during training). However, frames in H are trained to be aligned with their corresponding token-level language labels within the language diarization decoder. An assumption made here is that an accurate frame-to-token alignment enriches the unsupervised LID process with supervised information through backpropagation. Optimization of the model is achieved similarly to that of (4).
在訓練期間,幀級 LID 以無監督方式進行優化,類似於 [10] (即,在訓練期間不提供幀級語言標註)。然而,在語言分割解碼器中,H 中的幀被訓練以與其對應的標記級語言標籤對齊。這裡假設的一個前提是,準確的幀到標記的對齊通過反向傳播豐富了無監督 LID 過程的監督信息。模型的優化類似於 (4)

During inference, the frame- and token-level language posterior are computed before biasing the hidden output and the ASR decoding, respectively. The decoding process is similar to that presented in [19], which is defined to maximize the linear combination of the logarithmic CTC and attention objectives, i.e.,

W^=argmaxW{αlogpctc (WX)+(1α)logpatt (WX)}.(7)
View SourceRight-click on figure for MathML and additional features.
在推論過程中,分別計算了框架級和標記級的語言後驗,然後分別對隱藏輸出和 ASR 解碼進行偏置。解碼過程類似於 [19] 中提出的過程,該過程被定義為最大化對數 CTC 和注意力目標的線性組合,即
W^=argmaxW{αlogpctc (WX)+(1α)logpatt (WX)}.(7)
View SourceRight-click on figure for MathML and additional features.

SECTION 3. 第三部分。

DATASET, EXPERIMENTS, AND RESULTS
數據集、實驗和結果

3.1. Dataset and experiment setup
3.1. 數據集和實驗設置

All experiments are conducted on datasets from the ASRU 2019 Mandarin-English code-switching speech recognition challenge [24]. This challenge comprises four datasets, including a 500-hour Mandarin-only training set, a 200-hour intra-sentence English-Mandarin code-switching training set, a 40-hour intra-sentence English-Mandarin code-switching development set, and a 20-hour intra-sentence English-Mandarin code-switching test set. We employed ESPnet1 to train all models on the 200-hour CS training set, which are validated on the development set and evaluated on the test set [25].
所有實驗均在 ASRU 2019 年普通話-英語代碼切換語音識別挑戰的數據集上進行。該挑戰包括四個數據集,包括一個 500 小時的僅普通話訓練集,一個 200 小時的句內英語-普通話代碼切換訓練集,一個 40 小時的句內英語-普通話代碼切換開發集,以及一個 20 小時的句內英語-普通話代碼切換測試集。我們使用 ESPnet 在 200 小時的代碼切換訓練集上訓練所有模型,這些模型在開發集上進行驗證,並在測試集上進行評估。

SpecAugment is applied to augment the training data [26]. Words are transformed into a total of V = 6,923 tokens that include 3,000 English byte-pair encoding (BPE) tokens, 3,920 Mandarin characters, and three special tokens for unk, blank, and sos/eos. All tokens are transformed to language labels building Vld, which comprises e for English BPEs, m for Mandarin characters, and sos/eos. Language labels in Vld are used as LD outputs. We extracted F = 83 dimensional features comprising 80-dimensional log-fbanks and 3dimensional pitch for each speech sample before applying global mean and variance normalization.
SpecAugment 被應用於擴增訓練數據 [26] 。單詞被轉換為總共 V = 6,923 個標記,其中包括 3,000 個英文字節對編碼(BPE)標記,3,920 個中文字符,以及三個特殊標記用於未知、空白和 sos/eos。所有標記被轉換為語言標籤構建 V ld ,其中包括 e 代表英文 BPE,m 代表中文字符,以及 sos/eos。V ld 中的語言標籤被用作 LD 輸出。在應用全局均值和方差歸一化之前,我們提取了 F = 83 維特徵,每個語音樣本包括 80 維的 log-fbanks 和 3 維的音高。

We chose a hybrid CTC/Attention ASR model comprising twelve conformer encoder layers and six transformer decoder layers as the baseline model [2], [9], [27]. In addition, we adopted the multi-task learning model and the language posterior bias approach as our benchmark [17]. All self-attention encoder and decoder layers have four attention heads with input and output dimensions being D = 256, and the inner layer of the position-wise feed-forward network is of 2048 dimensions. During training, we set parameters α = 0.3 and β = 0.8 in (4), while a label smoothing factor of 0.1 is used for all cross-entropy losses. The ten best models during validation are averaged for inference. All models are trained on two GeForce RTX 3090 GPUs, where the baseline was trained for seventy epochs, while other models were trained for eighty epochs due to their higher number of parameters.
我們選擇了一個混合的 CTC/Attention ASR 模型,包括十二個 conformer 編碼器層和六個 transformer 解碼器層作為基準模型。此外,我們採用了多任務學習模型和語言後驗偏差方法作為我們的基準。所有自注意力編碼器和解碼器層都有四個注意力頭,輸入和輸出維度為 D = 256,位置智能前向網絡的內部層為 2048 維。在訓練期間,我們設置了參數 α = 0.3 和 β = 0.8,在所有交叉熵損失中使用了 0.1 的標籤平滑因子。在推理過程中,我們對驗證期間的十個最佳模型進行了平均。所有模型都在兩個 GeForce RTX 3090 GPU 上進行訓練,其中基準模型訓練了七十個時期,而其他模型由於參數數量較多,因此訓練了八十個時期。

During inference, we set parameters α = 0.4 in (7). Tenbest beam search is used before selecting the best hypothesis. The language model (LM) used in this paper is a sixteen-layer transformer model with each attention layer comprising eight heads. The proposed systems are evaluated by employing mix error rate (MER) comprising word error rate (WER) for English and character error rate (CER) for Mandarin.
在推論過程中,我們在 (7) 中設置參數 α = 0.4。在選擇最佳假設之前使用 Tenbest 波束搜索。本文中使用的語言模型(LM)是一個十六層的 transformer 模型,每個注意力層包含八個 heads。提出的系統通過使用混合錯誤率(MER),包括英語的詞錯誤率(WER)和普通話的字錯誤率(CER)進行評估。

Table 1. Performance comparison of models utilizing various-level language information without using external language model by employing MER (%)
表 1。利用各級語言信息的模型性能比較,不使用外部語言模型,採用 MER(%)進行評估
Table 1.- 
Performance comparison of models utilizing various-level language information without using external language model by employing MER (%)

3.2. Baseline and single language-biased models
3.2。基準和單一語言偏向模型

The results of the benchmark models are shown in Table 1 as systems 1.0, 1.1, and 1.2. Compared to the vanilla hybrid CTC-attention CS-ASR model, incorporating an auxiliary language diarization task and employing token-level LPB proposed in [17] lead to higher performance. These indicate that incorporating language information benefits the CS-ASR process, which is consistent with the observation presented in [17]. However, the token-level LPB approach shows no performance improvement over the multi-task optimization since Mandarin is the primary language in this dataset and languages do not switch frequently.
基準模型的結果顯示在 Table 1 中,系統 1.0、1.1 和 1.2。與純粹的混合 CTC-注意力 CS-ASR 模型相比,將輔助語言辨識任務和採用 [17] 中提出的基於標記級的 LPB 的方法納入,可以提高性能。這表明納入語言信息有助於 CS-ASR 過程,這與 [17] 中提出的觀察一致。然而,基於標記級的 LPB 方法在多任務優化方面並未表現出性能改善,因為普通話是該數據集中的主要語言,語言不會頻繁切換。

The above data characteristics also result in performance degradation for model configuration 1.3 when frame-level LID is not sufficiently accurate. To prevent the CTC outputs from interacting with the unsupervised frame-level LID, the input of CTC is set to H while the input of the ASR decoder is set to H in model configuration 1.3. As mentioned in Section 2.2, frame-level LID is generally less accurate than token-level LID. Those incorrect language posteriors may increase language confusion when being transmitted into the ASR decoder module.
上述數據特徵也導致模型配置 1.3 在幀級 LID 不夠準確時性能下降。為了防止 CTC 輸出與無監督的幀級 LID 產生交互作用,模型配置 1.3 中 CTC 的輸入設置為 H,而 ASR 解碼器的輸入設置為 H 。正如 Section 2.2 中所提到的,幀級 LID 通常比標記級 LID 不夠準確。這些不正確的語言後驗概率可能會在傳輸到 ASR 解碼器模塊時增加語言混淆。

3.3. Results of models with interactive language biases
具有互動語言偏差模型的結果

We next investigate how the interaction between frame- and token-level language information improves the model performance using systems 1.4, 1.5, and 1.6. In model configuration 1.4, as opposed to model configuration 1.3, the input of CTC is set to H so as to bias the CTC module with language information. The CTC performs frame-level classification before computing the optimal alignment, where the language biases are infused with acoustic features and combined intrinsically when generating tokens. Therefore, the performance improvement shown in Table 1 when comparing model configuration 1.4 with 1.3 indicates that language-biased frames can also perform better than vanilla frames. This underpins the efficacy of the frame-level language bias when being used for CTC.
我們接下來研究幀級和標記級語言信息之間的互動如何提高系統 1.4、1.5 和 1.6 的模型性能。在模型配置 1.4 中,與模型配置 1.3 相反,CTC 的輸入設置為 H ,以便用語言信息偏置 CTC 模塊。在計算最佳對齊之前,CTC 執行幀級分類,其中語言偏差與聲學特徵融合,並在生成標記時內在結合。因此,比較模型配置 1.4 和 1.3 時顯示的性能改善 Table 1 表明,具有語言偏差的幀也可以比普通幀表現更好。這支持了在用於 CTC 時使用幀級語言偏差的功效。

Table 2. Performance comparison of models using external language model during inference by employing MER (%), where "Reduction" denotes the absolute MER reduction compared to their no-LM counterparts
表 2. 使用外部語言模型在推斷期間的模型性能比較,通過使用 MER(%)來衡量,“減少”表示與其無 LM 對應物相比的絕對 MER 減少
Table 2.- 
Performance comparison of models using external language model during inference by employing MER (%), where "Reduction" denotes the absolute MER reduction compared to their no-LM counterparts

Model configuration 1.5 employs frame- and token-level language biases jointly but excludes the CTC module from being biased. Model configuration 1.5 shows significantly higher performance than single-language-biased models 1.2 and 1.3. This implies that the use of token-level language bias compensates for the inaccurate frame-level LID especially when model configuration 1.3 degrades the performance of model 1.1, which demonstrates that the interactive language biases are effective for CS-ASR.
模型配置 1.5 同時使用框架級和標記級語言偏差,但排除了 CTC 模塊的偏差。模型配置 1.5 的性能顯著高於單語言偏差模型 1.2 和 1.3。這意味著在模型配置 1.3 降低模型 1.1 的性能時,標記級語言偏差的使用補償了不準確的框架級 LID,這表明交互式語言偏差對 CS-ASR 有效。

Model configuration 1.6 further combines two language biases with the CTC module and achieves the highest performance among all model considerations, with a 7.8% relative improvement compared to the baseline model. It is not surprising that this configuration achieves higher performance than model configurations 1.4 and 1.5 since biasing CTC with language information improves the performance over the encoder LPB approach. In addition, the above implies that enriching all modules within a CS-ASR model with language information obtains higher gain compared to a single module, which is consistent with our proposition in Section 1.
模型配置 1.6 進一步結合了兩種語言偏差與 CTC 模塊,並在所有模型考慮中實現了最高性能,相對於基準模型提高了 7.8%。這個配置實現比模型配置 1.4 和 1.5 更高的性能並不奇怪,因為用語言信息偏置 CTC 會比使用編碼器 LPB 方法提高性能。此外,上述意味著在 CS-ASR 模型中豐富所有模塊的語言信息相比單個模塊獲得更高的增益,這與我們在 Section 1 中的主張一致。

3.4. Results of external language modeling
外部語言建模的結果

Since the end-to-end ASR approaches internally perform language modeling, we explore whether the internal LM is stronger than the external LM when being trained on the same corpora.
由於端到端的 ASR 方法內部執行語言建模,我們探討當在相同語料庫上訓練時,內部 LM 是否比外部 LM 更強大。

We present the results with respect to external language models in Table 2. The vanilla hybrid CTC/attention model shows higher performance after being integrated with external LM during inference. However, the results show that all language-aware CS-ASR models suffer from performance degradation compared to the baseline model. This implies that the CS-ASR model biased by language information could develop a more robust internal language model compared to an external model trained on the same text data. Since training an external language model can be time-consuming [28], robust internal language modeling can thus be concluded as an advantage of the proposed interactive language biases approach.
我們在 Table 2 中呈現了與外部語言模型相關的結果。在推斷過程中,普通的混合 CTC/注意力模型在與外部 LM 整合後表現更好。然而,結果顯示,所有具有語言感知能力的 CS-ASR 模型與基準模型相比都表現出性能下降。這意味著受語言信息影響的 CS-ASR 模型可能會發展出比在相同文本數據上訓練的外部模型更強大的內部語言模型。由於訓練外部語言模型可能耗時 [28] ,因此可以得出結論,穩健的內部語言建模是所提出的互動式語言偏見方法的優勢。

Fig. 2. - 
Comparison between attention matrices with respect to the frame-to-token alignment within language diarization decoder after employing token-level LPB (above) and interactive language biases (below).
Fig. 2.  圖 2.

Comparison between attention matrices with respect to the frame-to-token alignment within language diarization decoder after employing token-level LPB (above) and interactive language biases (below).
在採用令牌級 LPB(上方)和互動式語言偏見(下方)後,在語言分離解碼器內關於幀到令牌對齊的注意力矩陣之間的比較。

SECTION 4. 第四部分。

DISCUSSION 討論。

Although the language diarization decoder adopted in this work does not generate timestamps for language changes, the frame-to-language alignment can be obtained from the attention matrices within the LD decoder as shown in Fig. 2.
儘管本研究採用的語言分割解碼器並未生成語言變換的時間戳記,但可以從 LD 解碼器內的注意力矩陣中獲得幀到語言的對齊,如 Fig. 2 所示。

The token-level LPB and interactive language biases (model configurations 1.2 and 1.6) are selected to compare single language bias with interactive language biases. As illustrated in Fig. 2, the attention mechanism identifies language changes in the first and second heads, and captures sequential information in the third and fourth heads. Compared to the token-level LPB, the attention matrices of our proposed interactive language biases approach exhibit clearer vertical language boundaries and smoother diagonal frame-to-token alignment. This indicates that the proposed approach improves not only ASR but also language diarization performance being consistent with our assumption in Section 2.2.
在選擇單一語言偏見與互動語言偏見進行比較時,選擇了基於標記級的 LPB 和互動語言偏見(模型配置 1.2 和 1.6)。如 Fig. 2 所示,注意機制識別了第一和第二頭部的語言變化,並在第三和第四頭部捕捉了順序信息。與基於標記級的 LPB 相比,我們提出的互動語言偏見方法的注意矩陣展現出更清晰的垂直語言邊界和更平滑的對角框架到標記的對齊。這表明所提出的方法不僅改善了語音識別,還提高了語言分割的性能,與我們在 Section 2.2 中的假設一致。

SECTION 5. 第 5 節。

CONCLUSION 結論

We proposed an interactive language biases approach to improve CS-ASR through the interaction between frame- and token-level language information. Experiment results presented indicate that the proposed approach outperforms the benchmark in CS-ASR. We next visualized the attention matrices within the LD decoder. The proposed interactive language biases achieve higher language diarization performance compared with single token-level language bias, highlighting the efficacy of the proposed interactive language biases approach. In addition, the results show that a language-aware CS-ASR model can develop a robust internal LM, resulting in performance degradation when using an external language model during inference.
我們提出了一種互動式語言偏見方法,通過框架級和標記級語言信息之間的交互作用來改善 CS-ASR。實驗結果表明,所提出的方法在 CS-ASR 中優於基準。接下來,我們對 LD 解碼器內的注意力矩陣進行了可視化。所提出的互動式語言偏見相較於單個標記級語言偏見實現了更高的語言辨識性能,突顯了所提出的互動式語言偏見方法的有效性。此外,結果表明,一個具有語言意識的 CS-ASR 模型可以發展出一個強大的內部 LM,在推斷過程中使用外部語言模型時導致性能下降。

References

1.
D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. IEEE Workshop Autom. Speech Recognit. Understanding, 2011.
2.
A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, et al., “Conformer: Convolutionaugmented transformer for speech recognition,” in Proc. Interspeech, 2020.
3.
Z. Nan, T. Dang, V. Sethu, and B. Ahmed, “Variational connectionist temporal classification for order-preserving sequence modeling,” arXiv preprint arXiv:2309.11983, 2023.
4.
C. Chen, N. Hou, Y. Hu, S. Shirol, and E. S. Chng, “Noiserobust speech recognition with 10 minutes unparalleled indomain data,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 4298–4302.
5.
Y. Hu, N. Hou, C. Chen, and E. Siong Chng, “Interactive feature fusion for end-to-end noise-robust speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6292–6296.
6.
N. T. Vu, D.-C. Lyu, J. Weiner, D. Telaar, T. Schlippe, F. Blaicher, E.-S. Chng, T. Schultz, and H. Li, “A first speech recognition system for mandarin-english code-switch conversational speech,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2012, pp. 4889–4892.
7.
Z. Zeng, Y. Khassanov, V. T. Pham, H. Xu, E. S. Chng, and H. Li, “On the end-to-end solution to Mandarin-English code-switching speech recognition,” in Proc. Interspeech, 2019, pp. 2165–2169.
8.
H. Liu, L. P. G. Perera, X. Zhang, J. Dauwels, A. W. H. Khong, S. Khudanpur, and S. J. Styles, “End-to-end language diarization for bilingual code-switching speech,” in Proc. Interspeech, 2021, pp. 1489–1493.
9.
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst., 2017, pp. 5998–6008.
10.
Y. Lu, M. Huang, H. Li, J. Guo, and Y. Qian, “Bi-encoder transformer network for Mandarin-English code-switching speech recognition using mixture of experts,” in Proc. Interspeech, 2020, pp. 4766–4770.
11.
M. S. Mary N J, V. M. Shetty, and S. Umesh, “Investigation of methods to improve the recognition performance of TamilEnglish code-switched data in transformer framework,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, pp. 7889–7893.
12.
S. Dalmia, Y. Liu, S. Ronanki, and K. Kirchhoff, “Transformertransducers for code-switched speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2021, pp. 5859–5863.
13.
T. Song, Q. Xu, M. Ge, L. Wang, H. Shi, Y. Lv, Y. Lin, and J. Dang, “Language-specific characteristic assistance for code-switching speech recognition,” in Proc. Interspeech, 2022, pp. 3924–3928.
14.
L. Dong, S. Xu, and B. Xu, “Speech-transformer: A norecurrence sequence-to-sequence model for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5884–5888.
15.
S. Zhang, J. Yi, Z. Tian, J. Tao, Y. T. Yeung, and L. Deng, “Reducing multilingual context confusion for end-to-end code-switching automatic speech recognition,” in Proc. Interspeech, 2022, pp. 3894–3898.
16.
B. Yan, C. Zhang, M. Yu, S.-X. Zhang, S. Dalmia, D. Berrebbi, C. Weng, S. Watanabe, and D. Yu, “Joint modeling of codeswitched and monolingual asr via conditional factorization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2022, pp. 6412–6416.
17.
H. Liu, H. Xu, L. P. Garcia, A. W. H. Khong, Y. He, and S. Khudanpur, “Reducing language confusion for code-switching speech recognition with token-level language diarization,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2023, pp. 1–5.
18.
Y. Jiang, Z. Chen, R. Tao, L. Deng, Y. Qian, and H. Li, “Prompt-driven target speech diarization,” arXiv preprint arXiv:2310.14823, 2023.
19.
S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE J. Sel. Topics Signal Process., vol. 11, no. 8, pp. 1240–1253, 2017.
20.
A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proc. Int. Conf. Mach. Learn., 2006, pp. 369–376.
21.
H. Liu, L. P. G. Perera, A. W. H. Khong, S. J. Styles, and S. Khudanpur, “PHO-LID: A unified model incorporating acousticphonetic and phonotactic information for language identification,” in Proc. Interspeech, 2022, pp. 2233–2237.
22.
S. O. Sadjadi, T. Kheyrkhah, C. S. Greenberg, E. Singer, D. A. Reynolds, L. P. Mason, and J. Hernandez-Cordero, “Performance analysis of the 2017 NIST language recognition evaluation,” in Proc. Interspeech, 2018, pp. 1798–1802.
23.
L.-H. Tseng, Y.-K. Fu, H.-J. Chang, and H.-y. Lee, “Mandarin-English code-switching speech recognition with self-supervised speech representation models,” arXiv preprint arXiv:2110.03504, 2021.
24.
X. Shi, Q. Feng, and L. Xie, “The ASRU 2019 Mandarin-English code-switching speech recognition challenge: Open datasets, tracks, methods and results,” arXiv preprint arXiv:2007.05916, 2020.
25.
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y. Unno, N. Enrique Yalta Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Proc. Interspeech, 2018, pp. 2207–2211.
26.
D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A simple data augmentation method for automatic speech recognition,” in Proc. Interspeech, 2019, pp. 2613–2617.
27.
S. Karita, N. E. Y. Soplin, S. Watanabe, M. Delcroix, A. Ogawa, and T. Nakatani, “Improving transformer-based end-to-end speech recognition with connectionist temporal classification and language model integration,” in Proc. Interspeech, 2019, pp. 1408–1412.
28.
C. Chen, Y. Hu, C.-H. H. Yang, S. M. Siniscalchi, P.-Y. Chen, and E. S. Chng, “Hyporadise: An open baseline for generative speech recognition with large language models,” arXiv preprint arXiv:2309.15701, 2023.