這是用戶在 2024-5-1 12:24 為 https://app.immersivetranslate.com/pdf-pro/6b39d61b-43bb-4736-ab49-43af02b89041 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?
2024_04_30_661ff71a98f8a80a052cg

In search of a robust facial expressions recognition model: A large-scale visual cross-corpus study
尋找強大的面部表情識別模型:一項大規模的視覺跨語料庫研究

Elena Ryumina , Denis Dresvyanskiy b,, , Alexey Karpov
伊琳娜·留米娜 ,丹尼斯·德列斯維揚斯基 b, ,亞歷克謝·卡爾波夫
a St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS),
俄羅斯科學院聖彼得堡信息學與自動化研究所,俄羅斯科學院聖彼得堡聯邦研究中心(SPC RAS),
St. Petersburg 199178, Russia
俄羅斯聖彼得堡 199178
b Ulm University, Ulm 89081, Germany
德國烏爾姆大學,烏爾姆 89081
' ITMO University, St. Petersburg 191002, Russia
俄羅斯聖彼得堡 ITMO 大學,聖彼得堡 191002

A R T I C L E I N F O
文章資訊

Article history 文章歷史

Received 3 November 2021
2021 年 11 月 3 日收到
Revised 29 August 2022
2022 年 8 月 29 日修訂
Accepted 1 October 2022
2022 年 10 月 1 日接受
Available online 7 October 2022
2022 年 10 月 7 日起可在線上取得
Communicated by Zidong Wang
由王子東通訊

Keywords: 關鍵詞

Visual emotion recognition
視覺情感識別
Affective computing 情感計算
Paralinguistic analysis 語音語調分析
Cross-corpus analysis 跨語料庫分析
Deep learning 深度學習
End-to-end model 端到端模型

Abstract 摘要

A B S T R A C T Many researchers have been seeking robust emotion recognition system for already last two decades. It would advance computer systems to a new level of interaction, providing much more natural feedback during human-computer interaction due to analysis of user affect state. However, one of the key problems in this domain is a lack of generalization ability: we observe dramatic degradation of model performance when it was trained on one corpus and evaluated on another one. Although some studies were done in this direction, visual modality still remains under-investigated. Therefore, we introduce the visual cross-corpus study conducted with the utilization of eight corpora, which differ in recording conditions, participants' appearance characteristics, and complexity of data processing. We propose a visualbased end-to-end emotion recognition framework, which consists of the robust pre-trained backbone model and temporal sub-system in order to model temporal dependencies across many video frames. In addition, a detailed analysis of mistakes and advantages of the backbone model is provided, demonstrating its high ability of generalization. Our results show that the backbone model has achieved the accuracy of on the AffectNet dataset, outperforming all the state-of-the-art results. Moreover, the CNN-LSTM model has demonstrated a decent efficacy on dynamic visual datasets during crosscorpus experiments, achieving comparable with state-of-the-art results. In addition, we provide backbone and CNN-LSTM models for future researchers: they can be accessed via GitHub.
摘要:許多研究人員已經尋求強大的情感識別系統已經有二十年了。這將使計算機系統在人機交互期間提供更自然的反饋,因為它可以分析用戶的情感狀態。然而,這個領域的一個關鍵問題是缺乏泛化能力:當模型在一個語料庫上訓練並在另一個語料庫上評估時,我們觀察到模型性能的劇烈下降。儘管在這個方向上做了一些研究,但視覺模態仍然未得到充分研究。因此,我們介紹了一個視覺跨語料庫研究,利用了八個語料庫,這些語料庫在記錄條件、參與者的外觀特徵和數據處理的複雜性方面有所不同。我們提出了一個基於視覺的端到端情感識別框架,該框架由強大的預訓練骨幹模型和時間子系統組成,以建模跨多個視頻幀的時間依賴關係。此外,還提供了對骨幹模型錯誤和優點的詳細分析,證明了其高泛化能力。 我們的結果顯示,骨幹模型在 AffectNet 數據集上取得了 的準確度,優於所有最新技術的結果。此外,CNN-LSTM 模型在跨語料庫實驗中對動態視覺數據集展現出良好的效能,達到與最新技術相當的水準。此外,我們為未來研究人員提供了骨幹和 CNN-LSTM 模型:它們可以通過 GitHub 訪問。

(c) 2022 Elsevier B.V. All rights reserved.
(c) 2022 Elsevier B.V. 保留所有權利。

1. Introduction 1. 簡介

In the recent decades, affective computing has become a new fast growing and perspective domain due to its high importance in human-computer interaction ( ) systems. Requirements to current HCI systems have increased and now consist of not only recognition of user speech or face, but also analysis of user state, including emotional component. Such information serves then for adjusting system response to fit it in the best way for user. Possible applications are almost limitless: it is being utilized in robotics [1,2], marketing [3], entertainment [4], medicine [5,6], education [7] and many others. Reliable affect recognition in those areas is becoming one of the key features capable to advance the quality of the HCI systems on a new level.
近幾十年來,情感計算已成為一個新的快速增長和前景廣闊的領域,因為它在人機交互系統中的重要性日益提高。當前人機交互系統的要求已經增加,不僅包括對用戶語音或面部的識別,還包括對用戶狀態的分析,包括情感成分。這些信息用於調整系統響應,以最佳方式適應用戶。可能的應用幾乎是無限的:它正在機器人技術[1,2]、市場營銷[3]、娛樂[4]、醫學[5,6]、教育[7]等領域得到應用。在這些領域中可靠的情感識別正在成為推進人機交互系統質量提升到新水平的關鍵特徵之一。
Currently there are two most common models for emotion attribution - the categorical model [8] (also well-known as Ekman's model) and the time-continuous model [9] (also wellknown as Russell's circumplex model). The circumplex model has 3 dimensions, which are arousal, valence, and dominance, although the dominance axis is often omitted. Ekman's model divides the emotional space into 7 categorical states ( 6 salient emotions tral) and it is exploited by researchers much more often, becoming a fundamental approach in emotion description. Because of the simplicity of the annotation, the categorical datasets prevail over time-continuous ones, leading to more data available for training machine learning (ML), including deep learning (DL) models. Therefore, in this work, we are focused on using categorical data, since the DL models are well-known for their hunger for tremendous amount of data.
今天,在情感計算領域,DL 模型佔據了典型 ML 模型的主導地位。有幾個原因:(1)DL 模型不需要複雜的特徵工程方法,因為它們能夠直接使用數據;(2)DL 模型在處理大量數據時表現出色,這是典型 ML 模型所缺乏的。此外,DL 模型還能夠自動學習特徵,這使得它們在情感計算領域中更加強大。因此,我們在這項工作中使用 DL 模型來進行情感分類任務。
Today, DL models dominate over typical ML models in the affect computing domain. There are several reasons for that: (1) DL models do not need sophisticated feature engineering methods, because they are able to consume data as is; (2) DL models are
almost infinitely scalable - and therefore can build up performance (accuracy) if needed; (3) DL models preserve the knowledge of the learned domain, opening the possibility to use transfer learning, and; (4) DL models can be constructed as end-to-end (E2E) models, omitting the necessity of breaking down the problem on several steps (feature engineering, training, and others).
幾乎無限擴展 - 因此如果需要可以提高性能(準確性); (3) DL 模型保留了所學領域的知識,開啟了使用轉移學習的可能性; (4) DL 模型可以構建為端到端(E2E)模型,省略了將問題分解為多個步驟(特徵工程、訓練等)的必要性。
However, even powerful DL models suffer from dataset biases [10]. Such problem was named as cross-corpus problem, and lies in model "sticking" to the specific corpus it was trained on. Applying such a model to other corpora, researchers observe dramatic decrease in the model performance due to different conditions, in which the data was acquired [10]. The current methodology in such case is the model fine-tuning on a new data, which seems to be a temporary solution, because it requires to do it every time new data comes up. Moreover, the necessity of fine-tuning basically represents the inability of the model to cover all possible variations of emotions humans express [11], which is crucial for the future systems.
然而,即使是強大的 DL 模型也會受到數據集偏差的影響[10]。這種問題被稱為跨語料庫問題,並且存在於模型“堅持”於其訓練的特定語料庫中。將這樣的模型應用於其他語料庫時,研究人員觀察到由於獲取數據的不同條件,模型性能急劇下降[10]。在這種情況下的當前方法是在新數據上進行模型微調,這似乎是一個暫時的解決方案,因為每次出現新數據都需要進行微調。此外,微調的必要性基本上代表了模型無法涵蓋人類表達情感的所有可能變化[11],這對未來的 系統至關重要。
In our work we would like to introduce the visual-based emotion recognition (ER) end-to-end framework able to identify emotions with high performance on different datasets. Trying to train data-unbiased model, we took into account many data balancing and augmentation techniques to provide the model with as various data as possible. Moreover, we have conducted extensive experiments with many different datasets, which are diverse in the terms of lightning, rate of occlusions, noise, age, ethnicity, and head poses, analysing the strengths of the model and its efficiency.
在我們的工作中,我們希望介紹基於視覺的情緒識別(ER)端到端框架,能夠在不同數據集上高效識別情緒。為了訓練數據無偏見的模型,我們考慮了許多數據平衡和增強技術,以提供模型盡可能多樣的數據。此外,我們對許多不同數據集進行了廣泛實驗,這些數據集在光線、遮擋率、噪音、年齡、種族和頭部姿勢方面多樣,分析了模型的優勢和效率。
To sum up, the main contributions of the article are:
總的來說,本文的主要貢獻是:
  • We propose a flexible pipeline of the facial expressions recognition (FER) system consisting of the backbone FER model and several temporal FER models. Every component of the system can be substituted with other similar models: for instance, instead of the backbone model, researchers can use other feature extractors utilized in the computer vision domain.
    我們提出了一個靈活的面部表情識別(FER)系統的流程,包括骨幹 FER 模型和幾個時間 FER 模型。系統的每個組件都可以用其他類似模型替換:例如,研究人員可以使用計算機視覺領域中使用的其他特徵提取器來替代骨幹模型。
  • We introduce an efficient feature extractor for the FER task (named backbone model) demonstrating the state-of-the-art performance on the AffectNet dataset. Moreover, we provide this model for further scientific usage and describe all finetuning steps done to obtain it.
    我們為 FER 任務引入了一個高效的特徵提取器(稱為骨幹模型),展示了在 AffectNet 數據集上的最先進性能。此外,我們提供此模型供進一步科學使用,並描述了獲得它所做的所有微調步驟。
  • We present a large-scale visual cross-corpus study leveraging the leave-one-corpus-out experiment protocol. During the experiment, several different temporal models have been examined, while the best one was chosen based on the models' performance and generalization ability.
    我們提出了一項大規模的視覺跨語料庫研究,利用了一個留一語料庫實驗協議。在實驗過程中,我們檢查了幾種不同的時間模型,並根據模型的性能和泛化能力選擇了最佳模型。
  • We demonstrate the robustness and decent performance of the backbone model via analysis of its functioning on various complex frames from dynamic FER datasets.
    通過分析骨幹模型在來自動態 FER 數據集的各種複雜幀上的功能,我們展示了該模型的穩健性和良好性能。
The rest of the article is organized as follows: we analyze the current state of both visual emotion recognition and cross-corpus models in Section 2. Section 3 presents a developed end-to-end framework and observes utilized data. Next, Section 4 provides the setup of conducted experiments and results obtained with the proposed framework. In Section 5 we discuss the results and analyse the features of the developed framework. Lastly, Section 6 summarizes the performed work and considers the directions of future researches in cross-corpus affect computing.
本文的其餘部分組織如下:我們在第 2 節分析了視覺情感識別和跨語料模型的現狀。第 3 節介紹了一個開發的端到端框架並觀察了使用的數據。接下來,第 4 節提供了進行實驗的設置以及使用提出的框架獲得的結果。在第 5 節中,我們討論了結果並分析了開發框架的特點。最後,第 6 節總結了所執行的工作並考慮了跨語料情感計算未來研究的方向。
In this Section we firstly observe the state-of-the-art visual emotion recognition systems and then analyse the progress done in cross-corpus (data-biasing) problem elimination.
在本節中,我們首先觀察了最先進的視覺情感識別系統,然後分析了在消除跨語料(數據偏差)問題方面所取得的進展。

Earlier, in before-deep-learning era, affect computing researchers were forced to engineer and exploit neat hand-crafted features such as Facial Action Units (FAUs) [12], Facial Landmarks, Histogram of Oriented Gradients (HOG) feature maps [13] and many others. The emotion recognition systems were highly dependent on the quality of the extracted features, while the machine learning algorithm was being chosen according to generated features in an expert way based on the knowledge and experience of developers. Extracted and selected hand-crafted features were more important rather than machine learning techniques used. The situation was revolutionized in the last decade, approximately at the time, after the VGG [14] and ResNet [15] were introduced. Such models, finetuned on emotional datasets, were able to show comparable performance, while having the possibility to consume raw data (images) without the feature extraction phase.
在深度學習之前,影響計算研究人員被迫設計和利用整潔的手工製作特徵,如面部動作單元(FAUs)[12]、面部標誌、定向梯度直方圖(HOG)特徵圖[13]等。情感識別系統高度依賴提取特徵的質量,機器學習算法則根據開發者的知識和經驗以專家方式選擇。提取和選擇的手工製作特徵比使用的機器學習技術更重要。在過去的十年中,大約在 VGG [14]和 ResNet [15]推出之後,情況發生了變革。這樣的模型在情感數據集上進行微調,能夠展示可比的性能,同時有可能在不經過特徵提取階段的情況下處理原始數據(圖像)。
Starting from 2015, a numerous research works using DL models have been published - H. W. Ng et al. [16] utilized the ImageNet model fine-tuned on FER [17] and EmotiW datasets. Using such a cascade fine-tuning process, they dramatically outperformed the challenge baseline on more than of accuracy. To reduce the confusing factors and, thus, the amount of data needed for training, G. Levi and T. Hassner [18] have proposed a novel image transformation technique and applied it for the training of an ensemble of VGG [14] and GoogleNet [19] Convolutional Neural Networks (CNNs). Averaging the class scores of the ensemble members, the authors got more than improvement over the baseline results. S. A. Bargal et al. in [20] have used VGG-based and ResNet pretrained on the combination of emotional datasets CNNs for the feature extraction. Normalizing the features and extracting the statistics along with all frames in the video file, they trained a Support Vector Machine (SVM) to predict one emotion for the whole video file, resulting in increase of the recognition rate in comparison with the baseline performance.
從 2015 年開始,已發表了許多使用深度學習模型的研究作品 - H. W. Ng 等人利用在 FER 和 EmotiW 數據集上微調的 ImageNet 模型。通過這種串聯微調過程,他們在準確度上明顯優於挑戰基準。為了減少混亂因素和訓練所需的數據量,G. Levi 和 T. Hassner 提出了一種新的圖像轉換技術,並應用於 VGG 和 GoogleNet 卷積神經網絡的集成訓練。通過對集成成員的類別分數進行平均,作者們在基準結果上獲得了超過 1 的改進。S. A. Bargal 等人在[20]中使用基於 VGG 和 ResNet 預訓練的組合情感數據集 CNN 進行特徵提取。對特徵進行歸一化並提取視頻文件中所有幀的統計信息,他們訓練了一個支持向量機(SVM)來預測整個視頻文件的一種情感,從而使識別率比基準性能提高了 2。
There were also numerous studies devoted to bringing context consideration into DL facial emotion recognition [21-23]. Besides faces, such networks take into account the context, in which the humans appear, providing complementary information for a score or feature fusion. Combining the facial and context-based features by simple NN or SVM, the authors showed a significant performance growth in considered tasks.
也有許多研究致力於將上下文考慮納入深度學習面部情感識別中[21-23]。除了臉部,這些網絡還考慮到人類出現的上下文,為得分或特徵融合提供補充信息。通過簡單的 NN 或 SVM 結合面部和基於上下文的特徵,作者展示了在考慮的任務中顯著的性能增長。
Over the last few years, the emotion recognition research community has concentrated on developing multi-modal emotion recognition systems, processing different information channels independently with their further aggregation. In most cases, for the feature extraction from every data channel, researchers utilize deep neural networks (DNNs), usually CNNs. The information aggregation is done, however, by different techniques: Deep Belief Networks [24,25], Attention Mechanism [26,27], Weighted Score Fusion [28], by SVM [29,30], etc.
在過去幾年中,情感識別研究社區集中於開發多模情感識別系統,獨立處理不同信息通道,然後進行進一步的聚合。在大多數情況下,為了從每個數據通道提取特徵,研究人員利用深度神經網絡(DNNs),通常是 CNNs。然而,信息聚合是通過不同的技術完成的:深度信念網絡[24,25],注意機制[26,27],加權得分融合[28],通過 SVM[29,30]等。
We should note here that, although the multi-modal systems normally show better results in comparison with uni-modal systems, the most efficient component of it is the visual sub-system [31]. Therefore, in this work, we have focused on obtaining an efficient unbiased visual emotion recognition system so that other researchers could use it in further experiments with visual or multi-modal frameworks.
我們應該注意,儘管多模式系統通常與單模式系統相比顯示出更好的結果,但其中效率最高的組件是視覺子系統[31]。因此,在這項工作中,我們專注於獲得一個高效且無偏見的視覺情感識別系統,以便其他研究人員可以在視覺或多模式框架的進一步實驗中使用它。
It is also important to underline that considered researches have been conducted without rigorous cross-corpus experiments, and therefore are applicable for concrete datasets they were experimented on.
還要強調的是,考慮到的研究是在沒有嚴格的跨語料庫實驗的情況下進行的,因此適用於它們進行實驗的具體數據集。
Obtaining a system, which is not sensitive to the changes in data condition and distribution, it is a long-standing problem in almost every ML domain. In developing the affective computing frameworks it has become one of the key points, since humans express their emotions in an enormous number of ways, depending on ethnicity, culture, language, and even age and gender.
獲得一個不會對數據條件和分佈的變化敏感的系統,在幾乎每個機器學習領域都是一個長期存在的問題。在開發情感計算框架時,這已成為其中的一個關鍵點,因為人類以各種方式表達情感,取決於種族、文化、語言,甚至年齡和性別。
To the best of our knowledge, most of the studies in this direction are done for audio modality, because of its computational ease in processing in comparison with the video channel. However, we survey in this subsection several relevant acoustic cross-corpus works to show the general trends and features in the crosscorpus direction.
就我們所知,這方面的大多數研究都是針對音頻模式進行的,因為與視頻通道相比,在處理上更容易計算。然而,在本小節中,我們對幾個相關的聲學跨語料庫作品進行了調查,以展示跨語料庫方向的一般趨勢和特徵。
H. Kaya and A. Karpov [32] have proposed a cascaded normalization approach for minimization of the speaker- and corpusrelated effects. Utilizing the Extreme Learning Machine as a classifier, they applied proposed normalization to openSMILE audio features on 5 corpora, resulting in obtaining robust features and significant improvement of the model performance in comparison with other baseline normalization techniques. In [33] the authors introduced the approach called Adversarial Discriminative Domain Generalization, which forces the deep encoding of the two (or more) datasets to be as close as possible, learning the model to generalize the representation of the emotion despite the differences in datasets. Conducting experiments on 3 datasets, they showed that such generalization increases model performance, while also decreases the variance of the model.
H. Kaya 和 A. Karpov 提出了一種級聯標準化方法,用於最小化與說話者和語料庫相關的影響。他們利用極限學習機作為分類器,將提出的標準化應用於 5 個語料庫上的 openSMILE 音頻特徵,從而獲得了穩健的特徵,並與其他基線標準化技術相比,模型性能顯著提高。在[33]中,作者介紹了一種稱為對抗性區分領域泛化的方法,該方法迫使兩個(或更多)數據集的深度編碼盡可能接近,學習模型以使情感的表示能夠泛化,儘管數據集之間存在差異。在 3 個數據集上進行實驗,他們表明這種泛化可以提高模型性能,同時降低模型的變異性。
. Kaya et al. in [34] investigated the application of Long ShortTerm Memory (LSTM) network to the set of frame-level acoustic Low-Level Descriptors (LLDs) on 3 different corpora. Combining predictions of the LSTM-based model with predictions of the Weighted Least-Squares Kernel classifier via weighted score-level fusion, they outperformed the challenge baseline systems on the development set. However, such an approach showed worse results on the test set, likely because of the class distribution mismatch among datasets. Nevertheless, in comparison with other exploited models, the LSTM-based classifier demonstrated a decent performance improvement due to its ability to model temporal dependencies.
Kaya 等人在[34]中研究了將長短期記憶(LSTM)網絡應用於 3 個不同語料庫的幀級聲學低級描述符(LLDs)。通過將基於 LSTM 的模型的預測與加權最小二乘核分類器的預測通過加權得分級融合結合,他們在開發集上超越了挑戰基線系統。然而,這種方法在測試集上表現較差,可能是因為數據集之間的類分佈不匹配。儘管如此,與其他利用的模型相比,基於 LSTM 的分類器由於其建模時間依賴性的能力而表現出不錯的性能改善。
In [35] the authors analyzed the generalization ability of DL models (CNN, LSTM, and CNN-LSTM), training considered DL systems on 5 corpora combined altogether, and presenting the model evaluation on development out-of-domain dataset. Utilizing the tDistributed Stochastic Neighbor Embedding (t-SNE) to visualize the learned data representations, the authors observed that increasing the variability of the data by combining corpora enhances the model performance on the development set. Moreover, the LSTM-only model was proven to be prone to overfitting, while CNN was not, making the CNN-based approach more attractive for cross-corpus training. H. Meng et al. in [36] have proposed an E2E architecture, which is consists of dilated CNN and bidirectional LSTM (B-LSTM) with an attention mechanism, taking advantage of both networks. Conducting experiments on two datasets, the authors stated that the proposed E2E DL framework has demonstrated a high generalization ability and robustness to changes in data distribution caused by switching to an unseen corpus.
在[35]中,作者分析了 DL 模型(CNN、LSTM 和 CNN-LSTM)的泛化能力,訓練考慮了將 5 個語料庫合併在一起的 DL 系統,並在開發的跨領域數據集上進行了模型評估。利用 t 分佈隨機鄰域嵌入(t-SNE)來可視化學習到的數據表示,作者觀察到通過結合語料庫增加數據的變異性可以提高模型在開發集上的性能。此外,LSTM 模型被證明容易過擬合,而 CNN 則不會,使得基於 CNN 的方法對於跨語料庫訓練更具吸引力。在[36]中,H. Meng 等人提出了一種 E2E 架構,由膨脹 CNN 和雙向 LSTM(B-LSTM)與注意機制組成,充分利用了兩個網絡的優勢。在兩個數據集上進行實驗後,作者表示,所提出的 E2E DL 框架展現了高泛化能力,對於由於切換到未見語料庫而導致的數據分佈變化具有很強的韌性。
Regarding the cross-corpus video-based studies, there are many fewer works, which conducted cross-corpus experiments, due to the complexity of the video processing. Moreover, some of the works presented below are not specifically directed to crosscorpus study, yet have done experiments on several datasets in cross-corpus style to show the model's ability of generalization.
關於跨語料庫的基於視頻的研究,由於視頻處理的複雜性,進行了跨語料庫實驗的作品要少得多。此外,下面列出的一些作品並非專門針對跨語料庫研究,但已以跨語料庫風格在多個數據集上進行實驗,以展示模型的泛化能力。
In [37], the authors introduced E2E DNN Inception-based architecture, conducting comprehensive cross-corpus experiments on several datasets. Utilizing the proposed model, they outperformed state-of-the-art results, mostly presented by exploiting ML classifiers and hand-crafted features. W. Xie et al. in [38] proposed a feature sparseness-based regularization integrated into loss function, outperforming the model trained with L2 regularization. The study was done on 4 corpora in cross-corpus style, showing the superiority of the model in comparison with former state-of-the-art models. M. V. Zavarez et al. [39] fine-tuned the pre-trained on facial images VGGFace [40] model on several emotional datasets, following the leave-one-out cross-corpus experimental setup. They showed that pre-trained on related to emotion recognition domains CNN models substantially outperform the randomly initialized ones, including the cross-corpus experiments.
在[37]中,作者介紹了基於 E2E DNN Inception 架構的模型,並在多個數據集上進行了全面的跨語料庫實驗。利用所提出的模型,他們超越了主要通過利用 ML 分類器和手工特徵提出的最新成果。W. Xie 等人在[38]中提出了一種基於特徵稀疏性的正則化方法,並將其整合到損失函數中,表現優於使用 L2 正則化訓練的模型。該研究在 4 個語料庫上以跨語料庫方式進行,展示了該模型相對於以前最先進模型的優越性。M. V. Zavarez 等人[39]對在面部圖像上預訓練的 VGGFace [40]模型進行了微調,並在多個情感數據集上遵循了留一法交叉語料庫的實驗設置。他們表明,預先在與情感識別領域相關的 CNN 模型上進行訓練明顯優於隨機初始化的模型,包括跨語料庫實驗。
In [41], the authors have proposed a CNN ensembling method with the modified architecture of each CNN, exploiting so-called maxout layers. Carrying out the experiments on three corpora, they concluded that the developed ensemble of CNNs is able to surpass the default CNN model if the data amount is sufficient. Z. Meng et al. in [42] introduced a novel identity-aware CNN with a selfdesigned architecture. Training the proposed model with developed sophisticated identity-sensitive and expression-sensitive contrastive losses, they outperformed all baseline and most of the state-of-the-art models on 3 emotional datasets, following the cross-corpus protocol.
在[41]中,作者提出了一種使用修改後的 CNN 架構和所謂的 maxout 層的 CNN 集成方法。在三個語料庫上進行實驗後,他們得出結論:如果數據量足夠,開發的 CNN 集成能夠超越默認的 CNN 模型。在[42]中,Z. Meng 等人介紹了一種新的身份感知 CNN,具有自設計的架構。使用開發的複雜身份敏感和表情敏感的對比損失來訓練所提出的模型,他們在 3 個情感數據集上超越了所有基線和大多數最先進的模型,遵循交叉語料庫協議。
In [43] B. Hasani and M. H. Mahoor developed the facial emotion recognition framework consisted of 3D Inception-based ResNet and LSTM stacked on top of it. To emphasize the facial components instead of regions, the authors add to the input the Facial Landmarks as complementary information. Providing the results done within the cross-corpus setup, emotion recognition system shows outperforming efficiency in comparison with the state-ofthe-art systems on 3 out of 4 datasets.
在[43]中,B. Hasani 和 M. H. Mahoor 開發了一個由 3D Inception-based ResNet 和 LSTM 組成的面部情感識別框架。為了強調面部組件而不是區域,作者將面部標誌添加到輸入中作為補充信息。在交叉語料庫設置中提供的結果顯示,情感識別系統在 4 個數據集中有 3 個的效率優於最先進的系統。
In our former work [44], we have conducted cross-corpus experiments with hand-crafted features (Facial Landmarks) to investigate its applicability instead of using the E2E DL approaches. Utilizing the ensemble of the classifiers, we showed that the classification accuracy highly depends on the sequence length and the diversity of the dataset expressed by number of different participants.
在我們之前的工作中[44],我們使用手工製作的特徵(面部標誌)進行了跨語料庫實驗,以研究其應用性,而不是使用 E2E DL 方法。通過使用分類器的集合,我們表明分類準確性高度取決於序列長度和數據集的多樣性,這由不同參與者的數量表示。
We should note that none of aforementioned corpora (except FER2013 and our former research [44]) are used in our study. Moreover, to make the final data as diverse as possible, we have selected datasets so that they have as less as possible intersections in terms of recording setup and conditions: the lightning, subjects' moving, obstacles, age, ethnicity, and culture.
我們應該注意到,除了 FER2013 和我們之前的研究[44]之外,我們的研究中沒有使用上述任何語料庫。此外,為了使最終數據盡可能多樣化,我們選擇了數據集,使它們在記錄設置和條件方面盡可能少地交叉:照明,主題移動,障礙物,年齡,種族和文化。

3. Materials and Methods
3. 材料和方法

In this section, we describe the methodology of our work and data used.
在這個部分中,我們描述了我們的工作方法和使用的數據。

3.1. Experimental Data 3.1. 實驗數據

Emotional datasets are a key element in building a reliable emotion recognition system. They can contain one to several modalities, most often visual, acoustic, linguistic, or their combinations. However, one of the main sources of information in HCI nowadays is the video channel, and therefore, in this research we focus on the Visual Emotional Datasets (VEDs), which can be divided into static or dynamic ones, depending on the type of presented images of facial expressions. Most VEDs are annotated in terms of 6 basic Ekman's emotions (Happiness, Sadness, Surprise, Fear, Disgust, Angry) [8] plus Neutral state, resulting in the 7-class task. There are also VEDs acquired with a continuous valence-arousal scheme, however, they are much less presented because of the annotation ambiguity and problems with raters' agreement. The training of the reliable E2E method requires a tremendous amount of data with high variety and therefore we have decided to focus on categorical VEDs, which are more presented nowadays.
情感數據集是構建可靠情感識別系統的關鍵元素。它們可以包含一個或多個模態,最常見的是視覺、聲音、語言或它們的組合。然而,現在 HCI 中主要的信息來源之一是視頻通道,因此,在這項研究中,我們專注於視覺情感數據集(VEDs),它們可以根據呈現的面部表情圖像的類型分為靜態或動態。大多數 VEDs 都以 6 種基本的 Ekman 情感(快樂、悲傷、驚訝、恐懼、厭惡、憤怒)[8]加上中性狀態進行注釋,從而產生 7 類任務。還有一些以連續的價值-激勵方案獲得的 VEDs,但由於注釋的模糊性和評分者協議的問題,它們很少出現。可靠的 E2E 方法的培訓需要大量高品質的數據,因此我們決定專注於分類 VEDs,這是現在更常見的。
Looking from a different point of view, VEDs can be separated depending on the recording conditions: laboratory (imitation of facial expressions) and "in the wild" (natural non-acted facial expressions). The choice of the VEDs has a significant impact on
從不同的角度來看,VEDs 可以根據錄製條件分為兩種:實驗室(模擬面部表情)和“野外”(自然非演出的面部表情)。 VEDs 的選擇對情感識別系統的效果有顯著影響

the effectiveness of the emotion recognition systems, especially on DL models, which become more efficient with more data seen.
,特別是對 DL 模型,隨著看到更多數據,它們變得更有效率。
For the fairness of the study, we have selected 8 VEDs varied in recording conditions. They are publicly distributed and have been widely used for analysis, research, and experiments in the affect computation domain. An overview of the chosen data is presented in Table 1. Next, we describe each of the presented VEDs in detail.
為了研究的公平性,我們選擇了 8 種在錄製條件上有所不同的 VEDs。它們被公開分發,並被廣泛用於情感計算領域的分析、研究和實驗。在表 1 中介紹了所選數據的概況。接下來,我們詳細描述每個呈現的 VEDs。
To the best of our knowledge, RAMAS [45] is a sole VED with persons having Slavic appearance and Russian speech. In total, the dataset contains 564 videos annotated by 21 experts (at least 5 experts per video). The specificity of this corpus is that the raters could mark different time intervals for the presence of one to several emotions. Moreover, sometimes the intervals for different experts could overlap, causing ambiguity for overlapping regions in terms of annotated emotion. The participants were not limited in the movements of the head and arms, therefore there are frames, in which the face is completely covered by hands or is in a difficult position for visual emotion recognition. However, such frames are also labelled with one of the emotions, usually depending on the previous context. These factors make it difficult to work with RAMAS.
據我們所知,RAMAS [45] 是一個具有斯拉夫人外表和俄羅斯口音的唯一 VED。總共,這個數據集包含由 21 位專家標註的 564 個視頻(每個視頻至少有 5 位專家)。這個語料庫的特點是評分者可以標記不同的時間間隔來表示一到多種情緒的存在。此外,有時不同專家的時間間隔可能重疊,導致在標註情緒時存在重疊區域的歧義。參與者在頭部和手臂的運動上沒有限制,因此有些幀中,臉部完全被手遮蓋或處於視覺情緒識別困難的位置。然而,這些幀通常也被標記為其中一種情緒,通常取決於先前的上下文。這些因素使得使用 RAMAS 變得困難。
IEMOCAP [46] consists of 151 videos divided into chunks. Overall, 6 raters (at least 3 experts per video) were involved in the annotation process, assigning to the sections one to several emotions. The total number of sections is presented in Table 1 . The dataset was acquired for recognition of 5 emotional states (Happiness, Angry, Sadness, Frustration, Neutral) and continuous emotional dimensions (Valence, Arousal, and Dominance). However, experts were free to assign other emotions (Disgust, Fear, Surprise) if they were presented in the sections. In addition to the videos, the authors provide the Motion Capture (MOCAP) data, which represents the information about the muscles movement of the face, head, and arms. Although the videos have a good resolution of pixels, one-third of the frame is occupied by the participant's interlocutor (it is an interviewer, which is not annotated), resulting in only final resolution for the rated participant. In addition, the participant's head in many frames is quite far from the camera, making it difficult to extract deep features from such a small amount of pixels.
IEMOCAP [46] 包含 151 個視頻,分為多個片段。總共有 6 名評分員(每個視頻至少 3 名專家)參與標註過程,將情感分配給各個片段。片段的總數見表 1。該數據集用於識別 5 種情感狀態(快樂、憤怒、悲傷、挫折、中性)和連續情感維度(價值、激動和支配)。但是,如果片段中呈現其他情感(厭惡、恐懼、驚訝),專家可以自由分配。除了視頻外,作者還提供了運動捕捉(MOCAP)數據,該數據代表臉部、頭部和手臂肌肉運動的信息。儘管視頻的分辨率為 像素,但其中三分之一的畫面被參與者的對話者(一位未標註的訪談者)佔據,導致被評分參與者的最終分辨率僅為 。此外,許多畫面中參與者的頭部與攝像機相距甚遠,使得從這麼少的像素中提取深度特徵變得困難。
Unlike the two former datasets, which have 10 young participants each, CREMA-D [47] provides the materials with wide diversity in terms of ethnicity and age (from 20 to 74). Each video file has one from six labeled emotional states (the emotion Surprise was not considered). However, in addition to the emotional label, the authors provide a confidence level of the rater for every video. Such a feature allows us to select video files with a high confidence level, discarding the noisy data.
與前兩個數據集不同,每個 CREMA-D [47]數據集都有 10 位年輕參與者,提供了在種族和年齡方面具有廣泛多樣性的材料(從 20 歲到 74 歲)。每個視頻文件都有六種標記的情感狀態之一(未考慮情感驚訝)。然而,除了情感標籤之外,作者還為每個視頻提供了評分者的信心水平。這樣的功能使我們能夠選擇信心水平高的視頻文件,丟棄噪音數據。
The specificity of the RAVDESS [48] lies in containing the speech and melodic reproduction of emotions. The melodic reproduction of emotions makes the dataset suitable for nonpharmacological treatment in the rehabilitation of neurological and motor disorders. In addition to 7 common emotional state classes, the authors have annotated the emotion Calm as well. All emotions were recorded with a normal and strong emotional intensity regime.
RAVDESS [48]的特殊之處在於包含語音和情感的旋律再現。情感的旋律再現使該數據集適用於在神經和運動障礙康復中的非藥物治療。除了 7 個常見的情感狀態類別外,作者還將情感平靜標註為一種情感。所有情感都以正常和強烈的情感強度模式錄製。
The uniqueness of the SAVEE [49] dataset is that during the video recording, the authors were showing to the participants the facial expressions and text prompts on the display. The main goal of this was to convey to the participants emotion presented on screen in the most accurate way. In addition, to capture the key features of every facial expression, 60 blue markers were painted on the participants' forehead, eyebrows, cheeks, lips, and jaws.
SAVEE [49] 數據集的獨特之處在於在視頻錄製期間,作者向參與者展示了面部表情和顯示屏上的文本提示。這樣做的主要目的是以最準確的方式向參與者傳達屏幕上呈現的情感。此外,為了捕捉每個面部表情的關鍵特徵,60 個藍色標記被塗在參與者的額頭、眉毛、臉頰、嘴唇和下巴上。
During the work with all aforementioned datasets, we have noticed that participants of the CREMA-D, RAVDESS, and SAVEE datasets were located at equal distance to the camera, that simplified their preprocessing in comparison with RAMAS and IEMOCAP.
在處理所有上述數據集時,我們注意到 CREMA-D、RAVDESS 和 SAVEE 數據集的參與者與攝像機的距離相等,這使得它們與 RAMAS 和 IEMOCAP 相比更容易預處理。
AffWild2 [52,53,50,54-57] was introduced within the Affective Behavior Analysis in-the-Wild (ABAW) competition. The dataset participants have different ages (from babies to elder people) and ethnicity. Since the corpus was collected "in the wild", the video frames of faces have a wide range of head poses, lighting conditions, occlusions, and a variety of emotional expressions. It was annotated for 6 basic Ekman's emotions plus Neutral state, timecontinuous valence-arousal dimensions, and Facial Action Units (FAUs) in a frame-by-frame way. Since the dataset was used for competition, it is deliberately divided into train, development, and test sets by authors, the test set is hidden for final model efficacy evaluation. Moreover, starting from 2020, three ABAW competitions have already been introduced, resulting in different versions of the AffWild2 dataset. We should clarify that we have used the original version presented in the ABAW-2020 competition. The complexity of processing AffWild2 lies in the videos themselves: some videos contain more than one person on the frame at the same time and therefore the faces may overlap, or, even worse, the main face may be completely lost from the frame, while the annotated label remains equal to the previous one. Such a problem challenges the model to consider the context, paying attention to previous frames.
AffWild2 [52,53,50,54-57] 是在野外情感行為分析(ABAW)比賽中引入的。參與者的數據集有不同的年齡(從嬰兒到老年人)和種族。由於該語料庫是在野外收集的,因此臉部的視頻幀具有廣泛的頭部姿勢、照明條件、遮擋和各種情感表達。它以逐幀方式注釋了 6 種基本的 Ekman 情感加上中性狀態、時間連續的價值-激勵維度和面部動作單元(FAUs)。由於該數據集用於比賽,因此作者故意將其分為訓練、開發和測試集,測試集對於最終模型效能評估是隱藏的。此外,從 2020 年開始,已經引入了三個 ABAW 比賽,導致 AffWild2 數據集的不同版本。我們應該澄清的是,我們使用了在 ABAW-2020 比賽中提出的原始版本。 處理 AffWild2 的複雜性在於視頻本身:一些視頻同時包含多個人,因此面部可能重疊,甚至更糟的是,主要面部可能完全從畫面中消失,而標註的標籤仍然與之前相同。這樣的問題挑戰模型考慮上下文,關注先前的幀。
The example frames of the participants from all observed VEDs are presented in Fig. 1.
所有觀察到的 VED 參與者的示例幀顯示在圖 1 中。
FER2013 [17] is the only gray-scale corpus selected in this work. However, despite the gray-scale and low resolution, the corpus is still actively exploited to date for training the emotion recognition systems. FER2013 is divided into train, development, and test sets and publicly available.
FER2013 [17] 是本工作中選擇的唯一灰度語料庫。然而,儘管是灰度和低分辨率,該語料庫仍然被積極利用至今用於訓練情感識別系統。FER2013 被劃分為訓練、開發和測試集並且是公開可用的。
AffectNet [51] is the largest corpus of the image-based categorical emotions, acquired in-the-wild conditions. The authors have collected the data using three search engines (Google, Bing, and Yahoo). AffectNet contains more than 1 million facial images with extracted facial landmarks, 450,000 of which have been also annotated in terms of 8 categorical emotions (Contempt is taken additionally to the 6 basic Ekman's emotions and Neutral state). Moreover, every facial image is also annotated in terms of continuous valence-arousal dimensions. As in the FER2013 corpus, facial
AffectNet [51] 是基於圖像的情感分類最大的語料庫,是在野外條件下獲得的。作者使用三個搜索引擎(Google、Bing 和 Yahoo)收集了數據。AffectNet 包含超過 100 萬張帶有提取的面部標誌的面部圖像,其中 45 萬張也已經根據 8 種情感類別進行了標註(輕蔑是額外添加到 6 種基本的 Ekman 情感和中性狀態中)。此外,每張面部圖像還根據連續的價值-激發維度進行了標註。與 FER2013 語料庫一樣,面部
Table 1 表 1
Overview of the research VEDs. St. means states, Part. - participants, Con. - the recording condition, Lab. - the laboratory recording conditions, Wild - the "in-the-wild" recording conditions, not available.
研究 VEDs 的概述。St. 表示狀態,Part. - 參與者,Con. - 錄製條件,Lab. - 實驗室錄製條件,Wild - “野外”錄製條件, 不可用。
VED # St. # Part. # Samples/hours # 樣本/小時 FPS Resolution Con.
RAMAS[45] 7 10 Lab.
IEMOCAP[46] 8 10 30 Lab.
CREMA-D[47] 6 91 30 Lab.
RAVDESS[48] 8 24 60 Lab.
SAVEE[49] 7 4 various
AffWild2[50] 7 458 - - Wild
FER2013[17] 7 35888 various
AffectNet[51] 8 450000 Wild
Fig. 1. Sample frames from dynamic VEDs.
圖 1. 動態 VEDs 的樣本幀。
regions are localized. The corpus is divided into train, development, and test sets, however, the test set is not publicly available.
區域被定位。 語料庫分為訓練、開發和測試集,但測試集不公開。
Thus, for this research we have chosen both types of the VEDs (with static and dynamic facial expressions), widely covering the high diversity of participants' gender, age, and ethnicity as well as variability in head poses, lighting conditions, occlusions, and the degree of the recording control.
因此,對於這項研究,我們選擇了兩種類型的 VEDs(具有靜態和動態面部表情),廣泛涵蓋參與者的性別、年齡和種族的高度多樣性,以及頭部姿勢、照明條件、遮擋和錄製控制程度的變異性。

3.2. Methodology 3.2. 方法論

The purpose of the current study is to create a robust and efficient system for the categorical emotion recognition task. We have done it in two steps: (1) implementation of the backbone emotion recognition model, which is able to predict emotion from raw image with high performance and (2) fine-tuning of the backbone emotion recognition system, adding temporal axis to it (for instance, LSTM layers) and utilizing the cross-corpus training protocol.
當前研究的目的是為了為分類情緒識別任務創建一個強大而高效的系統。我們分兩步完成:(1)實現骨幹情緒識別模型,該模型能夠高效地從原始圖像中預測情緒,(2)對骨幹情緒識別系統進行微調,為其添加時間軸(例如,LSTM 層)並利用跨語料庫訓練協議。

3.2.1. Static and Temporal Emotion Recognition Approaches
3.2.1. 靜態和時間情感識別方法

In this subsection we describe the structure of every approach we have implemented in this research, precising the training pipeline of these models. We call the static facial emotion recognition model as the backbone model. The training process pipeline for model is shown in Fig. 2. Hereinafter, the vectors under the facial expression images represent the emotional state in the following order: Neutral, Happiness, Sadness, Surprise, Fear, Disgust, Anger.
在本小節中,我們描述了我們在這項研究中實施的每種方法的結構,澄清了這些模型的訓練流程。我們將靜態面部情感識別模型稱為骨幹模型。模型的訓練過程流程顯示在圖 2 中。此後,面部表情圖像下的向量按照以下順序表示情感狀態:中性、快樂、悲傷、驚訝、恐懼、厭惡、憤怒。
Typically, the static facial images express the peak levels of emotion. We believe that utilizing even a reliable robust model trained on the static images will not be able to demonstrate a good generalization capability for real-world applications, since in-thewild people often do not express emotions in such a clear way.
通常,靜態面部圖像表達情感的高峰水平。我們認為,即使是在靜態圖像上訓練的可靠強大模型也無法展示出良好的泛化能力,因為在現實世界的應用中,人們往往不會以如此清晰的方式表達情感。
Fig. 2. The training process pipeline for the categorical emotion recognition model.
圖 2。分類情感識別模型的訓練過程流程。
Moreover, when deciding on the state of the interlocutor, humans are well known to rely on the context information, taking into account previous emotional states as complementary information. Thus, we have decided to enhance the emotion recognition model with different temporal aggregation techniques. To learn the model catching temporal dependencies is a complex task, which requires a huge amount of data. Therefore, we have chosen 6 big different VEDs (the description is presented in Section 3.1), which represent the naturalistic way how people express emotions and which are good to learn capturing temporal context.
此外,在決定對話者狀態時,人們眾所周知地依賴上下文信息,考慮先前的情感狀態作為補充信息。因此,我們決定通過不同的時間聚合技術來增強情感識別模型。學習捕捉時間依賴性是一項複雜的任務,需要大量數據。因此,我們選擇了 6 個不同的大型 VED(描述請參見第 3.1 節),這些 VED 代表了人們表達情感的自然方式,並且有助於學習捕捉時間上下文。
As a baseline, we have implemented two simple temporal approaches:
作為基準,我們實施了兩種簡單的時間方法:
  • CNN-W: The backbone model is used to get probability predictions for each frame in one video. Next, the emotion label of the video is predicted as a normalized sum of evaluated probability predictions.
    CNN-W:骨幹模型用於為一個視頻中的每一幀獲取概率預測。接下來,視頻的情感標籤被預測為評估概率預測的歸一化總和。
  • CNN-S: The backbone model is used to get probability predictions for each frame in one video. Next, the Hamming window is utilized to smooth the predictions in the window with chosen length (we have tried different lengths of the window). The formula for calculating weights in Hamming window is presented below:
    CNN-S:骨幹模型用於為一個視頻中的每一幀獲取概率預測。接下來,使用漢明窗口來平滑窗口內的預測,窗口的長度由選擇(我們嘗試了不同長度的窗口)。漢明窗口中計算權重的公式如下所示:
where is an integer vector with length contained values from to (even digits), and is the window length. The Hamming window was applied to the entire video step-by-step starting from the first frame and with a bias equal to one frame. Then, all predictions were averaged, resulting in one probability vector.
其中 是一個整數向量,長度為 ,包含從 的值(偶數位數), 是窗口長度。漢明窗口逐步應用於整個視頻,從第一幀開始,偏差等於一幀。然後,所有預測值取平均,得到一個概率向量。
The first baseline approach (CNN-W) is the simplest one because it takes into account only the "global" information, essentially catching the most frequent emotion that occurred in the considered video. Controversially, the CNN-S approach exploits the weighted aggregation of the frames within the fixed window, assigning the weights depending on the frame "proximity" to the central frame under consideration. This allows capturing the "local" temporal information. Both methods do not require any training and can be used directly after deep embeddings extraction done by the backbone model.
第一個基線方法(CNN-W)是最簡單的方法,因為它僅考慮“全局”信息,基本上捕捉了所考慮的視頻中最常見的情感。相反,CNN-S 方法利用固定窗口內幀的加權聚合,根據幀與正在考慮的中心幀的“接近程度”分配權重。這允許捕捉“本地”時間信息。這兩種方法都不需要任何培訓,可以在骨幹模型完成深度嵌入提取後直接使用。
However, the CNN-W and CNN-S approaches are not flexible enough for capturing fairly complex temporal dependencies in data. Therefore, we have developed more sophisticated temporal approaches using extracted by backbone model deep embeddings:
然而,CNN-W 和 CNN-S 方法對於捕捉數據中相當複雜的時間依賴關係不夠靈活。因此,我們開發了更複雜的時間方法,使用骨幹模型提取的深度嵌入:
  • CNN-SVM: For every video, the deep embeddings are combined into sequences (windows) of . Next, we calculate Means and Standard Deviations (STDs) for every deep embedding within considered window, as we suggested in [58]. Evaluated statistics are then fed into SVM, which makes a final emotion prediction for the whole window. To make it more clear, we illustrate the scheme of the considered method in Fig. 3.
    CNN-SVM:對於每個視頻,深度嵌入被組合成 的序列(窗口)。接下來,我們計算每個考慮窗口內的深度嵌入的平均值和標準差(STDs),正如我們在[58]中建議的那樣。評估的統計數據然後被餵入 SVM,為整個窗口進行最終情感預測。為了更清晰,我們在圖 3 中說明了所考慮方法的方案。
  • CNN-LSTM: For every video, the deep embeddings are combined into sequences (windows) of 2 s. However, here we downsample every video to 5 frames per second (FPS) due to the absence of the necessity to calculate statistics. Next, an LSTM network with 512 and 256 consecutive neurons is trained on dynamic VEDs. To bound the LSTM for overfitting, after each LSTM layer we have added L2 regularization of 0.001 and dropout with rate of 0.2. The pipeline of the CNN-LSTM approach is presented in Fig. 4.
    CNN-LSTM:對於每個視頻,深度嵌入被組合成 2 秒的序列(窗口)。然而,在這裡,由於沒有計算統計數據的必要性,我們將每個視頻降採樣為每秒 5 幀(FPS)。接下來,在動態 VEDs 上訓練具有 512 和 256 個連續神經元的 LSTM 網絡。為了限制 LSTM 過度擬合,我們在每個 LSTM 層之後添加了 0.001 的 L2 正則化和 0.2 的輸出層丟失率。CNN-LSTM 方法的流程圖如圖 4 所示。
  • CNN-GRU: To decrease the number of configurable parameters, we also tried to replace the LSTM layers with Gated Recurrent Unit (GRU) layers. All other hyperparameters are similar to the CNN-LSTM method. Due to the similarity of these two approaches, we did not present the CNN-GRU method in the pipeline depicted in Fig. 4.
    CNN-GRU: 為了減少可配置參數的數量,我們也嘗試將 LSTM 層替換為門控循環單元(GRU)層。所有其他超參數與 CNN-LSTM 方法相似。由於這兩種方法的相似性,我們沒有在圖 4 中描述的流程中呈現 CNN-GRU 方法。
  • CNN-LSTM-A: We improved the CNN-LSTM approach by inserting between LSTM layers the Attention mechanism proposed in . In addition, to speed up the convergence of the training, the batch normalization after every LSTM layer was inserted. The pipeline of the proposed method is presented in Fig. 4 as well.
    CNN-LSTM-A: 我們通過在 LSTM 層之間插入 Attention 機制來改進 CNN-LSTM 方法,該機制是在 中提出的。此外,為了加快訓練的收斂速度,我們在每個 LSTM 層之後插入了批量標準化。所提出方法的流程也在圖 4 中呈現。
To evaluate all proposed approaches, we utilized the leave-onecorpus-out (cross-corpus) cross-validation procedure.
為了評估所有提出的方法,我們使用了留一語料庫外(跨語料庫)交叉驗證程序。

3.2.2. Experimental Setup of Hyper-parameter Search for the Backbone Model
3.2.2. 背骨模型超參數搜索的實驗設置

To create the efficient backbone categorical emotion recognition system, we have conducted numerous experiments, varying the following training hyper-parameters:
為了創建高效的背骨分類情感識別系統,我們進行了大量實驗,變化了以下訓練超參數:
  • Schedulers of learning rates: Constant (the learning rate is constant throughout the training process); Time-based (the learning rate changes at each epoch); Piecewise Constant (the learning rate is constant at a given iteration); Cosine Annealing [61].
    學習率的調度器:恆定(學習率在整個訓練過程中保持恆定);基於時間(學習率在每個時期更改);分段恆定(學習率在給定的迭代中保持恆定);餘弦退火 [61]。
  • Optimization algorithms (Adam, SGD) and Initial Learning Rates .
    優化算法(Adam,SGD)和初始學習率
  • DNN architectures: ResNet-50 [62], SeNet-50 [63], VGG-16 [14] pre-trained on the VGG-Face2 dataset [40], EfficientNetB0 [64], ResNet-101-V2 [62], MobileNet-V2 [65] pre-trained on Imagenet dataset [66]. - Logarithmic Class Weighting [58] and Inversely Proportional
    DNN 架構:ResNet-50 [62],SeNet-50 [63],VGG-16 [14]在 VGG-Face2 數據集[40]上預訓練,EfficientNetB0 [64],ResNet-101-V2 [62],MobileNet-V2 [65]在 Imagenet 數據集[66]上預訓練。-對數級別加權[58]和反比例。

    Class Weighting. 類加權。
Fig. 3. The pipeline of the CNN-SVM approach.
圖 3. CNN-SVM 方法的流程。
Fig. 4. The pipeline of the CNN-LSTM and CNN-LSTM-A approaches.
圖 4. CNN-LSTM 和 CNN-LSTM-A 方法的流程。
  • Regularization Methods: Dropout, L2, Gaussian noise.
    正則化方法:Dropout、L2、高斯噪聲。
  • Convolutional Layers Freezing: full freezing, without freezing, and freezing up to, but excluding the last convolutional layer.
    卷積層凍結: 完全凍結、不凍結和凍結直到最後一個卷積層之前。
  • Number of neurons of the last dense layer: 256, 512, and 1048.
    最後一個密集層的神經元數量: 256、512 和 1048。
  • Data Augmentation Techniques: Affine Transformations, Combining different VEDs, and Mixup[67].
    數據增強技術: 仿射變換、結合不同的 VEDs 和 Mixup[67]。
Experimenting with all mentioned hyper-parameters, we have chosen the best ones by monitoring the model's performance on the development set during the training of categorical emotion recognition model. We should note that the backbone model is trained using static VEDs (AffectNet and FER2013).
通過對所有提到的超參數進行實驗,我們通過監控在訓練分類情感識別模型期間在開發集上的模型性能,選擇了最佳的超參數。我們應該注意,骨幹模型是使用靜態 VEDs(AffectNet 和 FER2013)進行訓練。

4. Experimental Results 4. 實驗結果

In this Section we present the data pre-processing description and experimental results, including backbone categorical model and the carried out cross-corpus analysis.
在本節中,我們介紹了數據預處理描述和實驗結果,包括骨幹分類模型和進行的跨語料庫分析。

4.1. Data Pre-Processing
4.1. 數據預處理

In contrast to static facial expressions in the AffectNet [51] and FER2013 [17], other VEDs contain dynamic video sequences without pre-detection of the faces. Therefore, the necessity to identify the face region on each video sequence has been raised. However, currently, there are many face detectors publicly available for the research, and each of them has pros and cons. To choose one for our research, we have evaluated three DL face detectors: the Single Shot Multibox Detector (SSD) [68], the RetinaFace [69], the Multitask Cascaded CNN (MTCNN) [70]. The efficiency was scored according to the two metrics: FPS and the Intersection Over Union (IOU). The formula of the IoU is presented below:
與 AffectNet [51]和 FER2013 [17]中的靜態面部表情相比,其他 VED 包含沒有預先檢測面部的動態視頻序列。因此,提出了在每個視頻序列上識別面部區域的必要性。然而,目前有許多面部檢測器可供研究使用,每個檢測器都有優缺點。為了為我們的研究選擇一個,我們評估了三個 DL 面部檢測器:單鏡頭多框檢測器(SSD)[68],RetinaFace [69],多任務級聯 CNN(MTCNN)[70]。效率根據兩個指標進行評分:FPS 和交集超過聯合(IOU)。IoU 的公式如下所示:
,
where is a number of true positive samples, is a number of false positives samples, and is a number of false negative samples.
其中 是真正樣本的數量, 是偽正樣本的數量, 是偽負樣本的數量。

The experiments were carried out on a randomly selected video from the AffWild2 dataset [52] with the name "99-30-720x720", which contains 1800 frames with one person. AffWild2 is a good dataset to test the effectiveness of the face detectors, since it has many video frames with occlusions, dim or too bright lightning, and a high variety of head poses. We set the confidence threshold to be at least more than , while the maximum size of the frame in width/high to be 300 pixels. The experimental results are presented in Table 2.
實驗是在 AffWild2 數據集[52]中隨機選擇的一個名為"99-30-720x720"的視頻上進行的,其中包含 1800 幀,一個人。AffWild2 是一個很好的數據集,用於測試人臉檢測器的有效性,因為它有許多具有遮擋、昏暗或過亮照明以及各種頭部姿勢的視頻幀。我們將置信閾值設置為至少大於 ,而幀的最大寬度/高度為 300 像素。實驗結果請參見表 2。
Table 2 shows that only the RetinaFace was able to find all 1800 faces, however, it has found 13 additional erroneous faces (FP). The analysis of the mistakes made by the RetinaFace showed that setting the confidence threshold to more than eliminates all errors. However, when setting such a threshold for other face detectors, the IoU value drops significantly. Thus, since the RetinaFace showed the best IoU and a good FPS, we have selected it as a base face detector for our research. To see the more detailed research on the effectiveness of the face detectors, the reader is kindly referred to the paper [71].
表 2 顯示只有 RetinaFace 能夠找到所有 1800 張臉,但是它找到了 13 個額外的錯誤臉(FP)。對 RetinaFace 的錯誤進行分析顯示,將置信閾值設置為大於 可以消除所有錯誤。然而,當為其他人臉檢測器設置這樣的閾值時,IoU 值會顯著下降。因此,由於 RetinaFace 顯示了最佳的 IoU 和良好的 FPS,我們選擇它作為我們研究的基礎人臉檢測器。有關人臉檢測器有效性的更詳細研究,請讀者參考文獻[71]。
In addition to the face detection, we have considered the raters' annotation confidence level, where it was possible (RAMAS, IEMOCAP, and CREMA-D). It has been experimentally proven [72] that with an increase of the annotation confidence levels, the accuracy of recognition systems grows. Therefore, currently in the emotion recognition domain, it is common to utilize the annotation confidence levels equaled to at least [73]. This has been applied to the corpora mentioned above as well.
除了面部檢測外,我們還考慮了評分者的標註信心水平,如果可能的話(RAMAS、IEMOCAP 和 CREMA-D)。實驗證明[72],隨著標註信心水平的提高,識別系統的準確性也增加。因此,在當前的情感識別領域中,常常使用至少等於 的標註信心水平[73]。這也應用於上述提到的語料庫中。
Moreover, to ensure the same processing conditions for recurrent neural networks in terms of temporality, we have equalized the FPS of every video from VED by downsampling all the video files to 5 FPS. We have utilized the simplest downsampling process - the selection of every -th frame, while all other frames are
此外,為了確保循環神經網絡在時間上具有相同的處理條件,我們通過將 VED 的每個視頻文件降採樣到 5 FPS 來使其 FPS 相等。我們使用了最簡單的降採樣過程 - 選擇每 幀,而其他所有幀則
Table 2 表 2
The efficiency of the face detectors expressed via different metrics.
面部檢測器的效率通過不同的指標表達。
Face detector 臉部檢測器 TP FP FN IoU, FPS
SSD 1787 41 13 99.0
MTCNN 1484 21 315 81.5
RetinaFace 1800 13 -
skipped. For instance, if we have a video with FPS equalled 10, to downsample it to the FPS equalled 5 , we need to take every (second) frame.
跳過。例如,如果我們有一個 FPS 為 10 的視頻,要將其降低到 FPS 為 5,我們需要每隔 (秒)取一幀。
We should note here that there are some other, more sophisticated downsampling techniques (i. e. [74]), and we did not focus on that hyperparameter due to significant computational time of such techniques.
我們應該注意這裡還有一些其他更複雜的降採樣技術(即 [74]),我們沒有專注於這些技術的超參數,因為這些技術需要大量的計算時間。
We present the overview of the datasets utilized in our research, including the class distribution, in Table 3. For VEDs consisted of static images (AffectNet and FER2013), the number of samples is shown in the overall number of frames. We did it for the Affwild2 as well, since it is annotated following the frameby-frame protocol (meaning that every frame is rated separately).
我們在表 3 中提供了我們研究中使用的數據集的概述,包括類別分佈。對於由靜態圖像組成的 VED(AffectNet 和 FER2013),樣本數量顯示在總幀數中。對於 Affwild2,我們也這樣做,因為它是按照逐幀協議進行注釋的(這意味著每個幀都是單獨評分的)。
Observing Table 3, one can note that most of the samples belong to the Neutral and Happiness categories, accounting for almost of the overall data. It is well-known that disproportionately distributed number of samples in the training set negatively affects the performance of ML models [75]. Therefore, to eliminate this problem, we apply data augmentation techniques such as affine transformations, combining different VEDs, and Mixup. Furthermore, we have combined the AffectNet dataset with the data of several emotional categories from the FER2013, increasing the number of the minority class samples: Sadness at 16%, Surprise 18.4%, Fear - 39.1%, Disgust - 12.2%, Angry - 13.8%.
從表 3 中可以看出,大多數樣本屬於中性和快樂類別,佔整個數據的近 。眾所周知,訓練集中樣本數量分佈不均會對 ML 模型的性能產生負面影響[75]。因此,為了消除這個問題,我們應用了數據增強技術,如仿射變換、結合不同的 VED 和 Mixup。此外,我們將 AffectNet 數據集與 FER2013 的幾個情感類別的數據結合起來,增加了少數類樣本的數量:悲傷 16%,驚訝 18.4%,恐懼 39.1%,厭惡 12.2%,憤怒 13.8%。

4.2. Backbone Model 4.2. 骨幹模型

The main experiments on the training parameters selection for the backbone model are presented in Table 4. It is important to note that, besides all mentioned in Table 4 methods, the following training parameters for all experiments were used: random affine transformations and contrast varying; SGD optimization algorithm; two dense layers with 512 and 7 neurons are stack on top of CNN; dropout with rate equaled 0.2 after every dense layer; inversely proportional class weighting; batch size (BS) equals to 64; 30 training epoch. For experiments , the learning rate was set to constant value 0.0001 . In addition, since we have exploited the pre-trained on VGGFace 2 dataset model, all the images were normalized before feeding into the CNN in the same fashion, as in [40]. The normalization process consists of (1) conversion of the channel scheme from RGB to BGR; (2) centering of each channel according to the means calculated on the VGGFace2 dataset.
對於骨幹模型的訓練參數選擇進行的主要實驗在表 4 中呈現。值得注意的是,除了表 4 中提到的所有方法外,所有實驗中使用了以下訓練參數:隨機仿射變換和對比度變化;SGD 優化算法;兩個 512 和 7 個神經元的密集層堆疊在 CNN 頂部;每個密集層後的 dropout 率為 0.2;反比例類加權;批量大小(BS)等於 64;30 個訓練時期。對於實驗 ,學習率設置為恆定值 0.0001。此外,由於我們利用了在 VGGFace 2 數據集上預訓練的模型,所有圖像在餵入 CNN 之前都以相同的方式進行了歸一化處理,就像[40]中所述。歸一化過程包括(1)將通道方案從 RGB 轉換為 BGR;(2)根據在 VGGFace2 數據集上計算的平均值對每個通道進行居中處理。
The best model was chosen by monitoring the model efficiency on the development set, in other words, by using early stopping.
通過監控模型在開發集上的效率,換句話說,通過使用提前停止,選擇了最佳模型。
To make the pre-processing procedure clearer,the batch generation process is presented in Fig. 5.
為了使預處理過程更清晰,批量生成過程在圖 5 中呈現。
Analyzing the results from Table 4, we should mention that the Mixup data augmentation technique makes a significant contribution to improving the accuracy of the backbone model (adding it causes the accuracy increase on , comparing experiments 5 and 9). ResNet50 showed a better performance in comparison with VGG-16 and MobileNet-V2 (see experiments 4, 5, 7, and 8). 12 regularization and adding the FER2013 data to the training slightly improved the model performance as well. Thus, utilizing all the proposed methods significantly magnified the model efficacy by on an absolute scale (experiment 9). The confusion matrix for the experiment 9 on the AffectNet development set is presented in Fig. 6.
從表格 4 的結果分析,我們應該提到 Mixup 數據增強技術對於提高骨幹模型的準確性做出了顯著貢獻(添加它會導致 上的準確度增加,比較實驗 5 和 9)。相較於 VGG-16 和 MobileNet-V2,ResNet50 表現更好(參見實驗 4、5、7 和 8)。12 個正則化和添加 FER2013 數據到訓練中也稍微提高了模型的性能。因此,利用所有提出的方法在絕對尺度上顯著地放大了模型的效能 (實驗 9)。實驗 9 在 AffectNet 開發集上的混淆矩陣如圖 6 所示。
From the confusion matrix, one can see that the backbone model is well-balanced and mostly confuses adjacent emotions: for example, the Fear is correctly recognized by cases, while most of the errors are related to the Surprise. These emotions are characterized by similar facial features such as wide-set eyebrows (with Fear they are usually lowered, while with Surprise they are raised). Another example lies in emotions Disgust and Anger: in both cases the eyebrows are usually lowered and pulled together, which explains why model sometimes predicts Anger
從混淆矩陣中可以看出,骨幹模型是平衡的,並且主要混淆相鄰的情緒:例如,恐懼被 個案例正確識別,而大多數錯誤 與驚訝有關。這些情緒的特徵是類似的面部特徵,例如寬間距的眉毛(恐懼時通常下垂,而驚訝時則上揚)。另一個例子是厭惡和憤怒的情緒:在這兩種情況下,眉毛通常下垂並拉在一起,這解釋了為什麼模型有時會預測憤怒。
Table 3 表 3
Number of samples per each emotional category. NE denotes Neutral state, HA - Happiness, SA - Sadness, SU - Surprise, FE - Fear, DI - Disgust and AN - Angry, TS - train set, DS development set.
每個情緒類別的樣本數。NE 代表中立狀態,HA - 快樂,SA - 悲傷,SU - 驚訝,FE - 恐懼,DI - 厭惡和 AN - 憤怒,TS - 訓練集,DS 開發集。
VED NE HA SA SU FE DI AN
RAMAS 172 496 187 341 208 214 249
IEMOCAP 1828 856 1183 155 56 8 1336
CREMA-D 907 1028 233 - 402 679 594
AffWild2 (TS) AffWild2(TS) 589215 152010 101295 39035 11155 12704 24080
AffWild2 (DS) 183636 53702 39486 23113 9754 5825 8002
AffectNet (TS) 74874 134415 25459 14090 6378 3803 24882
Part, % 53.71 22.01 11.54 5.07 2.09 1.54 4.04
Table 4 表 4
Experiments on the hyperparameters selection for the training process of backbone model. The figures from 1 to 9 denote the number of the experiment. GN means Gaussian noise, CA - Cosine Annealing.
對骨幹模型訓練過程中超參數選擇的實驗。從 1 到 9 的數字表示實驗的編號。GN 代表高斯噪聲,CA 代表餘弦退火。
Method 1 2 3 4 5 6 7 8 9
ResNet-50 + + + + + + - - +
VGG-16 - - - - - - + - -
MobileNet-V2 - - - - - - - + -
- - - + + + + + +
CA (5 cycles) CA(5 個週期) - - - - + - - + +
Time-based - - - - - + - - -
Fig. 5. The process of batch generation. is a random value from 0 to 1 .
圖 5. 批量生成過程。 是從 0 到 1 的隨機值。
instead of Disgust (in 14% cases). To read more about the manifestation of the particular emotions, the reader is kindly referred to the [8]. We would like to note that, in general, we obtained a good, well-balanced and efficient backbone emotion recognition system with a recognition rate for every emotion not less than , which is quite a high result for such a subjective task as emotion recognition. Moreover, to the best of our knowledge, the developed backbone emotion recognition system obtained the highest state-ofthe-art results on the AffectNet validation set. To demonstrate it, we have compared the proposed model with other known stateof-the-art results in Table 5.
而不是厭惡(在 14%的情況下)。要了解特定情緒的表現更多信息,讀者請參考[8]。我們想指出,總的來說,我們獲得了一個良好、平衡和高效的基礎情感識別系統,對於每種情感的識別率不低於 ,這對於情感識別這樣一個主觀任務來說是相當高的結果。此外,據我們所知,開發的基礎情感識別系統在 AffectNet 驗證集上獲得了最高的最新成果。為了證明這一點,我們已將所提出的模型與表 5 中其他已知的最新成果進行了比較。

4.3. Cross-Corpus Analysis
4.3. 跨語料庫分析

As we described before, for the cross-corpus analysis we have taken dynamic VEDs contained videos with different recording conditions. Since we have now an additional temporal dimension, the necessity to model it has raised. To accomplish it, we have chosen a window of for temporal modeling. The length of was selected because of the limitation of one VED called CREMA-D - the average length of the videos in this dataset is , making it difficult to set the window size bigger: the zero/same padding for short videos would only confuse the model.
正如我們之前描述的那樣,對於跨語料庫分析,我們已經採用了包含不同錄製條件的動態 VEDs 的視頻。由於現在我們有了額外的時間維度,必須對其進行建模。為了完成這一點,我們選擇了一個 的時間建模窗口。選擇 的長度是因為一個名為 CREMA-D 的 VED 的限制 - 這個數據集中視頻的平均長度是 ,使得設置更大的窗口大小變得困難:對於短視頻的零/相同填充只會混淆模型。
For the CNN-S approach, we have experimented with different lengths of the smoothing window. The number of frames was identified in brute-force style from 3 to 99 for all odd digits. During experiments, we faced contradicting results: the optimal window size (in terms of model performance) varied starting from small
對於 CNN-S 方法,我們嘗試了不同長度的平滑窗口。從 3 到 99 的所有奇數數字中,我們以蠻力風格確定了幀數。在實驗中,我們面臨矛盾的結果:最佳窗口大小(就模型性能而言)從小到大不同。
Fig. 6. Confusion matrix for the experiment 9 on AffectNet development set. NE denotes Neutral state, HA - Happiness, SA - Sadness, SU - Surprise, FE - Fear, DI Disgust and AN - Angry. to high values (61, 37, 73, 23, 97, and 97 for RAMAS, RAVDESS, CREMA-D, IEMOCAP, SAVEE, and AffWild2 respectively). This shows how important the length of the context is in dynamic affective computing. However, to generalize the system, we had to choose one window length for all datasets. It was done by averaging the Unweighted Average Recall (UAR) value for all datasets for every window length and choosing the highest one. Ultimately, the window size of 71 was selected, and therefore we present the experimental results (see Table 6) with this window size.
圖 6. AffectNet 開發集上實驗 9 的混淆矩陣。NE 代表中性狀態,HA - 快樂,SA - 悲傷,SU - 驚訝,FE - 害怕,DI - 厭惡和 AN - 憤怒。對於 RAMAS、RAVDESS、CREMA-D、IEMOCAP、SAVEE 和 AffWild2,高值分別為 61、37、73、23、97 和 97。這顯示了上下文長度在動態情感計算中的重要性。然而,為了使系統更具普遍性,我們必須為所有數據集選擇一個窗口長度。通過對每個窗口長度的所有數據集的未加權平均召回率(UAR)值進行平均,並選擇最高值來完成。最終,窗口大小為 71 被選擇,因此我們使用這個窗口大小呈現實驗結果(見表 6)。
For all the experiments with the LSTM-based and GRU-based networks, the SGD with learning rate of 0.0001 was chosen. We have selected the following hyperparameters empirically: the number of neurons equals 512 and 256 for the first and the second recurrent layers, dropout rate 0.2 , and regularization parameter 0.001 for recurrent layers. As we noted earlier, we have implemented two temporal aggregation approaches: straight recurrent network (CNN-LSTM and CNN-GRU) and attention-based LSTM network (CNN-LSTM-A), which is supposed to be more focused on significant deep embeddings extracted by trained CNN (backbone model). For the CNN-LSTM-A model we have exploited the attention mechanism developed by Z. Yang et al. in [60].
對於基於 LSTM 和 GRU 的所有實驗,我們選擇了學習率為 0.0001 的 SGD。我們根據經驗選擇了以下超參數:第一和第二個循環層的神經元數分別為 512 和 256,輸出率為 0.2,循環層的正則化參數為 0.001。正如我們之前提到的,我們實現了兩種時間聚合方法:直接循環網絡(CNN-LSTM 和 CNN-GRU)和基於注意力的 LSTM 網絡(CNN-LSTM-A),後者應該更加專注於由訓練好的 CNN(骨幹模型)提取的重要深度嵌入。對於 CNN-LSTM-A 模型,我們利用了 Z. Yang 等人在[60]中開發的注意力機制。
The results of the cross-corpus experiments are presented in Table 6. According to the results, the utilizing of Hamming window (CNN-W) for smoothing allows to slightly increase the models' performance on all considered datasets. Surprisingly, the CNN-SVM approach worked worse on all VEDs except CREMA-D. This can be due to the temporal complexity, which statistical parameters (Means and STDs) used by SVM were not able to encode well enough.
跨語料庫實驗的結果顯示在表 6 中。根據結果,使用漢明窗口(CNN-W)進行平滑可以稍微提高所有考慮的數據集上模型的性能。令人驚訝的是,CNN-SVM 方法在除 CREMA-D 外的所有 VED 上表現更差。這可能是由於時間複雜性,SVM 使用的統計參數(均值和標準差)無法很好地編碼。
Regarding CNN-LSTM and CNN-LSTM-A approaches, both have demonstrated a significant improvement in the UAR in comparison with other approaches. Replacement of the LSTM layers with GRU layers has overall decreased the performance. This can be partially explained by the decrease in the number of tuning parameters that worsened the generalization ability of the GRU-based network. From the observation of the results, however, it is difficult to highlight only CNN-LSTM or CNN-LSTM-A, since it highly depends on the considered VED. We have conducted the Student's paired t-
關於 CNN-LSTM 和 CNN-LSTM-A 方法,與其他方法相比,兩者都顯示出顯著的 UAR 改進。將 LSTM 層替換為 GRU 層整體上降低了性能。這可以部分解釋為調整參數數量的減少,使基於 GRU 的網絡的泛化能力變差。然而,從結果的觀察來看,很難僅強調 CNN-LSTM 或 CNN-LSTM-A,因為它高度取決於所考慮的 VED。我們進行了學生配對 t-
Table 5 表 5
Comparison with the state-of-the-art results on the AffectNet validation set.
與 AffectNet 驗證集上最先進的結果進行比較。
Research Approach Accuracy, %
Wang et al. [76]
王等人 [76]
SCN 60.2
Kervadec et al. [77]
Kervadec 等人 [77]
CAKE 61.7
She et al. [78]
She 等人 [78]
Res-50IBN 63.1
Georgescu et al. [79]
Georgescu 等人 [79]
VGG and BOVW features + SVM
VGG 和 BOVW 特徵 + SVM
63.3
Kollias et al. [80]
Kollias 等人 [80]
FaceBehaviorNet 65.0
Savchenko [81] EfficientNet-B2 66.3
This work Backbone Model 骨幹模型
Table 6 表格 6
The results of the cross-corpus experiments (UAR, %).
跨語料庫實驗的結果(UAR,%)。
VED CNN-W CNN-S CNN-SVM CNN-LSTM CNN-GRU CNN-LSTM-A CNN-LSTM-CV
RAMAS 42.8 42.8 26.5 50.2
RAVDESS 59.6 59.8 56.6 69.7
CREMA-D 54.5 55.4 60.6 79.0
IEMOCAP 25.7 25.9 26.3 28.7
SAVEE 62.1 63.7 38.3 82.8
AffWild2 45.1 46.2 33.0 52.9
Average UAR 48.3 48.8 40.2 60.6
test to both CNN-LSTM and CNN-LSTM-A results to figure out the difference in the efficiency statistically. With the t-score and alpha value the -test found no statistically significant difference between these two methods. Thus, both of them can be used for efficient emotion recognition, yet the CNNLSTM model contains fewer parameters and, therefore, can be more preferable. In conclusion, we should say that during the leave-one-corpus-out cross-validation procedure we have achieved on average a increase in terms of UAR comparing to the baseline simplest approach.
測試 CNN-LSTM 和 CNN-LSTM-A 的結果,以統計方式找出效率上的差異。通過 t 值 和α值 -test 發現這兩種方法之間沒有統計上顯著的差異。因此,這兩種方法都可以用於有效的情感識別,然而 CNN-LSTM 模型包含更少的參數,因此可能更可取。總之,在單一語料庫交叉驗證過程中,我們平均實現了 的 UAR 增長,與基線最簡單的 方法相比。
To test the model's generalization ability more, we have conducted participant-independent cross-validation as well (see Table 6, column CNN-LSTM-CV). This was carried out with 5 folds for every dataset (for SAVEE - 4 folds). We should also note that we have preserved the deliberate separation of the AffWild2 on the train and developments sets, while performed leave-one-sessionout for RAMAS and IEMOCAP datasets.
為了更多地測試模型的泛化能力,我們還進行了參與者獨立的交叉驗證(見表 6,CNN-LSTM-CV 列)。對每個數據集進行了 5 次交叉驗證(對於 SAVEE - 4 次)。我們還應該注意,我們保留了 AffWild2 在訓練集和開發集中的有意分離,同時對 RAMAS 和 IEMOCAP 數據集進行了留一會話交叉驗證。
We have used the CNN-LSTM approach for evaluation since it outperformed all others and is computationally more effective than CNN-LSTM-A. The results of the experiments are also presented in Table 6. Participant-independent cross-validation allows analyzing the functioning of the model in cases when it faces completely a new, unknown before person. An efficient model should "omit" the unimportant differences in faces, highlighting and exploiting only facial features salient for affective computing (in our case, emotion recognition). As one can see from Table 6, the participant-independent experiments have shown a dramatic increase ( on average) in the CNN-LSTM model performance in comparison with cross-corpus experiments. While it was expected, we would like to point out that both conducted experiments prove the ability of the model to generalize the learned features regardless of the variability in faces, lightning, obstacles, and other recording conditions.
我們使用了 CNN-LSTM 方法進行評估,因為它優於其他所有方法,在計算上比 CNN-LSTM-A 更有效。實驗結果也在表 6 中呈現。參與者獨立的交叉驗證允許分析模型在面對完全新的、以前未知的人時的運作情況。一個高效的模型應該“忽略”臉部的不重要差異,僅突出和利用對情感計算(在我們的情況下是情緒識別)至關重要的臉部特徵。從表 6 可以看出,參與者獨立的實驗顯示,與跨語料實驗相比,CNN-LSTM 模型的性能平均增加了 。儘管這是預期的,我們想指出,進行的兩個實驗都證明了模型在面對臉部、光線、障礙物和其他錄製條件的變化時,學習特徵泛化的能力。

5. Discussion 討論

In this Section, we discuss the features of the developed framework, compare it with other state-of-the-art emotion recognition works, and provide an analysis of the augmentation techniques, which dramatically influenced on the model efficacy.
在本節中,我們討論了開發框架的特點,將其與其他最先進的情感識別作品進行比較,並對影響模型效能的增強技術進行分析。

5.1. Comparison with State-of-the-Art Works
5.1. 與最先進作品的比較

The cross-corpus problem has been attracting researchers attention for a long time since without its resolving it is not possible to develop a truly general emotion recognition system. Although there are quite many studies devoted to it, most of them investigate the audio modality and use 2-3 corpora. To the best of our knowledge, our work, in turn, is the first research conducted on a large amount of considered VEDs on visual modality.
跨語料庫問題長期以來一直吸引著研究人員的注意,因為如果不解決這個問題,就無法開發真正通用的情感識別系統。儘管有許多研究致力於此,但大多數研究調查的是音頻模態並使用 2-3 個語料庫。據我們所知,我們的工作則是首次在視覺模態上對大量考慮的 VED 進行研究。
For this reason, it is difficult to compare the obtained results with other papers, where the same datasets were exploited separately. The comparison would be inadequate, since in these works the system was trained and tested on the same corpus, obviously leading to the higher performance. However, we can collate the results of our former [44] work, where we have conducted a study on two corpora (CREMA-D and RAVDESS), with the current one. The comparison in terms of the Accuracy performance measure is presented in Table 7. We can note that the current model efficiency is comparable with our former study despite the bigger amount of in-the-wild training data, which can "confuse" the model during the evaluation phase on the laboratory-obtained data.
因此,很難將獲得的結果與其他論文進行比較,因為這些論文中使用了相同的數據集進行單獨利用。這樣的比較是不恰當的,因為在這些工作中,系統是在相同的語料庫上進行訓練和測試的,顯然會導致更高的性能。然而,我們可以將我們之前的研究[44]中對兩個語料庫(CREMA-D 和 RAVDESS)進行的研究結果與當前的研究進行對比。表 7 中呈現了準確性性能指標的比較。我們可以看到,儘管在野外訓練數據量更大,這可能會在實驗室獲得的數據評估階段“混淆”模型,但當前模型的效率與我們之前的研究相當。
To make the comparison choices wider, we have included in Table 7 the results of the participant-independent crossvalidation studies [82-84] in terms of UAR and F1-score (the measure was chosen depending on scores reported by authors). Such works can be viewed as cross-corpus research to some extent due to the fact that the validation part of the data does not contain participants included in the training data, nevertheless, it is much easier than the original cross-corpus problem. From the results, we can mention that the developed system has lower UAR on CREMAD and SAVEE datasets, while higher F1-score on the AffWild2 dataset. It can be caused, as we noted earlier, by the "nature" of the data: the model was trained on more diverse and biased towards in-the-wild data (represented by AffWild2), while the participant-independent cross-validation studies utilized solely laboratory-controlled data.
為了使比較選擇更廣泛,我們在表 7 中包含了參與者獨立的交叉驗證研究[82-84]的結果,以 UAR 和 F1 分數來衡量(度量標準取決於作者報告的分數)。這些工作在某種程度上可以被視為跨語料庫研究,因為數據的驗證部分不包含在訓練數據中的參與者,然而,這比原始的跨語料庫問題要容易得多。從結果來看,我們可以提到開發的系統在 CREMAD 和 SAVEE 數據集上的 UAR 較低,而在 AffWild2 數據集上的 F1 分數較高。正如我們之前提到的,這可能是由於數據的“性質”造成的:該模型是在更多樣化且偏向野外數據(由 AffWild2 代表)上進行訓練,而參與者獨立的交叉驗證研究僅使用實驗室控制的數據。

5.2. Backbone Model 5.2. 骨幹模型

Currently, the facial expression recognition models in the affective computing field are still far from perfection in terms of model accuracy. This is because of two main things: (1) the model imperfection and (2) the corruption in the data annotation (mislabelling, raters' subjectivity) used for training. Although we cannot state that our backbone model obtained during the study has no shortcomings at all, we would like to demonstrate its robustness and decent performance via analysis of its functioning on various complex frames from dynamic VEDs.
目前,在情感計算領域中的面部表情識別模型在模型準確性方面仍然遠遠不完美。這是因為兩個主要原因:(1)模型的不完美和(2)用於訓練的數據標註中存在的錯誤(標記錯誤,評分者主觀性)。雖然我們不能說在研究過程中獲得的基礎模型完全沒有缺陷,但我們希望通過對其在來自動態 VED 的各種複雜幀上的功能進行分析,展示其穩健性和良好性能。
To show where the model is "looking at" during the decisionmaking process, we have exploited the GradCAM [85] technique, obtaining the gradient heatmaps for every frame under the consideration. Fig. 7 presents the heatmaps for eight correctly and incorrectly recognized facial expressions, which were randomly chosen from dynamic VEDs. The redder the area of the image, the more attention "pays" the model to this area.
為了展示模型在決策過程中“看著”哪裡,我們利用了 GradCAM [85]技術,獲得了每個受考慮幀的梯度熱度圖。圖 7 展示了從動態 VED 中隨機選擇的八個正確和不正確識別的面部表情的熱度圖。圖像的紅色區域越多,模型對該區域的關注就越多。
Table 7 表 7
The comparison with other cross-corpus and participant-independent cross validation studies. CC means cross-corpus, PI - participant-independent
與其他跨語料庫和參與者獨立的交叉驗證研究進行比較。CC 表示跨語料庫,PI 表示參與者獨立
# Type VED Metric CNN-LSTM Other studies 其他研究
1 CC RAVDESS Accuracy 67.3
2 CC CREMA-D Accuracy
3 PI CREMA-D UAR 66.6
4 PI SAVEE UAR 82.8
5 PI AffWild2 F1-score
Fig. 7. GradCAM heatmaps for the backbone model with correctly/incorrectly recognized facial expressions.
圖 7. 正確/不正確識別面部表情的骨幹模型的 GradCAM 熱度圖。
Analyzing Fig. 7, we can note that the model correctly identifies the important regions of the faces for certain emotions despite the difficulties caused by head tilt. For instance, on the picture with the Disgust, it correctly (according to Ekman) pays attention to raised upper lip and wrinkles around the nose. When we take a look at a picture with Sadness, we can observe that model is taking into account the dropped down eyelids and pulled together brows.
分析圖 7,我們可以注意到,儘管由於頭部傾斜而引起的困難,該模型仍正確識別了面部對於某些情緒的重要區域。例如,在帶有厭惡情緒的圖片上,它正確(根據艾克曼)專注於抬高的上唇和鼻子周圍的皺紋。當我們看一張帶有悲傷情緒的圖片時,我們可以觀察到模型正在考慮到下垂的眼瞼和拉在一起的眉毛。
Nevertheless, there were also cases, where the model made mistakes. For example, the model wrongly predicts the state of the man depicted in the down left corner of Fig. 7. Looking at the almost closed eyes and noticeable bags under the eyes, the model is being confused, predicting an emotion of Sadness instead of the annotated emotion Anger. Observing another participant's face in the down-right corner (Happiness), we suppose that the model was confused by the lacking of a smile on the face. Taking into account the wrinkles, it has wrongly predicted the emotion Disgust.
然而,也有一些情況下,模型犯了錯誤。例如,模型錯誤地預測了圖 7 左下角所描繪男子的狀態。看著幾乎閉上的眼睛和明顯的眼袋,模型感到困惑,預測出了悲傷情緒而不是標註的憤怒情緒。觀察圖 7 右下角另一參與者的臉(快樂),我們認為模型被臉上缺少微笑所困惑。考慮到皺紋,它錯誤地預測了厭惡情緒。
Because of the subjectivity of emotions, frequently datasets contain corrupted annotations, which can highly bias the model, confusing it and decreasing its efficiency. To examine it, we have chosen several frames, which show, in our opinion (and based on Ekman's rules), emotions different from annotated and depicted them in Fig. 8.
由於情感的主觀性,常常數據集包含損壞的標註,這可能會嚴重偏頗模型,使其混淆並降低效率。為了檢查這一點,我們選擇了幾個幀,根據艾克曼的規則,我們認為這些幀中的情感與標註不同,並在圖 8 中展示它們。
For instance, the woman with annotated emotion Neutral has all features of the emotion Sadness, except for the lips: the eyebrows are raised and pulled together, the eyelids almost fully dropped down, tears on the face, and the wrinkles on the sides of the nose. That is why the model has almost no response in terms of a gradient for the Neutral emotion. On the other hand, when the emotion Sadness for the GradCAM is chosen, the model correctly takes into account all the aforementioned facial areas, predicting Sadness with .
例如,標註為中性情感的女性除了嘴唇外,具有悲傷情感的所有特徵:眉毛上揚並拉在一起,眼皮幾乎完全下垂,臉上有淚水,鼻子兩側有皺紋。這就是為什麼模型對中性情感幾乎沒有梯度響應的原因。另一方面,當選擇 GradCAM 的悲傷情感時,模型正確考慮了所有上述面部區域,預測出悲傷情感為

Considering another instance with initially annotated Anger (Fig. 8, upper right corner), we can note similar things: the jaw dropped open, raised and pulled together eyebrows, raised upper eyelids, and tensed lower eyelids indicate that it is rather the emotion Fear. The raters likely have chosen the Anger, because of the widely opened eyes and pulled together eyebrows. It could also be caused by the interpolation of the annotations: while two frames (separated by the annotation frequency) were expressing, for example, Anger, the participant's state Fear between these two time points was missed.
考慮另一個最初帶有標註的憤怒實例(圖 8,右上角),我們可以注意到類似的事情:下巴張開,眉毛上揚並拉在一起,上眼瞼上揚,下眼瞼緊張表明這更像是恐懼情緒。評分員可能選擇了憤怒,是因為眼睛張得很大和眉毛拉在一起。這也可能是由於標註的插值引起的:例如,當兩個幀(由標註頻率分隔)表達憤怒時,參與者在這兩個時間點之間的恐懼狀態被忽略了。
Nevertheless, we can partially eliminate such problem as corrupted labels by using the Mixup technique, which mixes two images and, therefore, weakens the influence of the corrupted labels on the model. In the next subsection, we present its analysis, demonstrating the functioning of the model on separated and mixed images.
然而,我們可以通過使用 Mixup 技術部分消除這種受損標籤的問題,該技術混合兩幅圖像,因此削弱了受損標籤對模型的影響。在下一小節中,我們將介紹其分析,展示模型在分離和混合圖像上的運作。

5.3. The Mixup Technique Analysis
5.3. Mixup 技術分析

As we described earlier, the backbone model was trained using to Mixup data augmentation technique (see Section 4.2), which has given a significant gain in the model performance expressed in UAR. This is because Mixup allows us to weaken (smooth) labels by combining images with different categories, forcing network to refuse from the prediction overconfidence. Moreover, besides regularizing the model, it increases its robustness by resisting corrupted labels (labels, which were annotated incorrectly) [67]. To show aforementioned Mixup features, we have randomly taken two images from AffectNet development set, applied to them Mixup technique with randomly generated mixing ratio ,
正如我們之前描述的,骨幹模型是使用 Mixup 數據增強技術(參見第 4.2 節)進行訓練的,這在 UAR 中表現出顯著的增益。這是因為 Mixup 允許我們通過結合不同類別的圖像來削弱(平滑)標籤,迫使網絡拒絕過度自信的預測。此外,除了對模型進行正則化外,它還通過抵抗損壞的標籤(錯誤標記的標籤)[67]來增加其韌性。為了展示上述的 Mixup 特性,我們從 AffectNet 開發集中隨機選取了兩幅圖像,對它們應用了隨機生成的混合比例的 Mixup 技術。
Fig. 8. GradCAM heatmaps for the backbone model with incorrectly/correctly annotated facial expressions.
圖 8. 骨幹模型對錯誤/正確標記的面部表情的 GradCAM 熱度圖。
Normalized images 正規化圖像
Mixup 混合
Fig. 9. The evaluation of the Backbone model on examples with and without applied Mixup.
圖 9. 在應用混合和未應用混合的示例上對骨幹模型進行評估。
and obtained a class probability prediction for trained backbone model. The result is depicted in Fig. 9.
並為訓練後的骨幹模型獲得了類別概率預測。結果如圖 9 所示。
Vectors in green and blue are probability vectors for 7 emotional classes (the order is Neutral, Happiness, Sadness, Surprise, Fear, Disgust, Anger). In case of ground truth labels (green color), they are one-hot vectors, while in row with the blue color the probability prediction generated by model is presented. Analyzing Fig. 9, we can observe that the backbone model functions well both with one-hot and smoothed labels. For instance, the model predicted the right one of the normalized images as a Happiness state with probability, which is very close to the one-hot encoding. On the other hand, when mixing this image with another one (the block Mixup, right image), the model correctly forecasts the Happiness state (the error is only 0.1 ), while predicts also the Anger state of another imposed image with probability of 0.2. Thus, one can see that the model during the training was adapted to both one-hot and smoothed images, which allowed it to overcome the problems with prediction overconfidence.
綠色和藍色向量是 7 種情感類別的概率向量(順序為中性、快樂、悲傷、驚訝、恐懼、厭惡、憤怒)。在地面真實標籤的情況下(綠色),它們是獨熱向量,而在藍色的行中呈現了模型生成的概率預測。通過分析圖 9,我們可以觀察到骨幹模型在獨熱和平滑標籤下都運作良好。例如,該模型將正確預測正規化圖像中的一個為快樂狀態,概率為 ,這非常接近獨熱編碼。另一方面,當將這個圖像與另一個圖像混合(Mixup 區塊,右側圖像)時,模型正確預測了快樂狀態(誤差僅為 0.1),同時也預測了另一個強加圖像的憤怒狀態,概率為 0.2。因此,可以看出該模型在訓練過程中適應了獨熱和平滑圖像,這使其能夠克服預測過於自信的問題。
During the training, we have considered the erroneously classified images presented in AffectNet development set. After the analysis, we have found that some of them corrupted in the same way as we noted in the previous section. However, it is interesting to see how the model works on mixed images with corrupted and non-corrupted labels. Fig. 10 shows an illustrative example. Images framed in red/green color show an incorrect/correct predicted class label.
在培訓期間,我們考慮了 AffectNet 開發集中呈現的被錯誤分類的圖像。經過分析,我們發現其中一些以與我們在前一節中注意到的相同方式損壞。然而,有趣的是看到模型如何在帶有損壞和非損壞標籤的混合圖像上運作。圖 10 顯示了一個說明性示例。用紅色/綠色框架的圖像顯示了不正確/正確的預測類別標籤。
From Fig. 10 one can see that the first normalized image is labeled as Disgust, while it contains all the features of the Surprise emotion (eyebrows raised, but not drawn together; upper eyelids raised, lower eyelids neutral; jaw dropped down), turning out a
從圖 10 可以看出,第一個歸一化圖像被標記為厭惡,而它包含了驚訝情緒的所有特徵(眉毛上揚,但不緊緊地拉在一起;上眼瞼上揚,下眼瞼中性;下巴下垂),結果是

Fig. 10. The evaluation of the Backbone model on incorrectly annotated example with and without applied Mixup.
圖 10。對具有不正確標註示例的骨幹模型進行評估,並應用 Mixup。
corrupted label. After mixing this image with Sadness annotated image, we got the probability 0.3 of Sadness and 0.7 of Disgust for the first generated sample, and vise versa for the second one. In case the corrupted label prevails (first generated image), one can see that the model is confused, trying to focus simultaneously on two presented emotions, although preferring to stand still on emotion Surprise. Such confusion forces the model to be doubtful and does not make strict decisions (like one-hot encoding), which helps not to focus entirely on the corrupted label. Regarding the second generated image, we can observe that in case the emotion is expressed in a really strong way, the model prefers to give such emotion a high probability, nevertheless, softening it a little bit. Thus, the Mixup technique allows us to soft the biasing of the image with a corrupted label by mixing it with a non-corrupted one, obtaining two new samples (less corrupted in comparison with previous ones), and making the model to doubt and cope with complex "two-emotional" states.
標籤被損壞。將這張圖像與標註為悲傷的圖像混合後,我們得到了第一個生成樣本的悲傷概率為 0.3,厭惡概率為 0.7,第二個生成樣本則相反。如果損壞的標籤佔上風(第一個生成的圖像),可以看到模型感到困惑,試圖同時專注於兩種呈現的情緒,儘管更偏向於停留在驚訝情緒上。這種困惑迫使模型感到懷疑,不做出嚴格的決定(如 one-hot 編碼),這有助於不完全專注於損壞的標籤。關於第二個生成的圖像,我們可以觀察到,如果情緒表達得非常強烈,模型更傾向於給予該情緒較高的概率,儘管稍微軟化了一點。因此,Mixup 技術使我們能夠通過將損壞的標籤與非損壞的標籤混合,獲得兩個新樣本(與之前的相比較不那麼損壞),並使模型懷疑和應對複雜的“雙情緒”狀態。
In addition, we would like to note that figures presented in this subsection indicate once more that the backbone model is adapted to different ages, genders, lightning conditions, head movements, and obstacles, making it a good base for efficient emotion recognition model.
此外,我們想指出,本小節中呈現的數據再次表明,骨幹模型適應不同年齡、性別、光線條件、頭部運動和障礙物,使其成為高效情感識別模型的良好基礎。

6. Conclusions 6. 結論

In this article, we have presented probably one of the largest cross-corpus studies on visual emotion recognition at the moment. We suggested a novel and effective E2E emotion recognition framework consisting of two key elements, which are employed for different functions: (1) the backbone emotion recognition model, which is based on the VGGFace2 ResNet50 model, trained in a balanced way, and able to predict emotion from the raw image with high performance and (2) the temporal block stacked on top of the backbone model and trained with dynamic VEDs using the cross-corpus protocol in order to show its reliability and effectiveness.
在本文中,我們提出了當前可能是最大的視覺情感識別跨語料庫研究之一。我們提出了一個新穎有效的端到端情感識別框架,由兩個關鍵元素組成,用於不同功能:(1) 基於 VGGFace2 ResNet50 模型的骨幹情感識別模型,以平衡方式訓練,能夠高效地從原始圖像中預測情感;(2) 堆疊在骨幹模型頂部並使用跨語料庫協議訓練的時間塊,以展示其可靠性和有效性。
During the research, the backbone model was fine-tuned on the largest facial expression corpus AffectNet containing static images. Our backbone model achieved the accuracy of on the AffectNet validation set, outperforming all currently known state-of-theart works.
在研究過程中,骨幹模型在包含靜態圖像的最大面部表情語料庫 AffectNet 上進行了微調。我們的骨幹模型在 AffectNet 驗證集上實現了 的準確率,優於目前已知的所有最先進的作品。
Models trained on static images are usually biased since such images demonstrate the peak levels of emotions, that are rarely expressed in-the-wild conditions. Moreover, utilizing only the backbone model, it is not possible to take into account temporal information, yet it is extremely important in-the-wild circumstances: exploiting context information (former video frames), the model can overcome such problems as bad lightning, occasional occlusion, noises, etc. Therefore, we have conducted exten- sive experiments on six datasets (RAMAS, IEMOCAP, CREMA-D, RAVDESS, SAVEE, AffWild2) with various temporal aggregation techniques (CNN-SVM, CNN-LSTM, CNN-GRU, and CNN-LSTM-A), leveraging the leave-one-corpus-out protocol to figure out the best method in terms of its performance and ability to generalize.
在靜態圖像上訓練的模型通常存在偏見,因為這些圖像展示了情緒的高峰水平,這在野外條件下很少表現出來。此外,僅使用骨幹模型無法考慮時間信息,然而在野外情況下這是非常重要的:利用上下文信息(先前的視頻幀),模型可以克服諸如照明不良、偶爾遮擋、噪音等問題。因此,我們對六個數據集(RAMAS、IEMOCAP、CREMA-D、RAVDESS、SAVEE、AffWild2)進行了大量實驗,使用各種時間聚合技術(CNN-SVM、CNN-LSTM、CNN-GRU 和 CNN-LSTM-A),利用留一語料庫協議來找出在性能和泛化能力方面最佳方法。
Analysis of the results shows that CNN-LSTM and CNN-LSTM-A approaches provide the best efficiency on most research datasets , and on RAMAS, IEMOCAP, CREMA-D, RAVDESS, SAVEE, and AffWild2 datasets accordingly). Moreover, we have achieved and, in some cases, outperformed the state-of-the-art results. However, there is no statistically significant difference in performance between these two approaches. Thus, we suggest using the CNN-LSTM model, since it has fewer parameters and more computationally efficient.
分析結果顯示,CNN-LSTM 和 CNN-LSTM-A 方法在大多數研究數據集上提供了最佳效率 ,並且在 RAMAS、IEMOCAP、CREMA-D、RAVDESS、SAVEE 和 AffWild2 數據集上分別表現 。此外,我們已經取得了並且在某些情況下超越了最先進的結果。然而,這兩種方法的性能之間沒有統計上顯著的差異。因此,我們建議使用 CNN-LSTM 模型,因為它具有更少的參數並且更具計算效率。
Additionally, to show the predictive balance of the model, we have studied model functioning on the images randomly selected from dynamic VEDs, analyzing the model behavior in case of mixed labels and utilizing the GradCAM technique. We found that the backbone model demonstrates a good performance not only on the images expressing one emotion, but also with complex combined emotional states. Moreover, the usage of the GradCAM allowed showing that the model "looks" on right salient areas of the facial expressions even when they have occlusions, bad lightning conditions or turned apart from the camera.
此外,為了展示模型的預測平衡,我們研究了從動態 VEDs 隨機選擇的圖像上模型的功能,分析了模型在混合標籤情況下的行為並利用了 GradCAM 技術。我們發現骨幹模型不僅在表達單一情感的圖像上表現良好,而且在複雜的組合情感狀態下也表現出色。此外,使用 GradCAM 技術可以顯示出,即使面部表情有遮擋、光線條件不佳或者轉向遠離攝像機,模型也能“看到”面部表情的正確突出區域。
For the possibility to reproduce our results, we provide trained models and the code for testing them as well on GitHub².
為了能夠重現我們的結果,我們在 GitHub² 上提供了訓練好的模型和測試代碼。
In our future research, we plan to perform a cross-corpus analysis for both acoustic and linguistic modalities in order to fuse different information channels and systems that process them into one multi-modal emotion recognition system, which is a promising approach based on some previous studies [86,87].
在我們未來的研究中,我們計劃進行跨語料庫的聲學和語言模態分析,以融合不同的信息通道和處理它們的系統,形成一個多模態情感識別系統,這是基於一些先前研究的有前途的方法 [86,87]。

Funding 資助

This work was supported by the Analytical Center for the Government of the Russian Federation (IGK 000000D730321P5Q0002), agreement No. 70-2021-00141.
本工作得到俄羅斯聯邦政府分析中心(IGK 000000D730321P5Q0002)的支持,協議編號為 70-2021-00141。

CRediT authorship contribution statement
CRediT 作者貢獻聲明

Elena Ryumina: Conceptualization, Methodology, Investigation, Validation, Visualization, Writing - original draft. Denis Dresvyanskiy: Conceptualization, Methodology, Investigation, Writing - original draft, Visualization. Alexey Karpov: Supervision, Writing - review & editing, Funding acquisition.
Elena Ryumina:概念化、方法論、調查、驗證、可視化、原稿撰寫。Denis Dresvyanskiy:概念化、方法論、調查、原稿撰寫、可視化。Alexey Karpov:監督、審查和編輯、資金獲取。

Declaration of Competing Interest
利益衝突聲明

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
作者聲明,他們沒有已知的競爭性財務利益或個人關係,可能會影響本文報告的工作。

References 參考文獻

[1] J. Yang, R. Wang, X. Guan, M.M. Hassan, A. Almogren, A. Alsanad, AI-enabled emotion-aware robot: The fusion of smart clothing, edge clouds and robotics, Future Generation Computer Systems 102 (2020) 701-709, https://doi.org/ 10.1016/j.future.2019.09.029.
[1] J. Yang, R. Wang, X. Guan, M.M. Hassan, A. Almogren, A. Alsanad, AI 啟用的情感感知機器人:智能服裝、邊緣雲和機器人的融合,未來世代計算機系統 102(2020)701-709,https://doi.org/10.1016/j.future.2019.09.029。
[2] Z. Liu, M. Wu, W. Cao, L. Chen, J. Xu, R. Zhang, M. Zhou, J. Mao, A facial expression emotion recognition based human-robot interaction system, IEEE/ CAA Journal of Automatica Sinica 4 (4) (2017) 668-676, https://doi.org/ 10.1109/JAS.2017.7510622.
[2] Z. Liu, M. Wu, W. Cao, L. Chen, J. Xu, R. Zhang, M. Zhou, J. Mao, 基於面部表情情感識別的人機交互系統,IEEE/ CAA 自動化學報 4(4)(2017)668-676,https://doi.org/10.1109/JAS.2017.7510622。
[3] A. Shukla, S.S. Gullapuram, H. Katti, K. Yadati, M. Kankanhalli, R. Subramanian, Affect recognition in ads with application to computational advertising, in: 25th ACM International Conference on Multimedia, 2017, pp. 1148-1156, https://doi.org/10.1145/3123266.3123444.
[3] A. Shukla, S.S. Gullapuram, H. Katti, K. Yadati, M. Kankanhalli, R. Subramanian, 應用於計算廣告的廣告情感識別, 在: 第 25 屆 ACM 多媒體國際會議, 2017, 頁 1148-1156, https://doi.org/10.1145/3123266.3123444.
[4] S. Cosentino, E.I. Randria, J.-Y. Lin, T. Pellegrini, S. Sessa, A. Takanishi, Group emotion recognition strategies for entertainment robots, in: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 813-818, https://doi.org/10.1109/IROS.2018.8593503.
[4] S. Cosentino, E.I. Randria, J.-Y. Lin, T. Pellegrini, S. Sessa, A. Takanishi, 娛樂機器人的群體情感識別策略, 在: IEEE/RSJ 智能機器人與系統國際會議(IROS), 2018, 頁 813-818, https://doi.org/10.1109/IROS.2018.8593503.
[5] Z. Fei, E. Yang, D.D.-U. Li, S. Butler, W. Ijomah, X. Li, H. Zhou, Deep convolution network based emotion analysis towards mental health care, Neurocomputing 388 (2020) 212-227, https://doi.org/10.1016/j.neucom.2020.01.034.
[5] Z. Fei, E. Yang, D.D.-U. Li, S. Butler, W. Ijomah, X. Li, H. Zhou, 基於深度卷積網絡的情感分析對心理健康護理的影響, Neurocomputing 388 (2020) 212-227, https://doi.org/10.1016/j.neucom.2020.01.034.
[6] M.S. Hossain, G. Muhammad, Emotion-aware connected healthcare big data towards 5G, IEEE Internet of Things Journal 5 (4) (2017) 2399-2406, https:// doi.org/10.1109/JIOT.2017.2772959.
[6] M.S. Hossain, G. Muhammad, 情感感知連接醫療大數據走向 5G, IEEE 物聯網期刊 5 (4) (2017) 2399-2406, https://doi.org/10.1109/JIOT.2017.2772959.
[7] D. Yang, A. Alsadoon, P.C. Prasad, A.K. Singh, A. Elchouemi, An emotion recognition model based on facial recognition in virtual learning environment, Procedia Computer Science 125 (2018) 2-10, https://doi.org/10.1016/j. procs.2017.12.003.
[7] D. Yang, A. Alsadoon, P.C. Prasad, A.K. Singh, A. Elchouemi, 基於虛擬學習環境中的面部識別的情感識別模型, Procedia 計算機科學 125 (2018) 2-10, https://doi.org/10.1016/j.procs.2017.12.003.
[8] P. Ekman, W. Friesen, Nonverbal leakage and clues to deception, Psychiatry 32 (1) (1969) 88-106, https://doi.org/10.1080/00332747.1969.11023575.
[8] P. Ekman, W. Friesen, 非語言泄漏和欺騙線索, 精神病學 32 (1) (1969) 88-106, https://doi.org/10.1080/00332747.1969.11023575.
[9] J.A. Russell, A circumplex model of affect, Journal of Personality and Social Psychology 39 (6) (1980) 1161-1178, https://doi.org/10.1037/h0077714.
[9] J.A. Russell, 情感的圓形模型, 《人格與社會心理學雜誌》39 (6) (1980) 1161-1178, https://doi.org/10.1037/h0077714.
[10] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, G. Hofer, Analysis of deep learning architectures for cross-corpus speech emotion recognition, Interspeech (2019) 1656-1660, https://doi.org/10.21437/Interspeech.20192753.
[10] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, G. Hofer, 跨語料庫語音情感識別的深度學習架構分析, Interspeech (2019) 1656-1660, https://doi.org/10.21437/Interspeech.20192753.
[11] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, Learning affective features with a hybrid deep model for audio-visual emotion recognition, IEEE Transactions on Circuits and Systems for Video Technology 28 (10) (2018) 3030-3043, https:// doi.org/10.1109/TCSVT.2017.2719043
[11] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, 用混合深度模型學習情感特徵以進行視聽情感識別, 《IEEE 視頻技術電路與系統交易》28 (10) (2018) 3030-3043, https://doi.org/10.1109/TCSVT.2017.2719043
[12] E. Friesen, P. Ekman, Facial action coding system: a technique for the measurement of facial movement, Palo Alto 3 (2) (1978) 5.
[12] E. Friesen, P. Ekman, 面部動作編碼系統:用於測量面部運動的技術,Palo Alto 3 (2) (1978) 5.
[13] C. Shu, X. Ding, C. Fang, Histogram of the oriented gradient for face recognition, Tsinghua Science and Technology 16 (2) (2011) 216-224, https://doi.org/ 10.1016/S1007-0214(11)70032-3.
[13] C. Shu, X. Ding, C. Fang, 面部識別的導向梯度直方圖,清華科技 16 (2) (2011) 216-224,https://doi.org/ 10.1016/S1007-0214(11)70032-3.
[14] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, in: 3rd International Conference on Learning Representations (ICLR), 2015, pp. 1-14.
[14] K. Simonyan, A. Zisserman, 用於大規模圖像識別的非常深度卷積網絡,在:第 3 屆國際學習表示會議(ICLR),2015 年,頁 1-14。
[15] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, https://doi.org/10.1109/CVPR.2016.90.
[15] K. He, X. Zhang, S. Ren, J. Sun, 深度殘差學習用於圖像識別, 在: 2016 年 IEEE 計算機視覺和模式識別會議(CVPR), pp. 770-778, https://doi.org/10.1109/CVPR.2016.90.
[16] H.-W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, Deep learning for emotion recognition on small datasets using transfer learning, in: 17th ACM on International Conference on Multimodal Interaction, 2015, pp. 443-449, https://doi.org/10.1145/2818346.2830593.
[16] H.-W. Ng, V.D. Nguyen, V. Vonikakis, S. Winkler, 小數據集上使用轉移學習進行情感識別的深度學習, 在: 2015 年第 17 屆 ACM 多模式交互國際會議, pp. 443-449, https://doi.org/10.1145/2818346.2830593.
[17] I.J. Goodfellow, D. Erhan, P.L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.H. Lee, et al., Challenges in representation learning: A report on three machine learning contests, in: International Conference on Neural Information Processing, 2013, pp. 117-124, https://doi. org/10.1007/978-3-642-42051-1_16.
[17] I.J. Goodfellow, D. Erhan, P.L. Carrier, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.H. Lee, 等, 表征學習中的挑戰: 三個機器學習競賽報告, 在: 2013 年國際神經信息處理會議, pp. 117-124, https://doi. org/10.1007/978-3-642-42051-1_16.
[18] G. Levi, T. Hassner, Emotion recognition in the wild via convolutional neural networks and mapped binary patterns, in: 17th ACM on International Conference on Multimodal Interaction, 2015, pp. 503-510, https://doi.org/ 10.1145/2818346.2830587.
[18] G. Levi,T. Hassner,通過卷積神經網絡和映射二進制模式在野外進行情感識別,於 2015 年第 17 屆 ACM 國際多模態交互會議上,頁面 503-510,https://doi.org/10.1145/2818346.2830587。
[19] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9, https://doi.org/10.1109/CVPR.2015.7298594.
[19] C. Szegedy,W. Liu,Y. Jia,P. Sermanet,S. Reed,D. Anguelov,D. Erhan,V. Vanhoucke,A. Rabinovich,通過卷積神經網絡深入探究,於 2015 年 IEEE 計算機視覺和模式識別(CVPR)會議上,頁面 1-9,https://doi.org/10.1109/CVPR.2015.7298594。
[20] S.A. Bargal, E. Barsoum, C.C. Ferrer, C. Zhang, Emotion recognition in the wild from videos using images, in: 18th ACM International Conference on Multimodal Interaction, 2016, pp. 433-436, https://doi.org/10.1145/ 2993148.2997627.
[20] S.A. Bargal,E. Barsoum,C.C. Ferrer,C. Zhang,通過圖像從視頻中進行野外情感識別,在 2016 年第 18 屆 ACM 國際多模態交互會議上,頁面 433-436,https://doi.org/10.1145/2993148.2997627。
[21] P. Balouchian, H. Foroosh, Context-sensitive single-modality image emotion analysis: A unified architecture from dataset construction to cnn classification, in: 25th IEEE International Conference on Image Processing (ICIP), 2018, pp. 1932-1936, https://doi.org/10.1109/ICIP.2018.8451048.
[21] P. Balouchian, H. Foroosh, 上下文敏感的單模式圖像情感分析: 從數據集構建到 cnn 分類的統一架構, 在: 第 25 屆 IEEE 國際圖像處理大會(ICIP), 2018, 頁 1932-1936, https://doi.org/10.1109/ICIP.2018.8451048.
[22] M.-C. Sun, S.-H. Hsu, M.-C. Yang, J.-H. Chien, Context-aware cascade attentionbased RNN for video emotion recognition, in: First Asian Conference on
[22] M.-C. Sun, S.-H. Hsu, M.-C. Yang, J.-H. Chien, 上下文感知級聯注意力 RNN 用於視頻情感識別, 在: 第一屆亞洲情感計算和智能交互會議(ACII Asia), 2018, 頁 1-6. doi:10.1109/ACIIAsia.2018.8470372.

Affective Computing and Intelligent Interaction (ACII Asia), 2018, pp. 1-6. doi:10.1109/ACIIAsia.2018.8470372.
[23] J. Lee, S. Kim, S. Kim, J. Park, K. Sohn, Context-aware emotion recognition networks, IEEE/CVF International Conference on Computer Vision (2019) 10143-10152, https://doi.org/10.1109/ICCV.2019.01024.
[23] J. Lee, S. Kim, S. Kim, J. Park, K. Sohn, 具有上下文感知的情緒識別網絡, IEEE/CVF 國際計算機視覺會議 (2019) 10143-10152, https://doi.org/10.1109/ICCV.2019.01024.
[24] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, C. Fookes, Deep spatio-temporal features for multimodal emotion recognition, in: IEEE Winter Conference on Applications of Computer Vision (WACV), 2017, pp. 1215-1223, https://doi.org/10.1109/WACV.2017.140.
[24] D. Nguyen, K. Nguyen, S. Sridharan, A. Ghasemi, D. Dean, C. Fookes, 用於多模式情緒識別的深度時空特徵, 在: IEEE 冬季計算機視覺應用會議 (WACV), 2017, pp. 1215-1223, https://doi.org/10.1109/WACV.2017.140.
[25] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, Learning affective features with a hybrid deep model for audio-visual emotion recognition, IEEE Transactions on Circuits and Systems for Video Technology 28 (10) (2017) 3030-3043, https:// doi.org/10.1109/TCSVT.2017.2719043.
[25] S. Zhang, S. Zhang, T. Huang, W. Gao, Q. Tian, 通過混合深度模型學習具有情感特徵的音視頻情緒識別, IEEE 視頻技術電路與系統交易 28 (10) (2017) 3030-3043, https:// doi.org/10.1109/TCSVT.2017.2719043.
[26] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, D. Manocha, M3er: Multiplicative multimodal emotion recognition using facial, textual, and speech cues, in: AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 1359-1367. doi:10.1609/aaai.v34i02.5492.
[26] T. Mittal, U. Bhattacharya, R. Chandra, A. Bera, D. Manocha, M3er: 使用面部、文本和語音線索的乘法多模情感識別, 在: AAAI 人工智慧大會, Vol. 34, 2020, pp. 1359-1367. doi:10.1609/aaai.v34i02.5492.
[27] J. Huang, J. Tao, B. Liu, Z. Lian, M. Niu, Multimodal transformer fusion for continuous emotion recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3507-3511, https://doi.org/10.1109/ICASSP40776.2020.9053762.
[27] J. Huang, J. Tao, B. Liu, Z. Lian, M. Niu, 多模變壓器融合連續情感識別, 在: IEEE 國際聲學、語音和信號處理大會(ICASSP), 2020, pp. 3507-3511, https://doi.org/10.1109/ICASSP40776.2020.9053762.
[28] H. Kaya, F. Gürpınar, A.A. Salah, Video-based emotion recognition in the wild using deep transfer learning and score fusion, Image and Vision Computing 65 (2017) 66-75, https://doi.org/10.1016/j.imavis.2017.01.012.
[28] H. Kaya, F. Gürpınar, A.A. Salah, 在野外使用深度轉移學習和分數融合的基於視頻的情感識別, 圖像和視覺計算 65 (2017) 66-75, https://doi.org/10.1016/j.imavis.2017.01.012.
[29] E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, Audiovisual emotion recognition in wild, Machine Vision and Applications 30 (2019) 975-985, https://doi.org/10.1007/s00138-018-0960-9.
[29] E. Avots, T. Sapiński, M. Bachmann, D. Kamińska, 野外音視覺情緒識別, 機器視覺與應用 30 (2019) 975-985, https://doi.org/10.1007/s00138-018-0960-9.
[30] F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, Audio-visual emotion recognition in video clips, IEEE Transactions on Affective Computing 10 (1) (2017) 60-75, https://doi.org/10.1109/TAFFC.2017.2713783.
[30] F. Noroozi, M. Marjanovic, A. Njegus, S. Escalera, G. Anbarjafari, 影片中的音視覺情緒識別, IEEE 情感計算期刊 10 (1) (2017) 60-75, https://doi.org/10.1109/TAFFC.2017.2713783.
[31] M. Wu, W. Su, L. Chen, W. Pedrycz, K. Hirota, Two-stage fuzzy fusion basedconvolution neural network for dynamic emotion recognition, IEEE Transactions on Affective Computing 1 (2020) 1-13, https://doi.org/10.1109/ TAFFC.2020.2966440.
[31] M. Wu, W. Su, L. Chen, W. Pedrycz, K. Hirota, 基於兩階段模糊融合的卷積神經網絡用於動態情緒識別, IEEE 情感計算期刊 1 (2020) 1-13, https://doi.org/10.1109/ TAFFC.2020.2966440.
[32] H. Kaya, A.A. Karpov, Efficient and effective strategies for cross-corpus acoustic emotion recognition, Neurocomputing 275 (2018) 1028-1034, https://doi.org/ 10.1016/j.neucom.2017.09.049.
[32] H. Kaya, A.A. Karpov, 跨語料庫聲學情感識別的高效和有效策略, Neurocomputing 275 (2018) 1028-1034, https://doi.org/10.1016/j.neucom.2017.09.049.
[33] B. Zhang, E.M. Provost, G. Essl, Cross-corpus acoustic emotion recognition with multi-task learning: Seeking common ground while preserving differences, IEEE Transactions on Affective Computing 10 (1) (2017) 85-99, https://doi.org/ 10.1109/TAFFC.2017.2684799.
[33] B. Zhang, E.M. Provost, G. Essl, 多任務學習的跨語料庫聲學情感識別: 尋求共同點並保留差異, IEEE Transactions on Affective Computing 10 (1) (2017) 85-99, https://doi.org/10.1109/TAFFC.2017.2684799.
[34] H. Kaya, D. Fedotov, A. Yesilkanat, O. Verkholyak, Y. Zhang, A. Karpov, LSTM based cross-corpus and cross-task acoustic emotion recognition, Interspeech (2018) 521-525, https://doi.org/10.21437/Interspeech.2018-2298.
[34] H. Kaya, D. Fedotov, A. Yesilkanat, O. Verkholyak, Y. Zhang, A. Karpov, 基於 LSTM 的跨語料庫和跨任務聲學情感識別, Interspeech (2018) 521-525, https://doi.org/10.21437/Interspeech.2018-2298.
[35] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, G. Hofer, Analysis of deep learning architectures for cross-corpus speech emotion recognition, Interspeech (2019) 1656-1660, https://doi.org/10.21437/Interspeech.20192753.
[35] J. Parry, D. Palaz, G. Clarke, P. Lecomte, R. Mead, M. Berger, G. Hofer, 深度學習架構分析跨語料庫語音情感識別, Interspeech (2019) 1656-1660, https://doi.org/10.21437/Interspeech.20192753.
[36] H. Meng, T. Yan, F. Yuan, H. Wei, Speech emotion recognition from 3D log-mel spectrograms with deep learning network, IEEE Access 7 (2019) 125868125881, https://doi.org/10.1109/ACCESS.2019.2938007.
[36] H. Meng, T. Yan, F. Yuan, H. Wei, 利用深度學習網絡從 3D log-mel 頻譜圖識別語音情感, IEEE Access 7 (2019) 125868125881, https://doi.org/10.1109/ACCESS.2019.2938007.
[37] A. Mollahosseini, D. Chan, M.H. Mahoor, Going deeper in facial expression recognition using deep neural networks, in: IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1-10, https://doi.org/ 10.1109/WACV.2016.7477450.
[37] A. Mollahosseini, D. Chan, M.H. Mahoor, 利用深度神經網絡在面部表情識別中深入研究, in: IEEE Winter Conference on Applications of Computer Vision (WACV), 2016, pp. 1-10, https://doi.org/ 10.1109/WACV.2016.7477450.
[38] W. Xie, X. Jia, L. Shen, M. Yang, Sparse deep feature learning for facial expression recognition, Pattern Recognition 96 (2019), https://doi.org/ 10.1016/j.patcog.2019.106966.
[38] W. Xie, X. Jia, L. Shen, M. Yang, 用於面部表情識別的稀疏深度特徵學習, 圖像識別 96 (2019), https://doi.org/10.1016/j.patcog.2019.106966.
[39] M.V. Zavarez, R.F. Berriel, T. Oliveira-Santos, Cross-database facial expression recognition based on fine-tuned deep convolutional network, in: 30th IEEE Conference on Graphics, Patterns and Images (SIBGRAPI), 2017, pp. 405-412, https://doi.org/10.1109/SIBGRAPI.2017.60.
[39] M.V. Zavarez, R.F. Berriel, T. Oliveira-Santos, 基於微調的深度卷積網絡的跨數據庫面部表情識別, 在: 第 30 屆 IEEE 圖形、模式和圖像會議(SIBGRAPI), 2017, 頁 405-412, https://doi.org/10.1109/SIBGRAPI.2017.60.
[40] Q. Cao, L. Shen, W. Xie, O. Parkhi, A. Zisserman, Vggface2: A dataset for recognising faces across pose and age, in: 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2018, pp. 67-74, https://doi. org/10.1109/FG.2018.00020.
[40] Q. Cao, L. Shen, W. Xie, O. Parkhi, A. Zisserman, Vggface2: 一個用於識別不同姿勢和年齡的人臉的數據集, 在: 第 13 屆 IEEE 國際自動人臉和手勢識別大會(FG), 2018, 頁 67-74, https://doi.org/10.1109/FG.2018.00020.
[41] G. Wen, Z. Hou, H. Li, D. Li, L. Jiang, E. Xun, Ensemble of deep neural networks with probability-based fusion for facial expression recognition, Cognitive Computation 9 (2017) 597-610, https://doi.org/10.1007/s12559-017-9472-6.
[41] G. Wen, Z. Hou, H. Li, D. Li, L. Jiang, E. Xun, 概率融合深度神經網絡集成用於面部表情識別,《認知計算》9 (2017) 597-610, https://doi.org/10.1007/s12559-017-9472-6.
[42] Z. Meng, P. Liu, J. Cai, S. Han, Y. Tong, Identity-aware convolutional neural network for facial expression recognition, in: 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2017, pp. 558565, https://doi.org/10.1109/FG.2017.140.
[42] Z. Meng, P. Liu, J. Cai, S. Han, Y. Tong, 面部表情識別的身份感知卷積神經網絡,《第 12 屆 IEEE 國際自動人臉和手勢識別大會(FG)》, 2017, pp. 558565, https://doi.org/10.1109/FG.2017.140.
[43] B. Hasani, M.H. Mahoor, Facial expression recognition using enhanced deep 3D convolutional neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 30-40, https://doi.org/ 10.1109/CVPRW.2017.282.
[43] B. Hasani, M.H. Mahoor, 使用增強的深度 3D 卷積神經網絡進行面部表情識別,《IEEE 計算機視覺和模式識別會議研討會論文集(CVPRW)》, 2017, pp. 30-40, https://doi.org/ 10.1109/CVPRW.2017.282.
[44] E. Ryumina, A. Karpov, Facial expression recognition using distance importance scores between facial landmarks, CEUR Workshop Proceedings 2744 (2020) 1-10, https://doi.org/10.51130/graphicon-2020-2-3-32.
[44] E. Ryumina, A. Karpov, 利用面部標誌之間的距離重要性分數進行面部表情識別, CEUR Workshop Proceedings 2744 (2020) 1-10, https://doi.org/10.51130/graphicon-2020-2-3-32.
[45] O. Perepelkina, E. Kazimirova, M. Konstantinova, RAMAS: Russian multimodal corpus of dyadic interaction for affective computing, in: 20th International Conference on Speech and Computer, 2018, pp. 501-510, https://doi.org/ 10.1007/978-3-319-99579-3_52.
[45] O. Perepelkina, E. Kazimirova, M. Konstantinova, RAMAS: 俄羅斯雙人互動情感計算的多模態語料庫, 在: 第 20 屆國際語音和計算機大會, 2018, 頁 501-510, https://doi.org/10.1007/978-3-319-99579-3_52.
[46] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: Interactive emotional dyadic motion capture
[46] C. Busso, M. Bulut, C.C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J.N. Chang, S. Lee, S.S. Narayanan, IEMOCAP: 互動情感雙人運動捕捉

database, Language Resources and Evaluation 42 (2008) 335-359, https://doi. org/10.1007/s10579-008-9076-6.
資料庫,語言資源和評估 42(2008)335-359,https://doi.org/10.1007/s10579-008-9076-6。
[47] H. Cao, D.G. Cooper, M.K. Keutmann, R.C. Gur, A. Nenkova, R. Verma, CREMA-D: Crowd-sourced emotional multimodal actors dataset, IEEE Transactions on Affective Computing 5 (4) (2014) 377-390, https://doi.org/10.1109/ TAFFC.2014.2336244.
[47] H. Cao,D.G. Cooper,M.K. Keutmann,R.C. Gur,A. Nenkova,R. Verma,CREMA-D:眾包情感多模態演員數據集,IEEE 情感計算期刊 5(4)(2014)377-390,https://doi.org/10.1109/TAFFC.2014.2336244。
[48] S.R. Livingstone, F.A. Russo, The ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english, PLoS One 13 (5) (2018), https://doi.org/ 10.1371/journal.pone. 0196391 .
[48] S.R. Livingstone,F.A. Russo,The ryerson audio-visual database of emotional speech and song(RAVDESS):北美英語中面部和聲音表情的動態多模態集,PLoS One 13(5)(2018),https://doi.org/10.1371/journal.pone.0196391。
[49] S. Haq, P. Jackson, J.R. Edge, Audio-visual feature selection and reduction for emotion classification, in: International Conference on Auditory-Visual Speech Processing, 2008, pp. 185-190.
[49] S. Haq, P. Jackson, J.R. Edge, 視聽特徵選擇和縮減用於情緒分類, 國際聽覺視覺語音處理會議, 2008, 頁 185-190.
[50] D. Kollias, S. Zafeiriou, Expression, affect, action unit recognition: Aff-Wild2, multi-task learning and ArcFace, ArXiv abs/1910.04855 (2019) 1-15.
[50] D. Kollias, S. Zafeiriou, 表情、情感、動作單元識別: Aff-Wild2, 多任務學習和 ArcFace, ArXiv abs/1910.04855 (2019) 1-15.
[51] A. Mollahosseini, B. Hasani, M.H. Mahoor, Affectnet: A database for facial expression, valence, and arousal computing in the wild, IEEE Transactions on Affective Computing 10 (1) (2017) 18-31, https://doi.org/10.1109/ TAFFC.2017.2740923
[51] A. Mollahosseini, B. Hasani, M.H. Mahoor, Affectnet: 野外臉部表情、價值和覺醒計算數據庫, IEEE 情感計算期刊 10 (1) (2017) 18-31, https://doi.org/10.1109/ TAFFC.2017.2740923
[52] D. Kollias, A. Schulc, E. Hajiyev, S. Zafeiriou, Analysing affective behavior in the first ABAW 2020 competition, in: 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2020, pp. 794-800, https://doi. org/10.1109/FG47880.2020.00126.
[52] D. Kollias,A. Schulc,E. Hajiyev,S. Zafeiriou,分析第一屆 ABAW 2020 競賽中的情感行為,在第 12 屆 IEEE 國際自動面部和手勢識別大會(FG)上,2020 年,頁面 794-800,https://doi.org/10.1109/FG47880.2020.00126。
[53] D. Kollias, S. Zafeiriou, A multi-task learning & generation framework: Valence-arousal, action units & primary expressions, ArXiv abs/1811.07771 (2018) 1-9.
[53] D. Kollias,S. Zafeiriou,多任務學習和生成框架:情緒價值,行動單元和主要表達,ArXiv abs/1811.07771(2018)1-9。
[54] D. Kollias, S. Zafeiriou, Aff-Wild2: Extending the Aff-Wild database for affect recognition, ArXiv abs/1811.07770 (2018) 1-8.
[54] D. Kollias,S. Zafeiriou,Aff-Wild2:擴展 Aff-Wild 數據庫以進行情感識別,ArXiv abs/1811.07770(2018)1-8。
[55] D. Kollias, P. Tzirakis, M.A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, S. Zafeiriou, Deep affect prediction in-the-wild: Aff-Wild database and challenge, deep architectures, and beyond, International Journal of Computer Vision 127 (2019) 907-929, https://doi.org/10.1007/s11263-019-01158-4.
[55] D. Kollias, P. Tzirakis, M.A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, S. Zafeiriou, 在野外的深度情感預測: Aff-Wild 資料庫和挑戰, 深度架構, 以及更多, 國際電腦視覺期刊 127 (2019) 907-929, https://doi.org/10.1007/s11263-019-01158-4.
[56] S. Zafeiriou, D. Kollias, M.A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, Aff-wild: Valence and arousal 'in-the-wild' challenge, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1980-1987, https://doi.org/10.1109/CVPRW.2017.248,
[56] S. Zafeiriou, D. Kollias, M.A. Nicolaou, A. Papaioannou, G. Zhao, B. Schuller, I. Kotsia, Aff-wild: 在野外的情緒價值和覺醒挑戰, 在: IEEE 電腦視覺和模式識別會議研討會 (CVPRW), 2017, 頁 1980-1987, https://doi.org/10.1109/CVPRW.2017.248,
[57] D. Kollias, M.A. Nicolaou, I. Kotsia, G. Zhao, S. Zafeiriou, Recognition of affect in the wild using deep neural networks, in: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 1972-1979, https:// doi.org/10.1109/CVPRW.2017.247.
[57] D. Kollias, M.A. Nicolaou, I. Kotsia, G. Zhao, S. Zafeiriou, 使用深度神經網絡在野外識別情感, 在: IEEE 電腦視覺和模式識別會議研討會 (CVPRW), 2017, 頁 1972-1979, https:// doi.org/10.1109/CVPRW.2017.247.
[58] D. Dresvyanskiy, E. Ryumina, H. Kaya, M. Markitantov, A. Karpov, W. Minker, End-to-end modeling and transfer learning for audiovisual emotion recognition in-the-wild, Multimodal Technologies and Interaction 6 (2) (2022) 1-23, https://doi.org/10.3390/mti6020011.
[58] D. Dresvyanskiy, E. Ryumina, H. Kaya, M. Markitantov, A. Karpov, W. Minker, 面向野外音視覺情感識別的端到端建模和遷移學習, 多模態技術與互動 6 (2) (2022) 1-23, https://doi.org/10.3390/mti6020011.
[59] G. Winata, O. Kampman, in: F. P, Attention-based LSTM, for psychological stress detection from spoken language using distant supervision, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 6204-6208, https://doi.org/10.1109/ICASSP.2018.8461990.
[59] G. Winata, O. Kampman, 在: F. P, 基於注意力的 LSTM, 用於使用遠程監督從口語語言中檢測心理壓力, 在 IEEE 國際聲學、語音和信號處理大會(ICASSP), 2018, pp. 6204-6208, https://doi.org/10.1109/ICASSP.2018.8461990.
[60] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, H.E., Hierarchical attention networks for document classification, in: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, pp. 1480-1489. doi:10.18653/v1/N16-1174.
[60] Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, H.E., 用於文檔分類的分層注意力網絡, 在: 北美計算語言學協會年會: 人類語言技術, 2016, pp. 1480-1489. doi:10.18653/v1/N16-1174.
[61] I. Loshchilov, F. Hutter, SGDR: Stochastic gradient descent with warm restarts, ArXiv abs/1608.03983 (2016) 1-16.
[61] I. Loshchilov, F. Hutter, SGDR: 具有溫暖重啟的隨機梯度下降, ArXiv abs/1608.03983 (2016) 1-16.
[62] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, https://doi.org/10.1109/CVPR.2016.90.
[62] K. He, X. Zhang, S. Ren, J. Sun, 圖像識別的深度殘差學習, 在: IEEE 計算機視覺和模式識別會議(CVPR), 2016, pp. 770-778, https://doi.org/10.1109/CVPR.2016.90.
[63] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7132-7141.
[63] J. Hu, L. Shen, G. Sun, 擠壓和激勵網絡, 在: IEEE 計算機視覺和模式識別會議(CVPR), 2018, pp. 7132-7141.
[64] M. Tan, Q.V. Le, EfficientNet: Rethinking model scaling for convolutional neural networks, in: International Conference on Machine Learning (ICML), 2019, pp. 6105-6114.
[64] M. Tan, Q.V. Le, EfficientNet: 重新思考卷積神經網絡的模型尺度,於:2019 年機器學習國際會議(ICML),頁 6105-6114。
[65] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen, Mobilenetv 2: Inverted residuals and linear bottlenecks, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4510-4520, https://doi.org/10.1109/ CVPR.2018.00474.
[65] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L. Chen, Mobilenetv 2: 反向殘差和線性瓶頸,於:2018 年 IEEE 計算機視覺和模式識別會議(CVPR),頁 4510-4520,https://doi.org/10.1109/ CVPR.2018.00474。
[66] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248-255, https://doi.org/10.1109/ CVPR.2009.5206848.
[66] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, Imagenet: 一個大規模的分層圖像數據庫,於:2009 年 IEEE 計算機視覺和模式識別會議(CVPR),頁 248-255,https://doi.org/10.1109/ CVPR.2009.5206848。
[67] H. Zhang, M. Cissé, Y.N. Dauphin, D. Lopez-Paz, Mixup: Beyond empirical risk minimization, in: 3rd International Conference on Learning Representations (ICLR), 2018,
[67] H. Zhang, M. Cissé, Y.N. Dauphin, D. Lopez-Paz, Mixup: 超越經驗風險最小化, 在: 第 3 屆國際學習表示會議 (ICLR), 2018 年,
[68] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A.C. Berg, SSD: Single shot multibox detector, in: European Conference on Computer Vision, Amsterdam, 2016, pp. 21-37. doi:10.1007/978-3-319-46448-0_2.
[68] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.Y. Fu, A.C. Berg, SSD: 單次拍攝多框檢測器, 在: 歐洲計算機視覺大會, 阿姆斯特丹, 2016 年, 頁 21-37. doi:10.1007/978-3-319-46448-0_2.
[69] J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou, RetinaFace: Single-shot multilevel face localisation in the wild, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5203-5212, https://doi.org/10.1109/ CVPR42600.2020.00525.
[69] J. Deng, J. Guo, E. Ververas, I. Kotsia, S. Zafeiriou, RetinaFace: 野外單次多級面部定位, 在: IEEE 計算機視覺和模式識別大會 (CVPR), 2020 年, 頁 5203-5212, https://doi.org/10.1109/ CVPR42600.2020.00525.
[70] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, Joint face detection and alignment using multitask cascaded convolutional networks, IEEE Signal Processing Letters 23 (10) (2016) 1499-1503, https://doi.org/10.1109/LSP.2016.2603342.
[70] K. Zhang, Z. Zhang, Z. Li, Y. Qiao, 使用多任務級聯卷積網絡進行聯合人臉檢測和對齊, IEEE 信號處理通信 23 (10) (2016) 1499-1503, https://doi.org/10.1109/LSP.2016.2603342.
[71] E. Ryumina, D. Ryumin, D. Ivanko, A. Karpov, A novel method for protective face mask detection using convolutional neural networks and image histograms, in: International Archives of the Photogrammetry Remote
[71] E. Ryumina, D. Ryumin, D. Ivanko, A. Karpov, 使用卷積神經網絡和圖像直方圖進行防護口罩檢測的新方法, 國際攝影測量遙感和空間信息科學存檔 XLIV-2/W1-2021, 2021, pp. 177182, https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-177-2021.

Sensing and Spatial Information Sciences XLIV-2/W1-2021, 2021, pp. 177182, https://doi.org/10.5194/isprs-archives-XLIV-2-W1-2021-177-2021.
[72] E. Ryumina, O. Verkholyak, A. Karpov, Annotation confidence vs. training sample size: Trade-off solution for partially-continuous categorical emotion recognition, Interspeech (2021) 3690-3694, https://doi.org/10.21437/ Interspeech.2021-1636.
[72] E. Ryumina, O. Verkholyak, A. Karpov, 注釋信心與訓練樣本大小:部分連續分類情感識別的權衡解決方案, Interspeech (2021) 3690-3694, https://doi.org/10.21437/ Interspeech.2021-1636.
[73] S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh, A. Hussain, Multimodal sentiment analysis: Addressing key issues and setting up the baselines, IEEE Intelligent Systems 33 (6) (2018) 17-25, https://doi.org/ 10.1109/MIS.2018.2882362.
[73] S. Poria, N. Majumder, D. Hazarika, E. Cambria, A. Gelbukh, A. Hussain, 多模態情感分析:解決關鍵問題並建立基線, IEEE Intelligent Systems 33 (6) (2018) 17-25, https://doi.org/ 10.1109/MIS.2018.2882362.
[74] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L.V. Gool, Temporal segment networks: Towards good practices for deep action recognition, in: European conference on computer vision, Springer, 2016, pp. 20-36.
[74] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, L.V. Gool, 時間段網絡:深度動作識別的良好實踐, in: European conference on computer vision, Springer, 2016, pp. 20-36.
[75] E. Ryumina, A. Karpov, Comparative analysis of methods for imbalance elimination of emotion classes in video data of facial expressions, Scientific and Technical Journal of Information Technologies, Mechanics and Optics 20 (5 (129)) (2020) 683-691, https://doi.org/10.17586/2226-1494-2020-20-5-683691.
[75] E. Ryumina, A. Karpov, 情感類別在面部表情視頻數據中消除不平衡的方法的比較分析, 信息技術、力學和光學科學技術期刊 20 (5 (129)) (2020) 683-691, https://doi.org/10.17586/2226-1494-2020-20-5-683691.
[76] K. Wang, X. Peng, J. Yang, S. Lu, Y. Qiao, Suppressing uncertainties for largescale facial expression recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6897-6906.
[76] K. Wang, X. Peng, J. Yang, S. Lu, Y. Qiao, 抑制大規模面部表情識別中的不確定性, 在: 2020 年 IEEE/CVF 計算機視覺和模式識別會議論文集, 頁 6897-6906.
[77] C. Kervadec, V. Vielzeuf, S. Pateux, A. Lechervy, F. Jurie, CAKE: a compact and accurate k-dimensional representation of emotion, British Machine Vision Association (2018) 1-12.
[77] C. Kervadec, V. Vielzeuf, S. Pateux, A. Lechervy, F. Jurie, CAKE: 一種情感的緊湊準確的 k 維表示, 英國機器視覺協會 (2018) 1-12.
[78] J. She, Y. Hu, H. Shi, J. Wang, Q. Shen, T. Mei, Dive into ambiguity: latent distribution mining and pairwise uncertainty estimation for facial expression recognition, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6248-6257.
[78] J. She, Y. Hu, H. Shi, J. Wang, Q. Shen, T. Mei, 深入探討模糊性:潛在分佈挖掘和面部表情識別的成對不確定性估計, 在:2021 年 IEEE/CVF 計算機視覺和模式識別會議論文集, 頁 6248-6257.
[79] M.-I. Georgescu, R.T. Ionescu, M. Popescu, Local learning with deep and handcrafted features for facial expression recognition, IEEE Access (2019) 64827-64836, https://doi.org/10.1109/ACCESS.2019.2917266.
[79] M.-I. Georgescu, R.T. Ionescu, M. Popescu, 深度和手工特徵的本地學習用於面部表情識別, IEEE Access (2019) 64827-64836, https://doi.org/10.1109/ACCESS.2019.2917266.
[80] D. Kollias, V. Sharmanska, S. Zafeiriou, Distribution matching for heterogeneous Multi-Task learning: a Large-Scale face study, ArXiv abs/ 2105.03790 (2021) 1-15.
[80] D. Kollias, V. Sharmanska, S. Zafeiriou, 異構多任務學習的分佈匹配:一項大規模人臉研究, ArXiv abs/ 2105.03790 (2021) 1-15.
[81] A.V. Savchenko, Facial expression and attributes recognition based on MultiTask learning of lightweight neural networks, in: 2021 IEEE 19th International Symposium on Intelligent Systems and Informatics (SISY), 2021, pp. 119-124, https://doi.org/10.1109/SISY52375.2021.9582508.
[81] A.V. Savchenko,基於輕量級神經網絡的多任務學習的面部表情和屬性識別,2021 年 IEEE 第 19 屆智能系統和信息學術研討會(SISY),2021 年,頁 119-124,https://doi.org/10.1109/SISY52375.2021.9582508。
[82] E. Ghaleb, M. Popa, S. Asteriadis, Multimodal and temporal perception of audio-visual cues for emotion recognition, in: 8th IEEE International Conference on Affective Computing and Intelligent Interaction (ACII), 2019, pp. 552-558, https://doi.org/10.1109/ACII.2019.8925444.
[82] E. Ghaleb,M. Popa,S. Asteriadis,基於多模態和時間感知的音視覺提示情感識別,第 8 屆 IEEE 情感計算和智能交互國際會議(ACII),2019 年,頁 552-558,https://doi.org/10.1109/ACII.2019.8925444。
[83] L.N. Do, H.J. Yang, H.D. Nguyen, S.H. Kim, G.S. Lee, I.S. Na, Deep neural networkbased fusion model for emotion recognition using visual data, J Supercomputing 77 (2021) 10773-10790, https://doi.org/10.1007/s11227021-03690-y.
[83] L.N. Do,H.J. Yang,H.D. Nguyen,S.H. Kim,G.S. Lee,I.S. Na,基於深度神經網絡融合模型的視覺數據情感識別,J Supercomputing 77(2021)10773-10790,https://doi.org/10.1007/s11227021-03690-y。
[84] D. Gera, S. Balasubramanian, Affect expression behaviour analysis in the wild using spatio-channel attention and complementary context information, ArXiv abs/2009.14440 (2020) 1-8.
[84] D. Gera, S. Balasubramanian, 利用空間通道關注和補充性上下文信息在野外進行情感表達行為分析, ArXiv abs/2009.14440 (2020) 1-8.
[85] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, GradCAM: Visual explanations from deep networks via gradient-based localization, in: IEEE International Conference on Computer Vision, 2017, pp. 618-626, https://doi.org/10.1109/ICCV.2017.74.
[85] R.R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, GradCAM: 通過基於梯度的定位從深度網絡獲取視覺解釋, 在: IEEE 國際計算機視覺大會, 2017, pp. 618-626, https://doi.org/10.1109/ICCV.2017.74.
[86] M. Gogate, A. Adeel, A. Hussain, A novel brain-inspired compression-based optimised multimodal fusion for emotion recognition, in: IEEE Symposium Series on Computational Intelligence (SSCI), 2017, pp. 1-7, https://doi.org/ 10.1109/SSCI.2017.8285377.
[86] M. Gogate, A. Adeel, A. Hussain, 一種新型的腦啟發式基於壓縮的優化多模態融合用於情感識別, 在: IEEE 計算智能研討會系列(SSCI), 2017, pp. 1-7, https://doi.org/10.1109/SSCI.2017.8285377.
[87] S. Yoon, S. Dey, H. Lee, K. Jung, Attentive modality hopping mechanism for speech emotion recognition, in: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 3362-3366, https://doi.org/ 10.1109/ICASSP40776.2020.9054229.
[87] 尹石, Dey S, 李慧, 鄭克, 專注的模態跳躍機制用於語音情感識別, 在: 2020 年 IEEE 國際聲學、語音和信號處理大會(ICASSP), pp. 3362-3366, https://doi.org/10.1109/ICASSP40776.2020.9054229.
Elena Ryumina received the M.E. degree from the ITMO University in 2021. She is currently a joint Ph.D. student at ITMO University and works as the junior researcher at the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPCRAS) in the Speech and Multimodal Interfaces Laboratory. Her main research focus is affective computing, audiovisual emotion recognition, computational paralinguistics, machine learning, neural networks, human-machine interfaces.
Elena Ryumina 於 2021 年從 ITMO 大學獲得 M.E.學位。她目前是 ITMO 大學的聯合博士研究生,並在俄羅斯科學院聖彼得堡信息學與自動化研究所、聖彼得堡聯邦科學院俄羅斯科學院(SPCRAS)的語音和多模式界面實驗室擔任初級研究員。她的主要研究方向是情感計算、視聽情感識別、計算語音語言學、機器學習、神經網絡、人機界面。
Dresvyanskiy Denis received his BE and MS degrees in system analysis from Reshetnev Siberian State University of Science and Technology in 2017 and 2019, respectively. He is currently a joint Ph.D. student at Ulm University and ITMO University. He has wide research interests mainly including human-computer interaction, signal processing, computer vision, deep and transfer learning, and paralinguistics analysis. Particularly, his study lies in the evaluation and analysis of conversational characteristics such as engagement, dominance, and affective states of interlocutors.
Dresvyanskiy Denis 於 2017 年和 2019 年分別從 Reshetnev Siberian State University of Science and Technology 獲得系統分析學士和碩士學位。他目前是烏爾姆大學和 ITMO 大學的聯合博士生。他的研究興趣廣泛,主要包括人機交互、信號處理、計算機視覺、深度學習和轉移學習,以及語音語調分析。特別是,他的研究集中在評估和分析對話特徵,如對話參與度、主導地位和情感狀態。
Karpov Alexey is Doctor of Technical Sciences (2013), Professor. Currently he works as the chief researcher and head of the Speech and Multimodal Interfaces Laboratory at the St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences of the St. Petersburg Federal Research Center of the Russian Academy of Sciences (SPC RAS). He also works as a Full Professor (part-time) at the ITMO University. His research interests are speech technology, automatic speech recognition, audio-visual speech processing, multimodal human-computer interfaces, and computational paralinguistics. He is a general chair of International Conferences on Speech and Computer (SPECOM).
卡爾波夫·亞歷克謝(Karpov Alexey)是技術科學博士(2013 年),教授。目前擔任俄羅斯科學院聖彼得堡信息自動化研究所(SPC RAS)言語和多模式界面實驗室的首席研究員和負責人。他還兼任 ITMO 大學的兼職全職教授。他的研究興趣包括言語技術、自動語音識別、音視覺言語處理、多模式人機界面和計算語音語言學。他是國際言語與計算機會議(SPECOM)的主席。

    • Corresponding author. 對應作者。
    E-mail addresses: ryumina_ev@mail.ru (E. Ryumina), denis.dresvyanskiy@uniulm.de (D. Dresvyanskiy), karpov@iias.spb.su (A. Karpov).
    電子郵件地址:ryumina_ev@mail.ru(E. Ryumina)、denis.dresvyanskiy@uniulm.de(D. Dresvyanskiy)、karpov@iias.spb.su(A. Karpov)。
    These authors contributed equally to this work.
    這些作者對這項工作做出了同等的貢獻。