這是用戶在 2024-4-18 16:25 為 https://app.immersivetranslate.com/pdf-pro/c1558388-a485-4d8b-bb8a-9290859ddfdf 保存的雙語快照頁面,由 沉浸式翻譯 提供雙語支持。了解如何保存?

Multimodal Video Sentiment Analysis Using Deep Learning Approaches, a Survey

Sarah A. Abdu , Ahmed H. Yousef , Ashraf Salem
Sarah A. Abdu,Ahmed H. Yousef,Ashraf Salem
Ain Shams University, Faculty of Engineering, Computers & Systems Department, Egypt
Ain Shams University,Faculty of Engineering,Computers & Systems Department,Egypt
Nile University, Center of Informatics Science (CIS), Egypt
Nile University,Center of Informatics Science (CIS),Egypt


Keywords: 關鍵詞

Sentiment analysis 情感分析
Sentiment classification
Multimodal sentiment analysis
Multimodal fusion 多模態融合
Audio, visual and text information fusion

Abstract 摘要

A B S T R A C T Deep learning has emerged as a powerful machine learning technique to employ in multimodal sentiment analysis tasks. In the recent years, many deep learning models and various algorithms have been proposed in the field of multimodal sentiment analysis which urges the need to have survey papers that summarize the recent research trends and directions. This survey paper tackles a comprehensive overview of the latest updates in this field. We present a sophisticated categorization of thirty-five state-of-the-art models, which have recently been proposed in video sentiment analysis field, into eight categories based on the architecture used in each model. The effectiveness and efficiency of these models have been evaluated on the most two widely used datasets in the field, CMU-MOSI and CMU-MOSEI. After carrying out an intensive analysis of the results, we eventually conclude that the most powerful architecture in multimodal sentiment analysis task is the Multi-Modal Multi-Utterance based architecture, which exploits both the information from all modalities and the contextual information from the neighbouring utterances in a video in order to classify the target utterance. This architecture mainly consists of two modules whose order may vary from one model to another. The first module is the Context Extraction Module that is used to model the contextual relationship among the neighbouring utterances in the video and highlight which of the relevant contextual utterances are more important to predict the sentiment of the target one. In most recent models, this module is usually a bidirectional recurrent neural network based module. The second module is an Attention-Based Module that is responsible for fusing the three modalities (text, audio and video) and prioritizing only the important ones. Furthermore, this paper provides a brief summary of the most popular approaches that have been used to extract features from multimodal videos in addition to a comparative analysis between the most popular benchmark datasets in the field. We expect that these findings can help newcomers to have a panoramic view of the entire field and get quick experience from the provided helpful insights. This will guide them easily to the development of more effective models.
摘要 深度學習已經成為一種強大的機器學習技術,可應用於多模態情感分析任務。近年來,在多模態情感分析領域提出了許多深度學習模型和各種算法,這促使我們需要有總結最近研究趨勢和方向的調查論文。本調查論文對這一領域的最新更新進行了全面概述。我們對最近在視頻情感分析領域提出的三十五種最新模型進行了複雜的分類,根據每個模型中使用的架構將其分為八個類別。這些模型的有效性和效率已在該領域中最廣泛使用的兩個數據集 CMU-MOSI 和 CMU-MOSEI 上進行了評估。 經過對結果的深入分析後,我們最終得出結論,多模情感分析任務中最強大的架構是基於多模多發話的架構,它利用所有模態的信息以及視頻中相鄰發話的上下文信息來對目標發話進行分類。該架構主要由兩個模塊組成,其順序可能因模型而異。第一個模塊是上下文提取模塊,用於建模視頻中相鄰發話之間的上下文關係,並突出哪些相關的上下文發話對於預測目標發話的情感更為重要。在最近的模型中,該模塊通常是基於雙向循環神經網絡的模塊。第二個模塊是基於注意力的模塊,負責融合三種模態(文本、音頻和視頻)並僅優先考慮重要的模態。 此外,本文提供了從多模態視頻中提取特徵的最流行方法的簡要摘要,以及該領域中最流行的基準數據集之間的比較分析。我們期望這些發現可以幫助新手全面了解整個領域並從提供的有用見解中快速獲得經驗。這將使他們輕鬆地引導到開發更有效模型的過程中。

1. Introduction 1. 簡介

Sentiments play a very important role in our daily lives. They help us to communicate, learn and make decisions, that's why over the past two decades, AI researchers have been trying to make machines capable of analyzing human sentiments. The early efforts in sentiment analysis have focused on textual sentiment analysis where only words are used to analyze the sentiment. However, textual sentiment analysis is insufficient to extract the sentiment expressed by humans; the meaning of words and sentences spoken by speakers often changes dynamically according to the non-verbal behaviours [1]. For example if someone said the word 'Amazing', it could express negative sentiment if it is
accompanied by a sarcastic laugh or sarcastic voice.
Intensive research over many years have shown that multimodal systems are more efficient in recognizing the sentiment of a speaker than unimodal systems. The way humans naturally communicate and express their emotions and sentiments is usually multimodal: the textual, audio, and visual modalities are concurrently fused to extract information conveyed during communication in an effective way. A survey of multimodal sentiment analysis published in 2015 [2] reported that "multimodal systems were consistently ( of systems) more accurate than their best unimodal counterparts, with an average improvement of ". Also, several surveys published later suggest that textual information is not sufficient for predicting human sentiments especially in
多年來進行的深入研究表明,多模式系統在識別說話者情感方面比單模式系統更有效。人類自然交流和表達情感的方式通常是多模式的:文本、音頻和視覺模式同時融合,以有效地提取在溝通過程中傳達的信息。2015 年發表的一項多模式情感分析調查報告指出,“多模式系統一貫地( 的系統)比最佳單模式對應系統更準確,平均改善了 ”。此外,後來發表的幾項調查表明,僅憑文本信息無法準確預測人類情感,特別是在

cases of sarcasm or ambiguity [3-6]. For example, it is impossible to recognize the sentiment of a sarcastic sentence "Great" as negative considering only the textual information. However, if the system can access the visual modality, it can easily detect the unpleasant gestures of the speaker and would classify it with the negative sentiment polarity. Similarly, acoustic features play important roles in the correctness of the system. In 2018 Poria et al. [7] also introduced an intuitive explanation of the improved performance in the multimodal scenario by visualizations of MOSI dataset [8] where both unimodal features and multimodal features are used to give information regarding the dataset distribution (see Fig. 1). For the textual modality only, comprehensive clustering can be seen with substantial overlap. However, this overlap is reduced in multimodal [7].
With the recent growth in social media platforms and the advances in technology, people started recording videos and uploading them on social media platforms like YouTube or Facebook to inform subscribers about their views. These videos may be product reviews, movie reviews, political debates, consulting, or their views about any random topic. A video provides a good source for extracting multi-modal information. In addition to the visual frames, it also provides information such as acoustic and textual representation of spoken language. This is what urged most AI researchers to direct their research towards multimodal sentiment analysis to leverage the varieties of (often distinct) information from multiple sources for building a more efficient system. There are many alternative ways to fuse information from different modalities, however selecting the best way is challenging [9].
Multimodal sentiment analysis focuses on modelling intramodal dynamics (View-specific dynamics) and intermodal dynamics (Crossview dynamics) [10]. Intra-modality dynamics (View-specific dynamics) mean the interactions within a specific modality, independent of other modalities. For example the interactions between words in a given sentence. Intra-modality dynamics are particularly challenging for the language analysis since multimodal sentiment analysis is performed on spoken language. A spoken opinion such as "I think it was alright ... Hmmm . . let me think . . . yeah . . . no . . . ok yeah" almost never happens in written text. This volatile nature of spoken opinions, where proper language structure is often ignored, complicates sentiment analysis. On the other hand, inter-modality dynamics (Cross-view dynamics) refer to the interactions between the different modalities and are divided into synchronous and asynchronous categories. An example of synchronous cross-view dynamics is a smile and a positive word occurring simultaneously. And an example of asynchronous cross-view dynamics is the delayed occurrence of a laughter after the end of sentence. The main challenge in multimodal sentiment analysis is intra-modal representation and selecting the best approach to fuse features from different modalities .
多模情感分析專注於建模內模態動態(視圖特定動態)和跨模態動態(跨視圖動態)[10]。內模態動態(視圖特定動態)指的是在特定模態內的互動,與其他模態無關。例如,給定句子中單詞之間的互動。內模態動態對於語言分析尤其具有挑戰性,因為多模情感分析是針對口語語言進行的。口語意見,例如“我覺得還好...嗯...讓我想想...對...不...好吧”,幾乎不會出現在書面文本中。口語意見的不穩定性,其中正確的語言結構經常被忽略,使情感分析變得複雜。另一方面,跨模態動態(跨視圖動態)指的是不同模態之間的互動,分為同步和異步兩類。同步跨視圖動態的一個例子是微笑和正面詞語同時出現。異步跨視圖動態的一個例子是在句子結束後延遲出現笑聲。 多模情感分析中的主要挑戰是內模表示和選擇從不同模態融合特徵的最佳方法。

Fig. 1. T-SNE 2D visualization of MOSI dataset when text features and multimodal features are used [7].
圖 1。當使用文本特徵和多模特徵時,MOSI 數據集的 T-SNE 2D 可視化[7]。

1.1. The scope of this survey
1.1. 本調查的範圍

As multimodal sentiment analysis research continues to gain popularity, the number of articles published every year in this field continues to increase which urges the need to have survey papers that summarize the recent research trends and directions in the field.
A long detailed survey was presented by . Poria et al. [3] in which the authors presented an overview of the state of the art methodologies and trends in the field of multimodal sentiment analysis. However this survey was published in 2017 and many new articles and models have been introduced later, which urges the need for a new survey paper. For example, the most popular two datasets used in the field (CMU-MOSI and CMU-MOSEI) haven't been summarized in [3]; this is mainly because the two datasets have been introduced after the paper was published. Furthermore, the most powerful models introduced in the field haven't been referenced in this survey.
Poria 等人在[3]中提出了一份詳盡的調查報告,其中作者概述了多模情感分析領域的最新方法和趨勢。然而,這份調查報告發表於 2017 年,之後許多新文章和模型被引入,這促使我們需要一份新的調查論文。例如,該領域中最流行的兩個數據集(CMU-MOSI 和 CMU-MOSEI)並未在[3]中總結;這主要是因為這兩個數據集是在該論文發表後引入的。此外,該領域中引入的最強大模型也未在該調查報告中被引用。
In 2021, D. Gkoumas et al. [13] replicated the implementation of eleven state of the art models in the field. They evaluated the performance of the referenced models for multimodal sentiment analysis tasks by using two benchmark datasets, CMU-MOSI [8] and CMU-MOSEI [14]. An experimental categorization of the models have also been provided with respect to the fusion approach used in each model, where the authors eventually conclude that the attention mechanism approaches are the most effective for the task. However, this paper has some shortcomings. First, D. Gkoumas et al. [13] completely ignored a certain category of models where the contextual information of neighboring utterances have been used to predict the sentiment of the target utterance, although these models have shown sophisticated performance in multi-modal sentiment analysis tasks. Second, the authors categorized only eleven models, where the categorization was based only on the fusion approach, while completely ignoring the architecture of each model. Third, they haven't given any overview of the datasets used in the field except a brief summary of CMU-MOSI [8] and CMU-MOSEI [14].
在2021年,D. Gkoumas等人[13]複製了該領域中十一個最先進模型的實施。他們通過使用兩個基準數據集CMU-MOSI [8]和CMU-MOSEI [14]來評估參考模型在多模態情感分析任務中的性能。對於每個模型使用的融合方法,作者還提供了一個實驗性的分類,最終得出結論,即注意機制方法對該任務最為有效。然而,這篇論文存在一些缺點。首先,D. Gkoumas等人[13]完全忽略了某一類模型,這些模型使用鄰近發言的上下文信息來預測目標發言的情感,盡管這些模型在多模態情感分析任務中表現出色。其次,作者僅將十一個模型進行了分類,而該分類僅基於融合方法,完全忽略了每個模型的架構。第三,他們沒有提供有關該領域使用的數據集的概述,除了對CMU-MOSI [8]和CMU-MOSEI [14]的簡要摘要。
The contribution of our survey is significant for many reasons. First, we noticed that each group of models have a common architecture, this is what urged us to categorize thirty-five models in the field into eight categories based on the architecture used in each model.
Second, we compare the effectiveness and efficiency of the thirty-five models on two widely used datasets for multimodal sentiment analysis (CMU-MOSI and CMU-MOSEI). After carrying out an intensive analysis of the results, we eventually come up to several conclusions. (1) We are able to conclude that the most powerful architecture in multimodal sentiment analysis task is the Multi-Modal Multi-Utterance based architecture, which exploits both the information from all modalities and the contextual information from the neighbouring utterances in a video in order to classify the target utterance. This architecture mainly consists of two modules whose order may vary from one model to another. The first module is the Context Extraction Module, which is used to model the contextual relationship among the neighbouring utterances in the video and highlight which of the relevant contextual utterances are more important to predict the sentiment of the target one. In most recent models, this module is usually a bidirectional recurrent neural network based module. The second module is the Attention-Based Module, which is responsible for fusing the three modalities (text, audio and video) and prioritizing only the important ones. (2) We were also able to conclude that using the scaled dot-product attention and the concept of multihead attention are most effective for multimodal sentiment analysis task. (3) Moreover, the results obtained entailed that bimodal attention frameworks achieve better performance than self-attention frameworks. We believe that these findings could help the researchers to easily develop more effective models and choose the appropriate technique for a certain application.
其次,我們比較了三十五個模型在兩個廣泛使用的多模態情感分析數據集(CMU-MOSI和CMU-MOSEI)上的效果和效率。在對結果進行深入分析後,我們最終得出了幾個結論。 (1)我們得出結論,多模態情感分析任務中最強大的架構是基於多模態多發話的架構,它利用所有模態的信息以及視頻中相鄰發話的上下文信息來對目標發話進行分類。該架構主要由兩個模塊組成,其順序可能因模型而異。第一個模塊是上下文提取模塊,用於建模視頻中相鄰發話之間的上下文關係,並突出哪些相關的上下文發話對於預測目標發話的情感更重要。在最近的大多數模型中,該模塊通常是基於雙向循環神經網絡的模塊。 第二個模塊是基於注意力的模塊,負責融合三種模態(文本、音頻和視頻)並僅將重要的部分列為優先。 (2) 我們還得出結論,使用縮放點積注意力和多頭注意力的概念對於多模情感分析任務最為有效。 (3) 此外,所獲得的結果表明,雙模注意力框架比自注意力框架表現更好。我們相信這些發現可以幫助研究人員更輕鬆地開發更有效的模型並選擇適合特定應用的技術。
Finally, in order to help new comers to have a panoramic view on the entire field, we provide brief details of the algorithms used in each model. In addition, we provide a comparative analysis between the most

popular benchmarks datasets in the field and a brief summary of the most popular approaches that have been used to extract features from multimodal videos.
In table 1, we present a brief comparison between the surveys presented by S. Poria et al. [3] and D. Gkoumas et al. [13], and our survey. The first column shows the reference to the survey while the second column shows the year when the survey was published. The third columns shows the number of models that have been categorized in each survey while the number of datasets summarized is given in the fourth column. The basis the authors used for categorizing the referenced models in their survey can be seen in the fifth column. The last column specifies whether the article reviewed the feature extraction methods by means of Yes/No answers ( or ).
在表 1 中,我們對 S. Poria 等人[3]和 D. Gkoumas 等人[13]提出的調查與我們的調查進行了簡要比較。第一列顯示了對調查的引用,而第二列顯示了調查發表的年份。第三列顯示了每個調查中已分類的模型數量,而第四列給出了總結的數據集數量。作者用於將其調查中引用的模型進行分類的基礎可以在第五列中看到。最後一列指定了文章是否通過是/否答案( )來審查特徵提取方法。

1.2. Multimodal sentiment analysis process on multimodal data
1.2. 多模態數據上的多模態情感分析過程

To classify the sentiment of any video, the visual, acoustic and textual features should be extracted first using the appropriate visual, acoustic and textual features extractors respectively. The extracted features of the three modalities are passed into a classification model to predict the correct sentiment. In the next sections, we will discuss different types of visual, acoustic and textual feature extractors and also tackle a comprehensive overview of the latest models used in the fields. The multimodal sentiment analysis process on multimodal data can be seen in Fig. 2.
為了對任何視頻的情感進行分類,應該首先使用適當的視覺、聲學和文本特徵提取器分別提取視覺、聲學和文本特徵。三種模態的提取特徵被傳入分類模型以預測正確的情感。在接下來的部分中,我們將討論不同類型的視覺、聲學和文本特徵提取器,並全面概述這些領域中使用的最新模型。多模態情感分析過程中的多模態數據可以在圖 2 中看到。
This paper is organized as follows: Section 2 provides a summary of the most popular datasets in multimodal sentiment analysis. Section 3 tackles the most feature extraction techniques and their related articles. Section 4 categorizes the state of the art models in multimodal sentiment analysis and the corresponding articles into eight architectures. Section 5 tackles the evaluation metrics used to evaluate the efficiency and effectiveness of the models. Section 6 presents the results and discussions, and finally the conclusion and future trend in research are tackled in Section 7.
本文組織如下:第 2 節簡要介紹了多模態情感分析中最流行的數據集。第 3 節介紹了最 特徵提取技術及其相關文章。第 4 節將多模態情感分析中的最新模型和相應文章分為八種架構進行分類。第 5 節介紹了用於評估模型效率和有效性的評估指標。第 6 節呈現了結果和討論,最後在第 7 節討論了結論和未來研究趨勢。
The most popular datasets used in multimodal sentiment analysis are summarized in Table 2. The name of the dataset is shown in the first column while the year the dataset was published is presented in the second column. The number of videos in each dataset is shown on the third column while the number of utterances is shown in the fourth column. In some datasets the authors didn't mention anything about the number of utterances, that's why the forth column is sometimes left empty. The number of distinct speakers in the whole dataset can be seen in the fifth column. The sixth column shows the language in which the videos are recorded while the seventh column presents the source from which the videos were collected. The eighth column tells you from where you can download the dataset. The ninth column illustrates the sentiments that have been used for labeling the data. Finally the last column shows the topics of the videos included in the dataset, the videos could be product reviews, movie reviews, debates... etc.
在多模態情感分析中使用最廣泛的數據集概述如表 2 所示。數據集的名稱顯示在第一列,數據集發布的年份顯示在第二列。每個數據集中的視頻數顯示在第三列,而發言數顯示在第四列。在一些數據集中,作者沒有提及有關發言數的信息,這就是為什麼第四列有時會留空。整個數據集中不同說話者的數量可以在第五列中看到。第六列顯示視頻記錄的語言,而第七列呈現了視頻收集的來源。第八列告訴您可以從哪裡下載數據集。第九列說明了用於標記數據的情感。最後一列顯示了數據集中包含的視頻主題,這些視頻可能是產品評論、電影評論、辯論等等。
Table 1 表 1
Comparison between our survey and surveys introduced by S. Poria et al. [3] and D. Gkoumas et al. [13].
我們調查與 S. Poria 等人[3]和 D. Gkoumas 等人[13]介紹的調查之間的比較。
Paper Year
# Datasets
Basis of
2017 3 Fusion approach 融合方法
2020 11 2 Fusion approach 融合方法
Ours 2021 35 7
Architecture of 建築
each model
Fig. 2. Multimodal sentiment analysis process on a video.
圖 2. 視頻上的多模情感分析過程。

2.1. YouTube Dataset 2.1. YouTube 數據集

The YouTube Dataset was developed in 2011 by Morency et al. [15]. The dataset was collected from YouTube website in such a way that it is not based on one particular topic where the videos were collected using the following keywords: opinion, review, product review, best perfume, toothpaste, war, job, business, cosmetics review, camera review, baby product review, I hate, I like... etc. Morency et al. [15] were so careful that the videos collected were diverse and having noise to encompass the different facets of sentiment analysis. The dataset consists of 47 generalized videos, each video contains 3-11 utterances. The ages of the speakers ranged from 14 to 60 years where 40 videos were expressed by female speakers, while the rest were expressed by male speakers. Although all speakers are from different cultures, they all expressed their opinions in English. Each video in the dataset was labelled with one of three sentiments: positive, negative or neutral giving a final set of 13 positively, 12 negatively and 22 neutrally labeled videos.
YouTube 數據集於 2011 年由 Morency 等人開發。該數據集是從 YouTube 網站收集的,並不基於特定主題,視頻是使用以下關鍵詞收集的:意見、評論、產品評論、最好的香水、牙膏、戰爭、工作、商業、化妝品評論、相機評論、嬰兒產品評論、我討厭、我喜歡等。Morency 等人非常謹慎,收集的視頻多樣且帶有噪音,以包含情感分析的不同方面。該數據集包含 47 個通用視頻,每個視頻包含 3-11 個發言。發言者的年齡從 14 歲到 60 歲不等,其中 40 個視頻由女性發言者表達,其餘由男性發言者表達。儘管所有發言者來自不同文化,但他們都用英語表達自己的意見。數據集中的每個視頻都標記為三種情感之一:積極、消極或中立,最終形成了 13 個積極、12 個消極和 22 個中立標記的視頻。

2.2. MOUD Dataset 2.2. MOUD 數據集

The Multimodal Opinion Utterances Dataset (MOUD) was developed in 2013 by Perez-Rosas et al. [16] where 80 videos were collected from YouTube website using several keywords likely to lead to a product review or recommendation. The ages of the speakers ranged from 20 to 60 years where 15 videos were expressed by female speakers, while the rest were expressed by male speakers. All videos were recorded in Spanish. Eventually, a multimodal dataset of 498 utterances was created with an average duration of 5 seconds. Each utterance in the dataset was labelled with one of three sentiments: positive, negative or neutral giving a final set of 182 positively, 231 negatively and 85 neutrally labeled videos.
2013 年,Perez-Rosas 等人開發了多模態意見發言數據集(MOUD),從 YouTube 網站收集了 80 個視頻,使用了幾個可能導致產品評論或推薦的關鍵詞。 說話者的年齡從 20 到 60 歲不等,其中有 15 個視頻由女性說話者表達,其餘由男性說話者表達。 所有視頻都是用西班牙語錄製的。 最終,創建了一個包含 498 個發言的多模態數據集,平均持續時間為 5 秒。 數據集中的每個發言都標記為三種情感之一:積極的、消極的或中性的,最終得到了 182 個積極標記的視頻、231 個消極標記的視頻和 85 個中性標記的視頻。

2.3. ICT-MMMO Dataset 2.3. ICT-MMMO 數據集

The Institute for Creative Technologies Multi-Modal Movie Opinion
Table 2 表 2
Comparative analysis across most popular multimodal sentiment analysis datasets..
Dataset Year Lang. Source Available at Sentiments Topics
YouTube 2011 47 280. 47 English YouTube
By sending mail to
Product reviews 產品評論
MOUD 2013 80 498 80 Spanish YouTube
Publicly available 公開發佈
umich.edu/ mihalcea/
downloads.html 下載頁面
No particular topic 沒有特定主題
ICT-МMMO 2013 308 - 370 English
By sending mail to
Movie Reviews 電影評論
POM 2014 1000 - 352 English ExpoTV - Movie Reviews 電影評論
CMU-MOSI 2016 93 2199 89 English YouTube Publicly available at 公開可用於
multicomp.cs.cmu. 多元組
edu/raw_datasets/ 教育機構/原始數據集
processed_data/ 處理過的數據
No particular 沒有特定
CMU-MOSEI 2018 3228 22,777 1000 English YouTube
Random topics, but the
隨機話題,但最常見的 3 個話題是
most frequent 3 topics are
reviews , debate
and consulting  以及諮詢
CH-SIMS 2020 60 2,281 474 Chinese - Publicly available 公開發佈
Movies, TV
series, and
variety shows. 綜藝節目。
(ICT-MMMO) database was created in 2013 by Wollmer et al. [17]. The dataset consists of 370 online videos collected from YouTube and ExpoTV reviewing movies in English. Each video in the dataset was labelled with one of five sentiment labels: strongly positive, weakly positive, neutral, strongly negative and weakly negative.
(ICT-MMMO)數據庫於 2013 年由 Wollmer 等人創建[17]。該數據集包含從 YouTube 和 ExpoTV 收集的 370 個在英語中評論電影的線上視頻。數據集中的每個視頻都標記有五個情感標籤中的一個:非常積極、輕微積極、中性、非常消極和輕微消極。

2.4. 說服性意見多媒體(POM)數據集

S. Park et al. [18] collected 1000 movie reviews from ExpoTV where all reviewers expressed their opinions in English. Each movie review is a video of a speaker talking about a particular movie, as well as the speaker's direct rating of the movie on scale from 1 star (most negative review) to 5 stars (most positive review). Each video in the corpus has a frontal view of one person talking about a particular movie, and the average length of the videos is about 93 seconds. The dataset can be used for two purposes. First, it is used to study persuasiveness in the context of online social multimedia. Each video is annotated from 1 (very unpersuasive) to 7 (very persuasive). Second, it is used recognize the speaker traits. Each movie review video was annotated with one of the following speaker traits: confidence, entertaining, trusting, passion, relaxed, persuasive, dominance, nervous, credibility, entertaining, reserved, trusting, relaxed, nervous, humorous and persuasive. 903 videos were split into 600 for training, 100 for validation and 203 for testing.
S. Park 等人[18]從 ExpoTV 收集了 1000 條電影評論,其中所有評論者都用英語表達了自己的意見。每條電影評論都是一個講述特定電影的演講者的視頻,以及演講者對電影的直接評分,評分範圍從 1 星(最負面的評論)到 5 星(最正面的評論)。語料庫中的每個視頻都是一個正面視圖,展示一個講述特定電影的人,視頻的平均長度約為 93 秒。該數據集可用於兩個目的。首先,它用於研究在線社交多媒體背景下的說服力。每個視頻都被註釋為 1(非常不具說服力)到 7(非常具說服力)。其次,它用於識別演講者特徵。每個電影評論視頻都被註釋為以下一種演講者特徵:自信、有趣、信任、激情、輕鬆、具說服力、支配、緊張、可信度、有趣、保守、信任、輕鬆、緊張、幽默和具說服力。903 個視頻分為 600 個用於訓練,100 個用於驗證,203 個用於測試。

2.5. CMU-MOSI Dataset 2.5. CMU-MOSI 數據集

The CMU-MOSI Dataset was developed in 2016 by Amir Zadeh et al [8]. The dataset consists of 93 videos collected from YouTube video-blogs, or vlogs. The vblogs are YouTube videos where many users can express their opinions about many different subjects; this type of videos usually contain only one speaker, looking primarily at the camera.
2016 年由 Amir Zadeh 等人[8]開發的 CMU-MOSI 數據集。該數據集包含了從 YouTube 視頻博客或 vlog 收集的 93 個視頻。這些 vlog 是 YouTube 視頻,許多用戶可以在其中表達他們對許多不同主題的意見;這類視頻通常只包含一位演講者,主要看著攝像機。
The ages of the speakers ranged from 20 to 30 years where 41 videos were expressed by female speakers, while the rest were expressed by male speakers. Although all speakers are from different cultures, they all expressed their opinions in English. One big advantage of these videos is that they address diversity and contain noise; all videos were recorded in different setups, some users have high-tech microphones and cameras, while others use less professional recording devices. Also users are in different distances from the camera, and background and lighting conditions differed from one video to another. The videos were kept in their original resolution without any enhancement to the quality.
演講者的年齡從 20 到 30 歲不等,其中 41 個視頻由女性演講者表達,其餘由男性演講者表達。儘管所有演講者來自不同文化,但他們都用英語表達自己的意見。這些視頻的一個重要優勢是它們涉及多樣性並包含噪音;所有視頻都是在不同的設置中錄製的,一些用戶使用高科技麥克風和攝像機,而其他人使用不太專業的錄製設備。此外,用戶與攝像機的距離不同,背景和照明條件在一個視頻和另一個視頻之間也有所不同。這些視頻保持其原始分辨率,沒有對質量進行任何增強。
Each utterance in the dataset was labelled with one of the following sentiments: strongly positive (labelled as +3 ), positive , weakly positive ( +1 ), neutral ( 0 ), weakly negative ( -1 ), negative ( -2 ), strongly negative (-3)

2.6. CMU-MOSEI Dataset 2.6. CMU-MOSEI 資料集

The CMU-MOSEI dataset was developed in 2018 by Zadeh et al [14]. CMU-MOSEI is a larger scale dataset that consists of 3228 videos divided into 22,777 utterances from more than 1000 online YouTube speakers ( male to female). The videos talk about 250 distinct topics but the most frequent 3 topics are reviews , debate and consulting (1.8%). Each video contains only one speaker looking primarily at the camera. As CMU-MOSI, CMU-MOSEI also address diversity and contain noise. Each utterance in the dataset was labelled with one of eight sentiments: strongly positive (labeled as +3 ), positive , weakly positive , neutral ( 0 ), weakly negative , negative , strongly negative (-3).
CMU-MOSEI 資料集由 Zadeh 等人於 2018 年開發。CMU-MOSEI 是一個規模較大的資料集,包含 3228 個影片,分為來自 1000 多位線上 YouTube 演講者(男性至女性比例)的 22,777 個話語。這些影片涵蓋 250 個不同的主題,但最常見的三個主題是評論、辯論和諮詢(1.8%)。每個影片只包含一位演講者,主要朝向攝影機。與 CMU-MOSI 一樣,CMU-MOSEI 也關注多樣性並包含噪音。資料集中的每個話語都被標記為以下八種情感之一:非常正面(標記為+3),正面,微正面,中立(0),微負面,負面,非常負面(-3)。
It is worth mentioning that recently most researches use CMU-MOSI and CMU-MOSEI datasets to evaluate the performance of their models in multimodal sentiment analysis.
值得一提的是,最近大多數研究都使用 CMU-MOSI 和 CMU-MOSEI 數據集來評估他們的模型在多模態情感分析中的表現。

2.7. CH-SIMS Dataset 2.7. CH-SIMS 數據集

The CH-SIMS Dataset was developed in 2020 by Yu et al. [19]. The dataset consists of 60 video divided into 2281 utterances collected from movies, TV series, and variety shows. Average length of the utterance is 3.67 seconds and in each video clip, no other faces appear except the face of the speaker. For each utterance, the annotators give one multimodal annotation and three unimodal annotations for each video clip. This can help researchers to use SIMS to do both unimodal and multimodal sentiment analysis tasks. Furthermore, researchers can develop new methods for multimodal sentiment analysis with these additional annotations. The annotations in this dataset can be one of the following: positive, weakly positive, neutral, weakly negative, negative.
CH-SIMS 數據集是由 Yu 等人於 2020 年開發的。該數據集包含 60 個視頻,分為 2281 個發言,這些發言來自電影、電視劇和綜藝節目。每個發言的平均長度為 3.67 秒,在每個視頻片段中,除了發言者的臉之外,沒有其他臉出現。對於每個發言,標註者為每個視頻片段提供一個多模態標註和三個單模態標註。這可以幫助研究人員使用 SIMS 來執行單模態和多模態情感分析任務。此外,研究人員可以利用這些額外的標註來開發新的多模態情感分析方法。該數據集中的標註可以是以下之一:正面、微正面、中性、微負面、負面。

3. Feature Extraction 3. 特徵提取

3.1. Visual Feature Extraction
3.1. 視覺特徵提取

Facial expressions have always been the primary keys for analyzing the emotions and the sentiments of a speaker. Many measurement systems for facial expressions have been developed. The most famous one is the Facial Action Coding System (FACS) which has been developed by Ekman and Friesen and published in 1978. FACS is an anatomically based system for describing all visually discriminative facial movement. FACS is based on the reconstruction of facial expressions in terms of Action Units (AUs). The facial muscles of all humans are almost identical and AUs are based on movements of these muscles. FACS distinguishes facial actions only but doesn't identify the emotions. FACS codes are used to infer emotions using various resources available. Some resources used a combinations of AUs to infer emotions such as FACS Investigators' Guide [20], the FACS interpretive database, and a large body of empirical research [21].
臉部表情一直是分析說話者情緒和情感的主要關鍵。已經開發了許多用於面部表情的測量系統。最著名的是由 Ekman 和 Friesen 開發並於 1978 年發表的面部動作編碼系統(FACS)。FACS 是一個基於解剖學的系統,用於描述所有視覺可區分的面部運動。FACS 基於以動作單元(AUs)的形式重建面部表情。所有人的面部肌肉幾乎相同,AUs 基於這些肌肉的運動。FACS 僅區分面部動作,但不識別情緒。FACS 代碼用於使用各種資源推斷情緒。一些資源使用 AUs 的組合來推斷情緒,例如 FACS 調查員指南[20]、FACS 解釋性數據庫和大量的實證研究[21]。
Ekman's work encouraged many researchers to exploit image and video processing methods in order to analyze facial expressions. Yacoob et al. [22] and Black et al. [23] used high gradient points on the face, and head and facial movements to recognize facial expressions. In 1999, Zhang et al. [24] used geometrical features with a multi-scale, multi-orientation Gabor Wavelet-based representation to identify expressions. In 2000, Haro et al. [25] used Kalman Filter and principal component analysis (PCA) to enhance the features. A stochastic gradient descent based technique [26] and active appearance model (AAM) [27] were used to recover the face shape and texture parameters, for facial features. Donato et al. [28] also provided a comparison of several techniques, such as optical flow, PCA, independent component analysis (ICA), local feature analysis and Gabor wavelet, for recognition of action units, and observed that, Gabor wavelet representation and ICA performed better on most datasets. In 2001, Tian et al. [29] claimed that every part of the face is an important feature, thus introduced a multi-state face component model to make use of both permanent and transient features. Permanent features are those which remain the same through ages, for example opening and closing of lips and eyes, pupil location, eyebrows and cheek areas. Transient features are observed only at the time of facial expressions, such as contraction of the corrugator muscle that produces vertical furrows between the eyebrows. Texture features of the face have also been used for facial expression analysis in a number of feature extraction methods, including: image intensity [30], image difference [31], edge detection [29], and Gabor wavelets [32]. In 2002, Ekman et al. [20] introduced an updated version of FACS where the description of each AU, and AU combinations were refined. Moreover, details on head movements and eye positions were added.
Ekman的工作鼓勵許多研究人員利用影像和視頻處理方法來分析面部表情。Yacoob等人[22]和Black等人[23]使用臉部的高梯度點,以及頭部和面部運動來識別面部表情。1999年,Zhang等人[24]使用幾何特徵與基於多尺度、多方向Gabor小波的表示來識別表情。2000年,Haro等人[25]使用卡爾曼濾波器和主成分分析(PCA)來增強特徵。基於隨機梯度下降的技術[26]和主動外觀模型(AAM)[27]被用來恢復面部形狀和紋理參數,用於面部特徵。Donato等人[28]還提供了幾種技術的比較,如光流、PCA、獨立成分分析(ICA)、局部特徵分析和Gabor小波,用於識別動作單元,並觀察到,Gabor小波表示和ICA在大多數數據集上表現更好。2001年,Tian等人[29]聲稱臉部的每個部分都是重要特徵,因此引入了多狀態面部組件模型,以利用永久和暫時特徵。 永久特徵是那些通過歲月不變的特徵,例如嘴唇和眼睛的張開和閉合、瞳孔位置、眉毛和臉頰區域。瞬時特徵僅在面部表情時觀察到,例如產生眉間垂直皺紋的皺眉肌收縮。臉部的紋理特徵也被用於多種特徵提取方法中進行面部表情分析,包括:圖像強度、圖像差異、邊緣檢測和 Gabor小波。在2002年,Ekman等人引入了FACS的更新版本,其中對每個AU的描述和AU組合進行了精煉。此外,還添加了有關頭部運動和眼睛位置的細節。
Many facial expression recognition techniques, face tracking methods and feature extraction methods have been introduced. The most popular ones are Active Appearance Models (AAM) [33], Optical flow models [22], Active Shape Models (ASM) [34], 3D Morphable Models (3DMM) [35], Muscle-based models [36], 3D wireframe models [37], Elastic net model [38], Geometry-based shape models [39], 3D Constrained Local Model (CLM-Z) [66], Adaptive View-based Appearance Model (GAVAM) [40].
許多面部表情識別技術、面部跟蹤方法和特徵提取方法已被引入。最受歡迎的是主動外觀模型(AAM)[33]、光流模型[22]、主動形狀模型(ASM)[34]、3D 可塑模型(3DMM)[35]、基於肌肉的模型[36]、3D 線框模型[37]、彈性網模型[38]、基於幾何的形狀模型[39]、3D 受限局部模型(CLM-Z)[66]、自適應基於視圖的外觀模型(GAVAM)[40]。
All the pre-mentioned methods do not work well for videos because they do not model temporal information. An important facet in videobased methods is maintaining accurate tracking throughout the video sequence. A wide range of deformable models, such as muscle-based models [36], 3D wire frame models [37], elastic net models [38] and geometry-based shape models , have been used to track facial features in videos. Following this, many automatic image-based and video-based methods for detection of facial features and facial expressions were proposed [42, 43].
所有上述方法對於視頻效果不佳,因為它們沒有建模時間信息。在基於視頻的方法中,保持整個視頻序列中的準確跟蹤是一個重要方面。廣泛使用了各種可變形模型,如基於肌肉的模型[36]、3D 線框模型[37]、彈性網模型[38]和基於幾何的形狀模型,用於在視頻中跟蹤面部特徵。此外,提出了許多用於檢測面部特徵和面部表情的自動基於圖像和基於視頻的方法[42, 43]。
Research in psychology proved that the body gestures can provide a great significance to the emotion and sentiment of the speaker. A detailed study was carried out to prove that body gestures are highly related to emotions of the speaker and that different emotions can result in various combinations of body gesture dimensions and qualities of the speaker [44]. This what urged some researchers to focus on extracting features from the body gestures for sentiment analysis and emotion recognition [45-48].
The previous sections described how to extract handcrafted features from a visual modality and how to create mathematical models for facial expression analysis. With the advent of deep learning, deep learning models can learn the best features automatically without prior intervention. The deep learning framework enables robust and accurate feature learning in both supervised and unsupervised settings, which in turn produces best performance on a range of applications, including digit recognition, image classification, feature learning, visual recognition, musical signal processing and NLP [3] . Motivated by the recent success of deep learning in feature extraction, sentiment analysis tasks started to adopt deep learning algorithms to extract visual features, especially the convolutional neural network (CNN). In [49], a novel visual sentiment prediction framework was designed to understand images using CNN. The framework is based on transfer learning from a CNN pre-trained on large scale data for object recognition, which in turn is used for sentiment prediction. The main advantage of the proposed framework is that there is no requirement of domain knowledge for visual sentiment prediction. In 2014 You et al. [50] employed 2D-CNN in visual sentiment analysis, coupled with a progressive strategy to fine tune deep learning networks to filter out noisy training data. They also used domain transfer learning to improve the performance. In 2015, Tran et al. [51] proposed a deep 3D convolutional network (3D-CNN) for spatiotemporal feature extraction. This network consists of 8 convolution layers, 5 pooling layers, 2 fully connected layers, and a softmax output layer. The network has proved to be very robust for extracting spatiotemporal features. In 2016 Poria et al. [52] proposed a convolutional recurrent neural network to extract visual features from multimodal sentiment analysis and emotion recognition datasets where a CNN and RNN have been stacked and trained together.
先前的章節描述了如何從視覺模態中提取手工特徵,以及如何為面部表情分析創建數學模型。隨著深度學習的出現,深度學習模型可以在沒有先前干預的情況下自動學習最佳特徵。深度學習框架使得在監督和非監督設置中進行強大和準確的特徵學習成為可能,進而在一系列應用中實現最佳性能,包括數字識別、圖像分類、特徵學習、視覺識別、音樂信號處理和自然語言處理。受深度學習在特徵提取方面的最近成功啟發,情感分析任務開始採用深度學習算法來提取視覺特徵,尤其是卷積神經網絡(CNN)。在[49]中,設計了一個新穎的視覺情感預測框架,使用CNN來理解圖像。該框架基於從在大規模數據上預先訓練的用於對象識別的CNN進行轉移學習,進而用於情感預測。 提議的框架的主要優勢在於視覺情感預測無需領域知識的要求。2014 年,You 等人[50]在視覺情感分析中採用了 2D-CNN,並結合了一種逐步策略來微調深度學習網路,以濾除嘈雜的訓練數據。他們還使用了領域轉移學習來改善性能。2015 年,Tran 等人[51]提出了用於時空特徵提取的深度 3D 卷積網絡(3D-CNN)。該網絡由 8 個卷積層,5 個池化層,2 個全連接層和一個 softmax 輸出層組成。該網絡已被證明對提取時空特徵非常強大。2016 年,Poria 等人[52]提出了一種卷積循環神經網絡,該網絡從多模態情感分析和情感識別數據集中提取視覺特徵,其中 CNN 和 RNN 被堆疊並一起訓練。

Recent trend in visual features extraction

For the past few years, many sentiment analysis models have used either 3D-CNN to extract the visual features from videos in the past few years, or used publicly available libraries to extract visual features like FACET, OKAO, and CERT libraries. FACET library is used to extract a set of visual features including facial action units, facial landmarks, head pose, gaze tracking and HOG features. OKAO Vision is a commercial software that detects the face at each frame, then it extracts the facial features and extrapolates some basic facial expressions as well as eye gaze direction. The main facial expression being recognized is smile. This is a well-established technology that can be found in many digital cameras. The Computer Expression Recognition Toolbox (CERT) [53] automatically extract the smile and head pose estimates, Facial AUs. These features describe the presence of two or more AUs that define one of eight emotions (anger, contempt, disgust, fear, joy, sad, surprise, and neutral). For example, the unit A12 describes the pulling of lip corners movement, which usually suggests a smile but when associated with a check raiser movement (unit A6), defines happiness emotion.
在過去幾年中,許多情感分析模型在過去幾年中使用了3D-CNN來從視頻中提取視覺特徵,或者使用了公開可用的庫來提取視覺特徵,如FACET、OKAO和CERT庫。 FACET庫用於提取一組視覺特徵,包括面部動作單元、面部標誌、頭部姿勢、凝視追踪和HOG特徵。 OKAO Vision是一個商業軟件,它在每個幀中檢測面部,然後提取面部特徵並推斷一些基本的面部表情以及眼睛凝視方向。被識別的主要面部表情是微笑。這是一項成熟的技術,可以在許多數碼相機中找到。計算機表情識別工具箱(CERT)[53]自動提取微笑和頭部姿勢估計,面部AU。這些特徵描述了定義八種情緒之一(憤怒、輕蔑、厭惡、恐懼、喜悅、悲傷、驚訝和中性)的兩個或更多AU的存在。例如,單元A12描述了拉動嘴角的運動,這通常暗示微笑,但當與一個提高檢查的運動(單元A6)相關聯時,定義了幸福情緒。

3.2. Acoustic Feature Extraction
3.2. 聲學特徵提取

Early research on audio features extraction focused on the acoustic properties of spoken language. Some psychological studies related to emotion showed that vocal parameters, especially pitch, intensity, speaking rate and voice quality have a great role in sentiment analysis and emotion recognition [54]. Some other studies showed that acoustic parameters change through both oral variations and personality traits. Many researches have been conducted to find the type of features that can be used for better analysis , where researchers have found

that pitch and energy related features play a key role in emotion recognition and sentiment analysis. Some other researchers used other features for feature extraction like pitch, pause duration, spectral centroid, spectral flux, beat histogram, beat sum, strongest beat, formants frequencies, mel-frequency cepstral coefficients (MFCC), linear prediction cepstral coefficients (LPCC), frequency power coefficients (LFPC), pause duration, teager energy operated based features. Please refer to [3] for brief description of these features.
It is also worth mentioning that some articles have classified the affective reactions to sound into discrete feeling states and states based on dimensions [57, 58]. Discrete feeling states are defined as spontaneous, uncontrollable emotions. The states based on dimension are hedonic valence (pleasantness), arousal (activation, intensity) and dominance.
Audio affect classification has also been classified into local features and global features. The common approach to analyze audio modality is to segment each utterance into either overlapped or non-overlapped segments and examine them. Within a segment the signal is considered to be stationary. The features extracted from these segments are called local features. In speech production, there are several utterances and, for each utterance, the audio signal can be divided into several segments. Global features are calculated by measuring several statistics such as the average, mean, deviation of the local features. Global features are the most commonly used features in the literature. They are fast to compute and, as they are fewer in number compared to local features, so the overall speed of computation is enhanced [59]. However, there are some drawbacks of calculating global features, as some of them are only useful to detect the effect of high arousal, e.g., anger and disgust. For lower arousals, global features are not that effective, e.g., global features are less prominent to distinguish between anger and joy. Global features also lack temporal information and dependence between two segments in an utterance.
音頻影響分類也被分為局部特徵和全局特徵。分析音頻模態的常見方法是將每個發話分割為重疊或非重疊的片段並對其進行檢查。在一個片段內,信號被認為是穩定的。從這些片段中提取的特徵被稱為局部特徵。在語音製作中,有幾個發話,對於每個發話,音頻信號可以被分成幾個片段。通過測量幾個統計量,如局部特徵的平均值、均值、偏差,來計算全局特徵。全局特徵是文獻中最常用的特徵。它們計算速度快,並且與局部特徵相比數量較少,因此計算速度得到提高。然而,計算全局特徵存在一些缺點,因為其中一些只有在檢測高激動效果時才有用,例如,憤怒和厭惡。對於較低的激動,全局特徵並不那麼有效,例如,全局特徵不太明顯地區分憤怒和喜悅。 全球特徵也缺乏語音中兩個片段之間的時間信息和依賴關係。
The benchmark results proposed by Navas et al. [60] proved that the speaker-dependent approaches often gives much better results than the speaker-independent approaches. However, the speaker-dependent approach is not feasible in many practical applications that deal with a very large number of users.
Navas 等人提出的基準結果證明,與講話者無關的方法通常比與講話者有關的方法結果更好。然而,在許多應用中,處理大量用戶的實際應用中,與講話者有關的方法是不可行的。
As for computer vision, deep learning is increasingly gaining attention in audio classification research. A possible research question is whether deep neural networks can be replicated for automatic feature extraction from audible data. The answer to this question was given in 2015 by a group of researchers [61], where a CNN was used to extract features from audio, which were then used in a classifier for the final emotion classification task. Deep neural networks based on Generalized Discriminant Analysis (GerDA) are also a very popular approach in the literature to extract features automatically from raw audio data. However, most deep learning approaches in audio emotion classification literature rely on handcrafted features [62].
就計算機視覺而言,深度學習在音頻分類研究中越來越受到關注。一個可能的研究問題是深度神經網絡是否可以被複製,用於從可聽數據中自動提取特徵。這個問題的答案是由一組研究人員在 2015 年給出的,其中使用 CNN 從音頻中提取特徵,然後用於最終情感分類任務的分類器。基於廣義判別分析(GerDA)的深度神經網絡也是文獻中自動從原始音頻數據中提取特徵的一種非常流行的方法。然而,音頻情感分類文獻中的大多數深度學習方法依賴於手工設計的特徵。

Recent trend in acoustic features extraction

Recently most multimodal sentiment analysis models use OpenSMILE [63], COVAREP [64], Open EAR [65] to extract acoustic features. They are freely available popular audio feature extraction toolkits which are able to extract all the key features as elaborated above. OpenSMILE feature extraction toolkit unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. It used to extract CHROMA and CENS features, loudness, MFCC, perceptual linear predictive cepstral coefficients, linear predictive coefficients, line spectral frequencies, fundamental frequency, and formant frequencies [63]. COVAREP is used to extract acoustic features including 12 Mel-frequency cepstral coefficients, pitch, and voiced/unvoiced segmenting features, glottal source parameters, peak slope parameters and maxima dispersion quotients. All extracted features are related to emotions and tone of speech. The audio features are automatically extracted from the audio track of each video clip [64]. OpenEAR is an open source software that can be used to automatically compute the pitch and voice intensity. Speaker normalization is performed using z-standardization [65].
最近,大多數多模情感分析模型使用OpenSMILE [63]、COVAREP [64]、Open EAR [65]來提取聲學特徵。這些是免費且流行的音頻特徵提取工具包,能夠提取上述詳細介紹的所有關鍵特徵。OpenSMILE特徵提取工具包結合了來自語音處理和音樂信息檢索社區的特徵提取算法。它用於提取CHROMA和CENS特徵、音量、MFCC、感知線性預測倒頻譜係數、線性預測係數、線性頻率、基頻和共振頻率 [63]。COVAREP用於提取包括12個Mel頻率倒頻譜係數、音高和有聲/無聲分段特徵、聲門源參數、峰值斜率參數和最大分散率的聲學特徵。所有提取的特徵與情感和語音語調有關。音頻特徵是從每個視頻剪輯的音頻軌道中自動提取的 [64]。OpenEAR是一個開源軟件,可用於自動計算音高和語音強度。使用z標準化進行語者歸一化 [65]。

3.3. Textual Feature Extraction
3.3. 文本特徵提取

Traditionally, the bag-of-words (BoW) model had been used to extract features for sentences and documents in NLP and text mining. It is called a "bag" of words, because any information about the order or structure of words in the document is ignored. The model is only concerned with whether known words occur in the document, not where in the document.
Based on BoW, a document is transformed to a numeric feature vector with a fixed length, where each element in the vector is scored. This score can be: a binary scoring of the presence or absence of words, word frequency, or TF-IDF score. Despite its popularity, BoW has some shortcomings. First, the dimension of this vector is equal to the size of the vocabulary, so as the vocabulary size increases, the vector representation of documents increases also. You can imagine that for a very large corpus, such as thousands of books, that the length of the vector might be thousands or millions of positions. Second, BoW can barely encode the semantics of words since the word order is ignored, which means that two documents can have exactly the same representation as long as they share the same words. Third, each document may contain very few number of the known words in the vocabulary which results in a vector with lots of zero scores, called a sparse vector or sparse representation. Sparse vectors require more memory and computational resources when modeling and the vast number of positions or dimensions can make the modeling process very challenging for traditional algorithms.
Later, a more sophisticated model was introduced to create a vocabulary of grouped words called Bag-of-n-grams, an extension for BoW. This changes the scope of the vocabulary and allows the bag-of-words to capture a little bit more meaning from the document. In this approach, each word or token is called a "gram". Creating a vocabulary of twoword pairs is, in turn, called a bigram model. Again, only the bigrams that appear in the corpus are modeled, not all possible bigrams. In this mode, the scores can highlight words that are distinct (contain useful information) in a given document. Thus, this model can consider the word order in a short context (n-gram), however it still suffers from data sparsity and high dimensionality [66].
To overcome the shortcomings of BoW and n-grams, word embedding techniques were proposed. A word embedding is a technique for feature extraction that uses neural networks to learn a representation for text such that words that have the same meaning have a similar representation. Word embedding transforms words in a vocabulary to vectors of continuous real numbers. The technique normally involves embedding high-dimensional sparse vector (e.g., one-hot vector) to a lowerdimensional dense vector which can encode some semantic and syntactic properties of words. Each dimension of the embedding vector represents a latent feature of a word. Word Embeddings solve the shortcomings of one hot vector and achieve dimensionality reduction. That's why it may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems [66].
為了克服 BoW 和 n-grams 的缺點,提出了詞嵌入技術。詞嵌入是一種特徵提取技術,它使用神經網絡學習文本的表示,使具有相同含義的詞具有類似的表示。詞嵌入將詞彙中的詞轉換為連續實數向量。該技術通常涉及將高維稀疏向量(例如,one-hot 向量)嵌入到能夠編碼詞的一些語義和句法特性的低維密集向量中。嵌入向量的每個維度代表詞的一個潛在特徵。詞嵌入解決了 one-hot 向量的缺點並實現了降維。這就是為什麼它可能被認為是深度學習在具有挑戰性的自然語言處理問題上的重要突破之一。

Recent trend in textual features extraction

More recently, the new trend in multimodal sentiment analysis focuses on using word embeddings pre-trained on a large corpus such as the Glove [67] or word2vec [68]. And all the models referenced in this paper use word embeddings to learn a representation for written text.
最近,多模態情感分析的新趨勢集中於使用在大型語料庫上預先訓練的詞嵌入,如 Glove [67] 或 word2vec [68]。本文引用的所有模型都使用詞嵌入來學習書面文本的表示。
In Table 3 and Table 4, we present the methods used by each of the models referenced in this paper to extract visual, acoustic and textual features.
在表 3 和表 4 中,我們展示了本文引用的每個模型用於提取視覺、聲學和文本特徵的方法。
Table 3 表 3
Effectiveness results for some multimodal sentiment analysis models on MOSI dataset and the feature extraction methods used in each model.
一些多模情感分析模型在 MOSI 數據集上的有效性結果以及每個模型使用的特徵提取方法。
Ref Model Efficiency Feature Extraction 特徵提取
Binary Classification 二元分類
Regression Textual Visual Acoustic
F1 Score MAE Corr
[74] SVM 1.1 0.559 Glove FACET COVAREP
[13] EF-LSTM 1.000 0.630 Glove FACET COVAREP
LF-LSTM 0.987 0.624 Glove FACET COVAREP
1.143 0.518 Glove FACET COVAREP
Majority 1.864 0.057 Glove FACET COVAREP
MARN 0.968 0.625 Glove FACET COVAREP
[73] MFN 0.965 0.632 Glove FACET COVAREP
RMFN 0.922 0.681 Glove FACET COVAREP
[89] Multilogue-Net 多對話網絡 - - - CNN 3D-CNN openSMILE
[74] TFN 1.040 0.587 Glove FACET COVAREP
[81] LMF 0.912 0.668 Glove FACET COVAREP
[84] MRRF 0.912 0.772 Glove FACET COVAREP
[83] HFFN - - - Glove FACET COVAREP
[85] MMUUSA 79.52 - - - - word2vec 3D-CNN openSMILE
[13] MMUUBA 0.947 0.675 Glove FACET COVAREP
[86] BIMHA 0.9694 0.644 BERT [105] LibROSA [106] LibROSA
[87] SWAFN 0.88 0.697 GloVe FACET COVAREP
[13] RAVEN 0.948 0.674 Glove FACET COVAREP
Bc-LSTM - - - - word2vec 3D-CNN openSMILE
[85] MMMU-BA - - - - word2vec 3D-CNN openSMILE
[94] MMMUBA 2 - - - - word2vec 3D-CNN openSMILE
MHSAN - - - - word2vec 3D-CNN openSMILE
[98] MHAM - - - - - - -
[95] Bi-LSTM - - - - word2vec 3D-CNN openSMILE
[96] MARNN - - - - word2vec 3D-CNN openSMILE
[13] [99] MCTN 0.909 0.676 - - -
MulT 0.871 0.698 Glove FACET COVAREP
[103] QMF 0.6399 0.6575 - - -
Table 4 表 4
Effectiveness results for some multimodal sentiment analysis models on MOSEI dataset and the feature extraction methods used in each model.
在 MOSEI 數據集上一些多模情感分析模型的有效性結果,以及每個模型使用的特徵提取方法。
Ref Model Efficiency Feature Extraction 特徵提取
Binary Classification 二元分類
Regression Textual Visual Acoustic
F1 Score MAE Corr
[13] EF-LSTM 0.687 0.573 Glove FACET COVAREP
[13] LF-LSTM 0.655 0.614 Glove FACET COVAREP
1.143 0.518 Glove FACET COVAREP
Majority 1.864 0.057 Glove FACET COVAREP
[13] MFN 0.646 0.626 Glove FACET COVAREP
[14] Graph MFN 0.71 0.54 - - -
[13] MARN 0.646 0.629 Glove FACET COVAREP
[89] Multilogue-Net 多對話網絡 0.59 0.5 Glove FACET OpenSMILE
[13] TFN 1.009 0.605 Glove FACET COVAREP
[13] LMF 0.660 0.623 Glove FACET COVAREP
[85] MMUUSA - - - - Glove FACET COVAREP
[13] MMUUBA 0.627 0.672 Glove FACET COVAREP
[13] RAVEN 0.636 0.654 Glove FACET COVAREP
[100] MulT 0.580 0.703 Glove FACET COVAREP
[103] 0.9146 0.6959 - - -

4. State of the art models in multimodal sentiment analysis

In this section, we classify thirty five recent models in multimodal sentiment analysis into eight categories, based on the architecture of each model as shown in Fig. 3.
在本節中,我們根據每個模型的架構將最近的三十五個多模態情感分析模型分為八個類別,如圖 3 所示。

4.1. Early Fusion based models (Feature Level Fusion)

In this category, all modalities are concatenated into a single view. Then this concatenated view is used as input to a prediction model as shown in Fig. 4a. The prediction models could be as simple as Hidden Markov Models (HMMs) [69], Support Vector Machines (SVMs) [70] or Hidden Conditional Random Fields (HCRFs) [71]. Later, after the advances of deep learning, Recurrent Neural Networks, especially Long-short Term Memory [72] have been used for sequence modeling.
在這個類別中,所有的模態都被串聯成一個單一的視圖。然後,這個串聯的視圖被用作預測模型的輸入,如圖 4a 所示。預測模型可以是像隱馬爾可夫模型(HMMs)[69],支持向量機(SVMs)[70]或隱藏條件隨機場(HCRFs)[71]這樣簡單的模型。後來,在深度學習的進步之後,循環神經網絡,特別是長短期記憶[72]被用於序列建模。

Although this simple concatenation succeeded somehow in modeling multi-view problems, it causes over-fitting in case of a small size training dataset and is not intuitively meaningful because modeling view-specific dynamics is ignored, thus losing the context and temporal dependencies within each modality [73]. Some of the models that used this architecture are reviewed in this section.

4.1.1. THMM (Tri-modal Hidden Markov Model)
4.1.1. 三模態隱馬爾可夫模型(THMM)

Morency at al. [15] were the first to address the problem of tri-modal sentiment analysis. They used an HMM for classification after concatenation.
Morency 等人[15]是第一個解決三模情感分析問題的人。他們在串聯後使用 HMM 進行分類。

4.1.2. SVM (Support Vector Machine)
4.1.2. SVM(支持向量機)

Perez-Rosas, Mihalcea, and Morency [16] combined the multimodal streams into a single feature vector, thus resulting in one vector for each
Perez-Rosas,Mihalcea 和 Morency[16]將多模流合併為單個特徵向量,因此每個向量結果為一個向量。
Fig. 3. SOTA models in multimodal sentiment analysis.
圖 3。多模態情感分析中的 SOTA 模型。
Fig. 4. Traditional fusion techniques (a) Early fusion architecture, (b) Late fusion architecture.
圖 4。傳統融合技術 (a) 早期融合架構,(b) 遲到融合架構。
utterance in the dataset, which is used by a SVM for binary classification to make a decision about the sentiment of the utterance. On the other hand S. Park et al. [18] used the Support Vector Machines (SVMs) for classification and Support Vector Regression (SVRs) for regression experiments with the radial basis function kernel as the prediction models.
資料集中的發話,這被一個 SVM 用於二元分類,以做出有關發話情感的決定。另一方面,Park 等人[18] 使用支援向量機 (SVM) 進行分類和支援向量回歸 (SVR) 進行回歸實驗,使用了以徑向基函數為預測模型的 SVM。

4.1.3. EF-LSTM (Early Fusion LSTM)
4.1.3. EF-LSTM(早期融合 LSTM)

Zadeh et al. [74] and D. Gkoumas et al. [13] concatenate the different modalities at each timestep in a single feature vector and use it as input to an LSTM [75]. Zadeh et al. [74] used bidirectional stacked LSTM while D. Gkoumas et al. [13] passed the last hidden state of the LSTM to two fully connected layers to produce the output sentiment. Both reported a very slight difference in performance metrics.
Zadeh 等人 [74] 和 D. Gkoumas 等人 [13] 在每個時間步將不同的模態串聯為單個特徵向量,並將其用作 LSTM [75] 的輸入。Zadeh 等人 [74] 使用了雙向堆疊 LSTM,而 D. Gkoumas 等人 [13] 將 LSTM 的最後隱藏狀態傳遞給兩個全連接層以生成輸出情感。兩者在性能指標上報告了非常微小的差異。

4.2. Late Fusion based models
4.2. 基於後期融合的模型

In late fusion, different models are built for each modality and then their decisions are combined by averaging, weighted sum [76], majority
在後期融合中,為每種模態構建不同的模型,然後通過平均、加權求和[76]、多數投票[77]或深度神經網絡(如圖 4b 所示)將它們的決策組合在一起。後期融合方法在預測級別集成了不同的模態。

voting [77], or deep neural networks as shown in Fig. 4b. The late fusion method integrates different modalities at the prediction level.
The advantage of these models is that they are very modular, and that one can build a multimodal model from individual pre-trained unimodal models with just fine-tuning on the output layer. These methods are generally strong in modeling view-specific dynamics and can also outperform unimodal models but they have problems in modelling cross-view dynamics since are normally more complex than a decision vote [78]. Some of the models that used this architecture are reviewed in this section.

4.2.1. Decision Voting (Deep Fusion DF)
4.2.1. 決策投票(深度融合 DF)

Nojavanasghari et al. [79] train one deep neural model for each modality and performs decision voting on the output of each modality network.
Nojavanasghari 等人 [79] 為每種模態訓練一個深度神經模型,並對每種模態網絡的輸出進行決策投票。

4.2.2. Averaging 4.2.2. 平均化

Nojavanasghari et al. [79] train one deep neural model for each modality. The output scores from all deep models are averaged.
Nojavanasghari 等人[79]為每種模態訓練一個深度神經模型。所有深度模型的輸出分數取平均值。

4.2.3. LF-LSTM

D. Gkoumas et al. [13] builds separate LSTMs for textual, visual, and acoustic modalities, and then concatenates the last hidden state of the three LSTMs. The concatenated hidden states are passed into two fully connected layers to produce the output sentiment.
D. Gkoumas 等人[13]為文本、視覺和聲學模態構建獨立的 LSTMs,然後將三個 LSTMs 的最後隱藏狀態串聯起來。串聯的隱藏狀態通過兩個全連接層產生輸出情感。

4.2.4. Majority 4.2.4. 多數

Zadeh et al. [74] performs majority voting for classification tasks, and predicts the expected label for regression tasks.
Zadeh 等人[74]對分類任務進行多數投票,對回歸任務預測預期標籤。

4.3. Temporal based Fusion
4.3. 基於時間的融合

In this architecture, the model accounts for view-specific and crossview interactions and continuously models them over time with a special attention mechanism. In most models the architecture consists of two main components as shown in Fig. 5.
在這個架構中,模型考慮了特定於視圖和跨視圖交互作用,並通過特殊的注意機制隨時間不斷對其進行建模。在大多數模型中,該架構由兩個主要組件組成,如圖 5 所示。

- System of LSTMs/LSTHMs

Each modality is assigned one LSTM or LSTHM to model its viewspecific dynamic. At each time step time , the modality specific features are fed into the corresponding LSTM/LSTHM.
每個模態被分配一個 LSTM 或 LSTHM 來建模其特定於視圖的動態。在每個時間步驟 time ,模態特定的特徵被餵入相應的 LSTM/LSTHM。

- Attention Block - 注意力區塊

This layer is an explicitly designed attention mechanism responsible for attending to the most important components of the output of the systems of LSTMs/LSTHMs to model the cross-view dynamics.
這一層是一個明確設計的注意力機制,負責關注 LSTMs/LSTHMs 系統輸出中最重要的組件,以建模跨視圖動態。

All the models in this section differ only in the way they apply attention on the output of the LSTMs/LSTHMs system except Multilogue-Net whose architecture is somehow different. We will use the symbols to denote acoustic, visual and textual features respectively at time step .
本節中所有模型在應用注意力於 LSTMs/LSTHMs 系統輸出的方式上均有所不同,除了 Multilogue-Net 的架構略有不同。我們將使用符號 分別表示時間步驟 的聲學、視覺和文本特徵。

4.3.1. Memory Fusion network (MFN)
4.3.1. 記憶融合網絡(MFN)

At each timestamp ( ) of MFN recursion, the memory of the three LSTMs are concatenated together with the memory of the three LSTMs at the previous timestamp and passed to the attention block. The attention block in this component is a simple neural network with softmax at the output layer to calculate the attention weights, and is called Dynamic Memory Attention Block (DMAN). The output of this module is the attended memories of the LSTMs, which is passed to a Multi-view Gated Memory [73]. The Multi-view Gated Memory is a unifying memory which stores the cross-view interactions over time. It has two gates, controlled by two neural networks, called retain and update respectively. At each timestep, the retain gate determine how much of the current state of the Multi-view Gated Memory to remember while the update gate determines how much of the Multi-view Gated Memory to update respectively as shown in Fig. 6.
在 MFN 遞迴的每個時間戳( )中,三個 LSTM 的記憶與上一個時間戳的三個 LSTM 的記憶串聯在一起,並傳遞到注意力塊。 這個組件中的注意力塊是一個簡單的神經網絡,輸出層上有 softmax 來計算注意力權重,稱為動態記憶注意力塊(DMAN)。 這個模塊的輸出是 LSTM 的注意力記憶,傳遞到多視圖閘控記憶[73]。 多視圖閘控記憶是一個統一的記憶,它存儲隨時間變化的跨視圖交互作用。 它有兩個閘,由兩個神經網絡控制,分別稱為保留和更新。 在每個時間步長,保留閘確定要記住多視圖閘控記憶的當前狀態的多少,而更新閘確定要更新多視圖閘控記憶的多少,如圖 6 所示。
Finally, the final state of the Multi-view Gated Memory and the output of the three LSTMs at the last timestamp of the input sequence are concatenated together to construct the multimodal sequence representation that can be used to produce the output sentiment using two fully connected layers [73].
最後,多視圖閘控記憶的最終狀態和輸入序列的最後一個時間戳記的三個 LSTM 的輸出被串聯在一起,構建多模態序列表示,可用於使用兩個全連接層[73]生成輸出情感。
The power of MFN is that DMAN can model asynchronous cross-view interactions because it attends to the memories in the System of LSTMs which can carry information about the observed inputs across different timestamps [73].
MFN 的優勢在於 DMAN 可以模擬異步跨視圖交互作用,因為它關注 LSTMs 系統中的記憶,這些記憶可以攜帶關於不同時間戳記的觀察輸入的信息[73]。

4.3.2. Graph MFN (DFG)
4.3.2. 圖形 MFN(DFG)

Zadeh et al. [14] replicate the architecture of MFN except that they replace DMAN in MFN [73] with a new neural-based component called the Dynamic Fusion Graph (DFG. Please refer to the original paper [14] for detailed explanation of DFG.
Zadeh 等人[14]複製了 MFN 的架構,但將 MFN 中的 DMAN [73]替換為一個名為動態融合圖(DFG)的新基於神經的組件。請參考原始論文[14]以獲得 DFG 的詳細解釋。

4.3.3. Multi-Attention Recurrent Network (MARN)
4.3.3. 多注意力循環網絡(MARN)

The architecture of MARN is very similar to MFN; the main difference is that Zadeh et al. [74] modified the LSTM to create a hybrid LSTM (LSTHM) where they reformulate the memory component of each LSTM to carry hybrid information; the view specific dynamics of its modality and the cross-view dynamics code related to that modality. The asynchronous cross-view dynamics are captured at each time-step using neural-based attention block called multi-attention block (MAB) as shown in Fig. 7. The authors claimed that there may exist more than one cross-view dynamics that occur simultaneously across the three modalities, thus MAB consists of neural networks, each with softmax at
MARN 的架構與 MFN 非常相似;主要區別在於 Zadeh 等人[74]修改了 LSTM 以創建混合 LSTM(LSTHM),在其中重新制定了每個 LSTM 的記憶組件以攜帶混合信息;其模態的特定視圖動態和與該模態相關的跨視圖動態代碼。異步跨視圖動態使用名為多注意力塊(MAB)的基於神經的注意力塊在每個時間步驟捕獲,如圖 7 所示。作者聲稱可能存在超過一個跨視圖動態同時發生在三個模態之間,因此 MAB 由 個神經網絡組成,每個都帶有 softmax。
Fig. 5. Architecture of temporal fusion based models.
Fig. 7. Architecture of MARN.
MARN 的架構圖。
Fig. 6. Architecture of MFN.
MFN 的架構圖。
the output layer, responsible for modelling cross view dynamics. The output of this module is the attended output of the LSTHMs, which undergoes dimensionality reduction, and then passed into a deep neural network to produce the cross-view dynamics code at time , called . The cross-view dynamics code represents all cross-modal interactions discovered at this timestep and is fed back into the intramodal LSTHMs as an additional input for the next timestep.
負責建模跨視圖動態的輸出層。該模塊的輸出是 LSTHMs 的關注輸出,經歷降維處理,然後傳入深度神經網絡,以在時間 生成跨視圖動態代碼,稱為 。跨視圖動態代碼代表在此時間步驟發現的所有跨模態交互作用,並作為額外輸入反饋到內模態 LSTHMs,用於下一時間步驟。
Finally, the cross-view dynamics code and the output of the three LSTHMs at the last timestamp of the input sequence are concatenated together to construct the multimodal sequence representation that can be used to produce the output sentiment. MARN is so powerful because it can model asynchronous cross-view dynamics.
最後,將跨視圖動態代碼和輸入序列的最後一個時間戳的三個 LSTHMs 的輸出串聯在一起,構建多模式序列表示,可用於生成輸出情感。MARN 非常強大,因為它可以模擬異步跨視圖動態。

4.3.4. Recurrent Memory Fusion Network (RMFN)
4.3.4. 循環記憶融合網絡(RMFN)

Similar to MARN, RMFN uses a hybrid LSTM (LSTHM) to model the view-specific dynamics [80]. For each timestep, the output of the three LSTHMs are concatenated together and passed to the attention block. The attention block in RMFN consist of multiple stages. In each stage, the most important modalities are highlighted using an Attention LSTM, then passed to a Fuse LSTM which integrates the highlighted modalities with the fusion representations from previous stages. Then the output of the final stage is fed into a SUMMARIZE module to generate a summarized cross-modal representation which represents all cross-modal interactions discovered at this timestep and is fed back into the intramodal LSTHMs as an additional input for the next timestep [80]. Finally, the last summarized cross-modal representation and the output of the three LSTHMs at the last timestamp of the input sequence are concatenated together to construct the multimodal sequence representation that can be used to produce the output sentiment.
與 MARN 類似,RMFN 使用混合 LSTM(LSTHM)來建模特定視圖的動態[80]。對於每個時間步,三個 LSTHM 的輸出被串聯在一起並傳遞到注意力塊。 RMFN 中的注意力塊包含多個階段。在每個階段中,使用注意力 LSTM 突出顯示最重要的模態,然後傳遞到 Fuse LSTM,該 LSTM 將突出顯示的模態與來自先前階段的融合表示集成在一起。然後,最終階段的輸出被餵入 SUMMARIZE 模塊以生成總結的跨模態表示,該表示代表在此時間步驟發現的所有跨模態交互作用,並作為額外輸入反饋到內模態 LSTHMs 中,以供下一時間步使用[80]。最後,將最後總結的跨模態表示和輸出的三個 LSTHMs 在輸入序列的最後時間戳記一起串聯在一起,構建可以用於生成輸出情感的多模態序列表示。

The experiments conducted by the authors reveal that the multiple stages coordinate to capture both synchronous and asynchronous multimodal interactions.

4.3.5. Multilogue-Net: A Context Aware RNN for Multi-modal Emotion
4.3.5. Multilogue-Net:一個多模情感上下文感知 RNN

Detection and Sentiment Analysis in Conversation
Shenoy et al. [89] assume that the sentiment governing a specific utterance depends on 4 factors - speaker state, speaker intent, the preceding and future emotions, and the context of the conversation. The speaker intent is particularly difficult to model due to its dependency of prior knowledge about the speaker, yet modelling the other three factors separately in an interrelated manner was theorized to produce meaningful results if managed to be captured effectively. The authors attempt to simulate the setting in which an utterance is said, and use the actual utterance at that point to be able to gain better insights regarding the sentiment of that utterance. The model uses information from all modalities learning multiple state vectors (representing speaker state) for a given utterance, followed by a pairwise attention mechanism, attempting to better capture the relationship between all pairs of the available modalities. In particular, the model uses two gated recurrent units (GRU) for each modality for modelling the speaker's state and emotion. Along with these GRU's, the model also uses an interconnected context network, consisting of the same number of GRU's as the number of available modalities, to model a different learned context representation for each modality. The incoming utterance representations and the historical GRU outputs are used at every timestamp to be able to predict the sentiment for that timestamp [89].
Shenoy 等人[89] 假設特定發話的情感取決於 4 個因素 - 說話者狀態、說話者意圖、先前和未來的情感,以及對話的上下文。說話者的意圖特別難以建模,因為它依賴於對說話者的先前知識,然而,如果能夠有效地捕捉其他三個因素並以相互關聯的方式分開建模,則理論上將產生有意義的結果。作者試圖模擬發話的情境,並使用該時刻的實際發話來更好地瞭解該發話的情感。該模型使用所有模態的信息學習給定發話的多個狀態向量(代表說話者狀態),然後使用成對注意機制,嘗試更好地捕捉所有可用模態之間的關係。特別是,該模型為每種模態使用兩個閘控循環單元(GRU)來建模說話者的狀態和情感。 除了這些 GRU,模型還使用一個互連的上下文網絡,由與可用模態數量相同的 GRU 組成,為每種模態建模不同的學習上下文表示。每個時間戳使用傳入的話語表示和歷史 GRU 輸出來預測該時間戳的情感[89]。

4.4. Utterance-Level Non-temporal Fusion
4.4. 話語級非時間融合

This architecture relies on collapsing the time dimension from inputs. Unlike architecture 3 where the model works on every time step, in this architecture, the model work with the whole utterance. We define three matrices which are formed from the concatenation of the language features, acoustic features and visual features respectively of each utterance. Most of the models in this architecture mainly consists of two main components as shown in Fig. 8.
這種架構依賴於將輸入的時間維度折疊。與架構 3 不同,該模型在每個時間步上工作,而在這種架構中,模型與整個話語一起工作。我們定義了三個矩陣 ,它們分別由每個話語的語言特徵、聲學特徵和視覺特徵的串聯形成。這種架構中的大多數模型主要由圖 8 中顯示的兩個主要組件組成。

- Modality Embedding Subnetwork
- 模態嵌入子網絡

This subnetwork is responsible for modelling the view-specific dynamics. The output from this subnetwork are acoustic, visual and textual embeddings of the utterance.

- Modality Fusion Subnetwork
- 模態融合子網絡

The fusion layer is used to combine the acoustic, visual and textual embeddings into one compact representation to model the cross-view dynamics of the whole utterance.

4.4.1. Tensor Fusion Network (TFN)
4.4.1. 張量融合網路(TFN)

Zadeh et al. [9] explicitly model view-specific dynamics of the textual modality with an LSTM followed by a fully connected deep network. While the view-specific dynamics of the visual and acoustic modalities are both modelled by mean pooling followed by a fully connected deep network.
Zadeh 等人[9]明確地應用 LSTM 來模擬文本模態的視圖特定動態,然後接一個全連接深度網路。而視覺和聲學模態的視圖特定動態則都是通過平均池化後接一個全連接深度網路進行建模。
Then they perform fusion by creating a multi-dimensional tensor that captures unimodal, bimodal and trimodal interactions across the three modalities. It is mathematically equivalent to the outer product between the visual embeddings, acoustic embeddings and the textual embeddings. It is a 3D cube of all possible combination of unimodal embeddings as shown in Fig. 9. So each utterance can be represented by a multimodal tensor which is passed to a fully connected deep neural network called Sentiment Inference Subnetwork to produce a vector representation which can be used to predict the sentiment (see Fig. 10).
The main disadvantage is that the produced tensor is very high dimensional and its dimensions increase exponentially with the number of modalities. Consequently the number of learnable parameters in the weight tensor in the Sentiment Inference Subnetwork will also increase exponentially. This introduces exponential computational increase in cost and memory and exposes the model to risks of overfitting [81] [82]. Despite the effectiveness this type of methods have achieved, they give little consideration to acknowledging the variations across different portions of a feature vector which may contain disparate aspects of information and thus fail to render the fusion procedure more specialized [83].
Fig. 9. Tensor fusion with three types of subtensors: unimodal, bimodal and trimodal [9].
圖 9. 使用三種類型的子張量進行張量融合:單模、雙模和三模。

4.4.2. Temporal Tensor Fusion Network (T2FN)
4.4.2. 時間張量融合網絡(T2FN)

Proceeding from the fact that clean data exhibits correlations across time and across modalities and produce low rank tensors, while noisy data breaks these natural correlations and leads to higher rank tensors, Paul et al. [82] propose a model called the Temporal Tensor Fusion Network (T2FN) which builds tensor representation from multimodal data but uses the method of low-rank tensor approximation to implement more efficient tensors that can represent the true correlations and latent structures in multimodal data more accurately, thus eliminating imperfection in the input. The architecture of T2FN is very similar to TFN. The main difference is that T2FN models view-specific dynamics of the three modalities with three LSTMs not followed by a deep network. And while TFN uses a single outer product between the three embeddings to obtain a multimodal tensor of rank one, T2FN performs outer products between the individual representations through every time step in each utterance to obtain a number of tensors equal to the number of time steps in the utterance. Then these tensors are summed to obtain a final tensor of high rank upper bounded by the number of time steps in each utterance. This final tensor can be used by the fully connected layer to predict the sentiment. The main advantage of these model is that the tensor rank minimization acts as a simple regularizer for training in the presence of noisy data. It is also computationally efficient and has fewer parameter compared to previous tensor-based methods.
基於乾淨數據在時間和模態之間展現相關性並生成低秩張量的事實,而嘈雜數據破壞這些自然相關性並導致更高秩的張量,Paul等人提出了一個名為時間張量融合網絡(T2FN)的模型,該模型從多模態數據構建張量表示,但使用低秩張量逼近方法實現更高效的張量,可以更準確地表示多模態數據中的真實相關性和潛在結構,從而消除輸入中的不完美。T2FN的架構與TFN非常相似。主要區別在於T2FN模型使用三個LSTM來查看三種模態的視圖特定動態,而不是後面跟著一個深度網絡。而TFN使用三個嵌入之間的單個外積來獲得秩為一的多模態張量,T2FN通過每個發話中的每個時間步驟之間的個別表示進行外積運算,以獲得與發話中時間步驟數相等的張量數量。然後這些張量相加以獲得一個最終的高秩張量,上限為每個發話中的時間步驟數。 這個最終張量可以被全連接層用來預測情感。這些模型的主要優勢在於張量秩的最小化作為在存在噪聲數據的情況下的簡單正則化器。與以前基於張量的方法相比,它還具有計算效率高且參數較少的優勢。

4.4.3. Low-rank Multimodal Fusion (LMF)
4.4.3. 低秩多模態融合(LMF)

Similar to T2FN, Liu et al. [81] propose low-rank weight tensors to make multimodal fusion efficient without compromising on performance. The architecture of LMF is very similar to that of TFN [9]. LMF decreases the number of parameters as well as the computation complexity.
與 T2FN 類似,Liu 等人提出低秩權重張量,使多模態融合在不影響性能的情況下更有效率。LMF 的架構與 TFN 非常相似。LMF 減少了參數數量以及計算複雜度。
Fig. 8. Architecture of utterance-level non-temporal fusion based architecture.
圖 8. 基於發話級非時間融合的架構。
Fig. 10. Architecture of TFN.
圖 10. TFN 的架構。

4.4.4. Modality based Redundancy Reduction Fusion (MRRF)
4.4.4. 基於模態的冗餘減少融合(MRRF)

Inspired by how TFN [9] fuses multimodal by outer product tensor of input modalities, and how LMF [81] reduce the number of elements in the resulting tensor through low rank factorization, Barezi et al. [84] introduce the MRRF which builds on the above two models. The main difference is that the factorization used in LMF utilizes a single compression rate across all modalities, while MRRF use Tuckers tensor decomposition which gives different compression rates for each modality. This allows the model to adapt to variations in the amount of useful information between modalities. Modality-specific factors are chosen by maximizing performance on a validation set. Applying a modality-based factorization method is useful in removing redundant information that is duplicated across modalities and results in fewer parameters with minimal information loss leading to a less complicated model and reducing overfitting [84].
受到 TFN [9] 如何通過輸入模態的外積張量融合多模態的啟發,以及 LMF [81] 如何通過低秩因子化減少結果張量中的元素數量的影響,Barezi 等人 [84] 提出了 MRRF,該模型基於上述兩個模型。主要區別在於 LMF 中使用的因子化利用單一壓縮率跨越所有模態,而 MRRF 使用 Tuckers 張量分解,為每個模態提供不同的壓縮率。這使得模型能夠適應模態之間有用信息量的變化。通過在驗證集上最大化性能來選擇特定於模態的因子。應用基於模態的因子化方法有助於去除跨模態重複的冗餘信息,從而減少參數數量,減少信息損失,使模型更簡單,減少過度擬合 [84]。

4.4.5. Hierarchical Feature Fusion Network (HFFN)
4.4.5. 階層特徵融合網絡(HFFN)

Again inspired by how TFN [9] fuses multimodal features by outer product tensor of input modalities, Mai et al. [83] tries to improve the efficiency and avoid the problem of high dimensional tensors being created by introducing a new model called HFFN that has proven empirically to achieve significant drop in computational complexity compared to other tensor based methods. The model consists of three main stages: 'divide', 'conquer' and 'combine' (see Fig. 11). In the 'divide' stage, the feature vectors of the three modalities of the target utterance are aligned to form multimodality embedding, then this multimodality embedding is divided into multiple local chunks using a sliding window to explore inter-modality dynamics locally. In the Conquer stage, the local chunks are passed into Local Fusion Module (LFM) where the outer product is applied for fusing features within each local chunk to model the inter-modality dynamics. This can considerably reduce the computational complexity by dividing holistic tensor of TFN into multiple local ones. In the Combine stage, global interactions are
再次受到 TFN [9] 如何通過輸入模態的外積張量融合多模態特徵的啟發,Mai 等人 [83] 試圖通過引入一個被實證證明在計算複雜度方面顯著降低的新模型 HFFN 來提高效率並避免高維張量創建的問題。該模型由三個主要階段組成:'分割'、'征服' 和 '結合'(見圖 11)。在 '分割' 階段,目標發話的三個模態的特徵向量被對齊以形成多模態嵌入,然後將這種多模態嵌入分成多個本地塊,使用滑動窗口來在本地探索模態間的動態。在征服階段,本地塊被傳遞到本地融合模塊(LFM),其中應用外積來融合每個本地塊內的特徵,以建模模態間的動態。這可以通過將 TFN 的整體張量分成多個本地張量來顯著降低計算複雜度。在結合階段,全局交互作用是
Fig. 11. Architecture of HFFN.
圖 11. HFFN 的架構。

modelled by exploring interconnections and context-dependency across local fused tensors. However, the limited and fixed size of sliding window may lead to division of the complete process of expressing the sentiment into different local portions, that's why bidirectional flow of information between local tensors is warranted to compensate for this problem. This is what urged the authors to design an RNN variant, called Attentive Bi-directional Skip-connected LSTM (ABS-LSTM); it is bidirectional and supported with two levels of attention mechanism: Regional Interdependence Attention and Global Interaction Attention. ABS-LSTM can transmit information and learn cross-modal interactions more effectively. Finally the output of the global fusion module is passed to Emotion Inference Module (EIM) to predict the sentiment of the target utterance.
通過探索本地融合張量之間的相互聯繫和上下文依賴性來建模。然而,滑動窗口的有限和固定大小可能導致將表達情感的完整過程劃分為不同的本地部分,這就是為什麼需要在本地張量之間實現雙向信息流以彌補這個問題。這促使作者設計了一種 RNN 變體,稱為 Attentive Bi-directional Skip-connected LSTM (ABS-LSTM);它是雙向的,並支持兩級注意機制:區域相互依賴關注和全局交互關注。ABS-LSTM 可以更有效地傳遞信息並學習跨模態交互作用。最後,全局融合模塊的輸出被傳遞給情感推斷模塊(EIM)以預測目標話語的情感。

4.4.6. Multimodal Uni-utterance Self Attention (MMUUSA)
4.4.6. 多模式單話語自我關注(MMUUSA)

Ghosaly et al. [85] explicitly models the view-specific dynamics of the three modalities by a bidirectional GRU followed by a fully connected deep network to generate visual, acoustic and textual embeddings (modality embedding subnetwork) as shown in Fig. 12. The three embeddings are concatenated together to form the information matrix of the utterance. And in order to model the cross-modal interactions, self-attention is applied to the information matrix to produce the attention matrix. Finally, the information matrix and attention matrix are concatenated and passed to the output layer for prediction of the sentiment of each utterance.
Ghosaly 等人[85]明確地模擬了三種模態的規範動態,使用雙向 GRU,後接全連接深度網絡,以生成視覺、聲學和文字嵌入(模態嵌入子網絡),如圖 12 所示。三個嵌入被串聯在一起,形成發話的信息矩陣。為了建模跨模態交互,對信息矩陣應用自我注意力,生成關注矩陣。最後,信息矩陣和關注矩陣被串聯並傳遞給輸出層,以預測每個發話的情感。

4.4.7. Multimodal Uni-utterance Bi-Attention (MMUUBA)
4.4.7. 多模式單發聲單元雙關注(MMUUBA)

In the same paper [85], Ghosaly et al. try modelling the cross-view dynamics using bi-modal attention. Pairwise attentions are computed across all possible combinations of modality embeddings, i.e., linguistic-visual, linguistic-acoustic, and visual-acoustic. Finally, individual modality embeddings and bimodal attention pairs are concatenated to create the multimodal representation that can be used for the prediction of the sentiment of each utterance as shown in Fig. 13.
在同一篇論文[85]中,Ghosaly 等人試圖使用雙模態關注對交叉視圖動態進行建模。計算了所有可能的模態嵌入組合之間的兩兩關注,即語言-視覺、語言-聲學和視覺-聲學。最後,個別的模態嵌入和雙模態注意力對被串聯起來,創建了可以用於預測每個發話情感的多模式表示,如圖 13 所示。

4.4.8. Bimodal Information-augmented Multi-Head Attention (BIMHA)

BIMHA [86] consists of four layers. The first layer models the view specific dynamics within the single modality. The second layer models the cross-view dynamics. Wu et al. [86] adopted tensor fusion based approach, which calculates the second order Cartesian product from the embeddings of pairwise modalities to obtain the interaction information. After that, in order to adapt to the attention calculation module, the extracted unimodal features are fed into to two fully connected layers to uniformly convert the feature dimension to . Different from the first fully connected layer which is private for each modality, the second fully connected layer is a shared layer to reduce parameters. The third layer extracts the bimodal interaction information. In this layer, the multi-head attention mechanism is applied to conduct bimodal interaction and calculate the bimodal attention to obtain the features assigned attention weights. Finally, individual modality embeddings and bimodal attention pairs are concatenated to create the multimodal representation that can be used for the prediction of the sentiment of each utterance.
BIMHA [86] 包含四個層。第一層模擬單一模態內的視圖特定動態。第二層模擬跨視圖動態。Wu等人[86]採用基於張量融合的方法,從成對模態的嵌入計算第二階笛卡爾乘積,以獲取交互信息。之後,為了適應注意力計算模塊,提取的單模特徵被餵入兩個全連接層,將特徵維度均勻轉換為

4.4.9. Sentimental Words Aware Fusion Network for Multimodal Sentiment
4.4.9. 情感詞感知融合網絡用於多模態情感

Analysis (SWAFN)
Very similar to the architecture of MMUUBA, Chen and Li [87] explicitly model the view-specific dynamics of the three modalities by LSTM to generate visual, acoustic and textual embeddings (modality embedding subnetwork). Then they use the coattention mechanism introduced by Xiong et al. [88] to learn the co-dependent representation between language modality and other modalities (i.e. vision or acoustic) separately by capturing attention contexts of each modality. This kind of bimodal fusion between language modality and other modalities is called the shallow fusion part of the network, as the trimodal fusion and the knowledge existing in the language modality are not well captured so far [87].
與 MMUUBA 的架構非常相似,Chen 和 Li [87] 通過 LSTM 明確地建模了三種模態的視覺、聲音和文本嵌入(模態嵌入子網絡)。 然後,他們使用 Xiong 等人引入的共同關注機制 [88] 來通過捕獲每種模態的注意力上下文來分別學習語言模態與其他模態(即視覺或聲音)之間的相互依賴表示。 語言模態與其他模態之間的這種雙模態融合被稱為網絡的淺層融合部分,因為迄今為止尚未很好地捕獲三模態融合和語言模態中存在的知識 [87]。
In order to capture the knowledge existing in the language modality, the authors concatenate the two kinds of bimodal fusion representation and the language embedding, and input the result to an LSTM layer to aggregate. Moreover, the authors believe that the sentimental words information that exist in the language modality can also be incorporated into a fusion model to learn richer multimodal representation. This is what urged them to design a sentimental words prediction task as an auxiliary task to guide the aggregation of the shallow fusion of multiple modality features and obtain the final sentimental words aware deep fusion representation. This part of the network is called the aggregation part and this part is mainly the main difference between MMUUBA and SWAFN. Finally, the final representation is input to a fully-connected layer and a prediction layer to get the sentiment prediction [87]. Please refer to the original paper [87] for detailed explanation of the aggregation part.
為了捕捉存在於語言模態中的知識,作者將兩種雙模融合表示和語言嵌入串聯起來,並將結果輸入到 LSTM 層進行聚合。此外,作者認為存在於語言模態中的情感詞信息也可以納入融合模型中,以學習更豐富的多模態表示。這促使他們設計了一個情感詞預測任務作為輔助任務,以引導多模態特徵的淺層融合的聚合,並獲得最終的情感詞感知深度融合表示。這部分網絡被稱為聚合部分,這部分主要是 MMUUBA 和 SWAFN 之間的主要區別。最後,將最終表示輸入到全連接層和預測層以獲得情感預測。請參考原始論文[87]以獲得有關聚合部分的詳細解釋。

4.5. Word Level Fusion
4.5. 詞級融合

In this architecture, every word in a sequence is fused with the accompanying nonverbal (acoustic and visual) features to learn variation vectors that either (1) disambiguate or (2) emphasize the existing word representations for multimodal prediction tasks [78].

4.5.1. RAVEN

Wang et al. [78] assume that the exact sentiment behind an uttered word can always be derived from the embedding of the uttered words combined with a shift in the embedding space introduced by the accompanying nonverbal (acoustic and visual) features. Proceeding from this assumption, Wang et al. [78] use the visual and acoustic
Fig. 12. Architecture of MMUUSA.
MMUUSA 的架構圖。
Fig. 13. Architecture of MMUUBA.
MMUUBA 的架構圖。
modalities accompanying an uttered word to learn variation vectors that can either disambiguate or emphasize the existing word representations for multimodal prediction tasks.
In RAVEN, each word in an utterance is accompanied by two sequences from the visual and acoustic modalities. Our model consists of three major components: (1) Nonverbal Subnetwork where the visual and acoustic features are passed into visual and acoustic LSTMs respectively to model the view-specific dynamics and compute the nonverbal (visual and acoustic) embeddings. (2) Gated Modality mixing Network takes as input the original word embedding as well as the visual and acoustic embedding, and uses an attention gating mechanism to yield the nonverbal shift vector which characterizes how far and in which direction has the meaning of the word changed due to nonverbal context. (3) Multimodal Shifting computes the multimodal-shifted word representation by integrating the nonverbal shift vector to the original word embedding.
在RAVEN中,發話中的每個詞都伴隨著來自視覺和聲學模態的兩個序列。我們的模型由三個主要組件組成:(1)非語言子網絡,其中視覺和聲學特徵分別傳入視覺和聲學LSTM,以建模視覺特定動態並計算非語言(視覺和聲學)嵌入。 (2)閘控模態混合網絡將原始詞嵌入以及視覺和聲學嵌入作為輸入,並使用注意閘控機制產生非語言轉移向量,該向量描述了由於非語言上下文而導致詞義發生變化的程度和方向。 (3)多模態轉移通過將非語言轉移向量整合到原始詞嵌入中,計算多模態轉移的詞表示。
By applying the same method for every word in a sequence, the original sequence triplet (language, visual and acoustic) is transformed into one sequence of multimodal-shifted representations (E) which corresponds to a shifted version of the original sequence of word representations fused with information from its accompanying nonverbal contexts. This sequence of multimodal-shifted word representations is then used in the high-level hierarchy to predict sentiments or emotions expressed in the utterance.

4.6. Multi-Modal Multi-Utterance Fusion
4.6. 多模態多語言融合

All the previous approaches consider each utterance as an independent entity and, ignore the relationship and dependencies between other utterances in the video. Utterance-level sentiment analysis and traditional fusion techniques cannot extract context from multiple utterances. But practically, utterances in the same video maintain a sequence and can be highly correlated. Thus, identifying relevant and important information from the pool of utterances is necessary in order to make a model more robust and accurate [85, 90, 91]. Proceeding from this claim, some researches recently started to benefit from the contextual information of other utterances in order to classify an utterance in video. However, every modality and utterance may not have the same importance in the sentiment and emotion classification, therefore the architecture mainly consists of two main modules whose order may vary from one model to another:

- Context Extraction Module
- 上下文提取模塊

This module is used to model the contextual relationship among the neighbouring utterances in the video. Also, all neighboring utterances are not equally important in the sentiment classification of the target utterance. Thus it is necessary to highlight which utterances of the relevant contextual utterances are more important to predict the sentiment of the target utterance. In most architectures this module is usually a bidirectional recurrent neural network based module.

- Attention-Based Module for Modality Fusion
- 基於注意力的模組用於模態融合

This module is responsible for fusing the three modalities (text, audio and video) and prioritizing only the important ones.
In the following section, we define three matrices which represent the multi-modal information (i.e. text, visual acoustic) for a sequence of utterances in a video.

4.6.1. Bidirectional Contextual LSTM (Bc-LSTM)
4.6.1. 雙向上下文 LSTM(Bc-LSTM)

Poria et al. [90] propose an LSTM-based network that takes as input the sequence of utterances in a video. Initially, the unimodal features are extracted from each utterance separately without considering the contextual dependency between the utterances. For each video, a matrix is constructed from the unimodal features of all utterances in the video, then this matrix is used as input to a contextual LSTM cell such that each utterance can get contextual information from the neighboring utterances in the video. The output of each LSTM cell is passed into a dense layer followed by a softmax layer. The dense layer activations serve as the output features (see Fig. 14).
Poria 等人提出了一種基於 LSTM 的網絡,該網絡將視頻中的一系列發言作為輸入。最初,從每個發言中單獨提取單模特徵,而不考慮發言之間的上下文依賴性。對於每個視頻,從視頻中所有發言的單模特徵構建一個矩陣,然後將此矩陣用作上下文 LSTM 單元的輸入,以便每個發言可以從視頻中相鄰的發言中獲得上下文信息。每個 LSTM 單元的輸出被傳遞到一個密集層,然後是一個 softmax 層。密集層的激活作為輸出特徵(見圖 14)。
The authors [90] also consider several variants of the contextual LSTM architecture in their experiments. First they consider the simple LSTM (sc-LSTM) where the contextual LSTM architecture consists of unidirectional LSTM cells. They also consider the hidden LSTM (h-LSTM) where the dense layer after the LSTM cell is omitted. Furthermore they consider the Bi-directional LSTMs (bc-LSTM); this variant gives the best performance since an utterance can get information from utterances occurring before and after itself in the video.
作者們[90]在他們的實驗中還考慮了上下文 LSTM 架構的幾個變體。首先,他們考慮了簡單的 LSTM(sc-LSTM),其中上下文 LSTM 架構由單向 LSTM 單元組成。他們還考慮了隱藏的 LSTM(h-LSTM),其中在 LSTM 單元後的密集層被省略。此外,他們考慮了雙向 LSTM(bc-LSTM);這種變體表現最好,因為一個發話可以從視頻中發生在其之前和之後的發話中獲取信息。

4.6.2. Multi-Utterance - Self Attention (MU-SA)
4.6.2. 多發話-自我注意力(MU-SA)

Ghosaly et al. [85] extract the context between the neighboring utterances at one level using bidirectional recurrent neural networks based models. The proposed framework takes multi-modal information (i.e. text, visual & acoustic) for a sequence of utterances and feeds it into three separate bi-directional Gated Recurrent Unit (GRU) [92] (one for each modality), followed by three fully-connected dense layers (one for each modality) resulting in three matrices that contain modality specific contextual information between the utterances. This is the only level of context extraction and is called unimodal context extraction.
Ghosaly 等人[85]使用基於雙向循環神經網絡的模型在一個級別上提取相鄰話語之間的上下文。所提出的框架將多模信息(即文本、視覺和聲音)應用於一系列話語,並將其餵入三個獨立的雙向門控循環單元(GRU)[92](每個模態一個),然後是三個全連接的密集層(每個模態一個),產生包含話語之間模態特定上下文信息的三個矩陣。這是唯一的上下文提取級別,稱為單模態上下文提取。
Next self-attention is applied on the utterances of each modality separately, and used for classification (see Fig. 15). Specifically, for the
接下來,對每個模態的話語進行自注意力應用,並用於分類(見圖 15)。具體來說,對於
Fig. 14. Architecture of bc-LSTM.
圖 14. bc-LSTM 的架構。
Fig. 15. Architecture of MUSA.
圖 15. MUSA 的架構。
three modalities, three separate attention blocks are required, where each block takes multi-utterance information of a single modality and computes the self-attention matrix. Finally the attention matrices, along with output of the dense layers are concatenated and passed to the output layer for classification.

4.6.3. Multi-Modal Multi-utterance Bi-Attention (MMMUBA)
4.6.3. 多模態多發話雙注意力(MMMUBA)

Ghosaly et al. [85] replicate the same context architecture module of MU-SA, however in this model, multimodal attention is applied on the outputs of the dense layers in order to learn the joint-association between the multiple modalities & utterances, and to emphasize on the contributing features by putting more attention to these. In particular, bi-modal attention framework is employed, where an attention function
Ghosaly 等人[85]複製了 MU-SA 的相同上下文架構模塊,但在這個模型中,多模態注意力應用於密集層的輸出,以便學習多個模態和話語之間的聯合關聯,並通過更多地關注這些貢獻特徵來強調這些特徵。具體來說,採用了雙模態注意力框架,其中一個注意力函數
Fig. 16. Architecture of MMMUBA.
圖 16. MMMUBA 的架構。

is applied on the representations of pairwise modalities i.e. visual-text, text-acoustic and acoustic-visual to obtain bimodal representations. Finally the bimodal representations along with individual modalities are concatenated using residual skip attention-based networks [93] and passed to a softmax layer for classification as shown in Fig. 16.
應用於成對模態(即視覺-文本,文本-聲學和聲學-視覺)的表示,以獲得雙模態表示。最後,使用基於殘差跳過注意力的網絡[93]將雙模態表示與個別模態串聯起來,並將其傳遞到 softmax 層進行分類,如圖 16 所示。

4.6.4. Multi-Modal Multi-utterance Bi-Attention 2 (MMMUBA II)
4.6.4. 多模態多發話雙關注 2 (MMMUBA II)

Huddar et al. [94] modified MMMUBA by adding two more levels of context extraction between the neighboring utterances using bidirectional recurrent neural networks based models. The introduced model is almost identical to MMMUBA, that's why we called it MMMUBA II although the authors haven't named their proposed model explicitly. The only difference between the two models is that before the final classification, the bimodal representations are fed into a bidirectional recurrent neural network-based module in order to extract a bimodal contextual feature vector. This is the second level of context extraction and is called bimodal context extraction. The bimodal contextual feature vectors are concatenated using residual skip attention-based networks [93] to obtain a trimodal attention matrix (Trimodal Fusion), which is passed into a bidirectional LSTM to obtain trimodal contextual features. This is the third level of context extraction and is called trimodal context extraction. Finally the contextual trimodal attention matrix is fed into a softmax classifier to obtain the final classification label for the utterance.
Huddar 等人[94]通過在相鄰發話之間添加兩個更多層次的上下文提取,使用基於雙向循環神經網絡的模型來修改 MMMUBA。引入的模型幾乎與 MMMUBA 相同,這就是為什麼我們稱之為 MMMUBA II,盡管作者並未明確命名他們提出的模型。兩個模型之間唯一的區別是,在最終分類之前,雙模態表示被餵入基於雙向循環神經網絡的模塊,以提取雙模態上下文特徵向量。這是第二級上下文提取,稱為雙模態上下文提取。雙模態上下文特徵向量使用基於殘差跳過注意力的網絡[93]進行串聯,以獲得三模態注意力矩陣(三模態融合),然後傳入雙向 LSTM 以獲得三模態上下文特徵。這是第三級上下文提取,稱為三模態上下文提取。最後,上下文三模態注意力矩陣被餵入 softmax 分類器,以獲得發話的最終分類標籤。

4.6.5. Multi-Head Self-Attention Network (MHSAN)
4.6.5. 多頭自注意力網絡(MHSAN)

Cao et al. [1] claim that traditional multimodal sentiment analysis methods are mainly based on RNNs which cannot utilize the correlation between each sentence well. To address this issue, they propose multimodal sentiment analysis based on the multi-head attention mechanism, considering both the context information between sentences and the contributing factors of different modalities.
Cao 等人[1]聲稱傳統的多模態情感分析方法主要基於 RNNs,無法很好地利用每個句子之間的相關性。為了解決這個問題,他們提出了基於多頭注意力機制的多模態情感分析,考慮了句子之間的上下文信息和不同模態的貢獻因素。
The proposed framework extracts the context between the neighboring utterances at one level using multi-head attention based networks (see Fig 17). The multimodal information (i.e. text, visual & acoustic) for a sequence of utterances are fed it into three separate multihead attention networks. Then, the context-dependent unimodal features are concatenated and fed into an attention network that can dynamically assign the contribution of multimodal information to sentiment classification.
提出的框架使用基於多頭注意力的網絡(見圖 17)在一個級別上提取相鄰發話之間的上下文。一系列發話的多模態信息(即文本、視覺和聲音)被餵入三個獨立的多頭注意力網絡。然後,上下文相依的單模特徵被串聯並餵入一個注意力網絡,該網絡可以動態分配多模態信息對情感分類的貢獻。

4.6.6. Contextual Attention BiLSTM
4.6.6. 上下文關注雙向 LSTM

For each utterance, the feature vectors of all the three modalities are fed into a fully-connected layer for dimensionality equalization, then they are concatenated vertically into a single vector called the multimodal feature vector of an utterance [95] (see Fig. 18).
對於每個話語,所有三種模態的特徵向量被餵入一個全連接層進行維度均等化,然後它們被垂直連接成一個稱為話語的多模態特徵向量的單個向量[95](見圖 18)。
The second layer in the model is called Attention-Based Network for Multimodal Fusion (AT-Fusion). This layer takes as input the multimodal feature vector of each utterance, and outputs the attended modality features of each utterance. The third layer is the Contextual Attention LSTM (CAT-LSTM) which is used to model the contextual relationship among utterances and highlight the important contextual information for classification. CAT-LSTM accepts the attended modality features (output of the second layer) of a sequence of utterances per video and outputs a new representation of those utterances based on the surrounding utterances. CAT-LSTM consists of a number of LSTM cells equal to the number of utterances in the sequence followed by an attention network to amplify the contribution of context-rich utterances. The output of each cell in the CAT-LSTM represents the new representation of each utterance and is sent into a softmax layer for sentiment classification.
模型中的第二層稱為基於注意力的多模態融合網絡(AT-Fusion)。該層將每個發話的多模態特徵向量作為輸入,並輸出每個發話的關注模態特徵。第三層是上下文關注 LSTM(CAT-LSTM),用於建模發話之間的上下文關係,並突出重要的上下文信息以進行分類。CAT-LSTM 接受每個視頻序列的發話的關注模態特徵(第二層的輸出),並根據周圍的發話輸出這些發話的新表示。CAT-LSTM 由與序列中發話數量相等的 LSTM 單元組成,後面跟著一個注意力網絡,以增強富有上下文的發話的貢獻。CAT-LSTM 中每個單元的輸出代表每個發話的新表示,並被發送到一個 softmax 層進行情感分類。

4.6.7. Multi Attention Recurrent Neural Network (MA-RNN)
4.6.7. 多注意力循環神經網絡(MA-RNN)

The architecture of MA-RNN [96] is almost identical to that of the Contextual Attention BiLSTM [95] except for that Kim et al. used Scaled Dot-Product Attention to calculate the attention score of each modality, and used multi-head attention mechanism to learn features in multiple representation subspaces at different positions. Second, an Attention-based BiGRU is used to model the contextual relationship among utterances instead of the CAT-LSTM that has been used in BiLSTM.
MA-RNN的架構幾乎與Contextual Attention BiLSTM相同,唯一不同之處在於Kim等人使用了縮放點積注意力機制來計算每個模態的注意力分數,並使用多頭注意力機制在不同位置的多個表示子空間中學習特徵。其次,使用基於注意力的BiGRU來建模發言之間的上下文關係,而不是在BiLSTM中使用的CAT-LSTM。
In particular, for each utterance, the feature vectors of all the three modalities are fed into a fully-connected layer for dimensionality equalization, then they are concatenated horizontally into a single matrix called the multimodal feature matrix of an utterance [96]. The second layer in the model is an attention layer to fuse the multimodal data of each utterance and for dimensionality reduction. This layer takes as input the multimodal feature matrix of each utterance and produces the attended modality features vector of each utterance. The scaled dot-product attention and the concept of multi-head attention introduced by Vaswani et al. [97] were applied in this layer. We refer the reader to [97] for a more detailed explanation of the model. The third layer is Attention-based BiGRU used to model the contextual relationship among utterances and highlight the important contextual information for classification. Attention-based BiGRU accepts the attended modality features (output the second layer) of a sequence of utterances per video and outputs a new representation of those utterances based on the surrounding utterances. Attention-based BiGRU consists of a number of Bi-GRU cells equal to the number of utterances in the sequence followed by an Attention network to amplify the contribution of context-rich utterances. The output of each cell in the Bi-GRU represents the new representation of each utterance and is sent into a softmax layer for sentiment classification [96].
特別是,對於每個話語,所有三種模態的特徵向量都被餵入一個全連接層進行維度均衡,然後它們被水平連接成一個稱為話語的多模態特徵矩陣的單一矩陣[96]。模型中的第二層是一個注意力層,用於融合每個話語的多模態數據並進行維度降低。該層以每個話語的多模態特徵矩陣作為輸入,並生成每個話語的受關注模態特徵向量。在這一層中應用了 Vaswani 等人介紹的縮放點積注意力和多頭注意力的概念[97]。我們建議讀者參考[97]以獲得有關模型的更詳細解釋。第三層是基於注意力的 BiGRU,用於對話語之間的上下文關係進行建模,並突出重要的上下文信息以進行分類。基於注意力的 BiGRU 接受每個視頻的一系列話語的受關注模態特徵(第二層的輸出),並根據周圍話語輸出這些話語的新表示。 基於注意力的雙向門控循環單元由與序列中的發言數相等的雙向門控循環單元組成,後面跟著一個注意力網絡,以增強上下文豐富的發言的貢獻。雙向門控循環單元中每個單元的輸出代表每個發言的新表示,並被送入一個 softmax 層進行情感分類 [96]。
LSTM 長短期記憶網絡
Fig. 17. Architecture of MHSAN.
圖 17. MHSAN 的架構。
Fig. 18. Architecture of biLSTM.
Fig. 18. 雙向 LSTM 的架構。
4.6.8. Multimodal sentiment analysis based on multi-head attention mechanism (MHAM)
4.6.8. 基於多頭注意力機制(MHAM)的多模態情感分析
Xi et al. [98] proposed a model based on multi-head attention mechanism, which uses self-attention mechanism to extract the intra-modal features and the multi-head mutual attention to analyze the correlation between different modalities. And it contains a total of 6 modules derived from the multi-head attention mechanism.
Xi 等人[98] 提出了一個基於多頭注意力機制的模型,該模型使用自注意機制來提取模態內部特徵,並使用多頭互相注意來分析不同模態之間的相關性。它包含了總共 6 個從多頭注意力機制衍生出的模塊。

4.7. Sequence to Sequence (Seq2Seq) Models
4.7. 序列到序列(Seq2Seq)模型

4.7.1. Multimodal Cyclic Translation Network (MCTN)
4.7.1. 多模態循環翻譯網絡(MCTN)

Inspired by the success of Seq2Seq models, Pham et al. [99] propose the Multimodal Cyclic Translation Network model (MCTN) to learn robust joint multimodal representations by translating between modalities. Translation from a source modality to a target modality results in an intermediate representation that captures joint information between the two modalities. MCTN extends this insight using a cyclic translation loss involving both forward translations from source to target modalities, and backward translations from the predicted target back to the source modality. This is what the authors call multimodal cyclic translations to ensure that the learned joint representations capture maximal information from both modalities The model is a hierarchical neural machine translation network with a source modality and two target modalities. The first level learns a joint representation by using back translation. This intermediate representation is translated into the second target modality without back translation. The multimodal representation is fed into RNN for final classification. The power of MCTN is that once the translation model is trained with paired multimodal data, only data from the source modality is needed at test time for final sentiment prediction which makes the model robust from perturbations or missing information in the other modalities [99].
受到 Seq2Seq 模型成功的啟發,Pham 等人提出了多模態循環翻譯網絡模型(MCTN),通過在模態之間進行翻譯來學習強大的聯合多模態表示。從源模態到目標模態的翻譯會產生一個捕捉兩個模態之間聯合信息的中間表示。MCTN 通過涉及來自源到目標模態的正向翻譯和從預測目標返回到源模態的反向翻譯的循環翻譯損失來擴展這一見解。這就是作者所謂的多模態循環翻譯,以確保學習到的聯合表示捕捉來自兩個模態的最大信息。該模型是一個具有源模態和兩個目標模態的分層神經機器翻譯網絡。第一級通過使用反向翻譯學習一個聯合表示。這個中間表示被翻譯成第二個目標模態而不進行反向翻譯。多模態表示被餵入 RNN 進行最終分類。 MCTN 的優勢在於一旦翻譯模型使用配對的多模態數據進行訓練,測試時只需要來自源模態的數據進行最終情感預測,這使得模型能夠從其他模態中的干擾或缺失信息中保持穩健性 [99]。

4.7.2. Multimodal Transformer (MulT)
4.7.2. 多模態 Transformer(MulT)

Tsai et al. [100] propose the Multimodal Transformer for Unaligned Multimodal Language Sequences (MulT). The authors extend the standard Transformer network that has been introduced by Vaswani et al. in 2017 [97] to produce a modified transformer model called the cross-modal transformer. MulT fuses multimodal time series using a feed-forward fusion process from multiple directional pairwise crossmodal transformers. The proposed cross-modal transformer has no encoder-decoder structure, however it enables one modality to receive information from another modality (i.e. tries to repeatedly reinforce a target modality with the low-level features from another source modality) by learning the attention across the two modalities' features using the cross model attention block. Each crossmodal transformer consists of several layers of crossmodal attention blocks, which can directly attend to low-level features of every pair of modalities and doesn't rely on taking intermediate-level features (removing the self-attention) which helps to preserve the low-level information for each modality. The multi-head attention block is also adopted to learn the inter-modal attention. Then, the outputs from the crossmodal transformers that share the same target modality are concatenated. Each of them is then passed through a self-attention transformer [97]. Finally, the last elements of the self-attention transformers are extracted to pass through fully-connected layers to make predictions.
蔡等人提出了用於不對齊多模態語言序列的多模態Transformer(MulT)。作者擴展了由Vaswani等人於2017年引入的標準Transformer網絡,以生成一個修改後的Transformer模型,稱為跨模態Transformer。MulT通過來自多個方向的成對跨模態Transformer的前向融合過程來融合多模態時間序列。所提出的跨模態Transformer沒有編碼器-解碼器結構,但它使一種模態能夠從另一種模態接收信息(即通過學習兩種模態特徵之間的注意力來重複強化目標模態與另一來源模態的低級特徵)。每個跨模態Transformer由多層跨模態注意力塊組成,可以直接關注每對模態的低級特徵,並且不依賴於採取中級特徵(去除自注意力),這有助於保留每種模態的低級信息。多頭注意力塊也被採用來學習跨模態注意力。 然後,共享相同目標模態的跨模態變壓器的輸出被串聯起來。然後,每個都通過自我注意變壓器[97]。最後,自我注意變壓器的最後元素被提取出來,通過全連接層進行預測。

4.8. Quantum based models
4.8. 基於量子的模型

All existing multimodal sentiment analysis models that are mainly based on neural networks, model the multimodal interactions in a way that is implicit and hard-to-understand. The models suffer from low interpretability; the way the modalities interact is ambiguous and implicit for both levels of interactions. This is mainly because most models rely on neural structures to fuse multimodal data, which act like blackboxes with few numerical constraints [11]. And although the pre-mentioned models have been a success, researchers are looking for ways to understand the model, in order to know whether we can trust it and deploy it in real work, or whether it contains privacy or security issues [101]. That's why Interpretability has become an important concern for machine learning researchers and they started to develop quantum-based approaches for fusing multimodal data. As far as our knowledge three models only were introduced till 2021 that were based in quantum fusion. The first one is QMR (Zhang et al. [102] are the first to apply Quantum Theory (QT) to sentiment analysis). The second model is QMF [103] which address this limitation with inspirations from quantum theory, which contains principled methods for modeling complicated interactions and correlations. In their quantum-inspired framework, the view-specific dynamics and the cross-view dynamics are formulated with superposition and entanglement respectively at different stages. The complex-valued neural network implementation of
所有現有的主要基於神經網絡的多模情感分析模型,以一種隱含且難以理解的方式建模多模交互作用。這些模型存在著低可解釋性的問題;模態之間的交互方式對於兩個層次的交互都是模糊和隱含的。這主要是因為大多數模型依賴神經結構來融合多模數據,這些神經結構就像帶有少量數值約束的黑盒子。儘管上述模型取得了成功,研究人員仍在尋找方法來理解模型,以了解我們是否可以信任它並在實際工作中部署它,或者它是否存在隱私或安全問題。這就是為什麼可解釋性已成為機器學習研究人員的一個重要關注點,他們開始開發基於量子的方法來融合多模數據。據我們所知,直到2021年為止,只有三個基於量子融合的模型被提出。第一個是QMR(Zhang等人是第一個將量子理論(QT)應用於情感分析的研究者)。 第二個模型是 QMF [103],它從量子理論中汲取靈感,解決了這個限制,其中包含了建模複雜交互作用和相關性的原則方法。在他們的量子啟發框架中,視圖特定的動態和視圖間的動態分別在不同階段用叠加和納入形式制定。複數值神經網絡實現

the framework achieves comparable results to state-of-the-art systems on both MOSI and MOSEI datasets. The third model introduced is QMN (quantum-like multimodal network) [104], which leverages the mathematical formalism of quantum theory (QT) and a long short-term memory (LSTM) network. Specifically, the QMN framework consists of a multimodal decision fusion approach inspired by quantum interference theory to capture the interactions within each utterance and a model inspired by quantum measurement theory to model the interactions between adjacent utterances. However, it is worth mentioning that QMN was tested on emotion recognition tasks, not sentiment analysis [104].
框架在 MOSI 和 MOSEI 數據集上實現了與最先進系統相當的結果。第三個介紹的模型是 QMN(類量子多模態網絡)[104],它利用了量子理論(QT)的數學形式主義和長短期記憶(LSTM)網絡。具體來說,QMN 框架包括一種受量子干涉理論啟發的多模態決策融合方法,以捕捉每個發話中的交互作用,以及一種受量子測量理論啟發的模型,用於建模相鄰發話之間的交互作用。然而,值得一提的是,QMN 是在情感識別任務上進行測試的,而不是情感分析 [104]。

5. Performance Evaluation
5. 效能評估

We evaluate the performance of the thirty-five models in terms of both effectiveness and efficiency on CMU-MOSI and CMU-MOSEI datasets.
我們在 CMU-MOSI 和 CMU-MOSEI 數據集上,從效能和效率兩方面評估了三十五個模型的表現。

5.1. Effectiveness 5.1. 效能

We use five evaluation performance metrics introduced in prior work