这是用户在 2024-10-16 16:50 为 https://app.immersivetranslate.com/pdf-pro/eefd71e4-cc9c-4c31-9825-b68c1cfde597 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

DARNet: Dual Attention Refinement Network with Spatiotemporal Construction for Auditory Attention Detection
DARNet:用于听觉注意力检测的具有时空结构的双重注意力细化网络

Sheng Yan 1 1 ^(1**){ }^{1 *} Cunhang Fan 1 1 ^(1**†)quad{ }^{1 * \dagger} \quad Hongyu Zhang 1 1 ^(1)quad{ }^{1} \quad Xiaoke Yang 1 1 ^(1)quad^{1} \quad Jianhua Tao 2 2 ^(2)^{2}
Sheng Yan 1 1 ^(1**){ }^{1 *} Cunhang Fan 1 1 ^(1**†)quad{ }^{1 * \dagger} \quad Hongyu Zhang 1 1 ^(1)quad{ }^{1} \quad Xiaoke Yang 1 1 ^(1)quad^{1} \quad Jianhua Tao 2 2 ^(2)^{2}
Zhao Lv 1 1 ^(1){ }^{1} 1 1 ^(1){ }^{1} Anhui Province Key Laboratory of Multimodal Cognitive Computations,
1 1 ^(1){ }^{1} 安徽省多模态认知计算重点实验室、
School of Computer Science and Technology, Anhui University
安徽大学计算机科学与技术学院
2 2 ^(2){ }^{2} Department of Automation, Tsinghua University
2 2 ^(2){ }^{2} 清华大学自动化系

Abstract 摘要

At a cocktail party, humans exhibit an impressive ability to direct their attention. The auditory attention detection (AAD) approach seeks to identify the attended speaker by analyzing brain signals, such as EEG signals. However, current AAD algorithms overlook the spatial distribution information within EEG signals and lack the ability to capture long-range latent dependencies, limiting the model’s ability to decode brain activity. To address these issues, this paper proposes a dual attention refinement network with spatiotemporal construction for AAD, named DARNet, which consists of the spatiotemporal construction module, dual attention refinement module, and feature fusion & classifier module. Specifically, the spatiotemporal construction module aims to construct more expressive spatiotemporal feature representations, by capturing the spatial distribution characteristics of EEG signals. The dual attention refinement module aims to extract different levels of temporal patterns in EEG signals and enhance the model’s ability to capture long-range latent dependencies. The feature fusion & classifier module aims to aggregate temporal patterns and dependencies from different levels and obtain the final classification results. The experimental results indicate that compared to the state-of-the-art models, DARNet achieves an average classification accuracy improvement of 5.9 % 5.9 % 5.9%5.9 \% for 0.1 s , 4.6 % 0.1 s , 4.6 % 0.1s,4.6%0.1 \mathrm{~s}, 4.6 \% for 1 s , and 3.9 % 3.9 % 3.9%3.9 \% for 2 s on the DTU dataset. While maintaining excellent classification performance, DARNet significantly reduces the number of required parameters. Compared to the state-of-the-art models, DARNet reduces the parameter count by 91 % 91 % 91%91 \%. Code is available at: https://github.com/fchest/DARNet.git
在鸡尾酒会上,人类表现出令人印象深刻的引导注意力的能力。听觉注意力检测(AAD)方法旨在通过分析脑电信号(如脑电图信号)来识别出席的发言者。然而,目前的听觉注意力检测算法忽略了脑电信号中的空间分布信息,缺乏捕捉长程潜在依赖关系的能力,从而限制了模型解码大脑活动的能力。为了解决这些问题,本文提出了一种用于 AAD 的时空构建双重注意细化网络,命名为 DARNet,由时空构建模块、双重注意细化模块和特征融合与分类器模块组成。具体来说,时空构建模块旨在通过捕捉脑电信号的空间分布特征,构建更具表现力的时空特征表征。双重注意力细化模块旨在提取脑电信号中不同层次的时间模式,增强模型捕捉长程潜在依赖关系的能力。特征融合与分类器模块旨在汇总不同层次的时间模式和依赖关系,并获得最终的分类结果。实验结果表明,与最先进的模型相比,DARNet 在 DTU 数据集上对 1 秒内的 0.1 s , 4.6 % 0.1 s , 4.6 % 0.1s,4.6%0.1 \mathrm{~s}, 4.6 \% 和 2 秒内的 3.9 % 3.9 % 3.9%3.9 \% 实现了 5.9 % 5.9 % 5.9%5.9 \% 的平均分类准确率提升。在保持出色分类性能的同时,DARNet 大大减少了所需参数的数量。与最先进的模型相比,DARNet 减少了 91 % 91 % 91%91 \% 个参数。代码见https://github.com/fchest/DARNet.git

1 Introduction 1 引言

The auditory attention detection (AAD) aims to study human auditory attention tendencies by analyzing brain signals [1, 2, 3]. The auditory attention refers to the ability of individuals to isolate or concentrate on specific sounds, which aids them in focusing on a single speaker amidst a multispeaker environment, a scenario commonly referred to as the “cocktail party scenario” [4]. However, this ability may diminish or even completely disappear for individuals with impairment. Therefore, finding solutions to assist these individuals in overcoming this challenge has become an urgent matter.
听觉注意力检测(AAD)旨在通过分析大脑信号来研究人类的听觉注意力倾向[1, 2, 3]。听觉注意力指的是个体隔离或专注于特定声音的能力,这种能力有助于个体在多扬声器环境中专注于单个扬声器,这种情景通常被称为 "鸡尾酒会情景"[4]。然而,对于有障碍的人来说,这种能力可能会减弱甚至完全消失。因此,寻找帮助这些人克服这一挑战的解决方案已成为当务之急。
Mesgarani and Chang [5] have demonstrated a close connection between auditory attention and brain activity, which indicates that researchers can study auditory attention by analyzing brain activity. Following this concept, many methods such as electrocorticography (ECoG) [5], magnetoencephalography [6, 7], and electroencephalography (EEG) [8, 9] have been used to implement auditory attention detection. Among these methods, EEG-based approaches are widely applied in AAD due to their high temporal resolution, non-invasive mode, and excellent maneuverability [9, 10, 11].
Mesgarani 和 Chang [5] 证明了听觉注意与大脑活动之间的密切联系,这表明研究人员可以通过分析大脑活动来研究听觉注意。根据这一概念,许多方法,如脑皮层电图(ECoG)[5]、脑磁图[6, 7]和脑电图(EEG)[8, 9]已被用于实现听觉注意力检测。在这些方法中,基于脑电图的方法因其时间分辨率高、无创模式和出色的可操作性而被广泛应用于听觉注意力检测[9, 10, 11]。

According to the conclusions of Mesgarani and Chang [5], previous studies have utilized stimulusreconstruction or speech envelope reconstruction methods, which necessitate clean auditory stimuli as input [12, 13]. However, in most real-world scenarios, environments consist of multiple sounds simultaneously. Listeners are exposed to a mixture of these sounds, posing a challenge in obtaining clean auditory stimuli. Therefore, in recent years, the academic community has increasingly focused solely on utilizing EEG signals as input for AAD research [14, 15]. The research method proposed in this paper also exclusively utilizes EEG signals.
根据 Mesgarani 和 Chang 的结论[5],以往的研究采用的是刺激重构或语音包络重构方法,这些方法需要将干净的听觉刺激作为输入[12, 13]。然而,在现实世界的大多数场景中,环境是由多种声音同时组成的。听者会接触到这些声音的混合物,这给获取干净的听觉刺激带来了挑战。因此,近年来学术界越来越多地专注于利用脑电信号作为 AAD 研究的输入[14, 15]。本文提出的研究方法也完全采用脑电信号。

Traditional AAD tasks relied on linear models to process EEG signals [16, 17]. However, brain activity is inherently nonlinear, posing challenges for linear models in capturing this complexity. Consequently, they necessitate longer decision windows to extract brain activity features [18]. Some previous studies have indicated that decent decoding performance can be achieved by analyzing different spatial distribution features within each frequency band. These studies project the extracted differential entropy (DE) values onto 2D topological maps and decode them with convolutional neural networks [19, 15]. However, EEG signals are fundamentally time-series data, these methods overlook the dynamic temporal patterns of EEG signals. Other studies analyze EEG signals only in the time domain. For instance, they use long short-term memory (LSTM) networks to capture dependencies within EEG signals and achieve decent decoding performance [2]. However, these studies only focus on the temporal information within EEG signals, neglecting the spatial distribution features, which reflect the dynamic patterns of different brain regions when receiving, processing, and responding to auditory stimuli. Meanwhile, numerous noise points and outliers make it difficult to capture long-range latent dependencies.
传统的 AAD 任务依赖线性模型来处理 EEG 信号 [16, 17]。然而,大脑活动本身是非线性的,这给线性模型捕捉这种复杂性带来了挑战。因此,它们需要更长的决策窗口来提取大脑活动特征[18]。之前的一些研究表明,通过分析每个频段内不同的空间分布特征,可以实现良好的解码性能。这些研究将提取的微分熵(DE)值投射到二维拓扑图上,并利用卷积神经网络对其进行解码 [19,15]。然而,脑电信号本质上是时间序列数据,这些方法忽略了脑电信号的动态时间模式。其他研究仅在时域分析脑电信号。例如,他们使用长短期记忆(LSTM)网络来捕捉脑电信号中的依赖关系,并取得了不错的解码性能[2]。然而,这些研究只关注脑电信号中的时间信息,忽略了空间分布特征,而空间分布特征反映了不同脑区在接收、处理和响应听觉刺激时的动态模式。同时,大量的噪声点和异常值使得捕捉长程潜在依赖关系变得困难。
To address these issues, this paper proposes a dual attention refinement network with spatiotemporal construction for AAD, named DARNet, which effectively captures the spatiotemporal features and long-range latent dependencies of EEG signals. Specifically, our model consists of three modules: (1) Spatiotemporal Construction Module. The spatiotemporal construction module employs a temporal convolutional layer and a spatial convolutional layer. The temporal convolutional layer effectively captures the temporal dynamic features of EEG signals, and the spatial convolutional layer captures the spatial distribution features among different channels, thereby constructing a robust embedding for the next layer. (2) Dual Attention Refinement Module. The dual-layer self-attention refinement module consists of two layers, each comprising a multi-head self-attention and a refinement layer. This design is intended to capture long-range latent dependencies and deeper sequence patterns in EEG signals. (3) Feature Fusion & Classifier Module. The attention features generated by the dual-layer self-attention refinement module, comprising both shallow and deep levels, are fed into the feature fusion module to obtain richer representations, enhancing the model’s robustness and generalization. The fused features are input into a classifier to predict the auditory attention tendencies of the subjects.
为了解决这些问题,本文提出了一种针对 AAD 的具有时空构建功能的双重注意力细化网络,命名为 DARNet,它能有效捕捉脑电信号的时空特征和长程潜在依赖性。具体来说,我们的模型由三个模块组成:(1)时空构建模块。时空构建模块采用一个时间卷积层和一个空间卷积层。时间卷积层有效捕捉脑电信号的时间动态特征,空间卷积层捕捉不同通道之间的空间分布特征,从而为下一层构建稳健的嵌入。(2) 双注意力细化模块。双层自我注意细化模块由两层组成,每层包括多头自我注意层和细化层。这种设计旨在捕捉脑电信号中的长程潜在依赖性和更深层的序列模式。(3) 特征融合与分类器模块。由双层自我注意细化模块(包括浅层和深层)生成的注意特征被输入特征融合模块,以获得更丰富的表征,增强模型的鲁棒性和泛化能力。融合后的特征输入分类器,用于预测受试者的听觉注意倾向。

To this end, We evaluated the decoding performance of DARNet on three datasets: DTU, KUL, and MM-AAD. The results demonstrate that DARNet outperforms the current state-of-the-art model on all three datasets. The main contributions of this paper are summarized as follows:
为此,我们在三个数据集上评估了 DARNet 的解码性能:DTU、KUL 和 MM-AAD。结果表明,DARNet 在所有三个数据集上的表现都优于当前最先进的模型。本文的主要贡献概述如下:
  • We propose a novel auditory attention decoding architecture, which consists of a spatiotemporal construction module, a dual attention refinement module, and a feature fusion module. This architecture could fully leverage the spatiotemporal features and capture long-range latent dependencies of EEG signals.
    我们提出了一种新型听觉注意力解码架构,它由时空构建模块、双重注意力细化模块和特征融合模块组成。该架构可充分利用时空特征,捕捉脑电信号的长程潜在依赖性。
  • The DARNet achieves remarkable decoding accuracy within very short decision windows, surpassing the current state-of-the-art (SOTA) model by 5.9 % 5.9 % 5.9%5.9 \% on the DTU dataset and 4.9 % 4.9 % 4.9%4.9 \% on the KUL dataset, all under a 0.1 -second decision window. Furthermore, compared to the current state-of-the-art model with 0.91 million training parameters, DARNet achieves further parameter reduction, requiring only 0.78 million parameters.
    DARNet 在极短的决策窗口内实现了出色的解码精度,在 DTU 数据集上 5.9 % 5.9 % 5.9%5.9 \% 超越了当前最先进的 (SOTA) 模型,在 KUL 数据集上 4.9 % 4.9 % 4.9%4.9 \% 超越了当前最先进的 (SOTA) 模型,所有这一切都发生在 0.1 秒的决策窗口内。此外,与当前需要 91 万个训练参数的最先进模型相比,DARNet 进一步减少了参数,只需要 78 万个参数。

Figure 1: The framework of the DARNet model for AAD, which mainly consists of three modules: (a) spatiotemporal construction module, (b) dual attention refinement module, and © feature fusion & classifier module. The model inputs are common spatial patterns (CSP) extracted from EEG signals, and the outputs are two predicted labels related to auditory attention.
图 1:针对 AAD 的 DARNet 模型框架,主要由三个模块组成:(a) 时空构建模块,(b) 双重注意细化模块,© 特征融合与分类器模块。模型输入是从脑电信号中提取的共同空间模式(CSP),输出是与听觉注意力相关的两个预测标签。

2 Methodology 2 方法

The previous AAD methods overlooked the influence of spatial distribution characteristics on decoding performance and struggled to capture the long-range dependencies in EEG signals [14, 19]. To address these issues, we proposed DARNet, which consists of a spatiotemporal construction module, a dual attention refinement module, and a feature fusion & classifier module, see Figure 1 Our proposed DARNet effectively captures the spatiotemporal features of EEG signals and has the capability to capture long-range latent dependencies in EEG signals.
以往的 AAD 方法忽视了空间分布特征对解码性能的影响,也难以捕捉脑电信号中的长程依赖关系 [14, 19]。为了解决这些问题,我们提出了 DARNet,它由时空构建模块、双重注意细化模块和特征融合与分类模块组成(见图 1)。我们提出的 DARNet 能有效捕捉脑电信号的时空特征,并有能力捕捉脑电信号中的长程潜在依赖关系。
By employing a moving window on the EEG data, we obtain a series of decision windows, each containing a small duration of EEG signals. Let R = [ r 1 , , r i , , r N ] R T × N R = r 1 , , r i , , r N R T × N R=[r_(1),dots,r_(i),dots,r_(N)]inR^(T xx N)R=\left[r_{1}, \ldots, r_{i}, \ldots, r_{N}\right] \in \mathbb{R}^{T \times N} represents the EEG signals of a decision window, where r i R N × 1 r i R N × 1 r_(i)inR^(N xx1)r_{i} \in \mathbb{R}^{N \times 1} represents the EEG data at the i i ii-th time point within a decision window, contains N N NN channels. Here N N NN represents the number of EEG channels and T T TT denotes the length of the decision window. Before inputting EEG data into the DARNet, we employ a common spatial patterns (CSP) algorithm to extract raw features from the EEG data under different brain states [20, 21].
通过在脑电图数据上使用移动窗口,我们可以得到一系列决策窗口,每个窗口包含一小段脑电图信号。假设 R = [ r 1 , , r i , , r N ] R T × N R = r 1 , , r i , , r N R T × N R=[r_(1),dots,r_(i),dots,r_(N)]inR^(T xx N)R=\left[r_{1}, \ldots, r_{i}, \ldots, r_{N}\right] \in \mathbb{R}^{T \times N} 表示决策窗口的脑电信号,其中 r i R N × 1 r i R N × 1 r_(i)inR^(N xx1)r_{i} \in \mathbb{R}^{N \times 1} 表示决策窗口内 i i ii 时点的脑电图数据,包含 N N NN 个通道。这里, N N NN 表示脑电图通道数, T T TT 表示决策窗口的长度。在将脑电图数据输入 DARNet 之前,我们采用常见空间模式(CSP)算法从不同大脑状态下的脑电图数据中提取原始特征[20, 21]。
E = C S P ( R ) R c _ i n × T E = C S P ( R ) R c _ i n × T E=CSP(R)inR^(c)_^(in xx T)E=C S P(R) \in \mathbb{R}^{c} \_^{i n \times T}
where C S P ( ) C S P ( ) CSP(*)C S P(\cdot) represents the CSP algorithm, E R N × T E R N × T E inR^(N xx T)E \in \mathbb{R}^{N \times T} represents the processed EEG signal. c _ i n c _ i n c_inc \_i n is the components of the CSP algorithm and T denotes the length of the decision window.
其中, C S P ( ) C S P ( ) CSP(*)C S P(\cdot) 表示 CSP 算法, E R N × T E R N × T E inR^(N xx T)E \in \mathbb{R}^{N \times T} 表示经过处理的脑电信号。 c _ i n c _ i n c_inc \_i n 是 CSP 算法的分量,T 表示决策窗口的长度。

2.1 Spatiotemporal Construction Module
2.1 时空构建模块

EEG signals record the brain’s neuronal electrical activity, varying over time and reflecting activity patterns and connectivity across brain regions [22]. By constructing spatiotemporal features from EEG signals, it’s possible to analyze the brain’s response patterns to auditory stimuli. However, previous studies only focused on local temporal patterns in EEG data, overlooking the spatial distribution features. Therefore, in addition to the conventional use of temporal filters, we introduced a spatial filter [23] to construct the spatiotemporal features of EEG signals.
脑电信号记录了大脑神经元的电活动,随时间变化,反映了大脑各区域的活动模式和连接性[22]。通过构建脑电信号的时空特征,可以分析大脑对听觉刺激的反应模式。然而,以往的研究只关注脑电图数据的局部时间模式,忽略了空间分布特征。因此,除了使用传统的时间滤波器外,我们还引入了空间滤波器 [23] 来构建脑电信号的时空特征。
Firstly, we use temporal convolution layers to capture the instantaneous changes in EEG signals, thereby constructing the temporal patterns E t E t E_(t)E_{t} of the EEG signals. It can be formulated as follows:
首先,我们使用时间卷积层捕捉脑电信号的瞬时变化,从而构建脑电信号的时间模式 E t E t E_(t)E_{t} 。它可以表述如下:
E t = G E L U ( TemporalConv 2 d ( E ) ) R 4 d model × c i n × T E t = G E L U (  TemporalConv  2 d ( E ) ) R 4 d model  × c i n × T E_(t)=GELU(" TemporalConv "2d(E))inR^(4d_("model ")xxc_(-)in xx T)E_{t}=G E L U(\text { TemporalConv } 2 d(E)) \in \mathbb{R}^{4 d_{\text {model }} \times c_{-} i n \times T}
where TemporalConv 2 d ( ) 2 d ( ) 2d(*)2 d(\cdot) performs an 2-D convolutional filters (kernel size = 1 × 8 = 1 × 8 =1xx8=1 \times 8 ) on time dimension with G E L U ( ) G E L U ( ) GELU(*)G E L U(\cdot) activation function. d model d model  d_("model ")d_{\text {model }} represents the embedding dimension.
其中 TemporalConv 2 d ( ) 2 d ( ) 2d(*)2 d(\cdot) 使用 G E L U ( ) G E L U ( ) GELU(*)G E L U(\cdot) 激活函数在时间维度上执行二维卷积滤波器(核大小 = 1 × 8 = 1 × 8 =1xx8=1 \times 8 )。 d model d model  d_("model ")d_{\text {model }} 表示嵌入维度。

Subsequently, we employ a spatial convolutional layer with a receptive field spanning all channels to capture the spatial distribution features S S SS of EEG signals across different channels, thereby aiding the model in comprehensively understanding the brain’s activity patterns in response to various auditory stimuli.
随后,我们采用了一个空间卷积层,其感受野横跨所有通道,以捕捉不同通道脑电信号的空间分布特征 S S SS ,从而帮助模型全面理解大脑在响应各种听觉刺激时的活动模式。
S = G E L U ( SpatialConv 2 d ( E t ) ) R d model × T S = G E L U  SpatialConv  2 d E t R d model  × T S=GELU(" SpatialConv "2d(E_(t)))inR^(d_("model ")xx T)S=G E L U\left(\text { SpatialConv } 2 d\left(E_{t}\right)\right) \in \mathbb{R}^{d_{\text {model }} \times T}
where SpatialConv 2 d ( ) 2 d ( ) 2d(*)2 d(\cdot) performs an 2-D convolutional filters with a c i n × 1 c i n × 1 c_(-)in xx1c_{-} i n \times 1 kernel size on spatial dimension. By doing so, we not only capture the temporal patterns in EEG signals but also integrate the spatial distribution characteristics of EEG signals, thereby constructing input embedding S S SS containing comprehensive spatiotemporal information for the next layer. This integrated input better reflects the complex features within EEG signals, providing richer information for subsequent analysis and processing.
其中 SpatialConv 2 d ( ) 2 d ( ) 2d(*)2 d(\cdot) 在空间维度上执行一个内核大小为 c i n × 1 c i n × 1 c_(-)in xx1c_{-} i n \times 1 的二维卷积滤波器。这样,我们不仅能捕捉到脑电信号中的时间模式,还能整合脑电信号的空间分布特征,从而为下一层构建包含全面时空信息的输入嵌入 S S SS 。这种综合输入能更好地反映脑电信号中的复杂特征,为后续分析和处理提供更丰富的信息。

2.2 Dual Attention Refinement Module
2.2 双重关注细化模块

Previous psycho-acoustic research has demonstrated that human attention is a dynamic and timerelated activity [24, 25]. The brain activity from the preceding moment can profoundly influence subsequent brain activity [26]. However, previous AAD algorithms were hindered by model depth and the noise and outliers in EEG data, making them ineffective at capturing the long-range latent dependencies in EEG signals.
以往的心理声学研究表明,人类的注意力是一种与时间相关的动态活动 [24, 25]。前一时刻的大脑活动会深刻影响随后的大脑活动 [26]。然而,以往的 AAD 算法受限于模型深度以及脑电图数据中的噪声和异常值,无法有效捕捉脑电信号中的长程潜在依赖关系。

To address this issue, we proposed a dual self-attention mechanism, which has greater potential for capturing long-range latent dependencies and deeper sequence patterns in EEG signals. Inspired by Zhou et al. [27], Yu et al. [28], We introduced a self-attention refinement operation, which refines the dominant temporal features through convolution and pooling operations, compressing the original EEG series of length T T TT to half its length. This self-attention refinement operation reduces the impact of noise and outliers, while also decreasing the model’s parameter count. This enhances the model’s generalization and robustness. The single-layer attention refinement module can be formulated as follows:
为了解决这个问题,我们提出了一种双重自我注意机制,这种机制在捕捉脑电信号中的长程潜在依赖关系和深层序列模式方面具有更大的潜力。受 Zhou 等人[27]、Yu 等人[28]的启发,我们引入了自注意细化操作,通过卷积和池化操作细化主要的时间特征,将长度为 T T TT 的原始脑电图序列压缩到其长度的一半。这种自我关注细化操作可减少噪音和异常值的影响,同时也减少了模型的参数数量。这增强了模型的通用性和鲁棒性。单层注意力细化模块可表述如下:
F = MaxPool ( E L U ( Conv 1 d ( MultiHeadAttention ( x ) ) ) ) F = MaxPool ( E L U ( Conv 1 d (  MultiHeadAttention  ( x ) ) ) ) F=MaxPool(ELU(Conv1d(" MultiHeadAttention "(x))))F=\operatorname{MaxPool}(E L U(\operatorname{Conv} 1 d(\text { MultiHeadAttention }(x))))
where MultiHeadAttention ( ) ( ) (*)(\cdot) denotes multi-head self-attention algorithm [29], Conv1d ( ) ( ) (*)(\cdot) represents an 1-D convolutional filters (kernel width=3) on time dimension. The E L U ( ) E L U ( ) ELU(*)E L U(\cdot) is the activation function proposed by Clevert et al. [30], MaxPool denotes a max-pooling layer with stride 2.
其中 MultiHeadAttention ( ) ( ) (*)(\cdot) 表示多头自注意算法 [29],Conv1d ( ) ( ) (*)(\cdot) 表示时间维度上的一维卷积滤波器(核宽度=3)。 E L U ( ) E L U ( ) ELU(*)E L U(\cdot) 是 Clevert 等人提出的激活函数[30],MaxPool 表示步长为 2 的最大池化层。

Before applying the temporal attention feature extraction module, we add the absolute positional embedding [29] to the input embedding S S SS as follows:
在应用时态注意力特征提取模块之前,我们在输入嵌入 S S SS 中添加了绝对位置嵌入 [29],如下所示:
s i = s i + p i s i = s i + p i s_(i)=s_(i)+p_(i)s_{i}=s_{i}+p_{i}
where p i R d model p i R d model  p_(i)inR^(d_("model "))p_{i} \in \mathbb{R}^{d_{\text {model }}} denotes i t h i t h i_(th)i_{t h} time step position.
其中 p i R d model p i R d model  p_(i)inR^(d_("model "))p_{i} \in \mathbb{R}^{d_{\text {model }}} 表示 i t h i t h i_(th)i_{t h} 时间步长位置。

To obtain different levels of temporal features from EEG signals and to capture the long-range latent dependencies, we stacked two of the above attention refinement extraction modules.
为了从脑电信号中获取不同层次的时间特征,并捕捉长程潜在依赖关系,我们将上述两个注意力细化提取模块进行了堆叠。
F 1 = MaxPool ( E L U ( Conv 1 d ( MultiHeadAttention ( S ) ) ) ) R d model × T 2 F 2 = MaxPool ( E L U ( Conv 1 d ( MultiHeadAttention ( F 1 ) ) ) ) R d model × T 4 F 1 = MaxPool ( E L U ( Conv 1 d (  MultiHeadAttention  ( S ) ) ) ) R d model  × T 2 F 2 = MaxPool E L U Conv 1 d MultiHeadAttention F 1 R d model  × T 4 {:[F_(1)=MaxPool(ELU(Conv1d(" MultiHeadAttention "(S))))inR^(d_("model ")xx(T)/(2))],[F_(2)=MaxPool(ELU(Conv1d(MultiHeadAttention(F_(1)))))inR^(d_("model ")xx(T)/(4))]:}\begin{aligned} & F_{1}=\operatorname{MaxPool}(E L U(\operatorname{Conv} 1 d(\text { MultiHeadAttention }(S)))) \in \mathbb{R}^{d_{\text {model }} \times \frac{T}{2}} \\ & F_{2}=\operatorname{MaxPool}\left(E L U\left(\operatorname{Conv} 1 d\left(\operatorname{MultiHeadAttention}\left(F_{1}\right)\right)\right)\right) \in \mathbb{R}^{d_{\text {model }} \times \frac{T}{4}} \end{aligned}
where F 1 F 1 F_(1)F_{1} and F 2 F 2 F_(2)F_{2} contain different levels of dependencies and temporal patterns in the EEG signals, respectively.
其中 F 1 F 1 F_(1)F_{1} F 2 F 2 F_(2)F_{2} 分别包含脑电信号中不同层次的依赖性和时间模式。

2.3 Feature Fusion & Classifier Module
2.3 特征融合与分类器模块

Features at different levels can reflect various characteristics of the pattern. By optimizing and combining these different features, it preserves effective discriminative information from features
不同层次的特征可以反映模式的各种特性。通过优化和组合这些不同的特征,可以从特征中保留有效的判别信息

at different levels while also to some extent eliminating redundant information [31]. Therefore, we designed a feature fusion module as follows:
同时还能在一定程度上消除冗余信息 [31]。因此,我们设计了如下特征融合模块:
First, we project the features F 1 F 1 F_(1)F_{1} and F 2 F 2 F_(2)F_{2} into the same dimension.
首先,我们将特征 F 1 F 1 F_(1)F_{1} F 2 F 2 F_(2)F_{2} 投射到同一维度。
F 1 = Linear ( AdaptiveAvgPool ( F 1 ) ) R 4 F 2 = Linear ( AdaptiveAvgPool ( F 2 ) ) R 4 F 1 =  Linear   AdaptiveAvgPool  F 1 R 4 F 2 = Linear  AdaptiveAvgPool  F 2 R 4 {:[F_(1)^(')=" Linear "(" AdaptiveAvgPool "(F_(1)))inR^(4)],[F_(2)^(')=Linear(" AdaptiveAvgPool "(F_(2)))inR^(4)]:}\begin{aligned} & F_{1}^{\prime}=\text { Linear }\left(\text { AdaptiveAvgPool }\left(F_{1}\right)\right) \in \mathbb{R}^{4} \\ & F_{2}^{\prime}=\operatorname{Linear}\left(\text { AdaptiveAvgPool }\left(F_{2}\right)\right) \in \mathbb{R}^{4} \end{aligned}
where Adaptive AvgPool denotes an adaptive average pooling layer, Linear denotes a linear layer. Second, We concatenate feature F 1 F 1 F_(1)^(')F_{1}^{\prime} and F 2 F 2 F_(2)^(')F_{2}^{\prime} to obtain the fused feature vector F F FF.
其中,Adaptive AvgPool 表示自适应平均池化层,Linear 表示线性层。其次,我们将特征 F 1 F 1 F_(1)^(')F_{1}^{\prime} F 2 F 2 F_(2)^(')F_{2}^{\prime} 连接起来,得到融合特征向量 F F FF
F = [ F 1 , F 2 ] F = F 1 , F 2 F=[F_(1)^('),F_(2)^(')]F=\left[F_{1}^{\prime}, F_{2}^{\prime}\right]
Finally, we employ a fully connected layer to obtain the final auditory attention prediction.
最后,我们利用全连接层获得最终的听觉注意力预测结果。
p = w ( F + b ) p = w ( F + b ) p=w(F+b)p=w(F+b)
where w w ww and b b bb are the weight and the bias of the fully connected layer. In the training stage, we employ the cross entropy loss function to supervise the network training.
其中, w w ww b b bb 是全连接层的权重和偏置。在训练阶段,我们使用交叉熵损失函数来监督网络训练。

3 Experiments 3 项实验

3.1 Dataset 3.1 数据集

In this section, we conduct experiments on three publicly available datasets, namely KUL [32, 33], DTU [34, 35] and MM-AAD [19], which are commonly used in auditory attention detection to evaluate the effectiveness of our DARNet. KUL and DTU only contain EEG data of the auditory stimulus scenes. MM-AAD contains EEG data of the audio-only scene and the audio-visual scene. We summarized the details of the above datasets in Table 1
在本节中,我们在三个公开的数据集(即 KUL [32, 33]、DTU [34, 35] 和 MM-AAD [19])上进行了实验,这些数据集通常用于听觉注意力检测,以评估我们的 DARNet 的有效性。KUL 和 DTU 只包含听觉刺激场景的 EEG 数据。MM-AAD 包含纯音频场景和视听场景的 EEG 数据。我们在表 1 中总结了上述数据集的详细信息
  1. KUL Dataset: In this dataset, 64-channel EEG data were collected from 16 normal-hearing subjects using a BioSemi ActiveTwo device at a sampling rate of 8 , 192 Hz 8 , 192 Hz 8,192Hz8,192 \mathrm{~Hz} in a soundproof room. Each subject was instructed to focus on one of two simultaneous speakers. The auditory stimuli were filtered at 4 kHz and set at 60 dB through in-ear headphones, which contain four Dutch short stories, narrated by three male Flemish speakers. Two listening conditions were employed: dichotic (dry) presentation with one speaker per ear, and headrelated transfer function (HRTF) filtered presentation, simulating speech from 90 90 90^(@)90^{\circ} left or right. Each subject listened to 8 trials, which lasted 6 minutes.
    KUL 数据集:在该数据集中,使用 BioSemi ActiveTwo 设备在隔音室中以 8 , 192 Hz 8 , 192 Hz 8,192Hz8,192 \mathrm{~Hz} 的采样率收集了 16 名正常听力受试者的 64 通道脑电图数据。每个受试者都被要求将注意力集中在两个同时发声的扬声器中的一个上。听觉刺激通过入耳式耳机以 4 kHz 的频率滤波并设置为 60 dB,其中包含四个荷兰短篇故事,由三名弗拉芒男性扬声器讲述。测试采用了两种聆听条件:每只耳朵聆听一个扬声器的二分声(干式)演示,以及头部相关传递函数(HRTF)滤波演示,模拟来自 90 90 90^(@)90^{\circ} 左侧或右侧的语音。每个受试者聆听 8 次测试,每次持续 6 分钟。
  2. DTU Dataset: In this dataset, 64-channel EEG data were collected from 18 normal-hearing subjects using a BioSemi ActiveTwo device at a sampling rate of 512 Hz . Each subject was instructed to focus on one of two simultaneous speakers, who presented at 60 60 60^(@)60^{\circ} relative to the subject. The auditory stimuli were set at 60dB through ER-2 earphones, which contain Danish audiobooks, narrated by three male speakers and three female speakers. Each subject listened to 60 trials, which lasted 50 seconds.
    DTU 数据集:在该数据集中,使用 BioSemi ActiveTwo 设备以 512 Hz 的采样率收集了 18 名正常听力受试者的 64 通道脑电图数据。每个受试者都被要求将注意力集中在两个同时说话的人中的一个,这两个人相对于受试者的位置为 60 60 60^(@)60^{\circ} 。听觉刺激设置为 60 分贝,通过 ER-2 耳机播放,其中包含丹麦有声读物,由三名男性扬声器和三名女性扬声器解说。每位受试者聆听 60 次试验,每次试验持续 50 秒。
  3. MM-AAD Dataset: In this dataset, 32-channel EEG data were collected from 50 normalhearing subjects ( 34 males and 16 females) at a sampling rate of 4 kHz , following the 10 / 20 10 / 20 10//2010 / 20 international system. Each subject was exposed to both audio-only and audio-visual stimuli. They were instructed to focus on one of two simultaneous speakers, who presented at left or right spatial direction relative to the subject. The auditory stimuli comprised 40 classic Chinese stories narrated by both male and female speakers. Each subject listened to 20 trials, which lasted 165 seconds.
    MM-AAD 数据集:该数据集收集了 50 名正常听力受试者(34 名男性和 16 名女性)的 32 通道脑电图数据,采样率为 4 kHz,采用 10 / 20 10 / 20 10//2010 / 20 国际系统。每个受试者都接受了纯音频和视听刺激。他们被要求将注意力集中在同时出现的两个扬声器中的一个上,扬声器在相对于被试者的左或右空间方向上出现。听觉刺激包括由男性和女性演讲者讲述的 40 个经典中国故事。每个受试者听 20 次,每次持续 165 秒。

3.2 Data Processing 3.2 数据处理

To fairly compare the performance of the proposed DARNet model, specific preprocessing steps are applied to each dataset (KUL, DTU, and MM-AAD). For the KUL dataset, the EEG data were firstly re-referenced to the average response of mastoid electrodes, then bandpass filtered between 0.1 Hz and 50 Hz , and finally down-sampled to 128 Hz . For the DTU dataset, the EEG data were filtered to remove 50 Hz linear noise and harmonics. Eye artifacts were eliminated through joint decorrelation
为了公平地比较所提议的 DARNet 模型的性能,对每个数据集(KUL、DTU 和 MM-AAD)都采用了特定的预处理步骤。对于 KUL 数据集,首先将脑电图数据重新参考乳突电极的平均响应,然后在 0.1 Hz 和 50 Hz 之间进行带通滤波,最后降采样至 128 Hz。对于 DTU 数据集,对脑电图数据进行了滤波,以去除 50 Hz 线性噪声和谐波。通过联合相关消除眼部伪影
Table 1: Details of three datasets used in the experiments.
表 1:实验中使用的三个数据集的详细信息。
Dataset 数据集 subjects 科目 Scene 场景 Language 语言

每个研究对象的持续时间(分钟)
Duration per subject
(minutes)
Duration per subject (minutes)| Duration per subject | | :---: | | (minutes) |
 总时长(小时)
Total duration
(hours)
Total duration (hours)| Total duration | | :---: | | (hours) |
KUL 16 audio-only 纯音频 Dutch 荷兰语 48 12.8
DTU 18 audio-only 纯音频 Danish 丹麦语 50 15.0
MM-AAD 50 audio-only 纯音频 Chinese 中文 55 45.8
50 audio-visual 声像 Chinese 中文 55 45.8
Dataset subjects Scene Language "Duration per subject (minutes)" "Total duration (hours)" KUL 16 audio-only Dutch 48 12.8 DTU 18 audio-only Danish 50 15.0 MM-AAD 50 audio-only Chinese 55 45.8 50 audio-visual Chinese 55 45.8| Dataset | subjects | Scene | Language | Duration per subject <br> (minutes) | Total duration <br> (hours) | | :---: | :---: | :---: | :---: | :---: | :---: | | KUL | 16 | audio-only | Dutch | 48 | 12.8 | | DTU | 18 | audio-only | Danish | 50 | 15.0 | | MM-AAD | 50 | audio-only | Chinese | 55 | 45.8 | | | 50 | audio-visual | Chinese | 55 | 45.8 |
and the EEG data were re-referenced to the average response of mastoid electrodes. Finally, the EEG data were down-sampled to 64 Hz . For the MM-AAD dataset, the EEG data were firstly bandpass filtered between 0.1 Hz and 50 Hz , then removed 50 Hz noise through a notch filter. Additionally, eye artifacts were eliminated, and further noise removal was achieved, using independent component analysis (ICA). Finally, the EEG data were down-sampled to 128 Hz .
并将脑电图数据重新参照乳突电极的平均响应。最后,将脑电图数据向下采样至 64 Hz。对于 MM-AAD 数据集,首先在 0.1 Hz 和 50 Hz 之间对脑电图数据进行带通滤波,然后通过陷波滤波器去除 50 Hz 的噪声。此外,还消除了眼球伪影,并使用独立成分分析法(ICA)进一步去除噪音。最后,将脑电图数据向下采样至 128 Hz。

We evaluated our proposed DARNet model and compared it with other state-of-the-art models under three decision window lengths: 0.1 s , 1 s 0.1 s , 1 s 0.1s,1s0.1 \mathrm{~s}, 1 \mathrm{~s}, and 2 s . Specifically, we selected three publicly available models as our baseline for comparison: SSF-CNN [36], MBSSFCC [15], and DBPNet [19].
我们评估了我们提出的 DARNet 模型,并在三种决策窗口长度下将其与其他最先进的模型进行了比较:<具体来说,我们选择了三个公开可用的模型作为比较基准:SSF-CNN [36]、MBSSFCC [15] 和 DBPNet [19]。

3.3 Implement Details 3.3 执行细节

In previous A A D A A D AADA A D research, the accuracy of auditory attention prediction classification has been used as a benchmark for model performance. We followed this convention and evaluated our proposed DARNet on the KUL, DTU, and MM-AAD datasets. As follows, we take the KUL dataset with a 1 -second decision window as an example to illustrate implementation details, including training settings and network configuration.
在以前的 A A D A A D AADA A D 研究中,听觉注意力预测分类的准确性被用作模型性能的基准。我们遵循这一惯例,在 KUL、DTU 和 MM-AAD 数据集上评估了我们提出的 DARNet。下面,我们以决策窗口为 1 秒的 KUL 数据集为例,对训练设置和网络配置等实施细节进行说明。

Firstly, we set the proportions of the training, validation, and test sets to 8 : 1 : 1 8 : 1 : 1 8:1:18: 1: 1. For each subject of the KUL dataset, we get 4,600 decision windows for training, 576 decision windows for validation, and 576 decision windows for testing. Meanwhile, we set the batch size to 32 , the maximum number of epochs to 100, and employ an early stopping strategy. Training will stop if the loss function value on the validation set does not decrease for 10 consecutive epochs. Additionally, we utilize the Adam optimizer with a learning rate of 5e-4 and weight decay of 3 e 4 3 e 4 3e-43 \mathrm{e}-4 to train the model. The DARNet is performed using PyTorch.
首先,我们将训练集、验证集和测试集的比例设置为 8 : 1 : 1 8 : 1 : 1 8:1:18: 1: 1 。对于 KUL 数据集的每个主题,我们得到 4,600 个决策窗口用于训练,576 个决策窗口用于验证,576 个决策窗口用于测试。同时,我们将批次大小设为 32,最大epoch次数设为 100,并采用早期停止策略。如果验证集上的损失函数值连续 10 个epochs 没有减少,训练就会停止。此外,我们使用学习率为 5e-4 和权重衰减为 3 e 4 3 e 4 3e-43 \mathrm{e}-4 的 Adam 优化器来训练模型。DARNet 使用 PyTorch 实现。

Before inputting EEG data into the DARNet, we employ the CSP algorithm to extract raw features E R 128 × 64 E R 128 × 64 E inR^(128 xx64)E \in \mathbb{R}^{128 \times 64} from the EEG data. The data is transposed and expanded, represented as E E E^(')inE^{\prime} \in R 1 × 64 × 128 R 1 × 64 × 128 R^(1xx64 xx128)\mathbb{R}^{1 \times 64 \times 128}. Then, through the spatiotemporal construction module ( c in c in  c_("in ")c_{\text {in }} is set to 16), we can get embedding data S R 16 × 1 × 128 S R 16 × 1 × 128 S inR^(16 xx1xx128)S \in \mathbb{R}^{16 \times 1 \times 128}. After dimensionality reduction, transposition, and the addition of absolute positional embedding, the data is fed into the dual attention refinement module, resulting in two distinct level features, F 1 R 16 × 64 F 1 R 16 × 64 F_(1)inR^(16 xx64)F_{1} \in \mathbb{R}^{16 \times 64} and F 2 R 16 × 32 F 2 R 16 × 32 F_(2)inR^(16 xx32)F_{2} \in \mathbb{R}^{16 \times 32}. The F 1 F 1 F_(1)F_{1} and F 2 F 2 F_(2)F_{2} are sent to the feature fusion module, where they undergo global average pooling and dimensionality reduction via a fully connected (FC) layer (input: 16, output: 4) before being concatenated to obtain the fused feature, F R 8 F R 8 F inR^(8)F \in \mathbb{R}^{8}. Finally, F F FF is passed through another fc layer (input: 8 , output:2) to obtain the final auditory attention prediction p R 2 p R 2 p inR^(2)p \in \mathbb{R}^{2}.
在将脑电图数据输入 DARNet 之前,我们采用 CSP 算法从脑电图数据中提取原始特征 E R 128 × 64 E R 128 × 64 E inR^(128 xx64)E \in \mathbb{R}^{128 \times 64} 。数据经过转置和扩展,表示为 E E E^(')inE^{\prime} \in R 1 × 64 × 128 R 1 × 64 × 128 R^(1xx64 xx128)\mathbb{R}^{1 \times 64 \times 128} 。然后,通过时空构建模块( c in c in  c_("in ")c_{\text {in }} 设置为 16),我们可以得到嵌入数据 S R 16 × 1 × 128 S R 16 × 1 × 128 S inR^(16 xx1xx128)S \in \mathbb{R}^{16 \times 1 \times 128} 。经过降维、转置和增加绝对位置嵌入后,这些数据被送入双重注意力细化模块,从而得到两个不同层次的特征,即 F 1 R 16 × 64 F 1 R 16 × 64 F_(1)inR^(16 xx64)F_{1} \in \mathbb{R}^{16 \times 64} F 2 R 16 × 32 F 2 R 16 × 32 F_(2)inR^(16 xx32)F_{2} \in \mathbb{R}^{16 \times 32} F 1 F 1 F_(1)F_{1} F 2 F 2 F_(2)F_{2} 被送入特征融合模块,在这里,它们通过一个全连接 (FC) 层(输入:16,输出:4)进行全局平均池化和降维处理,然后通过连接获得融合特征 F R 8 F R 8 F inR^(8)F \in \mathbb{R}^{8} 。最后, F F FF 通过另一个全连接层(输入:8,输出:2),得到最终的听觉注意力预测 p R 2 p R 2 p inR^(2)p \in \mathbb{R}^{2}

4 Result 4 结果

4.1 Performance of DARNet
4.1 DARNet 的性能

To evaluate the performance of DARNet, we conducted comprehensive experiments under decision windows of 0.1 -second, 1-second, and 2-second, respectively, as shown in Figure 2 Additionally, We compared our DARNet with other advanced models, as shown in Table 2 The results are replicated from the corresponding papers.
为了评估 DARNet 的性能,我们分别在 0.1 秒、1 秒和 2 秒的决策窗口下进行了综合实验,如图 2 所示。 此外,我们还将 DARNet 与其他先进模型进行了比较,如表 2 所示。

DARNet has outperformed the current state-of-the-art models on the KUL, DTU, and MM-AAD datasets, achieving further enhancements in performance. On the KUL dataset, the DARNet achieves average accuracies of 91.6 % 91.6 % 91.6%91.6 \% (SD: 4.83 % 4.83 % 4.83%4.83 \% ), 96.2 % 96.2 % 96.2%96.2 \% (SD: 3.04 % 3.04 % 3.04%3.04 \% ), 97.2 % 97.2 % 97.2%97.2 \% (SD: 2.50 % 2.50 % 2.50%2.50 \% ) under 0.1 -second, 1 -second and 2-second decision window, respectively. On the DTU dataset, the DARNet achieves
在 KUL、DTU 和 MM-AAD 数据集上,DARNet 的表现优于当前最先进的模型,性能得到进一步提升。在 KUL 数据集上,DARNet 在 0.1 秒、1 秒和 2 秒决策窗口下的平均准确率分别为 91.6 % 91.6 % 91.6%91.6 \% (标清: 4.83 % 4.83 % 4.83%4.83 \% )、 96.2 % 96.2 % 96.2%96.2 \% (标清: 3.04 % 3.04 % 3.04%3.04 \% )、 97.2 % 97.2 % 97.2%97.2 \% (标清: 2.50 % 2.50 % 2.50%2.50 \% )。在 DTU 数据集上,DARNet 实现了

average accuracies of 79.5 % 79.5 % 79.5%79.5 \% (SD: 5.84 % 5.84 % 5.84%5.84 \% ) for 0.1 -second decision window, 87.8 % 87.8 % 87.8%87.8 \% (SD: 6.02 % 6.02 % 6.02%6.02 \% ) for 1-second decision window, 89.9 % 89.9 % 89.9%89.9 \% (SD: 5.03 % 5.03 % 5.03%5.03 \% ) for 2-second decision window, respectively. On the MM-AAD dataset, the DARNet also demonstrates outstanding decoding accuracies of 94.9 % 94.9 % 94.9%94.9 \% (SD: 4.79 % 4.79 % 4.79%4.79 \% ) for 0.1 -second, 96.0 % 96.0 % 96.0%96.0 \% (SD: 4.00 % 4.00 % 4.00%4.00 \% ) for 1-second, 96.5 % 96.5 % 96.5%96.5 \% (SD: 3.59 % 3.59 % 3.59%3.59 \% ) for 2-second in the audio-only scene, and 95.8 % 95.8 % 95.8%95.8 \% (SD: 4.04%) for 0.1 -second, 96.4 % 96.4 % 96.4%96.4 \% (SD: 3.72%) for 1-second, 96.8 % 96.8 % 96.8%96.8 \% (SD: 3.44 % 3.44 % 3.44%3.44 \% ) for 2 -second in the audio-visual scene.
0.1 秒决策窗口的平均精度为 79.5 % 79.5 % 79.5%79.5 \% (SD: 5.84 % 5.84 % 5.84%5.84 \% ),1 秒决策窗口的平均精度为 87.8 % 87.8 % 87.8%87.8 \% (SD: 6.02 % 6.02 % 6.02%6.02 \% ),2 秒决策窗口的平均精度为 89.9 % 89.9 % 89.9%89.9 \% (SD: 5.03 % 5.03 % 5.03%5.03 \% )。在 MM-AAD 数据集上,DARNet 也表现出了出色的解码准确度,在纯音频场景中,0.1 秒的解码准确度为 94.9 % 94.9 % 94.9%94.9 \% (SD: 4.79 % 4.79 % 4.79%4.79 \% ),1 秒的解码准确度为 96.0 % 96.0 % 96.0%96.0 \% (SD: 4.00 % 4.00 % 4.00%4.00 \% ),2 秒的解码准确度为 96.5 % 96.5 % 96.5%96.5 \% (SD: 3.59 % 3.59 % 3.59%3.59 \% ),0.1 秒的解码准确度为 95.8 % 95.8 % 95.8%95.8 \% (SD: 4. 04%) 。04%), 96.4 % 96.4 % 96.4%96.4 \% (SD: 3.72%) 为 1 秒, 96.8 % 96.8 % 96.8%96.8 \% (SD: 3.44 % 3.44 % 3.44%3.44 \% ) 为 2 秒。
Overall, DARNet’s decoding accuracy increases with larger decision windows, consistent with prior research [15, 14]. This is because longer decision windows provide more information for the model to make judgments while also mitigating the impact of individual outliers on the predictions. However, DARNet still maintains excellent performance under the 0.1 -second decision window. Additionally, we observe that in the MM-AAD dataset, performance is better in the audio-visual condition compared to the audio-only condition in two different scenarios. We attribute this improvement to the visual cues aiding humans in localizing sound sources.
总体而言,DARNet 的解码准确率随着决策窗口的增大而提高,这与之前的研究结果一致[15, 14]。这是因为较长的决策窗口为模型做出判断提供了更多信息,同时也减轻了个别异常值对预测的影响。不过,DARNet 在 0.1 秒的决策窗口下仍然保持了出色的性能。此外,我们还观察到,在 MM-AAD 数据集中,在两种不同的情况下,视听条件下的性能要优于纯视听条件下的性能。我们将这一进步归功于视觉线索帮助人类定位声源。
Table 2: Auditory attention detection accuracy(%) comparison on DTU, KUL and MM-AAD dataset. The results annotated by * are taken from [19]. Our experimental setup is consistent with theirs to ensure fairness in comparison. Hence, we directly cited their results.
表 2:DTU、KUL 和 MM-AAD 数据集的听觉注意力检测准确率(%)比较。带 * 的结果来自 [19]。我们的实验设置与他们的一致,以确保比较的公平性。因此,我们直接引用了他们的结果。
Dataset 数据集 Scene 场景 Model 模型 Decision Window 决策窗口
0.1 -second 0.1 秒 1-second 1 秒钟 2-second 2 秒钟
KUL audio-only 纯音频 SSF-CNN* 36 76.3 ± 8.47 76.3 ± 8.47 76.3+-8.4776.3 \pm 8.47 84.4 ± 8.67 84.4 ± 8.67 84.4+-8.6784.4 \pm 8.67 87.8 ± 7.87 87.8 ± 7.87 87.8+-7.8787.8 \pm 7.87
MBSSFCC* ^(**){ }^{*} [15] MBSSFCC* ^(**){ }^{*} [15] 79.0 ± 7.34 79.0 ± 7.34 79.0+-7.3479.0 \pm 7.34 86.5 ± 7.16 86.5 ± 7.16 86.5+-7.1686.5 \pm 7.16 89.5 ± 6.74 89.5 ± 6.74 89.5+-6.7489.5 \pm 6.74
BSAnet [37] - 93.7 ± 4.02 93.7 ± 4.02 93.7+-4.0293.7 \pm 4.02 95.2 ± 3.08 95.2 ± 3.08 95.2+-3.0895.2 \pm 3.08
DenseNet-3D [38] - 94.3 ± 4.3 94.3 ± 4.3 94.3+-4.394.3 \pm 4.3 95.9 ± 4.3 95.9 ± 4.3 95.9+-4.395.9 \pm 4.3
DBPNet* 19 87.1 ± 6.55 87.1 ± 6.55 87.1+-6.5587.1 \pm 6.55 95.0 ± 4.16 95.0 ± 4.16 95.0+-4.1695.0 \pm 4.16 96.5 ± 3.50 96.5 ± 3.50 96.5+-3.5096.5 \pm 3.50
DARNet(ours) 我们的 DARNet 91.6 ± 4.83 91.6 ± 4.83 91.6+-4.8391.6 \pm 4.83 96.2 ± 3.04 96.2 ± 3.04 96.2+-3.0496.2 \pm 3.04 97.2 ± 2.50 97.2 ± 2.50 97.2+-2.5097.2 \pm 2.50
DTU audio-only 纯音频 SSF-CNN* 36] 62.5 ± 3.40 62.5 ± 3.40 62.5+-3.4062.5 \pm 3.40 69.8 ± 5.12 69.8 ± 5.12 69.8+-5.1269.8 \pm 5.12 73.3 ± 6.21 73.3 ± 6.21 73.3+-6.2173.3 \pm 6.21
MBSSFCC* 3 ] 3 ] ^(**3]){ }^{* 3]} 66.9 ± 5.00 66.9 ± 5.00 66.9+-5.0066.9 \pm 5.00 75.6 ± 6.55 75.6 ± 6.55 75.6+-6.5575.6 \pm 6.55 78.7 ± 6.75 78.7 ± 6.75 78.7+-6.7578.7 \pm 6.75
BSAnet [37] - 83.1 ± 6.75 83.1 ± 6.75 83.1+-6.7583.1 \pm 6.75 85.6 ± 6.47 85.6 ± 6.47 85.6+-6.4785.6 \pm 6.47
EEG-Graph Net [39] 脑电图图网 [39] 72.5 ± 7.41 72.5 ± 7.41 72.5+-7.4172.5 \pm 7.41 78.7 ± 6.47 78.7 ± 6.47 78.7+-6.4778.7 \pm 6.47 79.4 ± 7.16 79.4 ± 7.16 79.4+-7.1679.4 \pm 7.16
DBPNet* [19] 75.1 ± 4.87 75.1 ± 4.87 75.1+-4.8775.1 \pm 4.87 83.9 ± 5.95 83.9 ± 5.95 83.9+-5.9583.9 \pm 5.95 86.5 ± 5.34 86.5 ± 5.34 86.5+-5.3486.5 \pm 5.34
DARNet(ours) 我们的 DARNet 7 9 . 5 ± 5 . 8 4 7 9 . 5 ± 5 . 8 4 79.5+-5.84\mathbf{7 9 . 5} \pm \mathbf{5 . 8 4} 87.8 ± 6.02 87.8 ± 6.02 87.8+-6.0287.8 \pm 6.02 89.9 ± 5.03 89.9 ± 5.03 89.9+-5.0389.9 \pm 5.03
MM-AAD audio-only 纯音频 SSF-CNN* 36] 56.5 ± 5.71 56.5 ± 5.71 56.5+-5.7156.5 \pm 5.71 57.0 ± 6.55 57.0 ± 6.55 57.0+-6.5557.0 \pm 6.55 57.9 ± 7.47 57.9 ± 7.47 57.9+-7.4757.9 \pm 7.47
MBSSFCC* 15  MBSSFCC*  15 " MBSSFCC* "^(**15)\text { MBSSFCC* }{ }^{* 15} 75.3 ± 9.27 75.3 ± 9.27 75.3+-9.2775.3 \pm 9.27 76.5 ± 9.90 76.5 ± 9.90 76.5+-9.9076.5 \pm 9.90 77.0 ± 9.92 77.0 ± 9.92 77.0+-9.9277.0 \pm 9.92
DBPNet* 9 ] 9 ] ^(**9]){ }^{* 9]} 91.4 ± 4.63 91.4 ± 4.63 91.4+-4.6391.4 \pm 4.63 92.0 ± 5.42 92.0 ± 5.42 92.0+-5.4292.0 \pm 5.42 92.5 ± 4.59 92.5 ± 4.59 92.5+-4.5992.5 \pm 4.59
DARNet(ours) 我们的 DARNet 94.9 ± 4.79 94.9 ± 4.79 94.9+-4.7994.9 \pm 4.79 96.0 ± 4.00 96.0 ± 4.00 96.0+-4.0096.0 \pm 4.00 96.5 ± 3.59 96.5 ± 3.59 96.5+-3.5996.5 \pm 3.59
audio-visual 声像 SSF-CNN* 36] 56.6 ± 3.82 56.6 ± 3.82 56.6+-3.8256.6 \pm 3.82 57.2 ± 5.59 57.2 ± 5.59 57.2+-5.5957.2 \pm 5.59 58.2 ± 6.39 58.2 ± 6.39 58.2+-6.3958.2 \pm 6.39
MBSSFCC* 15 77.2 ± 9.01 77.2 ± 9.01 77.2+-9.0177.2 \pm 9.01 78.1 ± 10.1 78.1 ± 10.1 78.1+-10.178.1 \pm 10.1 78.4 ± 9.57 78.4 ± 9.57 78.4+-9.5778.4 \pm 9.57
DBPNet* 19 92.1 ± 4.47 92.1 ± 4.47 92.1+-4.4792.1 \pm 4.47 92.8 ± 5.94 92.8 ± 5.94 92.8+-5.9492.8 \pm 5.94 93.4 ± 4.86 93.4 ± 4.86 93.4+-4.8693.4 \pm 4.86
DARNet(ours) 我们的 DARNet 95.8 ± 4.04 95.8 ± 4.04 95.8+-4.0495.8 \pm 4.04 96.4 ± 3.72 96.4 ± 3.72 96.4+-3.7296.4 \pm 3.72 96.8 ± 3.44 96.8 ± 3.44 96.8+-3.4496.8 \pm 3.44
Dataset Scene Model Decision Window 0.1 -second 1-second 2-second KUL audio-only SSF-CNN* 36 76.3+-8.47 84.4+-8.67 87.8+-7.87 MBSSFCC* ^(**) [15] 79.0+-7.34 86.5+-7.16 89.5+-6.74 BSAnet [37] - 93.7+-4.02 95.2+-3.08 DenseNet-3D [38] - 94.3+-4.3 95.9+-4.3 DBPNet* 19 87.1+-6.55 95.0+-4.16 96.5+-3.50 DARNet(ours) 91.6+-4.83 96.2+-3.04 97.2+-2.50 DTU audio-only SSF-CNN* 36] 62.5+-3.40 69.8+-5.12 73.3+-6.21 MBSSFCC* ^(**3]) 66.9+-5.00 75.6+-6.55 78.7+-6.75 BSAnet [37] - 83.1+-6.75 85.6+-6.47 EEG-Graph Net [39] 72.5+-7.41 78.7+-6.47 79.4+-7.16 DBPNet* [19] 75.1+-4.87 83.9+-5.95 86.5+-5.34 DARNet(ours) 79.5+-5.84 87.8+-6.02 89.9+-5.03 MM-AAD audio-only SSF-CNN* 36] 56.5+-5.71 57.0+-6.55 57.9+-7.47 " MBSSFCC* "^(**15) 75.3+-9.27 76.5+-9.90 77.0+-9.92 DBPNet* ^(**9]) 91.4+-4.63 92.0+-5.42 92.5+-4.59 DARNet(ours) 94.9+-4.79 96.0+-4.00 96.5+-3.59 audio-visual SSF-CNN* 36] 56.6+-3.82 57.2+-5.59 58.2+-6.39 MBSSFCC* 15 77.2+-9.01 78.1+-10.1 78.4+-9.57 DBPNet* 19 92.1+-4.47 92.8+-5.94 93.4+-4.86 DARNet(ours) 95.8+-4.04 96.4+-3.72 96.8+-3.44| Dataset | Scene | Model | Decision Window | | | | :---: | :---: | :---: | :---: | :---: | :---: | | | | | 0.1 -second | 1-second | 2-second | | KUL | audio-only | SSF-CNN* 36 | $76.3 \pm 8.47$ | $84.4 \pm 8.67$ | $87.8 \pm 7.87$ | | | | MBSSFCC* ${ }^{*}$ [15] | $79.0 \pm 7.34$ | $86.5 \pm 7.16$ | $89.5 \pm 6.74$ | | | | BSAnet [37] | - | $93.7 \pm 4.02$ | $95.2 \pm 3.08$ | | | | DenseNet-3D [38] | - | $94.3 \pm 4.3$ | $95.9 \pm 4.3$ | | | | DBPNet* 19 | $87.1 \pm 6.55$ | $95.0 \pm 4.16$ | $96.5 \pm 3.50$ | | | | DARNet(ours) | $91.6 \pm 4.83$ | $96.2 \pm 3.04$ | $97.2 \pm 2.50$ | | DTU | audio-only | SSF-CNN* 36] | $62.5 \pm 3.40$ | $69.8 \pm 5.12$ | $73.3 \pm 6.21$ | | | | MBSSFCC* ${ }^{* 3]}$ | $66.9 \pm 5.00$ | $75.6 \pm 6.55$ | $78.7 \pm 6.75$ | | | | BSAnet [37] | - | $83.1 \pm 6.75$ | $85.6 \pm 6.47$ | | | | EEG-Graph Net [39] | $72.5 \pm 7.41$ | $78.7 \pm 6.47$ | $79.4 \pm 7.16$ | | | | DBPNet* [19] | $75.1 \pm 4.87$ | $83.9 \pm 5.95$ | $86.5 \pm 5.34$ | | | | DARNet(ours) | $\mathbf{7 9 . 5} \pm \mathbf{5 . 8 4}$ | $87.8 \pm 6.02$ | $89.9 \pm 5.03$ | | MM-AAD | audio-only | SSF-CNN* 36] | $56.5 \pm 5.71$ | $57.0 \pm 6.55$ | $57.9 \pm 7.47$ | | | | $\text { MBSSFCC* }{ }^{* 15}$ | $75.3 \pm 9.27$ | $76.5 \pm 9.90$ | $77.0 \pm 9.92$ | | | | DBPNet* ${ }^{* 9]}$ | $91.4 \pm 4.63$ | $92.0 \pm 5.42$ | $92.5 \pm 4.59$ | | | | DARNet(ours) | $94.9 \pm 4.79$ | $96.0 \pm 4.00$ | $96.5 \pm 3.59$ | | | audio-visual | SSF-CNN* 36] | $56.6 \pm 3.82$ | $57.2 \pm 5.59$ | $58.2 \pm 6.39$ | | | | MBSSFCC* 15 | $77.2 \pm 9.01$ | $78.1 \pm 10.1$ | $78.4 \pm 9.57$ | | | | DBPNet* 19 | $92.1 \pm 4.47$ | $92.8 \pm 5.94$ | $93.4 \pm 4.86$ | | | | DARNet(ours) | $95.8 \pm 4.04$ | $96.4 \pm 3.72$ | $96.8 \pm 3.44$ |

4.2 Ablation Study 4.2 消融研究

We conducted comprehensive ablation experiments by removing the spatial feature extraction module, the temporal feature extraction module, and the feature fusion module. Additionally, we supplemented our study with ablation experiments using a single-layer attention refinement module on the KUL and DTU dataset, referred to as single-DARNet. All experimental conditions remained the same as in previous settings. The results of the ablation experiments are shown in Table 3
我们通过移除空间特征提取模块、时间特征提取模块和特征融合模块,进行了全面的消减实验。此外,我们还在 KUL 和 DTU 数据集上使用单层注意力细化模块(称为单 DARNet)进行了消减实验,以补充我们的研究。所有实验条件与之前的设置相同。消融实验结果如表 3 所示

Experimental results show that on the DTU dataset, after removing the spatial feature extraction module from DARNet, the average accuracy decreased by 10.1 % 10.1 % 10.1%10.1 \% under a 0.1 s decision window, 13.1 % 13.1 % 13.1%13.1 \% under a 1 s decision window, and 12.0 % 12.0 % 12.0%12.0 \% under a 2 s decision window. After removing the temporal feature extraction module, the average accuracy for the 0.1 s , 1 s 0.1 s , 1 s 0.1s,1s0.1 \mathrm{~s}, 1 \mathrm{~s}, and 2 s decision windows decreased by 10.9 % , 14.2 % 10.9 % , 14.2 % 10.9%,14.2%10.9 \%, 14.2 \%, and 11.9 % 11.9 % 11.9%11.9 \%, respectively. After removing the feature fusion module, the average accuracy decreased by 2.5 % 2.5 % 2.5%2.5 \% under a 0.1 s decision window, 1.6 % 1.6 % 1.6%1.6 \% under a 1 s decision window,
实验结果表明,在 DTU 数据集上,移除 DARNet 的空间特征提取模块后,0.1 秒决策窗口下的平均准确率下降了 10.1 % 10.1 % 10.1%10.1 \% ,1 秒决策窗口下的平均准确率下降了 13.1 % 13.1 % 13.1%13.1 \% ,2 秒决策窗口下的平均准确率下降了 12.0 % 12.0 % 12.0%12.0 \% 。移除时间特征提取模块后, 0.1 s , 1 s 0.1 s , 1 s 0.1s,1s0.1 \mathrm{~s}, 1 \mathrm{~s} 和 2 秒决策窗口的平均准确率分别下降了 10.9 % , 14.2 % 10.9 % , 14.2 % 10.9%,14.2%10.9 \%, 14.2 \% 11.9 % 11.9 % 11.9%11.9 \% 。移除特征融合模块后,0.1 秒决策窗口下的平均准确率下降了 2.5 % 2.5 % 2.5%2.5 \% ,1 秒决策窗口下的平均准确率下降了 1.6 % 1.6 % 1.6%1.6 \%
Table 3: Ablation Study on KUL, DTU, and MM-AAD dataset.
表 3:KUL、DTU 和 MM-AAD 数据集的消融研究。
Dataset 数据集 Scene 场景 Model 模型 Decision Window 决策窗口
1-second 1 秒钟 2-second 2 秒钟
KUL w/o spatial feature 无空间特征 81.1 ± 6.51 81.1 ± 6.51 81.1+-6.5181.1 \pm 6.51 87.5 ± 7.24 87.5 ± 7.24 87.5+-7.2487.5 \pm 7.24 89.0 ± 4.93 89.0 ± 4.93 89.0+-4.9389.0 \pm 4.93
audio-only 纯音频 w/o temporal feature 无时间特征 81.2 ± 6.94 81.2 ± 6.94 81.2+-6.9481.2 \pm 6.94 86.7 ± 7.30 86.7 ± 7.30 86.7+-7.3086.7 \pm 7.30 89.1 ± 5.37 89.1 ± 5.37 89.1+-5.3789.1 \pm 5.37
w/o feature fusion 无功能融合 90.5 ± 5.09 90.5 ± 5.09 90.5+-5.0990.5 \pm 5.09 95.2 ± 3.58 95.2 ± 3.58 95.2+-3.5895.2 \pm 3.58 96.1 ± 3.46 96.1 ± 3.46 96.1+-3.4696.1 \pm 3.46
91.1 ± 5.18 91.1 ± 5.18 91.1+-5.1891.1 \pm 5.18 95.5 ± 3.28 95.5 ± 3.28 95.5+-3.2895.5 \pm 3.28 96.2 ± 2.98 96.2 ± 2.98 96.2+-2.9896.2 \pm 2.98
DARNet(ours) 我们的 DARNet 9 1 . 6 ± 4 . 8 3 9 1 . 6 ± 4 . 8 3 91.6+-4.83\mathbf{9 1 . 6} \pm \mathbf{4 . 8 3} 9 6 . 2 ± 3 . 0 4 9 6 . 2 ± 3 . 0 4 96.2+-3.04\mathbf{9 6 . 2} \pm \mathbf{3 . 0 4} 9 7 . 2 ± 2 . 5 0 9 7 . 2 ± 2 . 5 0 97.2+-2.50\mathbf{9 7 . 2} \pm \mathbf{2 . 5 0}
DTU audio-only 纯音频 w/o spatial feature 无空间特征 71.5 ± 5.79 71.5 ± 5.79 71.5+-5.7971.5 \pm 5.79 76.3 ± 7.21 76.3 ± 7.21 76.3+-7.2176.3 \pm 7.21 79.1 ± 6.79 79.1 ± 6.79 79.1+-6.7979.1 \pm 6.79
w/o feature fusion 无功能融合 70.9 ± 5.62 70.9 ± 5.62 70.9+-5.6270.9 \pm 5.62 75.3 ± 6.62 75.3 ± 6.62 75.3+-6.6275.3 \pm 6.62 79.2 ± 6.84 79.2 ± 6.84 79.2+-6.8479.2 \pm 6.84
single-DARNet 单一达网 79.1 ± 7.07 79.1 ± 7.07 79.1+-7.0779.1 \pm 7.07 86.3 ± 6.16 86.3 ± 6.16 86.3+-6.1686.3 \pm 6.16 88.6 ± 5.71 88.6 ± 5.71 88.6+-5.7188.6 \pm 5.71
DARNet(ours) 我们的 DARNet 7 9 . 5 ± 5 . 8 4 7 9 . 5 ± 5 . 8 4 79.5+-5.84\mathbf{7 9 . 5} \pm \mathbf{5 . 8 4} 8 7 . 8 ± 6 . 0 2 8 7 . 8 ± 6 . 0 2 87.8+-6.02\mathbf{8 7 . 8} \pm \mathbf{6 . 0 2} 8 9 . 9 ± 5 . 0 3 8 9 . 9 ± 5 . 0 3 89.9+-5.03\mathbf{8 9 . 9} \pm \mathbf{5 . 0 3}
audio-only 纯音频 w/o spatial feature 无空间特征 90.0 ± 5.76 90.0 ± 5.76 90.0+-5.7690.0 \pm 5.76 91.0 ± 5.38 91.0 ± 5.38 91.0+-5.3891.0 \pm 5.38 92.5 ± 4.76 92.5 ± 4.76 92.5+-4.7692.5 \pm 4.76
w/o temporal feature 无时间特征 87.8 ± 5.66 87.8 ± 5.66 87.8+-5.6687.8 \pm 5.66 89.0 ± 5.20 89.0 ± 5.20 89.0+-5.2089.0 \pm 5.20 91.1 ± 5.31 91.1 ± 5.31 91.1+-5.3191.1 \pm 5.31
w/o feature fusion 无功能融合 94.3 ± 3.94 94.3 ± 3.94 94.3+-3.9494.3 \pm 3.94 94.8 ± 4.05 94.8 ± 4.05 94.8+-4.0594.8 \pm 4.05 95.7 ± 4.21 95.7 ± 4.21 95.7+-4.2195.7 \pm 4.21
DARNet(ours) 我们的 DARNet 9 4 . 9 ± 4 . 7 9 9 4 . 9 ± 4 . 7 9 94.9+-4.79\mathbf{9 4 . 9} \pm \mathbf{4 . 7 9} 9 6 . 0 ± 4 . 0 0 9 6 . 0 ± 4 . 0 0 96.0+-4.00\mathbf{9 6 . 0} \pm \mathbf{4 . 0 0} 9 6 . 5 ± 3 . 5 9 9 6 . 5 ± 3 . 5 9 96.5+-3.59\mathbf{9 6 . 5} \pm \mathbf{3 . 5 9}
M-AAD w/o spatial feature 无空间特征 90.4 ± 5.99 90.4 ± 5.99 90.4+-5.9990.4 \pm 5.99 91.2 ± 5.56 91.2 ± 5.56 91.2+-5.5691.2 \pm 5.56 93.1 ± 5.36 93.1 ± 5.36 93.1+-5.3693.1 \pm 5.36
audio-visual 声像 w/o temporal feature 无时间特征 89.4 ± 6.96 89.4 ± 6.96 89.4+-6.9689.4 \pm 6.96 90.5 ± 6.04 90.5 ± 6.04 90.5+-6.0490.5 \pm 6.04 92.1 ± 5.72 92.1 ± 5.72 92.1+-5.7292.1 \pm 5.72
w/o feature fusion 无功能融合 95.3 ± 4.57 95.3 ± 4.57 95.3+-4.5795.3 \pm 4.57 95.7 ± 3.88 95.7 ± 3.88 95.7+-3.8895.7 \pm 3.88 96.1 ± 3.96 96.1 ± 3.96 96.1+-3.9696.1 \pm 3.96
DARNet(ours) 我们的 DARNet 9 5 . 8 ± 4 . 0 4 9 5 . 8 ± 4 . 0 4 95.8+-4.04\mathbf{9 5 . 8} \pm \mathbf{4 . 0 4} 9 6 . 4 ± 3 . 7 2 9 6 . 4 ± 3 . 7 2 96.4+-3.72\mathbf{9 6 . 4} \pm \mathbf{3 . 7 2} 9 6 . 8 ± 3 . 4 4 9 6 . 8 ± 3 . 4 4 96.8+-3.44\mathbf{9 6 . 8} \pm \mathbf{3 . 4 4}
Dataset Scene Model Decision Window 1-second 2-second KUL w/o spatial feature 81.1+-6.51 87.5+-7.24 89.0+-4.93 audio-only w/o temporal feature 81.2+-6.94 86.7+-7.30 89.1+-5.37 w/o feature fusion 90.5+-5.09 95.2+-3.58 96.1+-3.46 91.1+-5.18 95.5+-3.28 96.2+-2.98 DARNet(ours) 91.6+-4.83 96.2+-3.04 97.2+-2.50 DTU audio-only w/o spatial feature 71.5+-5.79 76.3+-7.21 79.1+-6.79 w/o feature fusion 70.9+-5.62 75.3+-6.62 79.2+-6.84 single-DARNet 79.1+-7.07 86.3+-6.16 88.6+-5.71 DARNet(ours) 79.5+-5.84 87.8+-6.02 89.9+-5.03 audio-only w/o spatial feature 90.0+-5.76 91.0+-5.38 92.5+-4.76 w/o temporal feature 87.8+-5.66 89.0+-5.20 91.1+-5.31 w/o feature fusion 94.3+-3.94 94.8+-4.05 95.7+-4.21 DARNet(ours) 94.9+-4.79 96.0+-4.00 96.5+-3.59 M-AAD w/o spatial feature 90.4+-5.99 91.2+-5.56 93.1+-5.36 audio-visual w/o temporal feature 89.4+-6.96 90.5+-6.04 92.1+-5.72 w/o feature fusion 95.3+-4.57 95.7+-3.88 96.1+-3.96 DARNet(ours) 95.8+-4.04 96.4+-3.72 96.8+-3.44| Dataset | Scene | Model | Decision Window | | | | :---: | :---: | :---: | :---: | :---: | :---: | | | | | 1-second | 2-second | | | KUL | | w/o spatial feature | $81.1 \pm 6.51$ | $87.5 \pm 7.24$ | $89.0 \pm 4.93$ | | | audio-only | w/o temporal feature | $81.2 \pm 6.94$ | $86.7 \pm 7.30$ | $89.1 \pm 5.37$ | | | | w/o feature fusion | $90.5 \pm 5.09$ | $95.2 \pm 3.58$ | $96.1 \pm 3.46$ | | | | $91.1 \pm 5.18$ | $95.5 \pm 3.28$ | $96.2 \pm 2.98$ | | | | | DARNet(ours) | $\mathbf{9 1 . 6} \pm \mathbf{4 . 8 3}$ | $\mathbf{9 6 . 2} \pm \mathbf{3 . 0 4}$ | $\mathbf{9 7 . 2} \pm \mathbf{2 . 5 0}$ | | DTU | audio-only | w/o spatial feature | $71.5 \pm 5.79$ | $76.3 \pm 7.21$ | $79.1 \pm 6.79$ | | | | w/o feature fusion | $70.9 \pm 5.62$ | $75.3 \pm 6.62$ | $79.2 \pm 6.84$ | | | | single-DARNet | $79.1 \pm 7.07$ | $86.3 \pm 6.16$ | $88.6 \pm 5.71$ | | | | DARNet(ours) | $\mathbf{7 9 . 5} \pm \mathbf{5 . 8 4}$ | $\mathbf{8 7 . 8} \pm \mathbf{6 . 0 2}$ | $\mathbf{8 9 . 9} \pm \mathbf{5 . 0 3}$ | | audio-only | w/o spatial feature | $90.0 \pm 5.76$ | $91.0 \pm 5.38$ | $92.5 \pm 4.76$ | | | | | w/o temporal feature | $87.8 \pm 5.66$ | $89.0 \pm 5.20$ | $91.1 \pm 5.31$ | | | | w/o feature fusion | $94.3 \pm 3.94$ | $94.8 \pm 4.05$ | $95.7 \pm 4.21$ | | | | DARNet(ours) | $\mathbf{9 4 . 9} \pm \mathbf{4 . 7 9}$ | $\mathbf{9 6 . 0} \pm \mathbf{4 . 0 0}$ | $\mathbf{9 6 . 5} \pm \mathbf{3 . 5 9}$ | | M-AAD | | w/o spatial feature | $90.4 \pm 5.99$ | $91.2 \pm 5.56$ | $93.1 \pm 5.36$ | | | audio-visual | w/o temporal feature | $89.4 \pm 6.96$ | $90.5 \pm 6.04$ | $92.1 \pm 5.72$ | | | | w/o feature fusion | $95.3 \pm 4.57$ | $95.7 \pm 3.88$ | $96.1 \pm 3.96$ | | | | DARNet(ours) | $\mathbf{9 5 . 8} \pm \mathbf{4 . 0 4}$ | $\mathbf{9 6 . 4} \pm \mathbf{3 . 7 2}$ | $\mathbf{9 6 . 8} \pm \mathbf{3 . 4 4}$ |
and 1.4 % 1.4 % 1.4%1.4 \% under a 2 s decision window. On the KUL dataset and the MM-AAD dataset, removing the aforementioned modules also resulted in similar trends of decreased average accuracy.
1.4 % 1.4 % 1.4%1.4 \% 在 2 秒决策窗口下。在 KUL 数据集和 MM-AAD 数据集上,移除上述模块也会导致类似的平均准确率下降趋势。

Figure 2: AAD accuracy(%) of DARNet across all subjects on KUL, DTU, and MMAAD datasets.
图 2:DARNet 在 KUL、DTU 和 MMAAD 数据集上对所有受试者的 AAD 准确率(%)。

Figure 3: AAD accuracy(%) of the ablation study across all subjects on the DTU dataset.
图 3:DTU 数据集上所有受试者消融研究的 AAD 准确率(%)。

5 Discussion 5 讨论

5.1 Comparative Analysis 5.1 比较分析

To further evaluate the performance of our proposed DARNet, we compared it with other advanced AAD models, as shown in Table 2. The results indicate that our DARNet has achieved a significant improvement over the current state-of-the-art results.
为了进一步评估我们提出的 DARNet 的性能,我们将其与其他先进的 AAD 模型进行了比较,如表 2 所示。结果表明,我们的 DARNet 比目前最先进的结果有了显著提高。

For example, on the DTU dataset, our DARNet has shown relative improvements of 27.2 % , 18.8 % 27.2 % , 18.8 % 27.2%,18.8%27.2 \%, 18.8 \%, 9.7 % 9.7 % 9.7%9.7 \%, and 5.9 % 5.9 % 5.9%5.9 \% for 0.1 -second decision window, compared to the SSF-CNN, MBSSFCC, EEG-Graph Net and DBPNet models, respectively. Compared to the SSF-CNN, MBSSFCC, BSAnet, EEG-Graph Net, and DBPNet models, the relative improvements achieve 25.8 % , 16.1 % , 5.7 % , 11.6 % , 4.6 % 25.8 % , 16.1 % , 5.7 % , 11.6 % , 4.6 % 25.8%,16.1%,5.7%,11.6%,4.6%25.8 \%, 16.1 \%, 5.7 \%, 11.6 \%, 4.6 \% for
例如,在 DTU 数据集上,与 SSF-CNN、MBSSFCC、EEG-Graph Net 和 DBPNet 模型相比,我们的 DARNet 在 0.1 秒决策窗口中的相对改进分别为 27.2 % , 18.8 % 27.2 % , 18.8 % 27.2%,18.8%27.2 \%, 18.8 \% 9.7 % 9.7 % 9.7%9.7 \% 5.9 % 5.9 % 5.9%5.9 \% 。与 SSF-CNN、MBSSFCC、BSAnet、EEG-Graph Net 和 DBPNet 模型相比, 25.8 % , 16.1 % , 5.7 % , 11.6 % , 4.6 % 25.8 % , 16.1 % , 5.7 % , 11.6 % , 4.6 % 25.8%,16.1%,5.7%,11.6%,4.6%25.8 \%, 16.1 \%, 5.7 \%, 11.6 \%, 4.6 \% 在 0.1 秒决策窗口时的相对改进达到了 9.7 % 9.7 % 9.7%9.7 \% 5.9 % 5.9 % 5.9%5.9 \%
Table 4: The training parameter counts of our DARNet and three open-source models. “M” denotes a million.
表 4:我们的 DARNet 和三种开源模型的训练参数计数。M "表示一百万。
Model 模型 Trainable Parameters Counts
可训练参数计数
SSF-CNN [36] 4.21 M
MBSSFCC [15] 83.91 M
DBPNet [19] 0.91 M
DARNet (ours) DARNet (我们的) 0 . 0 8 M 0 . 0 8 M 0.08M\mathbf{0 . 0 8 M}
Model Trainable Parameters Counts SSF-CNN [36] 4.21 M MBSSFCC [15] 83.91 M DBPNet [19] 0.91 M DARNet (ours) 0.08M| Model | Trainable Parameters Counts | | :---: | :---: | | SSF-CNN [36] | 4.21 M | | MBSSFCC [15] | 83.91 M | | DBPNet [19] | 0.91 M | | DARNet (ours) | $\mathbf{0 . 0 8 M}$ |
1 -second decision window, and 22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6%,14.2%,5.0%,13.2%,3.9%22.6 \%, 14.2 \%, 5.0 \%, 13.2 \%, 3.9 \% for 2 -second. On both the KUL and MM-AAD datasets, DARNet has achieved similar improvements compared to the state-of-the-art models. The particularly outstanding results achieved across all three datasets under the 0.1 -second decision window indicate the potential of DARNet for real-time decoding of auditory attention.
22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6%,14.2%,5.0%,13.2%,3.9%22.6 \%, 14.2 \%, 5.0 \%, 13.2 \%, 3.9 \% 为 1 秒决策窗口, 22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6 % , 14.2 % , 5.0 % , 13.2 % , 3.9 % 22.6%,14.2%,5.0%,13.2%,3.9%22.6 \%, 14.2 \%, 5.0 \%, 13.2 \%, 3.9 \% 为 2 秒决策窗口。在 KUL 和 MM-AAD 数据集上,DARNet 与最先进的模型相比取得了类似的改进。在 0.1 秒决策窗口下,DARNet 在所有三个数据集上都取得了特别突出的成绩,这表明 DARNet 具有对听觉注意力进行实时解码的潜力。

Overall, the excellent performance of DARNet across different datasets and decision windows demonstrates its robustness and versatility in various contexts. This further validates the potential of DARNet as an effective EEG analysis model and provides strong support for its widespread application in real-world scenarios.
总之,DARNet 在不同数据集和决策窗口中的优异表现证明了其在各种情况下的鲁棒性和通用性。这进一步验证了 DARNet 作为有效脑电图分析模型的潜力,并为其在现实世界中的广泛应用提供了有力支持。

5.2 Ablation Analysis 5.2 消融分析

As shown in Table 3 and Figure 3, compared with removing the spatial feature extraction step, removing the temporal feature extraction step, removing the feature fusion module, and using a single-layer attention refinement module, we believe DARNet performs excellently for the following reasons:
如表 3 和图 3 所示,与去掉空间特征提取步骤、去掉时间特征提取步骤、去掉特征融合模块和使用单层注意力细化模块相比,我们认为 DARNet 的性能非常出色,原因如下:
  1. Integrating multiple sources of information: DARNet integrates temporal and spatial distribution features from EEG signals, constructing richer and more robust spatiotemporal features. This enables the model to comprehensively understand the spatiotemporal information within EEG signals, thereby enhancing the understanding of brain activity. In contrast, removing any single feature may lead to information loss or the inability to capture the transient changes in EEG signals, thereby impacting the model’s performance.
    整合多种信息源:DARNet 整合了脑电信号的时间和空间分布特征,构建了更丰富、更稳健的时空特征。这使模型能够全面理解脑电信号中的时空信息,从而增强对大脑活动的理解。相反,去除任何一个特征都可能导致信息丢失或无法捕捉脑电信号的瞬时变化,从而影响模型的性能。
  2. Comprehensive capture of temporal dependencies: The dual attention refinement module and feature fusion module of DARNet comprehensively capture temporal patterns and dependencies at different levels, enabling the model to better understand the temporal dynamics within EEG signals. This holistic consideration of features at different time scales is crucial for the analysis of EEG data.
    全面捕捉时间依赖性:DARNet 的双重注意力细化模块和特征融合模块全面捕捉不同层次的时间模式和依赖关系,使模型能够更好地理解脑电信号的时间动态。这种对不同时间尺度特征的整体考虑对于脑电图数据分析至关重要。
  3. Robust feature representation: Despite observing that removing the feature fusion module did not lead to a significant decrease in accuracy across the three datasets, the performance variability of DARNet increases substantially. We believe that the feature fusion module integrates temporal patterns and dependencies at different levels, enabling the model to better understand and utilize the complex relationships within the data, thus enhancing the robustness and generalization of the model.
    稳健的特征表示:尽管在三个数据集中移除特征融合模块并未导致准确率的显著下降,但 DARNet 的性能变异性却大幅增加。我们认为,特征融合模块整合了不同层次的时间模式和依赖关系,使模型能够更好地理解和利用数据中的复杂关系,从而增强了模型的鲁棒性和泛化能力。

5.3 Computational Cost 5.3 计算成本

We compared the training parameter counts of our DARNet, SSF-CNN [36], MBSSFCC [15], and DBPNet [19], with the results shown in Table 4. The parameter count of DARNet is 51.6 times lower than that of SSF-CNN, 1331.5 times lower than that of MBSSFCC, and 0.91 times lower than that of DBPNet. Compared to other models, DARNet demonstrates superior parameter efficiency. Despite having fewer parameters, DARNet maintains good performance, indicating its ability to be applied in resource-constrained environments for AAD analysis, thus demonstrating practical utility.
我们比较了 DARNet、SSF-CNN [36]、MBSSFCC [15] 和 DBPNet [19]的训练参数数,结果如表 4 所示。DARNet 的参数数是 SSF-CNN 的 51.6 倍,是 MBSSFCC 的 1331.5 倍,是 DBPNet 的 0.91 倍。与其他模型相比,DARNet 的参数效率更高。尽管参数较少,但 DARNet 仍能保持良好的性能,这表明它能够在资源有限的环境中应用于 AAD 分析,从而体现了实用性。

6 Conclusion 6 结论

In this paper, we propose the DARNet, a novel dual attention refinement network with spatiotemporal construction for auditory attention detection. By employing spatial convolution operations across all channels, DARNet effectively leverages the spatial information embedded in EEG signals, thereby constructing a more robust spatiotemporal feature. Additionally, DARNet integrates dual
在本文中,我们提出了用于听觉注意力检测的具有时空结构的新型双重注意力细化网络--DARNet。通过对所有通道进行空间卷积运算,DARNet 有效地利用了脑电信号中蕴含的空间信息,从而构建了更稳健的时空特征。此外,DARNet 还集成了双

attention refinement and feature fusion techniques to comprehensively capture temporal patterns and dependencies at various levels, enhancing the model’s ability to capture the temporal dynamics within EEG signals. We evaluate the performance of DARNet on three datasets: KUL, DTU, and MM-AAD. DARNet achieves a decoding accuracy of 96.2 % 96.2 % 96.2%96.2 \% on the 1 -second decision window of the KUL dataset and 87.8 % 87.8 % 87.8%87.8 \% on the 1 -second decision window of the DTU dataset, demonstrating significant improvements compared to current state-of-the-art models. The experimental results validate the effectiveness and efficiency of the DARNet architecture, indicating its potential for practical applications. In future research, we plan to further explore DARNet’s performance on cross-subject tasks to verify its generalization and robustness.
我们利用注意力细化和特征融合技术来全面捕捉不同层次的时间模式和依赖关系,从而提高模型捕捉脑电信号中时间动态的能力。我们在三个数据集上评估了 DARNet 的性能:KUL、DTU 和 MM-AAD。在 KUL 数据集的 1 秒决策窗口中,DARNet 实现了 96.2 % 96.2 % 96.2%96.2 \% 的解码精度,在 DTU 数据集的 1 秒决策窗口中,DARNet 实现了 87.8 % 87.8 % 87.8%87.8 \% 的解码精度,与当前最先进的模型相比,DARNet 的性能有了显著提高。实验结果验证了 DARNet 架构的有效性和效率,表明其具有实际应用的潜力。在未来的研究中,我们计划进一步探索 DARNet 在跨主体任务中的表现,以验证其通用性和鲁棒性。

Acknowledgments and Disclosure of Funding
致谢和资金披露

This work is supported by the STI 2030—Major Projects (No. 2021ZD0201500), the National Natural Science Foundation of China (NSFC) (No.62201002, 6247077204), Excellent Youth Foundation of Anhui Scientific Committee (No. 2408085Y034), Distinguished Youth Foundation of Anhui Scientific Committee (No. 2208085J05), Special Fund for Key Program of Science and Technology of Anhui Province (No. 202203a07020008), Open Fund of Key Laboratory of Flight Techniques and Flight Safety, CACC (No, FZ2022KF15), Cloud Ginger XR-1.
本研究得到科技创新2030-重大项目(编号:2021ZD0201500)、国家自然科学基金(编号:62201002、6247077204)、安徽省科委优秀青年基金(编号:2408085Y034)、安徽省科委杰出青年基金(编号:2208085J05)、安徽省科技攻关计划专项基金(编号:202203a07020008)、中国民航飞行技术与飞行安全重点实验室开放基金(编号:FZ2022KF15)、云姜XR-1。

References 参考资料

[1] Cong Han, James O’Sullivan, Yi Luo, Jose Herrero, Ashesh D Mehta, and Nima Mesgarani. Speaker-independent auditory attention decoding without access to clean speech sources. Science advances, 5(5):eaav6134, 2019.
[1] Cong Han, James O'Sullivan, Yi Luo, Jose Herrero, Ashesh D Mehta, and Nima Mesgarani.无需访问干净语音源的说话者独立听觉注意力解码。Science advances, 5(5):eaav6134, 2019.

[2] Mohammad Jalilpour Monesi, Bernd Accou, Jair Montoya-Martinez, Tom Francart, and Hugo Van Hamme. An lstm based architecture to relate speech stimulus to eeg. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 941-945. IEEE, 2020.
[2] Mohammad Jalilpour Monesi、Bernd Accou、Jair Montoya-Martinez、Tom Francart 和 Hugo Van Hamme。基于 LSTM 架构的语音刺激与电子脑电图相关性研究。In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 941-945.IEEE, 2020.

[3] Siqi Cai, Enze Su, Longhan Xie, and Haizhou Li. Eeg-based auditory attention detection via frequency and channel neural attention. IEEE Transactions on Human-Machine Systems, 52(2): 256-266, 2021.
[3] 蔡思齐、苏恩泽、谢龙汉、李海洲。通过频率和通道神经注意进行基于 EEG 的听觉注意检测。IEEE Transactions on Human-Machine Systems, 52(2):256-266, 2021.

[4] E Colin Cherry. Some experiments on the recognition of speech, with one and with two ears. The Journal of the acoustical society of America, 25(5):975-979, 1953.
[4] E Colin Cherry.单耳和双耳识别语音的一些实验。美国声学学会杂志,25(5):975-979,1953 年。

[5] Nima Mesgarani and Edward F Chang. Selective cortical representation of attended speaker in multi-talker speech perception. Nature, 485(7397):233-236, 2012.
[5] Nima Mesgarani 和 Edward F Chang.多说话者语音感知中出席说话者的选择性皮层表征。自然》,485(7397):233-236,2012.

[6] Nai Ding and Jonathan Z Simon. Neural coding of continuous speech in auditory cortex during monaural and dichotic listening. Journal of neurophysiology, 107(1):78-89, 2012.
[6] Nai Ding 和 Jonathan Z Simon.单耳和二分听时听觉皮层对连续语音的神经编码。神经生理学杂志》,107(1):78-89,2012.

[7] Sahar Akram, Jonathan Z Simon, and Behtash Babadi. Dynamic estimation of the auditory temporal response function from meg in competing-speaker environments. IEEE Transactions on Biomedical Engineering, 64(8):1896-1905, 2016.
[7] Sahar Akram, Jonathan Z Simon, and Behtash Babadi.竞争扬声器环境下从 meg 动态估计听觉时间响应函数。IEEE Transactions on Biomedical Engineering,64(8):1896-1905,2016.

[8] Inyong Choi, Siddharth Rajaram, Lenny A Varghese, and Barbara G Shinn-Cunningham. Quantifying attentional modulation of auditory-evoked cortical responses from single-trial electroencephalography. Frontiers in human neuroscience, 7:115, 2013.
[8] Inyong Choi, Siddharth Rajaram, Lenny A Varghese, and Barbara G Shinn-Cunningham.从单次脑电图量化听觉诱发皮层反应的注意力调节。人类神经科学前沿》,7:115,2013。

[9] James A O’sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn-Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor. Attentional selection in a cocktail party environment can be decoded from single-trial eeg. Cerebral cortex, 25(7):1697-1706, 2015.
[9] James A O'sullivan, Alan J Power, Nima Mesgarani, Siddharth Rajaram, John J Foxe, Barbara G Shinn-Cunningham, Malcolm Slaney, Shihab A Shamma, and Edmund C Lalor.鸡尾酒会环境中的注意选择可从单次脑电图中解码。大脑皮层》,25(7):1697-1706,2015。

[10] Faramarz Faghihi, Siqi Cai, and Ahmed A Moustafa. A neuroscience-inspired spiking neural network for eeg-based auditory spatial attention detection. Neural Networks, 152:555-565, 2022.
[10] Faramarz Faghihi、Siqi Cai 和 Ahmed A Moustafa。基于电子脑电图的听觉空间注意力检测的神经科学启发尖峰神经网络。神经网络》,152:555-565,2022 年。

[11] Simon Geirnaert, Tom Francart, and Alexander Bertrand. Riemannian geometry-based decoding of the directional focus of auditory attention using eeg. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1115-1119. IEEE, 2021 .
[11] Simon Geirnaert、Tom Francart 和 Alexander Bertrand。基于黎曼几何的电子脑听觉注意力方向焦点解码。In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1115-1119.IEEE, 2021 .

[12] Ali Aroudi, Daniel Marquardt, and Simon Daclo. Eeg-based auditory attention decoding using steerable binaural superdirective beamformer. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 851-855. IEEE, 2018.
[12] Ali Aroudi、Daniel Marquardt 和 Simon Daclo。使用可转向双耳超指向波束成形器的基于 Eeg 的听觉注意力解码。In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 851-855.IEEE, 2018.

[13] Neetha Das, Simon Van Eyndhoven, Tom Francart, and Alexander Bertrand. Eeg-based attentiondriven speech enhancement for noisy speech mixtures using n-fold multi-channel wiener filters. In 2017 25th European Signal Processing Conference (EUSIPCO), pages 1660-1664. IEEE, 2017.
[13] Neetha Das、Simon Van Eyndhoven、Tom Francart 和 Alexander Bertrand。使用 n 倍多通道维纳滤波器对基于 Eeg 的噪声语音混合物进行注意力驱动语音增强。2017 年第 25 届欧洲信号处理会议(EUSIPCO),第 1660-1664 页。IEEE, 2017.

[14] Enze Su, Siqi Cai, Longhan Xie, Haizhou Li, and Tanja Schultz. Stanet: A spatiotemporal attention network for decoding auditory spatial attention from eeg. IEEE Transactions on Biomedical Engineering, 69(7):2233-2242, 2022.
[14] Enze Su、Siqi Cai、Longhan Xie、Haizhou Li 和 Tanja Schultz。Stanet:从电子脑电图解码听觉空间注意力的时空注意力网络。IEEE Transactions on Biomedical Engineering,69(7):2233-2242,2022.

[15] Yifan Jiang, Ning Chen, and Jing Jin. Detecting the locus of auditory attention based on the spectro-spatial-temporal analysis of eeg. Journal of neural engineering, 19(5):056035, 2022.
[15] 江一帆,陈宁,金晶。基于电子脑电图时空谱分析的听觉注意位置检测。神经工程学报,19(5):056035,2022.

[16] Michael J Crosse, Giovanni M Di Liberto, Adam Bednar, and Edmund C Lalor. The multivariate temporal response function (mtrf) toolbox: a matlab toolbox for relating neural signals to continuous stimuli. Frontiers in human neuroscience, 10:604, 2016.
[16] Michael J Crosse、Giovanni M Di Liberto、Adam Bednar 和 Edmund C Lalor。多变量时间响应函数(mtrf)工具箱:将神经信号与连续刺激相关联的 Matlab 工具箱。人类神经科学前沿》,10:604,2016。

[17] Daniel DE Wong, Søren A Fuglsang, Jens Hjortkjær, Enea Ceolini, Malcolm Slaney, and Alain De Cheveigne. A comparison of regularization methods in forward and backward models for auditory attention decoding. Frontiers in neuroscience, 12:352049, 2018.
[17] Daniel DE Wong, Søren A Fuglsang, Jens Hjortkjær, Enea Ceolini, Malcolm Slaney, and Alain De Cheveigne.用于听觉注意力解码的前向和后向模型正则化方法比较》。神经科学前沿》,12:352049,2018.

[18] Sina Miran, Sahar Akram, Alireza Sheikhattar, Jonathan Z Simon, Tao Zhang, and Behtash Babadi. Real-time tracking of selective auditory attention from m/eeg: A bayesian filtering approach. Frontiers in neuroscience, 12:262, 2018.
[18] Sina Miran、Sahar Akram、Alireza Sheikhattar、Jonathan Z Simon、Tao Zhang 和 Behtash Babadi。从 m/eeg 实时跟踪选择性听觉注意:一种贝叶斯滤波方法。神经科学前沿》,12:262,2018.

[19] Qinke Ni, Hongyu Zhang, Cunhang Fan, Shengbing Pei, Chang Zhou, and Zhao Lv. DBPNet: Dual-branch parallel network with temporal-frequency fusion for auditory attention detection. In IJCAI, 2024.
[19] Qinke Ni,Hongyu Zhang,Cunhang Fan,Shengbing Pei,Chang Zhou,and Zhao Lv. DBPNet:用于听觉注意力检测的时频融合双分支并行网络。In IJCAI, 2024.

[20] Herbert Ramoser, Johannes Muller-Gerking, and Gert Pfurtscheller. Optimal spatial filtering of single trial eeg during imagined hand movement. IEEE transactions on rehabilitation engineering, 8(4):441-446, 2000.
[20] Herbert Ramoser, Johannes Muller-Gerking, and Gert Pfurtscheller.想象手部运动时单次试验电子脑电图的最佳空间过滤。IEEE 康复工程交易,8(4):441-446,2000.

[21] Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus-Robert Muller. Optimizing spatial filters for robust eeg single-trial analysis. IEEE Signal processing magazine, 25(1):41-56, 2007.
[21] Benjamin Blankertz、Ryota Tomioka、Steven Lemm、Motoaki Kawanabe 和 Klaus-Robert Muller。优化空间滤波器,实现稳健的电子脑单次试验分析。IEEE信号处理杂志,25(1):41-56,2007.

[22] Mahnaz Arvaneh, Cuntai Guan, Kai Keng Ang, and Chai Quek. Optimizing the channel selection and classification accuracy in eeg-based bci. IEEE Transactions on Biomedical Engineering, 58(6):1865-1873, 2011.
[22] Mahnaz Arvaneh、Cuntai Guan、Kai Keng Ang 和 Chai Quek。优化基于电子脑电图的 BCC 中的信道选择和分类精度。IEEE 生物医学工程论文集,58(6):1865-1873,2011.

[23] Navid Mohammadi Foumani, Chang Wei Tan, Geoffrey I Webb, and Mahsa Salehi. Improving position encoding of transformers for multivariate time series classification. Data Mining and Knowledge Discovery, 38(1):22-48, 2024.
[23] Navid Mohammadi Foumani、Chang Wei Tan、Geoffrey I Webb 和 Mahsa Salehi。改进用于多元时间序列分类的变换器位置编码。数据挖掘与知识发现》,38(1):22-48,2024 年。

[24] Mari Riess Jones and Marilyn Boltz. Dynamic attending and responses to time. Psychological review, 96(3):459, 1989.
[24] Mari Riess Jones and Marilyn Boltz.动态关注和对时间的反应》。心理学评论》,96(3):459,1989 年。

[25] Mari Riess Jones, Heather Moynihan, Noah MacKenzie, and Jennifer Puente. Temporal aspects of stimulus-driven attending in dynamic arrays. Psychological science, 13(4):313-319, 2002.
[25] Mari Riess Jones, Heather Moynihan, Noah MacKenzie, and Jennifer Puente.动态阵列中刺激驱动注意力的时间方面。心理科学,13(4):313-319,2002。

[26] Klaus Linkenkaer-Hansen, Vadim V Nikouline, J Matias Palva, and Risto J Ilmoniemi. Longrange temporal correlations and scaling behavior in human brain oscillations. Journal of Neuroscience, 21(4):1370-1377, 2001.
[26] Klaus Linkenkaer-Hansen, Vadim V Nikouline, J Matias Palva, and Risto J Ilmoniemi.人脑振荡中的长程时间相关性和缩放行为。神经科学杂志》,21(4):1370-1377,2001 年。

[27] Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106-11115, 2021.
[27] Haoyi Zhou,Shanghang Zhang,Jieqi Peng,Shuai Zhang,Jianxin Li,Hui Xiong,and Wancai Zhang.信息器:超越长序列时间序列预测的高效转换器。In Proceedings of the AAAI conference on artificial intelligence, volume 35, pages 11106-11115, 2021.

[28] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472-480, 2017.
[28] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser.稀释残差网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 472-480, 2017.

[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[29] Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。神经信息处理系统进展》,2017年30期。

[30] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
[30] Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter.通过指数线性单元(elus)实现快速准确的深度网络学习。arXiv preprint arXiv:1511.07289, 2015.

[31] Quan-Sen Sun, Sheng-Gen Zeng, Yan Liu, Pheng-Ann Heng, and De-Shen Xia. A new method of feature fusion and its application in image recognition. Pattern Recognition, 38(12): 2437 2448 , 2005 2437 2448 , 2005 2437-2448,20052437-2448,2005.
[31] Quan-Sen Sun、Sheng-Gen Zeng、Yan Liu、Pheng-Ann Heng 和 De-Shen Xia.一种新的特征融合方法及其在图像识别中的应用。模式识别》,38(12): 2437 2448 , 2005 2437 2448 , 2005 2437-2448,20052437-2448,2005

[32] Neetha Das, Wouter Biesmans, Alexander Bertrand, and Tom Francart. The effect of headrelated filtering and ear-specific decoding bias on auditory attention detection. Journal of neural engineering, 13(5):056014, 2016.
[32] Neetha Das, Wouter Biesmans, Alexander Bertrand, and Tom Francart.头部相关滤波和特定耳朵解码偏差对听觉注意力检测的影响。神经工程学报》,13(5):056014,2016.

[33] Neetha Das, Tom Francart, and Alexander Bertrand. Auditory attention detection dataset kuleuven. Zenodo, 2019.
[33] Neetha Das, Tom Francart, and Alexander Bertrand.听觉注意力检测数据集 kuleuven.Zenodo, 2019.

[34] Søren Asp Fuglsang, Torsten Dau, and Jens Hjortkjær. Noise-robust cortical tracking of attended speech in real-world acoustic scenes. NeuroImage, 156:435-444, 2017.
[34] Søren Asp Fuglsang, Torsten Dau, and Jens Hjortkjær.真实世界声学场景中被听语音的噪声稳健皮层跟踪。NeuroImage, 156:435-444, 2017.

[35] Søren A Fuglsang, DD Wong, and Jens Hjortkjær. Eeg and audio dataset for auditory attention decoding. Zenodo, 2018.
[35] Søren A Fuglsang, DD Wong, and Jens Hjortkjær.用于听觉注意力解码的脑电图和音频数据集。Zenodo, 2018.

[36] Siqi Cai, Pengcheng Sun, Tanja Schultz, and Haizhou Li. Low-latency auditory spatial attention detection based on spectro-spatial features from eeg. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 5812-5815. IEEE, 2021.
[36] Siqi Cai,Pengcheng Sun,Tanja Schultz,and Haizhou Li.基于电子脑电图光谱空间特征的低延迟听觉空间注意力检测。在 2021 年第 43 届 IEEE 医学与生物学工程学会(EMBC)国际年会上,第 5812-5815 页。IEEE, 2021.

[37] Siqi Cai, Peiwen Li, and Haizhou Li. A bio-inspired spiking attentional neural network for attentional selection in the listening brain. IEEE Transactions on Neural Networks and Learning Systems, 2023.
[37] Siqi Cai, Peiwen Li, and Haizhou Li.听觉大脑中用于注意力选择的生物启发尖峰注意力神经网络。IEEE Transactions on Neural Networks and Learning Systems,2023.

[38] Xiran Xu, Bo Wang, Yujie Yan, Xihong Wu, and Jing Chen. A densenet-based method for decoding auditory spatial attention with eeg. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1946-1950. IEEE, 2024.
[38] Xiran Xu,Bo Wang,Yujie Yan,Xihong Wu,and Jing Chen.基于电子脑的听觉空间注意力解码方法。In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1946-1950.IEEE, 2024.

[39] Siqi Cai, Tanja Schultz, and Haizhou Li. Brain topology modeling with eeg-graphs for auditory spatial attention detection. IEEE transactions on bio-medical engineering, 71(1):171-182, 2024.
[39] Siqi Cai、Tanja Schultz 和 Haizhou Li.听觉空间注意力检测中的电子脑图大脑拓扑建模。IEEE 生物医学工程交易,71(1):171-182,2024.

NeurIPS Paper Checklist NeurIPS 论文核对表

1. Claims 1.权利要求

Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
问题摘要和引言中提出的主要主张是否准确反映了论文的贡献和范围?

Answer: [Yes] 请回答:是
Justification: The abstract and introduction of the paper clearly state the main contributions and scope of the research. The claims made are aligned with the theoretical and experimental results presented in the paper.
理由:论文摘要和引言清楚地说明了论文的主要贡献和研究范围。提出的主张与论文中介绍的理论和实验结果一致。

Guidelines: 指导原则:

  • The answer NA means that the abstract and introduction do not include the claims made in the paper.
    答案 NA 表示摘要和引言不包括论文中提出的主张。
  • The abstract and/or introduction should clearly state the claims made, including the contributions made in the paper and important assumptions and limitations. A No or NA answer to this question will not be perceived well by the reviewers.
    摘要和/或引言应清楚地说明提出的主张,包括论文的贡献、重要假设和局限性。如果对该问题的回答是 "否 "或 "不知道",审稿人将无法接受。
  • The claims made should match theoretical and experimental results, and reflect how much the results can be expected to generalize to other settings.
    提出的主张应与理论和实验结果相匹配,并反映出这些结果在多大程度上可望推广到其他环境中。
  • It is fine to include aspirational goals as motivation as long as it is clear that these goals are not attained by the paper.
    只要明确这些目标不是本文所要达到的,那么将理想目标作为动力也是可以的。

2. Limitations 2.局限性

Question: Does the paper discuss the limitations of the work performed by the authors?
问题论文是否讨论了作者工作的局限性?

Answer: [Yes] 请回答:是

Justification: We outlined the limitations in the conclusion. Current AAD research primarily employs two experimental strategies: subject-dependent and subject-independent.Subjectdependent refers to the training and evaluating procedures containing only samples from a single subject, while subject-independent contains samples from all subjects in the dataset. Our proposed model has been validated under the subject-dependent condition and has demonstrated exceptional results. However, further exploration and resolution of the issue of inter-subject variability are necessary to enable our model to be more widely applicable to real-world brain-computer interface applications.
理由:我们在结论中概述了局限性。目前的 AAD 研究主要采用了两种实验策略:主体依赖型和主体独立型。主体依赖型指的是训练和评估程序只包含来自单个主体的样本,而主体独立型则包含数据集中所有主体的样本。我们提出的模型已在依赖主体的条件下进行了验证,并取得了优异的成绩。不过,要使我们的模型更广泛地应用于现实世界的脑机接口应用,还需要进一步探索和解决受试者之间的变异性问题。

Guidelines: 指导原则:

  • The answer NA means that the paper has no limitation while the answer No means that the paper has limitations, but those are not discussed in the paper.
    答案 "NA "表示论文没有局限性,答案 "No "表示论文有局限性,但论文中没有讨论。
  • The authors are encouraged to create a separate “Limitations” section in their paper.
    我们鼓励作者在论文中单独设立一个 "限制 "部分。
  • The paper should point out any strong assumptions and how robust the results are to violations of these assumptions (e.g., independence assumptions, noiseless settings, model well-specification, asymptotic approximations only holding locally). The authors should reflect on how these assumptions might be violated in practice and what the implications would be.
    论文应指出任何强有力的假设,以及结果对违反这些假设的稳健性(如独立性假设、无噪声设置、模型规范、渐近近似仅在局部成立)。作者应思考在实践中如何违反这些假设及其影响。
  • The authors should reflect on the scope of the claims made, e.g., if the approach was only tested on a few datasets or with a few runs. In general, empirical results often depend on implicit assumptions, which should be articulated.
    作者应反思所提出主张的范围,例如,是否仅在少数数据集或少数运行中测试了该方法。一般来说,实证结果往往取决于隐含的假设,这些假设应予以阐明。
  • The authors should reflect on the factors that influence the performance of the approach. For example, a facial recognition algorithm may perform poorly when image resolution is low or images are taken in low lighting. Or a speech-to-text system might not be used reliably to provide closed captions for online lectures because it fails to handle technical jargon.
    作者应思考影响方法性能的因素。例如,当图像分辨率较低或在光线不足的情况下拍摄图像时,面部识别算法的性能可能会很差。或者,语音转文本系统可能无法可靠地为在线讲座提供封闭式字幕,因为它无法处理专业术语。
  • The authors should discuss the computational efficiency of the proposed algorithms and how they scale with dataset size.
    作者应讨论所提算法的计算效率,以及这些算法如何随数据集大小而扩展。
  • If applicable, the authors should discuss possible limitations of their approach to address problems of privacy and fairness.
    如果适用,作者应讨论其解决隐私和公平问题的方法可能存在的局限性。
  • While the authors might fear that complete honesty about limitations might be used by reviewers as grounds for rejection, a worse outcome might be that reviewers discover limitations that aren’t acknowledged in the paper. The authors should use their best
    虽然作者可能会担心审稿人会以完全诚实地说明局限性为由拒绝论文,但更糟糕的结果可能是审稿人发现了论文中没有承认的局限性。作者应尽最大努力

    judgment and recognize that individual actions in favor of transparency play an important role in developing norms that preserve the integrity of the community. Reviewers will be specifically instructed to not penalize honesty concerning limitations.
    判断,并认识到有利于透明度的个人行动在制定维护社区完整性的规范方面发挥着重要作用。我们将特别指示评审员不要惩罚有关限制的诚实行为。

3. Theory Assumptions and Proofs
3.理论假设与证明

Question: For each theoretical result, does the paper provide the full set of assumptions and a complete (and correct) proof?
问题对于每个理论结果,论文是否提供了全套假设和完整(正确)的证明?
Answer: [Yes] 请回答:是
Justification: All the theorems, formulas, and proofs in the paper have been properly numbered and cross-referenced, fulfilling the guidelines provided.
说明理由:论文中的所有定理、公式和证明都有正确的编号和交叉引用,符合提供的指导原则。

Guidelines: 指导原则:

  • The answer NA means that the paper does not include theoretical results.
    答案 NA 表示论文不包含理论结果。
  • All the theorems, formulas, and proofs in the paper should be numbered and crossreferenced.
    论文中的所有定理、公式和证明都应编号并相互参照。
  • All assumptions should be clearly stated or referenced in the statement of any theorems.
    所有假设都应在定理陈述中明确说明或引用。
  • The proofs can either appear in the main paper or the supplemental material, but if they appear in the supplemental material, the authors are encouraged to provide a short proof sketch to provide intuition.
    证明既可以出现在正文中,也可以出现在补充材料中,但如果出现在补充材料中,我们鼓励作者提供简短的证明草图,以提供直观性。
  • Inversely, any informal proof provided in the core of the paper should be complemented by formal proofs provided in appendix or supplemental material.
    反之,论文核心部分提供的任何非正式证明都应由附录或补充材料中提供的正式证明加以补充。
  • Theorems and Lemmas that the proof relies upon should be properly referenced.
    证明所依据的定理和推理应适当注明出处。

4. Experimental Result Reproducibility
4.实验结果的再现性

Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper (regardless of whether the code and data are provided or not)?
问题论文是否充分披露了重现论文主要实验结果所需的所有信息,以至于影响到论文的主要主张和/或结论(无论是否提供代码和数据)?

Answer: [Yes] 请回答:是

Justification: We have detailed our model and experimental setup thoroughly in the Methodology and Experiments sections, providing all necessary information to reproduce the main experimental results.
理由我们在 "方法论 "和 "实验 "部分详细介绍了我们的模型和实验装置,提供了重现主要实验结果的所有必要信息。

Guidelines: 指导原则:

  • The answer NA means that the paper does not include experiments.
    答案 NA 表示论文不包括实验。
  • If the paper includes experiments, a No answer to this question will not be perceived well by the reviewers: Making the paper reproducible is important, regardless of whether the code and data are provided or not.
    如果论文中包含实验,对这个问题的否定回答将不会被评审人看好:无论是否提供代码和数据,使论文具有可重复性都很重要。
  • If the contribution is a dataset and/or model, the authors should describe the steps taken to make their results reproducible or verifiable.
    如果提供的是数据集和/或模型,作者应说明为使其结果可重复或可验证而采取的步骤。
  • Depending on the contribution, reproducibility can be accomplished in various ways. For example, if the contribution is a novel architecture, describing the architecture fully might suffice, or if the contribution is a specific model and empirical evaluation, it may be necessary to either make it possible for others to replicate the model with the same dataset, or provide access to the model. In general. releasing code and data is often one good way to accomplish this, but reproducibility can also be provided via detailed instructions for how to replicate the results, access to a hosted model (e.g., in the case of a large language model), releasing of a model checkpoint, or other means that are appropriate to the research performed.
    根据贡献的不同,可复制性可以通过不同的方式实现。例如,如果贡献是一个新颖的架构,全面描述该架构可能就足够了;如果贡献是一个特定的模型和经验评估,则可能需要让其他人可以用相同的数据集复制该模型,或者提供对该模型的访问权限。一般来说,发布代码和数据通常是实现这一目标的好方法,但也可以通过详细说明如何复制结果、访问托管模型(如大型语言模型)、发布模型检查点或其他适合所开展研究的方法来提供可重复性。
  • While NeurIPS does not require releasing code, the conference does require all submissions to provide some reasonable avenue for reproducibility, which may depend on the nature of the contribution. For example
    虽然 NeurIPS 并不要求发布代码,但会议确实要求所有提交的论文提供一些合理的重现途径,这可能取决于论文的性质。例如

    (a) If the contribution is primarily a new algorithm, the paper should make it clear how to reproduce that algorithm.
    (a) 如果贡献主要是一种新算法,论文应明确说明如何复制该算法。

    (b) If the contribution is primarily a new model architecture, the paper should describe the architecture clearly and fully.
    (b) 如果贡献主要是一个新的模型架构,论文应清楚、全面地描述该架构。

    © If the contribution is a new model (e.g., a large language model), then there should either be a way to access this model for reproducing the results or a way to reproduce the model (e.g., with an open-source dataset or instructions for how to construct the dataset).
    如果贡献的是一个新模型(如大型语言模型),则应提供获取该模型以重现结果的方法或重现该模型的方法(如使用开源数据集或如何构建数据集的说明)。

    (d) We recognize that reproducibility may be tricky in some cases, in which case authors are welcome to describe the particular way they provide for reproducibility. In the case of closed-source models, it may be that access to the model is limited in some way (e.g., to registered users), but it should be possible for other researchers to have some path to reproducing or verifying the results.
    (d) 我们认识到,在某些情况下,可重复性可能很棘手,在这种情况下,欢迎作者描述他们 提供可重复性的特殊方式。对于封闭源模型,可能会以某种方式限制对模型的访问(例如,仅限于注册用户),但其他研究人员应该有可能通过某种途径复制或验证结果。

5. Open access to data and code
5.开放数据和代码访问

Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
问题如补充材料所述,论文是否提供了开放的数据和代码访问权限,以及忠实再现主要实验结果的充分说明?

Answer: [Yes] 请回答:是

Justification: The paper includes code as an attachment, facilitating the reproduction of the main experimental results.
理由:论文附有代码,便于复制主要实验结果。

Guidelines: 指导原则:
  • The answer NA means that paper does not include experiments requiring code.
    答案 NA 表示试卷不包括需要代码的实验。
  • Please see the NeurIPS code and data submission guidelines (https://nips.cc/ public/guides/CodeSubmissionPolicy) for more details.
    详情请参见 NeurIPS 代码和数据提交指南(https://nips.cc/ public/guides/CodeSubmissionPolicy)。
  • While we encourage the release of code and data, we understand that this might not be possible, so “No” is an acceptable answer. Papers cannot be rejected simply for not including code, unless this is central to the contribution (e.g., for a new open-source benchmark).
    虽然我们鼓励发布代码和数据,但我们也理解这可能无法实现,因此 "否 "是可以接受的答案。论文不能仅仅因为没有包含代码而被拒绝,除非这是论文的核心内容(例如,新的开源基准)。
  • The instructions should contain the exact command and environment needed to run to reproduce the results. See the NeurIPS code and data submission guidelines (https: //nips.cc/public/guides/CodeSubmissionPolicy) for more details.
    说明应包含重现结果所需的确切命令和运行环境。详情请参见 NeurIPS 代码和数据提交指南(https: //nips.cc/public/guides/CodeSubmissionPolicy )。
  • The authors should provide instructions on data access and preparation, including how to access the raw data, preprocessed data, intermediate data, and generated data, etc.
    作者应提供数据获取和准备说明,包括如何获取原始数据、预处理数据、中间数据和生成数据等。
  • The authors should provide scripts to reproduce all experimental results for the new proposed method and baselines. If only a subset of experiments are reproducible, they should state which ones are omitted from the script and why.
    作者应提供脚本,以重现新方法和基线的所有实验结果。如果只有一部分实验结果可以重现,作者应说明脚本中省略了哪些实验结果以及省略的原因。
  • At submission time, to preserve anonymity, the authors should release anonymized versions (if applicable).
    在投稿时,为保持匿名性,作者应发布匿名版本(如适用)。
  • Providing as much information as possible in supplemental material (appended to the paper) is recommended, but including URLs to data and code is permitted.
    建议在补充材料(附在论文之后)中提供尽可能多的信息,但允许包含数据和代码的 URL。

6. Experimental Setting/Details
6.实验设置/细节

Question: Does the paper specify all the training and test details (e.g., data splits, hyperparameters, how they were chosen, type of optimizer, etc.) necessary to understand the results?
问题论文是否说明了理解结果所需的所有训练和测试细节(如数据分割、超参数、如何选择、优化器类型等)?

Answer: [Yes] 请回答:是
Justification: We have provided detailed specifications of our experimental settings in the “Methodology” and “Experiments” sections of the paper. This includes descriptions of the data splits, hyperparameters, selection criteria, and the type of optimizer used. The comprehensive documentation of these parameters ensures that our results can be understood and replicated by other researchers.
理由我们在论文的 "方法论 "和 "实验 "部分提供了实验设置的详细说明。其中包括数据分割、超参数、选择标准和所用优化器类型的说明。对这些参数的全面记录确保了其他研究人员能够理解和复制我们的结果。

Guidelines: 指导原则:

  • The answer NA means that the paper does not include experiments.
    答案 NA 表示论文不包括实验。
  • The experimental setting should be presented in the core of the paper to a level of detail that is necessary to appreciate the results and make sense of them.
    在论文的核心部分,应详细介绍实验环境,以便于理解实验结果并使其具有意义。
  • The full details can be provided either with the code, in appendix, or as supplemental material.
    完整的详细信息可以随代码、附录或作为补充材料提供。

7. Experiment Statistical Significance
7.实验统计意义

Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
问题论文是否适当、正确地定义了误差条,或报告了有关实验统计意义的其他适当信息?

Answer: [ No ] [ No ] [No][\mathrm{No}] 请回答: [ No ] [ No ] [No][\mathrm{No}]
Justification: The experimental results do not include confidence intervals or statistical significance tests.
理由:实验结果不包括置信区间或统计显著性检验。

Guidelines: 指导原则:
  • The answer NA means that the paper does not include experiments.
    答案 NA 表示论文不包括实验。
  • The authors should answer “Yes” if the results are accompanied by error bars, confidence intervals, or statistical significance tests, at least for the experiments that support the main claims of the paper.
    如果结果附有误差带、置信区间或统计显著性检验,作者应回答 "是",至少对于支持论文主要观点的实验是这样。
  • The factors of variability that the error bars are capturing should be clearly stated (for example, train/test split, initialization, random drawing of some parameter, or overall run with given experimental conditions).
    应明确说明误差条所反映的变异因素(例如,训练/测试分割、初始化、某些参数的随机绘制或在给定实验条件下的整体运行)。
  • The method for calculating the error bars should be explained (closed form formula, call to a library function, bootstrap, etc.)
    应解释误差条的计算方法(封闭式公式、调用库函数、引导等)。
  • The assumptions made should be given (e.g., Normally distributed errors).
    应给出所做的假设(例如,正态分布误差)。
  • It should be clear whether the error bar is the standard deviation or the standard error of the mean.
    应明确误差条是标准差还是平均值的标准误差。
  • It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2 -sigma error bar than state that they have a 96 % CI 96 % CI 96%CI96 \% \mathrm{CI}, if the hypothesis of Normality of errors is not verified.
    报告 1-西格玛误差条是可以的,但应该说明。如果误差正态性假设没有得到验证,作者最好报告 2σ 误差条,而不是说明他们有 96 % CI 96 % CI 96%CI96 \% \mathrm{CI} 误差条。
  • For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates).
    对于非对称分布,作者应注意不要在表格或图中显示对称误差条,以免得出超出范围的结果(如负误差率)。
  • If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.
    如果表格或绘图中报告了误差条,作者应在文中解释误差条的计算方法,并在文中引用相应的图或表。

8. Experiments Compute Resources
8.实验计算资源

Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?
问题对于每个实验,论文是否提供了重现实验所需计算机资源(计算工作者类型、内存、执行时间)的足够信息?

Answer: [ No ] [ No ] [No][\mathrm{No}] 请回答: [ No ] [ No ] [No][\mathrm{No}]

Justification: Our model contains only 0.78 million training parameters, making it lightweight and capable of running on most machines. Therefore, we did not provide detailed specifications of the compute resources required.
理由我们的模型仅包含 78 万个训练参数,因此非常轻便,能够在大多数机器上运行。因此,我们没有提供所需计算资源的详细规格。

Guidelines: 指导原则:

  • The answer NA means that the paper does not include experiments.
    答案 NA 表示论文不包括实验。
  • The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage.
    论文应说明计算工作者 CPU 或 GPU、内部集群或云提供商的类型,包括相关内存和存储。
  • The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute.
    论文应提供每个实验运行所需的计算量,并估算总计算量。
  • The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn’t make it into the paper).
    论文应披露整个研究项目是否需要比论文中报告的实验更多的计算量(例如,未写入论文的初步实验或失败实验)。

9. Code Of Ethics 9.道德守则

Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics https://neurips.cc/public/EthicsGuidelines?
问题论文中进行的研究是否在各个方面都符合《NeurIPS 道德准则》https://neurips.cc/public/EthicsGuidelines?

Answer: [Yes] 请回答:是
Justification: Yes, our research adheres to all ethical guidelines required by NeurIPS.
理由:是的,我们的研究符合 NeurIPS 要求的所有伦理准则。

Guidelines: 指导原则:
  • The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.
    答案 NA 表示作者没有阅读过《NeurIPS 职业道德规范》。
  • If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics.
    如果作者回答 "否",则应说明需要偏离《职业道德准则》的特殊情况。
  • The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).
    作者应确保匿名(例如,如果由于其管辖范围内的法律或法规而有特殊考虑)。

10. Broader Impacts 10.更广泛的影响

Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
问题论文是否讨论了工作可能产生的积极社会影响和消极社会影响?

Answer: [NA] 请回答:[不知道]
Justification: Our research does not directly produce societal impacts as it focuses on technical advancements in a specific field without direct societal applications.
理由:我们的研究不会直接产生社会影响,因为它侧重于特定领域的技术进步,没有直接的社会应用。

Guidelines: 指导原则:

  • The answer NA means that there is no societal impact of the work performed.
    答案 NA 表示所从事的工作没有社会影响。
  • If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact.
    如果作者回答 "NA "或 "No",则应解释其工作没有社会影响的原因或论文没有论及社会影响的原因。
  • Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.
    负面社会影响的例子包括潜在的恶意或意外用途(如造谣、生成虚假档案、监视)、公平性考虑(如部署的技术可能做出对特定群体产生不公平影响的决策)、隐私考虑和安全考虑。
  • The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster.
    会议希望许多论文都是基础性研究,与特定应用无关,更不用说部署了。但是,如果有直接通向任何负面应用的途径,作者就应该指出来。例如,指出生成模型质量的提高可被用于生成虚假信息的深度伪造是合理的。另一方面,无需指出优化神经网络的通用算法可以让人们更快地训练生成 Deepfakes 的模型。
  • The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology.
    作者应考虑当技术按预期使用并正常运行时可能产生的危害,当技术按预期使用但产生错误结果时可能产生的危害,以及(有意或无意)滥用技术造成的危害。
  • If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).
    如果存在负面社会影响,作者还可以讨论可能的缓解策略(例如,有限制地发布模型、提供攻击以外的防御措施、监测滥用的机制、监测系统如何随着时间的推移从反馈中学习的机制、提高 ML 的效率和可访问性)。

11. Safeguards 11.保障措施

Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?
问题文件是否描述了为负责任地发布具有高滥用风险的数据或模型(如预训练语言模型、图像生成器或废数据集)而采取的保障措施?

Answer: [NA] 请回答:[不知道]

Justification: This paper poses no such risks.
理由:本文件不存在此类风险。

Guidelines: 指导原则:
  • The answer NA means that the paper poses no such risks.
    答案 NA 表示该论文不存在此类风险。
  • Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters.
    对于滥用或双重用途风险较高的已发布模型,在发布时应采取必要的保障措施,以便对模型的使用进行控制,例如,要求用户遵守使用指南或访问模型的限制,或实施安全过滤。
  • Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images.
    从互联网上抓取的数据集可能会带来安全风险。作者应说明他们是如何避免发布不安全图像的。
  • We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.
    我们认识到,提供有效的保障措施具有挑战性,许多论文并不要求这样做,但我们鼓励作者考虑到这一点并尽最大努力。

12. Licenses for existing assets
12.现有资产的许可证

Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
问题论文中使用的资产(如代码、数据、模型)的创建者或原始所有者是否已适当注明,使用许可和条款是否已明确提及并得到适当尊重?

Answer: [Yes] 请回答:是
Justification: All datasets used in our paper are publicly available datasets, and we have cited the respective literature for each dataset. Any researcher can download these datasets from the provided sources.
理由:我们论文中使用的所有数据集都是可公开获取的数据集,我们已引用了每个数据集的相关文献。任何研究人员都可以从提供的来源下载这些数据集。

Guidelines: 指导原则:

  • The answer NA means that the paper does not use existing assets.
    答案 NA 表示论文不使用现有资产。
  • The authors should cite the original paper that produced the code package or dataset.
    作者应引用产生代码包或数据集的原始论文。
  • The authors should state which version of the asset is used and, if possible, include a URL.
    作者应说明所使用资产的版本,如有可能,还应提供 URL。
  • The name of the license (e.g., CC-BY 4.0) should be included for each asset.
    每项资产都应包含许可名称(如 CC-BY 4.0)。
  • For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided.
    对于从特定来源(如网站)搜刮的数据,应提供该来源的版权和服务条款。
  • If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset.
    如果资产已发布,则应提供软件包中的许可证、版权信息和使用条款。对于流行的数据集,paperswithcode.com/datasets对一些数据集的许可证进行了整理。他们的许可指南有助于确定数据集的许可。
  • For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.
    对于重新打包的现有数据集,应提供原始许可证和衍生资产的许可证(如果已发生变化)。
  • If this information is not available online, the authors are encouraged to reach out to the asset’s creators.
    如果网上没有这些信息,我们鼓励作者联系资产的创建者。

13. New Assets 13.新资产

Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
问题文件中引入的新资产是否有完备的文件,文件是否与资产一起提供?

Answer: [Yes] 请回答:是
Justification: The paper includes the submission of the model’s source code.
理由:论文包括提交模型的源代码。

Guidelines: 指导原则:
  • The answer NA means that the paper does not release new assets.
    答案 NA 表示该文件不释放新资产。
  • Researchers should communicate the details of the dataset/code/model as part of their submissions via structured templates. This includes details about training, license, limitations, etc.
    研究人员应通过结构化模板交流数据集/代码/模型的详细信息,作为提交材料的一部分。这包括培训、许可、限制等详细信息。
  • The paper should discuss whether and how consent was obtained from people whose asset is used.
    论文应讨论是否以及如何征得资产被使用人的同意。
  • At submission time, remember to anonymize your assets (if applicable). You can either create an anonymized URL or include an anonymized zip file.
    提交时,请记住对资产进行匿名处理(如适用)。您可以创建匿名 URL 或包含匿名压缩文件。

14. Crowdsourcing and Research with Human Subjects
14.众包与以人为对象的研究

Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation (if any)?
问题:对于以人为对象的众包实验和研究,论文是否包括给参与者的说明全文和截图(如适用),以及有关补偿的详细信息(如有)?

Answer: [NA] 请回答:[不知道]

Justification: The paper does not involve crowdsourcing nor research with human subjects.
理由:本文不涉及众包,也不涉及以人为对象的研究。

Guidelines: 指导原则:

  • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
    答案 NA 表示论文不涉及众包,也不涉及以人为对象的研究。
  • Including this information in the supplemental material is fine, but if the main contribution of the paper involves human subjects, then as much detail as possible should be included in the main paper.
    在补充材料中包含这些信息是可以的,但如果论文的主要贡献涉及人类受试者,则应在主要论文中包含尽可能多的细节。
  • According to the NeurIPS Code of Ethics, workers involved in data collection, curation, or other labor should be paid at least the minimum wage in the country of the data collector.
    根据《NeurIPS 职业道德准则》,从事数据收集、整理或其他工作的人员的工资至少应达到数据收集者所在国家的最低工资标准。

15. Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects
15.机构审查委员会(IRB)对人体研究的批准或同等批准

Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or institution) were obtained?
问题论文是否描述了研究参与者可能面临的风险,是否向受试者披露了这些风险,以及是否获得了机构审查委员会(IRB)的批准(或根据贵国或机构要求的同等批准/审查)?

Answer: [NA] 请回答:[不知道]
Justification: The paper does not involve crowdsourcing nor research with human subjects. Guidelines:
理由:论文不涉及众包,也不涉及以人为对象的研究。指导原则:
  • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects.
    答案 NA 表示论文不涉及众包,也不涉及以人为对象的研究。
  • Depending on the country in which research is conducted, IRB approval (or equivalent) may be required for any human subjects research. If you obtained IRB approval, you should clearly state this in the paper.
    根据研究所在国家的不同,任何以人为对象的研究都可能需要获得 IRB 批准(或同等批准)。如果您已获得 IRB 批准,则应在论文中明确说明。
  • We recognize that the procedures for this may vary significantly between institutions and locations, and we expect authors to adhere to the NeurIPS Code of Ethics and the guidelines for their institution.
    我们认识到,不同机构和地点的程序可能会有很大不同,我们希望作者遵守《NeurIPS 职业道德准则》和所在机构的指导方针。
  • For initial submissions, do not include any information that would break anonymity (if applicable), such as the institution conducting the review.
    对于初次提交的材料,请勿包含任何会破坏匿名性的信息(如适用),如进行评审的机构。

  1. *Equal contribution. *平等贡献。
    ^(†){ }^{\dagger} Corresponding author. Correspondence to cunhang.fan@ahu.edu.cn.
    ^(†){ }^{\dagger} 通讯作者。通讯作者:cunhang.fan@ahu.edu.cn