PhysFormer: Facial Video-based Physiological Measurement with
Temporal Difference Transformer
PhysFormer：基于面部视频的生理测量与时间差分转换器

Zitong Yu¹, Yuming Shen², Jingang Shi³, Hengshuang Zhao^4,2, Philip Torr², Guoying Zhao¹
紫童于 ¹ ，玉明沈 ² ，金刚石 ³ ，恒爽赵 ^4,2 ，菲利普托尔 ² ，国英赵 ¹
¹CMVS, University of Oulu ²TVG, University of Oxford
CMVS，奥卢大学 TVG，牛津大学
³Xi’an Jiaotong University ⁴The University of Hong Kong
西安交通大学香港大学Corresponding author 通讯作者

Abstract 摘要

Remote photoplethysmography (rPPG), which aims at measuring heart activities and physiological signals from facial video without any contact, has great potential in many applications. Recent deep learning approaches focus on mining subtle rPPG clues using convolutional neural networks with limited spatio-temporal receptive fields, which neglect the long-range spatio-temporal perception and interaction for rPPG modeling. In this paper, we propose the PhysFormer, an end-to-end video transformer based architecture, to adaptively aggregate both local and global spatio-temporal features for rPPG representation enhancement. As key modules in PhysFormer, the temporal difference transformers first enhance the quasi-periodic rPPG features with temporal difference guided global attention, and then refine the local spatio-temporal representation against interference. Furthermore, we also propose the label distribution learning and a curriculum learning inspired dynamic constraint in frequency domain, which provide elaborate supervisions for PhysFormer and alleviate overfitting. Comprehensive experiments are performed on four benchmark datasets to show our superior performance on both intra- and cross-dataset testings. One highlight is that, unlike most transformer networks needed pretraining from large-scale datasets, the proposed PhysFormer can be easily trained from scratch on rPPG datasets, which makes it promising as a novel transformer baseline for the rPPG community. The codes are available at https://github.com/ZitongYu/PhysFormer.
远程光电容测（rPPG）旨在通过面部视频测量心脏活动和生理信号，无需任何接触，在许多应用中具有巨大潜力。最近的深度学习方法专注于使用具有有限时空感受野的卷积神经网络挖掘微妙的 rPPG 线索，这忽视了 rPPG 建模中的长距离时空感知和交互。在本文中，我们提出了 PhysFormer，一种基于端到端视频变压器的架构，以自适应地聚合本地和全局时空特征，以增强 rPPG 表示。作为 PhysFormer 中的关键模块，时间差变压器首先使用时间差引导的全局注意力增强准周期 rPPG 特征，然后针对干扰改进本地时空表示。此外，我们还提出了标签分布学习和受启发的频域动态约束的课程学习，为 PhysFormer 提供精细的监督，并减轻过拟合。在四个基准数据集上进行了全面实验，以展示我们在数据集内和跨数据集测试中的优越性能。一个亮点是，与大多数需要从大规模数据集进行预训练的 Transformer 网络不同，所提出的 PhysFormer 可以轻松地在 rPPG 数据集上从头开始训练，这使其有望成为 rPPG 社区的新型 Transformer 基准线。代码可在 https://github.com/ZitongYu/PhysFormer 上找到。

1 Introduction 1 介绍

Refer to caption — Figure 1: The trajectories of rPPG signals around t1, t2, and t3 share similar properties (e.g., trends with rising edge first then falling edge later, and relatively high magnitudes) induced by skin color changes. It inspires the long-range spatio-temporal attention (e.g., blue tube around t1 interacted with red tubes from intra- and inter-frames) according to their local temporal difference features for quasi-periodic rPPG enhancement. Here ‘tube’ indicates the same regions across short-time consecutive frames.
图 1：在 t1、t2 和 t3 周围的 rPPG 信号轨迹具有相似的特性（例如，随着上升边缘先上升后下降，且幅度相对较高），这些特性是由皮肤颜色变化引起的。它激发了长程时空关注（例如，t1 周围的蓝色管与来自帧内和帧间的红色管相互作用），根据它们的局部时间差异特征进行准周期性 rPPG 增强。这里的“管”表示跨越短时间连续帧的相同区域。

Physiological signals such as heart rate (HR), respiration frequency (RF), and heart rate variability (HRV) are important vital signs to be measured in many circumstances, especially for healthcare or medical purposes. Traditionally, the Electrocardiography (ECG) and Photoplethysmograph (PPG) are the two most common ways for measuring heart activities and corresponding physiological signals. However, both ECG and PPG sensors need to be attached to body parts, which may cause discomfort and are inconvenient for long-term monitoring. To counter for this issue, remote photoplethysmography (rPPG) [66, 12, 36] methods are developing fast in recent years, which aim to measure heart activity remotely without any contact.
生理信号，如心率（HR），呼吸频率（RF）和心率变异性（HRV），是许多情况下需要测量的重要生命体征，特别是在医疗保健或医疗目的下。传统上，心电图（ECG）和光电容积描记（PPG）是测量心脏活动和相应生理信号的两种最常见方法。然而，ECG 和 PPG 传感器都需要连接到身体部位，这可能会引起不适，并且对长期监测来说是不方便的。为了解决这个问题，近年来远程光电容积描记（rPPG）方法发展迅速，旨在远程测量心脏活动而无需任何接触。

In earlier studies of facial rPPG measurement, most methods analyze subtle color changes on facial regions of interest (ROI) with classical signal processing approaches [57, 50, 49, 30, 55]. Besides, there are a few color subspace transformation methods [13, 59] which utilize all skin pixels for rPPG measurement. Based on the prior knowledge from traditional methods, a few learning based approaches [25, 51, 44, 45] are designed as non-end-to-end fashions. ROI based preprocessed signal representations (e.g., time-frequency map [25] and spatio-temporal map [44, 45]) are generated first, and then learnable models could capture rPPG features from these maps. However, these methods need the strict preprocessing procedure and neglect the global contextual clues outside the pre-defined ROIs. Meanwhile, more and more end-to-end deep learning based rPPG methods [53, 11, 65, 67, 34] are developed, which treat facial video frames as input and predict rPPG and other physiological signals directly. However, pure end-to-end methods are easily influenced by the complex scenarios (e.g., with head movement and various illumination conditions) and rPPG-unrelated features can not be ruled out in learning, resulting in large performance decrease [63] in realistic datasets (e.g., VIPL-HR [45]).
在早期的面部 rPPG 测量研究中，大多数方法使用经典信号处理方法分析感兴趣面部区域（ROI）上微小的颜色变化[57, 50, 49, 30, 55]。此外，还有一些利用所有皮肤像素进行 rPPG 测量的颜色子空间转换方法[13, 59]。基于传统方法的先验知识，设计了一些基于学习的方法[25, 51, 44, 45]作为非端到端的方式。首先生成基于 ROI 的预处理信号表示（例如，时频图[25]和时空图[44, 45]），然后可学习的模型可以从这些图中捕获 rPPG 特征。然而，这些方法需要严格的预处理过程，并忽略了预定义 ROI 之外的全局上下文线索。与此同时，越来越多基于端到端深度学习的 rPPG 方法[53, 11, 65, 67, 34]被开发出来，将面部视频帧作为输入，并直接预测 rPPG 和其他生理信号。然而，纯端到端方法很容易受到复杂场景的影响（例如。在学习过程中，无法排除与头部运动和各种照明条件无关的特征，导致在现实数据集（例如 VIPL-HR [45]）中性能大幅下降[63]。

Recently, due to its excellent long-range attentional modeling capacities in solving sequence-to-sequence issues, transformer [32, 22] has been successfully applied in many artificial intelligence tasks such as natural language processing (NLP) [56], image [15] and video [3] analysis. Similarly, rPPG measurement from facial videos can be treated as a video sequence to signal sequence problem, where the long-range contextual clues should be exploited for semantic modeling. As shown in Fig. 1, rPPG clues from different skin regions and temporal locations (e.g., signal trajectories around t1, t2, and t3) share similar properties (e.g., trends with rising edge first then falling edge later and relative high magnitudes), which can be utilized for long-range feature modeling and enhancement. However, different from the most video tasks aiming at huge motion representation, facial rPPG measurement focuses on capturing subtle skin color changes, which makes it challenging for global spatio-temporal perception. Furthermore, video-based rPPG measurement is usually a long-time monitoring task, and it is challenging to design and train transformers with long video sequence inputs.

Motivated by the discussions above, we propose an end-to-end video transformer architecture, namely PhysFormer, for remote physiological measurement. On one hand, the cascaded temporal difference transfomer blocks in PhysFormer benefit the rPPG feature enhancement via global spatio-temporal attention based on the fine-grained temporal skin color differences. On the other hand, to alleviate the interference-induced overfitting issue and complement the weak temporal supervision signals, elaborate supervision in frequency domain is designed, which helps PhysFormer learn more intrinsic rPPG-aware features.

The contributions of this work are as follows:

•

We propose the PhysFormer, which mainly consists of a powerful video temporal difference transformer backbone. To our best knowledge, it is the first time to explore the long-range spatio-temporal relationship for reliable rPPG measurement.
•

We propose an elaborate recipe to supervise PhysFormer with label distribution learning and curriculum learning guided dynamic loss in frequency domain to learn efficiently and alleviate overfitting.

我们提出了一种精心设计的配方，用于在频域中监督 PhysFormer，并结合标签分布学习和课程学习引导的动态损失，以便高效学习并减轻过拟合。
•

We conduct intra- and cross-dataset testings and show that the proposed PhysFormer achieves superior or on par state-of-the-art performance without pretraining on large-scale datasets like ImageNet-21K.

我们进行了数据集内和跨数据集测试，并展示了所提出的 PhysFormer 在没有对大规模数据集（如 ImageNet-21K）进行预训练的情况下，实现了优越或与最先进性能相媲美的表现。

2 Related Work

Remote physiological measurement. An early study of rPPG-based physiological measurement was reported in [57]. Plenty of traditional hand-crafted approaches have been developed on this field since then. Selective merging information from different color channels [50, 49, 30] or different ROIs [27, 30] are proven to be efficient for subtle rPPG signal recovery. To improve the signal-to-noise-ratio of the recovered rPPG signals, several signal decomposition methods such as independent component analysis (ICA) [50, 49, 27] and matrix completion [55] are also proposed. In recent years, deep learning based approaches dominate the field of rPPG measurement due to the strong spatio-temporal representation capabilities. On one hand, facial ROI based spatial-temporal signal maps [44, 47, 40, 46, 41] are developed, which alleviate the interference from non-skin regions. Based on these signal maps, 2D-CNNs are utilized for rPPG feature extraction. On the other hand, end-to-end spatial networks [53, 11] and spatio-temporal models [65, 67, 63, 34, 35, 48, 20] are developed, which could recover rPPG signals from the facial video directly. However, previous methods only consider the spatio-temporal rPPG features from adjacent frames and neglect the long-range relationship among quasi-periodic rPPG features.

Transformer for vision tasks. Transformer [32] is proposed in [56] to model sequential data in the field of NLP. Then vision transformer (ViT) [15] is proposed recently by feeding transformer with sequences of image patches for image classification. Many other ViT variants [22, 26, 54, 38, 70, 60, 23, 8, 14] are proposed from then, which achieve promising performance compared with its counterpart CNNs for image analysis tasks [6, 74, 24]. Recently, some works introduce vision transformer for video understanding tasks such as action recognition [1, 16, 42, 21, 39, 4, 3], action detection [73, 37, 58, 62], video super-resolution [5], video inpainting [71, 33], and 3D animation [10, 9]. Some works [42, 21] conduct temporal contextual modeling with transformer based on single-frame features from pretrained 2D networks, while other works [3, 1, 39, 4, 16] mine the spatio-temporal attentions via video transformer directly. Most of these works are incompatible for long-video-sequence ( $\textgreater$ 150 frames) signal regression task. There are two related works [64, 35] using ViT for rPPG feature representation. TransRPPG [64] extracts rPPG features from the preprocessed signal maps via ViT for face 3D mask presentation attack detection [68]. Based on the temporal shift networks [34, 31], EfficientPhys-T [35] adds several swin transformer [38] layers for global spatial attention. Different from these two works, the proposed PhysFormer is an end-to-end video transformer, which is able to capture long-range spatio-temporal attentional rPPG features from facial video directly.

3 Methodology 方法论

We will first introduce the architecture of PhysFormer in Sec. 3.1, then introduce label distribution learning for rPPG measurement in Sec. 3.2, and at last present the curriculum learning guided dynamic supervision in Sec. 3.3.
我们将首先在第 3.1 节介绍 PhysFormer 的架构，然后在第 3.2 节介绍 rPPG 测量的标签分布学习，最后在第 3.3 节介绍课程学习引导的动态监督。

3.1 PhysFormer

As illustrated in Fig. 2, PhysFormer consists of a shallow stem $\mathbf{E}_{\text{stem}}$ , a tube tokenizer $\mathbf{E}_{\text{tube}}$ , $N$ temporal difference transformer blocks $\mathbf{E}^{i}_{\text{trans}}$ ( $i=1,...,N$ ) and a rPPG predictor head. Inspired by the study in [61], we adopt a shallow stem to extract coarse local spatio-temporal features, which benefits the fast convergence and clearer subsequent global self-attention. Specifically, the stem is formed by three convolutional blocks with kernel size (1x5x5), (3x3x3) and (3x3x3), respectively. Each convolution operator is cascaded with a batch normalization (BN), ReLU and MaxPool. The pooling layer only halves the spatial dimension. Therefore, given an RGB facial video input $X\in\mathbb{R}^{3\times T\times H\times W}$ , the stem output $X_{\text{stem}}=\mathbf{E}_{\text{stem}}(X)$ , where $X_{\text{stem}}\in\mathbb{R}^{D\times T\times H/8\times W/8}$ , and $D$ , $T$ , $W$ , $H$ indicate channel, sequence length, width, height, respectively. Then $X_{\text{stem}}$ would be partitioned into spatio-temporal tube tokens $X_{\text{tube}}\in\mathbb{R}^{D\times T^{\prime}\times H^{\prime}\times W^{\prime}}$ via the tube tokenizer $\mathbf{E}_{\text{tube}}$ . Subsequently, the tube tokens will be forwarded with $N$ temporal difference transformer blocks and obtain the global-local refined rPPG features $X_{\text{trans}}$ , which has the same dimensions with $X_{\text{tube}}$ . Finally, the rPPG predictor head temporally upsamples, spatially averages, and projects the features $X_{\text{trans}}$ to 1D signal $Y\in\mathbb{R}^{T}$ .

Tube tokenization. Here the coarse feature $X_{\text{stem}}$ would be partitioned into non-overlapping tube tokens via $\mathbf{E}_{\text{tube}}(X_{\text{stem}})$ , which aggregates the spatio-temporal neighbor semantics and reduces computational costs for the subsequent transformers. Specifically, with the targeted tube size $T_{s}\times H_{s}\times W_{s}$ (the same as the partition step size in non-overlapping setting), the tube token map $X_{\text{tube}}\in\mathbb{R}^{D\times T^{\prime}\times H^{\prime}\times W^{\prime}}$ has length, height and width

T^{\prime}=\left\lfloor\frac{T}{T_{s}}\right\rfloor,H^{\prime}=\left\lfloor\frac{H/8}{H_{s}}\right\rfloor,W^{\prime}=\left\lfloor\frac{W/8}{W_{s}}\right\rfloor.

(1)

Please note that there are no position embeddings after the tube tokenization as the stem at early stage already captures relative spatio-temporal positions.

Temporal difference multi-head self-attention. In self-attention mechanism [56, 15], the relationship between the tokens is modeled by the similarity between the projected query-key pairs, yielding the attention score. Instead of point-wise linear projection, we utilize temporal difference convolution (TDC) [63, 69] for query ( $Q$ ) and key ( $K$ ) projection, which could capture fine-grained local temporal difference features for subtle color change description. TDC with learnable $w$ can be formulated as

\footnotesize\begin{split}\mathrm{TDC}(x)&=\underbrace{\sum_{p_{n}\in\mathcal{R}}w(p_{n})\cdot x(p_{0}+p_{n})}_{\text{vanilla 3D convolution}}+\theta\cdot(\underbrace{-x(p_{0})\cdot\sum_{p_{n}\in\mathcal{R^{\prime}}}w(p_{n}))}_{\text{temporal difference term}},\\ \end{split}

(2)

where $p_{0}$ , $\mathcal{R}$ and $\mathcal{R^{\prime}}$ indicate the current spatio-temporal location, sampled local (3x3x3) neighborhood and sampled adjacent neighborhood, respectively. Then query and key are projected as

Q=\mathrm{BN}(\mathrm{TDC}(X_{\text{tube}})),K=\mathrm{BN}(\mathrm{TDC}(X_{\text{tube}})).\vspace{-0.3em}

(3)

For the value ( $V$ ) projection, point-wise linear projection without BN is utilized. Then $Q,K,V\in\mathbb{R}^{D\times T^{\prime}\times H^{\prime}\times W^{\prime}}$ are flattened into sequence, and separated into $h$ heads ( $D_{h}=D/h$ for each head). For the $i$ -th head ( $i\leq h$ ), the self-attention (SA) can be formulated

\mathrm{SA}_{i}=\mathrm{Softmax}(Q_{i}K^{T}_{i}/\tau)V_{i},\vspace{-0.3em}

(4)

where $\tau$ controls the sparsity. We find that the default setting $\tau=\sqrt{D_{h}}$ in [56, 15] performs poorly for rPPG measurement. According to the periodicity of rPPG features, we use smaller $\tau$ value to obtain sparser attention activation. The corresponding study can be found in Table 6. The output of TD-MHSA is the concatenation of SA from all heads and then with a linear projection $U\in\mathbb{R}^{D\times D}$

\text{TD-MHSA}=\mathrm{Concat}(\mathrm{SA}_{1};\mathrm{SA}_{2};...;\mathrm{SA}_{h})U.\vspace{-0.3em}

(5)

As illustrated in Fig. 2, residual connection and layer normalization (LN) would be conducted after TD-MHSA.

Spatio-temporal feed-forward. The vanilla feed-forward network consists of two linear transformation layers, where the hidden dimension $D^{\prime}$ between two layers is expanded to learn a richer feature representation. In contrast, we introduce a depthwise 3D convolution (with BN and nonlinear activation) between these two layers with extra slight computational cost but remarkable performance improvement. The benefits are two-fold: 1) as a complementation of TD-MHSA, ST-FF could refine the local inconsistency and parts of noisy features; 2) richer locality provides TD-MHSA sufficient relative position cues.

3.2 Label Distribution Learning

Similar to the facial age estimation task [19, 18] that faces at close ages look quite similar, facial rPPG signals with close HR values usually have similar periodicity. Inspired by this observation, instead of considering each facial video as an instance with one label (HR), we regard each facial video as an instance associated with a label distribution. The label distribution covers a certain number of class labels, representing the degree that each label describes the instance. Through this way, one facial video can contribute to both targeted HR value and its adjacent HRs.

To consider the similarity information among HR classes during the training stage, we model the rPPG-based HR estimation problem as a specific $L$ -class multi-label classification problem, where $L$ =139 in our case (each integer HR value within [42, 180] bpm as a class). A label distribution $\mathbf{p}=\left\{p_{1},p_{2},...,p_{L}\right\}\in\mathbb{R}^{L}$ is assigned to each facial video $X$ . It is assumed that each entry of $\mathbf{p}$ is a real value in the range [0,1] such that $\sum_{k=1}^{L}p_{k}=1$ . We consider the Gaussian distribution function, centred at the ground truth HR label $Y_{\text{HR}}$ with the standard deviation $\sigma$ , to construct the corresponding label distribution $\mathbf{p}$ .

p_{k}=\frac{1}{\sqrt{2\pi}\sigma}\exp\left(-\frac{(k-(Y_{HR}-41))^{2}}{2\sigma^{2}}\right).

(6)

The label distribution loss can be formulated as $\mathcal{L}_{\text{LD}}=\mathrm{KL}(\mathbf{p},\mathrm{Softmax}(\mathbf{\hat{p}}))$ , where divergence measure $\mathrm{KL}(\cdot)$ denotes the Kullback-Leibler (KL) divergence [17], and $\mathbf{\hat{p}}$ is the power spectral density (PSD) of predicted rPPG signals.

Please note that the previous work [43] also considers the distribution learning for HR estimation. However, it is totally different with our work: 1) the motivation in [43] is to smooth the temporal HR outliers caused by facial movements across continuous video clips, while our work is more generic, aiming at efficient feature learning across adjacent labels under limited-scale training data; 2) the technique used in [43] is after a post-HR-estimation for the handcrafted rPPG signals, while our work is to design a reasonable supervision signal $\mathcal{L}_{\text{LD}}$ for PhysFormer.

3.3 Curriculum Learning Guided Dynamic Loss

Curriculum learning [2], as a major machine learning regime with philosophy of easy-to-hard curriculum, is utilized to train PhysFormer. In the rPPG measurement task, the supervision signals from temporal domain (e.g., mean square error loss [11], negative Pearson loss [65, 67]) and frequency domain (e.g., cross-entropy loss [46, 63], signal-to-noise ratio loss [53]) provide different extents of constraints for model learning. The former one gives signal-trend-level constraints, which is straightforward and easy for model convergence but overfitting after that. In contrast, the latter one with strong constraints on frequency domain enforces the model learning periodic features within target frequency bands, which is hard to converged well due to the realistic rPPG-irrelevant noise. Inspired by the curriculum learning, we propose the dynamic supervision to gradually enlarge the frequency constraints, which alleviates the overfitting issue and benefits the intrinsic rPPG-aware feature learning gradually. Specifically, exponential increment strategy is adopted, and comparison with other dynamic strategies (e.g., linear increment) will be shown in Table 7. The dynamic loss $\mathcal{L}_{\text{overall}}$ can be formulated as

\begin{split}\mathcal{L}_{\text{overall}}&=\underbrace{\alpha\cdot\mathcal{L}_{\text{time}}}_{\text{temporal}}+\underbrace{\beta\cdot(\mathcal{L}_{\text{CE}}+\mathcal{L}_{\text{LD}})}_{\text{frequency}},\\ \beta&=\beta_{0}\cdot(\eta^{({\text{Epoch}}_{\text{current}}-1)/{\text{Epoch}}_{\text{total}}}),\vspace{-0.1em}\end{split}

(7)

where hyperparameters $\alpha$ , $\beta_{0}$ and $\eta$ equal to 0.1, 1.0 and 5.0, respectively. Negative Pearson loss [65, 67] and frequency cross-entropy loss [46, 63] are adopted as $\mathcal{L}_{\text{time}}$ and $\mathcal{L}_{\text{CE}}$ , respectively. With the dynamic supervision, PhysFormer could perceive better signal trend at the beginning while such perfect warming up facilitates the gradually stronger frequency knowledge learning later.

4 Experimental Evaluation

Experiments of rPPG-based physiological measurement for three types of physiological signals, i.e., heart rate (HR), heart rate variability (HRV), and respiration frequency (RF), are conducted on four benchmark datasets (VIPL-HR [45], MAHNOB-HCI [52], MMSE-HR [55], and OBF [29]).

4.1 Datasets and Performance Metrics
4.1 数据集和性能指标

VIPL-HR [45] is a large-scale dataset for remote physiological measurement under less-constrained scenarios. It contains 2,378 RGB videos of 107 subjects recorded with different head movements, lighting conditions and acquisition devices. MAHNOB-HCI [52] is one of the most widely used benchmark for remote HR measurement evaluations. It includes 527 facial videos of with 61 fps framerate and 780x580 resolution from 27 subjects. MMSE-HR [55] is a dataset including 102 RGB videos from 40 subjects, and the raw resolution of each video is at 1040x1392. OBF [29] is a high-quality dataset for remote physiological signal measurement. It contains 200 five-minute-long RGB videos with 60 fps framerate recorded from 100 healthy adults.
VIPL-HR [ 45] 是一个大规模数据集，用于在不受限制的情况下进行远程生理测量。它包含了 107 名受试者的 2,378 个 RGB 视频，记录了不同的头部运动、光照条件和采集设备。MAHNOB-HCI [ 52] 是远程心率测量评估中最广泛使用的基准之一。它包括了 27 名受试者的 527 个面部视频，帧率为 61fps，分辨率为 780x580。MMSE-HR [ 55] 是一个数据集，包括 40 名受试者的 102 个 RGB 视频，每个视频的原始分辨率为 1040x1392。OBF [ 29] 是一个用于远程生理信号测量的高质量数据集。它包含了 100 名健康成年人录制的 200 个持续五分钟的 RGB 视频，帧率为 60fps。

Average HR estimation task is evaluated on all four datasets while HRV and RF estimation tasks on high-quality OBF [29] dataset. Specifically, we follow existing methods [67, 46, 41] and report low frequency (LF), high frequency (HF), and LF/HF ratio for HRV and RF estimation. We report the most commonly used performance metrics for evaluation, including the standard deviation (SD), mean absolute error (MAE), root mean square error (RMSE), and Pearson’s correlation coefficient ( $r$ ).

4.2 Implementation Details
4.2 实施细节

Our proposed method is implemented with Pytorch. For each video clip, we use the MTCNN face detector [72] to crop the enlarged face area at the first frame and fix the region through the following frames. The videos in MAHNOB-HCI and OBF are downsampled to 30 fps for efficiency. The settings $N$ =12, $h$ =4, $D$ =96, $D^{\prime}$ =144 are used for PhysFormer while $\theta$ =0.7 and $\tau$ =2.0 for TD-MHSA. The targeted tube size $T_{s}\times H_{s}\times W_{s}$ equals to 4 $\times$ 4 $\times$ 4. In the training stage, we randomly sample RGB face clips with size 160 $\times$ 128 $\times$ 128 ( $T\times H\times W$ ) as model inputs. Random horizontal flipping and temporally up/down-sampling [63] are used for data augmentation. The PhysFormer is trained with Adam optimizer and the initial learning rate and weight decay are 1e-4 and 5e-5, respectively. We cannot find obvious performance improvement using AdamW optimizer. We train models with 25 epochs with fixed setting $\alpha$ =0.1 for temporal loss while exponentially increased parameter $\beta\in[1,5]$ for frequency losses. We set $\sigma$ =1.0 for label distribution learning. The batch size is 4 on one 32G V100 GPU. In the testing stage, similar to [45], we uniformly separate 30-second videos into three short clips with 10 seconds, and then video-level HR is calculated via averaging HRs from three short clips.

Table 1: Intra-dataset testing results on VIPL-HR [45]. The symbols

\blacktriangle

\blacklozenge

and

\star

denote traditional, non-end-to-end learning based and end-to-end learning based methods, respectively. Best results are marked in bold and second best in underline.
表 1：在 VIPL-HR [45]上的数据集内测试结果。符号

\blacktriangle

，

\blacklozenge

和

\star

分别表示传统方法，非端到端学习方法和端到端学习方法。最佳结果用粗体标记，次佳结果用下划线标记。

Method

\downarrow

(bpm)

MAE

\downarrow

(bpm)

RMSE

\downarrow

(bpm)

r

\uparrow

Tulyakov2016 [55]

\blacktriangle

18.0

15.9

21.0

0.11

POS [59]

\blacktriangle

15.3

11.5

17.2

0.30

CHROM [13]

\blacktriangle

15.1

11.4

16.9

0.28

RhythmNet [45]

\blacklozenge

8.11

5.30

8.14

0.76

ST-Attention [47]

\blacklozenge

7.99

5.40

7.99

0.66

NAS-HR [40]

\blacklozenge

8.10

5.12

8.01

0.79

CVD [46]

\blacklozenge

7.92

5.02

7.97

0.79

Dual-GAN [41]

\blacklozenge

7.63

4.93

7.68

0.81

I3D [7]

\star

15.9

12.0

15.9

0.07

PhysNet [65]

\star

14.9

10.8

14.8

0.20

DeepPhys [11]

\star

13.6

11.0

13.8

0.11

AutoHR [63]

\star

8.48

5.68

8.68

0.72

PhysFormer (Ours)

\star

7.74

4.97

7.79

0.78

Table 2: Performance comparison of HR and RF measurement as well as HRV analysis on OBF [29].
表 2：HR 和 RF 测量以及 HRV 分析在 OBF 上的性能比较[29]。

	HR(bpm)			RF(Hz)			LF(u.n)			HF(u.n)			LF/HF
Method	SD	RMSE	$r$	SD	RMSE	$r$	SD	RMSE	$r$	SD	RMSE	$r$	SD	RMSE	$r$
ROI_green [29] $\blacktriangle$	2.159	2.162	0.99	0.078	0.084	0.321	0.22	0.24	0.573	0.22	0.24	0.573	0.819	0.832	0.571
CHROM [13] $\blacktriangle$	2.73	2.733	0.98	0.081	0.081	0.224	0.199	0.206	0.524	0.199	0.206	0.524	0.83	0.863	0.459
POS [59] $\blacktriangle$	1.899	1.906	0.991	0.07	0.07	0.44	0.155	0.158	0.727	0.155	0.158	0.727	0.663	0.679	0.687
CVD [46] $\blacklozenge$	1.257	1.26	0.996	0.058	0.058	0.606	0.09	0.09	0.914	0.09	0.09	0.914	0.453	0.453	0.877
rPPGNet [67] $\star$	1.756	1.8	0.992	0.064	0.064	0.53	0.133	0.135	0.804	0.133	0.135	0.804	0.58	0.589	0.773
PhysFormer (Ours) $\star$	0.804	0.804	0.998	0.054	0.054	0.661	0.085	0.086	0.912	0.085	0.086	0.912	0.389	0.39	0.896

Table 3: Intra-dataset results on MAHNOB-HCI [52].
表 3：MAHNOB-HCI 数据集上的内部数据结果[52]。

Method

\downarrow

(bpm)

MAE

\downarrow

(bpm)

RMSE

\downarrow

(bpm)

r

\uparrow

Poh2011 [49]

\blacktriangle

13.5

13.6

0.36

CHROM [13]

\blacktriangle

13.49

22.36

0.21

Li2014 [30]

\blacktriangle

Li2014 [30]

\blacktriangle

6.88

7.62

0.81

Tulyakov2016 [55]

\blacktriangle

5.81

4.96

6.23

0.83

SynRhythm SynRhythm 统同[44]

\blacklozenge

10.88

11.08

RhythmNet [45]

\blacklozenge

3.99

0.87

HR-CNN [53]

\star

7.25

9.24

0.51

rPPGNet [67]

\star

7.82

5.51

7.82

0.78

DeepPhys [11]

\star

4.57

AutoHR [63]

\star

4.73

3.78

5.10

0.86

Meta-rPPG [28]

\star

4.9

3.01

3.68

0.85

PhysFormer (Ours)

\star

3.87

3.25

3.97

0.87

4.3 Intra-dataset Testing 4.3 数据集内测试

HR estimation on VIPL-HR. In these experiments, we follow [45] and use a subject-exclusive 5-fold cross-validation protocol on VIPL-HR. As shown in Table 1, all three traditional methods (Tulyakov2016 [55], POS [59] and CHROM [13]) perform poorly due to the complex scenarios (e.g., large head movement and various illumination) in the VIPL-HR dataset. Similarly, the existing end-to-end learning based methods (e.g., PhysNet [65], DeepPhys [11], and AutoHR [63]) predict unreliable HR values with large RMSE compared with non-end-to-end learning approaches (e.g., RhythmNet [45], ST-Attention [47], NAS-HR [40], CVD [46], and Dual-GAN [41]). Such the large performance margin might be caused by the coarse and overfitted rPPG features extracted from the end-to-end models. In contrast, all five non-end-to-end methods first extract fine-grained signal maps from multiple facial ROIs, and then more dedicated rPPG clues would be extracted via the cascaded models. Without strict and heavy preprocessing procedure in [45, 47, 40, 46, 41], our proposed PhysFormer can be trained from scratch on facial videos directly, and achieves comparable performance with state-of-the-art non-end-to-end learning based method Dual-GAN [41]. It indicates that PhysFormer is able to learn the intrinsic and periodic rPPG-aware features automatically.
在 VIPL-HR 上进行的 HR 估计。在这些实验中，我们遵循[45]并在 VIPL-HR 上使用一个专门的 5 折交叉验证协议。如表 1 所示，由于 VIPL-HR 数据集中存在复杂的情景（例如，大幅度的头部运动和各种照明），所有三种传统方法（Tulyakov2016 [55]，POS [59]和 CHROM [13]）表现不佳。同样，现有的基于端到端学习的方法（例如 PhysNet [65]，DeepPhys [11]和 AutoHR [63]）与非端到端学习方法（例如 RhythmNet [45]，ST-Attention [47]，NAS-HR [40]，CVD [46]和 Dual-GAN [41]）相比，预测的 HR 值不可靠，RMSE 较大。这种较大的性能差距可能是由端到端模型提取的粗糙和过拟合的 rPPG 特征引起的。相反，所有五种非端到端方法首先从多个面部 ROI 中提取细粒度信号图，然后通过级联模型提取更专门的 rPPG 线索。在[45, 47, 40, 46, 41]中没有严格和繁重的预处理过程，我们提出的 PhysFormer 可以直接从头开始在面部视频上进行训练，并且与最先进的非端到端学习方法 Dual-GAN[41]实现了可比较的性能。这表明 PhysFormer 能够自动学习内在和周期性的 rPPG 感知特征。

HR estimation on MAHNOB-HCI. For the HR estimation tasks on MAHNOB-HCI, similar to [67], subject-independent 9-fold cross-validation protocol is adopted. In consideration of the convergence difficulty due to the low illumination and high compression videos in MAHNOB-HCI, we finetune the VIPL-HR pretrained model on MAHNOB-HCI for further 15 epochs. The HR estimation results are shown in Table 3. The proposed PhysFormer achieves the lowest SD (3.87 bpm) and highest $r$ (0.87) among the traditional, non-end-to-end learning, and end-to-end learning methods, which indicates the reliability of the learned rPPG features from PhysFormer under sufficient supervision. Our performance is on par with the latest end-to-end learning method Meta-rPPG [28] without transductive adaptation from target frames.
在 MAHNOB-HCI 上进行的心率估计。对于 MAHNOB-HCI 上的心率估计任务，类似于[67]，采用了独立于受试者的 9 折交叉验证协议。考虑到 MAHNOB-HCI 中由于低照明和高压缩视频而导致的收敛困难，我们在 MAHNOB-HCI 上对 VIPL-HR 预训练模型进行了进一步的微调，进行了 15 个时期。心率估计结果如表 3 所示。所提出的 PhysFormer 在传统、非端到端学习和端到端学习方法中实现了最低的 SD（3.87 bpm）和最高的 $r$ （0.87），这表明在充分监督下从 PhysFormer 学习的 rPPG 特征的可靠性。我们的性能与最新的端到端学习方法 Meta-rPPG[28]相当，而无需从目标帧进行传导适应。

Table 4: Cross-dataset results on MMSE-HR [55].
表 4：MMSE-HR 跨数据集结果[55]。

Method

\downarrow

(bpm) (每分钟节拍)

MAE

\downarrow

(bpm) (每分钟节拍)

RMSE

\downarrow

(bpm) (每分钟节拍)

r

\uparrow

Li2014 [30]

\blacktriangle

Li2014 [30]

\blacktriangle

20.02

19.95

0.38

CHROM [13]

\blacktriangle

14.08

13.97

0.55

Tulyakov2016 [55]

\blacktriangle

12.24

11.37

0.71

ST-Attention [47]

\blacklozenge

9.66

10.10

0.64

RhythmNet [45]

\blacklozenge

6.98

7.33

0.78

CVD [46]

\blacklozenge

6.06

6.04

0.84

PhysNet [65]

\star

12.76

13.25

0.44

TS-CAN [34]

\star

3.85

7.21

0.86

AutoHR [63]

\star

5.71

5.87

0.89

EfficientPhys-C [35]

\star

2.91

5.43

0.92

EfficientPhys-T1 [35]

\star

3.48

7.21

0.86

PhysFormer (Ours)

\star

PhysFormer（我们的）

\star

5.22

2.84

5.36

0.92

HR, HRV and RF estimation on OBF. We also conduct experiments for three types of physiological signals, i.e., HR, RF, and HRV measurement on the OBF [29] dataset. Following [67, 46], we use a 10-fold subject-exclusive protocol for all experiments. All the results are shown in Table 2. From the results, we can see that the proposed approach outperforms the existing state-of-the-art traditional (ROI_green [29], CHROM [13], POS [59]) and end-to-end learning (rPPGNet [67]) methods by a large margin on all evaluation metrics for HR, RF and all HRV features. The proposed PhysFormer also gives more accurate estimation in terms of HR, RF, and LF/HF compared with the preprocessed signal map based non-end-to-end learning method CVD [46]. These results indicate that PhysFormer could not only handle the average HR estimation task but also give a promising prediction of the rPPG signal for RF measurement and HRV analysis, which shows its potential in many healthcare applications.
HR、HRV 和 RF 在 OBF 上的估计。我们还对三种类型的生理信号进行实验，即在 OBF [29] 数据集上进行 HR、RF 和 HRV 测量。根据 [67, 46]，我们对所有实验采用了一个 10 折主题专属协议。所有结果都显示在表 2 中。从结果中我们可以看出，所提出的方法在所有评估指标上均明显优于现有的最先进传统方法（ROI_green [29]、CHROM [13]、POS [59]）和端到端学习方法（rPPGNet [67]）。所提出的 PhysFormer 在 HR、RF 和所有 HRV 特征的所有评估指标上均表现更好。与基于预处理信号图的非端到端学习方法 CVD [46] 相比，所提出的 PhysFormer 在 HR、RF 和 LF/HF 方面也提供了更准确的估计。这些结果表明，PhysFormer 不仅可以处理平均 HR 估计任务，还可以对 RF 测量和 HRV 分析进行有前途的预测，显示了其在许多医疗应用中的潜力。

4.4 Cross-dataset Testing 4.4 跨数据集测试

Besides of the intra-dataset testings on the VIPL-HR, MAHNOB-HCI, and OBF datasets, we also conduct cross-dataset testing on MMSE-HR [55] following the protocol of [45]. The models trained on VIPL-HR are directly tested on MMSE-HR. All the results of the proposed approach and the state-of-the-art methods are shown in Table 4. It is clear that the proposed PhysFormer generalizes well in unseen domain. It is worth noting that PhysFormer achieves the lowest SD (5.22 bpm), MAE (2.84 bpm), RMSE (5.36 bpm) as well as the highest $r$ (0.92) among the traditional, non-end-to-end learning and end-to-end learning based methods, indicating 1) the predicted HRs are highly correlated with the ground truth HRs, and 2) the model learns domain-invariant intrinsic rPPG-aware features. Compared with the spatio-temporal transformer based EfficientPhys-T1 [35], our proposed PhysFormer is able to predict more accurate physiological signals, which indicates the effectiveness of the long-range spatio-temporal attention.

Table 5: Ablation of Tube Tokenization of PhysFormer. The three dimensions in tensors indicate length

\times

height

\times

width.
表 5：对 PhysFormer 的管道标记消融。张量中的三个维度表示长度

\times

高度

\times

宽度。

Inputs

[Stem] [干]

Feature Size 特征大小

[Tube Size] 管尺寸

Token Numbers 令牌号码

RMSE

\downarrow

(bpm)

160\times 128\times 128

[

\bigtimes

]

160\times 128\times 128

[4\times 32\times 32]

40\times 4\times 4

10.62

160\times 128\times 128

[

\surd

]

160\times 16\times 16

[4\times 4\times 4]

40\times 4\times 4

7.56

160\times 96\times 96

[

\surd

]

160\times 12\times 12

[4\times 4\times 4]

40\times 3\times 3

8.03

160\times 128\times 128

[

\surd

]

160\times 16\times 16

[4\times 16\times 16]

40\times 1\times 1

10.61

160\times 128\times 128

[

\surd

]

160\times 16\times 16

[2\times 4\times 4]

80\times 4\times 4

7.81

Table 6: Ablation of TD-MHSA and ST-FF in PhysFormer.
表 6：在 PhysFormer 中消融 TD-MHSA 和 ST-FF。

MHSA	$\tau$	Feed-forward	RMSE (bpm) $\downarrow$ RMSE（bpm） $\downarrow$
-	-	ST-FF	9.81
TD-MHSA	$\sqrt{D_{h}}\approx$ 4.9	ST-FF	9.51
TD-MHSA	2.0	ST-FF	7.56
vanilla MHSA 香草 MHSA	2.0	ST-FF	10.43
TD-MHSA	2.0	vanilla FF 香草 FF	8.27

Table 7: Ablation of dynamic loss in the frequency domain. The temporal loss

\mathcal{L}_{time}

is with fixed

\alpha

=0.1 here. ‘CE’ and ‘LD’ denote cross-entropy and label distribution, respectively.
表 7：在频域中消融动态损失。这里的时间损失

\mathcal{L}_{time}

是固定的

\alpha

=0.1。'CE'和'LD'分别表示交叉熵和标签分布。

Frequency loss 频率损失	$\beta$	Strategy	RMSE (bpm) $\downarrow$ RMSE（bpm） $\downarrow$
$\mathcal{L}_{\text{CE}}$ + $\mathcal{L}_{\text{LD}}$	1.0	fixed	8.48
$\mathcal{L}_{\text{CE}}$ + $\mathcal{L}_{\text{LD}}$	5.0	fixed	8.86
$\mathcal{L}_{\text{CE}}$ + $\mathcal{L}_{\text{LD}}$	[1.0, 5.0]	linear	8.37
$\mathcal{L}_{\text{CE}}$ + $\mathcal{L}_{\text{LD}}$	[1.0, 5.0]	exponential	7.56
$\mathcal{L}_{\text{CE}}$	[1.0, 5.0]	exponential	8.09
$\mathcal{L}_{\text{LD}}$	[1.0, 5.0]	exponential	8.21
$\mathcal{L}_{\text{LD}}$ (real distribution) $\mathcal{L}_{\text{LD}}$ （真实分布）	[1.0, 5.0]	exponential	8.72

4.5 Ablation Study 4.5 消融研究

We also provide the results of ablation studies for HR estimation on the Fold-1 of the VIPL-HR [45] dataset.
我们还提供了 VIPL-HR [45]数据集 Fold-1 上心率估计消融研究的结果。

Impact of tube tokenization. In the default setting of PhysFormer, a shallow stem cascaded with a tube tokenization is used. In this ablation, we consider other four tokenization configurations with or w/o stem. It can be seen from the first row in Table 5 that the stem helps the PhysFormer see better [61], and the RMSE increases dramatically (+3.06 bpm) when w/o the stem. Then we investigate the impacts of the spatial and temporal domains in tube tokenization. It is clear that the result in the fourth row with full spatial projection is quite poor (RMSE=10.61 bpm), indicating the necessity of the spatial attention. In contrast, tokenization with smaller tempos (e.g., [2x4x4]) or spatial inputs (e.g., 160x96x96) reduces performance slightly.
管道标记化的影响。在 PhysFormer 的默认设置中，使用了一个浅干扰级联与管道标记化。在这个消融中，我们考虑了其他四种标记化配置，带有或不带有干扰。从表 5 的第一行可以看出，干扰有助于 PhysFormer 更好地看到[61]，当没有干扰时，RMSE 急剧增加（+3.06 bpm）。然后我们调查了管道标记化中空间和时间域的影响。很明显，第四行的结果在全空间投影下非常糟糕（RMSE=10.61 bpm），表明空间注意的必要性。相比之下，具有较小节奏（例如[2x4x4]）或空间输入（例如 160x96x96）的标记化会略微降低性能。

Impact of TD-MHSA and ST-FF. As shown in Table 6, both the TD-MHSA and ST-FF play vital roles in PhysFormer. The result in the first row shows that the performance degrades sharply without spatio-temporal attention. Moreover, it can be seen from the last two rows that without TD-MHSA/ST-FF, PhysFormer with vanilla MHSA/FF obtains 10.43/8.27 bpm RMSE. One important finding in this research is that, the temperature $\tau$ influences the MHSA a lot. When the $\tau=\sqrt{D_{h}}$ like previous ViT [15, 1], the predicted rPPG signals are unsatisfied (RMSE=9.51 bpm). Regularizing the $\tau$ with smaller value enforces sparser spatio-temporal attention, which is effective for the quasi-periodic rPPG task.

Impact of label distribution learning. Besides the temporal loss $\mathcal{L}_{\text{time}}$ and frequency cross-entropy loss $\mathcal{L}_{\text{CE}}$ , the ablations w/ and w/o label distribution loss $\mathcal{L}_{\text{LD}}$ are shown in the last four rows of Table 7. Although the $\mathcal{L}_{\text{LD}}$ performs slightly worse (+0.12 bpm RMSE) than $\mathcal{L}_{\text{CE}}$ , the best performance can be achieved using both losses, indicating the effectiveness of explicit distribution constraints for extreme-frequency interference alleviation and adjacent label knowledgement propagation. It is interesting to find from the last two rows that using real PSD distribution from groundtruth PPG signals as $\mathbf{p}$ , the performance is inferior due to the lack of an obvious peak and partial noise. We can also find from the Fig. 4(a) that the $\sigma$ ranged from 0.9 to 1.2 for $\mathcal{L}_{\text{LD}}$ are suitable to achieve good performance.

Impact of dynamic supervision. Fig. 3 illustrates the testing performance on Fold-1 VIPL-HR when training with fixed and dynamic supervision. It is clear that with exponential increased frequency loss, models in the blue curve converge faster and achieve smaller RMSE. We also compare several kinds of fixed and dynamic strategies in Table 7. The results in the first four rows indicate 1) using fixed higher $\beta$ leads to poorer performance caused by the convergency difficulty; 2) models with the exponentially increased $\beta$ perform better than using linear increment.

Impact of $\theta$ and layer/head numbers. Hyperparameter $\theta$ tradeoffs the contribution of local temporal gradient information. As illustrated in Fig. 3(b), PhysFormer could achieve smaller RMSE when $\theta$ =0.4 and 0.7, indicating the importance of the normalized local temporal difference features for global spatio-temporal attention. We also investigate how the layer and head numbers influence the performance. As shown in Fig. 5(a), with deeper temporal transformer blocks, the RMSE are reduced progressively despite heavier computational cost. In terms of the impact of head numbers, it is clear to find from Fig. 5(b) that PhysFormer with four heads perform the best while fewer heads lead to sharp performance drops.

4.6 Visualization and Discussion

We visualize the attention map from the last TD-MHSA module as well as one example about the query-key interaction in Fig. 6. The x and y axes indicate the attention confidence from key and query tube tokens, respectively. From the attention map, we can easily find periodic or quasi-periodic responses along both axes, indicating the periodicity of the intrinsic rPPG features from PhysFormer. To be specific, given the 530th tube token (in blue) from the forehead (spatial face domain) and peak (temporal signal domain) locations as a query, the corresponding key responses are illustrated at the blue line in the attention map. On one hand, it can be seen from the key responses that dominant spatial attentions focus on the facial skin regions and discard unrelated background. On the other hand, the temporal localizations of the key responses are around peak positions in the predicted rPPG signals. All these patterns are reasonable: 1) the forehead and cheek regions [57] have richer blood volume for rPPG measurement and are also reliable since these regions are less affected by facial muscle movements due to e.g., facial expressions, talking; and 2) rPPG signals from healthy people are usually periodic.

However, we also find two limitations of the spatio-temporal attention from Fig. 6. First, there are still some unexpected responses (e.g., continuous query tokens with similar key responses) in the attention map, which might introduce task-irrelevant noise and damage to the performance. Second, the temporal attentions are not always accurate, and some are coarse with phase shifts (e.g., the first vertical dotted line of the rPPG signals in bottom Fig. 6).

5 Conclusions and Future Work

In this paper, we propose an end-to-end video transformer architecture, namely PhysFormer, for remote physiological measurement. With temporal difference transformer and elaborate supervisions, PhysFormer is able to achieve superior performance on benchmark datasets. The study of video transformer based physiological measurement is still at an early stage. Future directions include: 1) Designing more efficient architectures. The proposed PhysFormer is with 7.03 M parameters and 47.01 GFLOPs, which is unfriendly for mobile deployment; 2) Exploring more accurate yet efficient spatio-temporal self-attention mechanism especially for long sequence rPPG monitoring.

Acknowledgment This work was supported by the Academy of Finland for Academy Professor project EmotionAI (grants 336116, 345122), and ICT 2023 project (grant 328115), by Ministry of Education and Culture of Finland for AI forum project, the National Natural Science Foundation of China (Grant No. 62002283), the EPSRC grant: Turing AI Fellowship: EP/W002981/1, EPSRC/MURI grant EP/N019474/1. We would also like to thank the Royal Academy of Engineering and FiveAI. As well, the authors wish to acknowledge CSC-IT Center for Science, Finland, for computational resources.

References

[1] Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, and Cordelia Schmid. Vivit: A video vision transformer. ICCV, 2021.
[2] Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In ICML, 2009.
[3] Gedas Bertasius, Heng Wang, and Lorenzo Torresani. Is space-time attention all you need for video understanding? arXiv:2102.05095, 2021.
[4] Adrian Bulat, Juan-Manuel Perez-Rua, Swathikiran Sudhakaran, Brais Martinez, and Georgios Tzimiropoulos. Space-time mixing attention for video transformer. NeurIPS, 2021.
[5] Jiezhang Cao, Yawei Li, Kai Zhang, and Luc Van Gool. Video super-resolution transformer. arXiv:2106.06847, 2021.
[6] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.
[7] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 2017.
[8] Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. ICCV, 2021.
[9] Haoyu Chen, Hao Tang, Nicu Sebe, and Guoying Zhao. Aniformer: Data-driven 3d animation with transformer. BMVC, 2021.
[10] Haoyu Chen, Hao Tang, Zitong Yu, Nicu Sebe, and Guoying Zhao. Geometry-contrastive transformer for generalized 3d pose transfer. In AAAI, 2022.
[11] Weixuan Chen and Daniel McDuff. Deepphys: Video-based physiological measurement using convolutional attention networks. In ECCV, 2018.
[12] Xun Chen, Juan Cheng, Rencheng Song, Yu Liu, Rabab Ward, and Z Jane Wang. Video-based heart rate measurement: Recent advances and future prospects. IEEE Transactions on Instrumentation and Measurement, 2018.
[13] Gerard De Haan and Vincent Jeanne. Robust pulse rate from chrominance-based rppg. IEEE Transactions on Biomedical Engineering, 2013.
[14] Mingyu Ding, Xiaochen Lian, Linjie Yang, Peng Wang, Xiaojie Jin, Zhiwu Lu, and Ping Luo. Hr-nas: Searching efficient high-resolution neural architectures with lightweight transformers. In CVPR, 2021.
[15] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
[16] Haoqi Fan, Bo Xiong, Karttikeya Mangalam, Yanghao Li, Zhicheng Yan, Jitendra Malik, and Christoph Feichtenhofer. Multiscale vision transformers. ICCV, 2021.
[17] Bin-Bin Gao, Chao Xing, Chen-Wei Xie, Jianxin Wu, and Xin Geng. Deep label distribution learning with label ambiguity. IEEE TIP, 2017.
[18] Bin-Bin Gao, Hong-Yu Zhou, Jianxin Wu, and Xin Geng. Age estimation using expectation of label distribution learning. In IJCAI, 2018.
[19] Xin Geng, Chao Yin, and Zhi-Hua Zhou. Facial age estimation by learning from label distributions. IEEE TPAMI, 2013.
[20] John Gideon and Simon Stent. The way to my heart is through contrastive learning: Remote photoplethysmography from unlabelled video. In ICCV, 2021.
[21] Rohit Girdhar, Joao Carreira, Carl Doersch, and Andrew Zisserman. Video action transformer network. In CVPR, 2019.
[22] Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. A survey on visual transformer. arXiv:2012.12556, 2020.
[23] Kai Han, An Xiao, Enhua Wu, Jianyuan Guo, Chunjing Xu, and Yunhe Wang. Transformer in transformer. arXiv:2103.00112, 2021.
[24] Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, and Wei Jiang. Transreid: Transformer-based object re-identification. ICCV, 2021.
[25] Gee-Sern Hsu, ArulMurugan Ambikapathi, and Ming-Shiang Chen. Deep learning with time-frequency representation for pulse estimation from facial videos. In IJCB, 2017.
[26] Salman Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. arXiv:2101.01169, 2021.
[27] Antony Lam and Yoshinori Kuno. Robust heart rate measurement from video using select random patches. In ICCV, 2015.
[28] Eugene Lee, Evan Chen, and Chen-Yi Lee. Meta-rppg: Remote heart rate estimation using a transductive meta-learner. In ECCV, 2020.
[29] Xiaobai Li, Iman Alikhani, Jingang Shi, Tapio Seppanen, Juhani Junttila, Kirsi Majamaa-Voltti, Mikko Tulppo, and Guoying Zhao. The obf database: A large face video database for remote physiological signal measurement and atrial fibrillation detection. In FG, 2018.
[30] Xiaobai Li, Jie Chen, Guoying Zhao, and Matti Pietikainen. Remote heart rate measurement from face videos under realistic situations. In CVPR, 2014.
[31] Ji Lin, Chuang Gan, and Song Han. Tsm: Temporal shift module for efficient video understanding. In CVPR, 2019.
[32] Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers. arXiv:2106.04554, 2021.
[33] Rui Liu, Hanming Deng, Yangyi Huang, Xiaoyu Shi, Lewei Lu, Wenxiu Sun, Xiaogang Wang, Jifeng Dai, and Hongsheng Li. Fuseformer: Fusing fine-grained information in transformers for video inpainting. In ICCV, 2021.
[34] Xin Liu, Josh Fromm, Shwetak Patel, and Daniel McDuff. Multi-task temporal shift attention networks for on-device contactless vitals measurement. NeurIPS, 2020.
[35] Xin Liu, Brian L Hill, Ziheng Jiang, Shwetak Patel, and Daniel McDuff. Efficientphys: Enabling simple, fast and accurate camera-based vitals measurement. arXiv:2110.04447, 2021.
[36] Xin Liu, Shwetak Patel, and Daniel McDuff. Camera-based physiological sensing: Challenges and future directions. arXiv:2110.13362, 2021.
[37] Xiaolong Liu, Qimeng Wang, Yao Hu, Xu Tang, Song Bai, and Xiang Bai. End-to-end temporal action detection with transformer. arXiv:2106.10271, 2021.
[38] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. ICCV, 2021.
[39] Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. Video swin transformer. arXiv:2106.13230, 2021.
[40] Hao Lu and Hu Han. Nas-hr: Neural architecture search for heart rate estimation from face videos. Virtual Reality & Intelligent Hardware, 2021.
[41] Hao Lu, Hu Han, and S Kevin Zhou. Dual-gan: Joint bvp and noise modeling for remote physiological measurement. In CVPR, 2021.
[42] Daniel Neimark, Omri Bar, Maya Zohar, and Dotan Asselmann. Video transformer network. arXiv:2102.00719, 2021.
[43] Xuesong Niu, Hu Han, Shiguang Shan, and Xilin Chen. Continuous heart rate measurement from face: A robust rppg approach with distribution learning. In IJCB, 2017.
[44] Xuesong Niu, Hu Han, Shiguang Shan, and Xilin Chen. Synrhythm: Learning a deep heart rate estimator from general to specific. In ICPR, 2018.
[45] Xuesong Niu, Shiguang Shan, Hu Han, and Xilin Chen. Rhythmnet: End-to-end heart rate estimation from face via spatial-temporal representation. IEEE TIP, 2019.
[46] Xuesong Niu, Zitong Yu, Hu Han, Xiaobai Li, Shiguang Shan, and Guoying Zhao. Video-based remote physiological measurement via cross-verified feature disentangling. In ECCV, pages 295–310. Springer, 2020.
[47] Xuesong Niu, Xingyuan Zhao, Hu Han, Abhijit Das, Antitza Dantcheva, Shiguang Shan, and Xilin Chen. Robust remote heart rate estimation from face utilizing spatial-temporal attention. In FG, 2019.
[48] Ewa M Nowara, Daniel McDuff, and Ashok Veeraraghavan. The benefit of distraction: Denoising camera-based physiological measurements using inverse attention. In ICCV, 2021.
[49] Ming-Zher Poh, Daniel McDuff, and Rosalind Picard. Advancements in noncontact, multiparameter physiological measurements using a webcam. IEEE transactions on biomedical engineering, 2010.
[50] Ming-Zher Poh, Daniel J McDuff, and Rosalind W Picard. Non-contact, automated cardiac pulse measurements using video imaging and blind source separation. Optics express, 2010.
[51] Ying Qiu, Yang Liu, Juan Arteaga-Falconi, Haiwei Dong, and Abdulmotaleb El Saddik. Evm-cnn: Real-time contactless heart rate estimation from facial video. IEEE TMM, 2018.
[52] Mohammad Soleymani, Jeroen Lichtenauer, Thierry Pun, and Maja Pantic. A multimodal database for affect recognition and implicit tagging. IEEE transactions on affective computing, 2011.
[53] Radim Špetlík, Vojtech Franc, and Jirí Matas. Visual heart rate estimation with convolutional neural network. In BMVC, 2018.
[54] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, 2021.
[55] Sergey Tulyakov, Xavier Alameda-Pineda, Elisa Ricci, Lijun Yin, Jeffrey F Cohn, and Nicu Sebe. Self-adaptive matrix completion for heart rate estimation from face videos under realistic conditions. In CVPR, 2016.
[56] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[57] Wim Verkruysse, Lars O Svaasand, and J Stuart Nelson. Remote plethysmographic imaging using ambient light. Optics express, 2008.
[58] Lining Wang, Haosen Yang, Wenhao Wu, Hongxun Yao, and Hujie Huang. Temporal action proposal generation with transformers. arXiv:2105.12043, 2021.
[59] Wenjin Wang, Albertus C den Brinker, Sander Stuijk, and Gerard de Haan. Algorithmic principles of remote ppg. IEEE Transactions on Biomedical Engineering, 2017.
[60] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. ICCV, 2021.
[61] Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Dollár, and Ross Girshick. Early convolutions help transformers see better. NeurIPS, 2021.
[62] Mingze Xu, Yuanjun Xiong, Hao Chen, Xinyu Li, Wei Xia, Zhuowen Tu, and Stefano Soatto. Long short-term transformer for online action detection. arXiv:2107.03377, 2021.
[63] Zitong Yu, Xiaobai Li, Xuesong Niu, Jingang Shi, and Guoying Zhao. Autohr: A strong end-to-end baseline for remote heart rate measurement with neural searching. IEEE SPL, 2020.
[64] Zitong Yu, Xiaobai Li, Pichao Wang, and Guoying Zhao. Transrppg: Remote photoplethysmography transformer for 3d mask face presentation attack detection. IEEE SPL, 2021.
[65] Zitong Yu, Xiaobai Li, and Guoying Zhao. Remote photoplethysmograph signal measurement from facial videos using spatio-temporal networks. In BMVC, 2019.
[66] Zitong Yu, Xiaobai Li, and Guoying Zhao. Facial-video-based physiological signal measurement: Recent advances and affective applications. IEEE Signal Processing Magazine, 2021.
[67] Zitong Yu, Wei Peng, Xiaobai Li, Xiaopeng Hong, and Guoying Zhao. Remote heart rate measurement from highly compressed facial videos: an end-to-end deep learning solution with video enhancement. In ICCV, 2019.
[68] Zitong Yu, Yunxiao Qin, Xiaobai Li, Chenxu Zhao, Zhen Lei, and Guoying Zhao. Deep learning for face anti-spoofing: a survey. arXiv:2106.14948, 2021.
[69] Zitong Yu, Benjia Zhou, Jun Wan, Pichao Wang, Haoyu Chen, Xin Liu, Stan Z Li, and Guoying Zhao. Searching multi-rate and multi-modal temporal enhanced networks for gesture recognition. IEEE TIP, 2021.
[70] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. arXiv:2101.11986, 2021.
[71] Yanhong Zeng, Jianlong Fu, and Hongyang Chao. Learning joint spatial-temporal transformations for video inpainting. In ECCV, 2020.
[72] Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE SPL, 2016.
[73] Jiaojiao Zhao, Xinyu Li, Chunhui Liu, Shuai Bing, Hao Chen, Cees GM Snoek, and Joseph Tighe. Tuber: Tube-transformer for action detection. arXiv:2104.00969, 2021.
[74] Sixiao Zheng, Jiachen Lu, Hengshuang Zhao, Xiatian Zhu, Zekun Luo, Yabiao Wang, Yanwei Fu, Jianfeng Feng, Tao Xiang, Philip HS Torr, et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In CVPR, 2021.

PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference TransformerPhysFormer：基于面部视频的生理测量与时间差分转换器

Abstract 摘要

1 Introduction 1 介绍

2 Related Work

3 Methodology 方法论

3.1 PhysFormer

3.2 Label Distribution Learning

3.3 Curriculum Learning Guided Dynamic Loss

4 Experimental Evaluation

4.1 Datasets and Performance Metrics4.1 数据集和性能指标

4.2 Implementation Details4.2 实施细节

4.3 Intra-dataset Testing 4.3 数据集内测试

4.4 Cross-dataset Testing 4.4 跨数据集测试

4.5 Ablation Study 4.5 消融研究

4.6 Visualization and Discussion

5 Conclusions and Future Work

References

PhysFormer: Facial Video-based Physiological Measurement with
Temporal Difference Transformer
PhysFormer：基于面部视频的生理测量与时间差分转换器

4.1 Datasets and Performance Metrics
4.1 数据集和性能指标

4.2 Implementation Details
4.2 实施细节