这是用户在 2024-6-19 15:29 为 https://ieeexplore.ieee.org/document/10030911 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
离散余弦TransFormer:基于频域的图像建模 |IEEE会议出版物 |IEEE Xplore(IEEE Xplore) --- Discrete Cosin TransFormer: Image Modeling From Frequency Domain | IEEE Conference Publication | IEEE Xplore

Discrete Cosin TransFormer: Image Modeling From Frequency Domain

Publisher: IEEE

Abstract:

In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transform...View more

Abstract: 抽象:

In this paper, we propose Discrete Cosin TransFormer (DCFormer) that directly learn semantics from DCT-based frequency domain representation. We first show that transformer-based networks are able to learn semantics directly from frequency domain representation based on discrete cosine transform (DCT) without compromising the performance. To achieve the desired efficiency-effectiveness trade-off, we then leverage an input information compression on its frequency domain representation, which highlights the visually significant signals inspired by JPEG compression. We explore different frequency domain downsampling strategies and show that it is possible to preserve the semantic meaningful information by strategically dropping the high-frequency components. The proposed DCFormer is tested on various downstream tasks including image classification, object detection and instance segmentation, and achieves state-of-the-art comparable performance with less FLOPs, and outperforms the commonly used backbone (e.g. SWIN) at similar FLOPs. Our ablation results also show that the proposed method generalizes well on different transformer backbones.
在本文中,我们提出了离散余弦变压器(DCFormer),它直接从基于DCT的频域表示中学习语义。我们首先表明,基于Transformer的网络能够直接从基于离散余弦变换(DCT)的频域表示中学习语义,而不会影响性能。为了实现所需的效率-有效性权衡,我们随后在其频域表示上利用输入信息压缩,突出显示受 JPEG 压缩启发的视觉显着信号。我们探索了不同的频域下采样策略,并表明可以通过战略性地丢弃高频分量来保留语义上有意义的信息。所提出的DCFormer在各种下游任务上进行了测试,包括图像分类、对象检测和实例分割,并以更少的FLOP实现了最先进的可比性能,并且在类似的FLOP上优于常用的骨干网(如SWIN)。我们的烧蚀结果还表明,所提方法在不同的变压器骨干上具有较好的推广性。
Published in: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
发表于: 2023 IEEE/CVF 计算机视觉应用冬季会议 (WACV)
Date of Conference: 02-07 January 2023
会议日期:2023年1月2-7日
Date Added to IEEE Xplore: 06 February 2023
添加到IEEE Xplore的日期: 06 February 2023
ISBN Information: ISBN信息:

ISSN Information:  ISSN信息:

DOI: 10.1109/WACV56688.2023.00543
DOI: 10.1109/WACV56688.2023.00543
Publisher: IEEE 发行商: IEEE
Conference Location: Waikoloa, HI, USA
会议地点: Waikoloa, HI, USA

SECTION 1. 第 1 节。

Introduction 介绍

Different types of image representations are often used for different types of downstream tasks. The RGB-based representation carries rich semantic information, and thus becomes the mainstream solution for visual content understanding and associated computer vision tasks, e.g. image classification [11], object detection [31], etc. The frequency domain representation better separates the information from different frequency bands, which is commonly used for image compression and image quality assessment [15], [16]. In this paper, we explore image modeling directly on frequency representation unlike the conventional RGB-based image modeling, and furthermore, efficient image modeling by dropping the non-visually significant information directly from the frequency representations. Performing efficient image modeling directly on frequency domain is often overlooked, because in the downstream tasks that focus on semantics and content understanding, RGB-based modeling approaches generally yield better performance. There are two major challenges for efficient image modeling directly on frequency domain: (1). how to model the frequency representations as the adjacent pixels lack direct spatial associations; and (2). how to compress the non-visually significant information without compromising the performance.
不同类型的图像表示通常用于不同类型的下游任务。基于RGB的表示承载了丰富的语义信息,因此成为视觉内容理解和相关计算机视觉任务的主流解决方案,例如图像分类 [11] 、目标检测 [31] 等。频域表示可以更好地分离不同频段的信息,常用于图像压缩和图像质量评估 [15][16] 在本文中,我们探索了与传统的基于RGB的图像建模不同,直接在频率表示上进行图像建模,此外,通过直接从频率表示中删除非视觉显着信息来进行有效的图像建模。直接在频域上执行高效的图像建模经常被忽视,因为在专注于语义和内容理解的下游任务中,基于RGB的建模方法通常会产生更好的性能。直接在频域上进行高效图像建模面临两大挑战:(1)当相邻像素缺乏直接空间关联时,如何对频率表示进行建模;(2)项。如何在不影响性能的情况下压缩不具有视觉意义的信息。

Figure 1: - 
Image classification on ImageNet-1K. DCFormer (red lines) is able to achieve better efficiency/effectiveness balance: DCFormer achieves similar performance at lower FLOPs, and better accuracy at similar FLOPs. DCFormer-SW and -NA denotes DCFormer with SWIN [33] and NAT [19] as backbones, respectively. Details in section 4.
Figure 1:  图 1:

Image classification on ImageNet-1K. DCFormer (red lines) is able to achieve better efficiency/effectiveness balance: DCFormer achieves similar performance at lower FLOPs, and better accuracy at similar FLOPs. DCFormer-SW and -NA denotes DCFormer with SWIN [33] and NAT [19] as backbones, respectively. Details in section 4.
ImageNet-1K上的图像分类。DCFormer(红线)能够实现更好的效率/有效性平衡:DCFormer 在较低的 FLOP 下实现相似的性能,在类似的 FLOP 下实现更高的精度。DCFormer-SW 和 -NA 分别表示以 SWIN [33] 和 NAT [19] 为主干的 DCFormer。详细信息见 section 4

For frequency representations modeling, we find that the inverse-DCT transformation shares a similar mathematical representation to a transformer (self-attention based networks), indicating that it is possible for transformer-based encoders to simulate the inverse-DCT process (details in Section 3). Therefore, we propose Discrete Cosin TransFormer (DCFormer) using frequency domain representations for image modeling. To ensure the frequency representation generally works with conventional transformers (e.g. ViT [12], SWIN transformer [33], etc.), we propose a frequency embedding in addition to the positional embedding to maintain both spatial and frequency band information. We further empirically demonstrate that our DCFormer is able to capture the semantics directly from frequency representation without any performance compromise compared with RGB-based approaches.
对于频率表示建模,我们发现逆DCT变换与变压器(基于自注意力的网络)具有相似的数学表示,这表明基于变压器的编码器可以模拟逆DCT过程(详见)。 Section 3 因此,我们提出了使用频域表示进行图像建模的离散余弦变压器(DCFormer)。为了确保频率表示通常适用于传统变压器(例如ViT [12] 、SWIN变压器 [33] 等),我们建议在位置嵌入之外进行频率嵌入,以保持空间和频带信息。我们进一步实证证明,与基于RGB的方法相比,我们的DCFormer能够直接从频率表示中捕获语义,而不会对性能造成任何影响。

As for the second challenge, strategically dropping the information from the input is non-trivial and it is challenging for the RGB-based representation, as previous research show that any types of pooling on the input will harm the performance [9]. Inspired by image compression approaches, we propose to strategically drop more high frequency components and less low-frequency components to better maintain the semantic information. We also introduce an reconstruction aux loss to help the training process.
至于第二个挑战,战略性地从输入中删除信息并非易事,对于基于RGB的表示来说也具有挑战性,因为以前的研究表明,输入上的任何类型的池化都会损害性能 [9] 。受图像压缩方法的启发,我们建议战略性地删除更多的高频分量和更少的低频分量,以更好地维护语义信息。我们还引入了重建辅助损失来帮助训练过程。

We tested our model on image classification tasks with Imagenet-1K dataset, and object detection and instance segmentation tasks with MS-COCO [31]. The proposed DCFormer generalizes well to different transformer backbones without performance compromises, including SWIN [33], ViT [12], and NAT [19]. With the help of the proposed frequency down-sampling strategy, the DCFormer is capable of taking input images at different resolutions for better efficiency vs. effectiveness trade-off (Figure 1). We further show that the DCFormer is able to achieve performances comparable to commonly used RGB-based models at lower computational costs, demonstrating the frequency modeling is a promising direction for building efficient model. Our contributions are summarized as follows:
我们使用 Imagenet-1K 数据集测试了我们的模型,并使用 MS-COCO [31] 测试了对象检测和实例分割任务。所提出的DCFormer可以很好地推广到不同的变压器骨干网,而不会影响性能,包括SWIN [33] 、ViT [12] 和NAT [19] 。借助所提出的频率下采样策略,DCFormer能够以不同的分辨率获取输入图像,以实现更好的效率与有效性权衡( Figure 1 )。我们进一步表明,DCFormer能够以更低的计算成本实现与常用的基于RGB的模型相当的性能,表明频率建模是构建高效模型的一个有前途的方向。我们的贡献总结如下:

  1. DCFormer, the transformer directly performs image modeling for various downstream tasks on DCT-based frequency representation. The DCFormer learns semantics directly from DCT based frequency representation without performance compromise.

  2. Study different input down-sampling methods, and propose zigzag based hard-selection for DCT-based input compression. With the proposed input compression strategy, DCFormer is able to achieve even better efficiency-effectiveness trade-off.

  3. Detailed experimental results and ablations, which can be used for future reference.

SECTION 2.

Related Work

2.1. Image Modeling

As the foundation of many computer vision tasks, the image classification has been studied for decades, from heavily rely on manually-crafted features [59], [28], [38] to the deep neural network [30] era, the deep learning dominates the image modeling since 2012 [29]. For the past decade, the networks go deeper, wider and more complex [44], [47], [21], [56] to fit into various tasks including classification [21], detection [40], [32], segmentation [20], [8], pose estimation [45] and more. Besides the network architectures, the convolution layers also evolve from the basic convolution to the depth-wise convolution [57], nonlocal convolution [53] and deformable convolution [10]. In parallel with convolution networks, the recent researches show that the attention based architecture, previously commonly used for NLP tasks [51], transfers to image modeling well. The pioneer work ViT [12] and the following works DEiT [50], SWIN [33], CoaT [58], and more recent Mixer [49] all achieve comparable or superior performances compared with convolution networks. The majority of works on image modeling focus on the performance while in this paper, we focus on both efficiency and effectiveness trade-off.

2.2. Frequency-domain Learning

Frequency domain learning gains much less attention compared with RGB domain modeling in past decades. Only a few works propose to make use of JPEG encoding for faster image classification [17], [14]. Although efficient, these works are less effective than the SOTA image classification models at their times. Some recent works try to incorporate the frequency components from DCT transformation to the channel for better modeling [2], [1], however, the effectiveness gap still exists. Besides, the frequency domain representation has also been used in compression [55], [35], pruning [35] and convnet compression [54], [13]. Although works, the frequency domain modeling generally suffers from the low accuracy and low efficiency, which make them less favored by many image downstream tasks. Our proposed DCFormer with image compression achieves parallel performance compared with SOTA RGB networks at much low computational costs, which stands out from previous frequency domain modeling works. The recent work WaveViT [60] achieves strong performance with discrete wavelet transformation based representation. We share the similar scope on frequency representation based modeling, but instead leverage DCT-based representation because of its flexibility to support strategical down-sampling that improves efficiency without performance compromise.

2.3. Efficient Image Modeling

There are several attempts for efficient image modeling. The convolution kernel or network compression [18], [22] is a straight-forward way to reduce model FLOPs but often leads to noticeable performance drop. Later the carefully designed compact networks with very small memory footage are proposed to work on edge devices including squeezeNet [24], MobileNets [23], [42], and ShuffleNets [61], [36]. More recently, the neural architecture search is widely used as a tool for searching the efficient and accurate network architectures e.g. Proxylessnas [5] and EfficientNet [48]. Different from these approaches that try to build a smaller network, we propose to reduce computation based on frequency domain image compression.

SECTION 3.

Methodology

3.1. Frequency Domain Modeling

In this paper, we adopt DCT based frequency domain representation because it is commonly used for image compression [41], image encoding [43], and various computer vision tasks.

3.1.1 Domain Converting Preliminaries

For an RGB image IRGBRW×H, we first convert the image color space from RGB to yCrCb color space (IYCrCb) and patchfy the image as:

P=[P0,P1,,Pi]=patchfy(IYCrCb)(1)
View SourceRight-click on figure for MathML and additional features. where PR(3×Hps×Wps×ps×ps) is a sequence of patches, ps denotes the side length of each patch. DCT [4] is applied on each patch to generate the frequency domain representation:
Di=DCT(Pi)(2)
View SourceRight-click on figure for MathML and additional features.
where the DCT map of each patch Di ∈ Rps×ps has the same dimension as the original patch Pi.

The patchfy operation preserves the relative spatial information, while each point in a patch Di carries certain frequency information. Patch size selection involves a tradeoff. Smaller patches lead to higher spatial resolution but less fine-grained frequency information and opposite for the larger patch size. We empirically select 8 × 8 as patch size for the best efficiency-effectiveness trade-off, the same as JPEG’s encoding process. Such design potentially allows us to directly obtain the DCT components from raw JPEG images for faster training and inference.

3.1.2 Frequency Domain Encoder

For a compressed frequency map S, the pixels no longer hold spatial relationships inside each patch. Different from previous works that try to shift frequency components to channels for the convolution based modeling [1], [2], we propose to build a network that directly works on the frequency map. The 2D Inverse-DCT (IDCT) transformation for each frequency patch can be mathematically formulated as:

IDCT(Si)=A(Si)A(3)
View SourceRight-click on figure for MathML and additional features. where A denotes the DCT transform matrix. The above equation can be further illustrated as below and is consistent with the transformer layer’s formulation:
IDCT(Si)=A(Si)A=(WqΛWk)(WvSi)(W)(4)
View SourceRight-click on figure for MathML and additional features.
where Wk,Wq,Wv denotes the learnable linear projection for key, query and value. W denotes the learnable weights for linear layers after attention. Although it is not guaranteed that W is strictly the transpose of (WqΛWk), it is possible for transformers to learn the approximation of the IDCT. The observation in Equation 4 makes the transformer architecture a good fit for our compressed image modeling. Note that, it is possible for convolution networks to simulate IDCT through carefully designed kernel size and strides, but it will be less effective compared with transformer networks (ablations in Table 4d). The commonly used transformers, including sequence transformer (e.g. ViT[12]) and hierarchical vision transformer (e.g. SWIN[33], NAT [19]) work directly in the frequency domain with minor changes to the patch size and embedding.

Frequency Embedding: Each frequency point Si,j on Si carries the relative positional information as well as certain frequency band information. To maintain the frequency information, we propose frequency embedding (FE) in addition to the commonly used positional embedding as:

FE(j,2k)=sin(j/100002k/dm)FE(j,(2k+1))=sin(j/10000(2k+1)/dm)(5)
View SourceRight-click on figure for MathML and additional features. where j ∈ [0,ps2] denotes the position in the downsampled frequency patch Si. k denotes the k-th dimension of total dm feature dimensions. We apply the frequency embedding by adding it to the frequency map S.

Classification: We unpatchfy the compressed patches based on their relative locations as:

S~=unpatchfy(S)(6)
View SourceRight-click on figure for MathML and additional features. where the unpatchfy operation reorganize a sequence of compressed DCT map based on their relative spatial location. The DCFormer encoder extras feature embedding from the compressed frequency domain representations as:
XE=ϕ(S~)(7)
View SourceRight-click on figure for MathML and additional features.
where XE stands for the DCFormer encoder feature map. The classification can be done by adding a [CLS]-token [12] or use a linear layer [33].

3.2. Efficient Frequency Domain Modeling

3.2.1 Frequency Domain Compression

The famous JPEG compression infers that compression can be achieved by discarding the non-visually significant values via quantization [52]. Following similar intuition, we aim maintain only the informative frequency components from each DCT map Di for efficient image modeling. We explored three types of compression strategies as follows:

Figure 2: - 
An overview of our model, DCFormer. The model first takes RGB image as input, converting it to DCT-based frequency representation, follows by an optional frequency compression module. The compression module (when τ>1) offers significant efficiency boost, with slight performance trade-off. The frequency based representation is then augmented with positional and frequency embedding and feed into a set of DCFormer blocks. The frequency attention is compatible with various transformer attentions (e.g. SWIN attention, neighbour attention). An linear projection with CE loss is used for classification, and a MSE reconstruction loss can be used as aux loss when frequency compression is applied.
Figure 2:

An overview of our model, DCFormer. The model first takes RGB image as input, converting it to DCT-based frequency representation, follows by an optional frequency compression module. The compression module (when τ>1) offers significant efficiency boost, with slight performance trade-off. The frequency based representation is then augmented with positional and frequency embedding and feed into a set of DCFormer blocks. The frequency attention is compatible with various transformer attentions (e.g. SWIN attention, neighbour attention). An linear projection with CE loss is used for classification, and a MSE reconstruction loss can be used as aux loss when frequency compression is applied.

Averaging: An average pooling with τ × τ kernel over the DCT map Di as:

{Si(jτ,kτ)=1τ2[Di(j,k)++Di(j,k+τ)++Di(j+τ,k+τ)](8)
View SourceRight-click on figure for MathML and additional features. where j and k are the coordinates of Di. Si denotes the compressed DCT map Di and τ is the compression ratio.

Soft-selection: A cross-attention based soft-selection method on Di as:

Si=MHCA(Conv2D(Di),qemb)(9)
View SourceRight-click on figure for MathML and additional features. where MHCA denotes the multi-head cross-attention module qembRps2τ2×c is the query embedding, Conv2D is an 1 × 1 2D convolution that expands the channel of Di, e.g. c = 128, to support multi-head attentions.

Hard-selection: A hard-selection follows zigzag pattern that focus on low frequency components as:

Si=[ψ(Di)]k,k[1,ps2τ2](10)
View SourceRight-click on figure for MathML and additional features. where ψ is the zigzag encoding used in [37], ψ(Di)Rpsτ×pgT stands for the sequence of frequency components after zigzag encoding, which are then sorted by their frequency band from low to high. To maintain visually significant information, we hard select the first 1τ2 elements from ψ(Di), which covers the DC component and most of lowfrequency and part of middle frequency information.

We empirically choose zigzag-baed hard-selection for compression as it works best without introducing additional computation. The cross-attention based soft-selection is deprecated since it is computationally heavy. Averaging based approaches perform the worst since averaging frequency responses over different bands does not make sense.

3.2.2 Reconstruction Aux Loss

Because frequency compression causes information lose, we further introduce a reconstruction decoder adopting auxiliary loss during training which encourages the DCFormer encoder to generate comprehensive and semantic-related feature embeddings. The decoder is not used in the inference and hence does not introduce additional inference computations. It is worth mentioning that the decoder has to be lightweight with limited capacities, so that the decoder utilizes the encoded features as much as possible instead of learning the new semantics features by itself. We propose a simple eight-layer convolution neural network with 3 × 3 kernels as the decoder. Because convolution is less effective on frequency-to-RGB domain converting, the decoder has to rely on semantic information generated by DCFormer encoder to reconstruct the RGB image. The reconstruction process thus enforces the encoder to generate more comprehensive representation during training. The decoder has four up-sampling stages, each stage has two convolution layers and a spatial up-sampling layer defined as:

XDi={XEconv2D(conv2D(U(XDi1)))i=1i[2,4](11)
View SourceRight-click on figure for MathML and additional features. where U denotes the 4× bilinear interpolation up-sampling. The XDi denotes the feature from i-th decoder stage.

3.2.3 Losses

We apply the categorical cross-entropy to the DCFormer encoder output as classification loss (Lcls). For the reconstructed images, we calculate the MSE loss as a measure of reconstruction quality (LMSE). We also adopt the perceptual loss (Lperceptual), which is commonly used in image super-resolution [25] tasks, to encourage the DCFormer encoder to generate semantic related representations. The final loss is thus defined as:

L=Lcls+αLMSE+βLperceptual(12)
View SourceRight-click on figure for MathML and additional features.

We empirically determined the α = 0.1 and β = 0.01.

Table 1: Frequency modeling results on ImageNet-1K validation set. τ = 1 and τ = 2 denotes frequency input without and with 4X compression.
Table 1:- 
Frequency modeling results on ImageNet-1K validation set. τ = 1 and τ = 2 denotes frequency input without and with 4X compression.
SECTION 4.

Experiments

We conduct image classification on ImageNet-1K dataset [11]; object detection and instance segmentation tasks on COCO object detection dataset [31]. We will first compare the proposed DCFormer with other frequency domain modeling approaches on each task, and then further establish comparison with SOTA RGB-based models. Finally we present the ablation analysis on the design choices and generalizability of DCFormer. We use τ = 1 for DCFormer without frequency compression and τ = 2 for DCFormer with 4× frequency compression.

4.1. Frequency Domain Modeling Results

4.1.1 Classification on ImageNet

Setting: We follow [33] for ImageNet training with minor changes. We adopt the AdamW [27] optimizer and use a cosine learning rate scheduler. The training process starts with 30 epochs of linear warm-up, followed by 270 training epochs. A batch size of 1024, an initial learning rate of 0.001, and a weight decay of 0.05 are used similar to [33], the learning rate scales according to the batch size for different experiments. We follow the augmentation and regularization strategies of [50] in training, including color and size jittering, mixup and label smoothing, etc.

Results: We first demonstrate that the transformer is able to learn the semantics directly from frequency representation. We compare the image classification accuracy with SWIN[33] that takes RGB image as input and our DC-Former that takes frequency-based representation as input. The results (Table 1, τ = 1) show that DCFormer is able to achieve similar performance comparing with different RGB backbones. This proves the validness of our intuition, that transformer can learn semantics from frequency representations directly without performance compromise.

We further study the performance of proposed frequency input compression on DCFormer (Table 1, τ = 2) and comparing with previous frequency modeling works [1], [2]. Under the same input resolution, the DCFormer with SWIN transformer backbone outperforms most recent frequency domain modeling approach [1], with less FLOPs. The reduced FLOPs and better performance demonstrate our zigzag compression better maintains salient information than the commonly used channel-wise convolution based frequency domain compression methods [2]. DCFormerNAT-T also slightly outperforms the mot recent image frequency modeling work using wavelet transformation [60]. It is worth mentioning that WaveVit-B achieves strong performance with additional data, and is not a direct comparison to other results. We notice by increasing the input resolution, while performing the proposed input compression, we see performance increase (Table 1, τ = 2). The DCFormer with τ = 2 and 4× input resolution runs at same FLOPs comparing to DCFormer take input without compression, but with comparable or slightly better performance (e.g., on DCFormer-SW-T vs. SWIN-T).

4.1.2 Object Detection on COCO

Setting: We finetune the DCFormer-SWIN with Mask RCNN [20] pipeline on the COCO 2017 dataset [31]. During fine-tuning, we use the multi-scale training [46], [7], AdamW optimizer[27], with weight decay of 0.05, and the same learning decay schedule as [6], [34]. The pipeline follows [33]. Because our image compression leads to smaller feature maps at the final stage, we decrease the spatial scale factor of the RoIAlign and FPN layers to match the scale of the feature maps, e.g. scale = 2 for τ = 2.

Results: The proposed DCFormer outperforms other frequency domain modeling works [1], [2], [60] by a significant margin (Table 2). Under Mask R-CNN setup, the DCFormer-T outperforms [2] by 3.5% with half of FLOPs and DCFormer-S outperforms [1] by 3.8% with only 60% of its FLOPs. The DCFormer stands out in detection tasks among other frequency domain based methods because: (1) instead of squeezing the frequency band into the channels, the DCFormer is able to maintain the feature map used for ROIalign at a reasonable size, which is critical for detection tasks; (2) the transformer better models the frequencydomain representation compared with stacked convolutions.

4.1.3 Instance Segmentation on COCO

Setting: We also evaluate the instance segmentation performance on the COCO dataset, and our training follows the same protocol used in the detection experiments.

Table 2: Comparison on COCO Object detection and instance segmentation on 5k validation set, with 800 × 1333 input images. DCFormer-SW and DCFormer-NA denots DCFormer with SWIN [33] and NAT [19] as backbones, respectively.
Table 2:- 
Comparison on COCO Object detection and instance segmentation on 5k validation set, with 800 × 1333 input images. DCFormer-SW and DCFormer-NA denots DCFormer with SWIN [33] and NAT [19] as backbones, respectively.
Table 3: Comparing with RGB-based works on image classification, object detection and instance segmentation. Detection and instance segmentation tasks run on 800 × 1333 input images, 1.5× denotes 1.5 times larger input images. DCFormerSW/NA denotes DCFormer with SWIN [33] and NAT [19] as backbones, respectively.
Table 3:- 
Comparing with RGB-based works on image classification, object detection and instance segmentation. Detection and instance segmentation tasks run on 800 × 1333 input images, 1.5× denotes 1.5 times larger input images. DCFormerSW/NA denotes DCFormer with SWIN [33] and NAT [19] as backbones, respectively.

Results: DCFormer achieves consistent performance improvements on instance segmentation (Table 2) compared with other frequency domain modeling approaches [1], [2].

4.2. Comparing with RGB-based SOTA

Classification Table 3 (a) shows our results on ImageNet1K validation set compared with the previous works based on convolution [21], [34], transformer [33], [50]. All the models listed are only trained on ImageNet-1K from scratch. We find that the proposed DCFormer is able to operate at lower computational budget to maintain similar performance comparing with commonly used RGB models. For example, DCFormer-SWIN-S (= 2) achieves 80.9% top1 accuracy with only 2.7G FLOPs, significantly more efficient than SWIN-T [33]. The DCFormer-NA-T also achieves slightly better performance at lower FLOPs comparing with recent works ConvNeXT-T [34] and SWINT [33]. Furthermore, the DCFormer is able to achieve performance comparable to RGB SOTA at same computational budget. For example, the DCFormer-SWIN-T (τ = 2) with 512×512 input resolution slightly outperforms SWINT [33] at same FLOPs. The DCFormer-NA-B also achieves performance comparable to SOTA NAT-B [19] with same input resolution and same FLOPs. The results demonstrate the outstanding efficiency and effectiveness of the proposed approach.

Table 4: Ablation studies on ImageNet-1K. All the experiments use DCFormer-SW-T and images of 256 × 256 as backbone without the reconstruction decoder, unless specified.
Table 4:- 
Ablation studies on ImageNet-1K. All the experiments use DCFormer-SW-T and images of 256 × 256 as backbone without the reconstruction decoder, unless specified.

Object Detection To compare with SOTA RGB models, we trained DCFormer with SWIN backbone on cascade mask RCNN pipeline. The proposed DCFormer is able to reduce the FLOPs and latency on object detection tasks (Table 3 (b)). Giving the same input image resolution, the DCFormer-SWIN-S achieves slightly worse performance compared with the SOTA SWIN-T and ConvNeXt-T models but with 15% less FLOPs. Similar to image classification, the performance gap can be compensated by higher input resolution without significant FLOPS increase. By scaling up the input image by 1.5× (1200 × 1666), the DCFormer-SWIN-S is able to achieve comparable performance with 11% less FLOPs.

Instance Segmentation The DCFormer with large input resolution achieves similar performance compared with SOTA SWIN [33] and ConvNeXT [34] based approaches (Table 3 (c)). We notice that the DCFormer maintains good efficiency due to the proposed frequency compression. However, the DCFormer performs slightly worse on instance segmentation tasks. This is probably due to the reduced feature map size and lack of high-frequency (texture) information. There are researches showing that texture information helps with instance segmentation [26]. To better preserve these textures during the compression as well as maintaining the low computation will be our future work.

4.3. Ablation Study

We justify the important design choices, effectiveness and generalization of proposed model on ImageNet-1K image classification task. All the experiments are performed on DCFormer-SWIN-T. The images are 256 × 256 resolution and have 8 × 8 frequency patch size, with τ = 2 are used, unless specified.

Building components breakdown. Table 4a analyzes the contribution of each proposed components, by adding them one at a time, to a standard SWIN transformer. We use a SWIN transformer on an RGB image down-sampled to 112 × 112 as the baseline (same FLOPs). By performing hard-selection on the frequency components, we boost the performance by 0.34% without introducing additional computation. This also verifies our intuition that the frequency domain down-sampling better preserves information that is visually significant. The proposed frequency embedding slightly enhance the performance by 0.2%. Additionally, the reconstruction decoder achieves a slight improvement by 0.16% which shows that our decoder works as expected. Note that the decoder only introduces additional FLOPs in training but it is not used during inference.

Compression ratio. Table 4b compares the classification performance under different frequency compression ratios. The larger compression ratio, e.g. τ = 4, leads to lower FLOPs but lower accuracy since more information was dropped; same for the opposite. Based on the ablation, we choose τ = 0.5 for the best efficiency and effectiveness trade-off. It is also interesting to see that the DCFormer with no frequency down-sampling (τ = 1) achieves the same accuracy and FLOPs as SWIN-T with RGB image input. This indicates the frequency domain representation is as effective as the RGB representation, as we argue that transformer is a good fit for frequency domain modeling.

DCT Patch Size. Table 4c studies the impact of using different DCT patch sizes on images of different resolutions. In general, smaller DCT patch size, e.g. 82 works better on smaller images (e.g. 2562, 3842). Further increasing the input resolution with same DCT patch size does not consistently enhance the performance, because small DCT patches with limited DCT bases only contain limited information. Larger DCT patches convey more frequency information and yield better performance on inputs with high-resolution. Dynamically adjusting DCT patch size will be our future research.

Generalization. Table 4d explores the generalization of proposed image compression with different backbones. Our approach generalizes to different backbones. The transformer-based encoders generally yield less performance drop by using frequency domain inputs, which proves our illustration that the attention can simulate inverse-DCT operation more effectively. It is also worth mentioning that the ViT-B has high FLOPs due to its lack of multi-scale feature hierarchy.

Compression method. Table 4e compares different frequency compression methods. We first notice that average pooling which is commonly used in spatial down-sampling causes significant performance drop when applied to the frequency domain. This is because averaging data points that belong to different frequency bands does not make sense as they do not have direct spatial associations. We then try to use the cross-attention to learn the weighted average based compression. However, cross-attention requires applying extra convolution layers on input DCT map, which introduces additional computation and makes the compression less efficient and contradict to our motivations. The zigzag selection works best at no additional computations in our case. Similar approach was used in JPEG compression and similar patterns were observed [2].

Effectiveness. Table 4f compares and analyzes several alternatives to the proposed image modeling at similar FLOPs, including: directly down-sample the input image by 2×; and after proposed frequency domain compression, reconstruct the RGB image with IDCT and feed it into the standard SWIN transformer. For better comparison, the SWIN-T with RGB input of 256 × 256 is used as the baseline. The results indicate that the proposed method is more effective than the alternative, as the reconstruction may suffer from noises introduced during padding and converting.

SECTION 5.

Visualization

To qualititively show the proposed frequency domain modeling learns the semantics, we visualize the activation of DCFormer-SWIN wtih attention rollout [3].

Figure 3 visualizes and compares the features learned by SWIN transformer and our DCFormer-SWIN-T (τ = 2). The activation maps are extracted from the last stage of backbone and overlayed to the input image. For most cases, the attentions from SWIN transformer and our DCFormer fall onto the same regions, which indicates the DCFormer learns the same semantic representations as RGB domain modeling (Figure 3, top). We notice that in a few cases, the activation map from SWIN transformer covers broader regions (Figure 3, bottom), this is probably because the SWIN transformer generates 4× larger feature maps than DCFormer giving the same input image.

Figure 3: - 
Activation from SWIN-T [33] and DCFormerSWIN. Most cases share similar activation for both models (top); for some cases SWIN covers larger regions (bottom).
Figure 3:

Activation from SWIN-T [33] and DCFormerSWIN. Most cases share similar activation for both models (top); for some cases SWIN covers larger regions (bottom).

SECTION 6.

Conclusion

In this paper, we introduce DCFormer that enables transformer to learn semantics directly from DCT-based frequency domain representation. Based on DCFormer, we further introduce an frequency input down-sampling method. DCFormer achieves the performance comparable to commonly used transformer backbones with no performance compromise. With proposed frequency input compression, the DCFormer is able to achieve better efficiencyeffectiveness trade-off comparing with previous frequency modeling approaches. We hope that these promising results reported will encourage the research on efficient modeling from another perspective and the implementation of proposed approach to many downstream tasks. Exploring the transformer based frequency domain modeling approach with other frequency representation, e.g. discrete wavelet transformation, and refining the frequency compression for better performance will be our future work.

References

1.
"Fcanet: Frequency channel attention networks", CVPR 2021.
2.
"Learning in the frequency domain", CVPR 2020.
3.
Samira Abnar and Willem Zuidema, "Quantifying attention flow in transformers", 2020.
4.
Nasir Ahmed, T Natarajan and Kamisetty R Rao, "Discrete cosine transform", IEEE transactions on Computers, vol. 100, no. 1, pp. 90-93, 1974.
5.
Han Cai, Ligeng Zhu and Song Han, "Proxylessnas: Direct neural architecture search on target task and hardware", 2018.
6.
Hu Cao, Yueyue Wang, Joy Chen, Dongsheng Jiang, Xiaopeng Zhang, Qi Tian, et al., "Swin-unet: Unet-like pure transformer for medical image segmentation", 2021.
7.
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov and Sergey Zagoruyko, "End-to-end object detection with transformers", European conference on computer vision, pp. 213-229, 2020.
8.
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy and Alan L Yuille, "Deeplab: Semantic image segmentation with deep convolutional nets atrous convolution and fully connected crfs", IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834-848, 2017.
9.
Tianshui Chen, Liang Lin, Wangmeng Zuo, Xiaonan Luo and Lei Zhang, "Learning a wavelet-like auto-encoder to accelerate deep neural networks", Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, 2018.
10.
Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, et al., "Deformable convolutional networks", Proceedings of the IEEE international conference on computer vision, pp. 764-773, 2017.
11.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei, "Imagenet: A large-scale hierarchical image database", 2009 IEEE conference on computer vision and pattern recognition, pp. 248-255, 2009.
12.
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly et al., "An image is worth 16x16 words: Transformers for image recognition at scale", 2020.
13.
Adam Dziedzic, John Paparrizos, Sanjay Krishnan, Aaron Elmore and Michael Franklin, "Band-limited training and inference for convolutional neural networks", International Conference on Machine Learning, pp. 1745-1754, 2019.
14.
Max Ehrlich and Larry S Davi, "Deep residual learning in the jpeg transform domain", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3484-3493, 2019.
15.
Ahmet M Eskicioglu and Paul S Fisher, "Image quality measures and their performance", IEEE Transactions on communications, vol. 43, no. 12, pp. 2959-2965, 1995.
16.
Ge Gao, Pei You, Rong Pan, Shunyuan Han, Yuanyuan Zhang, Yuchao Dai, et al., "Neural image compression via attentional multi-scale back projection and frequency decomposition", Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14677-14686, 2021.
17.
Lionel Gueguen, Alex Sergeev, Ben Kadlec, Rosanne Liu and Jason Yosinski, "Faster neural networks straight from jpeg", Advances in Neural Information Processing Systems, vol. 31, 2018.
18.
Song Han, Huizi Mao and William J Dally, "Deep compression: Compressing deep neural networks with pruning trained quantization and huffman coding", 2015.
19.
Ali Hassani, Steven Walton, Jiachen Li, Shen Li and Humphrey Sh, "Neighborhood attention transformer", 2022.
20.
Kaiming He, Georgia Gkioxari, Piotr Dollár and Ross Girshick, "Mask r-cnn", Proceedings of the IEEE international conference on computer vision, pp. 2961-2969, 2017.
21.
Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun, "Deep residual learning for image recognition", Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770-778, 2016.
22.
Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li and Song Han, "Amc: Automl for model compression and acceleration on mobile devices", Proceedings of the European conference on computer vision (ECCV), pp. 784-800, 2018.
23.
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, et al., "Mobilenets: Efficient convolutional neural networks for mobile vision applications", 2017.
24.
Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally and Kurt Keutzer, "Squeezenet: Alexnet-level accuracy with 50x fewer parameters and¡ 0.5 mb model size", 2016.
25.
Justin Johnson, Alexandre Alahi and Li Fei-Fei, "Perceptual losses for real-time style transfer and super-resolution", European conference on computer vision, pp. 694-711, 2016.
26.
Myeongjin Kim and Hyeran Byun, "Learning texture invariant representation for domain adaptation of semantic segmentation", Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 12975-12984, 2020.
27.
Diederik P Kingma and Jimmy Ba, "Adam: A method for stochastic optimization", 2014.
28.
Takumi Kobayashi, "Bfo meets hog: feature extraction based on histograms of oriented pdf gradients for image classification", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 747-754, 2013.
29.
Alex Krizhevsky, Ilya Sutskever and Geoffrey E Hinton, "Imagenet classification with deep convolutional neural networks", Advances in neural information processing systems, vol. 25, 2012.
30.
Yann LeCun, Yoshua Bengio and Geoffrey Hinton, "Deep learning", nature, vol. 521, no. 7553, pp. 436-444, 2015.