¹¹institutetext: University of California, Los Angeles
加州大学洛杉矶分校²²institutetext: University of Texas at Austin
² institutetext: 德克萨斯大学奥斯汀分校³³institutetext: Google Research
谷歌研究⁴⁴institutetext: University of California, Merced
加州大学默塞德分校

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision Transformer
V2X-ViT：基于视觉 Transformer 的车辆到万物协同感知

Runsheng Xu 徐润生Equal contribution. ^† Corresponding author: jiaqima@ucla.edu11
同等贡献。 ^† 对应作者：jiaqima@ucla.edu11 Hao Xiang^⋆ 郝向 ^⋆ 11 Zhengzhong Tu^⋆ 图正中 ^⋆ 22 Xin Xia 辛霞11
Ming-Hsuan Yang 楊明煥3344 Jiaqi Ma^† 马家琪 ^† 11

Abstract 摘要

In this paper, we investigate the application of Vehicle-to-Everything (V2X) communication to improve the perception performance of autonomous vehicles. We present a robust cooperative perception framework with V2X communication using a novel vision Transformer. Specifically, we build a holistic attention model, namely V2X-ViT, to effectively fuse information across on-road agents (i.e., vehicles and infrastructure). V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention, which captures inter-agent interaction and per-agent spatial relationships. These key modules are designed in a unified Transformer architecture to handle common V2X challenges, including asynchronous information sharing, pose errors, and heterogeneity of V2X components. To validate our approach, we create a large-scale V2X perception dataset using CARLA and OpenCDA. Extensive experimental results demonstrate that V2X-ViT sets new state-of-the-art performance for 3D object detection and achieves robust performance even under harsh, noisy environments. The code is available at https://github.com/DerrickXuNu/v2x-vit.
在这篇论文中，我们研究了将车辆到一切（V2X）通信应用于提高自动驾驶车辆的感知性能。我们提出了一种使用新颖的视觉 Transformer 的鲁棒合作感知框架，利用 V2X 通信。具体来说，我们构建了一个整体注意力模型，即 V2X-ViT，以有效地融合道路上的代理（即车辆和基础设施）之间的信息。V2X-ViT 由异构多代理自注意力和多尺度窗口自注意力的交替层组成，它捕捉了代理之间的交互和每个代理的空间关系。这些关键模块被设计在统一的 Transformer 架构中，以处理常见的 V2X 挑战，包括异步信息共享、姿态误差和 V2X 组件的异质性。为了验证我们的方法，我们使用 CARLA 和 OpenCDA 创建了一个大规模的 V2X 感知数据集。大量的实验结果表明，V2X-ViT 在 3D 目标检测方面达到了新的最先进性能，即使在恶劣、嘈杂的环境中也能实现鲁棒的性能。代码可在 https://github.com/DerrickXuNu/v2x-vit 上获取。

Keywords:

V2X, Vehicle-to-Everything, Cooperative perception, Autonomous driving, Transformer
关键词：V2X，车联网，协同感知，自动驾驶，Transformer

1 Introduction 1 引言

Perceiving the complex driving environment precisely is crucial to the safety of autonomous vehicles (AVs). With recent advancements of deep learning, the robustness of single-vehicle perception systems has demonstrated significant improvement in several tasks such as semantic segmentation [37, 11], depth estimation [61, 12], and object detection and tracking [22, 27, 55, 14]. Despite recent advancements, challenges remain. Single-agent perception system tends to suffer from occlusion and sparse sensor observation at a far distance, which can potentially cause catastrophic consequences [57]. The cause of such a problem is that an individual vehicle can only perceive the environment from a single perspective with limited sight-of-view. To address these issues, recent studies [45, 6, 50, 5, 54, 23] leverage the advantages of multiple viewpoints of the same scene by investigating Vehicle-to-Vehicle (V2V) collaboration, where visual information (e.g., detection outputs, raw sensory information, intermediate deep learning features, details see Sec. 2) from multiple nearby AVs are shared for a complete and accurate understanding of the environment.
感知复杂的驾驶环境对于自动驾驶汽车（AVs）的安全性至关重要。随着深度学习的近期进展，单车感知系统的鲁棒性在语义分割[37, 11]、深度估计[61, 12]以及目标检测和跟踪[22, 27, 55, 14]等任务中表现出显著的提升。尽管取得了近期进展，但挑战依然存在。单代理感知系统往往会在远距离遭受遮挡和稀疏传感器观测，这可能导致灾难性的后果[57]。这种问题的原因是单个车辆只能从单一视角以有限的视野感知环境。为了解决这些问题，近期研究[45, 6, 50, 5, 54, 23]通过研究车辆到车辆（V2V）协作，利用同一场景的多视角优势，共享多个附近 AVs 的视觉信息（例如，检测输出、原始感官信息、中间深度学习特征，详见第 2 节），以实现对环境的完整和准确理解。

Although V2V technologies have the prospect to revolutionize the mobility industry, it ignores a critical collaborator – roadside infrastructure. The presence of AVs is usually unpredictable, whereas the infrastructure can always provide supports once installed in key scenes such as intersections and crosswalks. Moreover, infrastructure equipped with sensors on an elevated position has a broader sight-of-view and potentially less occlusion. Despite these advantages, including infrastructure to deploy a robust V2X perception system is non-trivial. Unlike V2V collaboration, where all agents are homogeneous, V2X systems often involve a heterogeneous graph formed by infrastructure and AVs. The configuration discrepancies between infrastructure and vehicle sensors, such as types, noise levels, installation height, and even sensor attributes and modality, make the design of a V2X perception system challenging. Moreover, the GPS localization noises and the asynchronous sensor measurements of AVs and infrastructure can introduce inaccurate coordinate transformation and lagged sensing information. Failing to properly handle these challenges will make the system vulnerable.
尽管 V2V 技术有望彻底改变移动行业，但它忽略了一个关键的合作者——路边基础设施。自动驾驶汽车的存在通常是不可预测的，而一旦在交叉口和斑马线等关键场景中安装，基础设施总能提供支持。此外，安装在高位传感器的设施具有更广阔的视野和潜在的遮挡较少。尽管有这些优势，将基础设施纳入部署强大的 V2X 感知系统并非易事。与所有代理同质化的 V2V 协作不同，V2X 系统通常涉及由基础设施和自动驾驶汽车组成的异构图。基础设施和车辆传感器之间的配置差异，如类型、噪声水平、安装高度，甚至传感器属性和模式，使得 V2X 感知系统的设计具有挑战性。此外，GPS 定位噪声以及自动驾驶汽车和基础设施的异步传感器测量可以引入不准确的坐标转换和滞后感知信息。未能妥善处理这些挑战将使系统易受攻击。

In this paper, we introduce a unified fusion framework, namely V2X Vision Transformer or V2X-ViT, for V2X perception, that can jointly handle these challenges. Fig. 2 illustrates the entire system. AVs and infrastructure capture, encode, compress, and send intermediate visual features with each other, while the ego vehicle (i.e., receiver) employs V2X-Transformer to perform information fusion for object detection. We propose two novel attention modules to accommodate V2X challenges: 1) a customized heterogeneous multi-agent self-attention module that explicitly considers agent types (vehicles and infrastructure) and their connections when performing attentive fusion; 2) a multi-scale window attention module that can handle localization errors by using multi-resolution windows in parallel. These two modules will adaptively iteratively fuse visual features to capture inter-agent interaction and per-agent spatial relationship, correcting the feature misalignment caused by localization error and time delay. Moreover, we also integrate a delay-aware positional encoding to handle the time delay uncertainty further. Notably, all these modules are incorporated in a single transformer that learns to address these challenges end-to-end.
本文介绍了一种统一的融合框架，即 V2X 视觉 Transformer 或 V2X-ViT，用于 V2X 感知，能够联合处理这些挑战。图 2 展示了整个系统。自动驾驶汽车和基础设施相互捕捉、编码、压缩并发送中间视觉特征，而自我车辆（即接收器）采用 V2X-Transformer 进行信息融合以执行目标检测。我们提出了两个新颖的注意力模块以适应 V2X 挑战：1）一个定制的异构多智能体自注意力模块，在执行注意力融合时明确考虑智能体类型（车辆和基础设施）及其连接；2）一个多尺度窗口注意力模块，通过并行使用多分辨率窗口来处理定位误差。这两个模块将自适应迭代融合视觉特征以捕捉智能体间的交互和每个智能体的空间关系，纠正由定位误差和时间延迟引起的特征错位。此外，我们还集成了一种延迟感知的位置编码来进一步处理时间延迟的不确定性。显著的是，所有这些模块都被整合到一个单一的 Transformer 中，该 Transformer 学习端到端地解决这些挑战。

To evaluate our approach, we collect a new large-scale open dataset, namely V2XSet, that explicitly considers real-world noises during V2X communication using the high-fidelity simulator CARLA [10], and a cooperative driving automation simulation tool OpenCDA. Fig. 1 shows a data sample in the collected dataset. Experiments show that our proposed V2X-ViT significantly advances the performance on V2X LiDAR-based 3D object detection, achieving a 21.2% gain of AP compared to single-agent baseline and performing favorably against leading intermediate fusion methods by at least 7.3%. Our contributions are:
为了评估我们的方法，我们收集了一个新的大规模公开数据集，即 V2XSet，该数据集在 V2X 通信中使用高保真模拟器 CARLA [10]和合作驾驶自动化模拟工具 OpenCDA 明确考虑了现实世界的噪声。图 1 显示了收集的数据集中的一个数据样本。实验表明，我们提出的 V2X-ViT 在基于 V2X LiDAR 的 3D 目标检测性能上取得了显著进步，与单代理基线相比，AP 提升了 21.2%，并且至少比领先的中间融合方法表现好 7.3%。我们的贡献包括：

•

We present the first unified transformer architecture (V2X-ViT) for V2X perception, which can capture the heterogeneity nature of V2X systems with strong robustness against various noises. Moreover, the proposed model achieves state-of-the-art performance on the challenging cooperative detection task.
我们提出了第一个用于 V2X 感知的统一 Transformer 架构（V2X-ViT），该架构能够捕捉 V2X 系统的异构性特征，并对各种噪声具有强大的鲁棒性。此外，所提出的模型在具有挑战性的协同检测任务上实现了最先进的性能。
•

We propose a novel heterogeneous multi-agent attention module (HMSA) tailored for adaptive information fusion between heterogeneous agents.
我们提出了一种针对异构代理自适应信息融合的新型异构多代理注意力模块（HMSA）。
•

We present a new multi-scale window attention module (MSwin) that simultaneously captures local and global spatial feature interactions in parallel.
我们提出了一种新的多尺度窗口注意力模块（MSwin），该模块能够并行捕捉局部和全局空间特征交互。
•

We construct V2XSet, a new large-scale open simulation dataset for V2X perception, which explicitly accounts for imperfect real-world conditions.
我们构建了 V2XSet，一个针对 V2X 感知的新的大型开放式模拟数据集，该数据集明确考虑了不完美的现实世界条件。

2 Related work 2 相关工作

V2X perception. Cooperative perception studies how to efficiently fuse visual cues from neighboring agents. Based on its message sharing strategy, it can be divided into 3 categories: 1) early fusion [6] where raw data is shared and gathered to form a holistic view, 2) intermediate fusion [45, 50, 41, 5] where intermediate neural features are extracted based on each agent’s observation and then transmitted, and 3) late fusion [33, 34] where detection outputs (e.g., 3D bounding box position, confidence score) are circulated. As early fusion usually requires large transmission bandwidth and late fusion fails to provide valuable scenario context [45], intermediate fusion has attracted increasing attention because of its good balance between accuracy and transmission bandwidth. Several intermediate fusion methods have been proposed for V2V perception recently. OPV2V [50] implements a single-head self-attention module to fuse features, while F-Cooper employs maxout [15] fusion operation. V2VNet [45] proposes a spatial-aware message passing mechanism to jointly reason detection and prediction. To attenuate outlier messages, [41] regresses vehicles’ localization errors with consistent pose constraints. DiscoNet [25] leverages knowledge distillation to enhance training by constraining the corresponding features to the ones from the network for early fusion. However, intermediate fusion for V2X is still in its infancy. Most V2X methods explored late fusion strategies to aggregate information from infrastructure and vehicles. For example, a late fusion two-level Kalman filter is proposed by [31] for roadside infrastructure failure conditions. Xiangmo et al. [58] propose fusing the lane mark detection from infrastructure and vehicle sensors, leveraging Dempster-Shafer theory to model the uncertainty.
V2X 感知。协同感知研究如何高效融合邻近代理的视觉线索。基于其消息共享策略，可以分为 3 类：1）早期融合[6]，即共享和收集原始数据以形成整体视图；2）中间融合[45, 50, 41, 5]，基于每个代理的观察提取中间神经特征然后传输；3）晚期融合[33, 34]，检测输出（例如，3D 边界框位置、置信度分数）被循环。由于早期融合通常需要较大的传输带宽，而晚期融合无法提供有价值的场景上下文[45]，因此中间融合因其准确性和传输带宽之间的良好平衡而越来越受到关注。最近提出了几种中间融合方法用于 V2V 感知。OPV2V[50]实现了一个单头自注意力模块以融合特征，而 F-Cooper 采用 maxout[15]融合操作。V2VNet[45]提出了一种空间感知的消息传递机制以联合推理检测和预测。为了减弱异常消息，[41]通过一致的姿态约束对车辆的定位误差进行回归。DiscoNet [25]利用知识蒸馏通过约束对应特征到网络中的早期融合来增强训练。然而，V2X 的中级融合仍处于起步阶段。大多数 V2X 方法探索了晚期融合策略以聚合来自基础设施和车辆的信息。例如，[31]提出了一种晚期融合的两级卡尔曼滤波器，用于路边基础设施故障条件。Xiangmo 等人[58]提出融合来自基础设施和车辆传感器的车道标记检测，利用 Dempster-Shafer 理论来建模不确定性。

LiDAR-based 3D object detection. Numerous methods have been explored to extract features from raw points, voxels, bird-eye-view (BEV) images, and their mixtures. PointRCNN [36] proposes a two-stage strategy based on raw point clouds, which learns rough estimation in the first stage and then refines it with semantic attributes. The authors of [60, 51] propose to split the space into voxels and produce features per voxel. Despite having high accuracy, their inference speed and memory consumption are difficult to optimize due to reliance on 3D convolutions. To avoid computationally expensive 3D convolutions, [22, 52] propose an efficient BEV representation. To satisfy both computational and flexible receptive field requirements, [35, 59, 53] combine voxel-based and point-based approaches to detect 3D objects.
基于激光雷达的 3D 目标检测。已经探索了许多从原始点、体素、鸟瞰图（BEV）图像及其混合中提取特征的方法。PointRCNN [36] 提出了一种基于原始点云的两阶段策略，第一阶段学习粗略估计，然后通过语义属性进行细化。文献[60, 51]的作者提出将空间分割成体素，并为每个体素生成特征。尽管具有高精度，但由于依赖于 3D 卷积，它们的推理速度和内存消耗难以优化。为了避免计算昂贵的 3D 卷积，[22, 52] 提出了一种高效的 BEV 表示。为了满足计算和灵活感受野的要求，[35, 59, 53] 将基于体素和基于点的策略结合起来以检测 3D 目标。

Transformers in vision. The Transformer [43] is first proposed for machine translation [43], where multi-head self-attention and feed-forward layers are stacked to capture long-range interactions between words. Dosovitskiy et al. [9] present a Vision Transformer (ViT) for image recognition by regarding image patches as visual words and directly applying self-attention. The full self-attention in ViT [43, 9, 13], despite having global interaction, suffers from heavy computational complexity and does not scale to long-range sequences or high-resolution images. To ameliorate this issue, numerous methods have introduced locality into self-attention, such as Swin [30], CSwin [8], Twins [7], window [46, 39], and sparse attention [42, 40, 49]. A hierarchical architecture is usually adopted to progressively increase the receptive fields for capturing longer dependencies.
视觉中的 Transformer。Transformer [43] 首先被提出用于机器翻译 [43]，其中多头自注意力和前馈层被堆叠以捕捉词语之间的长距离交互。Dosovitskiy 等人 [9] 提出了一种视觉 Transformer（ViT）用于图像识别，通过将图像块视为视觉词并直接应用自注意力。尽管 ViT [43, 9, 13] 中的全自注意力具有全局交互，但计算复杂度高，且无法扩展到长距离序列或高分辨率图像。为了解决这个问题，许多方法引入了局部性到自注意力中，如 Swin [30]、CSwin [8]、Twins [7]、窗口 [46, 39] 和稀疏注意力 [42, 40, 49]。通常采用分层架构来逐步增加感受野以捕捉更长的依赖关系。

While these vision transformers have proven efficient in modeling homogeneous structured data, their efficacy to represent heterogeneous graphs has been less studied. One of the developments related to our work is the heterogeneous graph transformer (HGT) [18]. HGT was originally designed for web-scale Open Academic Graph where the nodes are text and attributes. Inspired by HGT, we build a customized heterogeneous multi-head self-attention module tailored for graph attribute-aware multi-agent 3D visual feature fusion, which is able to capture the heterogeneity of V2X systems.
尽管这些视觉转换器在建模同质结构化数据方面已被证明是高效的，但它们在表示异构图方面的有效性研究较少。与我们工作相关的一项发展是异构图转换器（HGT）[18]。HGT 最初是为网页规模的开放学术图设计的，其中节点是文本和属性。受 HGT 的启发，我们构建了一个定制的异构多头自注意力模块，专门用于图属性感知多智能体 3D 视觉特征融合，能够捕捉 V2X 系统的异构性。

3 Methodology 3 方法论

In this paper, we consider V2X perception as a heterogeneous multi-agent perception system, where different types of agents (i.e., smart infrastructure and AVs) perceive the surrounding environment and communicate with each other. To simulate real-world scenarios, we assume that all the agents have imperfect localization and time delay exists during feature transmission. Given this, our goal is to develop a robust fusion system to enhance the vehicle’s perception capability and handle these aforementioned challenges in a unified end-to-end fashion. The overall architecture of our framework is illustrated in Fig. 2, which includes five major components: 1) metadata sharing, 2) feature extraction, 3) compression and sharing, 4) V2X vision Transformer, and 5) a detection head.
在这篇论文中，我们将 V2X 感知视为一个异构多智能体感知系统，其中不同类型的智能体（即智能基础设施和自动驾驶汽车）感知周围环境并相互通信。为了模拟现实场景，我们假设所有智能体具有不完美的定位，并且在特征传输过程中存在时间延迟。鉴于这种情况，我们的目标是开发一个鲁棒的融合系统，以增强车辆的感知能力，并以统一端到端的方式处理上述挑战。我们的框架的整体架构如图 2 所示，包括五个主要组件：1）元数据共享，2）特征提取，3）压缩和共享，4）V2X 视觉 Transformer，以及 5）检测头。

3.1 Main architecture design
3.1 主要架构设计

V2X metadata sharing. During the early stage of collaboration, every agent $i\in\{1\dots N\}$ within the communication networks shares metadata such as poses, extrinsics, and agent type $c_{i}\in\{I,V\}$ (meaning infrastructure or vehicle) with each other. We select one of the connected AVs as the ego vehicle ( $e$ ) to construct a V2X graph around it where the nodes are either AVs or infrastructure and the edges represent directional V2X communication channels. To be more specific, we assume the transmission of metadata is well-synchronized, which means each agent $i$ can receive ego pose $x_{e}^{t_{i}}$ at the time $t_{i}$ . Upon receiving the pose of the ego vehicle, all the other connected agents nearby will project their own LiDAR point clouds to the ego-vehicle’s coordinate frame before feature extraction.
V2X 元数据共享。在协作的早期阶段，通信网络中的每个代理 $i\in\{1\dots N\}$ 相互共享元数据，如姿态、外参和代理类型 $c_{i}\in\{I,V\}$ （表示基础设施或车辆）。我们选择一个连接的 AV 作为自我车辆（ $e$ ），围绕它构建一个 V2X 图，其中节点是 AV 或基础设施，边代表方向性的 V2X 通信通道。更具体地说，我们假设元数据的传输是良好同步的，这意味着每个代理 $i$ 可以在时间 $t_{i}$ 接收到自我姿态 $x_{e}^{t_{i}}$ 。在接收到自我车辆的姿态后，所有其他附近的连接代理将投影自己的激光雷达点云到自我车辆的坐标系中，然后再进行特征提取。

Feature extraction. We leverage the anchor-based PointPillar method [22] to extract visual features from point clouds because of its low inference latency and optimized memory usage [50]. The raw point clouds will be converted to a stacked pillar tensor, then scattered to a 2D pseudo-image and fed to the PointPillar backbone. The backbone extracts informative feature maps $\mathbf{F}_{i}^{t_{i}}\in\mathbb{R}^{H\times W\times C}$ , denoting agent $i$ ’s feature at time $t_{i}$ with height $H$ , width $W$ , and channels $C$ .
特征提取。我们利用基于锚点的 PointPillar 方法[22]从点云中提取视觉特征，因为它具有低推理延迟和优化的内存使用[50]。原始点云将被转换为堆叠的柱状张量，然后散布到 2D 伪图像中，并输入到 PointPillar 骨干网络。骨干网络提取信息特征图 $\mathbf{F}_{i}^{t_{i}}\in\mathbb{R}^{H\times W\times C}$ ，表示代理 $i$ 在时间 $t_{i}$ 的高度 $H$ 、宽度 $W$ 和通道 $C$ 的特征。

Compression and sharing. To reduce the required transmission bandwidth, we utilize a series of $1\times 1$ convolutions to progressively compress the feature maps along the channel dimension. The compressed features with the size $(H,W,C^{\prime})$ (where $C^{\prime}\ll C$ ) are then transmitted to the ego vehicle ( $e$ ), on which the features are projected back to $(H,W,C)$ using $1\times 1$ convolutions.
压缩与共享。为了减少所需的传输带宽，我们利用一系列卷积来逐步压缩沿通道维度的特征图。然后，将尺寸为 $(H,W,C^{\prime})$ （其中 $C^{\prime}\ll C$ ）的压缩特征传输到 ego 车辆（ $e$ ），在该车辆上，使用 $1\times 1$ 卷积将特征投影回 $(H,W,C)$ 。

There exists an inevitable time gap between the time when the LiDAR data is captured by connected agents and when the extracted features are received by the ego vehicle (details see appendix). Thus, features collected from surrounding agents are often temporally misaligned with the features captured on the ego vehicle. To correct this delay-induced global spatial misalignment, we need to transform (i.e., rotate and translate) the received features to the current ego-vehicle’s pose. Thus, we leverage a spatial-temporal correction module (STCM), which employs a differential transformation and sampling operator $\mathbf{\Gamma_{\xi}}$ to spatially warp the feature maps [19]. An ROI mask is also calculated to prevent the network from paying attention to the padded zeros caused by the spatial warp.
存在连接代理捕获激光雷达数据与提取的特征被自我车辆接收之间的不可避免的时间差（详情见附录）。因此，从周围代理收集的特征通常与自我车辆上捕获的特征在时间上不一致。为了纠正由延迟引起的全局空间错位，我们需要将接收到的特征转换（即旋转和平移）到当前自我车辆的姿态。因此，我们利用一个时空校正模块（STCM），该模块采用微分变换和采样算子 $\mathbf{\Gamma_{\xi}}$ 对特征图进行空间扭曲[19]。还计算了一个感兴趣区域掩码，以防止网络关注由空间扭曲引起的填充零。

V2X-ViT. The intermediate features $\mathbf{H}_{i}=\mathbf{\Gamma_{\xi}}\left(\mathbf{F}_{i}^{t_{i}}\right)\in\mathbb{R}^{H\times W\times C}$ aggregated from connected agents are fed into the major component of our framework i.e., V2X-ViT to conduct an iterative inter-agent and intra-agent feature fusion using self-attention mechanisms. We maintain the feature maps in the same level of high resolution throughout the entire Transformer as we have observed that the absence of high-definition features greatly harms the objection detection performance. The details of our proposed V2X-ViT will be unfolded in Sec. 3.2.
V2X-ViT。从连接的代理中聚合的中间特征 $\mathbf{H}_{i}=\mathbf{\Gamma_{\xi}}\left(\mathbf{F}_{i}^{t_{i}}\right)\in\mathbb{R}^{H\times W\times C}$ 被输入到我们框架的主要组件，即 V2X-ViT，以使用自注意力机制进行迭代式代理间和代理内特征融合。我们在整个 Transformer 中保持特征图在同一级别的高分辨率，因为我们观察到缺乏高清特征会严重损害目标检测性能。我们提出的 V2X-ViT 的细节将在第 3.2 节中展开。

Detection head. After receiving the final fused feature maps, we apply two $1\times 1$ convolution layers for box regression and classification. The regression output is $(x,y,z,w,l,h,\theta)$ , denoting the position, size, and yaw angle of the predefined anchor boxes, respectively. The classification output is the confidence score of being an object or background for each anchor box. We use the smooth $\ell_{1}$ loss for regression and a focal loss [28] for classification.
检测头。在接收最终的融合特征图后，我们应用两个卷积层进行框回归和分类。回归输出为 $(x,y,z,w,l,h,\theta)$ ，分别表示预定义锚框的位置、大小和偏航角。分类输出是每个锚框作为对象或背景的置信度分数。我们使用平滑 $\ell_{1}$ 损失进行回归，并使用焦点损失[28]进行分类。

3.2 V2X-Vision Transformer
3.2V2X-视觉 Transformer

Our goal is to design a customized vision Transformer that can jointly handle the common V2X challenges. Firstly, to effectively capture the heterogeneous graph representation between infrastructure and AVs, we build a heterogeneous multi-agent self-attention module that learns different relationships based on node and edge types. Moreover, we propose a novel spatial attention module, namely multi-scale window attention (MSwin), that captures long-range interactions at various scales. MSwin uses multiple window sizes to aggregate spatial information, which greatly improves the detection robustness against localization errors. Lastly, these two attention modules are integrated into a single V2X-ViT block in a factorized manner (illustrated in Fig. 3a), enabling us to maintain high-resolution features throughout the entire process. We stack a series of V2X-ViT blocks to iteratively learn inter-agent interaction and per-agent spatial attention, leading to a robust aggregated feature representation for detection.
我们的目标是设计一种定制的视觉 Transformer，能够联合处理常见的 V2X 挑战。首先，为了有效地捕捉基础设施和自动驾驶汽车之间的异构图表示，我们构建了一个异构多智能体自注意力模块，该模块根据节点和边类型学习不同的关系。此外，我们提出了一种新颖的空间注意力模块，即多尺度窗口注意力（MSwin），它能够在各种尺度上捕捉长距离交互。MSwin 使用多个窗口大小来聚合空间信息，这大大提高了对定位错误的检测鲁棒性。最后，这两个注意力模块以分解的方式集成到一个单一的 V2X-ViT 块中（如图 3a 所示），使我们能够在整个过程中保持高分辨率特征。我们将一系列 V2X-ViT 块堆叠起来，迭代地学习智能体间的交互和每个智能体的空间注意力，从而为检测生成鲁棒的聚合特征表示。

3.2.1 Heterogeneous multi-agent self-attention
3.2.1 异构多智能体自注意力

The sensor measurements captured by infrastructure and AVs possibly have distinct characteristics. The infrastructure’s LiDAR is often installed at a higher position with less occlusion and different view angles. In addition, the sensors may have different levels of sensor noise due to maintenance frequency, hardware quality etc. To encode this heterogeneity, we build a novel heterogeneous multi-agent self-attention (HMSA) where we attach types to both nodes and edges in the directed graph. To simplify the graph structure, we assume the sensor setups among the same category of agents are identical. As shown in Fig. 3b, we have two types of nodes and four types of edges, i.e., node type $c_{i}\in\{I,V\}$ and edge type $\phi\left(e_{ij}\right)\in\{V\!-\!V,V\!-\!I,I\!-\!V,I\!-\!I\}$ . Note that unlike traditional attention where the node features are treated as a vector, we only reason the interaction of features in the same spatial position from different agents to preserve spatial cues. Formally, HSMA is expressed as:
传感器由基础设施和自动驾驶汽车捕获的测量可能具有不同的特征。基础设施的激光雷达通常安装在较高的位置，遮挡较少，视角不同。此外，由于维护频率、硬件质量等因素，传感器可能具有不同级别的噪声。为了编码这种异质性，我们构建了一个新颖的异构多智能体自注意力（HMSA）模型，其中我们在有向图中的节点和边都附加了类型。为了简化图结构，我们假设同一类智能体之间的传感器设置是相同的。如图 3b 所示，我们有两种节点类型和四种边类型，即节点类型 $c_{i}\in\{I,V\}$ 和边类型 $\phi\left(e_{ij}\right)\in\{V\!-\!V,V\!-\!I,I\!-\!V,I\!-\!I\}$ 。请注意，与传统注意力不同，其中节点特征被视为向量，我们只推理来自不同智能体的同一空间位置的特征交互，以保留空间线索。形式上，HSMA 表示为：

\mathbf{H}_{i}=\underset{\forall j\in N\left(i\right)}{\mathsf{Dense}_{c_{i}}}\left(\mathbf{ATT}\left(i,j\right)\cdot\mathbf{MSG}\left(i,j\right)\right)

(1)

which contains 3 operators: a linear aggregator $\mathsf{Dense}_{c_{i}}$ , attention weights estimator $\mathbf{ATT}$ , and message aggregator $\mathbf{MSG}$ . The $\mathsf{Dense}$ is a set of linear projectors indexed by the node type $c_{i}$ , aggregating multi-head information. $\mathbf{ATT}$ calculates the importance weights between pairs of nodes conditioned on the associated node and edge types:
包含 3 个算子：线性聚合器 $\mathsf{Dense}_{c_{i}}$ 、注意力权重估计器 $\mathbf{ATT}$ 和消息聚合器 $\mathbf{MSG}$ 。 $\mathsf{Dense}$ 是一组按节点类型 $c_{i}$ 索引的线性投影器，聚合多头信息。 $\mathbf{ATT}$ 计算在关联节点和边类型条件下的节点对之间的重要性权重：

$\displaystyle{\mathbf{ATT}}\left(i,j\right)$	$\displaystyle=\underset{\forall j\in N\left(i\right)}{\mathsf{softmax}}\left(\underset{m\in[1,h]}{\\|}\text{head}^{m}_{\mathrm{ATT}}\left(i,j\right)\right)$	(2)
$\displaystyle\text{head}^{m}_{\mathrm{ATT}}\left(i,j\right)$	$\displaystyle=\left(\mathbf{K}^{m}\left(j\right)\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{ATT}}\mathbf{Q}^{m}\left(i\right)^{T}\right)\frac{1}{\sqrt{C}}$	(3)
$\displaystyle\mathbf{K}^{m}\left(j\right)$	$\displaystyle=\mathsf{Dense}^{m}_{c_{j}}\left({\mathbf{H}_{j}}\right)$	(4)
$\displaystyle\mathbf{Q}^{m}\left(i\right)$	$\displaystyle=\mathsf{Dense}^{m}_{c_{i}}\left({\mathbf{H}_{i}}\right)$	(5)

where $\|$ denotes concatenation, $m$ is the current head number and $h$ is the total number of heads. Notice that $\mathsf{Dense}$ here is indexed by both node type $c_{i/j}$ , and head number $m$ . The linear layers in K and Q have distinct parameters. To incorporate the semantic meaning of edges, we calculate the dot product between Query and Key vectors weighted by a matrix $\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{ATT}}\in\mathbb{R}^{C\times C}$ . Similarly, when parsing messages from the neighboring agent, we embed infrastructure and vehicle’s features separately via $\mathsf{Dense}_{c_{j}}^{m}$ . A matrix $\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{MSG}}$ is used to project the features based on the edge type between source node and target node:
$\|$ 表示连接， $m$ 是当前头编号， $h$ 是头总数。注意，此处 $\mathsf{Dense}$ 由节点类型 $c_{i/j}$ 和头编号 $m$ 双重索引。K 和 Q 中的线性层具有不同的参数。为了融入边的语义意义，我们计算由矩阵 $\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{ATT}}\in\mathbb{R}^{C\times C}$ 加权的 Query 和 Key 向量的点积。同样，在解析来自邻近代理的消息时，我们通过 $\mathsf{Dense}_{c_{j}}^{m}$ 分别嵌入基础设施和车辆的特征。矩阵 $\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{MSG}}$ 用于根据源节点和目标节点之间的边类型投影特征：

	$\displaystyle\mathbf{MSG}\left(i,j\right)$	$\displaystyle=\underset{m\in[1,h]}{\\|}\text{head}^{m}_{\mathrm{MSG}}\left(i,j\right)$		(6)
	$\displaystyle\text{head}^{m}_{\mathrm{MSG}}\left(i,j\right)$	$\displaystyle=\mathsf{Dense}^{m}_{c_{j}}\left(\mathbf{H}_{j}\right)\mathbf{W}_{\phi\left(e_{ij}\right)}^{m,\mathrm{MSG}}.$		(7)

3.2.2 Multi-scale window attention
3.2.2 多尺度窗口注意力

We present a new type of attention mechanism tailored for efficient long-range spatial interaction on high-resolution detection, called multi-scale window attention (MSwin). It uses a pyramid of windows, each of which caps a different attention range, as illustrated in Fig. 3c. The usage of variable window sizes can greatly improve the detection robustness of V2X-ViT against localization errors (see ablation study in Fig. 5b). Attention performed within larger windows can capture long-range visual cues to compensate for large localization errors, whereas smaller window branches perform attention at finer scales to preserve local context. Afterward, the split-attention module [56] is used to adaptively fuse information coming from multiple branches, empowering MSwin to handle a range of pose errors. Note that MSwin is applied on each agent independently without considering any inter-agent fusion; therefore we omit the agent subscript in this subsection for simplicity.
我们提出了一种针对高分辨率检测中高效长距离空间交互的新型注意力机制，称为多尺度窗口注意力（MSwin）。它使用一个窗口金字塔，每个窗口覆盖不同的注意力范围，如图 3c 所示。使用可变窗口大小可以显著提高 V2X-ViT 对定位错误的检测鲁棒性（参见图 5b 中的消融研究）。在较大窗口内执行的注意力可以捕捉长距离视觉线索以补偿大的定位错误，而较小的窗口分支则在更精细的尺度上执行注意力以保留局部上下文。随后，使用分割注意力模块[56]自适应地融合来自多个分支的信息，使 MSwin 能够处理一系列姿态错误。请注意，MSwin 独立应用于每个智能体，不考虑任何智能体间融合；因此，为了简化，本小节中省略了智能体下标。

Formally, let $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ be an input feature map of a single agent. In branch $j$ out of $k$ parallel branches, $\mathbf{H}$ is partitioned using window size $P_{j}\!\times\!P_{j}$ , into a tensor of shape $(\frac{H}{P_{j}}\times\frac{W}{P_{j}},P_{j}\times P_{j},C)$ , which represents a $\frac{H}{P_{j}}\times\frac{W}{P_{j}}$ grid of non-overlapping patches each with size $P_{j}\times P_{j}$ . We use $h_{j}$ number of heads to improve the attention power at $j$ -th branch. More detailed formulation can be found in Appendix. Following [30, 17], we also consider an additional relative positional encoding $\mathbf{B}$ that acts as a bias term added to the attention map. As the relative position along each axis lies in the range $[-P_{j}+1,P_{j}-1]$ , we take $\mathbf{B}$ from a parameterized matrix $\hat{\mathbf{B}}\in\mathbb{R}^{(2P_{j}-1)\times(2P_{j}-1)}$ .
形式上，令 $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ 为单个代理的输入特征图。在 $k$ 并行分支中的 $j$ 分支， $\mathbf{H}$ 使用窗口大小 $P_{j}\!\times\!P_{j}$ 进行划分，形成一个形状为 $(\frac{H}{P_{j}}\times\frac{W}{P_{j}},P_{j}\times P_{j},C)$ 的张量，该张量代表一个大小为 $P_{j}\times P_{j}$ 的非重叠补丁的 $\frac{H}{P_{j}}\times\frac{W}{P_{j}}$ 网格。我们使用 $h_{j}$ 个头在 $j$ 分支处提高注意力能力。更详细的公式可以在附录中找到。遵循[30, 17]，我们还考虑了一个额外的相对位置编码 $\mathbf{B}$ ，它作为添加到注意力图中的偏置项。由于每个轴上的相对位置位于范围 $[-P_{j}+1,P_{j}-1]$ 内，我们从参数化矩阵 $\hat{\mathbf{B}}\in\mathbb{R}^{(2P_{j}-1)\times(2P_{j}-1)}$ 中取 $\mathbf{B}$ 。

To attain per-agent multi-range spatial relationship, each branch partitions input tensor $\mathbf{H}$ with different window sizes i.e. $\{P_{j}\}_{j=1}^{k}=\{P,2P,...,kP\}$ . We progressively decrease the number of heads when using a larger window size to save memory usage. Finally, we fuse the features from all the branches by a Split-Attention module [56], yielding the output feature $\mathbf{Y}$ . The complexity of the proposed MSwin is linear to image size $HW$ , while enjoying long-range multi-scale receptive fields and adaptively fuses both local and (sub)-global visual hints in parallel. Notably, unlike Swin Transformer [30], our multi-scale window approach requires no masking, padding, or cyclic-shifting, making it more efficient in implementations while having larger-scale spatial interactions.
为了实现每个代理的多范围空间关系，每个分支使用不同的窗口大小对输入张量 $\mathbf{H}$ 进行划分，即 $\{P_{j}\}_{j=1}^{k}=\{P,2P,...,kP\}$ 。在采用更大的窗口大小时，我们逐步减少头数以节省内存使用。最后，通过 Split-Attention 模块[56]融合所有分支的特征，得到输出特征 $\mathbf{Y}$ 。所提出的 MSwin 的复杂度与图像大小 $HW$ 线性相关，同时享有长距离多尺度感受野，并能够并行自适应融合局部和（亚）全局视觉线索。值得注意的是，与 Swin Transformer[30]不同，我们的多尺度窗口方法不需要掩码、填充或循环移位，这使得它在实现上更加高效，同时具有更大尺度的空间交互。

3.2.3 Delay-aware positional encoding
3.2.3 延迟感知位置编码

Although the global misalignment is captured by the spatial warping matrix $\Gamma_{\xi}$ , another type of local misalignment, arising from object motions during the delay-induced time lag, also needs to be considered. To encode this temporal information, we leverage an adaptive delay-aware positional encoding (DPE), composed of a linear projection and a learnable embedding. We initialize it with sinusoid functions conditioned on time delay $\Delta t_{i}$ and channel $c\in[1,C]$ :
尽管全局错位被空间扭曲矩阵 $\Gamma_{\xi}$ 所捕捉，但还需要考虑另一种类型的局部错位，这种错位源于延迟诱导的时间滞后期间的对象运动。为了编码这种时间信息，我们利用自适应延迟感知位置编码（DPE），它由一个线性投影和一个可学习的嵌入组成。我们使用基于时间延迟 $\Delta t_{i}$ 和通道 $c\in[1,C]$ 的余弦函数对其进行初始化：

\mathbf{p}_{c}\left(\Delta t_{i}\right)=\begin{cases}\sin\left(\Delta t_{i}/10000^{\frac{2c}{C}}\right),&c=2k\\ \cos\left(\Delta t_{i}/10000^{\frac{2c}{C}}\right),&c=2k+1\\ \end{cases}

(8)

A linear projection $f:\mathbb{R}^{C}\rightarrow\mathbb{R}^{C}$ will further warp the learnable embedding so it can generalize better for unseen time delay [18]. We add this projected embedding to each agents’ feature $\mathbf{H}_{i}$ before feeding into the Transformer so that the features are temporally aligned beforehand.
线性投影 $f:\mathbb{R}^{C}\rightarrow\mathbb{R}^{C}$ 将进一步扭曲可学习的嵌入，使其能更好地泛化未见的时间延迟[18]。我们在将特征 $\mathbf{H}_{i}$ 输入到 Transformer 之前，将这个投影嵌入添加到每个代理的特征中，以便事先进行时间对齐。

	$\displaystyle\text{DPE}\left(\Delta t_{i}\right)$	$\displaystyle=f\left(\mathbf{p}\left(\Delta t_{i}\right)\right)$		(9)
	$\displaystyle\mathbf{H}_{i}$	$\displaystyle=\mathbf{H}_{i}+\text{DPE}\left(\Delta t_{i}\right)$		(10)

4 Experiments 4 实验

4.1 V2XSet: An open dataset for V2X cooperative perception
4.1V2XSet：一个用于 V2X 协同感知的开源数据集

To the best of our knowledge, no fully public V2X perception dataset exists suitable for investigating common V2X challenges such as localization error and transmission time delay. DAIR-V2X [2] is a large-scale real-world V2I dataset without V2V cooperation. V2X-Sim [24] is an open V2X simulated dataset but does not simulate noisy settings and only contains a single road type. OPV2V [50] contains more road types but are restricted to V2V cooperation. To this end, we collect a new large-scale dataset for V2X perception that explicitly considers these real-world noises during V2X collaboration using CARLA [10] and OpenCDA [48] together. In total, there are 11,447 frames in our dataset (33,081 samples if we count frames per agent in the same scene), and the train/validation/test splits are 6,694/1,920/2,833, respectively. Compared with existing datasets, V2XSet incorporates both V2X cooperation and realistic noise simulation. Please refer to the supplementary material for more details.
据我们所知，目前尚无适用于研究定位误差和传输时间延迟等常见 V2X 挑战的完全公开的 V2X 感知数据集。DAIR-V2X[2]是一个无 V2V 合作的规模化真实世界 V2I 数据集。V2X-Sim[24]是一个开放的 V2X 模拟数据集，但不模拟噪声环境，且仅包含单一道路类型。OPV2V[50]包含更多道路类型，但仅限于 V2V 合作。为此，我们收集了一个新的大规模 V2X 感知数据集，该数据集明确考虑了在 V2X 协作中使用 CARLA[10]和 OpenCDA[48]时出现的这些真实世界噪声。总共，我们的数据集中有 11,447 帧（如果按同一场景中每个代理的帧数计算，则为 33,081 个样本），训练/验证/测试分割分别为 6,694/1,920/2,833。与现有数据集相比，V2XSet 结合了 V2X 合作和现实噪声模拟。更多详情请参阅补充材料。

4.2 Experimental setup 4.2 实验设置

Th evaluation range in $x$ and $y$ direction are $[-140,140]$ m and $[-40,40]$ m respectively. We assess models under two settings: 1) Perfect Setting, where the pose is accurate, and everything is synchronized across agents; 2) Noisy Setting, where pose error and time delay are both considered. In the Noisy Setting, the positional and heading noises of the transmitter are drawn from a Gaussian distribution with a default standard deviation of 0.2 m and 0.2° respectively, following the real-world noise levels [47, 26, 1]. The time delay is set to 100 ms for all the evaluated models to have a fair comparison of their robustness against asynchronous message propagation.
评估范围在 $x$ 和 $y$ 方向分别为 $[-140,140]$ 米和 $[-40,40]$ 米。我们在两种设置下评估模型：1）完美设置，姿态准确，所有代理之间同步；2）噪声设置，考虑姿态误差和时间延迟。在噪声设置中，发射器的位置噪声和航向噪声分别从标准差为 0.2 米和 0.2°的高斯分布中抽取，遵循现实世界的噪声水平[47, 26, 1]。时间延迟设置为 100 毫秒，以确保所有评估模型在异步消息传播的鲁棒性方面有公平的比较。

Evaluation metrics. The detection performance is measured with Average Precisions (AP) at Intersection-over-Union (IoU) thresholds of 0.5 and 0.7. In this work, we focus on LiDAR-based vehicle detection. Vehicles hit by at least one LiDAR point from any connected agent will be included as evaluation targets.
评估指标。检测性能通过在交并比（IoU）阈值为 0.5 和 0.7 的平均精度（AP）进行衡量。在本工作中，我们专注于基于激光雷达的车辆检测。任何连接的代理至少被一个激光雷达点击中的车辆将被包括在评估目标中。

Implementation details. During training, a random AV is selected as the ego vehicle, while during testing, we evaluate on a fixed ego vehicle for all the compared models. The communication range of each agent is set as 70 m based on [20], whereas all the agents out of this broadcasting radius of ego vehicle is ignored. For the PointPillar backbone, we set the voxel resolution to 0.4 m for both height and width. The default compression rate is 32 for all intermediate fusion methods. Our V2X-ViT has 3 encoder layers with 3 window sizes in MSwin: 4, 8, and 16. We first train the model under the Perfect Setting, then fine-tune it for Noisy Setting. We adopt Adam optimizer [21] with an initial learning rate of $10^{-3}$ and steadily decay it every 10 epochs using a factor of 0.1. All models are trained on Tesla V100.
实施细节。在训练过程中，随机选择一个 AV 作为 ego 车辆，而在测试过程中，我们对所有比较的模型使用固定的 ego 车辆进行评估。根据[20]，每个代理的通信范围设置为 70 米，而所有超出 ego 车辆广播半径的代理都被忽略。对于 PointPillar 骨干网络，我们将体素分辨率设置为 0.4 米，用于高度和宽度。所有中间融合方法的默认压缩率为 32。我们的 V2X-ViT 在 MSwin 中有 3 个编码器层，3 个窗口大小：4、8 和 16。我们首先在完美设置下训练模型，然后对其进行微调以适应噪声设置。我们采用 Adam 优化器[21]，初始学习率为 $10^{-3}$ ，每 10 个 epoch 稳定衰减，衰减因子为 0.1。所有模型均在 Tesla V100 上训练。

Compared methods. We consider No Fusion as our baseline, which only uses ego-vehicle’s LiDAR point clouds. We also compare with Late Fusion, which gathers all detected outputs from agents and applies Non-maximum suppression to produce the final results, and Early Fusion, which directly aggregates raw LiDAR point clouds from nearby agents. For intermediate fusion strategy, we evaluate four state-of-the-art approaches: OPV2V [50], F-Cooper [5] V2VNet [45], and DiscoNet [25]. For a fair comparison, all the models use PointPillar as the backbone, and every compared V2V methods also receive infrastructure data, but they do not distinguish between infrastructure and vehicles.
与比较方法。我们将“无融合”作为基线，它仅使用 ego-vehicle 的 LiDAR 点云。我们还与“后期融合”进行比较，后者收集来自代理的所有检测输出并应用非极大值抑制以产生最终结果，以及“早期融合”，它直接聚合来自附近代理的原始 LiDAR 点云。对于中间融合策略，我们评估了四种最先进的方法：OPV2V [50]、F-Cooper [5] V2VNet [45]和 DiscoNet [25]。为了公平比较，所有模型都使用 PointPillar 作为骨干网络，并且每个比较的 V2V 方法也接收基础设施数据，但它们不区分基础设施和车辆。

Table 1: 3D detection performance comparison on V2XSet. We show Average Precision (AP) at IoU=0.5, 0.7 on Perfect and Noisy settings, respectively.
表 1：在 V2XSet 上的 3D 检测性能比较。我们分别展示了在完美和无噪声设置下，IoU=0.5 和 0.7 的平均精度（AP）。

	Perfect 完美		Noisy 嘈杂
Models 模型	AP0.5	AP0.7	AP0.5	AP0.7
No Fusion 无融合	0.606	0.402	0.606	0.402
Late Fusion 晚期融合	0.727	0.620	0.549	0.307
Early Fusion 早期融合	0.819	0.710	0.720	0.384
F-Cooper [5] F-Cooper [5]	0.840	0.680	0.715	0.469
OPV2V [50]	0.807	0.664	0.709	0.487
V2VNet [45] V2VNet [45]	0.845	0.677	0.791	0.493
DiscoNet [25] 迪斯科网[25]	0.844	0.695	0.798	0.541
V2X-ViT (Ours) V2X-ViT（ ours）	0.882	0.712	0.836	0.614

4.3 Quantitative evaluation
4.3 定量评估

Main performance comparison. Tab. 1 shows the performance comparisons on both Perfect and Noisy Setting. Under the Perfect Setting, all the cooperative methods significantly outperform No Fusion baseline. Our proposed V2X-ViT outperforms SOTA intermediate fusion methods by 3.8%/1.7% for AP@0.5/0.7. It is even higher than the ideal Early fusion by 0.2% AP@0.7, which receives complete raw information. Under noisy setting, when localization error and time delay are considered, the performance of Early Fusion and Late Fusion drastically drop to 38.4% and 30.7% in AP@0.7, even worse than single-agent baseline No Fusion. Although OPV2V [50], F-Cooper [5] V2VNet [45], and DiscoNet [25] are still higher than No fusion, their performance decrease by 17.7%, 21.1%, 18.4% and 15.4% in AP@0.7, respectively. In contrast, V2X-ViT performs favorably against the No fusion method by a large margin, i.e. 23% and 21.2% higher in AP@0.5 and AP@0.7. Moreover, when compared to the Perfect Setting, V2X-ViT only drops by less than 5% and 10% in AP@0.5 and AP@0.7 under Noisy Setting, demonstrating its robustness against normal V2X noises. The real-time performance of V2X-ViT is also shown in Tab. 4. The inference time of V2X-ViT is 57 ms, and by using only 1 encoder layer, V2X-ViT_S can still beat DiscoNet while reaching only 28 ms inference time, which achieves real-time performance.
主要性能比较。表 1 显示了在完美和噪声设置下的性能比较。在完美设置下，所有合作方法均显著优于无融合基线。我们提出的 V2X-ViT 在 AP@0.5/0.7 上优于 SOTA 中间融合方法 3.8%/1.7%。它甚至比理想的早期融合高出 0.2% AP@0.7，后者接收完整的原始信息。在噪声设置下，当考虑定位误差和时间延迟时，早期融合和晚期融合的性能在 AP@0.7 中急剧下降至 38.4%和 30.7%，甚至不如单代理基线无融合。尽管 OPV2V [50]、F-Cooper [5]、V2VNet [45]和 DiscoNet [25]仍然高于无融合，但它们在 AP@0.7 中的性能分别下降了 17.7%、21.1%、18.4%和 15.4%。相比之下，V2X-ViT 在 AP@0.5 和 AP@0.7 上对无融合方法有大幅度的优势，即分别高出 23%和 21.2%。此外，与完美设置相比，V2X-ViT 在噪声设置下 AP@0.5 和 AP@0.7 的下降幅度均不到 5%和 10%，这证明了其对正常 V2X 噪声的鲁棒性。V2X-ViT 的实时性能也在表 4 中显示。 V2X-ViT 的推理时间为 57 毫秒，仅使用 1 个编码器层，V2X-ViT _S 仍能击败 DiscoNet，同时达到仅 28 毫秒的推理时间，实现了实时性能。

Sensitivity to localization error. To assess the models’ sensitivity to pose error, we sample noises from Gaussian distribution with standard deviation $\sigma_{xyz}\in[0,0.5]$ m, $\sigma_{\mathrm{heading}}\in[0\degree,1.0\degree$ ]. As Fig. 4 depicts, when the positional and heading errors stay within a normal range (i.e., $\sigma_{xyz}\leq 0.2m,\sigma_{heading}\leq 0.4\degree$ [1, 26, 47]), the performance of V2X-ViT only drops by less than 3%, whereas other intermediate fusion methods decrease at least 6%. Moreover, the accuracy of Early Fusion and Late Fusion degrade by nearly 20% in AP@0.7. When the noise is massive (e.g., 0.5 m and 1°std), V2X-ViT can still stay around 60% detection accuracy while the performance of other methods significantly degrades, showing the robustness of V2X-ViT against pose errors.
对定位误差的敏感性。为了评估模型对姿态误差的敏感性，我们从高斯分布中采样标准差为 $\sigma_{xyz}\in[0,0.5]$ m， $\sigma_{\mathrm{heading}}\in[0\degree,1.0\degree$ ]的噪声。如图 4 所示，当位置和航向误差保持在正常范围内（即 $\sigma_{xyz}\leq 0.2m,\sigma_{heading}\leq 0.4\degree$ [ 1, 26, 47]），V2X-ViT 的性能仅下降不到 3%，而其他中间融合方法至少下降 6%。此外，在 AP@0.7 下，早期融合和晚期融合的准确性下降近 20%。当噪声大量存在（例如，0.5 m 和 1°std）时，V2X-ViT 仍能保持在约 60%的检测精度，而其他方法的性能显著下降，显示出 V2X-ViT 对姿态误差的鲁棒性。

Time delay analysis. We further investigate the impact of time delay with range $[0,400]$ ms. As Fig. 4c shows, the AP of Late Fusion drops dramatically below No Fusion with only 100 ms delay. Early Fusion and other intermediate fusion methods are relatively less sensitive, but they still drop rapidly when delay keeps increasing and are all below the baseline after 400 ms. Our V2X-ViT, in contrast, exceeds No Fusion by 6.8% in AP@0.7 even under 400 ms delay, which is much larger than usual transmission delay in real-world system[38]. This clearly demonstrates its great robustness against time delay.
时间延迟分析。我们进一步研究了时间延迟对范围 $[0,400]$ ms 的影响。如图 4c 所示，延迟仅为 100 ms 时，晚期融合的 AP 值就显著低于无融合。早期融合和其他中间融合方法相对不太敏感，但当延迟持续增加时，它们仍然会迅速下降，并在 400 ms 后都低于基线。相比之下，我们的 V2X-ViT 即使在 400 ms 的延迟下，AP@0.7 也超过了无融合 6.8%，这比现实世界系统中通常的传输延迟要大得多[38]。这清楚地表明其对时间延迟的强大鲁棒性。

Table 2: Component ablation study. MSwin, SpAttn, HMSA, DPE represent adding i) multi-scale window attention, ii) split attention, iii) heterogeneous multi-agent self-attention, and iv) delay-aware positional encoding, respectively.
表 2：组件消融研究。MSwin、SpAttn、HMSA、DPE 分别代表添加 i）多尺度窗口注意力、ii）分割注意力、iii）异构多智能体自注意力、iv）延迟感知位置编码。 MSwin SpAttn HMSA HMSA HMSA DPE DPE 翻译文本：DPE AP0.5 / AP0.7 0.719 / 0.478 ✓ 0.748 / 0.519 ✓ ✓ 0.786 / 0.548 ✓ ✓ ✓ 0.823 / 0.601 ✓ ✓ ✓ ✓ 0.836 / 0.614

Table 3: Effect of DPE w.r.t. time delay on AP@0.7.
表 3：DPE 相对于时间延迟对 AP@0.7 的影响。

Delay/Model 延迟/模型	w/o DPE 无 DPE	w/ DPE 与 DPE
100 ms 100 毫秒	0.639	0.650
200 ms 200 毫秒	0.558	0.572
300 ms 300 毫秒	0.496	0.514
400 ms 400 毫秒	0.458	0.478

Table 4: Inference time measured on GPU Tesla V100.
表 4：在 GPU Tesla V100 上测量的推理时间。

Model	Time 时间	AP0.7(prf/nsy)
V2X-ViT_S	28ms	0.696 / 0.591
V2X-ViT	57ms	0.712 / 0.614

Infrastructure vs. vehicles. To analyze the effect of infrastructure in the V2X system, we evaluate the performance between V2V, where only vehicles can share information, and V2X, where infrastructure can also transmit messages. We denote the number of agents as the total number of infrastructure and vehicles that can share information. As shown in Fig. 5a, both V2V and V2X have better performance when the number of agents increases. The V2X system has better APs compared with V2V in our collected scenes. We argue this is due to the better sight-of-view and less occlusion of infrastructure sensors, leading to more informative features for reasoning the environmental context.
基础设施与车辆。为了分析基础设施在 V2X 系统中的影响，我们评估了 V2V（只有车辆可以共享信息）和 V2X（基础设施也可以传输消息）之间的性能。我们将代理数量定义为可以共享信息的总基础设施和车辆数量。如图 5a 所示，当代理数量增加时，V2V 和 V2X 的性能都更好。在我们的收集场景中，V2X 系统的 APs 比 V2V 更好。我们认为这是由于基础设施传感器的视野更好和遮挡更少，导致推理环境上下文时具有更多信息性特征。

Effects of transmission size. The size of the transmitted message can significantly affect the transmission delay, thereby affecting the detection performance. Here we study the model’s detection performance with respect to transmitted data size. The data transmission time is calculated by $t_{c}=f_{s}/v$ , where $f_{s}$ denotes the feature size and transmission rate $v$ is set to 27 Mbps [3]. Following [32], we also include another system-wise asynchronous delay that follows a uniform distribution between 0 and 200 ms. See supplementary materials for more details. From Fig. 5c, we can observe: 1) Large bandwidth requirement can eliminate the advantages of cooperative perception quickly, e.g., Early Fusion drops to 28%, indicating the necessity of compression; 2) With the default compression rate (32x), our V2X-ViT outperforms other intermediate fusion methods substantially; 3) V2X-ViT is insensitive to large compression rate. Even under a 128x compression rate, our model can still maintain high performance.
传输大小的影响。传输消息的大小可以显著影响传输延迟，从而影响检测性能。在此，我们研究模型检测性能与传输数据大小的关系。数据传输时间通过 $t_{c}=f_{s}/v$ 计算，其中 $f_{s}$ 表示特征大小，传输速率 $v$ 设置为 27 Mbps [3]。遵循[32]，我们还包括另一个系统级异步延迟，该延迟在 0 到 200 ms 之间呈均匀分布。更多细节请参阅补充材料。从图 5c 中，我们可以观察到：1）大带宽需求可以迅速消除协同感知的优势，例如，早期融合下降到 28%，表明压缩的必要性；2）在默认压缩率（32x）下，我们的 V2X-ViT 在中间融合方法中表现优异；3）V2X-ViT 对大压缩率不敏感。即使在 128x 压缩率下，我们的模型仍能保持高性能。

4.4 Qualitative evaluation
4.4 定性评估

Detection visualization. Fig. 6 shows the detection visualization of OPV2V, V2VNet, DiscoNet, and V2X-ViT in two challenging scenarios under Noisy setting. Our model predicts highly accurate bounding boxes which are well-aligned with ground truths, while other approaches exhibit larger displacements. More importantly, V2X-ViT can identify more dynamic objects (more ground-truth bounding boxes have matches), which proves its capability of effectively fusing all sensing information from nearby agents. Please see Appendix for more results.
检测可视化。图 6 展示了在噪声设置下，OPV2V、V2VNet、DiscoNet 和 V2X-ViT 在两个具有挑战性的场景中的检测可视化。我们的模型预测的边界框非常准确，与真实值对齐良好，而其他方法表现出更大的位移。更重要的是，V2X-ViT 可以识别更多动态对象（更多真实值边界框有匹配），这证明了其有效融合附近代理所有感知信息的能力。请参阅附录以获取更多结果。

Attention map visualization. To understand the importance of infra, we also visualize the learned attention maps in Fig. 7, where brighter color means more attention ego pays. As shown in Fig. 7(a), several objects are largely occluded (circled in blue) from both ego and AV2’s perspectives, whereas infrastructure can still capture rich point clouds. Therefore, V2X-ViT pays much more attention to infra on occluded areas (Fig. 7(d)) than other agents (Figs. 7(b) and 7(c)), demonstrating the critical role of infra on occlusions. Moreover, the attention map for infra is generally brighter than the vehicles, indicating more importance on infra seen by the trained V2X-ViT model.
注意图可视化。为了理解基础设施的重要性，我们还可视化了图 7 中学习的注意图，其中较亮的颜色表示自我赋予更多的注意。如图 7(a)所示，从自我和 AV2 的角度看，有几个物体大部分被遮挡（用蓝色圈出），而基础设施仍然可以捕捉到丰富的点云。因此，V2X-ViT 在遮挡区域（图 7(d)）比其他代理（图 7(b)和 7(c)）更加关注基础设施，证明了基础设施在遮挡中的关键作用。此外，基础设施的注意图通常比车辆更亮，表明训练后的 V2X-ViT 模型对基础设施的重视程度更高。

4.5 Ablation studies 4.5 消融研究

Contribution of major components in V2X-ViT. Now we investigate the effectiveness of individual components in V2X-ViT. Our base model is PointPillars with naive multi-agent self-attention fusion, which treats vehicles and infrastructure equally. We evaluate the impact of each component by progressively adding i) MSwin, ii) split attention, iii) HMSA, and iv) DPE on the Noisy Setting. As Tab. 4 demonstrates, all the modules are beneficial to the performance gains, while our proposed MSwin and HMSA have the most significant contributions by increasing the AP@0.7 4.1% and 6.6%, respectively.
V2X-ViT 主要组件的贡献。现在我们研究 V2X-ViT 中各个组件的有效性。我们的基础模型是 PointPillars，采用朴素的多智能体自注意力融合，将车辆和基础设施同等对待。我们通过逐步添加 i) MSwin，ii) 分割注意力，iii) HMSA 和 iv) DPE 到噪声设置中，评估每个组件的影响。如表 4 所示，所有模块都对性能提升有益，而我们提出的 MSwin 和 HMSA 通过分别增加 AP@0.7 4.1%和 6.6%，贡献最为显著。

MSwin for localization error. To validate the effectiveness of the multi-scale design in MSwin on localization error, we compare three different window configurations: i) using a single small window branch (SW), ii) using a small and a middle window (SMW), and iii) using all three window branches (SMLW). We simulate the localization error by combining different levels of positional and heading noises. From Fig. 5b, we can clearly observe that using a large and small window in parallel remarkably increased its robustness against localization error, which validates the design benefits of MSwin.
MSwin 本地化错误。为了验证 MSwin 在本地化错误上的多尺度设计有效性，我们比较了三种不同的窗口配置：i）使用单个小型窗口分支（SW），ii）使用小型和中型窗口（SMW），和 iii）使用所有三个窗口分支（SMLW）。我们通过结合不同级别的位置和航向噪声来模拟本地化错误。从图 5b 中，我们可以清楚地观察到，并行使用大窗口和小窗口显著提高了其对本地化错误的鲁棒性，这验证了 MSwin 设计的优势。

DPE Performance under delay. Tab. 4 shows that DPE can improve the performance under various time delays. The AP gain increases as delay increases.
延迟下的 DPE 性能。表 4 显示，DPE 可以在各种时间延迟下提高性能。随着延迟的增加，AP 增益增加。

5 Conclusion 5 结论

In this paper, we propose a new vision transformer (V2X-ViT) for V2X perception. Its key components are two novel attention modules i.e. HMSA and MSwin, which can capture heterogeneous inter-agent interactions and multi-scale intra-agent spatial relationship. To evaluate our approach, we construct V2XSet, a new large-scale V2X perception dataset. Extensive experiments show that V2X-ViT can significantly boost cooperative 3D object detection under both perfect and noisy settings. This work focuses on LiDAR-based cooperative 3D vehicle detection, limited to single sensor modality and vehicle detection task. Our future work involves multi-sensor fusion for joint V2X perception and prediction.
本文提出了一种新的视觉 Transformer（V2X-ViT）用于 V2X 感知。其关键组件是两个新颖的注意力模块，即 HMSA 和 MSwin，能够捕捉异构的代理间交互和多尺度的代理内空间关系。为了评估我们的方法，我们构建了 V2XSet，一个新的大规模 V2X 感知数据集。大量实验表明，V2X-ViT 可以在完美和噪声环境下显著提升协同 3D 目标检测。本研究专注于基于 LiDAR 的协同 3D 车辆检测，限于单一传感器模态和车辆检测任务。我们的未来工作涉及多传感器融合以实现联合 V2X 感知和预测。

Broader impacts and limitations. The proposed model can be deployed to improve the performance and robustness of autonomous driving systems by incorporating V2X communication using a novel vision Transformer. However, for models trained on simulated datasets, there are known issues on data bias and generalization ability to real-world scenarios. Furthermore, although the design choice of our communication approach (i.e., project LiDAR to others at the beginning) has an advantage of accuracy (see supplementary for details), its scalability is limited. In addition, new concerns around privacy and adversarial robustness may arise during data capturing and sharing, which has not received much attention. This work facilitates future research on fairness, privacy, and robustness in visual learning systems for autonomous vehicles.
更广泛的影响和局限性。所提出的模型可以通过采用新颖的视觉 Transformer 结合 V2X 通信来部署，以提升自动驾驶系统的性能和鲁棒性。然而，对于在模拟数据集上训练的模型，已知存在数据偏差和泛化到现实场景的能力问题。此外，尽管我们通信方法的设计选择（即，最初将激光雷达投影到其他对象上）在准确性方面具有优势（详见补充材料），但其可扩展性有限。另外，在数据捕获和共享过程中可能会出现关于隐私和对抗鲁棒性的新问题，这些问题尚未得到足够的关注。这项工作促进了未来关于自动驾驶车辆视觉学习系统中的公平性、隐私和鲁棒性的研究。

Acknowledgement. This material is supported in part by the Federal Highway Administration Exploratory Advanced Research (EAR) Program, and by the US National Science Foundation through Grants CMMI # 1901998. We thank Xiaoyu Dong for her insightful discussions.
致谢。本材料部分得到联邦公路管理局探索性高级研究（EAR）计划的支持，以及美国国家科学基金会通过 CMMI # 1901998 项目资助。我们感谢董晓宇的深刻讨论。

Appendix 附录

In this supplementary material, we first discuss the design choice and scalability issue of our method (Appendix 0.A). Then, we provide more model details and analysis (Appendix 0.B) in regards to the proposed architecture, including the mathematical details of the spatial-temporal correction module (Sec. 0.B.1), details of the proposed MSwin (Sec. 0.B.2), and the overall architectural specifications (Sec. 0.B.3). Afterwords, additional information and visualizations of the proposed V2XSet dataset are shown in Appendix 0.C. In the end, we present more quantitative experiments, qualitative detection results, attention map visualizations, and details about the effects of the transmission size experiment in Appendix 0.D.
在此补充材料中，我们首先讨论了我们方法的设计选择和可扩展性问题（附录 0.A）。然后，我们提供了更多关于所提出架构的模型细节和分析（附录 0.B），包括空间-时间校正模块的数学细节（0.B.1 节）、所提出 MSwin 的细节（0.B.2 节）以及整体架构规范（0.B.3 节）。随后，在附录 0.C 中展示了所提出 V2XSet 数据集的附加信息和可视化。最后，在附录 0.D 中，我们展示了更多定量实验、定性检测结果、注意力图可视化以及关于传输大小实验影响细节。

Appendix 0.A Discussion of design choice
附录 0.A 设计选择讨论

Scalability of ego vehicles. Our approach can be scalable in two ways: 1) De-centralized: the ablation studies conducted in this paper (Fig. 5a) and OPV2V [50] indicate that, when the number of collaborators is larger than 4, the performance gain becomes marginal while the computation still increases linearly. In practice, each agent only needs to share features with a limited number of agents. For example, Who2Com [29] studies which agent to request/transmit data, largely reducing computation. Moreover, the computation of selected PointPillar backbone is efficient, e.g., around 4 ms for 4 agents with full parallelization and 16 ms in sequence computing on RTX3090. 2) Centralized: Within a certain communication range, only one ego agent is selected to aggregate all the features from neighbors to predict bounding boxes and share the results with other agents. This solution requires only one computation node for a group of agents, thus being scalable.
自动驾驶汽车的扩展性。我们的方法可以从两个方面进行扩展：1）去中心化：本文中进行的消融研究（图 5a）和 OPV2V [50] 表明，当协作者的数量超过 4 个时，性能提升变得微乎其微，而计算量仍然呈线性增长。在实践中，每个智能体只需要与有限数量的智能体共享特征。例如，Who2Com [29] 研究了请求/传输数据的智能体，大大减少了计算量。此外，所选 PointPillar 骨干的计算效率高，例如，在 RTX3090 上，4 个智能体全并行化时约为 4 毫秒，序列计算时为 16 毫秒。2）集中化：在一定的通信范围内，只选择一个自动驾驶智能体来聚合来自邻居的所有特征以预测边界框，并将结果与其他智能体共享。这种解决方案只需要一个计算节点来处理一组智能体，因此是可扩展的。

Table T0: Comparison between our design choice and broadcasting approach.
表 T0：我们的设计选择与广播方法比较。

	DiscoNet (broad. / ours) DiscoNet（广泛/我们的）	V2X-ViT (broad. / ours) V2X-ViT（广泛/我们的）
AP@0.7 (perfect) AP@0.7（完美）	0.610 / 0.695	0.623 / 0.712

Design choices for communication. Compared to the broadcasting approach (i.e., compute the features in each cav’s own space and transform the feature maps directly on the ego side), our approach has more advantages in terms of detection accuracy. Most LiDAR detection methods often largely crop the LiDAR range based on the evaluation range to reduce computation. As the figure below shows, the CAVs crop the LiDAR data based on their own evaluation range in the broadcasting method, which leads to redundant data. Our approach, on the contrary, always does cropping based on the ego’s evaluation range, thus guaranteeing more effective feature transmission. We further validate this by comparing our framework with the broadcasting approach. The Tab. T0 below shows that our design outperforms broadcasting by 8.5% and 8.9% for DiscoNet and V2X-ViT.
设计选择：通信。与广播方法（即在每个腔体自己的空间中计算特征，并在自我一侧直接转换特征图）相比，我们的方法在检测精度方面具有更多优势。大多数激光雷达检测方法通常基于评估范围大量裁剪激光雷达范围以减少计算。如图所示，在广播方法中，CAVs 根据其自身的评估范围裁剪激光雷达数据，这导致冗余数据。我们的方法相反，始终基于自我评估范围进行裁剪，从而保证更有效的特征传输。我们进一步通过与广播方法比较来验证这一点。下表 T0 显示，我们的设计在 DiscoNet 和 V2X-ViT 方面比广播方法分别提高了 8.5%和 8.9%。

Appendix 0.B Model Details and Analysis
附录 0.B 模型细节与分析

0.B.1 Spatial-Temporal Correction Module
0.B.1 空间-时间校正模块

During the early stage of collaboration, when each connected agent $i$ receives ego vehicle’s pose at time $t_{i}$ , the observed point clouds of agent $i$ will be projected to ego vehicle’s pose $x_{e}^{t_{i}}$ at time $t_{i}$ before feature extraction. However, due to the time delay, the ego vehicle observes the data at a different time $t_{e}$ . Thus, the received features from connected agents are centered around a delayed ego vehicle’s pose (i.e., $x_{e}^{t_{i}}$ ) while the ego vehicle’s features are centered around current pose (i.e., $x_{e}^{t_{e}}$ ), leading to a delay-induced spatial misalignment. To correct this misalignment between the received features and ego-vehicle’s features, a global transformation ${\xi}_{x_{e}^{t_{i}},x_{e}^{t_{e}}}\in\mathfrak{se}(3)$ from ego vehicle’s past pose $x_{e}^{t_{i}}$ to its current pose $x_{e}^{t_{e}}$ is required. To this end, we employ a differential 2D transformation ${\Gamma}_{\xi}\left(\cdot\right)$ to warp the intermediate features spatially [19]. To be more specific, we will transform features’ positions by using affine transformation:
在协作的早期阶段，当每个连接的智能体 $i$ 在时间 $t_{i}$ 接收到自我车辆的姿态时，智能体 $i$ 观察到的点云将在特征提取之前投影到时间 $t_{i}$ 的自我车辆姿态 $x_{e}^{t_{i}}$ 。然而，由于时间延迟，自我车辆在时间 $t_{e}$ 观察到数据。因此，从连接的智能体接收到的特征围绕延迟的自我车辆姿态（即 $x_{e}^{t_{i}}$ ）进行中心化，而自我车辆的特征围绕当前姿态（即 $x_{e}^{t_{e}}$ ）进行中心化，导致由延迟引起的空间错位。为了纠正接收到的特征与自我车辆特征之间的这种错位，需要从自我车辆的过去姿态 $x_{e}^{t_{i}}$ 到当前姿态 $x_{e}^{t_{e}}$ 的全局变换 ${\xi}_{x_{e}^{t_{i}},x_{e}^{t_{e}}}\in\mathfrak{se}(3)$ 。为此，我们采用微分二维变换 ${\Gamma}_{\xi}\left(\cdot\right)$ 在空间上扭曲中间特征[19]。更具体地说，我们将通过仿射变换来变换特征的位置：

\begin{bmatrix}x_{s}\\ y_{s}\end{bmatrix}={\Gamma}_{\xi}(\begin{bmatrix}x_{t}\\ y_{t}\\ 1\end{bmatrix})\\ =\begin{bmatrix}R_{11}&R_{12}&\delta x\\ R_{21}&R_{22}&\delta y\\ \end{bmatrix}\begin{bmatrix}x_{t}\\ y_{t}\\ 1\end{bmatrix}

(11)

where $(x_{s},y_{s})$ and $(x_{t},y_{t})$ are the source and target coordinates. As the calculated coordinates may not be integers, we use bilinear interpolation to sample input feature vectors. An ROI mask is also calculated to prevent the network from paying attention to the padded zeros caused by the spatial warp. This mask will be used in heterogeneous multi-agent self-attention to mask padded values’ attention weights as zeros.
$(x_{s},y_{s})$ 和 $(x_{t},y_{t})$ 分别是源坐标和目标坐标。由于计算出的坐标可能不是整数，我们使用双线性插值来采样输入特征向量。同时计算一个 ROI 掩码，以防止网络关注由空间变换引起的填充零。此掩码将在异构多智能体自注意力中用于将填充值的注意力权重掩码为零。

0.B.2 Multi-Scale Window Attention (MSwin)
0.B.2 多尺度窗口注意力（MSwin）

Detailed formulation. Let $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ be an input feature of a single agent. Let $h_{j}$ be the number of attention heads used in branch $j$ (i.e. head dimension $d_{h_{j}}=C/h_{j}$ ), applying self-attention within each non-overlapping window $P_{j}\times P_{j}$ for branch $j$ out of $k$ branches on feature $\mathbf{H}$ can be formulated as:
详细公式。令 $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ 为单个代理的输入特征。令 $h_{j}$ 为分支 $j$ （即头维度 $d_{h_{j}}=C/h_{j}$ ）中使用的注意力头数量，在分支 $j$ （在 $k$ 个分支中的分支 $j$ ）的每个非重叠窗口 $P_{j}\times P_{j}$ 内应用自注意力，可以表示为：

$\displaystyle\mathbf{H}$	$\displaystyle=[\mathbf{H}^{1},\mathbf{H}^{2},...,\mathbf{H}^{HW/(P_{j})^{2}}],\$	$\displaystyle\text{for branch}\ j$	(12)
$\displaystyle\hat{\mathbf{H}}^{i}_{m}$	$\displaystyle=\mathsf{Attention}(\mathbf{H}^{i}\mathbf{W}^{Q}_{m},\mathbf{H}^{i}\mathbf{W}^{K}_{m},\mathbf{H}^{i}\mathbf{W}^{V}_{m}),\$	$\displaystyle i=1,...,HW/(P_{j})^{2}$	(13)
$\displaystyle\mathbf{Y}_{m}$	$\displaystyle=[\hat{\mathbf{H}}^{1}_{m},\hat{\mathbf{H}}^{2}_{m},...,\hat{\mathbf{H}}^{HW/(P_{j})^{2}}_{m}],$	$\displaystyle m=1,...,h_{j}$	(14)
$\displaystyle\mathbf{Y}^{j}$	$\displaystyle=[\mathbf{Y}_{1},\mathbf{Y}_{2},...,\mathbf{Y}_{h_{j}}],$		(15)

where $\hat{\mathbf{H}}^{i}_{m}\in\mathbb{R}^{P_{j}^{2}\times d_{h_{j}}}$ and $\mathbf{W}^{Q}_{m},\mathbf{W}^{K}_{m},\mathbf{W}^{V}_{m}$ represent the query, key, and value projection matrices. $\mathbf{Y}_{m}$ is the output of the $m$ -th head for branch $j$ . Afterwards, the outputs for all heads $1,2,...,h_{j}$ are concatenated to obtain the final output $\mathbf{Y}^{j}$ . Here the $\mathsf{Attention}$ operation denotes the relative self-attention, similar to the usage in Swin [30]:
$\hat{\mathbf{H}}^{i}_{m}\in\mathbb{R}^{P_{j}^{2}\times d_{h_{j}}}$ 和 $\mathbf{W}^{Q}_{m},\mathbf{W}^{K}_{m},\mathbf{W}^{V}_{m}$ 分别代表查询、键和值投影矩阵。 $\mathbf{Y}_{m}$ 是第 $m$ 个头在分支 $j$ 的输出。之后，将所有头的输出 $1,2,...,h_{j}$ 连接起来以获得最终的输出 $\mathbf{Y}^{j}$ 。这里 $\mathsf{Attention}$ 操作表示相对自注意力，类似于 Swin [30] 中的用法：

\mathsf{Attention}(\mathbf{Q},\mathbf{K},\mathbf{V})=\mathsf{softmax}((\frac{\mathbf{Q}\mathbf{K}^{T}}{\sqrt{d}}+\mathbf{B})\mathbf{V})\vspace{-2mm}

(16)

where $\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{P_{j}^{2}\times d}$ denote the query, key, and value matrices. $d$ is the dimension of query/key, while $P_{j}^{2}$ denotes the window size for branch $j$ . Following [30, 17], we also consider an additional relative positional encoding $\mathbf{B}$ that acts as a bias term added to the attention map. As the relative position along each axis lies in the range $[-P_{j}+1,P_{j}-1]$ , we take $\mathbf{B}$ from a parameterized matrix $\hat{\mathbf{B}}\in\mathbb{R}^{(2P_{j}-1)\times(2P_{j}-1)}$ . To adaptively fuse features from all the $k$ branches, we adopt the split-attention module [56] for the parallel feature aggregation:
$\mathbf{Q},\mathbf{K},\mathbf{V}\in\mathbb{R}^{P_{j}^{2}\times d}$ 表示查询、键和值矩阵。 $d$ 是查询/键的维度，而 $P_{j}^{2}$ 表示分支 $j$ 的窗口大小。遵循[30, 17]，我们还考虑了一个额外的相对位置编码 $\mathbf{B}$ ，它作为添加到注意力图中的偏置项。由于每个轴上的相对位置位于 $[-P_{j}+1,P_{j}-1]$ 范围内，我们从参数化矩阵 $\hat{\mathbf{B}}\in\mathbb{R}^{(2P_{j}-1)\times(2P_{j}-1)}$ 中取 $\mathbf{B}$ 。为了自适应地融合所有 $k$ 分支的特征，我们采用 split-attention 模块[56]进行并行特征聚合：

\mathbf{Y}=\mathsf{SplitAttention}(\mathbf{Y}^{1},\mathbf{Y}^{2},...\mathbf{Y}^{k}),

(17)

Time complexity. As mentioned in the paper, we have $k$ parallel branches. Each branch has $P_{j}\times P_{j}$ window size and $h_{j}$ heads where $P_{j}=jP$ and $h_{j}=h/j$ . After partitioning, the input tensor $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ is split into $h_{j}$ features with shape $(\frac{H}{P_{j}}\times\frac{W}{P_{j}},P_{j}\times P_{j},C/h_{j})$ . Following [30], we focus on the computation for vector-matrix multiplication and attention weight calculation. Thus, the complexity of MSwin can be written as:
时间复杂度。如论文中所述，我们有 $k$ 个并行分支。每个分支有 $P_{j}\times P_{j}$ 个窗口大小和 $h_{j}$ 个头，其中 $P_{j}=jP$ 和 $h_{j}=h/j$ 。在分区后，输入张量 $\mathbf{H}\in\mathbb{R}^{H\times W\times C}$ 被分割成 $h_{j}$ 个具有形状 $(\frac{H}{P_{j}}\times\frac{W}{P_{j}},P_{j}\times P_{j},C/h_{j})$ 的特征。遵循 [30]，我们关注向量-矩阵乘法和注意力权重计算的计算。因此，MSwin 的复杂度可以表示为：

	$\displaystyle\mathcal{O}(\sum_{j=1}^{k}\frac{HW}{P_{j}^{2}}\times\frac{C}{h_{j}}\times(P_{j}\times P_{j})^{2}\times h_{j}+4\frac{HW}{P_{j}^{2}}\times P_{j}^{2}\times(\frac{C}{h_{j}})^{2}\times h_{j})$
	$\displaystyle=\mathcal{O}(\sum_{j=1}^{k}P_{j}^{2}HWC+\frac{4HWC^{2}}{h_{j}})=\mathcal{O}(\sum_{j=1}^{k}j^{2}P^{2}HWC+\frac{4HWC^{2}j}{h})$
	$\displaystyle=\mathcal{O}(\frac{1}{3}k^{3}P^{2}HWC+\frac{2HWC^{2}k^{2}}{h})$		(18)

where the first term corresponds to attention weight calculation, the second term is associated with vector-matrix multiplication, and the last equality is due to the fact that $\sum_{j=1}^{k}j^{2}=\mathcal{O}(\frac{k^{3}}{3})$ and $\sum_{j=1}^{k}j=\mathcal{O}(\frac{k^{2}}{2})$ . Thus the overall complexity is
第一项对应注意力权重计算，第二项与向量矩阵乘法相关，最后一个等式是因为 $\sum_{j=1}^{k}j^{2}=\mathcal{O}(\frac{k^{3}}{3})$ 和 $\sum_{j=1}^{k}j=\mathcal{O}(\frac{k^{2}}{2})$ 。因此，整体复杂度为

\mathrm{FLOPs}(\mathsf{MSwin})=\mathcal{O}((\frac{k^{3}P^{2}C}{3}+\frac{2k^{2}C^{2}}{h})HW)\sim\mathcal{O}(HW),

(19)

which is linear with respect to the image size. The comparison of time complexity of different types of transformers is shown in Tab. T1 where $N$ denotes the number of input pixels, or (here $N=HW$ ). Our MSwin obtains multi-scale spatial interactions with a linear complexity with respect to $N$ , while other long-range attention mechanisms like ViT [9], Axial [44], and CSwin [8] requires more than linear complexity, which are not scalable to high-resolution dense prediction tasks such as object detection and segmentation.
与图像大小呈线性关系。不同类型变换器的时间复杂度比较显示在表 T1 中，其中 $N$ 表示输入像素数，或（此处 $N=HW$ ）。我们的 MSwin 在 $N$ 方面实现了多尺度空间交互的线性复杂度，而其他如 ViT[9]、Axial[44]和 CSwin[8]等长距离注意力机制则需要超过线性复杂度，这些机制不适用于高分辨率密集预测任务，如目标检测和分割。

Table T1: Computational complexity comparisons of our proposed MSwin attention with (a) full attention in ViT [9], (b) Axial [44], (c) Swin [30], (d) CSwin [8].
表 T1：我们提出的 MSwin 注意力与（a）ViT 中的全注意力[9]、（b）Axial[44]、（c）Swin[30]、（d）CSwin[8]的计算复杂度比较。

Attention Models 注意模型	Complexity 复杂性
ViT [9] ViT [9]	$\mathcal{O}(4HWC^{2}+2(HW)^{2}C)\sim\mathcal{O}(N^{2})$
Axial [44] 轴向[44]	$\mathcal{O}(HWC(4C+H+W))\sim\mathcal{O}(N\sqrt{N})$
Swin [30] 斯温[30]	$\mathcal{O}(4HWC^{2}+2P^{2}HWC)\sim\mathcal{O}(N)$
CSwin [8] CSwin [8]	$\mathcal{O}(HWC(4C+sH+sW))\sim\mathcal{O}(N\sqrt{N})$
MSwin (ours) MSwin（ ours）	$\mathcal{O}(\frac{1}{3}k^{3}P^{2}HWC+\frac{2HWC^{2}k^{2}}{h})\sim\mathcal{O}(N)$

Effective receptive field. The comparisons of receptive fields between different transformers are shown in Fig. 8. Swin [30] enlarge the receptive fields by using shifted window but it requires sequential blocks to accumulate. Axial Transformer [44] conducts attention on both row-wise and column-wise directions. Similarly, CSwin [8] proposes to perform attention on horizontal and vertical stripes with asymmetrical receptive range in different directions, but requires polynomial time complexity– $\mathcal{O}(N^{1.5})$ . In contrast, our proposed MSwin can aggregate features from multi-scale branches to increase fields in parallel, which has more symmetrical receptive fields and linear complexity with respect to $N$ .
有效感受野。不同 Transformer 之间感受野的比较如图 8 所示。Swin [30] 通过使用移动窗口来扩大感受野，但它需要顺序块来累积。Axial Transformer [44] 在行和列方向上执行注意力。同样，CSwin [8] 提出在不同方向上对水平和垂直条纹进行注意力，但需要多项式时间复杂度– $\mathcal{O}(N^{1.5})$ 。相比之下，我们提出的 MSwin 可以通过多尺度分支聚合特征来并行增加感受野，它具有更对称的感受野，并且相对于 $N$ 具有线性复杂度。

0.B.3 Architectural Configurations
0.B.3 建筑配置

Given all these definitions, the entire V2X-ViT model can be formulated as:
考虑到所有这些定义，整个 V2X-ViT 模型可以表示为：

$\displaystyle{z}_{i}$	$\displaystyle=\text{PointPillar}({x}_{i}),\quad\quad\quad\quad\quad{x}_{i}\in\mathbb{R}^{P\times 4},{z_{i}}\in\mathbb{R}^{H\times W\times C}\$	$\displaystyle\text{for agent}\ i$	(20)
$\displaystyle\mathbf{z}_{0}$	$\displaystyle=\text{STCM}([z_{0},...,z_{M}])+\text{DPE}([\Delta t_{0},...,\Delta t_{M}]),$	for ego AV 为自我 AV	(21)
$\displaystyle\mathbf{z}^{\prime}_{\ell}$	$\displaystyle=\mathbf{z}_{\ell-1}+\text{MSwin}(\text{HSMA}(\text{LN}(\mathbf{z}_{0}))),\quad\mathbf{z}_{0}\in\mathbb{R}^{M\times H\times W\times C}$	$\displaystyle\ell=1,...,L$	(22)
$\displaystyle\mathbf{z}_{\ell}$	$\displaystyle=\mathbf{z}^{\prime}_{\ell}+\text{MLP}(\text{LN}(\mathbf{z}^{\prime}_{\ell})),$	$\displaystyle\ell=1,...,L$	(23)
$\displaystyle\mathbf{y}$	$\displaystyle=\text{Head}(\mathbf{z}_{L}),$		(24)

where the input ${x}_{i}$ denotes the raw LiDAR point clouds captured on each agent, which are fed into the PointPillar Encoder [22], yielding visually informative 2D features ${z}_{i}$ for each agent $i$ . These tensors are then compressed, shared, decompressed, and further fed into the spatial-temporal correction module (STCM) to spatially warp the features. Then, we add the delay-aware positional encoded (DPE) features conditioned on each agent’s time delay $\Delta t_{i}$ to the output of STCM. Afterwords, the gathered features from $M$ agents are processed using our proposed V2X-ViT, which consists of $L$ layers of V2X-ViT blocks. Each V2X-ViT block contains a HSMA, a MSwin, and a standard MLP network [9]. Following [9, 30], we use the Layer Normalization [4] before feeding into the Transformer/MLP module. We show the detailed specifications of V2X-ViT architecture in Table T2.
输入 ${x}_{i}$ 表示每个智能体捕获的原始激光雷达点云，这些点云被输入到 PointPillar 编码器[22]，为每个智能体生成视觉信息丰富的 2D 特征 ${z}_{i}$ 。然后，这些张量被压缩、共享、解压缩，并进一步输入到时空校正模块（STCM）以空间扭曲特征。然后，我们将基于每个智能体的时间延迟 $\Delta t_{i}$ 的条件延迟感知位置编码（DPE）特征添加到 STCM 的输出。之后，我们从 $M$ 智能体收集的特征使用我们提出的 V2X-ViT 进行处理，它由 $L$ 层 V2X-ViT 块组成。每个 V2X-ViT 块包含一个 HSMA、一个 MSwin 和一个标准 MLP 网络[9]。遵循[9, 30]，我们在输入到 Transformer/MLP 模块之前使用层归一化[4]。我们在表 T2 中展示了 V2X-ViT 架构的详细规格。

Table T2: Detailed architectural specifications for V2X-ViT.
表 T2：V2X-ViT 的详细建筑规范

Output size 输出尺寸

V2X-ViT framework V2X-ViT 框架

PointPillar 点柱 Encoder 编码器

M\times 352\times 96\times 256

\left[\begin{array}[]{c}\text{Voxel samp. reso. 0.4m, Scatter, 64}\end{array}\right]

\left[\begin{array}[]{c}\text{Conv3x3, 64, stride 2, BN, ReLU}\end{array}\right]\times 3

\left[\begin{array}[]{c}\text{Conv3x3, 128, stride 2, BN, ReLU}\end{array}\right]\times 5

\left[\begin{array}[]{c}\text{Conv3x3, 256, stride 2, BN, ReLU}\end{array}\right]\times 8

\left[\begin{array}[]{c}\text{ConvT3x3, 128, stride 1, BN, ReLU}\end{array}\right]\times 1

\left[\begin{array}[]{c}\text{ConvT3x3, 128, stride 2, BN, ReLU}\end{array}\right]\times 1

\left[\begin{array}[]{c}\text{ConvT3x3, 128, stride 4, BN, ReLU}\end{array}\right]\times 1

M\times 176\times 48\times 256

\left[\begin{array}[]{c}\text{Concat3, 384}\end{array}\right]

\left[\begin{array}[]{c}\text{Conv3x3, 256, stride 2, ReLU}\\ \text{Conv3x3, 256, stride 1, ReLU}\end{array}\right]\times 1

Delay-aware 感知延迟

Pos. Encoding 位置编码

M\times 176\times 48\times 256

\left[\begin{array}[]{c}\text{sin-cos pos. encoding}\end{array}\right]

\left[\begin{array}[]{c}\text{Linear, 256}\end{array}\right]\times 1

Transformer 变换器

Backbone 脊柱

M\times 176\times 48\times 256

\left[\begin{array}[]{c}\text{HSMA, dim 256, head 8}\\ \text{MSwin, dim 256, }\\ \text{head \{16, 8, 4\},}\\ \text{ws. }\{4\times 4,8\times 8,16\times 16\}\\ \text{MLP, dim 256}\end{array}\right]\times 3

Detection 检测

Head 头部

176\times 48\times 16

Cls. head:

\left[\begin{array}[]{c}\text{Conv1x1, 2, stride 1}\end{array}\right]

类标题：

\left[\begin{array}[]{c}\text{Conv1x1, 2, stride 1}\end{array}\right]

Regr. head:

\left[\begin{array}[]{c}\text{Conv1x1, 14, stride 1}\end{array}\right]

回归头：

\left[\begin{array}[]{c}\text{Conv1x1, 14, stride 1}\end{array}\right]

Appendix 0.C V2XSet Dataset
附录 0.CV2XSet 数据集

Statistics We gather 55 representative scenes covering 5 different roadway types and 8 towns in CARLA. Each scene is limited to 25 seconds, and in each scene, there are at least 2 and at most 7 intelligent agents that can communicate with each other. Each agent is equipped with 32-channel LiDAR and has 120 meters data range. We mount sensors on top of each AV while we only deploy infrastructure sensors in the intersection, mid-block, and entrance ramp at the height of 14 feet since these scenarios are typically more congested and challenging [16]. We record LiDAR point clouds at 10 Hz and save the corresponding positional data and timestamp.
統計我們在 CARLA 中收集了 55 個具有代表性的场景，涵盖了 5 種不同的道路类型和 8 个城镇。每个场景限制为 25 秒，每个场景中至少有 2 个最多 7 个智能代理可以相互通信。每个代理配备 32 通道激光雷达，具有 120 米的数据范围。我们在每辆自动驾驶车辆顶部安装传感器，而我们在交叉口、中间路段和入口坡道仅部署高度为 14 英尺的基础设施传感器，因为这些场景通常更为拥挤和具有挑战性[16]。我们以 10 赫兹的频率记录激光雷达点云，并保存相应的位置数据和时间戳。

Infrastructure deployment. The infrastructure sensors are installed on the traffic light poles or steet light poles at the intersection, mid-block, and entrance ramp at the height of 14 feet. For road type like rural curvy road, there is no infrastructure installed and only V2V collaboration exists.
基础设施部署。基础设施传感器安装在交叉口的交通信号杆或街灯杆上，以及中间路段和入口坡道的 14 英尺高度处。对于乡村弯曲道路等道路类型，没有安装基础设施，仅存在车与车之间的协作。

Dataset visualization. As Fig. 9 displays, there are 5 different roadway types in V2XSet dataset (i.e., straight segment, curvy segment, midblock, entrance ramp, and intersection), covering the most common driving scenarios in real life. We collect more intersection scenes than other types as it is usually more challenging due to the high traffic volume and severe occlusions. Data samples from different roadway types can be found in Fig. 10. From the figure, we can observe that the infrastructure sensors at the entrance ramp and intersection have different measurement patterns especially near its installation position compared with vehicle sensors. This is caused by the different installation heights between vehicle and infrastructure sensors. Such observation again shows the necessity of capturing the heterogeneity nature of V2X system.
数据集可视化。如图 9 所示，V2XSet 数据集中有 5 种不同的道路类型（即直线段、弯曲段、中间段、入口坡道和交叉口），涵盖了现实生活中最常见的驾驶场景。由于交通流量大和严重遮挡，我们收集的交叉口场景比其他类型更多。不同道路类型的数据样本可以在图 10 中找到。从图中可以看出，与车辆传感器相比，入口坡道和交叉口的基础设施传感器在安装位置附近具有不同的测量模式。这是由于车辆传感器和基础设施传感器安装高度的不同造成的。这样的观察再次表明了捕捉 V2X 系统异质性的必要性。

Appendix 0.D More Experimental Results
附录 0.D 更多实验结果

0.D.1 Performance for identifying dynamic objects
0.D.1 动态对象识别性能

We group the test set based on object speeds v (km/h) and compare AP@IoU=0.7 under the noisy setting for all intermediate fusion models. As shown in Tab. T3, V2X-ViT outperforms all other intermediate fusion methods under various speed range. It is noticeable that the objects with higher speed range generally have lower AP scores as the same time delay can produce more positional mis-alignments for the high-speed vehicles.
我们将测试集根据物体速度 v（千米/小时）进行分组，并在噪声环境下比较所有中间融合模型在 AP@IoU=0.7 下的表现。如表 T3 所示，V2X-ViT 在各种速度范围内均优于其他中间融合方法。值得注意的是，具有更高速度范围的物体通常 AP 得分较低，因为相同的时延对于高速车辆会产生更多的位置错位。

Table T3: Perception performance for objects with different speed (km/h), measured in AP@0.7 under noisy setting.
表 T3：不同速度（千米/小时）物体的感知性能，在噪声环境下以 AP@0.7 进行测量。

Model	$v<20$	$20\leq v\leq 40$	$v>40$
F-Cooper F-库珀	0.539	0.487	0.354
OPV2V	0.552	0.498	0.346
V2VNet	0.598	0.518	0.406
DiscoNet 迪斯科网	0.639	0.580	0.420
V2X-ViT	0.693	0.634	0.488

0.D.2 Performance for different road types
0.D.2 不同路面类型的性能

We also group the test scenes based on their road types and calculate the AP@IoU=0.7 scores under the noisy setting. As shown in Tab. T4, V2X-ViT ranks the first for all 5 road categories, demonstrating its detection robustness on different scenes.
我们还将测试场景根据其道路类型进行分组，并在噪声环境下计算 AP@IoU=0.7 的分数。如表 T4 所示，V2X-ViT 在所有 5 个道路类别中排名第一，展示了其在不同场景下的检测鲁棒性。

Table T4: Perception performance for different road types, measured in AP@0.7 under noisy setting.
表 T4：不同道路类型的感知性能，在噪声环境下以 AP@0.7 进行测量。

Model	Straight 笔直	Curvy 曲线	Intersection 交集	Midblock 中段	Entrance 入口
F-Cooper F-库珀	0.483	0.558	0.458	0.431	0.375
OPV2V	0.478	0.604	0.492	0.460	0.380
V2VNet	0.496	0.556	0.517	0.489	0.360
DiscoNet 迪斯科网	0.519	0.594	0.572	0.472	0.440
V2X-ViT	0.645	0.686	0.615	0.530	0.487

0.D.3 Qualitative results 0.D.3 定性结果

Figs. 11, 12 and 13 demonstrate more detection visualizations of V2VNet[45], OPV2V [50], F-Cooper [5], DiscoNet [25], and our V2X-ViT in different scenarios under Noisy Setting. V2X-ViT yields more robust performance in general with fewer regression displacements and fewer undetected objects. When the scenario is challenging with high-density traffic flow and more occlusions (e.g., Scene 7 in Fig. 13 ), our model can still identify most of the objects accurately.
图 11、12 和 13 展示了 V2VNet[45]、OPV2V[50]、F-Cooper[5]、DiscoNet[25]以及我们提出的 V2X-ViT 在噪声设置下不同场景下的更多检测可视化。V2X-ViT 在总体上表现出更鲁棒的性能，具有更少的回归位移和更少的未检测物体。当场景具有高密度交通流和更多遮挡（例如图 13 中的场景 7）时，我们的模型仍然可以准确识别大多数物体。

0.D.4 Attention visualization
0.D.4 注意力可视化

Fig. 14 shows more attention map visualizations of V2X-ViT under noisy setting. The LiDAR points of ego vehicle, the other connected autonomous vehicle (cav), and infrastructure are plotted in blue, green, and red respectively. The brighter color in the attention map means more attention ego vehicle pays. Generally, the color of infrastructure attention maps is brighter than others, especially for the occluded regions of other agents, indicating the more importance ego vehicle assigns to the infrastructure. This observation agrees with our intuition that the sensor observation of infrastructure has fewer occlusions, which leads to better feature representations.
图 14 显示了在噪声环境下 V2X-ViT 的更多注意力图可视化。自我车辆的激光雷达点、其他连接的自动驾驶车辆（cav）和基础设施分别用蓝色、绿色和红色表示。注意力图中颜色越亮表示自我车辆分配的注意力越多。一般来说，基础设施注意力图的颜色比其他颜色更亮，特别是对于其他代理的遮挡区域，这表明自我车辆赋予基础设施的重要性更高。这一观察结果与我们的直觉相符，即基础设施的传感器观测遮挡较少，这导致了更好的特征表示。

0.D.5 Explanation on effects of transmission size
0.D.5 传输尺寸影响说明

Here we provide more explanations of the data transmission size experiment in our paper. Different fusion strategies usually have distinct bandwidth requirements e.g., early fusion requires large bandwidth to transmit raw data, whereas late fusion only delivers minimal size of data. This communication volume will significantly influence the time delay, thus we need to simulate a more realistic time delay setting to study the effects of transmission size.
这里我们对我们论文中的数据传输大小实验进行更多解释。不同的融合策略通常具有不同的带宽需求，例如，早期融合需要大带宽来传输原始数据，而晚期融合仅提供最小数据量。这种通信量将显著影响延迟时间，因此我们需要模拟更真实的时间延迟设置来研究传输大小的影响。

Following [32], we decompose the total time delay into two parts: i) the data transmission time $t_{c}$ during broadcasting, ii) the idle time $t_{i}$ caused by the lack of synchronization between the perception system and communication system. The total time delay is calculated as
根据[32]，我们将总延迟分解为两部分：i) 广播期间的数据传输时间 $t_{c}$ ，ii) 由于感知系统和通信系统之间缺乏同步而引起的空闲时间 $t_{i}$ 。总延迟计算如下

t_{total}=t_{c}+t_{i}

(25)

As mentioned in the paper, the data transmission time has
如论文中所述，数据传输时间已有

t_{c}=f_{s}/v

(26)

where $f_{s}$ is the data size and $v$ is the transmission rate. Idle time $t_{i}$ can be further decoupled into the idle time on the sender side and the time on the receiver side i.e., $t_{i}=t_{i,1}+t_{i,2}$ . For $t_{i,1}$ , the worst case in terms of delay happens when the communication system just misses a perception cycle and needs to wait for the next round. Similarly, for $t_{i,2}$ , the worst case occurs when new data is received just after a new cycle of the perception system has started. Assume both perception system and communication system have the same rate of 10 $Hz$ , then $0\leavevmode\nobreak\ ms<t_{i}<200\leavevmode\nobreak\ ms$ . We employ a uniform distribution $\mathcal{U}\left(0,200\right)$ to model this uncertainty. In summary, we use the following equation to mimic the real-world time delay.
$f_{s}$ 为数据大小， $v$ 为传输速率。空闲时间 $t_{i}$ 可以进一步分解为发送端空闲时间和接收端时间，即 $t_{i}=t_{i,1}+t_{i,2}$ 。对于 $t_{i,1}$ ，在延迟方面最坏的情况发生在通信系统刚好错过感知周期，需要等待下一轮。同样，对于 $t_{i,2}$ ，最坏的情况发生在新数据在感知系统新周期开始后接收。假设感知系统和通信系统具有相同的速率 10 $Hz$ ，则 $0\leavevmode\nobreak\ ms<t_{i}<200\leavevmode\nobreak\ ms$ 。我们采用均匀分布 $\mathcal{U}\left(0,200\right)$ 来模拟这种不确定性。总之，我们使用以下方程来模拟现实世界的时间延迟。

t_{c}=f_{s}/v+\mathcal{U}\left(0,200\right)

(27)

which captures the influence of transmission size and asynchrony-caused uncertainty. In practice, we sample the time delay according to Eq. 26 and discretize it to the observed timestamps, which are discrete in a 10Hz update system.
捕捉传输大小和异步引起的不确定性的影响。在实践中，我们根据公式 26 采样时间延迟，并将其离散化到观察到的时戳，这些时戳在 10Hz 更新系统中是离散的。

References

[1] Rt3000. https://www.oxts.com/products/rt3000-v3, accessed: 2021-11-11
[2] Institue for AI Industry Research (AIR), T.U.: Vehicle-infrastructure cooperative autonomous driving: Dair-v2x dataset (2021)
[3] Arena, F., Pau, G.: An overview of vehicular communications. Future Internet 11(2), 27 (2019)
[4] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
[5] Chen, Q., Ma, X., Tang, S., Guo, J., Yang, Q., Fu, S.: F-cooper: Feature based cooperative perception for autonomous vehicle edge computing system using 3d point clouds. In: Proceedings of the 4th ACM/IEEE Symposium on Edge Computing. pp. 88–100 (2019)
[6] Chen, Q., Tang, S., Yang, Q., Fu, S.: Cooper: Cooperative perception for connected autonomous vehicles based on 3d point clouds. In: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS). pp. 514–524. OPTorganization (2019)
[7] Chu, X., Tian, Z., Wang, Y., Zhang, B., Ren, H., Wei, X., Xia, H., Shen, C.: Twins: Revisiting the design of spatial attention in vision transformers. arXiv preprint arXiv:2104.13840 1(2), 3 (2021)
[8] Dong, X., Bao, J., Chen, D., Zhang, W., Yu, N., Yuan, L., Chen, D., Guo, B.: Cswin transformer: A general vision transformer backbone with cross-shaped windows. arXiv preprint arXiv:2107.00652 (2021)
[9] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
[10] Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., Koltun, V.: CARLA: An open urban driving simulator. In: Proceedings of the 1st Annual Conference on Robot Learning. pp. 1–16 (2017)
[11] El Madawi, K., Rashed, H., El Sallab, A., Nasr, O., Kamel, H., Yogamani, S.: Rgb and lidar fusion based 3d semantic segmentation for autonomous driving. In: 2019 IEEE Intelligent Transportation Systems Conference (ITSC). pp. 7–12. OPTorganization (2019)
[12] Fan, X., Zhou, Z., Shi, P., Xin, Y., Zhou, X.: Rafm: Recurrent atrous feature modulation for accurate monocular depth estimating. IEEE Signal Processing Letters pp. 1–5 (2022). https://doi.org/10.1109/LSP.2022.3189597
[13] Fan, Z., Song, Z., Liu, H., Lu, Z., He, J., Du, X.: Svt-net: Super light-weight sparse voxel transformer for large scale place recognition. AAAI (2022)
[14] Fan, Z., Zhu, Y., He, Y., Sun, Q., Liu, H., He, J.: Deep learning on monocular object pose detection and tracking: A comprehensive overview. ACM Computing Surveys (CSUR) (2021)
[15] Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. In: International Conference on Machine Learning. pp. 1319–1327. PMLR (2013)
[16] Guo, Y., Ma, J., Leslie, E., Huang, Z.: Evaluating the effectiveness of integrated connected automated vehicle applications applied to freeway managed lanes. IEEE Transactions on Intelligent Transportation Systems (2020)
[17] Hu, H., Zhang, Z., Xie, Z., Lin, S.: Local relation networks for image recognition. In: ICCV. pp. 3464–3473 (2019)
[18] Hu, Z., Dong, Y., Wang, K., Sun, Y.: Heterogeneous graph transformer. In: Proceedings of The Web Conference 2020. pp. 2704–2710 (2020)
[19] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: NeurIPS (2015)
[20] Kenney, J.B.: Dedicated short-range communications (dsrc) standards in the united states. Proceedings of the IEEE 99(7), 1162–1182 (2011)
[21] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[22] Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: Fast encoders for object detection from point clouds. In: CVPR. pp. 12697–12705 (2019)
[23] Lei, Z., Ren, S., Hu, Y., Zhang, W., Chen, S.: Latency-aware collaborative perception. arXiv preprint arXiv:2207.08560 (2022)
[24] Li, Y., Ma, D., An, Z., Wang, Z., Zhong, Y., Chen, S., Feng, C.: V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters (2022)
[25] Li, Y., Ren, S., Wu, P., Chen, S., Feng, C., Zhang, W.: Learning distilled collaboration graph for multi-agent perception. NeurIPS 34 (2021)
[26] Li, Y., Zhuang, Y., Hu, X., Gao, Z., Hu, J., Chen, L., He, Z., Pei, L., Chen, K., Wang, M., et al.: Toward location-enabled iot (le-iot): Iot positioning techniques, error sources, and error mitigation. IEEE Internet of Things Journal 8(6), 4035–4062 (2020)
[27] Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3d object detection. In: ECCV. pp. 641–656 (2018)
[28] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: ICCV. pp. 2980–2988 (2017)
[29] Liu, Y.C., Tian, J., Ma, C.Y., Glaser, N., Kuo, C.W., Kira, Z.: Who2com: Collaborative perception via learnable handshake communication. In: 2020 IEEE International Conference on Robotics and Automation (ICRA). pp. 6876–6883. IEEE (2020)
[30] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021)
[31] Mo, Y., Zhang, P., Chen, Z., Ran, B.: A method of vehicle-infrastructure cooperative perception based vehicle state information fusion using improved kalman filter. Multimedia Tools and Applications pp. 1–18 (2021)
[32] Rauch, A., Klanner, F., Dietmayer, K.: Analysis of v2x communication parameters for the development of a fusion architecture for cooperative perception systems. In: 2011 IEEE Intelligent Vehicles Symposium (IV). pp. 685–690. OPTorganization (2011)
[33] Rauch, A., Klanner, F., Rasshofer, R., Dietmayer, K.: Car2x-based perception in a high-level fusion architecture for cooperative perception systems. In: 2012 IEEE Intelligent Vehicles Symposium. pp. 270–275. OPTorganization (2012)
[34] Rawashdeh, Z.Y., Wang, Z.: Collaborative automated driving: A machine learning-based method to enhance the accuracy of shared information. In: 2018 21st International Conference on Intelligent Transportation Systems (ITSC). pp. 3961–3966. OPTorganization (2018)
[35] Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: Pv-rcnn: Point-voxel feature set abstraction for 3d object detection. In: CVPR. pp. 10529–10538 (2020)
[36] Shi, S., Wang, X., Li, H.: Pointrcnn: 3d object proposal generation and detection from point cloud. In: CVPR. pp. 770–779 (2019)
[37] Treml, M., Arjona-Medina, J., Unterthiner, T., Durgesh, R., Friedmann, F., Schuberth, P., Mayr, A., Heusel, M., Hofmarcher, M., Widrich, M., et al.: Speeding up semantic segmentation for autonomous driving. In: NeurIPS workshop MLITS (2016)
[38] Tsukada, M., Oi, T., Ito, A., Hirata, M., Esaki, H.: Autoc2x: Open-source software to realize v2x cooperative perception among autonomous vehicles. In: 2020 IEEE 92nd Vehicular Technology Conference (VTC2020-Fall). pp. 1–6. OPTorganization (2020)
[39] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxim: Multi-axis mlp for image processing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 5769–5780 (2022)
[40] Tu, Z., Talebi, H., Zhang, H., Yang, F., Milanfar, P., Bovik, A., Li, Y.: Maxvit: Multi-axis vision transformer. arXiv preprint arXiv:2204.01697 (2022)
[41] Vadivelu, N., Ren, M., Tu, J., Wang, J., Urtasun, R.: Learning to communicate and correct pose errors. arXiv preprint arXiv:2011.05289 (2020)
[42] Vaswani, A., Ramachandran, P., Srinivas, A., Parmar, N., Hechtman, B., Shlens, J.: Scaling local self-attention for parameter efficient visual backbones. In: CVPR. pp. 12894–12904 (2021)
[43] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: NeurIPS. pp. 5998–6008 (2017)
[44] Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.C.: Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: European Conference on Computer Vision. pp. 108–126. Springer (2020)
[45] Wang, T.H., Manivasagam, S., Liang, M., Yang, B., Zeng, W., Urtasun, R.: V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In: ECCV. pp. 605–621. Springer (2020)
[46] Wang, Z., Cun, X., Bao, J., Liu, J.: Uformer: A general u-shaped transformer for image restoration. arXiv preprint arXiv:2106.03106 (2021)
[47] Xia, X., Hang, P., Xu, N., Huang, Y., Xiong, L., Yu, Z.: Advancing estimation accuracy of sideslip angle by fusing vehicle kinematics and dynamics information with fuzzy logic. IEEE Transactions on Vehicular Technology (2021)
[48] Xu, R., Guo, Y., Han, X., Xia, X., Xiang, H., Ma, J.: Opencda: an open cooperative driving automation framework integrated with co-simulation. In: 2021 IEEE International Intelligent Transportation Systems Conference (ITSC). pp. 1155–1162. OPTorganization (2021)
[49] Xu, R., Tu, Z., Xiang, H., Shao, W., Zhou, B., Ma, J.: Cobevt: Cooperative bird’s eye view semantic segmentation with sparse transformers. arXiv preprint arXiv:2207.02202 (2022)
[50] Xu, R., Xiang, H., Xia, X., Han, X., Liu, J., Ma, J.: Opv2v: An open benchmark dataset and fusion pipeline for perception with vehicle-to-vehicle communication. arXiv preprint arXiv:2109.07644 (2021)
[51] Yan, Y., Mao, Y., Li, B.: Second: Sparsely embedded convolutional detection. Sensors 18(10), 3337 (2018)
[52] Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3d object detection from point clouds. In: CVPR. pp. 7652–7660 (2018)
[53] Yang, Z., Sun, Y., Liu, S., Shen, X., Jia, J.: Std: Sparse-to-dense 3d object detector for point cloud. In: CVPR. pp. 1951–1960 (2019)
[54] Yuan, Y., Cheng, H., Sester, M.: Keypoints-based deep feature fusion for cooperative vehicle detection of autonomous driving. IEEE Robotics and Automation Letters 7(2), 3054–3061 (2022)
[55] Zelin, Z., Ze, W., Yueqing, Z., Boxun, L., Jiaya, J.: Tracking objects as pixel-wise distributions. arXiv preprint arXiv:2207.05518 (2022)
[56] Zhang, H., Wu, C., Zhang, Z., Zhu, Y., Lin, H., Zhang, Z., Sun, Y., He, T., Mueller, J., Manmatha, R., et al.: Resnest: Split-attention networks. arXiv preprint arXiv:2004.08955 (2020)
[57] Zhang, Z., Fisac, J.F.: Safe occlusion-aware autonomous driving via game-theoretic active perception. arXiv preprint arXiv:2105.08169 (2021)
[58] Zhao, X., Mu, K., Hui, F., Prehofer, C.: A cooperative vehicle-infrastructure based urban driving environment perception method using a ds theory-based credibility map. Optik 138, 407–415 (2017)
[59] Zhong, Y., Zhu, M., Peng, H.: Vin: Voxel-based implicit network for joint 3d object detection and segmentation for lidars. arXiv preprint arXiv:2107.02980 (2021)
[60] Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR. pp. 4490–4499 (2018)
[61] Zhou, Z., Fan, X., Shi, P., Xin, Y.: R-msfm: Recurrent multi-scale feature modulation for monocular depth estimating. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12777–12786 (2021)

	Scene 1 场景 1	Scene 2 场景 2	Scene 3 场景 3
F-Cooper [5] F-Cooper [5]
V2VNet [45] V2VNet [45]
OPV2V [50]
DiscoNet [25] 迪斯科网[25]
V2X-ViT (ours) V2X-ViT（ ours）

	Scene 4 场景 4	Scene 5 场景 5	Scene 6 场景 6
F-Cooper [5] F-Cooper [5]
V2VNet [45] V2VNet [45]
OPV2V [50]
DiscoNet [25] 迪斯科网[25]
V2X-ViT (ours) V2X-ViT（ ours）

	Scene 7 第七场	Scene 8 场景 8
F-Cooper [5] F-Cooper [5]
V2VNet [45] V2VNet [45]
OPV2V [50]
DiscoNet [25] 迪斯科网[25]
V2X-ViT (ours) V2X-ViT（ ours）

V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision TransformerV2X-ViT：基于视觉 Transformer 的车辆到万物协同感知

Abstract 摘要

Keywords:

1 Introduction 1 引言

2 Related work 2 相关工作

3 Methodology 3 方法论

3.1 Main architecture design3.1 主要架构设计

3.2 V2X-Vision Transformer3.2V2X-视觉 Transformer

3.2.1 Heterogeneous multi-agent self-attention3.2.1 异构多智能体自注意力

3.2.2 Multi-scale window attention3.2.2 多尺度窗口注意力

3.2.3 Delay-aware positional encoding3.2.3 延迟感知位置编码

4 Experiments 4 实验

4.1 V2XSet: An open dataset for V2X cooperative perception4.1V2XSet：一个用于 V2X 协同感知的开源数据集

4.2 Experimental setup 4.2 实验设置

4.3 Quantitative evaluation4.3 定量评估

4.4 Qualitative evaluation4.4 定性评估

4.5 Ablation studies 4.5 消融研究

5 Conclusion 5 结论

Appendix 附录

Appendix 0.A Discussion of design choice附录 0.A 设计选择讨论

Appendix 0.B Model Details and Analysis附录 0.B 模型细节与分析

0.B.1 Spatial-Temporal Correction Module0.B.1 空间-时间校正模块

0.B.2 Multi-Scale Window Attention (MSwin)0.B.2 多尺度窗口注意力（MSwin）

0.B.3 Architectural Configurations0.B.3 建筑配置

Appendix 0.C V2XSet Dataset附录 0.CV2XSet 数据集

Appendix 0.D More Experimental Results附录 0.D 更多实验结果

0.D.1 Performance for identifying dynamic objects0.D.1 动态对象识别性能

0.D.2 Performance for different road types0.D.2 不同路面类型的性能