这是用户在 2024-4-23 10:58 为 https://app.immersivetranslate.com/pdf-pro/0e11b6c0-09e1-4694-8147-92756f9229f6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_23_b4d4ce1bd9daa284ceecg

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion
VoxFormer:基于摄像头的三维语义场景补全的稀疏体素变换器

Yiming Li Zhiding Yu Christopher Choy Chaowei Xiao
Yiming Li Zhiding Yu Christopher Choy Chaowei Xiao
Jose M. Alvarez Sanja Fidler Chen Feng Anima Anandkumar
Jose M. Alvarez Sanja Fidler Chen Feng Anima Anandkumar
NYU NVIDIA ASU University of Toronto Vector Institute Caltech
纽约大学 英伟达 ASU 多伦多大学 向量研究所 加州理工学院

Abstract 摘要

Humans can easily imagine the complete geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformerbased semantic scene completion framework that can output complete volumetric semantics from only images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense voxels from the sparse ones. A key idea of this design is that the visual features on images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of in geometry and in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.
人类可以轻松地想象出被遮挡物体和场景的完整 几何图形。这种吸引人的能力对于识别和理解至关重要。为了在人工智能系统中实现这种能力,我们提出了 VoxFormer,这是一个基于变换器的语义场景补全框架,可以仅从 图像中输出完整的 容积语义。我们的框架采用两阶段设计,从深度估算得到的可见和被占体素查询稀疏集开始,然后是密度化阶段,从稀疏体素生成密集的 体素。这种设计的一个关键理念是, 图像上的视觉特征只与可见场景结构相对应,而不是与遮挡或空白空间相对应。因此,从可见结构的特征化和预测开始更为可靠。一旦获得稀疏查询集,我们就会应用遮蔽式自动编码器设计,通过自我关注将信息传播到所有体素。在 SemanticKITTI 上进行的实验表明,VoxFormer 的几何性能和语义性能分别相对提高了 ,其性能优于现有技术,并且在训练过程中将 GPU 内存减少到了 16GB 以下。我们的代码可在 https://github.com/NVlabs/VoxFormer 上获取。

1. Introduction 1.导言

Holistic 3D scene understanding is an important problem in autonomous vehicle (AV) perception. It directly affects downstream tasks such as planning and map construction. However, obtaining accurate and complete 3D information of the real world is difficult, since the task is challenged by the lack of sensing resolution and the incomplete observation due to the limited field of view and occlusions.
整体三维场景理解是自动驾驶汽车(AV)感知中的一个重要问题。它直接影响到规划和地图构建等下游任务。然而,要获得真实世界准确而完整的三维信息十分困难,因为这项任务面临着传感分辨率不足、视野有限和遮挡导致观测不完整等挑战。
To tackle the challenges, semantic scene completion (SSC) [1] was proposed to jointly infer the complete scene geometry and semantics from limited observations. An SSC solution has to simultaneously address two subtasks: scene reconstruction for visible areas and scene hallucination for
为了应对这些挑战,有人提出了语义场景补全(SSC)[1],从有限的观测数据中联合推断完整的场景几何和语义。SSC 解决方案必须同时解决两个子任务:可见区域的场景重建和可视区域的场景幻觉。
Figure 1. (a) A diagram of VoxFormer for camera-based semantic scene completion that predicts complete 3D geometry and semantics given only images. After obtaining voxel query proposals based on depth, VoxFormer generates semantic voxels via an MAE-like architecture [3]. (b) A comparison against the state-of-the-art MonoScene [4] in different ranges on SemanticKITTI [5]. VoxFormer performs much better in safetycritical short-range areas, while MonoScene performs indifferently at three distances. The relative gains are marked by red.
图 1:(a) VoxFormer 的示意图,用于基于摄像头的语义场景补全,只需 图像就能预测完整的三维几何图形和语义。在获得基于深度的体素查询建议后,VoxFormer 通过类似 MAE 的架构生成语义体素[3]。(b) 在 SemanticKITTI [5]的不同范围内与最先进的 MonoScene [4] 进行的比较。VoxFormer 在安全关键的短距离区域表现更好,而 MonoScene 在三个距离上表现平平。相对优势用红色标出。
occluded regions. This task is further backed by the fact that humans can naturally reason about scene geometry and semantics from partial observations. However, there is still a significant performance gap between state-of-the-art SSC methods [2] and human perception in driving scenes.
遮挡区域。人类可以通过局部观察自然地推理出场景的几何形状和语义,这一事实进一步支持了这项任务。然而,在驾驶场景中,最先进的 SSC 方法 [2] 与人类感知之间仍存在明显的性能差距。
Most existing SSC solutions consider LiDAR a primary modality to enable accurate geometric measurement [6-9]. However, LiDAR sensors are expensive and less portable, while cameras are cheaper and provide richer visual cues of the driving scenes. This motivated the study of camera-based SSC solutions, as first proposed in the pioneering work of MonoScene [4]. MonoScene lifts 2D image inputs to 3D using dense feature projection. However, such a projection inevitably assigns features of visible regions to the empty or occluded voxels. For example, an empty voxel occluded by a car will still get the car's visual feature. As a result, the generated 3D features contain many ambi-
大多数现有的 SSC 解决方案都将激光雷达作为主要模式,以实现精确的 几何测量[6-9]。然而,激光雷达传感器价格昂贵,便携性较差,而摄像头价格便宜,能提供更丰富的驾驶场景视觉线索。这就促使人们开始研究基于摄像头的 SSC 解决方案,MonoScene 的开创性工作[4]首次提出了这一方案。MonoScene 利用密集特征投影将二维图像输入提升到三维。然而,这种投影不可避免地会将 可见区域的特征分配给空的或被遮挡的体素。例如,被汽车遮挡的空体素仍会获得汽车的视觉特征。因此,生成的三维特征包含了许多不可见区域的特征。

guities for subsequent geometric completion and semantic segmentation, resulting in unsatisfactory performance.
因此,在后续的几何补全和语义分割过程中,其性能并不令人满意。
Our contributions. Unlike MonoScene, VoxFormer considers 3D-to-2D cross-attention to represent the sparse queries. The proposed design is motivated by two insights: (1) reconstruction-before-hallucination: the non-visible region's information can be better completed using the reconstructed visible areas as starting points; and (2) sparsityin-3D-space: since a large volume of the space is usually unoccupied, using a sparse representation instead of a dense one is certainly more efficient and scalable. Our contributions in this work can be summarized as follows:
我们的贡献与 MonoScene 不同,VoxFormer 考虑了三维到二维的交叉关注来表示稀疏查询。提出这一设计的动机有两个:(1) 先重构后幻觉:以重构的可见区域为起点,可以更好地完成非可见区域的 信息;(2) 3D 空间的稀疏性:由于 空间通常有很大一部分未被占用,因此使用稀疏表示法而不是密集表示法无疑更高效、更可扩展。我们在这项工作中的贡献可总结如下:
  • A novel two-stage framework that lifts images into a complete 3D voxelized semantic scene.
    一个新颖的两阶段框架,可将图像提升为完整的三维体素化语义场景。
  • A novel query proposal network based on 2D convolutions that generates reliable queries from image depth.
    基于二维卷积的新型查询建议网络,可根据图像深度生成可靠的查询。
  • A novel Transformer similar to masked autoencoder (MAE) [3] that yields complete 3D scene representation.
    一种类似于遮蔽式自动编码器(MAE)[3] 的新型变换器,可生成完整的三维场景表示。
  • VoxFormer sets a new state-of-the-art in camera-based SSC on SemanticKITTI [5], as shown in Fig. 1 (b).
    如图 1(b)所示,VoxFormer 在 SemanticKITTI[5]上创造了基于摄像头的 SSC 的新纪录。
VoxFormer consists of class-agnostic query proposal (stage-1) and class-specific semantic segmentation (stage2 ), where stage-1 proposes a sparse set of occupied voxels, and stage- 2 completes the scene representations starting from the proposals given by stage-1. Specifically, stage-1 has a lightweight 2D CNN-based query proposal network using the image depth to reconstruct the scene geometry. It then proposes a sparse set of voxels from predefined learnable voxel queries over the entire field of view. Stage-2 is based on a novel sparse-to-dense MAE-like architecture as shown in Fig. 1 (a). It first strengthens the featurization of the proposed voxels by allowing them to attend to the image observations. Next, the non-proposed voxels will be associated with a learnable mask token, and the full set of voxels will be processed by self-attention to complete the scene representations for per-voxel semantic segmentation.
VoxFormer 包括与类别无关的查询建议(阶段 1)和特定类别的语义分割(阶段 2),其中阶段 1 建议一个稀疏的被占体素集,而阶段 2 则从阶段 1 的建议开始完成场景表示。具体来说,stage-1 有一个基于二维 CNN 的轻量级查询建议网络,利用图像深度来重建场景几何。然后,它从预定义的可学习体素查询中提出整个视场的稀疏体素集。第 2 阶段基于新颖的稀疏到密集 MAE 架构,如图 1 (a) 所示。它首先通过让提出的体素关注图像观测结果来加强体素的特征化。接下来,非提议体素将与可学习的掩码标记相关联,全套体素将通过自我关注处理,以完成每个体素语义分割的场景表征。
Extensive tests on the large-scale SemanticKITTI [5] show that VoxFormer achieves state-of-the-art performance in geometric completion and semantic segmentation. More importantly, the improvements are significant in safetycritical short-range areas, as shown in Fig. 1 (b).
在大规模 SemanticKITTI [5] 上进行的广泛测试表明,VoxFormer 在几何补全和语义分割方面达到了最先进的性能。更重要的是,如图 1 (b)所示,VoxFormer 在安全关键的短程领域取得了显著改进。
3D reconstruction and completion. reconstruction aims to infer the 3D geometry of objects or scenes from single or multiple 2D images. This challenging problem receives extensive attention in the traditional computer vision era [10] and the recent deep learning era [11]. 3D reconstruction can be divided into (1) single-view reconstruction by learning shape priors from massive data [12-15], and
三维重建和补全。 重建旨在从单张或多张二维图像中推断物体或场景的三维几何形状。这一具有挑战性的问题在传统计算机视觉时代[10]和最近的深度学习时代[11]受到广泛关注。三维重建可分为:(1) 通过从海量数据中学习形状前验进行的单视图重建 [12-15],以及

(2) multi-view reconstruction by leveraging different viewpoints [16,17]. Both explicit [12,14] and implicit representations [18-22] are investigated for the object/scene. Unlike 3 D reconstruction, completion requires the model to hallucinate the unseen structure related to single-view reconstruction, yet the input is in 3D except for 2D. 3D object shape completion is an active research topic that estimates the complete geometry from a partial shape in the format of point [23-25], voxels [26-28], and distance fields [29], etc. In addition to object-level completion, scene-level 3D completion has also been investigated in both indoor [30] and outdoor scenes [31]: [30] proposes a sparse generative network to convert a partial RGB-D scan into a high-resolution 3 D reconstruction with missing geometry. [31] learns a neural network to convert each scan to a dense volume of truncated signed distance fields (TSDF).
(2) 利用不同视角进行多视角重建[16,17]。研究对象/场景的显式表征[12,14]和隐式表征[18-22]。与三维重建不同, 完成需要模型幻化出与单视角 重建相关的未见结构,但输入除二维外都是三维的。三维物体形状补全是一个活跃的研究课题,它可以从点[23-25]、体素[26-28]和距离场[29]等格式的部分形状中估算出完整的几何形状。除了物体级三维补全外,人们还研究了室内 [30] 和室外 [31] 场景中的场景级三维补全:[30] 提出了一种稀疏生成网络,可将部分 RGB-D 扫描转换为具有缺失几何图形的高分辨率三维重建。[31]通过学习神经网络,将每次扫描转换为截断符号距离场(TSDF)的密集体积。
Semantic segmentation. Human-level scene understanding for intelligent agents is typically advanced by semantic segmentation on images [32] or point clouds [33]. Researchers have significantly promoted image segmentation performance with a variety of deep learning techniques, such as convolutional neural network [34,35], vision transformers [36, 37], prototype learning [38, 39], etc. To have intelligent agents interact with the 3D environment, thinking in 3D is essential because the physical world is not 2D but rather 3D. Thus various 3D point cloud segmentation methods have been developed to address semantic understanding [40-42]. However, real-world sensing in 3D is inherently sparse and incomplete. For holistic semantic understanding, it is insufficient to solely parse the sparse measurements while ignoring the unobserved scene structures.
语义分割。对图像 [32] 或点云 [33] 进行语义分割通常能促进智能代理对人类水平场景的理解。研究人员利用卷积神经网络 [34,35]、视觉转换器 [36,37]、原型学习 [38,39]等多种深度学习技术大大提高了图像分割性能。要让智能代理与三维环境互动,三维思维至关重要,因为物理世界不是二维而是三维的。因此,人们开发了各种三维点云分割方法来解决 语义理解问题[40-42]。然而,现实世界中的三维感知本质上是稀疏和不完整的。要实现整体语义理解,仅仅解析稀疏的测量数据而忽略未观察到的场景结构是不够的。
3D semantic scene completion. Holistic 3D scene understanding is challenged by limited sensing range, and researchers have proposed multi-agent collaborative perception to introduce more observations of the 3D scene [4349]. Another line of research is 3D semantic scene completion, unifying scene completion and semantic segmentation which are investigated separately at the early stage [50,51]. SSCNet [1] first defines the semantic scene completion task where geometry and semantics are jointly inferred given an incomplete visual observation. In recent years, SSC in the indoor scenes with a relatively small scale has been intensively studied [52-59]. Meanwhile, SSC in the largescale outdoor scenes have also started to receive attention after the release of SemanticKITTI dataset [5]. Semantic scene completion with a sparse observation is a highly desirable capability for autonomous vehicles since it can generate a dense 3D voxelized semantic representation of the scene. Such representation can aid 3D semantic map construction of the static environment and help perceive dynamic objects. Unfortunately, SSC for large-scale driving scenes is only at the preliminary development and exploration stages. Existing works commonly depend on 3D
三维语义场景补全。由于感知范围有限,整体三维场景理解面临挑战,研究人员提出了多代理协同感知,以引入对三维场景的更多观察[4349]。另一个研究方向是三维语义场景补全,将早期单独研究的场景补全和语义分割统一起来[50,51]。SSCNet [1] 首次定义了语义场景补全任务,即在不完整的视觉观察中联合推断几何和语义。近年来,人们对尺度相对较小的室内场景中的 SSC 进行了深入研究 [52-59]。同时,在 SemanticKITTI 数据集发布之后,大规模室外场景中的 SSC 也开始受到关注[5]。通过稀疏观测进行语义场景补全是自动驾驶车辆非常需要的能力,因为它可以生成密集的三维体素化场景语义表示。这种表示法可以帮助构建静态环境的三维语义地图,并有助于感知动态物体。遗憾的是,用于大规模驾驶场景的 SSC 还只处于初步开发和探索阶段。现有的工作通常依赖于三维

input such as LiDAR point clouds [6-9, 60]. In contrast, the recent MonoScene [4] has studied semantic scene completion from a monocular image. It proposes 2D-3D feature projections and uses successive 2D and 3D UNets to achieve camera-only 3D semantic scene completion. However, to-3D feature projection is prone to introduce false features for unoccupied 3D positions, and the heavy 3D convolution will lower the system's efficiency.
输入,如激光雷达点云 [6-9, 60]。相比之下,最近的 MonoScene [4] 研究了从单目图像完成语义场景的问题。它提出了 2D-3D 特征投影,并使用连续的 2D 和 3D UNets 来实现纯相机 3D 语义场景补全。然而, 到三维特征投影容易引入未占用三维位置的虚假特征,而且大量的三维卷积会降低系统的效率。
Camera-based 3D perception. Camera-based systems have received extensive attention in the autonomous driving community because the camera is low-cost, easy to deploy, and widely available. In addition, the camera can provide rich visual attributes of the scene to help vehicles achieve holistic scene understanding. Several works have recently been proposed for 3D object detection or map segmentation from RGB images. Inspired by DETR [62] in 2D detection, DETR3D [63] links learnable 3D object queries with images by camera projection matrices and enables endto-end 3D bounding box prediction without non-maximum suppression (NMS). M2BEV [64] also investigates the viability of simultaneously running multi-tasks perception based on BEV features. BEVFormer [65] proposes a spatiotemporal transformer that aggregates BEV features from current and previous features via deformable attention [66]. Compared to object detection, semantic scene completion can provide occupancy for each small cell instead of assigning a fixed-size bounding box to an object. This could help identify an irregularly-shaped object with an overhanging obstacle. Compared to 2D BEV representation, 3D voxelized scene representation has more information, which is particularly helpful when vehicles are driving over bumpy roads. Hence, dense volumetric semantics can provide a more comprehensive 3D scene representation, while how to create it with only cameras receives scarce attention.
基于摄像头的 3D 感知。基于摄像头的系统在自动驾驶领域受到了广泛关注,因为摄像头成本低、易于部署且广泛可用。此外,摄像头还能提供丰富的场景视觉属性,帮助车辆实现整体场景理解。最近有几项研究提出了从 RGB 图像中进行 3D 物体检测或地图分割的方法。DETR3D [63]受二维检测中的 DETR [62]启发,通过摄像头投影矩阵将可学习的三维物体查询与 图像联系起来,并在无非最大抑制(NMS)的情况下实现端到端的三维边界框预测。M2BEV [64] 还研究了基于 BEV 特征同时运行多任务感知的可行性。BEVFormer [65]提出了一种时空变换器,通过可变形注意力[66]从当前和之前的特征中汇总 BEV 特征。与物体检测相比,语义场景补全可以为每个小单元提供占用率,而不是为物体指定一个固定大小的边界框。这有助于识别带有悬挂障碍物的不规则形状物体。与二维 BEV 表示法相比,三维体素化场景表示法具有更多的信息,这在车辆行驶在颠簸的道路上时尤其有用。因此,密集体积语义学可以提供更全面的三维场景表示,而如何仅用摄像机来创建这种表示却很少有人关注。

3. Methodology 3.方法论

3.1. Preliminary 3.1 初步情况

Problem setup. We aim to predict a dense semantic scene within a certain volume in front of the vehicle, given only RGB images. More specifically, we use as input current and previous images denoted by , and use as output a voxel grid defined in the coordinate of egovehicle at timestamp , where each voxel is either empty (denoted by ) or occupied by a certain semantic class in . Here denotes the total number of interested classes, and denote the length, width, and height of the voxel grid, respectively. In summary, the overall objective is to learn a neural network to generate a semantic voxel as close to the ground truth as possible. Note that previous SSC works commonly consider 3D input [2]. The most related work to us [4] considers a single image as input which is our special case.
问题设置。我们的目标是在仅给出 RGB 图像的情况下,预测车辆前方一定范围内的密集语义场景。更具体地说,我们使用当前和之前的图像作为输入(用 表示),并使用在时间戳 时以自我车辆坐标定义的体素网格 作为输出,其中每个体素要么是空(用 表示),要么被 中的某个语义类别占据。这里 表示感兴趣类别的总数, 分别表示象素网格的长度、宽度和高度。总之,总体目标是学习神经网络 ,生成尽可能接近地面实况的语义体素 。请注意,以前的 SSC 工作通常考虑三维输入[2]。与我们最相关的工作[4]将单幅图像作为输入,这正是我们的特例。

Design rationale. Motivated by reconstruction-beforehallucination and sparsity-in-3D-space, we build a twostage framework: stage-1 based on CNN proposes a sparse set of voxel queries from image depth to attend to images since the image features correspond to visible and occupied voxels instead of non-visible and empty ones; stage2 based on Transformer uses an MAE-like architecture to first strengthen the featurization of the proposed voxels by voxel-to-image cross-attention, and then process the full set of voxels with self-attention to enable the voxel interactions.
设计原理。受 "认知前重建"(reconstruction-beforehallucination)和 "三维空间稀疏性"(sparsity-in-3D-space)的启发,我们建立了一个两阶段框架:第 1 阶段基于 CNN,从图像深度提出一组稀疏的体素查询,以关注图像,因为图像特征对应的是可见和被占用的体素,而不是不可见和空的体素;第 2 阶段基于 Transformer,使用类似 MAE 的架构,首先通过体素到图像的交叉关注来加强所提出的体素的特征化,然后用自我关注处理全套体素,以实现体素交互。

3.2. Overall Architecture
3.2.总体结构

We learn 3D voxel features from 2D images for SSC based on Transformer, as illustrated in Fig. 2: our architecture extracts 2D features from RGB images and then uses a sparse set of 3D voxel queries to index into these features, linking 3D positions to an image stream using camera projection matrices. Specifically, voxel queries are 3D-grid-shaped learnable parameters designed to query features inside the 3D volume from images via attention mechanisms [67]. Our framework is a two-stage cascade composed of class-agnostic proposals and class-specific segmentation similar to [68]: stage-1 generates class-agnostic query proposals, and stage-2 uses an MAE-like architecture to propagate information to all voxels. Ultimately, the voxel features will be up-sampled for semantic segmentation. A more specific procedure is as follows:
我们基于 Transformer 从二维图像中学习三维体素特征,用于 SSC,如图 2 所示:我们的架构从 RGB 图像中提取二维特征,然后使用一组稀疏的三维体素查询对这些 特征进行索引,利用相机投影矩阵将三维位置与图像流联系起来。具体来说,体素查询是三维网格状的可学习参数,旨在通过注意力机制从图像中查询三维体积内的特征[67]。我们的框架是一个两阶段级联,由与类无关的提议和与类特定的分割组成,类似于 [68]:第一阶段生成与类无关的查询提议,第二阶段使用类似 MAE 的架构将信息传播到所有体素。最终,体素特征将被上采样用于语义分割。更具体的程序如下:
  • Extract 2D features from RGB image using ResNet-50 backbone [61], where is spatial resolution, and is feature dimension.
    使用 ResNet-50 骨干网 [61],从 RGB 图像 中提取二维特征 ,其中 为空间分辨率, 为特征维度。
  • Generate class-agnostic query proposals which is a subset of the predefined voxel queries , where and are the numbers of query proposals and the total number of voxel queries respectively.
    生成与类别无关的查询建议 ,它是预定义的体素查询 的子集,其中 分别是查询建议的数量和体素查询的总数。
  • Refine voxel features with two steps: (1) update the subset of voxels corresponding to query proposals by using to attend to image features via cross-attention and (2) update all voxels by letting them attend to each other via self-attention.
    通过两个步骤完善体素特征 :(1) 通过 更新与查询建议相对应的体素子集,通过交叉关注来关注图像特征 ;(2) 更新所有体素,让它们通过自我关注相互关注。
  • Output dense semantic map by up-sampling and linear projection of .
    通过对 进行上采样和线性投影,输出密集语义图
We will detail the voxel queries in Sec. 3.3, stage-1 in Sec. 3.4, stage-2 in Sec. 3.5, and training loss in Sec. 3.6.
我们将在第 3.3 节中详细介绍体素查询,在第 3.4 节中详细介绍第 1 阶段,在第 3.5 节中详细介绍第 2 阶段,在第 3.6 节中详细介绍训练损耗。

3.3. Predefined Parameters
3.3.预定义参数

Voxel queries. We pre-define a total of voxel queries as a cluster of 3D-grid-shaped learnable parameters as shown in the bottom left corner of Fig. 2, with its spatial resolution which is lower than output resolution to save
体素查询。如图 2 左下角所示,我们将总共 个体素查询预先定义为一个三维网格状可学习参数集群 ,其空间分辨率为 ,低于输出分辨率 ,以节省时间。
Figure 2. Overall framework of VoxFormer. Given RGB images, 2D features are extracted by ResNet50 [61] and the depth is estimated by an off-the-shelf depth predictor. The estimated depth after correction enables the class-agnostic query proposal stage: the query located at an occupied position will be selected to carry out deformable cross-attention with image features. Afterwards, mask tokens will be added for completing voxel features by deformable self-attention. The refined voxel features will be upsampled and projected to the output space for per-voxel semantic segmentation. Note that our framework supports the input of single or multiple images.
图 2.VoxFormer 的整体框架。给定 RGB 图像,由 ResNet50 [61] 提取二维特征,并由现成的深度预测器估算深度。校正后的估计深度可用于分类查询建议阶段:位于被占用位置的查询将被选中,与图像特征进行可变形交叉关注。之后,将添加掩码标记,通过可变形自关注来完善体素特征。完善后的体素特征将被上采样并投射到输出空间,用于每个体素的语义分割。请注意,我们的框架支持单张或多张图像的输入。
computations. Note that denotes the feature dimension, which is equal to that of image features. More specifically, a single voxel query located at position of is responsible for the corresponding 3D voxel inside the volume. Each voxel corresponds to a real-world size of meters. Meanwhile, the voxel queries are defined in the ego vehicle's coordinate, and learnable positional embeddings will be added to voxel queries for attention stages, following existing works for 2D BEV feature learning [65].
计算。请注意, 表示特征维度,等于图像特征维度。更具体地说,位于 位置的单个体素查询 负责体积内相应的三维体素。每个体素对应的真实世界大小为 米。同时,体素查询是以自我车辆的坐标定义的,并将可学习的位置嵌入添加到体素查询中,以引起注意,这一点沿用了现有的 2D BEV 特征学习工作[65]。
Mask token. While some voxel queries are selected to attend to images; the remaining voxels will be associated with another learnable parameter to complete 3D voxel features. We name such learnable parameter as mask token [3] for conciseness since unselected from is analogous to masked from Q. Specifically, each mask token is a learnable vector that indicates the presence of a missing voxel to be predicted. The positional embeddings are also added to help mask tokens be aware of their 3D locations.
掩码标记。在选择一些体素查询图像的同时,其余的体素将与另一个可学习参数相关联,以完成三维体素特征。为了简洁起见,我们将这种可学习参数命名为掩码标记[3],因为从 中的未选择类似于从 Q 中的掩码。具体来说,每个掩码标记 是一个可学习的向量,表示是否存在待预测的缺失体素。此外,还添加了位置嵌入,以帮助掩码标记了解其三维位置。

3.4. Stage-1: Class-Agnostic Query Proposal
3.4.阶段-1:类别诊断查询建议

Our stage-1 determines which voxels to be queried based on depth: the occupied voxels deserve careful attention, while the empty ones can be detached from the group. Given a 2D RGB observation, we first obtain a representation of the scene based on depth estimation. Afterwards, we acquire query positions by occupancy prediction that help correct the inaccurate image depth.
我们的第一阶段根据深度来确定要查询的体素:被占据的体素值得仔细关注,而空无一物的体素则可以从组中分离出来。给定一个 2D RGB 观察结果,我们首先根据深度估计获得场景的 表示。然后,我们通过占位预测获取 查询位置,帮助纠正不准确的图像深度。
Depth estimation. We leverage off-the-shelf depth estimation models such as monocular depth [69] or stereo depth [70] to directly predict the depth of each image pixel . Afterwards, the depth map will be backprojected into a 3D point cloud: a pixel will be trans- formed to in 3D by:
深度估计。我们利用现成的深度估计模型,如单目深度[69]或立体深度[70],直接预测每个图像像素的深度 。之后,深度图 将反向投影到三维点云中:像素 将通过以下方式在三维中转换为
where is the camera center and and are the horizontal and vertical focal length. However, the resulting point cloud has low quality, especially in the long-range area, because the depth at the horizon is extremely inconsistent; only a few pixels determine the depth of a large area.
其中 为摄像机中心, 为水平和垂直焦距。然而,所得到的 点云质量不高,尤其是在远距离区域,因为地平线处的深度极不一致;只有几个像素决定了大片区域的深度。
Depth correction. To obtain satisfactory query proposals, we employ a model to predict an occupancy map at a lower spatial resolution to help correct the image depth. Specifically, the synthetic point cloud is firstly converted into a binary voxel grid map , where each voxel is marked as 1 if occupied by at least one point. Then we can predict the occupancy by , where has a lower resolution than the input since a lower resolution is more robust to depth errors and compatible with the resolution of voxel queries. is a lightweight UNet-like model adapted from [6], mainly using 2D convolutions for binary classification of each voxel.
深度校正。为了获得令人满意的查询建议,我们采用了一个模型 来预测较低空间分辨率下的占位图,以帮助校正图像深度。具体来说,首先将合成点云转换为二进制体素网格图 ,其中每个体素如果至少被一个点占据,则标记为 1。然后,我们可以通过 来预测占据率,其中 的分辨率低于输入的 ,因为较低的分辨率对深度误差更有鲁棒性,并且与体素查询的分辨率兼容。 是一种类似于 UNet 的轻量级模型,改编自文献[6],主要使用二维卷积对每个体素进行二进制分类。
Query proposal. Following depth correction, we can select voxel queries from based on the binary :
查询建议。深度校正后,我们可以根据二进制 中选择体素查询:
where is the query proposals to attend to images later on. Our depth-based query proposal can: (1) save computations and memories by removing many empty spaces and (2) ease attention learning by reducing ambiguities caused by erroneous 2D-to-3D correspondences.
其中 是稍后关注图像的查询建议。我们基于深度的查询建议可以(1)通过删除许多空白空间来节省计算和记忆;(2)通过减少错误的二维到三维对应所造成的模糊性来简化注意力学习。

3.5. Stage-2: Class-Specific Segmentation
3.5.阶段 2:特定类别分割

Following stage-1, we then attend to image features with query proposals to learn rich visual features of the scene. For efficiency, we utilize deformable attention [66], which interacts with local regions of interest, and only sample points around the reference point to compute the attention results. Mathematically, each query will be updated by the following general equation:
在第一阶段之后,我们通过查询建议 来关注图像特征,从而学习 场景的丰富视觉特征。为了提高效率,我们利用可变形注意力[66],与局部感兴趣的区域进行交互,只对参考点周围的 点进行采样,计算注意力结果。在数学上,每个查询 将通过以下一般等式进行更新:
where denotes the reference point, represents input features, and indexes the sampled point from a total of points. denotes learnable weights for the value generation, is the learnable attention weight. is the predicted offset to the reference point , and is the feature at location extracted by bilinear interpolation. Note that we only show the formulation of single-head attention for conciseness.
其中, 表示参考点, 表示输入特征, 从总共 个点中索引采样点。 表示值生成的可学习权重, 是可学习的注意力权重。 是对参考点 的预测偏移, 是通过双线性内插法提取的位置 上的特征。请注意,为了简洁起见,我们只展示了单头注意力的表述。
Deformable cross-attention. For each proposed query , we obtain its corresponding real-world location based on the voxel resolution and the real size of the interested 3D volume. Afterwards, we project the 3D point to image features based on projection matrics. However, the projected 2D point can only fall on some images due to the limited field of view. Here, we term the hit image as . After that, we regard these points as the reference points of the query and sample the features from the hit views around these reference points. Finally, we perform a weighted sum of the sampled features as the output of deformable cross-attention (DCA):
可变形交叉关注。对于提出的每个查询 ,我们都会根据体素分辨率 和感兴趣的三维体积的实际大小获取其相应的实际位置。然后,我们根据投影矩阵将三维点投影到 图像特征 。然而,由于视场有限,投影的二维点只能落在某些图像上。在此,我们将命中的图像称为 。之后,我们将这些 点视为查询 的参考点,并围绕这些参考点对命中视图的特征进行采样。最后,我们对采样特征进行加权求和,作为可变形交叉注意力(DCA)的输出:
where indexes the images, and for each query proposal located at , we use camera projection function to obtain the reference point on image .
其中 是图像索引,对于位于 的每个查询建议 ,我们使用相机投影函数 来获取图像 上的参考点。
Deformable self-attention. After several layers of deformable cross-attention, the query proposals will be updated to . To get the complete voxel features, we combine the updated query proposals and the mask tokens to get the initial voxel features . Then we use deformable self-attention to get the refined voxel features :
可变形的自我关注。经过几层可变形交叉关注后,查询建议将更新为 。为了得到完整的体素特征,我们将更新后的查询建议 和掩码标记 结合起来,得到初始体素特征 。然后,我们使用可变形的自我关注来获得完善的体素特征
where could be either a mask token or an updated query proposal located at .
其中 可以是掩码标记,也可以是位于 的更新查询建议。
Output Stage. After obtaining refined voxel features , it will be upsampled and projected to the output space to get the final output , where denotes semantic classes and one empty class.
输出阶段。在获得细化的体素特征 后,将对其进行上采样并投射到输出空间,以获得最终输出 ,其中 表示 语义类别和一个空类别。

3.6. Training Loss 3.6.培训损失

We train stage-2 with a weighted cross-entropy loss. The ground truth defined at time represents a multi-class semantic voxel grid. Therefore, the loss can be computed by:
我们使用加权交叉熵损失来训练 stage-2。在 时定义的地面实况 代表一个多类语义体素网格。因此,损失的计算方法如下
where is the voxel index, is the total number of the voxel , indexes class, is the predicted logits for the -th voxel belonging to class is the -th element of and is a one-hot vector if voxel belongs to class ). is a weight for each class according to the inverse of the class frequency as in [6]. We also use scene-class affinity loss proposed in [4]. For stage-1, we employ a binary cross-entropy loss for occupancy prediction at a lower spatial resolution.
其中, 是体素索引, 是体素的总数