这是用户在 2024-4-23 10:58 为 https://app.immersivetranslate.com/pdf-pro/0e11b6c0-09e1-4694-8147-92756f9229f6 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
2024_04_23_b4d4ce1bd9daa284ceecg

VoxFormer: Sparse Voxel Transformer for Camera-based 3D Semantic Scene Completion
VoxFormer:基于摄像头的三维语义场景补全的稀疏体素变换器

Yiming Li Zhiding Yu Christopher Choy Chaowei Xiao
Yiming Li Zhiding Yu Christopher Choy Chaowei Xiao
Jose M. Alvarez Sanja Fidler Chen Feng Anima Anandkumar
Jose M. Alvarez Sanja Fidler Chen Feng Anima Anandkumar
NYU NVIDIA ASU University of Toronto Vector Institute Caltech
纽约大学 英伟达 ASU 多伦多大学 向量研究所 加州理工学院

Abstract 摘要

Humans can easily imagine the complete geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformerbased semantic scene completion framework that can output complete volumetric semantics from only images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense voxels from the sparse ones. A key idea of this design is that the visual features on images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of in geometry and in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer.
人类可以轻松地想象出被遮挡物体和场景的完整 几何图形。这种吸引人的能力对于识别和理解至关重要。为了在人工智能系统中实现这种能力,我们提出了 VoxFormer,这是一个基于变换器的语义场景补全框架,可以仅从 图像中输出完整的 容积语义。我们的框架采用两阶段设计,从深度估算得到的可见和被占体素查询稀疏集开始,然后是密度化阶段,从稀疏体素生成密集的 体素。这种设计的一个关键理念是, 图像上的视觉特征只与可见场景结构相对应,而不是与遮挡或空白空间相对应。因此,从可见结构的特征化和预测开始更为可靠。一旦获得稀疏查询集,我们就会应用遮蔽式自动编码器设计,通过自我关注将信息传播到所有体素。在 SemanticKITTI 上进行的实验表明,VoxFormer 的几何性能和语义性能分别相对提高了 ,其性能优于现有技术,并且在训练过程中将 GPU 内存减少到了 16GB 以下。我们的代码可在 https://github.com/NVlabs/VoxFormer 上获取。

1. Introduction 1.导言

Holistic 3D scene understanding is an important problem in autonomous vehicle (AV) perception. It directly affects downstream tasks such as planning and map construction. However, obtaining accurate and complete 3D information of the real world is difficult, since the task is challenged by the lack of sensing resolution and the incomplete observation due to the limited field of view and occlusions.
整体三维场景理解是自动驾驶汽车(AV)感知中的一个重要问题。它直接影响到规划和地图构建等下游任务。然而,要获得真实世界准确而完整的三维信息十分困难,因为这项任务面临着传感分辨率不足、视野有限和遮挡导致观测不完整等挑战。
To tackle the challenges, semantic scene completion (SSC) [1] was proposed to jointly infer the complete scene geometry and semantics from limited observations. An SSC solution has to simultaneously address two subtasks: scene reconstruction for visible areas and scene hallucination for
为了应对这些挑战,有人提出了语义场景补全(SSC)[1],从有限的观测数据中联合推断完整的场景几何和语义。SSC 解决方案必须同时解决两个子任务:可见区域的场景重建和可视区域的场景幻觉。
Figure 1. (a) A diagram of VoxFormer for camera-based semantic scene completion that predicts complete 3D geometry and semantics given only images. After obtaining voxel query proposals based on depth, VoxFormer generates semantic voxels via an MAE-like architecture [3]. (b) A comparison against the state-of-the-art MonoScene [4] in different ranges on SemanticKITTI [5]. VoxFormer performs much better in safetycritical short-range areas, while MonoScene performs indifferently at three distances. The relative gains are marked by red.
图 1:(a) VoxFormer 的示意图,用于基于摄像头的语义场景补全,只需 图像就能预测完整的三维几何图形和语义。在获得基于深度的体素查询建议后,VoxFormer 通过类似 MAE 的架构生成语义体素[3]。(b) 在 SemanticKITTI [5]的不同范围内与最先进的 MonoScene [4] 进行的比较。VoxFormer 在安全关键的短距离区域表现更好,而 MonoScene 在三个距离上表现平平。相对优势用红色标出。
occluded regions. This task is further backed by the fact that humans can naturally reason about scene geometry and semantics from partial observations. However, there is still a significant performance gap between state-of-the-art SSC methods [2] and human perception in driving scenes.
遮挡区域。人类可以通过局部观察自然地推理出场景的几何形状和语义,这一事实进一步支持了这项任务。然而,在驾驶场景中,最先进的 SSC 方法 [2] 与人类感知之间仍存在明显的性能差距。
Most existing SSC solutions consider LiDAR a primary modality to enable accurate geometric measurement [6-9]. However, LiDAR sensors are expensive and less portable, while cameras are cheaper and provide richer visual cues of the driving scenes. This motivated the study of camera-based SSC solutions, as first proposed in the pioneering work of MonoScene [4]. MonoScene lifts 2D image inputs to 3D using dense feature projection. However, such a projection inevitably assigns features of visible regions to the empty or occluded voxels. For example, an empty voxel occluded by a car will still get the car's visual feature. As a result, the generated 3D features contain many ambi-
大多数现有的 SSC 解决方案都将激光雷达作为主要模式,以实现精确的 几何测量[6-9]。然而,激光雷达传感器价格昂贵,便携性较差,而摄像头价格便宜,能提供更丰富的驾驶场景视觉线索。这就促使人们开始研究基于摄像头的 SSC 解决方案,MonoScene 的开创性工作[4]首次提出了这一方案。MonoScene 利用密集特征投影将二维图像输入提升到三维。然而,这种投影不可避免地会将 可见区域的特征分配给空的或被遮挡的体素。例如,被汽车遮挡的空体素仍会获得汽车的视觉特征。因此,生成的三维特征包含了许多不可见区域的特征。

guities for subsequent geometric completion and semantic segmentation, resulting in unsatisfactory performance.
因此,在后续的几何补全和语义分割过程中,其性能并不令人满意。
Our contributions. Unlike MonoScene, VoxFormer considers 3D-to-2D cross-attention to represent the sparse queries. The proposed design is motivated by two insights: (1) reconstruction-before-hallucination: the non-visible region's information can be better completed using the reconstructed visible areas as starting points; and (2) sparsityin-3D-space: since a large volume of the space is usually unoccupied, using a sparse representation instead of a dense one is certainly more efficient and scalable. Our contributions in this work can be summarized as follows:
我们的贡献与 MonoScene 不同,VoxFormer 考虑了三维到二维的交叉关注来表示稀疏查询。提出这一设计的动机有两个:(1) 先重构后幻觉:以重构的可见区域为起点,可以更好地完成非可见区域的 信息;(2) 3D 空间的稀疏性:由于 空间通常有很大一部分未被占用,因此使用稀疏表示法而不是密集表示法无疑更高效、更可扩展。我们在这项工作中的贡献可总结如下:
  • A novel two-stage framework that lifts images into a complete 3D voxelized semantic scene.
    一个新颖的两阶段框架,可将图像提升为完整的三维体素化语义场景。
  • A novel query proposal network based on 2D convolutions that generates reliable queries from image depth.
    基于二维卷积的新型查询建议网络,可根据图像深度生成可靠的查询。
  • A novel Transformer similar to masked autoencoder (MAE) [3] that yields complete 3D scene representation.
    一种类似于遮蔽式自动编码器(MAE)[3] 的新型变换器,可生成完整的三维场景表示。
  • VoxFormer sets a new state-of-the-art in camera-based SSC on SemanticKITTI [5], as shown in Fig. 1 (b).
    如图 1(b)所示,VoxFormer 在 SemanticKITTI[5]上创造了基于摄像头的 SSC 的新纪录。
VoxFormer consists of class-agnostic query proposal (stage-1) and class-specific semantic segmentation (stage2 ), where stage-1 proposes a sparse set of occupied voxels, and stage- 2 completes the scene representations starting from the proposals given by stage-1. Specifically, stage-1 has a lightweight 2D CNN-based query proposal network using the image depth to reconstruct the scene geometry. It then proposes a sparse set of voxels from predefined learnable voxel queries over the entire field of view. Stage-2 is based on a novel sparse-to-dense MAE-like architecture as shown in Fig. 1 (a). It first strengthens the featurization of the proposed voxels by allowing them to attend to the image observations. Next, the non-proposed voxels will be associated with a learnable mask token, and the full set of voxels will be processed by self-attention to complete the scene representations for per-voxel semantic segmentation.
VoxFormer 包括与类别无关的查询建议(阶段 1)和特定类别的语义分割(阶段 2),其中阶段 1 建议一个稀疏的被占体素集,而阶段 2 则从阶段 1 的建议开始完成场景表示。具体来说,stage-1 有一个基于二维 CNN 的轻量级查询建议网络,利用图像深度来重建场景几何。然后,它从预定义的可学习体素查询中提出整个视场的稀疏体素集。第 2 阶段基于新颖的稀疏到密集 MAE 架构,如图 1 (a) 所示。它首先通过让提出的体素关注图像观测结果来加强体素的特征化。接下来,非提议体素将与可学习的掩码标记相关联,全套体素将通过自我关注处理,以完成每个体素语义分割的场景表征。
Extensive tests on the large-scale SemanticKITTI [5] show that VoxFormer achieves state-of-the-art performance in geometric completion and semantic segmentation. More importantly, the improvements are significant in safetycritical short-range areas, as shown in Fig. 1 (b).
在大规模 SemanticKITTI [5] 上进行的广泛测试表明,VoxFormer 在几何补全和语义分割方面达到了最先进的性能。更重要的是,如图 1 (b)所示,VoxFormer 在安全关键的短程领域取得了显著改进。
3D reconstruction and completion. reconstruction aims to infer the 3D geometry of objects or scenes from single or multiple 2D images. This challenging problem receives extensive attention in the traditional computer vision era [10] and the recent deep learning era [11]. 3D reconstruction can be divided into (1) single-view reconstruction by learning shape priors from massive data [12-15], and
三维重建和补全。 重建旨在从单张或多张二维图像中推断物体或场景的三维几何形状。这一具有挑战性的问题在传统计算机视觉时代[10]和最近的深度学习时代[11]受到广泛关注。三维重建可分为:(1) 通过从海量数据中学习形状前验进行的单视图重建 [12-15],以及

(2) multi-view reconstruction by leveraging different viewpoints [16,17]. Both explicit [12,14] and implicit representations [18-22] are investigated for the object/scene. Unlike 3 D reconstruction, completion requires the model to hallucinate the unseen structure related to single-view reconstruction, yet the input is in 3D except for 2D. 3D object shape completion is an active research topic that estimates the complete geometry from a partial shape in the format of point [23-25], voxels [26-28], and distance fields [29], etc. In addition to object-level completion, scene-level 3D completion has also been investigated in both indoor [30] and outdoor scenes [31]: [30] proposes a sparse generative network to convert a partial RGB-D scan into a high-resolution 3 D reconstruction with missing geometry. [31] learns a neural network to convert each scan to a dense volume of truncated signed distance fields (TSDF).
(2) 利用不同视角进行多视角重建[16,17]。研究对象/场景的显式表征[12,14]和隐式表征[18-22]。与三维重建不同, 完成需要模型幻化出与单视角 重建相关的未见结构,但输入除二维外都是三维的。三维物体形状补全是一个活跃的研究课题,它可以从点[23-25]、体素[26-28]和距离场[29]等格式的部分形状中估算出完整的几何形状。除了物体级三维补全外,人们还研究了室内 [30] 和室外 [31] 场景中的场景级三维补全:[30] 提出了一种稀疏生成网络,可将部分 RGB-D 扫描转换为具有缺失几何图形的高分辨率三维重建。[31]通过学习神经网络,将每次扫描转换为截断符号距离场(TSDF)的密集体积。
Semantic segmentation. Human-level scene understanding for intelligent agents is typically advanced by semantic segmentation on images [32] or point clouds [33]. Researchers have significantly promoted image segmentation performance with a variety of deep learning techniques, such as convolutional neural network [34,35], vision transformers [36, 37], prototype learning [38, 39], etc. To have intelligent agents interact with the 3D environment, thinking in 3D is essential because the physical world is not 2D but rather 3D. Thus various 3D point cloud segmentation methods have been developed to address semantic understanding [40-42]. However, real-world sensing in 3D is inherently sparse and incomplete. For holistic semantic understanding, it is insufficient to solely parse the sparse measurements while ignoring the unobserved scene structures.
语义分割。对图像 [32] 或点云 [33] 进行语义分割通常能促进智能代理对人类水平场景的理解。研究人员利用卷积神经网络 [34,35]、视觉转换器 [36,37]、原型学习 [38,39]等多种深度学习技术大大提高了图像分割性能。要让智能代理与三维环境互动,三维思维至关重要,因为物理世界不是二维而是三维的。因此,人们开发了各种三维点云分割方法来解决 语义理解问题[40-42]。然而,现实世界中的三维感知本质上是稀疏和不完整的。要实现整体语义理解,仅仅解析稀疏的测量数据而忽略未观察到的场景结构是不够的。
3D semantic scene completion. Holistic 3D scene understanding is challenged by limited sensing range, and researchers have proposed multi-agent collaborative perception to introduce more observations of the 3D scene [4349]. Another line of research is 3D semantic scene completion, unifying scene completion and semantic segmentation which are investigated separately at the early stage [50,51]. SSCNet [1] first defines the semantic scene completion task where geometry and semantics are jointly inferred given an incomplete visual observation. In recent years, SSC in the indoor scenes with a relatively small scale has been intensively studied [52-59]. Meanwhile, SSC in the largescale outdoor scenes have also started to receive attention after the release of SemanticKITTI dataset [5]. Semantic scene completion with a sparse observation is a highly desirable capability for autonomous vehicles since it can generate a dense 3D voxelized semantic representation of the scene. Such representation can aid 3D semantic map construction of the static environment and help perceive dynamic objects. Unfortunately, SSC for large-scale driving scenes is only at the preliminary development and exploration stages. Existing works commonly depend on 3D
三维语义场景补全。由于感知范围有限,整体三维场景理解面临挑战,研究人员提出了多代理协同感知,以引入对三维场景的更多观察[4349]。另一个研究方向是三维语义场景补全,将早期单独研究的场景补全和语义分割统一起来[50,51]。SSCNet [1] 首次定义了语义场景补全任务,即在不完整的视觉观察中联合推断几何和语义。近年来,人们对尺度相对较小的室内场景中的 SSC 进行了深入研究 [52-59]。同时,在 SemanticKITTI 数据集发布之后,大规模室外场景中的 SSC 也开始受到关注[5]。通过稀疏观测进行语义场景补全是自动驾驶车辆非常需要的能力,因为它可以生成密集的三维体素化场景语义表示。这种表示法可以帮助构建静态环境的三维语义地图,并有助于感知动态物体。遗憾的是,用于大规模驾驶场景的 SSC 还只处于初步开发和探索阶段。现有的工作通常依赖于三维

input such as LiDAR point clouds [6-9, 60]. In contrast, the recent MonoScene [4] has studied semantic scene completion from a monocular image. It proposes 2D-3D feature projections and uses successive 2D and 3D UNets to achieve camera-only 3D semantic scene completion. However, to-3D feature projection is prone to introduce false features for unoccupied 3D positions, and the heavy 3D convolution will lower the system's efficiency.
输入,如激光雷达点云 [6-9, 60]。相比之下,最近的 MonoScene [4] 研究了从单目图像完成语义场景的问题。它提出了 2D-3D 特征投影,并使用连续的 2D 和 3D UNets 来实现纯相机 3D 语义场景补全。然而, 到三维特征投影容易引入未占用三维位置的虚假特征,而且大量的三维卷积会降低系统的效率。
Camera-based 3D perception. Camera-based systems have received extensive attention in the autonomous driving community because the camera is low-cost, easy to deploy, and widely available. In addition, the camera can provide rich visual attributes of the scene to help vehicles achieve holistic scene understanding. Several works have recently been proposed for 3D object detection or map segmentation from RGB images. Inspired by DETR [62] in 2D detection, DETR3D [63] links learnable 3D object queries with images by camera projection matrices and enables endto-end 3D bounding box prediction without non-maximum suppression (NMS). M2BEV [64] also investigates the viability of simultaneously running multi-tasks perception based on BEV features. BEVFormer [65] proposes a spatiotemporal transformer that aggregates BEV features from current and previous features via deformable attention [66]. Compared to object detection, semantic scene completion can provide occupancy for each small cell instead of assigning a fixed-size bounding box to an object. This could help identify an irregularly-shaped object with an overhanging obstacle. Compared to 2D BEV representation, 3D voxelized scene representation has more information, which is particularly helpful when vehicles are driving over bumpy roads. Hence, dense volumetric semantics can provide a more comprehensive 3D scene representation, while how to create it with only cameras receives scarce attention.
基于摄像头的 3D 感知。基于摄像头的系统在自动驾驶领域受到了广泛关注,因为摄像头成本低、易于部署且广泛可用。此外,摄像头还能提供丰富的场景视觉属性,帮助车辆实现整体场景理解。最近有几项研究提出了从 RGB 图像中进行 3D 物体检测或地图分割的方法。DETR3D [63]受二维检测中的 DETR [62]启发,通过摄像头投影矩阵将可学习的三维物体查询与 图像联系起来,并在无非最大抑制(NMS)的情况下实现端到端的三维边界框预测。M2BEV [64] 还研究了基于 BEV 特征同时运行多任务感知的可行性。BEVFormer [65]提出了一种时空变换器,通过可变形注意力[66]从当前和之前的特征中汇总 BEV 特征。与物体检测相比,语义场景补全可以为每个小单元提供占用率,而不是为物体指定一个固定大小的边界框。这有助于识别带有悬挂障碍物的不规则形状物体。与二维 BEV 表示法相比,三维体素化场景表示法具有更多的信息,这在车辆行驶在颠簸的道路上时尤其有用。因此,密集体积语义学可以提供更全面的三维场景表示,而如何仅用摄像机来创建这种表示却很少有人关注。

3. Methodology 3.方法论

3.1. Preliminary 3.1 初步情况

Problem setup. We aim to predict a dense semantic scene within a certain volume in front of the vehicle, given only RGB images. More specifically, we use as input current and previous images denoted by , and use as output a voxel grid defined in the coordinate of egovehicle at timestamp , where each voxel is either empty (denoted by ) or occupied by a certain semantic class in . Here denotes the total number of interested classes, and denote the length, width, and height of the voxel grid, respectively. In summary, the overall objective is to learn a neural network to generate a semantic voxel as close to the ground truth as possible. Note that previous SSC works commonly consider 3D input [2]. The most related work to us [4] considers a single image as input which is our special case.
问题设置。我们的目标是在仅给出 RGB 图像的情况下,预测车辆前方一定范围内的密集语义场景。更具体地说,我们使用当前和之前的图像作为输入(用 表示),并使用在时间戳 时以自我车辆坐标定义的体素网格 作为输出,其中每个体素要么是空(用 表示),要么被 中的某个语义类别占据。这里 表示感兴趣类别的总数, 分别表示象素网格的长度、宽度和高度。总之,总体目标是学习神经网络 ,生成尽可能接近地面实况的语义体素 。请注意,以前的 SSC 工作通常考虑三维输入[2]。与我们最相关的工作[4]将单幅图像作为输入,这正是我们的特例。

Design rationale. Motivated by reconstruction-beforehallucination and sparsity-in-3D-space, we build a twostage framework: stage-1 based on CNN proposes a sparse set of voxel queries from image depth to attend to images since the image features correspond to visible and occupied voxels instead of non-visible and empty ones; stage2 based on Transformer uses an MAE-like architecture to first strengthen the featurization of the proposed voxels by voxel-to-image cross-attention, and then process the full set of voxels with self-attention to enable the voxel interactions.
设计原理。受 "认知前重建"(reconstruction-beforehallucination)和 "三维空间稀疏性"(sparsity-in-3D-space)的启发,我们建立了一个两阶段框架:第 1 阶段基于 CNN,从图像深度提出一组稀疏的体素查询,以关注图像,因为图像特征对应的是可见和被占用的体素,而不是不可见和空的体素;第 2 阶段基于 Transformer,使用类似 MAE 的架构,首先通过体素到图像的交叉关注来加强所提出的体素的特征化,然后用自我关注处理全套体素,以实现体素交互。

3.2. Overall Architecture
3.2.总体结构

We learn 3D voxel features from 2D images for SSC based on Transformer, as illustrated in Fig. 2: our architecture extracts 2D features from RGB images and then uses a sparse set of 3D voxel queries to index into these features, linking 3D positions to an image stream using camera projection matrices. Specifically, voxel queries are 3D-grid-shaped learnable parameters designed to query features inside the 3D volume from images via attention mechanisms [67]. Our framework is a two-stage cascade composed of class-agnostic proposals and class-specific segmentation similar to [68]: stage-1 generates class-agnostic query proposals, and stage-2 uses an MAE-like architecture to propagate information to all voxels. Ultimately, the voxel features will be up-sampled for semantic segmentation. A more specific procedure is as follows:
我们基于 Transformer 从二维图像中学习三维体素特征,用于 SSC,如图 2 所示:我们的架构从 RGB 图像中提取二维特征,然后使用一组稀疏的三维体素查询对这些 特征进行索引,利用相机投影矩阵将三维位置与图像流联系起来。具体来说,体素查询是三维网格状的可学习参数,旨在通过注意力机制从图像中查询三维体积内的特征[67]。我们的框架是一个两阶段级联,由与类无关的提议和与类特定的分割组成,类似于 [68]:第一阶段生成与类无关的查询提议,第二阶段使用类似 MAE 的架构将信息传播到所有体素。最终,体素特征将被上采样用于语义分割。更具体的程序如下:
  • Extract 2D features from RGB image using ResNet-50 backbone [61], where is spatial resolution, and is feature dimension.
    使用 ResNet-50 骨干网 [61],从 RGB 图像 中提取二维特征 ,其中 为空间分辨率, 为特征维度。
  • Generate class-agnostic query proposals which is a subset of the predefined voxel queries , where and are the numbers of query proposals and the total number of voxel queries respectively.
    生成与类别无关的查询建议 ,它是预定义的体素查询 的子集,其中 分别是查询建议的数量和体素查询的总数。
  • Refine voxel features with two steps: (1) update the subset of voxels corresponding to query proposals by using to attend to image features via cross-attention and (2) update all voxels by letting them attend to each other via self-attention.
    通过两个步骤完善体素特征 :(1) 通过 更新与查询建议相对应的体素子集,通过交叉关注来关注图像特征 ;(2) 更新所有体素,让它们通过自我关注相互关注。
  • Output dense semantic map by up-sampling and linear projection of .
    通过对 进行上采样和线性投影,输出密集语义图
We will detail the voxel queries in Sec. 3.3, stage-1 in Sec. 3.4, stage-2 in Sec. 3.5, and training loss in Sec. 3.6.
我们将在第 3.3 节中详细介绍体素查询,在第 3.4 节中详细介绍第 1 阶段,在第 3.5 节中详细介绍第 2 阶段,在第 3.6 节中详细介绍训练损耗。

3.3. Predefined Parameters
3.3.预定义参数

Voxel queries. We pre-define a total of voxel queries as a cluster of 3D-grid-shaped learnable parameters as shown in the bottom left corner of Fig. 2, with its spatial resolution which is lower than output resolution to save
体素查询。如图 2 左下角所示,我们将总共 个体素查询预先定义为一个三维网格状可学习参数集群 ,其空间分辨率为 ,低于输出分辨率 ,以节省时间。
Figure 2. Overall framework of VoxFormer. Given RGB images, 2D features are extracted by ResNet50 [61] and the depth is estimated by an off-the-shelf depth predictor. The estimated depth after correction enables the class-agnostic query proposal stage: the query located at an occupied position will be selected to carry out deformable cross-attention with image features. Afterwards, mask tokens will be added for completing voxel features by deformable self-attention. The refined voxel features will be upsampled and projected to the output space for per-voxel semantic segmentation. Note that our framework supports the input of single or multiple images.
图 2.VoxFormer 的整体框架。给定 RGB 图像,由 ResNet50 [61] 提取二维特征,并由现成的深度预测器估算深度。校正后的估计深度可用于分类查询建议阶段:位于被占用位置的查询将被选中,与图像特征进行可变形交叉关注。之后,将添加掩码标记,通过可变形自关注来完善体素特征。完善后的体素特征将被上采样并投射到输出空间,用于每个体素的语义分割。请注意,我们的框架支持单张或多张图像的输入。
computations. Note that denotes the feature dimension, which is equal to that of image features. More specifically, a single voxel query located at position of is responsible for the corresponding 3D voxel inside the volume. Each voxel corresponds to a real-world size of meters. Meanwhile, the voxel queries are defined in the ego vehicle's coordinate, and learnable positional embeddings will be added to voxel queries for attention stages, following existing works for 2D BEV feature learning [65].
计算。请注意, 表示特征维度,等于图像特征维度。更具体地说,位于 位置的单个体素查询 负责体积内相应的三维体素。每个体素对应的真实世界大小为 米。同时,体素查询是以自我车辆的坐标定义的,并将可学习的位置嵌入添加到体素查询中,以引起注意,这一点沿用了现有的 2D BEV 特征学习工作[65]。
Mask token. While some voxel queries are selected to attend to images; the remaining voxels will be associated with another learnable parameter to complete 3D voxel features. We name such learnable parameter as mask token [3] for conciseness since unselected from is analogous to masked from Q. Specifically, each mask token is a learnable vector that indicates the presence of a missing voxel to be predicted. The positional embeddings are also added to help mask tokens be aware of their 3D locations.
掩码标记。在选择一些体素查询图像的同时,其余的体素将与另一个可学习参数相关联,以完成三维体素特征。为了简洁起见,我们将这种可学习参数命名为掩码标记[3],因为从 中的未选择类似于从 Q 中的掩码。具体来说,每个掩码标记 是一个可学习的向量,表示是否存在待预测的缺失体素。此外,还添加了位置嵌入,以帮助掩码标记了解其三维位置。

3.4. Stage-1: Class-Agnostic Query Proposal
3.4.阶段-1:类别诊断查询建议

Our stage-1 determines which voxels to be queried based on depth: the occupied voxels deserve careful attention, while the empty ones can be detached from the group. Given a 2D RGB observation, we first obtain a representation of the scene based on depth estimation. Afterwards, we acquire query positions by occupancy prediction that help correct the inaccurate image depth.
我们的第一阶段根据深度来确定要查询的体素:被占据的体素值得仔细关注,而空无一物的体素则可以从组中分离出来。给定一个 2D RGB 观察结果,我们首先根据深度估计获得场景的 表示。然后,我们通过占位预测获取 查询位置,帮助纠正不准确的图像深度。
Depth estimation. We leverage off-the-shelf depth estimation models such as monocular depth [69] or stereo depth [70] to directly predict the depth of each image pixel . Afterwards, the depth map will be backprojected into a 3D point cloud: a pixel will be trans- formed to in 3D by:
深度估计。我们利用现成的深度估计模型,如单目深度[69]或立体深度[70],直接预测每个图像像素的深度 。之后,深度图 将反向投影到三维点云中:像素 将通过以下方式在三维中转换为
where is the camera center and and are the horizontal and vertical focal length. However, the resulting point cloud has low quality, especially in the long-range area, because the depth at the horizon is extremely inconsistent; only a few pixels determine the depth of a large area.
其中 为摄像机中心, 为水平和垂直焦距。然而,所得到的 点云质量不高,尤其是在远距离区域,因为地平线处的深度极不一致;只有几个像素决定了大片区域的深度。
Depth correction. To obtain satisfactory query proposals, we employ a model to predict an occupancy map at a lower spatial resolution to help correct the image depth. Specifically, the synthetic point cloud is firstly converted into a binary voxel grid map , where each voxel is marked as 1 if occupied by at least one point. Then we can predict the occupancy by , where has a lower resolution than the input since a lower resolution is more robust to depth errors and compatible with the resolution of voxel queries. is a lightweight UNet-like model adapted from [6], mainly using 2D convolutions for binary classification of each voxel.
深度校正。为了获得令人满意的查询建议,我们采用了一个模型 来预测较低空间分辨率下的占位图,以帮助校正图像深度。具体来说,首先将合成点云转换为二进制体素网格图 ,其中每个体素如果至少被一个点占据,则标记为 1。然后,我们可以通过 来预测占据率,其中 的分辨率低于输入的 ,因为较低的分辨率对深度误差更有鲁棒性,并且与体素查询的分辨率兼容。 是一种类似于 UNet 的轻量级模型,改编自文献[6],主要使用二维卷积对每个体素进行二进制分类。
Query proposal. Following depth correction, we can select voxel queries from based on the binary :
查询建议。深度校正后,我们可以根据二进制 中选择体素查询:
where is the query proposals to attend to images later on. Our depth-based query proposal can: (1) save computations and memories by removing many empty spaces and (2) ease attention learning by reducing ambiguities caused by erroneous 2D-to-3D correspondences.
其中 是稍后关注图像的查询建议。我们基于深度的查询建议可以(1)通过删除许多空白空间来节省计算和记忆;(2)通过减少错误的二维到三维对应所造成的模糊性来简化注意力学习。

3.5. Stage-2: Class-Specific Segmentation
3.5.阶段 2:特定类别分割

Following stage-1, we then attend to image features with query proposals to learn rich visual features of the scene. For efficiency, we utilize deformable attention [66], which interacts with local regions of interest, and only sample points around the reference point to compute the attention results. Mathematically, each query will be updated by the following general equation:
在第一阶段之后,我们通过查询建议 来关注图像特征,从而学习 场景的丰富视觉特征。为了提高效率,我们利用可变形注意力[66],与局部感兴趣的区域进行交互,只对参考点周围的 点进行采样,计算注意力结果。在数学上,每个查询 将通过以下一般等式进行更新:
where denotes the reference point, represents input features, and indexes the sampled point from a total of points. denotes learnable weights for the value generation, is the learnable attention weight. is the predicted offset to the reference point , and is the feature at location extracted by bilinear interpolation. Note that we only show the formulation of single-head attention for conciseness.
其中, 表示参考点, 表示输入特征, 从总共 个点中索引采样点。 表示值生成的可学习权重, 是可学习的注意力权重。 是对参考点 的预测偏移, 是通过双线性内插法提取的位置 上的特征。请注意,为了简洁起见,我们只展示了单头注意力的表述。
Deformable cross-attention. For each proposed query , we obtain its corresponding real-world location based on the voxel resolution and the real size of the interested 3D volume. Afterwards, we project the 3D point to image features based on projection matrics. However, the projected 2D point can only fall on some images due to the limited field of view. Here, we term the hit image as . After that, we regard these points as the reference points of the query and sample the features from the hit views around these reference points. Finally, we perform a weighted sum of the sampled features as the output of deformable cross-attention (DCA):
可变形交叉关注。对于提出的每个查询 ,我们都会根据体素分辨率 和感兴趣的三维体积的实际大小获取其相应的实际位置。然后,我们根据投影矩阵将三维点投影到 图像特征 。然而,由于视场有限,投影的二维点只能落在某些图像上。在此,我们将命中的图像称为 。之后,我们将这些 点视为查询 的参考点,并围绕这些参考点对命中视图的特征进行采样。最后,我们对采样特征进行加权求和,作为可变形交叉注意力(DCA)的输出:
where indexes the images, and for each query proposal located at , we use camera projection function to obtain the reference point on image .
其中 是图像索引,对于位于 的每个查询建议 ,我们使用相机投影函数 来获取图像 上的参考点。
Deformable self-attention. After several layers of deformable cross-attention, the query proposals will be updated to . To get the complete voxel features, we combine the updated query proposals and the mask tokens to get the initial voxel features . Then we use deformable self-attention to get the refined voxel features :
可变形的自我关注。经过几层可变形交叉关注后,查询建议将更新为 。为了得到完整的体素特征,我们将更新后的查询建议 和掩码标记 结合起来,得到初始体素特征 。然后,我们使用可变形的自我关注来获得完善的体素特征
where could be either a mask token or an updated query proposal located at .
其中 可以是掩码标记,也可以是位于 的更新查询建议。
Output Stage. After obtaining refined voxel features , it will be upsampled and projected to the output space to get the final output , where denotes semantic classes and one empty class.
输出阶段。在获得细化的体素特征 后,将对其进行上采样并投射到输出空间,以获得最终输出 ,其中 表示 语义类别和一个空类别。

3.6. Training Loss 3.6.培训损失

We train stage-2 with a weighted cross-entropy loss. The ground truth defined at time represents a multi-class semantic voxel grid. Therefore, the loss can be computed by:
我们使用加权交叉熵损失来训练 stage-2。在 时定义的地面实况 代表一个多类语义体素网格。因此,损失的计算方法如下
where is the voxel index, is the total number of the voxel , indexes class, is the predicted logits for the -th voxel belonging to class is the -th element of and is a one-hot vector if voxel belongs to class ). is a weight for each class according to the inverse of the class frequency as in [6]. We also use scene-class affinity loss proposed in [4]. For stage-1, we employ a binary cross-entropy loss for occupancy prediction at a lower spatial resolution.
其中, 是体素索引, 是体素的总数 是类别索引, 是属于类别的 -th 体素的预测对数 -th 元素 ,如果体素 属于类别 ,则是单击向量 )。 是根据类别频率的倒数对每个类别的权重,如 [6]。我们还使用了 [4] 中提出的场景-类别亲和力损失。在第一阶段,我们采用二元交叉熵损失在较低的空间分辨率下进行占位预测。

4. Experiments 4.实验

4.1. Experimental Setup 4.1.实验装置

Dataset. We verify VoxFormer on SemanticKITTI [5], which provides dense semantic annotations for each LiDAR sweep from the KITTI Odometry Benchmark [71] composed of 22 outdoor driving scenarios. SemanticKITTI SSC benchmark is interested in a volume of ahead of the car, to left and right side, and in height. The voxelization of this volume leads to a group of 3D voxel grids with a dimension of since each voxel has a size of . The voxel grids are labelled with 20 classes (19 semantics and 1 free). Regarding the target output, SemanticKITTI provides the groundtruth semantic voxel grids by voxelization of the aggregated consecutive registered semantic point cloud. Regarding the sparse input to an SSC model, it can be either a single voxelized LiDAR sweep or an RGB image. In this work, we investigate image-based SSC similar to [4], yet our input could be multiple images including temporal information.
数据集。我们在SemanticKITTI[5]上对VoxFormer进行了验证,该数据集为KITTI Odometry Benchmark[71]的每个LiDAR扫描提供了密集的语义注释,KITTI Odometry Benchmark由22个室外驾驶场景组成。SemanticKITTI SSC 基准关注的是汽车前方 、左右两侧 和高度 的体积。由于每个体素的大小为 ,因此对该体积进行体素化后会产生一组维度为 的三维体素网格。体素网格上标有 20 个类别(19 个语义类别和 1 个自由类别)。关于目标输出,SemanticKITTI通过对汇总的连续注册语义点云进行体素化处理,提供真实的语义体素网格。关于 SSC 模型的稀疏输入,可以是单个体素化的激光雷达扫描数据,也可以是 RGB 图像。在这项工作中,我们将研究与 [4] 类似的基于图像的 SSC,但我们的输入可以是包含时间信息的多幅图像。
Implementation details. Regarding stage-1, we employ the MobileStereoNet [70] for direct depth estimation. Such depth can help generate a pseudo-LiDAR point cloud at a much lower cost based solely on stereo images. The occupancy prediction network for depth correction is adapted from LMSCNet [6] which is on top of lightweight 2D CNNs. We directly utilize the depth predictor in [70], and we train an occupancy predictor from scratch, using as input a voxelized pseudo point cloud with a size of and as output an occupancy map with a size of . Regarding stage-2, we crop RGB images of cam 2 to size and employ ResNet50 [61] to extract image features, then the features in the 3rd stage will be taken by FPN [72] to produce a feature map whose
实施细节。关于第一阶段,我们采用 MobileStereoNet [70] 进行直接深度估算。这种深度可以帮助生成伪激光雷达点云,而仅基于立体图像的成本要低得多。用于深度校正的占位预测网络改编自 LMSCNet [6],它建立在轻量级 2D CNN 的基础之上。我们直接利用了 [70] 中的深度预测器,并从头开始训练占位预测器,输入是大小为 的体素化伪点云,输出是大小为 的占位图。关于第 2 阶段,我们将凸轮 2 的 RGB 图像裁剪为 大小,并使用 ResNet50 [61] 提取图像特征,然后将第 3 阶段的特征通过 FPN [72] 生成特征图,该特征图的尺寸为:1.0x1.0x1.0x1.0x1.0x1.0x1.1。
Methods
Range
VoxFormer-T (Ours) VoxFormer-T (我们的) VoxFormer-S (Ours) VoxFormer-S (我们的) MonoScene [4] 单一场景 [4]
SSCNet
IoU (%) 65.38 57.69 44.15 65.35 57.54 44.02 38.42 38.55 36.80 65.52 54.89 38.36 59.51 53.20 40.93
Precis 76.54 69.95 62.06 77.65 70.85 62.32 51.22 51.96 5210 86.51 82.21 77.60 65.38 59.13 48.77
Recall (% 81.77 76.70 60.47 80.49 75.39 59.99 60.60 59.91 55.50 72.98 62.29 43.13 86.89 84.15 71.80
mIoU 21.55 18.4 13.35 17.66 16.48 12.35 12.25 12.22 11.30 15.69 14.13 9.94 16.32 14.55 10.27
44.90 37.46 26.54 39.78 35.24 25.79 24.34 24.64 23.29 42.99 35.41 23.62 37.48 31.09 22.32
bicycle ( 5.22 2.87 1.28 3.04 1.48 0.59 0.07 0.23 0.28 0.00 0.00 0.00 0.00 0.00 0.00
motorcy 2.98 1.24 0.56 2.84 1.10 051 0.05 0.20 0. 0.00 0.00 0.00 0.00 0.00 0.00
9.80 10.38 7.26 7.50 7.47 5.63 15.44 13.84 9.29 0.76 3.49 1.69 10.23 8.49 4.69
n other-v 17.21 10.61 7.81 8.71 4.98 3.77 1.18 2.13 2.6 0.00 0.00 0.00 7.60 4.55 2.43
perso 4.44 3.50 1.93 4.10 3.31 1.78 0.90 1.37 2.00 0.00 0.00 0.00 0.00 0.00 0.00
2.65 3.92 1.97 6.82 7.1 0.54 1.0 1.0 0.00 0. 0.00 0.00 0 . 0.01
moto 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
75.45 66.15 53.57 72.40 65.74 54.76 57.37 57.11 55.89 73.85 67.56 54.90 72.27 65.78 51.28
21.01 23.96 19.69 10.79 18.49 20.04 18.60 14.75 15.63 13.22 9. 15.55 13.35 9.07
m sic 45.39 34.53 26.52 39.35 33.20 26.35 27.81 27. 26 . 42.29 34.20 25.43 40.88 32.84 22.38
ot 0.00 0.76 0.42 0.00 1.54 0. 1. 2.1 1. 0.00 0.00 0. 0.00 0.01 0.02
builc 25.13 29.45 19.54 17.91 24.09 17.65 16.67 15.97 13 22.46 27.83 14 18.19 24.59 15.20
16.17 11.15 7.31 12.98 10.63 7. 7. 7.3 6. 5.84 5.31 3.57
43.55 38.07 26.10 40.50 34.68 24.39 19.52 19.68 17.98 39.04 33.32 20.19 36.34 33.17 22.24
21.39 12.75 6.10 15.81 10.64 5.08 2.02 2.57 2.44 6.32 3.01 1.06 13.35 8.53 4.33
42.82 39.61 33.06 32.25 35.08 29.96 31.72 31.59 29.84 41.59 41.51 32.30 37.61 38.46 31.21
pole 20.66 15.56 9.15 14.47 11.95 7.11 3.10 3.79 3.91 7.28 4.43 2.04 11.36 8.33 4.83
traf.-sign 10.63 8.09 4.94 6.19 6.29 4.18 3.69 2.54 2.43 0.00 0.00 0.00 3.86 2.65 1.49
Table 1. Quantitative comparison against the state-of-the-art camera-based SSC methods. We report the performances inside three volumes, i.e., , and . The first two volumes are introduced for assessing the SSC performance in safety-critical nearby locations. The top three performances are marked by red, green, and blue respectively.
表 1.与最先进的基于摄像头的 SSC 方法的定量比较。我们报告了三个卷的性能,即 。前两卷用于评估附近安全关键地点的 SSC 性能。前三名分别用红色、绿色和蓝色标出。
size is of the input image size. The feature dimension is set as . The numbers of deformable attention layers for cross-attention and self-attention are 3 and 2 respectively. We use 8 sampling points around each reference point for the cross-/self-attention head. There is also a linear layer that projects feature dimension 128 to the number of classes 20 . We train stage-1 and stage-2 separately with 24 epochs, a learning rate of . Note that we provide two versions of VoxFormer, one takes only the current image as input (VoxFormer-S), and the other takes the current and the previous 4 images as input (VoxFormer-T).
尺寸为 的输入图像尺寸。特征维度设置为 。交叉注意和自我注意的可变形注意层数分别为 3 和 2。我们在每个参考点周围使用 8 个采样点作为交叉/自我注意头。还有一个线性层,将特征维度 128 投影到类别数 20。我们分别对 stage-1 和 stage-2 进行 24 次训练,学习率为 。请注意,我们提供了两个版本的 VoxFormer,一个版本只将当前图像作为输入(VoxFormer-S),另一个版本将当前图像和之前的 4 幅图像作为输入(VoxFormer-T)。
Evaluation metrics. We employ intersection over union (IoU) to evaluate the scene completion quality, regardless of the allocated semantic labels. Such a group of geometryonly voxel grids is actually a binary occupancy map which is crucial for obstacle avoidance in self-driving. We use the mean IoU (mIoU) of 19 semantic classes to assess the performance of semantic segmentation. Note that there is a strong interaction between IoU and mIoU, e.g., a high mIoU can be achieved by naively decreasing the IoU. Therefore, the desired model should achieve excellent performance in both geometric completion and semantic segmentation. Meanwhile, we further propose to assess different ranges ahead of the car for a thorough evaluation: we individually report the and mIoU inside the volume of , and . Note that the understanding of a short-range area is more crucial since it leaves less time for autonomous vehicles to improve. Differently, the understanding of a provisional long-range area could be enhanced as SDVs get closer to it to collect more observations. We report the results within different ranges on the validation set, and the results within the full range on the hidden test set are in the supplementary.
评估指标。我们采用交集大于联合(IoU)来评估场景的完成质量,而不考虑分配的语义标签。这样一组无几何体的体素网格实际上是一个二元占位图,对于自动驾驶中的避障至关重要。我们使用 19 个语义类别的平均 IoU(mIoU)来评估语义分割的性能。需要注意的是,IoU 和 mIoU 之间存在很强的交互作用,例如,可以通过降低 IoU 来获得较高的 mIoU。因此,理想的模型应该在几何补全和语义分割方面都有出色的表现。同时,我们进一步建议对汽车前方的不同范围进行全面评估:我们分别报告了 和 mIoU 在 体积内的情况。需要注意的是,对短距离区域的理解更为关键,因为它留给自动驾驶汽车改进的时间更少。不同的是,随着 SDV 靠近临时远距离区域以收集更多观测数据,对该区域的了解程度也会提高。我们报告了在验证集上不同范围内的结果,在隐藏测试集上全范围内的结果见补充资料。
Baseline methods. We compare VoxFormer against the state-of-the-art SSC methods with public resources: (1) a camera-based SSC method MonoScene [4] based on 2D-to-3D feature projection, (2) LiDAR-based SSC methods including JS3CNet [8], LMSCNet [6], and SSCNet [1], and (3) RGB-inferred baselines LMSCNet* [6] and SSCNet* [1] which take as input a pseudo LiDAR point cloud generated by the stereo depth [70].
基准方法。我们利用公共资源将 VoxFormer 与最先进的 SSC 方法进行了比较:(1) 基于相机的 SSC 方法 MonoScene [4],基于二维到三维的特征投影;(2) 基于激光雷达的 SSC 方法,包括 JS3CNet [8]、LMSCNet [6] 和 SSCNet [1];(3) 基于 RGB 的基线 LMSCNet* [6] 和 SSCNet* [1],将立体深度 [70] 生成的伪激光雷达点云作为输入。

4.2. Performance 4.2 性能

4.2.1 Comparison against camera-based methods
4.2.1 与摄像方法的比较

3D-to-2D query outperforms 2D-to-3D projection. VoxFormer-S outperforms MonoScene by a large margin in terms of geometric completion ; see Table 1. Such a large improvement stems from stage1 with explicit depth estimation and correction, reducing a lot of empty spaces during the query process. In contrast, MonoScene based on 2D-to-3D projection will associate a lot of empty voxels with false features, e.g., if a free voxel is occluded by a car, it will be assigned the car's features when reprojected to the image, causing ambiguities during training. Meanwhile, the semantic score is also improved by without sacrificing IoU.
三维到二维的查询优于二维到三维的投影。就几何完成度而言,VoxFormer-S 在很大程度上优于 MonoScene ;见表 1。如此大的进步源于第一阶段明确的深度估计和校正,在查询过程中减少了大量的空白空间。相比之下,基于二维到三维投影的 MonoScene 会将大量空白体素与虚假特征联系在一起,例如,如果一个空闲体素被一辆汽车遮挡,在重新投影到图像上时就会被赋予汽车的特征,从而在训练过程中产生歧义。同时,在不牺牲 IoU 的情况下, ,也能提高语义得分。
Temporal information boost the semantic understanding. Despite the negligible difference in IoU, VoxFormer-T further improves the SSC performance over
时间信息促进了语义理解。尽管 IoU 的差异可以忽略不计,但 VoxFormer-T 与 SSC 相比,进一步提高了 SSC 性能。
Figure 3. Qualitative results of our method and others. VoxFormer better captures the scene layout in large-scale self-driving scenarios. Meanwhile, VoxFormer shows satisfactory performances in completing small objects such as trunks and poles.
图 3.我们的方法和其他方法的定性结果。在大规模自动驾驶场景中,VoxFormer 能更好地捕捉场景布局。同时,VoxFormer 在完成树干和电线杆等小型物体时表现令人满意。
Methods Modality IoU (%) mIoU (%)
MonoScene [4] 单一场景 [4] Camera 38.42 38.55 36.80 12.25 12.22 11.30
VoxFormer-T (Ours) VoxFormer-T (我们的) Camera 65.38 57.69 44.15 18.42 13.35
SSCNet [1] LiDAR 20.02
LMSCNet [6] LiDAR 22.37 17.19
JS3CNet [8] LiDAR 63.47 63.40 53.09
Table 2. Quantitative comparison against the state-of-the-art LiDAR-based SSC methods. VoxFormer even performs on par with some LiDAR-based methods at close range.
表 2.与最先进的基于激光雷达的 SSC 方法的定量比较。VoxFormer 在近距离的表现甚至与某些基于激光雷达的方法不相上下。
VoxFormer-S with temporal information: the mIoU is improved by , and inside three volumes , and respectively as shown in Table 1. For example, the IoU scores of building, parking, and terrain categories are respectively improved by , and inside the full volume because VoxFormer-S is restricted by the individual viewpoint while involving more viewpoints can mitigate this issue.
带有时间信息的 VoxFormer-S:如表 1 所示, 分别提高了三个全卷 的 mIoU。例如,建筑物、停车场和地形类别的 IoU 分数在全卷内分别提高了 ,这是因为 VoxFormer-S 受限于单个视点,而涉及更多视点则可以缓解这一问题。
Our superiority over others in short-range areas. Our method shows a significant improvement over other camera-based methods in safety-critical short-range areas, as shown in Table 1. VoxFormer-T can achieve mIoU scores of 21.55 and 18.42 within 12.8 meters and 25.6 meters, which outperforms the state-of-the-art MonoScene by and respectively. Compared to MonoScene with comparable performances at different distances (11.30 12.25), VoxFormer with much better short-range performances is more desirable in selfdriving. The reason is that the insufficient understanding of a provisional long-range area could be gradually advanced as SDVs move forward to collect more close observations.
在短距离区域,我们的方法优于其他方法。如表 1 所示,在对安全至关重要的短距离区域,我们的方法比其他基于摄像头的方法有显著改进。VoxFormer-T 在 12.8 米和 25.6 米范围内的 mIoU 分数分别为 21.55 和 18.42,分别比最先进的 MonoScene 高出 。与不同距离性能相当(11.30 12.25)的 MonoScene 相比,短距离性能更好的 VoxFormer 在自动驾驶中更受欢迎。原因在于,随着 SDV 向前移动,收集更多近距离观测数据,可以逐步推进对临时远距离区域的不充分了解。
Methods Depth IoU (%) mIoU (%)
MonoScene [4] 单一场景 [4] - 38.42 38.55 36.80 12.25 12.22
Stereo [70]
Mono [69] 50.47 38.08 18.67 11.27
VoxFormer-S (Ours) VoxFormer-S (我们的) Stereo [70] 65.35 57.54 44.02 16.48 12.35
Mono [69] 57.41 14.62 14.01 10.67
Table 3. Ablation study for image depth. With monocular depth, VoxFormer-S performs better than MonoScene in geometry , and ) and semantics and .
表 3.图像深度的消融研究。对于单目深度,VoxFormer-S 在几何 , 和 ) 以及语义 方面的表现均优于 MonoScene。
Our superiority over others for small objects. VoxFormer shows a large advancement in completing small objects compared to the main baseline MonoScene such as the bicycle , motorcycle , bicyclist , trunk , pole , and traffic sign , as shown in Table 1. The gap is even larger compared to LMSCNet* and SSCNet* directly consuming the pseudo point cloud, e.g., bicycle , motorcycle , and person . Such major improvements come from the full exploitation of visual attributes of the 3D scene.
我们在小物件方面的优势如表 1 所示,与主要基线 MonoScene 相比,VoxFormer 在完成小型对象(如自行车 、摩托车 、骑车人 、树干 、电线杆 和交通标志 )方面有很大进步。与直接消耗伪点云的 LMSCNet* 和 SSCNet* 相比,差距更大,如自行车 、摩托车 和人 。这种重大改进来自对三维场景视觉属性的充分利用。
Our superiority in size and memory. VoxFormer has a total of parameters, which is more lightweight than MonoScene with parameters. Besides, VoxFormer needs less than 16GB GPU memory during training.
我们在大小和内存方面的优势。VoxFormer 共有 个参数,比拥有 个参数的 MonoScene 更加轻巧。此外,VoxFormer 在训练过程中所需的 GPU 内存小于 16GB。

4.2.2 Comparison against LiDAR-based methods
4.2.2 与基于激光雷达的方法进行比较

As shown in Table 2, as the distance gets closer to the egovehicle, the performance gap between our method and the state-of-the-art LiDAR-based methods becomes smaller,
如表 2 所示,随着与自我车辆的距离越来越近,我们的方法与最先进的基于激光雷达的方法之间的性能差距越来越小、
Query Dense Random Occupancy
Ratio (%) 100 90 80 70 60 50 40 30 20 10
Memory (G) 18.5 18.2 17.6 17.3 16.8 16.3 15.8 15.3 14.9 14.6
IoU (%) 34.6 34.5 34.1 34.0 34.2 33.9 24.5 34.0 33.5 24.6
mIoU (%) 10.1 9.9 9.9 9.8 9.6 9.5 3.8 9.3 8.9 3.8
Table 4. Ablation study for query proposal. Our depth-based query proposal performs best.
表 4.查询建议的消融研究。我们基于深度的查询方案表现最佳。
Table 5. Ablation study for temporal input. means using the future frame . Memory denotes training memory.
表 5.时间输入的消融研究。 表示使用未来帧 。记忆表示训练记忆。
e.g., compared to LMSCNet, the mIoU pair is 17.19 if considering the area of ahead of the ego-vehicle, while the mIoU pair will be if only considering the area of . This observation is promising and inspiring to the self-driving community since VoxFormer only needs cheap cameras during inference. More interestingly, our mIoU within is even better than LiDAR-based SSCNet with a relative gain of , and our IoU within is better than LiDAR-based JS3CNet with an improvement of . In contrast, there is always a large gap between MonoScene and LiDAR-based methods at different ranges.
例如,与 LMSCNet 相比,如果考虑自我车辆前方的 区域,则 mIoU 对为 17.19,而如果只考虑 区域,则 mIoU 对将为 。由于 VoxFormer 在推理过程中只需要廉价的摄像头,因此这一观察结果对自动驾驶领域来说是很有希望和启发的。更有趣的是,在 范围内,我们的 mIoU 甚至优于基于激光雷达的 SSCNet,相对增益为 ;在 范围内,我们的 IoU 优于基于激光雷达的 JS3CNet,相对增益为 。相比之下,在不同范围内,MonoScene 和基于激光雷达的方法始终存在较大差距。

4.2.3 Ablation studies 4.2.3 消融研究

Depth estimation. We compare the performances between VoxFormers using monocular [69] and stereo depth [70], as shown in Table 3. In general, stereo depth is more accurate than monocular depth since the former exploits epipolar geometry, but the latter relies on pattern recognition [73]. Hence, VoxFormer with stereo depth performs best. Note that our framework can be integrated with any state-of-theart depth models, so using a stronger existing depth predictor [74-76] could enhance our SSC performance. Meanwhile, VoxFormer can be further promoted along with the advancement of depth estimation.
深度估计。如表 3 所示,我们比较了 VoxFormers 使用单目深度[69]和立体深度[70]的性能。一般来说,立体深度比单眼深度更准确,因为前者利用了外极几何,而后者则依赖于模式识别[73]。因此,具有立体深度的 VoxFormer 性能最佳。请注意,我们的框架可以与任何先进的深度模型集成,因此使用更强大的现有深度预测器[74-76]可以提高我们的 SSC 性能。同时,VoxFormer 还可以随着深度估算技术的进步而得到进一步推广。
Query mechanism. The ablation study for the query mechanism is reported in Table 4. We find that: (1) dense query (use all voxel queries in stage-2) is inefficient in memory consumption and performs worse than our occupancy-based query in both geometry and semantics; (2) the performance of random query (randomly proposing a subset from all voxel queries based on a specific ratio) is not stable, and there is a large gap between the random and occupancy-based query in both geometry and semantics; (3) our method achieves an excellent tradeoff between the memory consumption and the performance.
查询机制。表 4 报告了查询机制的消融研究。我们发现(1) 密集查询(在第二阶段使用所有体素查询)的内存消耗效率较低,在几何和语义上的表现都不如我们的基于占位的查询;(2) 随机查询(根据特定比例从所有 体素查询中随机提出一个子集)的表现不稳定,在几何和语义上,随机查询和基于占位的查询都存在较大差距;(3) 我们的方法在内存消耗和性能之间实现了很好的权衡。
Spatial resolution 空间分辨率 IoU (%) mIoU (%) Params (M)
44.26 10.24
44.38 11.33 57.84
57.90
44.19 12.29 58.04
Table 6. Ablation study for 2D image feature layers. Spatial resolution is relative to the input image size.
表 6.二维图像特征层的消融研究。空间分辨率相对于输入图像的大小。
Methods IoU (%) mIoU (%)
Ours
Ours w/o depth estimation
我们没有进行深度估算
34.64 10.14
Ours w/o depth correction
没有深度校正
36.95 11.36
Ours w/o cross-attention
我们没有交叉注意力
32.74 9.94
Ours w/o self-attention 我们没有自我关注 43.73 10.70
Table 7. Ablation study for architecture.
表 7.建筑消融研究。
Temporal input. The ablation study for temporal information is shown in Table 5. The offline setting with more future observations can largely boost semantic segmentation. Compared to the online setting with only previous and current images, the mIoU can be improved from 13.24 to . Note that involving more temporal input can lead to more memory consumption.
时间输入。对时间信息的消减研究如表 5 所示。带有更多未来观测数据的离线设置可以在很大程度上提高语义分割能力。与只使用以前和当前图像的在线设置相比,mIoU 可以从 13.24 提高到 。需要注意的是,更多的时间输入会导致更多的内存消耗。
Image features. The ablation study for 2D feature layers is shown in Table 6. We see that using different layers has comparable IoU but different mIoU. Using the layer whose size is of the input image size achieves an excellent balance between the performance and the model size.
图像特征。表 6 显示了对二维特征层的消融研究。我们可以看到,使用不同的图层具有可比的 IoU,但 mIoU 却不同。使用大小为输入图像大小 的图层可以在性能和模型大小之间取得很好的平衡。
Architecture. We conduct architecture ablation as shown in Table 7. For stage-1, depth estimation and correction are both important since a group of reasonable voxel queries can set a good basis for complete scene representation learning. For stage-2, self-attention and cross-attention can help improve the performance by enabling voxel-tovoxel and voxel-to-image interactions.
结构。我们进行的结构消融如表 7 所示。对于第一阶段,深度估计和校正都很重要,因为一组合理的体素查询可以为完整的场景表征学习奠定良好的基础。对于第二阶段,自我关注和交叉关注可以实现体素与体素、体素与图像之间的交互,从而帮助提高性能。

4.2.4 Limitation and future work
4.2.4 局限性和未来工作

Our performance at long range still needs to be improved, because the depth is very unreliable at the corresponding locations. Decoupling the long-range and short-range SSC is a potential solution to enhance the SSC far away from the ego vehicle. We leave this as our future work.
我们在远距离的性能仍有待提高,因为在相应位置的深度非常不可靠。将远距离和近距离 SSC 解耦是一个潜在的解决方案,可以增强远离自我飞行器的 SSC。这将是我们今后的工作重点。

5. Conclusion 5.结论

In this paper, we present VoxFormer, a strong camerabased 3D semantic scene completion (SSC) framework composed of (1) class-agnostic query proposal based on depth estimation and (2) class-specific segmentation with a sparse-to-dense MAE-like design. VoxFormer outperforms the state-of-the-art camera-based method and even performs on par with LiDAR-based methods at close range. We hope VoxFormer can motivate further research in camera-based and its applications in AV perception.
在本文中,我们介绍了 VoxFormer,这是一种强大的基于摄像头的三维语义场景补全(SSC)框架,由以下两部分组成:(1)基于深度估计的类别无关查询建议;(2)采用类似于稀疏到密集 MAE 的设计的特定类别分割。VoxFormer 的表现优于最先进的基于摄像头的方法,甚至在近距离表现与基于激光雷达的方法相当。我们希望 VoxFormer 能进一步推动基于摄像头的 研究及其在视听感知领域的应用。

Appendix 附录

In the appendix, we mainly provide quantitative and qualitative results of our method and the state-of-the-art camera-based SSC method MonoScene [4] on the hidden test set of SemanticKITTI [5]. Since we do not have access to the ground truth of the test set, we can only report the performances within the full range .
在附录中,我们主要提供了我们的方法和最先进的基于摄像头的SSC方法MonoScene[4]在SemanticKITTI[5]隐藏测试集上的定量和定性结果。由于我们无法获得测试集的地面实况,因此只能报告在全部范围内的性能

A. Quantitative Comparison
A.定量比较

Scene completion. As shown in Table I, VoxFormer outperforms MonoScene with a large gap in terms of geometric completion. VoxFormer-S without using historical observations improves MonoScene on IoU with a relative gain of . Note that in autonomous driving, geometry occupancy is critical for obstacle avoidance since a false negative could result in severe accidents. Therefore, our method is more desirable than MonoScene in safety-critical camerabased autonomous driving applications.
场景完成度。如表 I 所示,VoxFormer 在几何完成度方面以较大差距超过了 MonoScene。不使用历史观测数据的 VoxFormer-S 在 IoU 方面比 MonoScene 高出 。需要注意的是,在自动驾驶中,几何占位对于避障至关重要,因为假负值可能导致严重事故。因此,在对安全至关重要的基于摄像头的自动驾驶应用中,我们的方法比 MonoScene 更为可取。
Semantic scene completion. As shown in Table I, VoxFormer also demonstrates a better semantic scene understanding. VoxFormer-S and VoxFormer-T both demonstrate better mIoU than MonoScene. VoxFormer-T / VoxFormer have a relative improvement of compared with the cutting-edged MonoScene. Note that the values of IoU and mIoU are intertwined, and some methods can naively increase the value of mIoU by sacrificing IoU. In contrast, our method shows superior performance in terms of both geometry and semantics.
语义场景补全。如表 I 所示,VoxFormer 也展示了更好的语义场景理解能力。VoxFormer-S 和 VoxFormer-T 的 mIoU 都比 MonoScene 好。与最先进的 MonoScene 相比,VoxFormer-T / VoxFormer 的相对改进幅度为 。需要注意的是,IoU 和 mIoU 的值是相互交织的,有些方法可以通过牺牲 IoU 来提高 mIoU 的值。相比之下,我们的方法在几何和语义方面都表现出了卓越的性能。
Short-range performances. Although short-range evaluations are not available on the hidden test set, we expect to see a similar trend (we perform much better in safetycritical short-range areas than MonoScene). The reason is that the scores of mIoU and IoU on the test set are comparable to that on the validation set inside the volume. For example, VoxFormer-S achieves an mIoU of 12.35 on the validation set and 12.20 on the test set.
短程性能。虽然没有对隐藏测试集进行短程评估,但我们预计会看到类似的趋势(我们在安全关键短程区域的表现比 MonoScene 好得多)。原因是 mIoU 和 IoU 在测试集上的得分与 卷内验证集上的得分相当。例如,VoxFormer-S 在验证集上的 mIoU 为 12.35,在测试集上为 12.20。

B. Qualitative Comparison
B.定性比较

More visualizations are shown in Fig. I. We can see that our method performs much better than MonoScene in the short-range areas. There are some missing objects for MonoScene at close range, as shown in the first and last row of Fig. I. Meanwhile, the long-range performance of our method can be further improved, e.g., the trunks in the longrange areas are not completed in the fourth row of Fig. I.
更多可视化效果见图 I。我们可以看到,我们的方法在短距离区域的表现要比 MonoScene 好得多。如图 I 第一行和最后一行所示,MonoScene 在近距离有一些丢失的对象。同时,我们的方法的远距离性能还可以进一步提高,例如图 I 第四行中远距离区域的树干没有完成。

References 参考资料

[1] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser. Semantic scene completion from a single depth image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746-1754, 2017. 1, 2, 6, 7
[1] Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser.从单一深度图像完成语义场景。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1746-1754, 2017.1, 2, 6, 7

[2] Luis Roldao, Raoul De Charette, and Anne VerroustBlondet. 3d semantic scene completion: a survey. International Journal of Computer Vision, pages 1-28, 2022. 1, 3
[2] Luis Roldao、Raoul De Charette 和 Anne VerroustBlondet。3d 语义场景补全:一项调查。国际计算机视觉杂志》,第 1-28 页,2022 年。1, 3
[3] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1600016009, 2022. 1, 2, 4
[3] 何开明、陈新磊、谢赛宁、李阳浩、Piotr Dollár 和 Ross Girshick。遮蔽式自动编码器是可扩展的视觉学习器。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 1600016009 页,2022 年。1, 2, 4
[4] Anh-Quan Cao and Raoul de Charette. Monoscene: Monocular 3d semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3991-4001, 2022. 1, 3, 5, 6, 7, 9
[4] Anh-Quan Cao and Raoul de Charette.Monoscene:单目 3D 语义场景补全。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 3991-4001 页,2022 年。1, 3, 5, 6, 7, 9
[5] Jens Behley, Martin Garbade, Andres Milioto, Jan Quenzel, Sven Behnke, Cyrill Stachniss, and Jurgen Gall. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297-9307, 2019. 1, 2, 5, 9
[5] Jens Behley、Martin Garbade、Andres Milioto、Jan Quenzel、Sven Behnke、Cyrill Stachniss 和 Jurgen Gall。Semantickitti:激光雷达序列语义场景理解数据集。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9297-9307, 2019.1, 2, 5, 9
[6] Luis Roldao, Raoul de Charette, and Anne Verroust-Blondet. Lmscnet: Lightweight multiscale 3d semantic completion. In 2020 International Conference on 3D Vision (3DV), pages 111-119. IEEE, 2020. 1, 3, 4, 5, 6, 7
[6] Luis Roldao、Raoul de Charette 和 Anne Verroust-Blondet。Lmscnet:轻量级多尺度 3D 语义补全。2020年3D视觉国际会议(3DV),第111-119页。IEEE, 2020.1, 3, 4, 5, 6, 7
[7] Ran Cheng, Christopher Agia, Yuan Ren, Xinhai Li, and Liu Bingbing. S3cnet: A sparse semantic scene completion network for lidar point clouds. In Conference on Robot Learning, pages 2148-2161. PMLR, 2021. 1, 3
[7] Ran Cheng、Christopher Agia、Yuan Ren、Xinhai Li 和 Liu Bingbing。S3cnet:用于激光雷达点云的稀疏语义场景补全网络。机器人学习大会,第 2148-2161 页。PMLR, 2021.1, 3
[8] Xu Yan, Jiantao Gao, Jie Li, Ruimao Zhang, Zhen Li, Rui Huang, and Shuguang Cui. Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 3101-3109, 2021.
[8] Xu Yan、Jiantao Gao、Jie Li、Ruimao Zhang、Zhen Li、Rui Huang 和 Shuguang Cui。通过学习场景完成的上下文形状前验进行稀疏单扫激光雷达点云分割。美国人工智能学会会议论文集》,第 35 卷,第 3101-3109 页,2021 年。
[9] Pengfei Li, Yongliang Shi, Tianyu Liu, Hao Zhao, Guyue Zhou, and Ya-Qin Zhang. Semi-supervised implicit scene completion from sparse lidar, 2021. 1,3
[9] Pengfei Li,Yongliang Shi,Tianyu Liu,Hao Zhao,Guyue Zhou,and Ya-Qin Zhang.来自稀疏激光雷达的半监督隐式场景补全,2021.1,3
[10] Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge university press, 2003. 2
[10] 理查德-哈特利和安德鲁-齐瑟曼。计算机视觉中的多视图几何》。剑桥大学出版社,2003 年。2
[11] Xian-Feng Han, Hamid Laga, and Mohammed Bennamoun. Image-based 3d object reconstruction: State-of-the-art and trends in the deep learning era. IEEE transactions on pattern analysis and machine intelligence, 43(5):1578-1604, 2019. 2
[11] 韩险峰、哈米德-拉加和穆罕默德-本纳蒙。基于图像的 3D 物体重建:深度学习时代的最新技术和发展趋势.IEEE transactions on pattern analysis and machine intelligence,43(5):1578-1604,2019.2
[12] Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. In European conference on computer vision, pages 628-644. Springer, 2016. 2
[12] Christopher B Choy、Danfei Xu、JunYoung Gwak、Kevin Chen 和 Silvio Savarese。3d-r2n2:单视角和多视角 3D 物体重建的统一方法。欧洲计算机视觉会议,第 628-644 页。Springer, 2016.2
[13] Shubham Tulsiani, Tinghui Zhou, Alexei A Efros, and Jitendra Malik. Multi-view supervision for single-view reconstruction via differentiable ray consistency. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626-2634, 2017. 2
[13] Shubham Tulsiani、Tinghui Zhou、Alexei A Efros 和 Jitendra Malik.通过可微分光线一致性实现单视角重建的多视角监督。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2626-2634, 2017.2
?
MonoScene | 34.16 0.70 | 1.00 | 54.70| | 11.08
VoxFormer-S (Ours) VoxFormer-S (我们的)
VoxFormer-T (Ours) VoxFormer-T (我们的)
42.95
25.30
26.90
{fa7f0429e-e318-46d1-9296-d55af0ac82d6}
22.40
\right\rvert, \right\rvert、
12.20
Table I. Quantitative results of VoxFormer and the state-of-the-art MonoScene on the hidden test set of SemanticKITTI.
表一.VoxFormer 和最先进的 MonoScene 在 SemanticKITTI 隐藏测试集上的量化结果。
Figure I. Qualitative results of our method and others on the hidden test set. VoxFormer better captures the scene layout in large-scale self-driving scenarios. Meanwhile, VoxFormer shows satisfactory performances in completing small objects such as trunks and poles.
图一.我们的方法和其他方法在隐藏测试集上的定性结果。在大规模自动驾驶场景中,VoxFormer 能更好地捕捉场景布局。同时,VoxFormer 在完成树干和电线杆等小型物体时表现令人满意。
[14] Haoqiang Fan, Hao Su, and Leonidas J Guibas. A point set generation network for object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605-613, 2017. 2
[14] Haoqiang Fan, Hao Su, and Leonidas J Guibas.从单张图像重建 物体的点集生成网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605-613, 2017.2
[15] Xinchen Yan, Jimei Yang, Ersin Yumer, Yijie Guo, and Honglak Lee. Perspective transformer nets: Learning singleview 3d object reconstruction without 3d supervision. In vances in neural information processing systems, volume 29 , 2016. 2
[15] 闫新晨、杨继美、额尔辛-尤默、郭一杰、李宏乐。透视变换网:学习单视角 3d 物体重建,无需 3d 监督。In vances in neural information processing systems, volume 29 , 2016.2
[16] Richard A Newcombe, Shahram Izadi, Otmar Hilliges, David Molyneaux, David Kim, Andrew J Davison, Pushmeet Kohi, Jamie Shotton, Steve Hodges, and Andrew Fitzgibbon. Kinectfusion: Real-time dense surface mapping and tracking. In 2011 10th IEEE international symposium on mixed and augmented reality, pages 127-136. Ieee, 2011. 2
[16] Richard A Newcombe、Shahram Izadi、Otmar Hilliges、David Molyneaux、David Kim、Andrew J Davison、Pushmeet Kohi、Jamie Shotton、Steve Hodges 和 Andrew Fitzgibbon。Kinectfusion:实时密集表面映射和跟踪。2011 年第 10 届 IEEE 混合与增强现实国际研讨会,第 127-136 页。IEEE,2011。2
[17] Benjamin Ummenhofer, Huizhong Zhou, Jonas Uhrig, Nikolaus Mayer, Eddy Ilg, Alexey Dosovitskiy, and Thomas Brox. Demon: Depth and motion network for learning monocular stereo. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038-5047, 2017. 2
[17] Benjamin Ummenhofer、Huizhong Zhou、Jonas Uhrig、Nikolaus Mayer、Eddy Ilg、Alexey Dosovitskiy 和 Thomas Brox.Demon:用于学习单目立体的深度和运动网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5038-5047, 2017.2
[18] Lars Mescheder, Michael Oechsle, Michael Niemeyer, Sebastian Nowozin, and Andreas Geiger. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460-4470, 2019. 2
[18] Lars Mescheder、Michael Oechsle、Michael Niemeyer、Sebastian Nowozin 和 Andreas Geiger。占位网络:学习函数空间中的 3D 重构。In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4460-4470, 2019.2
[19] Songyou Peng, Michael Niemeyer, Lars Mescheder, Marc Pollefeys, and Andreas Geiger. Convolutional occupancy networks. In European Conference on Computer Vision, pages 523-540. Springer, 2020. 2
[19] 彭松友、Michael Niemeyer、Lars Mescheder、Marc Pollefeys 和 Andreas Geiger。卷积占位网络。欧洲计算机视觉会议,第 523-540 页。Springer, 2020.2
[20] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann. Disn: Deep implicit surface network for high-quality single-view reconstruction. vances in Neural Information Processing Systems, 32, 2019. 2
[20] Qiangeng Xu, Weiyue Wang, Duygu Ceylan, Radomir Mech, and Ulrich Neumann.Disn:用于高质量单视角 重建的深度隐式曲面网络。 vances in Neural Information Processing Systems,32,2019.2
[21] Stefan Popov, Pablo Bauszat, and Vittorio Ferrari. Corenet: Coherent scene reconstruction from a single rgb image. In European Conference on Computer Vision, pages 366-383. Springer, 2020. 2
[21] Stefan Popov、Pablo Bauszat 和 Vittorio Ferrari。Corenet:从单张 RGB 图像重建相干 场景。欧洲计算机视觉会议,第 366-383 页。Springer, 2020.2
[22] Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12786-12796, 2022. 2
[22] Zihan Zhu、Songyou Peng、Viktor Larsson、Weiwei Xu、Hujun Bao、Zhaopeng Cui、Martin R Oswald 和 Marc Pollefeys。Nice-slam:神经隐式可伸缩满贯编码。IEEE/CVF 计算机视觉与模式识别大会论文集》,第 12786-12796 页,2022 年。2
[23] Wentao Yuan, Tejas Khot, David Held, Christoph Mertz, and Martial Hebert. Pcn: Point completion network. In 2018 International Conference on 3D Vision (3DV), pages 728737. IEEE, 2018. 2
[23] Wentao Yuan、Tejas Khot、David Held、Christoph Mertz 和 Martial Hebert。Pcn:点补全网络。In 2018 International Conference on 3D Vision (3DV),pages 728737.IEEE, 2018.2
[24] Jiayuan Gu, Wei-Chiu Ma, Sivabalan Manivasagam, Wenyuan Zeng, Zihao Wang, Yuwen Xiong, Hao Su, and Raquel Urtasun. Weakly-supervised 3d shape completion in the wild. In European Conference on Computer Vision, pages 283-299. Springer, 2020. 2
[24] 顾嘉元、马伟超、西瓦巴兰-马尼瓦萨甘、曾文渊、王子豪、熊宇文、苏浩和拉奎尔-乌塔松。野外弱监督三维形状补全。欧洲计算机视觉会议,第 283-299 页。Springer, 2020.2
[25] Xingguang Yan, Liqiang Lin, Niloy J Mitra, Dani Lischinski, Daniel Cohen-Or, and Hui Huang. Shapeformer:
[25] 闫星光、林立强、Niloy J Mitra、Dani Lischinski、Daniel Cohen-Or 和 Hui Huang。Shapeformer:

Transformer-based shape completion via sparse representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6239-6249, 2022. 2
基于变压器的稀疏表示形状补全。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 6239-6249 页,2022 年。2
[26] Julian Chibane, Thiemo Alldieck, and Gerard Pons-Moll. Implicit functions in feature space for shape reconstruction and completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6970-6981, 2020. 2
[26] Julian Chibane、Thiemo Alldieck 和 Gerard Pons-Moll。用于 形状重建和补全的特征空间隐函数。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6970-6981, 2020.2
[27] Xiaogang Wang, Marcelo H Ang, and Gim Hee Lee. Voxelbased network for shape completion by leveraging edge generation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 13189-13198, 2021. 2
[27] Xiaogang Wang, Marcelo H Ang, and Gim Hee Lee.基于体素网络的边缘生成形状补全。In Proceedings of the IEEE/CVF international conference on computer vision, pages 13189-13198, 2021.2
[28] Linqi Zhou, Yilun Du, and Jiajun Wu. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826-5835, 2021. 2
[28] Linqi Zhou, Yilun Du, and Jiajun Wu.通过点-体素扩散生成和完成三维形状IEEE/CVF 计算机视觉国际会议论文集》,第 5826-5835 页,2021 年。2
[29] Angela Dai, Charles Ruizhongtai Qi, and Matthias Nießner. Shape completion using 3d-encoder-predictor cnns and shape synthesis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5868-5877, 2017. 2
[29] Angela Dai、Charles Ruizhongtai Qi 和 Matthias Nießner。使用 3d- 编码器-预测器 cnns 和形状合成完成形状。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5868-5877, 2017.2
[30] Angela Dai, Christian Diller, and Matthias Nießner. Sg-nn: Sparse generative neural networks for self-supervised scene completion of rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 849-858, 2020. 2
[30] Angela Dai、Christian Diller 和 Matthias Nießner.Sg-nn:用于 rgb-d 扫描的自监督场景补全的稀疏生成神经网络。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 849-858, 2020.2
[31] Ignacio Vizzo, Benedikt Mersch, Rodrigo Marcuzzi, Louis Wiesmann, Jens Behley, and Cyrill Stachniss. Make it dense: Self-supervised geometric scan completion of sparse lidar scans in large outdoor environments. IEEE Robotics and tomation Letters, 7(3):8534-8541, 2022. 2
[31] Ignacio Vizzo、Benedikt Mersch、Rodrigo Marcuzzi、Louis Wiesmann、Jens Behley 和 Cyrill Stachniss.让它更密集在大型室外环境中完成稀疏 激光雷达扫描的自监督几何扫描。IEEE Robotics and tomation Letters, 7(3):8534-8541, 2022.2
[32] Shervin Minaee, Yuri Boykov, Fatih Porikli, Antonio Plaza, Nasser Kehtarnavaz, and Demetri Terzopoulos. Image segmentation using deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7):35233542, 2022. 2
[32] Shervin Minaee、Yuri Boykov、Fatih Porikli、Antonio Plaza、Nasser Kehtarnavaz 和 Demetri Terzopoulos。使用深度学习进行图像分割:调查。IEEE Transactions on Pattern Analysis and Machine Intelligence,44(7):35233542,2022.2
[33] Yulan Guo, Hanyun Wang, Qingyong Hu, Hao Liu, Li Liu, and Mohammed Bennamoun. Deep learning for 3d point clouds: A survey. IEEE transactions on pattern analysis and machine intelligence, 43(12):4338-4364, 2020. 2
[33] Yulan Guo、Hanyun Wang、Qingyong Hu、Hao Liu、Li Liu 和 Mohammed Bennamoun。三维点云深度学习:一项调查。IEEE patterns analysis and machine intelligence, 43(12):4338-4364, 2020.2
[34] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015. 2
[34] Jonathan Long、Evan Shelhamer 和 Trevor Darrell.用于语义分割的全卷积网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431-3440, 2015.2
[35] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015. 2
[35] Hyeonwoo Noh, Seungghoon Hong, and Bohyung Han.用于语义分割的学习去卷积网络。In Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.2
[36] Enze Xie, Wenhai Wang, Zhiding Yu, Anima Anandkumar, Jose M Alvarez, and Ping Luo. Segformer: Simple and efficient design for semantic segmentation with transformers. In Advances in Neural Information Processing Systems, volume 34, pages 12077-12090, 2021. 2
[36] 谢恩泽、王文海、于志定、阿尼玛-阿南德库马尔、何塞-M-阿尔瓦雷斯、罗平。Segformer:使用变换器进行语义分割的简单高效设计。神经信息处理系统进展》,第 34 卷,第 12077-12090 页,2021 年。2
[37] Bowen Cheng, Alex Schwing, and Alexander Kirillov. Perpixel classification is not all you need for semantic segmentation. In Advances in Neural Information Processing Systems, volume 34, pages 17864-17875, 2021. 2
[37] Bowen Cheng、Alex Schwing 和 Alexander Kirillov.像素分类并非语义分割的全部。神经信息处理系统进展》,第 34 卷,第 17864-17875 页,2021 年。2
[38] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9197-9206, 2019. 2
[38] Kaixin Wang, Jun Hao Liew, Yingtian Zou, Daquan Zhou, and Jiashi Feng.Panet:带有原型对齐功能的少量图像语义分割。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9197-9206, 2019.2
[39] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool. Rethinking semantic segmentation: A prototype view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 25822593, 2022. 2
[39] Tianfei Zhou, Wenguan Wang, Ender Konukoglu, and Luc Van Gool.反思语义分割:原型视图。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 25822593 页,2022 年。2
[40] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652-660, 2017. 2
[40] Charles R Qi、Hao Su、Kaichun Mo 和 Leonidas J Guibas。Pointnet:用于 分类和分割的点集深度学习。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652-660, 2017.2
[41] Xun Xu and Gim Hee Lee. Weakly supervised semantic point cloud segmentation: Towards 10x fewer labels. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13706-13715, 2020. 2
[41] Xun Xu 和 Gim Hee Lee.弱监督语义点云分割:减少 10 倍标签。In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13706-13715, 2020.2
[42] Qingyong Hu, Bo Yang, Linhai Xie, Stefano Rosa, Yulan Guo, Zhihua Wang, Niki Trigoni, and Andrew Markham. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1110811117, 2020. 2
[42] 胡清勇、杨波、谢林海、Stefano Rosa、郭玉兰、王志华、Niki Trigoni 和 Andrew Markham。Randla-net:大规模点云的高效语义分割。IEEE/CVF 计算机视觉与模式识别大会论文集》,第 1110811117 页,2020 年。2
[43] Tsun-Hsuan Wang, Sivabalan Manivasagam, Ming Liang, Bin Yang, Wenyuan Zeng, and Raquel Urtasun. V2vnet: Vehicle-to-vehicle communication for joint perception and prediction. In European conference on computer vision, pages 605-621. Springer, 2020. 2
[43] Tsun-Hsuan Wang、Sivabalan Manivasagam、Ming Liang、Bin Yang、Wenyuan Zeng 和 Raquel Urtasun。V2vnet:用于联合感知和预测的车对车通信。欧洲计算机视觉会议,第 605-621 页。Springer, 2020.2
[44] Yiming Li, Dekun Ma, Ziyan An, Zixun Wang, Yiqi Zhong, Siheng Chen, and Chen Feng. V2x-sim: Multi-agent collaborative perception dataset and benchmark for autonomous driving. IEEE Robotics and Automation Letters, 7(4):1091410921, 2022. 2
[44] 李一鸣、马德坤、安子燕、王梓勋、钟奕琦、陈思恒、冯晨。V2x-sim:多机器人协同感知数据集与自动驾驶基准。IEEE Robotics and Automation Letters,7(4):1091410921,2022.2
[45] Runsheng Xu, Hao Xiang, Zhengzhong Tu, Xin Xia, MingHsuan Yang, and Jiaqi Ma. V2x-vit: Vehicle-to-everything cooperative perception with vision transformer. In European conference on computer vision, pages 107-124. Springer, 2022. 2
[45] Runsheng Xu、Hao Xiang、Zhengzhong Tu、Xin Xia、MingHsuan Yang 和 Jiaqi Ma.V2x-vit:使用视觉转换器的车对万物协同感知。欧洲计算机视觉会议,第 107-124 页。Springer, 2022.2
[46] Yiming Li, Shunli Ren, Pengxiang Wu, Siheng Chen, Chen Feng, and Wenjun Zhang. Learning distilled collaboration graph for multi-agent perception. In Advances in Neural Information Processing Systems, volume 34, pages 2954129552, 2021. 2
[46] 李一鸣、任顺利、吴鹏翔、陈思恒、冯晨、张文军。为多机器人感知学习提炼协作图。神经信息处理系统进展》,第 34 卷,第 2954129552 页,2021 年。2
[47] Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, and Jiaqi Ma. Cobevt: Cooperative bird's eye view semantic segmentation with sparse transformers. In 6th Annual Conference on Robot Learning, 2022. 2
[47] Runsheng Xu, Zhengzhong Tu, Hao Xiang, Wei Shao, Bolei Zhou, and Jiaqi Ma.Cobevt:使用稀疏变换器的合作式鸟瞰语义分割。第 6 届机器人学习年会,2022 年。2
[48] Yiming Li, Juexiao Zhang, Dekun Ma, Yue Wang, and Chen Feng. Multi-robot scene completion: Towards task-agnostic collaborative perception. In 6th Annual Conference on Robot Learning, 2022. 2
[48] 李一鸣、张觉晓、马德坤、王玥和陈峰。多机器人场景补全:面向任务识别的协作感知。第 6 届机器人学习年会,2022 年。2
[49] Sanbao Su, Yiming Li, Sihong He, Songyang Han, Chen Feng, Caiwen Ding, and Fei Miao. Uncertainty quantification of collaborative detection for self-driving. In IEEE International Conference on Robotics and Automation, 2023. 2
[49] Sanbao Su,Yiming Li,Sihong He,Songyang Han,Chen Feng,Caiwen Ding,and Fei Miao.自动驾驶协同检测的不确定性量化。In IEEE International Conference on Robotics and Automation, 2023.2
[50] Saurabh Gupta, Pablo Arbelaez, and Jitendra Malik. Perceptual organization and recognition of indoor scenes from rgb images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 564-571, 2013. 2
[50] Saurabh Gupta、Pablo Arbelaez 和 Jitendra Malik。从 RGB 图像感知组织和识别室内场景。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 564-571, 2013.2
[51] S. Thrun and B. Wegbreit. Shape from symmetry. In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 2, pages 1824-1831 Vol. 2, 2005. 2
[51] S. Thrun 和 B. Wegbreit。从对称看形状In Tenth IEEE International Conference on Computer Vision (ICCV'05) Volume 1, volume 2, pages 1824-1831 Vol.2
[52] Jiahui Zhang, Hao Zhao, Anbang Yao, Yurong Chen, Li Zhang, and Hongen Liao. Efficient semantic scene completion network with spatial group convolution. In Proceedings of the European Conference on Computer Vision (ECCV), pages 733-749, 2018. 2
[52] Jiahui Zhang,Hao Zhao,Anbang Yao,Yurong Chen,Li Zhang,and Hongen Liao.利用空间群卷积的高效语义场景补全网络。欧洲计算机视觉会议(ECCV)论文集》,第 733-749 页,2018 年。2
[53] Shice Liu, Yu Hu, Yiming Zeng, Qiankun Tang, Beibei Jin, Yinhe Han, and Xiaowei Li. See and think: Disentangling semantic scene completion. In Advances in Neural Information Processing Systems, volume 31, 2018. 2
[53] Shice Liu、Yu Hu、Yiming Zeng、Qiankun Tang、Beibei Jin、Yinhe Han 和 Xiaowei Li.看与想:解构语义场景补全。载《神经信息处理系统进展》,2018 年第 31 卷。2
[54] Jie Li, Yu Liu, Xia Yuan, Chunxia Zhao, Roland Siegwart, Ian Reid, and Cesar Cadena. Depth based semantic scene completion with position importance aware loss. IEEE Robotics and Automation Letters, 5(1):219-226, 2019. 2
[54] Jie Li, Yu Liu, Xia Yuan, Chunxia Zhao, Roland Siegwart, Ian Reid, and Cesar Cadena.基于深度的语义场景补全与位置重要性感知损失。IEEE Robotics and Automation Letters, 5(1):219-226, 2019.2
[55] Jie Li, Yu Liu, Dong Gong, Qinfeng Shi, Xia Yuan, Chunxia Zhao, and Ian Reid. Rgbd based dimensional decomposition residual network for semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7693-7702, 2019. 2
[55] Jie Li, Yu Liu, Dong Gong, Qinfeng Shi, Xia Yuan, Chunxia Zhao, and Ian Reid.基于 Rgbd 的维度分解残差网络用于 语义场景补全。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7693-7702, 2019.2
[56] Pingping Zhang, Wei Liu, Yinjie Lei, Huchuan Lu, and Xiaoyun Yang. Cascaded context pyramid for full-resolution 3d semantic scene completion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 78017810, 2019. 2
[56] Pingping Zhang, Wei Liu, Yinjie Lei, Huchuan Lu, and Xiaoyun Yang.用于全分辨率三维语义场景补全的级联上下文金字塔。In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 78017810, 2019.2
[57] Jie Li, Kai Han, Peng Wang, Yu Liu, and Xia Yuan. Anisotropic convolutional networks for semantic scene completion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 33513359, 2020. 2
[57] Jie Li, Kai Han, Peng Wang, Yu Liu, and Xia Yuan.用于 语义场景补全的各向异性卷积网络。In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 33513359, 2020.2
[58] Xiaokang Chen, Kwan-Yee Lin, Chen Qian, Gang Zeng, and Hongsheng Li. 3d sketch-aware semantic scene completion via semi-supervised structure prior. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4193-4202, 2020. 2
[58] Xiaokang Chen,Kwan-Yee Lin,Chen Qian,Gang Zeng,and Hongsheng Li.通过半监督结构先验完成 3D 草图感知语义场景。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 4193-4202 页,2020 年。2
[59] Yingjie Cai, Xuesong Chen, Chao Zhang, Kwan-Yee Lin, Xiaogang Wang, and Hongsheng Li. Semantic scene completion via integrating instances and scene in-the-loop. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 324-333, 2021. 2
[59] Yingjie Cai, Xuesong Chen, Chao Zhang, Kwan-Yee Lin, Xiaogang Wang, and Hongsheng Li.通过整合实例和场景的语义场景补全。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 324-333 页,2021 年。2
[60] Christoph B Rist, David Emmerichs, Markus Enzweiler, and Dariu M Gavrila. Semantic scene completion using local deep implicit functions on lidar data. IEEE transactions on pattern analysis and machine intelligence, 44(10):72057218, 2021. 3
[60] Christoph B Rist、David Emmerichs、Markus Enzweiler 和 Dariu M Gavrila。利用激光雷达数据的局部深度隐函数完成语义场景。IEEE patterns analysis and machine intelligence, 44(10):72057218, 2021.3
[61] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016. 3, 4, 5
[61] He Kaiming He,Xiangyu Zhang,Shaoqing Ren,and Jian Sun.图像识别的深度残差学习。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770-778, 2016.3, 4, 5
[62] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toend object detection with transformers. In European conference on computer vision, pages 213-229. Springer, 2020. 3
[62] Nicolas Carion、Francisco Massa、Gabriel Synnaeve、Nicolas Usunier、Alexander Kirillov 和 Sergey Zagoruyko。使用变换器进行端到端物体检测。欧洲计算机视觉会议,第 213-229 页。Springer, 2020.3
[63] Yue Wang, Vitor Campagnolo Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. Detr3d: object detection from multi-view images via -to-2d queries. In Conference on Robot Learning, pages 180-191. PMLR, 2022. 3
[63] Yue Wang、Vitor Campagnolo Guizilini、Tianyuan Zhang、Yilun Wang、Hang Zhao 和 Justin Solomon。Detr3d: 通过 -to-2d 查询从多视角图像中检测物体。机器人学习大会》,第180-191页。PMLR,2022年。3
[64] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and Jose M Alvarez. bev: Multi-camera joint detection and segmentation with unified birds-eye view representation. arXiv preprint arXiv:2204.05088, 2022. 3
[64] Enze Xie, Zhiding Yu, Daquan Zhou, Jonah Philion, Anima Anandkumar, Sanja Fidler, Ping Luo, and Jose M Alvarez. bev:使用统一鸟瞰图表示的多摄像头联合 检测和分割。arXiv 预印本 arXiv:2204.05088, 2022。3
[65] Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: Learning bird's-eye-view representation from multi-camera images via spatiotemporal transformers. In European conference on computer vision, 2022. 3, 4
[65] Zhiqi Li,Wenhai Wang,Hongyang Li,Enze Xie,Chonghao Sima,Tong Lu,Qiao Yu,and Jifeng Dai.Bevformer:通过时空变换器从多摄像头图像中学习鸟瞰表示。欧洲计算机视觉会议,2022 年。3, 4
[66] Xizhou Zhu, Weijie Su, Lewei Lu, Bin Li, Xiaogang Wang, and Jifeng Dai. Deformable detr: Deformable transformers for end-to-end object detection. In International Conference on Learning Representations, 2020. 3, 5
[66] Xizhou Zhu,Weijie Su,Lewei Lu,Bin Li,Xiaogang Wang,and Jifeng Dai.可变形检测器:用于端到端物体检测的可变形变换器。学习表征国际会议,2020.3, 5
[67] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, volume 30, 2017. 3
[67] Ashish Vaswani、Noam Shazeer、Niki Parmar、Jakob Uszkoreit、Llion Jones、Aidan N Gomez、Łukasz Kaiser 和 Illia Polosukhin。注意力就是你所需要的一切。神经信息处理系统进展》,第 30 卷,2017 年。3
[68] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, volume 28, 2015. 3
[68] Shaoqing Ren,Kaiming He,Ross Girshick,and Jian Sun.更快的 r-cnn:利用区域建议网络实现实时物体检测。神经信息处理系统进展》,第 28 卷,2015 年。3
[69] Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4009-4018, 2021. 4, 7, 8
[69] Shariq Farooq Bhat、Ibraheem Alhashim 和 Peter Wonka。Adabins:使用自适应分层进行深度估计。IEEE/CVF 计算机视觉与模式识别会议论文集》,第 4009-4018 页,2021 年。4, 7, 8
[70] Faranak Shamsafar, Samuel Woerz, Rafia Rahim, and Andreas Zell. Mobilestereonet: Towards lightweight deep networks for stereo matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2417-2426, 2022. 4, 5, 6, 7, 8
[70] Faranak Shamsafar、Samuel Woerz、Rafia Rahim 和 Andreas Zell。Mobilestereonet: Towards lightweight deep networks for stereo matching.IEEE/CVF 计算机视觉应用冬季会议论文集》,第 2417-2426 页,2022 年。4, 5, 6, 7, 8
[71] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354-3361. IEEE, 2012. 5
[71] Andreas Geiger、Philip Lenz 和 Raquel Urtasun.我们为自动驾驶做好准备了吗?In 2012 IEEE conference on computer vision and pattern recognition, pages 3354-3361.IEEE, 2012.5
[72] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117-2125, 2017. 5
[72] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie.用于物体检测的特征金字塔网络。In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117-2125, 2017.5
[73] Hamid Laga, Laurent Valentin Jospin, Farid Boussaid, and Mohammed Bennamoun. A survey on deep learning techniques for stereo-based depth estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020. 8
[73] Hamid Laga、Laurent Valentin Jospin、Farid Boussaid 和 Mohammed Bennamoun。基于立体深度估计的深度学习技术概览。IEEE 《模式分析与机器智能》期刊,2020 年。8
[74] Zhenda Xie, Zigang Geng, Jingcheng Hu, Zheng Zhang, Han , and Yue Cao. Revealing the dark secrets of masked image modeling. arXiv preprint arXiv:2205.13543, 2022. 8
[74] 谢振达、耿子刚、胡京成、张正、韩 和曹悦。揭开遮蔽图像建模的秘密。arXiv 预印本 arXiv:2205.13543, 2022.8
[75] Vladimir Tankovich, Christian Hane, Yinda Zhang, Adarsh Kowdle, Sean Fanello, and Sofien Bouaziz. Hitnet: Hierarchical iterative tile refinement network for real-time stereo matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1436214372, 2021. 8
[75] Vladimir Tankovich、Christian Hane、Yinda Zhang、Adarsh Kowdle、Sean Fanello 和 Sofien Bouaziz.Hitnet:用于实时立体匹配的分层迭代瓦片细化网络。IEEE/CVF 计算机视觉与模式识别大会论文集》,第 1436214372 页,2021 年。8
[76] Weihao Yuan, Xiaodong Gu, Zuozhuo Dai, Siyu Zhu, and Ping Tan. Newcrfs: Neural window fully-connected crfs for monocular depth estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022. 8
[76] Weihao Yuan,Xiaodong Gu,Zuozhuo Dai,Siyu Zhu, and Ping Tan.Newcrfs:用于单目深度估计的神经窗口全连接 crfs。电气和电子工程师学会计算机视觉与模式识别会议论文集,2022 年。8

    • Corresponding author: Zhiding Yu (zhidingy @ nvidia.com)
      通讯作者Zhiding Yu (zhidingy @ nvidia.com)