这是用户在 2025-1-15 11:08 为 https://app.immersivetranslate.com/pdf-pro/63df354f-9d08-4fcf-b432-642fabcf2c2c 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
GaussianFormer:基于视觉的三维语义占位预测中的高斯场景

Fig. 1: Considering the universal approximating ability of Gaussian mixture [8, 11], we propose an object-centric 3D semantic Gaussian representation to describe the finegrained structure of 3D scenes without the use of dense grids. We propose a GaussianFormer model consisting of sparse convolution and cross-attention to efficiently transform 2D images into 3D Gaussian representations. To generate dense 3D occupancy, we design a Gaussian-to-voxel splatting module that can be efficiently implemented with CUDA. With comparable performance, our GaussianFormer reduces memory consumption of existing 3D occupancy prediction methods by 7 5 . 2 % 8 2 . 2 % 7 5 . 2 % 8 2 . 2 % 75.2%-82.2%\mathbf{7 5 . 2 \%} \mathbf{- 8 2 . 2 \%}.
图 1:考虑到高斯混合物的普遍近似能力[8, 11],我们提出了一种以物体为中心的三维语义高斯表示法,无需使用密集网格即可描述三维场景的细粒度结构。我们提出了一种由稀疏卷积和交叉注意组成的 GaussianFormer 模型,以有效地将二维图像转换为三维高斯表示。为了生成密集的三维占位,我们设计了一个高斯-体素拼接模块,该模块可通过 CUDA 高效实现。在性能相当的情况下,我们的 GaussianFormer 可将现有三维占位预测方法的内存消耗降低 7 5 . 2 % 8 2 . 2 % 7 5 . 2 % 8 2 . 2 % 75.2%-82.2%\mathbf{7 5 . 2 \%} \mathbf{- 8 2 . 2 \%}

Abstract  摘要

D semantic occupancy prediction aims to obtain 3D finegrained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive
D 语义占位预测旨在获得周围场景的三维精细几何和语义,是实现以视觉为中心的自动驾驶的一项重要任务。现有的大多数方法采用象素等密集网格作为场景表示,但忽略了占用率的稀疏性和物体尺度的多样性,从而导致资源分配不均衡。为了解决这个问题,我们提出了一种以物体为中心的表示法,用稀疏的三维语义高斯来描述三维场景,其中每个高斯代表一个灵活的感兴趣区域及其语义特征。我们通过注意力机制聚合图像信息,并迭代完善三维高斯的属性,包括位置、协方差和语义。然后,我们提出了一种高效的高斯到象素拼接方法来生成三维占位预测,这种方法只聚集特定位置的相邻高斯。我们进行了广泛的

Abstract  摘要

experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only 17.8 % 24.8 % 17.8 % 24.8 % 17.8%-24.8%17.8 \%-24.8 \% of their memory consumption. Code is available at: https://github. com/huang-yh/GaussianFormer.
在广泛采用的 nuScenes 和 KITTI-360 数据集上进行了实验。实验结果表明,GaussianFormer 的性能可与最先进的方法媲美,而内存消耗仅为它们的 17.8 % 24.8 % 17.8 % 24.8 % 17.8%-24.8%17.8 \%-24.8 \% 。代码见:https://github. com/huang-yh/GaussianFormer。

Keywords: 3D occupancy prediction • 3D Gaussian splitting • Autonomous Driving
关键词三维占用率预测 - 三维高斯分割 - 自动驾驶

1 Introduction  1 引言

Whether to use LiDAR for 3D perception has long been the core debate among autonomous driving companies. While vision-centric systems share an economical advantage, their inability to capture obstacles of arbitrary shapes hinders driving safety and robustness [ 13 , 17 , 25 , 26 ] 13 , 17 , 25 , 26 ] 13,17,25,26]13,17,25,26]. The emergence of 3D semantic occupancy prediction methods [ 4 , 16 , 18 , 35 , 50 , 57 ] [ 4 , 16 , 18 , 35 , 50 , 57 ] [4,16,18,35,50,57][4,16,18,35,50,57] remedies this issue by predicting the occupancy status of each voxel in the surrounding 3D space, which facilitates a variety of newly rising tasks such as end-to-end autonomous driving [46], 4D occupancy forecasting [58], and self-supervised 3D scene understanding [15].
是否使用激光雷达进行 3D 感知一直是自动驾驶公司争论的核心问题。虽然以视觉为中心的系统具有经济优势,但它们无法捕捉任意形状的障碍物,这阻碍了驾驶的安全性和稳健性 [ 13 , 17 , 25 , 26 ] 13 , 17 , 25 , 26 ] 13,17,25,26]13,17,25,26] 。三维语义占用预测方法 [ 4 , 16 , 18 , 35 , 50 , 57 ] [ 4 , 16 , 18 , 35 , 50 , 57 ] [4,16,18,35,50,57][4,16,18,35,50,57] 的出现弥补了这一问题,它可以预测周围三维空间中每个体素的占用状态,从而促进各种新出现的任务,如端到端自动驾驶[46]、四维占用预测[58]和自监督三维场景理解[15]。
Despite the promising applications, the dense output space of 3D occupancy prediction poses a great challenge in how to efficiently and effectively represent the 3D scene. Voxel-based methods [23,50] assign each voxel with a feature vector to obtain dense representations to describe the fine-grained structure of a 3D scene. They employ coarse-to-fine upsampling [46,50] or voxel filtering [23,33] to improve efficiency considering the sparse nature of the 3D space. As most of the voxel space is unoccupied [3], BEV-based methods [27,56] compress the height dimension and employ the bird’s eye view (BEV) as scene representations, yet they usually require post-processing such as multi-scale fusion [56] to capture finer details. TPVFormer [16] generalizes BEV with two additional planes and achieves a better performance-complexity trade-off with the tri-perspective view (TPV). However, they are all grid-based methods and inevitably suffer from the redundancy of empty grids, resulting in more complexity for downstream tasks [46]. It is also more difficult to capture scene dynamics with grid-based representations since it is objects instead of grids that move in the 3D space [58].
尽管应用前景广阔,但三维占位预测的密集输出空间对如何高效、有效地表示三维场景提出了巨大挑战。基于体素的方法[23,50]为每个体素分配一个特征向量,以获得密集的表示来描述三维场景的细粒度结构。考虑到三维空间的稀疏性,这些方法采用了从粗到细的上采样[46,50]或体素过滤[23,33]来提高效率。由于大部分体素空间未被占用[3],基于 BEV 的方法[27,56]压缩了高度维度,并采用鸟瞰图(BEV)作为场景表示,但它们通常需要进行后处理,如多尺度融合[56],以捕捉更精细的细节。TPVFormer[16]用两个额外的平面对 BEV 进行了概括,并通过三视角视图(TPV)实现了更好的性能-复杂性权衡。然而,它们都是基于网格的方法,不可避免地存在空网格冗余的问题,从而导致下游任务更加复杂[46]。此外,由于在三维空间中移动的是物体而不是网格,因此基于网格的表示法更难捕捉场景动态[58]。
In this paper, we propose the first object-centric representation for 3D semantic occupancy prediction. We employ a set of 3D semantic Gaussians to sparsely describe a 3D scene. Each Gaussian represents a flexible region of interest and consists of the mean, covariance, and its semantic category. We propose a GaussianFormer model to effectively obtain 3D semantic Gaussians from image inputs. We randomly initialize a set of queries to instantiate the 3D Gaussians and adopt the cross-attention mechanism to aggregate information from multiscale image features. We iteratively refine the properties of the 3D Gaussians for smoother optimizations. To efficiently incorporate interactions among 3D Gaussians, we treat them as point clouds located at the Gaussian means and leverage 3D sparse convolutions to process them. We then decode the properties of 3D semantic Gaussians from the updated queries as the scene representation.
在本文中,我们首次提出了用于三维语义占位预测的以物体为中心的表示法。我们采用一组三维语义高斯来稀疏描述三维场景。每个高斯代表一个灵活的兴趣区域,由均值、协方差及其语义类别组成。我们提出了一种高斯生成器模型,可有效地从图像输入中获取三维语义高斯。我们随机初始化一组查询来实例化三维高斯,并采用交叉关注机制来聚合来自多尺度图像特征的信息。我们不断改进三维高斯的属性,以实现更平滑的优化。为了有效整合三维高斯之间的相互作用,我们将它们视为位于高斯均值处的点云,并利用三维稀疏卷积来处理它们。然后,我们从更新的查询中解码三维语义高斯的属性,作为场景表示。
Motivated by the 3D Gaussian splatting method in image rendering [20], we design an efficient Gaussian-to-voxel splatting module that aggregates neighboring Gaussians to generate the semantic occupancy for a certain 3D position. The proposed 3D Gaussian representation uses a sparse and adaptive set of features to describe a 3D scene but can still model the fine-grained structure due to the universal approximating ability of Gaussian mixtures [8, 11]. Based on the 3D Gaussian representation, GaussianFormer further employs sparse convolution and local-aggregation-based Gaussian-to-voxel splatting to achieve efficient 3D semantic occupancy prediction, as shown in Fig. 1. We conduct extensive experiments on the nuScenes and KITTI-360 datasets for 3D semantic occupancy prediction from surrounding and monocular cameras, respectively. GaussianFormer achieves comparable performance with existing state-of-the-art methods with only 17.8 % 24.8 % 17.8 % 24.8 % 17.8%-24.8%17.8 \%-24.8 \% of their memory consumption. Our qualitative visualizations show that GaussianFormer is able to generate a both holistic and realistic perception of the scene.
受图像渲染中的三维高斯拼接法[20]的启发,我们设计了一种高效的高斯到象素拼接模块,它可以聚合相邻的高斯来生成特定三维位置的语义占位。所提出的三维高斯表示法使用稀疏的自适应特征集来描述三维场景,但由于高斯混合物的通用逼近能力[8, 11],它仍能对细粒度结构进行建模。如图 1 所示,在三维高斯表示的基础上,GaussianFormer 进一步采用稀疏卷积和基于局部聚合的高斯-象素拼接来实现高效的三维语义占位预测。我们在 nuScenes 和 KITTI-360 数据集上进行了大量实验,分别从周围摄像机和单目摄像机进行三维语义占位预测。GaussianFormer 的性能与现有的一流方法不相上下,而内存消耗仅为它们的 17.8 % 24.8 % 17.8 % 24.8 % 17.8%-24.8%17.8 \%-24.8 \% 。我们的定性可视化结果表明,GaussianFormer 能够生成整体而逼真的场景感知。

2.1 3D Semantic Occupancy Prediction
2.1 三维语义占用预测

3D semantic occupancy prediction has garnered increasing attention in recent years due to its comprehensive description of the driving scenes, which involves predicting the occupancy and semantic states of all voxels within a certain range. Learning an effective representation constitutes the fundamental step for this challenging task. A straightforward approach discretizes the 3D space into regular voxels, with each voxel being assigned a feature vector [59]. The capacity to represent intricate 3D structures renders voxel-based representations advantageous for 3D semantic occupancy prediction [ 4 , 6 , 18 , 21 , 35 , 43 , 50 , 51 , 57 ] [ 4 , 6 , 18 , 21 , 35 , 43 , 50 , 51 , 57 ] [4,6,18,21,35,43,50,51,57][4,6,18,21,35,43,50,51,57]. However, due to the sparsity of driving scenes and the high resolution of vanilla voxel representation, these approaches suffer from considerable computation and storage overhead. To improve efficiency, several methods propose to reduce the number of voxel queries by the coarse-to-fine upsampling strategy [49], or the depth-guided voxel filtering [23]. Nonetheless, the upsampling process might not adequately compensate for the information loss in the coarse stage. And the filtering strategy ignores the occluded area and depends on the quality of depth estimation. Although OctreeOcc [33] introduces voxel queries of multi-granularity to enhance the efficiency, it still conforms to a predefined regular partition pattern. Our 3D Gaussian representation resembles the voxel counterpart, but can flexibly adapt to varying object scales and region complexities in a deformable way, thus achieving better resource allocation and efficiency.
近年来,三维语义占位预测因其对驾驶场景的全面描述而受到越来越多的关注,其中涉及预测一定范围内所有体素的占位和语义状态。学习有效的表示方法是完成这项具有挑战性任务的基础步骤。一种直接的方法是将三维空间离散为规则的体素,并为每个体素分配一个特征向量[59]。表示复杂三维结构的能力使基于体素的表示法在三维语义占用预测 [ 4 , 6 , 18 , 21 , 35 , 43 , 50 , 51 , 57 ] [ 4 , 6 , 18 , 21 , 35 , 43 , 50 , 51 , 57 ] [4,6,18,21,35,43,50,51,57][4,6,18,21,35,43,50,51,57] 中具有优势。 然而,由于驾驶场景的稀疏性和虚体素表示法的高分辨率,这些方法的计算和存储开销相当大。为了提高效率,有几种方法建议通过从粗到细的上采样策略[49]或深度引导体素过滤[23]来减少体素查询次数。然而,上采样过程可能无法充分补偿粗采样阶段的信息损失。而且滤波策略会忽略遮挡区域,并取决于深度估计的质量。虽然 OctreeOcc [33] 引入了多粒度体素查询以提高效率,但它仍然符合预定义的规则分割模式。我们的三维高斯表示法与体素表示法类似,但能以可变形的方式灵活适应不同的物体尺度和区域复杂度,从而实现更好的资源分配和效率。
Another line of work utilizes the bird’s-eye-view (BEV) representation for 3D perception in autonomous driving, which can be categorized into two types according to the view transformation paradigm. Approaches based on lift-splatshoot (LSS) actively project image features into 3D space with depth guidance [ 14 , 24 , 25 , 31 , 39 , 40 ] [ 14 , 24 , 25 , 31 , 39 , 40 ] [14,24,25,31,39,40][14,24,25,31,39,40], while query-based methods typically use BEV queries and deformable attention to aggregate information from image features [ 19 , 26 [ 19 , 26 [19,26[19,26,
另一个研究方向是在自动驾驶中利用鸟瞰图(BEV)表示法进行三维感知,根据视图变换范例可将其分为两类。基于提升-平移(LSS)的方法通过深度引导将图像特征主动投射到三维空间 [ 14 , 24 , 25 , 31 , 39 , 40 ] [ 14 , 24 , 25 , 31 , 39 , 40 ] [14,24,25,31,39,40][14,24,25,31,39,40] ,而基于查询的方法通常使用 BEV 查询和可变形关注来汇总图像特征信息 [ 19 , 26 [ 19 , 26 [19,26[19,26
53]. Although BEV-based perception has achieved great success in 3D object detection, it is less employed for 3D occupancy prediction due to information loss from height compression. FB-OCC [27] uses dense BEV features from backward projection to optimize the sparse voxel features from forward projection. FlashOcc [56] applies a complex multi-scale feature fusion module on BEV features for finer details. However, existing 3D occupancy prediction methods are based on grid representations, which inevitably suffer from the computation redundancy of empty grids. Differently, our GaussianFormer is based on objectcentric representation and can efficiently tend to flexible regions of interest.
53].虽然基于 BEV 的感知技术在三维物体检测中取得了巨大成功,但由于高度压缩造成的信息损失,它在三维占位预测中应用较少。FB-OCC [27] 使用来自后向投影的密集 BEV 特征来优化来自前向投影的稀疏体素特征。FlashOcc [56] 在 BEV 特征上应用了复杂的多尺度特征融合模块,以获得更精细的细节。然而,现有的三维占位预测方法都是基于网格表示的,不可避免地会受到空网格计算冗余的影响。不同的是,我们的 GaussianFormer 基于以物体为中心的表示法,可以有效地倾向于灵活的感兴趣区域。

2.2 3D Gaussian Splatting
2.2 三维高斯拼接法

The recent 3D Gaussian splatting (3D-GS) [20] uses multiple 3D Gaussians for radiance field rendering, demonstrating superior performance in rendering quality and speed. In contrast to prior explicit scene representations, such as meshes [41, 42,52] and voxels [10,37], 3D-GS is capable of modeling intricate shapes with fewer parameters. Compared with implicit neural radiance field [1, 36], 3D-GS facilitates fast rendering through splat-based rasterization, which projects 3D Gaussians to the target 2D view and renders image patches with local 2D Gaussians. Recent advances in 3D-GS include adaptation to dynamic scenarios [34,54], online 3D-GS generalizable to novel scenes [5,61], and generative 3D-GS [7, 28, 45, 55].
最近的三维高斯拼接技术(3D-GS)[20] 使用多个三维高斯进行辐射场渲染,在渲染质量和速度方面都表现出色。与网格[41, 42,52]和体素[10,37]等先前的显式场景表示相比,3D-GS 能够以较少的参数模拟复杂的形状。与隐式神经辐射场[1, 36]相比,3D-GS 通过基于拼接的光栅化技术实现快速渲染,这种技术将 3D 高斯投射到目标 2D 视图,并用局部 2D 高斯渲染图像斑块。3D-GS 的最新进展包括适应动态场景 [34,54]、可通用于新场景的在线 3D-GS [5,61],以及生成式 3D-GS [7,28,45,55]。
Although our 3D Gaussian representation also adopts the physical form of 3D Gaussians (i.e. mean and covariance) and the multi-variant Gaussian distribution, it differs from 3D-GS in significant ways, which imposes unique challenges in 3D semantic occupancy prediction: 1) Our 3D Gaussian representation is learned in an online manner as opposed to offline optimization in 3D-GS. 2) We generate 3D semantic occupancy predictions from the 3D Gaussian representation in contrast to rendering 2D RGB images in 3D-GS.
虽然我们的三维高斯表示法也采用了三维高斯的物理形式(即均值和协方差)和多变量高斯分布,但它与三维高斯有很大不同,这给三维语义占位预测带来了独特的挑战:1) 与 3D-GS 的离线优化不同,我们的 3D 高斯表示是以在线方式学习的;2) 与 3D-GS 中的 2D RGB 图像渲染不同,我们从 3D 高斯表示生成 3D 语义占用预测。

3 Proposed Approach  3 建议的方法

In this section, we present our method of 3D Gaussian Splatting for 3D semantic occupancy prediction. We first introduce an object-centric 3D scene representation which adaptively describes the regions of interest with 3D semantic Gaussians (Sec. 3.1). We then explain how to effectively transform information from the image inputs to 3D Gaussians and elaborate on the model designs including self-encoding, image cross-attention, and property refinement (Sec. 3.2). At last, we detail the Gaussian-to-voxel splatting module which generates dense 3D occupancy predictions based on local aggregation and can be efficiently implemented with CUDA (Sec. 3.2).
在本节中,我们将介绍用于三维语义占位预测的三维高斯拼接方法。我们首先介绍了一种以物体为中心的三维场景表示法,这种表示法能用三维语义高斯自适应地描述感兴趣的区域(第 3.1 节)。然后,我们解释了如何有效地将图像输入信息转换为三维高斯信息,并详细介绍了模型设计,包括自编码、图像交叉关注和属性细化(第 3.2 节)。最后,我们详细介绍了高斯到象素拼接模块,该模块基于局部聚合生成密集的三维占位预测,可通过 CUDA 高效实现(第 3.2 节)。

3.1 Object-centric 3D Scene Representation
3.1 以物体为中心的三维场景表示法

Vision-based 3D semantic occupancy prediction aims to predict dense occupancy states and semantics for each voxel grid with multi-view camera images as input.
基于视觉的三维语义占位预测旨在以多视角相机图像为输入,预测每个体素网格的密集占位状态和语义。

Fig. 2: Comparisions of the proposed 3D Gaussian representation with exiting grid-based scene representations (figures from TPVFormer [16]). The voxel representation [ 23 , 50 ] [ 23 , 50 ] [23,50][23,50] assigns each voxel in the 3D space with a feature and is redundant due to the sparsity nature of the 3D space. BEV [26] and TPV [16] employ 2D planes to describe 3D space but can only alleviate the redundancy issue. Differently, the proposed object-centric 3D Gaussian representation can adapt to flexible regions of interest yet can still describe the fine-grained structure of the 3D scene due to the strong approximating ability of mixing Gaussians [ 8 , 11 ] [ 8 , 11 ] [8,11][8,11].
图 2:建议的三维高斯表示法与现有基于网格的场景表示法的比较(图片来自 TPVFormer [16])。体素表示 [ 23 , 50 ] [ 23 , 50 ] [23,50][23,50] 为三维空间中的每个体素分配了一个特征,由于三维空间的稀疏性,这种表示是多余的。BEV [26] 和 TPV [16] 采用二维平面来描述三维空间,但只能缓解冗余问题。不同的是,由于混合高斯 [ 8 , 11 ] [ 8 , 11 ] [8,11][8,11] 具有很强的近似能力,因此所提出的以物体为中心的三维高斯表示法可以适应灵活的感兴趣区域,但仍能描述三维场景的细粒度结构。
Formally, given a set of multi-view images I = { I i R 3 × H × W i = 1 , , N } I = I i R 3 × H × W i = 1 , , N I={I_(i)inR^(3xx H xx W)∣i=1,dots,N}\mathcal{I}=\left\{\mathbf{I}_{i} \in \mathbb{R}^{3 \times H \times W} \mid i=1, \ldots, N\right\}, and corresponding intrinsics K = { K i R 3 × 3 i = 1 , , N } K = K i R 3 × 3 i = 1 , , N K={K_(i)inR^(3xx3)∣i=1,dots,N}\mathcal{K}=\left\{\mathbf{K}_{i} \in \mathbb{R}^{3 \times 3} \mid i=1, \ldots, N\right\} and extrinsics T = T = T=\mathcal{T}= { T i R 4 × 4 i = 1 , , N } T i R 4 × 4 i = 1 , , N {T_(i)inR^(4xx4)∣i=1,dots,N}\left\{\mathbf{T}_{i} \in \mathbb{R}^{4 \times 4} \mid i=1, \ldots, N\right\}, the objective is to predict 3 D semantic occupancy O C X × Y × Z O C X × Y × Z OinC^(X xx Y xx Z)\mathbf{O} \in \mathcal{C}^{X \times Y \times Z}, where N , { H , W } , C N , { H , W } , C N,{H,W},CN,\{H, W\}, \mathcal{C} and { X , Y , Z } { X , Y , Z } {X,Y,Z}\{X, Y, Z\} denote the number of views, the image resolution, the set of semantic classes and the target volume resolution.
形式上,给定一组多视图图像 I = { I i R 3 × H × W i = 1 , , N } I = I i R 3 × H × W i = 1 , , N I={I_(i)inR^(3xx H xx W)∣i=1,dots,N}\mathcal{I}=\left\{\mathbf{I}_{i} \in \mathbb{R}^{3 \times H \times W} \mid i=1, \ldots, N\right\} 以及相应的内在 K = { K i R 3 × 3 i = 1 , , N } K = K i R 3 × 3 i = 1 , , N K={K_(i)inR^(3xx3)∣i=1,dots,N}\mathcal{K}=\left\{\mathbf{K}_{i} \in \mathbb{R}^{3 \times 3} \mid i=1, \ldots, N\right\} 和外在 T = T = T=\mathcal{T}= { T i R 4 × 4 i = 1 , , N } T i R 4 × 4 i = 1 , , N {T_(i)inR^(4xx4)∣i=1,dots,N}\left\{\mathbf{T}_{i} \in \mathbb{R}^{4 \times 4} \mid i=1, \ldots, N\right\} ,目标是预测 3 D 语义占用 O C X × Y × Z O C X × Y × Z OinC^(X xx Y xx Z)\mathbf{O} \in \mathcal{C}^{X \times Y \times Z} ,其中 N , { H , W } , C N , { H , W } , C N,{H,W},CN,\{H, W\}, \mathcal{C} { X , Y , Z } { X , Y , Z } {X,Y,Z}\{X, Y, Z\} 表示视图数、图像分辨率、语义类别集和目标体积分辨率。
The autonomous driving scenes contain foreground objects of various scales (such as buses and pedestrians), and background regions of different complexities (such as road and vegetation). Dense voxel representation [4, 23, 48] neglects this diversity and processes every 3D location with equal storage and computation resources, which often leads to intractable overhead because of unreasonable resource allocation. Planar representations, such as BEV [14, 26] and TPV [16], achieve 3D perception by first encoding 3D information into 2D feature maps for efficiency and then recovering 3D structures from 2D features. Although planar representations are resource-friendly, they could cause a loss of details. The gridbased methods can hardly adapt to regions of interest for different scenes and thus lead to representation and computation redundancy.
自动驾驶场景包含不同尺度的前景物体(如公共汽车和行人),以及不同复杂程度的背景区域(如道路和植被)。密集体素表示法[4, 23, 48]忽略了这种多样性,并以相同的存储和计算资源处理每个三维位置,这往往会因资源分配不合理而导致难以处理的开销。平面表示法,如 BEV [14, 26] 和 TPV [16],通过首先将三维信息编码到二维特征图中以提高效率,然后从二维特征中恢复三维结构来实现三维感知。虽然平面表示法对资源友好,但可能会造成细节损失。基于网格的方法很难适应不同场景的兴趣区域,从而导致表示和计算冗余。
To address this, we propose an object-centric 3D representation for 3D semantic occupancy prediction where each unit describes a region of interest instead of fixed grids, as shown in Fig. 2. We represent an autonomous driving scene with a number of 3D semantic Gaussians, and each of them instantiates a semantic Gaussian distribution characterized by mean, covariance, and semantic logits. The occupancy prediction for a 3D location can be computed by summing up the values of semantic Gaussian distributions evaluated at that location. Specifically, we use a set of P 3 D P 3 D P3DP 3 \mathrm{D} Gaussians G = { G i R d i = 1 , , P } G = G i R d i = 1 , , P G={G_(i)inR^(d)∣i=1,dots,P}\mathcal{G}=\left\{\mathbf{G}_{i} \in \mathbb{R}^{d} \mid i=1, \ldots, P\right\} for each scene, and each 3D Gaussian is represented by a d d dd-dimensional vector in the form of ( m R 3 , s R 3 , r R 4 , c R | C | ) m R 3 , s R 3 , r R 4 , c R | C | (minR^(3),sinR^(3),rinR^(4),cinR^(|C|))\left(\mathbf{m} \in \mathbb{R}^{3}, \mathbf{s} \in \mathbb{R}^{3}, \mathbf{r} \in \mathbb{R}^{4}, \mathbf{c} \in \mathbb{R}^{|\mathcal{C}|}\right), where d = 10 + | C | d = 10 + | C | d=10+|C|d=10+|\mathcal{C}|, and m , s , r , c m , s , r , c m,s,r,c\mathbf{m}, \mathbf{s}, \mathbf{r}, \mathbf{c} denote the mean, scale, rotation vectors and semantic logits, respectively. Therefore, the
为了解决这个问题,我们提出了一种用于三维语义占用预测的以物体为中心的三维表示法,如图 2 所示,每个单元描述一个感兴趣的区域,而不是固定的网格。我们用多个三维语义高斯来表示自动驾驶场景,每个高斯实例化了一个语义高斯分布,其特征为均值、协方差和语义对数。通过对某个三维位置的语义高斯分布值求和,可以计算出该位置的占用率预测值。具体来说,我们对每个场景使用一组 P 3 D P 3 D P3DP 3 \mathrm{D} 高斯 G = { G i R d i = 1 , , P } G = G i R d i = 1 , , P G={G_(i)inR^(d)∣i=1,dots,P}\mathcal{G}=\left\{\mathbf{G}_{i} \in \mathbb{R}^{d} \mid i=1, \ldots, P\right\} ,每个三维高斯由一个 d d dd 维向量表示,其形式为 ( m R 3 , s R 3 , r R 4 , c R | C | ) m R 3 , s R 3 , r R 4 , c R | C | (minR^(3),sinR^(3),rinR^(4),cinR^(|C|))\left(\mathbf{m} \in \mathbb{R}^{3}, \mathbf{s} \in \mathbb{R}^{3}, \mathbf{r} \in \mathbb{R}^{4}, \mathbf{c} \in \mathbb{R}^{|\mathcal{C}|}\right) ,其中 d = 10 + | C | d = 10 + | C | d=10+|C|d=10+|\mathcal{C}| m , s , r , c m , s , r , c m,s,r,c\mathbf{m}, \mathbf{s}, \mathbf{r}, \mathbf{c} 分别表示均值、比例、旋转向量和语义对数。因此

value of a semantic Gaussian distribution g g g\mathbf{g} evaluated at point p = ( x , y , z ) p = ( x , y , z ) p=(x,y,z)\mathbf{p}=(x, y, z) is
p = ( x , y , z ) p = ( x , y , z ) p=(x,y,z)\mathbf{p}=(x, y, z) 点评估语义高斯分布 g g g\mathbf{g} 的值为
g ( p ; m , s , r , c ) = exp ( 1 2 ( p m ) T Σ 1 ( p m ) ) c Σ = R S S T R T , S = diag ( s ) , R = q 2 r ( r ) g ( p ; m , s , r , c ) = exp 1 2 ( p m ) T Σ 1 ( p m ) c Σ = R S S T R T , S = diag ( s ) , R = q 2 r ( r ) {:[g(p;m","s","r","c)=exp(-(1)/(2)(p-m)^(T)Sigma^(-1)(p-m))c],[Sigma=RSS^(T)R^(T)","quadS=diag(s)","quadR=q2r(r)]:}\begin{gathered} \mathbf{g}(\mathbf{p} ; \mathbf{m}, \mathbf{s}, \mathbf{r}, \mathbf{c})=\exp \left(-\frac{1}{2}(\mathbf{p}-\mathbf{m})^{T} \boldsymbol{\Sigma}^{-1}(\mathbf{p}-\mathbf{m})\right) \mathbf{c} \\ \boldsymbol{\Sigma}=\mathbf{R S S}^{T} \mathbf{R}^{T}, \quad \mathbf{S}=\operatorname{diag}(\mathbf{s}), \quad \mathbf{R}=\mathrm{q} 2 \mathrm{r}(\mathbf{r}) \end{gathered}
where Σ , diag ( ) Σ , diag ( ) Sigma,diag(*)\boldsymbol{\Sigma}, \operatorname{diag}(\cdot) and q 2 r ( ) q 2 r ( ) q2r(*)\mathrm{q} 2 \mathrm{r}(\cdot) represent the covariance matrix, the function that constructs a diagonal matrix from a vector, and the function that transforms a quaternion into a rotation matrix, respectively. Then the occupancy prediction result at point p p p\mathbf{p} can be formulated as the summation of the contribution of individual Gaussians on the location p p p\mathbf{p} :
其中 Σ , diag ( ) Σ , diag ( ) Sigma,diag(*)\boldsymbol{\Sigma}, \operatorname{diag}(\cdot) q 2 r ( ) q 2 r ( ) q2r(*)\mathrm{q} 2 \mathrm{r}(\cdot) 分别代表协方差矩阵、从矢量构建对角矩阵的函数和将四元数转换为旋转矩阵的函数。那么, p p p\mathbf{p} 点的占用率预测结果可以表述为位置 p p p\mathbf{p} 上各个高斯贡献的求和:
o ^ ( p ; G ) = i = 1 P g i ( p ; m i , s i , r i , c i ) = i = 1 P exp ( 1 2 ( p m i ) T Σ i 1 ( p m i ) ) c i . o ^ ( p ; G ) = i = 1 P g i p ; m i , s i , r i , c i = i = 1 P exp 1 2 p m i T Σ i 1 p m i c i . hat(o)(p;G)=sum_(i=1)^(P)g_(i)(p;m_(i),s_(i),r_(i),c_(i))=sum_(i=1)^(P)exp(-(1)/(2)(p-m_(i))^(T)Sigma_(i)^(-1)(p-m_(i)))c_(i).\hat{o}(\mathbf{p} ; \mathcal{G})=\sum_{i=1}^{P} \mathbf{g}_{i}\left(\mathbf{p} ; \mathbf{m}_{i}, \mathbf{s}_{i}, \mathbf{r}_{i}, \mathbf{c}_{i}\right)=\sum_{i=1}^{P} \exp \left(-\frac{1}{2}\left(\mathbf{p}-\mathbf{m}_{i}\right)^{T} \boldsymbol{\Sigma}_{i}^{-1}\left(\mathbf{p}-\mathbf{m}_{i}\right)\right) \mathbf{c}_{i} .
Compared with voxel representation, the mean and covariance properties allow the 3D Gaussian representation to adaptively allocate computation and storage resources according to object scales and region complexities. Therefore, we need fewer 3D Gaussians to model a scene for better efficiency while still maintain expressiveness. Meanwhile, the 3D Gaussian representation take 3D Gaussians as its basic unit, and thus avoids potential loss of details from dimension reduction in planar representations. Moreover, every 3D Gaussian has explicit semantic meaning, making the transformation from the scene representation to occupancy predictions much easier than those in other representations which often involve decoding per-voxel semantics from high dimensional features.
与体素表示法相比,三维高斯表示法的均值和协方差特性可以根据物体尺度和区域复杂程度自适应地分配计算和存储资源。因此,我们只需较少的三维高斯就能对场景进行建模,从而在保持表现力的同时提高效率。同时,三维高斯表示以三维高斯为基本单位,从而避免了平面表示中降维带来的潜在细节损失。此外,每个三维高斯都有明确的语义,这使得从场景表征到占位预测的转换比其他表征容易得多,因为其他表征通常需要从高维特征中解码每个象素的语义。

3.2 GaussianFormer: Image to Gaussians
3.2 GaussianFormer:图像到高斯

Based on the 3D semantic Gaussian representation of the scene, we further propose a GaussianFormer model to learn meaningful 3D Gaussians from multi-view images. The overall pipeline is shown in Fig. 3. We first initialize the properties of 3D Gaussians and their corresponding high-dimensional queries as learnable vectors. Then we iteratively refine the Gaussian properties within the B B BB blocks of GaussianFormer. Each block consists of a self-encoding module to enable interactions among 3D Gaussians, an image cross-attention module to aggregate visual information, and a refinement module to rectify the properties of 3D Gaussians.
基于场景的三维语义高斯表示,我们进一步提出了一个高斯形成器模型,用于从多视角图像中学习有意义的三维高斯。整体流程如图 3 所示。我们首先将三维高斯的属性及其相应的高维查询初始化为可学习向量。然后,我们在 GaussianFormer 的 B B BB 块中反复精炼高斯属性。每个区块由一个自编码模块(用于实现三维高斯之间的交互)、一个图像交叉注意模块(用于聚合视觉信息)和一个细化模块(用于修正三维高斯的属性)组成。
Gaussian Properties and Queries. We introduce two groups of features in GaussianFormer. The Gaussian properties G = { G i R d i = 1 , , P } G = G i R d i = 1 , , P G={G_(i)inR^(d)∣i=1,dots,P}\mathcal{G}=\left\{\mathbf{G}_{i} \in \mathbb{R}^{d} \mid i=1, \ldots, P\right\} are the physical attributes as discussed in Section 3.1, and they are de facto the learning target of the model. On the other hand, the Gaussian queries Q = Q = Q=\mathcal{Q}= { Q i R m i = 1 , , P } Q i R m i = 1 , , P {Q_(i)inR^(m)∣i=1,dots,P}\left\{\mathbf{Q}_{i} \in \mathbb{R}^{m} \mid i=1, \ldots, P\right\} are the high-dimensional feature vectors that implicitly encode 3D information in the self-encoding and image cross-attention modules, and provide guidance for rectification in the refinement module. We initialize the Gaussian properties as learnable vectors denoted by Initial Properties in Fig. 3.
高斯属性和查询。我们在 GaussianFormer 中引入了两组特征。高斯属性 G = { G i R d i = 1 , , P } G = G i R d i = 1 , , P G={G_(i)inR^(d)∣i=1,dots,P}\mathcal{G}=\left\{\mathbf{G}_{i} \in \mathbb{R}^{d} \mid i=1, \ldots, P\right\} 是第 3.1 节中讨论的物理属性,它们实际上是模型的学习目标。另一方面,高斯查询 Q = Q = Q=\mathcal{Q}= { Q i R m i = 1 , , P } Q i R m i = 1 , , P {Q_(i)inR^(m)∣i=1,dots,P}\left\{\mathbf{Q}_{i} \in \mathbb{R}^{m} \mid i=1, \ldots, P\right\} 是高维特征向量,在自编码和图像交叉注意模块中隐含了三维信息,并在细化模块中为纠正提供指导。我们将高斯属性初始化为可学习向量,在图 3 中以初始属性表示。
Self-encoding Module. Methods with voxel or planar representations usually implement self-encoding modules with deformable attention for efficiency
自编码模块。采用体素或平面表征的方法通常会使用可变形注意力的自编码模块,以提高效率

Fig. 3: Framework of our GaussianFormer for 3D semantic occupancy prediction. We first extract multi-scale (M.S.) features from image inputs using an image backbone. We then randomly initialized a set of queries and properties (mean, covariance, and semantics) to represent 3D Gaussians and update them with interleaved self-encoding, image cross-attention, and property refinement. Having obtained the updated 3D Gaussians, we employ an efficient Gaussian-to-voxel splatting module to generate dense 3D occupancy via local aggregation of Gaussians.
图 3:用于三维语义占位预测的高斯模型框架。我们首先使用图像骨干从图像输入中提取多尺度(M.S.)特征。然后,我们随机初始化一组查询和属性(均值、协方差和语义)来表示三维高斯,并通过交错自编码、图像交叉关注和属性细化来更新它们。在获得更新的三维高斯后,我们采用了高效的高斯到体素拼接模块,通过高斯的局部聚集生成密集的三维占位。

considerations, which is not well supported for unstructured 3D Gaussian representation. Instead, we leverage 3D sparse convolution to allow interactions among 3D Gaussians, sharing the same linear computational complexity as deformable attention. Specifically, we treat each Gaussian as a point located at its mean m m m\mathbf{m}, voxelize the generated point cloud (denoted by Voxelization in Fig. 3), and apply sparse convolution on the voxel grid. Since the number of 3D Gaussians P P PP is much fewer than X × Y × Z X × Y × Z X xx Y xx ZX \times Y \times Z, sparse convolution could effectively take advantage of the sparsity of Gaussians.
考虑因素,而非结构化三维高斯表示法并不支持这一点。取而代之的是,我们利用三维稀疏卷积来实现三维高斯之间的交互,与可变形注意力共享相同的线性计算复杂度。具体来说,我们将每个高斯视为位于其均值 m m m\mathbf{m} 处的一个点,对生成的点云进行体素化(图 3 中以体素化表示),并在体素网格上应用稀疏卷积。由于三维高斯 P P PP 的数量远远少于 X × Y × Z X × Y × Z X xx Y xx ZX \times Y \times Z ,稀疏卷积可以有效利用高斯的稀疏性。
Image Cross-attention Module. The image cross-attention module (ICA) is designed to extract visual information from images for our vision-based approach. To elaborate, for a 3D Gaussian G, we first generate a set of 3D reference points R = { m + Δ m i i = 1 , , R } R = m + Δ m i i = 1 , , R R={m+Deltam_(i)∣i=1,dots,R}\mathcal{R}=\left\{\mathbf{m}+\Delta \mathbf{m}_{i} \mid i=1, \ldots, R\right\} by permuting the mean m m m\mathbf{m} with offsets Δ m Δ m Deltam\Delta \mathbf{m}. We calculate the offsets according to the covariance of the Gaussian to reflect the shape of its distribution. Then we project the 3D reference points onto image feature maps with extrinsics T T T\mathcal{T} and intrinsics K K K\mathcal{K}. Finally, we update the Gaussian query Q Q Q\mathbf{Q} with the weighted sum of retrieved image features:
图像交叉注意模块。图像交叉注意模块(ICA)旨在为我们基于视觉的方法从图像中提取视觉信息。具体来说,对于三维高斯 G,我们首先通过将平均值 m m m\mathbf{m} 与偏移 Δ m Δ m Deltam\Delta \mathbf{m} 进行置换,生成一组三维参考点 R = { m + Δ m i i = 1 , , R } R = m + Δ m i i = 1 , , R R={m+Deltam_(i)∣i=1,dots,R}\mathcal{R}=\left\{\mathbf{m}+\Delta \mathbf{m}_{i} \mid i=1, \ldots, R\right\} 。我们根据高斯的协方差计算偏移量,以反映其分布形状。然后,我们将三维参考点投影到图像特征图上,并使用外特征 T T T\mathcal{T} 和内特征 K K K\mathcal{K} 。最后,我们用检索到的图像特征的加权和更新高斯查询 Q Q Q\mathbf{Q}
ICA ( R , Q , F ; T , K ) = 1 N n = 1 N i = 1 R DA ( Q , π ( R ; T , K ) , F n ) ICA ( R , Q , F ; T , K ) = 1 N n = 1 N i = 1 R DA Q , π ( R ; T , K ) , F n ICA(R,Q,F;T,K)=(1)/(N)sum_(n=1)^(N)sum_(i=1)^(R)DA(Q,pi(R;T,K),F_(n))\operatorname{ICA}(\mathcal{R}, \mathbf{Q}, \mathbf{F} ; \mathcal{T}, \mathcal{K})=\frac{1}{N} \sum_{n=1}^{N} \sum_{i=1}^{R} \operatorname{DA}\left(\mathbf{Q}, \pi(\mathcal{R} ; \mathcal{T}, \mathcal{K}), \mathbf{F}_{n}\right)
where F , DA ( ) , π ( ) F , DA ( ) , π ( ) F,DA(*),pi(*)\mathbf{F}, \mathrm{DA}(\cdot), \pi(\cdot) denote the image feature maps, the deformable attention function and the transformation from world to pixel coordinates, respectively.
其中, F , DA ( ) , π ( ) F , DA ( ) , π ( ) F,DA(*),pi(*)\mathbf{F}, \mathrm{DA}(\cdot), \pi(\cdot) 分别表示图像特征图、可变形注意力函数和从世界坐标到像素坐标的变换。
Refinement Module. We use the refinement module to rectify the Gaussian properties with guidance from corresponding Gaussian queries which have aggregated sufficient 3D information in the prior self-encoding and image crossattention modules. Specifically, we take inspiration from DETR [60] in object detection. For a 3D Gaussian G = ( m , s , r , c ) G = ( m , s , r , c ) G=(m,s,r,c)\mathbf{G}=(\mathbf{m}, \mathbf{s}, \mathbf{r}, \mathbf{c}), we first decode the intermediate properties G ^ = ( m ^ , s ^ , r ^ , c ^ ) G ^ = ( m ^ , s ^ , r ^ , c ^ ) hat(G)=( hat(m), hat(s), hat(r), hat(c))\hat{\mathbf{G}}=(\hat{\mathbf{m}}, \hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}}) from the Gaussian query Q Q Q\mathbf{Q} with a multi-layer perceptron (MLP). When refining the old properties with the intermediate ones, we treat the intermediate mean m ^ m ^ hat(m)\hat{\mathbf{m}} as a residual and add it with the old mean m m m\mathbf{m}, while we directly substitute the other intermediate properties ( s ^ , r ^ , c ^ ) s ^ , r ^ , c ^ ) hat(s), hat(r), hat(c))\hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}}) for the corresponding old properties:
细化模块。我们利用细化模块,在相应高斯查询的指导下修正高斯属性,这些高斯查询在先前的自编码和图像交叉注意模块中汇总了足够的三维信息。具体来说,我们从物体检测中的 DETR [60] 获得了灵感。对于三维高斯 G = ( m , s , r , c ) G = ( m , s , r , c ) G=(m,s,r,c)\mathbf{G}=(\mathbf{m}, \mathbf{s}, \mathbf{r}, \mathbf{c}) ,我们首先使用多层感知器(MLP)从高斯查询 Q Q Q\mathbf{Q} 中解码出中间属性 G ^ = ( m ^ , s ^ , r ^ , c ^ ) G ^ = ( m ^ , s ^ , r ^ , c ^ ) hat(G)=( hat(m), hat(s), hat(r), hat(c))\hat{\mathbf{G}}=(\hat{\mathbf{m}}, \hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}}) 。在用中间属性完善旧属性时,我们将中间平均值 m ^ m ^ hat(m)\hat{\mathbf{m}} 视为残差,并将其与旧平均值 m m m\mathbf{m} 相加,而将其他中间属性 ( s ^ , r ^ , c ^ ) s ^ , r ^ , c ^ ) hat(s), hat(r), hat(c))\hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}}) 直接替换为相应的旧属性:
G ^ = ( m ^ , s ^ , r ^ , c ^ ) = MLP ( Q ) , G new = ( m + m ^ , s ^ , r ^ , c ^ ) . G ^ = ( m ^ , s ^ , r ^ , c ^ ) = MLP ( Q ) , G new  = ( m + m ^ , s ^ , r ^ , c ^ ) . hat(G)=( hat(m), hat(s), hat(r), hat(c))=MLP(Q),quadG_("new ")=(m+ hat(m), hat(s), hat(r), hat(c)).\hat{\mathbf{G}}=(\hat{\mathbf{m}}, \hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}})=\operatorname{MLP}(\mathbf{Q}), \quad \mathbf{G}_{\text {new }}=(\mathbf{m}+\hat{\mathbf{m}}, \hat{\mathbf{s}}, \hat{\mathbf{r}}, \hat{\mathbf{c}}) .
We refine the mean of Gaussian with residual connections in order to keep their coherence throughout the B B BB blocks of GaussianFormer. The direct replacement of the other properties is due to the concern about vanishing gradient from the sigmoid and softmax activations we apply on covariance and semantic logits.
我们用残差连接来完善高斯平均值,以便在高斯成型器的 B B BB 块中保持它们的一致性。之所以直接替换其他属性,是因为我们担心协方差和语义对数的 sigmoid 和 softmax 激活会导致梯度消失。

3.3 Gaussian-to-Voxel Splatting
3.3 高斯-体素拼接

Due to the universal approximating ability of Gaussian mixtures [8,11] , the 3D semantic Gaussians can efficiently represent the 3D scene and thus can be directly processed to perform downstream tasks like motion planning and control. Specifically, to achieve 3D semantic occupancy prediction, we design an efficient Gaussian-to-voxel splatting module to efficiently transform 3D Gaussian representation to 3D semantic occupancy predictions using only local aggregation operation.
由于高斯混合物具有普遍的逼近能力[8,11],三维语义高斯可以有效地表示三维场景,因此可以直接进行处理,以执行运动规划和控制等下游任务。具体来说,为了实现三维语义占位预测,我们设计了一个高效的高斯到象素拼接模块,仅使用局部聚合操作就能高效地将三维高斯表示转换为三维语义占位预测。
Although (3) demonstrates the main idea of the transformation as a summation over the contributions of 3D Gaussians, it is infeasible to query all Gaussians for every voxel position due to the intractable computation and storage complexity ( O ( X Y Z × P ) ) ( O ( X Y Z × P ) ) (O(XYZ xx P))(O(X Y Z \times P)). Since the weight exp ( 1 2 ( p m ) T Σ 1 ( p m ) ) exp 1 2 ( p m ) T Σ 1 ( p m ) exp(-(1)/(2)(p-m)^(T)Sigma^(-1)(p-m))\exp \left(-\frac{1}{2}(\mathbf{p}-\mathbf{m})^{T} \boldsymbol{\Sigma}^{-1}(\mathbf{p}-\mathbf{m})\right) in (1) decays exponentially with respect to the square of the mahalanobis distance, it should be negligible when the distance is large enough. Based on the locality of the semantic Gaussian distribution, we only consider 3D Gaussians within a neighborhood of each voxel position to improve efficiency.
虽然(3)展示了三维高斯贡献求和变换的主要思想,但由于计算和存储复杂度 ( O ( X Y Z × P ) ) ( O ( X Y Z × P ) ) (O(XYZ xx P))(O(X Y Z \times P)) 难以处理,要查询每个体素位置的所有高斯是不可行的。由于 (1) 中的权重 exp ( 1 2 ( p m ) T Σ 1 ( p m ) ) exp 1 2 ( p m ) T Σ 1 ( p m ) exp(-(1)/(2)(p-m)^(T)Sigma^(-1)(p-m))\exp \left(-\frac{1}{2}(\mathbf{p}-\mathbf{m})^{T} \boldsymbol{\Sigma}^{-1}(\mathbf{p}-\mathbf{m})\right) 与 mahalanobis 距离的平方成指数衰减,因此当距离足够大时,权重应该可以忽略不计。基于语义高斯分布的位置性,我们只考虑每个体素位置邻域内的三维高斯,以提高效率。
As illustrated by Fig. 4, we first embed the 3D Gaussians into the target voxel grid of size X × Y × Z X × Y × Z X xx Y xx ZX \times Y \times Z according to their means m m m\mathbf{m}. For each 3D Gaussian, we then calculate the radius of its neighborhood according to its scale property s. We append both the index of the Gaussian and the index of each voxel inside the neighborhood as a tuple ( g , v ) ( g , v ) (g,v)(g, v) to a list. Then we sort the list according to the indices of voxels to derive the indices of 3D Gaussians that each voxel should attend to:
如图 4 所示,我们首先根据三维高斯的均值 m m m\mathbf{m} 将其嵌入大小为 X × Y × Z X × Y × Z X xx Y xx ZX \times Y \times Z 的目标体素网格中。我们将高斯的索引和邻域内每个体素的索引作为一个元组 ( g , v ) ( g , v ) (g,v)(g, v) 附加到一个列表中。然后,我们根据体素的索引对列表进行排序,得出每个体素应关注的三维高斯的索引:
sort v o x ( [ ( g , v g 1 ) , , ( g , v g k ) ] g = 1 P ) = [ ( g v 1 , v ) , , ( g v l , v ) ] v = 1 X Y Z sort v o x g , v g 1 , , g , v g k g = 1 P = g v 1 , v , , g v l , v v = 1 X Y Z sort_(vox)([(g,v_(g_(1))),dots,(g,v_(g_(k)))]_(g=1)^(P))=[(g_(v_(1)),v),dots,(g_(v_(l)),v)]_(v=1)^(XYZ)\operatorname{sort}_{v o x}\left(\left[\left(g, v_{g_{1}}\right), \ldots,\left(g, v_{g_{k}}\right)\right]_{g=1}^{P}\right)=\left[\left(g_{v_{1}}, v\right), \ldots,\left(g_{v_{l}}, v\right)\right]_{v=1}^{X Y Z}
where k , l k , l k,lk, l denote the number of neighboring voxels of a certain Gaussian, and the number of Gaussians that contribute to a certain voxel, respectively. Finally,
其中, k , l k , l k,lk, l 分别表示某个高斯的相邻体素数,以及对某个体素有贡献的高斯数。最后、

Fig. 4: Illustration of the Gaussian-to-voxel splatting method in 2D. We first voxelize the 3D Gaussians and record the affected voxels of each 3D Gaussian by appending their paired indices to a list. Then we sort the list according to the voxel indices to identify the neighboring Gaussians of each voxel, followed by a local aggregation to generate the occupancy prediction.
图 4:二维高斯-体素拼接方法示意图。我们首先对三维高斯进行体素化处理,并将每个三维高斯的受影响体素的成对指数追加到一个列表中。然后,我们根据体素指数对列表进行排序,找出每个体素的邻近高斯,再进行局部聚合,生成占位预测。

we can approximate (3) efficiently with only neighboring Gaussians:
我们只需使用邻近的高斯就能有效地近似 (3):
o ^ ( p ; G ) = i N ( p ) g i ( p ; m i , s i , r i , c i ) o ^ ( p ; G ) = i N ( p ) g i p ; m i , s i , r i , c i hat(o)(p;G)=sum_(i inN(p))g_(i)(p;m_(i),s_(i),r_(i),c_(i))\hat{o}(\mathbf{p} ; \mathcal{G})=\sum_{i \in \mathcal{N}(\mathbf{p})} \mathbf{g}_{i}\left(\mathbf{p} ; \mathbf{m}_{i}, \mathbf{s}_{i}, \mathbf{r}_{i}, \mathbf{c}_{i}\right)
where N ( p ) N ( p ) N(p)\mathcal{N}(\mathbf{p})