GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
GaussianFormer：基于视觉的三维语义占位预测中的高斯场景

Fig. 1: Considering the universal approximating ability of Gaussian mixture [8, 11], we propose an object-centric 3D semantic Gaussian representation to describe the finegrained structure of 3D scenes without the use of dense grids. We propose a GaussianFormer model consisting of sparse convolution and cross-attention to efficiently transform 2D images into 3D Gaussian representations. To generate dense 3D occupancy, we design a Gaussian-to-voxel splatting module that can be efficiently implemented with CUDA. With comparable performance, our GaussianFormer reduces memory consumption of existing 3D occupancy prediction methods by

75.2 % - 82.2 %

.
图 1：考虑到高斯混合物的普遍近似能力[8, 11]，我们提出了一种以物体为中心的三维语义高斯表示法，无需使用密集网格即可描述三维场景的细粒度结构。我们提出了一种由稀疏卷积和交叉注意组成的 GaussianFormer 模型，以有效地将二维图像转换为三维高斯表示。为了生成密集的三维占位，我们设计了一个高斯-体素拼接模块，该模块可通过 CUDA 高效实现。在性能相当的情况下，我们的 GaussianFormer 可将现有三维占位预测方法的内存消耗降低

75.2 % - 82.2 %

。

Abstract 摘要

D semantic occupancy prediction aims to obtain 3D finegrained geometry and semantics of the surrounding scene and is an important task for the robustness of vision-centric autonomous driving. Most existing methods employ dense grids such as voxels as scene representations, which ignore the sparsity of occupancy and the diversity of object scales and thus lead to unbalanced allocation of resources. To address this, we propose an object-centric representation to describe 3D scenes with sparse 3D semantic Gaussians where each Gaussian represents a flexible region of interest and its semantic features. We aggregate information from images through the attention mechanism and iteratively refine the properties of 3D Gaussians including position, covariance, and semantics. We then propose an efficient Gaussian-to-voxel splatting method to generate 3D occupancy predictions, which only aggregates the neighboring Gaussians for a certain position. We conduct extensive
D 语义占位预测旨在获得周围场景的三维精细几何和语义，是实现以视觉为中心的自动驾驶的一项重要任务。现有的大多数方法采用象素等密集网格作为场景表示，但忽略了占用率的稀疏性和物体尺度的多样性，从而导致资源分配不均衡。为了解决这个问题，我们提出了一种以物体为中心的表示法，用稀疏的三维语义高斯来描述三维场景，其中每个高斯代表一个灵活的感兴趣区域及其语义特征。我们通过注意力机制聚合图像信息，并迭代完善三维高斯的属性，包括位置、协方差和语义。然后，我们提出了一种高效的高斯到象素拼接方法来生成三维占位预测，这种方法只聚集特定位置的相邻高斯。我们进行了广泛的

Abstract 摘要

experiments on the widely adopted nuScenes and KITTI-360 datasets. Experimental results demonstrate that GaussianFormer achieves comparable performance with state-of-the-art methods with only $17.8 % - 24.8 %$ of their memory consumption. Code is available at: https://github. com/huang-yh/GaussianFormer.
在广泛采用的 nuScenes 和 KITTI-360 数据集上进行了实验。实验结果表明，GaussianFormer 的性能可与最先进的方法媲美，而内存消耗仅为它们的 $17.8 % - 24.8 %$ 。代码见：https://github. com/huang-yh/GaussianFormer。

Keywords: 3D occupancy prediction • 3D Gaussian splitting • Autonomous Driving
关键词三维占用率预测 - 三维高斯分割 - 自动驾驶

1 Introduction 1 引言

Whether to use LiDAR for 3D perception has long been the core debate among autonomous driving companies. While vision-centric systems share an economical advantage, their inability to capture obstacles of arbitrary shapes hinders driving safety and robustness [

13, 17, 25, 26]

. The emergence of 3D semantic occupancy prediction methods

[4, 16, 18, 35, 50, 57]

remedies this issue by predicting the occupancy status of each voxel in the surrounding 3D space, which facilitates a variety of newly rising tasks such as end-to-end autonomous driving [46], 4D occupancy forecasting [58], and self-supervised 3D scene understanding [15].
是否使用激光雷达进行 3D 感知一直是自动驾驶公司争论的核心问题。虽然以视觉为中心的系统具有经济优势，但它们无法捕捉任意形状的障碍物，这阻碍了驾驶的安全性和稳健性 [

13, 17, 25, 26]

。三维语义占用预测方法

[4, 16, 18, 35, 50, 57]

的出现弥补了这一问题，它可以预测周围三维空间中每个体素的占用状态，从而促进各种新出现的任务，如端到端自动驾驶[46]、四维占用预测[58]和自监督三维场景理解[15]。

Despite the promising applications, the dense output space of 3D occupancy prediction poses a great challenge in how to efficiently and effectively represent the 3D scene. Voxel-based methods [23,50] assign each voxel with a feature vector to obtain dense representations to describe the fine-grained structure of a 3D scene. They employ coarse-to-fine upsampling [46,50] or voxel filtering [23,33] to improve efficiency considering the sparse nature of the 3D space. As most of the voxel space is unoccupied [3], BEV-based methods [27,56] compress the height dimension and employ the bird’s eye view (BEV) as scene representations, yet they usually require post-processing such as multi-scale fusion [56] to capture finer details. TPVFormer [16] generalizes BEV with two additional planes and achieves a better performance-complexity trade-off with the tri-perspective view (TPV). However, they are all grid-based methods and inevitably suffer from the redundancy of empty grids, resulting in more complexity for downstream tasks [46]. It is also more difficult to capture scene dynamics with grid-based representations since it is objects instead of grids that move in the 3D space [58].
尽管应用前景广阔，但三维占位预测的密集输出空间对如何高效、有效地表示三维场景提出了巨大挑战。基于体素的方法[23,50]为每个体素分配一个特征向量，以获得密集的表示来描述三维场景的细粒度结构。考虑到三维空间的稀疏性，这些方法采用了从粗到细的上采样[46,50]或体素过滤[23,33]来提高效率。由于大部分体素空间未被占用[3]，基于 BEV 的方法[27,56]压缩了高度维度，并采用鸟瞰图（BEV）作为场景表示，但它们通常需要进行后处理，如多尺度融合[56]，以捕捉更精细的细节。TPVFormer[16]用两个额外的平面对 BEV 进行了概括，并通过三视角视图（TPV）实现了更好的性能-复杂性权衡。然而，它们都是基于网格的方法，不可避免地存在空网格冗余的问题，从而导致下游任务更加复杂[46]。此外，由于在三维空间中移动的是物体而不是网格，因此基于网格的表示法更难捕捉场景动态[58]。

In this paper, we propose the first object-centric representation for 3D semantic occupancy prediction. We employ a set of 3D semantic Gaussians to sparsely describe a 3D scene. Each Gaussian represents a flexible region of interest and consists of the mean, covariance, and its semantic category. We propose a GaussianFormer model to effectively obtain 3D semantic Gaussians from image inputs. We randomly initialize a set of queries to instantiate the 3D Gaussians and adopt the cross-attention mechanism to aggregate information from multiscale image features. We iteratively refine the properties of the 3D Gaussians for smoother optimizations. To efficiently incorporate interactions among 3D Gaussians, we treat them as point clouds located at the Gaussian means and leverage 3D sparse convolutions to process them. We then decode the properties of 3D semantic Gaussians from the updated queries as the scene representation.
在本文中，我们首次提出了用于三维语义占位预测的以物体为中心的表示法。我们采用一组三维语义高斯来稀疏描述三维场景。每个高斯代表一个灵活的兴趣区域，由均值、协方差及其语义类别组成。我们提出了一种高斯生成器模型，可有效地从图像输入中获取三维语义高斯。我们随机初始化一组查询来实例化三维高斯，并采用交叉关注机制来聚合来自多尺度图像特征的信息。我们不断改进三维高斯的属性，以实现更平滑的优化。为了有效整合三维高斯之间的相互作用，我们将它们视为位于高斯均值处的点云，并利用三维稀疏卷积来处理它们。然后，我们从更新的查询中解码三维语义高斯的属性，作为场景表示。

Motivated by the 3D Gaussian splatting method in image rendering [20], we design an efficient Gaussian-to-voxel splatting module that aggregates neighboring Gaussians to generate the semantic occupancy for a certain 3D position. The proposed 3D Gaussian representation uses a sparse and adaptive set of features to describe a 3D scene but can still model the fine-grained structure due to the universal approximating ability of Gaussian mixtures [8, 11]. Based on the 3D Gaussian representation, GaussianFormer further employs sparse convolution and local-aggregation-based Gaussian-to-voxel splatting to achieve efficient 3D semantic occupancy prediction, as shown in Fig. 1. We conduct extensive experiments on the nuScenes and KITTI-360 datasets for 3D semantic occupancy prediction from surrounding and monocular cameras, respectively. GaussianFormer achieves comparable performance with existing state-of-the-art methods with only

17.8 % - 24.8 %

of their memory consumption. Our qualitative visualizations show that GaussianFormer is able to generate a both holistic and realistic perception of the scene.
受图像渲染中的三维高斯拼接法[20]的启发，我们设计了一种高效的高斯到象素拼接模块，它可以聚合相邻的高斯来生成特定三维位置的语义占位。所提出的三维高斯表示法使用稀疏的自适应特征集来描述三维场景，但由于高斯混合物的通用逼近能力[8, 11]，它仍能对细粒度结构进行建模。如图 1 所示，在三维高斯表示的基础上，GaussianFormer 进一步采用稀疏卷积和基于局部聚合的高斯-象素拼接来实现高效的三维语义占位预测。我们在 nuScenes 和 KITTI-360 数据集上进行了大量实验，分别从周围摄像机和单目摄像机进行三维语义占位预测。GaussianFormer 的性能与现有的一流方法不相上下，而内存消耗仅为它们的

17.8 % - 24.8 %

。我们的定性可视化结果表明，GaussianFormer 能够生成整体而逼真的场景感知。

2.1 3D Semantic Occupancy Prediction
2.1 三维语义占用预测

3D semantic occupancy prediction has garnered increasing attention in recent years due to its comprehensive description of the driving scenes, which involves predicting the occupancy and semantic states of all voxels within a certain range. Learning an effective representation constitutes the fundamental step for this challenging task. A straightforward approach discretizes the 3D space into regular voxels, with each voxel being assigned a feature vector [59]. The capacity to represent intricate 3D structures renders voxel-based representations advantageous for 3D semantic occupancy prediction

[4, 6, 18, 21, 35, 43, 50, 51, 57]

. However, due to the sparsity of driving scenes and the high resolution of vanilla voxel representation, these approaches suffer from considerable computation and storage overhead. To improve efficiency, several methods propose to reduce the number of voxel queries by the coarse-to-fine upsampling strategy [49], or the depth-guided voxel filtering [23]. Nonetheless, the upsampling process might not adequately compensate for the information loss in the coarse stage. And the filtering strategy ignores the occluded area and depends on the quality of depth estimation. Although OctreeOcc [33] introduces voxel queries of multi-granularity to enhance the efficiency, it still conforms to a predefined regular partition pattern. Our 3D Gaussian representation resembles the voxel counterpart, but can flexibly adapt to varying object scales and region complexities in a deformable way, thus achieving better resource allocation and efficiency.
近年来，三维语义占位预测因其对驾驶场景的全面描述而受到越来越多的关注，其中涉及预测一定范围内所有体素的占位和语义状态。学习有效的表示方法是完成这项具有挑战性任务的基础步骤。一种直接的方法是将三维空间离散为规则的体素，并为每个体素分配一个特征向量[59]。表示复杂三维结构的能力使基于体素的表示法在三维语义占用预测

[4, 6, 18, 21, 35, 43, 50, 51, 57]

中具有优势。然而，由于驾驶场景的稀疏性和虚体素表示法的高分辨率，这些方法的计算和存储开销相当大。为了提高效率，有几种方法建议通过从粗到细的上采样策略[49]或深度引导体素过滤[23]来减少体素查询次数。然而，上采样过程可能无法充分补偿粗采样阶段的信息损失。而且滤波策略会忽略遮挡区域，并取决于深度估计的质量。虽然 OctreeOcc [33] 引入了多粒度体素查询以提高效率，但它仍然符合预定义的规则分割模式。我们的三维高斯表示法与体素表示法类似，但能以可变形的方式灵活适应不同的物体尺度和区域复杂度，从而实现更好的资源分配和效率。

Another line of work utilizes the bird’s-eye-view (BEV) representation for 3D perception in autonomous driving, which can be categorized into two types according to the view transformation paradigm. Approaches based on lift-splatshoot (LSS) actively project image features into 3D space with depth guidance

[14, 24, 25, 31, 39, 40]

, while query-based methods typically use BEV queries and deformable attention to aggregate information from image features

[19, 26

,
另一个研究方向是在自动驾驶中利用鸟瞰图（BEV）表示法进行三维感知，根据视图变换范例可将其分为两类。基于提升-平移（LSS）的方法通过深度引导将图像特征主动投射到三维空间

[14, 24, 25, 31, 39, 40]

，而基于查询的方法通常使用 BEV 查询和可变形关注来汇总图像特征信息

[19, 26

、

53]. Although BEV-based perception has achieved great success in 3D object detection, it is less employed for 3D occupancy prediction due to information loss from height compression. FB-OCC [27] uses dense BEV features from backward projection to optimize the sparse voxel features from forward projection. FlashOcc [56] applies a complex multi-scale feature fusion module on BEV features for finer details. However, existing 3D occupancy prediction methods are based on grid representations, which inevitably suffer from the computation redundancy of empty grids. Differently, our GaussianFormer is based on objectcentric representation and can efficiently tend to flexible regions of interest.
53].虽然基于 BEV 的感知技术在三维物体检测中取得了巨大成功，但由于高度压缩造成的信息损失，它在三维占位预测中应用较少。FB-OCC [27] 使用来自后向投影的密集 BEV 特征来优化来自前向投影的稀疏体素特征。FlashOcc [56] 在 BEV 特征上应用了复杂的多尺度特征融合模块，以获得更精细的细节。然而，现有的三维占位预测方法都是基于网格表示的，不可避免地会受到空网格计算冗余的影响。不同的是，我们的 GaussianFormer 基于以物体为中心的表示法，可以有效地倾向于灵活的感兴趣区域。

2.2 3D Gaussian Splatting
2.2 三维高斯拼接法

The recent 3D Gaussian splatting (3D-GS) [20] uses multiple 3D Gaussians for radiance field rendering, demonstrating superior performance in rendering quality and speed. In contrast to prior explicit scene representations, such as meshes [41, 42,52] and voxels [10,37], 3D-GS is capable of modeling intricate shapes with fewer parameters. Compared with implicit neural radiance field [1, 36], 3D-GS facilitates fast rendering through splat-based rasterization, which projects 3D Gaussians to the target 2D view and renders image patches with local 2D Gaussians. Recent advances in 3D-GS include adaptation to dynamic scenarios [34,54], online 3D-GS generalizable to novel scenes [5,61], and generative 3D-GS [7, 28, 45, 55].
最近的三维高斯拼接技术（3D-GS）[20] 使用多个三维高斯进行辐射场渲染，在渲染质量和速度方面都表现出色。与网格[41, 42,52]和体素[10,37]等先前的显式场景表示相比，3D-GS 能够以较少的参数模拟复杂的形状。与隐式神经辐射场[1, 36]相比，3D-GS 通过基于拼接的光栅化技术实现快速渲染，这种技术将 3D 高斯投射到目标 2D 视图，并用局部 2D 高斯渲染图像斑块。3D-GS 的最新进展包括适应动态场景 [34,54]、可通用于新场景的在线 3D-GS [5,61]，以及生成式 3D-GS [7,28,45,55]。

Although our 3D Gaussian representation also adopts the physical form of 3D Gaussians (i.e. mean and covariance) and the multi-variant Gaussian distribution, it differs from 3D-GS in significant ways, which imposes unique challenges in 3D semantic occupancy prediction: 1) Our 3D Gaussian representation is learned in an online manner as opposed to offline optimization in 3D-GS. 2) We generate 3D semantic occupancy predictions from the 3D Gaussian representation in contrast to rendering 2D RGB images in 3D-GS.
虽然我们的三维高斯表示法也采用了三维高斯的物理形式（即均值和协方差）和多变量高斯分布，但它与三维高斯有很大不同，这给三维语义占位预测带来了独特的挑战：1) 与 3D-GS 的离线优化不同，我们的 3D 高斯表示是以在线方式学习的；2) 与 3D-GS 中的 2D RGB 图像渲染不同，我们从 3D 高斯表示生成 3D 语义占用预测。

3 Proposed Approach 3 建议的方法

In this section, we present our method of 3D Gaussian Splatting for 3D semantic occupancy prediction. We first introduce an object-centric 3D scene representation which adaptively describes the regions of interest with 3D semantic Gaussians (Sec. 3.1). We then explain how to effectively transform information from the image inputs to 3D Gaussians and elaborate on the model designs including self-encoding, image cross-attention, and property refinement (Sec. 3.2). At last, we detail the Gaussian-to-voxel splatting module which generates dense 3D occupancy predictions based on local aggregation and can be efficiently implemented with CUDA (Sec. 3.2).
在本节中，我们将介绍用于三维语义占位预测的三维高斯拼接方法。我们首先介绍了一种以物体为中心的三维场景表示法，这种表示法能用三维语义高斯自适应地描述感兴趣的区域（第 3.1 节）。然后，我们解释了如何有效地将图像输入信息转换为三维高斯信息，并详细介绍了模型设计，包括自编码、图像交叉关注和属性细化（第 3.2 节）。最后，我们详细介绍了高斯到象素拼接模块，该模块基于局部聚合生成密集的三维占位预测，可通过 CUDA 高效实现（第 3.2 节）。

3.1 Object-centric 3D Scene Representation
3.1 以物体为中心的三维场景表示法

Vision-based 3D semantic occupancy prediction aims to predict dense occupancy states and semantics for each voxel grid with multi-view camera images as input.
基于视觉的三维语义占位预测旨在以多视角相机图像为输入，预测每个体素网格的密集占位状态和语义。

Fig. 2: Comparisions of the proposed 3D Gaussian representation with exiting grid-based scene representations (figures from TPVFormer [16]). The voxel representation

[23, 50]

assigns each voxel in the 3D space with a feature and is redundant due to the sparsity nature of the 3D space. BEV [26] and TPV [16] employ 2D planes to describe 3D space but can only alleviate the redundancy issue. Differently, the proposed object-centric 3D Gaussian representation can adapt to flexible regions of interest yet can still describe the fine-grained structure of the 3D scene due to the strong approximating ability of mixing Gaussians

[8, 11]

.
图 2：建议的三维高斯表示法与现有基于网格的场景表示法的比较（图片来自 TPVFormer [16]）。体素表示

[23, 50]

为三维空间中的每个体素分配了一个特征，由于三维空间的稀疏性，这种表示是多余的。BEV [26] 和 TPV [16] 采用二维平面来描述三维空间，但只能缓解冗余问题。不同的是，由于混合高斯

[8, 11]

具有很强的近似能力，因此所提出的以物体为中心的三维高斯表示法可以适应灵活的感兴趣区域，但仍能描述三维场景的细粒度结构。

Formally, given a set of multi-view images

I = {I_{i} \in R^{3 \times H \times W} ∣ i = 1, \dots, N}

, and corresponding intrinsics

K = {K_{i} \in R^{3 \times 3} ∣ i = 1, \dots, N}

and extrinsics

T =

{T_{i} \in R^{4 \times 4} ∣ i = 1, \dots, N}

, the objective is to predict 3 D semantic occupancy

O \in C^{X \times Y \times Z}

, where

N, {H, W}, C

and

{X, Y, Z}

denote the number of views, the image resolution, the set of semantic classes and the target volume resolution.
形式上，给定一组多视图图像

I = {I_{i} \in R^{3 \times H \times W} ∣ i = 1, \dots, N}

以及相应的内在

K = {K_{i} \in R^{3 \times 3} ∣ i = 1, \dots, N}

和外在

T =

{T_{i} \in R^{4 \times 4} ∣ i = 1, \dots, N}

，目标是预测 3 D 语义占用

O \in C^{X \times Y \times Z}

，其中

N, {H, W}, C

和

{X, Y, Z}

表示视图数、图像分辨率、语义类别集和目标体积分辨率。

The autonomous driving scenes contain foreground objects of various scales (such as buses and pedestrians), and background regions of different complexities (such as road and vegetation). Dense voxel representation [4, 23, 48] neglects this diversity and processes every 3D location with equal storage and computation resources, which often leads to intractable overhead because of unreasonable resource allocation. Planar representations, such as BEV [14, 26] and TPV [16], achieve 3D perception by first encoding 3D information into 2D feature maps for efficiency and then recovering 3D structures from 2D features. Although planar representations are resource-friendly, they could cause a loss of details. The gridbased methods can hardly adapt to regions of interest for different scenes and thus lead to representation and computation redundancy.
自动驾驶场景包含不同尺度的前景物体（如公共汽车和行人），以及不同复杂程度的背景区域（如道路和植被）。密集体素表示法[4, 23, 48]忽略了这种多样性，并以相同的存储和计算资源处理每个三维位置，这往往会因资源分配不合理而导致难以处理的开销。平面表示法，如 BEV [14, 26] 和 TPV [16]，通过首先将三维信息编码到二维特征图中以提高效率，然后从二维特征中恢复三维结构来实现三维感知。虽然平面表示法对资源友好，但可能会造成细节损失。基于网格的方法很难适应不同场景的兴趣区域，从而导致表示和计算冗余。

To address this, we propose an object-centric 3D representation for 3D semantic occupancy prediction where each unit describes a region of interest instead of fixed grids, as shown in Fig. 2. We represent an autonomous driving scene with a number of 3D semantic Gaussians, and each of them instantiates a semantic Gaussian distribution characterized by mean, covariance, and semantic logits. The occupancy prediction for a 3D location can be computed by summing up the values of semantic Gaussian distributions evaluated at that location. Specifically, we use a set of

P 3 D

Gaussians

G = {G_{i} \in R^{d} ∣ i = 1, \dots, P}

for each scene, and each 3D Gaussian is represented by a

d

-dimensional vector in the form of

(m \in R^{3}, s \in R^{3}, r \in R^{4}, c \in R^{| C |})

, where

d = 10 + | C |

, and

m, s, r, c

denote the mean, scale, rotation vectors and semantic logits, respectively. Therefore, the
为了解决这个问题，我们提出了一种用于三维语义占用预测的以物体为中心的三维表示法，如图 2 所示，每个单元描述一个感兴趣的区域，而不是固定的网格。我们用多个三维语义高斯来表示自动驾驶场景，每个高斯实例化了一个语义高斯分布，其特征为均值、协方差和语义对数。通过对某个三维位置的语义高斯分布值求和，可以计算出该位置的占用率预测值。具体来说，我们对每个场景使用一组

P 3 D

高斯

G = {G_{i} \in R^{d} ∣ i = 1, \dots, P}

，每个三维高斯由一个

d

维向量表示，其形式为

(m \in R^{3}, s \in R^{3}, r \in R^{4}, c \in R^{| C |})

，其中

d = 10 + | C |

和

m, s, r, c

分别表示均值、比例、旋转向量和语义对数。因此
value of a semantic Gaussian distribution

g

evaluated at point

p = (x, y, z)

is
在

p = (x, y, z)

点评估语义高斯分布

g

的值为

\begin{matrix} g (p; m, s, r, c) = \exp (- \frac{1}{2} (p - m)^{T} Σ^{- 1} (p - m)) c \\ Σ = {R S S}^{T} R^{T}, S = diag (s), R = q 2 r (r) \end{matrix}

where

Σ, diag (\cdot)

and

q 2 r (\cdot)

represent the covariance matrix, the function that constructs a diagonal matrix from a vector, and the function that transforms a quaternion into a rotation matrix, respectively. Then the occupancy prediction result at point

p

can be formulated as the summation of the contribution of individual Gaussians on the location

p

:
其中

Σ, diag (\cdot)

和

q 2 r (\cdot)

分别代表协方差矩阵、从矢量构建对角矩阵的函数和将四元数转换为旋转矩阵的函数。那么，

p

点的占用率预测结果可以表述为位置

p

上各个高斯贡献的求和：

\hat{o} (p; G) = \sum_{i = 1}^{P} g_{i} (p; m_{i}, s_{i}, r_{i}, c_{i}) = \sum_{i = 1}^{P} \exp (- \frac{1}{2} {(p - m_{i})}^{T} Σ_{i}^{- 1} (p - m_{i})) c_{i} .

Compared with voxel representation, the mean and covariance properties allow the 3D Gaussian representation to adaptively allocate computation and storage resources according to object scales and region complexities. Therefore, we need fewer 3D Gaussians to model a scene for better efficiency while still maintain expressiveness. Meanwhile, the 3D Gaussian representation take 3D Gaussians as its basic unit, and thus avoids potential loss of details from dimension reduction in planar representations. Moreover, every 3D Gaussian has explicit semantic meaning, making the transformation from the scene representation to occupancy predictions much easier than those in other representations which often involve decoding per-voxel semantics from high dimensional features.
与体素表示法相比，三维高斯表示法的均值和协方差特性可以根据物体尺度和区域复杂程度自适应地分配计算和存储资源。因此，我们只需较少的三维高斯就能对场景进行建模，从而在保持表现力的同时提高效率。同时，三维高斯表示以三维高斯为基本单位，从而避免了平面表示中降维带来的潜在细节损失。此外，每个三维高斯都有明确的语义，这使得从场景表征到占位预测的转换比其他表征容易得多，因为其他表征通常需要从高维特征中解码每个象素的语义。

3.2 GaussianFormer: Image to Gaussians
3.2 GaussianFormer：图像到高斯

Based on the 3D semantic Gaussian representation of the scene, we further propose a GaussianFormer model to learn meaningful 3D Gaussians from multi-view images. The overall pipeline is shown in Fig. 3. We first initialize the properties of 3D Gaussians and their corresponding high-dimensional queries as learnable vectors. Then we iteratively refine the Gaussian properties within the

B

blocks of GaussianFormer. Each block consists of a self-encoding module to enable interactions among 3D Gaussians, an image cross-attention module to aggregate visual information, and a refinement module to rectify the properties of 3D Gaussians.
基于场景的三维语义高斯表示，我们进一步提出了一个高斯形成器模型，用于从多视角图像中学习有意义的三维高斯。整体流程如图 3 所示。我们首先将三维高斯的属性及其相应的高维查询初始化为可学习向量。然后，我们在 GaussianFormer 的

B

块中反复精炼高斯属性。每个区块由一个自编码模块（用于实现三维高斯之间的交互）、一个图像交叉注意模块（用于聚合视觉信息）和一个细化模块（用于修正三维高斯的属性）组成。

Gaussian Properties and Queries. We introduce two groups of features in GaussianFormer. The Gaussian properties

G = {G_{i} \in R^{d} ∣ i = 1, \dots, P}

are the physical attributes as discussed in Section 3.1, and they are de facto the learning target of the model. On the other hand, the Gaussian queries

Q =

{Q_{i} \in R^{m} ∣ i = 1, \dots, P}

are the high-dimensional feature vectors that implicitly encode 3D information in the self-encoding and image cross-attention modules, and provide guidance for rectification in the refinement module. We initialize the Gaussian properties as learnable vectors denoted by Initial Properties in Fig. 3.
高斯属性和查询。我们在 GaussianFormer 中引入了两组特征。高斯属性

G = {G_{i} \in R^{d} ∣ i = 1, \dots, P}

是第 3.1 节中讨论的物理属性，它们实际上是模型的学习目标。另一方面，高斯查询

Q =

{Q_{i} \in R^{m} ∣ i = 1, \dots, P}

是高维特征向量，在自编码和图像交叉注意模块中隐含了三维信息，并在细化模块中为纠正提供指导。我们将高斯属性初始化为可学习向量，在图 3 中以初始属性表示。

Self-encoding Module. Methods with voxel or planar representations usually implement self-encoding modules with deformable attention for efficiency
自编码模块。采用体素或平面表征的方法通常会使用可变形注意力的自编码模块，以提高效率

Fig. 3: Framework of our GaussianFormer for 3D semantic occupancy prediction. We first extract multi-scale (M.S.) features from image inputs using an image backbone. We then randomly initialized a set of queries and properties (mean, covariance, and semantics) to represent 3D Gaussians and update them with interleaved self-encoding, image cross-attention, and property refinement. Having obtained the updated 3D Gaussians, we employ an efficient Gaussian-to-voxel splatting module to generate dense 3D occupancy via local aggregation of Gaussians.
图 3：用于三维语义占位预测的高斯模型框架。我们首先使用图像骨干从图像输入中提取多尺度（M.S.）特征。然后，我们随机初始化一组查询和属性（均值、协方差和语义）来表示三维高斯，并通过交错自编码、图像交叉关注和属性细化来更新它们。在获得更新的三维高斯后，我们采用了高效的高斯到体素拼接模块，通过高斯的局部聚集生成密集的三维占位。
considerations, which is not well supported for unstructured 3D Gaussian representation. Instead, we leverage 3D sparse convolution to allow interactions among 3D Gaussians, sharing the same linear computational complexity as deformable attention. Specifically, we treat each Gaussian as a point located at its mean

m

, voxelize the generated point cloud (denoted by Voxelization in Fig. 3), and apply sparse convolution on the voxel grid. Since the number of 3D Gaussians

P

is much fewer than

X \times Y \times Z

, sparse convolution could effectively take advantage of the sparsity of Gaussians.
考虑因素，而非结构化三维高斯表示法并不支持这一点。取而代之的是，我们利用三维稀疏卷积来实现三维高斯之间的交互，与可变形注意力共享相同的线性计算复杂度。具体来说，我们将每个高斯视为位于其均值

m

处的一个点，对生成的点云进行体素化（图 3 中以体素化表示），并在体素网格上应用稀疏卷积。由于三维高斯

P

的数量远远少于

X \times Y \times Z

，稀疏卷积可以有效利用高斯的稀疏性。

Image Cross-attention Module. The image cross-attention module (ICA) is designed to extract visual information from images for our vision-based approach. To elaborate, for a 3D Gaussian G, we first generate a set of 3D reference points

R = {m + Δ m_{i} ∣ i = 1, \dots, R}

by permuting the mean

m

with offsets

Δ m

. We calculate the offsets according to the covariance of the Gaussian to reflect the shape of its distribution. Then we project the 3D reference points onto image feature maps with extrinsics

T

and intrinsics

K

. Finally, we update the Gaussian query

Q

with the weighted sum of retrieved image features:
图像交叉注意模块。图像交叉注意模块（ICA）旨在为我们基于视觉的方法从图像中提取视觉信息。具体来说，对于三维高斯 G，我们首先通过将平均值

m

与偏移

Δ m

进行置换，生成一组三维参考点

R = {m + Δ m_{i} ∣ i = 1, \dots, R}

。我们根据高斯的协方差计算偏移量，以反映其分布形状。然后，我们将三维参考点投影到图像特征图上，并使用外特征

T

和内特征

K

。最后，我们用检索到的图像特征的加权和更新高斯查询

Q

：

ICA (R, Q, F; T, K) = \frac{1}{N} \sum_{n = 1}^{N} \sum_{i = 1}^{R} DA (Q, π (R; T, K), F_{n})

where

F, DA (\cdot), π (\cdot)

denote the image feature maps, the deformable attention function and the transformation from world to pixel coordinates, respectively.
其中，

F, DA (\cdot), π (\cdot)

分别表示图像特征图、可变形注意力函数和从世界坐标到像素坐标的变换。

Refinement Module. We use the refinement module to rectify the Gaussian properties with guidance from corresponding Gaussian queries which have aggregated sufficient 3D information in the prior self-encoding and image crossattention modules. Specifically, we take inspiration from DETR [60] in object detection. For a 3D Gaussian

G = (m, s, r, c)

, we first decode the intermediate properties

\hat{G} = (\hat{m}, \hat{s}, \hat{r}, \hat{c})

from the Gaussian query

Q

with a multi-layer perceptron (MLP). When refining the old properties with the intermediate ones, we treat the intermediate mean

\hat{m}

as a residual and add it with the old mean

m

, while we directly substitute the other intermediate properties (

\hat{s}, \hat{r}, \hat{c})

for the corresponding old properties:
细化模块。我们利用细化模块，在相应高斯查询的指导下修正高斯属性，这些高斯查询在先前的自编码和图像交叉注意模块中汇总了足够的三维信息。具体来说，我们从物体检测中的 DETR [60] 获得了灵感。对于三维高斯

G = (m, s, r, c)

，我们首先使用多层感知器（MLP）从高斯查询

Q

中解码出中间属性

\hat{G} = (\hat{m}, \hat{s}, \hat{r}, \hat{c})

。在用中间属性完善旧属性时，我们将中间平均值

\hat{m}

视为残差，并将其与旧平均值

m

相加，而将其他中间属性 (

\hat{s}, \hat{r}, \hat{c})

直接替换为相应的旧属性：

\hat{G} = (\hat{m}, \hat{s}, \hat{r}, \hat{c}) = MLP (Q), G_{new} = (m + \hat{m}, \hat{s}, \hat{r}, \hat{c}) .

We refine the mean of Gaussian with residual connections in order to keep their coherence throughout the

B

blocks of GaussianFormer. The direct replacement of the other properties is due to the concern about vanishing gradient from the sigmoid and softmax activations we apply on covariance and semantic logits.
我们用残差连接来完善高斯平均值，以便在高斯成型器的

B

块中保持它们的一致性。之所以直接替换其他属性，是因为我们担心协方差和语义对数的 sigmoid 和 softmax 激活会导致梯度消失。

3.3 Gaussian-to-Voxel Splatting
3.3 高斯-体素拼接

Due to the universal approximating ability of Gaussian mixtures [8,11] , the 3D semantic Gaussians can efficiently represent the 3D scene and thus can be directly processed to perform downstream tasks like motion planning and control. Specifically, to achieve 3D semantic occupancy prediction, we design an efficient Gaussian-to-voxel splatting module to efficiently transform 3D Gaussian representation to 3D semantic occupancy predictions using only local aggregation operation.
由于高斯混合物具有普遍的逼近能力[8,11]，三维语义高斯可以有效地表示三维场景，因此可以直接进行处理，以执行运动规划和控制等下游任务。具体来说，为了实现三维语义占位预测，我们设计了一个高效的高斯到象素拼接模块，仅使用局部聚合操作就能高效地将三维高斯表示转换为三维语义占位预测。

Although (3) demonstrates the main idea of the transformation as a summation over the contributions of 3D Gaussians, it is infeasible to query all Gaussians for every voxel position due to the intractable computation and storage complexity

(O (X Y Z \times P))

. Since the weight

\exp (- \frac{1}{2} (p - m)^{T} Σ^{- 1} (p - m))

in (1) decays exponentially with respect to the square of the mahalanobis distance, it should be negligible when the distance is large enough. Based on the locality of the semantic Gaussian distribution, we only consider 3D Gaussians within a neighborhood of each voxel position to improve efficiency.
虽然(3)展示了三维高斯贡献求和变换的主要思想，但由于计算和存储复杂度

(O (X Y Z \times P))

难以处理，要查询每个体素位置的所有高斯是不可行的。由于 (1) 中的权重

\exp (- \frac{1}{2} (p - m)^{T} Σ^{- 1} (p - m))

与 mahalanobis 距离的平方成指数衰减，因此当距离足够大时，权重应该可以忽略不计。基于语义高斯分布的位置性，我们只考虑每个体素位置邻域内的三维高斯，以提高效率。

As illustrated by Fig. 4, we first embed the 3D Gaussians into the target voxel grid of size

X \times Y \times Z

according to their means

m

. For each 3D Gaussian, we then calculate the radius of its neighborhood according to its scale property s. We append both the index of the Gaussian and the index of each voxel inside the neighborhood as a tuple

(g, v)

to a list. Then we sort the list according to the indices of voxels to derive the indices of 3D Gaussians that each voxel should attend to:
如图 4 所示，我们首先根据三维高斯的均值

m

将其嵌入大小为

X \times Y \times Z

的目标体素网格中。我们将高斯的索引和邻域内每个体素的索引作为一个元组

(g, v)

附加到一个列表中。然后，我们根据体素的索引对列表进行排序，得出每个体素应关注的三维高斯的索引：

{sort}_{v o x} ({[(g, v_{g_{1}}), \dots, (g, v_{g_{k}})]}_{g = 1}^{P}) = {[(g_{v_{1}}, v), \dots, (g_{v_{l}}, v)]}_{v = 1}^{X Y Z}

where

k, l

denote the number of neighboring voxels of a certain Gaussian, and the number of Gaussians that contribute to a certain voxel, respectively. Finally,
其中，

k, l

分别表示某个高斯的相邻体素数，以及对某个体素有贡献的高斯数。最后、

Fig. 4: Illustration of the Gaussian-to-voxel splatting method in 2D. We first voxelize the 3D Gaussians and record the affected voxels of each 3D Gaussian by appending their paired indices to a list. Then we sort the list according to the voxel indices to identify the neighboring Gaussians of each voxel, followed by a local aggregation to generate the occupancy prediction.
图 4：二维高斯-体素拼接方法示意图。我们首先对三维高斯进行体素化处理，并将每个三维高斯的受影响体素的成对指数追加到一个列表中。然后，我们根据体素指数对列表进行排序，找出每个体素的邻近高斯，再进行局部聚合，生成占位预测。
we can approximate (3) efficiently with only neighboring Gaussians:
我们只需使用邻近的高斯就能有效地近似 (3)：

\hat{o} (p; G) = \sum_{i \in N (p)} g_{i} (p; m_{i}, s_{i}, r_{i}, c_{i})

where

N (p)

represents the set of neighboring Gaussians of the point at

p

. Considering the dynamic neighborhood sizes of 3D Gaussians, the implementation of Gaussian-to-voxel splatting is nontrivial. To fully exploit the parallel computation ability of GPU, we realize it with the CUDA programming language to achieve better acceleration.
其中

N (p)

表示

p

处点的邻域高斯集合。考虑到三维高斯的动态邻域大小，高斯到体素拼接的实现并非易事。为了充分发挥 GPU 的并行计算能力，我们使用 CUDA 编程语言来实现它，以达到更好的加速效果。

The overall GaussianFormer model can be trained efficiently in an end-to-end manner. For training, we use the cross entropy loss

L_{c e}

and the lovasz-softmax [2] loss

L_{lov}

following TPVFormer [16]. To refine the Gaussian properties in an iterative manner, we apply supervision on the output of each refinement module. The overall loss function is

L = \sum_{i = 1}^{B} L_{c e}^{i} + L_{l o v}^{i}

, where

i

denote the

i

-th block.
整个 GaussianFormer 模型可以通过端到端方式进行高效训练。在训练中，我们使用交叉熵损失

L_{c e}

和洛瓦斯-软最大[2] 损失

L_{lov}

来训练 TPVFormer [16]。为了以迭代方式细化高斯特性，我们对每个细化模块的输出进行监督。总体损失函数为

L = \sum_{i = 1}^{B} L_{c e}^{i} + L_{l o v}^{i}

，其中

i

表示

i

-th 块。

4 Experiments 4 项实验

In this paper, we propose a 3D semantic Gaussian representation to effectively and efficiently describe the 3D scene and devise a GaussianFormer model to perform 3D occupancy prediction. We conducted experiments on the nuScenes [3] dataset and the KITTI-360 [29] dataset for 3D semantic occupancy prediction with surrounding and monocular cameras, respectively.
在本文中，我们提出了一种三维语义高斯表示法来有效和高效地描述三维场景，并设计了一个高斯变形模型来进行三维占位预测。我们在 nuScenes [3] 数据集和 KITTI-360 [29] 数据集上进行了实验，分别使用周围摄像机和单目摄像机进行三维语义占用预测。

4.1 Datasets 4.1 数据集

NuScenes [3] consists of 1000 sequences of various driving scenes collected in Boston and Singapore, which are officially split into 700/150/150 sequences for
NuScenes [3] 包含在波士顿和新加坡收集的 1000 个各种驾驶场景序列，官方将这些序列拆分为 700/150/150 个序列，以便于在波士顿和新加坡使用。
training, validation and testing, respectively. Each sequence lasts 20 seconds with RGB images collected by 6 surrounding cameras, and the keyframes are annotated at 2 Hz . We leverage the dense semantic occupancy annotations from SurroundOcc [50] for supervision and evaluation. The annotated voxel grid spans

[- 50 m, 50 m]

along the X and Y axes, and

[- 5 m, 3 m]

along the Z axis with a resolution of

200 \times 200 \times 16

. Each voxel is labeled with one of the 18 classes (16 semantic, 1 empty and 1 unknown classes).
分别为训练、验证和测试。每个序列持续 20 秒，由周围 6 个摄像头采集 RGB 图像，关键帧注释频率为 2 Hz。我们利用 SurroundOcc [50] 提供的密集语义占位注释进行监督和评估。注释的体素网格沿 X 和 Y 轴跨度为

[- 50 m, 50 m]

，沿 Z 轴跨度为

[- 5 m, 3 m]

，分辨率为

200 \times 200 \times 16

。每个体素都标有 18 个类别（16 个语义类、1 个空类和 1 个未知类）中的一个。

KITTI-360 [29] is a large-scale dataset covering a driving distance of 73.7 km corresponding to over 320k images and 100k laser scans. We use the dense semantic occupancy annotations from SSCBench-KITTI-360 [22] for supervision and evaluation. It provides ground truth labels for 9 long sequences with a total of 12865 key frames, which are officially split into

7 / 1 / 1

sequences with

8487 / 1812 / 2566

key frames for training, validation and testing, respectively. The voxel grid spans

51.2 \times 51.2 \times 6.4 m^{3}

in front of the ego car with a resolution of

256 \times 256 \times 32

, and each voxel is labeled with one of 19 classes (18 semantic and 1 empty). We use the RGB images from the left camera as input to our model.
KITTI-360 [29] 是一个大型数据集，涵盖 73.7 公里的行驶距离，对应超过 320k 幅图像和 100k 次激光扫描。我们使用 SSCBench-KITTI-360 [22] 中的密集语义占用注释进行监督和评估。它为总共有 12865 个关键帧的 9 个长序列提供了地面实况标签，这些序列被正式分为

7 / 1 / 1

序列和

8487 / 1812 / 2566

关键帧，分别用于训练、验证和测试。体素网格跨度为

51.2 \times 51.2 \times 6.4 m^{3}

，位于小汽车前方，分辨率为

256 \times 256 \times 32

，每个体素被标记为 19 个类别（18 个语义类和 1 个空类）之一。我们使用左侧摄像头拍摄的 RGB 图像作为模型的输入。

4.2 Evaluation Metrics 4.2 评估指标

Following common practice [4], we use mean Intersection-over-Union (mIoU) and Intersection-over-Union (IoU) to evaluate the performance of our model:
按照通常的做法[4]，我们使用平均交叉-联合（mIoU）和交叉-联合（IoU）来评估我们模型的性能：

mIoU = \frac{1}{| C^{'} |} \sum_{i \in C^{'}} \frac{T P_{i}}{T P_{i} + F P_{i} + F N_{i}}, IoU = \frac{T P_{\neq c_{0}}}{T P_{\neq c_{0}} + F P_{\neq c_{0}} + F N_{\neq c_{0}}}

where

C^{'}, c_{0}, T P, F P, F N

denote the nonempty classes, the empty class, the number of true positive, false positive and false negative predictions, respectively.
其中

C^{'}, c_{0}, T P, F P, F N

分别表示非空类、空类、真阳性预测数、假阳性预测数和假阴性预测数。

4.3 Implementation Details
4.3 实施细节

We set the resolutions of input images as

900 \times 1600

for nuScenes [3] and

376 \times 1408

for KITTI-360 [29]. We employ ResNet101-DCN [12] initialized from FCOS3D [47] checkpoint as the image backbone for nuScenes and ResNet50 [12] pretrained with ImageNet [9] for KITTI-360. We use the feature pyramid network [30] (FPN) to generate multi-scale image features with downsample rates of

4, 8, 16

and 32 . We set the number of Gaussians to 144000 and 38400 for nuScenes and KITTI-360, respectively, and use 4 transformer blocks in GaussianFormer to refine the properties of Gaussians. For optimization, we utilize the AdamW [32] optimizer with a weight decay of 0.01 . The learning rate warms up in the first 500 iterations to a maximum value of

2 e - 4

and decreases according to a cosine schedule. We train our models for 20 epochs with a batch size of 8 , and employ random flip and photometric distortion augmentations.
对于 nuScenes [3]，我们将输入图像的分辨率设置为

900 \times 1600

；对于 KITTI-360 [29]，我们将输入图像的分辨率设置为

376 \times 1408

。对于 nuScenes，我们使用从 FCOS3D [47] 检查点初始化的 ResNet101-DCN [12] 作为图像骨干；对于 KITTI-360，我们使用经 ImageNet [9] 预训练的 ResNet50 [12]。我们使用特征金字塔网络 [30] (FPN) 生成多尺度图像特征，下采样率为

4, 8, 16

和 32。我们将 nuScenes 和 KITTI-360 的高斯数量分别设置为 144000 和 38400，并使用 GaussianFormer 中的 4 个变换块来完善高斯的属性。在优化方面，我们使用权重衰减为 0.01 的 AdamW [32] 优化器。学习率在前 500 次迭代中升温到

2 e - 4

的最大值，然后按照余弦计划下降。我们对模型进行了 20 次历时训练，批量大小为 8，并采用随机翻转和光度失真增强。

4.4 Results and Analysis
4.4 结果与分析

Surrounding-Camera 3D semantic occupancy prediction. In Table 1, we present a comprehensive quantitative comparison of various methods for
周围相机 3D 语义占用预测。在表 1 中，我们对各种方法进行了全面的定量比较，以帮助我们更好地理解这些方法。

Table 1：3D semantic occupancy prediction results on nuScenes validation set．While the original TPVFormer［16］is trained with LiDAR segmentation labels， TPVFormer＊is supervised by dense occupancy annotations．Our method achieves com－ parable performance with state－of－the－art methods．
表 1：在 nuScenes 验证集上的三维语义占位预测结果．虽然原始 TPVFormer［16］是用激光雷达分割标签训练的，但 TPVFormer ＊是在密集占位注释的监督下训练的．我们的方法取得了与最先进方法相当的性能．

Method 方法	$\begin{array}{cc} SC & SSC \\ IoU & mIoU \end{array}$		$\begin{aligned} O \\ \overset{0}{0} \\ 0.0 \end{aligned}$	En	管		$\begin{aligned} 0 \\ 0 \\ 0.0 \\ 0 . \\ 0 . \\ 0 \\ 0 \end{aligned}$				苐	$\begin{aligned} H \\ n \\ \dot{y} \\ \dot{y} \end{aligned}$	$\begin{aligned} 范 \\ む \\ \overset{⇀}{0} \end{aligned}$		発
MonoScene［4］单场景［4］	23.967 .31	4.03	0.35	8.00	8.04	2.90	0.28	1.16	0.67	4.01	4.35	27.72	5.20	15.13	11.29	9.03	14.86
Atlas［38］地图集［38］	28.6615 .00	10.64	5.68	19.66	24.94	8.90	8.84	6.47	3.28	10.42	16.21	34.86	15.46	21.89	20.95	11.21	20.54
BEVFormer［26］ BEV前身［26］	30.5016 .75	14.22	6.58	23.46	28.28	8.66	10.77	6.64	4.05	11.20	17.78	37.28	18.00	22.88	22.17	13.80	22.21
TPVFormer［16］	11.5111 .66	16.14	7.17	22.63	17.13	8.83	11.39	10.46	8.23	9.43	17.02	8.07	13.64	13.85	10.34	4.90	7.37
TPVFormer＊［16］冠捷前身＊［16］	30.8617 .10	15.96	5.31	23.86	27.32	9.79	8.74	7.09	5.20	10.97	19.22	38.87	21.2	24.26	23.15	11.73	20.81
OccFormer［57］	31.3919 .03	18.65	10.41	23.92	30.29	10.31	14.19	13.59	0.13	12.49	0.77	38.78	19.79	24.19	22.21	13.48	21.35
SurroundOcc［50］环绕声［50］	31.4920 .30	20.59	． 6	8.06	30.86	0.70	15.14	4.09	2.06	14.38	22.26	37.29	23.70	24.49	22.77	4.89	21.86
Ours 我们的	｜29．83 19．10｜	19.52	1.26	26.11	29.78	10.47	13.83	12.58	8.67	12.74	21.57	39.63	23.28	24.46	22.99	9.59	19.12

Method "SC SSC IoU mIoU" https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=91&width=28&top_left_y=605&top_left_x=766 " O 0^(0) 0.0" En 管 https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=128&width=30&top_left_y=568&top_left_x=995 "0 0 0.0 0. 0. 0 0" https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=128&width=35&top_left_y=568&top_left_x=1105 https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=126&width=33&top_left_y=568&top_left_x=1164 https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=83&width=29&top_left_y=613&top_left_x=1224 苐 " H n y^(˙) y^(˙)" " 范む 0^(⇀) " https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=111&width=30&top_left_y=585&top_left_x=1454 発 https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=114&width=26&top_left_y=580&top_left_x=1571 https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=126&width=28&top_left_y=568&top_left_x=1627 MonoScene［4］ 23.967 .31 4.03 0.35 8.00 8.04 2.90 0.28 1.16 0.67 4.01 4.35 27.72 5.20 15.13 11.29 9.03 14.86 Atlas［38］ 28.6615 .00 10.64 5.68 19.66 24.94 8.90 8.84 6.47 3.28 10.42 16.21 34.86 15.46 21.89 20.95 11.21 20.54 BEVFormer［26］ 30.5016 .75 14.22 6.58 23.46 28.28 8.66 10.77 6.64 4.05 11.20 17.78 37.28 18.00 22.88 22.17 13.80 22.21 TPVFormer［16］ 11.5111 .66 16.14 7.17 22.63 17.13 8.83 11.39 10.46 8.23 9.43 17.02 8.07 13.64 13.85 10.34 4.90 7.37 TPVFormer＊［16］ 30.8617 .10 15.96 5.31 23.86 27.32 9.79 8.74 7.09 5.20 10.97 19.22 38.87 21.2 24.26 23.15 11.73 20.81 OccFormer［57］ 31.3919 .03 18.65 10.41 23.92 30.29 10.31 14.19 13.59 0.13 12.49 0.77 38.78 19.79 24.19 22.21 13.48 21.35 SurroundOcc［50］ 31.4920 .30 20.59 ． 6 8.06 30.86 0.70 15.14 4.09 2.06 14.38 22.26 37.29 23.70 24.49 22.77 4.89 21.86 Ours ｜29．83 19．10｜ 19.52 1.26 26.11 29.78 10.47 13.83 12.58 8.67 12.74 21.57 39.63 23.28 24.46 22.99 9.59 19.12

| Method | $\begin{array}{cc} \mathrm{SC} & \mathrm{SSC} \\ \mathrm{IoU} & \mathrm{mIoU} \end{array}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=91&width=28&top_left_y=605&top_left_x=766) | $\begin{aligned} & \text { O} \\ & \stackrel{0}{0} \\ & 0.0 \end{aligned}$ | En | 管 | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=128&width=30&top_left_y=568&top_left_x=995) | $\begin{aligned} & 0 \\ & 0 \\ & 0.0 \\ & 0 . \\ & 0 . \\ & 0 \\ & 0 \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=128&width=35&top_left_y=568&top_left_x=1105) | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=126&width=33&top_left_y=568&top_left_x=1164) | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=83&width=29&top_left_y=613&top_left_x=1224) | 苐 | $\begin{aligned} & \text { H } \\ & \text { n } \\ & \dot{y} \\ & \dot{y} \end{aligned}$ | $\begin{aligned} & \text { 范 } \\ & \text { む } \\ & \stackrel{\rightharpoonup}{0} \\ & \hline \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=111&width=30&top_left_y=585&top_left_x=1454) | 発 | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=114&width=26&top_left_y=580&top_left_x=1571) | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=126&width=28&top_left_y=568&top_left_x=1627) | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | MonoScene［4］ | 23.967 .31 | 4.03 | 0.35 | 8.00 | 8.04 | 2.90 | 0.28 | 1.16 | 0.67 | 4.01 | 4.35 | 27.72 | 5.20 | 15.13 | 11.29 | 9.03 | 14.86 | | Atlas［38］ | 28.6615 .00 | 10.64 | 5.68 | 19.66 | 24.94 | 8.90 | 8.84 | 6.47 | 3.28 | 10.42 | 16.21 | 34.86 | 15.46 | 21.89 | 20.95 | 11.21 | 20.54 | | BEVFormer［26］ | 30.5016 .75 | 14.22 | 6.58 | 23.46 | 28.28 | 8.66 | 10.77 | 6.64 | 4.05 | 11.20 | 17.78 | 37.28 | 18.00 | 22.88 | 22.17 | 13.80 | 22.21 | | TPVFormer［16］ | 11.5111 .66 | 16.14 | 7.17 | 22.63 | 17.13 | 8.83 | 11.39 | 10.46 | 8.23 | 9.43 | 17.02 | 8.07 | 13.64 | 13.85 | 10.34 | 4.90 | 7.37 | | TPVFormer＊［16］ | 30.8617 .10 | 15.96 | 5.31 | 23.86 | 27.32 | 9.79 | 8.74 | 7.09 | 5.20 | 10.97 | 19.22 | 38.87 | 21.2 | 24.26 | 23.15 | 11.73 | 20.81 | | OccFormer［57］ | 31.3919 .03 | 18.65 | 10.41 | 23.92 | 30.29 | 10.31 | 14.19 | 13.59 | 0.13 | 12.49 | 0.77 | 38.78 | 19.79 | 24.19 | 22.21 | 13.48 | 21.35 | | SurroundOcc［50］ | 31.4920 .30 | 20.59 | ． 6 | 8.06 | 30.86 | 0.70 | 15.14 | 4.09 | 2.06 | 14.38 | 22.26 | 37.29 | 23.70 | 24.49 | 22.77 | 4.89 | 21.86 | | Ours | ｜29．83 19．10｜ | 19.52 | 1.26 | 26.11 | 29.78 | 10.47 | 13.83 | 12.58 | 8.67 | 12.74 | 21.57 | 39.63 | 23.28 | 24.46 | 22.99 | 9.59 | 19.12 |

Table 2：3D semantic occupancy prediction results on SSCBench－KITTI－ 360 validation set．Our method achieves performance on par with state－of－the－art methods，excelling at some smaller and general categories（i．e．motorcycle，other－veh．）．
表 2：在 SSCBench-KITTI- 360 验证集上的三维语义占用率预测结果。我们的方法与最先进的方法性能相当，但在一些较小的类别和一般类别上表现不佳（i．e．摩托车，other-veh．）

Method 方法

䔍

SC IoU

| \begin{matrix} SSC \\ mIoU \end{matrix}

药

\begin{aligned} O \\ 0 \\ 0.0 \\ 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{aligned}

\begin{aligned} U \\ B \\ B \end{aligned}

\begin{aligned} O \\ OU \\ O \end{aligned}

\begin{aligned} ت゙ \\ \% \\ in \end{aligned}

\begin{aligned} 关 \\ 菏 \\ \frac{0}{6} \end{aligned}

\begin{aligned} Z \\ g \\ ⋮ \\ ⋮ \\ ⋮ \\ 0 \end{aligned}

\begin{aligned} \overset{\leftrightarrow}{E} \\ \overset{4}{E} \end{aligned}

\begin{aligned} تू \\ ⿹ㅢ \end{aligned}

\overset{0}{\circ}

| Method | 䔍 | SC IoU | $\left\lvert\, \begin{gathered}\text { SSC } \\ \text { mIoU }\end{gathered}\right.$ | 药 | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=86&width=35&top_left_y=1175&top_left_x=807) | $\begin{aligned} & \text { O} \\ & 0 \\ & 0.0 \\ & 0 \\ & 0 \\ & 0 \\ & 0 \\ & 0 \end{aligned}$ | $\begin{aligned} & \text { U } \\ & \text { B } \\ & \text { B } \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=111&width=35&top_left_y=1151&top_left_x=952) | $\begin{aligned} & \text { O } \\ & \text { OU } \\ & \text { O } \\ & \hline \end{aligned}$ | $\begin{aligned} & \text { ت゙ } \\ & \text { \% } \\ & \text { in } \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=92&width=28&top_left_y=1168&top_left_x=1108) | $\begin{aligned} & \text { 关 } \\ & \text { 菏 } \\ & \frac{0}{6} \end{aligned}$ | $\begin{aligned} & \text { Z } \\ & \text { g } \\ & \vdots \\ & \vdots \\ & \vdots \\ & 0 \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=97&width=29&top_left_y=1163&top_left_x=1260) | $\begin{aligned} & \stackrel{\leftrightarrow}{\mathbb{E}} \\ & \stackrel{4}{\mathbb{E}} \end{aligned}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=114&width=32&top_left_y=1146&top_left_x=1357) | $\begin{aligned} & \text { تू } \\ & \text { ⿹ㅢ } \end{aligned}$ | $\stackrel{0}{\circ}$ | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=102&width=28&top_left_y=1158&top_left_x=1515) | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=131&width=28&top_left_y=1129&top_left_x=1567) | ![](https://cdn.mathpix.com/cropped/2025_01_15_06d0524060279e57ccd2g-11.jpg?height=131&width=28&top_left_y=1129&top_left_x=1620) | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |

LMSCNet［43］ LMSCNet ［43］

47.53

13.65

20.91

0.26

62.95

13.51

33.51

0.2

43.67

0.33

40.01

26.80

3.63

SSCNet［44］ SSCNet ［44］	L	53.58	16.95	31.95	0	0.17	10.29	0.58	0.07	65.7	17.33	41.24	3.22	44.41	6.77	43.72	28.87	0.78	0.75	8.60
0.6 .67

MonoScene［4］单场景［4］

37.87

12.31

19.34

0.43

0.58

8.02

2.03

0.86

48.35

11.38

28.13

3.22

32.89

3.53

26.15

16.75

6.92

5.67

4.20

3.09

Voxformer［23］

C [\begin{array}{lllllllllllllllllllllll} 38.76 & 11.91 & 17.84 & 1.16 & 0.89 & 4.56 & 2.06 & 1.63 & 47.01 & 9.67 & 27.21 & 2.89 & 31.18 & 4.97 & 28.99 & 14.69 & 6.51 & 6.92 & 3.79 & 2.43 \end{array}

Voxformer［23］

C [\begin{matrix} 38.76 & 11.91 & 17.84 & 1.16 & 0.89 & 4.56 & 2.06 & 1.63 & 47.01 & 9.67 & 27.21 & 2.89 & 31.18 & 4.97 & 28.99 & 14.69 & 6.51 & 6.92 & 3.79 & 2.43 \end{matrix}

Symphonies［18］

| C | 44.12 | 18.58 | 30.021 .855 .9025 .0712 .068 .2054 .9413 .8332 .766 .9335 .118 .5838 .3311 .5214 .019 .5714 .4411 .28

交响乐［18］

| C | 44.12 | 18.58 | 30.021 .855 .9025 .0712 .068 .2054 .9413 .8332 .766 .9335 .118 .5838 .3311 .5214 .019 .5714 .4411 .28

multi－view 3D semantic occupancy prediction on nuScenes validation set，with dense annotations from SurroundOcc［50］．Our GaussianFormer achieves no－ table improvements over methods based on planar representations，such as BEV－ Former［26］and TPVFormer［16］．Even compared with dense grid representa－ tions，GaussianFormer performs on par with OccFormer［57］and SurroundOcc［50］． These observations prove the valuable application of the 3D Gaussians for se－ mantic occupancy prediction．This is because the 3D Gaussian representation better exploits the sparse nature of the driving scenes and the diversity of object scales with flexible properties of position and covariance．
在 nuScenes 验证集上使用 SurroundOcc 的密集注释进行多视角三维语义占用预测［50］。我们的高斯公式与基于平面表示的方法（如 BEV- Former［26］和 TPVFormer［16］）相比，实现了无表改进。即使与稠密网格表示法相比，GaussianFormer的性能也与OccFormer［57］相当。这是因为三维高斯表示法更好地利用了驾驶场景的稀疏性以及物体尺度的多样性，具有灵活的位置和协方差特性。

Monocular 3D semantic occupancy prediction．Table 2 compares the performance of GaussianFormer with other methods for monocular 3D seman－ tic occupancy prediciton on SSCBench－KITTI－360．Notably，GaussianFormer achieves comparable performance with state－of－the－art models，excelling at some smaller categories such as motorcycle and general categories such as other－ vehicle．This is due to 3D Gaussians can adaptively change their positions and covariance to match the boundaries of small objects in images in contrast to rigid grid projections on images which might be misleading．Furthermore，the flexibility of 3D Gaussians also benefits the predictions for general objects（i．e．
表2比较了高斯矩阵与其他方法在SSCBench-KITTI-360上进行单眼三维语义占位预测的性能。这是因为三维高斯可以自适应地改变其位置和协方差，以匹配图像中的小物体边界，而图像上的刚性网格投影可能会产生误导。此外，三维高斯的灵活性也有利于对一般物体的预测（i．

Table 3: Efficiency comparison of different representations on nuScenes. The latency and memory consumption for GaussianFormer are tested on one NVIDIA 4090 GPU with batch size one, while the results for other methods are reported in OctreeOcc [33] tested on one NVIDIA A100 GPU. Our method demonstrates significantly reduced memory usage compared to other representations.
表 3：不同表示法在 nuScenes 上的效率比较。GaussianFormer 的延迟和内存消耗是在一个 NVIDIA 4090 GPU（批量大小为 1）上测试的，而其他方法的结果是在一个 NVIDIA A100 GPU 上测试的 OctreeOcc [33] 中报告的。与其他表示方法相比，我们的方法大大降低了内存使用量。

Methods 方法	Query Form 查询表格	Query Resolution 查询解决	Latency $↓$ Memory $↓$ 延迟 $↓$ 内存 $↓$
BEVFormer [26]	2D BEV	$200 \times 200$	$302 ms$	25100 M
TPVFormer [16]	2D tri-plane 二维三平面	$200 \times (200 + 16 + 16)$	341 ms 341 毫秒	29000 M
PanoOcc [49]	3D voxel 三维体素	$100 \times 100 \times 16$	502 ms 502 毫秒	35000 M
FBOCC [27]	3D voxel & 2D BEV 三维体素和二维 BEV	$200 \times 200 \times 16 & 200 \times 200$	463 ms 463 毫秒	31000 M
OctreeOcc [33]	Octree Query 八叉树查询	91200	386 ms 386 毫秒	26500 M
GaussianFormer 高斯成型器	3D Gaussian 3D 高斯	144000	372 ms 372 毫秒	$6229 M$

Table 4: Ablation on the components of GaussianFormer. Deep Supervision represents supervising the output of each refinement module. Residual Refine means on which properties of Gaussian to apply residual refinement as opposed to substitution.
表 4：对 GaussianFormer 组件的消融。深度监督（Deep Supervision）表示对每个细化模块的输出进行监督。残差细化（Residual Refine）是指对高斯的哪些属性进行残差细化，而不是替换。

Deep Supervision 深度监督	Sparse Conv.	Residual Refine 残余物提炼	mIoU	IoU
	$✓$	mean 吝啬	16.36	29.32
$✓$		mean 吝啬	15.93	28.99
$✓$	$✓$	none 无	-	-
$✓$	$✓$	all except semantics 除语义外	16.24	29.30
$✓$	$✓$	mean 吝啬	$16.41$	$29.37$

categories with other- prefix) which often have distinct shapes and appearances with normal categories.
与其他前缀的类别），其形状和外观往往与正常类别截然不同。

Efficiency comparisons with existing methods. We provide the efficiency comparisons of different scene representations in Table 3. Notably, GaussianFormer surpasses all existing competitors with significantly reduced memory consumption. The memory efficiency of GaussianFormer originates from its object-centric nature which assigns explicit semantic meaning to each 3D Gaussian, and thus greatly simplifies the transformation from the scene representation to occupancy predictions, getting rid of the expensive decoding process from high dimensional features. While being slightly slower (

\sim 70 ms

) than methods based on planar representations [16, 26], GaussianFormer achieves the lowest latency among dense grid representations. Notably, our method is faster than OctreeOcc [33] even with more queries.
与现有方法的效率比较。我们在表 3 中提供了不同场景表示法的效率比较。值得注意的是，GaussianFormer 的内存消耗大大降低，超过了所有现有的竞争者。GaussianFormer 的内存效率源于它以对象为中心的特性，即为每个三维高斯赋予明确的语义，从而大大简化了从场景表示到占用率预测的转换，摆脱了从高维特征进行昂贵解码的过程。虽然 GaussianFormer 比基于平面表示的方法稍慢（

\sim 70 ms

）[16, 26]，但它在密集网格表示中实现了最低的延迟。值得注意的是，即使查询次数更多，我们的方法也比 OctreeOcc [33] 更快。

Analysis of components of GaussianFormer. In Table 4, we provide comprehensive analysis on the components of GaussianFormer to validate their effectiveness. We conduct these experiments on nuScenes and set the number of 3D Gaussians to 51200. The strategy for refinement of Gaussian properties has a notable influence on the performance. The experiment for the all substitution strategy (denoted by none) collapses where we directly replace the old properties with new ones, and thus we do not report the result. This is because the positions of Gaussians are sensitive to noise which quickly converge to a trivial solution without regularization for coherence during refinement. Rectifying all properties except semantics in a residual style (denoted by all except semantics) is also harmful because the sigmoid activation used for covariance is prone to vanishing
对高斯成型器组件的分析。在表 4 中，我们对 GaussianFormer 的各个组件进行了全面分析，以验证它们的有效性。我们在 nuScenes 上进行了这些实验，并将三维高斯的数量设置为 51200。细化高斯属性的策略对性能有显著影响。采用全部替换策略（用无表示）的实验会崩溃，我们直接用新属性替换旧属性，因此没有报告结果。这是因为高斯的位置对噪声很敏感，如果在细化过程中不对一致性进行正则化处理，噪声会很快收敛到一个微不足道的解。用残差法（all except semantics）修正除语义之外的所有属性也是有害的，因为协方差所用的 sigmoid 激活容易出现虚化现象。

Table 5: Ablation on the number of Gaussians. The latency and memory are tested on an NVIDIA 4090 GPU with batch size one during inference. The performance improves consistently with more Gaussians while taking up more time and memory.
表 5：对高斯数量的消减。推理过程中的延迟和内存在英伟达 4090 GPU 上进行测试，批量大小为 1。随着高斯数的增加，性能持续提高，但占用的时间和内存也更多。

Number of Gaussians 高斯数	Latency 延迟	Memory 内存	mIoU	IoU
25600	$227 ms$	$4850 M$	16.00	28.72
38400	249 ms 249 毫秒	4856 M	16.04	28.72
51200	259 ms 259 毫秒	4866 M	16.41	29.37
91200	293 ms 293 毫秒	5380 M	18.31	27.48
144000	372 ms 372 毫秒	6229 M	$19.10$	$29.83$

gradient. Moreover, the 3D sparse convolution in the self-encoding module is crucial for the performance because it is responsible for the interactions among 3D Gaussians. On the other hand, the deep supervision strategy also contributes to the overall performance by ensuring that every intermediate refinement step benefits the 3D perception.
梯度。此外，自编码模块中的三维稀疏卷积对性能至关重要，因为它负责三维高斯之间的相互作用。另一方面，深度监督策略也有助于提高整体性能，因为它能确保每个中间细化步骤都有利于三维感知。

Effect of the number of Gaussians. Table 5 presents the ablation study on the number of Gaussians, where we analyze its influence on efficiency and performance. The mIoU increases linearly when the number of 3D Gaussians is greater than 38400 , which is due to the enhanced ability to represent finer details with more Gaussians. The latency and memory consumption also correlate linearly with the number of Gaussians, offering flexibility for deployment.
高斯数的影响。表 5 列出了对高斯数的消融研究，我们分析了高斯数对效率和性能的影响。当三维高斯数大于 38400 时，mIoU 呈线性增长，这是因为高斯数越多，表示更精细细节的能力越强。延迟和内存消耗也与高斯数呈线性相关，为部署提供了灵活性。

Visualization results. We provide qualitative visualization results in Fig. 5. Our GaussianFormer can generate a holistic and realistic perception of the scene. Specifically, the 3D Gaussians adjust their covariance matrices to capture the fine details of object shapes such as the flat-shaped Gaussians at the surface of the road and walls (e.g. in the third row). Additionally, the density is higher in regions with vehicles and pedestrians compared with the road surface (e.g. in the first row), which proves that the 3D Gaussians cluster around foreground objects with iterative refinement for a reasonable allocation of resources. Furthermore, our GaussianFormer even successfully predicts objects that are not in the ground truth and barely visible in the images, such as the truck in the front left input image and the upper right corner of the 3D visualizations in the fourth row.
可视化结果。我们在图 5 中提供了定性可视化结果。我们的高斯矩阵可以生成整体而逼真的场景感知。具体来说，三维高斯会调整其协方差矩阵，以捕捉物体形状的细节，例如道路和墙壁表面的扁平高斯（如第三行）。此外，与路面相比，车辆和行人区域的密度更高（如第一行），这证明三维高斯围绕前景物体聚类，并通过迭代细化合理分配资源。此外，我们的高斯矩阵甚至还能成功预测出地面实况中没有的、图像中几乎看不到的物体，例如左前方输入图像中的卡车和第四行三维可视化图像的右上角。

5 Conclusion and Discussions
5 结论与讨论

In this paper, we have proposed an efficient object-centric 3D Gaussian representation for 3D semantic occupancy prediction to better exploit the sparsity of occupancy and the diversity of object scales. We describe the driving scenes with sparse 3D Gaussians each of which is characterized by its position, covariance and semantics and represents a flexible region of interest. Based on the 3D Gaussian representation, we have designed the GaussianFormer to effectively learn 3D Gaussians from input images through attention mechanism and iterative refinement. To efficiently generate voxelized occupancy predictions from 3D Gaussians, we have proposed an efficient Gaussian-to-voxel splatting method, which only aggregates the neighboring Gaussians for each voxel. GaussianFormer has achieved comparable performance with state-of-the-art methods
在本文中，我们为三维语义占位预测提出了一种高效的以物体为中心的三维高斯表示法，以更好地利用占位的稀疏性和物体尺度的多样性。我们用稀疏的三维高斯来描述驾驶场景，每个高斯都有其位置、协方差和语义特征，并代表一个灵活的兴趣区域。在三维高斯表示法的基础上，我们设计了高斯形成器（GaussianFormer），通过注意机制和迭代细化，从输入图像中有效地学习三维高斯。为了从三维高斯有效地生成体素化的占位预测，我们提出了一种高效的高斯到体素拼接方法，该方法只聚集每个体素的相邻高斯。GaussianFormer 的性能与最先进的方法不相上下。

Fig. 5: Visualization results for 3D semantic occupancy prediction on nuScenes. We visualize the 3D Gaussians by treating them as ellipsoids centered at the Gaussian means with semi-axes determined by the Gaussian covariance matrices. Our GussianFormer not only achieves reasonable allocation of resources, but also captures the fine details of object shapes.
图 5：nuScenes 上三维语义占用预测的可视化结果。我们将三维高斯视为以高斯均值为中心的椭圆体，其半轴由高斯协方差矩阵决定，从而实现三维高斯的可视化。我们的 GussianFormer 不仅实现了资源的合理分配，还捕捉到了物体形状的细节。
on the nuScenes and KITTI-360 datasets, and significantly reduced the memory consumption by more than

75 %

. Our ablation study has shown that the performance of GaussianFormer scales well to the number of Gaussians. Additionally, the visualization has proved the abilities of 3D Gaussians to capture the details of object shapes and to reasonably allocate computation and storage resources.
在 nuScenes 和 KITTI-360 数据集上，内存消耗显著减少了

75 %

以上。我们的消融研究表明，GaussianFormer 的性能与高斯的数量成正比。此外，可视化还证明了三维高斯能够捕捉物体形状的细节，并合理分配计算和存储资源。

Limitations. The performance of GaussianFormer is still inferior to state-of-the-art methods despite the much lower memory consumption. This might result from the inaccuracy of 3D semantic Gaussian representation or simply the wrong choice of hyperparameter since we did not perform much hyperparameter tuning. GaussianFormer also requires a large number of Gaussians to achieve satisfactory performance. This is perhaps because the current 3D semantic Gaussians include empty as one category and thus still can be redundant. It is interesting to only model solid objects to further improve performance and speed.
局限性。GaussianFormer 的性能仍然不如最先进的方法，尽管内存消耗要低得多。这可能是三维语义高斯表示不准确造成的，也可能是超参数选择错误造成的，因为我们并没有进行太多的超参数调整。GaussianFormer 还需要大量高斯才能达到令人满意的性能。这可能是因为目前的三维语义高斯将空作为一个类别，因此仍然可能存在冗余。有趣的是，只对实体对象建模可以进一步提高性能和速度。

Fig. 6: Visualizations of the proposed GaussianFormer method for 3D semantic occupancy prediction on the nuScenes [3] validation set. We visualize the six input surrounding images and the corresponding predicted semantic occupancy in the upper part. The lower row shows the predicted 3D Gaussians (left), the predicted semantic occupancy in the global view (middle) and the bird’s eye view (right).
图 6：在 nuScenes [3] 验证集上对用于三维语义占位预测的拟议 GaussianFormer 方法进行可视化。我们在上半部分可视化了六幅输入的周边图像和相应的语义占位预测。下行显示预测的三维高斯（左）、全局视图（中）和鸟瞰视图（右）中预测的语义占位。

A Video Demonstration 视频演示

Fig. 6 shows a sampled image from the video demo

^{1}

for 3 D semantic occupancy prediction on the nuScenes [3] validation set. GaussianFormer successfully predicts both realistic and holistic semantic occupancy results with only surround images as input. The similarity between the 3D Gaussians and the predicted occupancy suggests the expressiveness of the efficient 3D Gaussian representation.
图 6 显示了视频演示

^{1}

中的一幅采样图像，用于在 nuScenes [3] 验证集上进行 3 D 语义占用预测。在只有环绕图像作为输入的情况下，GaussianFormer 成功地预测出了现实和整体的语义占位结果。三维高斯与预测占用率之间的相似性表明了高效三维高斯表示法的表现力。

B Additional Visualizations
B 补充可视化

In Fig. 7, we provide the visualizations of GaussianFormer for 3D semantic occupancy prediction on the KITTI-360 [29] validation set. Similar to the visualizations for nuScenes [3] in Fig. 5, we observe that the density of the 3D Gaussians is higher with the presence of vehicles (e.g. the 4th row) which demonstrates the object-centric nature of the 3D Gaussian representation and further benefits resource allocation. In addition to overall structure, GaussianFormer also captures intricate details such as poles in the scenes (e.g. the 1st row). The discrepancy of the density and scale of the 3D Gaussians in Fig. 5 and 7 is because we set the number of Gaussians to

144000 / 38400

and the max scale of Gaussians to 0.3 m / 0.5 m for nuScenes and KITTI-360, respectively. We also visualize the output 3D semantic Gaussians of each refinement layer in Fig. 8. In Fig. 9, we provide a qualitative comparison between our GaussianFormer and SurroundOcc [50].
在图 7 中，我们提供了 GaussianFormer 在 KITTI-360 [29] 验证集上进行三维语义占用预测的可视化效果。与图 5 中 nuScenes [3] 的可视化效果类似，我们观察到车辆出现时三维高斯的密度更高（例如第 4 行），这证明了三维高斯表示法以物体为中心的特性，并进一步有利于资源分配。除了整体结构外，GaussianFormer 还能捕捉到复杂的细节，如场景中的电线杆（如第 1 行）。图 5 和图 7 中三维高斯的密度和比例之所以存在差异，是因为我们将 nuScenes 和 KITTI-360 的高斯数量分别设置为

144000 / 38400

，将高斯的最大比例分别设置为 0.3 m / 0.5 m。我们还在图 8 中直观展示了每个细化层的三维语义高斯输出。在图 9 中，我们对高斯矩阵和 SurroundOcc [50] 进行了定性比较。

Fig. 7: Visualizations of the proposed GaussianFormer method for 3D semantic occupancy prediction on the KITTI-360 [29] validation set. GaussianFormer is able to capture both the overall structures and intricate details of the driving scenes in a monocular setting.
图 7：在 KITTI-360 验证集[29]上对用于三维语义占用预测的拟议 GaussianFormer 方法的可视化。在单目环境下，GaussianFormer 能够捕捉驾驶场景的整体结构和复杂细节。

C Additional Ablation Study
C 补充消融研究

We conduct additional ablation study on the number of Gaussians, photometric supervision, splitting & pruning strategy, and initialization schemes in Table 6. We observe consistent improvement as the number of Gaussians increases.
我们对表 6 中的高斯数量、光度监督、分割和剪枝策略以及初始化方案进行了额外的消融研究。我们观察到，随着高斯数的增加，情况也在不断改善。

We also experiment with an additional photometric loss to reconstruct the input image in Table 6, which does not demonstrates a significant improvement. This is because using photometric loss on nuScenes where the surrounding cameras share little overlap view cannot provide further structural information. In addition, we experiment with the splitting and pruning strategy prevalent in the offline 3D-GS literature and observe that it improves performance compared with the baseline.
在表 6 中，我们还尝试使用额外的光度损失来重建输入图像，但效果并没有明显改善。这是因为在 nuScenes 上使用光度损失时，周围摄像机的重叠视图很少，无法提供更多的结构信息。此外，我们还尝试了离线 3D-GS 文献中常用的分割和剪枝策略，结果发现与基线相比，该策略的性能有所提高。

We employ learnable properties as initialization for our online model to learn a prior from different scenes in the main paper. We further experimented with
在主论文中，我们采用可学习属性作为在线模型的初始化，从不同场景中学习先验。我们进一步试验了

Fig. 8: Visualization of the outputs of the refinement layers and the corresponding mIoU on nuScenes.
图 8：细化层输出和相应的 mIoU 在 nuScenes 上的可视化。

Fig. 9: Qualitative comparison between our method and SurroundOcc on nuScenes.
图 9：我们的方法与 SurroundOcc 在 nuScenes 上的定性比较。
different initialization strategies including uniform distribution, predicted pseudo point cloud, and ground-truth point cloud. We see that initialization is important to the performance and depth information is especially crucial.
不同的初始化策略，包括均匀分布、预测伪点云和地面实况点云。我们发现，初始化对性能非常重要，而深度信息尤为关键。

References 参考资料

Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P.: Mip-nerf: A multiscale representation for anti-aliasing neural radiance
Barron, J.T., Mildenhall, B., Tancik, M., Hedman, P., Martin-Brualla, R., Srinivasan, P.P..：Mip-nerf：用于抗锯齿神经辐射的多尺度表示法

Table 6: Analysis on different design choices.
表 6：对不同设计方案的分析。

Gaussian Number 高斯数	Photometric 光度测量	Split & Prune 分割和修剪	Initialization 初始化	mIoU	IoU
144000	$\times$	$\times$	learnable 可学习	19.10	29.83
192000	$\times$	$\times$	learnable 可学习	19.65	30.37
256000	$\times$	$\times$	learnable 可学习	$19.76$	$30.51$
51200	$\times$	$\times$	learnable 可学习	16.41	29.37
51200	$✓$	$\times$	learnable 可学习	16.24	$29.43$
51200	$\times$	$✓$	learnable 可学习	$16.50$	29.41
51200	$\times$	$\times$	uniform 匀净	16.27	29.23
51200	$\times$	$\times$	pseudo points 伪积分	18.99	28.84
51200	$\times$	$\times$	G.T. points G.T. 点	$26.78$	$41.81$

fields. In: ICCV. pp. 5855-5864 (2021) 4
领域。In：ICCV. pp.
2. Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4413-4421 (2018) 9
2.Berman, M., Triki, A.R., Blaschko, M.B.: The lovász-softmax loss：神经网络中交集-过联合度量优化的可控替代物。In：pp.4413-4421 (2018) 9
3. Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR (2020) 2, 9, 10, 15
3.Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes：用于自动驾驶的多模式数据集。In：CVPR (2020) 2, 9, 10, 15
4. Cao, A.Q., de Charette, R.: Monoscene: Monocular 3d semantic scene completion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3991-4001 (2022) 2, 3, 5, 10, 11
4.Cao, A.Q., de Charette, R.: Monoscene：单目 3D 语义场景补全。In：pp.3991-4001 (2022) 2, 3, 5, 10, 11
5. Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. arXiv preprint arXiv:2312.12337 (2023) 4
5.Charatan, D., Li, S., Tagliasacchi, A., Sitzmann, V.: pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. ArXiv preprint arXiv:2312.12337 (2023) 4
6. Chen, X., Lin, K.Y., Qian, C., Zeng, G., Li, H.: 3d sketch-aware semantic scene completion via semi-supervised structure prior. In: CVPR (2020) 3
6.Chen, X., Lin, K.Y., Qian, C., Zeng, G., Li, H.: 3d sketch-aware semantic scene completion via semi-supervised structure prior.In：CVPR (2020) 3
7. Chen, Z., Wang, F., Liu, H.: Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585 (2023) 4
7.ArXiv preprint arXiv:2309.16585 (2023) 4
8. Dalal, S., Hall, W.: Approximating priors by mixtures of natural conjugate priors. Journal of the Royal Statistical Society 45(2), 278-286 (1983) 1, 3, 5, 8
8.Dalal, S., Hall, W.: Approximating priors by mixtures of natural conjugate priors.Journal of the Royal Statistical Society 45(2), 278-286 (1983) 1, 3, 5, 8
9. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR. pp. 248-255. Ieee (2009) 10
9.Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database.In. CVPR：CVPR. pp.IEEE (2009) 10
10. Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels: Radiance fields without neural networks. In: CVPR. pp. 5501-5510 (2022) 4
10.Fridovich-Keil, S., Yu, A., Tancik, M., Chen, Q., Recht, B., Kanazawa, A.: Plenoxels：没有神经网络的辐射场。In：CVPR. pp.
11. Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016) 1, 3, 5, 8
11.Goodfellow, I., Bengio, Y., Courville, A.: Deep learning (2016) 1, 3, 5, 8
12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition (2016) 10
12.He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In：IEEE 计算机视觉与模式识别会议论文集（2016） 10
13. Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al.: Goal-oriented autonomous driving. arXiv preprint arXiv:2212.10156 (2022) 2
13.Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., et al：ArXiv preprint arXiv:2212.10156 (2022) 2
14. Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. arXiv preprint arXiv:2112.11790 (2021) 3, 5
14.Huang, J., Huang, G., Zhu, Z., Du, D.: Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. ArXiv preprint arXiv:2112.11790 (2021) 3, 5
15. Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: Selfocc: Self-supervised visionbased 3d occupancy prediction. In: CVPR (2024) 2
15.Huang, Y., Zheng, W., Zhang, B., Zhou, J., Lu, J.: Selfocc: Self-supervised visionbased 3d occupancy prediction.In：CVPR (2024) 2
16. Huang, Y., Zheng, W., Zhang, Y., Zhou, J., Lu, J.: Tri-perspective view for visionbased 3d semantic occupancy prediction. In: CVPR. pp. 9223-9232 (2023) 2, 5, 9, 11, 12
16.Huang，Y.、Zheng，W.、Zhang，Y.、Zhou，J.、Lu，J.：基于视觉的三维语义占位预测的三视角视图。In. CVPR：第 9223-9232 页（2023 年） 2、5、9、11、12
17. Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient autonomous driving. arXiv preprint arXiv:2303.12077 (2023) 2
17.Jiang，B.、Chen，S.、Xu，Q.、Liao，B.、Chen，J.、Zhou，H.、Zhang，Q.、Liu，W.、Huang，C.、Wang，X：Vad：ArXiv preprint arXiv:2303.12077 (2023) 2
18. Jiang, H., Cheng, T., Gao, N., Zhang, H., Liu, W., Wang, X.: Symphonize 3d semantic scene completion with contextual instance queries. arXiv preprint arXiv:2306.15670 (2023) 2, 3, 11
18.Jiang, H., Cheng, T., Gao, N., Zhang, H., Liu, W., Wang, X.：ArXiv preprint arXiv:2306.15670 (2023) 2, 3, 11
19. Jiang, Y., Zhang, L., Miao, Z., Zhu, X., Gao, J., Hu, W., Jiang, Y.G.: Polarformer: Multi-camera 3d object detection with polar transformer. In: AAAI. vol. 37, pp. 1042-1050 (2023) 3
19.Jiang, Y., Zhang, L., Miao, Z., Zhu, X., Gao, J., Hu, W., Jiang, Y.G.: Polarformer：使用极坐标变换器进行多摄像头 3D 物体检测。In：第 37 卷，第 1042-1050 页（2023 年） 3
20. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics 42 (4) (2023) 3, 4
20.Kerbl，B.、Kopanas，G.、Leimkühler，T.、Drettakis，G.：用于实时辐射场渲染的 3D 高斯拼接。ACM Transactions on Graphics 42 (4) (2023) 3, 4
21. Li, J., Han, K., Wang, P., Liu, Y., Yuan, X.: Anisotropic convolutional networks for 3d semantic scene completion. In: CVPR (2020) 3
21.Li, J., Han, K., Wang, P., Liu, Y., Yuan, X.：用于 3D 语义场景补全的各向异性卷积网络。In：CVPR (2020) 3
22. Li, Y., Li, S., Liu, X., Gong, M., Li, K., Chen, N., Wang, Z., Li, Z., Jiang, T., Yu, F., et al.: Sscbench: A large-scale 3d semantic scene completion benchmark for autonomous driving. arXiv preprint arXiv:2306.09001 (2023) 10
22.Li, Y., Li, S., Liu, X., Gong, M., Li, K., Chen, N., Wang, Z., Li, Z., Jiang, T., Yu, F., et al：Sscbench：ArXiv preprint arXiv:2306.09001 (2023) 10
23. Li, Y., Yu, Z., Choy, C., Xiao, C., Alvarez, J.M., Fidler, S., Feng, C., Anandkumar, A.: Voxformer: Sparse voxel transformer for camera-based 3d semantic scene completion. In: CVPR. pp. 9087-9098 (2023) 2, 3, 5, 11
23.Li, Y., Yu, Z., Choy, C., Xiao, C., Alvarez, J.M., Fidler, S., Feng, C., Anandkumar, A.: Voxformer：基于摄像头的 3D 语义场景补全的稀疏体素变换器。In：CVPR. pp.
24. Li, Y., Bao, H., Ge, Z., Yang, J., Sun, J., Li, Z.: Bevstereo: Enhancing depth estimation in multi-view 3d object detection with temporal stereo. In: AAAI. vol. 37, pp. 1486-1494 (2023) 3
24.Li，Y.、Bao，H.、Ge，Z.、Yang，J.、Sun，J.、Li，Z：Bevstereo：利用时空立体增强多视角 3D 物体检测中的深度估计。In：第 37 卷，第 1486-1494 页（2023 年） 3
25. Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., Li, Z.: Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. arXiv preprint arXiv:2206.10092 (2022) 2, 3
25.Li，Y.、Ge，Z.、Yu，G.、Yang，J.、Wang，Z.、Shi，Y.、Sun，J.、Li，Z：Bevdepth：ArXiv preprint arXiv:2206.10092 (2022) 2, 3
26. Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. arXiv preprint arXiv:2203.17270 (2022) 2, 3, 5, 11, 12
26.Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer：ArXiv preprint arXiv:2203.17270 (2022) 2, 3, 5, 11, 12
27. Li, Z., Yu, Z., Austin, D., Fang, M., Lan, S., Kautz, J., Alvarez, J.M.: Fb-occ: 3d occupancy prediction based on forward-backward view transformation. arXiv preprint arXiv:2307.01492 (2023) 2, 4, 12
27.Li，Z.、Yu，Z.、Austin，D.、Fang，M.、Lan，S.、Kautz，J.、Alvarez，J.M.：Fb-occ：基于前后视图转换的三维占位预测。
28. Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.: Luciddreamer: Towards high-fidelity text-to-3d generation via interval score matching. arXiv preprint arXiv:2311.11284 (2023) 4
28.Liang, Y., Yang, X., Lin, J., Li, H., Xu, X., Chen, Y.：Luciddreamer：ArXiv preprint arXiv:2311.11284 (2023) 4
29. Liao, Y., Xie, J., Geiger, A.: KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2 d and 3d. PAMI (2022) 9, 10, 15, 16
29.Liao, Y., Xie, J., Geiger, A.: KITTI-360：城市场景二维和三维理解的新型数据集和基准。PAMI (2022) 9, 10, 15, 16
30. Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR (2017) 10
30.Lin，T.Y.、Dollár，P.、Girshick，R.、He，K.、Hariharan，B.、Belongie，S.：用于物体检测的特征金字塔网络。In：CVPR (2017) 10
31. Liu, Z., Tang, H., Amini, A., Yang, X., Mao, H., Rus, D.L., Han, S.: Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation. In: ICRA. pp. 2774-2781. IEEE (2023) 3
31.Liu，Z.、Tang，H.、Amini，A.、Yang，X.、Mao，H.、Rus，D.L.、Han，S.：Bevfusion：具有统一鸟瞰图表示的多任务多传感器融合。In. ICRA：pp.IEEE (2023) 3
32. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017) 10
32.Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. ArXiv preprint arXiv:1711.05101 (2017) 10
33. Lu, Y., Zhu, X., Wang, T., Ma, Y.: Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. arXiv preprint arXiv:2312.03774 (2023) 2, 3,12
33.Lu, Y., Zhu, X., Wang, T., Ma, Y.：Octreeocc: Efficient and multi-granularity occupancy prediction using octree queries. ArXiv preprint arXiv:2312.03774 (2023) 2, 3,12
34. Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3d gaussians: Tracking by persistent dynamic view synthesis. arXiv preprint arXiv:2308.09713 (2023) 4
34.Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.：动态 3D 高斯：arXiv preprint arXiv:2308.09713 (2023) 4
35. Miao, R., Liu, W., Chen, M., Gong, Z., Xu, W., Hu, C., Zhou, S.: Occdepth: A depth-aware method for 3d semantic scene completion. arXiv preprint arXiv:2302.13540 (2023) 2, 3
35.Miao，R.、Liu，W.、Chen，M.、Gong，Z.、Xu，W.、Hu，C.、Zhou，S.：Occdepth：ArXiv preprint arXiv:2302.13540 (2023) 2, 3
36. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Communications of the ACM 65(1), 99-106 (2021) 4
36.Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis.ACM 通信 65（1），99-106（2021） 4
37. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ToG 41(4), 1-15 (2022) 4
37.Müller，T.、Evans，A.、Schied，C.、Keller，A.：使用多分辨率哈希编码的即时神经图形基元。ToG 41(4), 1-15 (2022) 4
38. Murez, Z., Van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas: End-to-end 3d scene reconstruction from posed images. In: ECCV. pp. 414-431. Springer (2020) 11
38.Murez, Z., Van As, T., Bartolozzi, J., Sinha, A., Badrinarayanan, V., Rabinovich, A.: Atlas：从摆拍图像进行端到端 3D 场景重建。In：ECCV. pp.414-431.Springer (2020) 11
39. Park, J., Xu, C., Yang, S., Keutzer, K., Kitani, K., Tomizuka, M., Zhan, W.: Time will tell: New outlooks and a baseline for temporal multi-view 3d object detection. arXiv preprint arXiv:2210.02443 (2022) 3
39.Park，J.、Xu，C.、Yang，S.、Keutzer，K.、Kitani，K.、Tomizuka，M.、Zhan，W.：时间会证明一切：arXiv preprint arXiv:2210.02443 (2022) 3
40. Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: ECCV (2020) 3
40.Philion, J., Fidler, S.: Lift, splat, shoot：通过隐式解投影到 3D 来编码任意摄像机装备的图像。In：ECCV (2020) 3
41. Qiao, Y.L., Gao, A., Xu, Y., Feng, Y., Huang, J.B., Lin, M.C.: Dynamic meshaware radiance fields. In: ICCV. pp. 385-396 (2023) 4
41.Qiao, Y.L., Gao, A., Xu, Y., Feng, Y., Huang, J.B., Lin, M.C.: Dynamic meshaware radiance fields.In：pp.385-396 (2023) 4
42. Rakotosaona, M.J., Manhardt, F., Arroyo, D.M., Niemeyer, M., Kundu, A., Tombari, F.: Nerfmeshing: Distilling neural radiance fields into geometricallyaccurate 3d meshes. arXiv preprint arXiv:2303.09431 (2023) 4
42.Rakotosaona, M.J., Manhardt, F., Arroyo, D.M., Niemeyer, M., Kundu, A., Tombari, F.: Nerfmeshing: Distilling neural radiance fields into geometricallyaccurate 3d meshes. ArXiv preprint arXiv:2303.09431 (2023) 4
43. Roldão, L., de Charette, R., Verroust-Blondet, A.: Lmscnet: Lightweight multiscale 3d semantic completion. In: ThreeDV (2020) 3, 11
43.Roldão, L., de Charette, R., Verroust-Blondet, A.: Lmscnet：轻量级多尺度 3D 语义补全。In：ThreeDV (2020) 3, 11
44. Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image. In: CVPR. pp. 1746-1754 (2017) 11
44.Song, S., Yu, F., Zeng, A., Chang, A.X., Savva, M., Funkhouser, T.: Semantic scene completion from a single depth image.In. CVPR：pp.
45. Tang, J., Ren, J., Zhou, H., Liu, Z., Zeng, G.: Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653 (2023) 4
45.Tang，J.、Ren，J.、Zhou，H.、Liu，Z.、Zeng，G.：Dreamgaussian：arXiv preprint arXiv:2309.16653 (2023) 4
46. Tong, W., Sima, C., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., et al.: Scene as occupancy. In: ICCV. pp. 8406-8415 (2023) 2
46.Tong, W., Sima, C., Wang, T., Chen, L., Wu, S., Deng, H., Gu, Y., Lu, L., Luo, P., Lin, D., et al：Scene as occupancy.In：pp.
47. Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: Fully convolutional one-stage monocular 3d object detection. In: ICCV. pp. 913-922 (2021) 10
47.Wang, T., Zhu, X., Pang, J., Lin, D..：Fcos3d：全卷积单级单目 3D 物体检测。In. ICCV：pp.
48. Wang, X., Zhu, Z., Xu, W., Zhang, Y., Wei, Y., Chi, X., Ye, Y., Du, D., Lu, J., Wang, X.: Openoccupancy: A large scale benchmark for surrounding semantic occupancy perception. arXiv preprint arXiv:2303.03991 (2023) 5
48.Wang，X.、Zhu，Z.、Xu，W.、Zhang，Y.、Wei，Y.、Chi，X.、Ye，Y.、Du，D.、Lu，J.、Wang，X：Openoccupancy：arXiv preprint arXiv:2303.03991 (2023) 5
49. Wang, Y., Chen, Y., Liao, X., Fan, L., Zhang, Z.: Panoocc: Unified occupancy representation for camera-based 3d panoptic segmentation. arXiv preprint arXiv:2306.10013 (2023) 3, 12
49.Wang, Y., Chen, Y., Liao, X., Fan, L., Zhang, Z.：ArXiv preprint arXiv:2306.10013 (2023) 3, 12
50. Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multicamera 3d occupancy prediction for autonomous driving. In: ICCV. pp. 2172921740 (2023) 2, 3, 5, 10, 11, 15
50.Wei, Y., Zhao, L., Zheng, W., Zhu, Z., Zhou, J., Lu, J.: Surroundocc: Multicamera 3d occupancy prediction for autonomous driving.In：ICCV. pp.
51. Yan, X., Gao, J., Li, J., Zhang, R., Li, Z., Huang, R., Cui, S.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion. In: AAAI. vol. 35, pp. 3101-3109 (2021) 3
51.Yan, X., Gao, J., Li, J., Zhang, R., Li, Z., Huang, R., Cui, S.: Sparse single sweep lidar point cloud segmentation via learning contextual shape priors from scene completion.In：第 35 卷，第 3101-3109 页（2021 年） 3
52. Yang, B., Bao, C., Zeng, J., Bao, H., Zhang, Y., Cui, Z., Zhang, G.: Neumesh: Learning disentangled neural mesh-based implicit field for geometry and texture editing. In: ECCV. pp. 597-614. Springer (2022) 4
52.Yang, B., Bao, C., Zeng, J., Bao, H., Zhang, Y., Cui, Z., Zhang, G.: Neumesh：学习基于神经网格的隐式场，用于几何和纹理编辑。In：ECCV. pp.Springer (2022) 4
53. Yang, C., Chen, Y., Tian, H., Tao, C., Zhu, X., Zhang, Z., Huang, G., Li, H., Qiao, Y., Lu, L., et al.: Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision. arXiv preprint arXiv:2211.10439 (2022) 3
53.Yang，C.、Chen，Y.、Tian，H.、Tao，C.、Zhu，X.、Zhang，Z.、Huang，G.、Li，H.、Qiao，Y.、Lu，L.等：Bevformer v2：ArXiv preprint arXiv:2211.10439 (2022) 3
54. Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3d gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023) 4
54.Yang，Z.，Gao，X.，Zhou，W.，Jiao，S.，Zhang，Y.，Jin，X：ArXiv preprint arXiv:2309.13101 (2023) 4
55. Yi, T., Fang, J., Wu, G., Xie, L., Zhang, X., Liu, W., Tian, Q., Wang, X.: Gaussiandreamer: Fast generation from text to 3d gaussian splatting with point cloud priors. arXiv preprint arXiv:2310.08529 (2023) 4
55.Yi，T.、Fang，J.、Wu，G.、Xie，L.、Zhang，X.、Liu，W.、Tian，Q.、Wang，X：Gaussiandreamer：ArXiv preprint arXiv:2310.08529 (2023) 4
56. Yu, Z., Shu, C., Deng, J., Lu, K., Liu, Z., Yu, J., Yang, D., Li, H., Chen, Y.: Flashocc: Fast and memory-efficient occupancy prediction via channel-to-height plugin. arXiv preprint arXiv:2311.12058 (2023) 2, 4
56.Yu，Z.、Shu，C.、Deng，J.、Lu，K.、Liu，Z.、Yu，J.、Yang，D.、Li，H.、Chen，Y：ArXiv preprint arXiv:2311.12058 (2023) 2, 4
57. Zhang, Y., Zhu, Z., Du, D.: Occformer: Dual-path transformer for vision-based 3d semantic occupancy prediction. arXiv preprint arXiv:2304.05316 (2023) 2, 3, 11
57.Zhang, Y., Zhu, Z., Du, D.：Occformer：ArXiv preprint arXiv:2304.05316 (2023) 2, 3, 11
58. Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld: Learning a 3d occupancy world model for autonomous driving. arXiv preprint arXiv:2311.16038 (2023) 2
58.Zheng, W., Chen, W., Huang, Y., Zhang, B., Duan, Y., Lu, J.: Occworld：ArXiv preprint arXiv:2311.16038 (2023) 2
59. Zhou, Y., Tuzel, O.: Voxelnet: End-to-end learning for point cloud based 3d object detection. In: CVPR. pp. 4490-4499 (2018) 3
59.Zhou, Y., Tuzel, O.：Voxelnet：基于点云的 3D 物体检测端到端学习。In. CVPR：pp.4490-4499 (2018) 3
60. Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020) 8
60.Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr：ArXiv preprint arXiv:2010.04159 (2020) 8
61. Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting: Fast and generalizable single-view 3d reconstruction with transformers. arXiv preprint arXiv:2312.09147 (2023) 4
61.Zou, Z.X., Yu, Z., Guo, Y.C., Li, Y., Liang, D., Cao, Y.P., Zhang, S.H.: Triplane meets gaussian splatting：ArXiv preprint arXiv:2312.09147 (2023) 4

- Project Leader 项目负责人
  $^{†}$ Corresponding Author
  $^{†}$ 通讯作者
$^{1}$ https://wzzheng.net/GaussianFormer

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction GaussianFormer：基于视觉的三维语义占位预测中的高斯场景

Abstract 摘要

Abstract 摘要

1 Introduction 1 引言

2 Related Work 2 相关工作

2.1 3D Semantic Occupancy Prediction2.1 三维语义占用预测

2.2 3D Gaussian Splatting2.2 三维高斯拼接法

3 Proposed Approach 3 建议的方法

3.1 Object-centric 3D Scene Representation3.1 以物体为中心的三维场景表示法

3.2 GaussianFormer: Image to Gaussians3.2 GaussianFormer：图像到高斯

3.3 Gaussian-to-Voxel Splatting3.3 高斯-体素拼接

4 Experiments 4 项实验

4.1 Datasets 4.1 数据集

4.2 Evaluation Metrics 4.2 评估指标

4.3 Implementation Details4.3 实施细节

4.4 Results and Analysis4.4 结果与分析

5 Conclusion and Discussions5 结论与讨论

A Video Demonstration 视频演示

B Additional VisualizationsB 补充可视化

C Additional Ablation StudyC 补充消融研究

References 参考资料

GaussianFormer: Scene as Gaussians for Vision-Based 3D Semantic Occupancy Prediction
GaussianFormer：基于视觉的三维语义占位预测中的高斯场景

2.1 3D Semantic Occupancy Prediction
2.1 三维语义占用预测

2.2 3D Gaussian Splatting
2.2 三维高斯拼接法

3.1 Object-centric 3D Scene Representation
3.1 以物体为中心的三维场景表示法

3.2 GaussianFormer: Image to Gaussians
3.2 GaussianFormer：图像到高斯

3.3 Gaussian-to-Voxel Splatting
3.3 高斯-体素拼接

4.3 Implementation Details
4.3 实施细节

4.4 Results and Analysis
4.4 结果与分析

5 Conclusion and Discussions
5 结论与讨论

B Additional Visualizations
B 补充可视化

C Additional Ablation Study
C 补充消融研究