这是用户在 2024-10-8 24:10 为 https://app.immersivetranslate.com/pdf-pro/eed5557b-e40b-457c-a06f-c0b7dd18f1e2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
BEVDet4D:利用多摄像头 3D 物体检测中的时间线索

Junjie Huang* Guan Huang 黄俊杰* 黄冠PhiGent Robotics 菲根特机器人junjie.huang@ieee.org, guan.huang@phigent.ai

Abstract 摘要

Single frame data contains finite information which limits the performance of the existing vision-based multicamera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4 D 4 D 4D4 D space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by removing the factors of ego-motion and time in the learning target. As a result, BEVDet4D with robust generalization performance reduces the velocity error by up to 62.9 % 62.9 % -62.9%-62.9 \%. This makes the vision-based methods, for the first time, become comparable with those relied on LiDAR or radar in this aspect. On challenge benchmark nuScenes, we report a new record of 54.5 % 54.5 % 54.5%54.5 \% NDS with the high-performance configuration dubbed BEVDet4D-Base, which surpasses the previous leading method BEVDet-Base by + 7.3 % + 7.3 % +7.3%+7.3 \% NDS.
单帧数据包含有限的信息,这限制了现有基于视觉的多摄像头 3D 物体检测范式的性能。为了从根本上推动该领域的性能边界,提出了一种新范式,称为 BEVDet4D,旨在将可扩展的 BEVDet 范式从仅空间的 3D 空间提升到时空 4 D 4 D 4D4 D 空间。我们对简单的 BEVDet 框架进行了少量修改,仅用于将前一帧的特征与当前帧中的相应特征融合。通过这种方式,在几乎没有额外计算预算的情况下,我们使 BEVDet4D 能够通过查询和比较两个候选特征来获取时间线索。除此之外,我们通过去除自我运动和时间因素简化了速度预测的任务。因此,具有强大泛化性能的 BEVDet4D 将速度误差降低了多达 62.9 % 62.9 % -62.9%-62.9 \% 。这使得基于视觉的方法首次在这一方面与依赖于激光雷达或雷达的方法相媲美。 在挑战基准 nuScenes 上,我们报告了一个新的记录,使用高性能配置 BEVDet4D-Base 达到了 54.5 % 54.5 % 54.5%54.5 \% NDS,超过了之前领先的方法 BEVDet-Base,提升了 + 7.3 % + 7.3 % +7.3%+7.3 \% NDS。

1. Introduction 1. 引言

Recently, autonomous driving draws great attention in both the research and the industry community. The perception tasks in this scene include 3D object detection, BEV semantic segmentation, motion prediction, and so on. Most of them can be partly solved in the spatial-only 3D space with a single frame of data. However, with respect to the time-relevant targets like velocity, current vision-based paradigms with merely a single frame of data perform far poorer than those with sensors like LiDAR or radar. For example, the velocity error of the recently leading method BEVDet [14] in the vision-based 3D objection is 3 times
最近,自动驾驶在研究和工业界引起了极大的关注。该场景中的感知任务包括 3D 物体检测、鸟瞰视图语义分割、运动预测等。大多数任务可以在仅使用单帧数据的空间 3D 空间中部分解决。然而,对于与时间相关的目标,如速度,目前仅依赖单帧数据的视觉基础范式的表现远不如使用激光雷达或雷达等传感器的方法。例如,最近领先的方法 BEVDet [14]在基于视觉的 3D 物体检测中的速度误差是 3 倍。
Figure 1: The inference speed and performance of different paradigms on the nuScenes val set.
图 1:不同范式在 nuScenes 验证集上的推理速度和性能。

that of the LiDAR-based method CenterPoint [44] and 2 times that of the radar-based method CenterFusion [25]. To close this gap, we propose a novel paradigm dubbed BEVDet4D in this paper and pioneer the exploitation of vision-based autonomous driving in the spatial-temporal 4D space.
基于 LiDAR 的方法 CenterPoint [44]的性能是基于雷达的方法 CenterFusion [25]的 2 倍。为了缩小这一差距,我们在本文中提出了一种新颖的范式,称为 BEVDet4D,并开创了在时空 4D 空间中利用基于视觉的自动驾驶。
As illustrated in Fig. 2, BEVDet4D makes the first attempt at accessing the rich information in the temporal domain. It simply extends the naive BEVDet [14] by retaining the intermediate BEV features in the previous frames. Then it fuses the retained feature with the corresponding one in the current frame just by a spatial alignment operation and a concatenation operation. Other than that, we kept most other details of the framework unchanged. In this way, we place just a negligible extra computational budget on the inference process while enabling the paradigm to access the temporal cues by querying and comparing the two candidate features. Though simple in constructing the framework of BEVDet4D, it is nontrivial to build its robust performance. The spatial alignment operation and the learning targets should be carefully designed to cooperate with the elegant framework so that the velocity prediction task can be simplified and superior generalization performance can be achieved with BEVDet4D.
如图 2 所示,BEVDet4D 首次尝试访问时间域中的丰富信息。它通过保留前一帧中的中间 BEV 特征,简单地扩展了原始的 BEVDet [14]。然后,它通过空间对齐操作和连接操作,将保留的特征与当前帧中的相应特征融合在一起。除此之外,我们保持了框架的大部分其他细节不变。通过这种方式,我们在推理过程中仅增加了微不足道的额外计算预算,同时使得该范式能够通过查询和比较两个候选特征来访问时间线索。尽管在构建 BEVDet4D 框架时很简单,但构建其稳健性能并非易事。空间对齐操作和学习目标应精心设计,以与优雅的框架协同工作,从而简化速度预测任务,并在 BEVDet4D 中实现卓越的泛化性能。
We conduct comprehensive experiments on the chal-
我们对挑战进行了全面的实验

Figure 2: The framework of the proposed BEVDet4D paradigm. BEVDet4D retains the intermediate BEV feature of the previous frame and concatenates it with the ones generated by the current frame. Before that spatial alignment in the flat plane is conducted to partially simplify the velocity prediction task.
图 2:所提出的 BEVDet4D 范式的框架。BEVDet4D 保留了前一帧的中间 BEV 特征,并将其与当前帧生成的特征进行连接。在此之前,在平面上进行空间对齐,以部分简化速度预测任务。

lenge benchmark nuScenes [2] to verify the feasibility of BEVDet4D and discuss its characteristics. Fig. 1 illustrates the trade-off between inference speed and performance of different paradigms. Without bells and whistles, the BEVDet4D-Tiny configuration reduces the velocity error by 62.9 % 62.9 % 62.9%62.9 \% from 0.909 mAVE to 0.337 mAVE . Besides, the proposed paradigm also has significant improvement in the other indicators like detection score ( + 2.6 % + 2.6 % +2.6%+2.6 \% mAP ) mAP ) mAP)\mathrm{mAP}), orientation error ( 12.0 % mAOE ) 12.0 % mAOE ) -12.0%mAOE)-12.0 \% \mathrm{mAOE}), and attribute error ( 25.1 % mAAE ) ( 25.1 % mAAE ) (-25.1%mAAE)(-25.1 \% \mathrm{mAAE}). As a result, BEVDet4D-Tiny exceeds the baseline by + 8.4 % + 8.4 % +8.4%+8.4 \% on the composite indicator NDS. The high-performance configuration dubbed BEVDet4D-Base scores high as 42.1 % mAP 42.1 % mAP 42.1%mAP42.1 \% \mathrm{mAP} and 54.5 % 54.5 % 54.5%54.5 \% NDS, which has surpassed all published results. Last but not least, BEVDet4D achieves the aforementioned superiority just at a negligible cost in inference latency, which is meaningful in the scenario of autonomous driving.
通过基准测试 nuScenes [2] 验证 BEVDet4D 的可行性并讨论其特性。图 1 展示了不同范式之间推理速度和性能的权衡。BEVDet4D-Tiny 配置在没有额外功能的情况下将速度误差从 0.909 mAVE 降低到 0.337 mAVE。此外,所提出的范式在其他指标上也有显著改善,如检测得分( + 2.6 % + 2.6 % +2.6%+2.6 \% mAP ) mAP ) mAP)\mathrm{mAP}) )、方向误差( 12.0 % mAOE ) 12.0 % mAOE ) -12.0%mAOE)-12.0 \% \mathrm{mAOE}) )和属性误差( ( 25.1 % mAAE ) ( 25.1 % mAAE ) (-25.1%mAAE)(-25.1 \% \mathrm{mAAE}) )。因此,BEVDet4D-Tiny 在综合指标 NDS 上超过了基线 + 8.4 % + 8.4 % +8.4%+8.4 \% 。高性能配置 BEVDet4D-Base 的 NDS 得分为 42.1 % mAP 42.1 % mAP 42.1%mAP42.1 \% \mathrm{mAP} 54.5 % 54.5 % 54.5%54.5 \% ,超越了所有已发布的结果。最后但同样重要的是,BEVDet4D 在推理延迟上仅以微不足道的成本实现了上述优势,这在自动驾驶场景中具有重要意义。

2.1. Vision-based 3D object detection
2.1. 基于视觉的 3D 物体检测

3D object detection is a pivotal perception task in autonomous driving. In the last few years, fueled by the KITTI [10] benchmark monocular 3D object detection has witness a rapid development [24, 21, 45, 51, 47, 29, 37, 36, 15]. However, the limited data and the single view disable it in developing more complicated tasks. Recently, some largescale benchmarks [2,33] have been proposed with sufficient data and surrounding views, offering new perspectives to-
3D 物体检测是自动驾驶中的一个关键感知任务。在过去的几年中,受 KITTI [10]基准的推动,单目 3D 物体检测经历了快速发展[24, 21, 45, 51, 47, 29, 37, 36, 15]。然而,有限的数据和单一视角使其在开发更复杂的任务时受到限制。最近,一些大规模基准[2,33]被提出,提供了足够的数据和周围视角,为-

ward the paradigm development in the field of 3D object detection. Based on these benchmarks, some multi-camera 3D object detection paradigms have been developed with competitive performance. For example, inspired by the success of FCOS [34] in 2D detection, FCOS3D [38] treats the 3D object detection problem as a 2D object detection problem and conducts perception just in image view. Benefitting from the strong spatial correlation of the targets’ attribute with the image appearance, it works well in predicting this but is relatively poor in perceiving the targets’ translation, velocity, and orientation. PGD [39] further develops the FCOS3D paradigm by searching and resolving the outstanding shortcoming (i.e. the prediction of the targets’ depth). This offers a remarkable accuracy improvement on the baseline but at the cost of more computational budget and additional inference latency. Following DETR [3], DETR3D [40] proposes to detect 3D objects in an attention pattern, which has similar accuracy as FCOS3D. Although DETR3D requires just half the computational budget, the complex calculation pipeline slows down its inference speed to the same level as FCOS3D. PETR [20] further develops the performance of this paradigm by introducing the 3D coordinate generation and position encoding. Besides, they also exploit the strong data augmentation strategies just as BEVDet [14]. As a novel paradigm, BEVDet [14] makes the first attempt at applying a strong data augmentation strategy in vision-based 3D object detection. As they explicitly encode features in the BEV space, it
在 3D 物体检测领域推动范式的发展。基于这些基准,已经开发出一些具有竞争性能的多摄像头 3D 物体检测范式。例如,受到 FCOS [34] 在 2D 检测中成功的启发,FCOS3D [38] 将 3D 物体检测问题视为 2D 物体检测问题,并仅在图像视图中进行感知。由于目标属性与图像外观之间的强空间相关性,它在预测方面表现良好,但在感知目标的平移、速度和方向方面相对较差。PGD [39] 进一步发展了 FCOS3D 范式,通过搜索和解决突出的缺陷(即目标深度的预测)。这在基线基础上提供了显著的准确性提升,但代价是更多的计算预算和额外的推理延迟。继 DETR [3] 之后,DETR3D [40] 提出了以注意力模式检测 3D 物体,其准确性与 FCOS3D 相似。尽管 DETR3D 仅需一半的计算预算,但复杂的计算管道使其推理速度减慢到与 FCOS3D 相同的水平。 PETR [20] 通过引入 3D 坐标生成和位置编码进一步提升了这一范式的性能。此外,他们还利用了强大的数据增强策略,就像 BEVDet [14] 一样。作为一种新颖的范式,BEVDet [14] 首次尝试在基于视觉的 3D 目标检测中应用强大的数据增强策略。由于它们在 BEV 空间中明确编码特征,

is scalable in multiple aspects including multi-tasks learning, multi-sensors fusion, and temporal fusion. BEVDet4D is the temporal extension of BEVDet [14].
在多个方面具有可扩展性,包括多任务学习、多传感器融合和时间融合。BEVDet4D 是 BEVDet [14] 的时间扩展。
So far, few works have exploited the temporal cues in vision-based 3D object detection. Thus, the existing paradigms [38, 39, 40, 20, 14] perform relatively poorly in predicting the time-relevant targets like velocity than the LiDAR-based [44] or radar-based [25] methods. To the best of our knowledge, [1] is the only one work in this perspective. However, they predict the results based on a single frame and exploit the 3D Kalman filter to update the results for the temporal consistency of results between image sequences. The temporal cues are exploited in the postprocessing phase instead of the end-to-end learning framework. Differently, we make the first attempt in exploiting the temporal cues in the end-to-end learning framework BEVDet4D, which is elegant, powerful, and still scalable.
到目前为止,少数作品利用了基于视觉的 3D 物体检测中的时间线索。因此,现有的范式 [38, 39, 40, 20, 14] 在预测与时间相关的目标(如速度)方面表现相对较差,尤其是与基于 LiDAR [44] 或雷达 [25] 的方法相比。据我们所知,只有 [1] 在这个视角上进行过研究。然而,他们是基于单帧预测结果,并利用 3D 卡尔曼滤波器更新结果,以确保图像序列之间结果的时间一致性。时间线索是在后处理阶段利用的,而不是在端到端学习框架中。不同的是,我们首次尝试在端到端学习框架 BEVDet4D 中利用时间线索,该框架优雅、强大且可扩展。

2.2. Object Detection in Video
2.2. 视频中的目标检测

Video object detection mainly fueled by the ImageNet VID dataset [31] is analogous to the well-known tasks of common object detection [17] which performs and evaluates the object detection task in the image-view space. The difference is that detecting objects in video can access the temporal cues for improving detection accuracy. The methods in this area access the temporal cues mainly according to two kinds of mediums: the predicting results or the intermediate features. The former [43] is analogous to [1] in 3D object detection, who optimizes the prediction results in a tracking pattern. The latter reutilizes the features from the previous frame based on some special architectures like LSTM [13] for feature distillation [19, 18, 23], attention mechanism [35] for feature querying [5, 7, 8], and optical flow [9] for feature alignment [50, 49]. Specific for the scene of autonomous driving, BEVDet4D is analogous to the flow-based methods in mechanism but accesses the spatial correlation according to the ego-motion and conducts feature aggregation in the 3D space. Besides, BEVDet4D mainly focuses on the prediction of the velocity targets which is not in the scope of the common video object detection literature.
视频目标检测主要由 ImageNet VID 数据集推动,类似于常见的目标检测任务,该任务在图像视图空间中执行和评估目标检测。不同之处在于,视频中的目标检测可以利用时间线索来提高检测准确性。该领域的方法主要通过两种媒介获取时间线索:预测结果或中间特征。前者类似于 3D 目标检测中的方法,通过跟踪模式优化预测结果。后者基于一些特殊架构(如 LSTM)重新利用前一帧的特征进行特征提炼,使用注意力机制进行特征查询,以及利用光流进行特征对齐。针对自动驾驶场景,BEVDet4D 在机制上类似于基于流的方法,但根据自我运动获取空间相关性,并在 3D 空间中进行特征聚合。 此外,BEVDet4D 主要关注速度目标的预测,这不在常见视频目标检测文献的范围内。

3. Methodology 3. 方法论

3.1. Network Structure 3.1. 网络结构

As illustrated in Fig. 2, the overall framework of BEVDet4D is built upon the BEVDet [14] baseline which is consists of four kinds of modules: an image-view encoder, a view transformer, a BEV encoder, and a task-specific head. All implementation details of these modules are kept unchanged. To exploit the temporal cues, BEVDet4D extends the baseline by retaining the BEV features generated by the view transformer in the previous frame. Then the retained
如图 2 所示,BEVDet4D 的整体框架建立在 BEVDet [14] 基线之上,该基线由四种模块组成:图像视图编码器、视图变换器、BEV 编码器和任务特定头。这些模块的所有实现细节保持不变。为了利用时间线索,BEVDet4D 通过保留前一帧中由视图变换器生成的 BEV 特征来扩展基线。然后保留的

feature is merged with the one in the current frame. Before that, an alignment operation is conducted to simplify the learning targets which will be detailed in the following subsection. We apply a simple concatenation operation to merge the features for verifying the BEVDet4D paradigm. More complicated fusing strategies have not been exploited in this paper.
特征与当前帧中的特征合并。在此之前,进行了一次对齐操作,以简化学习目标,具体内容将在以下小节中详细说明。我们应用简单的连接操作来合并特征,以验证 BEVDet4D 范式。本文未采用更复杂的融合策略。
Besides, the feature generated by the view transformer is sparse, which is too coarse for the subsequential modules to exploit temporal cues. An extra BEV encoder is thus applied to adjust the candidate features before the temporal fusion. In practice, the extra BEV encoder consists of two naive residual units [12], whose channel number is set the same as the input feature.
此外,视图变换器生成的特征是稀疏的,这对于后续模块利用时间线索来说过于粗糙。因此,额外的 BEV 编码器被应用于在时间融合之前调整候选特征。实际上,额外的 BEV 编码器由两个简单的残差单元组成,其通道数与输入特征相同。

3.2. Simplify the Velocity Learning Task
3.2. 简化速度学习任务

Symbol Definition Following nuScense [2], we denote the global coordinate system as O g X Y Z O g X Y Z O_(g)-XYZO_{g}-X Y Z, the ego coordinate system as O e ( T ) X Y Z O e ( T ) X Y Z O_(e(T))-XYZO_{e(T)}-X Y Z, and the targets coordinate system as O t ( T ) X Y Z O t ( T ) X Y Z O_(t(T))-XYZO_{t(T)}-X Y Z. As illustrated in Fig. 3 , we construct a virtual scene with a moving ego vehicle and two target vehicles. One of the targets is static (i.e., O s X Y Z O s X Y Z O_(s)-XYZO_{s}-X Y Z painted green) in the global coordinate system, while the other one is moving (i.e., O m X Y Z O m X Y Z O_(m)-XYZO_{m}-X Y Z painted blue). The objects in two adjacent frames (i.e., frame T 1 T 1 T-1T-1 and frame T T TT ) are distinguished with different transparentness. The position of the objects is formulated as P x ( t ) P x ( t ) P^(x)(t)\mathbf{P}^{x}(t), where x { g , e ( T ) , e ( T 1 ) } x { g , e ( T ) , e ( T 1 ) } x in{g,e(T),e(T-1)}x \in\{g, e(T), e(T-1)\} denotes the coordinate system where the position is defined in. t { T , T 1 } t { T , T 1 } t in{T,T-1}t \in\{T, T-1\} denotes the time when the position is recorded. We use T s r c d s t T s r c d s t T_(src)^(dst)\mathbf{T}_{s r c}^{d s t} to denote the transformation from the source coordinate system into the target coordinate system.
符号定义 根据 nuScense [2],我们将全局坐标系统表示为 O g X Y Z O g X Y Z O_(g)-XYZO_{g}-X Y Z ,自我坐标系统表示为 O e ( T ) X Y Z O e ( T ) X Y Z O_(e(T))-XYZO_{e(T)}-X Y Z ,目标坐标系统表示为 O t ( T ) X Y Z O t ( T ) X Y Z O_(t(T))-XYZO_{t(T)}-X Y Z 。如图 3 所示,我们构建了一个虚拟场景,其中有一辆移动的自我车辆和两辆目标车辆。其中一辆目标车辆在全局坐标系统中是静态的(即 O s X Y Z O s X Y Z O_(s)-XYZO_{s}-X Y Z 涂成绿色),而另一辆是移动的(即 O m X Y Z O m X Y Z O_(m)-XYZO_{m}-X Y Z 涂成蓝色)。两个相邻帧(即帧 T 1 T 1 T-1T-1 和帧 T T TT )中的物体通过不同的透明度进行区分。物体的位置表示为 P x ( t ) P x ( t ) P^(x)(t)\mathbf{P}^{x}(t) ,其中 x { g , e ( T ) , e ( T 1 ) } x { g , e ( T ) , e ( T 1 ) } x in{g,e(T),e(T-1)}x \in\{g, e(T), e(T-1)\} 表示定义位置的坐标系统。 t { T , T 1 } t { T , T 1 } t in{T,T-1}t \in\{T, T-1\} 表示记录位置的时间。我们用 T s r c d s t T s r c d s t T_(src)^(dst)\mathbf{T}_{s r c}^{d s t} 表示从源坐标系统到目标坐标系统的转换。
Instead of directly predicting the velocity of the targets, we tend to predict the translation of the targets between the two adjacent frames. In this way, the learning task can be simplified as the time factor is removed and the positional shifting can be measured just according to the difference between two BEV features. Besides, we tend to learn the position shifting that is irrelevant to the ego-motion. In this way, the learning task can also be simplified as the egomotion will make the distribution of the targets’ positional shifting more complicated.
我们倾向于预测两个相邻帧之间目标的位移,而不是直接预测目标的速度。通过这种方式,学习任务可以简化,因为时间因素被去除,位置的变化可以仅根据两个 BEV 特征之间的差异来测量。此外,我们倾向于学习与自我运动无关的位置变化。通过这种方式,学习任务也可以简化,因为自我运动会使目标位置变化的分布变得更加复杂。
For example, a static object (i.e., the green box in Fig. 3) in the global coordinate system will be changed into a moving object in the ego coordinate system due to the egomotion. More specifically, the receptive field of the BEV features is symmetrically defined around the ego. Considering the two features generated by the view transformer in the two adjacent frames, their receptive fields in the global coordinate system are also diverse due to the ego-motion. Given a static object, its position in the global coordinate system is denoted as P s g ( T ) P s g ( T ) P_(s)^(g)(T)\mathbf{P}_{s}^{g}(T) and P s g ( T 1 ) P s g ( T 1 ) P_(s)^(g)(T-1)\mathbf{P}_{s}^{g}(T-1) in the two adjacent frames. Their positions in the candidate features are
例如,全球坐标系中的一个静态物体(即图 3 中的绿色框)由于自我运动而在自我坐标系中变成一个移动物体。更具体地说,BEV 特征的感受野是围绕自我对称定义的。考虑到在两个相邻帧中由视图变换器生成的两个特征,由于自我运动,它们在全球坐标系中的感受野也各不相同。给定一个静态物体,它在全球坐标系中的位置在两个相邻帧中分别表示为 P s g ( T ) P s g ( T ) P_(s)^(g)(T)\mathbf{P}_{s}^{g}(T) P s g ( T 1 ) P s g ( T 1 ) P_(s)^(g)(T-1)\mathbf{P}_{s}^{g}(T-1) 。它们在候选特征中的位置是

Figure 3: Illustrating the effect of the feature alignment operation. Without the alignment operation (i.e. the first row), the following modules are required to study a more complicated distribution of the object motion, which is relevant to the ego-motion. By applying alignment operation in the second row, the learning targets can be simplified.
图 3:说明特征对齐操作的效果。没有对齐操作(即第一行),需要以下模块来研究与自我运动相关的物体运动的更复杂分布。通过在第二行应用对齐操作,学习目标可以简化。

different. 不同。
P s e ( T ) ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T 1 ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) P s e ( T ) ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T 1 ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) {:[P_(s)^(e(T))(T)-P_(s)^(e(T-1))(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(g)^(e(T-1))P_(s)^(g)(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(e(T))^(e(T-1))T_(g)^(e(T))P_(s)^(g)(T-1)]:}\begin{aligned} & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{P}_{s}^{e(T-1)}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{g}^{e(T-1)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \end{aligned}
According to Eq. 1, directly concatenate the two features, the learning target (i.e., the positional shifting of the target object) of the following modules will be relevant to the ego motion (i.e., T e ( T ) e ( T 1 ) T e ( T ) e ( T 1 ) T_(e(T))^(e(T-1))\mathbf{T}_{e(T)}^{e(T-1)} ). To avoid this, we shift the feature from the adjacent frame by T e ( T 1 ) e ( T ) T e ( T 1 ) e ( T ) T_(e(T-1))^(e(T))\mathbf{T}_{e(T-1)}^{e(T)} to remove the fraction of ego-motion.
根据公式 1,直接将两个特征连接在一起,后续模块的学习目标(即目标物体的位置移动)将与自我运动(即 T e ( T ) e ( T 1 ) T e ( T ) e ( T 1 ) T_(e(T))^(e(T-1))\mathbf{T}_{e(T)}^{e(T-1)} )相关。为了避免这种情况,我们通过 T e ( T 1 ) e ( T ) T e ( T 1 ) e ( T ) T_(e(T-1))^(e(T))\mathbf{T}_{e(T-1)}^{e(T)} 从相邻帧中移动特征,以去除自我运动的部分。
P s e ( T ) ( T ) T e ( T 1 ) e ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T 1 ) e ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T ) P s g ( T 1 ) = P s e ( T ) ( T ) P s e ( T ) ( T 1 ) P s e ( T ) ( T ) T e ( T 1 ) e ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T 1 ) e ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T ) P s g ( T 1 ) = P s e ( T ) ( T ) P s e ( T ) ( T 1 ) {:[P_(s)^(e(T))(T)-T_(e(T-1))^(e(T))P_(s)^(e(T-1))(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(e(T-1))^(e(T))T_(e(T))^(e(T-1))T_(g)^(e(T))P_(s)^(g)(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(g)^(e(T))P_(s)^(g)(T-1)],[=P_(s)^(e(T))(T)-P_(s)^(e(T))(T-1)]:}\begin{aligned} & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{T}_{e(T-1)}^{e(T)} \mathbf{P}_{s}^{e(T-1)}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{e(T-1)}^{e(T)} \mathbf{T}_{e(T)}^{e(T-1)} \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{P}_{s}^{e(T)}(T-1) \end{aligned}
According to Eq. 2 , the learning target is set as the objection motion in the current frame’s ego coordinate system, which is irrelevant to the ego-motion.
根据公式 2,学习目标设定为当前帧自我坐标系中的物体运动,这与自我运动无关。
In practice, the alignment operation in Eq. 2 is achieved by feature alignment. Given the candidate features of the
在实践中,公式 2 中的对齐操作是通过特征对齐实现的。给定候选特征的

previous frame F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) and the current frame F ( T , P e ( T ) ) F T , P e ( T ) F(T,P^(e(T)))\mathcal{F}\left(T, \mathbf{P}^{e(T)}\right), the aligned feature can be obtained by:
前一帧 F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) 和当前帧 F ( T , P e ( T ) ) F T , P e ( T ) F(T,P^(e(T)))\mathcal{F}\left(T, \mathbf{P}^{e(T)}\right) ,可以通过以下方式获得对齐特征:
F ( T 1 , P e ( T ) ) = F ( T 1 , T e ( T ) e ( T 1 ) P e ( T ) ) F T 1 , P e ( T ) = F T 1 , T e ( T ) e ( T 1 ) P e ( T ) F^(')(T-1,P^(e(T)))=F(T-1,T_(e(T))^(e(T-1))P^(e(T)))\mathcal{F}^{\prime}\left(T-1, \mathbf{P}^{e(T)}\right)=\mathcal{F}\left(T-1, \mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)}\right)
Alone with Eq. 3 bilinear interpolation is applied as T e ( T ) e ( T 1 ) P e ( T ) T e ( T ) e ( T 1 ) P e ( T ) T_(e(T))^(e(T-1))P^(e(T))\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)} may not be a valid position in the sparse feature of F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right). The interpolation is a suboptimal method that will lead to precision degeneration. The magnitude of the precision degeneration is negatively correlated with the resolution of the BEV features. A more precise method is to adjust the coordinates of the point cloud generated by the lifting operation in the view transformer [28]. However, it is deprecated in this paper as it will destroy the precondition of the acceleration method proposed in the naive BEVDet [14]. The magnitude of the precision degeneration will be quantitatively estimated in the ablation study Section. 4.3.2
单独使用公式 3 时,双线性插值被应用,因为 T e ( T ) e ( T 1 ) P e ( T ) T e ( T ) e ( T 1 ) P e ( T ) T_(e(T))^(e(T-1))P^(e(T))\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)} 可能不是 F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) 稀疏特征中的有效位置。插值是一种次优方法,会导致精度下降。精度下降的幅度与 BEV 特征的分辨率呈负相关。更精确的方法是调整视图变换器[28]中提升操作生成的点云的坐标。然而,由于这会破坏在朴素 BEVDet [14]中提出的加速方法的前提条件,因此在本文中被弃用。精度下降的幅度将在消融研究第 4.3.2 节中进行定量评估。

4. Experiment 4. 实验

4.1. Experimental Settings
4.1. 实验设置

Dataset We conduct comprehensive experiments on a large-scale dataset, nuScenes [2]. nuScenes dataset includes
数据集 我们在一个大规模数据集 nuScenes [2]上进行全面实验。nuScenes 数据集包括
Table 1: Comparison of different paradigms on the nuScenes val set. † initialized from a FCOS3D backbone. § with test-time augmentation. # with model ensemble.
表 1:在 nuScenes 验证集上不同范式的比较。† 从 FCOS3D 主干初始化。§ 采用测试时增强。# 采用模型集成。
Methods 方法 Image Size 图像大小 #param. GFLOPs Modality 模态 mAP uarr\uparrow mATE darr\downarrow mASE darr\downarrow mAOE darr\downarrow mAVE darr\downarrow mAAE darr\downarrow NDS uarr\uparrow FPS
CenterFusion [25] 中心融合 [25] 800 × 450 800 × 450 800 xx450800 \times 450 20.4 M - Camera & Radar 相机与雷达 0.332 0.649 0.263 0.535 0.540 0.142 0.453 -
VoxelNet 44] 体素网 44] -- 8.8 M - LiDAR 激光雷达 0.563 0.292 0.253 0.316 0.287 0.191 0.648 14.1
PointPillar 44] - 6.0 M - LiDAR 激光雷达 0.487 0.315 0.260 0.368 0.323 0.203 0.597 17.9
CenterNet 46 - - - Camera 相机 0.306 0.716 0.264 0.609 1.426 0.658 0.328 -
FCOS3D 38 1600 × 900 1600 × 900 1600 xx9001600 \times 900 52.5 M 52.5 米 2,008.2 Camera 相机 0.295 0.806 0.268 0.511 1.315 0.170 0.372 1.7
DETR3D 40 1600 × 900 1600 × 900 1600 xx9001600 \times 900 51.3 M 51.3 百万 1,016.8 Camera 相机 0.303 0.860 0.278 0.437 0.967 0.235 0.374 2.0
PGD 39 1600 × 900 1600 × 900 1600 xx9001600 \times 900 53.6 M 53.6 百万 2,223.0 Camera 相机 0.335 0.732 0.263 0.423 1.285 0.172 0.409 1.4
PETR-R50 20 1056 × 384 1056 × 384 1056 xx3841056 \times 384 - - Camera 相机 0.313 0.768 0.278 0.564 0.923 0.225 0.381 10.7
PETR-R101 20] 1408 × 512 1408 × 512 1408 xx5121408 \times 512 - - Camera 相机 0.357 0.710 0.270 0.490 0.885 0.224 0.421 3.4
PETR-Tiny 20] 1408 × 512 1408 × 512 1408 xx5121408 \times 512 - - Camera 相机 0.361 0.732 0.273 0.497 0.808 0.185 0.431 -
BEVDet-Tiny 14 704 × 256 704 × 256 704 xx256704 \times 256 52.6 M 52.6 百万 215.3 Camera 相机 0.312 0.691 0.272 0.523 0.909 0.247 0.392 15.6
BEVDet-Base 14 1600 × 640 1600 × 640 1600 xx6401600 \times 640 126.6 M 126.6 百万 2,962.6 Camera 相机 0.393 0.608 0.259 0.366 0.822 0.191 0.472 1.9
BEVDet4D-Tiny 704 × 256 704 × 256 704 xx256704 \times 256 53.6 M 53.6 百万 222.0 Camera 相机 0.338