这是用户在 2024-10-8 24:10 为 https://app.immersivetranslate.com/pdf-pro/eed5557b-e40b-457c-a06f-c0b7dd18f1e2 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?

BEVDet4D: Exploit Temporal Cues in Multi-camera 3D Object Detection
BEVDet4D:利用多摄像头 3D 物体检测中的时间线索

Junjie Huang* Guan Huang 黄俊杰* 黄冠PhiGent Robotics 菲根特机器人junjie.huang@ieee.org, guan.huang@phigent.ai

Abstract 摘要

Single frame data contains finite information which limits the performance of the existing vision-based multicamera 3D object detection paradigms. For fundamentally pushing the performance boundary in this area, a novel paradigm dubbed BEVDet4D is proposed to lift the scalable BEVDet paradigm from the spatial-only 3D space to the spatial-temporal 4 D 4 D 4D4 D space. We upgrade the naive BEVDet framework with a few modifications just for fusing the feature from the previous frame with the corresponding one in the current frame. In this way, with negligible additional computing budget, we enable BEVDet4D to access the temporal cues by querying and comparing the two candidate features. Beyond this, we simplify the task of velocity prediction by removing the factors of ego-motion and time in the learning target. As a result, BEVDet4D with robust generalization performance reduces the velocity error by up to 62.9 % 62.9 % -62.9%-62.9 \%. This makes the vision-based methods, for the first time, become comparable with those relied on LiDAR or radar in this aspect. On challenge benchmark nuScenes, we report a new record of 54.5 % 54.5 % 54.5%54.5 \% NDS with the high-performance configuration dubbed BEVDet4D-Base, which surpasses the previous leading method BEVDet-Base by + 7.3 % + 7.3 % +7.3%+7.3 \% NDS.
单帧数据包含有限的信息,这限制了现有基于视觉的多摄像头 3D 物体检测范式的性能。为了从根本上推动该领域的性能边界,提出了一种新范式,称为 BEVDet4D,旨在将可扩展的 BEVDet 范式从仅空间的 3D 空间提升到时空 4 D 4 D 4D4 D 空间。我们对简单的 BEVDet 框架进行了少量修改,仅用于将前一帧的特征与当前帧中的相应特征融合。通过这种方式,在几乎没有额外计算预算的情况下,我们使 BEVDet4D 能够通过查询和比较两个候选特征来获取时间线索。除此之外,我们通过去除自我运动和时间因素简化了速度预测的任务。因此,具有强大泛化性能的 BEVDet4D 将速度误差降低了多达 62.9 % 62.9 % -62.9%-62.9 \% 。这使得基于视觉的方法首次在这一方面与依赖于激光雷达或雷达的方法相媲美。 在挑战基准 nuScenes 上,我们报告了一个新的记录,使用高性能配置 BEVDet4D-Base 达到了 54.5 % 54.5 % 54.5%54.5 \% NDS,超过了之前领先的方法 BEVDet-Base,提升了 + 7.3 % + 7.3 % +7.3%+7.3 \% NDS。

1. Introduction 1. 引言

Recently, autonomous driving draws great attention in both the research and the industry community. The perception tasks in this scene include 3D object detection, BEV semantic segmentation, motion prediction, and so on. Most of them can be partly solved in the spatial-only 3D space with a single frame of data. However, with respect to the time-relevant targets like velocity, current vision-based paradigms with merely a single frame of data perform far poorer than those with sensors like LiDAR or radar. For example, the velocity error of the recently leading method BEVDet [14] in the vision-based 3D objection is 3 times
最近,自动驾驶在研究和工业界引起了极大的关注。该场景中的感知任务包括 3D 物体检测、鸟瞰视图语义分割、运动预测等。大多数任务可以在仅使用单帧数据的空间 3D 空间中部分解决。然而,对于与时间相关的目标,如速度,目前仅依赖单帧数据的视觉基础范式的表现远不如使用激光雷达或雷达等传感器的方法。例如,最近领先的方法 BEVDet [14]在基于视觉的 3D 物体检测中的速度误差是 3 倍。
Figure 1: The inference speed and performance of different paradigms on the nuScenes val set.
图 1:不同范式在 nuScenes 验证集上的推理速度和性能。

that of the LiDAR-based method CenterPoint [44] and 2 times that of the radar-based method CenterFusion [25]. To close this gap, we propose a novel paradigm dubbed BEVDet4D in this paper and pioneer the exploitation of vision-based autonomous driving in the spatial-temporal 4D space.
基于 LiDAR 的方法 CenterPoint [44]的性能是基于雷达的方法 CenterFusion [25]的 2 倍。为了缩小这一差距,我们在本文中提出了一种新颖的范式,称为 BEVDet4D,并开创了在时空 4D 空间中利用基于视觉的自动驾驶。
As illustrated in Fig. 2, BEVDet4D makes the first attempt at accessing the rich information in the temporal domain. It simply extends the naive BEVDet [14] by retaining the intermediate BEV features in the previous frames. Then it fuses the retained feature with the corresponding one in the current frame just by a spatial alignment operation and a concatenation operation. Other than that, we kept most other details of the framework unchanged. In this way, we place just a negligible extra computational budget on the inference process while enabling the paradigm to access the temporal cues by querying and comparing the two candidate features. Though simple in constructing the framework of BEVDet4D, it is nontrivial to build its robust performance. The spatial alignment operation and the learning targets should be carefully designed to cooperate with the elegant framework so that the velocity prediction task can be simplified and superior generalization performance can be achieved with BEVDet4D.
如图 2 所示,BEVDet4D 首次尝试访问时间域中的丰富信息。它通过保留前一帧中的中间 BEV 特征,简单地扩展了原始的 BEVDet [14]。然后,它通过空间对齐操作和连接操作,将保留的特征与当前帧中的相应特征融合在一起。除此之外,我们保持了框架的大部分其他细节不变。通过这种方式,我们在推理过程中仅增加了微不足道的额外计算预算,同时使得该范式能够通过查询和比较两个候选特征来访问时间线索。尽管在构建 BEVDet4D 框架时很简单,但构建其稳健性能并非易事。空间对齐操作和学习目标应精心设计,以与优雅的框架协同工作,从而简化速度预测任务,并在 BEVDet4D 中实现卓越的泛化性能。
We conduct comprehensive experiments on the chal-
我们对挑战进行了全面的实验

Figure 2: The framework of the proposed BEVDet4D paradigm. BEVDet4D retains the intermediate BEV feature of the previous frame and concatenates it with the ones generated by the current frame. Before that spatial alignment in the flat plane is conducted to partially simplify the velocity prediction task.
图 2:所提出的 BEVDet4D 范式的框架。BEVDet4D 保留了前一帧的中间 BEV 特征,并将其与当前帧生成的特征进行连接。在此之前,在平面上进行空间对齐,以部分简化速度预测任务。

lenge benchmark nuScenes [2] to verify the feasibility of BEVDet4D and discuss its characteristics. Fig. 1 illustrates the trade-off between inference speed and performance of different paradigms. Without bells and whistles, the BEVDet4D-Tiny configuration reduces the velocity error by 62.9 % 62.9 % 62.9%62.9 \% from 0.909 mAVE to 0.337 mAVE . Besides, the proposed paradigm also has significant improvement in the other indicators like detection score ( + 2.6 % + 2.6 % +2.6%+2.6 \% mAP ) mAP ) mAP)\mathrm{mAP}), orientation error ( 12.0 % mAOE ) 12.0 % mAOE ) -12.0%mAOE)-12.0 \% \mathrm{mAOE}), and attribute error ( 25.1 % mAAE ) ( 25.1 % mAAE ) (-25.1%mAAE)(-25.1 \% \mathrm{mAAE}). As a result, BEVDet4D-Tiny exceeds the baseline by + 8.4 % + 8.4 % +8.4%+8.4 \% on the composite indicator NDS. The high-performance configuration dubbed BEVDet4D-Base scores high as 42.1 % mAP 42.1 % mAP 42.1%mAP42.1 \% \mathrm{mAP} and 54.5 % 54.5 % 54.5%54.5 \% NDS, which has surpassed all published results. Last but not least, BEVDet4D achieves the aforementioned superiority just at a negligible cost in inference latency, which is meaningful in the scenario of autonomous driving.
通过基准测试 nuScenes [2] 验证 BEVDet4D 的可行性并讨论其特性。图 1 展示了不同范式之间推理速度和性能的权衡。BEVDet4D-Tiny 配置在没有额外功能的情况下将速度误差从 0.909 mAVE 降低到 0.337 mAVE。此外,所提出的范式在其他指标上也有显著改善,如检测得分( + 2.6 % + 2.6 % +2.6%+2.6 \% mAP ) mAP ) mAP)\mathrm{mAP}) )、方向误差( 12.0 % mAOE ) 12.0 % mAOE ) -12.0%mAOE)-12.0 \% \mathrm{mAOE}) )和属性误差( ( 25.1 % mAAE ) ( 25.1 % mAAE ) (-25.1%mAAE)(-25.1 \% \mathrm{mAAE}) )。因此,BEVDet4D-Tiny 在综合指标 NDS 上超过了基线 + 8.4 % + 8.4 % +8.4%+8.4 \% 。高性能配置 BEVDet4D-Base 的 NDS 得分为 42.1 % mAP 42.1 % mAP 42.1%mAP42.1 \% \mathrm{mAP} 54.5 % 54.5 % 54.5%54.5 \% ,超越了所有已发布的结果。最后但同样重要的是,BEVDet4D 在推理延迟上仅以微不足道的成本实现了上述优势,这在自动驾驶场景中具有重要意义。

2.1. Vision-based 3D object detection
2.1. 基于视觉的 3D 物体检测

3D object detection is a pivotal perception task in autonomous driving. In the last few years, fueled by the KITTI [10] benchmark monocular 3D object detection has witness a rapid development [24, 21, 45, 51, 47, 29, 37, 36, 15]. However, the limited data and the single view disable it in developing more complicated tasks. Recently, some largescale benchmarks [2,33] have been proposed with sufficient data and surrounding views, offering new perspectives to-
3D 物体检测是自动驾驶中的一个关键感知任务。在过去的几年中,受 KITTI [10]基准的推动,单目 3D 物体检测经历了快速发展[24, 21, 45, 51, 47, 29, 37, 36, 15]。然而,有限的数据和单一视角使其在开发更复杂的任务时受到限制。最近,一些大规模基准[2,33]被提出,提供了足够的数据和周围视角,为-

ward the paradigm development in the field of 3D object detection. Based on these benchmarks, some multi-camera 3D object detection paradigms have been developed with competitive performance. For example, inspired by the success of FCOS [34] in 2D detection, FCOS3D [38] treats the 3D object detection problem as a 2D object detection problem and conducts perception just in image view. Benefitting from the strong spatial correlation of the targets’ attribute with the image appearance, it works well in predicting this but is relatively poor in perceiving the targets’ translation, velocity, and orientation. PGD [39] further develops the FCOS3D paradigm by searching and resolving the outstanding shortcoming (i.e. the prediction of the targets’ depth). This offers a remarkable accuracy improvement on the baseline but at the cost of more computational budget and additional inference latency. Following DETR [3], DETR3D [40] proposes to detect 3D objects in an attention pattern, which has similar accuracy as FCOS3D. Although DETR3D requires just half the computational budget, the complex calculation pipeline slows down its inference speed to the same level as FCOS3D. PETR [20] further develops the performance of this paradigm by introducing the 3D coordinate generation and position encoding. Besides, they also exploit the strong data augmentation strategies just as BEVDet [14]. As a novel paradigm, BEVDet [14] makes the first attempt at applying a strong data augmentation strategy in vision-based 3D object detection. As they explicitly encode features in the BEV space, it
在 3D 物体检测领域推动范式的发展。基于这些基准,已经开发出一些具有竞争性能的多摄像头 3D 物体检测范式。例如,受到 FCOS [34] 在 2D 检测中成功的启发,FCOS3D [38] 将 3D 物体检测问题视为 2D 物体检测问题,并仅在图像视图中进行感知。由于目标属性与图像外观之间的强空间相关性,它在预测方面表现良好,但在感知目标的平移、速度和方向方面相对较差。PGD [39] 进一步发展了 FCOS3D 范式,通过搜索和解决突出的缺陷(即目标深度的预测)。这在基线基础上提供了显著的准确性提升,但代价是更多的计算预算和额外的推理延迟。继 DETR [3] 之后,DETR3D [40] 提出了以注意力模式检测 3D 物体,其准确性与 FCOS3D 相似。尽管 DETR3D 仅需一半的计算预算,但复杂的计算管道使其推理速度减慢到与 FCOS3D 相同的水平。 PETR [20] 通过引入 3D 坐标生成和位置编码进一步提升了这一范式的性能。此外,他们还利用了强大的数据增强策略,就像 BEVDet [14] 一样。作为一种新颖的范式,BEVDet [14] 首次尝试在基于视觉的 3D 目标检测中应用强大的数据增强策略。由于它们在 BEV 空间中明确编码特征,

is scalable in multiple aspects including multi-tasks learning, multi-sensors fusion, and temporal fusion. BEVDet4D is the temporal extension of BEVDet [14].
在多个方面具有可扩展性,包括多任务学习、多传感器融合和时间融合。BEVDet4D 是 BEVDet [14] 的时间扩展。
So far, few works have exploited the temporal cues in vision-based 3D object detection. Thus, the existing paradigms [38, 39, 40, 20, 14] perform relatively poorly in predicting the time-relevant targets like velocity than the LiDAR-based [44] or radar-based [25] methods. To the best of our knowledge, [1] is the only one work in this perspective. However, they predict the results based on a single frame and exploit the 3D Kalman filter to update the results for the temporal consistency of results between image sequences. The temporal cues are exploited in the postprocessing phase instead of the end-to-end learning framework. Differently, we make the first attempt in exploiting the temporal cues in the end-to-end learning framework BEVDet4D, which is elegant, powerful, and still scalable.
到目前为止,少数作品利用了基于视觉的 3D 物体检测中的时间线索。因此,现有的范式 [38, 39, 40, 20, 14] 在预测与时间相关的目标(如速度)方面表现相对较差,尤其是与基于 LiDAR [44] 或雷达 [25] 的方法相比。据我们所知,只有 [1] 在这个视角上进行过研究。然而,他们是基于单帧预测结果,并利用 3D 卡尔曼滤波器更新结果,以确保图像序列之间结果的时间一致性。时间线索是在后处理阶段利用的,而不是在端到端学习框架中。不同的是,我们首次尝试在端到端学习框架 BEVDet4D 中利用时间线索,该框架优雅、强大且可扩展。

2.2. Object Detection in Video
2.2. 视频中的目标检测

Video object detection mainly fueled by the ImageNet VID dataset [31] is analogous to the well-known tasks of common object detection [17] which performs and evaluates the object detection task in the image-view space. The difference is that detecting objects in video can access the temporal cues for improving detection accuracy. The methods in this area access the temporal cues mainly according to two kinds of mediums: the predicting results or the intermediate features. The former [43] is analogous to [1] in 3D object detection, who optimizes the prediction results in a tracking pattern. The latter reutilizes the features from the previous frame based on some special architectures like LSTM [13] for feature distillation [19, 18, 23], attention mechanism [35] for feature querying [5, 7, 8], and optical flow [9] for feature alignment [50, 49]. Specific for the scene of autonomous driving, BEVDet4D is analogous to the flow-based methods in mechanism but accesses the spatial correlation according to the ego-motion and conducts feature aggregation in the 3D space. Besides, BEVDet4D mainly focuses on the prediction of the velocity targets which is not in the scope of the common video object detection literature.
视频目标检测主要由 ImageNet VID 数据集推动,类似于常见的目标检测任务,该任务在图像视图空间中执行和评估目标检测。不同之处在于,视频中的目标检测可以利用时间线索来提高检测准确性。该领域的方法主要通过两种媒介获取时间线索:预测结果或中间特征。前者类似于 3D 目标检测中的方法,通过跟踪模式优化预测结果。后者基于一些特殊架构(如 LSTM)重新利用前一帧的特征进行特征提炼,使用注意力机制进行特征查询,以及利用光流进行特征对齐。针对自动驾驶场景,BEVDet4D 在机制上类似于基于流的方法,但根据自我运动获取空间相关性,并在 3D 空间中进行特征聚合。 此外,BEVDet4D 主要关注速度目标的预测,这不在常见视频目标检测文献的范围内。

3. Methodology 3. 方法论

3.1. Network Structure 3.1. 网络结构

As illustrated in Fig. 2, the overall framework of BEVDet4D is built upon the BEVDet [14] baseline which is consists of four kinds of modules: an image-view encoder, a view transformer, a BEV encoder, and a task-specific head. All implementation details of these modules are kept unchanged. To exploit the temporal cues, BEVDet4D extends the baseline by retaining the BEV features generated by the view transformer in the previous frame. Then the retained
如图 2 所示,BEVDet4D 的整体框架建立在 BEVDet [14] 基线之上,该基线由四种模块组成:图像视图编码器、视图变换器、BEV 编码器和任务特定头。这些模块的所有实现细节保持不变。为了利用时间线索,BEVDet4D 通过保留前一帧中由视图变换器生成的 BEV 特征来扩展基线。然后保留的

feature is merged with the one in the current frame. Before that, an alignment operation is conducted to simplify the learning targets which will be detailed in the following subsection. We apply a simple concatenation operation to merge the features for verifying the BEVDet4D paradigm. More complicated fusing strategies have not been exploited in this paper.
特征与当前帧中的特征合并。在此之前,进行了一次对齐操作,以简化学习目标,具体内容将在以下小节中详细说明。我们应用简单的连接操作来合并特征,以验证 BEVDet4D 范式。本文未采用更复杂的融合策略。
Besides, the feature generated by the view transformer is sparse, which is too coarse for the subsequential modules to exploit temporal cues. An extra BEV encoder is thus applied to adjust the candidate features before the temporal fusion. In practice, the extra BEV encoder consists of two naive residual units [12], whose channel number is set the same as the input feature.
此外,视图变换器生成的特征是稀疏的,这对于后续模块利用时间线索来说过于粗糙。因此,额外的 BEV 编码器被应用于在时间融合之前调整候选特征。实际上,额外的 BEV 编码器由两个简单的残差单元组成,其通道数与输入特征相同。

3.2. Simplify the Velocity Learning Task
3.2. 简化速度学习任务

Symbol Definition Following nuScense [2], we denote the global coordinate system as O g X Y Z O g X Y Z O_(g)-XYZO_{g}-X Y Z, the ego coordinate system as O e ( T ) X Y Z O e ( T ) X Y Z O_(e(T))-XYZO_{e(T)}-X Y Z, and the targets coordinate system as O t ( T ) X Y Z O t ( T ) X Y Z O_(t(T))-XYZO_{t(T)}-X Y Z. As illustrated in Fig. 3 , we construct a virtual scene with a moving ego vehicle and two target vehicles. One of the targets is static (i.e., O s X Y Z O s X Y Z O_(s)-XYZO_{s}-X Y Z painted green) in the global coordinate system, while the other one is moving (i.e., O m X Y Z O m X Y Z O_(m)-XYZO_{m}-X Y Z painted blue). The objects in two adjacent frames (i.e., frame T 1 T 1 T-1T-1 and frame T T TT ) are distinguished with different transparentness. The position of the objects is formulated as P x ( t ) P x ( t ) P^(x)(t)\mathbf{P}^{x}(t), where x { g , e ( T ) , e ( T 1 ) } x { g , e ( T ) , e ( T 1 ) } x in{g,e(T),e(T-1)}x \in\{g, e(T), e(T-1)\} denotes the coordinate system where the position is defined in. t { T , T 1 } t { T , T 1 } t in{T,T-1}t \in\{T, T-1\} denotes the time when the position is recorded. We use T s r c d s t T s r c d s t T_(src)^(dst)\mathbf{T}_{s r c}^{d s t} to denote the transformation from the source coordinate system into the target coordinate system.
符号定义 根据 nuScense [2],我们将全局坐标系统表示为 O g X Y Z O g X Y Z O_(g)-XYZO_{g}-X Y Z ,自我坐标系统表示为 O e ( T ) X Y Z O e ( T ) X Y Z O_(e(T))-XYZO_{e(T)}-X Y Z ,目标坐标系统表示为 O t ( T ) X Y Z O t ( T ) X Y Z O_(t(T))-XYZO_{t(T)}-X Y Z 。如图 3 所示,我们构建了一个虚拟场景,其中有一辆移动的自我车辆和两辆目标车辆。其中一辆目标车辆在全局坐标系统中是静态的(即 O s X Y Z O s X Y Z O_(s)-XYZO_{s}-X Y Z 涂成绿色),而另一辆是移动的(即 O m X Y Z O m X Y Z O_(m)-XYZO_{m}-X Y Z 涂成蓝色)。两个相邻帧(即帧 T 1 T 1 T-1T-1 和帧 T T TT )中的物体通过不同的透明度进行区分。物体的位置表示为 P x ( t ) P x ( t ) P^(x)(t)\mathbf{P}^{x}(t) ,其中 x { g , e ( T ) , e ( T 1 ) } x { g , e ( T ) , e ( T 1 ) } x in{g,e(T),e(T-1)}x \in\{g, e(T), e(T-1)\} 表示定义位置的坐标系统。 t { T , T 1 } t { T , T 1 } t in{T,T-1}t \in\{T, T-1\} 表示记录位置的时间。我们用 T s r c d s t T s r c d s t T_(src)^(dst)\mathbf{T}_{s r c}^{d s t} 表示从源坐标系统到目标坐标系统的转换。
Instead of directly predicting the velocity of the targets, we tend to predict the translation of the targets between the two adjacent frames. In this way, the learning task can be simplified as the time factor is removed and the positional shifting can be measured just according to the difference between two BEV features. Besides, we tend to learn the position shifting that is irrelevant to the ego-motion. In this way, the learning task can also be simplified as the egomotion will make the distribution of the targets’ positional shifting more complicated.
我们倾向于预测两个相邻帧之间目标的位移,而不是直接预测目标的速度。通过这种方式,学习任务可以简化,因为时间因素被去除,位置的变化可以仅根据两个 BEV 特征之间的差异来测量。此外,我们倾向于学习与自我运动无关的位置变化。通过这种方式,学习任务也可以简化,因为自我运动会使目标位置变化的分布变得更加复杂。
For example, a static object (i.e., the green box in Fig. 3) in the global coordinate system will be changed into a moving object in the ego coordinate system due to the egomotion. More specifically, the receptive field of the BEV features is symmetrically defined around the ego. Considering the two features generated by the view transformer in the two adjacent frames, their receptive fields in the global coordinate system are also diverse due to the ego-motion. Given a static object, its position in the global coordinate system is denoted as P s g ( T ) P s g ( T ) P_(s)^(g)(T)\mathbf{P}_{s}^{g}(T) and P s g ( T 1 ) P s g ( T 1 ) P_(s)^(g)(T-1)\mathbf{P}_{s}^{g}(T-1) in the two adjacent frames. Their positions in the candidate features are
例如,全球坐标系中的一个静态物体(即图 3 中的绿色框)由于自我运动而在自我坐标系中变成一个移动物体。更具体地说,BEV 特征的感受野是围绕自我对称定义的。考虑到在两个相邻帧中由视图变换器生成的两个特征,由于自我运动,它们在全球坐标系中的感受野也各不相同。给定一个静态物体,它在全球坐标系中的位置在两个相邻帧中分别表示为 P s g ( T ) P s g ( T ) P_(s)^(g)(T)\mathbf{P}_{s}^{g}(T) P s g ( T 1 ) P s g ( T 1 ) P_(s)^(g)(T-1)\mathbf{P}_{s}^{g}(T-1) 。它们在候选特征中的位置是

Figure 3: Illustrating the effect of the feature alignment operation. Without the alignment operation (i.e. the first row), the following modules are required to study a more complicated distribution of the object motion, which is relevant to the ego-motion. By applying alignment operation in the second row, the learning targets can be simplified.
图 3:说明特征对齐操作的效果。没有对齐操作(即第一行),需要以下模块来研究与自我运动相关的物体运动的更复杂分布。通过在第二行应用对齐操作,学习目标可以简化。

different. 不同。
P s e ( T ) ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T 1 ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) P s e ( T ) ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T 1 ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) {:[P_(s)^(e(T))(T)-P_(s)^(e(T-1))(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(g)^(e(T-1))P_(s)^(g)(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(e(T))^(e(T-1))T_(g)^(e(T))P_(s)^(g)(T-1)]:}\begin{aligned} & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{P}_{s}^{e(T-1)}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{g}^{e(T-1)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \end{aligned}
According to Eq. 1, directly concatenate the two features, the learning target (i.e., the positional shifting of the target object) of the following modules will be relevant to the ego motion (i.e., T e ( T ) e ( T 1 ) T e ( T ) e ( T 1 ) T_(e(T))^(e(T-1))\mathbf{T}_{e(T)}^{e(T-1)} ). To avoid this, we shift the feature from the adjacent frame by T e ( T 1 ) e ( T ) T e ( T 1 ) e ( T ) T_(e(T-1))^(e(T))\mathbf{T}_{e(T-1)}^{e(T)} to remove the fraction of ego-motion.
根据公式 1,直接将两个特征连接在一起,后续模块的学习目标(即目标物体的位置移动)将与自我运动(即 T e ( T ) e ( T 1 ) T e ( T ) e ( T 1 ) T_(e(T))^(e(T-1))\mathbf{T}_{e(T)}^{e(T-1)} )相关。为了避免这种情况,我们通过 T e ( T 1 ) e ( T ) T e ( T 1 ) e ( T ) T_(e(T-1))^(e(T))\mathbf{T}_{e(T-1)}^{e(T)} 从相邻帧中移动特征,以去除自我运动的部分。
P s e ( T ) ( T ) T e ( T 1 ) e ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T 1 ) e ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T ) P s g ( T 1 ) = P s e ( T ) ( T ) P s e ( T ) ( T 1 ) P s e ( T ) ( T ) T e ( T 1 ) e ( T ) P s e ( T 1 ) ( T 1 ) = T g e ( T ) P s g ( T ) T e ( T 1 ) e ( T ) T e ( T ) e ( T 1 ) T g e ( T ) P s g ( T 1 ) = T g e ( T ) P s g ( T ) T g e ( T ) P s g ( T 1 ) = P s e ( T ) ( T ) P s e ( T ) ( T 1 ) {:[P_(s)^(e(T))(T)-T_(e(T-1))^(e(T))P_(s)^(e(T-1))(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(e(T-1))^(e(T))T_(e(T))^(e(T-1))T_(g)^(e(T))P_(s)^(g)(T-1)],[=T_(g)^(e(T))P_(s)^(g)(T)-T_(g)^(e(T))P_(s)^(g)(T-1)],[=P_(s)^(e(T))(T)-P_(s)^(e(T))(T-1)]:}\begin{aligned} & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{T}_{e(T-1)}^{e(T)} \mathbf{P}_{s}^{e(T-1)}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{e(T-1)}^{e(T)} \mathbf{T}_{e(T)}^{e(T-1)} \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T)-\mathbf{T}_{g}^{e(T)} \mathbf{P}_{s}^{g}(T-1) \\ = & \mathbf{P}_{s}^{e(T)}(T)-\mathbf{P}_{s}^{e(T)}(T-1) \end{aligned}
According to Eq. 2 , the learning target is set as the objection motion in the current frame’s ego coordinate system, which is irrelevant to the ego-motion.
根据公式 2,学习目标设定为当前帧自我坐标系中的物体运动,这与自我运动无关。
In practice, the alignment operation in Eq. 2 is achieved by feature alignment. Given the candidate features of the
在实践中,公式 2 中的对齐操作是通过特征对齐实现的。给定候选特征的

previous frame F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) and the current frame F ( T , P e ( T ) ) F T , P e ( T ) F(T,P^(e(T)))\mathcal{F}\left(T, \mathbf{P}^{e(T)}\right), the aligned feature can be obtained by:
前一帧 F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) 和当前帧 F ( T , P e ( T ) ) F T , P e ( T ) F(T,P^(e(T)))\mathcal{F}\left(T, \mathbf{P}^{e(T)}\right) ,可以通过以下方式获得对齐特征:
F ( T 1 , P e ( T ) ) = F ( T 1 , T e ( T ) e ( T 1 ) P e ( T ) ) F T 1 , P e ( T ) = F T 1 , T e ( T ) e ( T 1 ) P e ( T ) F^(')(T-1,P^(e(T)))=F(T-1,T_(e(T))^(e(T-1))P^(e(T)))\mathcal{F}^{\prime}\left(T-1, \mathbf{P}^{e(T)}\right)=\mathcal{F}\left(T-1, \mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)}\right)
Alone with Eq. 3 bilinear interpolation is applied as T e ( T ) e ( T 1 ) P e ( T ) T e ( T ) e ( T 1 ) P e ( T ) T_(e(T))^(e(T-1))P^(e(T))\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)} may not be a valid position in the sparse feature of F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right). The interpolation is a suboptimal method that will lead to precision degeneration. The magnitude of the precision degeneration is negatively correlated with the resolution of the BEV features. A more precise method is to adjust the coordinates of the point cloud generated by the lifting operation in the view transformer [28]. However, it is deprecated in this paper as it will destroy the precondition of the acceleration method proposed in the naive BEVDet [14]. The magnitude of the precision degeneration will be quantitatively estimated in the ablation study Section. 4.3.2
单独使用公式 3 时,双线性插值被应用,因为 T e ( T ) e ( T 1 ) P e ( T ) T e ( T ) e ( T 1 ) P e ( T ) T_(e(T))^(e(T-1))P^(e(T))\mathbf{T}_{e(T)}^{e(T-1)} \mathbf{P}^{e(T)} 可能不是 F ( T 1 , P e ( T 1 ) ) F T 1 , P e ( T 1 ) F(T-1,P^(e(T-1)))\mathcal{F}\left(T-1, \mathbf{P}^{e(T-1)}\right) 稀疏特征中的有效位置。插值是一种次优方法,会导致精度下降。精度下降的幅度与 BEV 特征的分辨率呈负相关。更精确的方法是调整视图变换器[28]中提升操作生成的点云的坐标。然而,由于这会破坏在朴素 BEVDet [14]中提出的加速方法的前提条件,因此在本文中被弃用。精度下降的幅度将在消融研究第 4.3.2 节中进行定量评估。

4. Experiment 4. 实验

4.1. Experimental Settings
4.1. 实验设置

Dataset We conduct comprehensive experiments on a large-scale dataset, nuScenes [2]. nuScenes dataset includes
数据集 我们在一个大规模数据集 nuScenes [2]上进行全面实验。nuScenes 数据集包括
Table 1: Comparison of different paradigms on the nuScenes val set. † initialized from a FCOS3D backbone. § with test-time augmentation. # with model ensemble.
表 1:在 nuScenes 验证集上不同范式的比较。† 从 FCOS3D 主干初始化。§ 采用测试时增强。# 采用模型集成。
Methods 方法 Image Size 图像大小 #param. GFLOPs Modality 模态 mAP uarr\uparrow mATE darr\downarrow mASE darr\downarrow mAOE darr\downarrow mAVE darr\downarrow mAAE darr\downarrow NDS uarr\uparrow FPS
CenterFusion [25] 中心融合 [25] 800 × 450 800 × 450 800 xx450800 \times 450 20.4 M - Camera & Radar 相机与雷达 0.332 0.649 0.263 0.535 0.540 0.142 0.453 -
VoxelNet 44] 体素网 44] -- 8.8 M - LiDAR 激光雷达 0.563 0.292 0.253 0.316 0.287 0.191 0.648 14.1
PointPillar 44] - 6.0 M - LiDAR 激光雷达 0.487 0.315 0.260 0.368 0.323 0.203 0.597 17.9
CenterNet 46 - - - Camera 相机 0.306 0.716 0.264 0.609 1.426 0.658 0.328 -
FCOS3D 38 1600 × 900 1600 × 900 1600 xx9001600 \times 900 52.5 M 52.5 米 2,008.2 Camera 相机 0.295 0.806 0.268 0.511 1.315 0.170 0.372 1.7
DETR3D 40 1600 × 900 1600 × 900 1600 xx9001600 \times 900 51.3 M 51.3 百万 1,016.8 Camera 相机 0.303 0.860 0.278 0.437 0.967 0.235 0.374 2.0
PGD 39 1600 × 900 1600 × 900 1600 xx9001600 \times 900 53.6 M 53.6 百万 2,223.0 Camera 相机 0.335 0.732 0.263 0.423 1.285 0.172 0.409 1.4
PETR-R50 20 1056 × 384 1056 × 384 1056 xx3841056 \times 384 - - Camera 相机 0.313 0.768 0.278 0.564 0.923 0.225 0.381 10.7
PETR-R101 20] 1408 × 512 1408 × 512 1408 xx5121408 \times 512 - - Camera 相机 0.357 0.710 0.270 0.490 0.885 0.224 0.421 3.4
PETR-Tiny 20] 1408 × 512 1408 × 512 1408 xx5121408 \times 512 - - Camera 相机 0.361 0.732 0.273 0.497 0.808 0.185 0.431 -
BEVDet-Tiny 14 704 × 256 704 × 256 704 xx256704 \times 256 52.6 M 52.6 百万 215.3 Camera 相机 0.312 0.691 0.272 0.523 0.909 0.247 0.392 15.6
BEVDet-Base 14 1600 × 640 1600 × 640 1600 xx6401600 \times 640 126.6 M 126.6 百万 2,962.6 Camera 相机 0.393 0.608 0.259 0.366 0.822 0.191 0.472 1.9
BEVDet4D-Tiny 704 × 256 704 × 256 704 xx256704 \times 256 53.6 M 53.6 百万 222.0 Camera 相机 0.338 0.672 0.274 0.460 0.337 0.185 0.476 15.5
BEVDet4D-Base 1600 × 640 1600 × 640 1600 xx6401600 \times 640 127.6 M 127.6 百万 2,989.2 Camera 相机 0.421 0.579 0.258 0.329 0.301 0.191 0.545 1.9
FCOS3D § # § # †§#\dagger \S \#§ 38] 1600 × 900 1600 × 900 1600 xx9001600 \times 900 - - Camera 相机 0.343 0.725 0.263 0.422 1.292 0.153 0.415 -
DETR3D ^(†)†^{\dagger} \dagger 40] 1600 × 900 1600 × 900 1600 xx9001600 \times 900 51.3 M 51.3 百万 - Camera 相机 0.349 0.716 0.268 0.379 0.842 0.200 0.434 -
PGD § 3 § 3 †§3\dagger \S 3§ 1600 × 900 1600 × 900 1600 xx9001600 \times 900 53.6 M 53.6 百万 - Camera 相机 0.369 0.683 0.260 0.439 1.268 0.185 0.428 -
PETR-R101 20 20 †20\dagger 20 1600 × 900 1600 × 900 1600 xx9001600 \times 900 - - Camera 相机 0.370 0.711 0.267 0.383 0.865 0.201 0.442 -
BEVDet-Base § 14 § 14 §14\S 14§ 1600 × 640 1600 × 640 1600 xx6401600 \times 640 126.6 M 126.6 百万 - Camera 相机 0.397 0.595 0.257 0.355 0.818 0.188 0.477 -
BEVDet4D-Base§ 1600 × 640 1600 × 640 1600 xx6401600 \times 640 126.6 M 126.6 百万 - Camera 相机 0.426 0.560 0.254 0.317 0.289 0.186 0.552 -
Methods Image Size #param. GFLOPs Modality mAP uarr mATE darr mASE darr mAOE darr mAVE darr mAAE darr NDS uarr FPS CenterFusion [25] 800 xx450 20.4 M - Camera & Radar 0.332 0.649 0.263 0.535 0.540 0.142 0.453 - VoxelNet 44] - 8.8 M - LiDAR 0.563 0.292 0.253 0.316 0.287 0.191 0.648 14.1 PointPillar 44] - 6.0 M - LiDAR 0.487 0.315 0.260 0.368 0.323 0.203 0.597 17.9 CenterNet 46 - - - Camera 0.306 0.716 0.264 0.609 1.426 0.658 0.328 - FCOS3D 38 1600 xx900 52.5 M 2,008.2 Camera 0.295 0.806 0.268 0.511 1.315 0.170 0.372 1.7 DETR3D 40 1600 xx900 51.3 M 1,016.8 Camera 0.303 0.860 0.278 0.437 0.967 0.235 0.374 2.0 PGD 39 1600 xx900 53.6 M 2,223.0 Camera 0.335 0.732 0.263 0.423 1.285 0.172 0.409 1.4 PETR-R50 20 1056 xx384 - - Camera 0.313 0.768 0.278 0.564 0.923 0.225 0.381 10.7 PETR-R101 20] 1408 xx512 - - Camera 0.357 0.710 0.270 0.490 0.885 0.224 0.421 3.4 PETR-Tiny 20] 1408 xx512 - - Camera 0.361 0.732 0.273 0.497 0.808 0.185 0.431 - BEVDet-Tiny 14 704 xx256 52.6 M 215.3 Camera 0.312 0.691 0.272 0.523 0.909 0.247 0.392 15.6 BEVDet-Base 14 1600 xx640 126.6 M 2,962.6 Camera 0.393 0.608 0.259 0.366 0.822 0.191 0.472 1.9 BEVDet4D-Tiny 704 xx256 53.6 M 222.0 Camera 0.338 0.672 0.274 0.460 0.337 0.185 0.476 15.5 BEVDet4D-Base 1600 xx640 127.6 M 2,989.2 Camera 0.421 0.579 0.258 0.329 0.301 0.191 0.545 1.9 FCOS3D †§# 38] 1600 xx900 - - Camera 0.343 0.725 0.263 0.422 1.292 0.153 0.415 - DETR3D ^(†)† 40] 1600 xx900 51.3 M - Camera 0.349 0.716 0.268 0.379 0.842 0.200 0.434 - PGD †§3 1600 xx900 53.6 M - Camera 0.369 0.683 0.260 0.439 1.268 0.185 0.428 - PETR-R101 †20 1600 xx900 - - Camera 0.370 0.711 0.267 0.383 0.865 0.201 0.442 - BEVDet-Base §14 1600 xx640 126.6 M - Camera 0.397 0.595 0.257 0.355 0.818 0.188 0.477 - BEVDet4D-Base§ 1600 xx640 126.6 M - Camera 0.426 0.560 0.254 0.317 0.289 0.186 0.552 -| Methods | Image Size | #param. | GFLOPs | Modality | mAP $\uparrow$ | mATE $\downarrow$ | mASE $\downarrow$ | mAOE $\downarrow$ | mAVE $\downarrow$ | mAAE $\downarrow$ | NDS $\uparrow$ | FPS | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | CenterFusion [25] | $800 \times 450$ | 20.4 M | - | Camera & Radar | 0.332 | 0.649 | 0.263 | 0.535 | 0.540 | 0.142 | 0.453 | - | | VoxelNet 44] | $-$ | 8.8 M | - | LiDAR | 0.563 | 0.292 | 0.253 | 0.316 | 0.287 | 0.191 | 0.648 | 14.1 | | PointPillar 44] | - | 6.0 M | - | LiDAR | 0.487 | 0.315 | 0.260 | 0.368 | 0.323 | 0.203 | 0.597 | 17.9 | | CenterNet 46 | - | - | - | Camera | 0.306 | 0.716 | 0.264 | 0.609 | 1.426 | 0.658 | 0.328 | - | | FCOS3D 38 | $1600 \times 900$ | 52.5 M | 2,008.2 | Camera | 0.295 | 0.806 | 0.268 | 0.511 | 1.315 | 0.170 | 0.372 | 1.7 | | DETR3D 40 | $1600 \times 900$ | 51.3 M | 1,016.8 | Camera | 0.303 | 0.860 | 0.278 | 0.437 | 0.967 | 0.235 | 0.374 | 2.0 | | PGD 39 | $1600 \times 900$ | 53.6 M | 2,223.0 | Camera | 0.335 | 0.732 | 0.263 | 0.423 | 1.285 | 0.172 | 0.409 | 1.4 | | PETR-R50 20 | $1056 \times 384$ | - | - | Camera | 0.313 | 0.768 | 0.278 | 0.564 | 0.923 | 0.225 | 0.381 | 10.7 | | PETR-R101 20] | $1408 \times 512$ | - | - | Camera | 0.357 | 0.710 | 0.270 | 0.490 | 0.885 | 0.224 | 0.421 | 3.4 | | PETR-Tiny 20] | $1408 \times 512$ | - | - | Camera | 0.361 | 0.732 | 0.273 | 0.497 | 0.808 | 0.185 | 0.431 | - | | BEVDet-Tiny 14 | $704 \times 256$ | 52.6 M | 215.3 | Camera | 0.312 | 0.691 | 0.272 | 0.523 | 0.909 | 0.247 | 0.392 | 15.6 | | BEVDet-Base 14 | $1600 \times 640$ | 126.6 M | 2,962.6 | Camera | 0.393 | 0.608 | 0.259 | 0.366 | 0.822 | 0.191 | 0.472 | 1.9 | | BEVDet4D-Tiny | $704 \times 256$ | 53.6 M | 222.0 | Camera | 0.338 | 0.672 | 0.274 | 0.460 | 0.337 | 0.185 | 0.476 | 15.5 | | BEVDet4D-Base | $1600 \times 640$ | 127.6 M | 2,989.2 | Camera | 0.421 | 0.579 | 0.258 | 0.329 | 0.301 | 0.191 | 0.545 | 1.9 | | FCOS3D $\dagger \S \#$ 38] | $1600 \times 900$ | - | - | Camera | 0.343 | 0.725 | 0.263 | 0.422 | 1.292 | 0.153 | 0.415 | - | | DETR3D $^{\dagger} \dagger$ 40] | $1600 \times 900$ | 51.3 M | - | Camera | 0.349 | 0.716 | 0.268 | 0.379 | 0.842 | 0.200 | 0.434 | - | | PGD $\dagger \S 3$ | $1600 \times 900$ | 53.6 M | - | Camera | 0.369 | 0.683 | 0.260 | 0.439 | 1.268 | 0.185 | 0.428 | - | | PETR-R101 $\dagger 20$ | $1600 \times 900$ | - | - | Camera | 0.370 | 0.711 | 0.267 | 0.383 | 0.865 | 0.201 | 0.442 | - | | BEVDet-Base $\S 14$ | $1600 \times 640$ | 126.6 M | - | Camera | 0.397 | 0.595 | 0.257 | 0.355 | 0.818 | 0.188 | 0.477 | - | | BEVDet4D-Base§ | $1600 \times 640$ | 126.6 M | - | Camera | 0.426 | 0.560 | 0.254 | 0.317 | 0.289 | 0.186 | 0.552 | - |
1000 scenes with images from 6 cameras with surrounding views, points from 5 Radars and 1 LiDAR. It is the up-todate popular benchmark for 3D object detection [38, 40, 39, 27] and BEV semantic segmentation [30, 28, 26, 42]. The scenes are officially split into 700/150/150 scenes for training/validation/testing. There are up to 1.4 M annotated 3 D bounding boxes for 10 classes: car, truck, bus, trailer, construction vehicle, pedestrian, motorcycle, bicycle, barrier, and traffic cone. Following CenterPoint [44], we define the region of interest (ROI) within 51.2 meters in the ground plane with a resolution of 0.8 meters by default.
1000 个场景,包含来自 6 个摄像头的图像和周围视图,5 个雷达和 1 个激光雷达的数据点。这是当前流行的 3D 物体检测基准[38, 40, 39, 27]和鸟瞰视图语义分割基准[30, 28, 26, 42]。场景被官方划分为 700/150/150 个场景用于训练/验证/测试。共有多达 140 万个标注的 3D 边界框,涵盖 10 个类别:汽车、卡车、公交车、拖车、工程车辆、行人、摩托车、自行车、障碍物和交通锥。遵循 CenterPoint [44],我们默认在地面平面内定义 51.2 米的兴趣区域(ROI),分辨率为 0.8 米。
Evaluation Metrics For 3D object detection, we report the official predefined metrics: mean Average Precision (mAP), Average Translation Error (ATE), Average Scale Error (ASE), Average Orientation Error (AOE), Average Velocity Error (AVE), Average Attribute Error (AAE), and NuScenes Detection Score (NDS). The mAP is analogous to that in 2D object detection [17] for measuring the precision and recall, but defined based on the match by 2D center distance on the ground plane instead of the Intersection over Union (IOU) [2]. NDS is the composite of the other indicators for comprehensively judging the detection capacity. The remaining metrics are designed for calculating the positive results’ precision on the corresponding aspects (e.g., translation, scale, orientation, velocity, and attribute).
评估指标 对于 3D 物体检测,我们报告官方预定义的指标:平均精度均值(mAP)、平均平移误差(ATE)、平均尺度误差(ASE)、平均方向误差(AOE)、平均速度误差(AVE)、平均属性误差(AAE)和 NuScenes 检测分数(NDS)。mAP 类似于 2D 物体检测中的指标[17],用于测量精度和召回率,但基于地面平面上 2D 中心距离的匹配,而不是交并比(IOU)[2]。NDS 是其他指标的综合,用于全面判断检测能力。其余指标旨在计算相应方面(例如,平移、尺度、方向、速度和属性)上正结果的精度。
Training Parameters Following BEVDet [14], models are trained with AdamW [22] optimizer, in which gradient clip is exploited with learning rate 2 e 4 2 e 4 2e-42 \mathrm{e}-4, a total batch size of 64 on 8 NVIDIA GeForce RTX 3090 GPUs. Sublinear memory cost [4] is used for GPU memory management.
训练参数 根据 BEVDet [14],模型使用 AdamW [22]优化器进行训练,其中使用梯度裁剪,学习率为 2 e 4 2 e 4 2e-42 \mathrm{e}-4 ,在 8 个 NVIDIA GeForce RTX 3090 GPU 上总批量大小为 64。使用亚线性内存成本 [4] 进行 GPU 内存管理。
We apply a cyclic policy [41], which linearly increases the learning rate from 2 e 4 2 e 4 2e-42 \mathrm{e}-4 to 1 e 3 1 e 3 1e-31 \mathrm{e}-3 in the first 40 % 40 % 40%40 \% schedule and linearly decreases the learning rate from 1 e 3 1 e 3 1e-31 \mathrm{e}-3 to 0 in the remainder epochs. By default, the total schedule is terminated within 20 epochs.
我们应用一种循环策略 [41],在前 40 % 40 % 40%40 \% 个周期内将学习率从 2 e 4 2 e 4 2e-42 \mathrm{e}-4 线性增加到 1 e 3 1 e 3 1e-31 \mathrm{e}-3 ,然后在剩余的周期中将学习率从 1 e 3 1 e 3 1e-31 \mathrm{e}-3 线性减少到 0。默认情况下,总周期在 20 个周期内结束。
Data Processing We keep all data processing settings the same as BEVDet [14]. Specifically, we use W i n × H in W i n × H in  W_(in)xxH_("in ")W_{i n} \times H_{\text {in }} to denote the width and height of the input image. By default in the training process, the source images with 1600 × 900 1600 × 900 1600 xx9001600 \times 900 resolution [2] are processed by random flipping, random scaling with a range of s [ W i n / 1600 0.06 , W i n / 1600 + s W i n / 1600 0.06 , W i n / 1600 + s in[W_(in)//1600-0.06,W_(in)//1600+:}s \in\left[W_{i n} / 1600-0.06, W_{i n} / 1600+\right. 0.11 ] 0.11 ] 0.11]0.11], random rotating with a range of r [ 5.4 , 5.4 ] r 5.4 , 5.4 r in[-5.4^(@),5.4^(@)]r \in\left[-5.4^{\circ}, 5.4^{\circ}\right], and finally cropping to a size of W i n × H i n W i n × H i n W_(in)xxH_(in)W_{i n} \times H_{i n}. The cropping is conducted randomly in the horizon direction but is fixed in the vertical direction (i.e., ( y 1 , y 2 ) = ( max ( 0 , s y 1 , y 2 = ( max ( 0 , s (y_(1),y_(2))=(max(0,s**\left(y_{1}, y_{2}\right)=(\max (0, s * 900 H i n ) , y 1 + H in ) 900 H i n , y 1 + H in  {: 900-H_(in)),y_(1)+H_("in "))\left.\left.900-H_{i n}\right), y_{1}+H_{\text {in }}\right), where y 1 y 1 y_(1)y_{1} and y 2 y 2 y_(2)y_{2} are the upper bound and the lower bound of the target region.) In the BEV space, the input feature and 3D object detection targets are augmented by random flipping, random rotating with a range of [ 22.5 , 22.5 ] 22.5 , 22.5 [-22.5^(@),22.5^(@)]\left[-22.5^{\circ}, 22.5^{\circ}\right], and random scaling with a range of [ 0.95 , 1.05 ] [ 0.95 , 1.05 ] [0.95,1.05][0.95,1.05]. Following CenterPoint [44], all models are trained with CBGS [48]. In testing time, the input image is scaled by a factor of s = W i n / 1600 + 0.04 s = W i n / 1600 + 0.04 s=W_(in)//1600+0.04s=W_{i n} / 1600+0.04 and cropped to W i n × H i n W i n × H i n W_(in)xxH_(in)W_{i n} \times H_{i n} resolution with a region defined as ( x 1 , x 2 , y 1 , y 2 ) = ( 0.5 ( s 1600 W i n ) , x 1 + W i n , s x 1 , x 2 , y 1 , y 2 = 0.5 s 1600 W i n , x 1 + W i n , s (x_(1),x_(2),y_(1),y_(2))=(0.5**(s**1600-W_(in)),x_(1)+W_(in),s**:}\left(x_{1}, x_{2}, y_{1}, y_{2}\right)=\left(0.5 *\left(s * 1600-W_{i n}\right), x_{1}+W_{i n}, s *\right. 900 H in , y 1 + H in ) 900 H in  , y 1 + H in  {: 900-H_("in "),y_(1)+H_("in "))\left.900-H_{\text {in }}, y_{1}+H_{\text {in }}\right)
数据处理 我们保持所有数据处理设置与 BEVDet [14] 相同。具体来说,我们用 W i n × H in W i n × H in  W_(in)xxH_("in ")W_{i n} \times H_{\text {in }} 表示输入图像的宽度和高度。在训练过程中,默认情况下,分辨率为 1600 × 900 1600 × 900 1600 xx9001600 \times 900 的源图像[2]通过随机翻转、随机缩放(范围为 s [ W i n / 1600 0.06 , W i n / 1600 + s W i n / 1600 0.06 , W i n / 1600 + s in[W_(in)//1600-0.06,W_(in)//1600+:}s \in\left[W_{i n} / 1600-0.06, W_{i n} / 1600+\right. 0.11 ] 0.11 ] 0.11]0.11] )、随机旋转(范围为 r [ 5.4 , 5.4 ] r 5.4 , 5.4 r in[-5.4^(@),5.4^(@)]r \in\left[-5.4^{\circ}, 5.4^{\circ}\right] )进行处理,最后裁剪为 W i n × H i n W i n × H i n W_(in)xxH_(in)W_{i n} \times H_{i n} 的大小。裁剪在水平方向上是随机进行的,但在垂直方向上是固定的(即, ( y 1 , y 2 ) = ( max ( 0 , s y 1 , y 2 = ( max ( 0 , s (y_(1),y_(2))=(max(0,s**\left(y_{1}, y_{2}\right)=(\max (0, s * 900 H i n ) , y 1 + H in ) 900 H i n , y 1 + H in  {: 900-H_(in)),y_(1)+H_("in "))\left.\left.900-H_{i n}\right), y_{1}+H_{\text {in }}\right) ,其中 y 1 y 1 y_(1)y_{1} y 2 y 2 y_(2)y_{2} 是目标区域的上限和下限)。在 BEV 空间中,输入特征和 3D 目标检测目标通过随机翻转、随机旋转(范围为 [ 22.5 , 22.5 ] 22.5 , 22.5 [-22.5^(@),22.5^(@)]\left[-22.5^{\circ}, 22.5^{\circ}\right] )和随机缩放(范围为 [ 0.95 , 1.05 ] [ 0.95 , 1.05 ] [0.95,1.05][0.95,1.05] )进行增强。遵循 CenterPoint [44],所有模型都使用 CBGS [48]进行训练。在测试时,输入图像按 s = W i n / 1600 + 0.04 s = W i n / 1600 + 0.04 s=W_(in)//1600+0.04s=W_{i n} / 1600+0.04 的比例缩放,并裁剪为 W i n × H i n W i n × H i n W_(in)xxH_(in)W_{i n} \times H_{i n} 的分辨率,区域定义为 ( x 1 , x 2 , y 1 , y 2 ) = ( 0.5 ( s 1600 W i n ) , x 1 + W i n , s x 1 , x 2 , y 1 , y 2 = 0.5 s 1600 W i n , x 1 + W i n , s (x_(1),x_(2),y_(1),y_(2))=(0.5**(s**1600-W_(in)),x_(1)+W_(in),s**:}\left(x_{1}, x_{2}, y_{1}, y_{2}\right)=\left(0.5 *\left(s * 1600-W_{i n}\right), x_{1}+W_{i n}, s *\right. 900 H in , y 1 + H in ) 900 H in  , y 1 + H in  {: 900-H_("in "),y_(1)+H_("in "))\left.900-H_{\text {in }}, y_{1}+H_{\text {in }}\right)
Inference Speed We conduct all experiments based on MMDetection3D [6]. The inference speed is the average upon 6019 validation samples [2]. For monocular paradigms like FCOS3D [38] and PGD [39], the inference
推理速度 我们的所有实验基于 MMDetection3D [6]。推理速度是基于 6019 个验证样本 [2] 的平均值。对于像 FCOS3D [38] 和 PGD [39] 这样的单目范式,推理
Table 2: Comparison with the state-of-the-art methods on the nuScenes test set. \dagger pre-train on DDAD [11].
表 2:与最先进方法在 nuScenes 测试集上的比较。 \dagger 在 DDAD [11]上进行预训练。
Methods 方法 Modality 模态 mAP uarr\uparrow mATE darr\downarrow mASE darr\downarrow mAOE darr\downarrow mAVE darr\downarrow mAAE darr\downarrow NDS uarr\uparrow
PointPillars(Light) [16] PointPillars(轻量版)[16] LiDAR 激光雷达 0.305 0.517 0.290 0.500 0.316 0.368 0.453
CenterFusion [25] 中心融合 [25] Camera & Radar 相机与雷达 0.326 0.631 0.261 0.516 0.614 0.115 0.449
CenterPoint [44] 中心点 [44] Camera & LiDAR & Radar
相机与激光雷达与雷达
0.671 0.249 0.236 0.350 0.250 0.136 0.714
MonoDIS [32] Camera 相机 0.304 0.738 0.263 0.546 1.553 0.134 0.384
CenterNet [46] Camera 相机 0.338 0.658 0.255 0.629 1.629 0.142 0.400
FCOS3D [38] Camera 相机 0.358 0.690 0.249 0.452 1.434 0.124 0.428
PGD [39] Camera 相机 0.386 0.626 0.245 0.451 1.509 0.127 0.448
PETR [20] Camera 相机 0.434 0.641 0.248 0.437 0.894 0.143 0.481
BEVDet [14] Camera 相机 0.422 0.529 0 . 2 3 6 0 . 2 3 6 0.236\mathbf{0 . 2 3 6} 0.395 0.979 0.152 0.482
BEVDet4D Camera 相机 0 . 4 5 1 0 . 4 5 1 0.451\mathbf{0 . 4 5 1} 0 . 5 1 1 0 . 5 1 1 0.511\mathbf{0 . 5 1 1} 0.241 0 . 3 8 6 0 . 3 8 6 0.386\mathbf{0 . 3 8 6} 0 . 3 0 1 0 . 3 0 1 0.301\mathbf{0 . 3 0 1} 0 . 1 2 1 0 . 1 2 1 0.121\mathbf{0 . 1 2 1} 0 . 5 6 9 0 . 5 6 9 0.569\mathbf{0 . 5 6 9}
DD3D [ 27 ] [ 27 ] †[27]\dagger[27] Camera 相机 0.418 0.572 0.249 0.368 1.014 0.124 0.477
DETR3D [ 40 ] [ 40 ] †[40]\dagger[40] Camera 相机 0.386 0.626 0.245 0.394 0.845 0.133 0.479
PETR [ 20 ] [ 20 ] †[20]\dagger[20] Camera 相机 0.441 0.593 0.249 0.383 0.808 0.132 0.504
Methods Modality mAP uarr mATE darr mASE darr mAOE darr mAVE darr mAAE darr NDS uarr PointPillars(Light) [16] LiDAR 0.305 0.517 0.290 0.500 0.316 0.368 0.453 CenterFusion [25] Camera & Radar 0.326 0.631 0.261 0.516 0.614 0.115 0.449 CenterPoint [44] Camera & LiDAR & Radar 0.671 0.249 0.236 0.350 0.250 0.136 0.714 MonoDIS [32] Camera 0.304 0.738 0.263 0.546 1.553 0.134 0.384 CenterNet [46] Camera 0.338 0.658 0.255 0.629 1.629 0.142 0.400 FCOS3D [38] Camera 0.358 0.690 0.249 0.452 1.434 0.124 0.428 PGD [39] Camera 0.386 0.626 0.245 0.451 1.509 0.127 0.448 PETR [20] Camera 0.434 0.641 0.248 0.437 0.894 0.143 0.481 BEVDet [14] Camera 0.422 0.529 0.236 0.395 0.979 0.152 0.482 BEVDet4D Camera 0.451 0.511 0.241 0.386 0.301 0.121 0.569 DD3D †[27] Camera 0.418 0.572 0.249 0.368 1.014 0.124 0.477 DETR3D †[40] Camera 0.386 0.626 0.245 0.394 0.845 0.133 0.479 PETR †[20] Camera 0.441 0.593 0.249 0.383 0.808 0.132 0.504| Methods | Modality | mAP $\uparrow$ | mATE $\downarrow$ | mASE $\downarrow$ | mAOE $\downarrow$ | mAVE $\downarrow$ | mAAE $\downarrow$ | NDS $\uparrow$ | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | PointPillars(Light) [16] | LiDAR | 0.305 | 0.517 | 0.290 | 0.500 | 0.316 | 0.368 | 0.453 | | CenterFusion [25] | Camera & Radar | 0.326 | 0.631 | 0.261 | 0.516 | 0.614 | 0.115 | 0.449 | | CenterPoint [44] | Camera & LiDAR & Radar | 0.671 | 0.249 | 0.236 | 0.350 | 0.250 | 0.136 | 0.714 | | MonoDIS [32] | Camera | 0.304 | 0.738 | 0.263 | 0.546 | 1.553 | 0.134 | 0.384 | | CenterNet [46] | Camera | 0.338 | 0.658 | 0.255 | 0.629 | 1.629 | 0.142 | 0.400 | | FCOS3D [38] | Camera | 0.358 | 0.690 | 0.249 | 0.452 | 1.434 | 0.124 | 0.428 | | PGD [39] | Camera | 0.386 | 0.626 | 0.245 | 0.451 | 1.509 | 0.127 | 0.448 | | PETR [20] | Camera | 0.434 | 0.641 | 0.248 | 0.437 | 0.894 | 0.143 | 0.481 | | BEVDet [14] | Camera | 0.422 | 0.529 | $\mathbf{0 . 2 3 6}$ | 0.395 | 0.979 | 0.152 | 0.482 | | BEVDet4D | Camera | $\mathbf{0 . 4 5 1}$ | $\mathbf{0 . 5 1 1}$ | 0.241 | $\mathbf{0 . 3 8 6}$ | $\mathbf{0 . 3 0 1}$ | $\mathbf{0 . 1 2 1}$ | $\mathbf{0 . 5 6 9}$ | | DD3D $\dagger[27]$ | Camera | 0.418 | 0.572 | 0.249 | 0.368 | 1.014 | 0.124 | 0.477 | | DETR3D $\dagger[40]$ | Camera | 0.386 | 0.626 | 0.245 | 0.394 | 0.845 | 0.133 | 0.479 | | PETR $\dagger[20]$ | Camera | 0.441 | 0.593 | 0.249 | 0.383 | 0.808 | 0.132 | 0.504 |
speeds are divided by a factor of 6 (i.e. the number of images in a single sample [2]), as they take each image as an independent sample. By default, the inference acceleration method proposed in BEVDet [14] is applied.
速度被一个因子 6 划分(即单个样本中的图像数量[2]),因为它们将每个图像视为独立样本。默认情况下,应用 BEVDet [14]中提出的推理加速方法。

4.2. Benchmark Results 4.2. 基准测试结果

4.2.1 nuScenes val set 4.2.1 nuScenes 验证集

We comprehensively compare the proposed BEVDet4D with the baseline method BEVDet [14] and other paradigms like FCOS3D [38], PGD [39], DETR3D [40] and PETR [20]. Their numbers of parameters, computational budget, inference speed, and accuracy on the nuScenes val set are all listed in Tab. 1 Some state-of-the-art methods with other sensors are also listed for comparison like LiDAR-based method CenterPoint [44] and radar-based method CenterFusion [25].
我们全面比较了提出的 BEVDet4D 与基线方法 BEVDet [14]以及其他范式,如 FCOS3D [38]、PGD [39]、DETR3D [40]和 PETR [20]。它们的参数数量、计算预算、推理速度和在 nuScenes 验证集上的准确性都列在表 1 中。一些使用其他传感器的最先进方法也被列出以供比较,如基于激光雷达的方法 CenterPoint [44]和基于雷达的方法 CenterFusion [25]。
The high-speed version dubbed BEVDet4D-Tiny scores 47.6 % 47.6 % 47.6%47.6 \% NDS on nuScenes val set, which exceeds the baseline (i.e. BEVDet [14] with 39.2 % 39.2 % 39.2%39.2 \% NDS) by a large margin of + 8.4 % + 8.4 % +8.4%+8.4 \% NDS. The improvement in the composite indicator NDS mainly derives from the reduction of the orientation error, the velocity error, and the attribute error. Specifically, benefitting from the well-designed BEVDet4D paradigm, the velocity error is significantly decreased by 62.9 % 62.9 % 62.9%62.9 \% from BEVDet-Tiny 0.909 mAVE to BEVDet4DTiny 0.337 mAVE . For the first time, the precision of the velocity prediction in the camera-based methods notably exceeds the CenterFusion [25] 0.540 mAVE , who relies on the multi-sensor fusion with camera and radar for high precision in this aspect. Besides, with a close inference speed, velocity precision of BEVDet4D-Tiny is also comparable with the LiDAR-based method PointPillar [16] (i.e. 17.9 FPS and 0.323 mAVE ) implemented in [44]. With respect to the orientation prediction, the proposed method also reduces the error in this aspect by 12.0 % 12.0 % -12.0%-12.0 \% from BEVDet-Tiny 0.523 mAOE to BEVDet4D-Tiny 0.460 mAOE . This is because the orientation and velocity of the targets are strong-
高速版本 BEVDet4D-Tiny 在 nuScenes 验证集上得分 47.6 % 47.6 % 47.6%47.6 \% NDS,远超基线(即 BEVDet [14]得分 39.2 % 39.2 % 39.2%39.2 \% NDS),相差 + 8.4 % + 8.4 % +8.4%+8.4 \% NDS。复合指标 NDS 的提升主要源于方向误差、速度误差和属性误差的减少。具体而言,得益于精心设计的 BEVDet4D 范式,速度误差显著降低,从 BEVDet-Tiny 的 0.909 mAVE 降至 BEVDet4D-Tiny 的 0.337 mAVE。首次,基于相机的方法在速度预测的精度上显著超过 CenterFusion [25]的 0.540 mAVE,后者依赖于相机和雷达的多传感器融合以实现高精度。此外,BEVDet4D-Tiny 的推理速度接近,其速度精度也与基于激光雷达的方法 PointPillar [16](即 17.9 FPS 和 0.323 mAVE)相当,后者在[44]中实现。关于方向预测,所提方法也将此方面的误差从 BEVDet-Tiny 的 0.523 mAOE 降低至 BEVDet4D-Tiny 的 0.460 mAOE。这是因为目标的方向和速度之间存在强烈的关联。

coupled. Analogously, the attribute error is reduced by 25.1 % 25.1 % -25.1%-25.1 \% from BEVDet-Tiny 0.247 mAAE to BEVDet4DTiny 0.185 mAAE .
耦合。类似地,属性误差从 BEVDet-Tiny 的 0.247 mAAE 减少到 BEVDet4DTiny 的 0.185 mAAE,减少了 25.1 % 25.1 % -25.1%-25.1 \%
While upgrading the paradigm to BEVDet4D-Base analogous to BEVDet-Base [14], the promotion on the baseline is slightly narrowed to + 7.3 % + 7.3 % +7.3%+7.3 \% on the composite indicator NDS from BEVDet-Base 47.2% NDS to BEVDet4D-Base 54.5 % 54.5 % 54.5%54.5 \% NDS. With test time augmentation, we further push the performance boundary to 55.2 % 55.2 % 55.2%55.2 \% NDS. It is worth noting that, thanks to the few framework adjustments, BEVDet4D achieves the aforementioned performance improvement at the cost of negligible extra inference latency.
在将范式升级到 BEVDet4D-Base 类似于 BEVDet-Base [14] 时,基线的提升在综合指标 NDS 上略微缩小,从 BEVDet-Base 的 47.2% NDS 缩小到 BEVDet4D-Base 54.5 % 54.5 % 54.5%54.5 \% NDS。通过测试时间增强,我们进一步将性能边界推向 55.2 % 55.2 % 55.2%55.2 \% NDS。值得注意的是,由于少量框架调整,BEVDet4D 在几乎没有额外推理延迟的情况下实现了上述性能提升。

4.2.2 nuScenes test set 4.2.2 nuScenes 测试集

For the nuScenes test set, we train the BEVDet4D-Base configuration on the train and val sets. A single model with test time augmentation is adopted. As listed in Tab. 2, BEVDet4D ranks first on the nuScenes vision-based 3D objection leader board with a score of 56.9 % 56.9 % 56.9%56.9 \% NDS, substantially surpassing the previous leading method BEVDet [14] by + 8.7 % + 8.7 % +8.7%+8.7 \% NDS. This also significantly exceeds those relied on additional data for pre-training like DD3D [27], DETR3D [40], and PETR [20]. With respect to the ability of generalization, the previous leading method BEVDet [14] has merely + 0.5 % + 0.5 % +0.5%+0.5 \% performance growth from val set 47.7 % 47.7 % 47.7%47.7 \% NDS to test set 48.2 % 48.2 % 48.2%48.2 \% NDS. However, with the same configuration, the performance growth of BEVDet4D is + 1.7 % + 1.7 % +1.7%+1.7 \% NDS from val set 55.2 % 55.2 % 55.2%55.2 \% NDS to test set 56.9 % 56.9 % 56.9%56.9 \% NDS. This indicates that exploiting temporal cues in BEVDet4D can also help improve the models’ generalization performance.
对于 nuScenes 测试集,我们在训练集和验证集上训练 BEVDet4D-Base 配置。采用了单个模型并进行测试时增强。如表 2 所示,BEVDet4D 在 nuScenes 基于视觉的 3D 目标排行榜上以 56.9 % 56.9 % 56.9%56.9 \% NDS 的得分排名第一,显著超过了之前的领先方法 BEVDet [14],超出 + 8.7 % + 8.7 % +8.7%+8.7 \% NDS。这也显著超过了依赖额外数据进行预训练的方法,如 DD3D [27]、DETR3D [40]和 PETR [20]。在泛化能力方面,之前的领先方法 BEVDet [14]从验证集 47.7 % 47.7 % 47.7%47.7 \% NDS 到测试集 48.2 % 48.2 % 48.2%48.2 \% NDS 仅有 + 0.5 % + 0.5 % +0.5%+0.5 \% 的性能增长。然而,在相同配置下,BEVDet4D 的性能增长为 + 1.7 % + 1.7 % +1.7%+1.7 \% NDS,从验证集 55.2 % 55.2 % 55.2%55.2 \% NDS 到测试集 56.9 % 56.9 % 56.9%56.9 \% NDS。这表明,在 BEVDet4D 中利用时间线索也有助于提高模型的泛化性能。

4.3. Ablation Studies 4.3. 消融研究

4.3.1 Road Map of Building BEVDet4D
4.3.1 建立 BEVDet4D 的路线图

In this subsection, we empirically show how the robust performance of BEVDet4D is built. BEVDet-Tiny [14] without acceleration is adopted as a baseline. In other words,
在本小节中,我们通过实证展示了 BEVDet4D 强大性能的构建方式。BEVDet-Tiny [14] 在没有加速的情况下被采用为基线。换句话说,
Table 3: Results of ablation study on the nuScenes val set. The align operation includes rotation ® and translation (T). Extra denotes the extra BEV encoder. Aug. denotes the augmentation in time dimension when selecting the adjacent frame.
表 3:在 nuScenes 验证集上的消融研究结果。对齐操作包括旋转(®)和平移(T)。Extra 表示额外的 BEV 编码器。Aug.表示在选择相邻帧时在时间维度上的增强。
Methods 方法 Align 对齐 Target 目标 Extra 额外 Weight 重量 Aug. 八月 mAP uarr\uparrow NDS uarr\uparrow mATE darr\downarrow mASE darr\downarrow mAOE darr\downarrow mAVE darr\downarrow mAAE darr\downarrow #param. GFLOPs FPS
BEVDet Speed 速度 0.2 0.312 0.392 0.691 0.272 0.523 0.909 0.247 52.5 M 52.5 米 215.3 7.8
A Speed 速度 0.2 0.296 0.376 0.711 0.274 0.501 1.544 0.234 52.6 M 52.6 百万 215.9 7.8
B T Speed 速度 0.2 0.321 0.393 0.672 0.272 0.525 1.186 0.215 52.6 M 52.6 百万 215.9 7.8
C T Offset 偏移 0.2 0.320 0.440 0.697 0.274 0.514 0.479 0.229 52.6 M 52.6 百万 215.9 7.8
D T Offset 偏移 \checkmark 0.2 0.323 0.449 0.680 0.274 0.480 0.479 0.209 53.6 M 53.6 百万 222.0 7.7
E T Offset 偏移 \checkmark 1.0 0.322 0.452 0.685 0.271 0.496 0.435 0.201 53.6 M 53.6 百万 222.0 7.7
F R&T Offset 偏移 \checkmark 1.0 0.321 0.461 0.681 0.272 0.461 0.376 0.203 53.6 M 53.6 百万 222.0 7.7
G R&T Offset 偏移 \checkmark 1.0 \checkmark 0.340 0.481 0.660 0.270 0.453 0.328 0.176 53.6 M 53.6 百万 222.0 7.7
Methods Align Target Extra Weight Aug. mAP uarr NDS uarr mATE darr mASE darr mAOE darr mAVE darr mAAE darr #param. GFLOPs FPS BEVDet Speed 0.2 0.312 0.392 0.691 0.272 0.523 0.909 0.247 52.5 M 215.3 7.8 A Speed 0.2 0.296 0.376 0.711 0.274 0.501 1.544 0.234 52.6 M 215.9 7.8 B T Speed 0.2 0.321 0.393 0.672 0.272 0.525 1.186 0.215 52.6 M 215.9 7.8 C T Offset 0.2 0.320 0.440 0.697 0.274 0.514 0.479 0.229 52.6 M 215.9 7.8 D T Offset ✓ 0.2 0.323 0.449 0.680 0.274 0.480 0.479 0.209 53.6 M 222.0 7.7 E T Offset ✓ 1.0 0.322 0.452 0.685 0.271 0.496 0.435 0.201 53.6 M 222.0 7.7 F R&T Offset ✓ 1.0 0.321 0.461 0.681 0.272 0.461 0.376 0.203 53.6 M 222.0 7.7 G R&T Offset ✓ 1.0 ✓ 0.340 0.481 0.660 0.270 0.453 0.328 0.176 53.6 M 222.0 7.7| Methods | Align | Target | Extra | Weight | Aug. | mAP $\uparrow$ | NDS $\uparrow$ | mATE $\downarrow$ | mASE $\downarrow$ | mAOE $\downarrow$ | mAVE $\downarrow$ | mAAE $\downarrow$ | #param. | GFLOPs | FPS | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | BEVDet | | Speed | | 0.2 | | 0.312 | 0.392 | 0.691 | 0.272 | 0.523 | 0.909 | 0.247 | 52.5 M | 215.3 | 7.8 | | A | | Speed | | 0.2 | | 0.296 | 0.376 | 0.711 | 0.274 | 0.501 | 1.544 | 0.234 | 52.6 M | 215.9 | 7.8 | | B | T | Speed | | 0.2 | | 0.321 | 0.393 | 0.672 | 0.272 | 0.525 | 1.186 | 0.215 | 52.6 M | 215.9 | 7.8 | | C | T | Offset | | 0.2 | | 0.320 | 0.440 | 0.697 | 0.274 | 0.514 | 0.479 | 0.229 | 52.6 M | 215.9 | 7.8 | | D | T | Offset | $\checkmark$ | 0.2 | | 0.323 | 0.449 | 0.680 | 0.274 | 0.480 | 0.479 | 0.209 | 53.6 M | 222.0 | 7.7 | | E | T | Offset | $\checkmark$ | 1.0 | | 0.322 | 0.452 | 0.685 | 0.271 | 0.496 | 0.435 | 0.201 | 53.6 M | 222.0 | 7.7 | | F | R&T | Offset | $\checkmark$ | 1.0 | | 0.321 | 0.461 | 0.681 | 0.272 | 0.461 | 0.376 | 0.203 | 53.6 M | 222.0 | 7.7 | | G | R&T | Offset | $\checkmark$ | 1.0 | $\checkmark$ | 0.340 | 0.481 | 0.660 | 0.270 | 0.453 | 0.328 | 0.176 | 53.6 M | 222.0 | 7.7 |
the spatial alignment operation in BEVDet4D is conducted within the view transformer by adjusting the pseudo point cloud [28]. The results of different configurations are listed in Tab. 3. Some key factors are discussed one by one in the following.
BEVDet4D 中的空间对齐操作是在视图变换器内通过调整伪点云进行的[28]。不同配置的结果列在表 3 中。以下将逐一讨论一些关键因素。
Directly concatenate the current frame feature with the previous one in configuration Tab. 3 (A), the overall performance drops from 39.2 % 39.2 % 39.2%39.2 \% NDS to 37.6 % 37.6 % 37.6%37.6 \% NDS by 1.6 % 1.6 % -1.6%-1.6 \%. This modification degrades the models’ performance, especially on the translation and the velocity aspects. We conjecture that, due to the ego-motion, the positional shift of the same static object between the two candidate features will confuse the following modules’ judgment on the object position. With respect to the moving object, it is more complicated for the modules to judge out the velocity target defined in the current frame’s ego coordinate system [14] from the positional shift between the two candidate features which is described in Eq. 1 As to this end, the module needs to remove the fraction of ego-motion from this positional shift and consider the time factor.
直接将当前帧特征与配置表 3(A)中的前一帧特征连接,整体性能从 39.2 % 39.2 % 39.2%39.2 \% NDS 下降到 37.6 % 37.6 % 37.6%37.6 \% NDS,下降幅度为 1.6 % 1.6 % -1.6%-1.6 \% 。这一修改降低了模型的性能,特别是在翻译和速度方面。我们推测,由于自我运动,同一静态物体在两个候选特征之间的位置变化会混淆后续模块对物体位置的判断。对于移动物体,模块更难从两个候选特征之间的位置变化中判断出在当前帧自我坐标系统中定义的速度目标[14],如公式 1 所述。为此,模块需要从这个位置变化中去除自我运动的部分,并考虑时间因素。
By conducting translation-only align operation in configuration Tab. 3 (B), we enable the modules to utilize the position-aligned candidate features to make better perceptions of the static targets. Besides, the velocity predicting task is simplified by removing the fraction of the egomotion. As result, the translation error is reduced by 5.4 % 5.4 % -5.4%-5.4 \% to 0.672 , which has surpassed the baseline configuration with a translation error of 0.691 . Moreover, the velocity error is also reduced by 23.2 % 23.2 % -23.2%-23.2 \% from configuration Tab. 3 (A) 1.544 mAVE to configuration Tab. 3 (B) 1.186 mAVE . However, this velocity error is still larger than that of the baseline configuration. We conjecture that the distribution of the positional shift is far from that of the velocity due to the inconsistent time duration between the two adjacent frames.
通过在配置表 3 (B)中进行仅翻译对齐操作,我们使模块能够利用位置对齐的候选特征更好地感知静态目标。此外,通过去除自运动的部分,速度预测任务得到了简化。因此,翻译误差从 5.4 % 5.4 % -5.4%-5.4 \% 减少到 0.672,已超过基线配置的 0.691 的翻译误差。此外,速度误差也从配置表 3 (A)的 1.544 mAVE 减少到配置表 3 (B)的 1.186 mAVE。然而,这个速度误差仍然大于基线配置的速度误差。我们推测,由于两个相邻帧之间时间持续不一致,位置偏移的分布与速度的分布相差甚远。
Further removing the time factor in configuration Tab. 3 ©, we let the module directly predict the targets’ positional shift in two candidate features. This modification successfully simplifies the learning targets and makes the trained module more robust on the validation set. The velocity error is thus further reduced by a large margin of 59.6 % 59.6 % -59.6%-59.6 \% to 0.479 mAVE, which is just 52.7 % 52.7 % 52.7%52.7 \% of the naive BEDVet [14].
进一步去除配置表 3 中的时间因素,我们让模块直接预测两个候选特征中的目标位置偏移。此修改成功简化了学习目标,并使训练后的模块在验证集上更具鲁棒性。因此,速度误差大幅降低,从 59.6 % 59.6 % -59.6%-59.6 \% 降至 0.479 mAVE,这仅为简单 BEDVet [14]的 52.7 % 52.7 % 52.7%52.7 \%

Figure 4: Ablation on the time interval between the current frame and the reference one. Points drawn in the same color are in the same training configuration.
图 4:当前帧与参考帧之间时间间隔的消融实验。相同颜色的点表示在相同的训练配置中。
In configuration Tab. 3 (D), we apply an extra BEV encoder before concatenating the two candidate features. This will slightly enlarge the computational budget by 2.8 % 2.8 % 2.8%2.8 \%. The change of inference speed is negligible. However, this modification offers comprehensive improvement on the baseline (i.e., Tab. 3©). The overall performance is thus improved by + 0.9 % + 0.9 % +0.9%+0.9 \% NDS from 44.0 % 44.0 % 44.0%44.0 \% to 44.9 % 44.9 % 44.9%44.9 \%. By adjusting the loss weight of velocity prediction in the training process, configuration Tab. 3 (E) further reduces the velocity error to 0.435 .
在配置表 3 (D)中,我们在连接两个候选特征之前应用了一个额外的 BEV 编码器。这将稍微增加计算预算 2.8 % 2.8 % 2.8%2.8 \% 。推理速度的变化可以忽略不计。然而,这一修改对基线(即表 3©)提供了全面的改进。因此,整体性能从 44.0 % 44.0 % 44.0%44.0 \% 提高了 + 0.9 % + 0.9 % +0.9%+0.9 \% NDS,达到 44.9 % 44.9 % 44.9%44.9 \% 。通过在训练过程中调整速度预测的损失权重,配置表 3 (E)进一步将速度误差降低到 0.435。
By considering the rotation variance of the ego pose in the align operation, configuration Tab. 33(F) further reduces the velocity error by 13.6 % 13.6 % 13.6%13.6 \% from 0.435 (i.e., Tab. 3 (E)) to 0.376. This indicates that a precise align operation can help increase the precision of velocity prediction.
通过考虑自我姿态在对齐操作中的旋转方差,配置表 33(F)进一步将速度误差从 0.435(即表 3(E))减少到 0.376,减少了 13.6 % 13.6 % 13.6%13.6 \% 。这表明精确的对齐操作可以帮助提高速度预测的精度。
To search for the optimal test time interval between the current frame and the reference one, we use the unlabeled camera sweeps with 12 Hz instead of the annotated camera frames ( 2 Hz ) ( 2 Hz ) (2Hz)(2 \mathrm{~Hz}) in configuration Tab. 3 (G). The time interval between two camera sweeps is denoted as T 0.083 s T 0.083 s T~~0.083sT \approx 0.083 \mathrm{~s}. We select three different time intervals in each training configuration and judge the adjusting direction by comparing them in test time. In this way, we can avoid the training disturbance in searching for this hyper-parameter. According to Fig. 4. The optimal interval is around 15 T which is
为了搜索当前帧与参考帧之间的最佳测试时间间隔,我们使用 12 Hz 的未标记相机扫描,而不是配置表 3(G)中的标注相机帧 ( 2 Hz ) ( 2 Hz ) (2Hz)(2 \mathrm{~Hz}) 。两个相机扫描之间的时间间隔表示为 T 0.083 s T 0.083 s T~~0.083sT \approx 0.083 \mathrm{~s} 。我们在每个训练配置中选择三个不同的时间间隔,并通过在测试时比较它们来判断调整方向。通过这种方式,我们可以避免在搜索这个超参数时的训练干扰。根据图 4,最佳间隔大约为 15 T。
Table 4: Ablation study for the precision degeneration of the interpolation operation on the nuScenes val set.
表 4:在 nuScenes 验证集上插值操作的精度退化的消融研究。
Configuration 配置 BEV Resolution BEV 分辨率 Align 对齐 mAP uarr\uparrow NDS uarr\uparrow mATE mATE mATEdarr\mathrm{mATE} \downarrow mASE mASE mASEdarr\mathrm{mASE} \downarrow mAOE mAOE mAOEdarr\mathrm{mAOE} \downarrow mAVE mAVE mAVEdarr\mathrm{mAVE} \downarrow mAAE mAAE mAAEdarr\mathrm{mAAE} \downarrow #param. GFLOPs FPS
A 0.8 m × 0.8 m 0.8 m × 0.8 m 0.8mxx0.8m0.8 \mathrm{~m} \times 0.8 \mathrm{~m} Within 内心 0.320 0.440 0.697 0.274 0.514 0.479 0.229 52.6 M 52.6 百万 215.9 7.8
B 0.8 m × 0.8 m 0.8 m × 0.8 m 0.8mxx0.8m0.8 \mathrm{~m} \times 0.8 \mathrm{~m} After 之后 0.313 0.438 0.685 0.273 0.529 0.499 0.205 52.6 M 52.6 百万 215.9 15.6
C 0.4 m × 0.4 m 0.4 m × 0.4 m 0.4mxx0.4m0.4 \mathrm{~m} \times 0.4 \mathrm{~m} Within 内心 0.326 0.452 0.651 0.273 0.455 0.516 0.216 52.6 M 52.6 百万 440.8
D 0.4 m × 0.4 m 0.4 m × 0.4 m 0.4mxx0.4m0.4 \mathrm{~m} \times 0.4 \mathrm{~m} After 之后 0.327 0.453 0.657 0.271 0.463 0.497 0.219 52.6 M 52.6 百万 440.8 10.0
Configuration BEV Resolution Align mAP uarr NDS uarr mATEdarr mASEdarr mAOEdarr mAVEdarr mAAEdarr #param. GFLOPs FPS A 0.8mxx0.8m Within 0.320 0.440 0.697 0.274 0.514 0.479 0.229 52.6 M 215.9 7.8 B 0.8mxx0.8m After 0.313 0.438 0.685 0.273 0.529 0.499 0.205 52.6 M 215.9 15.6 C 0.4mxx0.4m Within 0.326 0.452 0.651 0.273 0.455 0.516 0.216 52.6 M 440.8 D 0.4mxx0.4m After 0.327 0.453 0.657 0.271 0.463 0.497 0.219 52.6 M 440.8 10.0| Configuration | BEV Resolution | Align | mAP $\uparrow$ | NDS $\uparrow$ | $\mathrm{mATE} \downarrow$ | $\mathrm{mASE} \downarrow$ | $\mathrm{mAOE} \downarrow$ | $\mathrm{mAVE} \downarrow$ | $\mathrm{mAAE} \downarrow$ | #param. | GFLOPs | FPS | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | A | $0.8 \mathrm{~m} \times 0.8 \mathrm{~m}$ | Within | 0.320 | 0.440 | 0.697 | 0.274 | 0.514 | 0.479 | 0.229 | 52.6 M | 215.9 | 7.8 | | B | $0.8 \mathrm{~m} \times 0.8 \mathrm{~m}$ | After | 0.313 | 0.438 | 0.685 | 0.273 | 0.529 | 0.499 | 0.205 | 52.6 M | 215.9 | 15.6 | | C | $0.4 \mathrm{~m} \times 0.4 \mathrm{~m}$ | Within | 0.326 | 0.452 | 0.651 | 0.273 | 0.455 | 0.516 | 0.216 | 52.6 M | 440.8 | | | D | $0.4 \mathrm{~m} \times 0.4 \mathrm{~m}$ | After | 0.327 | 0.453 | 0.657 | 0.271 | 0.463 | 0.497 | 0.219 | 52.6 M | 440.8 | 10.0 |
Table 5: Ablation study for the position of temporal fusion in BEVDet4D on the nuScenes val set.
表 5:在 nuScenes 验证集上,BEVDet4D 中时间融合位置的消融研究。
Methods 方法 Configuration 配置 Position 职位 mAP mAP mAPuarr\mathrm{mAP} \uparrow NDS NDS NDSuarr\mathrm{NDS} \uparrow mATE mATE mATEdarr\mathrm{mATE} \downarrow mASE mASE mASEdarr\mathrm{mASE} \downarrow mAOE mAOE mAOEdarr\mathrm{mAOE} \downarrow mAVE mAVE mAVEdarr\mathrm{mAVE} \downarrow mAAE mAAE mAAEdarr\mathrm{mAAE} \downarrow
BEVDet [ 14 [ 14 [14[14 - - 0.312 0.392 0.691 0 . 2 7 2 0 . 2 7 2 0.272\mathbf{0 . 2 7 2} 0.523 0.909 0.247
BEVDet4D A Before Extra BEV Encoder 在额外的 BEV 编码器之前 0.320 0.442 0.687 0.277 0.519 0.480 0.214
BEVDet4D B After Extra BEV Encoder 额外的 BEV 编码器之后 0 . 3 2 3 0 . 3 2 3 0.323\mathbf{0 . 3 2 3} 0 . 4 5 3 0 . 4 5 3 0.453\mathbf{0 . 4 5 3} 0 . 6 7 4 0 . 6 7 4 0.674\mathbf{0 . 6 7 4} 0 . 2 7 2 0 . 2 7 2 0.272\mathbf{0 . 2 7 2} 0 . 5 0 3 0 . 5 0 3 0.503\mathbf{0 . 5 0 3} 0 . 4 2 9 0 . 4 2 9 0.429\mathbf{0 . 4 2 9} 0 . 2 0 8 0 . 2 0 8 0.208\mathbf{0 . 2 0 8}
BEVDet4D C After BEV Encoder 在 BEV 编码器之后 0.311 0.394 0.720 0.274 0.536 0.838 0.245
Methods Configuration Position mAPuarr NDSuarr mATEdarr mASEdarr mAOEdarr mAVEdarr mAAEdarr BEVDet [14 - - 0.312 0.392 0.691 0.272 0.523 0.909 0.247 BEVDet4D A Before Extra BEV Encoder 0.320 0.442 0.687 0.277 0.519 0.480 0.214 BEVDet4D B After Extra BEV Encoder 0.323 0.453 0.674 0.272 0.503 0.429 0.208 BEVDet4D C After BEV Encoder 0.311 0.394 0.720 0.274 0.536 0.838 0.245| Methods | Configuration | Position | $\mathrm{mAP} \uparrow$ | $\mathrm{NDS} \uparrow$ | $\mathrm{mATE} \downarrow$ | $\mathrm{mASE} \downarrow$ | $\mathrm{mAOE} \downarrow$ | $\mathrm{mAVE} \downarrow$ | $\mathrm{mAAE} \downarrow$ | | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | BEVDet $[14$ | - | - | 0.312 | 0.392 | 0.691 | $\mathbf{0 . 2 7 2}$ | 0.523 | 0.909 | 0.247 | | BEVDet4D | A | Before Extra BEV Encoder | 0.320 | 0.442 | 0.687 | 0.277 | 0.519 | 0.480 | 0.214 | | BEVDet4D | B | After Extra BEV Encoder | $\mathbf{0 . 3 2 3}$ | $\mathbf{0 . 4 5 3}$ | $\mathbf{0 . 6 7 4}$ | $\mathbf{0 . 2 7 2}$ | $\mathbf{0 . 5 0 3}$ | $\mathbf{0 . 4 2 9}$ | $\mathbf{0 . 2 0 8}$ | | BEVDet4D | C | After BEV Encoder | 0.311 | 0.394 | 0.720 | 0.274 | 0.536 | 0.838 | 0.245 |
set as the test time interval by default in this paper. During the training process, we conduct data augmentation by randomly sampling time intervals within [ 3 T , 27 T ] [ 3 T , 27 T ] [3T,27 T][3 T, 27 T]. As a result, configuration Tab. 3 (G) further reduces the velocity error by 12.8 % 12.8 % 12.8%12.8 \% from 0.376 in configuration Tab. 3 (F) to 0.328 .
在本文中,默认设置为测试时间间隔。在训练过程中,我们通过在 [ 3 T , 27 T ] [ 3 T , 27 T ] [3T,27 T][3 T, 27 T] 内随机抽样时间间隔进行数据增强。因此,配置表 3 (G) 进一步将速度误差从配置表 3 (F) 的 0.376 减少到 0.328,减少了 12.8 % 12.8 % 12.8%12.8 \%

4.3.2 Precision Degeneration of the Interpolation operation
4.3.2 插值操作的精度退化

We use configuration Tab. 3 © to exploit the precision degeneration of the interpolation operation. Several ablation configurations are constructed in Tab. 4 to study the factors like the BEV resolution and the interpolation operation. When a low BEV resolution of 0.8 m × 0.8 m 0.8 m × 0.8 m 0.8mxx0.8m0.8 \mathrm{~m} \times 0.8 \mathrm{~m} is applied, we observed a slight drop in velocity precision from configuration Tab. 4 (A) 0.479 mAVE to (B) 0.499 mAVE . This indicates that aligning the feature map after the view transformation with interpolation operation will introduce systematic error. However, the precondition of acceleration in BEVDet [14] can be maintained in configuration Tab. 4 (B). Benefitting from the acceleration method, the inference speed can be scaled up to 15.6 FPS, which is twice that of the configuration Tab. 4 (A).
我们使用配置表 3©来利用插值操作的精度退化。在表 4 中构建了几个消融配置,以研究像 BEV 分辨率和插值操作这样的因素。当应用低 BEV 分辨率 0.8 m × 0.8 m 0.8 m × 0.8 m 0.8mxx0.8m0.8 \mathrm{~m} \times 0.8 \mathrm{~m} 时,我们观察到配置表 4 (A)的速度精度从 0.479 mAVE 略微下降到(B) 0.499 mAVE。这表明,在视图变换后对特征图进行插值操作会引入系统误差。然而,BEVDet [14]中的加速前提可以在配置表 4 (B)中保持。得益于加速方法,推理速度可以提升到 15.6 FPS,是配置表 4 (A)的两倍。
When a high BEV resolution of 0.4 m × 0.4 m 0.4 m × 0.4 m 0.4mxx0.4m0.4 \mathrm{~m} \times 0.4 \mathrm{~m} is applied, the performance difference between aligning within the view transformation and aligning after view transformation with interpolation operation is negligible (i.e. Tab. 4 © with 45.2 NDS v.s. Tab. 4 (D) with 45.3 NDS). High BEV resolution can help reduce the precision degeneration caused by the interpolation operation. Besides, from the perspective of inference acceleration, conducting aligning operations within the view transformation is deprecated in this paper.
当应用高 BEV 分辨率 0.4 m × 0.4 m 0.4 m × 0.4 m 0.4mxx0.4m0.4 \mathrm{~m} \times 0.4 \mathrm{~m} 时,在视图变换内对齐和在视图变换后进行插值操作对齐之间的性能差异可以忽略不计(即表 4©的 45.2 NDS 与表 4(D)的 45.3 NDS)。高 BEV 分辨率可以帮助减少插值操作引起的精度退化。此外,从推理加速的角度来看,本文不再推荐在视图变换内进行对齐操作。

4.3.3 The Position of the Temporal Fusion
4.3.3 时间融合的位置

It is not trivial to select the position of the temporal fusion in the BEVDet4D framework. We compare some typical po-
在 BEVDet4D 框架中选择时间融合的位置并非易事。我们比较了一些典型的 po-

sitions in Tab. 5 to study this problem. Among all configurations, conducting temporal fusion after the extra BEV encoder in configuration Tab. 5(B) is the most applicable one with the lowest velocity error of 0.429 mAVE . When bringing forward the temporal fusion in configuration Tab. 5(A), the velocity error is increased by + 11.9 % + 11.9 % +11.9%+11.9 \% to 0.480 mAVE . This indicates that the BEV feature generated by the view transformer is too coarse to be directly applied. An extra BEV encoder before temporal fusion can help alleviate this problem. When we postpone the temporal fusion to the back of the BEV encoder in configuration Tab. 5©, the overall performance degenerates to 39.4 % 39.4 % 39.4%39.4 \% NDS which is close to the baseline BEVDet [14] with 39.2 % 39.2 % 39.2%39.2 \% NDS. More precisely, the feature from the previous frame helps slightly reduce the velocity error from 0.909 mAVE to 0.838 but increases the translation error from 0.691 mATE to 0.720 mATE. This indicates that the BEV encoder plays an important role in effectuating the proposed BEVDet4D paradigm by resisting the positional misleading from the previous frame feature and estimating the velocity according to the difference between the two candidate features.
在表 5 中的配置中研究这个问题。在所有配置中,配置表 5(B)中在额外的 BEV 编码器之后进行时间融合是最适用的,具有最低的速度误差 0.429 mAVE。当在配置表 5(A)中提前进行时间融合时,速度误差增加到 0.480 mAVE。这表明,由视图变换器生成的 BEV 特征过于粗糙,无法直接应用。在时间融合之前增加一个 BEV 编码器可以帮助缓解这个问题。当我们将时间融合推迟到配置表 5(C)中 BEV 编码器之后时,整体性能退化到接近基线 BEVDet [14]的 NDS。更准确地说,来自前一帧的特征稍微帮助将速度误差从 0.909 mAVE 减少到 0.838,但将平移误差从 0.691 mATE 增加到 0.720 mATE。这表明,BEV 编码器在实现所提出的 BEVDet4D 范式中发挥了重要作用,通过抵抗来自前一帧特征的位置信息误导,并根据两个候选特征之间的差异来估计速度。

5. Conclusion 5. 结论

In this paper, we pioneer the exploitation of visionbased autonomous driving in the spatial-temporal 4D space by proposing BEVDet4D to lift the scalable BEVDet [14] from spatial-only 3D space to spatial-temporal 4D space. BEVDet4D retains the elegance of BEVDet while substantially pushing the performance in multi-camera 3D object detection, particularly in the velocity prediction aspect. This is achieved by enabling the models to access the temporal cues with the BEV features from two adjacent frames and removing the factors of ego-motion and time in velocity prediction. Future works will focus on the design of framework and paradigm for more actively mining the temporal cues. Besides, multi-task learning in spatial-temporal 4D space with BEVDet4D is also another promising direction for substantially pushing the performance in this area.
在本文中,我们开创性地利用基于视觉的自主驾驶技术,在时空 4D 空间中提出 BEVDet4D,将可扩展的 BEVDet [14] 从仅空间的 3D 空间提升到时空 4D 空间。BEVDet4D 保留了 BEVDet 的优雅,同时在多摄像头 3D 物体检测方面显著提升了性能,特别是在速度预测方面。这是通过使模型能够访问来自两个相邻帧的 BEV 特征的时间线索,并消除速度预测中的自我运动和时间因素来实现的。未来的工作将集中在框架和范式的设计上,以更积极地挖掘时间线索。此外,在时空 4D 空间中使用 BEVDet4D 进行多任务学习也是在这一领域显著提升性能的另一个有前景的方向。

References 参考文献

[1] Garrick Brazil, Gerard Pons-Moll, Xiaoming Liu, and Bernt Schiele. Kinematic 3D Object Detection in Monocular Video. In Proceedings of the European Conference on Computer Vision, pages 135-152. Springer, 2020.
[1] Garrick Brazil, Gerard Pons-Moll, Liu Xiaoming 和 Bernt Schiele. 单目视频中的运动学 3D 物体检测. 载于欧洲计算机视觉会议论文集, 页码 135-152. Springer, 2020.

[2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multimodal dataset for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11621-11631, 2020.
[2] Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, 和 Oscar Beijbom. nuScenes:一个用于自动驾驶的多模态数据集。在 2020 年 IEEE 计算机视觉与模式识别会议论文集中,页码 11621-11631。

[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-toEnd Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, pages 213229. Springer, 2020.
[3] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, 和 Sergey Zagoruyko. 基于变换器的端到端目标检测. 载于欧洲计算机视觉会议论文集, 页码 213229. Springer, 2020.

[4] Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. Training Deep Nets with Sublinear Memory Cost. arXiv preprint arXiv:1604.06174, 2016.
[4] 陈天奇, 许兵, 张致远, 和 卡洛斯·古斯特林. 用亚线性内存成本训练深度网络. arXiv 预印本 arXiv:1604.06174, 2016.

[5] Yihong Chen, Yue Cao, Han Hu, and Liwei Wang. Memory Enhanced Global-Local Aggregation for Video Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 10337-10346, 2020.
[5] 陈怡鸿, 曹悦, 胡汉, 王立伟. 记忆增强的全局-局部聚合用于视频目标检测. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 10337-10346 页,2020 年。

[6] MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/ mmdetection3d, 2020.
[6] MMDetection3D 贡献者。MMDetection3D:OpenMMLab 下一代通用 3D 目标检测平台。https://github.com/open-mmlab/ mmdetection3d,2020。

[7] Hanming Deng, Yang Hua, Tao Song, Zongpu Zhang, Zhengui Xue, Ruhui Ma, Neil Robertson, and Haibing Guan. Object Guided External Memory Network for Video Object Detection. In Proceedings of the International Conference on Computer Vision, pages 6678-6687, 2019.
[7] 邓汉明, 华扬, 宋涛, 张宗普, 薛正贵, 马如辉, 尼尔·罗伯逊, 和关海兵. 面向对象的外部记忆网络用于视频目标检测. 载于国际计算机视觉会议论文集, 页码 6678-6687, 2019.

[8] Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, and Tao Mei. Relation Distillation Networks for Video Object Detection. In Proceedings of the International Conference on Computer Vision, pages 7023-7032, 2019.
[8] 邓家军, 潘英伟, 姚婷, 周文刚, 李厚强, 和梅涛. 视频目标检测的关系蒸馏网络. 载于国际计算机视觉会议论文集, 页码 7023-7032, 2019.

[9] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. FlowNet: Learning Optical Flow with Convolutional Networks. In Proceedings of the International Conference on Computer Vision, pages 2758-2766, 2015.
[9] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, 和 Thomas Brox. FlowNet: 使用卷积网络学习光流. 载于国际计算机视觉会议论文集, 页码 2758-2766, 2015.

[10] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2012.
[10] Andreas Geiger, Philip Lenz, 和 Raquel Urtasun. 我们准备好自动驾驶了吗?KITTI 视觉基准套件。在 2012 年 IEEE 计算机视觉与模式识别会议论文集中。

[11] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3D Packing for Self-Supervised Monocular Depth Estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2485-2494, 2020.
[11] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, 和 Adrien Gaidon. 自监督单目深度估计的 3D 打包. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 2485-2494 页,2020 年。

[12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016.
[12] 何恺明, 张翔宇, 任少卿, 孙剑. 深度残差学习用于图像识别. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 770-778 页,2016 年。

[13] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735-1780, 1997.
[13] Sepp Hochreiter 和 Jürgen Schmidhuber. 长短期记忆. 神经计算, 9(8):1735-1780, 1997.

[14] Junjie Huang, Guan Huang, Zheng Zhu, and Dalong Du. BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View. arXiv preprint arXiv:2112.11790, 2021.
[14] 黄俊杰, 黄冠, 朱正, 和杜大龙. BEVDet: 高性能多摄像头鸟瞰视角 3D 物体检测. arXiv 预印本 arXiv:2112.11790, 2021.

[15] Abhinav Kumar, Garrick Brazil, and Xiaoming Liu. GrooMeD-NMS: Grouped Mathematically Differentiable NMS for Monocular 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8973-8983, 2021.
[15] Abhinav Kumar, Garrick Brazil, 和 Xiaoming Liu. GrooMeD-NMS:用于单目 3D 物体检测的分组数学可微 NMS. 在 2021 年 IEEE 计算机视觉与模式识别会议论文集中,页码 8973-8983.

[16] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. PointPillars: Fast Encoders for Object Detection from Point Clouds. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 12697-12705, 2019.
[16] Alex H Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, 和 Oscar Beijbom. PointPillars: 从点云中进行物体检测的快速编码器. 载于 2019 年 IEEE 计算机视觉与模式识别会议论文集, 页 12697-12705.

[17] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, pages 740-755. Springer, 2014.
[17] 林宗义, 迈克尔·梅尔, 塞尔日·贝隆吉, 詹姆斯·海斯, 皮耶特罗·佩罗纳, 德瓦·拉马南, 皮奥特·多拉尔, 和 C·劳伦斯·齐特尼克. 微软 COCO: 上下文中的常见物体. 载于欧洲计算机视觉会议论文集, 页码 740-755. 施普林格, 2014.

[18] Mason Liu and Menglong Zhu. Mobile Video Object Detection with Temporally-Aware Feature Maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5686-5695, 2018.
[18] 刘梅森和朱梦龙。具有时间感知特征图的移动视频目标检测。载于《IEEE 计算机视觉与模式识别会议论文集》,第 5686-5695 页,2018 年。

[19] Mason Liu, Menglong Zhu, Marie White, Yinxiao Li, and Dmitry Kalenichenko. Looking Fast and Slow: MemoryGuided Mobile Video Object Detection. arXiv preprint arXiv:1903.10172, 2019.
[19] 刘梅森, 朱梦龙, 玛丽·怀特, 李银晓, 和德米特里·卡列尼琴科. 快速与缓慢的观察:记忆引导的移动视频目标检测. arXiv 预印本 arXiv:1903.10172, 2019.

[20] Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. PETR: Position Embedding Transformation for Multi-View 3D Object Detection. arXiv preprint arXiv:2203.05625, 2022.
[20] 刘英飞, 王天才, 张向宇, 和孙健. PETR:多视角 3D 物体检测的位置嵌入转换. arXiv 预印本 arXiv:2203.05625, 2022.

[21] Zongdai Liu, Dingfu Zhou, Feixiang Lu, Jin Fang, and Liangjun Zhang. AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection. In Proceedings of the International Conference on Computer Vision, pages 1564115650, 2021.
[21] 刘宗代, 周丁富, 陆飞翔, 方金, 张良军. AutoShape:实时形状感知单目 3D 物体检测. 载于国际计算机视觉会议论文集, 页码 15641-15650, 2021.

[22] Ilya Loshchilov and Frank Hutter. DECOUPLED WEIGHT DECAY REGULARIZATION. In Proceedings of the International Conference on Learning Representations, 2019.
[22] 伊利亚·洛希奇洛夫和弗兰克·胡特。解耦权重衰减正则化。在 2019 年国际学习表征会议论文集中。

[23] Yongyi Lu, Cewu Lu, and Chi-Keung Tang. Online Video Object Detection using Association LSTM. In Proceedings of the International Conference on Computer Vision, pages 2344-2352, 2017.
[23] 陆永毅,陆策武,邓志强。使用关联 LSTM 进行在线视频目标检测。载于国际计算机视觉会议论文集,页码 2344-2352,2017 年。

[24] Yan Lu, Xinzhu Ma, Lei Yang, Tianzhu Zhang, Yating Liu, Qi Chu, Junjie Yan, and Wanli Ouyang. Geometry Uncertainty Projection Network for Monocular 3D Object Detection. In Proceedings of the International Conference on Computer Vision, pages 3111-3121, 2021.
[24] 闫璐,马新竹,杨磊,张天柱,刘雅婷,楚琦,闫俊杰,欧阳万里。用于单目 3D 物体检测的几何不确定性投影网络。发表于国际计算机视觉会议论文集,页码 3111-3121,2021 年。

[25] Ramin Nabati and Hairong Qi. CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527-1536, 2021.
[25] Ramin Nabati 和 Hairong Qi. CenterFusion:基于中心的雷达与摄像头融合用于 3D 物体检测. 载于《IEEE/CVF 计算机视觉应用冬季会议论文集》,第 1527-1536 页,2021 年。

[26] Bowen Pan, Jiankai Sun, Ho Yin Tiga Leung, Alex Andonian, and Bolei Zhou. Cross-View Semantic Segmentation for Sensing Surroundings. IEEE Robotics and Automation Letters, 5(3):4867-4873, 2020.
[26] 潘博文, 孙建凯, 梁浩然, 亚历克斯·安多尼安, 及周博磊. 跨视角语义分割用于感知周围环境. IEEE 机器人与自动化快报, 5(3):4867-4873, 2020.

[27] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, and Adrien Gaidon. Is Pseudo-Lidar needed for Monocular 3D Object detection? In Proceedings of the International Conference on Computer Vision, pages 3142-3152, 2021.
[27] Dennis Park, Rares Ambrus, Vitor Guizilini, Jie Li, 和 Adrien Gaidon. 单目 3D 物体检测是否需要伪激光雷达?在国际计算机视觉会议论文集中,页 3142-3152,2021 年。

[28] Jonah Philion and Sanja Fidler. Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In Proceedings of the European Conference on Computer Vision, pages 194-210. Springer, 2020.
[28] 乔纳·菲利昂和桑贾·菲德勒。提升、溅射、拍摄:通过隐式反投影到 3D 来编码来自任意相机设备的图像。载于《欧洲计算机视觉会议论文集》,第 194-210 页。施普林格,2020 年。

[29] Cody Reading, Ali Harakeh, Julia Chae, and Steven L Waslander. Categorical Depth Distribution Network for Monocular 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 8555-8564, 2021.
[29] Cody Reading, Ali Harakeh, Julia Chae, 和 Steven L Waslander. 用于单目 3D 物体检测的分类深度分布网络. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 8555-8564 页,2021 年。

[30] Thomas Roddick and Roberto Cipolla. Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1113811147, 2020.
[30] 托马斯·罗迪克和罗伯托·奇波拉。使用金字塔占用网络从图像预测语义地图表示。在 2020 年 IEEE 计算机视觉与模式识别会议论文集中,页码 11138-11147。

[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision, 115(3):211-252, 2015.
[31] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein 等. ImageNet 大规模视觉识别挑战. 国际计算机视觉杂志, 115(3):211-252, 2015.

[32] Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel López-Antequera, and Peter Kontschieder. Disentangling Monocular 3D Object Detection. In Proceedings of the International Conference on Computer Vision, pages 1991-1999, 2019.
[32] Andrea Simonelli, Samuel Rota Bulo, Lorenzo Porzi, Manuel López-Antequera 和 Peter Kontschieder. 解开单目 3D 物体检测的复杂性. 在国际计算机视觉会议论文集中,页码 1991-1999,2019。

[33] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2446-2454, 2020.
[33] Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine 等。自动驾驶感知的可扩展性:Waymo 开放数据集。载于《IEEE 计算机视觉与模式识别会议论文集》,第 2446-2454 页,2020 年。

[34] Zhi Tian, Chunhua Shen, Hao Chen, and Tong He. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the International Conference on Computer Vision, pages 9627-9636, 2019.
[34] 只天,春华申,浩辰,和通赫。FCOS:全卷积单阶段目标检测。在国际计算机视觉会议论文集中,页码 9627-9636,2019 年。

[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is All you Need. Advances in Neural Information Processing Systems, 30, 2017.
[35] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, 和 Illia Polosukhin. 注意力即一切. 神经信息处理系统进展, 30, 2017.

[36] Li Wang, Liang Du, Xiaoqing Ye, Yanwei Fu, Guodong Guo, Xiangyang Xue, Jianfeng Feng, and Li Zhang. Depthconditioned Dynamic Message Propagation for Monocular 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 454 463 , 2021 454 463 , 2021 454-463,2021454-463,2021.
[36] 李旺,梁杜,晓青叶,燕伟傅,国栋郭,向阳薛,建峰冯,李张。基于深度条件的动态消息传播用于单目 3D 物体检测。在 IEEE 计算机视觉与模式识别会议论文集中,页码 454 463 , 2021 454 463 , 2021 454-463,2021454-463,2021

[37] Li Wang, Li Zhang, Yi Zhu, Zhi Zhang, Tong He, Mu Li, and Xiangyang Xue. Progressive Coordinate Transforms for Monocular 3D Object Detection. In Advances in Neural Information Processing Systems, 2021.
[37] 李王, 李张, 逸竹, 志张, 通赫, 穆李, 和 向阳薛. 单目 3D 物体检测的渐进坐标变换. 载于《神经信息处理系统进展》,2021 年。

[38] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. arXiv preprint arXiv:2104.10956, 2021.
[38] 太王, 朱兴戈, 庞江淼, 和 林大华. FCOS3D: 完全卷积单阶段单目 3D 物体检测. arXiv 预印本 arXiv:2104.10956, 2021.

[39] Tai Wang, Xinge Zhu, Jiangmiao Pang, and Dahua Lin. Probabilistic and Geometric Depth: Detecting Objects in Perspective. arXiv preprint arXiv:2107.14160, 2021.
[39] 太王, 朱新歌, 庞江淼, 和 林大华. 概率深度与几何深度:透视下的物体检测. arXiv 预印本 arXiv:2107.14160, 2021.

[40] Yue Wang, Vitor Guizilini, Tianyuan Zhang, Yilun Wang, Hang Zhao, and Justin Solomon. DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. arXiv preprint arXiv:2110.06922, 2021.
[40] 岳王, 维托尔·吉齐利尼, 张天元, 王怡伦, 赵航, 和贾斯廷·所罗门. DETR3D: 通过 3D 到 2D 查询从多视角图像中进行 3D 物体检测. arXiv 预印本 arXiv:2110.06922, 2021.

[41] Yan Yan, Yuxing Mao, and Bo Li. SECOND: Sparsely Embedded Convolutional Detection. Sensors, 18(10):3337, 2018.
[41] 闫闫、毛宇星和李博。SECOND:稀疏嵌入卷积检测。传感器,18(10):3337,2018。

[42] Weixiang Yang, Qi Li, Wenxi Liu, Yuanlong Yu, Yuexin Ma, Shengfeng He, and Jia Pan. Projecting Your View Attentively: Monocular Road Scene Layout Estimation via CrossView Transformation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 15536-15545, 2021.
[42] 杨维翔,李琦,刘文熙,余元龙,马月欣,何胜峰,潘佳。专注于投影您的视图:通过交叉视图变换进行单目道路场景布局估计。发表于 2021 年 IEEE 计算机视觉与模式识别会议论文集,页码 15536-15545。

[43] Wenfei Yang, Bin Liu, Weihai Li, and Nenghai Yu. Tracking Assisted Faster Video Object Detection. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 1750-1755, 2019.
[43] 杨文飞, 刘斌, 李伟海, 余能海. 跟踪辅助的快速视频目标检测. 载于 2019 年 IEEE 国际多媒体与博览会(ICME),第 1750-1755 页,2019 年。

[44] Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Centerbased 3D Object Detection and Tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 11784-11793, 2021.
[44] 尹天伟、周兴义和菲利普·克拉亨布尔。基于中心的三维物体检测与跟踪。载于《IEEE 计算机视觉与模式识别会议论文集》,第 11784-11793 页,2021 年。

[45] Yunpeng Zhang, Jiwen Lu, and Jie Zhou. Objects are Different: Flexible Monocular 3D Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3289-3298, 2021.
[45] 张云鹏, 陆继文, 周杰. 物体是不同的:灵活的单目 3D 物体检测. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 3289-3298 页,2021 年。

[46] Xingyi Zhou, Dequan Wang, and Philipp Krähenbühl. Objects as Points. arXiv preprint arXiv:1904.07850, 2019.
[46] Xingyi Zhou, Dequan Wang, 和 Philipp Krähenbühl. 物体作为点. arXiv 预印本 arXiv:1904.07850, 2019.

[47] Yunsong Zhou, Yuan He, Hongzi Zhu, Cheng Wang, Hongyang Li, and Qinhong Jiang. Monocular 3D Object Detection: An Extrinsic Parameter Free Approach. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7556-7566, 2021.
[47] 周云松, 何源, 朱洪子, 王成, 李洪阳, 和姜钦洪. 单目 3D 物体检测:一种无外部参数的方法. 载于 2021 年 IEEE 计算机视觉与模式识别会议论文集, 页码 7556-7566.

[48] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, and Gang Yu. Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection. arXiv preprint arXiv:1908.09492, 2019.
[48] Benjin Zhu, Zhengkai Jiang, Xiangxin Zhou, Zeming Li, 和 Gang Yu. 面向点云 3D 物体检测的类别平衡分组和采样. arXiv 预印本 arXiv:1908.09492, 2019.

[49] Xizhou Zhu, Jifeng Dai, Lu Yuan, and Yichen Wei. Towards High Performance Video Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7210-7218, 2018.
[49] 朱西洲, 戴季峰, 袁璐, 魏奕辰. 面向高性能视频目标检测. 载于《IEEE 计算机视觉与模式识别会议论文集》,第 7210-7218 页,2018 年。

[50] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Flow-Guided Feature Aggregation for Video Object Detection. In Proceedings of the International Conference on Computer Vision, pages 408-417, 2017.
[50] 朱西洲, 王宇杰, 戴季峰, 袁璐, 和 魏怡辰. 基于流引导的特征聚合用于视频目标检测. 在国际计算机视觉会议论文集中, 第 408-417 页, 2017 年.

[51] Zhikang Zou, Xiaoqing Ye, Liang Du, Xianhui Cheng, Xiao Tan, Li Zhang, Jianfeng Feng, Xiangyang Xue, and Errui Ding. The Devil Is in the Task: Exploiting Reciprocal Appearance-Localization Features for Monocular 3D Object Detection. In Proceedings of the International Conference on Computer Vision, pages 2713-2722, 2021.
[51] 邹志康, 叶晓青, 杜亮, 程显辉, 谭晓, 张丽, 冯剑锋, 薛向阳, 丁尔瑞. 任务中的魔鬼:利用互惠外观-定位特征进行单目 3D 物体检测. 载于国际计算机视觉会议论文集, 页码 2713-2722, 2021.

  1. *Corresponding author. *通讯作者。