这是用户在 2024-4-15 9:38 为 https://ieeexplore.ieee.org/document/10028728 保存的双语快照页面,由 沉浸式翻译 提供双语支持。了解如何保存?
Object Detection in 20 Years: A Survey EI检索SCI基础版 工程技术1区IF 20.6

定期维护:4 月 16 日星期二,IEEE Xplore 将于美国东部时间下午 1:00-5:00 (1700-2100 UTC) 进行定期维护。在此期间,可能会对性能产生间歇性影响。对于给您带来的不便,我们深表歉意。


Object Detection in 20 Years: A Survey EI检索SCI基础版 工程技术1区IF 20.6

 发行商: IEEE

邹正霞;陈克彦;史振伟;郭玉红;叶杰平

Abstract:Object detection, as of one the most fundamental and challenging problems in computer vision, has received great attention in recent years. Over the past two decades, we ...View more
 抽象:

目标检测作为计算机视觉中最基本、最具挑战性的问题之一,近年来受到广泛关注。在过去的二十年中,我们看到了物体检测技术的快速发展及其对整个计算机视觉领域的深远影响。如果我们将今天的对象检测技术视为由深度学习驱动的革命,那么,回到 1990 年代,我们将看到早期计算机视觉的巧妙思维和长期视角设计。本文从技术发展的角度广泛回顾了这个快速发展的研究领域,时间跨度超过四分之一个世纪(从 1990 年代到 2022 年)。本文涵盖了许多主题,包括历史上的里程碑检测器、检测数据集、指标、检测系统的基本构建块、加速技术以及最新最先进的检测方法。

发表于: Proceedings of the IEEE ( Volume: 111, Issue: 3, March 2023)
 页数: 257 - 276

发布日期: 2023 年 1 月 27 日
 ISSN信息:

DOI: 10.1109/JPROC.2023.3238524
 发行商: IEEE
 资助机构:

 第一节

 介绍


对象检测是一项重要的计算机视觉任务,用于检测数字图像中特定类别的视觉对象(例如人类、动物或汽车)的实例。对象检测的目标是开发计算模型和技术,以提供计算机视觉应用程序所需的最基本知识之一: 对象在哪里 对象检测的两个最重要的指标是准确性(包括分类精度和定位精度)和速度。


对象检测是许多其他计算机视觉任务的基础,例如实例分割 [1][2] 、、 [3] [4] 图像字幕 [5][6] 、和 [7] 对象跟踪 [8] 。近年来,深度学习技术 [9] 的快速发展极大地推动了目标检测的进步,带来了显著的突破,并推动了其成为前所未有的研究热点。目标检测现已广泛应用于许多实际应用,例如自动驾驶、机器人视觉和视频监控。 Fig. 1 显示了在过去二十年中与“对象检测”相关的出版物数量不断增加。

Fig. 1. - Increasing number of publications in object detection from 1998 to 2021. (Data from Google scholar advanced search: allintitle: “object detection” or “detecting objects.”)
 图 1.


从 1998 年到 2021 年,目标检测领域的出版物数量不断增加。(数据来自谷歌学术高级搜索:allintitle:“对象检测”或“检测对象”。


由于不同的检测任务具有完全不同的目标和约束,因此它们的难度可能各不相同。除了其他计算机视觉任务中的一些常见挑战(例如不同视点下的物体、照明和类内变化)之外,物体检测中的挑战包括但不限于以下方面:物体旋转和比例变化(例如,小物体)、准确的物体定位、密集和遮挡物体检测、检测速度加快等。在 Section IV 中,我们将对这些主题进行更详细的分析。


本调查旨在从多个角度为新手提供对目标检测技术的完整掌握,并重点介绍其发展。主要特点有三个方面:根据技术发展进行全面回顾,深入探索关键技术和最新技术,以及对检测加速技术进行全面分析。主要线索侧重于过去、现在和未来,并辅以对象检测中的其他一些必要组件,例如数据集、指标和加速技术。本次调查站在技术高速公路上,旨在呈现相关技术的演变,让读者掌握基本概念并找到潜在的未来方向,同时忽略其技术细节。


本文的其余部分组织如下。在 Section II 中,我们回顾了 20 年来目标检测的发展历程。在 中 Section III ,我们回顾了目标检测中的加速技术。本文 Section IV 回顾了近三年来最先进的检测方法。本文 Section V 在此结束,并对进一步的研究方向进行了深入分析。

 第二节.


20 年的目标检测


在本节中,我们将从多个角度回顾目标检测的历史,包括里程碑检测器、数据集、指标和关键技术的演变。


A. 目标检测路线图


在过去的二十年中,人们普遍认为目标检测的进展大致经历了两个历史时期:“传统目标检测期(2014年之前)”和“基于深度学习的检测期(2014年之后)”,如图所示 Fig. 2 。在下文中,我们将总结这一时期的里程碑探测器,以出现时间和性能为主要线索,突出驾驶技术背后的原因,如图所示 Fig. 3

Fig. 2. - Road map of object detection. Milestone detectors in this figure: VJ Det. [10], [11], HOG Det. [12], DPM [13], [14], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], [21], [22], SSD [23], FPN [24], Retina-Net [25], CornerNet [26], CenterNet [27], and DETR [28].
Fig. 2.

Road map of object detection. Milestone detectors in this figure: VJ Det. [10], [11], HOG Det. [12], DPM [13], [14], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], [21], [22], SSD [23], FPN [24], Retina-Net [25], CornerNet [26], CenterNet [27], and DETR [28].

Fig. 3. - Accuracy improvement of object detection on VOC07, VOC12, and MS-COCO datasets. Detectors in this figure: DPM-v1 [13], DPM-v5 [37], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], SSD [23], FPN [24], Retina-Net [25], RefineDet [38], TridentNet [39] CenterNet [40], FCOS [41], HTC [42], YOLOv4 [22], Deformable DETR [43], and Swin Transformer [44].
Fig. 3.

Accuracy improvement of object detection on VOC07, VOC12, and MS-COCO datasets. Detectors in this figure: DPM-v1 [13], DPM-v5 [37], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], SSD [23], FPN [24], Retina-Net [25], RefineDet [38], TridentNet [39] CenterNet [40], FCOS [41], HTC [42], YOLOv4 [22], Deformable DETR [43], and Swin Transformer [44].

1) Milestones: Traditional Detectors:

If we consider today’s object detection technique as a revolution driven by deep learning, then, back in the 1990s, we would see the ingenious design and long-term perspective of early computer vision. Most of the early object detection algorithms were built based on handcrafted features. Due to the lack of effective image representation at that time, people have to design sophisticated feature representations and a variety of speedup skills.

Viola Jones Detectors: In 2001, Viola and Jones [10], [11] achieved real-time detection of human faces for the first time without any constraints (e.g., skin color segmentation). Running on a 700-MHz Pentium III CPU, the detector was tens or even hundreds of times faster than other algorithms in its time under comparable detection accuracy. The VJ detector follows a most straightforward way of detection, i.e., sliding windows: to go through all possible locations and scales in an image to see if any window contains a human face. Although it seems to be a very simple process, the calculation behind it was far beyond the computer’s power of its time. The VJ detector has dramatically improved its detection speed by incorporating three important techniques: “integral image,” “feature selection,” and “detection cascades” (to be introduced in Section III).

HOG Detector: In 2005, Dalal and Triggs [12] proposed the histogram of oriented gradients (HOG) feature descriptor. HOG can be considered an important improvement of the scale-invariant feature transform [29], [30] and shape contexts [31] of its time. To balance the feature invariance (including translation, scale, illumination, and so on) and the nonlinearity, the HOG descriptor is designed to be computed on a dense grid of uniformly spaced cells and use overlapping local contrast normalization (on “blocks”). Although HOG can be used to detect a variety of object classes, it was motivated primarily by the problem of pedestrian detection. To detect objects of different sizes, the HOG detector rescales the input image multiple times while keeping the size of a detection window unchanged. The HOG detector has been an important foundation of many object detectors [13], [14], [32] and a large variety of computer vision applications for many years.

Deformable Part-Based Model (DPM): DPM, as the winners of VOC-07, -08, and -09 detection challenges, was the epitome of the traditional object detection methods. DPM was originally proposed by Felzenszwalb et al. [13] in 2008 as an extension of the HOG detector. It follows the detection philosophy of “divide and conquer,” where the training can be simply considered as the learning of a proper way of decomposing an object, and the inference can be considered as an ensemble of detections on different object parts. For example, the problem of detecting a “car” can be decomposed to the detection of its window, body, and wheels. This part of the work, a.k.a. “star-model,” was introduced by Felzenszwalb et al. [13]. Later on, Girshick [14], [15], [33], [34] has further extended the star model to the “mixture models” to deal with the objects in the real world under more significant variations and has made a series of other improvements.

Although today’s object detectors have far surpassed DPM in detection accuracy, many of them are still deeply influenced by its valuable insights, e.g., mixture models, hard negative mining (HNM), bounding box regression, and context priming. In 2010, Felzenszwalb and Girshick were awarded the “lifetime achievement” by PASCAL VOC.

2) Milestones: CNN-Based Two-Stage Detectors:

As the performance of handcrafted features became saturated, the research of object detection reached a plateau after 2010. In 2012, the world saw the rebirth of convolutional neural networks (CNNs) [35]. As a deep convolutional network is able to learn robust and high-level feature representations of an image, a natural question arises: can we introduce it to object detection Girshick et al. [16], [36] took the lead to break the deadlocks in 2014 by proposing the Regions with CNN features (RCNNs). Since then, object detection started to evolve at an unprecedented speed. There are two groups of detectors in the deep learning era: “two-stage detectors” and “one-stage detectors,” where the former frames the detection as a “coarse-to-fine” process, while the latter frames it as to “complete in one step.”

RCNN: The idea behind RCNN is simple. It starts with the extraction of a set of object proposals (object candidate boxes) by selective search [45]. Then, each proposal is rescaled to a fixed-size image and fed into a CNN model pretrained on ImageNet (say, AlexNet [35]) to extract features. Finally, linear SVM classifiers are used to predict the presence of an object within each region and to recognize object categories. RCNN yields a significant performance boost on VOC07, with a large improvement of mean Average Precision (mAP) from 33.7% (DPM-v5 [46]) to 58.5%. Although RCNN has made great progress, its drawbacks are obvious: the redundant feature computations on a large number of overlapped proposals (over 2000 boxes from one image) lead to an extremely slow detection speed (14 s per image with GPU). Later in the same year, SPPNet [17] was proposed and solved this problem.

SPPNet: In 2014, He et al. [17] proposed spatial pyramid pooling networks (SPPNet). Previous CNN models require a fixed-size input, e.g., a 224 × 224 image for AlexNet [35]. The main contribution of SPPNet is the introduction of a spatial pyramid pooling (SPP) layer, which enables a CNN to generate a fixed-length representation regardless of the size of the image/region of interest without rescaling it. When using SPPNet for object detection, the feature maps can be computed from the entire image only once, and then, fixed-length representations of arbitrary regions can be generated for training the detectors, which avoids repeatedly computing the convolutional features. SPPNet is more than 20 times faster than R-CNN without sacrificing any detection accuracy (VOC07 mAP =59.2 %). Although SPPNet has effectively improved the detection speed, it still has some drawbacks: first, the training is still multistage; second, SPPNet only fine-tunes its fully connected layers while simply ignoring all previous layers. Later in the next year, Fast RCNN [18] was proposed and solved these problems.

Fast RCNN: In 2015, Girshick [18] proposed a Fast RCNN detector, which is a further improvement of R-CNN and SPPNet [16], [17]. Fast RCNN enables us to simultaneously train a detector and a bounding box regressor under the same network configurations. On the VOC07 dataset, Fast RCNN increased the mAP from 58.5% (RCNN) to 70.0% while with a detection speed over 200 times faster than R-CNN. Although Fast-RCNN successfully integrates the advantages of R-CNN and SPPNet, its detection speed is still limited by the proposal detection (see Section II-C1 for more details). Then, a question naturally arises: “can we generate object proposals with a CNN model” Later, Faster R-CNN [19] answered this question.

Faster RCNN: In 2015, Ren et al. [19], [47] proposed a Faster RCNN detector shortly after the Fast RCNN. Faster RCNN is the first near-real-time deep learning detector (COCO mAP@.5=42.7 %, VOC07 mAP =73.2 %, and 17 fps with ZF-Net [48]). The main contribution of Faster-RCNN is the introduction of a region proposal network (RPN) that enables nearly cost-free region proposals. From R-CNN to Faster RCNN, most individual blocks of an object detection system, e.g., proposal detection, feature extraction, and bounding box regression, have been gradually integrated into a unified, end-to-end learning framework. Although Faster RCNN breaks through the speed bottleneck of Fast RCNN, there is still computation redundancy at the subsequent detection stage. Later on, a variety of improvements have been proposed, including RFCN [49] and Light head RCNN [50] (see more details in Section III).

Feature Pyramid Networks (FPNs): In 2017, Lin et al. [24] proposed FPN. Before FPN, most of the deep learning-based detectors run detection only on the feature maps of the networks’ top layer. Although the features in deeper layers of a CNN are beneficial for category recognition, it is not conducive to localizing objects. To this end, a top-down architecture with lateral connections is developed in FPN for building high-level semantics at all scales. Since a CNN naturally forms a feature pyramid through its forward propagation, the FPN shows great advances for detecting objects with a wide variety of scales. Using FPN in a basic Faster R-CNN system, it achieves state-of-the-art single model detection results on the COCO dataset without bells and whistles (COCO mAP@.5=59.1 %). FPN has now become a basic building block of many latest detectors.

3) Milestones: CNN-Based One-Stage Detectors:

Most of the two-stage detectors follow a coarse-to-fine processing paradigm. The coarse strives to improve recall ability, while the fine refines the localization on the basis of the coarse detection and places more emphasis on the discriminate ability. They can easily attain high precision without any bells and whistles but are rarely employed in engineering due to their poor speed and enormous complexity. On the contrary, one-stage detectors can retrieve all objects in one-step inference. They are well-liked by mobile devices with real-time and easy-deployed features, but their performance suffers noticeably when detecting dense and small objects.

You Only Look Once (YOLO): YOLO was proposed by Joseph et al. [20] in 2015. It was the first one-stage detector in the deep learning era [20]. YOLO is extremely fast: a fast version of YOLO runs at 155 fps with VOC07 mAP =52.7 %, while its enhanced version runs at 45 fps with VOC07 mAP =63.4 %. YOLO follows a totally different paradigm from two-stage detectors: to apply a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region simultaneously. In spite of its great improvement in detection speed, YOLO suffers from a drop in localization accuracy compared with two-stage detectors, especially for some small objects. YOLO’s subsequent versions [21], [22], [51] and the latter proposed SSD [23] have paid more attention to this problem. Recently, YOLOv7 [52], a follow-up work from the YOLOv4 team, has been proposed. It outperforms most existing object detectors in terms of speed and accuracy (range from 5 to 160 fps) by introducing optimized structures, such as dynamic label assignment and model structure reparameterization.

Single-Shot Multibox Detector (SSD): SSD was proposed by Liu et al. [23] in 2015. The main contribution of SSD is the introduction of the multireference and multiresolution detection techniques (to be introduced in Section II-C1), which significantly improves the detection accuracy of a one-stage detector, especially for some small objects. SSD has advantages in terms of both detection speed and accuracy (COCO mAP@.5=46.5 %, a fast version runs at 59 fps). The main difference between SSD and previous detectors is that SSD detects objects of different scales on different layers of the network, while the previous ones only run detection on their top layers.

RetinaNet: Despite their high speed and simplicity, the one-stage detectors have trailed the accuracy of two-stage detectors for years. Lin et al. [25] have explored the reasons behind and proposed RetinaNet in 2017. They found that the extreme foreground-background class imbalance encountered during the training of dense detectors is the central cause. To this end, a new loss function named “focal loss” has been introduced in RetinaNet by reshaping the standard cross entropy loss so that detector will put more focus on hard, misclassified examples during training. Focal loss enables the one-stage detectors to achieve comparable accuracy to two-stage detectors while maintaining a very high detection speed (COCO mAP@.5=59.1 %).

CornerNet: Previous methods primarily used anchor boxes to provide classification and regression references. Objects frequently exhibit variation in terms of number, location, scale, ratio, and so on. They have to follow the path of setting up a large number of reference boxes to better match ground truths in order to achieve high performance. However, the network would suffer from further category imbalance, lots of hand-designed hyperparameters, and a long convergence time. To address these problems, Law and Deng [26] discard the previous detection paradigm, and view the task as a keypoint (corners of a box) prediction problem. After obtaining the key points, it will decouple and regroup the corner points using extra embedding information to form the bounding boxes. CornerNet outperforms most one-stage detectors at that time (COCO mAP@.5=57.8 %).

CenterNet: Zhou et al. [40] proposed CenterNet in 2019. It also follows a keypoint-based detection paradigm but eliminates costly postprocesses, such as group-based keypoint assignment (in CornerNet [26], ExtremeNet [53], and so on) and NMS, resulting in a fully end-to-end detection network. CenterNet considers an object to be a single point (the object’s center) and regresses all of its attributes (such as size, orientation, location, and pose) based on the reference center point. The model is simple and elegant, and it can integrate 3-D object detection, human pose estimation, optical flow learning, depth estimation, and other tasks into a single framework. Despite using such a concise detection concept, CenterNet can also achieve comparative detection results (COCO mAP@.5=61.1 %).

DETR: In recent years, Transformers have deeply affected the entire field of deep learning, particularly the field of computer vision. Transformers discard the traditional convolution operator in favor of attention-alone calculation in order to overcome the limitations of CNNs and obtain a global-scale receptive field. In 2020, Carion et al. [28] proposed DETR, where they viewed object detection as a set prediction problem and proposed an end-to-end detection network with Transformers. So far, object detection has entered a new era in which objects can be detected without the use of anchor boxes or anchor points. Later, Zhu et al. [43] proposed Deformable DETR to address the DETR’s long convergence time and limited performance in detecting small objects. It achieves state-of-the-art performance on the MSCOCO dataset (COCO mAP@.5=71.9 %).

B. Object Detection Datasets and Metrics

1) Datasets:

Building larger datasets with less bias is essential for developing advanced detection algorithms. A number of well-known detection datasets have been released in the past ten years, including the datasets of PASCAL VOC Challenges [54], [55] (e.g., VOC2007, VOC2012), the ImageNet Large Scale Visual Recognition Challenge (e.g., ILSVRC2014) [56], the MS-COCO Detection Challenge [57], the Open Images Dataset [58], [59], Objects365 [60], and so on. The statistics of these datasets are given in Table 1. Fig. 4 shows some image examples of these datasets. Fig. 3 shows the improvements in detection accuracy on VOC07, VOC12, and MS-COCO datasets from 2008 to 2021.

Table 1 Some Well-Known Object Detection Datasets and Their Statistics
Table 1- 
Some Well-Known Object Detection Datasets and Their Statistics
Fig. 4. - Some example images and annotations in (a) PASCAL-VOC07, (b) ILSVRC, (c) MS-COCO, and (d) Open images.
Fig. 4.

Some example images and annotations in (a) PASCAL-VOC07, (b) ILSVRC, (c) MS-COCO, and (d) Open images.

Pascal VOC: The PASCAL Visual Object Classes (VOCs) Challenge1 (from 2005 to 2012) [54], [55] was one of the most important competitions in the early computer vision community. Two versions of Pascal-VOC are mostly used in object detection: VOC07 and VOC12, where the former consists of 5k tr. images +12k annotated objects, and the latter consists of 11k tr. images +27k annotated objects. 20 classes of objects that are common in life are annotated in these two datasets, e.g., “person,” “cat,” “bicycle,” and “sofa.”

ILSVRC: The ILSVRC2 [56] has pushed forward the state of the art in generic object detection. ILSVRC is organized each year from 2010 to 2017. It contains a detection challenge using ImageNet images [61]. The ILSVRC detection dataset contains 200 classes of visual objects. The number of its images/object instances is two orders of magnitude larger than VOC.

MS-COCO: MS-COCO3 [57] is one of the most challenging object detection datasets available today. The annual competition based on the MS-COCO dataset has been held since 2015. It has less number of object categories than ILSVRC but more object instances. For example, MS-COCO-17 contains 164k images and 897k annotated objects from 80 categories. Compared with VOC and ILSVRC, the biggest progress of MS-COCO is that apart from the bounding box annotations; each object is further labeled using per-instance segmentation to aid in precise localization. In addition, MS-COCO contains more small objects (whose area is smaller than 1% of the image) and more densely located objects. Just like ImageNet in its time, MS-COCO has become the de facto standard for the object detection community.

Open Images: The year 2018 sees the introduction of the open images detection (OID) challenge4 [62], following MS-COCO but at an unprecedented scale. There are two tasks in open images: 1) standard object detection and 2) visual relationship detection that detects paired objects in particular relations. For the standard detection task, the dataset consists of 1910k images with 15440k annotated bounding boxes on 600 object categories.

2) Metrics:

How can we evaluate the accuracy of a detector? This question may have different answers at different times. In the early time’s detection research, there are no widely accepted evaluation metrics on detection accuracy. For example, in the early research of pedestrian detection [12], the “miss rate versus false positives per window (FPPW)” was commonly used as the metric. However, the per-window measurement can be flawed and fails to predict full image performance [63]. In 2009, the Caltech pedestrian detection benchmark was introduced [63], [64], and since then, the evaluation metric has changed from FPPW to false positives per-image (FPPI).

In recent years, the most frequently used evaluation for detection is “average precision (AP),” which was originally introduced in VOC2007. AP is defined as the average detection precision under different recalls and is usually evaluated in a category-specific manner. The mAP averaged over all categories is usually used as the final metric of performance. To measure the object localization accuracy, the intersection over union (IoU) between the predicted box and the ground truth is used to verify whether it is greater than a predefined threshold, say, 0.5. If yes, the object will be identified as “detected,” otherwise, “missed.” The 0.5-IoU mAP has then become the de facto metric for object detection.

After 2014, due to the introduction of MS-COCO datasets, researchers started to pay more attention to the accuracy of object localization. Instead of using a fixed IoU threshold, MS-COCO AP is averaged over multiple IoU thresholds between 0.5 and 0.95, which encourages more accurate object localization and may be of great importance for some real-world applications (e.g., imagine there is a robot trying to grasp a spanner).

C. Technical Evolution in Object Detection

In this section, we will introduce some important building blocks of a detection system and its technical evolutions. First, we describe the multiscale and context priming on model designing, followed by the sample selection strategy and the design of the loss function in the training process and, finally, the nonmaximum suppression in the inference. The time stamp in the chart and text is supplied by the publication time of papers. The evolution order shown in the figures is primarily to assist readers in understanding, and there may be temporal overlap.

1) Technical Evolution of Multiscale Detection:

Multiscale detection of objects with “different sizes” and “different aspect ratios” is one of the main technical challenges in object detection. In the past 20 years, multiscale detection has gone through multiple historical periods, as shown in Fig. 5.

Fig. 5. - Evolution of multiscale detection techniques in object detection. Detectors in this figure: VJ Det. [10], HOG Det. [12], DPM [13], Exemplar SVM [32], Overfeat [65], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], DNN Det. [66], YOLO [20], SSD [23], Unified Det. [67], FPN [24], RetinaNet [25], RefineDet [38], Cascade R-CNN [68], Swin Transformer [44], FCOS [41], YOLOv4 [22], CornerNet [26], CenterNet [40], Reppoints [69], and DETR [28].
Fig. 5.

Evolution of multiscale detection techniques in object detection. Detectors in this figure: VJ Det. [10], HOG Det. [12], DPM [13], Exemplar SVM [32], Overfeat [65], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], DNN Det. [66], YOLO [20], SSD [23], Unified Det. [67], FPN [24], RetinaNet [25], RefineDet [38], Cascade R-CNN [68], Swin Transformer [44], FCOS [41], YOLOv4 [22], CornerNet [26], CenterNet [40], Reppoints [69], and DETR [28].

Feature Pyramids + Sliding Windows: After the VJ detector, researchers started to pay more attention to a more intuitive way of detection, i.e., by building “feature pyramid +sliding windows.” From 2004, a number of milestone detectors were built based on this paradigm, including the HOG detector, DPM, and so on. They frequently glide a fixed-size detection window over the image, paying little attention to “different aspect ratios.” To detect objects with a more complex appearance, Girshick et al. began to seek better solutions outside the feature pyramid. The “mixture model” [15] was a solution at that time, i.e., to train multiple detectors for objects of different aspect ratios. Apart from this, exemplar-based detection [32], [70] provided another solution by training individual models for every object instance (exemplar).

Detection With Object proposals: Object proposals refer to a group of class-agnostic reference boxes that are likely to contain any objects. Detection with object proposals helps to avoid the exhaustive sliding window search across an image. We refer readers to the following papers for a comprehensive review on this topic [71], [72]. Early time’s proposal detection methods followed a bottom-up detection philosophy [73], [74]. After 2014, with the popularity of deep CNN in visual recognition, the top-down, learning-based approaches began to show more advantages in this problem [19], [75], [76]. Now, the proposal detection gradually slipped out of sight after the rise of one-stage detectors.

Deep Regression and Anchor-Free Detection: In recent years, with the increase of GPU’s computing power, multiscale detection has become a more and more straightforward and brute force. The idea of using deep regression to solve multiscale problems becomes simple, i.e., to directly predict the coordinates of a bounding box based on the deep learning features [20], [66]. After 2018, researchers began to think about the object detection problem from the perspective of keypoint detection. These methods often follow two ideas: one is the group-based method that detects keypoints (corners, centers, or representative points) and then conducts objectwise grouping [26], [53], [69], [77]; the other is the group-free method that regards an object as one/many points and then regresses the object attributes (size, ratio, and so on) under the reference of the points [40], [41].

Multireference/Multiresolution Detection: Multireference detection is now the most used method for multiscale detection [19], [22], [23], [41], [47], [51]. The main idea of multireference detection [19], [22], [23], [41], [47], [51] is to first define a set of references (a.k.a. anchors, including boxes and points) at every location of an image and then predict the detection box based on these references. Another popular technique is multiresolution detection [23], [24], [44], [67], [68], i.e., by detecting objects of different scales at different layers of the network. Multireference and multiresolution detection have now become two basic building blocks in state-of-the-art object detection systems.

2) Technical Evolution of Context Priming:

Visual objects are usually embedded in a typical context with the surrounding environments. Our brain takes advantage of the associations among objects and environments to facilitate visual perception and cognition [96]. Context priming has long been used to improve detection. Fig. 6 shows the evolution of context priming in object detection.

Fig. 6. - Evolution of context priming in object detection. Detectors in this figure: Face Det. [78], MultiPath [79], GBDNet [80], [81], CC-Net [82], MultiRegion-CNN [83], CoupleNet [84], DPM [14], [15], StructDet [85], ION [86], RFCN++ [87], RBFNet [88], TridentNet [39], Non-Local [89], DETR [28], CtxSVM [90], PersonContext [91], SMN [92], RelationNet [93], SIN [94], and RescoringNet [95].
Fig. 6.

Evolution of context priming in object detection. Detectors in this figure: Face Det. [78], MultiPath [79], GBDNet [80], [81], CC-Net [82], MultiRegion-CNN [83], CoupleNet [84], DPM [14], [15], StructDet [85], ION [86], RFCN++ [87], RBFNet [88], TridentNet [39], Non-Local [89], DETR [28], CtxSVM [90], PersonContext [91], SMN [92], RelationNet [93], SIN [94], and RescoringNet [95].

Detection With Local Context: Local context refers to the visual information in the area that surrounds the object to detect. It has long been acknowledged that local context helps improve object detection. In the early 2000s, Sinha and Torralba [78] found that the inclusion of local contextual regions, such as the facial bounding contour, substantially improves face detection performance. Dalal and Triggs [12] also found that incorporating a small amount of background information improves the accuracy of pedestrian detection. Recent deep learning-based detectors can also be improved with local context by simply enlarging the networks’ receptive field or the size of object proposals [79], [80], [81], [82], [83], [84], [97].

Detection With Global Context: Global context exploits scene configuration as an additional source of information for object detection. For early time detectors, a common way of integrating global context is to integrate a statistical summary of the elements that comprise the scene, such as Gist [96]. For recent detectors, there are two methods to integrate the global context. The first method is to take advantage of deep convolution, dilated convolution, deformable convolution, and pooling operation [39], [87], [88] to receive a large receptive field (even larger than the input image). However, now, researchers have explored the potential to apply attention-based mechanisms (Non-Local, Transformers, and so on) to achieve a full-image receptive field and have obtained great success [28], [89]. The second method is to think of the global context as a kind of sequential information and to learn it with the recurrent neural networks [86], [98].

Context Interactive: Context interactive refers to the constraints and dependencies that convey between visual elements. Some recent studies suggested that modern detectors can be improved by considering context interactives. Some recent improvements can be grouped into two categories, where the first one is to explore the relationship between individual objects [15], [85], [90], [92], [93], [95], and the second one is to explore the dependencies between objects and scenes [91], [94].

3) Technical Evolution of Hard Negative Mining:

The training of a detector is essentially an imbalanced learning problem. In the case of sliding window-based detectors, the imbalance between backgrounds and objects could be as extreme as 107:1 [71]. In this case, using all backgrounds will be harmful to training as the vast number of easy negatives will overwhelm the learning process. HNM aims to overcome this problem. The technical evolution of HNM is shown in Fig. 7.

Fig. 7. - Evolution of HNM techniques in object detection. Detectors in this figure: Face Det. [99], Haar Det. [100], VJ Det. [10], HOG Det. [12], DPM [13], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [23], FasterPed [101], OHEM [102], RetinaNet [25], RefineDet [38], FCOS [41], and YOLOv4 [22].
Fig. 7.

Evolution of HNM techniques in object detection. Detectors in this figure: Face Det. [99], Haar Det. [100], VJ Det. [10], HOG Det. [12], DPM [13], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [23], FasterPed [101], OHEM [102], RetinaNet [25], RefineDet [38], FCOS [41], and YOLOv4 [22].

Bootstrap: Bootstrap in object detection refers to a group of training techniques in which the training starts with a small part of background samples and then iteratively adds new misclassified samples. In early times detectors, bootstrap was commonly used with the purpose of reducing the training computations over millions of backgrounds [10], [99], [100]. Later, it became a standard technique in DPM and HOG detectors [12], [13] for solving the data imbalance problem.

HNM in Deep Learning-Based Detectors: In the deep learning era, due to the increase of computing power, bootstrap was shortly discarded in object detection during 2014–2016 [16], [17], [18], [19], [20]. To ease the data-imbalance problem during training, detectors such as Faster RCNN and YOLO simply balance the weights between the positive and negative windows. However, researchers later noticed that this cannot completely solve the imbalanced problem [25]. To this end, the bootstrap was reintroduced to object detection after 2016 [23], [38], [101], [102]. An alternative improvement is to design new loss functions [25] by reshaping the standard cross entropy loss so that it will put more focus on hard, misclassified examples [25].

4) Technical Evolution of Loss Function:

The loss function measures how well the model matches the data (i.e., the deviation of the predictions from the true labels). Calculating the loss yields the gradients of the model weights, which can subsequently be updated by backpropagation to better suit the data. Classification loss and localization loss make up the supervision of the object detection problem [see (1)]. A general form of the loss function can be written as follows:

L(p,p,t,t)=I(t)=Lcls.(p,p)+βI(t)Lloc.(t,t){1,0,IoU{a,a}>ηelse(1)
View SourceRight-click on figure for MathML and additional features. where t and t are the locations of predicted and ground-truth bounding boxes, and p and p are their category probabilities. IoU{a,a} is the IoU between the reference box/point a and its ground truth a . η is an IoU threshold, say, 0.5. If an anchor box/point does not match any objects, its localization loss does not count in the final loss.

Classification Loss: It is used to evaluate the divergence of the predicted category from the actual category, which was not thoroughly investigated in prevIoUs work, such as YOLOv1 [20] and YOLOv2 [51] employing mean square error (mse)/L2 loss. Later, the cross-entropy (CE) loss is typically used [21], [23], [47]. L2 loss is a measure in the Euclidean space, whereas CE loss can measure distribution differences (termed a form of likelihood). The prediction of classification is a probability, so CE loss is preferable to L2 loss with greater misclassification cost and lower gradient vanishing effect. For improving categorization efficiency, label smooth has been proposed to enhance the model generalization ability and solve the overconfidence problem on noise labels [103], [104], and focal loss is designed to solve the problem of category imbalance and differences in classification difficulty [25].

Localization Loss: It is used to optimize position and size deviation. L2 loss is prevalent in early research [16], [20], [51], but it is highly affected by outliers and prone to gradient explosion. Combining the benefits of L1 loss and L2 loss, the researchers propose Smooth L1 loss [18], as illustrated in the following formula:

SmoothL1(x)={0.5x2,|x|0.5,if |x|<1else(2)
View SourceRight-click on figure for MathML and additional features. where x denotes the difference between the target and predicted values. When calculating the error, the above losses treat four numbers (x,y,w,h) representing a bounding box as independent variables; however, a correlation exists between them. Moreover, IoU is utilized to determine if the prediction box corresponds to the actual ground-truth box in evaluation. Equal Smooth L1 values will have totally different IoU values; hence, IoU loss [105] is introduced as follows:
IoU loss=log(IoU).(3)
View SourceRight-click on figure for MathML and additional features.

Following that, several algorithms improved IoU loss. Generalized IoU (G-IoU) [106] improved the case when IoU loss could not optimize the nonoverlapping bounding boxes, i.e., IoU=0 . According to Distance-IoU [107], a successful detection regression loss should meet three geometric metrics: overlap area, center point distance, and aspect ratio. Thus, based on IoU loss and G-IoU loss, the distance IoU (DIoU) is defined as the distance between the center point of the prediction and the ground truth, and the complete IoU (CIoU) [107] considered the aspect ratio difference on the basis of DIoU.

5) Technical Evolution of Nonmaximum Suppression:

As the neighboring windows usually have similar detection scores, the nonmaximum suppression is used as a postprocessing step to remove the replicated bounding boxes and obtain the final detection result. In the early times of object detection, NMS was not always integrated [121]. This is because the desired output of an object detection system was not entirely clear at that time. Fig. 8 shows the evolution of NMS in the past 20 years.

Fig. 8. - Evolution of nonmax suppression (NMS) techniques in object detection from 1994 to 2021: 1) greedy selection; 2) bounding box aggregation; 3) learning to NMS; and 4) NMS-free detection. Detectors in this figure: Face Det. [108], HOG Det. [12], DPM [13], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [23], FPN [24], RetinaNet [25], FCOS [41], StrucDet [85], MAP-Det [109], LearnNMS [110], RelationNet [93], Learn2Rank [111], SoftNMS [112], FitnessNMS [113], SofterNMS [114], AdaptiveNMS [115], DIoUNMS [107], Overfeat [65], APC-NMS [116], MAPC [117], WBF [118], ClusterNMS [119], CenterNet [40], DETR [28], and POTO [120].
Fig. 8.

Evolution of nonmax suppression (NMS) techniques in object detection from 1994 to 2021: 1) greedy selection; 2) bounding box aggregation; 3) learning to NMS; and 4) NMS-free detection. Detectors in this figure: Face Det. [108], HOG Det. [12], DPM [13], [15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [23], FPN [24], RetinaNet [25], FCOS [41], StrucDet [85], MAP-Det [109], LearnNMS [110], RelationNet [93], Learn2Rank [111], SoftNMS [112], FitnessNMS [113], SofterNMS [114], AdaptiveNMS [115], DIoUNMS [107], Overfeat [65], APC-NMS [116], MAPC [117], WBF [118], ClusterNMS [119], CenterNet [40], DETR [28], and POTO [120].

Greedy Selection: It is an old-fashioned but the most popular way to perform NMS. The idea behind it is simple and intuitive: for a set of overlapped detections, the bounding box with the maximum detection score is selected, while its neighboring boxes are removed according to a predefined overlap threshold. Although greedy selection has now become the de facto method for NMS, it still has some space for improvement. First, the top-scoring box may not be the best fit. Second, it may suppress nearby objects. Finally, it does not suppress false positives [116]. Many works have been proposed to solve the problems mentioned above [107], [112], [114], [115].

Bounding Box Aggregation: BB aggregation is another group of techniques for NMS [10], [65], [116], [117] with the idea of combining or clustering multiple overlapped bounding boxes into one final detection. The advantage of this type of method is that it takes full consideration of object relationships and their spatial layout [118], [119]. Some well-known detectors use this method, such as the VJ detector [10] and the Overfeat (winner of ILSVRC-13 localization task) [65].

Learning-Based NMS: A recent group of NMS improvements that have recently received much attention is learning-based NMS [85], [93], [109], [110], [111], [122]. The main idea is to think of NMS as a filter to rescore all raw detections and to train the NMS as part of a network in an end-to-end fashion or train a net to imitate NMS’s behavior. These methods have shown promising results in improving occlusion and dense object detection over traditional handcrafted NMS methods.

NMS-Free Detector: To release from NMS and achieve a fully end-to-end object detection training network, researchers developed a series of methods to complete one-to-one label assignment (a.k.a. one object with just one prediction box) [28], [40], [120]. These methods frequently adhere to a rule that calls for the use of the highest-quality box for training in order to achieve free NMS. NMS-free detectors are more similar to the human visual perception system and are also a possible way to the future of object detection.

SECTION III.

Speedup of Detection

The acceleration of a detector has long been a challenging problem. The speedup techniques in object detection can be divided into three levels of groups: the speedup of “detection pipeline,” “detector backbone,” and “numerical computation,” as shown in Fig. 9. Refer to [123] for a more detailed version.

Fig. 9. - Overview of the speedup techniques in object detection.
Fig. 9.

Overview of the speedup techniques in object detection.

A. Feature Map Shared Computation

Among the different computational stages of a detector, feature extraction usually dominates the amount of computation. The most commonly used idea to reduce the feature computational redundancy is to compute the feature map of the whole image only once [18], [19], [124], which have achieved tens or even hundreds of times of acceleration.

B. Cascaded Detection

Cascaded detection is a commonly used technique [10], [125]. It takes a coarse-to-fine detection philosophy: to filter out most of the simple background windows using simple calculations and then to process those more difficult windows with complex ones. In recent years, cascaded detection has been especially applied to those detection tasks of “small objects in large scenes,” e.g., face detection [126], [127] and pedestrian detection [101], [124], [128].

C. Network Pruning and Quantification

“Network pruning” and “network quantification” are two commonly used methods to speed up a CNN model. The former refers to pruning the network structure or weights, and the latter refers to reducing their code length. The research of “network pruning” can be traced back to as early as the 1980s [129]. The recent network pruning methods usually take an iterative training and pruning process, i.e., to remove only a small group of unimportant weights after each stage of training, and to repeat those operations [130]. The recent works on network quantification mainly focus on network binarization, which aims to compress a network by quantifying its activations or weights to binary variables (say, 0/1) so that the floating-point operation is converted to logical operations.

D. Lightweight Network Design

The last group of methods to speed up a CNN-based detector is to directly design lightweight networks. In addition to some general designing principles, such as “fewer channels and more layers” [131], some other methods have been proposed in recent years [132], [133], [134], [135], [136].

1) Factorizing Convolutions:

Factorizing convolutions is the most straightforward way to build a lightweight CNN model. There are two groups of factorizing methods. The first group is to factorize a large convolution filter into a set of small ones [50], [87], [137], as shown in Fig. 10(b). For example, one can factorize a 7 × 7 filter into three 3 × 3 filters, where they share the same receptive field, but the latter one is more efficient. The second group is to factorize convolutions in their channel dimension [138], [139], as shown in Fig. 10(c).

Fig. 10. - Overview of speedup methods of a CNN’s convolutional layer and the comparison of their computational complexity. (a) Standard convolution: 
$\mathcal {O}{(}\textit {dk}^{{2}}{c}{)}$
. (b) Factoring convolutional filters (
${k}\,\,\times \,\,{k}\,\,\rightarrow \,\,{(}{k}^{\prime }\,\,\times \,\,{k}^{\prime }{)}^{{2}}$
 or 
${1}\,\,\times \,\,{k}$
, 
${k}\,\,\times \,\,{1}$
): 
$\mathcal {O}{(}\textit {dk}^{\prime {2}}{c}{)}$
 or 
$\mathcal {O}{(}\textit {dkc}{)}$
. (c) Factoring convolutional channels: 
$\mathcal {O}{(}{d}^{\prime }{k}^{{2}}{c}{)} + \mathcal {O}{(}\textit {dk}^{{2}}{d}^{\,\prime }{)}$
. (d) Group convolution (#groups 
$= {m}$
): 
$\mathcal {O}{(}\textit {dk}^{{2}}{c}/{m}{)}$
. (e) Depthwise separable convolution: 
$\mathcal {O}{(}\textit {ck}^{{2}}{)} + \mathcal {O}{(}\textit {dc}{)}$
.
Fig. 10.

Overview of speedup methods of a CNN’s convolutional layer and the comparison of their computational complexity. (a) Standard convolution: O(dk2c) . (b) Factoring convolutional filters (k×k(k×k)2 or 1×k , k×1 ): O(dk2c) or O(dkc) . (c) Factoring convolutional channels: O(dk2c)+O(dk2d) . (d) Group convolution (#groups =m ): O(dk2c/m) . (e) Depthwise separable convolution: O(ck2)+O(dc) .

2) Group Convolution:

Group convolution aims to reduce the number of parameters in a convolution layer by dividing the feature channels into different groups and then convolve on each group independently [140], [141], as shown in Fig. 10(d). If we evenly divide the features into m groups, without changing other configurations, the computation will be theoretically reduced to 1/m of that before.

3) Depthwise Separable Convolution:

Depthwise separable convolution [142], as shown in Fig. 10(e), can be viewed as a special case of the group convolution when the number of groups is set equal to the number of channels. Usually, a number of 1 × 1 filters are used to make a dimension transform so that the final output will have the desired number of channels. By using depthwise separable convolution, the computation can be reduced from O(dk2c) to O(ck2)+O(dc) . This idea has been recently applied to object detection and fine-grain classification [143], [144], [145].

4) Bottle-Neck Design:

A bottleneck layer in a neural network contains few nodes compared to the previous layers. In recent years, the bottle-neck design has been widely used for designing lightweight networks [50], [133], [146], [147], [148]. Among these methods, the input layer of a detector can be compressed to reduce the amount of computation from the very beginning of the detection [133], [146], [147]. One can also compress the feature map to make it thinner so as to speed up subsequent detection [50], [148].

5) Detection With NAS:

Deep learning-based detectors are becoming increasingly sophisticated, relying heavily on handcrafted network architecture and training parameters. Neural architecture search (NAS) is primarily concerned with defining the proper space of candidate networks, improving strategies for searching quickly and accurately, and validating the searching results at a low cost. When designing a detection model, NAS can reduce the need for human intervention in the design of the network backbone and anchor boxes [149], [150], [151], [152], [153], [154], [155].

E. Numerical Acceleration

Numerical acceleration aims to accelerate object detectors from the bottom of their implementations.

1) Speedup With Integral Image:

The integral image is an important method in image processing. It helps to rapidly calculate summations over image subregions. The essence of an integral image is the integral-differential separability of convolution in signal processing

f(x)g(x)=(f(x)dx)(dg(x)dx)(4)
View SourceRight-click on figure for MathML and additional features. where, if dg(x)/dx is a sparse signal, then the convolution can be accelerated by the right part of this equation [10], [156]. The integral image can also be used to speed up more general features in object detection, e.g., color histogram and gradient histogram [124], [157], [158], [159]. A typical example is to speed up HOG by computing integral HOG maps [124], [157], as shown in Fig. 11. Integral HOG map has been used in pedestrian detection and has achieved dozens of times’ acceleration without losing any accuracy [124].

Fig. 11. - Illustration of how to compute the “Integral HOG Map” [124]. With integral image techniques, we can efficiently compute the histogram feature of any location and any size with constant computational complexity.
Fig. 11.

Illustration of how to compute the “Integral HOG Map” [124]. With integral image techniques, we can efficiently compute the histogram feature of any location and any size with constant computational complexity.

2) Speedup in Frequency Domain:

Convolution is an important type of numerical operation in object detection. As the detection of a linear detector can be viewed as the windowwise inner product between the feature map and detector’s weights, which can be implemented by convolutions, the Fourier transform is a very practical way to speed up convolutions, where the theoretical basis is the convolution theorem in signal processing, i.e., under suitable conditions, the Fourier transform F of a convolution of two signals IW is the pointwise product in their Fourier space

IW=F1(F(I)F(W))(5)
View SourceRight-click on figure for MathML and additional features. where F is the Fourier transform, F1 is the inverse Fourier transform, and is the pointwise product. The above calculation can be accelerated by using the fast Fourier transform (FFT) and the inverse FFT (IFFT) [160], [161], [162], [163].

3) Vector Quantization:

Vector quantization (VQ) is a classical quantization method in signal processing that aims to approximate the distribution of a large group of data by a small set of prototype vectors. It can be used for data compression and accelerating the inner product operation in object detection [164], [165].

SECTION IV.

Recent Advances in Object Detection

The continual appearance of new technologies over the past two decades has a considerable influence on object detection, while its fundamental principles and underlying logic have remained unchanged. In the above sections, we introduced the evolution of technology over the past two decades in a large-scale time range to help readers comprehend object detection; in this section, we will focus more on state-of-the-art algorithms in recent years in a short time range to help readers understand object detection. Some are expansions of previously discussed techniques (e.g., Sections IV-A–​IV-E), while others are novel crossovers that mix concepts (e.g., Sections IV-F–​IV-H).

A. Beyond Sliding Window Detection

Since an object in an image can be uniquely determined by its upper left corner and lower right corner of the ground-truth box, the detection task, therefore, can be equivalently framed as a pairwise key points’ localization problem. One recent implementation of this idea is to predict a heat map for the corners [26]. Some other methods follow the idea and utilize more key points (corner and center [77], extreme and center points [53], and representative points [69]) to obtain better performance. Another paradigm views an object as point/points and directly predicts the object’s attributes (e.g., height and width) without grouping. The advantage of this approach is that it can be implemented under a semantic segmentation framework, and there is no need to design multiscale anchor boxes. Furthermore, by viewing object detection as a set prediction, DETR [28], [43] completely liberates it in a reference-based framework.

B. Robust Detection of Rotation and Scale Changes

In recent years, efforts have been made to robust detection of rotation and scale changes.

1) Rotation Robust Detection:

Object rotation is common to see in face detection, text detection, and remote sensing object detection. The most straightforward solution to this problem is to perform data augmentation so that an object in any orientation can be well covered by the augmented data distribution [166] or to train independent detectors separately for each orientation [167], [168]. Designing rotation invariant loss functions is a recent popular solution, where a constraint on the detection loss is added so that the feature of rotated objects keeps unchanged [169], [170], [171]. Another recent solution is to learn geometric transformations of the object candidates [172], [173], [174], [175]. In two-stage detectors, ROI pooling aims to extract a fixed-length feature representation for an object proposal with any location and size. Since feature pooling usually is performed in Cartesian coordinates, it is not invariant to rotation transform. A recent improvement is to perform ROI pooling in polar coordinates so that the features can be robust to the rotation changes [167].

2) Scale Robust Detection:

Recent studies have been made for scale robust detection at both training and detection stages.

Scale Adaptive Training: Modern detectors usually rescale input images to a fixed size and back propagate the loss of the objects in all scales. A drawback of doing this is that there will be a “scale imbalance” problem. Building an image pyramid during detection could alleviate this problem but not fundamentally [49], [178]. A recent improvement is Scale Normalization for Image Pyramids (SNIP) [176], which builds image pyramids at both training and detection stages and only backpropagates the loss of some selected scales, as shown in Fig. 12. Some researchers have further proposed a more efficient training strategy: SNIP with Efficient Resampling (SNIPER) [177], i.e., to crop and rescale an image to a set of subregions so as to benefit from large batch training.

Fig. 12. - Different training strategies for multiscale object detection. (a) Training on a single resolution image, back propagate objects of all scales [17], [18], [19], [23]. (b) Training on multiresolution images (image pyramid), back propagate objects of the selected scale. If an object is too large or too small, its gradient will be discarded [39], [176], [177].
Fig. 12.

Different training strategies for multiscale object detection. (a) Training on a single resolution image, back propagate objects of all scales [17], [18], [19], [23]. (b) Training on multiresolution images (image pyramid), back propagate objects of the selected scale. If an object is too large or too small, its gradient will be discarded [39], [176], [177].

Scale Adaptive Detection: In CNN-based detectors, the size and the aspect ratio of anchors are usually carefully designed. A drawback of doing this is that the configurations cannot be adaptive to unexpected scale changes. To improve the detection of small objects, some “adaptive zoom-in” techniques are proposed in some recent detectors to adaptively enlarge the small objects into the “larger ones” [179], [180]. Another recent improvement is to predict the scale distribution of objects in an image and then adaptively rescale the image according to it [181], [182].

C. Detection With Better Backbones

The accuracy/speed of a detector depends heavily on the feature extraction networks, a.k.a. backbones, e.g., the ResNet [178], CSPNet [183], Hourglass [184], and Swin Transformer [44]. For a detailed introduction to some important detection backbones in the deep learning era, we refer readers to the following surveys [185]. Fig. 13 shows the detection accuracy of three well-known detection systems: Faster RCNN [19], R-FCN [49], and SSD [23] with different backbones [186]. Object detection has recently benefited from the powerful feature extraction capabilities of Transformers. On the COCO dataset, the top-ten detection methods are all Transformer-based.5 The performance gap between Transformers and CNNs has gradually widened.

Fig. 13. - Comparison of detection accuracy of three detectors: Faster RCNN [19], R-FCN [49], and SSD [23] on the MS-COCO dataset with different detection backbones. Image from CVPR 2017 [186].
Fig. 13.

Comparison of detection accuracy of three detectors: Faster RCNN [19], R-FCN [49], and SSD [23] on the MS-COCO dataset with different detection backbones. Image from CVPR 2017 [186].

D. Improvements of Localization

To improve localization accuracy, there are two groups of methods in recent detectors: 1) bounding box refinement and 2) new loss functions for accurate localization.

1) Bounding Box Refinement:

The most intuitive way to improve localization accuracy is bounding box refinement, which can be considered as postprocessing of the detection results. One recent method is to iteratively feed the detection results into a BB regressor until the prediction converges to a correct location and size [187], [188], [189]. However, some researchers also claimed that this method does not guarantee the monotonicity of localization accuracy [187] and may degenerate the localization if the refinement is applied multiple times.

2) New Loss Functions for Accurate Localization:

In most modern detectors, object localization is considered a coordinate regression problem. However, the drawbacks of this paradigm are obvious. First, the regression loss does not correspond to the final evaluation of localization, especially for some objects with very large aspect ratios. Second, the traditional BB regression method does not provide confidence in localization. When there is multiple BB’s overlapping with each other, this may lead to failure in nonmaximum suppression. The above problems can be alleviated by designing new loss functions. The most intuitive improvement is to directly use IoU as the localization loss [105], [106], [107], [190]. Besides, some researchers also tried to improve localization under a probabilistic inference framework [191]. Different from the previous methods that directly predict the box coordinates, this method predicts the probability distribution of a bounding box location.

E. Learning With Segmentation Loss

Object detection and semantic segmentation are two fundamental tasks in computer vision. Recent studies suggest that object detection can be improved by learning with semantic segmentation losses.

To improve detection with segmentation, the simplest way is to think of the segmentation network as a fixed feature extractor and integrate it into a detector as auxiliary features [83], [192], [193]. The advantage of this approach is that it is easy to implement, while the disadvantage is that the segmentation network may bring additional computation.

Another way is to introduce an additional segmentation branch on top of the original detector and train this model with multitask loss functions (seg. + det.) [4], [42], [192]. The advantage is that the seg. brunch will be removed at the inference stage, and the detection speed will not be affected. However, the disadvantage is that the training requires pixel-level image annotations.

F. Adversarial Training

The generative adversarial network (GAN), introduced by Goodfellow et al. [194] in 2014, has received great attention in many tasks, such as image generation [194], [195], image style transfer [196], and image super-resolution [197].

Recently, adversarial training has also been applied to object detection, especially for improving the detection of small and occluded objects. For small object detection, GAN can be used to enhance the features of small objects by narrowing the representations between small and large ones [198], [199]. To improve the detection of occluded objects, one recent idea is to generate occlusion masks by using adversarial training [200]. Instead of generating examples in pixel space, the adversarial network directly modifies the features to mimic occlusion.

G. Weakly Supervised Object Detection

Training a deep learning-based object detector usually requires a large amount of manually labeled data. Weakly supervised object detection (WSOD) aims at easing the reliance on data annotation by training a detector with only image-level annotations instead of bounding boxes [201].

Multi-instance learning is a group of supervised learning algorithms that has seen widespread application in WSOD [202], [203], [204], [205], [206], [207], [208], [209]. Instead of learning with a set of instances that are individually labeled, a multi-instance learning model receives a set of labeled bags, each containing many instances. If we consider object candidates in an image as a bag and image-level annotation as the label, then the WSOD can be formulated as a multi-instance learning process.

Class activation mapping is another recent group of methods for WSOD [210], [211]. The research on CNN visualization has shown that the convolutional layer of a CNN behaves as an object detector, despite there is no supervision on the location of the object. Class activation mapping shed light on how to enable a CNN with localization capability despite being trained on image-level labels [212].

In addition to the above approaches, some other researchers considered the WSOD as a proposal ranking process by selecting the most informative regions and then training these regions with image-level annotation [213]. Some other researchers proposed to mask out different parts of the image. If the detection score drops sharply, then the masked region may contain an object with high probability [214]. More recently, generative adversarial training has also been used for WSOD [215].

H. Detection With Domain Adaptation

The training process of most object detectors can be essentially viewed as a likelihood estimation process under the assumption of independent identically distributed (i.i.d.) data. Object detection with non-i.i.d. data, especially for some real-world applications, still remains a challenge. Aside from collecting more data or applying proper data augmentation, domain adaptation offers the possibility of narrowing the gap between domains. To obtain domain-invariant feature representation, feature regularization and adversarial training-based methods have been explored at the image, category, or object levels [216], [217], [218], [219], [220], [221]. Cycle-consistent transformation [222] has also been applied to bridge the gap between source and target domains [223], [224]. Some other methods also incorporate both ideas [225] to acquire better performance.

SECTION V.

Conclusion and Future Directions

Remarkable achievements have been made in object detection over the past 20 years. This article extensively reviews some milestone detectors, key technologies, speedup methods, datasets, and metrics in its 20 years of history. Some promising future directions may include, but are not limited to, the following aspects to help readers get more insights beyond the scheme mentioned above.

Lightweight Object Detection: It aims to speed up the detection inference to run on low-power edge devices. Some important applications include mobile augmented reality, automatic driving, smart city, smart cameras, face verification, and so on. Although a great effort has been made in recent years, the speed gap between a machine and human eyes still remains large, especially for detecting some small objects or detecting with multisource information [226], [227].

End-to-End Object Detection: Although some methods have been developed to detect objects in a fully end-to-end manner (image to box in a network) using one-to-one label assignment training, the majority still use a one-to-many label assignment method where the nonmaximum suppression operation is separately designed. Future research on this topic may focus on designing end-to-end pipelines that maintain both high detection accuracy and efficiency [228].

Small Object Detection: Detecting small objects in large scenes has long been a challenge. Some potential application of this research direction includes counting the population of people in crowd or animals in the open air and detecting military targets from satellite images. Some further directions may include the integration of the visual attention mechanisms and the design of high-resolution lightweight networks [229], [230].

3-D Object Detection: Despite recent advances in 2-D object detection, applications such as autonomous driving rely on access to the objects’ location and pose in a 3-D world. The future of object detection will receive more attention in the 3-D world and the utilization of multisource and multiview data (e.g., RGB images and 3-D LiDAR points from multiple sensors) [231], [232].

Detection in Videos: Real-time object detection/tracking in HD videos is of great importance for video surveillance and autonomous driving. Traditional object detectors are usually designed for imagewise detection while simply ignoring the correlations between video frames. Improving detection by exploring the spatial and temporal correlation under the calculation limitation is an important research direction [233], [234].

Cross-Modality Detection: Object detection with multiple sources/modalities of data, e.g., RGB-D image, LiDAR, flow, sound, text, and video, is of great importance for a more accurate detection system, which performs like human-being’s perception. Some open questions include how to immigrate well-trained detectors to different modalities of data, how to make information fusion to improve detection, and so on [235], [236].

Toward Open-World Detection: Out-of-domain generalization, zero-shot detection, and incremental detection are emerging topics in object detection. The majority of them devised ways to reduce catastrophic forgetting or utilized supplemental information. Humans have the instinct to discover objects of unknown categories in the environment. When the corresponding knowledge (label) is given, humans will learn new knowledge from it and get to keep the patterns. However, it is difficult for current object detection algorithms to grasp the detection ability of unknown classes of objects. Object detection in the open world aims at discovering unknown categories of objects when supervision signals are not explicitly given or partially given, which holds great promise in applications such as robotics and autonomous driving [237], [238].

Standing on the highway of technical evolutions, we believe that this article will help readers to build a complete road map of object detection and find future directions of this fast-moving research field.

References

References is not available for this document.