Vision-centric Semantic Occupancy Prediction for Autonomous Driving | by Patrick Langechuan Liu | May, 2023

Here I will first summarize the explosion of research studies during the past year on a high level, and then follow up with a summary of the various technical details. Below is a diagram summarizing the overall development thread of the work to be reviewed. It is worth noting that the field is still rapidly evolving, and has yet to converge to a universally accepted dataset and evaluation metric.

The development timeline for the field of semantic occupancy prediction (source: created by the author)

MonoScene (CVPR 2022), first vision-input attempt

MonoScene is the first work to reconstruct outdoor scenes using only RGB images as inputs, as opposed to lidar point clouds that previous studies used. It is a single-camera solution, focusing on the front-camera-only SemanticKITTI dataset.

The architecture of MonoScene (source: MonoScene)

The paper proposes many ideas, but only one design choice seems critical — FLoSP (Feature Line of Sight Projection). This idea is similar to the idea of feature propagation along the line of sight, also adopted by OFT (BMVC 2019) or Lift-Splat-Shoot (ECCV 2020). Other novelties such as Context Relation Prior and unique losses inspired by directly optimizing the metrics seem not that useful according to the ablation study.

VoxFormer (CVPR 2023), significantly improved monoScene

The key insight of VoxFormer is that SOP/SSC has to address two issues simultaneously: scene reconstruction for visible areas and scene hallucination for occluded regions. VoxFormer proposes a reconstruct-and-densify approach. In the first reconstruction stage, the paper lifts RGB pixels to a pseudo-LiDAR point cloud with monodepth methods, and then voxelize it into initial query proposals. In the second densification stage, these sparse queries are enhanced with image features and use self-attention for label propagation to generate a dense prediction. VoxFormer significantly outperformed MonoScene on SemanticKITTI and is still a single-camera solution. The image feature enhancement architecture heavily borrows the deformable attention idea from BEVFormer.

The architecture of VoxFormer (source: VoxFormer)

TPVFormer (CVPR 2023), the first multi-camera attempt

TPVFormer is the first work to generalize 3D semantic occupancy prediction to a multi-camera setup and extends the idea of SOP/SSC from semanticKITTI to NuScenes.

The architecture of TPVFormer (source: TPVFormer)

TPVFormer extends the idea of BEV to three orthogonal axes. This allows the modeling of 3D without suppressing any axes and avoids cubic complexity. Concretely TPVFormer proposes two steps of attention to generating TPV features. First, it uses image cross-attention (ICA) to get TPV features. This essentially borrows the idea of BEVFormer and extends to the other two orthogonal directions to form a TriPlane View feature. Then it uses cross-view hybrid attention (CVHA) to enhance each TPV feature by attending to the other two.

The prediction is denser than supervision in TPVFormer, but still has gaps and holes (source: TPVFormer)

TPVFormer uses supervision from sparse lidar points from the vanilla NuScenes dataset, without any multiframe densification or reconstruction. It claimed that the model can predict denser and more consistent volume occupancy for all voxels at inference time, despite the sparse supervision at training time. However, the denser prediction is still not as dense as compared to later studies such as SurroundOcc which uses densified NuScenes dataset.

SurroundOcc (Arxiv 2023/03) and OpenOccupancy (Arxiv 2023/03), the first attempts at dense label supervision

SurroundOcc argues that dense prediction requires dense labels. The paper successfully demonstrated that denser labels can significantly improve the performance of previous methods, such as TPVFormer, by almost 3x. Its most significant contribution is a pipeline for generating dense occupancy ground truth without the need for costly human annotation.

GT generation pipeline of SurroundOcc (source: SurroundOcc)

The generation of dense occupancy labels involves two steps: multiframe data aggregation and densification. First, multi-frame lidar points of dynamic objects and static scenes are stitched separately. The accumulated data is denser than a single frame measurement, but it still has many holes and requires further densification. The densification is performed by Poisson Surface Reconstruction of a triangular mesh, and Nearest Neighbor (NN) to propagate the labels to newly filled voxels.

OpenOccupancy is contemporary to and similar in spirit to SurroundOcc. Like SurroundOcc, OpenOccupancy also uses a pipeline that first aggregates multiframe lidar measurements for dynamic objects and static scenes separately. For further densification, instead of Poisson Reconstruction adopted by SurroundOcc, OpenOccupancy uses an Augment-and-Purify (AAP) approach. Concretely, a baseline model is trained with the aggregated raw label, and its prediction result is used to fuse with the original label to generate a denser label (aka “augment”). The denser label is roughly 2x denser, and manually refined by human labelers (aka “purify”). A total of 4000 human hours were invested to refine the label for nuScenes, roughly 4 human hours per 20-second clip.

The architecture of SurroundOcc (source: SurroundOcc)
The architecture of CONet (source: OpenOccupancy)

Compared to the contribution in the creation of the dense label generation pipeline, the network architecture of SurroundOcc and OpenOccupancy are not as innovative. SurroundOcc is largely based on BEVFormer, with a coarse-to-fine step to enhance 3D features. OpenOccupancy proposes CONet (cascaded occupancy network) which uses an approach similar to that of Lift-Splat-Shoot to lift 2D features to 3D and then enhances 3D features through a cascaded scheme.

Occ3D (Arxiv 2023/04), the first attempt at occlusion reasoning

Occ3D also proposed a pipeline to generate dense occupancy labels, which includes point cloud aggregation, point labeling, and occlusion handling. It is the first paper that explicitly handles the visibility and occlusion reasoning of the dense label. Visibility and occlusion reasoning are critically important for the onboard deployment of SOP models. Special treatment on occlusion and visibility is necessary during training to avoid false positives from over-hallucination about the unobservable scene.

It is noteworthy that lidar visibility is different from camera visibility. Lidar visibility describes the completeness of the dense label, as some voxels are not observable even after multiframe data aggregation. It is consistent across the whole sequence. Meanwhile, camera visibility focuses on the possibility of detection of onboard sensors without hallucination and differs at each timestamp. Eval is only performed on the “visible” voxels in both the LiDAR and camera views.

In the preparation of dense labels, Occ3D only relies on the multiframe data aggregation and does not have the second densification stage as in SurroundOcc and OpenOccupancy. The authors claimed that for the Waymo dataset, the label is already quite dense without densification. For nuScenes, although the annotation still does have holes after point cloud aggregation, Poisson Reconstruction leads to inaccurate results, therefore no densification step is performed. Maybe the Augment-and-Purify approach by OpenOccupancy is more practical in this setting.

The architecture of CTF-Occ in Occ3D (source: Occ3D)

Occ3D also proposed a neural network architecture Coarse-to-Fine Occupancy (CTF-Occ). The coarse-to-fine idea is largely the same as that in OpenOccupancy and SurroundOcc. CTF-Occ proposed incremental token selection to reduce the computation burden. It also proposed an implicit decoder to output the semantic label of any given point, similar to the idea of Occupancy Networks.

The Semantic Occupancy Prediction studies reviewed above are summarized in the following table, in terms of network architecture, training losses, evaluation metrics, and detection range and resolution.

Source link

Leave a Comment