Autonomous driving paper index
CLS-3D: Content-Wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection in Autonomous Vehicles
One-line summary
To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection).
Engineering notes
Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。
Original abstract
Accurate 3D object detection is vital in autonomous driving. Single-modal detectors, using either camera or LiDAR, struggle with issues like limited depth perception or difficulty in distinguishing semantically similar objects. While multimodal approaches aim to address these limitations by combining LiDAR and camera data, they often face complexities in integrating sparse and uneven point cloud distributions, resulting in inefficient feature fusion. To tackle these challenges, we propose CLS-3D (Content-wise LiDAR-Camera Fusion and Slot Reweighting Transformer for 3D Object Detection). This novel framework fuses LiDAR and camera features using a single multi-modal backbone and augments them with the semantic probabilities obtained from the image stream. Our method captures local and global spatial relationships through a slot reweighting mechanism and incorporates I3C-IoU loss for precise box regression. The semantically augmented features, via a single multi-modal backbone, are embedded using a content-based transformer and processed through a slot-wise auto-encoder structure with channel-wise positional embeddings and a feed-forward MLP network. Our model improves temporal consistency and detection accuracy by dynamically adjusting feature relevance through slot-wise reweighting. We further define a I3C-IoU metric, considering centre, overlap, and scale for enhanced box regression accuracy. This mechanism allows the model to focus on significant temporal information, enhancing its ability to learn complex sequences and improving the overall performance of 3D object detection, especially in challenging scenarios such as occlusion and long-range detection. Extensive experiments on the KITTI and nuScenes benchmark demonstrate that CLS-3D achieves state-of-the-art performance, with 89.52% 3D mAP and 94.08% BEV mAP, outperforming existing methods.
Links and sources
Need this topic turned into a technical roadmap?
Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.
Request B2B research
Comments