Autonomous driving paper index

End-to-End LiDAR-Camera Calibration via Multi-Modal Correspondences Estimation and Explicit BEV Alignment

2026-07-01 · International Journal of Computer Vision

autonomous drivingbevend-to-endlidarnusceneskittiprediction

One-line summary

Abstract In this work, we present a Bird’s Eye View (BEV) Alignment approach for the LiDAR-Camera calibration task.

Engineering notes

Our method significantly outperforms previous point-to-pixel matching methods, achieving state-of-the-art calibration accuracy. On the KITTI and nuScenes benchmarks, our method reduces the Relative Rotation Error (RRE) by 74% and 79%, and the Relative Translation Error (RTE) by 90% and 95%, respectively, compared to previous methods.

Chinese explanation / 中文解读

中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Abstract In this work, we present a Bird’s Eye View (BEV) Alignment approach for the LiDAR-Camera calibration task. Building upon previous BEV-based work, we extract sensor-wise BEV features from each input modality using domain-specific architectures. Then, we employ a CNN-based encoder to align the two BEVs and estimate the calibration matrix. However, corresponding 2D and 3D features may be spatially distant in BEV space, and as a consequence the encoder alone might struggle to learn the height dimension and estimate the correct registration matrix. To address this, we introduce an implicit alignment step to cross-attend the downsampled 3D features with those from RGB for computing point-to-pixel correspondences and estimating a coarse calibration matrix. To improve the implicit alignment, we also enforce the prediction of correct point-to-pixel correspondences by direct supervision of the similarity matrix computed into the cross attention module. Then, the coarsely aligned 3D features and the RGB features are fed to the BEV Alignment step, in which the CNN-based encoder refines the coarse estimate into a final, more accurate calibration matrix. Notably, both the steps are optimized in an end-to-end fashion. Our method significantly outperforms previous point-to-pixel matching methods, achieving state-of-the-art calibration accuracy. On the KITTI and nuScenes benchmarks, our method reduces the Relative Rotation Error (RRE) by 74% and 79%, and the Relative Translation Error (RTE) by 90% and 95%, respectively, compared to previous methods.

6.0Engineering value
8.0Research novelty
5.0Business relevance

Links and sources

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.
Login or register to leave a comment