Autonomous driving paper index

Long-Term Temporal Hierarchical Fusion Bird’s-Eye View Perception Method Based on Multiple Position Encodings

2025-12-31 · SAE technical paper series

autonomous drivingbev3d detectionnuscenesperception

One-line summary

With the rapid development of autonomous driving technology, environmental perception, as its core module, has attracted much attention.

Engineering notes

In addition, to deal with challenges such as target occlusion in dynamic scenes, this method further proposes a long-term temporal perception framework that fuses multi-frame temporal information and designs a cross-time guidance module, significantly improving the robustness of target localization by injecting historical geometric constraints. Experiments on the nuScenes dataset verify the effectiveness of this method, and the results show that it achieves excellent performance in both spatial perception accuracy and temporal modeling capability, providing an innovative and practical solution for autonomous driving environmental perception.

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

With the rapid development of autonomous driving technology, environmental perception, as its core module, has attracted much attention. Among them, the pure visual bird's-eye-view (BEV) 3D detection scheme has become a research hotspot due to its high spatial resolution and excellent semantic recognition ability in specific scenarios. Existing methods mainly utilize the Transformer encoder structure to perform position encoding in the BEV domain to achieve 3D perspective transformation, but they often fail to fully exploit the potential value of multi-perspective image information. To address this challenge, this paper proposes an improved Transformer-based visual BEV vehicle perception method that enhances perception performance by deeply fusing BEV domain and image domain information: an innovative multi-perspective position encoding mechanism is designed, which decouples camera parameters to more efficiently learn the mapping from images to 3D space; at the same time, a cyclic interaction attention mechanism is introduced to enhance the fine-grained association and fusion ability of pixel-level features, effectively improving the discriminability of features. In addition, to deal with challenges such as target occlusion in dynamic scenes, this method further proposes a long-term temporal perception framework that fuses multi-frame temporal information and designs a cross-time guidance module, significantly improving the robustness of target localization by injecting historical geometric constraints. Experiments on the nuScenes dataset verify the effectiveness of this method, and the results show that it achieves excellent performance in both spatial perception accuracy and temporal modeling capability, providing an innovative and practical solution for autonomous driving environmental perception.

5.5Engineering value

7.0Research novelty

5.0Business relevance

Links and sources

Official / arXiv page

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.