Autonomous driving paper index

SF3D-MDA: A Road-User Behavior Annotation Framework and Recognition Method Based on Vehicle-Mounted Visual Information

2026-05-15 · IEEE Sensors Journal

autonomous driving systemautonomous drivingbevwaymoperception

One-line summary

We propose a SlowFast-based 3-D bidirectional feature pyramid network (FPN) with multidimensional attention (SF3D-MDA) mechanisms.

Engineering notes

On the ROAD-Waymo dataset, video-level v-mAP@0.2 achieves gains of 7.4% (agent), 1.8% (action), and 2.7% (location). The executable codebase central to this research is available at the following repository: https://github.com/Hilbert-space007/SF3D-MDA

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Addressing the critical challenge of spatiotemporal semantic disjunction caused by conventional bird’s-eye view (BEV) trajectory modeling methods in ego-vehicle perspective road-user behavior recognition tasks, this article proposes a dedicated ego-vehicle-centric road-user behavior recognition methodology as a systematic solution. The methodology introduces the behavior–location joint annotation (BLJA) framework to address the inadequacy of the existing annotation framework in accommodating multimodal interaction dynamics and the scarcity of dedicated ego-vehicle perspective road-user behavior datasets. Furthermore, we construct our ROAD-Waymo-trans dataset using this framework, semantically enriched with fine-grained behavioral and positional labels to align with autonomous driving perception requirements. We propose a SlowFast-based 3-D bidirectional feature pyramid network (FPN) with multidimensional attention (SF3D-MDA) mechanisms. The model employs a dual-path 3-D bidirectional FPN (3-DBi-FPN): the slow pathway embeds channel–spatial attention (CSA) to capture pixel-level spatial details, while the fast pathway integrates channel–temporal attention (CTA) to model cross-frame spatiotemporal dependencies. Experimental results on the ROAD dataset show frame-level f-mAP@0.5 improvements of 1.7% (agent), 4% (action), and 3% (location). On the ROAD-Waymo dataset, video-level v-mAP@0.2 achieves gains of 7.4% (agent), 1.8% (action), and 2.7% (location). Ablation studies confirm that multidimensional attention mechanisms contribute 3.59%–4.37% performance boosts across tasks. Experimental results demonstrate that the improved model exhibits robustness in modeling multiagent interactions within complex traffic scenarios and produces accurate detection results. The proposed method successfully bridges the gap between perception and decision-making through context-aware annotations and spatiotemporally coherent architecture design. The results validate its utility for reliable autonomous driving systems in real-world environments. The executable codebase central to this research is available at the following repository: https://github.com/Hilbert-space007/SF3D-MDA

7.5Engineering value

7.0Research novelty

6.5Business relevance

Links and sources

Official / arXiv page

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.