Autonomous driving paper index

Sce2DriveX: A Generalized MLLM Framework for Scene-to-Drive Learning

2025-02-19 · IEEE Robotics and Automation Letters · arXiv: 2502.14917

End-to-End Autonomous Driving BEV Perception Path Planning Autonomous Driving Simulation

end-to-end autonomous drivingautonomous drivingbevend-to-end drivingend-to-endmotion planningcarlalarge language modelperceptionplanningcontrol

One-line summary

We propose Sce2DriveX, a human-like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to behavior analysis, motion planning, and vehicle control driving process.

Engineering notes

Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

End-to-end autonomous driving, which directly maps raw sensor inputs to low-level vehicle controls, is an crucial part of Embodied AI. Despite successes in applying Multimodal Large Language Models (MLLMs) for high-level traffic scene semantic understanding, it remains challenging to effectively translate these conceptual semantics understandings into low-level motion control commands and achieve cross-scene driving generalization and consensus. We propose Sce2DriveX, a human-like chain-of-thought (CoT) driving reasoning MLLM framework, designed to achieve progressive learning from multi-view scene understanding to behavior analysis, motion planning, and vehicle control driving process. Sce2DriveX utilizes multimodal joint learning of local scene videos and global Bird’s Eye View (BEV) maps to deeply understand long-range spatiotemporal relationships and road topology, enhancing its 3D dynamic/static scene perception and reasoning capabilities and achieving cross-scene generalization. Meanwhile, it reconstructs the implicit cognitive chain inherent in human driving, further enhancing the consensus between autonomous driving and human thought. To improve model performance, we construct the first comprehensive Visual Question Answering (VQA) driving instruction dataset, which tailored for 3D spatial understanding and long-axis task reasoning, and introduce a task-oriented three-stage training pipeline to support supervised fine-tuning. Extensive experiments demonstrate that Sce2DriveX achieves state-of-the-art performance across tasks from scene understanding to end-to-end driving, as well as robust generalization in handling diverse driving scenes on the CARLA Bench2Drive benchmark.

5.5Engineering value

8.5Research novelty

5.0Business relevance

Links and sources

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.