Autonomous driving paper index

POST-TRAINING OF VISION-LANGUAGE AGENTS FOR DECENTRALIZED AUTONOMOUS VEHICLE COORDINATION USING GENERALIZABLE MULTI-AGENT REWARDS

2026-07-03 · Digital Repository at the University of Maryland (University of Maryland College Park)

autonomous drivingautonomous vehicletrajectory predictionreinforcement learningperceptionprediction

One-line summary

Decentralized coordination at unsignalized intersections remains a persistent failure mode formodern autonomous driving policies when vehicle-to-everything (V2X) communication is unavail- able.

Engineering notes

We evaluate the proposed AR1+ELIGN post-training in a multi-agent simulation benchmark ofsymmetric four-way arrival scenarios in AlpaSim and compare against an ego-centric AR1 baseline as well as standard multi-agent reinforcement learning baselines (PPO and MAPPO). Quantitative benchmark results are pending completion of GRPO training; pre- training reward validation and closed-loop baseline evaluation confirm that the reward pipeline is stable and well-calibrated, with the social term contributing a mean penalty of approximately −0.02 per step under the deployed weighting while the trajectory-fidelity signal remains dominant.

Chinese explanation / 中文解读

中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Decentralized coordination at unsignalized intersections remains a persistent failure mode formodern autonomous driving policies when vehicle-to-everything (V2X) communication is unavail- able. Policies trained primarily with ego-centric objectives (e.g., collision avoidance, comfort, and action consistency) can be overly conservative in symmetric interactions, leading to deadlocks, or can make conflicting commitments, leading to unsafe near-collisions. This thesis addresses this gap by introducing a social post-training method for Alpamayo-R1 (AR1) that explicitly rewards behavior that is predictable to neighboring agents. We extend AR1’s Group Relative Policy Optimization (GRPO) post-training by augmenting thereward with Expectation Alignment (ELIGN), an intrinsic social term that penalizes mismatch between a learned neighbor-expectation model and the realized shared next observation. To make ELIGN applicable to AR1’s continuous trajectory outputs, we define the shared observation space over low- dimensional kinematic waypoints (x, y, ψ, v) rather than high-dimensional perception features, and we learn a compact trajectory prediction model offline before fine-tuning. The composite reward combines a trajectory-fidelity L2 term, a comfort score, and the ELIGN social penalty under a gated formulation that prevents the social term from masking large trajectory failures. Post-training is implemented using the cosmos rl framework with ReasoningVLAGRPOTrainer and vLLM-accelerated rollout generation, representing the first application of GRPO to a production-scale VLA model in a neural- rendered closed-loop driving simulator. We evaluate the proposed AR1+ELIGN post-training in a multi-agent simulation benchmark ofsymmetric four-way arrival scenarios in AlpaSim and compare against an ego-centric AR1 baseline as well as standard multi-agent reinforcement learning baselines (PPO and MAPPO). Performance is measured by collision rate (as a hard safety constraint), deadlock rate, intersection clearance time, and jerk variance as an indicator of indecision. Finally, we study zero-shot social generalization by testing whether ELIGN-fine-tuned agents coordinate effectively with novel partner agents not encoun- tered during training. Quantitative benchmark results are pending completion of GRPO training; pre- training reward validation and closed-loop baseline evaluation confirm that the reward pipeline is stable and well-calibrated, with the social term contributing a mean penalty of approximately −0.02 per step under the deployed weighting while the trajectory-fidelity signal remains dominant.

5.0Engineering value
8.0Research novelty
6.0Business relevance

Links and sources

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.
Login or register to leave a comment