Autonomous driving paper index
V2X-MAE: Decoder-free masked autoencoding with multi-view distillation for cooperative perception
One-line summary
To address these challenges, we propose a self-supervised pretraining framework specifically designed for V2X cooperative perception.
Engineering notes
Extensive experiments on three representative datasets (V2X-Real, V2V4Real, and OPV2V) demonstrate that our method achieves substantial improvements over baseline approaches in most settings, with particularly strong gains under high sensor heterogeneity and occlusion-heavy scenarios, while maintaining competitive performance in infrastructure-centric configurations where baselines already approach saturation.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。
Original abstract
Vehicle-to-Everything (V2X) cooperative perception has emerged as a transformative paradigm for autonomous driving by enabling connected vehicles and infrastructure to share sensing information, thereby extending perception range and mitigating occlusions. However, the development of effective cooperative perception models faces two critical challenges: the heavy reliance on large-scale annotated 3D point cloud data and the geometric shortcut problem, where models over-rely on superficial spatial cues rather than learning robust semantic representations. To address these challenges, we propose a self-supervised pretraining framework specifically designed for V2X cooperative perception. Our method leverages large-scale unlabelled multi-agent point cloud data through a novel multi-view generation strategy that creates global cooperative views, local single-agent views, and masked cooperative views with varying levels of spatial completeness. By employing a decoder-free architecture with teacher–student self-distillation, our framework explicitly mitigates the geometric shortcut problem prevalent in multi-agent scenarios while promoting cross-agent semantic consistency. The pretrained encoder integrates into existing cooperative perception pipelines with minimal overhead, requiring only a lightweight feature adaptation layer to connect with the downstream detection head. Extensive experiments on three representative datasets (V2X-Real, V2V4Real, and OPV2V) demonstrate that our method achieves substantial improvements over baseline approaches in most settings, with particularly strong gains under high sensor heterogeneity and occlusion-heavy scenarios, while maintaining competitive performance in infrastructure-centric configurations where baselines already approach saturation. Our method yields detection improvements of up to + 3.8 mAP@0.5, with notable advantages in sensor-heterogeneous settings and data efficiency. Notably, our approach surpasses baseline performance using only 50% of training data and shows robust resilience in challenging scenarios involving occlusions.
Links and sources
Need this topic turned into a technical roadmap?
Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.
Request B2B research
Comments