Autonomous driving paper index

T2LDM++: A Self-Conditioned Representation Guided Diffusion Model for Realistic Text-to-LiDAR Scene Generation

2026-06-29 · arXiv (Cornell University)

BEV Perception LiDAR Perception Sensor Fusion

autonomous drivingbevlidarcontrol

One-line summary

In this paper, we rethink the limitations of Text-LiDAR generation task, focusing on alleviating insufficient training priors and constructing controllable Text-LiDAR data.

Engineering notes

Furthermore, we construct two high-quality Text-LiDAR benchmarks ($>$100K samples) using a generalized strategy of geometric annotations, along with a controllability metric. With effective prior modeling and high-quality Text-LiDAR benchmarks, T2LDM++ can generate realistic LiDAR scenes with rich geometric details in unconditional and conditional settings.

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Recent progress in Text-to-Image generation benefits from large-scale Text-Image pairs. However, the scarcity of Text-LiDAR pairs often causes over-smoothed scenes and limited controllability. In this paper, we rethink the limitations of Text-LiDAR generation task, focusing on alleviating insufficient training priors and constructing controllable Text-LiDAR data. We propose a \textbf{T}ext-\textbf{to}-\textbf{L}iDAR \textbf{D}iffusion \textbf{M}odel for LiDAR scene generation, T2LDM++, with a Self-Conditioned Representation Guidance (SCRG). Specifically, to alleviate object over-smoothing, SCRG employs a Guidance Network (GN) to provide reconstruction-based soft supervision to the Denoising Network (DN). This enables DN to learn geometry-aware representations through reconstruction guidance, leading to more accurate denoising in DDPMs. Meanwhile, through analysis and design, SCRG exhibits more effective and lightweight, while decoupled in inference, avoiding computational overhead. Furthermore, we construct two high-quality Text-LiDAR benchmarks ($>$100K samples) using a generalized strategy of geometric annotations, along with a controllability metric. Moreover, a directional position prior is designed to mitigate street distortion, further improving scene fidelity. Additionally, T2LDM++ supports multiple conditions, including (Semantic, Box, BEV, Camera)-to-LiDAR, Sparse-to-Dense, and Dense-to-Sparse generation, by learning a control encoder via frozen DN. With effective prior modeling and high-quality Text-LiDAR benchmarks, T2LDM++ can generate realistic LiDAR scenes with rich geometric details in unconditional and conditional settings.

5.0Engineering value

7.0Research novelty

5.0Business relevance

Links and sources

PDF from original source

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.