Autonomous driving paper index
Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning
One-line summary
A robotics research paper on Q-GeoMem: Question-Guided Geometric Memory for Video Spatial Reasoning.
Engineering notes
Engineering notes will be added by the Full Self Driving editorial team.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。
Original abstract
Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.
Links and sources
Need this topic turned into a technical roadmap?
Robot Papers can prepare a custom robotics literature review, code map, dataset map, and B2B technology assessment.
Request B2B research
Comments