Autonomous driving paper index
MonoVAN: Visual Attention for Self-Supervised Monocular Depth Estimation
One-line summary
In this paper, we propose a novel fully convolutional network for monocular depth estimation, called MonoVAN, which incorporates the visual attention mechanism and applies super-resolution techniques in decoder to better capture fine-grained details in depth maps.
Engineering notes
Our experiments on outdoor KITTI benchmark and the indoor NYUv2 dataset show that our approach outperforms the most advanced self-supervised methods, including such state-of-the-art models as transformer-based VTDepth from ISMAR’22 and hybrid convolutional-transformer MonoFormer from AAAI’23, while having a comparable or even fewer number of parameters in our model than competitors. Code and weights are available at https://github.com/IlyaInd/MonoVAN.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。
Original abstract
Depth estimation is crucial in various computer vision applications, including autonomous driving, robotics, and virtual and augmented reality. An accurate scene depth map is beneficial for localization, spatial registration, and tracking. It converts 2D images into precise 3D coordinates for accurate positioning, seamlessly aligns virtual and real objects in applications like AR, and enhances object tracking by distinguishing distances. The self-supervised monocular approach is particularly promising as it eliminates the need for complex and expensive data acquisition setups relying solely on a standard RGB camera. Recently, transformer-based architectures have become popular to solve this problem, but at high quality, they suffer from high computational cost and poor perception of small details as they focus more on global information. In this paper, we propose a novel fully convolutional network for monocular depth estimation, called MonoVAN, which incorporates the visual attention mechanism and applies super-resolution techniques in decoder to better capture fine-grained details in depth maps. To the best of our knowledge, this work pioneers the use of a convolutional visual attention in the context of depth estimation. Our experiments on outdoor KITTI benchmark and the indoor NYUv2 dataset show that our approach outperforms the most advanced self-supervised methods, including such state-of-the-art models as transformer-based VTDepth from ISMAR’22 and hybrid convolutional-transformer MonoFormer from AAAI’23, while having a comparable or even fewer number of parameters in our model than competitors. We also validate the impact of each proposed improvement in isolation, providing evidence of its significant contribution. Code and weights are available at https://github.com/IlyaInd/MonoVAN.
Links and sources
Need this topic turned into a technical roadmap?
Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.
Request B2B research
Comments