Autonomous driving paper index
Domain-robust vision transformer with hierarchical swin encoding for explainable low-latency driver drowsiness detection
One-line summary
To address these issues, we propose Enhanced Multi-path Token-Fusion Vision Transformer (ViT), Light-VTD, a novel lightweight architecture designed for accurate and explainable fatigue detection from facial images.
Engineering notes
The Light-VTD model achieves an intra-dataset accuracy of 99.2% and at least 93% in cross-dataset transfer, outperforming current ViT baselines by 2-4%.
Chinese explanation / 中文解读
中文解读待补充:本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。
Original abstract
Driver drowsiness is a significant cause of road accidents worldwide, often leading to fatalities due to delayed reaction times and loss of vehicle control. Traditional fatigue detection systems face several limitations, including poor generalization across various conditions, a lack of model transparency, and an inability to operate on low-power platforms. To address these issues, we propose Enhanced Multi-path Token-Fusion Vision Transformer (ViT), Light-VTD, a novel lightweight architecture designed for accurate and explainable fatigue detection from facial images. Unlike existing transformer-based methods, Light-VTD features a fused-MBConv block for efficient local feature extraction, a position-aware token mixer for capturing global context, and a multi-path token fusion mechanism that enhances spatial consistency across different resolutions. Furthermore, we adapt Gradient-weighted Class Activation Mapping (Grad-CAM) to generate interpretable attention heatmaps that align with clinically relevant fatigue indicators. Our experimental models are trained on a large-scale dataset comprising over 167,000 labeled images, compiled from four diverse public datasets: MRL Eye, nthuDDD2, and UTA-RLDD. The Light-VTD model achieves an intra-dataset accuracy of 99.2% and at least 93% in cross-dataset transfer, outperforming current ViT baselines by 2-4%. It supports inference with <120 ms latency on low-power devices (Raspberry Pi 4). Moreover, we have integrated the model within a web-based application that provides live predictions and visual explanations for fatigue monitoring. This study contributes a robust, interpretable, and deployable solution with potential applications in fleet-wide Advanced Driver Assistance Systems (ADAS), workplace safety platforms, and portable telehealth diagnostics.
Links and sources
Need this topic turned into a technical roadmap?
Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.
Request B2B research
Comments