Autonomous driving paper index

Domain-robust vision transformer with hierarchical swin encoding for explainable low-latency driver drowsiness detection

2026-06-19 · Scientific Reports

autonomous drivingvision transformeradaspredictioncontrol

One-line summary

To address these issues, we propose Enhanced Multi-path Token-Fusion Vision Transformer (ViT), Light-VTD, a novel lightweight architecture designed for accurate and explainable fatigue detection from facial images.

Engineering notes

The Light-VTD model achieves an intra-dataset accuracy of 99.2% and at least 93% in cross-dataset transfer, outperforming current ViT baselines by 2-4%.

Chinese explanation / 中文解读

中文解读待补充：本站会优先为端到端自动驾驶、BEV感知、3D目标检测、轨迹预测、路径规划、LiDAR感知等高价值论文补充中文说明。

Original abstract

Driver drowsiness is a significant cause of road accidents worldwide, often leading to fatalities due to delayed reaction times and loss of vehicle control. Traditional fatigue detection systems face several limitations, including poor generalization across various conditions, a lack of model transparency, and an inability to operate on low-power platforms. To address these issues, we propose Enhanced Multi-path Token-Fusion Vision Transformer (ViT), Light-VTD, a novel lightweight architecture designed for accurate and explainable fatigue detection from facial images. Unlike existing transformer-based methods, Light-VTD features a fused-MBConv block for efficient local feature extraction, a position-aware token mixer for capturing global context, and a multi-path token fusion mechanism that enhances spatial consistency across different resolutions. Furthermore, we adapt Gradient-weighted Class Activation Mapping (Grad-CAM) to generate interpretable attention heatmaps that align with clinically relevant fatigue indicators. Our experimental models are trained on a large-scale dataset comprising over 167,000 labeled images, compiled from four diverse public datasets: MRL Eye, nthuDDD2, and UTA-RLDD. The Light-VTD model achieves an intra-dataset accuracy of 99.2% and at least 93% in cross-dataset transfer, outperforming current ViT baselines by 2-4%. It supports inference with <120 ms latency on low-power devices (Raspberry Pi 4). Moreover, we have integrated the model within a web-based application that provides live predictions and visual explanations for fatigue monitoring. This study contributes a robust, interpretable, and deployable solution with potential applications in fleet-wide Advanced Driver Assistance Systems (ADAS), workplace safety platforms, and portable telehealth diagnostics.

5.0Engineering value

8.0Research novelty

5.5Business relevance

Links and sources

PDF from original source

Need this topic turned into a technical roadmap?

Full Self Driving can prepare a custom autonomous driving literature review, code map, dataset map, and B2B technology assessment.

Request B2B research

Comments

No comments yet. Be the first to share your thoughts on this paper.