MAE- and DINOv2-Powered DETR++: A Hybrid, Transformer-Based Self-Supervised Framework for Accurate Object Detection
Received: 24 June 2025 | Revised: 1 August 2025 | Accepted: 14 August 2025 | Online: 5 November 2025
Corresponding author: Yogesh H. Bhosale
Abstract
Object detection remains a cornerstone task in computer vision, with wide-ranging applications in autonomous driving, surveillance, and medical imaging. However, traditional methods rely heavily on large annotated datasets, limiting their adaptability in low-resource environments. We propose Hybrid Masked Autoencoder-Detection Transformer++ (HybridMAE-DETR++), a novel self-supervised object detection framework that synergizes Masked Autoencoders (MAEs) and DINOv2 for Vision Transformer (ViT) pretraining. Integrated with a Swin-ViT hybrid backbone and an enhanced DETR++ detection head, the framework significantly reduces dependence on annotated data while improving detection accuracy for small and occluded objects. Evaluated on COCO 2017 and Cityscapes, HybridMAE-DETR++ achieves 47.5% mean Average Precision (mAP) and 68.0% Intersection over Union (IoU) on COCO, and 53.2% mAP and 72.1% IoU on Cityscapes, outperforming DETR and other transformer-based baselines. Ablation and sensitivity analyses confirm the robustness of our hybrid pretraining strategy, and visualizations using Layer-weighted Class Activation Mapping (LayerCAM) and Gradient-weighted Class Activation Mapping++ (Grad-CAM++) validate model interpretability. Despite a moderate increase in training time, the precision gains justify the computational cost. This framework sets a new benchmark for label-efficient, interpretable object detection in real-world scenarios.
Keywords:
object detection, self-supervised learning, Vision Transformer (ViT), Masked Autoencoder (MAE), DINOv2, Detection Transformer (DETR ), Swin Transformer, Layer-weighted Class Activation Mapping (LayerCAM)Downloads
References
M. Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," in 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp. 9630–9640. DOI: https://doi.org/10.1109/ICCV48922.2021.00951
Y. K. Yun and W. Lin, "Towards a Complete and Detail-Preserved Salient Object Detection," IEEE Transactions on Multimedia, vol. 26, pp. 4667–4680, 2024. DOI: https://doi.org/10.1109/TMM.2023.3325731
S. A. Jebur, L. Alzubaidi, A. Saihood, K. A. Hussein, H. K. Hoomod, and Y. Gu, "A Scalable and Generalised Deep Learning Framework for Anomaly Detection in Surveillance Videos," International Journal of Intelligent Systems, vol. 2025, no. 1, 2025, Art. no. 1947582. DOI: https://doi.org/10.1155/int/1947582
H. Zhang et al., "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." arXiv, July 11, 2022.
Y. Gao, J. Liu, W. Li, M. Hou, Y. Li, and H. Zhao, "Augmented Grad-CAM++: Super-Resolution Saliency Maps for Visual Interpretation of Deep Neural Network," Electronics, vol. 12, no. 23, Dec. 2023, Art. no. 4846. DOI: https://doi.org/10.3390/electronics12234846
Z. Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," in 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp. 9992–10002. DOI: https://doi.org/10.1109/ICCV48922.2021.00986
F. Yang, G. Chen, and J. Duan, "Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing," Remote Sensing, vol. 16, no. 16, Aug. 2024, Art. no. 2884. DOI: https://doi.org/10.3390/rs16162884
P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, "MCMAE: masked convolution meets masked autoencoders," in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 2022, pp. 35632–35644.
T. Afouras, Y. M. Asano, F. Fagan, A. Vedaldi, and F. Metze, "Self-supervised object detection from audio-visual correspondence," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 10565–10576. DOI: https://doi.org/10.1109/CVPR52688.2022.01032
N. Dong, Y. Zhang, M. Ding, and G. H. Lee, "Open World DETR: Transformer based Open World Object Detection." arXiv, Dec. 06, 2022.
Y. Pu et al., "Rank-DETR for High Quality Object Detection." arXiv, Nov. 03, 2023.
C. You et al., "Rethinking semi-supervised medical image segmentation: a variance-reduction perspective," in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 2023, pp. 9984–10021.
O. J. Hénaff et al., "Object discovery and representation networks." arXiv, July 27, 2022. DOI: https://doi.org/10.1007/978-3-031-19812-0_8
B. Zhao, J. Li, and H. Zhu, "CoDo: Contrastive Learning with Downstream Background Invariance for Detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 2022, pp. 4195–4200. DOI: https://doi.org/10.1109/CVPRW56347.2022.00464
E. Erçelik et al., "3D Object Detection with a Self-supervised Lidar Scene Flow Backbone," in Computer Vision – ECCV 2022: 17th European Conference, Proceedings, Part X, Tel Aviv, Israel, 2022, pp. 247–265. DOI: https://doi.org/10.1007/978-3-031-20080-9_15
D. Chen, H. Shen, and P. Li, "Optimizing vision transformers for CPU platforms via human-machine collaborative design," Knowledge-Based Systems, vol. 291, May 2024, Art. no. 111611. DOI: https://doi.org/10.1016/j.knosys.2024.111611
X. Wu, R. Zhang, J. Qin, S. Ma, and C.-L. Liu, "WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models." arXiv, July 14, 2024. DOI: https://doi.org/10.2139/ssrn.5217491
"COCO - Common Objects in Context." Cocodataset. [Online]. Available: https://cocodataset.org/#home.
"Cityscapes Dataset – Semantic Understanding of Urban Street Scenes." Cityscapes-Dataset, Oct. 17, 2020. [Online]. Available: https://www.cityscapes-dataset.com/.
T. Saidani, "Deep Learning Approach: YOLOv5-based Custom Object Detection," Engineering, Technology & Applied Science Research, vol. 13, no. 6, pp. 12158–12163, Dec. 2023. DOI: https://doi.org/10.48084/etasr.6397
K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked Autoencoders Are Scalable Vision Learners," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 15979–15988. DOI: https://doi.org/10.1109/CVPR52688.2022.01553
M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari, "SILC: Improving Vision Language Pretraining with Self-distillation," in Computer Vision – ECCV 2024: 18th European Conference, Proceedings, Part XXI, Milan, Italy, 2024, pp. 38–55. DOI: https://doi.org/10.1007/978-3-031-72664-4_3
X. Wang, R. Zhang, C. Shen, and T. Kong, "DenseCL: A simple framework for self-supervised dense visual pre-training," Visual Informatics, vol. 7, no. 1, pp. 30–40, Mar. 2023. DOI: https://doi.org/10.1016/j.visinf.2022.09.003
Downloads
How to Cite
License
Copyright (c) 2025 D. Anil, Ravinder Singh Kuntal, Sudhanshu Maurya, S. Pooja Ahuja, Savitha Hiremath, Basavaraj N. Hiremath, G. S. Girisha, Yogesh H. Bhosale

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
