MAE- and DINOv2-Powered DETR++: A Hybrid, Transformer-Based Self-Supervised Framework for Accurate Object Detection

D. Anil; Ravinder Singh Kuntal; Sudhanshu Maurya; Pooja Ahuja S.; Savitha Hiremath; Basavaraj N. Hiremath; G. S. Girisha; Yogesh H. Bhosale

doi:10.48084/etasr.12919

Authors

D. Anil Department of Computer Science and Business Systems, Dayananda Sagar College of Engineering, Bengaluru, Karnataka, India
Ravinder Singh Kuntal Department of Mathematics, Nitte (Deemed to be University), Nitte Meenakshi Institute of Technology (NMIT), Bengaluru, Karnataka, India
Sudhanshu Maurya Department of Computer Science & Engineering, School of Engineering & Technology, Manav Rachna International Institute of Research and Studies (Deemed to be University), Faridabad, India
Pooja Ahuja S. Computer Science and Engineering, BMS College of Engineering, Bengaluru, Karnataka, India
Savitha Hiremath Department of Computer Science and Engineering, Dayananda Sagar University, Bengaluru South District, Karnataka, India
Basavaraj N. Hiremath Department of Computer Science and Engineering, Dayananda Sagar University, Bengaluru South District, Karnataka, India
G. S. Girisha Department of Computer Science and Engineering, Dayananda Sagar University, Bengaluru South District, Karnataka, India
Yogesh H. Bhosale Department of Computer Science and Engineering, CSMSS Chh. Shahu College of Engineering, Chhatrapati Sambhajinagar (Aurangabad), Maharashtra, India

Volume: 15 | Issue: 6 | Pages: 30137-30143 | December 2025 | https://doi.org/10.48084/etasr.12919

Received: 24 June 2025 | Revised: 1 August 2025 | Accepted: 14 August 2025 | Online: 5 November 2025

Corresponding author: Yogesh H. Bhosale

Abstract

Object detection remains a cornerstone task in computer vision, with wide-ranging applications in autonomous driving, surveillance, and medical imaging. However, traditional methods rely heavily on large annotated datasets, limiting their adaptability in low-resource environments. We propose Hybrid Masked Autoencoder-Detection Transformer++ (HybridMAE-DETR++), a novel self-supervised object detection framework that synergizes Masked Autoencoders (MAEs) and DINOv2 for Vision Transformer (ViT) pretraining. Integrated with a Swin-ViT hybrid backbone and an enhanced DETR++ detection head, the framework significantly reduces dependence on annotated data while improving detection accuracy for small and occluded objects. Evaluated on COCO 2017 and Cityscapes, HybridMAE-DETR++ achieves 47.5% mean Average Precision (mAP) and 68.0% Intersection over Union (IoU) on COCO, and 53.2% mAP and 72.1% IoU on Cityscapes, outperforming DETR and other transformer-based baselines. Ablation and sensitivity analyses confirm the robustness of our hybrid pretraining strategy, and visualizations using Layer-weighted Class Activation Mapping (LayerCAM) and Gradient-weighted Class Activation Mapping++ (Grad-CAM++) validate model interpretability. Despite a moderate increase in training time, the precision gains justify the computational cost. This framework sets a new benchmark for label-efficient, interpretable object detection in real-world scenarios.

Keywords:

object detection, self-supervised learning, Vision Transformer (ViT), Masked Autoencoder (MAE), DINOv2, Detection Transformer (DETR ), Swin Transformer, Layer-weighted Class Activation Mapping (LayerCAM)

Author Biography

D. Anil, Department of Computer Science and Business Systems, Dayananda Sagar College of Engineering, Bengaluru, Karnataka, India

References

M. Caron et al., "Emerging Properties in Self-Supervised Vision Transformers," in 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp. 9630–9640. DOI: https://doi.org/10.1109/ICCV48922.2021.00951

Y. K. Yun and W. Lin, "Towards a Complete and Detail-Preserved Salient Object Detection," IEEE Transactions on Multimedia, vol. 26, pp. 4667–4680, 2024. DOI: https://doi.org/10.1109/TMM.2023.3325731

S. A. Jebur, L. Alzubaidi, A. Saihood, K. A. Hussein, H. K. Hoomod, and Y. Gu, "A Scalable and Generalised Deep Learning Framework for Anomaly Detection in Surveillance Videos," International Journal of Intelligent Systems, vol. 2025, no. 1, 2025, Art. no. 1947582. DOI: https://doi.org/10.1155/int/1947582

H. Zhang et al., "DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection." arXiv, July 11, 2022.

Y. Gao, J. Liu, W. Li, M. Hou, Y. Li, and H. Zhao, "Augmented Grad-CAM++: Super-Resolution Saliency Maps for Visual Interpretation of Deep Neural Network," Electronics, vol. 12, no. 23, Dec. 2023, Art. no. 4846. DOI: https://doi.org/10.3390/electronics12234846

Z. Liu et al., "Swin Transformer: Hierarchical Vision Transformer using Shifted Windows," in 2021 IEEE/CVF International Conference on Computer Vision, Montreal, Canada, 2021, pp. 9992–10002. DOI: https://doi.org/10.1109/ICCV48922.2021.00986

F. Yang, G. Chen, and J. Duan, "Skip-Encoder and Skip-Decoder for Detection Transformer in Optical Remote Sensing," Remote Sensing, vol. 16, no. 16, Aug. 2024, Art. no. 2884. DOI: https://doi.org/10.3390/rs16162884

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y. Qiao, "MCMAE: masked convolution meets masked autoencoders," in Proceedings of the 36th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 2022, pp. 35632–35644.

T. Afouras, Y. M. Asano, F. Fagan, A. Vedaldi, and F. Metze, "Self-supervised object detection from audio-visual correspondence," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 10565–10576. DOI: https://doi.org/10.1109/CVPR52688.2022.01032

N. Dong, Y. Zhang, M. Ding, and G. H. Lee, "Open World DETR: Transformer based Open World Object Detection." arXiv, Dec. 06, 2022.

Y. Pu et al., "Rank-DETR for High Quality Object Detection." arXiv, Nov. 03, 2023.

C. You et al., "Rethinking semi-supervised medical image segmentation: a variance-reduction perspective," in Proceedings of the 37th International Conference on Neural Information Processing Systems, New Orleans, LA, USA, 2023, pp. 9984–10021.

O. J. Hénaff et al., "Object discovery and representation networks." arXiv, July 27, 2022. DOI: https://doi.org/10.1007/978-3-031-19812-0_8

B. Zhao, J. Li, and H. Zhu, "CoDo: Contrastive Learning with Downstream Background Invariance for Detection," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, New Orleans, LA, USA, 2022, pp. 4195–4200. DOI: https://doi.org/10.1109/CVPRW56347.2022.00464

E. Erçelik et al., "3D Object Detection with a Self-supervised Lidar Scene Flow Backbone," in Computer Vision – ECCV 2022: 17th European Conference, Proceedings, Part X, Tel Aviv, Israel, 2022, pp. 247–265. DOI: https://doi.org/10.1007/978-3-031-20080-9_15

D. Chen, H. Shen, and P. Li, "Optimizing vision transformers for CPU platforms via human-machine collaborative design," Knowledge-Based Systems, vol. 291, May 2024, Art. no. 111611. DOI: https://doi.org/10.1016/j.knosys.2024.111611

X. Wu, R. Zhang, J. Qin, S. Ma, and C.-L. Liu, "WPS-SAM: Towards Weakly-Supervised Part Segmentation with Foundation Models." arXiv, July 14, 2024. DOI: https://doi.org/10.2139/ssrn.5217491

"COCO - Common Objects in Context." Cocodataset. [Online]. Available: https://cocodataset.org/#home.

"Cityscapes Dataset – Semantic Understanding of Urban Street Scenes." Cityscapes-Dataset, Oct. 17, 2020. [Online]. Available: https://www.cityscapes-dataset.com/.

T. Saidani, "Deep Learning Approach: YOLOv5-based Custom Object Detection," Engineering, Technology & Applied Science Research, vol. 13, no. 6, pp. 12158–12163, Dec. 2023. DOI: https://doi.org/10.48084/etasr.6397

K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, "Masked Autoencoders Are Scalable Vision Learners," in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 2022, pp. 15979–15988. DOI: https://doi.org/10.1109/CVPR52688.2022.01553

M. F. Naeem, Y. Xian, X. Zhai, L. Hoyer, L. Van Gool, and F. Tombari, "SILC: Improving Vision Language Pretraining with Self-distillation," in Computer Vision – ECCV 2024: 18th European Conference, Proceedings, Part XXI, Milan, Italy, 2024, pp. 38–55. DOI: https://doi.org/10.1007/978-3-031-72664-4_3

X. Wang, R. Zhang, C. Shen, and T. Kong, "DenseCL: A simple framework for self-supervised dense visual pre-training," Visual Informatics, vol. 7, no. 1, pp. 30–40, Mar. 2023. DOI: https://doi.org/10.1016/j.visinf.2022.09.003