Comparative Development of BioMedCLIP for Enhanced Biomedical Data Integration

Praveen Pandey; Hiyaa Malik; Sofia Singh; Dipti Theng; Urvashi Agrawal; Raj Kumar; Sanjay Balwani; Anoop Kumar Shukla

doi:10.48084/etasr.15242

Authors

Praveen Pandey Department of Artificial Intelligence, Amity School of Engineering & Technology, Amity University, Noida, India
Hiyaa Malik Department of Artificial Intelligence, Amity School of Engineering & Technology, Amity University, Noida, India
Sofia Singh Department of Artificial Intelligence, Amity School of Engineering & Technology, Amity University, Noida, India
Dipti Theng Department of Computer Science & Engineering, Symbiosis Institute of Technology Pune, Symbiosis International (Deemed University), Pune, India
Urvashi Agrawal Department of Electronics & Telecommunication Engineering, Jhulelal Institute of Technology, Nagpur, India
Raj Kumar Department of Computer Science & Engineering, Indian Institute of Technology, Patna, Bihar, India
Sanjay Balwani Department of Electronics & Telecommunication Engineering, Jhulelal Institute of Technology, Nagpur, India
Anoop Kumar Shukla Amity University, Noida, India

Volume: 16 | Issue: 1 | Pages: 30978-30983 | February 2026 | https://doi.org/10.48084/etasr.15242

Received: 30 September 2025 | Revised: 2 November 2025 | Accepted: 12 November 2025 | Online: 1 December 2025

Corresponding author: Sofia Singh

Abstract

Advancements in multimodal learning in Artificial Intelligence (AI) have led to the proposal of many foundational models in the area of biomedical AI. However, leveraging these models for specialized clinical use requires a satisfactory level of fine-tuning and validation. BioMedCLIP is one such state-of-the-art model. This paper presents a comparative study that focuses on adapting BioMedCLIP and evaluating its efficiency across various datasets and fine-tuning approaches. Several fine-tuning methods were explored, and the model was implemented on two large and challenging datasets. The first dataset was the large-scale National Institutes of Health (NIH) Chest X-ray collection for multi-label disease classification; the second dataset was HAM10000, a large collection of multi-source dermatoscopic images of skin lesions. The primary objective was to assess different fine-tuning strategies and develop a specialized model with enhanced capabilities in integrating images and textual data. Using three different approaches, the study compares BioMedCLIP across the NIH Chest X-ray and HAM10000 datasets, demonstrating improved text–image modality integration, 55.6% macro accuracy, and reduced overfitting. This work validates the effectiveness of domain-specific adaptation and establishes a powerful, fine-tuned model ready for deployment in advanced healthcare applications.

Keywords:

BioMedCLIP, finetuning, multimodal Artificial Intelligence (AI), Contrastive Language–Image Pretraining (CLIP), fusion, cross-attention transformer

Downloads

Download data is not yet available.

References

S. Zhang et al., "BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs." arXiv, Jan. 08, 2025. DOI: https://doi.org/10.1056/AIoa2400640

M. A. Rahman and S. Al-Hazzaa, "Next-Generation Virtual Hospital: Integrating Discriminative and Large Multi-Modal Generative AI for Personalized Healthcare," in GLOBECOM 2024 - 2024 IEEE Global Communications Conference, Cape Town, South Africa, 2024, pp. 3509–3514. DOI: https://doi.org/10.1109/GLOBECOM52923.2024.10901624

S. Agrawal et al., "Selection of 51 predictors from 13,782 candidate multimodal features using machine learning improves coronary artery disease prediction," Patterns, vol. 2, no. 12, Dec. 2021, Art. no. 100364. DOI: https://doi.org/10.1016/j.patter.2021.100364

J. Liu, D. Capurro, A. Nguyen, and K. Verspoor, "Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities," Journal of Biomedical Informatics, vol. 145, Sept. 2023, Art. no. 104466. DOI: https://doi.org/10.1016/j.jbi.2023.104466

X. Chen, H. Xie, X. Tao, F. L. Wang, M. Leng, and B. Lei, "Artificial intelligence and multimodal data fusion for smart healthcare: topic modeling and bibliometrics," Artificial Intelligence Review, vol. 57, no. 4, Mar. 2024, Art. no. 91. DOI: https://doi.org/10.1007/s10462-024-10712-7

Q. Cai, H. Wang, Z. Li, and X. Liu, "A Survey on Multimodal Data-Driven Smart Healthcare Systems: Approaches and Applications," IEEE Access, vol. 7, pp. 133583–133599, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2941419

N. C. Kundur, B. C. Anil, P. M. Dhulavvagol, R. Ganiger, and B. Ramadoss, "Pneumonia Detection in Chest X-Rays using Transfer Learning and TPUs," Engineering, Technology & Applied Science Research, vol. 13, no. 5, pp. 11878–11883, Oct. 2023. DOI: https://doi.org/10.48084/etasr.6335

A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning, Online, 2021, pp. 8748–8763.

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 3876–3887. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.256

S. Eslami, C. Meinel, and G. de Melo, "PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?," in Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2023, pp. 1181–1193. DOI: https://doi.org/10.18653/v1/2023.findings-eacl.88

"Biomedical Foundation Models." IBM Research. https://research.ibm.com/projects/biomedical-foundation-models?utm_source=chatgpt.com.

X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, "ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases," in 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3462–3471. DOI: https://doi.org/10.1109/CVPR.2017.369

P. Tschandl, C. Rosendahl, and H. Kittler, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions," Scientific Data, vol. 5, no. 1, Aug. 2018, Art. no. 180161. DOI: https://doi.org/10.1038/sdata.2018.161