Comparative Development of BioMedCLIP for Enhanced Biomedical Data Integration
Received: 30 September 2025 | Revised: 2 November 2025 | Accepted: 12 November 2025 | Online: 1 December 2025
Corresponding author: Sofia Singh
Abstract
Advancements in multimodal learning in Artificial Intelligence (AI) have led to the proposal of many foundational models in the area of biomedical AI. However, leveraging these models for specialized clinical use requires a satisfactory level of fine-tuning and validation. BioMedCLIP is one such state-of-the-art model. This paper presents a comparative study that focuses on adapting BioMedCLIP and evaluating its efficiency across various datasets and fine-tuning approaches. Several fine-tuning methods were explored, and the model was implemented on two large and challenging datasets. The first dataset was the large-scale National Institutes of Health (NIH) Chest X-ray collection for multi-label disease classification; the second dataset was HAM10000, a large collection of multi-source dermatoscopic images of skin lesions. The primary objective was to assess different fine-tuning strategies and develop a specialized model with enhanced capabilities in integrating images and textual data. Using three different approaches, the study compares BioMedCLIP across the NIH Chest X-ray and HAM10000 datasets, demonstrating improved text–image modality integration, 55.6% macro accuracy, and reduced overfitting. This work validates the effectiveness of domain-specific adaptation and establishes a powerful, fine-tuned model ready for deployment in advanced healthcare applications.
Keywords:
BioMedCLIP, finetuning, multimodal Artificial Intelligence (AI), Contrastive Language–Image Pretraining (CLIP), fusion, cross-attention transformerDownloads
References
S. Zhang et al., "BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs." arXiv, Jan. 08, 2025. DOI: https://doi.org/10.1056/AIoa2400640
M. A. Rahman and S. Al-Hazzaa, "Next-Generation Virtual Hospital: Integrating Discriminative and Large Multi-Modal Generative AI for Personalized Healthcare," in GLOBECOM 2024 - 2024 IEEE Global Communications Conference, Cape Town, South Africa, 2024, pp. 3509–3514. DOI: https://doi.org/10.1109/GLOBECOM52923.2024.10901624
S. Agrawal et al., "Selection of 51 predictors from 13,782 candidate multimodal features using machine learning improves coronary artery disease prediction," Patterns, vol. 2, no. 12, Dec. 2021, Art. no. 100364. DOI: https://doi.org/10.1016/j.patter.2021.100364
J. Liu, D. Capurro, A. Nguyen, and K. Verspoor, "Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities," Journal of Biomedical Informatics, vol. 145, Sept. 2023, Art. no. 104466. DOI: https://doi.org/10.1016/j.jbi.2023.104466
X. Chen, H. Xie, X. Tao, F. L. Wang, M. Leng, and B. Lei, "Artificial intelligence and multimodal data fusion for smart healthcare: topic modeling and bibliometrics," Artificial Intelligence Review, vol. 57, no. 4, Mar. 2024, Art. no. 91. DOI: https://doi.org/10.1007/s10462-024-10712-7
Q. Cai, H. Wang, Z. Li, and X. Liu, "A Survey on Multimodal Data-Driven Smart Healthcare Systems: Approaches and Applications," IEEE Access, vol. 7, pp. 133583–133599, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2941419
N. C. Kundur, B. C. Anil, P. M. Dhulavvagol, R. Ganiger, and B. Ramadoss, "Pneumonia Detection in Chest X-Rays using Transfer Learning and TPUs," Engineering, Technology & Applied Science Research, vol. 13, no. 5, pp. 11878–11883, Oct. 2023. DOI: https://doi.org/10.48084/etasr.6335
A. Radford et al., "Learning Transferable Visual Models From Natural Language Supervision," in Proceedings of the 38th International Conference on Machine Learning, Online, 2021, pp. 8748–8763.
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 3876–3887. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.256
S. Eslami, C. Meinel, and G. de Melo, "PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain?," in Findings of the Association for Computational Linguistics: EACL 2023, Dubrovnik, Croatia, 2023, pp. 1181–1193. DOI: https://doi.org/10.18653/v1/2023.findings-eacl.88
"Biomedical Foundation Models." IBM Research. https://research.ibm.com/projects/biomedical-foundation-models?utm_source=chatgpt.com.
X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, "ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases," in 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 2017, pp. 3462–3471. DOI: https://doi.org/10.1109/CVPR.2017.369
P. Tschandl, C. Rosendahl, and H. Kittler, "The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions," Scientific Data, vol. 5, no. 1, Aug. 2018, Art. no. 180161. DOI: https://doi.org/10.1038/sdata.2018.161
Downloads
How to Cite
License
Copyright (c) 2025 Praveen Pandey, Hiyaa Malik, Sofia Singh, Dipti Theng, Urvashi Agrawal, Raj Kumar, Sanjay Balwani, Anoop Kumar Shukla

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
