A Cross-Modal Retrieval Framework for Radiology Reports and Chest X-Ray Images
Received: 24 December 2025 | Revised: 8 January 2026 and 31 January 2026 | Accepted: 2 February 2026 | Online: 19 February 2026
Corresponding author: Yogesh H. Bhosale
Abstract
Accurate retrieval of relevant radiology reports and images is essential for clinical decision support and large-scale medical data management. This study proposes MedFuse-CLIP, a domain-tuned cross-modal retrieval framework that aligns chest X-ray images with corresponding radiology reports using a dual-encoder OpenCLIP ViT-B/32 backbone. The model introduces two key innovations: adaptive semantic hard-negative mining, which enhances discriminative learning across visually similar pathologies, and a retrieval-aware margin contrastive loss, which stabilizes alignment within the embedding space. Experiments on the MIMIC-CXR-JPG dataset demonstrate strong performance, achieving Recall@10 scores of 87.5% for image→text retrieval and 83.4% for text→image retrieval under fine-tuned cross-modal alignment, along with an average AUROC of 0.879 for zero-shot disease classification across 14 thoracic pathologies. Dimensionality reduction analysis confirmed that compact 256-dimensional embeddings preserve more than 98% of retrieval accuracy while halving storage requirements. The results indicate that MedFuse-CLIP matches or exceeds existing radiology vision-language models such as BioViL and GLoRIA while operating efficiently on a single consumer GPU.
Keywords:
radiology retrieval, cross-modal learning, vision–language model, contrastive learning, CLIP, medical imaging AIDownloads
References
A. E. W. Johnson et al., "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports," Scientific Data, vol. 6, no. 1, Dec. 2019, Art. no. 317.
K. You et al., "CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, vol. 14221, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, and R. Taylor, Eds. Springer Nature Switzerland, 2023, pp. 101–111.
H. N. T. Al-Azzawi et al., "Utilization of a Deep Convolutional Neural Network for the Binary Classification of Chest X-Ray Pneumonia," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20471–20483, Feb. 2025.
Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, "Contrastive Learning of Medical Visual Representations from Paired Images and Text," in Proceedings of the 7th Machine Learning for Healthcare Conference, Dec. 2022, pp. 2–25.
S. C. Huang, L. Shen, M. P. Lungren, and S. Yeung, "GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition," in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 3922–3931.
B. Boecking et al., "Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing," in Computer Vision – ECCV 2022, vol. 13696, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Springer Nature Switzerland, 2022, pp. 1–21.
Z. Wang, Z. Wu, D. Agarwal, and J. Sun, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3876–3887.
J. Jang, D. Kyung, S. H. Kim, H. Lee, K. Bae, and E. Choi, "Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders," Scientific Reports, vol. 14, no. 1, Oct. 2024, Art. no. 23199.
M. Jahanian, A. Karimi, N. O. Eraghi, and F. Zarafshan, "Multimodal transformers for joint diagnosis and retrieval from chest X-rays and radiology reports," Informatics in Medicine Unlocked, vol. 58, 2025, Art. no. 101672.
K. Schall, K. U. Barthel, N. Hezel, and K. Jung, "Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment," in Similarity Search and Applications, vol. 15268, E. Chávez, B. Kimia, J. Lokoč, M. Patella, and J. Sedmidubsky, Eds. Springer Nature Switzerland, 2025, pp. 97–110.
A. Nirala, A. Joshi, S. Sarkar, and C. Hegde, "Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing," in 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Apr. 2024, pp. 252–271.
A. Johnson et al., "MIMIC-CXR-JPG - chest radiographs with structured labels." PhysioNet.
F. Ashfaq, N. Jhanjhi, N. A. Khan, D. Javed, M. Masud, and M. Shorfuzzaman, "Enhancing ECG Report Generation With Domain-Specific Tokenization for Improved Medical NLP Accuracy," IEEE Access, vol. 13, pp. 85493–85506, 2025.
W. Dong, S. Shen, Y. Han, T. Tan, J. Wu, and H. Xu, "Generative Models in Medical Visual Question Answering: A Survey," Applied Sciences, vol. 15, no. 6, Mar. 2025, Art. no. 2983.
W. Huang, X. Hu, S. Abousamra, P. Prasanna, and C. Chen, "Hard Negative Sample Mining for Whole Slide Image Classification," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, vol. 15004, M. G. Linguraru, Q. Dou, A. Feragen, S. Giannarou, B. Glocker, K. Lekadir, and J. A. Schnabel, Eds. Springer Nature Switzerland, 2024, pp. 144–154.
Downloads
How to Cite
License
Copyright (c) 2026 Vijayalaxmi Mekali, M. P. Sowbhagya, H. D. Aparna, Deepa K. Mathew, Vandana Singh, Nitesh N. Nikam, Ganesh B. Dongre, Yogesh H. Bhosale

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
