A Cross-Modal Retrieval Framework for Radiology Reports and Chest X-Ray Images

Vijayalaxmi Mekali; M. P. Sowbhagya; H. D. Aparna; Deepa K. Mathew; Vandana Singh; Nitesh N. Nikam; Ganesh B. Dongre; Yogesh H. Bhosale

doi:10.48084/etasr.17163

Authors

Vijayalaxmi Mekali Department of Computer Science and Engineering, K. S. Institute of Technology, Bengaluru, India
M. P. Sowbhagya Department of Computer Science and Engineering, K. S. Institute of Technology, Bengaluru, India
H. D. Aparna Department of Computer Science and Business Systems, Dayananda Sagar College of Engineering, Bengaluru, India
Deepa K. Mathew Artificial Intelligence and Machine Learning Department, Dayananda Sagar Academy of Technology & Management (DSATM), Bangalore, India
Vandana Singh Department of Computer Science & Engineering, Amity School of Engineering and Technology, Ranchi, India
Nitesh N. Nikam Department of Electrical Engineering, CSMSS Chh. Shahu College of Engineering, Chhatrapati Sambhajinagar, India
Ganesh B. Dongre Department of Electronics and Computer Engineering, CSMSS Chh. Shahu College of Engineering, Chhatrapati Sambhajinagar, India
Yogesh H. Bhosale Department of Computer Science & Engineering, CSMSS Chh. Shahu College of Engineering, Kanchanwadi, Chhatrapati Sambhajinagar, India

Volume: 16 | Issue: 2 | Pages: 33196-33201 | April 2026 | https://doi.org/10.48084/etasr.17163

Received: 24 December 2025 | Revised: 8 January 2026 and 31 January 2026 | Accepted: 2 February 2026 | Online: 19 February 2026

Corresponding author: Yogesh H. Bhosale

Abstract

Accurate retrieval of relevant radiology reports and images is essential for clinical decision support and large-scale medical data management. This study proposes MedFuse-CLIP, a domain-tuned cross-modal retrieval framework that aligns chest X-ray images with corresponding radiology reports using a dual-encoder OpenCLIP ViT-B/32 backbone. The model introduces two key innovations: adaptive semantic hard-negative mining, which enhances discriminative learning across visually similar pathologies, and a retrieval-aware margin contrastive loss, which stabilizes alignment within the embedding space. Experiments on the MIMIC-CXR-JPG dataset demonstrate strong performance, achieving Recall@10 scores of 87.5% for image→text retrieval and 83.4% for text→image retrieval under fine-tuned cross-modal alignment, along with an average AUROC of 0.879 for zero-shot disease classification across 14 thoracic pathologies. Dimensionality reduction analysis confirmed that compact 256-dimensional embeddings preserve more than 98% of retrieval accuracy while halving storage requirements. The results indicate that MedFuse-CLIP matches or exceeds existing radiology vision-language models such as BioViL and GLoRIA while operating efficiently on a single consumer GPU.

Keywords:

radiology retrieval, cross-modal learning, vision–language model, contrastive learning, CLIP, medical imaging AI

Downloads

Download data is not yet available.

References

A. E. W. Johnson et al., "MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports," Scientific Data, vol. 6, no. 1, Dec. 2019, Art. no. 317. DOI: https://doi.org/10.1038/s41597-019-0322-0

K. You et al., "CXR-CLIP: Toward Large Scale Chest X-ray Language-Image Pre-training," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2023, vol. 14221, H. Greenspan, A. Madabhushi, P. Mousavi, S. Salcudean, J. Duncan, T. Syeda-Mahmood, and R. Taylor, Eds. Springer Nature Switzerland, 2023, pp. 101–111.

H. N. T. Al-Azzawi et al., "Utilization of a Deep Convolutional Neural Network for the Binary Classification of Chest X-Ray Pneumonia," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20471–20483, Feb. 2025. DOI: https://doi.org/10.48084/etasr.9788

Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, "Contrastive Learning of Medical Visual Representations from Paired Images and Text," in Proceedings of the 7th Machine Learning for Healthcare Conference, Dec. 2022, pp. 2–25.

S. C. Huang, L. Shen, M. P. Lungren, and S. Yeung, "GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition," in 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Oct. 2021, pp. 3922–3931. DOI: https://doi.org/10.1109/ICCV48922.2021.00391

B. Boecking et al., "Making the Most of Text Semantics to Improve Biomedical Vision–Language Processing," in Computer Vision – ECCV 2022, vol. 13696, S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, and T. Hassner, Eds. Springer Nature Switzerland, 2022, pp. 1–21.

Z. Wang, Z. Wu, D. Agarwal, and J. Sun, "MedCLIP: Contrastive Learning from Unpaired Medical Images and Text," in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3876–3887. DOI: https://doi.org/10.18653/v1/2022.emnlp-main.256

J. Jang, D. Kyung, S. H. Kim, H. Lee, K. Bae, and E. Choi, "Significantly improving zero-shot X-ray pathology classification via fine-tuning pre-trained image-text encoders," Scientific Reports, vol. 14, no. 1, Oct. 2024, Art. no. 23199. DOI: https://doi.org/10.1038/s41598-024-73695-z

M. Jahanian, A. Karimi, N. O. Eraghi, and F. Zarafshan, "Multimodal transformers for joint diagnosis and retrieval from chest X-rays and radiology reports," Informatics in Medicine Unlocked, vol. 58, 2025, Art. no. 101672. DOI: https://doi.org/10.1016/j.imu.2025.101672

K. Schall, K. U. Barthel, N. Hezel, and K. Jung, "Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment," in Similarity Search and Applications, vol. 15268, E. Chávez, B. Kimia, J. Lokoč, M. Patella, and J. Sedmidubsky, Eds. Springer Nature Switzerland, 2025, pp. 97–110. DOI: https://doi.org/10.1007/978-3-031-75823-2_9

A. Nirala, A. Joshi, S. Sarkar, and C. Hegde, "Fast Certification of Vision-Language Models Using Incremental Randomized Smoothing," in 2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), Apr. 2024, pp. 252–271. DOI: https://doi.org/10.1109/SaTML59370.2024.00019

A. Johnson et al., "MIMIC-CXR-JPG - chest radiographs with structured labels." PhysioNet.

F. Ashfaq, N. Jhanjhi, N. A. Khan, D. Javed, M. Masud, and M. Shorfuzzaman, "Enhancing ECG Report Generation With Domain-Specific Tokenization for Improved Medical NLP Accuracy," IEEE Access, vol. 13, pp. 85493–85506, 2025. DOI: https://doi.org/10.1109/ACCESS.2025.3567566

W. Dong, S. Shen, Y. Han, T. Tan, J. Wu, and H. Xu, "Generative Models in Medical Visual Question Answering: A Survey," Applied Sciences, vol. 15, no. 6, Mar. 2025, Art. no. 2983. DOI: https://doi.org/10.3390/app15062983

W. Huang, X. Hu, S. Abousamra, P. Prasanna, and C. Chen, "Hard Negative Sample Mining for Whole Slide Image Classification," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, vol. 15004, M. G. Linguraru, Q. Dou, A. Feragen, S. Giannarou, B. Glocker, K. Lekadir, and J. A. Schnabel, Eds. Springer Nature Switzerland, 2024, pp. 144–154. DOI: https://doi.org/10.1007/978-3-031-72083-3_14