Improving File-Level Bug Localization Using Pre-Trained Code Models: A Comprehensive Review

Al-Anzi Tuqa Emad Hussein; Aldabbagh Mohammad A. Taha

doi:10.48084/etasr.18559

Authors

Al-Anzi Tuqa Emad Hussein Department of Software, College of Computer Science and Mathematics, University of Mosul, Iraq
Aldabbagh Mohammad A. Taha Department of Software, College of Computer Science and Mathematics, University of Mosul, Iraq https://orcid.org/0000-0003-2240-9643

Volume: 16 | Issue: 3 | Pages: 36058-36063 | June 2026 | https://doi.org/10.48084/etasr.18559

Received: 6 March 2026 | Revised: 25 April 2026 | Accepted: 30 April 2026 | Online: 10 May 2026

Corresponding author: Al-Anzi Tuqa Emad Hussein

Abstract

Bug localization, the automatic detection of source code files that contain a given defect, is a fundamental problem in software maintenance. Pre-trained models, such as CodeBERT, GraphCodeBERT, UniXcoder, and CodeT5, can effectively bridge the semantic gap between bug reports in natural language and source code. However, existing studies use inconsistent datasets and evaluation protocols, leading to non-comparable and non-reproducible results. This review focuses on file-level bug localization using pre-trained models, going beyond prior surveys by identifying cross-study inconsistencies, highlighting a structural gap in LLM-based file-level evaluation, and providing a critical analysis of existing approaches. The key contributions are: a five-category taxonomy combining IR, ML, deep learning, pre-trained language model, and LLM-based approaches (the first taxonomy targeting file-level granularity among all paradigms); a cross-study analysis suggests that identical models on identical benchmarks report Top-10 accuracy values differing by up to 20 percentage points (as a result of undisclosed experimental differences); and the mapping into a structured framework of eight open research gaps to seven evidence-supported directions, followed by the most recent advances in the field, including studies published up to 2026.

Keywords:

bug localization, file-level bug localization, pre-trained code models, CodeBERT, GraphCodeBERT, UniXcoder, CodeT5, software maintenance, deep learning

References

J. Zhou, H. Zhang, and D. Lo, "Where should the bugs be fixed? More accurate information retrieval-based bug localization based on bug reports," in 2012 34th International Conference on Software Engineering (ICSE), June 2012, pp. 14–24.

R. K. Saha, M. Lease, S. Khurshid, and D. E. Perry, "Improving bug localization using structured information retrieval," in 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), Nov. 2013, pp. 345–355.

C. P. Wong, Y. Xiong, H. Zhang, D. Hao, L. Zhang, and H. Mei, "Boosting Bug-Report-Oriented Fault Localization with Segmentation and Stack-Trace Analysis," in 2014 IEEE International Conference on Software Maintenance and Evolution, Sept. 2014, pp. 181–190.

S. Wang and D. Lo, "AmaLgam+: Composing Rich Information Sources for Accurate Bug Localization," Journal of Software: Evolution and Process, vol. 28, no. 10, pp. 921–942, Oct. 2016.

W. Zou, E. Li, and C. Fang, "BLESER: Bug Localization Based on Enhanced Semantic Retrieval." arXiv, Sept. 08, 2021.

K. C. Youm, J. Ahn, and E. Lee, "Improved bug localization based on code change histories and bug reports," Information and Software Technology, vol. 82, pp. 177–192, Feb. 2017.

Z. Shi, J. Keung, K. E. Bennin, and X. Zhang, "Comparing learning to rank techniques in hybrid bug localization," Applied Soft Computing, vol. 62, pp. 636–648, Jan. 2018.

Y. Xiao, J. Keung, K. E. Bennin, and Q. Mi, "Improving bug localization with word embedding and enhanced convolutional neural networks," Information and Software Technology, vol. 105, pp. 17–29, Jan. 2019.

S. Sangle, S. Muvva, S. Chimalakonda, K. Ponnalagu, and V. G. Venkoparao, "DRAST -- A Deep Learning and AST Based Approach for Bug Localization." arXiv, Nov. 06, 2020.

S. B. Hossain et al., "A Deep Dive into Large Language Models for Automated Bug Localization and Repair," Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 1471–1493, July 2024.

Z. Feng et al., "CodeBERT: A Pre-Trained Model for Programming and Natural Languages." arXiv, Sept. 18, 2020.

D. Guo et al., "GraphCodeBERT: Pre-training Code Representations with Data Flow." arXiv, Sept. 13, 2021.

D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, "UniXcoder: Unified Cross-Modal Pre-training for Code Representation." arXiv, Mar. 08, 2022.

Y. Wang, W. Wang, S. Joty, and S. C. H. Hoi, "CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation." arXiv, Sept. 02, 2021.

J. Lee, D. Kim, T. F. Bissyandé, W. Jung, and Y. Le Traon, "Bench4BL: reproducibility study on the performance of IR-based bug localization," in Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, July 2018, pp. 61–72.

A. Ciborowska and K. Damevski, "Fast changeset-based bug localization with BERT," in Proceedings of the 44th International Conference on Software Engineering, May 2022, pp. 946–957.

M. Chandramohan, D. Q. Nguyen, P. Krishnan, and J. Jancic, "Supporting Cross-language Cross-project Bug Localization Using Pre-trained Language Models." arXiv, July 03, 2024.

F. Niu, C. Li, K. Liu, X. Xia, and D. Lo, "When Deep Learning Meets Information Retrieval-based Bug Localization: A Survey." arXiv, Apr. 30, 2025.

X. Meng, X. Wang, H. Zhang, H. Sun, and X. Liu, "Improving fault localization and program repair with deep semantic features and transferred knowledge," in Proceedings of the 44th International Conference on Software Engineering, May 2022, pp. 1169–1180.

D. Chhabra and R. Chadha, "Automatic Bug Triaging Process: An Enhanced Machine Learning Approach through Large Language Models," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18557–18562, Dec. 2024.

X. Li, W. Li, Y. Zhang, and L. Zhang, "DeepFL: integrating multiple fault diagnosis dimensions for deep fault localization," in Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, July 2019, pp. 169–180.

M. N. Rafi, A. R. Chen, T. H. P. Chen, and S. Wang, "Revisiting Defects4J for Fault Localization in Diverse Development Scenarios," in 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), Apr. 2025, pp. 63–75.

X. Zhou, D. Han, and D. Lo, "Assessing Generalizability of CodeBERT," in 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), Sept. 2021, pp. 425–436.

S. Alsaedi, A. A. A. Gad-Elrab, A. Noaman, and F. Eassa, "Two-Level Information-Retrieval-Based Model for Bug Localization Based on Bug Reports," Electronics, vol. 13, no. 2, Jan. 2024, Art. no. 321.

S. Rathi, N. L. B. Murthy, and L. Kumar, "An empirical evaluation of the effectiveness of various ML, DL, and CodeBERT models to enhance the quality of software with the application of AST and Embedding techniques," in Proceedings of the 18th Innovations in Software Engineering Conference, Feb. 2025, pp. 1–4.

M. Yang and J. Hu, "Precise Learning-to-Rank Bug Localization Using Multi-Feature Fusion for Hardware Code," ACM Transactions on Design Automation of Electronic Systems, vol. 31, no. 2, pp. 1–25, Mar. 2026.

Z. Luo, W. Wang, and C. Cen, "Improving Bug Localization With Effective Contrastive Learning Representation," IEEE Access, vol. 11, pp. 32523–32533, 2023.

M. Asad, R. M. Yasir, and S. Malek, "Towards Explorative IRBL: Combining Semantic Retrieval with LLM-driven Iterative Code Exploration." arXiv, Apr. 21, 2026.

Y. Li et al., "A Knowledge Enhanced Large Language Model for Bug Localization," Proceedings of the ACM on Software Engineering, vol. 2, no. FSE, pp. 1914–1936, June 2025.

P. Chakraborty, M. Alfadel, and M. Nagappan, "BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning," IEEE Transactions on Software Engineering, vol. 51, no. 8, pp. 2254–2267, Aug. 2025.

G. Nam and G. Yang, "LLMLoc: A Structure-Aware Retrieval System for Zero-Shot Bug Localization," Electronics, vol. 14, no. 21, Nov. 2025, Art. no. 4343.

V. N. Subramanian, "BugLLM: Explainable Bug Localization through LLMs," M.S. Thesis, University of Waterloo, Canada, 2024.