Enhancing Vulnerability Detection Precision through Ensemble Learning with Large Language Models

Hussein Al-Ofeishat; Azhar Hussain; Muhammad Rehan Faheem; Syed Asim Ali Shah; Hannan Adeel; Muzammil Hussain

doi:10.48084/etasr.18412

Authors

Hussein Al-Ofeishat Department of Computer Science, Faculty of Information Technology, Al-Ahliyya Amman University, Amman, Jordan | Faculty of Engineering, Al-Balqa Applied University, Al-Salt, Jordan
Azhar Hussain Department of Computer Science, The Islamia University of Bahawalpur, Pakistan
Muhammad Rehan Faheem Fakulti Kecerdasan Buatan dan Keselamatan Siber, Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
Syed Asim Ali Shah Fakulti Pengurusan Teknologi Dan Teknousahawanan, Kampus Teknologi, Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
Hannan Adeel Fakulti Kecerdasan Buatan dan Keselamatan Siber, Universiti Teknikal Malaysia Melaka, Melaka, Malaysia
Muzammil Hussain Department of Software Engineering, Faculty of Information Technology, Al-Ahliyya Amman University, Amman, Jordan

Volume: 16 | Issue: 4 | Pages: 37565-37570 | August 2026 | https://doi.org/10.48084/etasr.18412

Received: 27 February 2026 | Revised: 19 May 2026 | Accepted: 27 May 2026 | Online: 28 June 2026

Corresponding author: Muhammad Rehan Faheem

Abstract

This study investigates the use of ensemble learning with Large Language Models (LLMs) to improve the accuracy of software vulnerability prediction, following a structured experimental approach to assess whether combining multiple models can enhance performance. Three baseline models, CodeBERT, GraphCodeBERT, and CodeT5, were trained and assessed on the Devign dataset, which provides a large collection of labeled source code snippets. Their outputs were then integrated using three ensemble techniques: Majority Voting, Weighted Voting, and Stacking. Precision, recall, and F1-score metrics were used to gauge performance. Ensemble approaches outperformed all standalone models. In particular, Majority Voting increased precision from 0.601 (CodeBERT) to 0.690, representing a 14.81% improvement. Keeping in view the detection accuracy, this study focused on reducing the false positives. The results show that the ensemble techniques are a practical approach to boost the precision of LLMs in the detection of vulnerabilities. Ensemble learning can address the challenges faced by standalone models by reducing false positives and improving the overall trade-off between accuracy and reliability. The study suggests that ensemble methods offer great potential in the advancement of software security analysis.

Keywords:

ensemble learning, large language models, vulnerability detection, precision, majority voting, weighted voting, stacking

References

[1] W. Li, S. Manickam, Y. Chong, and S. Karuppayah, "Talking Like a Phisher: LLM-Based Attacks on Voice Phishing Classifiers." arXiv, July 22, 2025.

[2] S. Illindala and S. Sabie, "Assessment of the Practicality of LLMs in the Field of Cybersecurity and Detection of Malicious Code," American Journal of Student Research, pp. 177–186, 2025.

[3] K. Gladkikh and A. A. Zakharov, "Approach to Forming Vulnerability Datasets for Fine-Tuning AI Agents," in 2025 International Russian Smart Industry Conference (SmartIndustryCon), Mar. 2025, pp. 771–776.

[4] J. Saxe and K. Berlin, "eXpose: A Character-Level Convolutional Neural Network with Embeddings For Detecting Malicious URLs, File Paths and Registry Keys." arXiv, Feb. 27, 2017.

[5] Z. Li et al., "VulDeePecker: A Deep Learning-Based System for Vulnerability Detection," in Proceedings 2018 Network and Distributed System Security Symposium, 2018.

[6] J. Gui et al., "Deep Anomaly Detection of Temporal Heterogeneous Data in AIOps: A Survey," Frontiers of Information Technology & Electronic Engineering, vol. 26, no. 9, pp. 1551–1576, Sept. 2025.

[7] A. Shestov et al., "Finetuning Large Language Models for Vulnerability Detection," IEEE Access, vol. 13, pp. 38889–38900, 2025.

[8] Y. Zhou, S. Liu, J. Siow, X. Du, and Y. Liu, "Devign: Effective Vulnerability Identification by Learning Comprehensive Program Semantics via Graph Neural Networks," in Advances in Neural Information Processing Systems, 2019.

[9] V. Akuthota, R. Kasula, S. T. Sumona, M. Mohiuddin, M. T. Reza, and M. M. Rahman, "Vulnerability Detection and Monitoring Using LLM," in 2023 IEEE 9th International Women in Engineering (WIE) Conference on Electrical and Computer Engineering (WIECON-ECE), Nov. 2023, pp. 309–314.

[10] V. Nguyen, S. Nepal, X. Yuan, T. Wu, and C. Rudolph, "SAFE: A Novel Approach For Software Vulnerability Detection from Enhancing The Capability of Large Language Models," in Proceedings of the 20th ACM Asia Conference on Computer and Communications Security, May 2025, pp. 392–406.

[11] H. Hanif and S. Maffeis, "VulBERTa: Simplified Source Code Pre-Training for Vulnerability Detection," in 2022 International Joint Conference on Neural Networks (IJCNN), July 2022, pp. 1–8.

[12] M. Fu and C. Tantithamthavorn, "LineVul: a transformer-based line-level vulnerability prediction," in Proceedings of the 19th International Conference on Mining Software Repositories, July 2022, pp. 608–620.

[13] Y. Luo, W. Xu, and D. Xu, "Detecting code vulnerabilities with heterogeneous GNN training," International Journal of Information Security, vol. 24, no. 5, Sept. 2025, Art. no. 213.

[14] M. M. Abualhaj, S. N. Al-Khatib, M. A. Zyoud, I. Qaddara, and M. Anbar, "Enhancing Intrusion Detection System Performance Using a Hybrid of Harris Hawks and Whale Optimization Algorithms," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 24354–24361, Aug. 2025.

[15] S. Elsayed, K. Mohamed, and M. A. Madkour, "A Comparative Study of Using Deep Learning Algorithms in Network Intrusion Detection," IEEE Access, vol. 12, pp. 58851–58870, 2024.

[16] "DetectVul/devign." Hugging Face, [Online]. Available: https://huggingface.co/datasets/DetectVul/devign.