A Sequential Data Preprocessing Pipeline for Diabetes Prediction: A Data Leakage Prevention and Dual-Validation Approach

Ahmed Majid AbdulAbbas; Rafid Alkanany; Yasir Ali Khalid Al-Nuaimi; Zahraa Mehssen Agheeb Al-Hamdawee

doi:10.48084/etasr.14155

Authors

Ahmed Majid AbdulAbbas Department of Electrical Engineering, College of Engineering, University of Misan, Amarah, Iraq
Rafid Alkanany Department of Computer Techniques Engineering, Imam Alkadhim University College (IKU), Baghdad, Iraq
Yasir Ali Khalid Al-Nuaimi Department of Electrical Engineering, College of Engineering, University of Misan, Amarah, Iraq
Zahraa Mehssen Agheeb Al-Hamdawee Department of Electrical Engineering, College of Engineering, University of Misan, Amarah, Iraq

Volume: 15 | Issue: 6 | Pages: 30059-30066 | December 2025 | https://doi.org/10.48084/etasr.14155

Received: 19 August 2025 | Revised: 20 September 2025 | Accepted: 6 October 2025 | Online: 29 October 2025

Corresponding author: Ahmed Majid AbdulAbbas

Abstract

Machine learning approaches for diabetes prediction face methodological challenges, including data leakage from preprocessing before data splitting, inconsistent handling of missing values, and class imbalance with varying validation methods. This study presents a systematic approach that prevents data leakage and establishes standardized benchmarks for diabetes prediction. Using the PIMA Indian Diabetes Dataset (768 patients), this study applied a preprocessing pipeline: MICE for missing values (652 missing, 9.43% of data), SMOTE for class balance (500 nondiabetic vs 268 diabetic cases), and z-score normalization for feature scaling. Two feature selection methods identified six important clinical variables: Glucose, Pregnancies, Glucose_BMI, Glucose_Age, BMI, and BloodPressure. Dual validation approaches were employed, single split (80:20) and 5-fold cross-validation, to compare five machine learning algorithms: Random Forest (RF), Multi-Layer Perceptron (MLP), XGBoost, Support Vector Machine (SVM), and Logistic Regression (LR). Experimental results demonstrated that RF achieved the highest accuracy (79.79%) in single split testing, whereas MLP performed best in cross-validation (77.81% accuracy, 84.43% ROC-AUC). All algorithms achieved ROC-AUC scores above 0.80. Cross-validation analysis revealed that RF showed consistent performance across data splits, whereas MLP demonstrated better adaptability to different data conditions.

Keywords:

machine learning, diabetes prediction, data preprocessing, cross-validation, data leakage prevention

Downloads

Download data is not yet available.

References

"Urgent action needed as global diabetes cases increase four-fold over past decades," World Health Organization. https://www.who.int/news/item/13-11-2024-urgent-action-needed-as-global-diabetes-cases-increase-four-fold-over-past-decades.

Md. J. Hossain, Md. Al‐Mamun, and Md. R. Islam, "Diabetes mellitus, the fastest growing global public health concern: Early detection should be focused," Health Science Reports, vol. 7, no. 3, Mar. 2024, Art. no. e2004. DOI: https://doi.org/10.1002/hsr2.2004

"Facts & figures," International Diabetes Federation. https://idf.org/about-diabetes/diabetes-facts-figures/.

K. L. Ong et al., "Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021," The Lancet, vol. 402, no. 10397, pp. 203–234, Jul. 2023.

"Diagnosis and Classification of Diabetes: Standards of Care in Diabetes—2024," Diabetes Care, vol. 47, no. s1, pp. S20–S42, Jan. 2024. DOI: https://doi.org/10.2337/dc24-S002

N. Hussain, "Implications of using HBA1C as a diagnostic marker for diabetes," Diabetology International, vol. 7, no. 1, pp. 18–24, Nov. 2015. DOI: https://doi.org/10.1007/s13340-015-0244-9

S. I. Sherwani, H. A. Khan, A. Ekhzaimy, A. Masood, and M. K. Sakharkar, "Significance of HbA1c Test in Diagnosis and Prognosis of Diabetic Patients," Biomarker Insights, vol. 11, Jan. 2016, Art. no. BMI.S38440. DOI: https://doi.org/10.4137/BMI.S38440

"Spotlight on limitations of the HbA1c test," ACP Diabetes Monthly. https://diabetes.acponline.org/archives/2024/04/12/5.htm.

O. Schnell, J. B. Crocker, and J. Weng, "Impact of HbA1c Testing at Point of Care on Diabetes Management," Journal of Diabetes Science and Technology, vol. 11, no. 3, pp. 611–617, May 2017. DOI: https://doi.org/10.1177/1932296816678263

M. Kiran, Y. Xie, N. Anjum, G. Ball, B. Pierscionek, and D. Russell, "Machine learning and artificial intelligence in type 2 diabetes prediction: a comprehensive 33-year bibliometric and literature analysis," Frontiers in Digital Health, vol. 7, Mar. 2025, Art. no. 1557467. DOI: https://doi.org/10.3389/fdgth.2025.1557467

B. F. Wee, S. Sivakumar, K. H. Lim, W. K. Wong, and F. H. Juwono, "Diabetes detection based on machine learning and deep learning approaches," Multimedia Tools and Applications, vol. 83, no. 8, pp. 24153–24185, Aug. 2023. DOI: https://doi.org/10.1007/s11042-023-16407-5

E. Afsaneh, A. Sharifdini, H. Ghazzaghi, and M. Z. Ghobadi, "Recent applications of machine learning and deep learning models in the prediction, diagnosis, and management of diabetes: a comprehensive review," Diabetology & Metabolic Syndrome, vol. 14, no. 1, Dec. 2022, Art. no. 196. DOI: https://doi.org/10.1186/s13098-022-00969-9

M. Bansal, A. Goyal, and A. Choudhary, "A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning," Decision Analytics Journal, vol. 3, Jun. 2022, Art. no. 100071. DOI: https://doi.org/10.1016/j.dajour.2022.100071

S. A. Tanim, A. R. Aurnob, T. E. Shrestha, M. R. I. Emon, M. F. Mridha, and M. S. U. Miah, "Explainable deep learning for diabetes diagnosis with DeepNetX2," Biomedical Signal Processing and Control, vol. 99, Jan. 2025, Art. no. 106902. DOI: https://doi.org/10.1016/j.bspc.2024.106902

H. El-Sofany, S. A. El-Seoud, O. H. Karam, Y. M. Abd El-Latif, and I. A. T. F. Taj-Eddin, "A Proposed Technique Using Machine Learning for the Prediction of Diabetes Disease through a Mobile App," International Journal of Intelligent Systems, vol. 2024, pp. 1–13, Jan. 2024. DOI: https://doi.org/10.1155/2024/6688934

O. Iparraguirre-Villanueva, K. Espinola-Linares, R. O. Flores Castañeda, and M. Cabanillas-Carbonell, "Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes," Diagnostics, vol. 13, no. 14, Jul. 2023, Art. no. 2383. DOI: https://doi.org/10.3390/diagnostics13142383

I. Tasin, T. U. Nabil, S. Islam, and R. Khan, "Diabetes prediction using machine learning and explainable AI techniques," Healthcare Technology Letters, vol. 10, no. 1–2, pp. 1–10, Feb. 2023. DOI: https://doi.org/10.1049/htl2.12039

A. Ahmed et al., "Machine Learning Algorithm-Based Prediction of Diabetes Among Female Population Using PIMA Dataset," Healthcare, vol. 13, no. 1, Dec. 2024, Art. no. 37. DOI: https://doi.org/10.3390/healthcare13010037

F. Mercaldo, V. Nardone, and A. Santone, "Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques," Procedia Computer Science, vol. 112, pp. 2519–2528, Jan. 2017. DOI: https://doi.org/10.1016/j.procs.2017.08.193

N. Ahmed et al., "Machine learning based diabetes prediction and development of smart web application," International Journal of Cognitive Computing in Engineering, vol. 2, pp. 229–241, Jun. 2021. DOI: https://doi.org/10.1016/j.ijcce.2021.12.001

D. Sisodia and D. S. Sisodia, "Prediction of Diabetes using Classification Algorithms," Procedia Computer Science, vol. 132, pp. 1578–1585, 2018. DOI: https://doi.org/10.1016/j.procs.2018.05.122

R. Barakeh, "Leveraging Machine Learning for Precise Prediction of Type 2 Diabetes," Diabetes, vol. 73, no. s1, Jun. 2024. DOI: https://doi.org/10.2337/db24-59-PUB

V. Jain, S. Shukla, and N. Khare, "Analysis of various data imputation techniques for diabetes classification on PIMA dataset," in 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, Feb. 2024, pp. 1–6. DOI: https://doi.org/10.1109/SCEECS61402.2024.10482050

D. B. Rubin, "Inference and missing data," Biometrika, vol. 63, no. 3, pp. 581–592, Dec. 1976. DOI: https://doi.org/10.1093/biomet/63.3.581

N. Japkowicz and S. Stephen, "The class imbalance problem: A systematic study," Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, Sep. 2002. DOI: https://doi.org/10.3233/IDA-2002-6504

N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 1–6, Mar. 2004. DOI: https://doi.org/10.1145/1007730.1007733

C. Kim and A. Ferrara, Eds., Gestational Diabetes During and After Pregnancy. London, UK: Springer, 2010. DOI: https://doi.org/10.1007/978-1-84882-120-0

S. M. Camhi et al., "The Relationship of Waist Circumference and BMI to Visceral, Subcutaneous, and Total Body Fat: Sex and Race Differences," Obesity, vol. 19, no. 2, pp. 402–408, 2011. DOI: https://doi.org/10.1038/oby.2010.248

S. E. Kahn, R. L. Hull, and K. M. Utzschneider, "Mechanisms linking obesity to insulin resistance and type 2 diabetes," Nature, vol. 444, no. 7121, pp. 840–846, Dec. 2006. DOI: https://doi.org/10.1038/nature05482

"Classification and Diagnosis of Diabetes: Standards of Medical Care in Diabetes—2021," Diabetes Care, vol. 44, no. s1, pp. S15–S33, Dec. 2020. DOI: https://doi.org/10.2337/dc21-S002

J. R. Sowers, M. Epstein, and E. D. Frohlich, "Diabetes, Hypertension, and Cardiovascular Disease," Hypertension, vol. 37, no. 4, pp. 1053–1059, Apr. 2001. DOI: https://doi.org/10.1161/01.HYP.37.4.1053

I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, vol. 3, no. Mar, pp. 1157–1182, 2003.

S. van Buuren and K. Groothuis-Oudshoorn, "MICE: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, vol. 45, pp. 1–67, Dec. 2011. DOI: https://doi.org/10.18637/jss.v045.i03

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002. DOI: https://doi.org/10.1613/jair.953

S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, "Supervised machine learning: A review of classification techniques," Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, no. 1, pp. 3–24, 2007.

T. Widiyaningtyas, H. Hairani, D. D. Prasetya, U. Pujianto, and W. Caesarendra, "A Modified SMOTE with Noise Filtering and Manhattan Distance Metric Approach to Address Imbalanced Health Datasets," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25452–25459, Aug. 2025. DOI: https://doi.org/10.48084/etasr.11925

M. Nilashi, O. Ibrahim, M. Dalvi, H. Ahmadi, and L. Shahmoradi, "Accuracy Improvement for Diabetes Disease Classification: A Case on a Public Medical Dataset," Fuzzy Information and Engineering, vol. 9, no. 3, pp. 345–357, Sep. 2017. DOI: https://doi.org/10.1016/j.fiae.2017.09.006

R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1, pp. 273–324, Dec. 1997. DOI: https://doi.org/10.1016/S0004-3702(97)00043-X

T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, May 2016, pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785