A Sequential Data Preprocessing Pipeline for Diabetes Prediction: A Data Leakage Prevention and Dual-Validation Approach
Received: 19 August 2025 | Revised: 20 September 2025 | Accepted: 6 October 2025 | Online: 29 October 2025
Corresponding author: Ahmed Majid AbdulAbbas
Abstract
Machine learning approaches for diabetes prediction face methodological challenges, including data leakage from preprocessing before data splitting, inconsistent handling of missing values, and class imbalance with varying validation methods. This study presents a systematic approach that prevents data leakage and establishes standardized benchmarks for diabetes prediction. Using the PIMA Indian Diabetes Dataset (768 patients), this study applied a preprocessing pipeline: MICE for missing values (652 missing, 9.43% of data), SMOTE for class balance (500 nondiabetic vs 268 diabetic cases), and z-score normalization for feature scaling. Two feature selection methods identified six important clinical variables: Glucose, Pregnancies, Glucose_BMI, Glucose_Age, BMI, and BloodPressure. Dual validation approaches were employed, single split (80:20) and 5-fold cross-validation, to compare five machine learning algorithms: Random Forest (RF), Multi-Layer Perceptron (MLP), XGBoost, Support Vector Machine (SVM), and Logistic Regression (LR). Experimental results demonstrated that RF achieved the highest accuracy (79.79%) in single split testing, whereas MLP performed best in cross-validation (77.81% accuracy, 84.43% ROC-AUC). All algorithms achieved ROC-AUC scores above 0.80. Cross-validation analysis revealed that RF showed consistent performance across data splits, whereas MLP demonstrated better adaptability to different data conditions.
Keywords:
machine learning, diabetes prediction, data preprocessing, cross-validation, data leakage preventionDownloads
References
"Urgent action needed as global diabetes cases increase four-fold over past decades," World Health Organization. https://www.who.int/news/item/13-11-2024-urgent-action-needed-as-global-diabetes-cases-increase-four-fold-over-past-decades.
Md. J. Hossain, Md. Al‐Mamun, and Md. R. Islam, "Diabetes mellitus, the fastest growing global public health concern: Early detection should be focused," Health Science Reports, vol. 7, no. 3, Mar. 2024, Art. no. e2004. DOI: https://doi.org/10.1002/hsr2.2004
"Facts & figures," International Diabetes Federation. https://idf.org/about-diabetes/diabetes-facts-figures/.
K. L. Ong et al., "Global, regional, and national burden of diabetes from 1990 to 2021, with projections of prevalence to 2050: a systematic analysis for the Global Burden of Disease Study 2021," The Lancet, vol. 402, no. 10397, pp. 203–234, Jul. 2023.
"Diagnosis and Classification of Diabetes: Standards of Care in Diabetes—2024," Diabetes Care, vol. 47, no. s1, pp. S20–S42, Jan. 2024. DOI: https://doi.org/10.2337/dc24-S002
N. Hussain, "Implications of using HBA1C as a diagnostic marker for diabetes," Diabetology International, vol. 7, no. 1, pp. 18–24, Nov. 2015. DOI: https://doi.org/10.1007/s13340-015-0244-9
S. I. Sherwani, H. A. Khan, A. Ekhzaimy, A. Masood, and M. K. Sakharkar, "Significance of HbA1c Test in Diagnosis and Prognosis of Diabetic Patients," Biomarker Insights, vol. 11, Jan. 2016, Art. no. BMI.S38440. DOI: https://doi.org/10.4137/BMI.S38440
"Spotlight on limitations of the HbA1c test," ACP Diabetes Monthly. https://diabetes.acponline.org/archives/2024/04/12/5.htm.
O. Schnell, J. B. Crocker, and J. Weng, "Impact of HbA1c Testing at Point of Care on Diabetes Management," Journal of Diabetes Science and Technology, vol. 11, no. 3, pp. 611–617, May 2017. DOI: https://doi.org/10.1177/1932296816678263
M. Kiran, Y. Xie, N. Anjum, G. Ball, B. Pierscionek, and D. Russell, "Machine learning and artificial intelligence in type 2 diabetes prediction: a comprehensive 33-year bibliometric and literature analysis," Frontiers in Digital Health, vol. 7, Mar. 2025, Art. no. 1557467. DOI: https://doi.org/10.3389/fdgth.2025.1557467
B. F. Wee, S. Sivakumar, K. H. Lim, W. K. Wong, and F. H. Juwono, "Diabetes detection based on machine learning and deep learning approaches," Multimedia Tools and Applications, vol. 83, no. 8, pp. 24153–24185, Aug. 2023. DOI: https://doi.org/10.1007/s11042-023-16407-5
E. Afsaneh, A. Sharifdini, H. Ghazzaghi, and M. Z. Ghobadi, "Recent applications of machine learning and deep learning models in the prediction, diagnosis, and management of diabetes: a comprehensive review," Diabetology & Metabolic Syndrome, vol. 14, no. 1, Dec. 2022, Art. no. 196. DOI: https://doi.org/10.1186/s13098-022-00969-9
M. Bansal, A. Goyal, and A. Choudhary, "A comparative analysis of K-Nearest Neighbor, Genetic, Support Vector Machine, Decision Tree, and Long Short Term Memory algorithms in machine learning," Decision Analytics Journal, vol. 3, Jun. 2022, Art. no. 100071. DOI: https://doi.org/10.1016/j.dajour.2022.100071
S. A. Tanim, A. R. Aurnob, T. E. Shrestha, M. R. I. Emon, M. F. Mridha, and M. S. U. Miah, "Explainable deep learning for diabetes diagnosis with DeepNetX2," Biomedical Signal Processing and Control, vol. 99, Jan. 2025, Art. no. 106902. DOI: https://doi.org/10.1016/j.bspc.2024.106902
H. El-Sofany, S. A. El-Seoud, O. H. Karam, Y. M. Abd El-Latif, and I. A. T. F. Taj-Eddin, "A Proposed Technique Using Machine Learning for the Prediction of Diabetes Disease through a Mobile App," International Journal of Intelligent Systems, vol. 2024, pp. 1–13, Jan. 2024. DOI: https://doi.org/10.1155/2024/6688934
O. Iparraguirre-Villanueva, K. Espinola-Linares, R. O. Flores Castañeda, and M. Cabanillas-Carbonell, "Application of Machine Learning Models for Early Detection and Accurate Classification of Type 2 Diabetes," Diagnostics, vol. 13, no. 14, Jul. 2023, Art. no. 2383. DOI: https://doi.org/10.3390/diagnostics13142383
I. Tasin, T. U. Nabil, S. Islam, and R. Khan, "Diabetes prediction using machine learning and explainable AI techniques," Healthcare Technology Letters, vol. 10, no. 1–2, pp. 1–10, Feb. 2023. DOI: https://doi.org/10.1049/htl2.12039
A. Ahmed et al., "Machine Learning Algorithm-Based Prediction of Diabetes Among Female Population Using PIMA Dataset," Healthcare, vol. 13, no. 1, Dec. 2024, Art. no. 37. DOI: https://doi.org/10.3390/healthcare13010037
F. Mercaldo, V. Nardone, and A. Santone, "Diabetes Mellitus Affected Patients Classification and Diagnosis through Machine Learning Techniques," Procedia Computer Science, vol. 112, pp. 2519–2528, Jan. 2017. DOI: https://doi.org/10.1016/j.procs.2017.08.193
N. Ahmed et al., "Machine learning based diabetes prediction and development of smart web application," International Journal of Cognitive Computing in Engineering, vol. 2, pp. 229–241, Jun. 2021. DOI: https://doi.org/10.1016/j.ijcce.2021.12.001
D. Sisodia and D. S. Sisodia, "Prediction of Diabetes using Classification Algorithms," Procedia Computer Science, vol. 132, pp. 1578–1585, 2018. DOI: https://doi.org/10.1016/j.procs.2018.05.122
R. Barakeh, "Leveraging Machine Learning for Precise Prediction of Type 2 Diabetes," Diabetes, vol. 73, no. s1, Jun. 2024. DOI: https://doi.org/10.2337/db24-59-PUB
V. Jain, S. Shukla, and N. Khare, "Analysis of various data imputation techniques for diabetes classification on PIMA dataset," in 2024 IEEE International Students’ Conference on Electrical, Electronics and Computer Science (SCEECS), Bhopal, India, Feb. 2024, pp. 1–6. DOI: https://doi.org/10.1109/SCEECS61402.2024.10482050
D. B. Rubin, "Inference and missing data," Biometrika, vol. 63, no. 3, pp. 581–592, Dec. 1976. DOI: https://doi.org/10.1093/biomet/63.3.581
N. Japkowicz and S. Stephen, "The class imbalance problem: A systematic study," Intelligent Data Analysis, vol. 6, no. 5, pp. 429–449, Sep. 2002. DOI: https://doi.org/10.3233/IDA-2002-6504
N. V. Chawla, N. Japkowicz, and A. Kotcz, "Editorial: special issue on learning from imbalanced data sets," ACM SIGKDD Explorations Newsletter, vol. 6, no. 1, pp. 1–6, Mar. 2004. DOI: https://doi.org/10.1145/1007730.1007733
C. Kim and A. Ferrara, Eds., Gestational Diabetes During and After Pregnancy. London, UK: Springer, 2010. DOI: https://doi.org/10.1007/978-1-84882-120-0
S. M. Camhi et al., "The Relationship of Waist Circumference and BMI to Visceral, Subcutaneous, and Total Body Fat: Sex and Race Differences," Obesity, vol. 19, no. 2, pp. 402–408, 2011. DOI: https://doi.org/10.1038/oby.2010.248
S. E. Kahn, R. L. Hull, and K. M. Utzschneider, "Mechanisms linking obesity to insulin resistance and type 2 diabetes," Nature, vol. 444, no. 7121, pp. 840–846, Dec. 2006. DOI: https://doi.org/10.1038/nature05482
"Classification and Diagnosis of Diabetes: Standards of Medical Care in Diabetes—2021," Diabetes Care, vol. 44, no. s1, pp. S15–S33, Dec. 2020. DOI: https://doi.org/10.2337/dc21-S002
J. R. Sowers, M. Epstein, and E. D. Frohlich, "Diabetes, Hypertension, and Cardiovascular Disease," Hypertension, vol. 37, no. 4, pp. 1053–1059, Apr. 2001. DOI: https://doi.org/10.1161/01.HYP.37.4.1053
I. Guyon and A. Elisseeff, "An Introduction to Variable and Feature Selection," Journal of Machine Learning Research, vol. 3, no. Mar, pp. 1157–1182, 2003.
S. van Buuren and K. Groothuis-Oudshoorn, "MICE: Multivariate Imputation by Chained Equations in R," Journal of Statistical Software, vol. 45, pp. 1–67, Dec. 2011. DOI: https://doi.org/10.18637/jss.v045.i03
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002. DOI: https://doi.org/10.1613/jair.953
S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, "Supervised machine learning: A review of classification techniques," Emerging Artificial Intelligence Applications in Computer Engineering, vol. 160, no. 1, pp. 3–24, 2007.
T. Widiyaningtyas, H. Hairani, D. D. Prasetya, U. Pujianto, and W. Caesarendra, "A Modified SMOTE with Noise Filtering and Manhattan Distance Metric Approach to Address Imbalanced Health Datasets," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25452–25459, Aug. 2025. DOI: https://doi.org/10.48084/etasr.11925
M. Nilashi, O. Ibrahim, M. Dalvi, H. Ahmadi, and L. Shahmoradi, "Accuracy Improvement for Diabetes Disease Classification: A Case on a Public Medical Dataset," Fuzzy Information and Engineering, vol. 9, no. 3, pp. 345–357, Sep. 2017. DOI: https://doi.org/10.1016/j.fiae.2017.09.006
R. Kohavi and G. H. John, "Wrappers for feature subset selection," Artificial Intelligence, vol. 97, no. 1, pp. 273–324, Dec. 1997. DOI: https://doi.org/10.1016/S0004-3702(97)00043-X
T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, May 2016, pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785
Downloads
How to Cite
License
Copyright (c) 2025 Ahmed Majid AbdulAbbas, Rafid Alkanany, Yasir Ali Khalid Al-Nuaimi, Zahraa Mehssen Agheeb Al-Hamdawee

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
