Enhanced Sepsis Prediction Using Ensemble Learning with SMOTE-Based Data Balancing and Stratified Validation
Received: 14 August 2025 | Revised: 14 October 2025 and 28 October 2025 | Accepted: 31 October 2025 | Online: 9 February 2026
Corresponding author: N. Smitha
Abstract
Sepsis, a critical condition triggered by an abnormal immune response to infection, requires rapid and accurate identification to reduce the risk of mortality. In clinical settings, datasets often suffer from severe class imbalance, with sepsis cases significantly underrepresented, which complicates early prediction efforts. This study explores and compares the effectiveness of traditional and ensemble-based machine learning algorithms for sepsis detection. Initially, models such as Random Forest (RF), Support Vector Machine (SVM), XGBoost, and K-Nearest Neighbors (KNN) were trained on imbalanced datasets using median imputation. To improve model reliability and address class imbalance, advanced ensemble techniques, such as soft-voting and stacking, were incorporated, along with cost-sensitive learning and stratified k-fold validation. The Synthetic Minority Oversampling Technique (SMOTE) was later applied to balance the dataset, and the models were reassessed. Evaluation based on metrics such as ROC-AUC, PR-AUC, sensitivity, specificity, balanced accuracy, and Brier score revealed that the stacking ensemble, applied to SMOTE-processed data, delivered superior performance with a ROC-AUC of 0.9979. These results show how ensemble approaches and data balancing strategies can improve the precision and dependability of sepsis prediction models used in clinical decision-making.
Keywords:
ensemble learning, sepsis detection, stacking, SMOTE, ROC, AUCDownloads
References
D. B. Gotur, "Sepsis Diagnosis and Management," Journal of Medical Sciences and Health, vol. 03, no. 03, pp. 1–12, Dec. 2017. DOI: https://doi.org/10.46347/JMSH.2017.v03i03.001
S. J. Rigatti, "Random Forest," Journal of Insurance Medicine, vol. 47, no. 1, pp. 31–39, 2017. DOI: https://doi.org/10.17849/insm-47-01-31-39.1
H. Xue, Q. Yang, and S. Chen, "SVM: Support Vector Machines," in The Top Ten Algorithms in Data Mining, Chapman and Hall/CRC, 2009. DOI: https://doi.org/10.1201/9781420089653.ch3
T. Chen et al., "xgboost: Extreme Gradient Boosting." May 15, 2025, [Online]. Available: https://cran.r-project.org/web/packages/xgboost/.
G. I. Webb and Z. Zheng, "Multistrategy ensemble learning: reducing error by combining ensemble learning techniques," IEEE Transactions on Knowledge and Data Engineering, vol. 16, no. 8, pp. 980–991, Aug. 2004. DOI: https://doi.org/10.1109/TKDE.2004.29
S. Kumari, D. Kumar, and M. Mittal, "An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier," International Journal of Cognitive Computing in Engineering, vol. 2, pp. 40–46, June 2021. DOI: https://doi.org/10.1016/j.ijcce.2021.01.001
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, June 2002. DOI: https://doi.org/10.1613/jair.953
M. Reyna et al., "Early Prediction of Sepsis from Clinical Data: The PhysioNet/Computing in Cardiology Challenge 2019." PhysioNet. DOI: https://doi.org/10.22489/CinC.2019.412
S. Tripathi, L. Singh, and J. Sermanraja, "Complete Data Using Exploratory Data Analysis and ML Algorithms," in 2024 4th International Conference on Technological Advancements in Computational Sciences (ICTACS), Tashkent, Uzbekistan, Nov. 2024, pp. 1293–1296. DOI: https://doi.org/10.1109/ICTACS62700.2024.10840870
L. Zhou, M. Shao, C. Wang, and Y. Wang, "An early sepsis prediction model utilizing machine learning and unbalanced data processing in a clinical context," Preventive Medicine Reports, vol. 45, Sept. 2024, Art. no. 102841. DOI: https://doi.org/10.1016/j.pmedr.2024.102841
R. M. A. El-Aziz and A. Rayan, "Early detection of sepsis using machine learning algorithms," Alexandria Engineering Journal, vol. 111, pp. 47–56, Jan. 2025. DOI: https://doi.org/10.1016/j.aej.2024.10.005
S. A․ Parvin and B․ Saleena, "Designing a hybrid stack ensemble model to enhance sepsis classification using data triangulation approach," Results in Engineering, vol. 25, Mar. 2025, Art. no. 103748. DOI: https://doi.org/10.1016/j.rineng.2024.103748
M. Belgiu and L. Drăguţ, "Random forest in remote sensing: A review of applications and future directions," ISPRS Journal of Photogrammetry and Remote Sensing, vol. 114, pp. 24–31, Apr. 2016. DOI: https://doi.org/10.1016/j.isprsjprs.2016.01.011
L. Peterson, "K-nearest neighbor," Scholarpedia, vol. 4, no. 2, 2009, Art. no. 1883. DOI: https://doi.org/10.4249/scholarpedia.1883
W. S. Noble, "What is a support vector machine?," Nature Biotechnology, vol. 24, no. 12, pp. 1565–1567, Dec. 2006. DOI: https://doi.org/10.1038/nbt1206-1565
T. Chen et al., "xgboost: Extreme Gradient Boosting." Sept. 01, 2014. DOI: https://doi.org/10.32614/CRAN.package.xgboost
X. Zeng and T. R. Martinez, "Distribution-balanced stratified cross-validation for accuracy estimation," Journal of Experimental & Theoretical Artificial Intelligence, vol. 12, no. 1, pp. 1–12, Jan. 2000. DOI: https://doi.org/10.1080/095281300146272
T. Siddiqui, M. Latif, M. U. Farooq, M. A. Baig, and Y. S. Hassan, "Chronic Obstructive Pulmonary Disease Diagnosis with Bagging Ensemble Learning and ANN Classifiers," Engineering, Technology & Applied Science Research, vol. 14, no. 3, pp. 14741–14746, June 2024. DOI: https://doi.org/10.48084/etasr.7106
Downloads
How to Cite
License
Copyright (c) 2025 N. Smitha, R. Tanuja, S. H. Manjula

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
