A Modified SMOTE with Noise Filtering and Manhattan Distance Metric Approach to Address Imbalanced Health Datasets
Received: 5 May 2025 | Revised: 26 May 2025, 9 June 2025, 15 June 2025, and 18 June 2025 | Accepted: 21 June 2025 | Online: 29 June 2025
Corresponding author: Triyanna Widiyaningtyas
Abstract
Imbalanced class distribution remains a significant challenge in healthcare data analysis, particularly in disease-related datasets where minority classes representing critical conditions such as diabetes are severely underrepresented. This disproportionate representation often results in biased predictive models that exhibit reduced sensitivity to minority classes, leading to suboptimal diagnostic accuracy and reduced generalizability. Imbalanced data can decrease the performance of classification methods and result in overfitting. SMOTE is a frequently used method for addressing data imbalance. A recent SMOTE variant considers only outliers to remove minority classes (data noise) without considering minority data neighboring majority classes, which are considered noise. This research aimed to modify SMOTE based on KNN filtering and a modification of Manhattan-based distance metrics to reduce the generation of noise data in minority classes and minimize overlap. The proposed method is called NR-Modified SMOTE and has several stages in balancing data: (i) filtering by removing minority classes close to majority classes (data noise) using the KNN method, and (ii) applying SMOTE oversampling with the modification of the Manhattan distance metric. Experiments were carried out on two health datasets, Pima and Haberman, with NR-Modified SMOTE and classification using Random Forest, SVM, and Naive Bayes using 10-fold cross-validation, where the proposed method led to better accuracy for all classifiers than NR-SMOTE without distance metric modifications.
Keywords:
SMOTE modification, distance metric modification, filtering approach, noise reductionDownloads
References
L. G. R. Putra, K. Marzuki, and H. Hairani, "Correlation-based feature selection and Smote-Tomek Link to improve the performance of machine learning methods on cancer disease prediction," Engineering and Applied Science Research, vol. 50, 2023, Art. no. 577583.
Q. Yin, X. Ye, B. Huang, L. Qin, X. Ye, and J. Wang, "Stroke Risk Prediction: Comparing Different Sampling Algorithms," International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, 2023.
K. Wang et al., "Improving Risk Identification of Adverse Outcomes in Chronic Heart Failure Using SMOTE+ENN and Machine Learning," Risk Management and Healthcare Policy, vol. Volume 14, pp. 2453–2463, Jun. 2021.
A. A. Khan, O. Chaudhari, and R. Chandra, "A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation," Expert Systems with Applications, vol. 244, Jun. 2024, Art. no. 122778.
A. N. Kasanah, M. Muladi, and U. Pujianto, "Penerapan Teknik SMOTE untuk Mengatasi Imbalance Class dalam Klasifikasi Objektivitas Berita Online Menggunakan Algoritma KNN," Jurnal RESTI (Rekayasa Sistem dan Teknologi Informasi), vol. 3, no. 2, pp. 196–201, Aug. 2019.
L. Wang, M. Han, X. Li, N. Zhang, and H. Cheng, "Review of Classification Methods on Unbalanced Data Sets," IEEE Access, vol. 9, pp. 64606–64628, 2021.
D. Elreedy, A. F. Atiya, and F. Kamalov, "A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning," Machine Learning, vol. 113, no. 7, pp. 4903–4923, Jul. 2024.
S. Rezvani and X. Wang, "A broad review on class imbalance learning techniques," Applied Soft Computing, vol. 143, Aug. 2023, Art. no. 110415.
V. W. de Vargas, J. A. S. Aranda, R. Dos Santos Costa, P. R. Da Silva Pereira, and J. L. V. Barbosa, "Imbalanced data preprocessing techniques for machine learning: a systematic mapping study," Knowledge and Information Systems, vol. 65, no. 1, pp. 31–57, Jan. 2023.
Asniar, N. U. Maulidevi, and K. Surendro, "SMOTE-LOF for noise identification in imbalanced data classification," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 6, pp. 3413–3423, Jun. 2022.
G. A. Pradipta, R. Wardoyo, A. Musdholifah, and I. N. H. Sanjaya, "Radius-SMOTE: A New Oversampling Technique of Minority Samples Based on Radius Distance for Learning From Imbalanced Data," IEEE Access, vol. 9, pp. 74763–74777, 2021.
M. Revathi and D. Ramyachitra, "A Modified Borderline Smote with Noise Reduction in Imbalanced Datasets," Wireless Personal Communications, vol. 121, no. 3, pp. 1659–1680, Dec. 2021.
Q. Liu et al., "Application of KM-SMOTE for rockburst intelligent prediction," Tunnelling and Underground Space Technology, vol. 138, Aug. 2023, Art. no. 105180.
Q. Dai, J. Liu, and J. L. Zhao, "Distance-based arranging oversampling technique for imbalanced data," Neural Computing and Applications, vol. 35, no. 2, pp. 1323–1342, Jan. 2023.
S. Feng, J. Keung, P. Zhang, Y. Xiao, and M. Zhang, "The impact of the distance metric and measure on SMOTE-based techniques in software defect prediction," Information and Software Technology, vol. 142, Feb. 2022, Art. no. 106742.
A. Balakrishnan, J. Medikonda, P. K. Namboothiri, and M. Natarajan, "Mahalanobis Metric-based Oversampling Technique for Parkinson’s Disease Severity Assessment using Spatiotemporal Gait Parameters," Biomedical Signal Processing and Control, vol. 86, Sep. 2023, Art. no. 105057.
V. P. K. Turlapati and M. R. Prusty, "Outlier-SMOTE: A refined oversampling technique for improved detection of COVID-19," Intelligence-Based Medicine, vol. 3–4, Dec. 2020, Art. no. 100023.
G. S. Thegas, Y. Hariprasad, S. S. Iyengar, N. R. Sunitha, P. Badrinath, and S. Chennupati, "An extension of Synthetic Minority Oversampling Technique based on Kalman filter for imbalanced datasets," Machine Learning with Applications, vol. 8, Jun. 2022, Art. no. 100267.
H. A. Gameng, B. D. Gerardo, and R. P. Medina, "A Modified Adaptive Synthetic SMOTE Approach in Graduation Success Rate Classification," International Journal of Advanced Trends in Computer Science and Engineering, vol. 8, no. 6, pp. 3053–3057, Dec. 2019.
H. Guan, Y. Zhang, M. Xian, H. D. Cheng, and X. Tang, "SMOTE-WENN: Solving class imbalance and small sample problems by oversampling and distance scaling," Applied Intelligence, vol. 51, no. 3, pp. 1394–1409, Mar. 2021.
Q. Chen, Z. L. Zhang, W. P. Huang, J. Wu, and X. G. Luo, "PF-SMOTE: A novel parameter-free SMOTE for imbalanced datasets," Neurocomputing, vol. 498, pp. 75–88, Aug. 2022.
E. Elyan, C. F. Moreno-Garcia, and C. Jayne, "CDSMOTE: class decomposition and synthetic minority class oversampling technique for imbalanced-data classification," Neural Computing and Applications, vol. 33, no. 7, pp. 2839–2851, Apr. 2021.
"Pima Indians Diabetes Database." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database.
"Haberman’s Survival Data Set." Kaggle, [Online]. Available: https://www.kaggle.com/datasets/gilsousa/habermans-survival-data-set.
H. Hairani, T. Widiyaningtyas, and D. D. Prasetya, "Addressing Class Imbalance of Health Data: A Systematic Literature Review on Modified Synthetic Minority Oversampling Technique (SMOTE) Strategies," JOIV : International Journal on Informatics Visualization, vol. 8, no. 3, pp. 1310–1318, Sep. 2024.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, "SMOTE: Synthetic Minority Over-sampling Technique," Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002.
E. Blanco-Mallo, L. Morán-Fernández, B. Remeseiro, and V. Bolón-Canedo, "Do all roads lead to Rome? Studying distance measures in the context of machine learning," Pattern Recognition, vol. 141, Sep. 2023, Art. no. 109646.
S. Maldonado, J. López, and C. Vairetti, "An alternative SMOTE oversampling strategy for high-dimensional datasets," Applied Soft Computing, vol. 76, pp. 380–389, Mar. 2019.
X. Gao and G. Li, "A KNN Model Based on Manhattan Distance to Identify the SNARE Proteins," IEEE Access, vol. 8, pp. 112922–112931, 2020.
G. M. Lin and H. C. Zeng, "Electrocardiographic Machine Learning to Predict Mitral Valve Prolapse in Young Adults," IEEE Access, vol. 9, pp. 103132–103140, 2021.
K. Napierala and J. Stefanowski, "Types of minority class examples and their influence on learning classifiers from imbalanced data," Journal of Intelligent Information Systems, vol. 46, no. 3, pp. 563–597, Jun. 2016.
A. Arafa, N. El-Fishawy, M. Badawy, and M. Radad, "RN-SMOTE: Reduced Noise SMOTE based on DBSCAN for enhancing imbalanced data classification," Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 8, pp. 5059–5074, Sep. 2022.
H. Hartono and E. Ongko, "Avoiding Overfitting dan Overlapping in Handling Class Imbalanced Using Hybrid Approach with Smoothed Bootstrap Resampling and Feature Selection," JOIV : International Journal on Informatics Visualization, vol. 6, no. 2, pp. 343–348, Jun. 2022.
Sucipto, D. D. Prasetya, and T. Widiyaningtyas, "Α Supervised Hybrid Weighting Scheme for Bloom’s Taxonomy Questions using Category Space Density-based Weighting," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 22102–22108, Apr. 2025.
H. Qteat, M. Awad, and M. Awad, "Using Hybrid Model of Particle Swarm Optimization and Multi-Layer Perceptron Neural Networks for Classification of Diabetes," International Journal of Intelligent Engineering and Systems, vol. 14, no. 3, pp. 11–22, Jun. 2021.
H. Hanafi, A. H. Muhammad, I. Verawati, and R. Hardi, "An Intrusion Detection System Using SDAE to Enhance Dimensional Reduction in Machine Learning," JOIV : International Journal on Informatics Visualization, vol. 6, no. 2, pp. 306–316, Jun. 2022.
Z. Farou, M. Aharrat, and T. Horváth, "A Comparative Study of Assessment Metrics for Imbalanced Learning," in New Trends in Database and Information Systems, vol. 1850, A. Abelló, P. Vassiliadis, O. Romero, R. Wrembel, F. Bugiotti, J. Gamper, G. Vargas Solar, and E. Zumpano, Eds. Springer Nature Switzerland, 2023, pp. 119–129.
L. Yuningsih, G. A. Pradipta, D. Hermawan, P. D. W. Ayu, D. P. Hostiadi, and R. R. Huizen, "IRS-BAG-Integrated Radius-SMOTE Algorithm with Bagging Ensemble Learning Model for Imbalanced Data Set Classification," Emerging Science Journal, vol. 7, no. 5, pp. 1501–1516, Oct. 2023.
N. A. Firdausanti, I. Mendonça, and M. Aritsugi, "Noise-free sampling with majority framework for an imbalanced classification problem," Knowledge and Information Systems, vol. 66, no. 7, pp. 4011–4042, Jul. 2024.
D. D. Prasetya, T. Widiyaningtyas, H. Hairani, and A. Aminuddin, "Addressing Imbalance in Health Datasets: A New Method NR-Clustering SMOTE and Distance Metric Modification," Computers, Materials & Continua, vol. 82, no. 2, pp. 2931–2949, 2025.
Z. Xu, D. Shen, T. Nie, Y. Kou, N. Yin, and X. Han, "A cluster-based oversampling algorithm combining SMOTE and k-means for imbalanced medical data," Information Sciences, vol. 572, pp. 574–589, Sep. 2021.
Downloads
How to Cite
License
Copyright (c) 2025 Triyanna Widiyaningtyas, Hairani Hairani, Didik Dwi Prasetya, Utomo Pujianto, Wahyu Caesarendra

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.