Hybrid Statistical-Spectral Sparse Feature Selection with Optimization for Robust and Generalizable Lung Adenocarcinoma Classification
Received: 9 April 2025 | Revised: 28 April 2025 and 12 May 2025 | Accepted: 15 May 2025 | Online: 2 August 2025
Corresponding author: Sara Haddou Bouazza
Abstract
High dimensionality, redundant features, and poor cross-dataset generalization hinder Lung Adenocarcinoma (LUAD) classification using gene expression data. This study proposes Hybrid Statistical-Spectral Sparse Feature Selection with Optimization (HS3FS+), a novel framework that integrates Mutual Information (MI) and Kullback-Leibler (KL) divergence for feature ranking, Kernel Principal Component Analysis (KPCA) for nonlinear transformation, pathway-guided filtering for biological validation, and Genetic Algorithm (GA)-based optimization for feature selection. The framework was validated on four independent datasets: The Cancer Genome Atlas (TCGA)-LUAD, Gene Expression Omnibus (GEO) datasets GSE19188 and GSE37745, and TCGA-Lung Squamous Cell Carcinoma (TCGA-LUSC), ensuring robust cross-platform evaluation. HS3FS+ achieved classification accuracy of 98.3% on TCGA-LUAD, 97.1% on GSE19188, 96.0% on GSE37745, and 94.8% on TCGA-LUSC. The selected gene signatures exhibit strong concordance with established LUAD biomarkers, supporting both biological relevance and model interpretability. Additionally, the method demonstrated a fivefold reduction in computational time compared to Deep Learning (DL)–based feature selection approaches. These findings confirm HS3FS+ as a robust, interpretable, and scalable solution for LUAD classification, with potential applications in biomarker discovery and precision oncology.
Keywords:
machine learning, cancer classification, data mining, pattern recognition, feature selectionReferences
R. Fujikawa et al., "Clinicopathologic and Genotypic Features of Lung Adenocarcinoma Characterized by the International Association for the Study of Lung Cancer Grading System," Journal of Thoracic Oncology, vol. 17, no. 5, pp. 700–707, May 2022. DOI: https://doi.org/10.1016/j.jtho.2022.02.005
J. W. Chen and J. Dhahbi, "Lung adenocarcinoma and lung squamous cell carcinoma cancer classification, biomarker identification, and gene expression analysis using overlapping feature selection methods," Scientific Reports, vol. 11, no. 1, Jun. 2021, Art. no. 13323. DOI: https://doi.org/10.1038/s41598-021-92725-8
A. Leiter, R. R. Veluswamy, and J. P. Wisnivesky, "The global burden of lung cancer: current status and future trends," Nature Reviews Clinical Oncology, vol. 20, no. 9, pp. 624–639, Sep. 2023. DOI: https://doi.org/10.1038/s41571-023-00798-3
D. Huang, Z. Li, T. Jiang, C. Yang, and N. Li, "Artificial intelligence in lung cancer: current applications, future perspectives, and challenges," Frontiers in Oncology, vol. 14, Dec. 2024. DOI: https://doi.org/10.3389/fonc.2024.1486310
S. Srivastava et al., "Unveiling the potential of proteomic and genetic signatures for precision therapeutics in lung cancer management," Cellular Signalling, vol. 113, Jan. 2024, Art. no. 110932. DOI: https://doi.org/10.1016/j.cellsig.2023.110932
S. H. Bouazza, "Optimized Machine Learning for Cancer Classification via Three-Stage Gene Selection," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21093–21099, Apr. 2025. DOI: https://doi.org/10.48084/etasr.9473
S. H. Bouazza, "A Deep Ensemble Gene Selection and Attention-guided Classification Framework for Robust Cancer Diagnosis from Microarray Data," Engineering, Technology & Applied Science Research, vol. 15, no. 1, pp. 20235–20241, Feb. 2025. DOI: https://doi.org/10.48084/etasr.9476
S. Azadifar, M. Rostami, K. Berahmand, P. Moradi, and M. Oussalah, "Graph-based relevancy-redundancy gene selection method for cancer diagnosis," Computers in Biology and Medicine, vol. 147, Aug. 2022, Art. no. 105766. DOI: https://doi.org/10.1016/j.compbiomed.2022.105766
M. Mandal, P. K. Singh, M. F. Ijaz, J. Shafi, and R. Sarkar, "A Tri-Stage Wrapper-Filter Feature Selection Framework for Disease Classification," Sensors, vol. 21, no. 16, Aug. 2021, Art. no. 5571. DOI: https://doi.org/10.3390/s21165571
M. Farsi, "Filter-Based Feature Selection and Machine-Learning Classification of Cancer Data," Intelligent Automation & Soft Computing, vol. 28, no. 1, pp. 83–92, 2021. DOI: https://doi.org/10.32604/iasc.2021.015460
S. H. Bouazza and J. H. Bouazza, "Advanced Cancer Classification Using AI and Pattern Recognition Techniques," ITM Web of Conferences, vol. 69, 2024, Art. no. 02001. DOI: https://doi.org/10.1051/itmconf/20246902001
S. Bashir, I. U. Khattak, A. Khan, F. H. Khan, A. Gani, and M. Shiraz, "A Novel Feature Selection Method for Classification of Medical Data Using Filters, Wrappers, and Embedded Approaches," Complexity, vol. 2022, no. 1, Jan. 2022. DOI: https://doi.org/10.1155/2022/8190814
N. S. Azman et al., "Support Vector Machine – Recursive Feature Elimination for Feature Selection on Multi-omics Lung Cancer Data," Progress In Microbes & Molecular Biology, vol. 6, no. 1, Apr. 2023. DOI: https://doi.org/10.36877/pmmb.a0000327
E. O. Abiodun, A. Alabdulatif, O. I. Abiodun, M. Alawida, A. Alabdulatif, and R. S. Alkhawaldeh, "A systematic review of emerging feature selection optimization methods for optimal text classification: the present state and prospective opportunities," Neural Computing and Applications, vol. 33, no. 22, pp. 15091–15118, Nov. 2021. DOI: https://doi.org/10.1007/s00521-021-06406-8
A. B. Buriro and S. Kumar, "The Fisher Component-based Feature Selection Method," Engineering, Technology & Applied Science Research, vol. 12, no. 4, pp. 9023–9027, Aug. 2022. DOI: https://doi.org/10.48084/etasr.5137
D. K. Singh and M. Shrivastava, "Evolutionary Algorithm-based Feature Selection for an Intrusion Detection System," Engineering, Technology & Applied Science Research, vol. 11, no. 3, pp. 7130–7134, Jun. 2021. DOI: https://doi.org/10.48084/etasr.4149
S. H. Bouazza and J. H. Bouazza, "Optimized colon cancer classification via feature selection and machine learning," Bulletin of Electrical Engineering and Informatics, vol. 14, no. 2, pp. 1476–1485, Apr. 2025. DOI: https://doi.org/10.11591/eei.v14i2.9270
L. Meenachi and S. Ramakrishnan, "Review on hybrid feature selection and classification of microarray gene expression data," Data Fusion Techniques and Applications for Smart Healthcare, pp. 319–340, 2024. DOI: https://doi.org/10.1016/B978-0-44-313233-9.00020-5
I. Zafar et al., "Reviewing methods of deep learning for intelligent healthcare systems in genomics and biomedicine," Biomedical Signal Processing and Control, vol. 86, Sep. 2023, Art. no. 105263. DOI: https://doi.org/10.1016/j.bspc.2023.105263
G. Sokar, Z. Atashgahi, M. Pechenizkiy, and D. C. Mocanu, "Where to Pay Attention in Sparse Training for Feature Selection?" arXiv, 2022.
R. Shanmugavelu and V. Ravi, "Enhancing Security in Healthcare Frameworks using Optimal Deep Learning-based Attack Detection and Classification for Medical Wireless Sensor Networks," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21197–21202, Apr. 2025. DOI: https://doi.org/10.48084/etasr.9741
Lung Adenocarcinoma. (2025), National Cancer Institute. [Online]. Available: https://portal.gdc.cancer.gov/projects/TCGA-LUAD.
GSE19188. (2009), NCBI GEO. [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE19188.
GSE37745. (2012), NCBI GEO. [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE37745.
Lung Squamous Cell Carcinoma. (2025), National Cancer Institute. [Online]. Available: https://portal.gdc.cancer.gov/projects/TCGA-LUSC.
P. Keerin and T. Boongoen, "Improved KNN Imputation for Missing Values in Gene Expression Data," Computers, Materials & Continua, vol. 70, no. 2, pp. 4009–4025, 2022. DOI: https://doi.org/10.32604/cmc.2022.020261
C. S. K. Dash, A. K. Behera, S. Dehuri, and A. Ghosh, "An outliers detection and elimination framework in classification task of data mining," Decision Analytics Journal, vol. 6, Mar. 2023, Art. no. 100164. DOI: https://doi.org/10.1016/j.dajour.2023.100164
M. Rezapour, S. J. Walker, D. A. Ornelles, P. M. McNutt, A. Atala, and M. N. Gurcan, "Analysis of gene expression dynamics and differential expression in viral infections using generalized linear models and quasi-likelihood methods," Frontiers in Microbiology, vol. 15, Apr. 2024. DOI: https://doi.org/10.3389/fmicb.2024.1342328
Y. Xia, "Statistical normalization methods in microbiome data with application to microbiome cancer research," Gut Microbes, vol. 15, no. 2, Dec. 2023. DOI: https://doi.org/10.1080/19490976.2023.2244139
R. D. Tihagam and S. Bhatnagar, "A multi-platform normalization method for meta-analysis of gene expression data," Methods, vol. 217, pp. 43–48, Sep. 2023. DOI: https://doi.org/10.1016/j.ymeth.2023.06.012
Z. Jandaghi, "Mutual Information-based Machine Learning with Microarray Cancer Data," Ph.D. dissertation, University of Georgia, Georgia, USA, 2022.
S. Liu and W. Yao, "Prediction of lung cancer using gene expression and deep learning with KL divergence gene selection," BMC Bioinformatics, vol. 23, no. 1, Dec. 2022. DOI: https://doi.org/10.1186/s12859-022-04689-9
M. Ahsan, M. Mashuri, H. Khusna, and Wibawati, "Kernel principal component analysis (PCA) control chart for monitoring mixed non-linear variable and attribute quality characteristics," Heliyon, vol. 8, no. 6, Jun. 2022, Art. no. e09590. DOI: https://doi.org/10.1016/j.heliyon.2022.e09590
Z. Song, H. Wang, B. Xue, and M. Zhang, "Balancing Different Optimization Difficulty Between Objectives in Multiobjective Feature Selection," IEEE Transactions on Evolutionary Computation, vol. 28, no. 6, pp. 1824–1837, Dec. 2024. DOI: https://doi.org/10.1109/TEVC.2023.3334233
G. Naidu, T. Zuva, and E. M. Sibanda, "A Review of Evaluation Metrics in Machine Learning Algorithms," in Lecture Notes in Networks and Systems, Cham, 2023, pp. 15–25. DOI: https://doi.org/10.1007/978-3-031-35314-7_2
Downloads
How to Cite
License
Copyright (c) 2025 Sara Haddou Bouazza

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
