Integrative Transcriptomic Analysis of Breast Cancer Subtypes Using Consensus Gene Expression Modeling and Biological Pathway Decoding
Received: 17 December 2025 | Revised: 21 January 2026 and 13 March 2026 | Accepted: 15 March 2026 | Online: 23 May 2026
Corresponding author: Sumendra Yogarayan
Abstract
Breast cancer is among the three leading causes of cancer-related mortality in women, highlighting the importance of accurate molecular subtyping for treatment using personalized medicine. Despite major advances in transcriptomics, the analysis of gene expression data remains challenging due to high dimensionality and biological variability. This study proposes a reproducible computational framework for robust diagnosis of breast cancer subtypes based on gene expression profiles, using a subset of the GSE45827 comprising 120 tumor and normal tissue samples representing six molecular subtypes. The proposed pipeline incorporated rigorous preprocessing, consensus-based feature selection, comprehensive benchmarking across classical machine learning, ensemble learning, and deep learning models, feature selection combined with Shapley Additive Explanations (SHAP)-based importance analysis, Boruta, Tabular Network (TabNet) attention masks, and stability selection to identify biologically relevant and reproducible biomarkers. To minimize information leakage, nested cross-validation was employed throughout model development, while external validation was conducted using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort to evaluate generalizability across independent datasets. Among the evaluated approaches, the stacking ensemble classifier achieved the best overall performance, reaching mean accuracy and macro-F1 scores of 96.7% and 96.8%, respectively, on the GSE45827 dataset, and 94.2% and 94.0% on the METABRIC cohort. These results surpassed those obtained using TabNet, Autoencoder + Logistic Regression (AE+LR) pipelines, and Prediction Analysis of Microarray 50 (PAM50) baseline models. Moreover, the interpretability-based analysis identified i) several biologically significant genes, including Breast Cancer 1 (BRCA1), Tumor Protein p53 (TP53), and Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha (PIK3CA), as key contributors to subtype discrimination, and ii) the pathways in which these genes are involved, such as the Phosphatidylinositol 3-kinase - Ak strain transforming (PI3K-Akt) signaling and Deoxyribonucleic Acid (DNA) repair. Overall, the proposed framework provides a reproducible, interpretable, and high-performing methodology for reliable feature identification in gene expression data.
Keywords:
breast cancer subtypes, consensus feature selection, gene expression profiling, multi-omic integration, pathway enrichment analysis, SHAP-based interpretability, transcriptomic classificationReferences
World Health Organization. "Breast cancer." WHO. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer.
SEER. "Cancer Stat Facts: Female Breast Cancer Subtypes." SEER Cancer. [Online]. Available: https://seer.cancer.gov/statfacts/html/breast-subtypes.html.
L. Vaz-Gonçalves et al., "Capturing breast cancer subtypes in cancer registries: Insights into real-world incidence and survival," Journal of Cancer Policy, vol. 44, June 2025, Art. no. 100567.
N. Bandyopadhyay, T. Kahveci, S. Goodison, Y. Sun, and S. Ranka, "Pathway-Based Feature Selection Algorithm for Cancer Microarray Data," Advances in Bioinformatics, vol. 2009, pp. 1–16, Mar. 2009.
American Cancer Society, Breast Cancer Facts & Figures 2024-2025, Atlanta, GA: ACS, 2024.
Z. Wang, Y. Zhou, T. Takagi, J. Song, Y.-S. Tian, and T. Shibuya, "Genetic algorithm-based feature selection with manifold learning for cancer classification using microarray data," BMC Bioinformatics, vol. 24, no. 1, Apr. 2023, Art. no. 139.
H.-M. Song et al., "Dynamic time-varying transfer function for cancer gene expression data feature selection problem," Journal of Big Data, vol. 12, no. 1, Mar. 2025, Art. no. 53.
B. Weigelt and J. S. Reis-Filho, "Histological and molecular types of breast cancer: is there a unifying taxonomy?," Nature Reviews Clinical Oncology, vol. 6, no. 12, pp. 718–730, Dec. 2009.
T. M. Chowdhury and A. R. M. Kamal, "An Efficient and Interpretable Machine Learning Model for Classifying Breast Cancer Subtypes Using Gene Expression Profiles," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 24196–24203, Aug. 2025.
G. Naganandini and V. R. Hulipalled, "Breast Cancer Diagnosis Using Supervised Machine Learning for Benign and Malignant Classification," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25634–25640, Aug. 2025.
G. Kallah-Dagadu et al., "Breast cancer prediction based on gene expression data using interpretable machine learning techniques," Scientific Reports, vol. 15, no. 1, Mar. 2025, Art. no. 7594.
M. J. Saadh et al., "Advanced machine learning framework for enhancing breast cancer diagnostics through transcriptomic profiling," Discover Oncology, vol. 16, no. 1, Mar. 2025, Art. no. 334.
Z. Antysheva et al., "Machine learning-based single-sample molecular classifier for cancer grading," Frontiers in Oncology, vol. 15, July 2025, Art. no. 1617898.
S. Rezaei et al., "Role of machine learning in molecular pathology for breast cancer: A review on gene expression profiling and RNA sequencing application," Critical Reviews in Oncology/Hematology, vol. 213, Sept. 2025, Art. no. 104780.
Expression data from Breast cancer subtypes. (2016), NCBI, T. Gruosso, Y. Kieffer, T. Dubois, F. Mechta-Grigoriou. [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45827.
METABRIC Group et al., "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups," Nature, vol. 486, no. 7403, pp. 346–352, June 2012.
B. C. Feltes, E. B. Chandelier, B. I. Grisci, and M. Dorn, "CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research." Apr. 2019.
"Home - GEO - NCBI." [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/.
Downloads
How to Cite
License
Copyright (c) 2026 Garima Shukla, Vanshaj Awasthi, Sakshi Nipane, Tanisha Hedaoo, Dipak Raskar, Balamurugan Balusamy, Sumendra Yogarayan

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
