This is a preview and has not been published. View submission

Integrative Transcriptomic Analysis of Breast Cancer Subtypes Using Consensus Gene Expression Modeling and Biological Pathway Decoding

Authors

  • Garima Shukla Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Mumbai, Maharashtra, India
  • Vanshaj Awasthi Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Mumbai, Maharashtra, India
  • Sakshi Nipane Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Mumbai, Maharashtra, India
  • Tanisha Hedaoo Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Mumbai, Maharashtra, India
  • Dipak Raskar Department of Computer Science and Engineering, Amity School of Engineering and Technology, Amity University Mumbai, Maharashtra, India
  • Balamurugan Balusamy School of Engineering and IT, Manipal Academy of Higher Education, Dubai, United Arab Emirates
  • Sumendra Yogarayan Faculty of Information Science and Technology (FIST), Multimedia University (MMU), Jalan Ayer Keroh Lama, Melaka, Malaysia
Volume: 16 | Issue: 3 | Pages: 36738-36746 | June 2026 | https://doi.org/10.48084/etasr.17000

Abstract

Breast cancer is among the three leading causes of cancer-related mortality in women, highlighting the importance of accurate molecular subtyping for treatment using personalized medicine. Despite major advances in transcriptomics, the analysis of gene expression data remains challenging due to high dimensionality and biological variability. This study proposes a reproducible computational framework for robust diagnosis of breast cancer subtypes based on gene expression profiles, using a subset of the GSE45827 comprising 120 tumor and normal tissue samples representing six molecular subtypes. The proposed pipeline incorporated rigorous preprocessing, consensus-based feature selection, comprehensive benchmarking across classical machine learning, ensemble learning, and deep learning models, feature selection combined with Shapley Additive Explanations (SHAP)-based importance analysis, Boruta, Tabular Network (TabNet) attention masks, and stability selection to identify biologically relevant and reproducible biomarkers. To minimize information leakage, nested cross-validation was employed throughout model development, while external validation was conducted using the Molecular Taxonomy of Breast Cancer International Consortium (METABRIC) cohort to evaluate generalizability across independent datasets. Among the evaluated approaches, the stacking ensemble classifier achieved the best overall performance, reaching mean accuracy and macro-F1 scores of 96.7% and 96.8%, respectively, on the GSE45827 dataset, and 94.2% and 94.0% on the METABRIC cohort. These results surpassed those obtained using TabNet, Autoencoder + Logistic Regression (AE+LR) pipelines, and Prediction Analysis of Microarray 50 (PAM50) baseline models. Moreover, the interpretability-based analysis identified i) several biologically significant genes, including Breast Cancer 1 (BRCA1), Tumor Protein p53 (TP53), and Phosphatidylinositol-4,5-Bisphosphate 3-Kinase Catalytic Subunit Alpha (PIK3CA), as key contributors to subtype discrimination, and ii) the pathways in which these genes are involved, such as the Phosphatidylinositol 3-kinase - Ak strain transforming (PI3K-Akt) signaling and Deoxyribonucleic Acid (DNA) repair. Overall, the proposed framework provides a reproducible, interpretable, and high-performing methodology for reliable feature identification in gene expression data.

Keywords:

breast cancer subtypes, consensus feature selection, gene expression profiling, multi-omic integration, pathway enrichment analysis, SHAP-based interpretability, transcriptomic classification

References

World Health Organization. "Breast cancer." WHO. [Online]. Available: https://www.who.int/news-room/fact-sheets/detail/breast-cancer.

SEER. "Cancer Stat Facts: Female Breast Cancer Subtypes." SEER Cancer. [Online]. Available: https://seer.cancer.gov/statfacts/html/breast-subtypes.html.

L. Vaz-Gonçalves et al., "Capturing breast cancer subtypes in cancer registries: Insights into real-world incidence and survival," Journal of Cancer Policy, vol. 44, June 2025, Art. no. 100567.

N. Bandyopadhyay, T. Kahveci, S. Goodison, Y. Sun, and S. Ranka, "Pathway-Based Feature Selection Algorithm for Cancer Microarray Data," Advances in Bioinformatics, vol. 2009, pp. 1–16, Mar. 2009.

American Cancer Society, Breast Cancer Facts & Figures 2024-2025, Atlanta, GA: ACS, 2024.

Z. Wang, Y. Zhou, T. Takagi, J. Song, Y.-S. Tian, and T. Shibuya, "Genetic algorithm-based feature selection with manifold learning for cancer classification using microarray data," BMC Bioinformatics, vol. 24, no. 1, Apr. 2023, Art. no. 139.

H.-M. Song et al., "Dynamic time-varying transfer function for cancer gene expression data feature selection problem," Journal of Big Data, vol. 12, no. 1, Mar. 2025, Art. no. 53.

B. Weigelt and J. S. Reis-Filho, "Histological and molecular types of breast cancer: is there a unifying taxonomy?," Nature Reviews Clinical Oncology, vol. 6, no. 12, pp. 718–730, Dec. 2009.

T. M. Chowdhury and A. R. M. Kamal, "An Efficient and Interpretable Machine Learning Model for Classifying Breast Cancer Subtypes Using Gene Expression Profiles," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 24196–24203, Aug. 2025.

G. Naganandini and V. R. Hulipalled, "Breast Cancer Diagnosis Using Supervised Machine Learning for Benign and Malignant Classification," Engineering, Technology & Applied Science Research, vol. 15, no. 4, pp. 25634–25640, Aug. 2025.

G. Kallah-Dagadu et al., "Breast cancer prediction based on gene expression data using interpretable machine learning techniques," Scientific Reports, vol. 15, no. 1, Mar. 2025, Art. no. 7594.

M. J. Saadh et al., "Advanced machine learning framework for enhancing breast cancer diagnostics through transcriptomic profiling," Discover Oncology, vol. 16, no. 1, Mar. 2025, Art. no. 334.

Z. Antysheva et al., "Machine learning-based single-sample molecular classifier for cancer grading," Frontiers in Oncology, vol. 15, July 2025, Art. no. 1617898.

S. Rezaei et al., "Role of machine learning in molecular pathology for breast cancer: A review on gene expression profiling and RNA sequencing application," Critical Reviews in Oncology/Hematology, vol. 213, Sept. 2025, Art. no. 104780.

Expression data from Breast cancer subtypes. (2016), NCBI, T. Gruosso, Y. Kieffer, T. Dubois, F. Mechta-Grigoriou. [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE45827.

METABRIC Group et al., "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups," Nature, vol. 486, no. 7403, pp. 346–352, June 2012.

B. C. Feltes, E. B. Chandelier, B. I. Grisci, and M. Dorn, "CuMiDa: An Extensively Curated Microarray Database for Benchmarking and Testing of Machine Learning Approaches in Cancer Research." Apr. 2019.

"Home - GEO - NCBI." [Online]. Available: https://www.ncbi.nlm.nih.gov/geo/.

Downloads

How to Cite

[1]
G. Shukla, “Integrative Transcriptomic Analysis of Breast Cancer Subtypes Using Consensus Gene Expression Modeling and Biological Pathway Decoding”, Eng. Technol. Appl. Sci. Res., vol. 16, no. 3, pp. 36738–36746, Jun. 2026.

Metrics

Abstract Views: 33
PDF Downloads: 37

Metrics Information