An Efficient and Interpretable Machine Learning Model for Classifying Breast Cancer Subtypes Using Gene Expression Profiles

Authors

  • Tareque Mohmud Chowdhury Computer Science and Engineering Department, Islamic University of Technology, Dhaka, Bangladesh
  • Abu Raihan Mostofa Kamal Computer Science and Engineering Department, Islamic University of Technology, Dhaka, Bangladesh
Volume: 15 | Issue: 4 | Pages: 24196-24203 | August 2025 | https://doi.org/10.48084/etasr.11179

Abstract

Breast Cancer (BRCA) is a complex and heterogeneous disease. This heterogeneity has been shown to affect gene expression patterns and molecular activity of different subtypes in different ways. BRCA subtype identification is of critical importance in the context of prognosis and treatment decisions for the disease. Advances in transcriptomic profiling and Machine Learning (ML) models have enabled the classification of BRCA subtypes with higher accuracy, yet the majority of classification models lack interpretability, thereby limiting their clinical applicability. In this study, an interpretable ML framework for classifying BRCA subtypes is proposed using high-dimensional RNA-sequencing data. The framework was evaluated using a publicly available TCGA transcriptomic dataset, by applying dimensionality reduction techniques and optimizing ML models through grid search tuning. Shapley Additive Explanations (SHAP) values are used to find important transcriptomic markers that facilitate the classification of subtypes. This approach provides insights into the gene sets associated with the molecular mechanisms of each subtype. The experimental results demonstrate that the proposed method exhibits superior performance in terms of accuracy, precision, F1-score, and interpretability when compared to existing works. Finally, the gene set enrichment analysis highlights key pathways associated with BRCA and its subtypes.

Keywords:

Breast Cancer (BRCA), BRCA subtypes, interpretable AI, ML, subtype classification

Downloads

Download data is not yet available.

References

R. Gonzales Martinez and D.-M. van Dongen, "Deep learning algorithms for the early detection of breast cancer: A comparative study with traditional machine learning," Informatics in Medicine Unlocked, vol. 41, Jan. 2023. DOI: https://doi.org/10.1016/j.imu.2023.101317

A. Bekkouche, M. Merzoug, M. Hadjila, and W. Ferhi, "Towards Early Breast Cancer Detection: A Deep Learning Approach," Engineering, Technology & Applied Science Research, vol. 14, no. 5, pp. 17517–17523, Oct. 2024. DOI: https://doi.org/10.48084/etasr.8634

J. S. Parker et al., "Supervised Risk Predictor of Breast Cancer Based on Intrinsic Subtypes," Journal of Clinical Oncology, vol. 27, no. 8, pp. 1160–1167, Mar. 2009. DOI: https://doi.org/10.1200/JCO.2008.18.1370

A. Fernandez-Martinez et al., "Limitations in predicting PAM50 intrinsic subtype and risk of relapse score with Ki67 in estrogen receptor-positive HER2-negative breast cancer," Oncotarget, vol. 8, no. 13, pp. 21930–21937, Feb. 2017. DOI: https://doi.org/10.18632/oncotarget.15748

P. Turova et al., "The Breast Cancer Classifier refines molecular breast cancer classification to delineate the HER2-low subtype," npj Breast Cancer, vol. 11, no. 1, Feb. 2025, Art. no. 19. DOI: https://doi.org/10.1038/s41523-025-00723-0

J. M. Choi, C. Park, and H. Chae, "meth-SemiCancer: a cancer subtype classification framework via semi-supervised learning utilizing DNA methylation profiles," BMC Bioinformatics, vol. 24, no. 1, Apr. 2023, Art. no. 168. DOI: https://doi.org/10.1186/s12859-023-05272-6

M. Hamaneh and Y.-K. Yu, "A Simple Method for Robust and Accurate Intrinsic Subtyping of Breast Cancer," Cancer Informatics, vol. 22, Jan. 2023, Art. no. 11769351231159893. DOI: https://doi.org/10.1177/11769351231159893

L. Zhong, Q. Meng, and Y. Chen, "A Cascade Flexible Neural Forest Model for Cancer Subtypes Classification on Gene Expression Data," Computational Intelligence and Neuroscience, vol. 2021, no. 1, Oct. 2021, Art. no. 6480456. DOI: https://doi.org/10.1155/2021/6480456

S. Cascianelli, I. Molineris, C. Isella, M. Masseroli, and E. Medico, "Machine learning for RNA sequencing-based intrinsic subtyping of breast cancer," Scientific Reports, vol. 10, no. 1, Aug. 2020, Art. no. 14071. DOI: https://doi.org/10.1038/s41598-020-70832-2

Y. Huang, P. Zeng, and C. Zhong, "Classifying breast cancer subtypes on multi-omics data via sparse canonical correlation analysis and deep learning," BMC Bioinformatics, vol. 25, no. 1, Mar. 2024, Art. no. 132. DOI: https://doi.org/10.1186/s12859-024-05749-y

M. M. Islam, S. Huang, R. Ajwad, C. Chi, Y. Wang, and P. Hu, "An integrative deep learning framework for classifying molecular subtypes of breast cancer," Computational and Structural Biotechnology Journal, vol. 18, pp. 2185–2199, Aug. 2020. DOI: https://doi.org/10.1016/j.csbj.2020.08.005

S. Zhang et al., "lncRNA Gene Signatures for Prediction of Breast Cancer Intrinsic Subtypes and Prognosis," Genes, vol. 9, no. 2, Feb. 2018, Art. no. 65. DOI: https://doi.org/10.3390/genes9020065

S. Meshoul, A. Batouche, H. Shaiba, and S. AlBinali, "Explainable Multi-Class Classification Based on Integrative Feature Selection for Breast Cancer Subtyping," Mathematics, vol. 10, no. 22, Nov. 2022, Art. no. 4271. DOI: https://doi.org/10.3390/math10224271

J. M. Choi and H. Chae, "moBRCA-net: a breast cancer subtype classification framework based on multi-omics attention neural networks," BMC Bioinformatics, vol. 24, no. 1, Apr. 2023, Art. no. 169. DOI: https://doi.org/10.1186/s12859-023-05273-5

T. Wang et al., "MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification," Nature Communications, vol. 12, no. 1, Jun. 2021, Art. no. 3445. DOI: https://doi.org/10.1038/s41467-021-23774-w

J. Xu, P. Wu, Y. Chen, Q. Meng, H. Dawood, and H. Dawood, "A hierarchical integration deep flexible neural forest framework for cancer subtype classification by integrating multi-omics data," BMC Bioinformatics, vol. 20, no. 1, Oct. 2019, Art. no. 527. DOI: https://doi.org/10.1186/s12859-019-3116-7

Y. Lin, W. Zhang, H. Cao, G. Li, and W. Du, "Classifying Breast Cancer Subtypes Using Deep Neural Networks Based on Multi-Omics Data," Genes, vol. 11, no. 8, Aug. 2020, Art. no. 888. DOI: https://doi.org/10.3390/genes11080888

J. N. Weinstein et al., "The Cancer Genome Atlas Pan-Cancer analysis project," Nature Genetics, vol. 45, no. 10, pp. 1113–1120, Oct. 2013. DOI: https://doi.org/10.1038/ng.2764

A. Colaprico et al., "TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data," Nucleic Acids Research, vol. 44, no. 8, May 2016, Art. no. e71. DOI: https://doi.org/10.1093/nar/gkv1507

"R: The R Project for Statistical Computing." R-project. https://www.r-project.org/.

T. M. Chowdhury, F. Tabassum, S. Islam, and A. R. M. Kamal, "A Pan-cancer Classification Model using Multi-view Feature Selection Method and Ensemble Classifier." arXiv, Jan. 12, 2025.

J. S. Cramer, "The Origins of Logistic Regression." Social Science Research Network, Dec. 01, 2002. DOI: https://doi.org/10.2139/ssrn.360300

C. Cortes and V. Vapnik, "Support-vector networks," Machine Learning, vol. 20, no. 3, pp. 273–297, Sep. 1995. DOI: https://doi.org/10.1023/A:1022627411411

T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 2016, pp. 785–794. DOI: https://doi.org/10.1145/2939672.2939785

L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, "CatBoost: unbiased boosting with categorical features," in Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, Canada, 2018, pp. 6639–6649.

P. Cunningham and S. J. Delany, "k-Nearest Neighbour Classifiers - A Tutorial," ACM Computing Surveys, vol. 54, no. 6, Jul. 2021, Art. no. 128. DOI: https://doi.org/10.1145/3459665

L. Breiman, "Random Forests," Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001. DOI: https://doi.org/10.1023/A:1010933404324

S. M. Lundberg and S.-I. Lee, "A unified approach to interpreting model predictions," in Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 2017, pp. 4768–4777.

X. Dai et al., "Breast cancer intrinsic subtype classification, clinical use and future trends," American Journal of Cancer Research, vol. 5, no. 10, pp. 2929–2943, Sep. 2015.

S. X. Ge, D. Jung, and R. Yao, "ShinyGO: a graphical gene-set enrichment tool for animals and plants," Bioinformatics, vol. 36, no. 8, pp. 2628–2629, Apr. 2020. DOI: https://doi.org/10.1093/bioinformatics/btz931

M. Kanehisa and S. Goto, "KEGG: Kyoto Encyclopedia of Genes and Genomes," Nucleic Acids Research, vol. 28, no. 1, pp. 27–30, Jan. 2000. DOI: https://doi.org/10.1093/nar/28.1.27

C. VanGenderen, T. A. A. Harkness, and T. G. Arnason, "The role of Anaphase Promoting Complex activation, inhibition and substrates in cancer development and progression," Aging, vol. 12, no. 15, pp. 15818–15855, Aug. 2020. DOI: https://doi.org/10.18632/aging.103792

O. Wattanathamsan and V. Pongrakhananon, "Emerging role of microtubule-associated proteins on cancer metastasis," Frontiers in Pharmacology, vol. 13, Sep. 2022, Art. no. 935493. DOI: https://doi.org/10.3389/fphar.2022.935493

M. K. Shanmugam et al., "Role of novel histone modifications in cancer," Oncotarget, vol. 9, no. 13, pp. 11414–11426, 2018. DOI: https://doi.org/10.18632/oncotarget.23356

D. Vugic et al., "Replication gap suppression depends on the double-strand DNA binding activity of BRCA2," Nature Communications, vol. 14, no. 1, Jan. 2023, Art. no. 446. DOI: https://doi.org/10.1038/s41467-023-36149-0

A. L. Parker, M. Kavallaris, and J. A. McCarroll, "Microtubules and Their Role in Cellular Stress in Cancer," Frontiers in Oncology, vol. 4, Jun. 2014, Art. no. 153. DOI: https://doi.org/10.3389/fonc.2014.00153

R. Roskoski, "Cyclin-dependent protein serine/threonine kinase inhibitors as anticancer drugs," Pharmacological Research, vol. 139, pp. 471–488, Jan. 2019. DOI: https://doi.org/10.1016/j.phrs.2018.11.035

S. Rodrigues-Ferreira, A. Molina, and C. Nahmias, "Microtubule-associated tumor suppressors as prognostic biomarkers in breast cancer," Breast Cancer Research and Treatment, vol. 179, no. 2, pp. 267–273, Jan. 2020. DOI: https://doi.org/10.1007/s10549-019-05463-x

M. S. Ong et al., "Cytoskeletal Proteins in Cancer and Intracellular Stress: A Therapeutic Perspective," Cancers, vol. 12, no. 1, Jan. 2020, Art. no. 238. DOI: https://doi.org/10.3390/cancers12010238

B. Nami and Z. Wang, "Genetics and Expression Profile of the Tubulin Gene Superfamily in Breast Cancer Subtypes and Its Relation to Taxane Resistance," Cancers, vol. 10, no. 8, Aug. 2018, Art. no. 274. DOI: https://doi.org/10.3390/cancers10080274

K. A. L. Collins et al., "Proteomic analysis defines kinase taxonomies specific for subtypes of breast cancer," Oncotarget, vol. 9, no. 21, pp. 15480–15497, Mar. 2018. DOI: https://doi.org/10.18632/oncotarget.24337

T. M. Chowdhury, "Replication Data for: An Efficient and Interpretable Machine Learning Model for Predicting Breast Cancer Subtypes using Gene Expression Profiles." Harvard Dataverse, Apr. 26, 2025.

Downloads

How to Cite

[1]
T. M. Chowdhury and A. R. M. Kamal, “An Efficient and Interpretable Machine Learning Model for Classifying Breast Cancer Subtypes Using Gene Expression Profiles”, Eng. Technol. Appl. Sci. Res., vol. 15, no. 4, pp. 24196–24203, Aug. 2025.

Metrics

Abstract Views: 790
PDF Downloads: 639

Metrics Information