Comparison of Multiple Regression and Model Averaging Model-Building Approach for Missing Data with Multiple Imputation
Received: 3 September 2024 | Revised: 5 October 2024 and 14 October 2024 | Accepted: 16 October 2024 | Online: 5 November 2024
Corresponding author: Oyebayo Ridwan Olaniran
Abstract
Model construction is of significant importance for the extraction of information from datasets and the prediction of responses based on predictor variables. The objective of this study is to compare the Multiple Regression (MR) and model averaging approaches in the context of missing data and to validate the effectiveness of the Multiple Imputation (MI) method used to address missing data issues. A comparison was performed between the results obtained from the multiple-imputed data and those derived from the Complete Case (CC) data, using a diabetes dataset from Hospital Besar Alor Setar. Prior to the application of MI and model building, k-fold cross-validation was employed to partition the dataset, resulting in 90% of the data lacking complete covariates for training and 10% of the data comprising complete covariates for testing. Subsequently, MI was applied to the 90% training dataset. Model M115, derived from the multiple-imputed data, was identified as the optimal model for MR. In the model averaging approach, two models were identified as optimal: Model 1 (without interaction variables) and Model 2 (with interaction variables). The first one, exhibited the lowest values of Mean Square Error (MSE), Root Mean Square Error (RMSE), and Mean Absolute Error (MAE). These results indicate that model averaging, specifically Model 1, is the superior model-building approach for this study, demonstrating improved performance compared to MR and validating the effectiveness of the MI method.
Keywords:
statistical modeling, regression analysis, model averaging, missing data, multiple imputationDownloads
References
B. Hans and E. A. Prasetio, "Applying Multiple Linear Regression Method to Measure the Impact of Human Capital, Social Media, Business Sector and Founder Gender on Advanced-Stage Startup Funding in Indonesia," Applied Quantitative Analysis, vol. 3, no. 2, pp. 86–100, Dec. 2023.
S. S. Henley, R. M. Golden, and T. M. Kashner, "Statistical modeling methods: challenges and strategies," Biostatistics & Epidemiology, vol. 4, no. 1, pp. 105–139, Jan. 2020.
S. Buscemi and A. Plaia, "Model selection in linear mixed-effect models," AStA Advances in Statistical Analysis, vol. 104, no. 4, pp. 529–575, Dec. 2020.
B. Langenberg, J. L. Helm, and A. Mayer, "Bayesian Analysis of Multi-Factorial Experimental Designs Using SEM," Multivariate Behavioral Research, vol. 59, no. 4, pp. 716–737, Jul. 2024.
T. Köhler, M. Rumyantseva, and C. Welch, "Qualitative Restudies: Research Designs for Retheorizing," Organizational Research Methods, Dec. 2023, Art. no. 10944281231216323.
C. F. Falk and M. Muthukrishna, "Parsimony in model selection: Tools for assessing fit propensity," Psychological Methods, vol. 28, no. 1, pp. 123–136, Feb. 2023.
K. Barigou, P.-O. Goffard, S. Loisel, and Y. Salhi, "Bayesian model averaging for mortality forecasting using leave-future-out validation," International Journal of Forecasting, vol. 39, no. 2, pp. 674–690, Apr. 2023.
R. J. A. Little and D. B. Rubin, Statistical Analysis with Missing Data, 3rd ed. John Wiley & Sons, 2019.
D. B. Rubin, Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ, USA: Wiley-Interscience, 2004.
C. K. Enders, Applied Missing Data Analysis, 1st ed. New York, USA: The Guilford Press, 2010.
A. Remiro-Azócar, A. Heath, and G. Baio, "Model-based standardization using multiple imputation," BMC Medical Research Methodology, vol. 24, no. 1, Feb. 2024, Art. no. 32.
T. Z. Keith, Multiple Regression and Beyond: An Introduction to Multiple Regression and Structural Equation Modeling, 3rd ed. New York, USA: Routledge, 2019.
D. B. Rubin, "Multiple imputation," in Flexible imputation of missing data, 2nd ed., Chapman and Hall/CRC, pp. 29–62.
J. Carpenter and M. Kenward, Multiple Imputation and its Application, 1st ed. Chichester, West Sussex, UK: Wiley, 2013.
S. S. Avtar, G. P. Khuneswari, A. A. Abdullah, J. H. McColl, C. Wright, and G. M. S. Team, "Comparison between EM Algorithm and Multiple Imputation on Predicting Children’s Weight at School Entry," Journal of Physics: Conference Series, vol. 1366, no. 1, Nov. 2019, Art. no. 012124.
B. Y. Gravesteijn, C. A. Sewalt, E. Venema, D. Nieboer, E. W. Steyerberg, and the CENTER-TBI Collaborators, "Missing Data in Prediction Research: A Five-Step Approach for Multiple Imputation, Illustrated in the CENTER-TBI Study," Journal of Neurotrauma, vol. 38, no. 13, pp. 1842–1857, Jul. 2021.
W. Kim, W. Cho, J. Choi, J. Kim, C. Park, and J. Choo, "A Comparison of the Effects of Data Imputation Methods on Model Performance," in 2019 21st International Conference on Advanced Communication Technology (ICACT), PyeongChang, South Korea, Feb. 2019, pp. 592–599.
J. N. Wulff and L. Ejlskov, "Multiple Imputation by Chained Equations in Praxis: Guidelines and Review," Electronic Journal of Business Research Methods, vol. 15, no. 1, p. 41 56, Apr. 2017.
B. J. A. Mertens, E. Banzato, and L. C. de Wreede, "Construction and assessment of prediction rules for binary outcome in the presence of missing predictor data using multiple imputation and cross-validation: Methodological approach and data-based evaluation," Biometrical Journal, vol. 62, no. 3, pp. 724–741, May 2020.
C. D. Nguyen, J. B. Carlin, and K. J. Lee, "Practical strategies for handling breakdown of multiple imputation procedures," Emerging Themes in Epidemiology, vol. 18, Apr. 2021, Art. no. 5.
L. T. P. Thao and R. Geskus, "A comparison of model selection methods for prediction in the presence of multiply imputed data," Biometrical Journal, vol. 61, no. 2, pp. 343–356, Mar. 2019.
O. R. Olaniran and M. A. A. Abdullah, "Bayesian weighted random forest for classification of high-dimensional genomics data," Kuwait Journal of Science, vol. 50, no. 4, pp. 477–484, Oct. 2023.
O. R. Olaniran and A. R. R. Alzahrani, "On the Oracle Properties of Bayesian Random Forest for Sparse High-Dimensional Gaussian Regression," Mathematics, vol. 11, no. 24, Jan. 2023, Art. no. 4957.
Downloads
How to Cite
License
Copyright (c) 2024 Mohd Asrul Affendi Abdullah, Lai Jesintha, Gopal Pillay Khuneswari, Siti Afiqah Muhamad Jamil, Oyebayo Ridwan Olaniran
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.