Serialization-Induced Prediction Drift
Received: 18 February 2026 | Revised: 6 March 2026, 22 March 2026, and 2 April 2026 | Accepted: 10 April 2026 | Online: 6 June 2026
Corresponding author: Khudran M. Alzhrani
Abstract
Tabular Machine Learning (ML) workflows often export and reload numeric features through formats, such as CSV and Parquet, sometimes rounding values or casting between floating-point precisions (e.g., float64 to float32). Although commonly treated as engineering details, these steps can introduce systematic numerical perturbations that propagate into model predictions. This study presents a methodology to quantify how routine data-representation changes affect prediction drift and performance. Starting from a float64 Parquet baseline, CSV round-trip variants with 6, 3, and 1 decimal places and a float32 Parquet variant are generated. Fixed train-validation-test splits are reused across treatments, and two scenarios are evaluated: train-on-variant and evaluation-only (baseline-trained, perturbed-test). Value-level drift, prediction drift (score drift, rank correlation, and classification churn), and performance deltas are measured, with the results aggregated across three random seeds with bootstrap confidence intervals and Wilcoxon signed-rank tests. Experiments on the Breast Cancer Wisconsin (Diagnostic) classification dataset and the Diabetes and California Housing regression datasets, using multiple model families, show that mild perturbations (CSV 6/3 decimals and float32) generally yield negligible drift and no meaningful performance change, while rounding to 1-decimal place triggers a sharp instability onset, including threshold-crossing effects in classification and marked drift amplification in the most sensitive regression settings. Sensitivity varied by model family under aggressive rounding, and the added analysis of representative linear models showed that 1-decimal rounding perturbs the internal linear score and can also change the coefficient structure learned during retraining.
Keywords:
prediction drift, data serialization, numerical precision, machine learning pipelines, tabular dataReferences
D. Sculley et al., "Hidden Technical Debt in Machine Learning Systems," in NIPS'15: Proceedings of the 29th International Conference on Neural Information Processing Systems, Montreal, Canada, Dec. 2015, vol. 2, pp. 2503–2511.
T. Gebru et al., "Datasheets for Datasets," Communications of the ACM, vol. 64, no. 12, pp. 86–92, Dec. 2021.
J. Yuan et al., "Understanding and Mitigating Numerical Sources of Nondeterminism in LLM Inference." arXiv, Oct. 24, 2025.
I. Gonzalez Pepe, Y. Chatelain, G. Kiar, and T. Glatard, "Numerical Stability of DeepGOPlus Inference," PLOS ONE, vol. 19, no. 1, Jan. 2024, Art. no. e0296725.
G. Kiar et al., "Numerical Uncertainty in Analytical Pipelines Lead to Impactful Variability in Brain Networks," PLOS ONE, vol. 16, no. 11, Nov. 2021, Art. no. e0250755.
G. Kiar et al., "Comparing Perturbation Models for Evaluating Stability of Neuroimaging Pipelines," The International Journal of High Performance Computing Applications, vol. 34, no. 5, pp. 491–501, Sep. 2020.
M. Andrysco, R. Jhala, and S. Lerner, "Printing Floating-Point Numbers: A Faster, Always Correct Method," in Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, St. Petersburg, FL, USA, Jan. 2016, pp. 555–567.
U. Adams, "Ryū: Fast Float-to-String Conversion," in Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, Philadelphia, PA, USA, Jun. 2018, pp. 270–282.
J. Champagne Gareau and D. Lemire, "Converting Binary Floating‐Point Numbers to Shortest Decimal Strings: An Experimental Review," Software: Practice and Experience, vol. 56, no. 4, pp. 462–478, Apr. 2026.
T. Johnson III and S. A. Mostafa, "Impact of Data Perturbation for Statistical Disclosure Control on the Predictive Performance of Machine Learning Techniques," Journal of Data Science, vol. 23, no. 2, pp. 312–331, Jan. 2025.
S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, "Deep Learning with Limited Numerical Precision," in Proceedings of the 32nd International Conference on Machine Learning, Lille, France, Jul. 2015, vol. 37, pp. 1737–1746.
P. Micikevicius et al., "Mixed Precision Training." arXiv, 2017.
D. Kalamkar et al., "A Study of BFLOAT16 for Deep Learning Training." arXiv, 2019.
B. Jacob et al., "Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference," in IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, Jun. 2018, pp. 2704–2713.
A. Gholami, S. Kim, Z. Dong, Z. Yao, M. W. Mahoney, and K. Keutzer, "A Survey of Quantization Methods for Efficient Neural Network Inference," in Low-Power Computer Vision, 1st ed., Boca Raton: Chapman and Hall/CRC, 2022, pp. 291–326.
W. Wolberg, O. Mangasarian, N. Street, W. Street, "Breast Cancer Wisconsin (Diagnostic)." UCI Machine Learning Repository, 1993.
W. N. Street, W. H. Wolberg, and O. L. Mangasarian, "Nuclear Feature Extraction for Breast Tumor Diagnosis," presented at the IS&T/SPIE's Symposium on Electronic Imaging: Science and Technology, San Jose, CA, USA, Jul. 1993, pp. 861–870.
T. Hastie, B. Efron, I. Johnstone, and R. Tibshirani, "Diabetes Data." North Carolina State University, 2004, [Online]. Available: https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html.
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, "Least Angle Regression," The Annals of Statistics, vol. 32, no. 2, Apr. 2004.
L. Torgo, "California Housing Prices." Kaggle, Apr. 2025, [Online]. Available: https://www.kaggle.com/datasets/camnugent/california-housing-prices/data.
R. K. Pace and R. Barry, "Sparse Spatial Autoregressions," Statistics & Probability Letters, vol. 33, no. 3, pp. 291–297, May 1997.
S. Sudianto, A. Sa'adah, and B. F. Arkana, "Utilization of Adaptive Machine Learning for Streaming Sentiment Analysis: The Effects of Batch and Drift Types," Engineering, Technology & Applied Science Research, vol. 16, no. 1, pp. 32384–32390, Feb. 2026.
Downloads
How to Cite
License
Copyright (c) 2026 Khudran M. Alzhrani

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
