A Deep Learning–Based Framework for Dataset Creation and Sentiment Classification of English–Bengali Code-Mixed Texts
Received: 12 October 2025 | Revised: 1 November 2025 and 14 November 2025 | Accepted: 17 November 2025 | Online: 9 February 2026
Corresponding author: Dalia Barua
Abstract
Datasets form the foundation for most Natural Language Processing (NLP) operations, such as sentiment analysis, summarization, and translation. Lack of big and diverse datasets poses a critical problem for low-resource languages like Bengali. The problem is aggravated in code-mixed environments where users tend to intermix Bengali and English on web platforms like social media, video comments, and product reviews on e-commerce websites. It is very difficult to conduct sentiment analysis of such Bengali–English code-mixed reviews, as there are no high-quality, large-scale datasets available. To address this challenge, the present research introduces a novel synthetic English–Bengali code-mixed product review dataset, specifically curated for sentiment analysis in the Bangladeshi e-commerce domain. The dataset was generated through a new deep learning–driven pipeline integrating part-of-speech tagging, translation, transliteration, and a two-tier ensemble-based sentiment annotation framework to ensure linguistic diversity and contextual realism in code-mixed expressions. The dataset, En–Bn–Code–Mixed–Two–Class–Sentiment–Dataset–, is available on the Hugging Face repository, with a total of 100,000 reviews, balanced for both positive and negative sentiments. The quality and reliability of the generated dataset were validated through quantitative, linguistic, and statistical analyses, including translation, transliteration, and sentiment analysis evaluation metrics. The proposed two-tier ensemble sentiment annotation approach, implemented through deep learning models, meta-learning models, and majority voting, achieved 93.00% accuracy, 92.16% precision, 94.00% recall, and 93.07% F1-score. This work not only provides a useful resource for code-mixed Bengali–English NLP but also establishes a scalable methodology for low-resource and multilingual text processing, opening up further possibilities of more comprehensive and inclusive studies on sentiment analysis.
Keywords:
code-mixed, sentiment analysis, low-resource, dataset, deep-learning, ensemble, majority votingDownloads
References
E. D. Liddy, "Natural Language Processing," in Encyclopedia of Library and Information Science, 2nd ed. New York City, NY, USA: Marcel Dekker Inc, 2001.
K. R. Chowdhary, "Natural Language Processing," in Fundamentals of Artificial Intelligence, New Delhi, India: Springer India, 2020, pp. 603–649. DOI: https://doi.org/10.1007/978-81-322-3972-7_19
D. Khurana, A. Koli, K. Khatter, and S. Singh, "Natural Language Processing: State of The Art, Current Trends and Challenges," 2017.
T. Young, D. Hazarika, S. Poria, and E. Cambria,"Recent Trends in Deep Learning Based Natural Language Processing [Review Article]," IEEE Computational Intelligence Magazine, vol. 13, no. 3, pp. 55–75, Aug. 2018. DOI: https://doi.org/10.1109/MCI.2018.2840738
S. Alam, M. F. Ishmam, N. H. Alvee, M. S. Siddique, M. A. Hossain, and A. R. M. Kamal, "BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis," arXiv, 2024.
B. R. Chakravarthi et al., "DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text," Language Resources and Evaluation, vol. 56, no. 3, pp. 765–806, Sept. 2022. DOI: https://doi.org/10.1007/s10579-022-09583-7
M. Tareq, Md. F. Islam, S. Deb, S. Rahman, and A. A. Mahmud, "Data-Augmentation for Bangla-English Code-Mixed Sentiment Analysis: Enhancing Cross Linguistic Contextual Understanding," IEEE Access, vol. 11, pp. 51657–51671, 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3277787
S. Bal, S. Mahanta, L. Mandal, and R. Parekh, "Bilingual Machine Translation: English to Bengali," in Proceedings of International Ethical Hacking Conference 2018, vol. 811, M. Chakraborty, S. Chakrabarti, V. E. Balas, and J. K. Mandal, Eds. Singapore: Springer Singapore, 2019, pp. 247–259. DOI: https://doi.org/10.1007/978-981-13-1544-2_21
M. Khairullah, "A Novel Steganography Method using Transliteration of Bengali Text," Journal of King Saud University - Computer and Information Sciences, vol. 31, no. 3, pp. 348–366, July 2019. DOI: https://doi.org/10.1016/j.jksuci.2018.01.008
J. Hartmann, M. Heitmann, C. Siebert, and C. Schamp, "More than a Feeling: Accuracy and Application of Sentiment Analysis," International Journal of Research in Marketing, vol. 40, no. 1, pp. 75–87, Mar. 2023. DOI: https://doi.org/10.1016/j.ijresmar.2022.05.005
Hidayaturrahman and I. Prawira, "Leveraging Zero-Shot Learning in Large Language Models for Sentiment Analysis: A Comparative Study on the Indonesian Language," in 2024 International Conference on Informatics, Multimedia, Cyber and Information System, Jakarta, Indonesia, Nov. 2024, pp. 614–619. DOI: https://doi.org/10.1109/ICIMCIS63449.2024.10956237
A. Hasan, S. Moin, A. Karim, and S. Shamshirband, "Machine Learning-Based Sentiment Analysis for Twitter Accounts," Mathematical and Computational Applications, vol. 23, no. 1, p. 11, Feb. 2018. DOI: https://doi.org/10.3390/mca23010011
H. Huang, A. A. Zavareh, and M. B. Mustafa, "Sentiment Analysis in E-Commerce Platforms: A Review of Current Techniques and Future Directions," IEEE Access, vol. 11, pp. 90367–90382, 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3307308
K. D. S. Devi, V. Sireesha, C. Sudha, M. Ravisankar, and P. D. K. Reddy, "A Novel Approach to Sentiment Analysis using GMM-Enhanced N-gram LSTM Networks," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 23068–23073, June 2025. DOI: https://doi.org/10.48084/etasr.10640
K. Korovkinas and P. Danėnas, "SVM and Naïve Bayes Classification Ensemble Method for Sentiment Analysis," Baltic Journal of Modern Computing, vol. 5, no. 4, Dec. 2017. DOI: https://doi.org/10.22364/bjmc.2017.5.4.06
K. L. Tan, C. P. Lee, K. M. Lim, and K. S. M. Anbananthen, "Sentiment Analysis With Ensemble Hybrid Deep Learning Model," IEEE Access, vol. 10, pp. 103694–103704, 2022. DOI: https://doi.org/10.1109/ACCESS.2022.3210182
Ankit and N. Saleena, "An Ensemble Classification System for Twitter Sentiment Analysis," Procedia Computer Science, vol. 132, pp. 937–946, 2018. DOI: https://doi.org/10.1016/j.procs.2018.05.109
P. Thiengburanathum and P. Charoenkwan, "SETAR: Stacking Ensemble Learning for Thai Sentiment Analysis Using RoBERTa and Hybrid Feature Representation," IEEE Access, vol. 11, pp. 92822–92837, 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3308951
D. Moldovan, "A Majority Voting Framework for Reliable Sentiment Analysis of Product Reviews," PeerJ Computer Science, vol. 11, p. e2738, Feb. 2025. DOI: https://doi.org/10.7717/peerj-cs.2738
D. S. Krishna, G. Srinivas, and P. V. G. D. Prasad Reddy, "Disaster Tweet Classification: a Majority Voting Approach using Machine Learning Algorithms," Intelligent Decision Technologies, vol. 17, no. 2, pp. 343–355, May 2023. DOI: https://doi.org/10.3233/IDT-220310
F. Suandi et al., "Enhancing Sentiment Analysis Performance Using SMOTE and Majority Voting in Machine Learning Algorithms," in Proceedings of the 7th International Conference on Applied Engineering, vol. 251, L. Lumombo, A. Rahmi, S. Suwarno, N. Ardi, and D. E. Kurniawan, Eds. Dordrecht: Atlantis Press International BV, 2024, pp. 126–138. DOI: https://doi.org/10.2991/978-94-6463-620-8_10
Z. Qin, K. Dong, and B. Xie, "What Affects Customers Online Shopping Behavior, Research that Applied Machine Learning to Amazon Product Reviews," Applied and Computational Engineering, vol. 76, no. 1, pp. 48–64, July 2024. DOI: https://doi.org/10.54254/2755-2721/76/20240564
Y. Xiao, C. Qi, and H. Leng, "Sentiment Analysis of Amazon Product Reviews Based on NLP," in 2021 4th International Conference on Advanced Electronic Materials, Computers and Software Engineering, Changsha, China, Mar. 2021, pp. 1218–1221. DOI: https://doi.org/10.1109/AEMCSE51986.2021.00249
B. Roark et al., "Processing South Asian languages written in the Latin script: the Dakshina dataset," in 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 2413–2423.
X. Zhang, J. Zhao, and Y. LeCun, "Character-level convolutional networks for text classification," in NIPS'15: Proceedings of the 29th International Conference on Neural Information Processing Systems, Cambridge, MA, USA, Dec. 2015, pp. 649–657.
M. R. A. Rashid, K. F. Hasan, R. Hasan, A. Das, M. Sultana, and M. Hasan, "A comprehensive dataset for sentiment and emotion classification from Bangladesh e-commerce reviews," Data in Brief, vol. 53, p. 110052, Apr. 2024. DOI: https://doi.org/10.1016/j.dib.2024.110052
Downloads
How to Cite
License
Copyright (c) 2025 Dalia Barua, Tarandeep Singh Walia

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
