Performance Analysis of Effective Retrieval of Kannada Translations in Code-Mixed Sentences using BERT and MPnet

H. P. Rohith; Lava Kumar; Sooda Kavitha; Rai B. Karunakara; K. P. Inchara

doi:10.48084/etasr.9013

Authors

H. P. Rohith Deptartment of ISE, Nitte Meenakshi Institute of Technology, Bangalore, India
Lava Kumar Deptartment of CSE, B.M.S. College of Engineering, Bangalore, India
Sooda Kavitha Deptartment of CSE, B.M.S. College of Engineering, Bangalore, India
Rai B. Karunakara Deptartment of Electronics & Communication, Nitte Meenakshi Institute of Technology, Bangalore, India
K. P. Inchara Deptartment of CSE, B.M.S. College of Engineering, Bangalore, India

Volume: 15 | Issue: 1 | Pages: 19109-19114 | February 2025 | https://doi.org/10.48084/etasr.9013

Received: 17 September 2024 | Revised: 8 October 2024 | Accepted: 16 November 2024 | Online: 2 February 2025

Corresponding author: Lava Kumar

Abstract

Translating Kannada-English (Kn-En) code-mixed text is a challenging task due to the limited availability of Kannada language resources and the inherent complexity of the dataset. This study evaluates the effectiveness of the sentence transformer model, utilizing pre-trained multilingual MPNet and Bidirectional Encoder Representations from Transformers (BERT) architectures, in generating sentence embeddings to enhance translation accuracy. It encodes both code-mixed sentences and their corresponding Kannada translations into high-dimensional embeddings. By employing cosine similarity, it maps input sentences to their closest translations, encoding 2000 code-mixed sentences and their translations using both the MPNet and BERT models. The findings indicate that the MPNet model proved to be more effective, achieving a model accuracy of 98%, compared to BERT's 88%. Moreover, MPNet outperformed BERT in terms of Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, attaining 85.0 and 80.0, respectively, while BERT scored 65.3 and 58.7. These results highlight the advanced capabilities of MPNet in translating code-mixed languages and its potential applicability to a broader range of multilingual Natural Language Processing (NLP) tasks.

Keywords:

BERT, kannada-english code-mix, MPNet, multilingual, natural language processing, translation

References

A. R. Jafari, B. Heidary, R. Farahbakhsh, M. Salehi, and N. Crespi, "Language Models for Multi-Lingual Tasks - A Survey," International Journal of Advanced Computer Science and Applications, vol. 15, no. 6, 2024. DOI: https://doi.org/10.14569/IJACSA.2024.01506146

R. Chundi, V. R. Hulipalled, and J. B. Simha, "SAEKCS: Sentiment Analysis for English – Kannada Code SwitchText Using Deep Learning Techniques," in Proceeeding of International Conference on Smart Technologies in Computing, Electrical and Electronics, Bengaluru, India, Oct. 2020, pp. 327–331. DOI: https://doi.org/10.1109/ICSTCEE49637.2020.9277030

S. Zhao, J. Tian, J. Fu, J. Chen, and J. Wen, "FeaMix: Feature Mix With Memory Batch Based on Self-Consistency Learning for Code Generation and Code Translation," IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–10, 2024. DOI: https://doi.org/10.1109/TETCI.2024.3395531

A. Mangla, R. K. Bansal, and S. Bansal, "Language Identification and Normalization Techniques for Code-Mixed Text," in Proceeeding of Sixth International Conference on Computational Intelligence and Communication Technologies, Sonepat, India, Apr. 2024, pp. 435–441. DOI: https://doi.org/10.1109/CCICT62777.2024.00077

S. K. Sheshadri, D. Gupta, and M. R. Costa-Jussà, "A Voyage on Neural Machine Translation for Indic Languages," Procedia Computer Science, vol. 218, pp. 2694–2712, Jan. 2023. DOI: https://doi.org/10.1016/j.procs.2023.01.242

G. Takawane, A. Phaltankar, V. Patwardhan, A. Patil, R. Joshi, and M. S. Takalikar, "Language augmentation approach for code-mixed text classification," Natural Language Processing Journal, vol. 5, Dec. 2023, Art. no. 100042. DOI: https://doi.org/10.1016/j.nlp.2023.100042

S. Dutta, H. Agrawal, and P. K. Roy, "Sentiment Analysis on Multilingual Code-Mixed Kannada Language," Forum for Information Retrieval Evaluation, pp. 908–918, Dec. 2021.

H. Gadugoila, S. K. Sheshadri, P. C. Nair, and D. Gupta, "Unsupervised Pivot-based Neural Machine Translation for English to Kannada," in Proceedings of 19th India Council International Conference, Kochi, India, Nov. 2022, pp. 1–6. DOI: https://doi.org/10.1109/INDICON56171.2022.10039732

S. K. Sheshadri, B. Sai Bharath, A. Hari Naga Sree Chandana Sarvani, P. Reddy Vijaya Bharathi Reddy, and D. Gupta, "Unsupervised Neural Machine Translation for English to Kannada Using Pre-Trained Language Model," in Proceeding of 13th International Conference on Computing Communication and Networking Technologies, Kharagpur, India, Oct. 2022, pp. 1–5. DOI: https://doi.org/10.1109/ICCCNT54827.2022.9984521

H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, and G. Sidorov, "CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts." arXiv, Nov. 17, 2022.

F. Balouchzahi et al., "Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts," in Proceedings of the 19th International Conference on Natural Language Processing: Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, New Delhi, India, Sep. 2022, pp. 38–45.

B. R. Chakravarthi, M. B. Jagadeeshan, V. Palanikumar, and R. Priyadharshini, "Offensive language identification in dravidian languages using MPNet and CNN," International Journal of Information Management Data Insights, vol. 3, no. 1, Dec. 2023, Art. no. 100151. DOI: https://doi.org/10.1016/j.jjimei.2022.100151

M. H. Asnawi, A. A. Pravitasari, T. Herawan, and T. Hendrawati, "The Combination of Contextualized Topic Model and MPNet for User Feedback Topic Modeling," IEEE Access, vol. 11, pp. 130272–130286, Nov. 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3332644

H. Gao, B. Dong, Y. Zhang, T. Xiao, S. Jiang, and Y. Dong, "An Efficient Method of Supervised Contrastive Learning for Natural Language Understanding," in Proceeding of 7th International Conference on Computer and Communications (ICCC), Chengdu, China, Dec. 2021, pp. 1698–1704. DOI: https://doi.org/10.1109/ICCC54389.2021.9674736

L. Kumar, "kn-En-code-mix-sentence-dataset." GitHub, Jul. 2024, https://github.com/lavakumar7619/kn-En-code-mix-sentence-dataset.

L. Kumar, S. R. Vernekar, D. S. Shreevatsa, T. Srinivas, B. N. Gururaj, and K. Sooda, "Prediction Of Emotions In Kannada Sentence With Homonyms," in Proceeding of International Conference on Emerging Technologies in Computer Science for Interdisciplinary Applications, Bengaluru, India, Apr. 2024, pp. 1–5. DOI: https://doi.org/10.1109/ICETCS61022.2024.10543456

S. H S, K. Sooda, and B. Karunakara Rai, "EfficientNet-B7 framework for anomaly detection in mammogram images," Multimedia Tools and Applications, pp. 1–27, May 2024. DOI: https://doi.org/10.1007/s11042-024-18853-1

N. Sureja, N. Chaudhari, P. Patel, J. Bhatt, T. Desai, and V. Parikh, "Hyper-tuned Swarm Intelligence Machine Learning-based Sentiment Analysis of Social Media," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15415–15421, Aug. 2024. DOI: https://doi.org/10.48084/etasr.7818