Performance Analysis of Effective Retrieval of Kannada Translations in Code-Mixed Sentences using BERT and MPnet
Received: 17 September 2024 | Revised: 8 October 2024 | Accepted: 16 November 2024 | Online: 2 February 2025
Corresponding author: Lava Kumar
Abstract
Translating Kannada-English (Kn-En) code-mixed text is a challenging task due to the limited availability of Kannada language resources and the inherent complexity of the dataset. This study evaluates the effectiveness of the sentence transformer model, utilizing pre-trained multilingual MPNet and Bidirectional Encoder Representations from Transformers (BERT) architectures, in generating sentence embeddings to enhance translation accuracy. It encodes both code-mixed sentences and their corresponding Kannada translations into high-dimensional embeddings. By employing cosine similarity, it maps input sentences to their closest translations, encoding 2000 code-mixed sentences and their translations using both the MPNet and BERT models. The findings indicate that the MPNet model proved to be more effective, achieving a model accuracy of 98%, compared to BERT's 88%. Moreover, MPNet outperformed BERT in terms of Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores, attaining 85.0 and 80.0, respectively, while BERT scored 65.3 and 58.7. These results highlight the advanced capabilities of MPNet in translating code-mixed languages and its potential applicability to a broader range of multilingual Natural Language Processing (NLP) tasks.
Keywords:
BERT, kannada-english code-mix, MPNet, multilingual, natural language processing, translationDownloads
References
A. R. Jafari, B. Heidary, R. Farahbakhsh, M. Salehi, and N. Crespi, "Language Models for Multi-Lingual Tasks - A Survey," International Journal of Advanced Computer Science and Applications, vol. 15, no. 6, 2024.
R. Chundi, V. R. Hulipalled, and J. B. Simha, "SAEKCS: Sentiment Analysis for English – Kannada Code SwitchText Using Deep Learning Techniques," in Proceeeding of International Conference on Smart Technologies in Computing, Electrical and Electronics, Bengaluru, India, Oct. 2020, pp. 327–331.
S. Zhao, J. Tian, J. Fu, J. Chen, and J. Wen, "FeaMix: Feature Mix With Memory Batch Based on Self-Consistency Learning for Code Generation and Code Translation," IEEE Transactions on Emerging Topics in Computational Intelligence, pp. 1–10, 2024.
A. Mangla, R. K. Bansal, and S. Bansal, "Language Identification and Normalization Techniques for Code-Mixed Text," in Proceeeding of Sixth International Conference on Computational Intelligence and Communication Technologies, Sonepat, India, Apr. 2024, pp. 435–441.
S. K. Sheshadri, D. Gupta, and M. R. Costa-Jussà, "A Voyage on Neural Machine Translation for Indic Languages," Procedia Computer Science, vol. 218, pp. 2694–2712, Jan. 2023.
G. Takawane, A. Phaltankar, V. Patwardhan, A. Patil, R. Joshi, and M. S. Takalikar, "Language augmentation approach for code-mixed text classification," Natural Language Processing Journal, vol. 5, Dec. 2023, Art. no. 100042.
S. Dutta, H. Agrawal, and P. K. Roy, "Sentiment Analysis on Multilingual Code-Mixed Kannada Language," Forum for Information Retrieval Evaluation, pp. 908–918, Dec. 2021.
H. Gadugoila, S. K. Sheshadri, P. C. Nair, and D. Gupta, "Unsupervised Pivot-based Neural Machine Translation for English to Kannada," in Proceedings of 19th India Council International Conference, Kochi, India, Nov. 2022, pp. 1–6.
S. K. Sheshadri, B. Sai Bharath, A. Hari Naga Sree Chandana Sarvani, P. Reddy Vijaya Bharathi Reddy, and D. Gupta, "Unsupervised Neural Machine Translation for English to Kannada Using Pre-Trained Language Model," in Proceeding of 13th International Conference on Computing Communication and Networking Technologies, Kharagpur, India, Oct. 2022, pp. 1–5.
H. L. Shashirekha, F. Balouchzahi, M. D. Anusha, and G. Sidorov, "CoLI-Machine Learning Approaches for Code-mixed Language Identification at the Word Level in Kannada-English Texts." arXiv, Nov. 17, 2022.
F. Balouchzahi et al., "Overview of CoLI-Kanglish: Word Level Language Identification in Code-mixed Kannada-English Texts," in Proceedings of the 19th International Conference on Natural Language Processing: Shared Task on Word Level Language Identification in Code-mixed Kannada-English Texts, New Delhi, India, Sep. 2022, pp. 38–45.
B. R. Chakravarthi, M. B. Jagadeeshan, V. Palanikumar, and R. Priyadharshini, "Offensive language identification in dravidian languages using MPNet and CNN," International Journal of Information Management Data Insights, vol. 3, no. 1, Dec. 2023, Art. no. 100151.
M. H. Asnawi, A. A. Pravitasari, T. Herawan, and T. Hendrawati, "The Combination of Contextualized Topic Model and MPNet for User Feedback Topic Modeling," IEEE Access, vol. 11, pp. 130272–130286, Nov. 2023.
H. Gao, B. Dong, Y. Zhang, T. Xiao, S. Jiang, and Y. Dong, "An Efficient Method of Supervised Contrastive Learning for Natural Language Understanding," in Proceeding of 7th International Conference on Computer and Communications (ICCC), Chengdu, China, Dec. 2021, pp. 1698–1704.
L. Kumar, "kn-En-code-mix-sentence-dataset." GitHub, Jul. 2024, https://github.com/lavakumar7619/kn-En-code-mix-sentence-dataset.
L. Kumar, S. R. Vernekar, D. S. Shreevatsa, T. Srinivas, B. N. Gururaj, and K. Sooda, "Prediction Of Emotions In Kannada Sentence With Homonyms," in Proceeding of International Conference on Emerging Technologies in Computer Science for Interdisciplinary Applications, Bengaluru, India, Apr. 2024, pp. 1–5.
S. H S, K. Sooda, and B. Karunakara Rai, "EfficientNet-B7 framework for anomaly detection in mammogram images," Multimedia Tools and Applications, pp. 1–27, May 2024.
N. Sureja, N. Chaudhari, P. Patel, J. Bhatt, T. Desai, and V. Parikh, "Hyper-tuned Swarm Intelligence Machine Learning-based Sentiment Analysis of Social Media," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15415–15421, Aug. 2024.
Downloads
How to Cite
License
Copyright (c) 2024 H. P. Rohith, Lava Kumar, Sooda Kavitha, Rai B. Karunakara, K. P. Inchara

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.