Enhancing Hate Speech Detection in Low-Resource Code-Mixed Indonesian Tweets via GPT-Based Data Augmentation
Received: 27 August 2025 | Revised: 29 September 2025 | Accepted: 6 October 2025 | Online: 21 November 2025
Corresponding author: Endang Wahyu Pamungkas
Abstract
Automatic hate speech detection in low-resource, code-mixed languages, such as Indonesian social media environments, presents significant challenges due to the scarcity of annotated data and the linguistic variability introduced by code-mixing. However, due to the growing prevalence of hate speech on social media, there is a need for robust hate speech detection systems. This study investigates the effectiveness of data augmentation strategies, specifically Generative Pretrained Transformer (GPT)-based paraphrasing and aggressive text transformation, in enhancing the performance of hate speech detection models for Indonesian code-mixed tweets. To achieve that, we employed traditional machine learning models, Recurrent Neural Network (RNN)-based models, and transformer-based models to assess the impact of these augmentation strategies. Our findings reveal that GPT-generated data improve model performance, with transformer-based models, including Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) and the Cross-lingual Language Model Robustly Optimized BERT Pretraining approach (XLM-RoBERTa).
Keywords:
hate speech detection, low-resource language, code-mixed language, data augmentation, large language modelDownloads
References
A. Hande, K. Puranik, R. Priyadharshini, S. Thavareesan, and B. R. Chakravarthi, "Evaluating Pretrained Transformer-based Models for COVID-19 Fake News Detection," in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, Apr. 2021, pp. 766–772. DOI: https://doi.org/10.1109/ICCMC51019.2021.9418446
M. S. Jahan and M. Oussalah, "A systematic review of hate speech automatic detection using natural language processing," Neurocomputing, vol. 546, Aug. 2023, Art. no. 126232. DOI: https://doi.org/10.1016/j.neucom.2023.126232
N. S. Mullah and W. M. N. W. Zainon, "Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review," IEEE Access, vol. 9, pp. 88364–88376, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3089515
G. Kovács, P. Alonso, and R. Saini, "Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources," SN Computer Science, vol. 2, no. 2, Apr. 2021, Art. no. 95. DOI: https://doi.org/10.1007/s42979-021-00457-3
F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, and V. Patti, "Resources and benchmark corpora for hate speech detection: a systematic review," Language Resources and Evaluation, vol. 55, no. 2, pp. 477–523, June 2021. DOI: https://doi.org/10.1007/s10579-020-09502-8
A. Rawat, S. Kumar, and S. S. Samant, "Hate speech detection in social media: Techniques, recent trends, and future challenges," WIREs Computational Statistics, vol. 16, no. 2, Mar. 2024, Art. no. e1648. DOI: https://doi.org/10.1002/wics.1648
Z. Mansur, N. Omar, and S. Tiun, "Twitter Hate Speech Detection: A Systematic Review of Methods, Taxonomy Analysis, Challenges, and Opportunities," IEEE Access, vol. 11, pp. 16226–16249, 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3239375
E. W. Pamungkas, D. G. P. Putri, and A. Fatmawati, "Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities," International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, 2023. DOI: https://doi.org/10.14569/IJACSA.2023.01406125
E. Fauziati, S. D. Amalia, H. A. Zahra, S. E. Ningrum, Y. Sidiq, and A. Budiono, "Hate Speech Typology of Selected Controversial Figures on Social Media: A Discourse-Analytic Perspective," Wseas Transactions on Information Science and Applications, vol. 22, pp. 552–564, July 2025. DOI: https://doi.org/10.37394/23209.2025.22.46
M. Ridenhour, A. Bagavathi, E. Raisi, and S. Krishnan, "Detecting Online Hate Speech: Approaches Using Weak Supervision and Network Embedding Models," in Social, Cultural, and Behavioral Modeling, vol. 12268, R. Thomson, H. Bisgin, C. Dancy, A. Hyder, and M. Hussain, Eds. Cham: Springer International Publishing, 2020, pp. 202–212. DOI: https://doi.org/10.1007/978-3-030-61255-9_20
M. Lim, "Freedom to hate: social media, algorithmic enclaves, and the rise of tribal nationalism in Indonesia," Critical Asian Studies, vol. 49, no. 3, pp. 411–427, July 2017. DOI: https://doi.org/10.1080/14672715.2017.1341188
S. G. Cahyani, A. B. Wahyudi, Markhamah, and A. Sabardila, "Code Mixing on News Accounts Catch Me Up! on Twitter in News Text Learning," in Proceedings of the International Conference on Learning and Advanced Education (ICOLAE 2022), Paris, France, 2023, vol. 757, pp. 2024–2039. DOI: https://doi.org/10.2991/978-2-38476-086-2_162
M. O. Ibrohim and I. Budi, "Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter," in Proceedings of the Third Workshop on Abusive Language Online, Florence, Italy, 2019, pp. 46–57. DOI: https://doi.org/10.18653/v1/W19-3506
N. Aulia and I. Budi, "Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach," in Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence, Bali Indonesia, Apr. 2019, pp. 164–169. DOI: https://doi.org/10.1145/3330482.3330491
I. G. M. Putra and D. Nurjanah, "Hate Speech Detection In Indonesian Language Instagram," in 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, Oct. 2020, pp. 413–420. DOI: https://doi.org/10.1109/ICACSIS51025.2020.9263084
I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, "Hate speech detection in the Indonesian language: A dataset and preliminary study," in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Oct. 2017, pp. 233–238. DOI: https://doi.org/10.1109/ICACSIS.2017.8355039
A. D. Sanya and L. H. Suadaa, "Handling Imbalanced Dataset on Hate Speech Detection in Indonesian Online News Comments," in 2022 10th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, Aug. 2022, pp. 380–385. DOI: https://doi.org/10.1109/ICoICT55009.2022.9914883
B. P. Putra, B. Irawan, C. Setianingsih, A. Rahmadani, F. Imanda, and I. Z. Fawwas, "Hate Speech Detection using Convolutional Neural Network Algorithm Based on Image," in 2021 International Seminar on Machine Learning, Optimization, and Data Science (ISMODE), Jakarta, Indonesia, Jan. 2022, pp. 207–212. DOI: https://doi.org/10.1109/ISMODE53584.2022.9742810
H. Imaduddin, L. A. Kusumaningtias, and F. Y. A’la, "Application of LSTM and GloVe Word Embedding for Hate Speech Detection in Indonesian Twitter Data," Ingénierie des systèmes d information, vol. 28, no. 4, pp. 1107–1112, Aug. 2023. DOI: https://doi.org/10.18280/isi.280430
A. T. Azar, H. M. Noori, A. R. Mahlous, A. Al-Khayyat, and I. K. Ibraheem, "Quasi-Reflection Learning Arithmetic Firefly Search Optimization with Deep Learning-based Cyberbullying Detection on Social Networking," Engineering, Technology & Applied Science Research, vol. 14, no. 5, pp. 17162–17169, Oct. 2024. DOI: https://doi.org/10.48084/etasr.8314
E. W. Pamungkas, A. Fatmawati, and F. D. Salam, "Hate Speech Detection on Indonesian Social Media: A Preliminary Study on Code-Mixed Language Issue," in Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, Bangkok Thailand, Dec. 2022, pp. 104–109. DOI: https://doi.org/10.1145/3582768.3582771
E. W. Pamungkas, A. Fatmawati, Y. S. Nugroho, D. Gunawan, and E. Sudarmilah, "Hate Speech Detection in Code-Mixed Indonesian Social Media: Exploiting Multilingual Languages Resources," in 2022 Seventh International Conference on Informatics and Computing (ICIC), Denpasar, Bali, Indonesia, Dec. 2022, pp. 1–5. DOI: https://doi.org/10.1109/ICIC56845.2022.10006940
A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, and M. Shrivastava, "A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection," in Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, New Orleans, Louisiana, USA, 2018, pp. 36–41. DOI: https://doi.org/10.18653/v1/W18-1105
B. R. Chakravarthi, A. K. M, J. P. McCrae, B. Premjith, K. P. Sorman, and T. Mandl, "Overview of the track on HASOC-Offensive Language Identification-DravidianCodeMix," in FIRE 2020: Forum for Information Retrieval Evaluation, Hyderabad, India, Dec. 2020.
E. Ombui, L. Muchemi, and P. Wagacha, "Hate Speech Detection in Code-switched Text Messages," in 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, Oct. 2019, pp. 1–6. DOI: https://doi.org/10.1109/ISMSIT.2019.8932845
Z. Tan et al., "Large Language Models for Data Annotation and Synthesis: A Survey," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, 2024, pp. 930–957. DOI: https://doi.org/10.18653/v1/2024.emnlp-main.54
J. Wei and K. Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 6381–6387. DOI: https://doi.org/10.18653/v1/D19-1670
H. Q. Abonizio, E. C. Paraiso, and S. Barbon, "Toward Text Data Augmentation for Sentiment Analysis," IEEE Transactions on Artificial Intelligence, vol. 3, no. 5, pp. 657–668, Oct. 2022. DOI: https://doi.org/10.1109/TAI.2021.3114390
S. Woźniak and J. Kocoń, "From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis," in 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, Dec. 2023, pp. 799–808. DOI: https://doi.org/10.1109/ICDMW60847.2023.00108
Q. Zhang, S. Shi, K. Zhang, Z. Lu, T. Zhang, and X. Xie, "Data Augmentation for Fake News Using ChatGPT," in 2023 International Conference on Intelligent Management and Software Engineering (IMSE), Rome, Italy, Sept. 2023, pp. 12–17. DOI: https://doi.org/10.1109/IMSE61332.2023.00009
T. ValizadehAslani et al., "Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data," Neurocomputing, vol. 592, Aug. 2024, Art. no. 127801. DOI: https://doi.org/10.1016/j.neucom.2024.127801
Downloads
How to Cite
License
Copyright (c) 2025 Endang Wahyu Pamungkas, Dian Purworini, Widi Widayat, Divi Galih Prasetyo Putri, Ikhlasul Amal

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
