Enhancing Hate Speech Detection in Low-Resource Code-Mixed Indonesian Tweets via GPT-Based Data Augmentation

Endang Wahyu Pamungkas; Dian Purworini; Widi Widayat; Divi Galih Prasetyo Putri; Ikhlasul Amal

doi:10.48084/etasr.14342

Authors

Endang Wahyu Pamungkas Department of Informatics Engineering, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia | Social Informatics Research Center, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia
Dian Purworini Department of Communication Science, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia | Social Informatics Research Center, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia
Widi Widayat Department of Informatics Engineering, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia | Social Informatics Research Center, Universitas Muhammadiyah Surakarta, Surakarta, Indonesia
Divi Galih Prasetyo Putri Department of Electrical Engineering and Informatics, Vocational College, Universitas Gadjah Mada, Yogyakarta, Indonesia
Ikhlasul Amal Department of Artificial Intelligence, Universitas Gadjah Mada, Yogyakarta, Indonesia

Volume: 15 | Issue: 6 | Pages: 30649-30656 | December 2025 | https://doi.org/10.48084/etasr.14342

Received: 27 August 2025 | Revised: 29 September 2025 | Accepted: 6 October 2025 | Online: 21 November 2025

Corresponding author: Endang Wahyu Pamungkas

Abstract

Automatic hate speech detection in low-resource, code-mixed languages, such as Indonesian social media environments, presents significant challenges due to the scarcity of annotated data and the linguistic variability introduced by code-mixing. However, due to the growing prevalence of hate speech on social media, there is a need for robust hate speech detection systems. This study investigates the effectiveness of data augmentation strategies, specifically Generative Pretrained Transformer (GPT)-based paraphrasing and aggressive text transformation, in enhancing the performance of hate speech detection models for Indonesian code-mixed tweets. To achieve that, we employed traditional machine learning models, Recurrent Neural Network (RNN)-based models, and transformer-based models to assess the impact of these augmentation strategies. Our findings reveal that GPT-generated data improve model performance, with transformer-based models, including Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) and the Cross-lingual Language Model Robustly Optimized BERT Pretraining approach (XLM-RoBERTa).

Keywords:

hate speech detection, low-resource language, code-mixed language, data augmentation, large language model

References

A. Hande, K. Puranik, R. Priyadharshini, S. Thavareesan, and B. R. Chakravarthi, "Evaluating Pretrained Transformer-based Models for COVID-19 Fake News Detection," in 2021 5th International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, Apr. 2021, pp. 766–772. DOI: https://doi.org/10.1109/ICCMC51019.2021.9418446

M. S. Jahan and M. Oussalah, "A systematic review of hate speech automatic detection using natural language processing," Neurocomputing, vol. 546, Aug. 2023, Art. no. 126232. DOI: https://doi.org/10.1016/j.neucom.2023.126232

N. S. Mullah and W. M. N. W. Zainon, "Advances in Machine Learning Algorithms for Hate Speech Detection in Social Media: A Review," IEEE Access, vol. 9, pp. 88364–88376, 2021. DOI: https://doi.org/10.1109/ACCESS.2021.3089515

G. Kovács, P. Alonso, and R. Saini, "Challenges of Hate Speech Detection in Social Media: Data Scarcity, and Leveraging External Resources," SN Computer Science, vol. 2, no. 2, Apr. 2021, Art. no. 95. DOI: https://doi.org/10.1007/s42979-021-00457-3

F. Poletto, V. Basile, M. Sanguinetti, C. Bosco, and V. Patti, "Resources and benchmark corpora for hate speech detection: a systematic review," Language Resources and Evaluation, vol. 55, no. 2, pp. 477–523, June 2021. DOI: https://doi.org/10.1007/s10579-020-09502-8

A. Rawat, S. Kumar, and S. S. Samant, "Hate speech detection in social media: Techniques, recent trends, and future challenges," WIREs Computational Statistics, vol. 16, no. 2, Mar. 2024, Art. no. e1648. DOI: https://doi.org/10.1002/wics.1648

Z. Mansur, N. Omar, and S. Tiun, "Twitter Hate Speech Detection: A Systematic Review of Methods, Taxonomy Analysis, Challenges, and Opportunities," IEEE Access, vol. 11, pp. 16226–16249, 2023. DOI: https://doi.org/10.1109/ACCESS.2023.3239375

E. W. Pamungkas, D. G. P. Putri, and A. Fatmawati, "Hate Speech Detection in Bahasa Indonesia: Challenges and Opportunities," International Journal of Advanced Computer Science and Applications, vol. 14, no. 6, 2023. DOI: https://doi.org/10.14569/IJACSA.2023.01406125

E. Fauziati, S. D. Amalia, H. A. Zahra, S. E. Ningrum, Y. Sidiq, and A. Budiono, "Hate Speech Typology of Selected Controversial Figures on Social Media: A Discourse-Analytic Perspective," Wseas Transactions on Information Science and Applications, vol. 22, pp. 552–564, July 2025. DOI: https://doi.org/10.37394/23209.2025.22.46

M. Ridenhour, A. Bagavathi, E. Raisi, and S. Krishnan, "Detecting Online Hate Speech: Approaches Using Weak Supervision and Network Embedding Models," in Social, Cultural, and Behavioral Modeling, vol. 12268, R. Thomson, H. Bisgin, C. Dancy, A. Hyder, and M. Hussain, Eds. Cham: Springer International Publishing, 2020, pp. 202–212. DOI: https://doi.org/10.1007/978-3-030-61255-9_20

M. Lim, "Freedom to hate: social media, algorithmic enclaves, and the rise of tribal nationalism in Indonesia," Critical Asian Studies, vol. 49, no. 3, pp. 411–427, July 2017. DOI: https://doi.org/10.1080/14672715.2017.1341188

S. G. Cahyani, A. B. Wahyudi, Markhamah, and A. Sabardila, "Code Mixing on News Accounts Catch Me Up! on Twitter in News Text Learning," in Proceedings of the International Conference on Learning and Advanced Education (ICOLAE 2022), Paris, France, 2023, vol. 757, pp. 2024–2039. DOI: https://doi.org/10.2991/978-2-38476-086-2_162

M. O. Ibrohim and I. Budi, "Multi-label Hate Speech and Abusive Language Detection in Indonesian Twitter," in Proceedings of the Third Workshop on Abusive Language Online, Florence, Italy, 2019, pp. 46–57. DOI: https://doi.org/10.18653/v1/W19-3506

N. Aulia and I. Budi, "Hate Speech Detection on Indonesian Long Text Documents Using Machine Learning Approach," in Proceedings of the 2019 5th International Conference on Computing and Artificial Intelligence, Bali Indonesia, Apr. 2019, pp. 164–169. DOI: https://doi.org/10.1145/3330482.3330491

I. G. M. Putra and D. Nurjanah, "Hate Speech Detection In Indonesian Language Instagram," in 2020 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Depok, Indonesia, Oct. 2020, pp. 413–420. DOI: https://doi.org/10.1109/ICACSIS51025.2020.9263084

I. Alfina, R. Mulia, M. I. Fanany, and Y. Ekanata, "Hate speech detection in the Indonesian language: A dataset and preliminary study," in 2017 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Oct. 2017, pp. 233–238. DOI: https://doi.org/10.1109/ICACSIS.2017.8355039

A. D. Sanya and L. H. Suadaa, "Handling Imbalanced Dataset on Hate Speech Detection in Indonesian Online News Comments," in 2022 10th International Conference on Information and Communication Technology (ICoICT), Bandung, Indonesia, Aug. 2022, pp. 380–385. DOI: https://doi.org/10.1109/ICoICT55009.2022.9914883

B. P. Putra, B. Irawan, C. Setianingsih, A. Rahmadani, F. Imanda, and I. Z. Fawwas, "Hate Speech Detection using Convolutional Neural Network Algorithm Based on Image," in 2021 International Seminar on Machine Learning, Optimization, and Data Science (ISMODE), Jakarta, Indonesia, Jan. 2022, pp. 207–212. DOI: https://doi.org/10.1109/ISMODE53584.2022.9742810

H. Imaduddin, L. A. Kusumaningtias, and F. Y. A’la, "Application of LSTM and GloVe Word Embedding for Hate Speech Detection in Indonesian Twitter Data," Ingénierie des systèmes d information, vol. 28, no. 4, pp. 1107–1112, Aug. 2023. DOI: https://doi.org/10.18280/isi.280430

A. T. Azar, H. M. Noori, A. R. Mahlous, A. Al-Khayyat, and I. K. Ibraheem, "Quasi-Reflection Learning Arithmetic Firefly Search Optimization with Deep Learning-based Cyberbullying Detection on Social Networking," Engineering, Technology & Applied Science Research, vol. 14, no. 5, pp. 17162–17169, Oct. 2024. DOI: https://doi.org/10.48084/etasr.8314

E. W. Pamungkas, A. Fatmawati, and F. D. Salam, "Hate Speech Detection on Indonesian Social Media: A Preliminary Study on Code-Mixed Language Issue," in Proceedings of the 2022 6th International Conference on Natural Language Processing and Information Retrieval, Bangkok Thailand, Dec. 2022, pp. 104–109. DOI: https://doi.org/10.1145/3582768.3582771

E. W. Pamungkas, A. Fatmawati, Y. S. Nugroho, D. Gunawan, and E. Sudarmilah, "Hate Speech Detection in Code-Mixed Indonesian Social Media: Exploiting Multilingual Languages Resources," in 2022 Seventh International Conference on Informatics and Computing (ICIC), Denpasar, Bali, Indonesia, Dec. 2022, pp. 1–5. DOI: https://doi.org/10.1109/ICIC56845.2022.10006940

A. Bohra, D. Vijay, V. Singh, S. S. Akhtar, and M. Shrivastava, "A Dataset of Hindi-English Code-Mixed Social Media Text for Hate Speech Detection," in Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, New Orleans, Louisiana, USA, 2018, pp. 36–41. DOI: https://doi.org/10.18653/v1/W18-1105

B. R. Chakravarthi, A. K. M, J. P. McCrae, B. Premjith, K. P. Sorman, and T. Mandl, "Overview of the track on HASOC-Offensive Language Identification-DravidianCodeMix," in FIRE 2020: Forum for Information Retrieval Evaluation, Hyderabad, India, Dec. 2020.

E. Ombui, L. Muchemi, and P. Wagacha, "Hate Speech Detection in Code-switched Text Messages," in 2019 3rd International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT), Ankara, Turkey, Oct. 2019, pp. 1–6. DOI: https://doi.org/10.1109/ISMSIT.2019.8932845

Z. Tan et al., "Large Language Models for Data Annotation and Synthesis: A Survey," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, Florida, USA, 2024, pp. 930–957. DOI: https://doi.org/10.18653/v1/2024.emnlp-main.54

J. Wei and K. Zou, "EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019, pp. 6381–6387. DOI: https://doi.org/10.18653/v1/D19-1670

H. Q. Abonizio, E. C. Paraiso, and S. Barbon, "Toward Text Data Augmentation for Sentiment Analysis," IEEE Transactions on Artificial Intelligence, vol. 3, no. 5, pp. 657–668, Oct. 2022. DOI: https://doi.org/10.1109/TAI.2021.3114390

S. Woźniak and J. Kocoń, "From Big to Small Without Losing It All: Text Augmentation with ChatGPT for Efficient Sentiment Analysis," in 2023 IEEE International Conference on Data Mining Workshops (ICDMW), Shanghai, China, Dec. 2023, pp. 799–808. DOI: https://doi.org/10.1109/ICDMW60847.2023.00108

Q. Zhang, S. Shi, K. Zhang, Z. Lu, T. Zhang, and X. Xie, "Data Augmentation for Fake News Using ChatGPT," in 2023 International Conference on Intelligent Management and Software Engineering (IMSE), Rome, Italy, Sept. 2023, pp. 12–17. DOI: https://doi.org/10.1109/IMSE61332.2023.00009

T. ValizadehAslani et al., "Two-stage fine-tuning with ChatGPT data augmentation for learning class-imbalanced data," Neurocomputing, vol. 592, Aug. 2024, Art. no. 127801. DOI: https://doi.org/10.1016/j.neucom.2024.127801