Robust and Efficient Indonesian Span-Based Named Entity Recognition via Compact GLiNER: Towards Enhanced Retrieval-Augmented Generation

Mukhlish Fuadi; Adhi Dharma Wibawa; Surya Sumpeno

doi:10.48084/etasr.18482

Authors

Mukhlish Fuadi Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia https://orcid.org/0000-0002-7595-7646
Adhi Dharma Wibawa Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia | Department of Medical Technology, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia
Surya Sumpeno Department of Electrical Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia | Department of Computer Engineering, Institut Teknologi Sepuluh Nopember, Surabaya, Indonesia

Volume: 16 | Issue: 3 | Pages: 36225-36232 | June 2026 | https://doi.org/10.48084/etasr.18482

Received: 2 March 2026 | Revised: 18 April 2026 | Accepted: 1 May 2026 | Online: 9 May 2026

Corresponding author: Adhi Dharma Wibawa

Abstract

Efficient Named Entity Recognition (NER) integration is crucial for improving retrieval precision and fact verification in Retrieval-Augmented Generation (RAG) systems. However, the development of Indonesian NER still faces challenges, including dataset heterogeneity, inefficient tokenization, and trade-offs between NER performance and inference efficiency. Although span-based frameworks such as Generalist and Lightweight Named Entity Recognition (GLiNER) offer greater flexibility than conventional Beginning-Inside-Outside (BIO) approaches, available GLiNER models are generally multilingual and have not been optimized for Indonesian characteristics. This research proposes a compact GLiNER model specifically for Indonesian, utilizing a pruned mDeBERTa with a 30k vocabulary as the encoder backbone. We built a large-scale Indonesian GLiNER training corpus by combining heterogeneous NER datasets and augmenting them with controlled translations, resulting in 56,210 training, 6,707 validation, and 9,411 test samples. Experimental results show that the proposed model achieves an F1 score of 76.58%, surpassing the IndoBERT-based GLiNER baseline (74.70%) at max_len = 192 tokens, with consistent improvements on major entities. Deployment-oriented evaluation shows up to 8 × CPU-based inference acceleration (482 ms vs. 3,897 ms per sample). In long-context evaluation (max_len = 384), the proposed model outperforms multilingual GLiNER by more than 12 F1 points while maintaining a significantly lower memory footprint on both GPU and CPU. These advantages validate the model's potential as a reliable metadata filter component for RAG architectures.

Keywords:

Named Entity Recognition (NER), GLiNER, span-based NER, Retrieval-Augmented Generation (RAG), Transformer models, efficient NLP

References

Z. Wang, H. Chen, G. Xu, and M. Ren, "A novel large-language-model-driven framework for named entity recognition," Information Processing & Management, vol. 62, no. 3, May 2025, Art. no. 104054.

Y. Huang and J. X. Huang, "A Survey on Retrieval-Augmented Text Generation for Large Language Models," ACM Computing Surveys, Apr. 2026.

T. Fan, J. Wang, X. Ren, and C. Huang, "MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation." arXiv, Jan. 26, 2025.

I. Budi and R. R. Suryono, "Application of named entity recognition method for Indonesian datasets: a review," Bulletin of Electrical Engineering and Informatics, vol. 12, no. 2, pp. 969–978, Apr. 2023.

N. Kaur, A. Saha, M. Swami, M. Singh, and R. Dalal, "Bert-Ner: A Transformer-Based Approach For Named Entity Recognition," in 2024 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 2024, pp. 1–7.

T. E. Moussaoui, C. Loqman, and J. Boumhidi, "Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21918–21924, Apr. 2025.

W. L. Seow, I. Chaturvedi, A. Hogarth, R. Mao, and E. Cambria, "A review of named entity recognition: from learning methods to modelling paradigms and tasks," Artificial Intelligence Review, vol. 58, no. 10, July 2025, Art. no. 315.

J. Yu, B. Ji, S. Li, J. Ma, H. Liu, and H. Xu, "S-NER: A Concise and Efficient Span-Based Model for Named Entity Recognition," Sensors, vol. 22, no. 8, Apr. 2022, Art. no. 2852.

J. Fu, X. Huang, and P. Liu, "SpanNER: Named Entity Re-/Recognition as Span Prediction," in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 2021, pp. 7183–7195.

U. Zaratiana, N. Tomeh, P. Holat, and T. Charnois, "Named Entity Recognition as Structured Span Prediction," in Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures, Abu Dhabi, United Arab Emirates (Hybrid), 2022, pp. 1–10.

U. Zaratiana, N. Tomeh, P. Holat, and T. Charnois, "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer," in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 2024, pp. 5364–5376.

M. Fuadi, A. D. Wibawa, and S. Sumpeno, "Adaptation of Multilingual T5 Transformer for Indonesian Language," in 2023 IEEE 9th Information Technology International Seminar, Batu Malang, Indonesia, 2023, pp. 1–6.

F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, "IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP," in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 757–770.

B. Wilie et al., "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 2020, pp. 843–857.

"muchad/gliner-id." Hugging Face. https://huggingface.co/muchad/gliner-id.

S. O. Khairunnisa, A. Imankulova, and M. Komachi, "Towards a Standardized Dataset on Indonesian Named Entity Recognition," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, Suzhou, China, 2020, pp. 64–71.

H. Lovenia et al., "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 2024, pp. 5155–5203.

Y. Gultom and W. C. Wibowo, "Automatic open domain information extraction from Indonesian text," in 2017 International Workshop on Big Data and Information Security, Jakarta, Indonesia, 2017, pp. 23–30.

M. Fachri, "Named entity recognition for Indonesian text using hidden Markov model," B.S. thesis, Universitas Gadjah Mada, Yogyakarta, Indonesia, 2014.

D. Hoesen and A. Purwarianti, "Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger," in 2018 International Conference on Asian Language Processing, Bandung, Indonesia, 2018, pp. 35–38.

"urchade/pile-mistral-v0.1." Datasets at Hugging Face, Aug. 21, 2024. [Online]. Available: https://huggingface.co/datasets/urchade/pile-mistral-v0.1.

M. Fuadi, A. D. Wibawa, and S. Sumpeno, "Efficient Transformer Models via Language-Aware Frequency-Based Vocabulary Pruning," IEEE Access, vol. 14, pp. 50993–51006, 2026.

P. He, J. Gao, and W. Chen, "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." arXiv, Nov. 18, 2021.