Robust and Efficient Indonesian Span-Based Named Entity Recognition via Compact GLiNER
Towards Enhanced Retrieval-Augmented Generation
Received: 2 March 2026 | Revised: 18 April 2026 | Accepted: 1 May 2026 | Online: 9 May 2026
Corresponding author: Adhi Dharma Wibawa
Abstract
Efficient Named Entity Recognition (NER) integration is crucial for improving retrieval precision and fact verification in Retrieval-Augmented Generation (RAG) systems. However, the development of Indonesian NER still faces challenges, including dataset heterogeneity, inefficient tokenization, and trade-offs between NER performance and inference efficiency. Although span-based frameworks such as Generalist and Lightweight Named Entity Recognition (GLiNER) offer greater flexibility than conventional Beginning-Inside-Outside (BIO) approaches, available GLiNER models are generally multilingual and have not been optimized for Indonesian characteristics. This research proposes a compact GLiNER model specifically for Indonesian, utilizing a pruned mDeBERTa with a 30k vocabulary as the encoder backbone. We built a large-scale Indonesian GLiNER training corpus by combining heterogeneous NER datasets and augmenting them with controlled translations, resulting in 56,210 training, 6,707 validation, and 9,411 test samples. Experimental results show that the proposed model achieves an F1 score of 76.58%, surpassing the IndoBERT-based GLiNER baseline (74.70%) at max_len = 192 tokens, with consistent improvements on major entities. Deployment-oriented evaluation shows up to 8 × CPU-based inference acceleration (482 ms vs. 3,897 ms per sample). In long-context evaluation (max_len = 384), the proposed model outperforms multilingual GLiNER by more than 12 F1 points while maintaining a significantly lower memory footprint on both GPU and CPU. These advantages validate the model's potential as a reliable metadata filter component for RAG architectures.
Keywords:
Named Entity Recognition (NER), GLiNER, span-based NER, Retrieval-Augmented Generation (RAG), Transformer models, efficient NLPDownloads
References
Z. Wang, H. Chen, G. Xu, and M. Ren, "A novel large-language-model-driven framework for named entity recognition," Information Processing & Management, vol. 62, no. 3, May 2025, Art. no. 104054.
Y. Huang and J. X. Huang, "A Survey on Retrieval-Augmented Text Generation for Large Language Models," ACM Computing Surveys, Apr. 2026.
T. Fan, J. Wang, X. Ren, and C. Huang, "MiniRAG: Towards Extremely Simple Retrieval-Augmented Generation." arXiv, Jan. 26, 2025.
I. Budi and R. R. Suryono, "Application of named entity recognition method for Indonesian datasets: a review," Bulletin of Electrical Engineering and Informatics, vol. 12, no. 2, pp. 969–978, Apr. 2023.
N. Kaur, A. Saha, M. Swami, M. Singh, and R. Dalal, "Bert-Ner: A Transformer-Based Approach For Named Entity Recognition," in 2024 15th International Conference on Computing Communication and Networking Technologies, Kamand, India, 2024, pp. 1–7.
T. E. Moussaoui, C. Loqman, and J. Boumhidi, "Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21918–21924, Apr. 2025.
W. L. Seow, I. Chaturvedi, A. Hogarth, R. Mao, and E. Cambria, "A review of named entity recognition: from learning methods to modelling paradigms and tasks," Artificial Intelligence Review, vol. 58, no. 10, July 2025, Art. no. 315.
J. Yu, B. Ji, S. Li, J. Ma, H. Liu, and H. Xu, "S-NER: A Concise and Efficient Span-Based Model for Named Entity Recognition," Sensors, vol. 22, no. 8, Apr. 2022, Art. no. 2852.
J. Fu, X. Huang, and P. Liu, "SpanNER: Named Entity Re-/Recognition as Span Prediction," in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online, 2021, pp. 7183–7195.
U. Zaratiana, N. Tomeh, P. Holat, and T. Charnois, "Named Entity Recognition as Structured Span Prediction," in Proceedings of the Workshop on Unimodal and Multimodal Induction of Linguistic Structures, Abu Dhabi, United Arab Emirates (Hybrid), 2022, pp. 1–10.
U. Zaratiana, N. Tomeh, P. Holat, and T. Charnois, "GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer," in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), Mexico City, Mexico, 2024, pp. 5364–5376.
M. Fuadi, A. D. Wibawa, and S. Sumpeno, "Adaptation of Multilingual T5 Transformer for Indonesian Language," in 2023 IEEE 9th Information Technology International Seminar, Batu Malang, Indonesia, 2023, pp. 1–6.
F. Koto, A. Rahimi, J. H. Lau, and T. Baldwin, "IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP," in Proceedings of the 28th International Conference on Computational Linguistics, Barcelona, Spain (Online), 2020, pp. 757–770.
B. Wilie et al., "IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, Suzhou, China, 2020, pp. 843–857.
"muchad/gliner-id." Hugging Face. https://huggingface.co/muchad/gliner-id.
S. O. Khairunnisa, A. Imankulova, and M. Komachi, "Towards a Standardized Dataset on Indonesian Named Entity Recognition," in Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing: Student Research Workshop, Suzhou, China, 2020, pp. 64–71.
H. Lovenia et al., "SEACrowd: A Multilingual Multimodal Data Hub and Benchmark Suite for Southeast Asian Languages," in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 2024, pp. 5155–5203.
Y. Gultom and W. C. Wibowo, "Automatic open domain information extraction from Indonesian text," in 2017 International Workshop on Big Data and Information Security, Jakarta, Indonesia, 2017, pp. 23–30.
M. Fachri, "Named entity recognition for Indonesian text using hidden Markov model," B.S. thesis, Universitas Gadjah Mada, Yogyakarta, Indonesia, 2014.
D. Hoesen and A. Purwarianti, "Investigating Bi-LSTM and CRF with POS Tag Embedding for Indonesian Named Entity Tagger," in 2018 International Conference on Asian Language Processing, Bandung, Indonesia, 2018, pp. 35–38.
"urchade/pile-mistral-v0.1." Datasets at Hugging Face, Aug. 21, 2024. [Online]. Available: https://huggingface.co/datasets/urchade/pile-mistral-v0.1.
M. Fuadi, A. D. Wibawa, and S. Sumpeno, "Efficient Transformer Models via Language-Aware Frequency-Based Vocabulary Pruning," IEEE Access, vol. 14, pp. 50993–51006, 2026.
P. He, J. Gao, and W. Chen, "DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing." arXiv, Nov. 18, 2021.
Downloads
How to Cite
License
Copyright (c) 2026 Mukhlish Fuadi, Adhi Dharma Wibawa, Surya Sumpeno

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
