Automating Named Entity Recognition for Indonesian Diplomas via Template-Based Synthetic Data Generation
Received: 6 December 2025 | Revised: 12 February 2026 | Accepted: 23 February 2026 | Online: 4 March 2026
Corresponding author: Ruddy J. Suhatril
Abstract
Automated information extraction from administrative documents, such as higher education diplomas, is critical to streamlining public service verification. However, the development of robust Named Entity Recognition (NER) models for this domain is hindered by the scarcity of publicly available datasets and the prohibitively high cost of manual annotation. The current study addresses this gap by proposing a novel automated pipeline for generating domain-specific NER datasets for Indonesian higher education diplomas. Constructed from 36 distinct higher education templates and 3,460 alumni records, the dataset was synthesized using Direct Label Generation, which leveraged the PDDikti higher education database and a template-rendering mechanism. The generated dataset was evaluated by benchmarking two Pre-trained Language Models (PLMs), where the Indonesian Bidirectional Encoder Representations from Transformers (IndoBERT) outperformed the 522M baseline (F1=0.8048), achieving a superior overall F1-score of 0.8213. To validate real-world utility, a pilot evaluation on held-out, noisy real-world diplomas yielded an F1-score of 0.7491 and a recall of 0.8460. These results confirm that the synthetic generation method effectively bridges the data gap, enabling high-performance extraction even on raw, noisy documents.
Keywords:
Named Entity Recognition (NER), dataset generation, higher education diplomas, template rendering, IndoBERT, low-resource domainsDownloads
References
G. Talukdar, P. P. Borah, and A. Baruah, "Assamese Named Entity Recognition System Using Naive Bayes Classifier," in Advances in Computing and Data Sciences, 2018, pp. 35–43.
A. K. Jumani, M. A. Memon, F. H. Khoso, A. A. Sanjrani, and S. Soomro, "Named Entity Recognition System for Sindhi Language," in Emerging Technologies in Computing, vol. 200, M. H. Miraz, P. Excell, A. Ware, S. Soomro, and M. Ali, Eds. Springer International Publishing, 2018, pp. 237–246.
S. Gowr. P and Kumar. N, "Named Entity Recognition for Protecting Sensitive Data using Hybrid CNN," in 2023 5th International Conference on Smart Systems and Inventive Technology (ICSSIT), Jan. 2023, pp. 1235–1242.
T. El Moussaoui, C. Loqman, and J. Boumhidi, "Exploring the Impact of Annotation Schemes on Arabic Named Entity Recognition across General and Specific Domains," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21918–21924, Apr. 2025.
Y. Chen et al., "Prompt robust large language model for Chinese medical named entity recognition," Information Processing & Management, vol. 62, no. 5, Sept. 2025, Art. no. 104189.
I. Ait Talghalit, H. Alami, and S. O. El Alaoui, "Exploring Different Annotation Schemes for Single and Consecutive Named Entity Recognition in the Arabic Biomedical Domain using Transformer Models and Contextual Semantic Embeddings," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21854–21860, Apr. 2025.
N. Loukachevitch et al., "NEREL-BIO: a dataset of biomedical abstracts annotated with nested named entities," Bioinformatics, vol. 39, no. 4, Apr. 2023, Art. no. btad161.
M. S. U. Miah, J. Sulaiman, T. B. Sarwar, I. U. Ferdous, S. S. Islam, and Md. S. Haque, "Target and Precursor Named Entities Recognition from Scientific Texts of High-Temperature Steel Using Deep Neural Network," in Database and Expert Systems Applications, 2023, pp. 203–208.
M. V. P. Reddy, P. V. R. D. Prasad, M. Chikkamath, and S. Mandadi, "NERSE: Named Entity Recognition in Software Engineering as a Service," in Service Research and Innovation, 2019, pp. 65–80.
L. Feddoul, "GerPS-NER: A Dataset for Named Entity Recognition to Support Public Service Process Creation in Germany," presented at the Second International Workshop on Semantic Technologies and Deep Learning Models for Scientific, Technical and Legal Data SemTech4STLD, May 2024.
S. Jain and P. Harde, "NER-IPL: Indian Legal Prediction Dataset for Named Entity Recognition," in Business Analytics and Decision Making in Practice, 2024, pp. 41–50.
Y. J. Park, M. Lee, G. J. Yang, S. J. Patrk, and C. B. Sohn, "Biomedical Text NER Tagging Tool with Web Interface for Generating BERT-Based Fine-Tuning Dataset," Applied Sciences, vol. 12, no. 23, Nov. 2022, Art. no. 12012.
A. Belbekri, W. Bouarroudj, F. Benchikha, and Z. Boufaida, "Generating Synthetic Training Data for Named Entity Recognition With Large-Scale Models Integrating Wikidata and GPT," presented at the RIF’24: The 13th Conference on Research in computIng at Feminine, May 2024.
J. Frei and F. Kramer, "Annotated dataset creation through large language models for non-english medical NLP," Journal of Biomedical Informatics, vol. 145, Sept. 2023, Art. no. 104478.
K. Kassab, N. Teslya, and E. Vozhik, "Automated Dataset-Creation and Evaluation Pipeline for NER in Russian Literary Heritage," Applied Sciences, vol. 15, no. 4, Feb. 2025, Art. no. 2072.
J. Park, E. Lee, Y. Kim, I. Kang, H. I. Koo, and N. I. Cho, "Multi-Lingual Optical Character Recognition System Using the Reinforcement Learning of Character Segmenter," IEEE Access, vol. 8, pp. 174437–174448, 2020.
M. Modi, A. Shah, D. Kothadiya, and M. Rahevar, "Optical Character Recognition: Comparative Analysis of Tesseract and Textract on Diverse Datasets," in 2024 3rd International Conference on Automation, Computing and Renewable Systems (ICACRS), Dec. 2024, pp. 913–919.
Indonesia, Kementerian Pendidikan, Kebudayaan, Riset, dan Teknologi, (2024). Peraturan Menteri Pendidikan, Kebudayaan, Riset dan Teknologi Nomor 50 Tahun 2024.
L. Li and M. Spratling, "Data Augmentation Alone Can Improve Adversarial Training." arXiv, 2023.
J. Sheikh, F. Farahnakian, F. Farahnakian, L. Zelioli, and J. Heikkonen, "SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning," in Pattern Recognition, 2025, pp. 32–47.
Indonesia, Pemerintah Pusat (2022), Undang-undang (UU) Nomor 27 Tahun 2022 tentang Pelindungan Data Pribadi.
A. Isra, S. Madenda, and R. J. Suhatril, "Dataset Samples and Templates for: Automating Named Entity Recognition for Indonesian Diplomas via Template-Based Synthetic Data Generation." Zenodo, Jan. 18, 2026.
Downloads
How to Cite
License
Copyright (c) 2026 Ali Isra, Sarifuddin Madenda, Ruddy J. Suhatril

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
