A Novel Multi-Stage Rule-Based Information Extraction Framework for Disease Outbreak Detection with Enhanced Geographical Granularity
Received: 19 September 2025 | Revised: 7 October 2025, 19 October 2025, and 21 October 2025 | Accepted: 22 October 2025 | Online: 19 November 2025
Corresponding author: Manju Joy
Abstract
A critical limitation of modern event-based bio-surveillance systems is their reliance on headline-level data rather than a systematic analysis of the underlying news narratives. Although they provide general awareness, they often lack the geographical granularity needed to generate actionable insights and support location-specific public health interventions. To address this limitation, this study presents a method for extracting epidemic-related information from unstructured text corpora on digital platforms. By transforming unstructured text into a structured format using advanced NLP techniques, this approach facilitates the visualization of outbreak information on Google Maps with enhanced spatial detail. Kerala is recognized as the "Alarm bell of India" for consistently reporting the first instance of many outbreaks that have emerged throughout the country, including the first Nipah outbreak in 2018, the first COVID-19 case in 2020, and the index Monkeypox case in 2022. This pattern highlights Kerala's pivotal role in the early detection of emerging infectious diseases, highlighting the need for a more advanced and efficient bio-surveillance system to enhance public health preparedness. This study presents a novel approach to identify entities related to outbreaks using an optimized DistilBERT model, combined with a multi-stage rule-based method for automated information extraction from unstructured text corpora, achieving an overall precision of 94.86%. Experimental results demonstrate that the proposed framework is highly effective in tracking diseases that severely impact public health and disrupt socio-economic stability.
Keywords:
information extraction, named entity recognition, co-reference resolution, outbreak detection, epidemic surveillanceDownloads
References
J. O’Shea, "Digital disease detection: A systematic review of event-based internet biosurveillance systems," International Journal of Medical Informatics, vol. 101, pp. 15–22, May 2017. DOI: https://doi.org/10.1016/j.ijmedinf.2017.01.019
A. Kijazi, M. Kisangiri, S. Kaijage, and G. Shirima, "A Monitoring System for Transboundary Foot and Mouth Disease (FMD) considering the Demographic Characteristics in Gairo, Tanzania," Engineering, Technology & Applied Science Research, vol. 11, no. 4, pp. 7302–7310, Aug. 2021. DOI: https://doi.org/10.48084/etasr.4140
Sujarwoto and A. Maharani, "Participation in community-based healthcare interventions and non-communicable diseases early detection of general population in Indonesia," SSM - Population Health, vol. 19, Sep. 2022, Art. no. 101236. DOI: https://doi.org/10.1016/j.ssmph.2022.101236
J. Feldman, A. Thomas-Bachli, J. Forsyth, Z. H. Patel, and K. Khan, "Development of a global infectious disease activity database using natural language processing, machine learning, and human expertise," Journal of the American Medical Informatics Association, vol. 26, no. 11, pp. 1355–1359, Nov. 2019. DOI: https://doi.org/10.1093/jamia/ocz112
C. C. Freifeld, K. D. Mandl, B. Y. Reis, and J. S. Brownstein, "HealthMap: Global Infectious Disease Monitoring through Automated Classification and Visualization of Internet Media Reports," Journal of the American Medical Informatics Association, vol. 15, no. 2, pp. 150–157, Mar. 2008. DOI: https://doi.org/10.1197/jamia.M2544
B. Jang, M. Kim, I. Kim, and J. W. Kim, "EagleEye: A Worldwide Disease-Related Topic Extraction System Using a Deep Learning Based Ranking Algorithm and Internet-Sourced Data," Sensors, vol. 21, no. 14, Jul. 2021, Art. no. 4665. DOI: https://doi.org/10.3390/s21144665
E. Arsevska et al., "Web monitoring of emerging animal infectious diseases integrated in the French Animal Health Epidemic Intelligence System," PLOS ONE, vol. 13, no. 8, Aug. 2018, Art. no. e0199960. DOI: https://doi.org/10.1371/journal.pone.0199960
M. Nasir, M. Bakhtyar, J. Babar, S. Lakho, B. Ahmed, and W. Noor, "BIOPAK FLASHER: Epidemic Disease Monitoring and Detection in Pakistan Using Text Mining," in Soft Computing Applications, vol. 1438, V. E. Balas, L. C. Jain, M. M. Balas, and D. Baleanu, Eds. Springer International Publishing, 2023, pp. 519–536. DOI: https://doi.org/10.1007/978-3-031-23636-5_40
Z. Meng et al., "BioCaster in 2021: automatic disease outbreaks detection from global news media," Bioinformatics, vol. 38, no. 18, pp. 4446–4448, Sep. 2022. DOI: https://doi.org/10.1093/bioinformatics/btac497
A. Dellanzo et al., "Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus," BMC Bioinformatics, vol. 23, no. 1, Dec. 2022, Art. no. 558. DOI: https://doi.org/10.1186/s12859-022-05094-y
B. Jang, M. Lee, and J. W. Kim, "PEACOCK: A Map-Based Multitype Infectious Disease Outbreak Information System," IEEE Access, vol. 7, pp. 82956–82969, 2019. DOI: https://doi.org/10.1109/ACCESS.2019.2924189
A. Rortais, J. Belyaeva, M. Gemo, E. Van Der Goot, and J. P. Linge, "MedISys: An early-warning system for the detection of (re-)emerging food- and feed-borne hazards," Food Research International, vol. 43, no. 5, pp. 1553–1556, Jun. 2010. DOI: https://doi.org/10.1016/j.foodres.2010.04.009
H. F. Bradford et al., "Inactive disease in patients with lupus is linked to autoantibodies to type I interferons that normalize blood IFNα and B cell subsets," Cell Reports Medicine, vol. 4, no. 1, Jan. 2023, Art. no. 100894. DOI: https://doi.org/10.1016/j.xcrm.2022.100894
D. Carter, M. Stojanovic, and B. De Bruijn, "Revitalizing the Global Public Health Intelligence Network (GPHIN)," Online Journal of Public Health Informatics, vol. 10, no. 1, May 2018. DOI: https://doi.org/10.5210/ojphi.v10i1.8912
A. Jose, "Kerala in grip of epidemics, 144 deaths reported in 2024," The New Indian Express, Jul. 12, 2024. https://www.newindianexpress.com/states/kerala/2024/Jul/12/kerala-in-grip-of-epidemics-144-deaths-reported-in-2024.
L. Chiticariu, Y. Li, and F. R. Reiss, "Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!," in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, WA, USA, 2013, pp. 827–832. DOI: https://doi.org/10.18653/v1/D13-1079
B. Waltl, G. Bonczek, and F. Matthes, "Rule-based information extraction: Advantages, limitations, and perspectives," Jusletter IT, Feb. 2018.
M. I. Salih, S. M. Mohammed, A. K. Ibrahim, O. M. Ahmed, and L. M. Haji, "Fine-Tuning BERT for Automated News Classification," Engineering, Technology & Applied Science Research, vol. 15, no. 3, pp. 22953–22959, Jun. 2025. DOI: https://doi.org/10.48084/etasr.10625
T. Almeida, R. A. A. Jonker, R. Antunes, J. R. Almeida, and S. Matos, "Towards discovery: an end-to-end system for uncovering novel biomedical relations," Database, vol. 2024, Jul. 2024, Art. no. baae057. DOI: https://doi.org/10.1093/database/baae057
J. Jiang, "Information Extraction from Text," in Mining Text Data, C. C. Aggarwal and C. Zhai, Eds. Boston, MA, USA: Springer US, 2012, pp. 11–41. DOI: https://doi.org/10.1007/978-1-4614-3223-4_2
S. Singh, "Natural Language Processing for Information Extraction." arXiv, 2018. DOI: https://doi.org/10.1007/978-1-4842-4131-8_9
M. Y. Landolsi, L. Hlaoua, and L. Ben Romdhane, "Information extraction from electronic medical documents: state of the art and future research directions," Knowledge and Information Systems, vol. 65, no. 2, pp. 463–516, Feb. 2023. DOI: https://doi.org/10.1007/s10115-022-01779-1
M. Joy and D. M. Krishnaveni, "Enhancing Disease Outbreak Detection: Named Entity Recognition with Fine-tuned DistilBERT," Journal of Theoretical and Applied Information Technology, vol. 102, no. 10, pp. 4648–4660, May 2024.
Zae Myung Kim, Y. S. Jeong, and Ho-Jin Choi, "Understanding news stories through SVO triplets," in 2016 International Conference on Big Data and Smart Computing (BigComp), Hong Kong, China, Jan. 2016, pp. 498–501. DOI: https://doi.org/10.1109/BIGCOMP.2016.7425978
M. T. Nguyen and T. T. Nguyen, "Extraction of disease events for a real-time monitoring system," in Proceedings of the Fourth Symposium on Information and Communication Technology - SoICT ’13, Danang, Vietnam, 2013, pp. 139–147. DOI: https://doi.org/10.1145/2542050.2542084
Downloads
How to Cite
License
Copyright (c) 2025 Manju Joy, M. Krishnaveni

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
