Photometric Ligature Extraction Technique for Urdu Optical Character Recognition
Received: 4 November 2021 | Revised: 20 November 2021 | Accepted: 25 November 2021 | Online: 11 December 2021
Corresponding author: M. Kazmi
Abstract
Urdu Optical Character Recognition (OCR) based on character level recognition (analytical approach) is less popular as compared to ligature level recognition (holistic approach) due to its added complexity, characters and strokes overlapping. This paper presents a holistic approach Urdu ligature extraction technique. The proposed Photometric Ligature Extraction (PLE) technique is independent of font size and column layout and is capable to handle non-overlapping and all inter and intra overlapping ligatures. It uses a customized photometric filter along with the application of X-shearing and padding with connected component analysis, to extract complete ligatures instead of extracting primary and secondary ligatures separately. A total of ~ 2,67,800 ligatures were extracted from scanned Urdu Nastaliq printed text images with an accuracy of 99.4%. Thus, the proposed framework outperforms the existing Urdu Nastaliq text extraction and segmentation algorithms. The proposed PLE framework can also be applied to other languages using the Nastaliq script style, languages such as Arabic, Persian, Pashto, and Sindhi.
Keywords:
ligature, holistic, Urdu OCR, Nastaliq, photometric filter, Urdu printed text imagesDownloads
References
A. Wali and S. Hussain, "Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation," 2007, pp. 53–58, https://doi.org/10.1007/978-1-4020-6268-1_10.
S. T. Javed and S. Hussain, "Segmentation Based Urdu Nastalique OCR," in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2013, pp. 41–49, https://doi.org/10.1007/978-3-642-41827-3_6.
I. U. Din, Z. Malik, I. Siddiqi, and S. Khalid, "Line and Ligature Segmentation in Printed Urdu Document Images," presented at the 3rd International Conference on Computational and Social Sciences, Oct. 2015.
S. Naz, A. I. Umar, S. B. Ahmed, S. H. Shirazi, M. Imran Razzak, and I. Siddiqi, "An Ocr system for printed Nasta’liq script: A segmentation based approach," in 17th IEEE International Multi Topic Conference 2014, Dec. 2014, pp. 255–259, https://doi.org/10.1109/INMIC.2014.7097347.
H. R. Khan, M. A. Hasan, M. Kazmi, N. Fayyaz, H. Khalid, and S. A. Qazi, "A Holistic Approach to Urdu Language Word Recognition using Deep Neural Networks," Engineering, Technology & Applied Science Research, vol. 11, no. 3, pp. 7140–7145, Jun. 2021, https://doi.org/10.48084/etasr.4143.
N. H. Khan and A. Adnan, "Urdu Optical Character Recognition Systems: Present Contributions and Future Directions," IEEE Access, vol. 6, pp. 46019–46046, 2018, https://doi.org/10.1109/ACCESS.2018.2865532.
S. Chanda and U. Pal, "English, Devnagari and Urdu Text Identification," in Proc. international conference on document analysis and recognition, 2005, pp. 538–545.
A. Rana and G. S. Lehal, "Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script," Indian Journal of Science and Technology, vol. 8, no. 35, pp. 1–9, Dec. 2015, https://doi.org/10.17485/ijst/2015/v8i35/86807.
M. Alghobiri, "A Comparative Analysis of Classification Algorithms on Diverse Datasets," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2790–2795, Apr. 2018, https://doi.org/10.48084/etasr.1952.
S. R. Basha, J. K. Rani, and J. J. C. P. Yadav, "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy," Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 5001–5005, Dec. 2019, https://doi.org/10.48084/etasr.3173.
I. A. Doush, F. Alkhateeb, and A. H. Gharaibeh, "A novel Arabic OCR post-processing using rule-based and word context techniques," International Journal on Document Analysis and Recognition (IJDAR), vol. 21, no. 1, pp. 77–89, Jun. 2018, https://doi.org/10.1007/s10032-018-0297-y.
Y. Bassil and M. Alwani, "OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion," arXiv:1204.0191 [cs], Apr. 2012, Accessed: Dec. 01, 2021. [Online]. Available: http://arxiv.org/abs/1204.0191.
K. Kukich, "Techniques for automatically correcting words in text," ACM Computing Surveys, vol. 24, no. 4, pp. 377–439, Dec. 1992, https://doi.org/10.1145/146370.146380.
S. Naz, K. Hayat, M. Imran Razzak, M. Waqas Anwar, S. A. Madani, and S. U. Khan, "The optical character recognition of Urdu-like cursive scripts," Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, Mar. 2014, https://doi.org/10.1016/j.patcog.2013.09.037.
S. A. Husain, "A multi-tier holistic approach for Urdu Nastaliq recognition," in International Multi Topic Conference, 2002. Abstracts. INMIC 2002., Karachi, Pakistan, Dec. 2002, pp. 84–84, https://doi.org/10.1109/INMIC.2002.1310191.
S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin, "Segmentation Free Nastalique Urdu OCR," International Journal of Computer and Information Engineering, vol. 4, no. 10, pp. 1514–1519, Oct. 2010.
U. Pal and A. Sarkar, "Recognition of printed Urdu script," in Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Edinburgh, UK, Aug. 2003, pp. 1183–1187, https://doi.org/10.1109/ICDAR.2003.1227844.
Z. Ahmad, J. K. Orakzai, and I. Shamsher, "Urdu compound Character Recognition using feed forward neural networks," in 2009 2nd IEEE International Conference on Computer Science and Information Technology, Beijing, China, Aug. 2009, pp. 457–462, https://doi.org/10.1109/ICCSIT.2009.5234683.
S. A. Sattar, S. Haque, and M. K. Pathan, "A Finite State Model for Urdu Nastalique Optical Character Recognition," International Journal of Computer Science and Network Security, vol. 9, no. 9, pp. 116–122, 2009.
S. T. Javed, "Investigation into a segmentation-based OCR for the Nastaleeq writing system," M.S. thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan, 2007.
S. Mir, S. Zaman, and M. W. Anwar, "Printed Urdu Nastalique Script Recognition Using Analytical Approach," in 2015 13th International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, Dec. 2015, pp. 334–340, https://doi.org/10.1109/FIT.2015.65.
S. B. Ahmed, S. Naz, M. I. Razzak, S. F. Rashid, M. Z. Afzal, and T. M. Breuel, "Evaluation of cursive and non-cursive scripts using recurrent neural networks," Neural Computing and Applications, vol. 27, no. 3, pp. 603–613, Apr. 2016, https://doi.org/10.1007/s00521-015-1881-4.
R. P. Thakkar Mitesh, "Handwritten Nastaleeq Script Recognition with BLSTM-CTC and ANFIS method," International Journal of Computer Trends and Technology, vol. 11, no. 3, 2014, https://doi.org/10.14445/22312803/IJCTT-V11P128.
S. Naz et al., "Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks," Neurocomputing, vol. 177, pp. 228–241, Feb. 2016, https://doi.org/10.1016/j.neucom.2015.11.030.
S. Naz et al., "Urdu Nastaliq recognition using convolutional–recursive deep learning," Neurocomputing, vol. 243, pp. 80–87, Jun. 2017, https://doi.org/10.1016/j.neucom.2017.02.081.
S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, and M. I. Razzak, "Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features," Neural Computing and Applications, vol. 28, no. 2, pp. 219–231, Feb. 2017, https://doi.org/10.1007/s00521-015-2051-4.
S. Sardar and A. Wahab, "Optical character recognition system for Urdu," in 2010 International Conference on Information and Emerging Technologies, Karachi, Pakistan, Jun. 2010, https://doi.org/10.1109/ICIET.2010.5625694.
N. Sabbour and F. Shafait, ``A segmentation-free approach to Arabic and Urdu OCR,'' Proc. SPIE, vol. 8658, p. 86580N, Feb. 2013.
S. Nazir and A. Javed, "Diacritics Recognition Based Urdu Nastalique OCR System," The Nucleus, vol. 51, no. 3, pp. 361–367, Sep. 2014.
A. F. Ganai and A. Koul, "Projection profile based ligature segmentation of Nastaleeq Urdu OCR," in 2016 4th International Symposium on Computational and Business Intelligence (ISCBI), Olten, Switzerland, Sep. 2016, pp. 170–175, https://doi.org/10.1109/ISCBI.2016.7743278.
G. S. Lehal, "Ligature Segmentation for Urdu OCR," in 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, Aug. 2013, pp. 1130–1134, https://doi.org/10.1109/ICDAR.2013.229.
I. Ahmad, X. Wang, R. Li, M. Ahmed, and R. Ullah, "Line and Ligature Segmentation of Urdu Nastaleeq Text," IEEE Access, vol. 5, pp. 10924–10940, 2017, https://doi.org/10.1109/ACCESS.2017.2703155.
"Kutubistan," Kutubistan. https://kutubistan.blogspot.com/ (accessed Dec. 01, 2021).
Downloads
How to Cite
License
Copyright (c) 2021 M. Kazmi, F. Yasir, S. Habib, M. S. Hayat, S. A. Qazi
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.