Photometric Ligature Extraction Technique for Urdu Optical Character Recognition

M. Kazmi; F. Yasir; S. Habib; M. S. Hayat; S. A. Qazi

doi:10.48084/etasr.4596

Authors

M. Kazmi Faculty of Electrical and Computer Engineering, NED University of Engineering & Technology, Pakistan
F. Yasir Faculty of Electrical and Computer Engineering, NED University of Engineering & Technology, Pakistan
S. Habib Neurocomputation Lab, National Centre of Artificial Intelligence, NED University of Engineering and Technology, Pakistan
M. S. Hayat Deptartment of Electrical Engineering, NED University of Engineering and Technology, Pakistan
S. A. Qazi Neurocomputation Lab, National Centre of Artificial Intelligence, NED University of Engineering and Technology, Pakistan

Volume: 11 | Issue: 6 | Pages: 7968-7973 | December 2021 | https://doi.org/10.48084/etasr.4596

Received: 4 November 2021 | Revised: 20 November 2021 | Accepted: 25 November 2021 | Online: 11 December 2021

Corresponding author: M. Kazmi

Abstract

Urdu Optical Character Recognition (OCR) based on character level recognition (analytical approach) is less popular as compared to ligature level recognition (holistic approach) due to its added complexity, characters and strokes overlapping. This paper presents a holistic approach Urdu ligature extraction technique. The proposed Photometric Ligature Extraction (PLE) technique is independent of font size and column layout and is capable to handle non-overlapping and all inter and intra overlapping ligatures. It uses a customized photometric filter along with the application of X-shearing and padding with connected component analysis, to extract complete ligatures instead of extracting primary and secondary ligatures separately. A total of ~ 2,67,800 ligatures were extracted from scanned Urdu Nastaliq printed text images with an accuracy of 99.4%. Thus, the proposed framework outperforms the existing Urdu Nastaliq text extraction and segmentation algorithms. The proposed PLE framework can also be applied to other languages using the Nastaliq script style, languages such as Arabic, Persian, Pashto, and Sindhi.

Keywords:

ligature, holistic, Urdu OCR, Nastaliq, photometric filter, Urdu printed text images

References

A. Wali and S. Hussain, "Context Sensitive Shape-Substitution in Nastaliq Writing System: Analysis and Formulation," 2007, pp. 53–58, https://doi.org/10.1007/978-1-4020-6268-1_10.

S. T. Javed and S. Hussain, "Segmentation Based Urdu Nastalique OCR," in Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, 2013, pp. 41–49, https://doi.org/10.1007/978-3-642-41827-3_6.

I. U. Din, Z. Malik, I. Siddiqi, and S. Khalid, "Line and Ligature Segmentation in Printed Urdu Document Images," presented at the 3rd International Conference on Computational and Social Sciences, Oct. 2015.

S. Naz, A. I. Umar, S. B. Ahmed, S. H. Shirazi, M. Imran Razzak, and I. Siddiqi, "An Ocr system for printed Nasta’liq script: A segmentation based approach," in 17th IEEE International Multi Topic Conference 2014, Dec. 2014, pp. 255–259, https://doi.org/10.1109/INMIC.2014.7097347.

H. R. Khan, M. A. Hasan, M. Kazmi, N. Fayyaz, H. Khalid, and S. A. Qazi, "A Holistic Approach to Urdu Language Word Recognition using Deep Neural Networks," Engineering, Technology & Applied Science Research, vol. 11, no. 3, pp. 7140–7145, Jun. 2021, https://doi.org/10.48084/etasr.4143.

N. H. Khan and A. Adnan, "Urdu Optical Character Recognition Systems: Present Contributions and Future Directions," IEEE Access, vol. 6, pp. 46019–46046, 2018, https://doi.org/10.1109/ACCESS.2018.2865532.

S. Chanda and U. Pal, "English, Devnagari and Urdu Text Identification," in Proc. international conference on document analysis and recognition, 2005, pp. 538–545.

A. Rana and G. S. Lehal, "Offline Urdu OCR using Ligature based Segmentation for Nastaliq Script," Indian Journal of Science and Technology, vol. 8, no. 35, pp. 1–9, Dec. 2015, https://doi.org/10.17485/ijst/2015/v8i35/86807.

M. Alghobiri, "A Comparative Analysis of Classification Algorithms on Diverse Datasets," Engineering, Technology & Applied Science Research, vol. 8, no. 2, pp. 2790–2795, Apr. 2018, https://doi.org/10.48084/etasr.1952.

S. R. Basha, J. K. Rani, and J. J. C. P. Yadav, "A Novel Summarization-based Approach for Feature Reduction Enhancing Text Classification Accuracy," Engineering, Technology & Applied Science Research, vol. 9, no. 6, pp. 5001–5005, Dec. 2019, https://doi.org/10.48084/etasr.3173.

I. A. Doush, F. Alkhateeb, and A. H. Gharaibeh, "A novel Arabic OCR post-processing using rule-based and word context techniques," International Journal on Document Analysis and Recognition (IJDAR), vol. 21, no. 1, pp. 77–89, Jun. 2018, https://doi.org/10.1007/s10032-018-0297-y.

Y. Bassil and M. Alwani, "OCR Post-Processing Error Correction Algorithm using Google Online Spelling Suggestion," arXiv:1204.0191 [cs], Apr. 2012, Accessed: Dec. 01, 2021. [Online]. Available: http://arxiv.org/abs/1204.0191.

K. Kukich, "Techniques for automatically correcting words in text," ACM Computing Surveys, vol. 24, no. 4, pp. 377–439, Dec. 1992, https://doi.org/10.1145/146370.146380.

S. Naz, K. Hayat, M. Imran Razzak, M. Waqas Anwar, S. A. Madani, and S. U. Khan, "The optical character recognition of Urdu-like cursive scripts," Pattern Recognition, vol. 47, no. 3, pp. 1229–1248, Mar. 2014, https://doi.org/10.1016/j.patcog.2013.09.037.

S. A. Husain, "A multi-tier holistic approach for Urdu Nastaliq recognition," in International Multi Topic Conference, 2002. Abstracts. INMIC 2002., Karachi, Pakistan, Dec. 2002, pp. 84–84, https://doi.org/10.1109/INMIC.2002.1310191.

S. T. Javed, S. Hussain, A. Maqbool, S. Asloob, S. Jamil, and H. Moin, "Segmentation Free Nastalique Urdu OCR," International Journal of Computer and Information Engineering, vol. 4, no. 10, pp. 1514–1519, Oct. 2010.

U. Pal and A. Sarkar, "Recognition of printed Urdu script," in Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings., Edinburgh, UK, Aug. 2003, pp. 1183–1187, https://doi.org/10.1109/ICDAR.2003.1227844.

Z. Ahmad, J. K. Orakzai, and I. Shamsher, "Urdu compound Character Recognition using feed forward neural networks," in 2009 2nd IEEE International Conference on Computer Science and Information Technology, Beijing, China, Aug. 2009, pp. 457–462, https://doi.org/10.1109/ICCSIT.2009.5234683.

S. A. Sattar, S. Haque, and M. K. Pathan, "A Finite State Model for Urdu Nastalique Optical Character Recognition," International Journal of Computer Science and Network Security, vol. 9, no. 9, pp. 116–122, 2009.

S. T. Javed, "Investigation into a segmentation-based OCR for the Nastaleeq writing system," M.S. thesis, National University of Computer and Emerging Sciences, Lahore, Pakistan, 2007.

S. Mir, S. Zaman, and M. W. Anwar, "Printed Urdu Nastalique Script Recognition Using Analytical Approach," in 2015 13th International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, Dec. 2015, pp. 334–340, https://doi.org/10.1109/FIT.2015.65.

S. B. Ahmed, S. Naz, M. I. Razzak, S. F. Rashid, M. Z. Afzal, and T. M. Breuel, "Evaluation of cursive and non-cursive scripts using recurrent neural networks," Neural Computing and Applications, vol. 27, no. 3, pp. 603–613, Apr. 2016, https://doi.org/10.1007/s00521-015-1881-4.

R. P. Thakkar Mitesh, "Handwritten Nastaleeq Script Recognition with BLSTM-CTC and ANFIS method," International Journal of Computer Trends and Technology, vol. 11, no. 3, 2014, https://doi.org/10.14445/22312803/IJCTT-V11P128.

S. Naz et al., "Offline cursive Urdu-Nastaliq script recognition using multidimensional recurrent neural networks," Neurocomputing, vol. 177, pp. 228–241, Feb. 2016, https://doi.org/10.1016/j.neucom.2015.11.030.

S. Naz et al., "Urdu Nastaliq recognition using convolutional–recursive deep learning," Neurocomputing, vol. 243, pp. 80–87, Jun. 2017, https://doi.org/10.1016/j.neucom.2017.02.081.

S. Naz, A. I. Umar, R. Ahmad, S. B. Ahmed, S. H. Shirazi, and M. I. Razzak, "Urdu Nasta’liq text recognition system based on multi-dimensional recurrent neural network and statistical features," Neural Computing and Applications, vol. 28, no. 2, pp. 219–231, Feb. 2017, https://doi.org/10.1007/s00521-015-2051-4.

S. Sardar and A. Wahab, "Optical character recognition system for Urdu," in 2010 International Conference on Information and Emerging Technologies, Karachi, Pakistan, Jun. 2010, https://doi.org/10.1109/ICIET.2010.5625694.

N. Sabbour and F. Shafait, ``A segmentation-free approach to Arabic and Urdu OCR,'' Proc. SPIE, vol. 8658, p. 86580N, Feb. 2013.

S. Nazir and A. Javed, "Diacritics Recognition Based Urdu Nastalique OCR System," The Nucleus, vol. 51, no. 3, pp. 361–367, Sep. 2014.

A. F. Ganai and A. Koul, "Projection profile based ligature segmentation of Nastaleeq Urdu OCR," in 2016 4th International Symposium on Computational and Business Intelligence (ISCBI), Olten, Switzerland, Sep. 2016, pp. 170–175, https://doi.org/10.1109/ISCBI.2016.7743278.

G. S. Lehal, "Ligature Segmentation for Urdu OCR," in 2013 12th International Conference on Document Analysis and Recognition, Washington, DC, USA, Aug. 2013, pp. 1130–1134, https://doi.org/10.1109/ICDAR.2013.229.

I. Ahmad, X. Wang, R. Li, M. Ahmed, and R. Ullah, "Line and Ligature Segmentation of Urdu Nastaleeq Text," IEEE Access, vol. 5, pp. 10924–10940, 2017, https://doi.org/10.1109/ACCESS.2017.2703155.

"Kutubistan," Kutubistan. https://kutubistan.blogspot.com/ (accessed Dec. 01, 2021).