D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals
Received: 24 January 2022 | Revised: 14 February 2022 | Accepted: 22 February 2022 | Online: 9 April 2022
Corresponding author: M. Bhalekar
Abstract
Automatically describing the information of an image using properly constructed sentences is a tricky task in any language. However, it has the potential to have a significant effect by enabling visually challenged individuals to better understand their surroundings. This paper proposes an image captioning system that generates detailed captions and extracts text from an image, if any, and uses it as a part of the caption to provide a more precise description of the image. To extract the image features, the proposed model uses Convolutional Neural Networks (CNNs) followed by Long Short-Term Memory (LSTM) that generates corresponding sentences based on the learned image features. Further, using the text extraction module, the extracted text (if any) is included in the image description and the captions are presented in audio form. Publicly available benchmark datasets for image captioning like MS COCO, Flickr-8k, Flickr-30k have a variety of images, but they hardly have images that contain textual information. These datasets are not sufficient for the proposed model and this has resulted in the creation of a new image caption dataset that contains images with textual content. With the newly created dataset, comparative analysis of the experimental results is performed on the proposed model and the existing pre-trained model. The obtained experimental results show that the proposed model is equally effective as the existing one in subtitle image captioning models and provides more insights about the image by performing text extraction.
Keywords:
image captioning, text extraction, convolutional model, long short-term memory, deep learningDownloads
References
K. Fu, J. Jin, R. Cui, F. Sha, and C. Zhang, "Aligning Where to See and What to Tell: Image Captioning with Region-Based Attention and Scene-Specific Contexts," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2321–2334, Dec. 2017. DOI: https://doi.org/10.1109/TPAMI.2016.2642953
G. Kulkarni et al., "Baby talk: Understanding and generating simple image descriptions," in CVPR 2011, Colorado Springs, CO, USA, Jun. 2011, pp. 1601–1608. DOI: https://doi.org/10.1109/CVPR.2011.5995466
A. Karpathy and L. Fei-Fei, "Deep Visual-Semantic Alignments for Generating Image Descriptions," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 4, pp. 664–676, Dec. 2017. DOI: https://doi.org/10.1109/TPAMI.2016.2598339
J. Mao, W. Xu, Y. Yang, J. Wang, and A. L. Yuille, "Explain Images with Multimodal Recurrent Neural Networks," arXiv:1410.1090 [cs], Oct. 2014, Accessed: Feb. 23, 2022. [Online]. Available: http://arxiv.org/abs/1410.1090.
O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, "Show and tell: A neural image caption generator," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 3156–3164. DOI: https://doi.org/10.1109/CVPR.2015.7298935
X. Chen and C. L. Zitnick, "Mind’s eye: A recurrent visual representation for image caption generation," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 2422–2431. DOI: https://doi.org/10.1109/CVPR.2015.7298856
K. Xu et al., "Show, attend and tell: neural image caption generation with visual attention," in Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, Lille, France, Apr. 2015, pp. 2048–2057.
M. Bhalekar, S. Sureka, S. Joshi, and M. Bedekar, "Generation of Image Captions Using VGG and ResNet CNN Models Cascaded with RNN Approach," in Machine Intelligence and Signal Processing, Singapore, 2020, pp. 27–42. DOI: https://doi.org/10.1007/978-981-15-1366-4_3
M. Guillaumin, J. Verbeek, and C. Schmid, "Multimodal semi-supervised learning for image classification," in 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, Jun. 2010, pp. 902–909. DOI: https://doi.org/10.1109/CVPR.2010.5540120
S. Nuanmeesri, "A Hybrid Deep Learning and Optimized Machine Learning Approach for Rose Leaf Disease Classification," Engineering, Technology & Applied Science Research, vol. 11, no. 5, pp. 7678–7683, Oct. 2021. DOI: https://doi.org/10.48084/etasr.4455
S. L. Sanga, D. Machuve, and K. Jomanga, "Mobile-based Deep Learning Models for Banana Disease Detection," Engineering, Technology & Applied Science Research, vol. 10, no. 3, pp. 5674–5677, Jun. 2020. DOI: https://doi.org/10.48084/etasr.3452
C. Szegedy, A. Toshev, and D. Erhan, "Deep Neural Networks for Object Detection," in Advances in Neural Information Processing Systems, 2013, vol. 26.
X. Wang, Z. Zhu, C. Yao, and X. Bai, "Relaxed Multiple-Instance SVM with Application to Object Discovery," in 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, Sep. 2015, pp. 1224–1232. DOI: https://doi.org/10.1109/ICCV.2015.145
T.-Y. Lin et al., "Microsoft COCO: Common Objects in Context," in Computer Vision – ECCV 2014, 2014, pp. 740–755. DOI: https://doi.org/10.1007/978-3-319-10602-1_48
S. Hochreiter and J. Schmidhuber, "Long Short-Term Memory," Neural Computation, vol. 9, no. 8, pp. 1735–1780, Aug. 1997. DOI: https://doi.org/10.1162/neco.1997.9.8.1735
K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber, "LSTM: A Search Space Odyssey," IEEE Transactions on Neural Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, Jul. 2017. DOI: https://doi.org/10.1109/TNNLS.2016.2582924
G. A. Robby, A. Tandra, I. Susanto, J. Harefa, and A. Chowanda, "Implementation of Optical Character Recognition using Tesseract with the Javanese Script Target in Android Application," Procedia Computer Science, vol. 157, pp. 499–505, Jan. 2019. DOI: https://doi.org/10.1016/j.procs.2019.09.006
F. Alotaibi, M. T. Abdullah, R. B. H. Abdullah, R. W. B. O. K. Rahmat, I. A. T. Hashem, and A. K. Sangaiah, "Optical Character Recognition for Quranic Image Similarity Matching," IEEE Access, vol. 6, pp. 554–562, 2018. DOI: https://doi.org/10.1109/ACCESS.2017.2771621
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, USA, Apr. 2002, pp. 311–318. DOI: https://doi.org/10.3115/1073083.1073135
R. Vedantam, C. L. Zitnick, and D. Parikh, "CIDEr: Consensus-based image description evaluation," in 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, Jun. 2015, pp. 4566–4575. DOI: https://doi.org/10.1109/CVPR.2015.7299087
M. Hodosh, P. Young, and J. Hockenmaier, "Framing Image Description as a Ranking Task: Data, Models and Evaluation Metrics," Journal of Artificial Intelligence Research, vol. 47, pp. 853–899, Aug. 2013. DOI: https://doi.org/10.1613/jair.3994
C. Alippi, S. Disabato, and M. Roveri, "Moving Convolutional Neural Networks to Embedded Systems: The AlexNet and VGG-16 Case," in 2018 17th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN), Porto, Portugal, Apr. 2018, pp. 212–223. DOI: https://doi.org/10.1109/IPSN.2018.00049
X. Xia, C. Xu, and B. Nan, "Inception-v3 for flower classification," in 2017 2nd International Conference on Image, Vision and Computing (ICIVC), Chengdu, China, Jun. 2017, pp. 783–787.
L. Alzubaidi et al., "Review of deep learning: concepts, CNN architectures, challenges, applications, future directions," Journal of Big Data, vol. 8, no. 1, Nov. 2021, Art. no. 53. DOI: https://doi.org/10.1186/s40537-021-00444-8
B. Ahmed, G. Ali, A. Hussain, A. Baseer, and J. Ahmed, "Analysis of Text Feature Extractors using Deep Learning on Fake News," Engineering, Technology & Applied Science Research, vol. 11, no. 2, pp. 7001–7005, Apr. 2021. DOI: https://doi.org/10.48084/etasr.4069
Downloads
How to Cite
License
Copyright (c) 2022 M. Bhalekar, M. Bedekar
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.