Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization

Authors

  • Nuha M. Khassaf Informatics Institute for Postgraduate Studies, Iraqi Commission for Computers & Informatics, Iraq
  • Nada Hussein M. Ali Department of Computer Science, College of science, University of Baghdad, Iraq
Volume: 14 | Issue: 5 | Pages: 17337-17343 | October 2024 | https://doi.org/10.48084/etasr.8455

Abstract

The issue of image captioning, which comprises automatic text generation to understand an image’s visual information, has become feasible with the developments in object recognition and image classification. Deep learning has received much interest from the scientific community and can be very useful in real-world applications. The proposed image captioning approach involves the use of Convolution Neural Network (CNN) pre-trained models combined with Long Short Term Memory (LSTM) to generate image captions. The process includes two stages. The first stage entails training the CNN-LSTM models using baseline hyper-parameters and the second stage encompasses training CNN-LSTM models by optimizing and adjusting the hyper-parameters of the previous stage. Improvements include the use of a new activation function, regular parameter tuning, and an improved learning rate in the later stages of training. The experimental results on the flickr8k dataset showed a noticeable and satisfactory improvement in the second stage, where a clear increment was achieved in the evaluation metrics Bleu1-4, Meteor, and Rouge-L. This increment confirmed the effectiveness of the alterations and highlighted the importance of hyper-parameter tuning in improving the performance of CNN-LSTM models in image caption tasks.

Keywords:

CNN pre-trained models, LSTM, activation function, hyper-parameters, overfitting

Downloads

Download data is not yet available.

References

C. P. Chaudhari and S. Devane, "Capturing Semantic Knowledge In Object Localization In Captioning Images," in International Conference on Communication information and Computing Technology, Mumbai, India, Jun. 2021, pp. 1–4.

S. M. Al-Selwi, M. F. Hassan, S. J. Abdulkadir, and A. Muneer, "LSTM Inefficiency in Long-Term Dependencies Regression Problems," Journal of Advanced Research in Applied Sciences and Engineering Technology, vol. 30, no. 3, pp. 16–31, May 2023.

H. Xie, L. Zhang, and C. P. Lim, "Evolving CNN-LSTM Models for Time Series Prediction Using Enhanced Grey Wolf Optimizer," IEEE Access, vol. 8, pp. 161519–161541, Jan. 2020.

D. Agughalam, P. Pathak, and P. Stynes, "Bidirectional LSTM approach to image captioning with scene features," in Thirteenth International Conference on Digital Image Processing, Singapore, Singapore, Dec. 2021, vol. 11878, pp. 81–88.

J. Basnet, S. Kumari, and M. Rathore, "Image caption generator using CNN and LSTM," International Journal of Advance Research, Ideas and Innovations in Technology, vol. 8, no. 2, pp. 489–495, 2022.

M. A. Al-Malla, A. Jafar, and N. Ghneim, "Pre-trained CNNs as Feature-Extraction Modules for Image Captioning: An Experimental Study," ELCVIA. Electronic letters on computer vision and image analysis, vol. 21, no. 1, pp. 1–16, 2022.

P. R. Devi, M. T. Deepak, M. Lohitha, M. S. C. Raju, and K. Venkata, "Image Caption Generator Using VGG and LSTM For Visually Impaired," International Journal of Advances in Engineering and Management, vol. 5, no. 4, pp. 576–583, 2023.

H. Priyambudi and A. Hadinegoro, "Performance Analysis RESNET50 and INCEPTIONV3 Models for Caption Image Generator," JURTEKSI (Jurnal Teknologi dan Sistem Informasi), vol. 9, no. 3, pp. 521–528, Jun. 2023.

A. A. Ali and F. A. A. Dawood, "Deep Learning of Diabetic Retinopathy Classification in Fundus Images," Journal of Engineering, vol. 29, no. 12, pp. 139–152, Dec. 2023.

M. M. H. Milu, M. A. Rahman, M. A. Rashid, A. Kuwana, and H. Kobayashi, "Improvement of Classification Accuracy of Four-Class Voluntary-Imagery fNIRS Signals using Convolutional Neural Networks," Engineering, Technology & Applied Science Research, vol. 13, no. 2, pp. 10425–10431, Apr. 2023.

H. S. Abdullah, N. H. Ali, and N. A. Z. Abdullah, "Evaluating the Performance and Behavior of CNN, LSTM, and GRU for Classification and Prediction Tasks," Iraqi Journal of Science, vol. 65, no. 3, pp. 1741–1751, Mar. 2024.

S. Gupta, S. Agnihotri, D. Birla, A. Jain, T. Vaiyapuri, and P. S. Lamba, "Image Caption Generation and Comprehensive Comparison of Image Encoders," Fusion: Practice and Applications, vol. 4, no. 2, pp. 42–55, Jan. 2021.

B. Deepika, S. P. Reddy, S. G. Satya, and K. R. Kumar, "Image Caption Generator," in International e-Conference on Advances in Computer Engineering and Communication Systems, Hyderabad, India, Sep. 2023, pp. 360–370.

S. K. Shukla, S. Dubey, A. K. Pandey, V. Mishra, M. Awasthi, and V. Bhardwaj, "Image Caption Generator Using Neural Networks," International Journal of Scientific Research in Computer Science, Engineering and Information Technology, vol. 7, no. 3, pp. 1–7, May 2021.

H. K. Dhahir and N. H. Salman, "A Review on Face Detection Based on Convolution Neural Network Techniques," Iraqi Journal of Science, vol. 63, no. 4, pp. 1823–1835, Apr. 2022.

"machine-learning-articles/why-swish-could-perform-better-than-relu.md," GitHub. https://github.com/christianversloot/machine-learning-articles/blob/main/why-swish-could-perform-better-than-relu.md.

A. Halbouni, T. S. Gunawan, M. H. Habaebi, M. Halbouni, M. Kartiwi, and R. Ahmad, "CNN-LSTM: Hybrid Deep Neural Network for Network Intrusion Detection System," IEEE Access, vol. 10, pp. 99837–99849, Jan. 2022.

B. Subedi and B. Krishna Bal, "CNN-Transformer based Encoder-Decoder Model for Nepali Image Captioning," in 19th International Conference on Natural Language Processing, New Delhi, India, Dec. 2022, pp. 86–91.

N. Landro, I. Gallo, and R. La Grassa, "Mixing ADAM and SGD: a Combined Optimization Method," arXiv e-prints. Nov. 01, 2020.

V. V. Mai and M. Johansson, "Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness," in 38th International Conference on Machine Learning, Jul. 2021, pp. 7325–7335.

C. Ma, Y. Liu, J. Deng, L. Xie, W. Dong, and C. Xu, "Understanding and Mitigating Overfitting in Prompt Tuning for Vision-Language Models," IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4616–4629, Sep. 2023.

R. Mulyawan, A. Sunyoto, and A. H. M. Muhammad, "Pre-Trained CNN Architecture Analysis for Transformer-Based Indonesian Image Caption Generation Model," JOIV : International Journal on Informatics Visualization, vol. 7, no. 2, pp. 487–493, May 2023.

I. Taneja and S. Maggu, "Generating Captions for Images Using Neural Networks," IRE Journals, vol. 6, no. 12, pp. 214–218, 2023.

R. Khan, M. S. Islam, K. Kanwal, M. Iqbal, M. I. Hossain, and Z. Ye, "A Deep Neural Framework for Image Caption Generation Using GRU-Based Attention Mechanism." arXiv, Mar. 03, 2022.

I. I. Amal, D. H. Widyantoro, and A. Umam, "MobileNet-based Neural Image Caption Model in Title Generation for Product’s Images," in 7th International Conference on Advance Informatics: Concepts, Theory and Applications, Tokoname, Japan, Sep. 2020, pp. 1–6.

R. D. Dondapati, T. Sivaprakasam, and K. V. Kumar, "Dermatological Decision Support Systems using CNN for Binary Classification," Engineering, Technology & Applied Science Research, vol. 14, no. 3, pp. 14240–14247, Jun. 2024.

S. Mundargi and M. H. Mohanty, "Image Captioning using Attention Mechanism with ResNet, VGG and Inception Models," International Research Journal of Engineering and Technology, vol. 7, no. 9, pp. 3791–3801, 2020.

M. Bhalekar and M. Bedekar, "D-CNN: A New model for Generating Image Captions with Text Extraction Using Deep Learning for Visually Challenged Individuals," Engineering, Technology & Applied Science Research, vol. 12, no. 2, pp. 8366–8373, Apr. 2022.

Z. Ren, S. Gou, Z. Guo, S. Mao, and R. Li, "A Mask-Guided Transformer Network with Topic Token for Remote Sensing Image Captioning," Remote Sensing, vol. 14, no. 12, Jan. 2022, Art. no. 2939.

Downloads

How to Cite

[1]
Khassaf, N.M. and Ali, N.H.M. 2024. Improving Pre-trained CNN-LSTM Models for Image Captioning with Hyper-Parameter Optimization. Engineering, Technology & Applied Science Research. 14, 5 (Oct. 2024), 17337–17343. DOI:https://doi.org/10.48084/etasr.8455.

Metrics

Abstract Views: 39
PDF Downloads: 37

Metrics Information