Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique
Published online first on April 6, 2021.
This paper proposes a lip reading method based on convolutional neural networks applied to Concatenated Three Sequence Keyframe Image (C3-SKI), consisting of (a) the Start-Lip Image (SLI), (b) the Middle-Lip Image (MLI), and (c) the End-Lip Image (ELI) which is the end of the pronunciation of that syllable. The lip area’s image dimensions were reduced to 32×32 pixels per image frame and three keyframes concatenate together were used to represent one syllable with a dimension of 96×32 pixels for visual speech recognition. Every three concatenated keyframes representing any syllable are selected based on the relative maximum and relative minimum related to the open lip’s width and height. The evaluation results of the model’s effectiveness, showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively, for the THDigits dataset. The C3-SKI technique was also applied to the AVDigits dataset, showing 85.62% accuracy. In conclusion, the C3-SKI technique could be applied to perform lip reading recognition.
Keywords:concatenated frame images, convolutional neural network, keyframe reduction, keyframe sequence, lip reading
K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition," in Proceedings IEEE Computer Visualization and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 770-778. https://doi.org/10.1109/CVPR.2016.90
S. Fenghour, D. Chen, and P. Xiao, "Decoder-encoder LSTM for lip reading," in Proceedings of the 2019 8th International Conference on Software and Information Engineering, Cairo, Egypt, Apr. 9-12, 2019, pp. 162-166 https://doi.org/10.1145/3328833.3328845
S. Petridis, Z. Li, and M. Pantic, "End-to-end visual speech recognition with LSTMS," in Proceedings of the 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, New Orleans, LA, USA, Mar. 5-9, 2017, pp. 2592-2596. https://doi.org/10.1109/ICASSP.2017.7952625
S. Chung, J. S. Chung, and H. Kang, "Perfect match: Improved cross-modal embeddings for audio-visual synchronisation," in Proceedings of the 2019 IEEE International Conference on Acoustics, Speech, and Signal Processing, Brighton, UK, May 12-17, 2019, pp. 3965-3969. https://doi.org/10.1109/ICASSP.2019.8682524
R. Bi and M. Swerts, "A perceptual study of how rapidly and accurately audiovisual cues to utterance-final boundaries can be interpreted in Chinese and English," Speech Communication, vol. 95, pp. 68-77, 2017. https://doi.org/10.1016/j.specom.2017.07.002
D. Jang, H. Kim, C. Je, R. Park, and H. Park, "Lip reading using committee networks with two different types of concatenated frame images," IEEE Access, vol. 7, pp. 90125-90131, 2019.
A. Mesbah, A. Berrahou, H. Hammouchi, H. Berbia, H. Qjidaa, and M. Daoudi, "Lip reading with Hahn convolutional neural networks," Image and Vision Computing, vol. 88, pp. 76-83, 2019 https://doi.org/10.1016/j.imavis.2019.04.010
J. S. Chung and A. Zisserman, "Learning to lip read words by watching videos," Computer Vision and Image Understanding, vol. 173, pp. 76-85, 2018 https://doi.org/10.1016/j.cviu.2018.02.001
Z. Thabet, A. Nabih, K. Azmi, Y. Samy, G. Khoriba, and M. Elshehaly, "Lipreading using a comparative machine learning approach," in Proceedings of the 2018 First International Workshop on Deep and Representation Learning, Cairo, Egypt, 2018, pp. 19-25. https://doi.org/10.1109/IWDRL.2018.8358210
S. Petridis, J. Shen, D. Cetin, and M. Pantic, "Visual-only recognition of normal, whispered and silent speech," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 6219-6223. https://doi.org/10.1109/ICASSP.2018.8461596
A. Koumparoulis and G. Potamianos, "Deep View2View mapping for view-invariant lipreading," in Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, December 18-21, 2018, pp. 588-594. https://doi.org/10.1109/SLT.2018.8639698
J. Wei, F. Yang, J. Zhang, R. Yu, M. Yu, and J. Wang, "Three-dimensional joint geometric-physiologic feature for lip-reading," in Proceedings of the 2018 IEEE 30th International Conference on Tools with Artificial Intelligence, Greece, 2018, pp. 1007-1012. https://doi.org/10.1109/ICTAI.2018.00155
I. Fung and B. K. Mak, "End-to-end low-resource lip-reading with Maxout CNN and LSTM," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 2511-2515.
T. Thein and K. M. San, "Lip localization technique towards an automatic lip reading approach for Myanmar consonants recognition," in Proceedings of the 2018 International Conference on Information and Computer Technologies, IL, USA, 2018, pp. 123-127. https://doi.org/10.1109/INFOCT.2018.8356854
S. Yang et al., "LRW-1000: A naturally-distributed large-scale benchmark for lip reading in the wild," in Proceedings of the 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition, Lille, France, 2019, pp. 1-8. https://doi.org/10.1109/FG.2019.8756582
J. S. Chung and A. Zisserman, "Lip reading in profile," in Proceedings of the 28th British Machine Vision Conference, London, UK, 2017.
P. P. Filntisis, A. Katsamanis, P. Tsiakoulis, and P. Maragos, "Video-realistic expressive audio-visual speech synthesis for the Greek language," Speech Communication, vol. 95, pp. 137-152, 2017. https://doi.org/10.1016/j.specom.2017.08.011
T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, "Deep audio-visual speech recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1-11, 2018. https://doi.org/10.1109/TPAMI.2018.2889052
S. Petridis, T. Stafylakis, P. Ma, F. Cai, G. Tzimiropoulos, and M. Pantic, "End-to-end audiovisual speech recognition," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2018, pp. 6548-6552. https://doi.org/10.1109/ICASSP.2018.8461326
Y. Yuan, C. Tian, and X. Lu, "Auxiliary loss multimodal GRU model in audio-visual speech recognition," IEEE Access, vol. 6, pp. 5573-5583, 2018. https://doi.org/10.1109/ACCESS.2018.2796118
S. Petridis, T. Stafylakis, P. Ma, G. Tzimiropoulos, and M. Pantic, "Audio-visual speech recognition with a Hybrid CTC/Attention architecture," in Proceedings of the 2018 IEEE Spoken Language Technology Workshop, Athens, Greece, 2018, pp. 513-520. https://doi.org/10.1109/SLT.2018.8639643
W. J. Ma, X. Zhou, L. A. Ross, J. J. Foxe, and L. C. Parra, "Lip-reading aids word recognition most in moderate noise: A bayesian explanation using high-dimensional feature space," PLoS ONE, vol. 4, no. 3, 2009, Art. no. e4638. https://doi.org/10.1371/journal.pone.0004638
M. Wand, J. Koutník, and J. Schmidhuber, "Lipreading with long short-term memory," in Proceedings of the 2016 IEEE International Conference on Acoustics, Speech, and Signal Processing, Shanghai, China, 2016, pp. 6115-6119. https://doi.org/10.1109/ICASSP.2016.7472852
A. Gabbay, A. Shamir, and S. Peleg, "Visual speech enhancement," in Proceedings of Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, India, Sep. 2-6, 2018, pp. 1170-1174. https://doi.org/10.21437/Interspeech.2018-1955
M. Wand, J. Schmidhuber, and N. T. Vu, "Investigations on end-to-end audiovisual fusion," in Proceedings of the 2018 IEEE International Conference on Acoustics, Speech, and Signal Processing, Calgary, AB, Canada, 2018, pp. 3041-3045. https://doi.org/10.1109/ICASSP.2018.8461900
D. Hu, X. Li, and X. Lu, "Temporal multimodal learning in audiovisual speech recognition," in Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 2016, pp. 3574-3582. https://doi.org/10.1109/CVPR.2016.389
A. Fernandez-Lopez and F. M. Sukno, "Automatic viseme vocabulary construction to enhance continuous lip-reading," in Proceedings of the 12th International Conference on Computer Vision Theory and Applications, Porto, Portugal, Feb. 27- Mar. 1, 2017, pp. 52-63. https://doi.org/10.5220/0006102100520063
K. Paleček, "Experimenting with lipreading for large vocabulary continuous speech recognition," Journal on Multimodal User Interfaces, vol. 12, no. 4, pp. 309-318, 2018. https://doi.org/10.1007/s12193-018-0266-2
P. Viola and M. J. Jones, "Robust real-time face detection," International Journal of Computer Vision, vol. 57, no. 2, pp. 137-154, 2004. https://doi.org/10.1023/B:VISI.0000013087.49260.fb
Y.-Q. Wang, "An analysis of the Viola-Jones face detection algorithm," Image Processing On Line, vol. 4, pp. 128-148, 2014. https://doi.org/10.5201/ipol.2014.104
J. M. Saragih, S. Lucey, and J. F. Cohn, "Deformable model fitting by regularized landmark mean-shift," International Journal of Computer Vision, vol. 91, pp. 200-215, 2011. https://doi.org/10.1007/s11263-010-0380-4
K. Janocha and W. M. Czarnecki, "On loss functions for deep neural networks in classification," Schedae Informaticae, vol. 25, pp. 49-59, 2016.
Z. Zhang and M. R. Sabuncu, "Generalized cross entropy loss for training deep neural networks with noisy labels," in Proceedings of the 32nd Conference on Neural Information Processing Systems, Montréal, Canada, Dec. 2-8, 2018.
Q. Zhu, Z. He, T. Zhang, and W. Cui, "Improving classification performance of softmax loss function based on scalable batch-normalization," Applied Sciences, vol. 10, no. 8, pp. 29-50, 2020. https://doi.org/10.3390/app10082950
N. Srivastava and R. Salakhutdinov, "Learning representations for multimodal data with deep belief nets," presented at the 29th International Conference on Machine Learning Workshop, Edinburgh, UK, Jun. 26-Jul. 1, 2012.
M. B. Ayed, "Balanced communication-avoiding support vector machine when detecting epilepsy based on EGG signals," Engineering, Technology & Applied Science Research, vol. 10, no. 6, pp. 6462-6468, 2020. https://doi.org/10.48084/etasr.3878
S. Nuanmeesri, "Mobile application for the purpose of marketing, product distribution and location-based logistics for elderly farmers," Applied Computing and Informatics, 2019. https://doi.org/10.1016/j.aci.2019.11.001
A. N. Saeed, "A machine learning based approach for segmenting retinal nerve images using artificial neural networks," Engineering, Technology & Applied Science Research, vol. 10, no. 4, pp. 5986-5991, 2020. https://doi.org/10.48084/etasr.3666
A. U. Ruby, P. Theerthagiri, I. J. Jacob, and Y. Vamsidhar, "Binary cross entropy with deep learning technique for image classification," International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 4, pp. 5393-5397, 2020. https://doi.org/10.30534/ijatcse/2020/175942020
J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. "Multimodal deep learning," in Proceedings of the 28th International Conference on Machine Learning, Washington, USA, 2011, pp. 689-696.
C. Tian, and W. Ji, "Auxiliary multimodal LSTM for audio-visual speech recognition and lipreading," 2017, arXiv preprint arXiv:1701.04224v2
How to Cite
MetricsAbstract Views: 274
PDF Downloads: 244
Copyright (c) 2021 Authors
This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.