A Grammar-Aware Multimodal Transformer for Structured ASL-to-English Translation

Enshirah Altarawneh; Jawdat S. Alkasassbeh; Esraa Alshdaifat; Aws Al-Qaisi; Maen Takruri

doi:10.48084/etasr.18203

Authors

Enshirah Altarawneh Department of Computer Engineering, Faculty of Engineering, The Hashemite University, Zarqa, Jordan
Jawdat S. Alkasassbeh Department of Electrical Engineering, Faculty of Engineering Technology, Al-Balqa Applied University, Amman, Jordan
Esraa Alshdaifat Department of Information Technology, Faculty of Prince Al Hussein Bin Abdallah II for Information Technology, The Hashemite University, Zarqa, Jordan
Aws Al-Qaisi College of Engineering and Technology, American University of the Middle East, Egaila, Kuwait
Maen Takruri College of Engineering and Technology, American University of the Middle East, Egaila, Kuwait

Volume: 16 | Issue: 3 | Pages: 36574-36583 | June 2026 | https://doi.org/10.48084/etasr.18203

Received: 16 February 2026 | Revised: 4 March 2026 and 17 March 2026 | Accepted: 18 March 2026 | Online: 16 May 2026
Corresponding author: Enshirah Altarawneh

Abstract

While the process of automatically translating American Sign Language (ASL) into English remains challenging due to the inherent complexities of creating signs in space-time and due to the existence of its own grammatical structure, one of the primary objectives of this study was to create an ASL-to-English Translation Framework incorporating grammatical representations. Utilizing a formalized grammatical model of ASL rather than simply viewing ASL as a series of unconnected, unrelated signs or motions, our method views ASL as a structurally based means of communicating that is comparable to spoken languages. In addition, our system captures spatial and temporal interrelations in ASL by processing multimodal input data consisting of Red Green Blue (RGB) color video frames, 2D/3D body pose keypoints, and hand landmark information while employing a transformer-based architectural design. Moreover, we created ASL grammar tokens which represent intermediate expressions of characteristics, including whether a given sign is negative, whether a subject has been explicitly referenced, etc. The utilization of these tokens facilitates a transition from the ASL representation to the corresponding English representation. The proposed methodology was tested via experiments conducted on two publicly accessible benchmark datasets: Word-Level American Sign Language (WLASL) and Microsoft American Sign Language (MS-ASL). Results indicated that the proposed methodology outperformed the current state-of-the-art methodologies. Significant improvements in Bilingual Evaluation Understudy (BLEU-4) scores (+5.6 and +5.0 relative to baselines) were realized for WLASL and MS-ASL, respectively. Additional evaluation metrics utilized to assess increased lexical accuracy and semantic coherence included Recall-Oriented Understudy for Gisting Evaluation (ROUGE-L), Metric for Evaluation of Translation with Explicit Ordering (METEOR), and Consensus-based Image Description Evaluation (CIDEr). Lastly, our grammar token prediction module achieved an exceptionally high accuracy rate of 95.1%, thereby providing further justification for the employment of structural linguistic modeling concurrent with multimodal feature fusion within the translation pipeline. The results suggest that combining multimodal feature fusion with grammar-aware representations provides substantial improvement over previously employed methods for translating ASL into English and provides a foundation for future generations of ASL-to-English translation systems.

Keywords:

American Sign Language (ASL), RGB video, grammar-aware translation, multimodal learning, transformer, CNN-LSTM, pose estimation, Bilingual Evaluation Understudy (BLEU)

References

W. C. Stokoe, "Sign Language Structure: An Outline of the Visual Communication Systems of the American Deaf," Journal of Deaf Studies and Deaf Education, vol. 10, no. 1, pp. 3–37, Jan. 2005.

R. Pfau, M. Steinbach, and B. Woll, Eds., Sign Language: An International Handbook. DE GRUYTER, 2012.

T. Starner, J. Weaver, and A. Pentland, "Real-time American sign language recognition using desk and wearable computer based video," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 12, pp. 1371–1375, Dec. 1998.

O. Koller, J. Forster, and H. Ney, "Continuous sign language recognition: Towards large vocabulary statistical recognition systems handling multiple signers," Computer Vision and Image Understanding, vol. 141, pp. 108–125, Dec. 2015.

L. Pigou, A. Van Den Oord, S. Dieleman, M. Van Herreweghe, and J. Dambre, "Beyond Temporal Pooling: Recurrence and Temporal Convolutions for Gesture Recognition in Video," International Journal of Computer Vision, vol. 126, no. 2–4, pp. 430–439, Apr. 2018.

T. A. Patil et al., "Real-Time American Sign Language Recognition System Using Deep Learning and Computer Vision," International Journal of Scientific Research in Engineering and Management, vol. 09, no. 06, pp. 1–9, June 2025.

N. C. Camgoz, S. Hadfield, O. Koller, H. Ney, and R. Bowden, "Neural Sign Language Translation," in 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, June 2018, pp. 7784–7793.

N. C. Camgoz, O. Koller, S. Hadfield, and R. Bowden, "Sign Language Transformers: Joint End-to-End Sign Language Recognition and Translation," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 10020–10030.

S. K. Liddell, Grammar, Gesture, and Meaning in American Sign Language, 1st ed. Cambridge University Press, 2003.

R. Zuo and B. Mak, "Improving Continuous Sign Language Recognition with Consistency Constraints and Signer Removal." arXiv, 2022.

J. Huang, W. Zhou, Q. Zhang, H. Li, and W. Li, "Video-based sign language recognition without temporal segmentation," in Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, New Orleans, Louisiana, USA, 2018.

B. Alsharif, E. Alalwany, A. Ibrahim, I. Mahgoub, and M. Ilyas, "Real-Time American Sign Language Interpretation Using Deep Learning and Keypoint Tracking," Sensors, vol. 25, no. 7, Mar. 2025, Art. no. 2138.

S. Shekhar, "Real-time Sign Language to Text Conversion using Deep Learning Models," in 2024 3rd International Conference for Advancement in Technology (ICONAT), Sept. 2024, pp. 1–7.

V. S. Ganesh Reddy, B. AnkammaRao, C. Manvitha, K. Priyanka, V. Phani Kumar Sistla, and V. K. Kishore Kolli, "Performance Evaluation of Various Deep Learning Models for Sign Language Recognition," in 2025 International Conference on Emerging Systems and Intelligent Computing (ESIC), Feb. 2025, pp. 189–194.

E. Hassan, M. Y. Shams, T. Abd El-Hafeez, and M. Elseddik, "A novel model for expanding horizons in sign Language recognition," Scientific Reports, vol. 15, no. 1, July 2025, Art. no. 24358.

D. Li, C. R. Opazo, X. Yu, and H. Li, "Word-level Deep Sign Language Recognition from Video: A New Large-scale Dataset and Methods Comparison," in 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Mar. 2020, pp. 1448–1458.

A. Kasapbaşi, A. E. A. Elbushra, O. Al-Hardanee, and A. Yilmaz, "DeepASLR: A CNN based human computer interface for American Sign Language recognition for hearing-impaired individuals," Computer Methods and Programs in Biomedicine Update, vol. 2, 2022, Art. no. 100048.

B. Natarajan et al., "Development of an End-to-End Deep Learning Framework for Sign Language Recognition, Translation, and Video Generation," IEEE Access, vol. 10, pp. 104358–104374, 2022.

S. Sharma and K. Kumar, "ASL-3DCNN: American sign language recognition technique using 3-D convolutional neural networks," Multimedia Tools and Applications, vol. 80, no. 17, pp. 26319–26331, July 2021.

T. Ananthanarayana et al., "Deep Learning Methods for Sign Language Translation," ACM Transactions on Accessible Computing, vol. 14, no. 4, pp. 1–30, Dec. 2021.

M. Al-Qurishi, T. Khalid, and R. Souissi, "Deep Learning for Sign Language Recognition: Current Techniques, Benchmarks, and Open Issues," IEEE Access, vol. 9, pp. 126917–126951, 2021.

B. Alsharif, A. S. Altaher, A. Altaher, M. Ilyas, and E. Alalwany, "Deep Learning Technology to Recognize American Sign Language Alphabet," Sensors, vol. 23, no. 18, Sept. 2023, Art. no. 7970.

Y. Gu, H. Oku, and M. Todoh, "American Sign Language Recognition and Translation Using Perception Neuron Wearable Inertial Motion Capture System," Sensors, vol. 24, no. 2, Jan. 2024, Art. no. 453.

N. Adaloglou et al., "A Comprehensive Study on Deep Learning-Based Methods for Sign Language Recognition," IEEE Transactions on Multimedia, vol. 24, pp. 1750–1762, 2022.

P. Rakshit, S. Paul, and S. Dey, "Sign language detection using convolutional neural network," Journal of Ambient Intelligence and Humanized Computing, vol. 15, no. 4, pp. 2399–2424, Apr. 2024.

N. Shanthi, C. Sharmila, M. Muthuraja, S. Janupritha, P. Kavin, and J. Keerthi, "Unveiling the Power of Machine Learning and Deep Learning in Advancing American Sign Language Recognition," in 2024 International Conference on Cognitive Robotics and Intelligent Systems (ICC - ROBINS), Apr. 2024, pp. 360–369.

K. Bantupalli and Y. Xie, "American Sign Language Recognition using Deep Learning and Computer Vision," in 2018 IEEE International Conference on Big Data (Big Data), Dec. 2018, pp. 4896–4899.

A. Sultan, W. Makram, M. Kayed, and A. A. Ali, "Sign language identification and recognition: A comparative study," Open Computer Science, vol. 12, no. 1, pp. 191–210, May 2022.

A. Khan et al., "Deep Learning Approaches for Continuous Sign Language Recognition: A Comprehensive Review," IEEE Access, vol. 13, pp. 55524–55544, 2025.

H. Vaezi Joze and O. Koller, "MS-ASL: a large-scale data set and benchmark for understanding american sign language," in The British Machine Vision Conference (BMVC), Sept. 2019.

J. Redmon and A. Farhadi, "YOLOv3: An Incremental Improvement." arXiv, Apr. 2018.

C. Lugaresi et al., "MediaPipe: a framework for perceiving and processing reality," in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR) 2019, 2019.

Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, "OpenPose: Realtime Multi-Person 2D Pose Estimation Using Part Affinity Fields," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 1, pp. 172–186, Jan. 2021.

J. Wang et al., "Deep High-Resolution Representation Learning for Visual Recognition," IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 10, pp. 3349–3364, Oct. 2021.

H. A. Al-Ofeishat et al., "Analysis and Comparison of Raw Network Packet Datasets Using Machine Learning Classification and Grey Wolf Optimization," International Journal of Advances in Soft Computing and its Applications, vol. 17, no. 1, Mar. 2025.

K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, "BLEU: a method for automatic evaluation of machine translation," in Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, 2001, Art. no. 311.

M. Zouidine and M. Khalil, "Large Language Models for Arabic Sentiment Analysis and Machine Translation," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 20737–20742, Apr. 2025.

C.-Y. Lin, "ROUGE: a package for automatic evaluation of summaries," in Text Summarization Branches Out, July 2004, pp. 74–81.

S. Banerjee and A. Lavie, "METEOR: An automatic metric for MT evaluation with improved correlation with human judgments," in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, June 2005, pp. 65–72.

A. Manasreh, A. A. M. Sharadqh, J. S. Alkasassbeh, and A. Al-Qaisi, "Ensuring telecommunication network security through cryptology: a case of 4G and 5G LTE cellular network providers," International Journal of Electrical and Computer Engineering (IJECE), vol. 9, no. 6, Dec. 2019, Art. no. 4860.

J. S. Alkasassbeh, A. K. Al-Qaisi, M. Al-Hunaity, and J. Alkasassbeh, "Maximize Saving Transmitted Power in Wireless Communication System Using Adaptive Modulation Technique," International Journal on Communications Antenna and Propagation (IRECAP), vol. 6, no. 2, Apr. 2016, Art. no. 61.