A Multimodal Generative Storytelling Framework for Sustainable Ecotourism Using Text and Image Fusion

Listra Frigia Missianes Horhoruw; Lintang Yuniar Banowosari; Diana Ikasari

doi:10.48084/etasr.16742

Authors

Listra Frigia Missianes Horhoruw Doctoral Program in Information Technology, Gunadarma University, Indonesia
Lintang Yuniar Banowosari Department of Informatics, Gunadarma University, Indonesia
Diana Ikasari Faculty of Computer Science and Information Technology, Gunadarma University, Indonesia

Volume: 16 | Issue: 3 | Pages: 35006-35012 | June 2026 | https://doi.org/10.48084/etasr.16742

Received: 6 December 2025 | Revised: 13 February 2026, 28 February 2026, and 13 March 2026 | Accepted: 24 March 2026 | Online: 6 April 2026

Corresponding author: Listra Frigia Missianes Horhoruw

Abstract

This study presents a system-level multimodal generative storytelling approach designed to support sustainable ecotourism through narrative-oriented multimodal conditioning. The proposed architecture integrates Indonesian Text-to-Text Transfer Transformer (IndoT5) as the textual encoder–decoder and Bootstrapped Language-Image Pretraining (BLIP) as the visual encoder, employing an early fusion strategy to align semantic and visual representations. The model was trained on a curated dataset from Indonesian Super Priority Tourism Destinations (SPTDs) and optimized to generate coherent, natural, and emotionally expressive narratives. Performance evaluation was conducted using a combination of Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics and semantic (BERTScore) similarity metrics. The results indicate that the proposed IndoT5-based approach achieves higher average performance, with ROUGE-L (0.66), METEOR (0.70), and BERTScore (0.89). In addition, expert-based qualitative evaluation demonstrated strong agreement within the defined narrative assessment scope, resulting in a Scale-Level Content Validity Index Averaged Across Items (S-CVI/Ave) of 1.00. The proposed approach effectively bridges visual perception and linguistic generation, offering a scalable solution for automated tourism storytelling that preserves contextual and emotional nuances.

Keywords:

ecotourism storytelling, generative model, Indonesian language, text generation

References

Q. B. Baloch et al., "Impact of Tourism Development Upon Environmental Sustainability: A Suggested Framework for Sustainable Ecotourism," Environmental Science and Pollution Research, vol. 30, no. 3, pp. 5917–5930, Jan. 2023.

A. Stoffelen, "Disentangling the Tourism Sector’s Fragmentation: A Hands-on Coding/Post-Coding Guide for Interview and Policy Document Analysis in Tourism," Current Issues in Tourism, vol. 22, no. 18, pp. 2197–2210, Nov. 2019.

W. Zhang and D. R. Fesenmaier, "Assessing Emotions in Online Stories: Comparing Self-Report and Text-based Approaches," Information Technology & Tourism, vol. 20, no. 1–4, pp. 83–95, Dec. 2018.

S. Sujatmiko, D. P. Ar, A. Hamdat, and K. N. Salam, "User-Generated Content (UGC) and Its Impact on Tourism Marketing: A Systematic Literature Review," Golden Ratio of Mapping Idea and Literature Format, vol. 5, no. 2, pp. 97–105, Jun. 2025.

H. Li, S. Zeng, and K. Tay, "Tourism Storytelling Research Progress and Trends: A Systematic Literature Review on SDGs," Journal of Lifestyle and SDGs Review, vol. 5, no. 1, Nov. 2024, Art. no. e02231.

O. D. Rico Garcia, J. Fernandez Fernandez, R. A. Becerra Saldana, and O. Witkowski, "Emotion-Driven Interactive Storytelling: Let Me Tell You How to Feel," in Artificial Intelligence in Music, Sound, Art and Design, vol. 13221, T. Martins, N. Rodríguez-Fernández, and S. M. Rebelo, Eds. Cham, Switzerland: Springer International Publishing, 2022, pp. 259–274.

W. Villalobos, Y. Kumar, and J. J. Li, "The Multilingual Eyes Multimodal Traveler’s App," in Proceedings of Ninth International Congress on Information and Communication Technology, vol. 1004, X.-S. Yang, S. Sherratt, N. Dey, and A. Joshi, Eds. Singapore: Springer Nature Singapore, 2024, pp. 565–575.

S. Sharma and N. Pandey, "Enhancing Sustainable Tourism with AI-Powered Cloud-Based Predictive Models for Intelligent Travel Destinations," in 2nd International Conference on Multidisciplinary Research and Innovations in Engineering, Gurugram, India, Jul. 2025, pp. 529–534.

D. Ariyus, D. Manongga, and I. Sembiring, "Enhancing Sentiment Analysis of Indonesian Tourism Video Content Commentary on TikTok: A FastText and Bi-LSTM Approach," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18020–18028, Dec. 2024.

R. Zall, A. Kheyrkhah, E. Cambria, Z. Naseri, and M. R. Kangavari, "Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects." arXiv, 2025.

X. Lin and X. Chen, "Improving Visual Storytelling with Multimodal Large Language Models." arXiv, 2024.

D. Menaga and M. Sudha, "Leveraging a Modified Contrastive Language-Image Pre-training Model to Align Images and Text for Generating Remedy Text for Malus Pumila Lamina Images," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21989–21997, Apr. 2025.

"Lake Toba - User Reviews and Photos," Tripadvisor, 2025. https://www.tripadvisor.in/Attraction_Review-g2301775-d338410-Reviews-Lake_Toba-North_Sumatra_Sumatra.html.

"Borobudur Temple - Tripadvisor," Tripadvisor, 2025. https://www.tripadvisor.com/Attraction_Review-g790291-d320054-Reviews-Borobudur_Temple-Borobudur_Magelang_Central_Java_Java.html.

R. P. Kusumawardani, R. A. Rahman, R. P. Wibowo, and A. Tjahjanto, "Understanding Fine-Grained Sentiments of Super-Priority Destination Visitors Using Multi-task Learning for Extraction of Aspect Terms and Polarity Classification on Reviews," Procedia Computer Science, vol. 234, pp. 602–613, 2024.

Allenai, "Indonesian T5 Base." Hugging Face, Jan. 2024, [Online]. Available: https://huggingface.co/Wikidepia/IndoT5-base.

E. A. Abed and T. Aguili, "Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation," Diyala Journal of Engineering Sciences, pp. 228–248, Jun. 2025.

Meta, "Meta-llama/Llama-3.1-70B-Instruct." Hugging Face, Dec. 2024, [Online]. Available: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct.

"OpenRouter API: the Unified Interface for LLMs," OpenRouter, 2025. https://openrouter.ai/.

C. Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," Journal of Machine Learning Research, vol. 21, pp. 1–67, Jun. 2020.

M. Fuadi, A. D. Wibawa, and S. Sumpeno, "idT5: Indonesian Version of Multilingual T5 Transformer." arXiv, 2023.

L. Xue et al., "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer," in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 483–498.

M. Kale and A. Rastogi, "Text-to-Text Pre-Training for Data-to-Text Tasks," in Proceedings of the 13th International Conference on Natural Language Generation, 2020, pp. 97–102.