A Multimodal Generative Storytelling Framework for Sustainable Ecotourism Using Text and Image Fusion
Received: 6 December 2025 | Revised: 13 February 2026, 28 February 2026, and 13 March 2026 | Accepted: 24 March 2026 | Online: 6 April 2026
Corresponding author: Listra Frigia Missianes Horhoruw
Abstract
This study presents a system-level multimodal generative storytelling approach designed to support sustainable ecotourism through narrative-oriented multimodal conditioning. The proposed architecture integrates Indonesian Text-to-Text Transfer Transformer (IndoT5) as the textual encoder–decoder and Bootstrapped Language-Image Pretraining (BLIP) as the visual encoder, employing an early fusion strategy to align semantic and visual representations. The model was trained on a curated dataset from Indonesian Super Priority Tourism Destinations (SPTDs) and optimized to generate coherent, natural, and emotionally expressive narratives. Performance evaluation was conducted using a combination of Recall-Oriented Understudy for Gisting Evaluation (ROUGE) and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics and semantic (BERTScore) similarity metrics. The results indicate that the proposed IndoT5-based approach achieves higher average performance, with ROUGE-L (0.66), METEOR (0.70), and BERTScore (0.89). In addition, expert-based qualitative evaluation demonstrated strong agreement within the defined narrative assessment scope, resulting in a Scale-Level Content Validity Index Averaged Across Items (S-CVI/Ave) of 1.00. The proposed approach effectively bridges visual perception and linguistic generation, offering a scalable solution for automated tourism storytelling that preserves contextual and emotional nuances.
Keywords:
ecotourism storytelling, generative model, Indonesian language, text generationDownloads
References
Q. B. Baloch et al., "Impact of Tourism Development Upon Environmental Sustainability: A Suggested Framework for Sustainable Ecotourism," Environmental Science and Pollution Research, vol. 30, no. 3, pp. 5917–5930, Jan. 2023.
A. Stoffelen, "Disentangling the Tourism Sector’s Fragmentation: A Hands-on Coding/Post-Coding Guide for Interview and Policy Document Analysis in Tourism," Current Issues in Tourism, vol. 22, no. 18, pp. 2197–2210, Nov. 2019.
W. Zhang and D. R. Fesenmaier, "Assessing Emotions in Online Stories: Comparing Self-Report and Text-based Approaches," Information Technology & Tourism, vol. 20, no. 1–4, pp. 83–95, Dec. 2018.
S. Sujatmiko, D. P. Ar, A. Hamdat, and K. N. Salam, "User-Generated Content (UGC) and Its Impact on Tourism Marketing: A Systematic Literature Review," Golden Ratio of Mapping Idea and Literature Format, vol. 5, no. 2, pp. 97–105, Jun. 2025.
H. Li, S. Zeng, and K. Tay, "Tourism Storytelling Research Progress and Trends: A Systematic Literature Review on SDGs," Journal of Lifestyle and SDGs Review, vol. 5, no. 1, Nov. 2024, Art. no. e02231.
O. D. Rico Garcia, J. Fernandez Fernandez, R. A. Becerra Saldana, and O. Witkowski, "Emotion-Driven Interactive Storytelling: Let Me Tell You How to Feel," in Artificial Intelligence in Music, Sound, Art and Design, vol. 13221, T. Martins, N. Rodríguez-Fernández, and S. M. Rebelo, Eds. Cham, Switzerland: Springer International Publishing, 2022, pp. 259–274.
W. Villalobos, Y. Kumar, and J. J. Li, "The Multilingual Eyes Multimodal Traveler’s App," in Proceedings of Ninth International Congress on Information and Communication Technology, vol. 1004, X.-S. Yang, S. Sherratt, N. Dey, and A. Joshi, Eds. Singapore: Springer Nature Singapore, 2024, pp. 565–575.
S. Sharma and N. Pandey, "Enhancing Sustainable Tourism with AI-Powered Cloud-Based Predictive Models for Intelligent Travel Destinations," in 2nd International Conference on Multidisciplinary Research and Innovations in Engineering, Gurugram, India, Jul. 2025, pp. 529–534.
D. Ariyus, D. Manongga, and I. Sembiring, "Enhancing Sentiment Analysis of Indonesian Tourism Video Content Commentary on TikTok: A FastText and Bi-LSTM Approach," Engineering, Technology & Applied Science Research, vol. 14, no. 6, pp. 18020–18028, Dec. 2024.
R. Zall, A. Kheyrkhah, E. Cambria, Z. Naseri, and M. R. Kangavari, "Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects." arXiv, 2025.
X. Lin and X. Chen, "Improving Visual Storytelling with Multimodal Large Language Models." arXiv, 2024.
D. Menaga and M. Sudha, "Leveraging a Modified Contrastive Language-Image Pre-training Model to Align Images and Text for Generating Remedy Text for Malus Pumila Lamina Images," Engineering, Technology & Applied Science Research, vol. 15, no. 2, pp. 21989–21997, Apr. 2025.
"Lake Toba - User Reviews and Photos," Tripadvisor, 2025. https://www.tripadvisor.in/Attraction_Review-g2301775-d338410-Reviews-Lake_Toba-North_Sumatra_Sumatra.html.
"Borobudur Temple - Tripadvisor," Tripadvisor, 2025. https://www.tripadvisor.com/Attraction_Review-g790291-d320054-Reviews-Borobudur_Temple-Borobudur_Magelang_Central_Java_Java.html.
R. P. Kusumawardani, R. A. Rahman, R. P. Wibowo, and A. Tjahjanto, "Understanding Fine-Grained Sentiments of Super-Priority Destination Visitors Using Multi-task Learning for Extraction of Aspect Terms and Polarity Classification on Reviews," Procedia Computer Science, vol. 234, pp. 602–613, 2024.
Allenai, "Indonesian T5 Base." Hugging Face, Jan. 2024, [Online]. Available: https://huggingface.co/Wikidepia/IndoT5-base.
E. A. Abed and T. Aguili, "Automated Medical Image Captioning Using the BLIP Model: Enhancing Diagnostic Support with AI-Driven Language Generation," Diyala Journal of Engineering Sciences, pp. 228–248, Jun. 2025.
Meta, "Meta-llama/Llama-3.1-70B-Instruct." Hugging Face, Dec. 2024, [Online]. Available: https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct.
"OpenRouter API: the Unified Interface for LLMs," OpenRouter, 2025. https://openrouter.ai/.
C. Raffel et al., "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer," Journal of Machine Learning Research, vol. 21, pp. 1–67, Jun. 2020.
M. Fuadi, A. D. Wibawa, and S. Sumpeno, "idT5: Indonesian Version of Multilingual T5 Transformer." arXiv, 2023.
L. Xue et al., "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer," in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 483–498.
M. Kale and A. Rastogi, "Text-to-Text Pre-Training for Data-to-Text Tasks," in Proceedings of the 13th International Conference on Natural Language Generation, 2020, pp. 97–102.
Downloads
How to Cite
License
Copyright (c) 2026 Listra Frigia Missianes Horhoruw, Lintang Yuniar Banowosari, Diana Ikasari

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain the copyright and grant the journal the right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) after its publication in ETASR with an acknowledgement of its initial publication in this journal.
