Multimodal Contrastive Learning for Zero-Shot Instruction-Following Robot with Synthetic Data

Washington Kamadi; Jackson Githu Njiri; Samuel Kangwagye; Shohei Aoki

doi:10.48084/etasr.18291

Authors

Washington Kamadi Pan African University Institute for Basic Sciences, Technology and Innovation (PAUSTI), Kenya https://orcid.org/0000-0003-2135-0125
Jackson Githu Njiri Department of Mechatronic Engineering, Jomo Kenyatta University of Agriculture and Technology (JKUAT), Kenya https://orcid.org/0009-0000-0173-3327
Samuel Kangwagye Robotics and Automation Group, Department of Materials and Production, Aalborg University, Denmark | Department of Mechanical and Production Engineering, Kyambogo University, Uganda https://orcid.org/0000-0002-7791-3356
Shohei Aoki Department of Mechatronic Engineering, Jomo Kenyatta University of Agriculture and Technology (JKUAT), Kenya https://orcid.org/0000-0001-6727-125X

Volume: 16 | Issue: 3 | Pages: 35769-35779 | June 2026 | https://doi.org/10.48084/etasr.18291

Received: 20 February 2026 | Revised: 22 March 2026 and 1 April 2026 | Accepted: 3 April 2026 | Online: 26 April 2026

Corresponding author: Washington Kamadi

Abstract

Robot trajectory prediction is heavily dependent on large-scale real-world demonstrations, which limit scalability, increase data acquisition costs, and eventually prevent zero-shot generalization. To address this limitation, this paper introduces Zero-Shot Task Learning (ZSTL), a multimodal framework that uses structurally aligned synthetic data and contrastive learning to enable instruction-based trajectory generation without reliance on real-world demonstrations. ZSTL jointly encodes natural-language instructions, depth observations, LiDAR-derived spatial representations, and action trajectories within a joined embedding space, allowing cross-modal alignment and conditional behavior synthesis. The proposed architecture preserves modality structure prior to fusion by representing depth inputs as spatial tokens and LiDAR observations as temporal tokens. Together with a text token, these form a 101-token multimodal context attended over by a Transformer decoder to predict full 50-step trajectories with Gaussian uncertainty estimates. The system integrates a pretrained Bidirectional Encoder Representations from Transformers (BERT) language encoder, a ResNet-18 depth backbone, a one-dimensional convolutional LiDAR sequence encoder, and a two-layer Transformer decoder comprising approximately 125M parameters. Training was conducted entirely on a procedurally generated synthetic dataset of 5,000 samples for 50 epochs. The results demonstrate stable convergence, with the trajectory-negative log likelihood decreasing from 3.465 to -0.695 on validation data and the combined loss reaching -0.540 at epoch 18 under cosine annealed learning. The contrastive objective (InfoNCE, τ = 0.07) stabilized near 1.55, indicating consistent cross-modal alignment. The trajectory evaluation yielded an average final position error of 9.97 cm, a collision-free execution rate of 65.9%, and a task success rate of 59.2%, showing that structured synthetic supervision can support physically meaningful motion generation.

Keywords:

multimodal learning, zero-shot task, synthetic data, sentence transformers

References

M. Shridhar et al., "ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks," in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020, pp. 10737–10746.

O. Kroemer, S. Niekum, and G. Konidaris, "A Review of Robot Learning for Manipulation: Challenges, Representations, and Algorithms," Journal of Machine Learning Research, vol. 22, no. 30, pp. 1–82, 2021.

M. Soori, B. Arezoo, and R. Dastres, "Artificial intelligence, machine learning and deep learning in advanced robotics, a review," Cognitive Robotics, vol. 3, pp. 54–70, Jan. 2023.

H. L. Nguyen, D. T. Le, and H. H. Hoang, "Application of Synthetic Data on Object Detection Tasks," Engineering, Technology & Applied Science Research, vol. 14, no. 4, pp. 15695–15699, Aug. 2024.

M. Shridhar, L. Manuelli, and D. Fox, "CLIPort: What and Where Pathways for Robotic Manipulation," in Proceedings of the 5th Conference on Robot Learning, Jan. 2022, pp. 894–906.

J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi, "Align before Fuse: Vision and Language Representation Learning with Momentum Distillation," in Advances in Neural Information Processing Systems, 2021, vol. 34, pp. 9694–9705.

J. Li, D. Li, C. Xiong, and S. Hoi, "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation," in Proceedings of the 39th International Conference on Machine Learning, June 2022, pp. 12888–12900.

R. Garcia, S. Chen, and C. Schmid, "Towards Generalizable Vision-Language Robotic Manipulation: A Benchmark and LLM-Guided 3D Policy," in 2025 IEEE International Conference on Robotics and Automation (ICRA), May 2025, pp. 8996–9002.

S. S. Kannan, V. L. N. Venkatesh, and B. C. Min, "SMART-LLM: Smart Multi-Agent Robot Task Planning using Large Language Models," in 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Oct. 2024, pp. 12140–12147.

D. Driess et al., "PaLM-E: An Embodied Multimodal Language Model." arXiv, Mar. 06, 2023.

X. Han et al., "Multimodal fusion and vision–language models: A survey for robot vision," Information Fusion, vol. 126, Feb. 2026, Art. no. 103652.

J. Urain et al., "A Survey on Deep Generative Models for Robot Learning From Multimodal Demonstrations," IEEE Transactions on Robotics, vol. 42, pp. 60–79, 2026.

H. Ravichandar, A. S. Polydoros, S. Chernova, and A. Billard, "Recent Advances in Robot Learning from Demonstration," Annual Review of Control, Robotics, and Autonomous Systems, vol. 3, no. Volume 3, 2020, pp. 297–330, May 2020.

F. Yuan, E. Klavon, Z. Liu, R. P. Lopez, and X. Zhao, "A Systematic Review of Robotic Rehabilitation for Cognitive Training," Frontiers in Robotics and AI, vol. 8, May 2021.

T. N. Huynh and K. D. Nguyen, "Integrative AI framework for robotics: LLM-enabled reinforcement learning in object manipulation and task planning," Robotics and Autonomous Systems, vol. 195, Jan. 2026, Art. no. 105197.

C. M. Huang and B. Mutlu, "Learning-based modeling of multimodal behaviors for humanlike robots," in Proceedings of the 2014 ACM/IEEE International Conference on Human-robot Interaction, Nov. 2014, pp. 57–64.