LEGS-POMDP: Language and Gesture-Guided Object Search in Partially Observable Environments

ACM/IEEE International Conference on Human-Robot Interaction (HRI) 2026
Ivy He1, Stefanie Tellex1, Jason Xinyu Liu1
1Brown University
LEGS-POMDP Teaser

LEGS-POMDP integrates language, gesture, and visual observations for open-world object search. The robot maintains a joint belief over target object identity and location, enabling principled decision-making under uncertainty.


Abstract

To assist humans in open-world environments, robots must accurately interpret ambiguous instructions to locate desired objects. Foundation model-based approaches excel at reference expression grounding and multimodal instruction understanding, but lack a principled mechanism to model uncertainty of long-horizon tasks. Conversely, Partially Observable Markov Decision Processes (POMDPs) provide a systematic framework for planning under uncertainty but are typically limited in modalities and environment assumptions.

To achieve the best of both worlds, we introduce LanguagE and GeSture-Guided Object Search in Partially Observable Environments (LEGS-POMDP), a modular POMDP system that integrates language, gesture, and visual observations for open-world object search. Unlike prior work, LEGS-POMDP explicitly models two sources of partial observability: uncertainty over the target object's identity and its spatial location.

Simulation results show that multimodal fusion significantly outperforms unimodal baselines, achieving an average success rate of 89% ± 7% across challenging environments and object categories. We demonstrate the full system on a Boston Dynamics Spot quadruped mobile manipulator, where real-world experiments qualitatively validate robust multimodal perception and uncertainty reduction under ambiguous human instructions.


Video Demonstration


System Overview

System Diagram

The system architecture showing multimodal perception modules (language, gesture, vision), fusion component, POMDP belief update, and planning using PO-UCT solver.

Citation

If you find this work useful in your research, please consider citing:

@inproceedings{he2026legs,
  title     = {LEGS-POMDP: Language and Gesture-Guided Object Search
               in Partially Observable Environments},
  author    = {He, Ivy and Tellex, Stefanie and Liu, Jason Xinyu},
  booktitle = {Proceedings of ACM/IEEE International Conference on
               Human-Robot Interaction (HRI 2026)},
  year      = {2026},
  address   = {Edinburgh, Scotland, UK}
}

Acknowledgments

This work was supported in part by the National Science Foundation under Award No. 2433429 through the AI Research Institute on Interaction for AI Assistants (ARIA) and the Long-Term Autonomy for Ground and Aquatic Robotics program (Grant No. GR5250131), and by the Office of Naval Research under Agreement No. N000142412784.