LOVON: Legged Open-Vocabulary Object Navigator

1The Hong Kong University of Science and Technology (Guangzhou)
2Beijing Innovation Center of Humanoid Robotics
3The Hong Kong University of Science and Technology
LOVON系统介绍

LOVON is a novel system that integrates LLMs for hierarchical task planning with open-vocabulary visual detection and legged robot mobility.

Abstract

Object navigation in open-world environments remain a critical challenge for robotic systems. Despite advancements in large language models (LLMs) for task planning, open-vocabulary vision models for object detection, and versatile legged robots capable of traversing complex terrains, existing approaches lack a unified navigation framework to execute composite long-range missions. We propose LOVON, a novel system that integrates LLMs for hierarchical task planning with open-vocabulary visual detection and legged robot mobility. To address real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. Extensive evaluations on Go2, B2, and H1-2 legged platforms demonstrate successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. To the best of our knowledge, this work presents the first operational system achieving such capabilities in unstructured environments.

Video

LOVON Pipeline

The LOVON system is designed to integrate LLMs for hierarchical task planning with open-vocabulary visual detection and legged robot mobility. The pipeline consists of several key components, including visual stabilization, object detection, and navigation planning. The system is capable of operating in various environments, including indoor and outdoor settings, and can handle dynamic objects in real-time.

LOVON Pipeline

Simulation Results

As shown in the Table, our method, LOVON, outperforms several baseline approaches, achieving a perfect SR of 1.00 across most environments, including ParkingLot, UrbanCity and SnowVillage. Compared to EVT, LOVON demonstrates superior tracking performance, e.g., 500/1.00 vs. 484/0.92 in ParkingLot. Even when compared to the state-of-the-art TrackVLA, which achieves 1.00 SR but requires 360 hours of training, LOVON stands out with an efficient training time of just 1.5 hours, offering both high accuracy and significant efficiency.

LOVON Baseline

Multi-Embodiment

LOVON is designed to be a multi-embodiment system, capable of operating on various legged robots. Here we show examples of LOVON running on the H1-2 robot, which is a humanoid robot with a bipedal structure, the Go2 and B2, which are quadruped robots. The robots are able to navigate through a complex environment, detecting and tracking objects in real-time.

Open-world Object Seeking

LOVON is capable of operating in various environments, including indoor and outdoor settings. Here we show examples of LOVON running in multi-environments including indoors like office, lib, tea room, stairs, etc.; outdoors like parking area, playground, wild grass, etc.; The robot is able to transverse through the sand and grass, detecting and tracking backpack on the playground in real-time.

Robustness

Recapture the Lost Target

LOVON is robust to visual disturbances such as motion blur, occlusion, and dynamic state changes. Here we show an example of blocking out the umbrella in the scene, which causes the visual effects to be disturbed. LOVON is able to recover from this disturbance and continue approaching the target.

Dynamic Tracking

LOVON can also be used to track dynamic objects in complex unstructured environments. Here we show an example of tracking a person in real-world wild grass. The person is moving around in the scene, and LOVON is able to track the person and render the scene with the person in it. We also show an example of tracking a person with h1-2 robot in a real-world environment. The robot is able to detect and track the person in real-time, even when the person is moving around. And stay a safe distance from the person.

Challenging Terrain

LOVON is capable of navigating through challenging terrains such as stairs, uneven surfaces, and gravel ground. Here we show an example of LOVON navigating through a spiral staircase. The robot is able to detect and track the target while navigating through the stairs.

BibTeX

@article{daojie2025lovon,
  title={LOVON: Legged Open-Vocabulary Object Navigator},
  author={Peng, Daojie and Cao, Jiahang and Zhang, Qiang and Ma, Jun},
  journal={arxiv},
  year={2025},
}