LOVON: Legged Open-Vocabulary Object Navigator

Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.

First, the LLM task planner reconfigures the human’s task into basic instructions, while the detection model processes the video stream using a Laplace filter. Then, the mission instructions, target object, bounding box, and states are input to the Language-to-Motion Model, which generates the robot's control vector and feedback, progressively completing all tasks.

As shown in the Table, our method, LOVON, outperforms several baseline approaches, achieving a perfect SR of 1.00 across most environments, including ParkingLot, UrbanCity and SnowVillage. Compared to EVT, LOVON demonstrates superior tracking performance, e.g., 500/1.00 vs. 484/0.92 in ParkingLot. Even when compared to the state-of-the-art TrackVLA, which achieves 1.00 SR but requires 360 hours of training, LOVON stands out with an efficient training time of just 1.5 hours, offering both high accuracy and significant efficiency.

Multi-Embodiment

LOVON is designed to be a multi-embodiment system, capable of operating on various legged robots. Here we show examples of LOVON running on the H1-2 robot, which is a humanoid robot with a bipedal structure, the Go2 and B2, which are quadruped robots. The robots are able to navigate through a complex environment, detecting and tracking objects in real-time.

LOVON is capable of operating in various environments, including indoor and outdoor settings. Here we show examples of LOVON running in multi-environments including indoors like office, lib, tea room, stairs, etc.; outdoors like parking area, playground, wild grass, etc.; The robot is able to transverse through the sand and grass, detecting and tracking backpack on the playground in real-time.

LOVON is capable of handling long-horizon tasks, such as navigating to multiple targets in a single task. Here we show an example of LOVON navigating to three different targets in a long-horizon task. The robot is given a task to navigate to the backpack, then to the chair, and finally to the person.
Instruction: "You are a robot, you need firstly run to the backpack. After that move to the chair at 0.5 m/s and then approach the person fastly."

LOVON is robust to visual disturbances such as motion blur, occlusion, and dynamic state changes. Here we show an example of blocking out the umbrella in the scene, which causes the visual effects to be disturbed. LOVON is able to recover from this disturbance and continue approaching the target.

LOVON can also be used to track dynamic objects in complex unstructured environments. Here we show an example of tracking a person in real-world wild grass. The person is moving around in the scene, and LOVON is able to track the person and render the scene with the person in it. We also show an example of tracking a person with h1-2 robot in a real-world environment. The robot is able to detect and track the person in real-time, even when the person is moving around. And stay a safe distance from the person.

LOVON is capable of navigating through challenging terrains such as stairs, uneven surfaces, and gravel ground. Here we show an example of LOVON navigating through a spiral staircase. The robot is able to detect and track the target while navigating through the stairs.

BibTeX

@article{daojie2025lovon,
  title={LOVON: Legged Open-Vocabulary Object Navigator},
  author={Peng, Daojie and Cao, Jiahang and Zhang, Qiang and Ma, Jun},
  journal={arxiv},
  year={2025},
}

LOVON: Legged Open-Vocabulary Object Navigator

LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments.

Abstract

Video

LOVON Pipeline

Simulation Results

Multi-Embodiment

Open-World Object Seeking

Lonng-horizon Task

Robustness

Recapture the Lost Target

Dynamic Tracking

Challenging Terrain

BibTeX