SOL-Nav — Structured Observation Language for Vision-Language Navigation

Abstract

Vision-Language Navigation (VLN) requires an embodied agent to navigate complex environments by following natural language instructions, which typically demands tight fusion of visual and language modalities. Existing VLN methods often convert raw images into visual tokens or implicit features, requiring large-scale visual pre-training and suffering from poor generalization under environmental variations (e.g., lighting, texture).

To address these issues, we propose SOL-Nav (Structured Observation Language for Navigation), a novel framework that translates egocentric visual observations into compact structured language descriptions for efficient and generalizable navigation. Specifically, we divide RGB-D images into an N×N grid, extract representative semantic, color, and depth information for each grid cell to form structured text, and concatenate this with the language instruction as pure language input to a pre-trained language model (PLM). Experimental results on standard VLN benchmarks (R2R-CE, RxR-CE) and real-world deployments demonstrate that SOL-Nav significantly reduces the model size and training data dependency, fully leverages the reasoning and representation capabilities of PLMs, and achieves strong generalization to unseen environments.

Video Demonstration

SOL-Nav's performance on simulated benchmarks and real-world robotic deployments with Unitree Go2.

Core Advantages

A pure-language approach to vision-language navigation that is efficient, generalizable, and deployable.

Reduced Training Cost

Eliminates scratch training of visual encoders. Only requires small-scale navigation data for PLM fine-tuning via LoRA, drastically cutting computational overhead.

Strong Generalization

Structured text avoids environmental noise (lighting/texture variations) and enables robust adaptation to unseen scenes without visual domain adaptation.

Simplified Pipeline

No complex multimodal encoders or fusion modules. Pure language model for navigation decision-making with lower computational cost.

Real-World Deployable

Tiny model size (0.6B params) and low inference latency (0.8s on edge devices) for practical physical robot integration on Unitree Go2.

Method Overview

RGB-D observations are converted into structured textual descriptions with multi-resolution grids encoding depth, semantic, and color information, forming a pure language prompt for a pre-trained LLM.

Fig. 1. Pipeline of SOL-Nav. RGB-D observations are converted into structured textual descriptions with 2×2 / 4×4 / 6×6 multi-resolution grids (long/short-term history, current observation) encoding depth, semantic, and color information. The structured observation sequence, navigation instruction, and system description form a pure language prompt for action prediction.

Fig. 2. Structured Observation Language Prompt for LLM. The prompt integrates system description, structured observations at multiple time steps, and task instruction to guide the language model for action prediction.

Experimental Results

SOL-Nav achieves state-of-the-art or comparable performance with a 0.6B parameter model — 10× smaller than SOTA multimodal models, without additional training data or waypoint predictors.

R2R-CE Val-Unseen · No Extra Data / No Waypoint Predictor

Method	NE (↓)	OS (↑)	SR (↑)	SPL (↑)
SOL-Nav (Ours)	5.11	72.9	53.6	49.2
NaVILA	5.37	57.6	49.7	45.5
UniNaVid	5.58	53.3	47.0	42.7

RxR-CE Val-Unseen · No Extra Data

Method	NE (↓)	OS (↑)	SR (↑)	SPL (↑)
SOL-Nav (Ours)	6.87	60.5	48.6	42.3
UniNaVid	6.24	55.5	48.7	40.9

Ablation Study · R2R-CE Val-Unseen

Ablation	NE (↓)	OS (↑)	SR (↑)	SPL (↑)
Full Model (SOL-Nav)	5.11	72.9	53.6	49.2
Lower Res (4×4)	6.84	43.4	34.5	29.8
No Historical Obs	7.81	39.4	26.5	21.9
No Depth Info	7.98	31.2	21.6	17.8

Real-World Deployment

Deployed on Unitree Go2 with NVIDIA Jetson AGX Orin and Intel RealSense D435i. TensorRT quantization achieves 0.8s inference latency on the edge device.

Fig. 3. The SOL-Nav deployment pipeline. RGB-D input from the RealSense camera is processed through SegFormer for semantic segmentation, then converted to structured observations. The prompt builder integrates observations with system description and task instruction, feeding into Qwen3-Embedding (0.6B) for multi-step action prediction on the edge device.

Real-World Navigation Experiments

Experiments across three distinct indoor scenarios — Tea Area, Hall Stairs, and Meeting Room — with varying lighting, furniture density, and terrain complexity.

Fig. 4. Real-world navigation experiments in three distinct scenarios. Each column shows the task instruction, environment photo, and multi-modal perception outputs (RGB-D, depth, semantic segmentation, and color mapping), demonstrating SOL-Nav's robustness across diverse environments.

Citation

If you find SOL-Nav useful for your research, please cite our paper.

@article{peng2026structured,
  title={Structured Observation Language for Efficient and Generalizable Vision-Language Navigation},
  author={Peng, Daojie and Ma, Fulong and Ma, Jun},
  journal={arXiv preprint arXiv:2603.27577},
  year={2026}
}

Structured Observation Language for Efficient & GeneralizableVision-Language Navigation