M3E: Continual Vision-and-Language Navigation
via Mixture of Macro and Micro Experts

Yongliang Jiang1, Huaidong Zhang1,*, Xuandi Luo1, Shengfeng He2
1South China University of Technology, 2Singapore Management University
*Corresponding Author

Abstract

Vision-and-Language Navigation (VLN) agents have shown strong capabilities in following natural language instructions. However, they often struggle to generalize across environments due to catastrophic forgetting, which limits their practical use in real-world settings where agents must continually adapt to new domains. We argue that overcoming forgetting across environments hinges on decoupling global scene reasoning from local perceptual alignment, allowing the agent to adapt to new domains while preserving specialized capabilities.

To this end, we propose M3E, the Mixture of Macro and Micro Experts, an environment-aware hierarchical MoE framework for continual VLN. Our method introduces a dual-router architecture that separates navigation into two levels of reasoning. A macro-level, scene-aware router selects strategy experts based on global environmental features (e.g., office vs. residential), while a micro-level, instance-aware router activates perception experts based on local instruction-vision alignment for step-wise decision making. To preserve knowledge across domains, we adopt a dynamic momentum update strategy that identifies expert utility in new environments and selectively updates or freezes their parameters.

We evaluate M3E in a domain-incremental setting on the R2R and REVERIE datasets, where agents learn across unseen scenes without revisiting prior data. Results show that our method consistently outperforms standard fine-tuning and existing continual learning baselines in both adaptability and knowledge retention, offering a parameter-efficient solution for building generalizable embodied agents.

Method

M3E Architecture Overview

Overall architecture of M3E. The framework decouples macro-level scene reasoning and micro-level token grounding via a dual-router design. The Macro Router (blue) builds a task-aware scene representation using GNN-based propagation over a cognitive map. The Micro Router (purple) computes token-wise expert weights from hidden states. Both signals are fused to route experts in the MoE-LoRA layers for planning.

Experiments

Table 1. Domain-incremental learning in R2R environment. Methods are categorized by strategy: Reg (regularization), Reh (rehearsal), and RF (replay-free).

Method Strategy AvgSR% ↑ AvgSPL% ↑ AvgNE ↓ BWT ↑ FWT ↑
Reg Reh RF
Finetune 63.28 59.08 3.72 -5.42 -2.41
L2 58.78 56.20 4.23 -5.10 -3.43
EWC 64.15 60.21 3.60 -3.50 -2.80
ER 66.35 62.10 3.45 -1.50 0.50
PerR 67.05 62.93 3.38 -1.35 0.62
ESR 68.12 63.88 3.25 -1.10 0.85
Dual-SR 70.25 65.40 3.05 -0.45 1.85
M3E (ours) 71.92 66.96 2.95 0.04 2.15