M³E: Continual Vision-and-Language Navigation via Mixture of Macro and Micro Experts

Abstract

Vision-and-Language Navigation (VLN) agents have shown strong capabilities in following natural language instructions. However, they often struggle to generalize across environments due to catastrophic forgetting, which limits their practical use in real-world settings where agents must continually adapt to new domains. We argue that overcoming forgetting across environments hinges on decoupling global scene reasoning from local perceptual alignment, allowing the agent to adapt to new domains while preserving specialized capabilities.

To this end, we propose M³E, the Mixture of Macro and Micro Experts, an environment-aware hierarchical MoE framework for continual VLN. Our method introduces a dual-router architecture that separates navigation into two levels of reasoning. A macro-level, scene-aware router selects strategy experts based on global environmental features (e.g., office vs. residential), while a micro-level, instance-aware router activates perception experts based on local instruction-vision alignment for step-wise decision making. To preserve knowledge across domains, we adopt a dynamic momentum update strategy that identifies expert utility in new environments and selectively updates or freezes their parameters.

We evaluate M³E in a domain-incremental setting on the R2R and REVERIE datasets, where agents learn across unseen scenes without revisiting prior data. Results show that our method consistently outperforms standard fine-tuning and existing continual learning baselines in both adaptability and knowledge retention, offering a parameter-efficient solution for building generalizable embodied agents.

Method

Overall architecture of M³E. The framework decouples macro-level scene reasoning and micro-level token grounding via a dual-router design. The Macro Router (blue) builds a task-aware scene representation using GNN-based propagation over a cognitive map. The Micro Router (purple) computes token-wise expert weights from hidden states. Both signals are fused to route experts in the MoE-LoRA layers for planning.

Experiments

Table 1. Domain-incremental learning in R2R environment. Methods are categorized by strategy: Reg (regularization), Reh (rehearsal), and RF (replay-free).

Method	Strategy			AvgSR% ↑	AvgSPL% ↑	AvgNE ↓	BWT ↑	FWT ↑
Method	Reg	Reh	RF
Finetune	✗	✗	✓	63.28	59.08	3.72	-5.42	-2.41
L2	✓	✗	✗	58.78	56.20	4.23	-5.10	-3.43
EWC	✓	✗	✗	64.15	60.21	3.60	-3.50	-2.80
ER	✗	✓	✗	66.35	62.10	3.45	-1.50	0.50
PerR	✗	✓	✗	67.05	62.93	3.38	-1.35	0.62
ESR	✗	✓	✗	68.12	63.88	3.25	-1.10	0.85
Dual-SR	✓	✓	✗	70.25	65.40	3.05	-0.45	1.85
M³E (ours)	✗	✗	✓	71.92	66.96	2.95	0.04	2.15

M3E: Continual Vision-and-Language Navigationvia Mixture of Macro and Micro Experts

Abstract

Method

Experiments

M³E: Continual Vision-and-Language Navigation
via Mixture of Macro and Micro Experts