HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

We introduce HAMLET, a history-aware fine-tuning framework for Vision-Language-Action (VLA) models that augments pre-trained policies with temporal memory while preserving their existing competencies. HAMLET summarizes instantaneous vision-language representations at each timestep into moment tokens and consolidates them through a lightweight memory module to form a temporally informed condition for action prediction. This design delivers strong gains on long-horizon, history-dependent manipulation tasks without retraining from scratch.

HAMLET is a fine-tuning framework for VLAs that introduces History-Aware Memory with LEarned Tokens.
The framework has two components: (a) moment tokens, which summarize instantaneous VLM representations at each timestep, and (b) a memory module that aggregates moment tokens across timesteps to provide a temporally informed context for action prediction.

The moment tokens are appended to the VLM input at each timestep and initialized with time-contrastive learning, which encourages distinctiveness across timesteps. Building on this, we incorporate a lightweight memory module that stores and integrates moment token representations across timesteps.

We present several real-robot demonstrations for each real-world task.