HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

Myungkyu Koo*1   Daewon Choi*1   Taeyoung Kim1   Kyungmin Lee1   Changyeon Kim1
Younggyo Seo†2   Jinwoo Shin†1,3

1KAIST   2UC Berkeley   3RLWRLD
{jameskoo0503, daeone0920}@kaist.ac.kr
* Equal contribution.   Equal advising.

TL;DR:  HAMLET is a scalable framework that enables pre-trained VLAs to leverage historical information without costly retraining from scratch.


Project Teaser

Abstract

We introduce HAMLET, a history-aware fine-tuning framework for Vision-Language-Action (VLA) models that augments pre-trained policies with temporal memory while preserving their existing competencies. HAMLET summarizes instantaneous vision-language representations at each timestep into moment tokens and consolidates them through a lightweight memory module to form a temporally informed condition for action prediction. This design delivers strong gains on long-horizon, history-dependent manipulation tasks without retraining from scratch.

Method

HAMLET is a fine-tuning framework for VLAs that introduces History-Aware Memory with LEarned Tokens.
The framework has two components: (a) moment tokens, which summarize instantaneous VLM representations at each timestep, and (b) a memory module that aggregates moment tokens across timesteps to provide a temporally informed context for action prediction.

HAMLET Overview Diagram

The moment tokens are appended to the VLM input at each timestep and initialized with time-contrastive learning, which encourages distinctiveness across timesteps. Building on this, we incorporate a lightweight memory module that stores and integrates moment token representations across timesteps.

Results

We present several real-robot demonstrations for each real-world task.

Task Pick-and-Place Twice

Instruction "Pick up the cube and place it on the opposite side, and then return it to the original side."

[Cube on the left]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

[Cube on the right]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

Task Cover-and-Stack

Instruction "Cover the cube with the nearest cup, then stack the other cup on top of it."

[Cube near the left cup]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

[Cube near the right cup]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

Task Swap Cubes

Instruction "Swap the positions of two cubes, starting with the blue one."

[Blue cube on the left]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

[Blue cube on the right]

GR00T N1.5 (Failure #1)
GR00T N1.5 (Failure #2)
GR00T N1.5 + HAMLET (Success)

Typical failure cases for naïve multi-frame baseline

Task #1: Pick-and-Place Twice
Task #2: Cover-and-Stack
Task #3: Swap Cubes