Representation-Centric World Action Modeling

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Junke Wang1 Qihang Zhang2 Shuai Yang2 Yiming Luo2 Yujun Shen2 Zuxuan Wu1* Yu-Gang Jiang1* Yinghao Xu2,3,*

1 Institute of Trustworthy Embodied AI, Fudan University 2 Robbyant, Ant Group 3 Hong Kong University of Science and Technology * Corresponding authors

Real-Robot Deployment

RepWAM learns semantic visual and action latents in a shared representation space, then models future states and action transitions under language instructions.

Closed-Loop Success Rate

On three dual-arm manipulation tasks, RepWAM outperforms existing vision-language-action models (e.g., π0.5) and world-action models (e.g., Motus and Lingbot-VA).

Real-world robot success rates for RepWAM compared with baseline policies.

Real Robot Demos

60% Picking fruits, +50% over π0.5
80% Pushing the drawer, +30% over π0.5
60% Inserting the tube, +50% over π0.5

How It Works

At the core of RepWAM is RepViTok, a representation visual-action tokenizer that learns visual tokens with pixel reconstruction and semantic alignment. On top of this visual latent space, RepViTok learns latent action tokens as compact transitions between visual states.

With paired video and action latents, RepWAM trains a chunk-wise causal world action model to predict future visual states together with the latent actions that connect them.

Representation visual-action tokenization

RepViTok keeps visual states and action-induced scene changes in the same latent space, allowing the WAM to learn manipulation-relevant dynamics rather than only low-level pixel details.

Architecture diagram of the RepViTok representation visual-action tokenizer.

Causal world action modeling

The causal WAM takes language instructions and the first image observation as input, then outputs future visual states and the corresponding actions. It uses a causal diffusion transformer with a video expert and an action expert: attention weights are shared, while feed-forward network weights are expert-specific.

Inputs

Language instruction
Observed video latents

Language instructions and the first image observation.

Causal WAM

Video and action experts share attention weights but use separate FFN weights.

Outputs

Future video latents
Action outputs

Future visual states and actions.

Stage 1

Visual-action pretraining

Language Observed videos
Future videos + latent actions
Learn paired video and latent-action chunks from RepViTok.
Stage 2

Robot action adaptation

Language Observed videos
Future videos + robot actions
Map latent actions to executable real-robot trajectories.

RoboTwin Benchmark

No WAN Pretrain RepWAM trains from scratch.
89.3 / 88.4 Easy / Hard 50-task average.
+8.6 / +7.1 Significant gains over WAN2.2 VAE.

Without WAN pretraining, RepWAM-5B still surpasses π0.5 and Motus on the 50-task average.

Main RoboTwin 2.0 Benchmark
Metric π0.5 Motus Lingbot-VA Ours 1.3B Ours 5B
EasyHard EasyHard EasyHard EasyHard EasyHard
Backbone pretrained Yes Yes Yes No No
Average hor=2 79.373.0 85.280.9 85.386.9 85.784.0 87.487.6
Average hor=3 78.667.4 85.084.2 89.690.6 92.085.4 88.090.4
Average 50 tasks 82.776.8 88.787.0 92.991.6 86.683.1 89.388.4

For a fair comparison, we train the 1.3B WAM from scratch with WAN2.2 VAE and compare it against the same 1.3B setup using RepViTok. RepViTok consistently outperforms WAN2.2 VAE across horizons and difficulty levels.

Tokenizer Ablation for 1.3B WAM
Metric WAN2.2 VAE RepViTok
EasyHard EasyHard
Average hor=181.178.486.283.1
Average hor=275.573.985.784.0
Average hor=367.268.092.085.4
Average 50 tasks78.076.086.683.1

Visual Action Latents Analysis

We report four groups of metrics for a comprehensive comparison.

Reconstruction

PSNR, SSIM, rFID, and rFVD for visual reconstruction fidelity.

World Modeling

gFVD for world-model video rollout quality.

Action Accuracy

OLS for open-loop execution success rate.

Real Robot

PickFruit for closed-loop execution success rate.

S = seen AgiBot tasks, U = unseen held-out tasks. ↑ higher is better, ↓ lower is better.

Visual Tokenization

We design two tokenizer baselines for comparison:

WAN2.2 VAE A reconstruction-oriented video VAE adopted by WAN.
ViTok A ViT-based VAE without semantic alignment.

Compared with both baselines, RepViTok performs best on world modeling and action transfer.

Tokenizer Eval Seen (S) Eval Unseen (U) PickFruit ↑
gFVD ↓ PSNR ↑ SSIM ↑ OLS ↑ gFVD ↓ PSNR ↑ SSIM ↑ OLS ↑
WAN2.2 VAE 67.42 17.34 0.67 13.68 83.98 16.86 0.64 11.21 20%
ViTok 69.23 17.21 0.68 16.29 80.14 17.19 0.67 13.81 10%
RepViTok 61.01 18.47 0.70 18.82 72.91 17.72 0.67 14.15 30%

RepViTok also maintains strong reconstruction quality across ImageNet and UCF101, preserving pixel-level fidelity while retaining semantic details such as faces and text.

Tokenizer ImageNet 256 ImageNet 512 UCF101 256×17 UCF101 512×17
rFID ↓ PSNR ↑ SSIM ↑ rFID ↓ PSNR ↑ SSIM ↑ rFVD ↓ PSNR ↑ SSIM ↑ rFVD ↓ PSNR ↑ SSIM ↑
WAN2.2 VAE 0.50 28.16 0.87 0.20 30.48 0.90 4.28 36.61 0.98 0.68 41.45 0.99
ViTok 0.96 28.65 0.89 0.24 30.77 0.92 1.23 35.52 0.97 0.16 40.68 0.98
RepViTok 0.80 28.90 0.89 0.23 31.00 0.92 1.09 36.03 0.97 0.16 41.12 0.98

Less dependence on CFG

RepWAM performs best at video CFG scale 1.0 on RoboTwin 2.0, so inference can avoid the extra unconditional video branch and run faster.

Video classifier-free guidance curve on RoboTwin.

High-fidelity reconstruction

RepViTok preserves fine-grained details well, including human faces, object boundaries, and text regions.

RepViTok reconstruction examples.

Action Tokenization

We compare two-stage training with two baselines:

w/o latent actions Removes the latent-action pathway and trains directly with robot actions.
Joint Pred Asks the video expert to predict video and action latents jointly.

Two-stage latent-action training achieves the best world-modeling, action-prediction, and real-robot results among the ablations.

Training Eval Seen (S) Eval Unseen (U) PickFruit ↑
gFVD ↓ PSNR ↑ SSIM ↑ OLS ↑ gFVD ↓ PSNR ↑ SSIM ↑ OLS ↑
w/o latent actions 61.01 18.47 0.70 18.82 72.91 17.72 0.67 14.15 30%
Joint Pred 94.25 15.24 0.61 18.52 98.77 15.09 0.55 15.22 20%
Two Stages 48.23 22.86 0.75 19.87 58.83 19.93 0.74 16.98 50%

RepViTok latent actions also transfer better than LAPA to real robot action decoding, yielding lower inverse-dynamics action loss.

Latent action visualization and IDM action loss comparison.

Citation

If you find our work useful, please consider citing us.

@article{wang2026repwam,
  title  = {RepWAM: World Action Modeling with Representation Visual-Action Tokenizers},
  author = {Wang, Junke and Zhang, Qihang and Yang, Shuai and Luo, Yiming and Shen, Yujun and Wu, Zuxuan and Jiang, Yu-Gang and Xu, Yinghao},
  year   = {2026}
}