Representation-Centric World Action Modeling

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

Junke Wang¹ Qihang Zhang² Shuai Yang² Yiming Luo² Yujun Shen² Zuxuan Wu^1* Yu-Gang Jiang^1* Yinghao Xu^2,3,*

¹ Institute of Trustworthy Embodied AI, Fudan University ² Robbyant, Ant Group ³ Hong Kong University of Science and Technology ^* Corresponding authors

Paper Code BibTeX

Real-Robot Deployment

RepWAM learns semantic visual and action latents in a shared representation space, then models future states and action transitions under language instructions.

Closed-Loop Success Rate

On three dual-arm manipulation tasks, RepWAM outperforms existing vision-language-action models (e.g., π0.5) and world-action models (e.g., Motus and Lingbot-VA).

Real-world robot success rates for RepWAM compared with baseline policies.

Real Robot Demos

Pick the fruits

60% Picking fruits, +50% over π0.5

80% Pushing the drawer, +30% over π0.5

60% Inserting the tube, +50% over π0.5

How It Works

At the core of RepWAM is RepViTok, a representation visual-action tokenizer that learns visual tokens with pixel reconstruction and semantic alignment. On top of this visual latent space, RepViTok learns latent action tokens as compact transitions between visual states.

With paired video and action latents, RepWAM trains a chunk-wise causal world action model to predict future visual states together with the latent actions that connect them.

Representation visual-action tokenization

RepViTok keeps visual states and action-induced scene changes in the same latent space, allowing the WAM to learn manipulation-relevant dynamics rather than only low-level pixel details.

Architecture diagram of the RepViTok representation visual-action tokenizer.

Causal world action modeling

The causal WAM takes language instructions and the first image observation as input, then outputs future visual states and the corresponding actions. It uses a causal diffusion transformer with a video expert and an action expert: attention weights are shared, while feed-forward network weights are expert-specific.

Inputs

Language instruction

Observed video latents

Language instructions and the first image observation.

Causal WAM

Video and action experts share attention weights but use separate FFN weights.

Outputs

Future video latents

Action outputs

Future visual states and actions.

Stage 1

Visual-action pretraining

Language Observed videos

Future videos + latent actions

Learn paired video and latent-action chunks from RepViTok.

Stage 2

Robot action adaptation

Language Observed videos

Future videos + robot actions

Map latent actions to executable real-robot trajectories.

RoboTwin Benchmark

No WAN Pretrain RepWAM trains from scratch.

89.3 / 88.4 Easy / Hard 50-task average.

+8.6 / +7.1 Significant gains over WAN2.2 VAE.

Without WAN pretraining, RepWAM-5B still surpasses π0.5 and Motus on the 50-task average.

Main RoboTwin 2.0 Benchmark

Metric	π0.5		Motus		Lingbot-VA		Ours 1.3B		Ours 5B
Metric	Easy	Hard	Easy	Hard	Easy	Hard	Easy	Hard	Easy	Hard
Backbone pretrained	Yes		Yes		Yes		No		No
Average hor=2	79.3	73.0	85.2	80.9	85.3	86.9	85.7	84.0	87.4	87.6
Average hor=3	78.6	67.4	85.0	84.2	89.6	90.6	92.0	85.4	88.0	90.4
Average 50 tasks	82.7	76.8	88.7	87.0	92.9	91.6	86.6	83.1	89.3	88.4

For a fair comparison, we train the 1.3B WAM from scratch with WAN2.2 VAE and compare it against the same 1.3B setup using RepViTok. RepViTok consistently outperforms WAN2.2 VAE across horizons and difficulty levels.

Tokenizer Ablation for 1.3B WAM

Metric	WAN2.2 VAE		RepViTok
Metric	Easy	Hard	Easy	Hard
Average hor=1	81.1	78.4	86.2	83.1
Average hor=2	75.5	73.9	85.7	84.0
Average hor=3	67.2	68.0	92.0	85.4
Average 50 tasks	78.0	76.0	86.6	83.1

Visual Action Latents Analysis

We report four groups of metrics for a comprehensive comparison.

Reconstruction

PSNR, SSIM, rFID, and rFVD for visual reconstruction fidelity.

World Modeling

gFVD for world-model video rollout quality.

Action Accuracy

OLS for open-loop execution success rate.

Real Robot

PickFruit for closed-loop execution success rate.

S = seen AgiBot tasks, U = unseen held-out tasks. ↑ higher is better, ↓ lower is better.

Visual Tokenization

We design two tokenizer baselines for comparison:

WAN2.2 VAE A reconstruction-oriented video VAE adopted by WAN.

ViTok A ViT-based VAE without semantic alignment.

Compared with both baselines, RepViTok performs best on world modeling and action transfer.

Tokenizer	Eval Seen (S)				Eval Unseen (U)				PickFruit ↑
Tokenizer	gFVD ↓	PSNR ↑	SSIM ↑	OLS ↑	gFVD ↓	PSNR ↑	SSIM ↑	OLS ↑	PickFruit ↑
WAN2.2 VAE	67.42	17.34	0.67	13.68	83.98	16.86	0.64	11.21	20%
ViTok	69.23	17.21	0.68	16.29	80.14	17.19	0.67	13.81	10%
RepViTok	61.01	18.47	0.70	18.82	72.91	17.72	0.67	14.15	30%

RepViTok also maintains strong reconstruction quality across ImageNet and UCF101, preserving pixel-level fidelity while retaining semantic details such as faces and text.

Tokenizer	ImageNet 256			ImageNet 512			UCF101 256×17			UCF101 512×17
Tokenizer	rFID ↓	PSNR ↑	SSIM ↑	rFID ↓	PSNR ↑	SSIM ↑	rFVD ↓	PSNR ↑	SSIM ↑	rFVD ↓	PSNR ↑	SSIM ↑
WAN2.2 VAE	0.50	28.16	0.87	0.20	30.48	0.90	4.28	36.61	0.98	0.68	41.45	0.99
ViTok	0.96	28.65	0.89	0.24	30.77	0.92	1.23	35.52	0.97	0.16	40.68	0.98
RepViTok	0.80	28.90	0.89	0.23	31.00	0.92	1.09	36.03	0.97	0.16	41.12	0.98

Less dependence on CFG

RepWAM performs best at video CFG scale 1.0 on RoboTwin 2.0, so inference can avoid the extra unconditional video branch and run faster.

Video classifier-free guidance curve on RoboTwin.

High-fidelity reconstruction

RepViTok preserves fine-grained details well, including human faces, object boundaries, and text regions.

Action Tokenization

We compare two-stage training with two baselines:

w/o latent actions Removes the latent-action pathway and trains directly with robot actions.

Joint Pred Asks the video expert to predict video and action latents jointly.

Two-stage latent-action training achieves the best world-modeling, action-prediction, and real-robot results among the ablations.

Training	Eval Seen (S)				Eval Unseen (U)				PickFruit ↑
Training	gFVD ↓	PSNR ↑	SSIM ↑	OLS ↑	gFVD ↓	PSNR ↑	SSIM ↑	OLS ↑	PickFruit ↑
w/o latent actions	61.01	18.47	0.70	18.82	72.91	17.72	0.67	14.15	30%
Joint Pred	94.25	15.24	0.61	18.52	98.77	15.09	0.55	15.22	20%
Two Stages	48.23	22.86	0.75	19.87	58.83	19.93	0.74	16.98	50%

RepViTok latent actions also transfer better than LAPA to real robot action decoding, yielding lower inverse-dynamics action loss.

Latent action visualization and IDM action loss comparison.

Citation

If you find our work useful, please consider citing us.

@article{wang2026repwam,
  title  = {RepWAM: World Action Modeling with Representation Visual-Action Tokenizers},
  author = {Wang, Junke and Zhang, Qihang and Yang, Shuai and Luo, Yiming and Shen, Yujun and Wu, Zuxuan and Jiang, Yu-Gang and Xu, Yinghao},
  year   = {2026}
}