Not All Actions Are Equal:
Rethinking Conditioning for Dexterous World Model

1The Hong Kong University of Science and Technology (Guangzhou)  
 2Shenzhen University   3Beijing University of Technology   4JD Explore Academy

Corresponding author
Qualitative comparison of action-conditioned video prediction on EgoDex

DexAC predicts temporally coherent egocentric video of 57-DoF dexterous hand interactions on EgoDex and EgoVerse, by treating action conditioning as a structured process instead of compressing every action dimension into one global embedding.

Abstract

Recent advances in action-conditioned world models show promising progress in modeling complex interactions and forecasting future states under diverse action sequences. While these models are often driven by stronger visual representations and model capacity, action conditioning itself remains underexplored. Most existing approaches compress the entire action sequence into a single representation, which works well for low-DoF control but becomes less reliable in high-DoF scenarios. We observe that high-DoF dexterous actions are inherently heterogeneous, spanning multiple orders of magnitude, where large-scale motions coexist with subtle but important signals. When uniformly aggregated, optimization exhibits an imbalance across action components, which hinders the modeling of fine-grained effects and affects action fidelity. We therefore propose DexAC-WM, which treats action conditioning as a structured process rather than global compression. DexAC preserves dimension-level semantics via action tokenization and aligns action signals with visual dynamics through local refinement and global modulation. To address the limited high-level semantic grounding in existing world models, we further introduce a semantic branch that provides rich object-scene priors, which enables world model to capture dynamic visual details while supporting high-DoF action-conditioned video prediction. Experiments on EgoDex and EgoVerse show that combining the semantic branch with DexAC significantly improves FID, FVD, and PCK, demonstrating gains in visual-temporal realism and action-following consistency. We further verify that DexAC extends to other backbones, showing the scalability of our structured action-conditioning design. These results suggest that scaling world models to high-DoF control requires both structured action modeling and semantic grounding.

Why Vanilla Action Conditioning Breaks at High DoF

Low-DoF gripper actions (6 dimensions) are neatly bounded: translation and rotation live at a comparable, narrow scale, so a global MLP can aggregate them without issue. Dexterous human hands are a different story — large wrist and camera motions coexist with finger articulations that are five orders of magnitude smaller.

Empirical comparison of action magnitude distributions between low-DoF and high-DoF action spaces
Empirical comparison of action magnitude distributions.(a) Low-DoF action space (6-DoF).The translation and rotation dimensions maintain bounded variances spanning the 10⁻² scale, effectively preventing gradient domination. (b) High-DoF dexterous action space (57-DoF). The severe10⁵ scale gap between macro-movements (wrist/camera) and micro-movements (fingers) creates critical optimization bottlenecks. Note that the finger articulations (30 DoF) appear almost flat due to their 10⁵ scale compared to the 10⁰ macro-movements. Empirical comparison of action magnitude distributions.
Wrist / camera scale
10⁰
Finger articulation scale
10⁻⁵
Scale gap
10⁵×
Total action dimensions
57

When all 57 action dimensions are flattened into a single embedding, the high-variance wrist and camera dimensions dominate the gradient signal and effectively silence the finger articulations — the very signals needed for dexterous manipulation. This heterogeneous semantic collapse shows up directly in training: a vanilla global-conditioning baseline struggles to converge in the 57-DoF regime, while DexAC's structured conditioning stabilizes training and reaches a markedly lower loss.

Method: DexAC-WM

DexAC-WM is built on the Cosmos-Predict2.5 backbone and replaces global compression with a structured pipeline that keeps each action dimension semantically independent throughout conditioning. Three components work together: a structured action representation, a unified local-global conditioning module, and a semantic condition branch with dual cross-attention.

1

Structured Action Representation

Rather than flattening the full action sequence into one vector, each action dimension is independently normalized and temporally tokenized into a structured set of (B, N, C) tokens, where each token summarizes the full temporal evolution of one action dimension. This preserves dimension-level semantics that a single global embedding would wash out.

2

Unified Local-Global Conditioning

A local action refinement branch injects fine-grained action tokens directly into the latent noise via cross-attention, letting every latent token query the structured action representation. A global action modulation branch summarizes the action tokens through a learnable query and injects the result through Adaptive LayerNorm, keeping overall motion temporally coherent.

3

Semantic Condition with Dual Cross-Attention

DINOv3-L dense spatial features and VLM text embeddings are concatenated and jointly injected into the latent space through cross-attention, with latent tokens as queries. DINO features give image-aligned spatial cues for hand-object geometry, while text embeddings provide compact, high-level intent — together improving spatio-temporal consistency.

Structure of the semantic condition module with dual cross-attention
Schematic overview of the proposed DexAC-WM. DexAC is designed to explicitly capture both precise local dexterity and globally coherent motion in high-DoF action regimes, while semantic condition provides rich scene- and object-based representations for DexAC-WM. (b) provides details of the DiT backbone architecture, which consists of 28 blocks, and illustrates how the structured action condition is injected into each block through Adaptive Layer Normalization (AdaLN). (c) presents strcuture of DexAC to preserve dimension-wise struction action tokenizer with local and global attention refinement for adaptive action injection.

Quantitative Results

We evaluate on EgoDex (829 hours, 194 manipulation tasks, 500 distinct objects) and EgoVerse (1,362 hours, 1,965 tasks, 240 scenes, in-the-wild). All Cosmos-based models are trained on 8 NVIDIA H200 GPUs with a 2B action-conditioned backbone. Metrics: PSNR / SSIM for pixel and structural quality, LPIPS for perceptual similarity, FID / FVD for spatial and temporal realism, and PCK@10 / PCK@20 for fine-grained and overall action consistency.

Table 1 · EgoDex
BaselinePSNR↑SSIM↑LPIPS↓FID↓FVD↓PCK@10↑PCK@20↑
Wan2.1-Fun-1.3B-Control21.890.890.34194.111532.4920.3536.89
Wan2.2-Fun-5B-Control22.970.730.31167.981434.1921.5136.84
IRASim22.120.800.20153.81615.2127.7644.84
IRASim + DexAC23.110.810.16142.76565.3033.9451.37
Cosmos-Predict2.5-2B (Base)25.020.800.25114.51352.1931.0758.33
Base + DINOv325.740.810.23110.25977.6833.8660.78
Base + DexAC25.140.800.25114.26349.2934.1561.41
Ours (Base + DINOv3 + DexAC)25.130.800.24106.67284.4032.7060.59

Quantitative comparison of advanced action-conditioned world models on EgoDex. Higher PSNR and SSIM indicate better reconstruction quality, while lower LPIPS, FID and FVD indicate better perceptual and temporal quality. Higher PCK represents better action consistency. The best and second-best AVG are highlighted in bold and underlined, respectively.

Table 2 · EgoVerse
BaselinePSNR↑SSIM↑LPIPS↓FID↓FVD↓PCK@10↑PCK@20↑
Wan2.1-Fun-1.3B-Control22.430.740.37176.991370.1820.1733.95
Wan2.2-Fun-5B-Control21.930.790.41151.971203.8925.7441.32
IRASim22.590.710.35229.74989.2141.6857.90
IRASim + DexAC23.660.750.39224.77963.2544.6858.12
Cosmos-Predict2.5-2B (Base)21.450.630.39151.62955.7424.5841.16
Base + DINOv321.090.620.41162.13857.8226.4541.73
Base + DexAC21.370.630.40152.10919.5624.2442.72
Ours (Base + DINOv3 + DexAC)21.670.640.38139.60830.0340.6260.51

Quantitative comparison of advanced action-conditioned world models on EgoVerse. Higher PSNR and SSIM indicate better reconstruction quality, while lower LPIPS, FID and FVD indicate better perceptual and temporal quality. Higher PCK represents better action consistency. The best and second-best AVG are highlighted in bold and underlined, respectively.

Table 3 · Ablation on structured action conditioning components (EgoDex)
Metricw/o Local Attentionw/o Global AttentionMLP Action EmbedFull DexAC
PSNR↑25.1224.8225.0225.13
SSIM↑0.800.790.800.80
LPIPS↓0.250.260.250.24
FID↓114.51116.31114.51106.67
FVD↓377.03419.90352.19284.40
PCK@10↑30.4830.8131.0732.70
PCK@20↑58.5054.2158.3360.59
Table 4 · Per-action-family PCK evaluation (EgoDex)
MethodWrist PCK@10Wrist PCK@20Finger PCK@10Finger PCK@20Head PCK@10Head PCK@20
MLP Action Embed7.0331.807.0322.513.8914.27
w/o Global Attention88.8393.760000
w/o Local Attention48.8872.6029.0353.8735.7556.03
DexAC33.0458.4233.0458.4240.0060.99
DexAC + DINOv354.6576.0929.6854.4441.2165.51

Qualitative Results

Across both datasets, the Cosmos base model produces reasonable reconstructions but suffers from temporal inconsistencies such as drifting hand positions and unstable object interactions. Adding DINOv3 alone improves visual sharpness but introduces flickering and inconsistent motion trajectories. DexAC alone generates more temporally stable sequences with hand trajectories that align more closely with intended actions. The full model combines both properties, yielding sharper yet stable outputs with reduced motion jitter and more accurate long-horizon interaction dynamics.

Open-Loop Rollouts

Given the initial state and a dexterous action sequence, DexAC predicts future latent states autoregressively. Latent states are decoded into images for visualization.

Ours

GT

Ours

GT

Ours Bench Chair
GT Bench Chair
Ours Pick Place
GT Pick Place
Ours Dry Hands
GT Dry Hands
Ours Open Close Insert Remove Case
GT Open Close Insert Remove Case
Ours Stack
GT Stack
Ours Wipe Kitchen Surfaces
GT Wipe Kitchen Surfaces
Qualitative comparison of action-conditioned video prediction on EgoDex across four manipulation tasks
Figure 4. Qualitative comparison on EgoDex. From top to bottom in each block: Ground Truth (GT), our full method (Ours), DexAC only, DINOv3 only, and the Cosmos Base model, across four manipulation tasks (Insert Remove USB, Clean Surface, Scoop Dump Ice, Vertical Pick Place). Baselines repeatedly show motion tracking failures and hand distortion that DexAC corrects.
Qualitative comparison of action-conditioned video prediction on EgoVerse across four manipulation tasks
Figure 5. Qualitative comparison on EgoVerse. Same row ordering as above, on four in-the-wild tasks (Fold Clothes in Domain, Cup on Saucer in Domain, Cup on Saucer, Fold Clothes). DexAC's structured conditioning remains robust even as scene diversity increases.

Why Structured Conditioning Helps

To verify that subtle action dimensions are effectively utilized, we extract the action-conditioned embedding before AdaLN modulation and compute channel-wise activation magnitudes, forming a feature heatmap across four settings: without local attention, without global attention, DexAC, and our full model (DINOv3 + DexAC).

Action heatmap in the AdaLN embedding space across four ablation settings
Action heatmap in the AdaLN embedding space. As shown in Figure, the x-axis represents feature channels and the y-axis denotes forward index, with color indicating activation strength. We compare four settings: (1) without Local Attention, (2) without Global Attention, (3) DexAC, and (4) Ours (DINO+DexAC).
PCA projection of action tokens for finger, wrist, and head action families
The PCA results of (a) EgoVerse and (b) EgoDex tasks. The PCA results in Figure reveal different distribution patterns across action groups. For example, finger-related samples exhibit both compact clusters and locally scattered points, suggesting that the tokenizer captures common finger-motion structures while remaining sensitive to task-dependent hand dynamics.
t-SNE projection of action tokens for finger, wrist, and head action families
The t-SNE results of (a) EgoVerse and (b) EgoDex tasks. The t-SNE visualization in Figure further shows that action features from different tasks form distinct but partially clustered distributions, indicating that the proposed tokenizer can separate both local and global action patterns across different motion groups.

BibTeX

@misc{yuan2026actionsequalrethinkingconditioning,
      title={Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model}, 
      author={Zizhao Yuan and Zhengtu Liang and Taowen Wang and Qiwei Liang and Yichi Wang and Yunheng Wang and Yuetong Fang and Lusong Li and Zecui Zeng and Renjing Xu},
      year={2026},
      eprint={2606.27325},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.27325}, 
}