World Action Models are Zero-shot Policies

or, Robotics Data Abundance

Tim Dingman

Jun 15, 2026

Originally presented as a live talk on February 25, 2026

Paper · Website

Background

I think we have to start with, what are world models?

We actually covered this in a paper last year, where I explained that world models are not necessarily models of The World, like the physical world around us, although in this paper we actually are talking about that kind.

At its most abstract, the concept of a world model is about whether an ML model can form a coherent and predictive view of a given environment based on limited information.

That may sound like a requirement for making good predictions, which we know models can do, but it’s actually not. Let me give you an example.

Let’s say you’re flipping a coin and you don’t already know that coin flips are inherently 50-50. One way you could discover they’re 50-50 is to flip the coin a bunch of times and extrapolate from that data to predict future data. You don’t need to understand the nature of coins or the laws of physics to do that extrapolation, you just need basic data analysis.

Of course the way a human would conclude coin flips are 50-50 is to look at the coin, maybe toss it a few times to make sure it’s not weighted, and then reason a bit or just intuit that the coin toss will be 50-50. That does require a world model, because it requires underlying assumptions about how objects behave in the physical world.

Another example where you hear both the statistical model and the world model-ish views of a thing is in the stock market. Some people just look at trends, with no concept or view of the underlying company, the ticker symbol could mean anything. A lot of quant trading is like this, where you just feed a bunch of factors into a mathematical model and get a “buy” or “sell” recommendation out. On the other end of the spectrum are the fundamentals investors, who look at the company’s financial statements and the market and the strategy etc and form a model of the company’s future performance.

So a world model is just about where the reasoning seems to be, not about one particular place or another.

Here’s another example of a world model, this time in the world of code.

So normal code data for us would be a prompt and a response, if it’s SFT, or a prompt and a set of unit tests that verify whether the model’s answer works, if it’s RLVR. But in either case we’re training the model directly for writing code.

By contrast, this coding world model has a different goal: to predict how a piece of code will work. If the model can do that, then it probably has a good world model for the world of code, its “laws of physics” so to speak.

The way they train that in is to provide a piece of code and an example of using that code, then ask the model to predict the state and action at each step of the program.

Here the state is in yellow and the action is in blue. As you go down the rows, or “frames” as the paper calls them, you can see what the model is keeping track of and how it changes after each action. Like for example in the second frame, we get a new variable n, with the starting value 0. Then in the third frame we see n being tracked in state, along with its current value, 0. The two dots in quotes is just a visual shorthand that means the value of the variable hasn’t changed, so like s is “strawberry” for the first frame and then just the two dots for the rest of the frames.

Of course in the end what we care about is whether it can produce good code, like with any coding mode. But this idea of training or testing the world model, separate from the end result, will be relevant later on.

Now let’s look at some video generation models, what laymen might think of when we say “world model”.

This is a clip from Sora, OpenAI’s video generation model, when it first came out back in February 2024. I remember being very impressed when it came out, and honestly it still holds up at least in this clip, but you can see some weirdness happening. Take a look at the manhole for instance and how it kinda morphs into pavement, that’s a classic AI video-ism. Consistency and memory are tough for video gen models.

One reason it’s generally so accurate though is because of what they trained on: synthetic videos, generated by engines like Unreal Engine and Unity, which have programmed in very detailed rules of physics. If you train on tons of footage from physics engines, you’re going to end up with a pretty good sense of physics. OpenAI actually views Sora as a “general-purpose simulator of the physical world”. So they’re going for a world model here.

The latest and greatest in this space is Genie 3, which generates good videos but also allows real-time exploration. Look at the legend in the bottom-left, showing arrow key presses. Genie 3 generates the world in real time, lets you navigate it, and it remembers what has been in the scene before but is out of the scene now, to maintain consistency. Again, you need a really keen idea of the world to do something like this. So the question is, how do we put it to use in the actual physical world?

Now getting more into the robotics side, let’s talk a brief bit about inverse dynamics models, or IDMs. It’s a complicated name, but really the idea is simple: instead of a regular dynamics model, where you infer outcomes given actions, you invert it, inferring actions given outcomes.

OpenAI actually pioneered this concept in the modern AI age. This graphic is from a 2022 paper called Video Pretraining: Learning to Act by Watching Unlabeled Online Videos. Their goal was to teach a model to play Minecraft. To do that, they figured the best way was to take advantage of the many many hours of Minecraft gameplay video available out on the internet - some 70k hours apparently, after significant filtering. But how to teach the model with videos yet no key presses or mouse clicks? The model wouldn’t learn how to actually play the game!

To make their 70k hour corpus useful, they synthetically added actions in. They collected 2k hours of labeled video, i.e. with the actions recorded, and then trained an IDM to predict actions based on video. Once they had that, they put these “pseudo-labels” as they’re sometimes called on all the initial videos, then trained their Minecraft agent on the pseudo-labeled corpus.

Ultimately though they’re learning actions from videos, without any sort of world model; the IDM doesn’t demonstrably know what the key presses and mouse clicks mean, just what they correlate to.

We should also discuss briefly how a modern robotics model looks. For that we’re going to turn to the Pi series of models from Physical Intelligence, which is roughly the OpenAI equivalent of robotics model makers.

The Pi models are VLAs, vision-language-action models. That’s an LLM at the center, with a vision encoder added on the front so it can see, and an action expert added on the end so the model can do stuff - the action expert outputs signals for the motors and such. In the case of Pi 0.5 the LLM is Gemma 2 2.6B, made by Google. The vision encoder is SigLIP, also made by Google, so the vision-language model so far is 3B parameters. Then you add on the action decoder for another 300M parameters, so 3.3B parameters total. It’s a pretty small model, but it really has to be in order to produce actions in real-time. Gotta keep that latency down, no way to do that with the parameter counts of SOTA LLMs, which are in the tens or hundreds of billions, maybe even trillions. By the way, this is the same reason that people speculate Sora and Genie are also in the single billions of parameters. Kinda crazy that all the knowledge to model a world can fit in that few parameters.

So one of the major contributions of Pi 0.5 was this concept of “co-training”, i.e. training the VLM part on a bunch of multimodal web data. If you’re not familiar with this work already it may be somewhat surprising that a robot could learn to manipulate the world better by just looking at images or watching videos, but of course we know humans can do the same - we can learn how to do things just by watching. Hands-on practice is often better but it’s not always required.

They also throw in training data from other robots, like other embodiments that this model doesn’t get used on.

Anyway, the main thing to know here for our purposes is that VLAs are the currently dominant paradigm in robotics models, and they start from LLMs.

The Paper

First let’s talk hardware. This is the star of our show, AgiBot G1. This model has grippers, although they make a version with dexterous hands. The grippers can rotate. Each arm has two “elbows” so to speak. The torso can move up and down on the base, and the base can wheel around. Finally, there’s one camera watching each gripper, and one behind the faceplate watching the whole scene.

This is the robot they use for gathering data and for doing most of the evals. They do use a couple other robots, specifically a Franka Emika Panda and a YAM, but if you’re going to imagine one robot for these results I would pick this one.

Now let’s cover the model.

The big difference between normal transformers like your typical LLM and robotics models of any sort is that robotics models have to predict things in parallel, because robots have different parts that can move simultaneously. You don’t have to do that with text, you can generate one word at a time. But you can’t trade off which limb moves at a time or something like that, at least if you want to produce smooth motion.

Image and video generation models, on the other hand, do predict things in parallel, although that’s more because of the inherently parallel nature of vision than necessarily having to produce many pixels at once. Still, you do have a body of image generation research to draw on.

So the architecture here combines the transformer, which predicts in serial and is good for language, with diffusion, which predicts in parallel and is good for image. That’s the diffusion transformer, DiT. And the way it works is by predicting small blocks of actions where stuff can happen in parallel, but predicting each block in series, so only one block at a time. Specifically, they predict 1.6 seconds of action per block. That’s how far ahead the model is “thinking” or “envisioning”. Predicting many actions in a block like that results in smoother motion compared to predicting just one action at a time.

But it doesn’t blindly act in 1.6s increments and wait until the next increment to make adjustments; just like how your body frequently adjusts balance or grip while carrying your dinner across the kitchen, the model makes updated predictions on a similarly short timescale. With this model, this hardware, and a suite of optimizations we don’t need to get into, they can make new predictions every 150 ms. So if the researchers swap out an object in the scene, or the robot’s grip slips, or the wind blows a cup over, the robot can react in a reasonable amount of time. So you never get to the end of that 1.6s block, you always have a new block ready to start acting on.

Now the other thing they mention here that I want to dig into is this word “joint”, as in “joint video-action flow matching” and such. What they’re saying there is that given some inputs, they want to predict video and action simultaneously, NOT predict video and then predict action based on that predicted video. There are a couple reasons for this.

The first is that with two simultaneous predictions, errors in one are unlikely to appear in the same way as errors in the other. So like if my video part incorrectly predicts the plate I’m carrying starts to tilt, but my action part has predicted no change in my hands, that disagreement gets picked up as loss and thus is targeted in future training steps. But if I predicted the video and then predicted the action on top of the video, then my action part is basically stuck predicting something wrong in order to agree with the video prediction and thus minimize loss.

The second is that joint prediction shares the world model knowledge with video and action prediction. That’s the main theoretical thrust of the paper really, that there’s this world model implicit in a video generation model that gets locked away when you try to learn actions from video instead of learning actions alongside video. Like you don’t want the video to be an intermediary between the implicit world model and the action predictions, you want to go straight to the source.

Here’s a quick example of some generated video. We don’t get a side-by-side of generated vs actual sadly, but you can see they line up generated video with where the real action happened. In the Generated row, the top-left square is the faceplate view, the top-right square is the right arm, and the bottom-left square is the left arm. You can see the robot uses the left arm to get a second view of the scene, which is kinda cheating from a human point of view.

OK, on to the data. They need this robotics data to transform their starting point, the open-weights video generation model Wan 14B from Alibaba, into a world action model, a WAM.

The main body of data they collect is here, ~500 hours of teleoperation data on their AgiBot G1 across 7.2k episodes and 22 different environments. Since one key advantage of WAMs is that they rely on their world models to guide actions, they don’t need tons of demonstrations per task to then imitate. That allowed the researchers to focus on diversity instead of repetition with their time budget.

The average episode is about 4.4 minutes and has over 40 steps, what they call “subtasks”, which is relatively long-horizon for this type of thing. So with this pretraining data, we can turn Wan 14B into DreamZero, which can predict video and actions.

So with their 500 hours of pretraining data in hand, they can train some models and start comparing results.

Let’s cover the models and terminology first. On the bottom you’ll see three models mentioned: GR00T, a VLA also made by NVIDIA; Pi 0.5, a VLA by Physical Intelligence that we’ve covered before and that we saw the architecture of earlier; and DreamZero.

They also give two different descriptors in parentheses: scratch and pretrained. Scratch means no robotics data other than the ~500 hours these researchers collected, to give a fair comparison between DreamZero and the VLAs. Pretrained means it does have other robotics data in it, so that’s like the full versions of GR00T and Pi 0.5 against the full version of DreamZero.

Finally, for the tasks, PnP means “pick and place”, things like putting fruit in a bowl; and Contact-Rich means folding clothes basically, which is super common in robot evals and also is one of the most common use cases for robots in the real world actually, like to a surprising degree.

In any case, DreamZero outperforms the other two on all tasks and both embodiments. It seems like that world model really does confer a lot of physical common sense and allow the robot to handle a wider variety of cases.

Now for the previous chart, those were all seen tasks, tasks present in the training data, although the evals were in new environments and with different objects.

In this chart, the tasks are all unseen, not present in the training data. Again we see the same general trends, although Pi 0.5 does come close to DreamZero in a couple cases. To be fair to the Physical Intelligence folks, 0.5 is not the latest version of their model, but it is the most recent open-weights one so it’s a fair base of comparison.

One fun note from the authors: apparently the VLAs often try to grab objects regardless of the prompt given, suggesting they’re overfit to those types of tasks. Makes sense given the most common tasks are versions of pick and place, I like the variety of actions they represented here.

So the results we’ve been looking at so far are for models that are pretrained only, i.e. no task-specific post-training on demonstration data. Here they change that.

The post-training data is for three tasks: shirt folding, collecting 33 hrs; fruit packing, collecting 12 hrs; and table bussing, collecting 40 hrs. In all cases they use a variety of objects, different counts and positions, etc.

Here we see near-parity between Pi 0.5 and DreamZero. Unfortunately we don’t have pretrained vs post-trained results to understand the impact of post-training per se, but my interpretation is that WAMs can learn from demonstrations just as well as VLAs can. Like sure, WAMs seem to do better right out of the box, but you could have objected and said we don’t care about out of the box performance, we care about absolute performance, and maybe VLAs have a higher ceiling because they learn so well from demonstrations, i.e. from post-training data. But that doesn’t appear to be the case; the ceiling for a WAM seems just as high as the ceiling for a VLA.

The bull case here might be that WAMs don’t even need post-training data or might only need a few human egocentric examples rather than teleop data, but again, they don’t show these evals before post-training.

Finally, to drive home the point about most work happening within the world model rather than within the robot-specific parts, they collect two small sets of data for a set of new tasks:

12 minutes of human egocentric video
20 minutes of video from YAM, another robot. No actions collected, just video

Then they train on them and run the evals on three versions of DreamZero: the initial version, a version trained with the human data, and a version trained with the YAM data. Then they see how well DreamZero learned to do these new tasks from the human video and the robot video.

What they found was that both transferred about equally well, meaning you don’t need expensive robot hardware to make good data. All you need is a human with head- and wrist-mounted cameras who’s willing to visit a lot of new environments.

Now running in the opposite direction, they took 30 minutes of video from YAM on a different set of tasks, unrelated to any evals, and trained their model on that. And suddenly the model was able to operate the YAM, even though its pretraining data was only on AgiBot. So in both directions the embodiment seems to matter only a little, and the bigger factor by far is whether anything in the pretraining corpus looks like the task at hand.

The authors give an explanation that I want to quote in full to round out our review:

“Learning an implicit IDM from predicted videos may be inherently more sample-efficient than direct policy learning - the model only needs to learn the mapping from visual features to actions, while leveraging the pretrained video model’s existing understanding of physical dynamics. Consistent with our AgiBot findings, failures primarily stem from video prediction errors rather than action extraction, suggesting that increasing task diversity during post-training could further improve performance.”

My Takeaways

We might be able to ditch the robots, at least for pretraining data
- For evals you always want to test in real conditions, i.e. with the robot
Post-training for improving video generation will be valuable
- RLHF
- Rubrics? Seems challenging but potentially worth it
This could be the start of a shift away from VLAs to WAMs
- Even at similar quality, the data for WAMs is more scalable

Discussion about this post

Ready for more?