ICLR 2025 World Model Workshop Notes
https://iclr.cc/virtual/2025/workshop/24000
-
highlights
- world model is a model given states and actions, predict next states
- world model can be used for simulation and training RL agent
- short horizon task: combine local reward and global value function
- diffusion model can be combined with/replace transformer or VAE to speed up inference post training
- Robotics
- one model for high level instruction generation and one for low level motion generation
- use vision language model
- simulate with world model and pick best trajectory before execute in real environment
- World model can be pure video frame prediction
- encode video into latent space and apply diffusion model or transformer model for generation
- one component of model to infer latent action from video frames
- could also concatenate video token with action token and make next frame prediction
- Quality diversity algorithms
- find a diverse set of high performing agent(policies) and combine them
- exploration and goal switching
- Training
- Use LLM to find interesting task to be trained on first
- learnability: feed task which is not too hard or too simple to agent for training
- besides video generation, can also leverage LLM to write code to generate various environment
- With enough data, a neural network world model can learn causal representation
- we can generate world model from Policy + goal oriented agent
- Scalable Humanoid Whole-Body Control via Differentiable Neural Network Dynamics
- find a scalable way to train robot action planner
- 2 parts, world model to predict next state based on action and previous state; a policy to predict next action based on current state and target state
- train world model with states and action info in real environment
- use the trained world model to train action policy to predict the next policy
- both world model and action policy are several layers of MLP
-
keynote #1 TD-MPC2
- scalable, robust world model
- scalable: to data + params (RL can’t do)
- robust: apply to variety of problems
- MPC: model predictive control
- optimize in local trajectories for couple of steps
- TD: Temporal Difference Learning
- Learn global value function
- combine TD and MPC
- Planner: Given a state, encode into latent representation, simulate with different actions for multiple steps, calculate reward for MPC and global value and combine them as heuristic for current state
- state could be image of current state
- Training : Given trajectory data generated from either interaction or from existing dataset, encode state into latent space, predict optimal action, reward and value for couple of steps, the objective is minimize the difference between the predicted optimal future state’s latent encoding and real state’s encoding
- can be trained on 3090 for 5 day with 48m params and achieve good result
- performs better than previous works on 100+tasks without tuning
- performance increase as model params increases till 300m params
- tdmpc2.com
- data and params are both open sourced
- CNN for encoder and MLP for other parts
- application on robotics
- challenges in robotics
- autonomous interaction: cheap but exploration is challenging
- demonstration (human demonstrate movement to robot): strong supervision but costly to collect
- solution:
- collect few demonstrations through teleoperation to initialize word model and let world model to automatically explore as finetuning
- perform much better than just trained with demonstration data
- note that for each real robotic task, model is not pretrained beforehand
- challenges in robotics
- In robotics, "long-horizon" refers to complex tasks needing many steps and long-term planning
- some details
- planning before exploration
- add noise to planning
- short planning horizon to make it more efficient
- use a learned policy prior (a neural network) to predict next action and value but that is not critical to this algorithm , just to speed up training
- mostly short horizon tasks
- don’t train decoder to make training easier and stabler
- limitation: short horizon task
- scalable, robust world model
-
keynote #2 Developing Generalist Vision-Language-Action Models (for long horizon task)
- 2018:
- from one robot state image and future action, predict next state image ( (It, at:t+H ) → It:t+H), also plan use the model
- train the model with both demonstration and interaction data
- collect diverse demonstration data and use these data to do 3 things
- train a policy to guide future interaction data collection
- improve the model
- guide planning
- only apply to short horizon task like move stuff from one point to another point
- typical robot language following
- collect demo
- segment and label demos with task instructions e.g. pick up a tomato
- train instruction conditioned policy e.g. given robot state image and task description, plan actions
- limitation
- more complex prompt
- interjection about how a task should be done. e.g. some constraint like don’t touch something
- correct some actions in the middle
- hierarchical vision language modeling
- User prompt → high level policy → low level language commands → low level policy trained with instruction and demonstration and interaction data
- all use pretrained vision language model
- How to train high level policy?
- collect robot interaction task such as make a sandwich
- use human to segment and label the tasks into atomic instruction
- finetune the vision language model to predict atomic instruction based on image + task string
- How to handle open ended language?
- Use LLM to generate synthetic data: given current image and a task, ask the model to generate potential open ended user question + concrete task string from LLM, then train high level policy with image + generated user question, the policy should output concrete task string + atomic instruction
- evaluation
- finetuned high level policy performs better than GPT4o in prediction atomic commands
- based on ablation study, hierarchical data and synthetic data are all necessary
- limitation
- no memory of previous interaction
- lots of language interaction not supported e.g. clarify question
- how to generalize to novel tasks
- paper: pi.website/blog/pi05
- data collection
- collect home robotic manipulator data from 100 homes in sf
- Diverse mobile manipulation data (with moving robots) only 2.4% of data
- for both low level instruction prediction and high level command prediction
- static manipulation data from lab and homes
- web data such as caption
- high level instructional data
- Pretraining
- use pretrained 3B model
- Combined token prediction task (predict various things including: low level command (use imitation learning), high level instruction, caption)
- discrete action tokenizer: frequency space tokenizer (FAST) for more efficient pre-training
- transform time series to frequency series and tokenize
- this is for motor control signals
- caption with bounding boxes
- discrete action tokenizer: frequency space tokenizer (FAST) for more efficient pre-training
- post training
- frequency space tokenizer (FAST) is slow for inference
- post train with smaller diffusion head with 300m params to predict action to make inference faster
- fine tune model end to end on diverse mobile manipulation data
- why not pretrain diffusion head
- train slower and don’t follow language well
- Evaluation
- Hierarchical data is important
- performance matches lab room performance when test rooms number is high
- static manipulator data is still important for mobile manipulation training
- takeaway
- increasingly diverse data allow robot to generalize to new environment
- static manipulator data is still important for mobile manipulation training
- pretraining with tokenizing action better than pretraining with diffusion
- qa
- single unified model vs 2 models?
- single unified model has inference speed challenge but haven’t tried yet
- rl for planning?
- rl will exploit inaccuracy in models
- use some weak optimizer (cross entropy ), sample couple options, fit Gaussian to these options and resample from them
- data filtering
- filter idle robot data and task failure data
- single unified model vs 2 models?
- 2018:
-
Demo: World Models: Understanding, Modelling and Scaling , Wayve
- **Wayve: end to end autonomous driving**
- emergent capability: handle edge cases
- current autonomous driving training challenge:
- not enough edge case data
- hard to evaluate before releasing model
- solution: build a world model to simulate the environment
- GAIA2: a world model which generates video for driving interaction
- it generates driving video from different angle of cameras on a car
- Training:
- a video encoder and decoder to encode video to spatial temporal latent space
- a latent diffusion model to generate latent space with controllable context like
- driving action, different vehicle platforms, different environment scenario like different whether, 3d cuboids to control dynamic scenes
- application
- for evaluation purpose
- could encode edge case and generate variation e.g. different whether or sunlight
- not for pretrain because the synthetic data quality is not high enough
-
keynote #3 generative interactive environments
- recipe for training adaptive agent
- rick interactive environment: 25b distinct tasks
- auto curriculum
- divide tasks into 4 categories
- expected easy and truly easy
- expected easy and truly difficult (interesting)
- expected difficult and truly easy (interesting)
- expected difficult and truly difficult
- buffer and prioritize these tasks based on the category when serving training data (PLR)
- PLR leads to significant performance gain
- Improved algorithm: generate new tasks by automatically mutating previous ones using language model
- divide tasks into 4 categories
- large model
- up to 265m params transformer model
- performance scale as param grows and number of trials grow
- all the 3 above pillars are co-dependent and improve performance
- but still hard to generate to real environment
- solution: world model
- types of world models
- Model Class | Training Data | Controllability
- World Models | Video + Actions | Frame-level
- Video Models | Video + Text | Video-level
- Genie | Video | Frame-level
- genie is a combination of world model and video model
- Genie Goal: Train a generative world model from all internet videos, that can be used as a simulator for embodied AGI and a new form of generative entertainment.
- train on video data only, with inferred latent action
- input frames → video tokenizer generates video tokens + latent action model generates latent action → both serve to a dynamic model to generate next frames
- predicted latent action can be translated to physical key on game controller
- scale nicely with batch size and model param size
- tested with video game playing recording
- latent action learned across different environment have similar meanings, e.g. for different games, one action always mean move up
- we can prompt the model with OOD images to generate trajectories e.g. move the figurine below
- can be used as robotic simulator.
- instead of simulating physically, just simulate with lab robot videos and input actions
- genie 2, large scale foundation model
- genie 1 was mainly for 2d videos
- replace MaskGIT with Latent diffusion
- prompt with text to generate initial frame, then encode to latent space
- generate everything in latent space, control with action in latent space and decode to generate frames
- capability
- could simulate 3d games with starting frame
- could animate character as it moves, e.g. animate bird flapping wings when move ahead
- emergent capability: knowledge of physics effect e.g. light effect
- can automatically interact with objects, e.g. open door
- environment consistency: change viewpoint and environment is consistent
- can simulate multi agent, e.g. have multiple agents in the scene
- can start with real world image and simulate
- can work with other planning agent like SIMA so that SIMA play the games by providing action and Genie2 simulate the game play
- State of play
- Genie 1 showed a path to training foundation world models
- Genie 2 showed that scaling yields drastically improved results:
- Long, consistent generation
- Consistent control of diverse character morphologies
- Emergent multi-agent interaction
- We have the first signs of life that our most capable AI agents (SIMA) can use these world models to achieve goals in new, generated environments
- recipe for training adaptive agent
-
Keynote #4 foundation model for sequence decision making
- data is not efficiently used comparing to the data vs performance human achieved
- we have used up training token from internet but the performance still doesn’t saturate
- we need a world model on latent space
- common representation
- basic laws of physics
- TACO: temporal latent action driven contrastive loss
- state encoder & action encoder & projection layer → embedding , compute contrastive loss with final state encoder’s output through same project layer
- result:
- generalize to unseen data
- robust to low quality training data
- encoded representation can be used to generate generalist policy and universal world model
- OpenVLA
- impressive out of box generalist policy but still struggle with spatial temporal reasoning
- can be improved with visual traces
- given original image + visual trace image + user prompt, generate next action tokens which is detailed movement instruction
- traceVLA improves OpenVLA big model and small model and robustness and generalization
- meta policy generation
- given robot moving trajectories, instead of learning the motion policy, generate the policy
- trajectory → behavior embedding as context signal → Latent diffusion model → policy network which controls how robot moves
-
Keynote #5 Open-ended and AI generated Algorithms in the era of foundation models
- key science innovation are mostly from random exploration rather than trying too hard to solve a hard problem. For example, to solve problem A, you might need to solve a irrelevant problem B first and use the approach to solve problem A. E.g. to speed up cooking, you have to research electricity and invent microwave
- Quality diversity algorithms
- find a diverse set of high performing agent(policies) and combine them
- exploration and goal switching
- MAP elites
- choose the dimensions you are interested in , e.g. robot weight and heights
- randomly initialize a robot , evaluate it, and put it on the map based on its property chosen
- if the map location is already taken, replace it if the new robot has better performance
- then add perturbation to the dimensions and evaluate again and iterate
- Produces fast adaptation because all interesting variable combination has been simulated
- and efficient exploration
- but these algorithms are limited to the problem and context
- goal is to generate to new environments
- Paired open ended trailblazer POET
- Endlessly generating increasingly complex and diverse learning environments and their solutions
- Encode environment with different params so that we can keep generating new one
- Periodically generate new learning envs , add to population if the env is not too easy and not too hard and novel
- optimize agent to solve each one, allow switching goal, e.g. switch a best agent in one env to a different env to solve that
- pro: auto generated tasks
- con: small , hand chosen distribution of env and tasks
- hand designed pipelines are ultimately outperformed by learned solutions
- Open endedness via Models of human Notions of interestingness OMNI
- hard to quantifying interestingness so hard to find diverse problems
- Foundation models learn human preference from huge internet data. So we can count on it to tell which problem is interesting
- Methods
- Reinforcement learning in a particular task
- task sampler based on
- high learning progress, which tasks are learnable
- Prompt llm to find interesting tasks which is different from previous tasks
- Darwin complete: can express any environment
- solution:
- Deep neural network (Genie)
- Code (Omni EPIC)
- let llm generate code to generate env
- also let llm generate reward function code e.g. if a task is complete
- solution:
- meta learn learning algorithm
- to learn more efficiently
- solution: Video Pretraining (VPT)
- pretrain a model on large amouts of internet videos first
- then fine tune it with RL for particular task
- learning speed outperform simple RL on not pretrained model
- SIMA agent work in a similar way
-
Keynote #7 Diffusion Language Models: Towards A Unifying Paradigm for Multimodal Generative Modeling
- diffusion model
- works well on continuous object
- diffusion model is to compute a score based on gradient of density function. In such way, no need to calculate overall area. so it is easier
- by gradually score matching (denoising), diffusion model generate the image
- pros:
- flexible architecture, no normalization needed
- efficient training via denoising score matching
- training and inference are decoupled (different steps for training and inference)
- more params efficient
- support controllable generation
- coarse to fine generation correction during training (auto regressive can’t correct inference once previous tokens are generated)
- cons
- doesn’t work for discrete data
- not clear how to handle variable length sequences
- auto regressive model
- works well on discrete object
- auto regressive model is to estimate the probability density function. You need to keep the prob area as 1 and it may be tricky to calculate the area
- Diffusion for text
- there is a sentence graph,
- each sentence is a node with a probability
- denoising process is randomly walking around the graph based on score function
- at the beginning of training, each sentence’s prob is same and after training, they become different
- it is like you have a vocab size X sequence length matrix of prob and for each token , you estimate the score of changing one position’s token from one token to another
- surpasses auto regressive model for generation quality/speed
- diffusion model
-
panel discussion
- what is world model
- world model is to interact with the world
- world model should help agent to evolve
- world model should handle unknown unknown
- should future world model development be more general like gpt or try to cover each specific domain
- we can start with a large pre-trained model like vision transformer, and fine tune on specific domain or we can distill large model into specific agent for a task and further fine tune it.
- collect useful feedback signal
- find data with specific new pattern so that they can be compressed into world model
- develop an agent to discover the new data with new pattern
- should develop better compact model for causal representation and model
- language is abstract and easy to memorize and debug, it is better for world model to have such language abstraction so that it can memorize more things and easier to debug
- better to have agent to access world model params
- any benchmark for world model?
- world model has many different domains, it is hard to find a general benchmark for all world model
- one test is that if an agent can learn task quicker with the world model than from scratch
- another test is to test how long a human can play with the model before he get bored
- perplexity of video prediction, trainability (how hard to train the world model), how much an agent can learn from the world model
- research direction
- representation learning that produces constraint and guarantee
- data curation and filtering
- efficient model algorithm
- how to train a world model without the need to decode into video so that it has better latent representation
- train a model with fewer compute
- how to handle planning in world model
- once we have good representation, we can apply more planning constraint
- Is it important to incorporate action and control signals into world model training data
- we can infer action from video or we can capture multi agent action from video
- but it is also important to learn intention of the agent in the video
- closing words
- humans think in a multi modal and parallel way, for example, when doing something, a human simulates all different consequences from different entities in his mind. This is an area to explore in world model
- what is world model
-
Oral #1 Improving Transformer World Models for Data-Efficient RL
- a minecraft like environment Crafter
- goal: train a SOTA agent within 1M interaction
- Model free RL baseline MFRL: Stateless CNN policy trained on PPO + a memory (GRU with low dimension hidden state)
- Model Based RL baseline MBRL: model the dynamics with a transformer WM in IRIS
- A VQ-VAE maps each image into 64 discrete latent codes
- TWM predict future code indices over time given history of code indices and actions
- rollout policy into the transformer world model, then hallucinate imaginary trajectories , then use the trajectories to RL on policy
- the MBRL performs much worse than MFRL
- improvement
- warmup with real training data instead of imaginary training data
- patch factorization
- Limitation: VQ-VAE codes do not correspond to "meaningful entities".
- Solution: Exploit domain-specific knowledge to segment image into patches, then encode the patch, to get "disentangled" latents.
- Nearest neighbor tokenizer
- represent VQ-VAE learned codes as discrete codes.
- Add a patch to the code book if it is not close to any of existing codes
- block teacher forcing
- instead of generating each token one by one for one timestamp, generate all tokens with same timestamp in parallel
- it means all tokens only depends on previous time tokens instead of each other
-
Oral #2 From Foresight to Forethought: VLM-IN-THE- LOOP Policy Steering via Latent Alignment
- make open world interaction more reliable
- robot may make a lot of mistake when repeating same task, for example, clapping a cup, robot may turn down the cup
- solution: simulate different action sequences’ result and find the best one in execution time
- sample multiple action sequences from policy
- pass the action sequences to a world model and simulate the result, output as predicted latents
- pass the predicted latents to another Vision language model to calculate reward.
- to align the latents with VLM’s, we can finetune the VLM to translate the world model latent outputs to behavior narrations and then decide if the result is best one
- ablation study shows that both world model and VLM contribute to the performance of robot control policy in test time
-
Keynote #8 The simulation hypothesis
- Design correct task levels for RL agent to learn so that it can learn faster
- Need to calculate Regret to decide which task level to feed to agent for training.
- Previous estimation formula of regret doesn’t generalize to more realistic environment
- proposed solution is
- sampling a couple of environments for the same level, agent success rate is p
- use p*(1-p) value (learnability) to prioritize which level to learn first
- learn a foundation model for decision making
- Environment: Kinetix
- a physics puzzle like environment
- simulate large number of environments and vast space of tasks
- Parallelising over Heterogeneous Scenes and achieve great performance
- Running RL agents and environments jointly on the GPU using JAX results in up to orders of magnitude of speedups
- Kinetix can be used to cheaply and quickly study large-scale RL training of generalist agents
- skill can do zero shot transfer to different tasks under kinetix without fine tuning
- Learnability value function also applies to picking reasoning data
- limitation
- The learnability is only sound in deterministic environment where we can change the action sequence to improve the agent success rate
- for more complex environment with more rewarding env, this learnability should be reconsidered
-
Keynote #9 Robust Agents Learn World Models
- With enough data, a neural network world model can learn causal representation
- we can generate world model from Policy + goal oriented agent
-
Oral #4 Scalable Humanoid Whole-Body Control via Differentiable Neural Network Dynamics
- find a scalable way to train robot action planner
- 2 parts, world model to predict next state based on action and previous state; a policy to predict next action based on current state and target state
- train world model with states and action info in real environment
- use the trained world model to train action policy to predict the next policy
- both world model and action policy are several layers of MLP
-
Oral #5 Masked Generative Priors Improve World Models Sequence Modelling Capabilities
- base model: Current state X action → latent embedding (Zt) → transformer → hidden state embedding (Ht) → prior model → next state’s latent embedding
- the latent embedding is 32x32 embedding, more like image
- so replace the prior with MaskGIT + bidirectional transformer to predict next latent embedding like predicting an image
- during inference, perform draft and revise using masked decoding
-
Oral #6 Temporal Difference Flows
- Current World Models limitation
- Unrolling one-step world models incurs "test-time" compounding errors
- This severely limits the effective horizon for downstream applications
- Solution:
- train a model to predict multiple future states instead of one by one
- Use the idea from flow matching for loss calculation and future state sampling so as to reduce loss variance, make training more stable and have a good balance of impact on model between remote future states sampling and next state data
- Current World Models limitation
Comments
Post a Comment