October 16, 2025A research preview of RTFM, a new generative world model that generates video in real-time as you interact with it.
RTFM: A Real-Time Frame Model
Today we’re sharing RTFM, a new real-time generative World Model. RTFM (Real-Time Frame Model) generates video in real-time as you interact with it. It can be used to explore generated 3D worlds and real-world locations. RTFM is available today as a research preview.
RTFM is designed around 3 key principles:
- Efficiency: RTFM runs inference at interactive framerates using just a single H100 GPU.
- Scalability: RTFM is designed to scale up with increasing data and compute. It models 3D worlds without relying on explicit 3D representations, and uses a general end-to-end architecture that learns from large-scale video data.
- Persistence: You can interact with RTFM forever and the world will never be forgotten. It models a persistent 3D world that doesn’t disappear when you turn your back.
You can click here to demo RTFM in your browser today.
World Models are Compute Hungry
We are excited for a future where powerful World Models can reconstruct, generate, and simulate persistent, interactive, and physically accurate worlds in real‑time. Such models will transform industries from media to robotics and beyond. The past year has been an exciting time for this emerging technology, as advances in generative video modeling have been applied to generative world modeling.
As this technology develops, one thing is becoming clear: generative World Models will be very computationally demanding, much more so than LLMs today. If we naively apply modern video architectures to this problem, generating an interactive 4K video stream at 60fps requires producing over 100K tokens per second (roughly the length of Frankenstein or the first Harry Potter book); keeping these generations persistent for an hour or more of interaction requires attending to contexts of well over 100M tokens. This is neither feasible nor economically viable given today’s computing infrastructure.
We are strong believers in The Bitter Lesson: simple methods that scale gracefully with increasing compute tend to dominate in AI, since they can benefit from the exponentially decreasing costs of compute that have driven all of technology forward for decades. Generative World Models are perfectly positioned to benefit from a future where the cost of computation continues to fall.
This leads to a natural question: are generative World Models blocked by today’s hardware constraints? Or are there ways to preview this technology today?
Efficiency: Pulling the Future Forward
We set forth with a simple goal: Design a generative World Model that is efficient enough to deploy today, that can continue to scale up with more compute. Our ambitious target was building a model that can be deployed on a single H100 GPU, maintaining both interactive framerates and worlds that persist no matter how long you interact with them. Achieving these constraints would let us pull the future forward, delivering an experience today that gives a hint of what these models could achieve in the future.
This goal influenced the design of our entire system, from task setup to model architecture. We carefully optimized all parts of our inference stack, applying the latest advances in architecture design, model distillation, and inference optimization in order to give the highest-fidelity preview of tomorrow’s models running on today’s hardware.
Scalability: World Models as Learned Renderers
Traditional 3D graphics pipelines model the world using an explicit 3D representation (e.g. triangle mesh, Gaussian splats) which is then rendered to produce 2D images. They use hand‑designed data structures and algorithms to model 3D geometry, materials, lighting, shadows, reflections, and more. These approaches have been the trusted workhorse of computer graphics for decades, but they do not trivially scale to more data and compute.
RTFM takes a different approach. It builds on recent advances in generative video modeling and trains a single neural network that inputs one or more 2D images of a scene, and generates 2D images of that scene from new viewpoints without building any explicit 3D representation of the world. RTFM is implemented as an autoregressive diffusion transformer operating on sequences of frames, trained end-to-end on large-scale video data to predict the next frame conditioned on previous frames.
RTFM can be seen as a learned renderer. Its input frames are converted to neural network activations (the KV cache) which implicitly represent the world; while generating new frames the network reads from this representation (via attention) to create new views of the world consistent with the input views. The mechanisms for converting input views to world representations and then rendering new frames from those representations are learned end-to-end from data rather than being hand-engineered. RTFM learns to model complex effects like reflections and shadows simply by observing them during training.
RTFM blurs the line between reconstruction (interpolating between existing views) and generation (creating new content not visible in input views) which have historically been treated as separate problems in computer vision. When RTFM is provided with many input views it leans toward reconstruction since the task is more constrained; when provided with fewer input views it is forced to extrapolate beyond them.
Persistence: Posed Frames as Spatial Memory
A key property of the real world is persistence: the world doesn’t disappear or change completely when you look away; and you can always go back to previously visited locations no matter how much time you have spent away.
This has been a challenge for autoregressive frame models. The world is only represented implicitly via 2D image frames, so persistence requires the model to reason over an ever-growing set of frames as a user explores the world. This means that each new frame is more expensive to generate than the previous one, so the model’s memory of the world is effectively bounded by its compute budget.
RTFM circumvents this problem by modeling each frame as having a pose (position and orientation) in 3D space. We generate new frames by querying the model with the pose of the frame to be generated. The model’s memory of the world (contained in its frames) thus has a spatial structure; it uses posed frames as a spatial memory. This endows the model with a weak prior – that the world it models is a three-dimensional Euclidean space – without forcing it to explicitly predict the 3D geometry of objects in that world.
RTFM’s spatial memory enables unbounded persistence. When generating a new frame, we retrieve nearby frames from the spatial memory of posed frames to form a custom context for the model. We refer to this technique as context juggling: the model uses different context frames when generating in different regions of space. This allows RTFM to persist large worlds over long interactions without reasoning over an ever‑growing set of frames.
Looking Ahead
RTFM pulls the future forward, giving a vision of future World Models deployed on today’s hardware; and sets a technical approach to viewing World Models as renderers learned end-to-end from data.
There are many exciting pathways to extending RTFM. We can augment it to model dynamic worlds, and allow users to interact with generated worlds. It is also well-suited for scaling – our current model targets real-time inference on a single H100 GPU, but we expect larger models targeting larger inference budgets to continue improving.
If you haven’t already, make sure to try RTFM today.
If you are excited about this vision and want to help us build it, join us!
This post was produced by the World Labs team.
Read More
November 12, 2025
Marble: A Multimodal World Model
Marble, our frontier multimodal world model, is available to everyone starting today
November 10, 2025
From Words to Worlds: Spatial Intelligence is AI’s Next Frontier
A manifesto piece explaining what spatial intelligence is, why it matters, and how we’re building the world models that will unlock it—with impact that will reshape creativity, embodied intelligence, and human progress.
